1. 11 2月, 2023 1 次提交
  2. 10 2月, 2023 1 次提交
  3. 02 12月, 2022 2 次提交
  4. 23 11月, 2022 3 次提交
    • K
      dccp/tcp: Fixup bhash2 bucket when connect() fails. · e0833d1f
      Kuniyuki Iwashima 提交于
      If a socket bound to a wildcard address fails to connect(), we
      only reset saddr and keep the port.  Then, we have to fix up the
      bhash2 bucket; otherwise, the bucket has an inconsistent address
      in the list.
      
      Also, listen() for such a socket will fire the WARN_ON() in
      inet_csk_get_port(). [0]
      
      Note that when a system runs out of memory, we give up fixing the
      bucket and unlink sk from bhash and bhash2 by inet_put_port().
      
      [0]:
      WARNING: CPU: 0 PID: 207 at net/ipv4/inet_connection_sock.c:548 inet_csk_get_port (net/ipv4/inet_connection_sock.c:548 (discriminator 1))
      Modules linked in:
      CPU: 0 PID: 207 Comm: bhash2_prev_rep Not tainted 6.1.0-rc3-00799-gc8421681c845 #63
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-1.amzn2022.0.1 04/01/2014
      RIP: 0010:inet_csk_get_port (net/ipv4/inet_connection_sock.c:548 (discriminator 1))
      Code: 74 a7 eb 93 48 8b 54 24 18 0f b7 cb 4c 89 e6 4c 89 ff e8 48 b2 ff ff 49 8b 87 18 04 00 00 e9 32 ff ff ff 0f 0b e9 34 ff ff ff <0f> 0b e9 42 ff ff ff 41 8b 7f 50 41 8b 4f 54 89 fe 81 f6 00 00 ff
      RSP: 0018:ffffc900003d7e50 EFLAGS: 00010202
      RAX: ffff8881047fb500 RBX: 0000000000004e20 RCX: 0000000000000000
      RDX: 000000000000000a RSI: 00000000fffffe00 RDI: 00000000ffffffff
      RBP: ffffffff8324dc00 R08: 0000000000000001 R09: 0000000000000001
      R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000000
      R13: 0000000000000001 R14: 0000000000004e20 R15: ffff8881054e1280
      FS:  00007f8ac04dc740(0000) GS:ffff88842fc00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000020001540 CR3: 00000001055fa003 CR4: 0000000000770ef0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      PKRU: 55555554
      Call Trace:
       <TASK>
       inet_csk_listen_start (net/ipv4/inet_connection_sock.c:1205)
       inet_listen (net/ipv4/af_inet.c:228)
       __sys_listen (net/socket.c:1810)
       __x64_sys_listen (net/socket.c:1819 net/socket.c:1817 net/socket.c:1817)
       do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
       entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)
      RIP: 0033:0x7f8ac051de5d
      Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 93 af 1b 00 f7 d8 64 89 01 48
      RSP: 002b:00007ffc1c177248 EFLAGS: 00000206 ORIG_RAX: 0000000000000032
      RAX: ffffffffffffffda RBX: 0000000020001550 RCX: 00007f8ac051de5d
      RDX: ffffffffffffff80 RSI: 0000000000000000 RDI: 0000000000000004
      RBP: 00007ffc1c177270 R08: 0000000000000018 R09: 0000000000000007
      R10: 0000000020001540 R11: 0000000000000206 R12: 00007ffc1c177388
      R13: 0000000000401169 R14: 0000000000403e18 R15: 00007f8ac0723000
       </TASK>
      
      Fixes: 28044fc1 ("net: Add a bhash2 table hashed by port and address")
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Reported-by: NMat Martineau <mathew.j.martineau@linux.intel.com>
      Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.com>
      Acked-by: NJoanne Koong <joannelkoong@gmail.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      e0833d1f
    • K
      dccp/tcp: Update saddr under bhash's lock. · 8c5dae4c
      Kuniyuki Iwashima 提交于
      When we call connect() for a socket bound to a wildcard address, we update
      saddr locklessly.  However, it could result in a data race; another thread
      iterating over bhash might see a corrupted address.
      
      Let's update saddr under the bhash bucket's lock.
      
      Fixes: 3df80d93 ("[DCCP]: Introduce DCCPv6")
      Fixes: 7c657876 ("[DCCP]: Initial implementation")
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.com>
      Acked-by: NJoanne Koong <joannelkoong@gmail.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      8c5dae4c
    • K
      dccp/tcp: Reset saddr on failure after inet6?_hash_connect(). · 77934dc6
      Kuniyuki Iwashima 提交于
      When connect() is called on a socket bound to the wildcard address,
      we change the socket's saddr to a local address.  If the socket
      fails to connect() to the destination, we have to reset the saddr.
      
      However, when an error occurs after inet_hash6?_connect() in
      (dccp|tcp)_v[46]_conect(), we forget to reset saddr and leave
      the socket bound to the address.
      
      From the user's point of view, whether saddr is reset or not varies
      with errno.  Let's fix this inconsistent behaviour.
      
      Note that after this patch, the repro [0] will trigger the WARN_ON()
      in inet_csk_get_port() again, but this patch is not buggy and rather
      fixes a bug papering over the bhash2's bug for which we need another
      fix.
      
      For the record, the repro causes -EADDRNOTAVAIL in inet_hash6_connect()
      by this sequence:
      
        s1 = socket()
        s1.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
        s1.bind(('127.0.0.1', 10000))
        s1.sendto(b'hello', MSG_FASTOPEN, (('127.0.0.1', 10000)))
        # or s1.connect(('127.0.0.1', 10000))
      
        s2 = socket()
        s2.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
        s2.bind(('0.0.0.0', 10000))
        s2.connect(('127.0.0.1', 10000))  # -EADDRNOTAVAIL
      
        s2.listen(32)  # WARN_ON(inet_csk(sk)->icsk_bind2_hash != tb2);
      
      [0]: https://syzkaller.appspot.com/bug?extid=015d756bbd1f8b5c8f09
      
      Fixes: 3df80d93 ("[DCCP]: Introduce DCCPv6")
      Fixes: 7c657876 ("[DCCP]: Initial implementation")
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.com>
      Acked-by: NJoanne Koong <joannelkoong@gmail.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      77934dc6
  5. 24 10月, 2022 1 次提交
  6. 13 10月, 2022 1 次提交
    • K
      tcp: Fix data races around icsk->icsk_af_ops. · f49cd2f4
      Kuniyuki Iwashima 提交于
      setsockopt(IPV6_ADDRFORM) and tcp_v6_connect() change icsk->icsk_af_ops
      under lock_sock(), but tcp_(get|set)sockopt() read it locklessly.  To
      avoid load/store tearing, we need to add READ_ONCE() and WRITE_ONCE()
      for the reads and writes.
      
      Thanks to Eric Dumazet for providing the syzbot report:
      
      BUG: KCSAN: data-race in tcp_setsockopt / tcp_v6_connect
      
      write to 0xffff88813c624518 of 8 bytes by task 23936 on cpu 0:
      tcp_v6_connect+0x5b3/0xce0 net/ipv6/tcp_ipv6.c:240
      __inet_stream_connect+0x159/0x6d0 net/ipv4/af_inet.c:660
      inet_stream_connect+0x44/0x70 net/ipv4/af_inet.c:724
      __sys_connect_file net/socket.c:1976 [inline]
      __sys_connect+0x197/0x1b0 net/socket.c:1993
      __do_sys_connect net/socket.c:2003 [inline]
      __se_sys_connect net/socket.c:2000 [inline]
      __x64_sys_connect+0x3d/0x50 net/socket.c:2000
      do_syscall_x64 arch/x86/entry/common.c:50 [inline]
      do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80
      entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      read to 0xffff88813c624518 of 8 bytes by task 23937 on cpu 1:
      tcp_setsockopt+0x147/0x1c80 net/ipv4/tcp.c:3789
      sock_common_setsockopt+0x5d/0x70 net/core/sock.c:3585
      __sys_setsockopt+0x212/0x2b0 net/socket.c:2252
      __do_sys_setsockopt net/socket.c:2263 [inline]
      __se_sys_setsockopt net/socket.c:2260 [inline]
      __x64_sys_setsockopt+0x62/0x70 net/socket.c:2260
      do_syscall_x64 arch/x86/entry/common.c:50 [inline]
      do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80
      entry_SYSCALL_64_after_hwframe+0x63/0xcd
      
      value changed: 0xffffffff8539af68 -> 0xffffffff8539aff8
      
      Reported by Kernel Concurrency Sanitizer on:
      CPU: 1 PID: 23937 Comm: syz-executor.5 Not tainted
      6.0.0-rc4-syzkaller-00331-g4ed9c1e9-dirty #0
      
      Hardware name: Google Google Compute Engine/Google Compute Engine,
      BIOS Google 08/26/2022
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Reported-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      f49cd2f4
  7. 24 9月, 2022 1 次提交
  8. 21 9月, 2022 5 次提交
    • K
      tcp: Save unnecessary inet_twsk_purge() calls. · edc12f03
      Kuniyuki Iwashima 提交于
      While destroying netns, we call inet_twsk_purge() in tcp_sk_exit_batch()
      and tcpv6_net_exit_batch() for AF_INET and AF_INET6.  These commands
      trigger the kernel to walk through the potentially big ehash twice even
      though the netns has no TIME_WAIT sockets.
      
        # ip netns add test
        # ip netns del test
      
        or
      
        # unshare -n /bin/true >/dev/null
      
      When tw_refcount is 1, we need not call inet_twsk_purge() at least
      for the net.  We can save such unneeded iterations if all netns in
      net_exit_list have no TIME_WAIT sockets.  This change eliminates
      the tax by the additional unshare() described in the next patch to
      guarantee the per-netns ehash size.
      
      Tested:
      
        # mount -t debugfs none /sys/kernel/debug/
        # echo cleanup_net > /sys/kernel/debug/tracing/set_ftrace_filter
        # echo inet_twsk_purge >> /sys/kernel/debug/tracing/set_ftrace_filter
        # echo function > /sys/kernel/debug/tracing/current_tracer
        # cat ./add_del_unshare.sh
        for i in `seq 1 40`
        do
            (for j in `seq 1 100` ; do  unshare -n /bin/true >/dev/null ; done) &
        done
        wait;
        # ./add_del_unshare.sh
      
      Before the patch:
      
        # cat /sys/kernel/debug/tracing/trace_pipe
          kworker/u128:0-8       [031] ...1.   174.162765: cleanup_net <-process_one_work
          kworker/u128:0-8       [031] ...1.   174.240796: inet_twsk_purge <-cleanup_net
          kworker/u128:0-8       [032] ...1.   174.244759: inet_twsk_purge <-tcp_sk_exit_batch
          kworker/u128:0-8       [034] ...1.   174.290861: cleanup_net <-process_one_work
          kworker/u128:0-8       [039] ...1.   175.245027: inet_twsk_purge <-cleanup_net
          kworker/u128:0-8       [046] ...1.   175.290541: inet_twsk_purge <-tcp_sk_exit_batch
          kworker/u128:0-8       [037] ...1.   175.321046: cleanup_net <-process_one_work
          kworker/u128:0-8       [024] ...1.   175.941633: inet_twsk_purge <-cleanup_net
          kworker/u128:0-8       [025] ...1.   176.242539: inet_twsk_purge <-tcp_sk_exit_batch
      
      After:
      
        # cat /sys/kernel/debug/tracing/trace_pipe
          kworker/u128:0-8       [038] ...1.   428.116174: cleanup_net <-process_one_work
          kworker/u128:0-8       [038] ...1.   428.262532: cleanup_net <-process_one_work
          kworker/u128:0-8       [030] ...1.   429.292645: cleanup_net <-process_one_work
      Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      edc12f03
    • K
      tcp: Access &tcp_hashinfo via net. · 4461568a
      Kuniyuki Iwashima 提交于
      We will soon introduce an optional per-netns ehash.
      
      This means we cannot use tcp_hashinfo directly in most places.
      
      Instead, access it via net->ipv4.tcp_death_row.hashinfo.
      
      The access will be valid only while initialising tcp_hashinfo
      itself and creating/destroying each netns.
      Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      4461568a
    • K
      tcp: Set NULL to sk->sk_prot->h.hashinfo. · 429e42c1
      Kuniyuki Iwashima 提交于
      We will soon introduce an optional per-netns ehash.
      
      This means we cannot use the global sk->sk_prot->h.hashinfo
      to fetch a TCP hashinfo.
      
      Instead, set NULL to sk->sk_prot->h.hashinfo for TCP and get
      a proper hashinfo from net->ipv4.tcp_death_row.hashinfo.
      
      Note that we need not use sk->sk_prot->h.hashinfo if DCCP is
      disabled.
      Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      429e42c1
    • K
      tcp: Don't allocate tcp_death_row outside of struct netns_ipv4. · e9bd0cca
      Kuniyuki Iwashima 提交于
      We will soon introduce an optional per-netns ehash and access hash
      tables via net->ipv4.tcp_death_row->hashinfo instead of &tcp_hashinfo
      in most places.
      
      It could harm the fast path because dereferences of two fields in net
      and tcp_death_row might incur two extra cache line misses.  To save one
      dereference, let's place tcp_death_row back in netns_ipv4 and fetch
      hashinfo via net->ipv4.tcp_death_row"."hashinfo.
      
      Note tcp_death_row was initially placed in netns_ipv4, and commit
      fbb82952 ("tcp: allocate tcp_death_row outside of struct netns_ipv4")
      changed it to a pointer so that we can fire TIME_WAIT timers after freeing
      net.  However, we don't do so after commit 04c494e6 ("Revert "tcp/dccp:
      get rid of inet_twsk_purge()""), so we need not define tcp_death_row as a
      pointer.
      
      Also, we move refcount_dec_and_test(&tw_refcount) from tcp_sk_exit() to
      tcp_sk_exit_batch() as a debug check.
      Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      e9bd0cca
    • K
      tcp: Clean up some functions. · 08eaef90
      Kuniyuki Iwashima 提交于
      This patch adds no functional change and cleans up some functions
      that the following patches touch around so that we make them tidy
      and easy to review/revert.  The changes are
      
        - Keep reverse christmas tree order
        - Remove unnecessary init of port in inet_csk_find_open_port()
        - Use req_to_sk() once in reqsk_queue_unlink()
        - Use sock_net(sk) once in tcp_time_wait() and tcp_v[46]_connect()
      Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      08eaef90
  9. 02 9月, 2022 1 次提交
    • E
      ipv6: tcp: send consistent autoflowlabel in SYN_RECV state · aa51b80e
      Eric Dumazet 提交于
      This is a followup of commit c67b8555 ("ipv6: tcp: send consistent
      autoflowlabel in TIME_WAIT state"), but for SYN_RECV state.
      
      In some cases, TCP sends a challenge ACK on behalf of a SYN_RECV request.
      WHen this happens, we want to use the flow label that was used when
      the prior SYNACK packet was sent, instead of another one.
      
      After his patch, following packetdrill passes:
      
          0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
         +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
         +0 bind(3, ..., ...) = 0
         +0 listen(3, 1) = 0
      
        +.2 < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
         +0 > (flowlabel 0x11) S. 0:0(0) ack 1 <...>
      // Test if a challenge ack is properly sent (same flowlabel than prior SYNACK)
         +.01 < . 4000000000:4000000000(0) ack 1 win 320
         +0  > (flowlabel 0x11) . 1:1(0) ack 1
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20220831203729.458000-1-eric.dumazet@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      aa51b80e
  10. 25 8月, 2022 1 次提交
    • J
      net: Add a bhash2 table hashed by port and address · 28044fc1
      Joanne Koong 提交于
      The current bind hashtable (bhash) is hashed by port only.
      In the socket bind path, we have to check for bind conflicts by
      traversing the specified port's inet_bind_bucket while holding the
      hashbucket's spinlock (see inet_csk_get_port() and
      inet_csk_bind_conflict()). In instances where there are tons of
      sockets hashed to the same port at different addresses, the bind
      conflict check is time-intensive and can cause softirq cpu lockups,
      as well as stops new tcp connections since __inet_inherit_port()
      also contests for the spinlock.
      
      This patch adds a second bind table, bhash2, that hashes by
      port and sk->sk_rcv_saddr (ipv4) and sk->sk_v6_rcv_saddr (ipv6).
      Searching the bhash2 table leads to significantly faster conflict
      resolution and less time holding the hashbucket spinlock.
      
      Please note a few things:
      * There can be the case where the a socket's address changes after it
      has been bound. There are two cases where this happens:
      
        1) The case where there is a bind() call on INADDR_ANY (ipv4) or
        IPV6_ADDR_ANY (ipv6) and then a connect() call. The kernel will
        assign the socket an address when it handles the connect()
      
        2) In inet_sk_reselect_saddr(), which is called when rebuilding the
        sk header and a few pre-conditions are met (eg rerouting fails).
      
      In these two cases, we need to update the bhash2 table by removing the
      entry for the old address, and add a new entry reflecting the updated
      address.
      
      * The bhash2 table must have its own lock, even though concurrent
      accesses on the same port are protected by the bhash lock. Bhash2 must
      have its own lock to protect against cases where sockets on different
      ports hash to different bhash hashbuckets but to the same bhash2
      hashbucket.
      
      This brings up a few stipulations:
        1) When acquiring both the bhash and the bhash2 lock, the bhash2 lock
        will always be acquired after the bhash lock and released before the
        bhash lock is released.
      
        2) There are no nested bhash2 hashbucket locks. A bhash2 lock is always
        acquired+released before another bhash2 lock is acquired+released.
      
      * The bhash table cannot be superseded by the bhash2 table because for
      bind requests on INADDR_ANY (ipv4) or IPV6_ADDR_ANY (ipv6), every socket
      bound to that port must be checked for a potential conflict. The bhash
      table is the only source of port->socket associations.
      Signed-off-by: NJoanne Koong <joannelkoong@gmail.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      28044fc1
  11. 25 7月, 2022 1 次提交
  12. 16 7月, 2022 1 次提交
    • K
      tcp/udp: Make early_demux back namespacified. · 11052589
      Kuniyuki Iwashima 提交于
      Commit e21145a9 ("ipv4: namespacify ip_early_demux sysctl knob") made
      it possible to enable/disable early_demux on a per-netns basis.  Then, we
      introduced two knobs, tcp_early_demux and udp_early_demux, to switch it for
      TCP/UDP in commit dddb64bc ("net: Add sysctl to toggle early demux for
      tcp and udp").  However, the .proc_handler() was wrong and actually
      disabled us from changing the behaviour in each netns.
      
      We can execute early_demux if net.ipv4.ip_early_demux is on and each proto
      .early_demux() handler is not NULL.  When we toggle (tcp|udp)_early_demux,
      the change itself is saved in each netns variable, but the .early_demux()
      handler is a global variable, so the handler is switched based on the
      init_net's sysctl variable.  Thus, netns (tcp|udp)_early_demux knobs have
      nothing to do with the logic.  Whether we CAN execute proto .early_demux()
      is always decided by init_net's sysctl knob, and whether we DO it or not is
      by each netns ip_early_demux knob.
      
      This patch namespacifies (tcp|udp)_early_demux again.  For now, the users
      of the .early_demux() handler are TCP and UDP only, and they are called
      directly to avoid retpoline.  So, we can remove the .early_demux() handler
      from inet6?_protos and need not dereference them in ip6?_rcv_finish_core().
      If another proto needs .early_demux(), we can restore it at that time.
      
      Fixes: dddb64bc ("net: Add sysctl to toggle early demux for tcp and udp")
      Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.com>
      Link: https://lore.kernel.org/r/20220713175207.7727-1-kuniyu@amazon.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      11052589
  13. 11 7月, 2022 1 次提交
    • S
      net: Find dst with sk's xfrm policy not ctl_sk · e22aa148
      sewookseo 提交于
      If we set XFRM security policy by calling setsockopt with option
      IPV6_XFRM_POLICY, the policy will be stored in 'sock_policy' in 'sock'
      struct. However tcp_v6_send_response doesn't look up dst_entry with the
      actual socket but looks up with tcp control socket. This may cause a
      problem that a RST packet is sent without ESP encryption & peer's TCP
      socket can't receive it.
      This patch will make the function look up dest_entry with actual socket,
      if the socket has XFRM policy(sock_policy), so that the TCP response
      packet via this function can be encrypted, & aligned on the encrypted
      TCP socket.
      
      Tested: We encountered this problem when a TCP socket which is encrypted
      in ESP transport mode encryption, receives challenge ACK at SYN_SENT
      state. After receiving challenge ACK, TCP needs to send RST to
      establish the socket at next SYN try. But the RST was not encrypted &
      peer TCP socket still remains on ESTABLISHED state.
      So we verified this with test step as below.
      [Test step]
      1. Making a TCP state mismatch between client(IDLE) & server(ESTABLISHED).
      2. Client tries a new connection on the same TCP ports(src & dst).
      3. Server will return challenge ACK instead of SYN,ACK.
      4. Client will send RST to server to clear the SOCKET.
      5. Client will retransmit SYN to server on the same TCP ports.
      [Expected result]
      The TCP connection should be established.
      
      Cc: Maciej Żenczykowski <maze@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Steffen Klassert <steffen.klassert@secunet.com>
      Cc: Sehee Lee <seheele@google.com>
      Signed-off-by: NSewook Seo <sewookseo@google.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e22aa148
  14. 11 6月, 2022 1 次提交
  15. 21 5月, 2022 1 次提交
  16. 16 5月, 2022 1 次提交
  17. 13 5月, 2022 1 次提交
  18. 27 4月, 2022 1 次提交
    • E
      net: generalize skb freeing deferral to per-cpu lists · 68822bdf
      Eric Dumazet 提交于
      Logic added in commit f35f8219 ("tcp: defer skb freeing after socket
      lock is released") helped bulk TCP flows to move the cost of skbs
      frees outside of critical section where socket lock was held.
      
      But for RPC traffic, or hosts with RFS enabled, the solution is far from
      being ideal.
      
      For RPC traffic, recvmsg() has to return to user space right after
      skb payload has been consumed, meaning that BH handler has no chance
      to pick the skb before recvmsg() thread. This issue is more visible
      with BIG TCP, as more RPC fit one skb.
      
      For RFS, even if BH handler picks the skbs, they are still picked
      from the cpu on which user thread is running.
      
      Ideally, it is better to free the skbs (and associated page frags)
      on the cpu that originally allocated them.
      
      This patch removes the per socket anchor (sk->defer_list) and
      instead uses a per-cpu list, which will hold more skbs per round.
      
      This new per-cpu list is drained at the end of net_action_rx(),
      after incoming packets have been processed, to lower latencies.
      
      In normal conditions, skbs are added to the per-cpu list with
      no further action. In the (unlikely) cases where the cpu does not
      run net_action_rx() handler fast enough, we use an IPI to raise
      NET_RX_SOFTIRQ on the remote cpu.
      
      Also, we do not bother draining the per-cpu list from dev_cpu_dead()
      This is because skbs in this list have no requirement on how fast
      they should be freed.
      
      Note that we can add in the future a small per-cpu cache
      if we see any contention on sd->defer_lock.
      
      Tested on a pair of hosts with 100Gbit NIC, RFS enabled,
      and /proc/sys/net/ipv4/tcp_rmem[2] tuned to 16MB to work around
      page recycling strategy used by NIC driver (its page pool capacity
      being too small compared to number of skbs/pages held in sockets
      receive queues)
      
      Note that this tuning was only done to demonstrate worse
      conditions for skb freeing for this particular test.
      These conditions can happen in more general production workload.
      
      10 runs of one TCP_STREAM flow
      
      Before:
      Average throughput: 49685 Mbit.
      
      Kernel profiles on cpu running user thread recvmsg() show high cost for
      skb freeing related functions (*)
      
          57.81%  [kernel]       [k] copy_user_enhanced_fast_string
      (*) 12.87%  [kernel]       [k] skb_release_data
      (*)  4.25%  [kernel]       [k] __free_one_page
      (*)  3.57%  [kernel]       [k] __list_del_entry_valid
           1.85%  [kernel]       [k] __netif_receive_skb_core
           1.60%  [kernel]       [k] __skb_datagram_iter
      (*)  1.59%  [kernel]       [k] free_unref_page_commit
      (*)  1.16%  [kernel]       [k] __slab_free
           1.16%  [kernel]       [k] _copy_to_iter
      (*)  1.01%  [kernel]       [k] kfree
      (*)  0.88%  [kernel]       [k] free_unref_page
           0.57%  [kernel]       [k] ip6_rcv_core
           0.55%  [kernel]       [k] ip6t_do_table
           0.54%  [kernel]       [k] flush_smp_call_function_queue
      (*)  0.54%  [kernel]       [k] free_pcppages_bulk
           0.51%  [kernel]       [k] llist_reverse_order
           0.38%  [kernel]       [k] process_backlog
      (*)  0.38%  [kernel]       [k] free_pcp_prepare
           0.37%  [kernel]       [k] tcp_recvmsg_locked
      (*)  0.37%  [kernel]       [k] __list_add_valid
           0.34%  [kernel]       [k] sock_rfree
           0.34%  [kernel]       [k] _raw_spin_lock_irq
      (*)  0.33%  [kernel]       [k] __page_cache_release
           0.33%  [kernel]       [k] tcp_v6_rcv
      (*)  0.33%  [kernel]       [k] __put_page
      (*)  0.29%  [kernel]       [k] __mod_zone_page_state
           0.27%  [kernel]       [k] _raw_spin_lock
      
      After patch:
      Average throughput: 73076 Mbit.
      
      Kernel profiles on cpu running user thread recvmsg() looks better:
      
          81.35%  [kernel]       [k] copy_user_enhanced_fast_string
           1.95%  [kernel]       [k] _copy_to_iter
           1.95%  [kernel]       [k] __skb_datagram_iter
           1.27%  [kernel]       [k] __netif_receive_skb_core
           1.03%  [kernel]       [k] ip6t_do_table
           0.60%  [kernel]       [k] sock_rfree
           0.50%  [kernel]       [k] tcp_v6_rcv
           0.47%  [kernel]       [k] ip6_rcv_core
           0.45%  [kernel]       [k] read_tsc
           0.44%  [kernel]       [k] _raw_spin_lock_irqsave
           0.37%  [kernel]       [k] _raw_spin_lock
           0.37%  [kernel]       [k] native_irq_return_iret
           0.33%  [kernel]       [k] __inet6_lookup_established
           0.31%  [kernel]       [k] ip6_protocol_deliver_rcu
           0.29%  [kernel]       [k] tcp_rcv_established
           0.29%  [kernel]       [k] llist_reverse_order
      
      v2: kdoc issue (kernel bots)
          do not defer if (alloc_cpu == smp_processor_id()) (Paolo)
          replace the sk_buff_head with a single-linked list (Jakub)
          add a READ_ONCE()/WRITE_ONCE() for the lockless read of sd->defer_list
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NPaolo Abeni <pabeni@redhat.com>
      Link: https://lore.kernel.org/r/20220422201237.416238-1-eric.dumazet@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      68822bdf
  19. 22 4月, 2022 1 次提交
  20. 07 4月, 2022 1 次提交
  21. 09 3月, 2022 1 次提交
  22. 03 3月, 2022 1 次提交
    • M
      net: Add skb->mono_delivery_time to distinguish mono delivery_time from (rcv) timestamp · a1ac9c8a
      Martin KaFai Lau 提交于
      skb->tstamp was first used as the (rcv) timestamp.
      The major usage is to report it to the user (e.g. SO_TIMESTAMP).
      
      Later, skb->tstamp is also set as the (future) delivery_time (e.g. EDT in TCP)
      during egress and used by the qdisc (e.g. sch_fq) to make decision on when
      the skb can be passed to the dev.
      
      Currently, there is no way to tell skb->tstamp having the (rcv) timestamp
      or the delivery_time, so it is always reset to 0 whenever forwarded
      between egress and ingress.
      
      While it makes sense to always clear the (rcv) timestamp in skb->tstamp
      to avoid confusing sch_fq that expects the delivery_time, it is a
      performance issue [0] to clear the delivery_time if the skb finally
      egress to a fq@phy-dev.  For example, when forwarding from egress to
      ingress and then finally back to egress:
      
                  tcp-sender => veth@netns => veth@hostns => fq@eth0@hostns
                                           ^              ^
                                           reset          rest
      
      This patch adds one bit skb->mono_delivery_time to flag the skb->tstamp
      is storing the mono delivery_time (EDT) instead of the (rcv) timestamp.
      
      The current use case is to keep the TCP mono delivery_time (EDT) and
      to be used with sch_fq.  A latter patch will also allow tc-bpf@ingress
      to read and change the mono delivery_time.
      
      In the future, another bit (e.g. skb->user_delivery_time) can be added
      for the SCM_TXTIME where the clock base is tracked by sk->sk_clockid.
      
      [ This patch is a prep work.  The following patches will
        get the other parts of the stack ready first.  Then another patch
        after that will finally set the skb->mono_delivery_time. ]
      
      skb_set_delivery_time() function is added.  It is used by the tcp_output.c
      and during ip[6] fragmentation to assign the delivery_time to
      the skb->tstamp and also set the skb->mono_delivery_time.
      
      A note on the change in ip_send_unicast_reply() in ip_output.c.
      It is only used by TCP to send reset/ack out of a ctl_sk.
      Like the new skb_set_delivery_time(), this patch sets
      the skb->mono_delivery_time to 0 for now as a place
      holder.  It will be enabled in a latter patch.
      A similar case in tcp_ipv6 can be done with
      skb_set_delivery_time() in tcp_v6_send_response().
      
      [0] (slide 22): https://linuxplumbersconf.org/event/11/contributions/953/attachments/867/1658/LPC_2021_BPF_Datapath_Extensions.pdfSigned-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a1ac9c8a
  23. 25 2月, 2022 1 次提交
  24. 23 2月, 2022 1 次提交
  25. 20 2月, 2022 4 次提交
  26. 27 1月, 2022 1 次提交
    • E
      tcp: allocate tcp_death_row outside of struct netns_ipv4 · fbb82952
      Eric Dumazet 提交于
      I forgot tcp had per netns tracking of timewait sockets,
      and their sysctl to change the limit.
      
      After 0dad4087 ("tcp/dccp: get rid of inet_twsk_purge()"),
      whole struct net can be freed before last tw socket is freed.
      
      We need to allocate a separate struct inet_timewait_death_row
      object per netns.
      
      tw_count becomes a refcount and gains associated debugging infrastructure.
      
      BUG: KASAN: use-after-free in inet_twsk_kill+0x358/0x3c0 net/ipv4/inet_timewait_sock.c:46
      Read of size 8 at addr ffff88807d5f9f40 by task kworker/1:7/3690
      
      CPU: 1 PID: 3690 Comm: kworker/1:7 Not tainted 5.16.0-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Workqueue: events pwq_unbound_release_workfn
      Call Trace:
       <IRQ>
       __dump_stack lib/dump_stack.c:88 [inline]
       dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
       print_address_description.constprop.0.cold+0x8d/0x336 mm/kasan/report.c:255
       __kasan_report mm/kasan/report.c:442 [inline]
       kasan_report.cold+0x83/0xdf mm/kasan/report.c:459
       inet_twsk_kill+0x358/0x3c0 net/ipv4/inet_timewait_sock.c:46
       call_timer_fn+0x1a5/0x6b0 kernel/time/timer.c:1421
       expire_timers kernel/time/timer.c:1466 [inline]
       __run_timers.part.0+0x67c/0xa30 kernel/time/timer.c:1734
       __run_timers kernel/time/timer.c:1715 [inline]
       run_timer_softirq+0xb3/0x1d0 kernel/time/timer.c:1747
       __do_softirq+0x29b/0x9c2 kernel/softirq.c:558
       invoke_softirq kernel/softirq.c:432 [inline]
       __irq_exit_rcu+0x123/0x180 kernel/softirq.c:637
       irq_exit_rcu+0x5/0x20 kernel/softirq.c:649
       sysvec_apic_timer_interrupt+0x93/0xc0 arch/x86/kernel/apic/apic.c:1097
       </IRQ>
       <TASK>
       asm_sysvec_apic_timer_interrupt+0x12/0x20 arch/x86/include/asm/idtentry.h:638
      RIP: 0010:lockdep_unregister_key+0x1c9/0x250 kernel/locking/lockdep.c:6328
      Code: 00 00 00 48 89 ee e8 46 fd ff ff 4c 89 f7 e8 5e c9 ff ff e8 09 cc ff ff 9c 58 f6 c4 02 75 26 41 f7 c4 00 02 00 00 74 01 fb 5b <5d> 41 5c 41 5d 41 5e 41 5f e9 19 4a 08 00 0f 0b 5b 5d 41 5c 41 5d
      RSP: 0018:ffffc90004077cb8 EFLAGS: 00000206
      RAX: 0000000000000046 RBX: ffff88807b61b498 RCX: 0000000000000001
      RDX: dffffc0000000000 RSI: 0000000000000000 RDI: 0000000000000000
      RBP: ffff888077027128 R08: 0000000000000001 R09: ffffffff8f1ea4fc
      R10: fffffbfff1ff93ee R11: 000000000000af1e R12: 0000000000000246
      R13: 0000000000000000 R14: ffffffff8ffc89b8 R15: ffffffff90157fb0
       wq_unregister_lockdep kernel/workqueue.c:3508 [inline]
       pwq_unbound_release_workfn+0x254/0x340 kernel/workqueue.c:3746
       process_one_work+0x9ac/0x1650 kernel/workqueue.c:2307
       worker_thread+0x657/0x1110 kernel/workqueue.c:2454
       kthread+0x2e9/0x3a0 kernel/kthread.c:377
       ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
       </TASK>
      
      Allocated by task 3635:
       kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38
       kasan_set_track mm/kasan/common.c:46 [inline]
       set_alloc_info mm/kasan/common.c:437 [inline]
       __kasan_slab_alloc+0x90/0xc0 mm/kasan/common.c:470
       kasan_slab_alloc include/linux/kasan.h:260 [inline]
       slab_post_alloc_hook mm/slab.h:732 [inline]
       slab_alloc_node mm/slub.c:3230 [inline]
       slab_alloc mm/slub.c:3238 [inline]
       kmem_cache_alloc+0x202/0x3a0 mm/slub.c:3243
       kmem_cache_zalloc include/linux/slab.h:705 [inline]
       net_alloc net/core/net_namespace.c:407 [inline]
       copy_net_ns+0x125/0x760 net/core/net_namespace.c:462
       create_new_namespaces+0x3f6/0xb20 kernel/nsproxy.c:110
       unshare_nsproxy_namespaces+0xc1/0x1f0 kernel/nsproxy.c:226
       ksys_unshare+0x445/0x920 kernel/fork.c:3048
       __do_sys_unshare kernel/fork.c:3119 [inline]
       __se_sys_unshare kernel/fork.c:3117 [inline]
       __x64_sys_unshare+0x2d/0x40 kernel/fork.c:3117
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      The buggy address belongs to the object at ffff88807d5f9a80
       which belongs to the cache net_namespace of size 6528
      The buggy address is located 1216 bytes inside of
       6528-byte region [ffff88807d5f9a80, ffff88807d5fb400)
      The buggy address belongs to the page:
      page:ffffea0001f57e00 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff88807d5f9a80 pfn:0x7d5f8
      head:ffffea0001f57e00 order:3 compound_mapcount:0 compound_pincount:0
      memcg:ffff888070023001
      flags: 0xfff00000010200(slab|head|node=0|zone=1|lastcpupid=0x7ff)
      raw: 00fff00000010200 ffff888010dd4f48 ffffea0001404e08 ffff8880118fd000
      raw: ffff88807d5f9a80 0000000000040002 00000001ffffffff ffff888070023001
      page dumped because: kasan: bad access detected
      page_owner tracks the page as allocated
      page last allocated via order 3, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 3634, ts 119694798460, free_ts 119693556950
       prep_new_page mm/page_alloc.c:2434 [inline]
       get_page_from_freelist+0xa72/0x2f50 mm/page_alloc.c:4165
       __alloc_pages+0x1b2/0x500 mm/page_alloc.c:5389
       alloc_pages+0x1aa/0x310 mm/mempolicy.c:2271
       alloc_slab_page mm/slub.c:1799 [inline]
       allocate_slab mm/slub.c:1944 [inline]
       new_slab+0x28a/0x3b0 mm/slub.c:2004
       ___slab_alloc+0x87c/0xe90 mm/slub.c:3018
       __slab_alloc.constprop.0+0x4d/0xa0 mm/slub.c:3105
       slab_alloc_node mm/slub.c:3196 [inline]
       slab_alloc mm/slub.c:3238 [inline]
       kmem_cache_alloc+0x35c/0x3a0 mm/slub.c:3243
       kmem_cache_zalloc include/linux/slab.h:705 [inline]
       net_alloc net/core/net_namespace.c:407 [inline]
       copy_net_ns+0x125/0x760 net/core/net_namespace.c:462
       create_new_namespaces+0x3f6/0xb20 kernel/nsproxy.c:110
       unshare_nsproxy_namespaces+0xc1/0x1f0 kernel/nsproxy.c:226
       ksys_unshare+0x445/0x920 kernel/fork.c:3048
       __do_sys_unshare kernel/fork.c:3119 [inline]
       __se_sys_unshare kernel/fork.c:3117 [inline]
       __x64_sys_unshare+0x2d/0x40 kernel/fork.c:3117
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      page last free stack trace:
       reset_page_owner include/linux/page_owner.h:24 [inline]
       free_pages_prepare mm/page_alloc.c:1352 [inline]
       free_pcp_prepare+0x374/0x870 mm/page_alloc.c:1404
       free_unref_page_prepare mm/page_alloc.c:3325 [inline]
       free_unref_page+0x19/0x690 mm/page_alloc.c:3404
       skb_free_head net/core/skbuff.c:655 [inline]
       skb_release_data+0x65d/0x790 net/core/skbuff.c:677
       skb_release_all net/core/skbuff.c:742 [inline]
       __kfree_skb net/core/skbuff.c:756 [inline]
       consume_skb net/core/skbuff.c:914 [inline]
       consume_skb+0xc2/0x160 net/core/skbuff.c:908
       skb_free_datagram+0x1b/0x1f0 net/core/datagram.c:325
       netlink_recvmsg+0x636/0xea0 net/netlink/af_netlink.c:1998
       sock_recvmsg_nosec net/socket.c:948 [inline]
       sock_recvmsg net/socket.c:966 [inline]
       sock_recvmsg net/socket.c:962 [inline]
       ____sys_recvmsg+0x2c4/0x600 net/socket.c:2632
       ___sys_recvmsg+0x127/0x200 net/socket.c:2674
       __sys_recvmsg+0xe2/0x1a0 net/socket.c:2704
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Memory state around the buggy address:
       ffff88807d5f9e00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
       ffff88807d5f9e80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      >ffff88807d5f9f00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                                 ^
       ffff88807d5f9f80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
       ffff88807d5fa000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      
      Fixes: 0dad4087 ("tcp/dccp: get rid of inet_twsk_purge()")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Reported-by: NPaolo Abeni <pabeni@redhat.com>
      Tested-by: NPaolo Abeni <pabeni@redhat.com>
      Link: https://lore.kernel.org/r/20220126180714.845362-1-eric.dumazet@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      fbb82952
  27. 25 1月, 2022 1 次提交
  28. 07 1月, 2022 1 次提交
    • M
      net: bpf: Handle return value of BPF_CGROUP_RUN_PROG_INET{4,6}_POST_BIND() · 91a760b2
      Menglong Dong 提交于
      The return value of BPF_CGROUP_RUN_PROG_INET{4,6}_POST_BIND() in
      __inet_bind() is not handled properly. While the return value
      is non-zero, it will set inet_saddr and inet_rcv_saddr to 0 and
      exit:
      
      	err = BPF_CGROUP_RUN_PROG_INET4_POST_BIND(sk);
      	if (err) {
      		inet->inet_saddr = inet->inet_rcv_saddr = 0;
      		goto out_release_sock;
      	}
      
      Let's take UDP for example and see what will happen. For UDP
      socket, it will be added to 'udp_prot.h.udp_table->hash' and
      'udp_prot.h.udp_table->hash2' after the sk->sk_prot->get_port()
      called success. If 'inet->inet_rcv_saddr' is specified here,
      then 'sk' will be in the 'hslot2' of 'hash2' that it don't belong
      to (because inet_saddr is changed to 0), and UDP packet received
      will not be passed to this sock. If 'inet->inet_rcv_saddr' is not
      specified here, the sock will work fine, as it can receive packet
      properly, which is wired, as the 'bind()' is already failed.
      
      To undo the get_port() operation, introduce the 'put_port' field
      for 'struct proto'. For TCP proto, it is inet_put_port(); For UDP
      proto, it is udp_lib_unhash(); For icmp proto, it is
      ping_unhash().
      
      Therefore, after sys_bind() fail caused by
      BPF_CGROUP_RUN_PROG_INET4_POST_BIND(), it will be unbinded, which
      means that it can try to be binded to another port.
      Signed-off-by: NMenglong Dong <imagedong@tencent.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20220106132022.3470772-2-imagedong@tencent.com
      91a760b2
  29. 21 12月, 2021 1 次提交
    • E
      inet: fully convert sk->sk_rx_dst to RCU rules · 8f905c0e
      Eric Dumazet 提交于
      syzbot reported various issues around early demux,
      one being included in this changelog [1]
      
      sk->sk_rx_dst is using RCU protection without clearly
      documenting it.
      
      And following sequences in tcp_v4_do_rcv()/tcp_v6_do_rcv()
      are not following standard RCU rules.
      
      [a]    dst_release(dst);
      [b]    sk->sk_rx_dst = NULL;
      
      They look wrong because a delete operation of RCU protected
      pointer is supposed to clear the pointer before
      the call_rcu()/synchronize_rcu() guarding actual memory freeing.
      
      In some cases indeed, dst could be freed before [b] is done.
      
      We could cheat by clearing sk_rx_dst before calling
      dst_release(), but this seems the right time to stick
      to standard RCU annotations and debugging facilities.
      
      [1]
      BUG: KASAN: use-after-free in dst_check include/net/dst.h:470 [inline]
      BUG: KASAN: use-after-free in tcp_v4_early_demux+0x95b/0x960 net/ipv4/tcp_ipv4.c:1792
      Read of size 2 at addr ffff88807f1cb73a by task syz-executor.5/9204
      
      CPU: 0 PID: 9204 Comm: syz-executor.5 Not tainted 5.16.0-rc5-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Call Trace:
       <TASK>
       __dump_stack lib/dump_stack.c:88 [inline]
       dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
       print_address_description.constprop.0.cold+0x8d/0x320 mm/kasan/report.c:247
       __kasan_report mm/kasan/report.c:433 [inline]
       kasan_report.cold+0x83/0xdf mm/kasan/report.c:450
       dst_check include/net/dst.h:470 [inline]
       tcp_v4_early_demux+0x95b/0x960 net/ipv4/tcp_ipv4.c:1792
       ip_rcv_finish_core.constprop.0+0x15de/0x1e80 net/ipv4/ip_input.c:340
       ip_list_rcv_finish.constprop.0+0x1b2/0x6e0 net/ipv4/ip_input.c:583
       ip_sublist_rcv net/ipv4/ip_input.c:609 [inline]
       ip_list_rcv+0x34e/0x490 net/ipv4/ip_input.c:644
       __netif_receive_skb_list_ptype net/core/dev.c:5508 [inline]
       __netif_receive_skb_list_core+0x549/0x8e0 net/core/dev.c:5556
       __netif_receive_skb_list net/core/dev.c:5608 [inline]
       netif_receive_skb_list_internal+0x75e/0xd80 net/core/dev.c:5699
       gro_normal_list net/core/dev.c:5853 [inline]
       gro_normal_list net/core/dev.c:5849 [inline]
       napi_complete_done+0x1f1/0x880 net/core/dev.c:6590
       virtqueue_napi_complete drivers/net/virtio_net.c:339 [inline]
       virtnet_poll+0xca2/0x11b0 drivers/net/virtio_net.c:1557
       __napi_poll+0xaf/0x440 net/core/dev.c:7023
       napi_poll net/core/dev.c:7090 [inline]
       net_rx_action+0x801/0xb40 net/core/dev.c:7177
       __do_softirq+0x29b/0x9c2 kernel/softirq.c:558
       invoke_softirq kernel/softirq.c:432 [inline]
       __irq_exit_rcu+0x123/0x180 kernel/softirq.c:637
       irq_exit_rcu+0x5/0x20 kernel/softirq.c:649
       common_interrupt+0x52/0xc0 arch/x86/kernel/irq.c:240
       asm_common_interrupt+0x1e/0x40 arch/x86/include/asm/idtentry.h:629
      RIP: 0033:0x7f5e972bfd57
      Code: 39 d1 73 14 0f 1f 80 00 00 00 00 48 8b 50 f8 48 83 e8 08 48 39 ca 77 f3 48 39 c3 73 3e 48 89 13 48 8b 50 f8 48 89 38 49 8b 0e <48> 8b 3e 48 83 c3 08 48 83 c6 08 eb bc 48 39 d1 72 9e 48 39 d0 73
      RSP: 002b:00007fff8a413210 EFLAGS: 00000283
      RAX: 00007f5e97108990 RBX: 00007f5e97108338 RCX: ffffffff81d3aa45
      RDX: ffffffff81d3aa45 RSI: 00007f5e97108340 RDI: ffffffff81d3aa45
      RBP: 00007f5e97107eb8 R08: 00007f5e97108d88 R09: 0000000093c2e8d9
      R10: 0000000000000000 R11: 0000000000000000 R12: 00007f5e97107eb0
      R13: 00007f5e97108338 R14: 00007f5e97107ea8 R15: 0000000000000019
       </TASK>
      
      Allocated by task 13:
       kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38
       kasan_set_track mm/kasan/common.c:46 [inline]
       set_alloc_info mm/kasan/common.c:434 [inline]
       __kasan_slab_alloc+0x90/0xc0 mm/kasan/common.c:467
       kasan_slab_alloc include/linux/kasan.h:259 [inline]
       slab_post_alloc_hook mm/slab.h:519 [inline]
       slab_alloc_node mm/slub.c:3234 [inline]
       slab_alloc mm/slub.c:3242 [inline]
       kmem_cache_alloc+0x202/0x3a0 mm/slub.c:3247
       dst_alloc+0x146/0x1f0 net/core/dst.c:92
       rt_dst_alloc+0x73/0x430 net/ipv4/route.c:1613
       ip_route_input_slow+0x1817/0x3a20 net/ipv4/route.c:2340
       ip_route_input_rcu net/ipv4/route.c:2470 [inline]
       ip_route_input_noref+0x116/0x2a0 net/ipv4/route.c:2415
       ip_rcv_finish_core.constprop.0+0x288/0x1e80 net/ipv4/ip_input.c:354
       ip_list_rcv_finish.constprop.0+0x1b2/0x6e0 net/ipv4/ip_input.c:583
       ip_sublist_rcv net/ipv4/ip_input.c:609 [inline]
       ip_list_rcv+0x34e/0x490 net/ipv4/ip_input.c:644
       __netif_receive_skb_list_ptype net/core/dev.c:5508 [inline]
       __netif_receive_skb_list_core+0x549/0x8e0 net/core/dev.c:5556
       __netif_receive_skb_list net/core/dev.c:5608 [inline]
       netif_receive_skb_list_internal+0x75e/0xd80 net/core/dev.c:5699
       gro_normal_list net/core/dev.c:5853 [inline]
       gro_normal_list net/core/dev.c:5849 [inline]
       napi_complete_done+0x1f1/0x880 net/core/dev.c:6590
       virtqueue_napi_complete drivers/net/virtio_net.c:339 [inline]
       virtnet_poll+0xca2/0x11b0 drivers/net/virtio_net.c:1557
       __napi_poll+0xaf/0x440 net/core/dev.c:7023
       napi_poll net/core/dev.c:7090 [inline]
       net_rx_action+0x801/0xb40 net/core/dev.c:7177
       __do_softirq+0x29b/0x9c2 kernel/softirq.c:558
      
      Freed by task 13:
       kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38
       kasan_set_track+0x21/0x30 mm/kasan/common.c:46
       kasan_set_free_info+0x20/0x30 mm/kasan/generic.c:370
       ____kasan_slab_free mm/kasan/common.c:366 [inline]
       ____kasan_slab_free mm/kasan/common.c:328 [inline]
       __kasan_slab_free+0xff/0x130 mm/kasan/common.c:374
       kasan_slab_free include/linux/kasan.h:235 [inline]
       slab_free_hook mm/slub.c:1723 [inline]
       slab_free_freelist_hook+0x8b/0x1c0 mm/slub.c:1749
       slab_free mm/slub.c:3513 [inline]
       kmem_cache_free+0xbd/0x5d0 mm/slub.c:3530
       dst_destroy+0x2d6/0x3f0 net/core/dst.c:127
       rcu_do_batch kernel/rcu/tree.c:2506 [inline]
       rcu_core+0x7ab/0x1470 kernel/rcu/tree.c:2741
       __do_softirq+0x29b/0x9c2 kernel/softirq.c:558
      
      Last potentially related work creation:
       kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38
       __kasan_record_aux_stack+0xf5/0x120 mm/kasan/generic.c:348
       __call_rcu kernel/rcu/tree.c:2985 [inline]
       call_rcu+0xb1/0x740 kernel/rcu/tree.c:3065
       dst_release net/core/dst.c:177 [inline]
       dst_release+0x79/0xe0 net/core/dst.c:167
       tcp_v4_do_rcv+0x612/0x8d0 net/ipv4/tcp_ipv4.c:1712
       sk_backlog_rcv include/net/sock.h:1030 [inline]
       __release_sock+0x134/0x3b0 net/core/sock.c:2768
       release_sock+0x54/0x1b0 net/core/sock.c:3300
       tcp_sendmsg+0x36/0x40 net/ipv4/tcp.c:1441
       inet_sendmsg+0x99/0xe0 net/ipv4/af_inet.c:819
       sock_sendmsg_nosec net/socket.c:704 [inline]
       sock_sendmsg+0xcf/0x120 net/socket.c:724
       sock_write_iter+0x289/0x3c0 net/socket.c:1057
       call_write_iter include/linux/fs.h:2162 [inline]
       new_sync_write+0x429/0x660 fs/read_write.c:503
       vfs_write+0x7cd/0xae0 fs/read_write.c:590
       ksys_write+0x1ee/0x250 fs/read_write.c:643
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      The buggy address belongs to the object at ffff88807f1cb700
       which belongs to the cache ip_dst_cache of size 176
      The buggy address is located 58 bytes inside of
       176-byte region [ffff88807f1cb700, ffff88807f1cb7b0)
      The buggy address belongs to the page:
      page:ffffea0001fc72c0 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x7f1cb
      flags: 0xfff00000000200(slab|node=0|zone=1|lastcpupid=0x7ff)
      raw: 00fff00000000200 dead000000000100 dead000000000122 ffff8881413bb780
      raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000
      page dumped because: kasan: bad access detected
      page_owner tracks the page as allocated
      page last allocated via order 0, migratetype Unmovable, gfp_mask 0x112a20(GFP_ATOMIC|__GFP_NOWARN|__GFP_NORETRY|__GFP_HARDWALL), pid 5, ts 108466983062, free_ts 108048976062
       prep_new_page mm/page_alloc.c:2418 [inline]
       get_page_from_freelist+0xa72/0x2f50 mm/page_alloc.c:4149
       __alloc_pages+0x1b2/0x500 mm/page_alloc.c:5369
       alloc_pages+0x1a7/0x300 mm/mempolicy.c:2191
       alloc_slab_page mm/slub.c:1793 [inline]
       allocate_slab mm/slub.c:1930 [inline]
       new_slab+0x32d/0x4a0 mm/slub.c:1993
       ___slab_alloc+0x918/0xfe0 mm/slub.c:3022
       __slab_alloc.constprop.0+0x4d/0xa0 mm/slub.c:3109
       slab_alloc_node mm/slub.c:3200 [inline]
       slab_alloc mm/slub.c:3242 [inline]
       kmem_cache_alloc+0x35c/0x3a0 mm/slub.c:3247
       dst_alloc+0x146/0x1f0 net/core/dst.c:92
       rt_dst_alloc+0x73/0x430 net/ipv4/route.c:1613
       __mkroute_output net/ipv4/route.c:2564 [inline]
       ip_route_output_key_hash_rcu+0x921/0x2d00 net/ipv4/route.c:2791
       ip_route_output_key_hash+0x18b/0x300 net/ipv4/route.c:2619
       __ip_route_output_key include/net/route.h:126 [inline]
       ip_route_output_flow+0x23/0x150 net/ipv4/route.c:2850
       ip_route_output_key include/net/route.h:142 [inline]
       geneve_get_v4_rt+0x3a6/0x830 drivers/net/geneve.c:809
       geneve_xmit_skb drivers/net/geneve.c:899 [inline]
       geneve_xmit+0xc4a/0x3540 drivers/net/geneve.c:1082
       __netdev_start_xmit include/linux/netdevice.h:4994 [inline]
       netdev_start_xmit include/linux/netdevice.h:5008 [inline]
       xmit_one net/core/dev.c:3590 [inline]
       dev_hard_start_xmit+0x1eb/0x920 net/core/dev.c:3606
       __dev_queue_xmit+0x299a/0x3650 net/core/dev.c:4229
      page last free stack trace:
       reset_page_owner include/linux/page_owner.h:24 [inline]
       free_pages_prepare mm/page_alloc.c:1338 [inline]
       free_pcp_prepare+0x374/0x870 mm/page_alloc.c:1389
       free_unref_page_prepare mm/page_alloc.c:3309 [inline]
       free_unref_page+0x19/0x690 mm/page_alloc.c:3388
       qlink_free mm/kasan/quarantine.c:146 [inline]
       qlist_free_all+0x5a/0xc0 mm/kasan/quarantine.c:165
       kasan_quarantine_reduce+0x180/0x200 mm/kasan/quarantine.c:272
       __kasan_slab_alloc+0xa2/0xc0 mm/kasan/common.c:444
       kasan_slab_alloc include/linux/kasan.h:259 [inline]
       slab_post_alloc_hook mm/slab.h:519 [inline]
       slab_alloc_node mm/slub.c:3234 [inline]
       kmem_cache_alloc_node+0x255/0x3f0 mm/slub.c:3270
       __alloc_skb+0x215/0x340 net/core/skbuff.c:414
       alloc_skb include/linux/skbuff.h:1126 [inline]
       alloc_skb_with_frags+0x93/0x620 net/core/skbuff.c:6078
       sock_alloc_send_pskb+0x783/0x910 net/core/sock.c:2575
       mld_newpack+0x1df/0x770 net/ipv6/mcast.c:1754
       add_grhead+0x265/0x330 net/ipv6/mcast.c:1857
       add_grec+0x1053/0x14e0 net/ipv6/mcast.c:1995
       mld_send_initial_cr.part.0+0xf6/0x230 net/ipv6/mcast.c:2242
       mld_send_initial_cr net/ipv6/mcast.c:1232 [inline]
       mld_dad_work+0x1d3/0x690 net/ipv6/mcast.c:2268
       process_one_work+0x9b2/0x1690 kernel/workqueue.c:2298
       worker_thread+0x658/0x11f0 kernel/workqueue.c:2445
      
      Memory state around the buggy address:
       ffff88807f1cb600: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
       ffff88807f1cb680: fb fb fb fb fb fb fc fc fc fc fc fc fc fc fc fc
      >ffff88807f1cb700: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                              ^
       ffff88807f1cb780: fb fb fb fb fb fb fc fc fc fc fc fc fc fc fc fc
       ffff88807f1cb800: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      
      Fixes: 41063e9d ("ipv4: Early TCP socket demux.")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20211220143330.680945-1-eric.dumazet@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      8f905c0e
  30. 16 11月, 2021 1 次提交
    • E
      tcp: defer skb freeing after socket lock is released · f35f8219
      Eric Dumazet 提交于
      tcp recvmsg() (or rx zerocopy) spends a fair amount of time
      freeing skbs after their payload has been consumed.
      
      A typical ~64KB GRO packet has to release ~45 page
      references, eventually going to page allocator
      for each of them.
      
      Currently, this freeing is performed while socket lock
      is held, meaning that there is a high chance that
      BH handler has to queue incoming packets to tcp socket backlog.
      
      This can cause additional latencies, because the user
      thread has to process the backlog at release_sock() time,
      and while doing so, additional frames can be added
      by BH handler.
      
      This patch adds logic to defer these frees after socket
      lock is released, or directly from BH handler if possible.
      
      Being able to free these skbs from BH handler helps a lot,
      because this avoids the usual alloc/free assymetry,
      when BH handler and user thread do not run on same cpu or
      NUMA node.
      
      One cpu can now be fully utilized for the kernel->user copy,
      and another cpu is handling BH processing and skb/page
      allocs/frees (assuming RFS is not forcing use of a single CPU)
      
      Tested:
       100Gbit NIC
       Max throughput for one TCP_STREAM flow, over 10 runs
      
      MTU : 1500
      Before: 55 Gbit
      After:  66 Gbit
      
      MTU : 4096+(headers)
      Before: 82 Gbit
      After:  95 Gbit
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f35f8219