1. 19 6月, 2022 2 次提交
  2. 18 6月, 2022 1 次提交
  3. 17 6月, 2022 6 次提交
  4. 11 6月, 2022 3 次提交
  5. 10 6月, 2022 2 次提交
  6. 09 6月, 2022 3 次提交
  7. 03 6月, 2022 1 次提交
  8. 01 6月, 2022 1 次提交
    • E
      tcp: tcp_rtx_synack() can be called from process context · 0a375c82
      Eric Dumazet 提交于
      Laurent reported the enclosed report [1]
      
      This bug triggers with following coditions:
      
      0) Kernel built with CONFIG_DEBUG_PREEMPT=y
      
      1) A new passive FastOpen TCP socket is created.
         This FO socket waits for an ACK coming from client to be a complete
         ESTABLISHED one.
      2) A socket operation on this socket goes through lock_sock()
         release_sock() dance.
      3) While the socket is owned by the user in step 2),
         a retransmit of the SYN is received and stored in socket backlog.
      4) At release_sock() time, the socket backlog is processed while
         in process context.
      5) A SYNACK packet is cooked in response of the SYN retransmit.
      6) -> tcp_rtx_synack() is called in process context.
      
      Before blamed commit, tcp_rtx_synack() was always called from BH handler,
      from a timer handler.
      
      Fix this by using TCP_INC_STATS() & NET_INC_STATS()
      which do not assume caller is in non preemptible context.
      
      [1]
      BUG: using __this_cpu_add() in preemptible [00000000] code: epollpep/2180
      caller is tcp_rtx_synack.part.0+0x36/0xc0
      CPU: 10 PID: 2180 Comm: epollpep Tainted: G           OE     5.16.0-0.bpo.4-amd64 #1  Debian 5.16.12-1~bpo11+1
      Hardware name: Supermicro SYS-5039MC-H8TRF/X11SCD-F, BIOS 1.7 11/23/2021
      Call Trace:
       <TASK>
       dump_stack_lvl+0x48/0x5e
       check_preemption_disabled+0xde/0xe0
       tcp_rtx_synack.part.0+0x36/0xc0
       tcp_rtx_synack+0x8d/0xa0
       ? kmem_cache_alloc+0x2e0/0x3e0
       ? apparmor_file_alloc_security+0x3b/0x1f0
       inet_rtx_syn_ack+0x16/0x30
       tcp_check_req+0x367/0x610
       tcp_rcv_state_process+0x91/0xf60
       ? get_nohz_timer_target+0x18/0x1a0
       ? lock_timer_base+0x61/0x80
       ? preempt_count_add+0x68/0xa0
       tcp_v4_do_rcv+0xbd/0x270
       __release_sock+0x6d/0xb0
       release_sock+0x2b/0x90
       sock_setsockopt+0x138/0x1140
       ? __sys_getsockname+0x7e/0xc0
       ? aa_sk_perm+0x3e/0x1a0
       __sys_setsockopt+0x198/0x1e0
       __x64_sys_setsockopt+0x21/0x30
       do_syscall_64+0x38/0xc0
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Fixes: 168a8f58 ("tcp: TCP Fast Open Server - main code path")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NLaurent Fasnacht <laurent.fasnacht@proton.ch>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Link: https://lore.kernel.org/r/20220530213713.601888-1-eric.dumazet@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      0a375c82
  9. 31 5月, 2022 1 次提交
  10. 28 5月, 2022 1 次提交
    • E
      tcp: fix tcp_mtup_probe_success vs wrong snd_cwnd · 11825765
      Eric Dumazet 提交于
      syzbot got a new report [1] finally pointing to a very old bug,
      added in initial support for MTU probing.
      
      tcp_mtu_probe() has checks about starting an MTU probe if
      tcp_snd_cwnd(tp) >= 11.
      
      But nothing prevents tcp_snd_cwnd(tp) to be reduced later
      and before the MTU probe succeeds.
      
      This bug would lead to potential zero-divides.
      
      Debugging added in commit 40570375 ("tcp: add accessors
      to read/set tp->snd_cwnd") has paid off :)
      
      While we are at it, address potential overflows in this code.
      
      [1]
      WARNING: CPU: 1 PID: 14132 at include/net/tcp.h:1219 tcp_mtup_probe_success+0x366/0x570 net/ipv4/tcp_input.c:2712
      Modules linked in:
      CPU: 1 PID: 14132 Comm: syz-executor.2 Not tainted 5.18.0-syzkaller-07857-gbabf0bb9 #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      RIP: 0010:tcp_snd_cwnd_set include/net/tcp.h:1219 [inline]
      RIP: 0010:tcp_mtup_probe_success+0x366/0x570 net/ipv4/tcp_input.c:2712
      Code: 74 08 48 89 ef e8 da 80 17 f9 48 8b 45 00 65 48 ff 80 80 03 00 00 48 83 c4 30 5b 41 5c 41 5d 41 5e 41 5f 5d c3 e8 aa b0 c5 f8 <0f> 0b e9 16 fe ff ff 48 8b 4c 24 08 80 e1 07 38 c1 0f 8c c7 fc ff
      RSP: 0018:ffffc900079e70f8 EFLAGS: 00010287
      RAX: ffffffff88c0f7f6 RBX: ffff8880756e7a80 RCX: 0000000000040000
      RDX: ffffc9000c6c4000 RSI: 0000000000031f9e RDI: 0000000000031f9f
      RBP: 0000000000000000 R08: ffffffff88c0f606 R09: ffffc900079e7520
      R10: ffffed101011226d R11: 1ffff1101011226c R12: 1ffff1100eadcf50
      R13: ffff8880756e72c0 R14: 1ffff1100eadcf89 R15: dffffc0000000000
      FS:  00007f643236e700(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007f1ab3f1e2a0 CR3: 0000000064fe7000 CR4: 00000000003506e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       <TASK>
       tcp_clean_rtx_queue+0x223a/0x2da0 net/ipv4/tcp_input.c:3356
       tcp_ack+0x1962/0x3c90 net/ipv4/tcp_input.c:3861
       tcp_rcv_established+0x7c8/0x1ac0 net/ipv4/tcp_input.c:5973
       tcp_v6_do_rcv+0x57b/0x1210 net/ipv6/tcp_ipv6.c:1476
       sk_backlog_rcv include/net/sock.h:1061 [inline]
       __release_sock+0x1d8/0x4c0 net/core/sock.c:2849
       release_sock+0x5d/0x1c0 net/core/sock.c:3404
       sk_stream_wait_memory+0x700/0xdc0 net/core/stream.c:145
       tcp_sendmsg_locked+0x111d/0x3fc0 net/ipv4/tcp.c:1410
       tcp_sendmsg+0x2c/0x40 net/ipv4/tcp.c:1448
       sock_sendmsg_nosec net/socket.c:714 [inline]
       sock_sendmsg net/socket.c:734 [inline]
       __sys_sendto+0x439/0x5c0 net/socket.c:2119
       __do_sys_sendto net/socket.c:2131 [inline]
       __se_sys_sendto net/socket.c:2127 [inline]
       __x64_sys_sendto+0xda/0xf0 net/socket.c:2127
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x46/0xb0
      RIP: 0033:0x7f6431289109
      Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
      RSP: 002b:00007f643236e168 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
      RAX: ffffffffffffffda RBX: 00007f643139c100 RCX: 00007f6431289109
      RDX: 00000000d0d0c2ac RSI: 0000000020000080 RDI: 000000000000000a
      RBP: 00007f64312e308d R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000000
      R13: 00007fff372533af R14: 00007f643236e300 R15: 0000000000022000
      
      Fixes: 5d424d5a ("[TCP]: MTU probing")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      11825765
  11. 21 5月, 2022 1 次提交
    • J
      net: Add a second bind table hashed by port and address · d5a42de8
      Joanne Koong 提交于
      We currently have one tcp bind table (bhash) which hashes by port
      number only. In the socket bind path, we check for bind conflicts by
      traversing the specified port's inet_bind2_bucket while holding the
      bucket's spinlock (see inet_csk_get_port() and inet_csk_bind_conflict()).
      
      In instances where there are tons of sockets hashed to the same port
      at different addresses, checking for a bind conflict is time-intensive
      and can cause softirq cpu lockups, as well as stops new tcp connections
      since __inet_inherit_port() also contests for the spinlock.
      
      This patch proposes adding a second bind table, bhash2, that hashes by
      port and ip address. Searching the bhash2 table leads to significantly
      faster conflict resolution and less time holding the spinlock.
      Signed-off-by: NJoanne Koong <joannelkoong@gmail.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NKuniyuki Iwashima <kuniyu@amazon.co.jp>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      d5a42de8
  12. 20 5月, 2022 1 次提交
  13. 18 5月, 2022 1 次提交
    • J
      random32: use real rng for non-deterministic randomness · d4150779
      Jason A. Donenfeld 提交于
      random32.c has two random number generators in it: one that is meant to
      be used deterministically, with some predefined seed, and one that does
      the same exact thing as random.c, except does it poorly. The first one
      has some use cases. The second one no longer does and can be replaced
      with calls to random.c's proper random number generator.
      
      The relatively recent siphash-based bad random32.c code was added in
      response to concerns that the prior random32.c was too deterministic.
      Out of fears that random.c was (at the time) too slow, this code was
      anonymously contributed. Then out of that emerged a kind of shadow
      entropy gathering system, with its own tentacles throughout various net
      code, added willy nilly.
      
      Stop👏making👏bespoke👏random👏number👏generators👏.
      
      Fortunately, recent advances in random.c mean that we can stop playing
      with this sketchiness, and just use get_random_u32(), which is now fast
      enough. In micro benchmarks using RDPMC, I'm seeing the same median
      cycle count between the two functions, with the mean being _slightly_
      higher due to batches refilling (which we can optimize further need be).
      However, when doing *real* benchmarks of the net functions that actually
      use these random numbers, the mean cycles actually *decreased* slightly
      (with the median still staying the same), likely because the additional
      prandom code means icache misses and complexity, whereas random.c is
      generally already being used by something else nearby.
      
      The biggest benefit of this is that there are many users of prandom who
      probably should be using cryptographically secure random numbers. This
      makes all of those accidental cases become secure by just flipping a
      switch. Later on, we can do a tree-wide cleanup to remove the static
      inline wrapper functions that this commit adds.
      
      There are also some low-ish hanging fruits for making this even faster
      in the future: a get_random_u16() function for use in the networking
      stack will give a 2x performance boost there, using SIMD for ChaCha20
      will let us compute 4 or 8 or 16 blocks of output in parallel, instead
      of just one, giving us large buffers for cheap, and introducing a
      get_random_*_bh() function that assumes irqs are already disabled will
      shave off a few cycles for ordinary calls. These are things we can chip
      away at down the road.
      Acked-by: NJakub Kicinski <kuba@kernel.org>
      Acked-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NJason A. Donenfeld <Jason@zx2c4.com>
      d4150779
  14. 16 5月, 2022 8 次提交
  15. 14 5月, 2022 1 次提交
  16. 13 5月, 2022 5 次提交
    • E
      Revert "tcp/dccp: get rid of inet_twsk_purge()" · 04c494e6
      Eric Dumazet 提交于
      This reverts commits:
      
      0dad4087 ("tcp/dccp: get rid of inet_twsk_purge()")
      d507204d ("tcp/dccp: add tw->tw_bslot")
      
      As Leonard pointed out, a newly allocated netns can happen
      to reuse a freed 'struct net'.
      
      While TCP TW timers were covered by my patches, other things were not:
      
      1) Lookups in rx path (INET_MATCH() and INET6_MATCH()), as they look
        at 4-tuple plus the 'struct net' pointer.
      
      2) /proc/net/tcp[6] and inet_diag, same reason.
      
      3) hashinfo->bhash[], same reason.
      
      Fixing all this seems risky, lets instead revert.
      
      In the future, we might have a per netns tcp hash table, or
      a per netns list of timewait sockets...
      
      Fixes: 0dad4087 ("tcp/dccp: get rid of inet_twsk_purge()")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NLeonard Crestez <cdleonard@gmail.com>
      Tested-by: NLeonard Crestez <cdleonard@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      04c494e6
    • E
      inet: add READ_ONCE(sk->sk_bound_dev_if) in INET_MATCH() · 4915d50e
      Eric Dumazet 提交于
      INET_MATCH() runs without holding a lock on the socket.
      
      We probably need to annotate most reads.
      
      This patch makes INET_MATCH() an inline function
      to ease our changes.
      
      v2:
      
      We remove the 32bit version of it, as modern compilers
      should generate the same code really, no need to
      try to be smarter.
      
      Also make 'struct net *net' the first argument.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4915d50e
    • M
      net: inet: Retire port only listening_hash · cae3873c
      Martin KaFai Lau 提交于
      The listen sk is currently stored in two hash tables,
      listening_hash (hashed by port) and lhash2 (hashed by port and address).
      
      After commit 0ee58dad ("net: tcp6: prefer listeners bound to an address")
      and commit d9fbc7f6 ("net: tcp: prefer listeners bound to an address"),
      the TCP-SYN lookup fast path does not use listening_hash.
      
      The commit 05c0b357 ("tcp: seq_file: Replace listening_hash with lhash2")
      also moved the seq_file (/proc/net/tcp) iteration usage from
      listening_hash to lhash2.
      
      There are still a few listening_hash usages left.
      One of them is inet_reuseport_add_sock() which uses the listening_hash
      to search a listen sk during the listen() system call.  This turns
      out to be very slow on use cases that listen on many different
      VIPs at a popular port (e.g. 443).  [ On top of the slowness in
      adding to the tail in the IPv6 case ].  The latter patch has a
      selftest to demonstrate this case.
      
      This patch takes this chance to move all remaining listening_hash
      usages to lhash2 and then retire listening_hash.
      
      Since most changes need to be done together, it is hard to cut
      the listening_hash to lhash2 switch into small patches.  The
      changes in this patch is highlighted here for the review
      purpose.
      
      1. Because of the listening_hash removal, lhash2 can use the
         sk->sk_nulls_node instead of the icsk->icsk_listen_portaddr_node.
         This will also keep the sk_unhashed() check to work as is
         after stop adding sk to listening_hash.
      
         The union is removed from inet_listen_hashbucket because
         only nulls_head is needed.
      
      2. icsk->icsk_listen_portaddr_node and its helpers are removed.
      
      3. The current lhash2 users needs to iterate with sk_nulls_node
         instead of icsk_listen_portaddr_node.
      
         One case is in the inet[6]_lhash2_lookup().
      
         Another case is the seq_file iterator in tcp_ipv4.c.
         One thing to note is sk_nulls_next() is needed
         because the old inet_lhash2_for_each_icsk_continue()
         does a "next" first before iterating.
      
      4. Move the remaining listening_hash usage to lhash2
      
         inet_reuseport_add_sock() which this series is
         trying to improve.
      
         inet_diag.c and mptcp_diag.c are the final two
         remaining use cases and is moved to lhash2 now also.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      cae3873c
    • M
      net: inet: Open code inet_hash2 and inet_unhash2 · e8d00590
      Martin KaFai Lau 提交于
      This patch folds lhash2 related functions into __inet_hash and
      inet_unhash.  This will make the removal of the listening_hash
      in a latter patch easier to review.
      
      First, this patch folds inet_hash2 into __inet_hash.
      
      For unhash, the current call sequence is like
      inet_unhash() => __inet_unhash() => inet_unhash2().
      The specific testing cases in __inet_unhash() are mostly related
      to TCP_LISTEN sk and its caller inet_unhash() already has
      the TCP_LISTEN test, so this patch folds both __inet_unhash() and
      inet_unhash2() into inet_unhash().
      
      Note that all listening_hash users also have lhash2 initialized,
      so the !h->lhash2 check is no longer needed.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      e8d00590
    • M
      net: inet: Remove count from inet_listen_hashbucket · 8ea1eebb
      Martin KaFai Lau 提交于
      After commit 0ee58dad ("net: tcp6: prefer listeners bound to an address")
      and commit d9fbc7f6 ("net: tcp: prefer listeners bound to an address"),
      the count is no longer used.  This patch removes it.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      8ea1eebb
  17. 07 5月, 2022 1 次提交
    • L
      ipv4: drop dst in multicast routing path · 9e6c6d17
      Lokesh Dhoundiyal 提交于
      kmemleak reports the following when routing multicast traffic over an
      ipsec tunnel.
      
      Kmemleak output:
      unreferenced object 0x8000000044bebb00 (size 256):
        comm "softirq", pid 0, jiffies 4294985356 (age 126.810s)
        hex dump (first 32 bytes):
          00 00 00 00 00 00 00 00 80 00 00 00 05 13 74 80  ..............t.
          80 00 00 00 04 9b bf f9 00 00 00 00 00 00 00 00  ................
        backtrace:
          [<00000000f83947e0>] __kmalloc+0x1e8/0x300
          [<00000000b7ed8dca>] metadata_dst_alloc+0x24/0x58
          [<0000000081d32c20>] __ipgre_rcv+0x100/0x2b8
          [<00000000824f6cf1>] gre_rcv+0x178/0x540
          [<00000000ccd4e162>] gre_rcv+0x7c/0xd8
          [<00000000c024b148>] ip_protocol_deliver_rcu+0x124/0x350
          [<000000006a483377>] ip_local_deliver_finish+0x54/0x68
          [<00000000d9271b3a>] ip_local_deliver+0x128/0x168
          [<00000000bd4968ae>] xfrm_trans_reinject+0xb8/0xf8
          [<0000000071672a19>] tasklet_action_common.isra.16+0xc4/0x1b0
          [<0000000062e9c336>] __do_softirq+0x1fc/0x3e0
          [<00000000013d7914>] irq_exit+0xc4/0xe0
          [<00000000a4d73e90>] plat_irq_dispatch+0x7c/0x108
          [<000000000751eb8e>] handle_int+0x16c/0x178
          [<000000001668023b>] _raw_spin_unlock_irqrestore+0x1c/0x28
      
      The metadata dst is leaked when ip_route_input_mc() updates the dst for
      the skb. Commit f38a9eb1 ("dst: Metadata destinations") correctly
      handled dropping the dst in ip_route_input_slow() but missed the
      multicast case which is handled by ip_route_input_mc(). Drop the dst in
      ip_route_input_mc() avoiding the leak.
      
      Fixes: f38a9eb1 ("dst: Metadata destinations")
      Signed-off-by: NLokesh Dhoundiyal <lokesh.dhoundiyal@alliedtelesis.co.nz>
      Signed-off-by: NChris Packham <chris.packham@alliedtelesis.co.nz>
      Reviewed-by: NDavid Ahern <dsahern@kernel.org>
      Link: https://lore.kernel.org/r/20220505020017.3111846-1-chris.packham@alliedtelesis.co.nzSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      9e6c6d17
  18. 06 5月, 2022 1 次提交