1. 27 1月, 2022 1 次提交
    • E
      tcp: allocate tcp_death_row outside of struct netns_ipv4 · fbb82952
      Eric Dumazet 提交于
      I forgot tcp had per netns tracking of timewait sockets,
      and their sysctl to change the limit.
      
      After 0dad4087 ("tcp/dccp: get rid of inet_twsk_purge()"),
      whole struct net can be freed before last tw socket is freed.
      
      We need to allocate a separate struct inet_timewait_death_row
      object per netns.
      
      tw_count becomes a refcount and gains associated debugging infrastructure.
      
      BUG: KASAN: use-after-free in inet_twsk_kill+0x358/0x3c0 net/ipv4/inet_timewait_sock.c:46
      Read of size 8 at addr ffff88807d5f9f40 by task kworker/1:7/3690
      
      CPU: 1 PID: 3690 Comm: kworker/1:7 Not tainted 5.16.0-syzkaller #0
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      Workqueue: events pwq_unbound_release_workfn
      Call Trace:
       <IRQ>
       __dump_stack lib/dump_stack.c:88 [inline]
       dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
       print_address_description.constprop.0.cold+0x8d/0x336 mm/kasan/report.c:255
       __kasan_report mm/kasan/report.c:442 [inline]
       kasan_report.cold+0x83/0xdf mm/kasan/report.c:459
       inet_twsk_kill+0x358/0x3c0 net/ipv4/inet_timewait_sock.c:46
       call_timer_fn+0x1a5/0x6b0 kernel/time/timer.c:1421
       expire_timers kernel/time/timer.c:1466 [inline]
       __run_timers.part.0+0x67c/0xa30 kernel/time/timer.c:1734
       __run_timers kernel/time/timer.c:1715 [inline]
       run_timer_softirq+0xb3/0x1d0 kernel/time/timer.c:1747
       __do_softirq+0x29b/0x9c2 kernel/softirq.c:558
       invoke_softirq kernel/softirq.c:432 [inline]
       __irq_exit_rcu+0x123/0x180 kernel/softirq.c:637
       irq_exit_rcu+0x5/0x20 kernel/softirq.c:649
       sysvec_apic_timer_interrupt+0x93/0xc0 arch/x86/kernel/apic/apic.c:1097
       </IRQ>
       <TASK>
       asm_sysvec_apic_timer_interrupt+0x12/0x20 arch/x86/include/asm/idtentry.h:638
      RIP: 0010:lockdep_unregister_key+0x1c9/0x250 kernel/locking/lockdep.c:6328
      Code: 00 00 00 48 89 ee e8 46 fd ff ff 4c 89 f7 e8 5e c9 ff ff e8 09 cc ff ff 9c 58 f6 c4 02 75 26 41 f7 c4 00 02 00 00 74 01 fb 5b <5d> 41 5c 41 5d 41 5e 41 5f e9 19 4a 08 00 0f 0b 5b 5d 41 5c 41 5d
      RSP: 0018:ffffc90004077cb8 EFLAGS: 00000206
      RAX: 0000000000000046 RBX: ffff88807b61b498 RCX: 0000000000000001
      RDX: dffffc0000000000 RSI: 0000000000000000 RDI: 0000000000000000
      RBP: ffff888077027128 R08: 0000000000000001 R09: ffffffff8f1ea4fc
      R10: fffffbfff1ff93ee R11: 000000000000af1e R12: 0000000000000246
      R13: 0000000000000000 R14: ffffffff8ffc89b8 R15: ffffffff90157fb0
       wq_unregister_lockdep kernel/workqueue.c:3508 [inline]
       pwq_unbound_release_workfn+0x254/0x340 kernel/workqueue.c:3746
       process_one_work+0x9ac/0x1650 kernel/workqueue.c:2307
       worker_thread+0x657/0x1110 kernel/workqueue.c:2454
       kthread+0x2e9/0x3a0 kernel/kthread.c:377
       ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:295
       </TASK>
      
      Allocated by task 3635:
       kasan_save_stack+0x1e/0x50 mm/kasan/common.c:38
       kasan_set_track mm/kasan/common.c:46 [inline]
       set_alloc_info mm/kasan/common.c:437 [inline]
       __kasan_slab_alloc+0x90/0xc0 mm/kasan/common.c:470
       kasan_slab_alloc include/linux/kasan.h:260 [inline]
       slab_post_alloc_hook mm/slab.h:732 [inline]
       slab_alloc_node mm/slub.c:3230 [inline]
       slab_alloc mm/slub.c:3238 [inline]
       kmem_cache_alloc+0x202/0x3a0 mm/slub.c:3243
       kmem_cache_zalloc include/linux/slab.h:705 [inline]
       net_alloc net/core/net_namespace.c:407 [inline]
       copy_net_ns+0x125/0x760 net/core/net_namespace.c:462
       create_new_namespaces+0x3f6/0xb20 kernel/nsproxy.c:110
       unshare_nsproxy_namespaces+0xc1/0x1f0 kernel/nsproxy.c:226
       ksys_unshare+0x445/0x920 kernel/fork.c:3048
       __do_sys_unshare kernel/fork.c:3119 [inline]
       __se_sys_unshare kernel/fork.c:3117 [inline]
       __x64_sys_unshare+0x2d/0x40 kernel/fork.c:3117
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      The buggy address belongs to the object at ffff88807d5f9a80
       which belongs to the cache net_namespace of size 6528
      The buggy address is located 1216 bytes inside of
       6528-byte region [ffff88807d5f9a80, ffff88807d5fb400)
      The buggy address belongs to the page:
      page:ffffea0001f57e00 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff88807d5f9a80 pfn:0x7d5f8
      head:ffffea0001f57e00 order:3 compound_mapcount:0 compound_pincount:0
      memcg:ffff888070023001
      flags: 0xfff00000010200(slab|head|node=0|zone=1|lastcpupid=0x7ff)
      raw: 00fff00000010200 ffff888010dd4f48 ffffea0001404e08 ffff8880118fd000
      raw: ffff88807d5f9a80 0000000000040002 00000001ffffffff ffff888070023001
      page dumped because: kasan: bad access detected
      page_owner tracks the page as allocated
      page last allocated via order 3, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 3634, ts 119694798460, free_ts 119693556950
       prep_new_page mm/page_alloc.c:2434 [inline]
       get_page_from_freelist+0xa72/0x2f50 mm/page_alloc.c:4165
       __alloc_pages+0x1b2/0x500 mm/page_alloc.c:5389
       alloc_pages+0x1aa/0x310 mm/mempolicy.c:2271
       alloc_slab_page mm/slub.c:1799 [inline]
       allocate_slab mm/slub.c:1944 [inline]
       new_slab+0x28a/0x3b0 mm/slub.c:2004
       ___slab_alloc+0x87c/0xe90 mm/slub.c:3018
       __slab_alloc.constprop.0+0x4d/0xa0 mm/slub.c:3105
       slab_alloc_node mm/slub.c:3196 [inline]
       slab_alloc mm/slub.c:3238 [inline]
       kmem_cache_alloc+0x35c/0x3a0 mm/slub.c:3243
       kmem_cache_zalloc include/linux/slab.h:705 [inline]
       net_alloc net/core/net_namespace.c:407 [inline]
       copy_net_ns+0x125/0x760 net/core/net_namespace.c:462
       create_new_namespaces+0x3f6/0xb20 kernel/nsproxy.c:110
       unshare_nsproxy_namespaces+0xc1/0x1f0 kernel/nsproxy.c:226
       ksys_unshare+0x445/0x920 kernel/fork.c:3048
       __do_sys_unshare kernel/fork.c:3119 [inline]
       __se_sys_unshare kernel/fork.c:3117 [inline]
       __x64_sys_unshare+0x2d/0x40 kernel/fork.c:3117
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      page last free stack trace:
       reset_page_owner include/linux/page_owner.h:24 [inline]
       free_pages_prepare mm/page_alloc.c:1352 [inline]
       free_pcp_prepare+0x374/0x870 mm/page_alloc.c:1404
       free_unref_page_prepare mm/page_alloc.c:3325 [inline]
       free_unref_page+0x19/0x690 mm/page_alloc.c:3404
       skb_free_head net/core/skbuff.c:655 [inline]
       skb_release_data+0x65d/0x790 net/core/skbuff.c:677
       skb_release_all net/core/skbuff.c:742 [inline]
       __kfree_skb net/core/skbuff.c:756 [inline]
       consume_skb net/core/skbuff.c:914 [inline]
       consume_skb+0xc2/0x160 net/core/skbuff.c:908
       skb_free_datagram+0x1b/0x1f0 net/core/datagram.c:325
       netlink_recvmsg+0x636/0xea0 net/netlink/af_netlink.c:1998
       sock_recvmsg_nosec net/socket.c:948 [inline]
       sock_recvmsg net/socket.c:966 [inline]
       sock_recvmsg net/socket.c:962 [inline]
       ____sys_recvmsg+0x2c4/0x600 net/socket.c:2632
       ___sys_recvmsg+0x127/0x200 net/socket.c:2674
       __sys_recvmsg+0xe2/0x1a0 net/socket.c:2704
       do_syscall_x64 arch/x86/entry/common.c:50 [inline]
       do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Memory state around the buggy address:
       ffff88807d5f9e00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
       ffff88807d5f9e80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      >ffff88807d5f9f00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                                 ^
       ffff88807d5f9f80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
       ffff88807d5fa000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
      
      Fixes: 0dad4087 ("tcp/dccp: get rid of inet_twsk_purge()")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Reported-by: NPaolo Abeni <pabeni@redhat.com>
      Tested-by: NPaolo Abeni <pabeni@redhat.com>
      Link: https://lore.kernel.org/r/20220126180714.845362-1-eric.dumazet@gmail.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      fbb82952
  2. 02 12月, 2021 1 次提交
  3. 29 11月, 2021 1 次提交
  4. 15 10月, 2021 1 次提交
    • E
      tcp: switch orphan_count to bare per-cpu counters · 19757ceb
      Eric Dumazet 提交于
      Use of percpu_counter structure to track count of orphaned
      sockets is causing problems on modern hosts with 256 cpus
      or more.
      
      Stefan Bach reported a serious spinlock contention in real workloads,
      that I was able to reproduce with a netfilter rule dropping
      incoming FIN packets.
      
          53.56%  server  [kernel.kallsyms]      [k] queued_spin_lock_slowpath
                  |
                  ---queued_spin_lock_slowpath
                     |
                      --53.51%--_raw_spin_lock_irqsave
                                |
                                 --53.51%--__percpu_counter_sum
                                           tcp_check_oom
                                           |
                                           |--39.03%--__tcp_close
                                           |          tcp_close
                                           |          inet_release
                                           |          inet6_release
                                           |          sock_close
                                           |          __fput
                                           |          ____fput
                                           |          task_work_run
                                           |          exit_to_usermode_loop
                                           |          do_syscall_64
                                           |          entry_SYSCALL_64_after_hwframe
                                           |          __GI___libc_close
                                           |
                                            --14.48%--tcp_out_of_resources
                                                      tcp_write_timeout
                                                      tcp_retransmit_timer
                                                      tcp_write_timer_handler
                                                      tcp_write_timer
                                                      call_timer_fn
                                                      expire_timers
                                                      __run_timers
                                                      run_timer_softirq
                                                      __softirqentry_text_start
      
      As explained in commit cf86a086 ("net/dst: use a smaller percpu_counter
      batch for dst entries accounting"), default batch size is too big
      for the default value of tcp_max_orphans (262144).
      
      But even if we reduce batch sizes, there would still be cases
      where the estimated count of orphans is beyond the limit,
      and where tcp_too_many_orphans() has to call the expensive
      percpu_counter_sum_positive().
      
      One solution is to use plain per-cpu counters, and have
      a timer to periodically refresh this cache.
      
      Updating this cache every 100ms seems about right, tcp pressure
      state is not radically changing over shorter periods.
      
      percpu_counter was nice 15 years ago while hosts had less
      than 16 cpus, not anymore by current standards.
      
      v2: Fix the build issue for CONFIG_CRYPTO_DEV_CHELSIO_TLS=m,
          reported by kernel test robot <lkp@intel.com>
          Remove unused socket argument from tcp_too_many_orphans()
      
      Fixes: dd24c001 ("net: Use a percpu_counter for orphan_count")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NStefan Bach <sfb@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      19757ceb
  5. 24 6月, 2021 1 次提交
  6. 30 1月, 2021 1 次提交
  7. 10 11月, 2020 1 次提交
  8. 25 9月, 2020 1 次提交
  9. 18 7月, 2020 1 次提交
  10. 30 3月, 2020 1 次提交
  11. 26 1月, 2020 1 次提交
  12. 16 6月, 2019 1 次提交
  13. 31 5月, 2019 2 次提交
  14. 27 5月, 2019 2 次提交
  15. 01 12月, 2018 1 次提交
    • E
      tcp: implement coalescing on backlog queue · 4f693b55
      Eric Dumazet 提交于
      In case GRO is not as efficient as it should be or disabled,
      we might have a user thread trapped in __release_sock() while
      softirq handler flood packets up to the point we have to drop.
      
      This patch balances work done from user thread and softirq,
      to give more chances to __release_sock() to complete its work
      before new packets are added the the backlog.
      
      This also helps if we receive many ACK packets, since GRO
      does not aggregate them.
      
      This patch brings ~60% throughput increase on a receiver
      without GRO, but the spectacular gain is really on
      1000x release_sock() latency reduction I have measured.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Neal Cardwell <ncardwell@google.com>
      Cc: Yuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4f693b55
  16. 06 8月, 2018 1 次提交
  17. 30 6月, 2018 1 次提交
  18. 26 6月, 2018 1 次提交
  19. 18 5月, 2018 1 次提交
  20. 16 5月, 2018 1 次提交
  21. 20 4月, 2018 1 次提交
  22. 01 4月, 2018 2 次提交
    • E
      inet: frags: break the 2GB limit for frags storage · 3e67f106
      Eric Dumazet 提交于
      Some users are willing to provision huge amounts of memory to be able
      to perform reassembly reasonnably well under pressure.
      
      Current memory tracking is using one atomic_t and integers.
      
      Switch to atomic_long_t so that 64bit arches can use more than 2GB,
      without any cost for 32bit arches.
      
      Note that this patch avoids an overflow error, if high_thresh was set
      to ~2GB, since this test in inet_frag_alloc() was never true :
      
      if (... || frag_mem_limit(nf) > nf->high_thresh)
      
      Tested:
      
      $ echo 16000000000 >/proc/sys/net/ipv4/ipfrag_high_thresh
      
      <frag DDOS>
      
      $ grep FRAG /proc/net/sockstat
      FRAG: inuse 14705885 memory 16000002880
      
      $ nstat -n ; sleep 1 ; nstat | grep Reas
      IpReasmReqds                    3317150            0.0
      IpReasmFails                    3317112            0.0
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3e67f106
    • E
      inet: frags: remove some helpers · 6befe4a7
      Eric Dumazet 提交于
      Remove sum_frag_mem_limit(), ip_frag_mem() & ip6_frag_mem()
      
      Also since we use rhashtable we can bring back the number of fragments
      in "grep FRAG /proc/net/sockstat /proc/net/sockstat6" that was
      removed in commit 434d3054 ("inet: frag: don't account number
      of fragment queues")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6befe4a7
  23. 28 3月, 2018 1 次提交
  24. 27 3月, 2018 1 次提交
  25. 01 3月, 2018 1 次提交
  26. 13 2月, 2018 1 次提交
    • K
      net: Convert pernet_subsys, registered from inet_init() · f84c6821
      Kirill Tkhai 提交于
      arp_net_ops just addr/removes /proc entry.
      
      devinet_ops allocates and frees duplicate of init_net tables
      and (un)registers sysctl entries.
      
      fib_net_ops allocates and frees pernet tables, creates/destroys
      netlink socket and (un)initializes /proc entries. Foreign
      pernet_operations do not touch them.
      
      ip_rt_proc_ops only modifies pernet /proc entries.
      
      xfrm_net_ops creates/destroys /proc entries, allocates/frees
      pernet statistics, hashes and tables, and (un)initializes
      sysctl files. These are not touched by foreigh pernet_operations
      
      xfrm4_net_ops allocates/frees private pernet memory, and
      configures sysctls.
      
      sysctl_route_ops creates/destroys sysctls.
      
      rt_genid_ops only initializes fields of just allocated net.
      
      ipv4_inetpeer_ops allocated/frees net private memory.
      
      igmp_net_ops just creates/destroys /proc files and socket,
      noone else interested in.
      
      tcp_sk_ops seems to be safe, because tcp_sk_init() does not
      depend on any other pernet_operations modifications. Iteration
      over hash table in inet_twsk_purge() is made under RCU lock,
      and it's safe to iterate the table this way. Removing from
      the table happen from inet_twsk_deschedule_put(), but this
      function is safe without any extern locks, as it's synchronized
      inside itself. There are many examples, it's used in different
      context. So, it's safe to leave tcp_sk_exit_batch() unlocked.
      
      tcp_net_metrics_ops is synchronized on tcp_metrics_lock and safe.
      
      udplite4_net_ops only creates/destroys pernet /proc file.
      
      icmp_sk_ops creates percpu sockets, not touched by foreign
      pernet_operations.
      
      ipmr_net_ops creates/destroys pernet fib tables, (un)registers
      fib rules and /proc files. This seem to be safe to execute
      in parallel with foreign pernet_operations.
      
      af_inet_ops just sets up default parameters of newly created net.
      
      ipv4_mib_ops creates and destroys pernet percpu statistics.
      
      raw_net_ops, tcp4_net_ops, udp4_net_ops, ping_v4_net_ops
      and ip_proc_ops only create/destroy pernet /proc files.
      
      ip4_frags_ops creates and destroys sysctl file.
      
      So, it's safe to make the pernet_operations async.
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NAndrei Vagin <avagin@virtuozzo.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f84c6821
  27. 17 1月, 2018 1 次提交
    • A
      net: delete /proc THIS_MODULE references · 96890d62
      Alexey Dobriyan 提交于
      /proc has been ignoring struct file_operations::owner field for 10 years.
      Specifically, it started with commit 786d7e16
      ("Fix rmmod/read/write races in /proc entries"). Notice the chunk where
      inode->i_fop is initialized with proxy struct file_operations for
      regular files:
      
      	-               if (de->proc_fops)
      	-                       inode->i_fop = de->proc_fops;
      	+               if (de->proc_fops) {
      	+                       if (S_ISREG(inode->i_mode))
      	+                               inode->i_fop = &proc_reg_file_ops;
      	+                       else
      	+                               inode->i_fop = de->proc_fops;
      	+               }
      
      VFS stopped pinning module at this point.
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      96890d62
  28. 11 11月, 2017 1 次提交
  29. 31 8月, 2017 1 次提交
  30. 01 8月, 2017 1 次提交
  31. 08 6月, 2017 1 次提交
    • E
      tcp: add TCPMemoryPressuresChrono counter · 06044751
      Eric Dumazet 提交于
      DRAM supply shortage and poor memory pressure tracking in TCP
      stack makes any change in SO_SNDBUF/SO_RCVBUF (or equivalent autotuning
      limits) and tcp_mem[] quite hazardous.
      
      TCPMemoryPressures SNMP counter is an indication of tcp_mem sysctl
      limits being hit, but only tracking number of transitions.
      
      If TCP stack behavior under stress was perfect :
      1) It would maintain memory usage close to the limit.
      2) Memory pressure state would be entered for short times.
      
      We certainly prefer 100 events lasting 10ms compared to one event
      lasting 200 seconds.
      
      This patch adds a new SNMP counter tracking cumulative duration of
      memory pressure events, given in ms units.
      
      $ cat /proc/sys/net/ipv4/tcp_mem
      3088    4117    6176
      $ grep TCP /proc/net/sockstat
      TCP: inuse 180 orphan 0 tw 2 alloc 234 mem 4140
      $ nstat -n ; sleep 10 ; nstat |grep Pressure
      TcpExtTCPMemoryPressures        1700
      TcpExtTCPMemoryPressuresChrono  5209
      
      v2: Used EXPORT_SYMBOL_GPL() instead of EXPORT_SYMBOL() as David
      instructed.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      06044751
  32. 25 4月, 2017 1 次提交
  33. 17 3月, 2017 1 次提交
    • S
      tcp: remove tcp_tw_recycle · 4396e461
      Soheil Hassas Yeganeh 提交于
      The tcp_tw_recycle was already broken for connections
      behind NAT, since the per-destination timestamp is not
      monotonically increasing for multiple machines behind
      a single destination address.
      
      After the randomization of TCP timestamp offsets
      in commit 8a5bd45f6616 (tcp: randomize tcp timestamp offsets
      for each connection), the tcp_tw_recycle is broken for all
      types of connections for the same reason: the timestamps
      received from a single machine is not monotonically increasing,
      anymore.
      
      Remove tcp_tw_recycle, since it is not functional. Also, remove
      the PAWSPassive SNMP counter since it is only used for
      tcp_tw_recycle, and simplify tcp_v4_route_req and tcp_v6_route_req
      since the strict argument is only set when tcp_tw_recycle is
      enabled.
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NYuchung Cheng <ycheng@google.com>
      Cc: Lutz Vieweg <lvml@5t9.de>
      Cc: Florian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4396e461
  34. 03 2月, 2017 1 次提交
  35. 21 1月, 2017 1 次提交
  36. 30 12月, 2016 1 次提交
  37. 30 9月, 2016 1 次提交