1. 19 3月, 2021 1 次提交
  2. 11 2月, 2021 1 次提交
    • V
      mm, slub: better heuristic for number of cpus when calculating slab order · 3286222f
      Vlastimil Babka 提交于
      When creating a new kmem cache, SLUB determines how large the slab pages
      will based on number of inputs, including the number of CPUs in the
      system.  Larger slab pages mean that more objects can be allocated/free
      from per-cpu slabs before accessing shared structures, but also
      potentially more memory can be wasted due to low slab usage and
      fragmentation.  The rough idea of using number of CPUs is that larger
      systems will be more likely to benefit from reduced contention, and also
      should have enough memory to spare.
      
      Number of CPUs used to be determined as nr_cpu_ids, which is number of
      possible cpus, but on some systems many will never be onlined, thus
      commit 045ab8c9 ("mm/slub: let number of online CPUs determine the
      slub page order") changed it to nr_online_cpus().  However, for kmem
      caches created early before CPUs are onlined, this may lead to
      permamently low slab page sizes.
      
      Vincent reports a regression [1] of hackbench on arm64 systems:
      
        "I'm facing significant performances regression on a large arm64
         server system (224 CPUs). Regressions is also present on small arm64
         system (8 CPUs) but in a far smaller order of magnitude
      
         On 224 CPUs system : 9 iterations of hackbench -l 16000 -g 16
         v5.11-rc4 : 9.135sec (+/- 0.45%)
         v5.11-rc4 + revert this patch: 3.173sec (+/- 0.48%)
         v5.10: 3.136sec (+/- 0.40%)"
      
      Mel reports a regression [2] of hackbench on x86_64, with lockstat suggesting
      page allocator contention:
      
        "i.e. the patch incurs a 7% to 32% performance penalty. This bisected
         cleanly yesterday when I was looking for the regression and then
         found the thread.
      
         Numerous caches change size. For example, kmalloc-512 goes from
         order-0 (vanilla) to order-2 with the revert.
      
         So mostly this is down to the number of times SLUB calls into the
         page allocator which only caches order-0 pages on a per-cpu basis"
      
      Clearly num_online_cpus() doesn't work too early in bootup.  We could
      change the order dynamically in a memory hotplug callback, but runtime
      order changing for existing kmem caches has been already shown as
      dangerous, and removed in 32a6f409 ("mm, slub: remove runtime
      allocation order changes").
      
      It could be resurrected in a safe manner with some effort, but to fix
      the regression we need something simpler.
      
      We could use num_present_cpus() that should be the number of physically
      present CPUs even before they are onlined.  That would work for PowerPC
      [3], which triggered the original commit, but that still doesn't work on
      arm64 [4] as explained in [5].
      
      So this patch tries to determine the best available value without
      specific arch knowledge.
      
       - num_present_cpus() if the number is larger than 1, as that means the
         arch is likely setting it properly
      
       - nr_cpu_ids otherwise
      
      This should fix the reported regressions while also keeping the effect
      of 045ab8c9 for PowerPC systems.  It's possible there are
      configurations where num_present_cpus() is 1 during boot while
      nr_cpu_ids is at the same time bloated, so these (if they exist) would
      keep the large orders based on nr_cpu_ids as was before 045ab8c9.
      
      [1] https://lore.kernel.org/linux-mm/CAKfTPtA_JgMf_+zdFbcb_V9rM7JBWNPjAz9irgwFj7Rou=xzZg@mail.gmail.com/
      [2] https://lore.kernel.org/linux-mm/20210128134512.GF3592@techsingularity.net/
      [3] https://lore.kernel.org/linux-mm/20210123051607.GC2587010@in.ibm.com/
      [4] https://lore.kernel.org/linux-mm/CAKfTPtAjyVmS5VYvU6DBxg4-JEo5bdmWbngf-03YsY18cmWv_g@mail.gmail.com/
      [5] https://lore.kernel.org/linux-mm/20210126230305.GD30941@willie-the-truck/
      
      Link: https://lkml.kernel.org/r/20210208134108.22286-1-vbabka@suse.cz
      Fixes: 045ab8c9 ("mm/slub: let number of online CPUs determine the slub page order")
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reported-by: NVincent Guittot <vincent.guittot@linaro.org>
      Reported-by: NMel Gorman <mgorman@techsingularity.net>
      Tested-by: NMel Gorman <mgorman@techsingularity.net>
      Tested-by: NVincent Guittot <vincent.guittot@linaro.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Bharata B Rao <bharata@linux.ibm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3286222f
  3. 29 1月, 2021 1 次提交
    • W
      Revert "mm/slub: fix a memory leak in sysfs_slab_add()" · 757fed1d
      Wang Hai 提交于
      This reverts commit dde3c6b7.
      
      syzbot report a double-free bug. The following case can cause this bug.
      
       - mm/slab_common.c: create_cache(): if the __kmem_cache_create() fails,
         it does:
      
      	out_free_cache:
      		kmem_cache_free(kmem_cache, s);
      
       - but __kmem_cache_create() - at least for slub() - will have done
      
      	sysfs_slab_add(s)
      		-> sysfs_create_group() .. fails ..
      		-> kobject_del(&s->kobj); .. which frees s ...
      
      We can't remove the kmem_cache_free() in create_cache(), because other
      error cases of __kmem_cache_create() do not free this.
      
      So, revert the commit dde3c6b7 ("mm/slub: fix a memory leak in
      sysfs_slab_add()") to fix this.
      
      Reported-by: syzbot+d0bd96b4696c1ef67991@syzkaller.appspotmail.com
      Fixes: dde3c6b7 ("mm/slub: fix a memory leak in sysfs_slab_add()")
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NWang Hai <wanghai38@huawei.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      757fed1d
  4. 25 1月, 2021 1 次提交
  5. 23 1月, 2021 1 次提交
    • P
      mm: Add mem_dump_obj() to print source of memory block · 8e7f37f2
      Paul E. McKenney 提交于
      There are kernel facilities such as per-CPU reference counts that give
      error messages in generic handlers or callbacks, whose messages are
      unenlightening.  In the case of per-CPU reference-count underflow, this
      is not a problem when creating a new use of this facility because in that
      case the bug is almost certainly in the code implementing that new use.
      However, trouble arises when deploying across many systems, which might
      exercise corner cases that were not seen during development and testing.
      Here, it would be really nice to get some kind of hint as to which of
      several uses the underflow was caused by.
      
      This commit therefore exposes a mem_dump_obj() function that takes
      a pointer to memory (which must still be allocated if it has been
      dynamically allocated) and prints available information on where that
      memory came from.  This pointer can reference the middle of the block as
      well as the beginning of the block, as needed by things like RCU callback
      functions and timer handlers that might not know where the beginning of
      the memory block is.  These functions and handlers can use mem_dump_obj()
      to print out better hints as to where the problem might lie.
      
      The information printed can depend on kernel configuration.  For example,
      the allocation return address can be printed only for slab and slub,
      and even then only when the necessary debug has been enabled.  For slab,
      build with CONFIG_DEBUG_SLAB=y, and either use sizes with ample space
      to the next power of two or use the SLAB_STORE_USER when creating the
      kmem_cache structure.  For slub, build with CONFIG_SLUB_DEBUG=y and
      boot with slub_debug=U, or pass SLAB_STORE_USER to kmem_cache_create()
      if more focused use is desired.  Also for slub, use CONFIG_STACKTRACE
      to enable printing of the allocation-time stack trace.
      
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: <linux-mm@kvack.org>
      Reported-by: NAndrii Nakryiko <andrii@kernel.org>
      [ paulmck: Convert to printing and change names per Joonsoo Kim. ]
      [ paulmck: Move slab definition per Stephen Rothwell and kbuild test robot. ]
      [ paulmck: Handle CONFIG_MMU=n case where vmalloc() is kmalloc(). ]
      [ paulmck: Apply Vlastimil Babka feedback on slab.c kmem_provenance(). ]
      [ paulmck: Extract more info from !SLUB_DEBUG per Joonsoo Kim. ]
      [ paulmck: Explicitly check for small pointers per Naresh Kamboju. ]
      Acked-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Tested-by: NNaresh Kamboju <naresh.kamboju@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@kernel.org>
      8e7f37f2
  6. 13 1月, 2021 1 次提交
  7. 30 12月, 2020 1 次提交
  8. 23 12月, 2020 1 次提交
  9. 16 12月, 2020 4 次提交
  10. 15 11月, 2020 1 次提交
    • L
      mm/slub: fix panic in slab_alloc_node() · 22e4663e
      Laurent Dufour 提交于
      While doing memory hot-unplug operation on a PowerPC VM running 1024 CPUs
      with 11TB of ram, I hit the following panic:
      
          BUG: Kernel NULL pointer dereference on read at 0x00000007
          Faulting instruction address: 0xc000000000456048
          Oops: Kernel access of bad area, sig: 11 [#2]
          LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS= 2048 NUMA pSeries
          Modules linked in: rpadlpar_io rpaphp
          CPU: 160 PID: 1 Comm: systemd Tainted: G      D           5.9.0 #1
          NIP:  c000000000456048 LR: c000000000455fd4 CTR: c00000000047b350
          REGS: c00006028d1b77a0 TRAP: 0300   Tainted: G      D            (5.9.0)
          MSR:  8000000000009033 <SF,EE,ME,IR,DR,RI,LE>  CR: 24004228  XER: 00000000
          CFAR: c00000000000f1b0 DAR: 0000000000000007 DSISR: 40000000 IRQMASK: 0
          GPR00: c000000000455fd4 c00006028d1b7a30 c000000001bec800 0000000000000000
          GPR04: 0000000000000dc0 0000000000000000 00000000000374ef c00007c53df99320
          GPR08: 000007c53c980000 0000000000000000 000007c53c980000 0000000000000000
          GPR12: 0000000000004400 c00000001e8e4400 0000000000000000 0000000000000f6a
          GPR16: 0000000000000000 c000000001c25930 c000000001d62528 00000000000000c1
          GPR20: c000000001d62538 c00006be469e9000 0000000fffffffe0 c0000000003c0ff8
          GPR24: 0000000000000018 0000000000000000 0000000000000dc0 0000000000000000
          GPR28: c00007c513755700 c000000001c236a4 c00007bc4001f800 0000000000000001
          NIP [c000000000456048] __kmalloc_node+0x108/0x790
          LR [c000000000455fd4] __kmalloc_node+0x94/0x790
          Call Trace:
            kvmalloc_node+0x58/0x110
            mem_cgroup_css_online+0x10c/0x270
            online_css+0x48/0xd0
            cgroup_apply_control_enable+0x2c4/0x470
            cgroup_mkdir+0x408/0x5f0
            kernfs_iop_mkdir+0x90/0x100
            vfs_mkdir+0x138/0x250
            do_mkdirat+0x154/0x1c0
            system_call_exception+0xf8/0x200
            system_call_common+0xf0/0x27c
          Instruction dump:
          e93e0000 e90d0030 39290008 7cc9402a e94d0030 e93e0000 7ce95214 7f89502a
          2fbc0000 419e0018 41920230 e9270010 <89290007> 7f994800 419e0220 7ee6bb78
      
      This pointing to the following code:
      
          mm/slub.c:2851
                  if (unlikely(!object || !node_match(page, node))) {
          c000000000456038:       00 00 bc 2f     cmpdi   cr7,r28,0
          c00000000045603c:       18 00 9e 41     beq     cr7,c000000000456054 <__kmalloc_node+0x114>
          node_match():
          mm/slub.c:2491
                  if (node != NUMA_NO_NODE && page_to_nid(page) != node)
          c000000000456040:       30 02 92 41     beq     cr4,c000000000456270 <__kmalloc_node+0x330>
          page_to_nid():
          include/linux/mm.h:1294
          c000000000456044:       10 00 27 e9     ld      r9,16(r7)
          c000000000456048:       07 00 29 89     lbz     r9,7(r9)	<<<< r9 = NULL
          node_match():
          mm/slub.c:2491
          c00000000045604c:       00 48 99 7f     cmpw    cr7,r25,r9
          c000000000456050:       20 02 9e 41     beq     cr7,c000000000456270 <__kmalloc_node+0x330>
      
      The panic occurred in slab_alloc_node() when checking for the page's node:
      
      	object = c->freelist;
      	page = c->page;
      	if (unlikely(!object || !node_match(page, node))) {
      		object = __slab_alloc(s, gfpflags, node, addr, c);
      		stat(s, ALLOC_SLOWPATH);
      
      The issue is that object is not NULL while page is NULL which is odd but
      may happen if the cache flush happened after loading object but before
      loading page.  Thus checking for the page pointer is required too.
      
      The cache flush is done through an inter processor interrupt when a
      piece of memory is off-lined.  That interrupt is triggered when a memory
      hot-unplug operation is initiated and offline_pages() is calling the
      slub's MEM_GOING_OFFLINE callback slab_mem_going_offline_callback()
      which is calling flush_cpu_slab().  If that interrupt is caught between
      the reading of c->freelist and the reading of c->page, this could lead
      to such a situation.  That situation is expected and the later call to
      this_cpu_cmpxchg_double() will detect the change to c->freelist and redo
      the whole operation.
      
      In commit 6159d0f5 ("mm/slub.c: page is always non-NULL in
      node_match()") check on the page pointer has been removed assuming that
      page is always valid when it is called.  It happens that this is not
      true in that particular case, so check for page before calling
      node_match() here.
      
      Fixes: 6159d0f5 ("mm/slub.c: page is always non-NULL in node_match()")
      Signed-off-by: NLaurent Dufour <ldufour@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Nathan Lynch <nathanl@linux.ibm.com>
      Cc: Scott Cheloha <cheloha@linux.ibm.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20201027190406.33283-1-ldufour@linux.ibm.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      22e4663e
  11. 17 10月, 2020 1 次提交
  12. 14 10月, 2020 4 次提交
  13. 04 10月, 2020 1 次提交
  14. 06 9月, 2020 1 次提交
  15. 08 8月, 2020 20 次提交