1. 23 5月, 2021 1 次提交
  2. 15 5月, 2021 1 次提交
    • V
      mm, slub: move slub_debug static key enabling outside slab_mutex · afe0c26d
      Vlastimil Babka 提交于
      Paul E.  McKenney reported [1] that commit 1f0723a4 ("mm, slub: enable
      slub_debug static key when creating cache with explicit debug flags")
      results in the lockdep complaint:
      
       ======================================================
       WARNING: possible circular locking dependency detected
       5.12.0+ #15 Not tainted
       ------------------------------------------------------
       rcu_torture_sta/109 is trying to acquire lock:
       ffffffff96063cd0 (cpu_hotplug_lock){++++}-{0:0}, at: static_key_enable+0x9/0x20
      
       but task is already holding lock:
       ffffffff96173c28 (slab_mutex){+.+.}-{3:3}, at: kmem_cache_create_usercopy+0x2d/0x250
      
       which lock already depends on the new lock.
      
       the existing dependency chain (in reverse order) is:
      
       -> #1 (slab_mutex){+.+.}-{3:3}:
              lock_acquire+0xb9/0x3a0
              __mutex_lock+0x8d/0x920
              slub_cpu_dead+0x15/0xf0
              cpuhp_invoke_callback+0x17a/0x7c0
              cpuhp_invoke_callback_range+0x3b/0x80
              _cpu_down+0xdf/0x2a0
              cpu_down+0x2c/0x50
              device_offline+0x82/0xb0
              remove_cpu+0x1a/0x30
              torture_offline+0x80/0x140
              torture_onoff+0x147/0x260
              kthread+0x10a/0x140
              ret_from_fork+0x22/0x30
      
       -> #0 (cpu_hotplug_lock){++++}-{0:0}:
              check_prev_add+0x8f/0xbf0
              __lock_acquire+0x13f0/0x1d80
              lock_acquire+0xb9/0x3a0
              cpus_read_lock+0x21/0xa0
              static_key_enable+0x9/0x20
              __kmem_cache_create+0x38d/0x430
              kmem_cache_create_usercopy+0x146/0x250
              kmem_cache_create+0xd/0x10
              rcu_torture_stats+0x79/0x280
              kthread+0x10a/0x140
              ret_from_fork+0x22/0x30
      
       other info that might help us debug this:
      
        Possible unsafe locking scenario:
      
              CPU0                    CPU1
              ----                    ----
         lock(slab_mutex);
                                      lock(cpu_hotplug_lock);
                                      lock(slab_mutex);
         lock(cpu_hotplug_lock);
      
        *** DEADLOCK ***
      
       1 lock held by rcu_torture_sta/109:
        #0: ffffffff96173c28 (slab_mutex){+.+.}-{3:3}, at: kmem_cache_create_usercopy+0x2d/0x250
      
       stack backtrace:
       CPU: 3 PID: 109 Comm: rcu_torture_sta Not tainted 5.12.0+ #15
       Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1.1 04/01/2014
       Call Trace:
        dump_stack+0x6d/0x89
        check_noncircular+0xfe/0x110
        ? lock_is_held_type+0x98/0x110
        check_prev_add+0x8f/0xbf0
        __lock_acquire+0x13f0/0x1d80
        lock_acquire+0xb9/0x3a0
        ? static_key_enable+0x9/0x20
        ? mark_held_locks+0x49/0x70
        cpus_read_lock+0x21/0xa0
        ? static_key_enable+0x9/0x20
        static_key_enable+0x9/0x20
        __kmem_cache_create+0x38d/0x430
        kmem_cache_create_usercopy+0x146/0x250
        ? rcu_torture_stats_print+0xd0/0xd0
        kmem_cache_create+0xd/0x10
        rcu_torture_stats+0x79/0x280
        ? rcu_torture_stats_print+0xd0/0xd0
        kthread+0x10a/0x140
        ? kthread_park+0x80/0x80
        ret_from_fork+0x22/0x30
      
      This is because there's one order of locking from the hotplug callbacks:
      
      lock(cpu_hotplug_lock); // from hotplug machinery itself
      lock(slab_mutex); // in e.g. slab_mem_going_offline_callback()
      
      And commit 1f0723a4 made the reverse sequence possible:
      lock(slab_mutex); // in kmem_cache_create_usercopy()
      lock(cpu_hotplug_lock); // kmem_cache_open() -> static_key_enable()
      
      The simplest fix is to move static_key_enable() to a place before slab_mutex is
      taken. That means kmem_cache_create_usercopy() in mm/slab_common.c which is not
      ideal for SLUB-specific code, but the #ifdef CONFIG_SLUB_DEBUG makes it
      at least self-contained and obvious.
      
      [1] https://lore.kernel.org/lkml/20210502171827.GA3670492@paulmck-ThinkPad-P17-Gen-1/
      
      Link: https://lkml.kernel.org/r/20210504120019.26791-1-vbabka@suse.cz
      Fixes: 1f0723a4 ("mm, slub: enable slub_debug static key when creating cache with explicit debug flags")
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reported-by: NPaul E. McKenney <paulmck@kernel.org>
      Tested-by: NPaul E. McKenney <paulmck@kernel.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      afe0c26d
  3. 07 5月, 2021 1 次提交
  4. 01 5月, 2021 4 次提交
  5. 19 3月, 2021 2 次提交
  6. 11 3月, 2021 1 次提交
    • L
      Revert "mm, slub: consider rest of partial list if acquire_slab() fails" · 9b1ea29b
      Linus Torvalds 提交于
      This reverts commit 8ff60eb0.
      
      The kernel test robot reports a huge performance regression due to the
      commit, and the reason seems fairly straightforward: when there is
      contention on the page list (which is what causes acquire_slab() to
      fail), we do _not_ want to just loop and try again, because that will
      transfer the contention to the 'n->list_lock' spinlock we hold, and
      just make things even worse.
      
      This is admittedly likely a problem only on big machines - the kernel
      test robot report comes from a 96-thread dual socket Intel Xeon Gold
      6252 setup, but the regression there really is quite noticeable:
      
         -47.9% regression of stress-ng.rawpkt.ops_per_sec
      
      and the commit that was marked as being fixed (7ced3719: "slub:
      Acquire_slab() avoid loop") actually did the loop exit early very
      intentionally (the hint being that "avoid loop" part of that commit
      message), exactly to avoid this issue.
      
      The correct thing to do may be to pick some kind of reasonable middle
      ground: instead of breaking out of the loop on the very first sign of
      contention, or trying over and over and over again, the right thing may
      be to re-try _once_, and then give up on the second failure (or pick
      your favorite value for "once"..).
      Reported-by: Nkernel test robot <oliver.sang@intel.com>
      Link: https://lore.kernel.org/lkml/20210301080404.GF12822@xsang-OptiPlex-9020/
      Cc: Jann Horn <jannh@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9b1ea29b
  7. 09 3月, 2021 1 次提交
    • P
      mm: Don't build mm_dump_obj() on CONFIG_PRINTK=n kernels · 5bb1bb35
      Paul E. McKenney 提交于
      The mem_dump_obj() functionality adds a few hundred bytes, which is a
      small price to pay.  Except on kernels built with CONFIG_PRINTK=n, in
      which mem_dump_obj() messages will be suppressed.  This commit therefore
      makes mem_dump_obj() be a static inline empty function on kernels built
      with CONFIG_PRINTK=n and excludes all of its support functions as well.
      This avoids kernel bloat on systems that cannot use mem_dump_obj().
      
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: <linux-mm@kvack.org>
      Suggested-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NPaul E. McKenney <paulmck@kernel.org>
      5bb1bb35
  8. 27 2月, 2021 2 次提交
    • A
      kasan, mm: optimize kmalloc poisoning · e2db1a9a
      Andrey Konovalov 提交于
      For allocations from kmalloc caches, kasan_kmalloc() always follows
      kasan_slab_alloc().  Currenly, both of them unpoison the whole object,
      which is unnecessary.
      
      This patch provides separate implementations for both annotations:
      kasan_slab_alloc() unpoisons the whole object, and kasan_kmalloc() only
      poisons the redzone.
      
      For generic KASAN, the redzone start might not be aligned to
      KASAN_GRANULE_SIZE.  Therefore, the poisoning is split in two parts:
      kasan_poison_last_granule() poisons the unaligned part, and then
      kasan_poison() poisons the rest.
      
      This patch also clarifies alignment guarantees of each of the poisoning
      functions and drops the unnecessary round_up() call for redzone_end.
      
      With this change, the early SLUB cache annotation needs to be changed to
      kasan_slab_alloc(), as kasan_kmalloc() doesn't unpoison objects now.  The
      number of poisoned bytes for objects in this cache stays the same, as
      kmem_cache_node->object_size is equal to sizeof(struct kmem_cache_node).
      
      Link: https://lkml.kernel.org/r/7e3961cb52be380bc412860332063f5f7ce10d13.1612546384.git.andreyknvl@google.comSigned-off-by: NAndrey Konovalov <andreyknvl@google.com>
      Reviewed-by: NMarco Elver <elver@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Branislav Rankov <Branislav.Rankov@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Evgenii Stepanov <eugenis@google.com>
      Cc: Kevin Brodsky <kevin.brodsky@arm.com>
      Cc: Peter Collingbourne <pcc@google.com>
      Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e2db1a9a
    • A
      mm, kfence: insert KFENCE hooks for SLUB · b89fb5ef
      Alexander Potapenko 提交于
      Inserts KFENCE hooks into the SLUB allocator.
      
      To pass the originally requested size to KFENCE, add an argument
      'orig_size' to slab_alloc*(). The additional argument is required to
      preserve the requested original size for kmalloc() allocations, which
      uses size classes (e.g. an allocation of 272 bytes will return an object
      of size 512). Therefore, kmem_cache::size does not represent the
      kmalloc-caller's requested size, and we must introduce the argument
      'orig_size' to propagate the originally requested size to KFENCE.
      
      Without the originally requested size, we would not be able to detect
      out-of-bounds accesses for objects placed at the end of a KFENCE object
      page if that object is not equal to the kmalloc-size class it was
      bucketed into.
      
      When KFENCE is disabled, there is no additional overhead, since
      slab_alloc*() functions are __always_inline.
      
      Link: https://lkml.kernel.org/r/20201103175841.3495947-6-elver@google.comSigned-off-by: NMarco Elver <elver@google.com>
      Signed-off-by: NAlexander Potapenko <glider@google.com>
      Reviewed-by: NDmitry Vyukov <dvyukov@google.com>
      Reviewed-by: NJann Horn <jannh@google.com>
      Co-developed-by: NMarco Elver <elver@google.com>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joern Engel <joern@purestorage.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: SeongJae Park <sjpark@amazon.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b89fb5ef
  9. 25 2月, 2021 11 次提交
  10. 11 2月, 2021 1 次提交
    • V
      mm, slub: better heuristic for number of cpus when calculating slab order · 3286222f
      Vlastimil Babka 提交于
      When creating a new kmem cache, SLUB determines how large the slab pages
      will based on number of inputs, including the number of CPUs in the
      system.  Larger slab pages mean that more objects can be allocated/free
      from per-cpu slabs before accessing shared structures, but also
      potentially more memory can be wasted due to low slab usage and
      fragmentation.  The rough idea of using number of CPUs is that larger
      systems will be more likely to benefit from reduced contention, and also
      should have enough memory to spare.
      
      Number of CPUs used to be determined as nr_cpu_ids, which is number of
      possible cpus, but on some systems many will never be onlined, thus
      commit 045ab8c9 ("mm/slub: let number of online CPUs determine the
      slub page order") changed it to nr_online_cpus().  However, for kmem
      caches created early before CPUs are onlined, this may lead to
      permamently low slab page sizes.
      
      Vincent reports a regression [1] of hackbench on arm64 systems:
      
        "I'm facing significant performances regression on a large arm64
         server system (224 CPUs). Regressions is also present on small arm64
         system (8 CPUs) but in a far smaller order of magnitude
      
         On 224 CPUs system : 9 iterations of hackbench -l 16000 -g 16
         v5.11-rc4 : 9.135sec (+/- 0.45%)
         v5.11-rc4 + revert this patch: 3.173sec (+/- 0.48%)
         v5.10: 3.136sec (+/- 0.40%)"
      
      Mel reports a regression [2] of hackbench on x86_64, with lockstat suggesting
      page allocator contention:
      
        "i.e. the patch incurs a 7% to 32% performance penalty. This bisected
         cleanly yesterday when I was looking for the regression and then
         found the thread.
      
         Numerous caches change size. For example, kmalloc-512 goes from
         order-0 (vanilla) to order-2 with the revert.
      
         So mostly this is down to the number of times SLUB calls into the
         page allocator which only caches order-0 pages on a per-cpu basis"
      
      Clearly num_online_cpus() doesn't work too early in bootup.  We could
      change the order dynamically in a memory hotplug callback, but runtime
      order changing for existing kmem caches has been already shown as
      dangerous, and removed in 32a6f409 ("mm, slub: remove runtime
      allocation order changes").
      
      It could be resurrected in a safe manner with some effort, but to fix
      the regression we need something simpler.
      
      We could use num_present_cpus() that should be the number of physically
      present CPUs even before they are onlined.  That would work for PowerPC
      [3], which triggered the original commit, but that still doesn't work on
      arm64 [4] as explained in [5].
      
      So this patch tries to determine the best available value without
      specific arch knowledge.
      
       - num_present_cpus() if the number is larger than 1, as that means the
         arch is likely setting it properly
      
       - nr_cpu_ids otherwise
      
      This should fix the reported regressions while also keeping the effect
      of 045ab8c9 for PowerPC systems.  It's possible there are
      configurations where num_present_cpus() is 1 during boot while
      nr_cpu_ids is at the same time bloated, so these (if they exist) would
      keep the large orders based on nr_cpu_ids as was before 045ab8c9.
      
      [1] https://lore.kernel.org/linux-mm/CAKfTPtA_JgMf_+zdFbcb_V9rM7JBWNPjAz9irgwFj7Rou=xzZg@mail.gmail.com/
      [2] https://lore.kernel.org/linux-mm/20210128134512.GF3592@techsingularity.net/
      [3] https://lore.kernel.org/linux-mm/20210123051607.GC2587010@in.ibm.com/
      [4] https://lore.kernel.org/linux-mm/CAKfTPtAjyVmS5VYvU6DBxg4-JEo5bdmWbngf-03YsY18cmWv_g@mail.gmail.com/
      [5] https://lore.kernel.org/linux-mm/20210126230305.GD30941@willie-the-truck/
      
      Link: https://lkml.kernel.org/r/20210208134108.22286-1-vbabka@suse.cz
      Fixes: 045ab8c9 ("mm/slub: let number of online CPUs determine the slub page order")
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reported-by: NVincent Guittot <vincent.guittot@linaro.org>
      Reported-by: NMel Gorman <mgorman@techsingularity.net>
      Tested-by: NMel Gorman <mgorman@techsingularity.net>
      Tested-by: NVincent Guittot <vincent.guittot@linaro.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Bharata B Rao <bharata@linux.ibm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3286222f
  11. 29 1月, 2021 1 次提交
    • W
      Revert "mm/slub: fix a memory leak in sysfs_slab_add()" · 757fed1d
      Wang Hai 提交于
      This reverts commit dde3c6b7.
      
      syzbot report a double-free bug. The following case can cause this bug.
      
       - mm/slab_common.c: create_cache(): if the __kmem_cache_create() fails,
         it does:
      
      	out_free_cache:
      		kmem_cache_free(kmem_cache, s);
      
       - but __kmem_cache_create() - at least for slub() - will have done
      
      	sysfs_slab_add(s)
      		-> sysfs_create_group() .. fails ..
      		-> kobject_del(&s->kobj); .. which frees s ...
      
      We can't remove the kmem_cache_free() in create_cache(), because other
      error cases of __kmem_cache_create() do not free this.
      
      So, revert the commit dde3c6b7 ("mm/slub: fix a memory leak in
      sysfs_slab_add()") to fix this.
      
      Reported-by: syzbot+d0bd96b4696c1ef67991@syzkaller.appspotmail.com
      Fixes: dde3c6b7 ("mm/slub: fix a memory leak in sysfs_slab_add()")
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NWang Hai <wanghai38@huawei.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      757fed1d
  12. 25 1月, 2021 1 次提交
  13. 23 1月, 2021 1 次提交
    • P
      mm: Add mem_dump_obj() to print source of memory block · 8e7f37f2
      Paul E. McKenney 提交于
      There are kernel facilities such as per-CPU reference counts that give
      error messages in generic handlers or callbacks, whose messages are
      unenlightening.  In the case of per-CPU reference-count underflow, this
      is not a problem when creating a new use of this facility because in that
      case the bug is almost certainly in the code implementing that new use.
      However, trouble arises when deploying across many systems, which might
      exercise corner cases that were not seen during development and testing.
      Here, it would be really nice to get some kind of hint as to which of
      several uses the underflow was caused by.
      
      This commit therefore exposes a mem_dump_obj() function that takes
      a pointer to memory (which must still be allocated if it has been
      dynamically allocated) and prints available information on where that
      memory came from.  This pointer can reference the middle of the block as
      well as the beginning of the block, as needed by things like RCU callback
      functions and timer handlers that might not know where the beginning of
      the memory block is.  These functions and handlers can use mem_dump_obj()
      to print out better hints as to where the problem might lie.
      
      The information printed can depend on kernel configuration.  For example,
      the allocation return address can be printed only for slab and slub,
      and even then only when the necessary debug has been enabled.  For slab,
      build with CONFIG_DEBUG_SLAB=y, and either use sizes with ample space
      to the next power of two or use the SLAB_STORE_USER when creating the
      kmem_cache structure.  For slub, build with CONFIG_SLUB_DEBUG=y and
      boot with slub_debug=U, or pass SLAB_STORE_USER to kmem_cache_create()
      if more focused use is desired.  Also for slub, use CONFIG_STACKTRACE
      to enable printing of the allocation-time stack trace.
      
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: <linux-mm@kvack.org>
      Reported-by: NAndrii Nakryiko <andrii@kernel.org>
      [ paulmck: Convert to printing and change names per Joonsoo Kim. ]
      [ paulmck: Move slab definition per Stephen Rothwell and kbuild test robot. ]
      [ paulmck: Handle CONFIG_MMU=n case where vmalloc() is kmalloc(). ]
      [ paulmck: Apply Vlastimil Babka feedback on slab.c kmem_provenance(). ]
      [ paulmck: Extract more info from !SLUB_DEBUG per Joonsoo Kim. ]
      [ paulmck: Explicitly check for small pointers per Naresh Kamboju. ]
      Acked-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Tested-by: NNaresh Kamboju <naresh.kamboju@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@kernel.org>
      8e7f37f2
  14. 13 1月, 2021 1 次提交
  15. 30 12月, 2020 1 次提交
  16. 23 12月, 2020 1 次提交
  17. 16 12月, 2020 4 次提交
  18. 15 11月, 2020 1 次提交
    • L
      mm/slub: fix panic in slab_alloc_node() · 22e4663e
      Laurent Dufour 提交于
      While doing memory hot-unplug operation on a PowerPC VM running 1024 CPUs
      with 11TB of ram, I hit the following panic:
      
          BUG: Kernel NULL pointer dereference on read at 0x00000007
          Faulting instruction address: 0xc000000000456048
          Oops: Kernel access of bad area, sig: 11 [#2]
          LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS= 2048 NUMA pSeries
          Modules linked in: rpadlpar_io rpaphp
          CPU: 160 PID: 1 Comm: systemd Tainted: G      D           5.9.0 #1
          NIP:  c000000000456048 LR: c000000000455fd4 CTR: c00000000047b350
          REGS: c00006028d1b77a0 TRAP: 0300   Tainted: G      D            (5.9.0)
          MSR:  8000000000009033 <SF,EE,ME,IR,DR,RI,LE>  CR: 24004228  XER: 00000000
          CFAR: c00000000000f1b0 DAR: 0000000000000007 DSISR: 40000000 IRQMASK: 0
          GPR00: c000000000455fd4 c00006028d1b7a30 c000000001bec800 0000000000000000
          GPR04: 0000000000000dc0 0000000000000000 00000000000374ef c00007c53df99320
          GPR08: 000007c53c980000 0000000000000000 000007c53c980000 0000000000000000
          GPR12: 0000000000004400 c00000001e8e4400 0000000000000000 0000000000000f6a
          GPR16: 0000000000000000 c000000001c25930 c000000001d62528 00000000000000c1
          GPR20: c000000001d62538 c00006be469e9000 0000000fffffffe0 c0000000003c0ff8
          GPR24: 0000000000000018 0000000000000000 0000000000000dc0 0000000000000000
          GPR28: c00007c513755700 c000000001c236a4 c00007bc4001f800 0000000000000001
          NIP [c000000000456048] __kmalloc_node+0x108/0x790
          LR [c000000000455fd4] __kmalloc_node+0x94/0x790
          Call Trace:
            kvmalloc_node+0x58/0x110
            mem_cgroup_css_online+0x10c/0x270
            online_css+0x48/0xd0
            cgroup_apply_control_enable+0x2c4/0x470
            cgroup_mkdir+0x408/0x5f0
            kernfs_iop_mkdir+0x90/0x100
            vfs_mkdir+0x138/0x250
            do_mkdirat+0x154/0x1c0
            system_call_exception+0xf8/0x200
            system_call_common+0xf0/0x27c
          Instruction dump:
          e93e0000 e90d0030 39290008 7cc9402a e94d0030 e93e0000 7ce95214 7f89502a
          2fbc0000 419e0018 41920230 e9270010 <89290007> 7f994800 419e0220 7ee6bb78
      
      This pointing to the following code:
      
          mm/slub.c:2851
                  if (unlikely(!object || !node_match(page, node))) {
          c000000000456038:       00 00 bc 2f     cmpdi   cr7,r28,0
          c00000000045603c:       18 00 9e 41     beq     cr7,c000000000456054 <__kmalloc_node+0x114>
          node_match():
          mm/slub.c:2491
                  if (node != NUMA_NO_NODE && page_to_nid(page) != node)
          c000000000456040:       30 02 92 41     beq     cr4,c000000000456270 <__kmalloc_node+0x330>
          page_to_nid():
          include/linux/mm.h:1294
          c000000000456044:       10 00 27 e9     ld      r9,16(r7)
          c000000000456048:       07 00 29 89     lbz     r9,7(r9)	<<<< r9 = NULL
          node_match():
          mm/slub.c:2491
          c00000000045604c:       00 48 99 7f     cmpw    cr7,r25,r9
          c000000000456050:       20 02 9e 41     beq     cr7,c000000000456270 <__kmalloc_node+0x330>
      
      The panic occurred in slab_alloc_node() when checking for the page's node:
      
      	object = c->freelist;
      	page = c->page;
      	if (unlikely(!object || !node_match(page, node))) {
      		object = __slab_alloc(s, gfpflags, node, addr, c);
      		stat(s, ALLOC_SLOWPATH);
      
      The issue is that object is not NULL while page is NULL which is odd but
      may happen if the cache flush happened after loading object but before
      loading page.  Thus checking for the page pointer is required too.
      
      The cache flush is done through an inter processor interrupt when a
      piece of memory is off-lined.  That interrupt is triggered when a memory
      hot-unplug operation is initiated and offline_pages() is calling the
      slub's MEM_GOING_OFFLINE callback slab_mem_going_offline_callback()
      which is calling flush_cpu_slab().  If that interrupt is caught between
      the reading of c->freelist and the reading of c->page, this could lead
      to such a situation.  That situation is expected and the later call to
      this_cpu_cmpxchg_double() will detect the change to c->freelist and redo
      the whole operation.
      
      In commit 6159d0f5 ("mm/slub.c: page is always non-NULL in
      node_match()") check on the page pointer has been removed assuming that
      page is always valid when it is called.  It happens that this is not
      true in that particular case, so check for page before calling
      node_match() here.
      
      Fixes: 6159d0f5 ("mm/slub.c: page is always non-NULL in node_match()")
      Signed-off-by: NLaurent Dufour <ldufour@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Nathan Lynch <nathanl@linux.ibm.com>
      Cc: Scott Cheloha <cheloha@linux.ibm.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20201027190406.33283-1-ldufour@linux.ibm.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      22e4663e
  19. 17 10月, 2020 1 次提交
  20. 14 10月, 2020 3 次提交