1. 27 2月, 2021 2 次提交
    • A
      kasan, mm: don't save alloc stacks twice · 92850134
      Andrey Konovalov 提交于
      Patch series "kasan: optimizations and fixes for HW_TAGS", v4.
      
      This patchset makes the HW_TAGS mode more efficient, mostly by reworking
      poisoning approaches and simplifying/inlining some internal helpers.
      
      With this change, the overhead of HW_TAGS annotations excluding setting
      and checking memory tags is ~3%.  The performance impact caused by tags
      will be unknown until we have hardware that supports MTE.
      
      As a side-effect, this patchset speeds up generic KASAN by ~15%.
      
      This patch (of 13):
      
      Currently KASAN saves allocation stacks in both kasan_slab_alloc() and
      kasan_kmalloc() annotations.  This patch changes KASAN to save allocation
      stacks for slab objects from kmalloc caches in kasan_kmalloc() only, and
      stacks for other slab objects in kasan_slab_alloc() only.
      
      This change requires ____kasan_kmalloc() knowing whether the object
      belongs to a kmalloc cache.  This is implemented by adding a flag field to
      the kasan_info structure.  That flag is only set for kmalloc caches via a
      new kasan_cache_create_kmalloc() annotation.
      
      Link: https://lkml.kernel.org/r/cover.1612546384.git.andreyknvl@google.com
      Link: https://lkml.kernel.org/r/7c673ebca8d00f40a7ad6f04ab9a2bddeeae2097.1612546384.git.andreyknvl@google.comSigned-off-by: NAndrey Konovalov <andreyknvl@google.com>
      Reviewed-by: NMarco Elver <elver@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Peter Collingbourne <pcc@google.com>
      Cc: Evgenii Stepanov <eugenis@google.com>
      Cc: Branislav Rankov <Branislav.Rankov@arm.com>
      Cc: Kevin Brodsky <kevin.brodsky@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      92850134
    • A
      mm, kfence: insert KFENCE hooks for SLAB · d3fb45f3
      Alexander Potapenko 提交于
      Inserts KFENCE hooks into the SLAB allocator.
      
      To pass the originally requested size to KFENCE, add an argument
      'orig_size' to slab_alloc*(). The additional argument is required to
      preserve the requested original size for kmalloc() allocations, which
      uses size classes (e.g. an allocation of 272 bytes will return an object
      of size 512). Therefore, kmem_cache::size does not represent the
      kmalloc-caller's requested size, and we must introduce the argument
      'orig_size' to propagate the originally requested size to KFENCE.
      
      Without the originally requested size, we would not be able to detect
      out-of-bounds accesses for objects placed at the end of a KFENCE object
      page if that object is not equal to the kmalloc-size class it was
      bucketed into.
      
      When KFENCE is disabled, there is no additional overhead, since
      slab_alloc*() functions are __always_inline.
      
      Link: https://lkml.kernel.org/r/20201103175841.3495947-5-elver@google.comSigned-off-by: NMarco Elver <elver@google.com>
      Signed-off-by: NAlexander Potapenko <glider@google.com>
      Reviewed-by: NDmitry Vyukov <dvyukov@google.com>
      Co-developed-by: NMarco Elver <elver@google.com>
      
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Joern Engel <joern@purestorage.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: SeongJae Park <sjpark@amazon.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d3fb45f3
  2. 25 2月, 2021 5 次提交
    • A
      kasan: fix bug detection via ksize for HW_TAGS mode · 611806b4
      Andrey Konovalov 提交于
      The currently existing kasan_check_read/write() annotations are intended
      to be used for kernel modules that have KASAN compiler instrumentation
      disabled. Thus, they are only relevant for the software KASAN modes that
      rely on compiler instrumentation.
      
      However there's another use case for these annotations: ksize() checks
      that the object passed to it is indeed accessible before unpoisoning the
      whole object. This is currently done via __kasan_check_read(), which is
      compiled away for the hardware tag-based mode that doesn't rely on
      compiler instrumentation. This leads to KASAN missing detecting some
      memory corruptions.
      
      Provide another annotation called kasan_check_byte() that is available
      for all KASAN modes. As the implementation rename and reuse
      kasan_check_invalid_free(). Use this new annotation in ksize().
      To avoid having ksize() as the top frame in the reported stack trace
      pass _RET_IP_ to __kasan_check_byte().
      
      Also add a new ksize_uaf() test that checks that a use-after-free is
      detected via ksize() itself, and via plain accesses that happen later.
      
      Link: https://linux-review.googlesource.com/id/Iaabf771881d0f9ce1b969f2a62938e99d3308ec5
      Link: https://lkml.kernel.org/r/f32ad74a60b28d8402482a38476f02bb7600f620.1610733117.git.andreyknvl@google.comSigned-off-by: NAndrey Konovalov <andreyknvl@google.com>
      Reviewed-by: NMarco Elver <elver@google.com>
      Reviewed-by: NAlexander Potapenko <glider@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Branislav Rankov <Branislav.Rankov@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Evgenii Stepanov <eugenis@google.com>
      Cc: Kevin Brodsky <kevin.brodsky@arm.com>
      Cc: Peter Collingbourne <pcc@google.com>
      Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      611806b4
    • M
      mm: memcontrol: fix slub memory accounting · 96403bfe
      Muchun Song 提交于
      SLUB currently account kmalloc() and kmalloc_node() allocations larger
      than order-1 page per-node.  But it forget to update the per-memcg
      vmstats.  So it can lead to inaccurate statistics of "slab_unreclaimable"
      which is from memory.stat.  Fix it by using mod_lruvec_page_state instead
      of mod_node_page_state.
      
      Link: https://lkml.kernel.org/r/20210223092423.42420-1-songmuchun@bytedance.com
      Fixes: 6a486c0a ("mm, sl[ou]b: improve memory accounting")
      Signed-off-by: NMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Reviewed-by: NRoman Gushchin <guro@fb.com>
      Reviewed-by: NMichal Koutný <mkoutny@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96403bfe
    • V
      mm, slab, slub: stop taking cpu hotplug lock · 59450bbc
      Vlastimil Babka 提交于
      SLAB has been using get/put_online_cpus() around creating, destroying and
      shrinking kmem caches since 95402b38 ("cpu-hotplug: replace
      per-subsystem mutexes with get_online_cpus()") in 2008, which is supposed
      to be replacing a private mutex (cache_chain_mutex, called slab_mutex
      today) with system-wide mechanism, but in case of SLAB it's in fact used
      in addition to the existing mutex, without explanation why.
      
      SLUB appears to have avoided the cpu hotplug lock initially, but gained it
      due to common code unification, such as 20cea968 ("mm, sl[aou]b: Move
      kmem_cache_create mutex handling to common code").
      
      Regardless of the history, checking if the hotplug lock is actually needed
      today suggests that it's not, and therefore it's better to avoid this
      system-wide lock and the ordering this imposes wrt other locks (such as
      slab_mutex).
      
      Specifically, in SLAB we have for_each_online_cpu() in do_tune_cpucache()
      protected by slab_mutex, and cpu hotplug callbacks that also take the
      slab_mutex, which is also taken by the common slab function that currently
      also take the hotplug lock.  Thus the slab_mutex protection should be
      sufficient.  Also per-cpu array caches are allocated for each possible
      cpu, so not affected by their online/offline state.
      
      In SLUB we have for_each_online_cpu() in functions that show statistics
      and are already unprotected today, as racing with hotplug is not harmful.
      Otherwise SLUB relies on percpu allocator.  The slub_cpu_dead() hotplug
      callback takes the slab_mutex.
      
      To sum up, this patch removes get/put_online_cpus() calls from slab as it
      should be safe without further adjustments.
      
      Link: https://lkml.kernel.org/r/20210113131634.3671-4-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Qian Cai <cai@redhat.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      59450bbc
    • V
      mm, slab, slub: stop taking memory hotplug lock · 7e1fa93d
      Vlastimil Babka 提交于
      Since commit 03afc0e2 ("slab: get_online_mems for
      kmem_cache_{create,destroy,shrink}") we are taking memory hotplug lock for
      SLAB and SLUB when creating, destroying or shrinking a cache.  It is quite
      a heavy lock and it's best to avoid it if possible, as we had several
      issues with lockdep complaining about ordering in the past, see e.g.
      e4f8e513 ("mm/slub: fix a deadlock in show_slab_objects()").
      
      The problem scenario in 03afc0e2 (solved by the memory hotplug lock)
      can be summarized as follows: while there's slab_mutex synchronizing new
      kmem cache creation and SLUB's MEM_GOING_ONLINE callback
      slab_mem_going_online_callback(), we may miss creation of kmem_cache_node
      for the hotplugged node in the new kmem cache, because the hotplug
      callback doesn't yet see the new cache, and cache creation in
      init_kmem_cache_nodes() only inits kmem_cache_node for nodes in the
      N_NORMAL_MEMORY nodemask, which however may not yet include the new node,
      as that happens only later after the MEM_GOING_ONLINE callback.
      
      Instead of using get/put_online_mems(), the problem can be solved by SLUB
      maintaining its own nodemask of nodes for which it has allocated the
      per-node kmem_cache_node structures.  This nodemask would generally mirror
      the N_NORMAL_MEMORY nodemask, but would be updated only in under SLUB's
      control in its memory hotplug callbacks under the slab_mutex.  This patch
      adds such nodemask and its handling.
      
      Commit 03afc0e2 mentiones "issues like [the one above]", but there
      don't appear to be further issues.  All the paths (shared for SLAB and
      SLUB) taking the memory hotplug locks are also taking the slab_mutex,
      except kmem_cache_shrink() where 03afc0e2 replaced slab_mutex with
      get/put_online_mems().
      
      We however cannot simply restore slab_mutex in kmem_cache_shrink(), as
      SLUB can enters the function from a write to sysfs 'shrink' file, thus
      holding kernfs lock, and in kmem_cache_create() the kernfs lock is nested
      within slab_mutex.  But on closer inspection we don't actually need to
      protect kmem_cache_shrink() from hotplug callbacks: While SLUB's
      __kmem_cache_shrink() does for_each_kmem_cache_node(), missing a new node
      added in parallel hotplug is not fatal, and parallel hotremove does not
      free kmem_cache_node's anymore after the previous patch, so use-after free
      cannot happen.  The per-node shrinking itself is protected by
      n->list_lock.  Same is true for SLAB, and SLOB is no-op.
      
      SLAB also doesn't need the memory hotplug locking, which it only gained by
      03afc0e2 through the shared paths in slab_common.c.  Its memory
      hotplug callbacks are also protected by slab_mutex against races with
      these paths.  The problem of SLUB relying on N_NORMAL_MEMORY doesn't apply
      to SLAB, as its setup_kmem_cache_nodes relies on N_ONLINE, and the new
      node is already set there during the MEM_GOING_ONLINE callback, so no
      special care is needed for SLAB.
      
      As such, this patch removes all get/put_online_mems() usage by the slab
      subsystem.
      
      Link: https://lkml.kernel.org/r/20210113131634.3671-3-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Qian Cai <cai@redhat.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7e1fa93d
    • N
      mm/sl?b.c: remove ctor argument from kmem_cache_flags · 37540008
      Nikolay Borisov 提交于
      This argument hasn't been used since e153362a ("slub: Remove objsize
      check in kmem_cache_flags()") so simply remove it.
      
      Link: https://lkml.kernel.org/r/20210126095733.974665-1-nborisov@suse.comSigned-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      37540008
  3. 23 1月, 2021 1 次提交
    • P
      mm: Add mem_dump_obj() to print source of memory block · 8e7f37f2
      Paul E. McKenney 提交于
      There are kernel facilities such as per-CPU reference counts that give
      error messages in generic handlers or callbacks, whose messages are
      unenlightening.  In the case of per-CPU reference-count underflow, this
      is not a problem when creating a new use of this facility because in that
      case the bug is almost certainly in the code implementing that new use.
      However, trouble arises when deploying across many systems, which might
      exercise corner cases that were not seen during development and testing.
      Here, it would be really nice to get some kind of hint as to which of
      several uses the underflow was caused by.
      
      This commit therefore exposes a mem_dump_obj() function that takes
      a pointer to memory (which must still be allocated if it has been
      dynamically allocated) and prints available information on where that
      memory came from.  This pointer can reference the middle of the block as
      well as the beginning of the block, as needed by things like RCU callback
      functions and timer handlers that might not know where the beginning of
      the memory block is.  These functions and handlers can use mem_dump_obj()
      to print out better hints as to where the problem might lie.
      
      The information printed can depend on kernel configuration.  For example,
      the allocation return address can be printed only for slab and slub,
      and even then only when the necessary debug has been enabled.  For slab,
      build with CONFIG_DEBUG_SLAB=y, and either use sizes with ample space
      to the next power of two or use the SLAB_STORE_USER when creating the
      kmem_cache structure.  For slub, build with CONFIG_SLUB_DEBUG=y and
      boot with slub_debug=U, or pass SLAB_STORE_USER to kmem_cache_create()
      if more focused use is desired.  Also for slub, use CONFIG_STACKTRACE
      to enable printing of the allocation-time stack trace.
      
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: <linux-mm@kvack.org>
      Reported-by: NAndrii Nakryiko <andrii@kernel.org>
      [ paulmck: Convert to printing and change names per Joonsoo Kim. ]
      [ paulmck: Move slab definition per Stephen Rothwell and kbuild test robot. ]
      [ paulmck: Handle CONFIG_MMU=n case where vmalloc() is kmalloc(). ]
      [ paulmck: Apply Vlastimil Babka feedback on slab.c kmem_provenance(). ]
      [ paulmck: Extract more info from !SLUB_DEBUG per Joonsoo Kim. ]
      [ paulmck: Explicitly check for small pointers per Naresh Kamboju. ]
      Acked-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Tested-by: NNaresh Kamboju <naresh.kamboju@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@kernel.org>
      8e7f37f2
  4. 23 12月, 2020 2 次提交
  5. 16 12月, 2020 2 次提交
    • B
      mm: slab: clarify krealloc()'s behavior with __GFP_ZERO · 15d5de49
      Bartosz Golaszewski 提交于
      Patch series "slab: provide and use krealloc_array()", v3.
      
      Andy brought to my attention the fact that users allocating an array of
      equally sized elements should check if the size multiplication doesn't
      overflow.  This is why we have helpers like kmalloc_array().
      
      However we don't have krealloc_array() equivalent and there are many users
      who do their own multiplication when calling krealloc() for arrays.
      
      This series provides krealloc_array() and uses it in a couple places.
      
      A separate series will follow adding devm_krealloc_array() which is needed
      in the xilinx adc driver.
      
      This patch (of 9):
      
      __GFP_ZERO is ignored by krealloc() (unless we fall-back to kmalloc()
      path, in which case it's honored).  Point that out in the kerneldoc.
      
      Link: https://lkml.kernel.org/r/20201109110654.12547-1-brgl@bgdev.pl
      Link: https://lkml.kernel.org/r/20201109110654.12547-2-brgl@bgdev.plSigned-off-by: NBartosz Golaszewski <bgolaszewski@baylibre.com>
      Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: Sumit Semwal <sumit.semwal@linaro.org>
      Cc: Gustavo Padovan <gustavo@padovan.org>
      Cc: Christian Knig <christian.koenig@amd.com>
      Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: Robert Richter <rric@kernel.org>
      Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
      Cc: Maxime Ripard <mripard@kernel.org>
      Cc: Thomas Zimmermann <tzimmermann@suse.de>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Linus Walleij <linus.walleij@linaro.org>
      Cc: "Michael S . Tsirkin" <mst@redhat.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Jaroslav Kysela <perex@perex.cz>
      Cc: Takashi Iwai <tiwai@suse.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Takashi Iwai <tiwai@suse.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      15d5de49
    • liulangrenaaa's avatar
      mm/slab_common.c: use list_for_each_entry in dump_unreclaimable_slab() · 7714304f
      liulangrenaaa 提交于
      dump_unreclaimable_slab() acquires the slab_mutex first, and it won't
      remove any slab_caches list entry when itering the slab_caches lists.
      
      Thus we do not need list_for_each_entry_safe here, which is against
      removal of list entry.
      
      Link: https://lkml.kernel.org/r/20200926043440.GA180545@rlkSigned-off-by: liulangrenaaa's avatarHui Su <sh_def@163.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7714304f
  6. 13 8月, 2020 1 次提交
  7. 08 8月, 2020 11 次提交
  8. 25 7月, 2020 1 次提交
    • M
      mm: memcg/slab: fix memory leak at non-root kmem_cache destroy · d38a2b7a
      Muchun Song 提交于
      If the kmem_cache refcount is greater than one, we should not mark the
      root kmem_cache as dying.  If we mark the root kmem_cache dying
      incorrectly, the non-root kmem_cache can never be destroyed.  It
      resulted in memory leak when memcg was destroyed.  We can use the
      following steps to reproduce.
      
        1) Use kmem_cache_create() to create a new kmem_cache named A.
        2) Coincidentally, the kmem_cache A is an alias for kmem_cache B,
           so the refcount of B is just increased.
        3) Use kmem_cache_destroy() to destroy the kmem_cache A, just
           decrease the B's refcount but mark the B as dying.
        4) Create a new memory cgroup and alloc memory from the kmem_cache
           B. It leads to create a non-root kmem_cache for allocating memory.
        5) When destroy the memory cgroup created in the step 4), the
           non-root kmem_cache can never be destroyed.
      
      If we repeat steps 4) and 5), this will cause a lot of memory leak.  So
      only when refcount reach zero, we mark the root kmem_cache as dying.
      
      Fixes: 92ee383f ("mm: fix race between kmem_cache destroy, create and deactivate")
      Signed-off-by: NMuchun Song <songmuchun@bytedance.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20200716165103.83462-1-songmuchun@bytedance.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d38a2b7a
  9. 26 6月, 2020 1 次提交
    • W
      mm/slab: use memzero_explicit() in kzfree() · 8982ae52
      Waiman Long 提交于
      The kzfree() function is normally used to clear some sensitive
      information, like encryption keys, in the buffer before freeing it back to
      the pool.  Memset() is currently used for buffer clearing.  However
      unlikely, there is still a non-zero probability that the compiler may
      choose to optimize away the memory clearing especially if LTO is being
      used in the future.
      
      To make sure that this optimization will never happen,
      memzero_explicit(), which is introduced in v3.18, is now used in
      kzfree() to future-proof it.
      
      Link: http://lkml.kernel.org/r/20200616154311.12314-2-longman@redhat.com
      Fixes: 3ef0e5ba ("slab: introduce kzfree()")
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: "Serge E. Hallyn" <serge@hallyn.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: "Jason A . Donenfeld" <Jason@zx2c4.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8982ae52
  10. 03 6月, 2020 1 次提交
  11. 11 4月, 2020 1 次提交
  12. 08 4月, 2020 1 次提交
    • A
      proc: faster open/read/close with "permanent" files · d919b33d
      Alexey Dobriyan 提交于
      Now that "struct proc_ops" exist we can start putting there stuff which
      could not fly with VFS "struct file_operations"...
      
      Most of fs/proc/inode.c file is dedicated to make open/read/.../close
      reliable in the event of disappearing /proc entries which usually happens
      if module is getting removed.  Files like /proc/cpuinfo which never
      disappear simply do not need such protection.
      
      Save 2 atomic ops, 1 allocation, 1 free per open/read/close sequence for such
      "permanent" files.
      
      Enable "permanent" flag for
      
      	/proc/cpuinfo
      	/proc/kmsg
      	/proc/modules
      	/proc/slabinfo
      	/proc/stat
      	/proc/sysvipc/*
      	/proc/swaps
      
      More will come once I figure out foolproof way to prevent out module
      authors from marking their stuff "permanent" for performance reasons
      when it is not.
      
      This should help with scalability: benchmark is "read /proc/cpuinfo R times
      by N threads scattered over the system".
      
      	N	R	t, s (before)	t, s (after)
      	-----------------------------------------------------
      	64	4096	1.582458	1.530502	-3.2%
      	256	4096	6.371926	6.125168	-3.9%
      	1024	4096	25.64888	24.47528	-4.6%
      
      Benchmark source:
      
      #include <chrono>
      #include <iostream>
      #include <thread>
      #include <vector>
      
      #include <sys/types.h>
      #include <sys/stat.h>
      #include <fcntl.h>
      #include <unistd.h>
      
      const int NR_CPUS = sysconf(_SC_NPROCESSORS_ONLN);
      int N;
      const char *filename;
      int R;
      
      int xxx = 0;
      
      int glue(int n)
      {
      	cpu_set_t m;
      	CPU_ZERO(&m);
      	CPU_SET(n, &m);
      	return sched_setaffinity(0, sizeof(cpu_set_t), &m);
      }
      
      void f(int n)
      {
      	glue(n % NR_CPUS);
      
      	while (*(volatile int *)&xxx == 0) {
      	}
      
      	for (int i = 0; i < R; i++) {
      		int fd = open(filename, O_RDONLY);
      		char buf[4096];
      		ssize_t rv = read(fd, buf, sizeof(buf));
      		asm volatile ("" :: "g" (rv));
      		close(fd);
      	}
      }
      
      int main(int argc, char *argv[])
      {
      	if (argc < 4) {
      		std::cerr << "usage: " << argv[0] << ' ' << "N /proc/filename R
      ";
      		return 1;
      	}
      
      	N = atoi(argv[1]);
      	filename = argv[2];
      	R = atoi(argv[3]);
      
      	for (int i = 0; i < NR_CPUS; i++) {
      		if (glue(i) == 0)
      			break;
      	}
      
      	std::vector<std::thread> T;
      	T.reserve(N);
      	for (int i = 0; i < N; i++) {
      		T.emplace_back(f, i);
      	}
      
      	auto t0 = std::chrono::system_clock::now();
      	{
      		*(volatile int *)&xxx = 1;
      		for (auto& t: T) {
      			t.join();
      		}
      	}
      	auto t1 = std::chrono::system_clock::now();
      	std::chrono::duration<double> dt = t1 - t0;
      	std::cout << dt.count() << '
      ';
      
      	return 0;
      }
      
      P.S.:
      Explicit randomization marker is added because adding non-function pointer
      will silently disable structure layout randomization.
      
      [akpm@linux-foundation.org: coding style fixes]
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Joe Perches <joe@perches.com>
      Link: http://lkml.kernel.org/r/20200222201539.GA22576@avx2Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d919b33d
  13. 03 4月, 2020 1 次提交
    • Y
      mm, memcg: fix build error around the usage of kmem_caches · a87425a3
      Yafang Shao 提交于
      When I manually set default n to MEMCG_KMEM in init/Kconfig, bellow error
      occurs,
      
        mm/slab_common.c: In function 'memcg_slab_start':
        mm/slab_common.c:1530:30: error: 'struct mem_cgroup' has no member named
        'kmem_caches'
          return seq_list_start(&memcg->kmem_caches, *pos);
                                      ^
        mm/slab_common.c: In function 'memcg_slab_next':
        mm/slab_common.c:1537:32: error: 'struct mem_cgroup' has no member named
        'kmem_caches'
          return seq_list_next(p, &memcg->kmem_caches, pos);
                                        ^
        mm/slab_common.c: In function 'memcg_slab_show':
        mm/slab_common.c:1551:16: error: 'struct mem_cgroup' has no member named
        'kmem_caches'
          if (p == memcg->kmem_caches.next)
                        ^
          CC      arch/x86/xen/smp.o
        mm/slab_common.c: In function 'memcg_slab_start':
        mm/slab_common.c:1531:1: warning: control reaches end of non-void function
        [-Wreturn-type]
         }
         ^
        mm/slab_common.c: In function 'memcg_slab_next':
        mm/slab_common.c:1538:1: warning: control reaches end of non-void function
        [-Wreturn-type]
         }
         ^
      
      That's because kmem_caches is defined only when CONFIG_MEMCG_KMEM is set,
      while memcg_slab_start() will use it no matter CONFIG_MEMCG_KMEM is defined
      or not.
      
      By the way, the reason I mannuly undefined CONFIG_MEMCG_KMEM is to verify
      whether my some other code change is still stable when CONFIG_MEMCG_KMEM is
      not set. Unfortunately, the existing code has been already unstable since
      v4.11.
      
      Fixes: bc2791f8 ("slab: link memcg kmem_caches on their associated memory cgroup")
      Signed-off-by: NYafang Shao <laoar.shao@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Link: http://lkml.kernel.org/r/1580970260-2045-1-git-send-email-laoar.shao@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a87425a3
  14. 04 2月, 2020 2 次提交
  15. 14 1月, 2020 1 次提交
    • A
      mm: memcg/slab: call flush_memcg_workqueue() only if memcg workqueue is valid · 2fe20210
      Adrian Huang 提交于
      When booting with amd_iommu=off, the following WARNING message
      appears:
      
        AMD-Vi: AMD IOMMU disabled on kernel command-line
        ------------[ cut here ]------------
        WARNING: CPU: 0 PID: 0 at kernel/workqueue.c:2772 flush_workqueue+0x42e/0x450
        Modules linked in:
        CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.5.0-rc3-amd-iommu #6
        Hardware name: Lenovo ThinkSystem SR655-2S/7D2WRCZ000, BIOS D8E101L-1.00 12/05/2019
        RIP: 0010:flush_workqueue+0x42e/0x450
        Code: ff 0f 0b e9 7a fd ff ff 4d 89 ef e9 33 fe ff ff 0f 0b e9 7f fd ff ff 0f 0b e9 bc fd ff ff 0f 0b e9 a8 fd ff ff e8 52 2c fe ff <0f> 0b 31 d2 48 c7 c6 e0 88 c5 95 48 c7 c7 d8 ad f0 95 e8 19 f5 04
        Call Trace:
         kmem_cache_destroy+0x69/0x260
         iommu_go_to_state+0x40c/0x5ab
         amd_iommu_prepare+0x16/0x2a
         irq_remapping_prepare+0x36/0x5f
         enable_IR_x2apic+0x21/0x172
         default_setup_apic_routing+0x12/0x6f
         apic_intr_mode_init+0x1a1/0x1f1
         x86_late_time_init+0x17/0x1c
         start_kernel+0x480/0x53f
         secondary_startup_64+0xb6/0xc0
        ---[ end trace 30894107c3749449 ]---
        x2apic: IRQ remapping doesn't support X2APIC mode
        x2apic disabled
      
      The warning is caused by the calling of 'kmem_cache_destroy()'
      in free_iommu_resources(). Here is the call path:
      
        free_iommu_resources
          kmem_cache_destroy
            flush_memcg_workqueue
              flush_workqueue
      
      The root cause is that the IOMMU subsystem runs before the workqueue
      subsystem, which the variable 'wq_online' is still 'false'.  This leads
      to the statement 'if (WARN_ON(!wq_online))' in flush_workqueue() is
      'true'.
      
      Since the variable 'memcg_kmem_cache_wq' is not allocated during the
      time, it is unnecessary to call flush_memcg_workqueue().  This prevents
      the WARNING message triggered by flush_workqueue().
      
      Link: http://lkml.kernel.org/r/20200103085503.1665-1-ahuang12@lenovo.com
      Fixes: 92ee383f ("mm: fix race between kmem_cache destroy, create and deactivate")
      Signed-off-by: NAdrian Huang <ahuang12@lenovo.com>
      Reported-by: NXiaochun Lee <lixc17@lenovo.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2fe20210
  16. 05 12月, 2019 1 次提交
    • R
      mm: memcg/slab: wait for !root kmem_cache refcnt killing on root kmem_cache destruction · a264df74
      Roman Gushchin 提交于
      Christian reported a warning like the following obtained during running
      some KVM-related tests on s390:
      
          WARNING: CPU: 8 PID: 208 at lib/percpu-refcount.c:108 percpu_ref_exit+0x50/0x58
          Modules linked in: kvm(-) xt_CHECKSUM xt_MASQUERADE bonding xt_tcpudp ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ip6table_na>
          CPU: 8 PID: 208 Comm: kworker/8:1 Not tainted 5.2.0+ #66
          Hardware name: IBM 2964 NC9 712 (LPAR)
          Workqueue: events sysfs_slab_remove_workfn
          Krnl PSW : 0704e00180000000 0000001529746850 (percpu_ref_exit+0x50/0x58)
                     R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
          Krnl GPRS: 00000000ffff8808 0000001529746740 000003f4e30e8e18 0036008100000000
                     0000001f00000000 0035008100000000 0000001fb3573ab8 0000000000000000
                     0000001fbdb6de00 0000000000000000 0000001529f01328 0000001fb3573b00
                     0000001fbb27e000 0000001fbdb69300 000003e009263d00 000003e009263cd0
          Krnl Code: 0000001529746842: f0a0000407fe        srp        4(11,%r0),2046,0
                     0000001529746848: 47000700            bc         0,1792
                    #000000152974684c: a7f40001            brc        15,152974684e
                    >0000001529746850: a7f4fff2            brc        15,1529746834
                     0000001529746854: 0707                bcr        0,%r7
                     0000001529746856: 0707                bcr        0,%r7
                     0000001529746858: eb8ff0580024        stmg       %r8,%r15,88(%r15)
                     000000152974685e: a738ffff            lhi        %r3,-1
          Call Trace:
          ([<000003e009263d00>] 0x3e009263d00)
           [<00000015293252ea>] slab_kmem_cache_release+0x3a/0x70
           [<0000001529b04882>] kobject_put+0xaa/0xe8
           [<000000152918cf28>] process_one_work+0x1e8/0x428
           [<000000152918d1b0>] worker_thread+0x48/0x460
           [<00000015291942c6>] kthread+0x126/0x160
           [<0000001529b22344>] ret_from_fork+0x28/0x30
           [<0000001529b2234c>] kernel_thread_starter+0x0/0x10
          Last Breaking-Event-Address:
           [<000000152974684c>] percpu_ref_exit+0x4c/0x58
          ---[ end trace b035e7da5788eb09 ]---
      
      The problem occurs because kmem_cache_destroy() is called immediately
      after deleting of a memcg, so it races with the memcg kmem_cache
      deactivation.
      
      flush_memcg_workqueue() at the beginning of kmem_cache_destroy() is
      supposed to guarantee that all deactivation processes are finished, but
      failed to do so.  It waits for an rcu grace period, after which all
      children kmem_caches should be deactivated.  During the deactivation
      percpu_ref_kill() is called for non root kmem_cache refcounters, but it
      requires yet another rcu grace period to finish the transition to the
      atomic (dead) state.
      
      So in a rare case when not all children kmem_caches are destroyed at the
      moment when the root kmem_cache is about to be gone, we need to wait
      another rcu grace period before destroying the root kmem_cache.
      
      This issue can be triggered only with dynamically created kmem_caches
      which are used with memcg accounting.  In this case per-memcg child
      kmem_caches are created.  They are deactivated from the cgroup removing
      path.  If the destruction of the root kmem_cache is racing with the
      removal of the cgroup (both are quite complicated multi-stage
      processes), the described issue can occur.  The only known way to
      trigger it in the real life, is to unload some kernel module which
      creates a dedicated kmem_cache, used from different memory cgroups with
      GFP_ACCOUNT flag.  If the unloading happens immediately after calling
      rmdir on the corresponding cgroup, there is some chance to trigger the
      issue.
      
      Link: http://lkml.kernel.org/r/20191129025011.3076017-1-guro@fb.com
      Fixes: f0a3a24b ("mm: memcg/slab: rework non-root kmem_cache lifecycle management")
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Reported-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Tested-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a264df74
  17. 01 12月, 2019 3 次提交
  18. 19 10月, 2019 1 次提交
    • R
      mm: memcg/slab: fix panic in __free_slab() caused by premature memcg pointer release · b749ecfa
      Roman Gushchin 提交于
      Karsten reported the following panic in __free_slab() happening on a s390x
      machine:
      
        Unable to handle kernel pointer dereference in virtual kernel address space
        Failing address: 0000000000000000 TEID: 0000000000000483
        Fault in home space mode while using kernel ASCE.
        AS:00000000017d4007 R3:000000007fbd0007 S:000000007fbff000 P:000000000000003d
        Oops: 0004 ilc:3 Ý#1¨ PREEMPT SMP
        Modules linked in: tcp_diag inet_diag xt_tcpudp ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_at nf_nat
        CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.3.0-05872-g6133e3e4bada-dirty #14
        Hardware name: IBM 2964 NC9 702 (z/VM 6.4.0)
        Krnl PSW : 0704d00180000000 00000000003cadb6 (__free_slab+0x686/0x6b0)
                   R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3
        Krnl GPRS: 00000000f3a32928 0000000000000000 000000007fbf5d00 000000000117c4b8
                   0000000000000000 000000009e3291c1 0000000000000000 0000000000000000
                   0000000000000003 0000000000000008 000000002b478b00 000003d080a97600
                   0000000000000003 0000000000000008 000000002b478b00 000003d080a97600
                   000000000117ba00 000003e000057db0 00000000003cabcc 000003e000057c78
        Krnl Code: 00000000003cada6: e310a1400004        lg      %r1,320(%r10)
                   00000000003cadac: c0e50046c286        brasl   %r14,ca32b8
                  #00000000003cadb2: a7f4fe36            brc     15,3caa1e
                  >00000000003cadb6: e32060800024        stg     %r2,128(%r6)
                   00000000003cadbc: a7f4fd9e            brc     15,3ca8f8
                   00000000003cadc0: c0e50046790c        brasl   %r14,c99fd8
                   00000000003cadc6: a7f4fe2c            brc     15,3caa
                   00000000003cadc6: a7f4fe2c            brc     15,3caa1e
                   00000000003cadca: ecb1ffff00d9        aghik   %r11,%r1,-1
        Call Trace:
        (<00000000003cabcc> __free_slab+0x49c/0x6b0)
         <00000000001f5886> rcu_core+0x5a6/0x7e0
         <0000000000ca2dea> __do_softirq+0xf2/0x5c0
         <0000000000152644> irq_exit+0x104/0x130
         <000000000010d222> do_IRQ+0x9a/0xf0
         <0000000000ca2344> ext_int_handler+0x130/0x134
         <0000000000103648> enabled_wait+0x58/0x128
        (<0000000000103634> enabled_wait+0x44/0x128)
         <0000000000103b00> arch_cpu_idle+0x40/0x58
         <0000000000ca0544> default_idle_call+0x3c/0x68
         <000000000018eaa4> do_idle+0xec/0x1c0
         <000000000018ee0e> cpu_startup_entry+0x36/0x40
         <000000000122df34> arch_call_rest_init+0x5c/0x88
         <0000000000000000> 0x0
        INFO: lockdep is turned off.
        Last Breaking-Event-Address:
         <00000000003ca8f4> __free_slab+0x1c4/0x6b0
        Kernel panic - not syncing: Fatal exception in interrupt
      
      The kernel panics on an attempt to dereference the NULL memcg pointer.
      When shutdown_cache() is called from the kmem_cache_destroy() context, a
      memcg kmem_cache might have empty slab pages in a partial list, which are
      still charged to the memory cgroup.
      
      These pages are released by free_partial() at the beginning of
      shutdown_cache(): either directly or by scheduling a RCU-delayed work
      (if the kmem_cache has the SLAB_TYPESAFE_BY_RCU flag).  The latter case
      is when the reported panic can happen: memcg_unlink_cache() is called
      immediately after shrinking partial lists, without waiting for scheduled
      RCU works.  It sets the kmem_cache->memcg_params.memcg pointer to NULL,
      and the following attempt to dereference it by __free_slab() from the
      RCU work context causes the panic.
      
      To fix the issue, let's postpone the release of the memcg pointer to
      destroy_memcg_params().  It's called from a separate work context by
      slab_caches_to_rcu_destroy_workfn(), which contains a full RCU barrier.
      This guarantees that all scheduled page release RCU works will complete
      before the memcg pointer will be zeroed.
      
      Big thanks for Karsten for the perfect report containing all necessary
      information, his help with the analysis of the problem and testing of the
      fix.
      
      Link: http://lkml.kernel.org/r/20191010160549.1584316-1-guro@fb.com
      Fixes: fb2f2b0a ("mm: memcg/slab: reparent memcg kmem_caches on cgroup removal")
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Reported-by: NKarsten Graul <kgraul@linux.ibm.com>
      Tested-by: NKarsten Graul <kgraul@linux.ibm.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Karsten Graul <kgraul@linux.ibm.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b749ecfa
  19. 08 10月, 2019 2 次提交
    • V
      mm, sl[aou]b: guarantee natural alignment for kmalloc(power-of-two) · 59bb4798
      Vlastimil Babka 提交于
      In most configurations, kmalloc() happens to return naturally aligned
      (i.e.  aligned to the block size itself) blocks for power of two sizes.
      
      That means some kmalloc() users might unknowingly rely on that
      alignment, until stuff breaks when the kernel is built with e.g.
      CONFIG_SLUB_DEBUG or CONFIG_SLOB, and blocks stop being aligned.  Then
      developers have to devise workaround such as own kmem caches with
      specified alignment [1], which is not always practical, as recently
      evidenced in [2].
      
      The topic has been discussed at LSF/MM 2019 [3].  Adding a
      'kmalloc_aligned()' variant would not help with code unknowingly relying
      on the implicit alignment.  For slab implementations it would either
      require creating more kmalloc caches, or allocate a larger size and only
      give back part of it.  That would be wasteful, especially with a generic
      alignment parameter (in contrast with a fixed alignment to size).
      
      Ideally we should provide to mm users what they need without difficult
      workarounds or own reimplementations, so let's make the kmalloc()
      alignment to size explicitly guaranteed for power-of-two sizes under all
      configurations.  What this means for the three available allocators?
      
      * SLAB object layout happens to be mostly unchanged by the patch.  The
        implicitly provided alignment could be compromised with
        CONFIG_DEBUG_SLAB due to redzoning, however SLAB disables redzoning for
        caches with alignment larger than unsigned long long.  Practically on at
        least x86 this includes kmalloc caches as they use cache line alignment,
        which is larger than that.  Still, this patch ensures alignment on all
        arches and cache sizes.
      
      * SLUB layout is also unchanged unless redzoning is enabled through
        CONFIG_SLUB_DEBUG and boot parameter for the particular kmalloc cache.
        With this patch, explicit alignment is guaranteed with redzoning as
        well.  This will result in more memory being wasted, but that should be
        acceptable in a debugging scenario.
      
      * SLOB has no implicit alignment so this patch adds it explicitly for
        kmalloc().  The potential downside is increased fragmentation.  While
        pathological allocation scenarios are certainly possible, in my testing,
        after booting a x86_64 kernel+userspace with virtme, around 16MB memory
        was consumed by slab pages both before and after the patch, with
        difference in the noise.
      
      [1] https://lore.kernel.org/linux-btrfs/c3157c8e8e0e7588312b40c853f65c02fe6c957a.1566399731.git.christophe.leroy@c-s.fr/
      [2] https://lore.kernel.org/linux-fsdevel/20190225040904.5557-1-ming.lei@redhat.com/
      [3] https://lwn.net/Articles/787740/
      
      [akpm@linux-foundation.org: documentation fixlet, per Matthew]
      Link: http://lkml.kernel.org/r/20190826111627.7505-3-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NChristoph Hellwig <hch@lst.de>
      Cc: David Sterba <dsterba@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ming Lei <ming.lei@redhat.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: "Darrick J . Wong" <darrick.wong@oracle.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      59bb4798
    • V
      mm, sl[ou]b: improve memory accounting · 6a486c0a
      Vlastimil Babka 提交于
      Patch series "guarantee natural alignment for kmalloc()", v2.
      
      This patch (of 2):
      
      SLOB currently doesn't account its pages at all, so in /proc/meminfo the
      Slab field shows zero.  Modifying a counter on page allocation and
      freeing should be acceptable even for the small system scenarios SLOB is
      intended for.  Since reclaimable caches are not separated in SLOB,
      account everything as unreclaimable.
      
      SLUB currently doesn't account kmalloc() and kmalloc_node() allocations
      larger than order-1 page, that are passed directly to the page
      allocator.  As they also don't appear in /proc/slabinfo, it might look
      like a memory leak.  For consistency, account them as well.  (SLAB
      doesn't actually use page allocator directly, so no change there).
      
      Ideally SLOB and SLUB would be handled in separate patches, but due to
      the shared kmalloc_order() function and different kfree()
      implementations, it's easier to patch both at once to prevent
      inconsistencies.
      
      Link: http://lkml.kernel.org/r/20190826111627.7505-2-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ming Lei <ming.lei@redhat.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: "Darrick J . Wong" <darrick.wong@oracle.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6a486c0a