1. 21 10月, 2021 1 次提交
  2. 14 7月, 2021 2 次提交
    • V
      mm, slab, slub: stop taking cpu hotplug lock · 7fc60ad1
      Vlastimil Babka 提交于
      mainline inclusion
      from mainline-5.12-rc1
      commit 59450bbc
      category: feature
      bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZY8M
      CVE: NA
      
      -------------------------------------------------
      
      SLAB has been using get/put_online_cpus() around creating, destroying and
      shrinking kmem caches since 95402b38 ("cpu-hotplug: replace
      per-subsystem mutexes with get_online_cpus()") in 2008, which is supposed
      to be replacing a private mutex (cache_chain_mutex, called slab_mutex
      today) with system-wide mechanism, but in case of SLAB it's in fact used
      in addition to the existing mutex, without explanation why.
      
      SLUB appears to have avoided the cpu hotplug lock initially, but gained it
      due to common code unification, such as 20cea968 ("mm, sl[aou]b: Move
      kmem_cache_create mutex handling to common code").
      
      Regardless of the history, checking if the hotplug lock is actually needed
      today suggests that it's not, and therefore it's better to avoid this
      system-wide lock and the ordering this imposes wrt other locks (such as
      slab_mutex).
      
      Specifically, in SLAB we have for_each_online_cpu() in do_tune_cpucache()
      protected by slab_mutex, and cpu hotplug callbacks that also take the
      slab_mutex, which is also taken by the common slab function that currently
      also take the hotplug lock.  Thus the slab_mutex protection should be
      sufficient.  Also per-cpu array caches are allocated for each possible
      cpu, so not affected by their online/offline state.
      
      In SLUB we have for_each_online_cpu() in functions that show statistics
      and are already unprotected today, as racing with hotplug is not harmful.
      Otherwise SLUB relies on percpu allocator.  The slub_cpu_dead() hotplug
      callback takes the slab_mutex.
      
      To sum up, this patch removes get/put_online_cpus() calls from slab as it
      should be safe without further adjustments.
      
      Link: https://lkml.kernel.org/r/20210113131634.3671-4-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Qian Cai <cai@redhat.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NChengyang Fan <cy.fan@huawei.com>
      Reviewed-by: NChen Wandun <chenwandun@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      7fc60ad1
    • V
      mm, slab, slub: stop taking memory hotplug lock · 19ede649
      Vlastimil Babka 提交于
      mainline inclusion
      from mainline-5.12-rc1
      commit 7e1fa93d
      category: feature
      bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZY8M
      CVE: NA
      
      -------------------------------------------------
      
      Since commit 03afc0e2 ("slab: get_online_mems for
      kmem_cache_{create,destroy,shrink}") we are taking memory hotplug lock for
      SLAB and SLUB when creating, destroying or shrinking a cache.  It is quite
      a heavy lock and it's best to avoid it if possible, as we had several
      issues with lockdep complaining about ordering in the past, see e.g.
      e4f8e513 ("mm/slub: fix a deadlock in show_slab_objects()").
      
      The problem scenario in 03afc0e2 (solved by the memory hotplug lock)
      can be summarized as follows: while there's slab_mutex synchronizing new
      kmem cache creation and SLUB's MEM_GOING_ONLINE callback
      slab_mem_going_online_callback(), we may miss creation of kmem_cache_node
      for the hotplugged node in the new kmem cache, because the hotplug
      callback doesn't yet see the new cache, and cache creation in
      init_kmem_cache_nodes() only inits kmem_cache_node for nodes in the
      N_NORMAL_MEMORY nodemask, which however may not yet include the new node,
      as that happens only later after the MEM_GOING_ONLINE callback.
      
      Instead of using get/put_online_mems(), the problem can be solved by SLUB
      maintaining its own nodemask of nodes for which it has allocated the
      per-node kmem_cache_node structures.  This nodemask would generally mirror
      the N_NORMAL_MEMORY nodemask, but would be updated only in under SLUB's
      control in its memory hotplug callbacks under the slab_mutex.  This patch
      adds such nodemask and its handling.
      
      Commit 03afc0e2 mentiones "issues like [the one above]", but there
      don't appear to be further issues.  All the paths (shared for SLAB and
      SLUB) taking the memory hotplug locks are also taking the slab_mutex,
      except kmem_cache_shrink() where 03afc0e2 replaced slab_mutex with
      get/put_online_mems().
      
      We however cannot simply restore slab_mutex in kmem_cache_shrink(), as
      SLUB can enters the function from a write to sysfs 'shrink' file, thus
      holding kernfs lock, and in kmem_cache_create() the kernfs lock is nested
      within slab_mutex.  But on closer inspection we don't actually need to
      protect kmem_cache_shrink() from hotplug callbacks: While SLUB's
      __kmem_cache_shrink() does for_each_kmem_cache_node(), missing a new node
      added in parallel hotplug is not fatal, and parallel hotremove does not
      free kmem_cache_node's anymore after the previous patch, so use-after free
      cannot happen.  The per-node shrinking itself is protected by
      n->list_lock.  Same is true for SLAB, and SLOB is no-op.
      
      SLAB also doesn't need the memory hotplug locking, which it only gained by
      03afc0e2 through the shared paths in slab_common.c.  Its memory
      hotplug callbacks are also protected by slab_mutex against races with
      these paths.  The problem of SLUB relying on N_NORMAL_MEMORY doesn't apply
      to SLAB, as its setup_kmem_cache_nodes relies on N_ONLINE, and the new
      node is already set there during the MEM_GOING_ONLINE callback, so no
      special care is needed for SLAB.
      
      As such, this patch removes all get/put_online_mems() usage by the slab
      subsystem.
      
      Link: https://lkml.kernel.org/r/20210113131634.3671-3-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Qian Cai <cai@redhat.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NChengyang Fan <cy.fan@huawei.com>
      Reviewed-by: NChen Wandun <chenwandun@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      19ede649
  3. 06 7月, 2021 1 次提交
    • K
      mm/slub: fix redzoning for small allocations · 0b7316ba
      Kees Cook 提交于
      stable inclusion
      from stable-5.10.46
      commit 4314c8c63bfdd56ac34d10955023dc10886eafd3
      bugzilla: 168323
      CVE: NA
      
      --------------------------------
      
      commit 74c1d3e0 upstream.
      
      The redzone area for SLUB exists between s->object_size and s->inuse
      (which is at least the word-aligned object_size).  If a cache were
      created with an object_size smaller than sizeof(void *), the in-object
      stored freelist pointer would overwrite the redzone (e.g.  with boot
      param "slub_debug=ZF"):
      
        BUG test (Tainted: G    B            ): Right Redzone overwritten
        -----------------------------------------------------------------------------
      
        INFO: 0xffff957ead1c05de-0xffff957ead1c05df @offset=1502. First byte 0x1a instead of 0xbb
        INFO: Slab 0xffffef3950b47000 objects=170 used=170 fp=0x0000000000000000 flags=0x8000000000000200
        INFO: Object 0xffff957ead1c05d8 @offset=1496 fp=0xffff957ead1c0620
      
        Redzone  (____ptrval____): bb bb bb bb bb bb bb bb    ........
        Object   (____ptrval____): f6 f4 a5 40 1d e8          ...@..
        Redzone  (____ptrval____): 1a aa                      ..
        Padding  (____ptrval____): 00 00 00 00 00 00 00 00    ........
      
      Store the freelist pointer out of line when object_size is smaller than
      sizeof(void *) and redzoning is enabled.
      
      Additionally remove the "smaller than sizeof(void *)" check under
      CONFIG_DEBUG_VM in kmem_cache_sanity_check() as it is now redundant:
      SLAB and SLOB both handle small sizes.
      
      (Note that no caches within this size range are known to exist in the
      kernel currently.)
      
      Link: https://lkml.kernel.org/r/20210608183955.280836-3-keescook@chromium.org
      Fixes: 81819f0f ("SLUB core")
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: "Lin, Zhenpeng" <zplin@psu.edu>
      Cc: Marco Elver <elver@google.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: NWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      0b7316ba
  4. 03 6月, 2021 1 次提交
  5. 09 4月, 2021 1 次提交
  6. 13 8月, 2020 1 次提交
  7. 08 8月, 2020 11 次提交
  8. 25 7月, 2020 1 次提交
    • M
      mm: memcg/slab: fix memory leak at non-root kmem_cache destroy · d38a2b7a
      Muchun Song 提交于
      If the kmem_cache refcount is greater than one, we should not mark the
      root kmem_cache as dying.  If we mark the root kmem_cache dying
      incorrectly, the non-root kmem_cache can never be destroyed.  It
      resulted in memory leak when memcg was destroyed.  We can use the
      following steps to reproduce.
      
        1) Use kmem_cache_create() to create a new kmem_cache named A.
        2) Coincidentally, the kmem_cache A is an alias for kmem_cache B,
           so the refcount of B is just increased.
        3) Use kmem_cache_destroy() to destroy the kmem_cache A, just
           decrease the B's refcount but mark the B as dying.
        4) Create a new memory cgroup and alloc memory from the kmem_cache
           B. It leads to create a non-root kmem_cache for allocating memory.
        5) When destroy the memory cgroup created in the step 4), the
           non-root kmem_cache can never be destroyed.
      
      If we repeat steps 4) and 5), this will cause a lot of memory leak.  So
      only when refcount reach zero, we mark the root kmem_cache as dying.
      
      Fixes: 92ee383f ("mm: fix race between kmem_cache destroy, create and deactivate")
      Signed-off-by: NMuchun Song <songmuchun@bytedance.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20200716165103.83462-1-songmuchun@bytedance.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d38a2b7a
  9. 26 6月, 2020 1 次提交
    • W
      mm/slab: use memzero_explicit() in kzfree() · 8982ae52
      Waiman Long 提交于
      The kzfree() function is normally used to clear some sensitive
      information, like encryption keys, in the buffer before freeing it back to
      the pool.  Memset() is currently used for buffer clearing.  However
      unlikely, there is still a non-zero probability that the compiler may
      choose to optimize away the memory clearing especially if LTO is being
      used in the future.
      
      To make sure that this optimization will never happen,
      memzero_explicit(), which is introduced in v3.18, is now used in
      kzfree() to future-proof it.
      
      Link: http://lkml.kernel.org/r/20200616154311.12314-2-longman@redhat.com
      Fixes: 3ef0e5ba ("slab: introduce kzfree()")
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: "Serge E. Hallyn" <serge@hallyn.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: "Jason A . Donenfeld" <Jason@zx2c4.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8982ae52
  10. 03 6月, 2020 1 次提交
  11. 11 4月, 2020 1 次提交
  12. 08 4月, 2020 1 次提交
    • A
      proc: faster open/read/close with "permanent" files · d919b33d
      Alexey Dobriyan 提交于
      Now that "struct proc_ops" exist we can start putting there stuff which
      could not fly with VFS "struct file_operations"...
      
      Most of fs/proc/inode.c file is dedicated to make open/read/.../close
      reliable in the event of disappearing /proc entries which usually happens
      if module is getting removed.  Files like /proc/cpuinfo which never
      disappear simply do not need such protection.
      
      Save 2 atomic ops, 1 allocation, 1 free per open/read/close sequence for such
      "permanent" files.
      
      Enable "permanent" flag for
      
      	/proc/cpuinfo
      	/proc/kmsg
      	/proc/modules
      	/proc/slabinfo
      	/proc/stat
      	/proc/sysvipc/*
      	/proc/swaps
      
      More will come once I figure out foolproof way to prevent out module
      authors from marking their stuff "permanent" for performance reasons
      when it is not.
      
      This should help with scalability: benchmark is "read /proc/cpuinfo R times
      by N threads scattered over the system".
      
      	N	R	t, s (before)	t, s (after)
      	-----------------------------------------------------
      	64	4096	1.582458	1.530502	-3.2%
      	256	4096	6.371926	6.125168	-3.9%
      	1024	4096	25.64888	24.47528	-4.6%
      
      Benchmark source:
      
      #include <chrono>
      #include <iostream>
      #include <thread>
      #include <vector>
      
      #include <sys/types.h>
      #include <sys/stat.h>
      #include <fcntl.h>
      #include <unistd.h>
      
      const int NR_CPUS = sysconf(_SC_NPROCESSORS_ONLN);
      int N;
      const char *filename;
      int R;
      
      int xxx = 0;
      
      int glue(int n)
      {
      	cpu_set_t m;
      	CPU_ZERO(&m);
      	CPU_SET(n, &m);
      	return sched_setaffinity(0, sizeof(cpu_set_t), &m);
      }
      
      void f(int n)
      {
      	glue(n % NR_CPUS);
      
      	while (*(volatile int *)&xxx == 0) {
      	}
      
      	for (int i = 0; i < R; i++) {
      		int fd = open(filename, O_RDONLY);
      		char buf[4096];
      		ssize_t rv = read(fd, buf, sizeof(buf));
      		asm volatile ("" :: "g" (rv));
      		close(fd);
      	}
      }
      
      int main(int argc, char *argv[])
      {
      	if (argc < 4) {
      		std::cerr << "usage: " << argv[0] << ' ' << "N /proc/filename R
      ";
      		return 1;
      	}
      
      	N = atoi(argv[1]);
      	filename = argv[2];
      	R = atoi(argv[3]);
      
      	for (int i = 0; i < NR_CPUS; i++) {
      		if (glue(i) == 0)
      			break;
      	}
      
      	std::vector<std::thread> T;
      	T.reserve(N);
      	for (int i = 0; i < N; i++) {
      		T.emplace_back(f, i);
      	}
      
      	auto t0 = std::chrono::system_clock::now();
      	{
      		*(volatile int *)&xxx = 1;
      		for (auto& t: T) {
      			t.join();
      		}
      	}
      	auto t1 = std::chrono::system_clock::now();
      	std::chrono::duration<double> dt = t1 - t0;
      	std::cout << dt.count() << '
      ';
      
      	return 0;
      }
      
      P.S.:
      Explicit randomization marker is added because adding non-function pointer
      will silently disable structure layout randomization.
      
      [akpm@linux-foundation.org: coding style fixes]
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Joe Perches <joe@perches.com>
      Link: http://lkml.kernel.org/r/20200222201539.GA22576@avx2Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d919b33d
  13. 03 4月, 2020 1 次提交
    • Y
      mm, memcg: fix build error around the usage of kmem_caches · a87425a3
      Yafang Shao 提交于
      When I manually set default n to MEMCG_KMEM in init/Kconfig, bellow error
      occurs,
      
        mm/slab_common.c: In function 'memcg_slab_start':
        mm/slab_common.c:1530:30: error: 'struct mem_cgroup' has no member named
        'kmem_caches'
          return seq_list_start(&memcg->kmem_caches, *pos);
                                      ^
        mm/slab_common.c: In function 'memcg_slab_next':
        mm/slab_common.c:1537:32: error: 'struct mem_cgroup' has no member named
        'kmem_caches'
          return seq_list_next(p, &memcg->kmem_caches, pos);
                                        ^
        mm/slab_common.c: In function 'memcg_slab_show':
        mm/slab_common.c:1551:16: error: 'struct mem_cgroup' has no member named
        'kmem_caches'
          if (p == memcg->kmem_caches.next)
                        ^
          CC      arch/x86/xen/smp.o
        mm/slab_common.c: In function 'memcg_slab_start':
        mm/slab_common.c:1531:1: warning: control reaches end of non-void function
        [-Wreturn-type]
         }
         ^
        mm/slab_common.c: In function 'memcg_slab_next':
        mm/slab_common.c:1538:1: warning: control reaches end of non-void function
        [-Wreturn-type]
         }
         ^
      
      That's because kmem_caches is defined only when CONFIG_MEMCG_KMEM is set,
      while memcg_slab_start() will use it no matter CONFIG_MEMCG_KMEM is defined
      or not.
      
      By the way, the reason I mannuly undefined CONFIG_MEMCG_KMEM is to verify
      whether my some other code change is still stable when CONFIG_MEMCG_KMEM is
      not set. Unfortunately, the existing code has been already unstable since
      v4.11.
      
      Fixes: bc2791f8 ("slab: link memcg kmem_caches on their associated memory cgroup")
      Signed-off-by: NYafang Shao <laoar.shao@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Link: http://lkml.kernel.org/r/1580970260-2045-1-git-send-email-laoar.shao@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a87425a3
  14. 04 2月, 2020 2 次提交
  15. 14 1月, 2020 1 次提交
    • A
      mm: memcg/slab: call flush_memcg_workqueue() only if memcg workqueue is valid · 2fe20210
      Adrian Huang 提交于
      When booting with amd_iommu=off, the following WARNING message
      appears:
      
        AMD-Vi: AMD IOMMU disabled on kernel command-line
        ------------[ cut here ]------------
        WARNING: CPU: 0 PID: 0 at kernel/workqueue.c:2772 flush_workqueue+0x42e/0x450
        Modules linked in:
        CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.5.0-rc3-amd-iommu #6
        Hardware name: Lenovo ThinkSystem SR655-2S/7D2WRCZ000, BIOS D8E101L-1.00 12/05/2019
        RIP: 0010:flush_workqueue+0x42e/0x450
        Code: ff 0f 0b e9 7a fd ff ff 4d 89 ef e9 33 fe ff ff 0f 0b e9 7f fd ff ff 0f 0b e9 bc fd ff ff 0f 0b e9 a8 fd ff ff e8 52 2c fe ff <0f> 0b 31 d2 48 c7 c6 e0 88 c5 95 48 c7 c7 d8 ad f0 95 e8 19 f5 04
        Call Trace:
         kmem_cache_destroy+0x69/0x260
         iommu_go_to_state+0x40c/0x5ab
         amd_iommu_prepare+0x16/0x2a
         irq_remapping_prepare+0x36/0x5f
         enable_IR_x2apic+0x21/0x172
         default_setup_apic_routing+0x12/0x6f
         apic_intr_mode_init+0x1a1/0x1f1
         x86_late_time_init+0x17/0x1c
         start_kernel+0x480/0x53f
         secondary_startup_64+0xb6/0xc0
        ---[ end trace 30894107c3749449 ]---
        x2apic: IRQ remapping doesn't support X2APIC mode
        x2apic disabled
      
      The warning is caused by the calling of 'kmem_cache_destroy()'
      in free_iommu_resources(). Here is the call path:
      
        free_iommu_resources
          kmem_cache_destroy
            flush_memcg_workqueue
              flush_workqueue
      
      The root cause is that the IOMMU subsystem runs before the workqueue
      subsystem, which the variable 'wq_online' is still 'false'.  This leads
      to the statement 'if (WARN_ON(!wq_online))' in flush_workqueue() is
      'true'.
      
      Since the variable 'memcg_kmem_cache_wq' is not allocated during the
      time, it is unnecessary to call flush_memcg_workqueue().  This prevents
      the WARNING message triggered by flush_workqueue().
      
      Link: http://lkml.kernel.org/r/20200103085503.1665-1-ahuang12@lenovo.com
      Fixes: 92ee383f ("mm: fix race between kmem_cache destroy, create and deactivate")
      Signed-off-by: NAdrian Huang <ahuang12@lenovo.com>
      Reported-by: NXiaochun Lee <lixc17@lenovo.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2fe20210
  16. 05 12月, 2019 1 次提交
    • R
      mm: memcg/slab: wait for !root kmem_cache refcnt killing on root kmem_cache destruction · a264df74
      Roman Gushchin 提交于
      Christian reported a warning like the following obtained during running
      some KVM-related tests on s390:
      
          WARNING: CPU: 8 PID: 208 at lib/percpu-refcount.c:108 percpu_ref_exit+0x50/0x58
          Modules linked in: kvm(-) xt_CHECKSUM xt_MASQUERADE bonding xt_tcpudp ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ip6table_na>
          CPU: 8 PID: 208 Comm: kworker/8:1 Not tainted 5.2.0+ #66
          Hardware name: IBM 2964 NC9 712 (LPAR)
          Workqueue: events sysfs_slab_remove_workfn
          Krnl PSW : 0704e00180000000 0000001529746850 (percpu_ref_exit+0x50/0x58)
                     R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
          Krnl GPRS: 00000000ffff8808 0000001529746740 000003f4e30e8e18 0036008100000000
                     0000001f00000000 0035008100000000 0000001fb3573ab8 0000000000000000
                     0000001fbdb6de00 0000000000000000 0000001529f01328 0000001fb3573b00
                     0000001fbb27e000 0000001fbdb69300 000003e009263d00 000003e009263cd0
          Krnl Code: 0000001529746842: f0a0000407fe        srp        4(11,%r0),2046,0
                     0000001529746848: 47000700            bc         0,1792
                    #000000152974684c: a7f40001            brc        15,152974684e
                    >0000001529746850: a7f4fff2            brc        15,1529746834
                     0000001529746854: 0707                bcr        0,%r7
                     0000001529746856: 0707                bcr        0,%r7
                     0000001529746858: eb8ff0580024        stmg       %r8,%r15,88(%r15)
                     000000152974685e: a738ffff            lhi        %r3,-1
          Call Trace:
          ([<000003e009263d00>] 0x3e009263d00)
           [<00000015293252ea>] slab_kmem_cache_release+0x3a/0x70
           [<0000001529b04882>] kobject_put+0xaa/0xe8
           [<000000152918cf28>] process_one_work+0x1e8/0x428
           [<000000152918d1b0>] worker_thread+0x48/0x460
           [<00000015291942c6>] kthread+0x126/0x160
           [<0000001529b22344>] ret_from_fork+0x28/0x30
           [<0000001529b2234c>] kernel_thread_starter+0x0/0x10
          Last Breaking-Event-Address:
           [<000000152974684c>] percpu_ref_exit+0x4c/0x58
          ---[ end trace b035e7da5788eb09 ]---
      
      The problem occurs because kmem_cache_destroy() is called immediately
      after deleting of a memcg, so it races with the memcg kmem_cache
      deactivation.
      
      flush_memcg_workqueue() at the beginning of kmem_cache_destroy() is
      supposed to guarantee that all deactivation processes are finished, but
      failed to do so.  It waits for an rcu grace period, after which all
      children kmem_caches should be deactivated.  During the deactivation
      percpu_ref_kill() is called for non root kmem_cache refcounters, but it
      requires yet another rcu grace period to finish the transition to the
      atomic (dead) state.
      
      So in a rare case when not all children kmem_caches are destroyed at the
      moment when the root kmem_cache is about to be gone, we need to wait
      another rcu grace period before destroying the root kmem_cache.
      
      This issue can be triggered only with dynamically created kmem_caches
      which are used with memcg accounting.  In this case per-memcg child
      kmem_caches are created.  They are deactivated from the cgroup removing
      path.  If the destruction of the root kmem_cache is racing with the
      removal of the cgroup (both are quite complicated multi-stage
      processes), the described issue can occur.  The only known way to
      trigger it in the real life, is to unload some kernel module which
      creates a dedicated kmem_cache, used from different memory cgroups with
      GFP_ACCOUNT flag.  If the unloading happens immediately after calling
      rmdir on the corresponding cgroup, there is some chance to trigger the
      issue.
      
      Link: http://lkml.kernel.org/r/20191129025011.3076017-1-guro@fb.com
      Fixes: f0a3a24b ("mm: memcg/slab: rework non-root kmem_cache lifecycle management")
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Reported-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Tested-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a264df74
  17. 01 12月, 2019 3 次提交
  18. 19 10月, 2019 1 次提交
    • R
      mm: memcg/slab: fix panic in __free_slab() caused by premature memcg pointer release · b749ecfa
      Roman Gushchin 提交于
      Karsten reported the following panic in __free_slab() happening on a s390x
      machine:
      
        Unable to handle kernel pointer dereference in virtual kernel address space
        Failing address: 0000000000000000 TEID: 0000000000000483
        Fault in home space mode while using kernel ASCE.
        AS:00000000017d4007 R3:000000007fbd0007 S:000000007fbff000 P:000000000000003d
        Oops: 0004 ilc:3 Ý#1¨ PREEMPT SMP
        Modules linked in: tcp_diag inet_diag xt_tcpudp ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_at nf_nat
        CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.3.0-05872-g6133e3e4bada-dirty #14
        Hardware name: IBM 2964 NC9 702 (z/VM 6.4.0)
        Krnl PSW : 0704d00180000000 00000000003cadb6 (__free_slab+0x686/0x6b0)
                   R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3
        Krnl GPRS: 00000000f3a32928 0000000000000000 000000007fbf5d00 000000000117c4b8
                   0000000000000000 000000009e3291c1 0000000000000000 0000000000000000
                   0000000000000003 0000000000000008 000000002b478b00 000003d080a97600
                   0000000000000003 0000000000000008 000000002b478b00 000003d080a97600
                   000000000117ba00 000003e000057db0 00000000003cabcc 000003e000057c78
        Krnl Code: 00000000003cada6: e310a1400004        lg      %r1,320(%r10)
                   00000000003cadac: c0e50046c286        brasl   %r14,ca32b8
                  #00000000003cadb2: a7f4fe36            brc     15,3caa1e
                  >00000000003cadb6: e32060800024        stg     %r2,128(%r6)
                   00000000003cadbc: a7f4fd9e            brc     15,3ca8f8
                   00000000003cadc0: c0e50046790c        brasl   %r14,c99fd8
                   00000000003cadc6: a7f4fe2c            brc     15,3caa
                   00000000003cadc6: a7f4fe2c            brc     15,3caa1e
                   00000000003cadca: ecb1ffff00d9        aghik   %r11,%r1,-1
        Call Trace:
        (<00000000003cabcc> __free_slab+0x49c/0x6b0)
         <00000000001f5886> rcu_core+0x5a6/0x7e0
         <0000000000ca2dea> __do_softirq+0xf2/0x5c0
         <0000000000152644> irq_exit+0x104/0x130
         <000000000010d222> do_IRQ+0x9a/0xf0
         <0000000000ca2344> ext_int_handler+0x130/0x134
         <0000000000103648> enabled_wait+0x58/0x128
        (<0000000000103634> enabled_wait+0x44/0x128)
         <0000000000103b00> arch_cpu_idle+0x40/0x58
         <0000000000ca0544> default_idle_call+0x3c/0x68
         <000000000018eaa4> do_idle+0xec/0x1c0
         <000000000018ee0e> cpu_startup_entry+0x36/0x40
         <000000000122df34> arch_call_rest_init+0x5c/0x88
         <0000000000000000> 0x0
        INFO: lockdep is turned off.
        Last Breaking-Event-Address:
         <00000000003ca8f4> __free_slab+0x1c4/0x6b0
        Kernel panic - not syncing: Fatal exception in interrupt
      
      The kernel panics on an attempt to dereference the NULL memcg pointer.
      When shutdown_cache() is called from the kmem_cache_destroy() context, a
      memcg kmem_cache might have empty slab pages in a partial list, which are
      still charged to the memory cgroup.
      
      These pages are released by free_partial() at the beginning of
      shutdown_cache(): either directly or by scheduling a RCU-delayed work
      (if the kmem_cache has the SLAB_TYPESAFE_BY_RCU flag).  The latter case
      is when the reported panic can happen: memcg_unlink_cache() is called
      immediately after shrinking partial lists, without waiting for scheduled
      RCU works.  It sets the kmem_cache->memcg_params.memcg pointer to NULL,
      and the following attempt to dereference it by __free_slab() from the
      RCU work context causes the panic.
      
      To fix the issue, let's postpone the release of the memcg pointer to
      destroy_memcg_params().  It's called from a separate work context by
      slab_caches_to_rcu_destroy_workfn(), which contains a full RCU barrier.
      This guarantees that all scheduled page release RCU works will complete
      before the memcg pointer will be zeroed.
      
      Big thanks for Karsten for the perfect report containing all necessary
      information, his help with the analysis of the problem and testing of the
      fix.
      
      Link: http://lkml.kernel.org/r/20191010160549.1584316-1-guro@fb.com
      Fixes: fb2f2b0a ("mm: memcg/slab: reparent memcg kmem_caches on cgroup removal")
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Reported-by: NKarsten Graul <kgraul@linux.ibm.com>
      Tested-by: NKarsten Graul <kgraul@linux.ibm.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Karsten Graul <kgraul@linux.ibm.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b749ecfa
  19. 08 10月, 2019 2 次提交
    • V
      mm, sl[aou]b: guarantee natural alignment for kmalloc(power-of-two) · 59bb4798
      Vlastimil Babka 提交于
      In most configurations, kmalloc() happens to return naturally aligned
      (i.e.  aligned to the block size itself) blocks for power of two sizes.
      
      That means some kmalloc() users might unknowingly rely on that
      alignment, until stuff breaks when the kernel is built with e.g.
      CONFIG_SLUB_DEBUG or CONFIG_SLOB, and blocks stop being aligned.  Then
      developers have to devise workaround such as own kmem caches with
      specified alignment [1], which is not always practical, as recently
      evidenced in [2].
      
      The topic has been discussed at LSF/MM 2019 [3].  Adding a
      'kmalloc_aligned()' variant would not help with code unknowingly relying
      on the implicit alignment.  For slab implementations it would either
      require creating more kmalloc caches, or allocate a larger size and only
      give back part of it.  That would be wasteful, especially with a generic
      alignment parameter (in contrast with a fixed alignment to size).
      
      Ideally we should provide to mm users what they need without difficult
      workarounds or own reimplementations, so let's make the kmalloc()
      alignment to size explicitly guaranteed for power-of-two sizes under all
      configurations.  What this means for the three available allocators?
      
      * SLAB object layout happens to be mostly unchanged by the patch.  The
        implicitly provided alignment could be compromised with
        CONFIG_DEBUG_SLAB due to redzoning, however SLAB disables redzoning for
        caches with alignment larger than unsigned long long.  Practically on at
        least x86 this includes kmalloc caches as they use cache line alignment,
        which is larger than that.  Still, this patch ensures alignment on all
        arches and cache sizes.
      
      * SLUB layout is also unchanged unless redzoning is enabled through
        CONFIG_SLUB_DEBUG and boot parameter for the particular kmalloc cache.
        With this patch, explicit alignment is guaranteed with redzoning as
        well.  This will result in more memory being wasted, but that should be
        acceptable in a debugging scenario.
      
      * SLOB has no implicit alignment so this patch adds it explicitly for
        kmalloc().  The potential downside is increased fragmentation.  While
        pathological allocation scenarios are certainly possible, in my testing,
        after booting a x86_64 kernel+userspace with virtme, around 16MB memory
        was consumed by slab pages both before and after the patch, with
        difference in the noise.
      
      [1] https://lore.kernel.org/linux-btrfs/c3157c8e8e0e7588312b40c853f65c02fe6c957a.1566399731.git.christophe.leroy@c-s.fr/
      [2] https://lore.kernel.org/linux-fsdevel/20190225040904.5557-1-ming.lei@redhat.com/
      [3] https://lwn.net/Articles/787740/
      
      [akpm@linux-foundation.org: documentation fixlet, per Matthew]
      Link: http://lkml.kernel.org/r/20190826111627.7505-3-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NChristoph Hellwig <hch@lst.de>
      Cc: David Sterba <dsterba@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ming Lei <ming.lei@redhat.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: "Darrick J . Wong" <darrick.wong@oracle.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      59bb4798
    • V
      mm, sl[ou]b: improve memory accounting · 6a486c0a
      Vlastimil Babka 提交于
      Patch series "guarantee natural alignment for kmalloc()", v2.
      
      This patch (of 2):
      
      SLOB currently doesn't account its pages at all, so in /proc/meminfo the
      Slab field shows zero.  Modifying a counter on page allocation and
      freeing should be acceptable even for the small system scenarios SLOB is
      intended for.  Since reclaimable caches are not separated in SLOB,
      account everything as unreclaimable.
      
      SLUB currently doesn't account kmalloc() and kmalloc_node() allocations
      larger than order-1 page, that are passed directly to the page
      allocator.  As they also don't appear in /proc/slabinfo, it might look
      like a memory leak.  For consistency, account them as well.  (SLAB
      doesn't actually use page allocator directly, so no change there).
      
      Ideally SLOB and SLUB would be handled in separate patches, but due to
      the shared kmalloc_order() function and different kfree()
      implementations, it's easier to patch both at once to prevent
      inconsistencies.
      
      Link: http://lkml.kernel.org/r/20190826111627.7505-2-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ming Lei <ming.lei@redhat.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: "Darrick J . Wong" <darrick.wong@oracle.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6a486c0a
  20. 25 9月, 2019 1 次提交
    • W
      mm, slab: extend slab/shrink to shrink all memcg caches · 04f768a3
      Waiman Long 提交于
      Currently, a value of '1" is written to /sys/kernel/slab/<slab>/shrink
      file to shrink the slab by flushing out all the per-cpu slabs and free
      slabs in partial lists.  This can be useful to squeeze out a bit more
      memory under extreme condition as well as making the active object counts
      in /proc/slabinfo more accurate.
      
      This usually applies only to the root caches, as the SLUB_MEMCG_SYSFS_ON
      option is usually not enabled and "slub_memcg_sysfs=1" not set.  Even if
      memcg sysfs is turned on, it is too cumbersome and impractical to manage
      all those per-memcg sysfs files in a real production system.
      
      So there is no practical way to shrink memcg caches.  Fix this by enabling
      a proper write to the shrink sysfs file of the root cache to scan all the
      available memcg caches and shrink them as well.  For a non-root memcg
      cache (when SLUB_MEMCG_SYSFS_ON or slub_memcg_sysfs is on), only that
      cache will be shrunk when written.
      
      On a 2-socket 64-core 256-thread arm64 system with 64k page after
      a parallel kernel build, the the amount of memory occupied by slabs
      before shrinking slabs were:
      
       # grep task_struct /proc/slabinfo
       task_struct        53137  53192   4288   61    4 : tunables    0    0
       0 : slabdata    872    872      0
       # grep "^S[lRU]" /proc/meminfo
       Slab:            3936832 kB
       SReclaimable:     399104 kB
       SUnreclaim:      3537728 kB
      
      After shrinking slabs (by echoing "1" to all shrink files):
      
       # grep "^S[lRU]" /proc/meminfo
       Slab:            1356288 kB
       SReclaimable:     263296 kB
       SUnreclaim:      1092992 kB
       # grep task_struct /proc/slabinfo
       task_struct         2764   6832   4288   61    4 : tunables    0    0
       0 : slabdata    112    112      0
      
      Link: http://lkml.kernel.org/r/20190723151445.7385-1-longman@redhat.comSigned-off-by: NWaiman Long <longman@redhat.com>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      04f768a3
  21. 17 7月, 2019 1 次提交
  22. 13 7月, 2019 4 次提交
    • W
      mm, memcg: add a memcg_slabinfo debugfs file · fcf8a1e4
      Waiman Long 提交于
      There are concerns about memory leaks from extensive use of memory cgroups
      as each memory cgroup creates its own set of kmem caches.  There is a
      possiblity that the memcg kmem caches may remain even after the memory
      cgroups have been offlined.  Therefore, it will be useful to show the
      status of each of memcg kmem caches.
      
      This patch introduces a new <debugfs>/memcg_slabinfo file which is
      somewhat similar to /proc/slabinfo in format, but lists only information
      about kmem caches that have child memcg kmem caches.  Information
      available in /proc/slabinfo are not repeated in memcg_slabinfo.
      
      A portion of a sample output of the file was:
      
        # <name> <css_id[:dead]> <active_objs> <num_objs> <active_slabs> <num_slabs>
        rpc_inode_cache   root          13     51      1      1
        rpc_inode_cache     48           0      0      0      0
        fat_inode_cache   root           1     45      1      1
        fat_inode_cache     41           2     45      1      1
        xfs_inode         root         770    816     24     24
        xfs_inode           92          22     34      1      1
        xfs_inode           88:dead      1     34      1      1
        xfs_inode           89:dead     23     34      1      1
        xfs_inode           85           4     34      1      1
        xfs_inode           84           9     34      1      1
      
      The css id of the memcg is also listed. If a memcg is not online,
      the tag ":dead" will be attached as shown above.
      
      [longman@redhat.com: memcg: add ":deact" tag for reparented kmem caches in memcg_slabinfo]
        Link: http://lkml.kernel.org/r/20190621173005.31514-1-longman@redhat.com
      [longman@redhat.com: set the flag in the common code as suggested by Roman]
        Link: http://lkml.kernel.org/r/20190627184324.5875-1-longman@redhat.com
      Link: http://lkml.kernel.org/r/20190619171621.26209-1-longman@redhat.comSigned-off-by: NWaiman Long <longman@redhat.com>
      Suggested-by: NShakeel Butt <shakeelb@google.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fcf8a1e4
    • R
      mm: memcg/slab: reparent memcg kmem_caches on cgroup removal · fb2f2b0a
      Roman Gushchin 提交于
      Let's reparent non-root kmem_caches on memcg offlining.  This allows us to
      release the memory cgroup without waiting for the last outstanding kernel
      object (e.g.  dentry used by another application).
      
      Since the parent cgroup is already charged, everything we need to do is to
      splice the list of kmem_caches to the parent's kmem_caches list, swap the
      memcg pointer, drop the css refcounter for each kmem_cache and adjust the
      parent's css refcounter.
      
      Please, note that kmem_cache->memcg_params.memcg isn't a stable pointer
      anymore.  It's safe to read it under rcu_read_lock(), cgroup_mutex held,
      or any other way that protects the memory cgroup from being released.
      
      We can race with the slab allocation and deallocation paths.  It's not a
      big problem: parent's charge and slab global stats are always correct, and
      we don't care anymore about the child usage and global stats.  The child
      cgroup is already offline, so we don't use or show it anywhere.
      
      Local slab stats (NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE) aren't
      used anywhere except count_shadow_nodes().  But even there it won't break
      anything: after reparenting "nodes" will be 0 on child level (because
      we're already reparenting shrinker lists), and on parent level page stats
      always were 0, and this patch won't change anything.
      
      [guro@fb.com: properly handle kmem_caches reparented to root_mem_cgroup]
        Link: http://lkml.kernel.org/r/20190620213427.1691847-1-guro@fb.com
      Link: http://lkml.kernel.org/r/20190611231813.3148843-11-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Waiman Long <longman@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: Qian Cai <cai@lca.pw>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fb2f2b0a
    • R
      mm: memcg/slab: rework non-root kmem_cache lifecycle management · f0a3a24b
      Roman Gushchin 提交于
      Currently each charged slab page holds a reference to the cgroup to which
      it's charged.  Kmem_caches are held by the memcg and are released all
      together with the memory cgroup.  It means that none of kmem_caches are
      released unless at least one reference to the memcg exists, which is very
      far from optimal.
      
      Let's rework it in a way that allows releasing individual kmem_caches as
      soon as the cgroup is offline, the kmem_cache is empty and there are no
      pending allocations.
      
      To make it possible, let's introduce a new percpu refcounter for non-root
      kmem caches.  The counter is initialized to the percpu mode, and is
      switched to the atomic mode during kmem_cache deactivation.  The counter
      is bumped for every charged page and also for every running allocation.
      So the kmem_cache can't be released unless all allocations complete.
      
      To shutdown non-active empty kmem_caches, let's reuse the work queue,
      previously used for the kmem_cache deactivation.  Once the reference
      counter reaches 0, let's schedule an asynchronous kmem_cache release.
      
      * I used the following simple approach to test the performance
      (stolen from another patchset by T. Harding):
      
          time find / -name fname-no-exist
          echo 2 > /proc/sys/vm/drop_caches
          repeat 10 times
      
      Results:
      
              orig		patched
      
      real	0m1.455s	real	0m1.355s
      user	0m0.206s	user	0m0.219s
      sys	0m0.855s	sys	0m0.807s
      
      real	0m1.487s	real	0m1.699s
      user	0m0.221s	user	0m0.256s
      sys	0m0.806s	sys	0m0.948s
      
      real	0m1.515s	real	0m1.505s
      user	0m0.183s	user	0m0.215s
      sys	0m0.876s	sys	0m0.858s
      
      real	0m1.291s	real	0m1.380s
      user	0m0.193s	user	0m0.198s
      sys	0m0.843s	sys	0m0.786s
      
      real	0m1.364s	real	0m1.374s
      user	0m0.180s	user	0m0.182s
      sys	0m0.868s	sys	0m0.806s
      
      real	0m1.352s	real	0m1.312s
      user	0m0.201s	user	0m0.212s
      sys	0m0.820s	sys	0m0.761s
      
      real	0m1.302s	real	0m1.349s
      user	0m0.205s	user	0m0.203s
      sys	0m0.803s	sys	0m0.792s
      
      real	0m1.334s	real	0m1.301s
      user	0m0.194s	user	0m0.201s
      sys	0m0.806s	sys	0m0.779s
      
      real	0m1.426s	real	0m1.434s
      user	0m0.216s	user	0m0.181s
      sys	0m0.824s	sys	0m0.864s
      
      real	0m1.350s	real	0m1.295s
      user	0m0.200s	user	0m0.190s
      sys	0m0.842s	sys	0m0.811s
      
      So it looks like the difference is not noticeable in this test.
      
      [cai@lca.pw: fix an use-after-free in kmemcg_workfn()]
        Link: http://lkml.kernel.org/r/1560977573-10715-1-git-send-email-cai@lca.pw
      Link: http://lkml.kernel.org/r/20190611231813.3148843-9-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
      Signed-off-by: NQian Cai <cai@lca.pw>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Waiman Long <longman@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Andrei Vagin <avagin@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f0a3a24b
    • R
      mm: memcg/slab: synchronize access to kmem_cache dying flag using a spinlock · 63b02ef7
      Roman Gushchin 提交于
      Currently the memcg_params.dying flag and the corresponding workqueue used
      for the asynchronous deactivation of kmem_caches is synchronized using the
      slab_mutex.
      
      It makes impossible to check this flag from the irq context, which will be
      required in order to implement asynchronous release of kmem_caches.
      
      So let's switch over to the irq-save flavor of the spinlock-based
      synchronization.
      
      Link: http://lkml.kernel.org/r/20190611231813.3148843-8-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Waiman Long <longman@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: Qian Cai <cai@lca.pw>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      63b02ef7