1. 14 2月, 2015 11 次提交
    • A
      mm: slub: add kernel address sanitizer support for slub allocator · 0316bec2
      Andrey Ryabinin 提交于
      With this patch kasan will be able to catch bugs in memory allocated by
      slub.  Initially all objects in newly allocated slab page, marked as
      redzone.  Later, when allocation of slub object happens, requested by
      caller number of bytes marked as accessible, and the rest of the object
      (including slub's metadata) marked as redzone (inaccessible).
      
      We also mark object as accessible if ksize was called for this object.
      There is some places in kernel where ksize function is called to inquire
      size of really allocated area.  Such callers could validly access whole
      allocated memory, so it should be marked as accessible.
      
      Code in slub.c and slab_common.c files could validly access to object's
      metadata, so instrumentation for this files are disabled.
      Signed-off-by: NAndrey Ryabinin <a.ryabinin@samsung.com>
      Signed-off-by: NDmitry Chernenkov <dmitryc@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Konstantin Serebryany <kcc@google.com>
      Signed-off-by: NAndrey Konovalov <adech.fo@gmail.com>
      Cc: Yuri Gribov <tetra2005@gmail.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0316bec2
    • A
      mm: slub: introduce metadata_access_enable()/metadata_access_disable() · a79316c6
      Andrey Ryabinin 提交于
      It's ok for slub to access memory that marked by kasan as inaccessible
      (object's metadata).  Kasan shouldn't print report in that case because
      these accesses are valid.  Disabling instrumentation of slub.c code is not
      enough to achieve this because slub passes pointer to object's metadata
      into external functions like memchr_inv().
      
      We don't want to disable instrumentation for memchr_inv() because this is
      quite generic function, and we don't want to miss bugs.
      
      metadata_access_enable/metadata_access_disable used to tell KASan where
      accesses to metadata starts/end, so we could temporarily disable KASan
      reports.
      Signed-off-by: NAndrey Ryabinin <a.ryabinin@samsung.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Konstantin Serebryany <kcc@google.com>
      Cc: Dmitry Chernenkov <dmitryc@google.com>
      Signed-off-by: NAndrey Konovalov <adech.fo@gmail.com>
      Cc: Yuri Gribov <tetra2005@gmail.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a79316c6
    • A
      mm: slub: share object_err function · 75c66def
      Andrey Ryabinin 提交于
      Remove static and add function declarations to linux/slub_def.h so it
      could be used by kernel address sanitizer.
      Signed-off-by: NAndrey Ryabinin <a.ryabinin@samsung.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Konstantin Serebryany <kcc@google.com>
      Cc: Dmitry Chernenkov <dmitryc@google.com>
      Signed-off-by: NAndrey Konovalov <adech.fo@gmail.com>
      Cc: Yuri Gribov <tetra2005@gmail.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      75c66def
    • A
      mm: page_alloc: add kasan hooks on alloc and free paths · b8c73fc2
      Andrey Ryabinin 提交于
      Add kernel address sanitizer hooks to mark allocated page's addresses as
      accessible in corresponding shadow region.  Mark freed pages as
      inaccessible.
      Signed-off-by: NAndrey Ryabinin <a.ryabinin@samsung.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Konstantin Serebryany <kcc@google.com>
      Cc: Dmitry Chernenkov <dmitryc@google.com>
      Signed-off-by: NAndrey Konovalov <adech.fo@gmail.com>
      Cc: Yuri Gribov <tetra2005@gmail.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b8c73fc2
    • A
      kasan: disable memory hotplug · 786a8959
      Andrey Ryabinin 提交于
      Currently memory hotplug won't work with KASan.  As we don't have shadow
      for hotplugged memory, kernel will crash on the first access to it.  To
      make this work we will need to allocate shadow for new memory.
      
      At some future point proper memory hotplug support will be implemented.
      Until then, print a warning at startup and disable memory hot-add.
      Signed-off-by: NAndrey Ryabinin <a.ryabinin@samsung.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Konstantin Serebryany <kcc@google.com>
      Cc: Dmitry Chernenkov <dmitryc@google.com>
      Signed-off-by: NAndrey Konovalov <adech.fo@gmail.com>
      Cc: Yuri Gribov <tetra2005@gmail.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      786a8959
    • A
      kasan: add kernel address sanitizer infrastructure · 0b24becc
      Andrey Ryabinin 提交于
      Kernel Address sanitizer (KASan) is a dynamic memory error detector.  It
      provides fast and comprehensive solution for finding use-after-free and
      out-of-bounds bugs.
      
      KASAN uses compile-time instrumentation for checking every memory access,
      therefore GCC > v4.9.2 required.  v4.9.2 almost works, but has issues with
      putting symbol aliases into the wrong section, which breaks kasan
      instrumentation of globals.
      
      This patch only adds infrastructure for kernel address sanitizer.  It's
      not available for use yet.  The idea and some code was borrowed from [1].
      
      Basic idea:
      
      The main idea of KASAN is to use shadow memory to record whether each byte
      of memory is safe to access or not, and use compiler's instrumentation to
      check the shadow memory on each memory access.
      
      Address sanitizer uses 1/8 of the memory addressable in kernel for shadow
      memory and uses direct mapping with a scale and offset to translate a
      memory address to its corresponding shadow address.
      
      Here is function to translate address to corresponding shadow address:
      
           unsigned long kasan_mem_to_shadow(unsigned long addr)
           {
                      return (addr >> KASAN_SHADOW_SCALE_SHIFT) + KASAN_SHADOW_OFFSET;
           }
      
      where KASAN_SHADOW_SCALE_SHIFT = 3.
      
      So for every 8 bytes there is one corresponding byte of shadow memory.
      The following encoding used for each shadow byte: 0 means that all 8 bytes
      of the corresponding memory region are valid for access; k (1 <= k <= 7)
      means that the first k bytes are valid for access, and other (8 - k) bytes
      are not; Any negative value indicates that the entire 8-bytes are
      inaccessible.  Different negative values used to distinguish between
      different kinds of inaccessible memory (redzones, freed memory) (see
      mm/kasan/kasan.h).
      
      To be able to detect accesses to bad memory we need a special compiler.
      Such compiler inserts a specific function calls (__asan_load*(addr),
      __asan_store*(addr)) before each memory access of size 1, 2, 4, 8 or 16.
      
      These functions check whether memory region is valid to access or not by
      checking corresponding shadow memory.  If access is not valid an error
      printed.
      
      Historical background of the address sanitizer from Dmitry Vyukov:
      
      	"We've developed the set of tools, AddressSanitizer (Asan),
      	ThreadSanitizer and MemorySanitizer, for user space. We actively use
      	them for testing inside of Google (continuous testing, fuzzing,
      	running prod services). To date the tools have found more than 10'000
      	scary bugs in Chromium, Google internal codebase and various
      	open-source projects (Firefox, OpenSSL, gcc, clang, ffmpeg, MySQL and
      	lots of others): [2] [3] [4].
      	The tools are part of both gcc and clang compilers.
      
      	We have not yet done massive testing under the Kernel AddressSanitizer
      	(it's kind of chicken and egg problem, you need it to be upstream to
      	start applying it extensively). To date it has found about 50 bugs.
      	Bugs that we've found in upstream kernel are listed in [5].
      	We've also found ~20 bugs in out internal version of the kernel. Also
      	people from Samsung and Oracle have found some.
      
      	[...]
      
      	As others noted, the main feature of AddressSanitizer is its
      	performance due to inline compiler instrumentation and simple linear
      	shadow memory. User-space Asan has ~2x slowdown on computational
      	programs and ~2x memory consumption increase. Taking into account that
      	kernel usually consumes only small fraction of CPU and memory when
      	running real user-space programs, I would expect that kernel Asan will
      	have ~10-30% slowdown and similar memory consumption increase (when we
      	finish all tuning).
      
      	I agree that Asan can well replace kmemcheck. We have plans to start
      	working on Kernel MemorySanitizer that finds uses of unitialized
      	memory. Asan+Msan will provide feature-parity with kmemcheck. As
      	others noted, Asan will unlikely replace debug slab and pagealloc that
      	can be enabled at runtime. Asan uses compiler instrumentation, so even
      	if it is disabled, it still incurs visible overheads.
      
      	Asan technology is easily portable to other architectures. Compiler
      	instrumentation is fully portable. Runtime has some arch-dependent
      	parts like shadow mapping and atomic operation interception. They are
      	relatively easy to port."
      
      Comparison with other debugging features:
      ========================================
      
      KMEMCHECK:
      
        - KASan can do almost everything that kmemcheck can.  KASan uses
          compile-time instrumentation, which makes it significantly faster than
          kmemcheck.  The only advantage of kmemcheck over KASan is detection of
          uninitialized memory reads.
      
          Some brief performance testing showed that kasan could be
          x500-x600 times faster than kmemcheck:
      
      $ netperf -l 30
      		MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to localhost (127.0.0.1) port 0 AF_INET
      		Recv   Send    Send
      		Socket Socket  Message  Elapsed
      		Size   Size    Size     Time     Throughput
      		bytes  bytes   bytes    secs.    10^6bits/sec
      
      no debug:	87380  16384  16384    30.00    41624.72
      
      kasan inline:	87380  16384  16384    30.00    12870.54
      
      kasan outline:	87380  16384  16384    30.00    10586.39
      
      kmemcheck: 	87380  16384  16384    30.03      20.23
      
        - Also kmemcheck couldn't work on several CPUs.  It always sets
          number of CPUs to 1.  KASan doesn't have such limitation.
      
      DEBUG_PAGEALLOC:
      	- KASan is slower than DEBUG_PAGEALLOC, but KASan works on sub-page
      	  granularity level, so it able to find more bugs.
      
      SLUB_DEBUG (poisoning, redzones):
      	- SLUB_DEBUG has lower overhead than KASan.
      
      	- SLUB_DEBUG in most cases are not able to detect bad reads,
      	  KASan able to detect both reads and writes.
      
      	- In some cases (e.g. redzone overwritten) SLUB_DEBUG detect
      	  bugs only on allocation/freeing of object. KASan catch
      	  bugs right before it will happen, so we always know exact
      	  place of first bad read/write.
      
      [1] https://code.google.com/p/address-sanitizer/wiki/AddressSanitizerForKernel
      [2] https://code.google.com/p/address-sanitizer/wiki/FoundBugs
      [3] https://code.google.com/p/thread-sanitizer/wiki/FoundBugs
      [4] https://code.google.com/p/memory-sanitizer/wiki/FoundBugs
      [5] https://code.google.com/p/address-sanitizer/wiki/AddressSanitizerForKernel#Trophies
      
      Based on work by Andrey Konovalov.
      Signed-off-by: NAndrey Ryabinin <a.ryabinin@samsung.com>
      Acked-by: NMichal Marek <mmarek@suse.cz>
      Signed-off-by: NAndrey Konovalov <adech.fo@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Konstantin Serebryany <kcc@google.com>
      Cc: Dmitry Chernenkov <dmitryc@google.com>
      Cc: Yuri Gribov <tetra2005@gmail.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0b24becc
    • T
      mm: use %*pb[l] to print bitmaps including cpumasks and nodemasks · 9e763e0f
      Tejun Heo 提交于
      printk and friends can now format bitmaps using '%*pb[l]'.  cpumask
      and nodemask also provide cpumask_pr_args() and nodemask_pr_args()
      respectively which can be used to generate the two printf arguments
      necessary to format the specified cpu/nodemask.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9e763e0f
    • T
      slub: use %*pb[l] to print bitmaps including cpumasks and nodemasks · 5024c1d7
      Tejun Heo 提交于
      printk and friends can now format bitmaps using '%*pb[l]'.  cpumask
      and nodemask also provide cpumask_pr_args() and nodemask_pr_args()
      respectively which can be used to generate the two printf arguments
      necessary to format the specified cpu/nodemask.
      
      * This is an equivalent conversion but the whole function should be
        converted to use scnprinf famiily of functions rather than
        performing custom output length predictions in multiple places.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5024c1d7
    • T
      percpu: use %*pb[l] to print bitmaps including cpumasks and nodemasks · 807de073
      Tejun Heo 提交于
      printk and friends can now format bitmaps using '%*pb[l]'.  cpumask
      and nodemask also provide cpumask_pr_args() and nodemask_pr_args()
      respectively which can be used to generate the two printf arguments
      necessary to format the specified cpu/nodemask.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      807de073
    • A
      mm/slab: convert cache name allocations to kstrdup_const · 3dec16ea
      Andrzej Hajda 提交于
      slab frequently performs duplication of strings located in read-only
      memory section.  Replacing kstrdup by kstrdup_const allows to avoid such
      operations.
      
      [akpm@linux-foundation.org: make the handling of kmem_cache.name const-correct]
      Signed-off-by: NAndrzej Hajda <a.hajda@samsung.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Kyungmin Park <kyungmin.park@samsung.com>
      Cc: Mike Turquette <mturquette@linaro.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Greg KH <greg@kroah.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3dec16ea
    • A
      mm/util: add kstrdup_const · a4bb1e43
      Andrzej Hajda 提交于
      kstrdup() is often used to duplicate strings where neither source neither
      destination will be ever modified.  In such case we can just reuse the
      source instead of duplicating it.  The problem is that we must be sure
      that the source is non-modifiable and its life-time is long enough.
      
      I suspect the good candidates for such strings are strings located in
      kernel .rodata section, they cannot be modifed because the section is
      read-only and their life-time is equal to kernel life-time.
      
      This small patchset proposes alternative version of kstrdup -
      kstrdup_const, which returns source string if it is located in .rodata
      otherwise it fallbacks to kstrdup.  To verify if the source is in
      .rodata function checks if the address is between sentinels
      __start_rodata, __end_rodata.  I guess it should work with all
      architectures.
      
      The main patch is accompanied by four patches constifying kstrdup for
      cases where situtation described above happens frequently.
      
      I have tested the patchset on mobile platform (exynos4210-trats) and it
      saves 3272 string allocations.  Since minimal allocation is 32 or 64
      bytes depending on Kconfig options the patchset saves respectively about
      100KB or 200KB of memory.
      
      Stats from tested platform show that the main offender is sysfs:
      
      By caller:
        2260 __kernfs_new_node
          631 clk_register+0xc8/0x1b8
          318 clk_register+0x34/0x1b8
            51 kmem_cache_create
            12 alloc_vfsmnt
      
      By string (with count >= 5):
          883 power
          876 subsystem
          135 parameters
          132 device
           61 iommu_group
          ...
      
      This patch (of 5):
      
      Add an alternative version of kstrdup which returns pointer to constant
      char array.  The function checks if input string is in persistent and
      read-only memory section, if yes it returns the input string, otherwise it
      fallbacks to kstrdup.
      
      kstrdup_const is accompanied by kfree_const performing conditional memory
      deallocation of the string.
      Signed-off-by: NAndrzej Hajda <a.hajda@samsung.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Kyungmin Park <kyungmin.park@samsung.com>
      Cc: Mike Turquette <mturquette@linaro.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Greg KH <greg@kroah.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a4bb1e43
  2. 13 2月, 2015 29 次提交
    • G
      mm/zsmalloc: add statistics support · 0f050d99
      Ganesh Mahendran 提交于
      Keeping fragmentation of zsmalloc in a low level is our target.  But now
      we still need to add the debug code in zsmalloc to get the quantitative
      data.
      
      This patch adds a new configuration CONFIG_ZSMALLOC_STAT to enable the
      statistics collection for developers.  Currently only the objects
      statatitics in each class are collected.  User can get the information via
      debugfs.
      
           cat /sys/kernel/debug/zsmalloc/zram0/...
      
      For example:
      
      After I copied "jdk-8u25-linux-x64.tar.gz" to zram with ext4 filesystem:
       class  size obj_allocated   obj_used pages_used
           0    32             0          0          0
           1    48           256         12          3
           2    64            64         14          1
           3    80            51          7          1
           4    96           128          5          3
           5   112            73          5          2
           6   128            32          4          1
           7   144             0          0          0
           8   160             0          0          0
           9   176             0          0          0
          10   192             0          0          0
          11   208             0          0          0
          12   224             0          0          0
          13   240             0          0          0
          14   256            16          1          1
          15   272            15          9          1
          16   288             0          0          0
          17   304             0          0          0
          18   320             0          0          0
          19   336             0          0          0
          20   352             0          0          0
          21   368             0          0          0
          22   384             0          0          0
          23   400             0          0          0
          24   416             0          0          0
          25   432             0          0          0
          26   448             0          0          0
          27   464             0          0          0
          28   480             0          0          0
          29   496            33          1          4
          30   512             0          0          0
          31   528             0          0          0
          32   544             0          0          0
          33   560             0          0          0
          34   576             0          0          0
          35   592             0          0          0
          36   608             0          0          0
          37   624             0          0          0
          38   640             0          0          0
          40   672             0          0          0
          42   704             0          0          0
          43   720            17          1          3
          44   736             0          0          0
          46   768             0          0          0
          49   816             0          0          0
          51   848             0          0          0
          52   864            14          1          3
          54   896             0          0          0
          57   944            13          1          3
          58   960             0          0          0
          62  1024             4          1          1
          66  1088            15          2          4
          67  1104             0          0          0
          71  1168             0          0          0
          74  1216             0          0          0
          76  1248             0          0          0
          83  1360             3          1          1
          91  1488            11          1          4
          94  1536             0          0          0
         100  1632             5          1          2
         107  1744             0          0          0
         111  1808             9          1          4
         126  2048             4          4          2
         144  2336             7          3          4
         151  2448             0          0          0
         168  2720            15         15         10
         190  3072            28         27         21
         202  3264             0          0          0
         254  4096         36209      36209      36209
      
       Total               37022      36326      36288
      
      We can calculate the overall fragentation by the last line:
          Total               37022      36326      36288
          (37022 - 36326) / 37022 = 1.87%
      
      Also by analysing objects alocated in every class we know why we got so
      low fragmentation: Most of the allocated objects is in <class 254>.  And
      there is only 1 page in class 254 zspage.  So, No fragmentation will be
      introduced by allocating objs in class 254.
      
      And in future, we can collect other zsmalloc statistics as we need and
      analyse them.
      Signed-off-by: NGanesh Mahendran <opensource.ganesh@gmail.com>
      Suggested-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Seth Jennings <sjennings@variantweb.net>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0f050d99
    • G
      mm/zpool: add name argument to create zpool · 3eba0c6a
      Ganesh Mahendran 提交于
      Currently the underlay of zpool: zsmalloc/zbud, do not know who creates
      them.  There is not a method to let zsmalloc/zbud find which caller they
      belong to.
      
      Now we want to add statistics collection in zsmalloc.  We need to name the
      debugfs dir for each pool created.  The way suggested by Minchan Kim is to
      use a name passed by caller(such as zram) to create the zsmalloc pool.
      
          /sys/kernel/debug/zsmalloc/zram0
      
      This patch adds an argument `name' to zs_create_pool() and other related
      functions.
      Signed-off-by: NGanesh Mahendran <opensource.ganesh@gmail.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Seth Jennings <sjennings@variantweb.net>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3eba0c6a
    • H
      mm: fix negative nr_isolated counts · ff59909a
      Hugh Dickins 提交于
      The vmstat interfaces are good at hiding negative counts (at least when
      CONFIG_SMP); but if you peer behind the curtain, you find that
      nr_isolated_anon and nr_isolated_file soon go negative, and grow ever
      more negative: so they can absorb larger and larger numbers of isolated
      pages, yet still appear to be zero.
      
      I'm happy to avoid a congestion_wait() when too_many_isolated() myself;
      but I guess it's there for a good reason, in which case we ought to get
      too_many_isolated() working again.
      
      The imbalance comes from isolate_migratepages()'s ISOLATE_ABORT case:
      putback_movable_pages() decrements the NR_ISOLATED counts, but we forgot
      to call acct_isolated() to increment them.
      
      It is possible that the bug whcih this patch fixes could cause OOM kills
      when the system still has a lot of reclaimable page cache.
      
      Fixes: edc2ca61 ("mm, compaction: move pageblock checks up from isolate_migratepages_range()")
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: <stable@vger.kernel.org>	[3.18+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ff59909a
    • N
      mm: hwpoison: drop lru_add_drain_all() in __soft_offline_page() · 9ab3b598
      Naoya Horiguchi 提交于
      A race condition starts to be visible in recent mmotm, where a PG_hwpoison
      flag is set on a migration source page *before* it's back in buddy page
      poo= l.
      
      This is problematic because no page flag is supposed to be set when
      freeing (see __free_one_page().) So the user-visible effect of this race
      is that it could trigger the BUG_ON() when soft-offlining is called.
      
      The root cause is that we call lru_add_drain_all() to make sure that the
      page is in buddy, but that doesn't work because this function just
      schedule= s a work item and doesn't wait its completion.
      drain_all_pages() does drainin= g directly, so simply dropping
      lru_add_drain_all() solves this problem.
      
      Fixes: f15bdfa8 ("mm/memory-failure.c: fix memory leak in successful soft offlining")
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Chen Gong <gong.chen@linux.intel.com>
      Cc: <stable@vger.kernel.org>	[3.11+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9ab3b598
    • Y
      mm/page_alloc: fix comment · 84109e15
      Yaowei Bai 提交于
      Add a necessary 'leave'.
      Signed-off-by: NYaowei Bai <bywxiaobai@163.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      84109e15
    • G
      mm/memory.c: actually remap enough memory · 9cb12d7b
      Grazvydas Ignotas 提交于
      For whatever reason, generic_access_phys() only remaps one page, but
      actually allows to access arbitrary size.  It's quite easy to trigger
      large reads, like printing out large structure with gdb, which leads to a
      crash.  Fix it by remapping correct size.
      
      Fixes: 28b2ee20 ("access_process_vm device memory infrastructure")
      Signed-off-by: NGrazvydas Ignotas <notasas@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9cb12d7b
    • R
      mm/mm_init.c: mark mminit_loglevel __meminitdata · 194e8151
      Rasmus Villemoes 提交于
      mminit_loglevel is only referenced from __init and __meminit functions, so
      we can mark it __meminitdata.
      Signed-off-by: NRasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vishnu Pratap Singh <vishnu.ps@samsung.com>
      Cc: Pintu Kumar <pintu.k@samsung.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      194e8151
    • R
      mm/mm_init.c: park mminit_verify_zonelist as __init · 0e2342c7
      Rasmus Villemoes 提交于
      The only caller of mminit_verify_zonelist is build_all_zonelists_init,
      which is annotated with __init, so it should be safe to also mark the
      former as __init, saving ~400 bytes of .text.
      Signed-off-by: NRasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vishnu Pratap Singh <vishnu.ps@samsung.com>
      Cc: Pintu Kumar <pintu.k@samsung.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0e2342c7
    • R
      mm/page_alloc.c: pull out init code from build_all_zonelists · 061f67bc
      Rasmus Villemoes 提交于
      Pulling the code protected by if (system_state == SYSTEM_BOOTING) into
      its own helper allows us to shrink .text a little. This relies on
      build_all_zonelists already having a __ref annotation. Add a comment
      explaining why so one doesn't have to track it down through git log.
      
      The real saving comes in 3/5, ("mm/mm_init.c: Mark mminit_verify_zonelist
      as __init"), where we save about 400 bytes
      Signed-off-by: NRasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vishnu Pratap Singh <vishnu.ps@samsung.com>
      Cc: Pintu Kumar <pintu.k@samsung.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      061f67bc
    • R
      mm/internal.h: don't split printk call in two · fc5199d1
      Rasmus Villemoes 提交于
      All users of mminit_dprintk pass a compile-time constant as level, so this
      just makes gcc emit a single printk call instead of two.
      Signed-off-by: NRasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vishnu Pratap Singh <vishnu.ps@samsung.com>
      Cc: Pintu Kumar <pintu.k@samsung.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fc5199d1
    • V
      memcg: cleanup static keys decrement · f48b80a5
      Vladimir Davydov 提交于
      Move memcg_socket_limit_enabled decrement to tcp_destroy_cgroup (called
      from memcg_destroy_kmem -> mem_cgroup_sockets_destroy) and zap a bunch of
      wrapper functions.
      
      Although this patch moves static keys decrement from __mem_cgroup_free to
      mem_cgroup_css_free, it does not introduce any functional changes, because
      the keys are incremented on setting the limit (tcp or kmem), which can
      only happen after successful mem_cgroup_css_online.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f48b80a5
    • J
      mm/compaction: stop the isolation when we isolate enough freepage · 932ff6bb
      Joonsoo Kim 提交于
      Currently, freepage isolation in one pageblock doesn't consider how many
      freepages we isolate. When I traced flow of compaction, compaction
      sometimes isolates more than 256 freepages to migrate just 32 pages.
      
      In this patch, freepage isolation is stopped at the point that we
      have more isolated freepage than isolated page for migration. This
      results in slowing down free page scanner and make compaction success
      rate higher.
      
      stress-highalloc test in mmtests with non movable order 7 allocation shows
      increase of compaction success rate.
      
      Compaction success rate (Compaction success * 100 / Compaction stalls, %)
      27.13 : 31.82
      
      pfn where both scanners meets on compaction complete
      (separate test due to enormous tracepoint buffer)
      (zone_start=4096, zone_end=1048576)
      586034 : 654378
      
      In fact, I didn't fully understand why this patch results in such good
      result. There was a guess that not used freepages are released to pcp list
      and on next compaction trial we won't isolate them again so compaction
      success rate would decrease. To prevent this effect, I tested with adding
      pcp drain code on release_freepages(), but, it has no good effect.
      
      Anyway, this patch reduces waste time to isolate unneeded freepages so
      seems reasonable.
      
      Vlastimil said:
      
      : I briefly tried it on top of the pivot-changing series and with order-9
      : allocations it reduced free page scanned counter by almost 10%.  No effect
      : on success rates (maybe because pivot changing already took care of the
      : scanners meeting problem) but the scanning reduction is good on its own.
      :
      : It also explains why e14c720e ("mm, compaction: remember position
      : within pageblock in free pages scanner") had less than expected
      : improvements.  It would only actually stop within pageblock in case of
      : async compaction detecting contention.  I guess that's also why the
      : infinite loop problem fixed by 1d5bfe1f affected so relatively few
      : people.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Tested-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      932ff6bb
    • J
      mm/compaction: fix wrong order check in compact_finished() · 372549c2
      Joonsoo Kim 提交于
      What we want to check here is whether there is highorder freepage in buddy
      list of other migratetype in order to steal it without fragmentation.
      But, current code just checks cc->order which means allocation request
      order.  So, this is wrong.
      
      Without this fix, non-movable synchronous compaction below pageblock order
      would not stopped until compaction is complete, because migratetype of
      most pageblocks are movable and high order freepage made by compaction is
      usually on movable type buddy list.
      
      There is some report related to this bug. See below link.
      
        http://www.spinics.net/lists/linux-mm/msg81666.html
      
      Although the issued system still has load spike comes from compaction,
      this makes that system completely stable and responsive according to his
      report.
      
      stress-highalloc test in mmtests with non movable order 7 allocation
      doesn't show any notable difference in allocation success rate, but, it
      shows more compaction success rate.
      
      Compaction success rate (Compaction success * 100 / Compaction stalls, %)
      18.47 : 28.94
      
      Fixes: 1fb3f8ca ("mm: compaction: capture a suitable high-order page immediately when it is made available")
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: <stable@vger.kernel.org>	[3.7+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      372549c2
    • V
      slub: make dead caches discard free slabs immediately · d6e0b7fa
      Vladimir Davydov 提交于
      To speed up further allocations SLUB may store empty slabs in per cpu/node
      partial lists instead of freeing them immediately.  This prevents per
      memcg caches destruction, because kmem caches created for a memory cgroup
      are only destroyed after the last page charged to the cgroup is freed.
      
      To fix this issue, this patch resurrects approach first proposed in [1].
      It forbids SLUB to cache empty slabs after the memory cgroup that the
      cache belongs to was destroyed.  It is achieved by setting kmem_cache's
      cpu_partial and min_partial constants to 0 and tuning put_cpu_partial() so
      that it would drop frozen empty slabs immediately if cpu_partial = 0.
      
      The runtime overhead is minimal.  From all the hot functions, we only
      touch relatively cold put_cpu_partial(): we make it call
      unfreeze_partials() after freezing a slab that belongs to an offline
      memory cgroup.  Since slab freezing exists to avoid moving slabs from/to a
      partial list on free/alloc, and there can't be allocations from dead
      caches, it shouldn't cause any overhead.  We do have to disable preemption
      for put_cpu_partial() to achieve that though.
      
      The original patch was accepted well and even merged to the mm tree.
      However, I decided to withdraw it due to changes happening to the memcg
      core at that time.  I had an idea of introducing per-memcg shrinkers for
      kmem caches, but now, as memcg has finally settled down, I do not see it
      as an option, because SLUB shrinker would be too costly to call since SLUB
      does not keep free slabs on a separate list.  Besides, we currently do not
      even call per-memcg shrinkers for offline memcgs.  Overall, it would
      introduce much more complexity to both SLUB and memcg than this small
      patch.
      
      Regarding to SLAB, there's no problem with it, because it shrinks
      per-cpu/node caches periodically.  Thanks to list_lru reparenting, we no
      longer keep entries for offline cgroups in per-memcg arrays (such as
      memcg_cache_params->memcg_caches), so we do not have to bother if a
      per-memcg cache will be shrunk a bit later than it could be.
      
      [1] http://thread.gmane.org/gmane.linux.kernel.mm/118649/focus=118650Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d6e0b7fa
    • V
      slub: fix kmem_cache_shrink return value · ce3712d7
      Vladimir Davydov 提交于
      It is supposed to return 0 if the cache has no remaining objects and 1
      otherwise, while currently it always returns 0.  Fix it.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ce3712d7
    • V
      slub: never fail to shrink cache · 832f37f5
      Vladimir Davydov 提交于
      SLUB's version of __kmem_cache_shrink() not only removes empty slabs, but
      also tries to rearrange the partial lists to place slabs filled up most to
      the head to cope with fragmentation.  To achieve that, it allocates a
      temporary array of lists used to sort slabs by the number of objects in
      use.  If the allocation fails, the whole procedure is aborted.
      
      This is unacceptable for the kernel memory accounting extension of the
      memory cgroup, where we want to make sure that kmem_cache_shrink()
      successfully discarded empty slabs.  Although the allocation failure is
      utterly unlikely with the current page allocator implementation, which
      retries GFP_KERNEL allocations of order <= 2 infinitely, it is better not
      to rely on that.
      
      This patch therefore makes __kmem_cache_shrink() allocate the array on
      stack instead of calling kmalloc, which may fail.  The array size is
      chosen to be equal to 32, because most SLUB caches store not more than 32
      objects per slab page.  Slab pages with <= 32 free objects are sorted
      using the array by the number of objects in use and promoted to the head
      of the partial list, while slab pages with > 32 free objects are left in
      the end of the list without any ordering imposed on them.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Acked-by: NPekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      832f37f5
    • V
      memcg: reparent list_lrus and free kmemcg_id on css offline · 2788cf0c
      Vladimir Davydov 提交于
      Now, the only reason to keep kmemcg_id till css free is list_lru, which
      uses it to distribute elements between per-memcg lists.  However, it can
      be easily sorted out - we only need to change kmemcg_id of an offline
      cgroup to its parent's id, making further list_lru_add()'s add elements to
      the parent's list, and then move all elements from the offline cgroup's
      list to the one of its parent.  It will work, because a racing
      list_lru_del() does not need to know the list it is deleting the element
      from.  It can decrement the wrong nr_items counter though, but the ongoing
      reparenting will fix it.  After list_lru reparenting is done we are free
      to release kmemcg_id saving a valuable slot in a per-memcg array for new
      cgroups.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2788cf0c
    • V
      list_lru: add helpers to isolate items · 3f97b163
      Vladimir Davydov 提交于
      Currently, the isolate callback passed to the list_lru_walk family of
      functions is supposed to just delete an item from the list upon returning
      LRU_REMOVED or LRU_REMOVED_RETRY, while nr_items counter is fixed by
      __list_lru_walk_one after the callback returns.  Since the callback is
      allowed to drop the lock after removing an item (it has to return
      LRU_REMOVED_RETRY then), the nr_items can be less than the actual number
      of elements on the list even if we check them under the lock.  This makes
      it difficult to move items from one list_lru_one to another, which is
      required for per-memcg list_lru reparenting - we can't just splice the
      lists, we have to move entries one by one.
      
      This patch therefore introduces helpers that must be used by callback
      functions to isolate items instead of raw list_del/list_move.  These are
      list_lru_isolate and list_lru_isolate_move.  They not only remove the
      entry from the list, but also fix the nr_items counter, making sure
      nr_items always reflects the actual number of elements on the list if
      checked under the appropriate lock.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3f97b163
    • V
      memcg: free memcg_caches slot on css offline · 2a4db7eb
      Vladimir Davydov 提交于
      We need to look up a kmem_cache in ->memcg_params.memcg_caches arrays only
      on allocations, so there is no need to have the array entries set until
      css free - we can clear them on css offline.  This will allow us to reuse
      array entries more efficiently and avoid costly array relocations.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2a4db7eb
    • V
      slab: use css id for naming per memcg caches · f1008365
      Vladimir Davydov 提交于
      Currently, we use mem_cgroup->kmemcg_id to guarantee kmem_cache->name
      uniqueness.  This is correct, because kmemcg_id is only released on css
      free after destroying all per memcg caches.
      
      However, I am going to change that and release kmemcg_id on css offline,
      because it is not wise to keep it for so long, wasting valuable entries of
      memcg_cache_params->memcg_caches arrays.  Therefore, to preserve cache
      name uniqueness, let us switch to css->id.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f1008365
    • V
      slab: link memcg caches of the same kind into a list · 426589f5
      Vladimir Davydov 提交于
      Sometimes, we need to iterate over all memcg copies of a particular root
      kmem cache.  Currently, we use memcg_cache_params->memcg_caches array for
      that, because it contains all existing memcg caches.
      
      However, it's a bad practice to keep all caches, including those that
      belong to offline cgroups, in this array, because it will be growing
      beyond any bounds then.  I'm going to wipe away dead caches from it to
      save space.  To still be able to perform iterations over all memcg caches
      of the same kind, let us link them into a list.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      426589f5
    • V
      slab: embed memcg_cache_params to kmem_cache · f7ce3190
      Vladimir Davydov 提交于
      Currently, kmem_cache stores a pointer to struct memcg_cache_params
      instead of embedding it.  The rationale is to save memory when kmem
      accounting is disabled.  However, the memcg_cache_params has shrivelled
      drastically since it was first introduced:
      
      * Initially:
      
      struct memcg_cache_params {
      	bool is_root_cache;
      	union {
      		struct kmem_cache *memcg_caches[0];
      		struct {
      			struct mem_cgroup *memcg;
      			struct list_head list;
      			struct kmem_cache *root_cache;
      			bool dead;
      			atomic_t nr_pages;
      			struct work_struct destroy;
      		};
      	};
      };
      
      * Now:
      
      struct memcg_cache_params {
      	bool is_root_cache;
      	union {
      		struct {
      			struct rcu_head rcu_head;
      			struct kmem_cache *memcg_caches[0];
      		};
      		struct {
      			struct mem_cgroup *memcg;
      			struct kmem_cache *root_cache;
      		};
      	};
      };
      
      So the memory saving does not seem to be a clear win anymore.
      
      OTOH, keeping a pointer to memcg_cache_params struct instead of embedding
      it results in touching one more cache line on kmem alloc/free hot paths.
      Besides, it makes linking kmem caches in a list chained by a field of
      struct memcg_cache_params really painful due to a level of indirection,
      while I want to make them linked in the following patch.  That said, let
      us embed it.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f7ce3190
    • V
      list_lru: introduce per-memcg lists · 60d3fd32
      Vladimir Davydov 提交于
      There are several FS shrinkers, including super_block::s_shrink, that
      keep reclaimable objects in the list_lru structure.  Hence to turn them
      to memcg-aware shrinkers, it is enough to make list_lru per-memcg.
      
      This patch does the trick.  It adds an array of lru lists to the
      list_lru_node structure (per-node part of the list_lru), one for each
      kmem-active memcg, and dispatches every item addition or removal to the
      list corresponding to the memcg which the item is accounted to.  So now
      the list_lru structure is not just per node, but per node and per memcg.
      
      Not all list_lrus need this feature, so this patch also adds a new
      method, list_lru_init_memcg, which initializes a list_lru as memcg
      aware.  Otherwise (i.e.  if initialized with old list_lru_init), the
      list_lru won't have per memcg lists.
      
      Just like per memcg caches arrays, the arrays of per-memcg lists are
      indexed by memcg_cache_id, so we must grow them whenever
      memcg_nr_cache_ids is increased.  So we introduce a callback,
      memcg_update_all_list_lrus, invoked by memcg_alloc_cache_id if the id
      space is full.
      
      The locking is implemented in a manner similar to lruvecs, i.e.  we have
      one lock per node that protects all lists (both global and per cgroup) on
      the node.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      60d3fd32
    • V
      list_lru: organize all list_lrus to list · c0a5b560
      Vladimir Davydov 提交于
      To make list_lru memcg aware, we need all list_lrus to be kept on a list
      protected by a mutex, so that we could sleep while walking over the
      list.
      
      Therefore after this change list_lru_destroy may sleep.  Fortunately,
      there is only one user that calls it from an atomic context - it's
      put_super - and we can easily fix it by calling list_lru_destroy before
      put_super in destroy_locked_super - anyway we don't longer need lrus by
      that time.
      
      Another point that should be noted is that list_lru_destroy is allowed
      to be called on an uninitialized zeroed-out object, in which case it is
      a no-op.  Before this patch this was guaranteed by kfree, but now we
      need an explicit check there.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c0a5b560
    • V
      list_lru: get rid of ->active_nodes · ff0b67ef
      Vladimir Davydov 提交于
      The active_nodes mask allows us to skip empty nodes when walking over
      list_lru items from all nodes in list_lru_count/walk.  However, these
      functions are never called from hot paths, so it doesn't seem we need
      such kind of optimization there.  OTOH, removing the mask will make it
      easier to make list_lru per-memcg.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ff0b67ef
    • V
      memcg: add rwsem to synchronize against memcg_caches arrays relocation · 05257a1a
      Vladimir Davydov 提交于
      We need a stable value of memcg_nr_cache_ids in kmem_cache_create()
      (memcg_alloc_cache_params() wants it for root caches), where we only
      hold the slab_mutex and no memcg-related locks.  As a result, we have to
      update memcg_nr_cache_ids under the slab_mutex, which we can only take
      on the slab's side (see memcg_update_array_size).  This looks awkward
      and will become even worse when per-memcg list_lru is introduced, which
      also wants stable access to memcg_nr_cache_ids.
      
      To get rid of this dependency between the memcg_nr_cache_ids and the
      slab_mutex, this patch introduces a special rwsem.  The rwsem is held
      for writing during memcg_caches arrays relocation and memcg_nr_cache_ids
      updates.  Therefore one can take it for reading to get a stable access
      to memcg_caches arrays and/or memcg_nr_cache_ids.
      
      Currently the semaphore is taken for reading only from
      kmem_cache_create, right before taking the slab_mutex, so right now
      there's no much point in using rwsem instead of mutex.  However, once
      list_lru is made per-memcg it will allow list_lru initializations to
      proceed concurrently.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      05257a1a
    • V
      memcg: rename some cache id related variables · dbcf73e2
      Vladimir Davydov 提交于
      memcg_limited_groups_array_size, which defines the size of memcg_caches
      arrays, sounds rather cumbersome.  Also it doesn't point anyhow that
      it's related to kmem/caches stuff.  So let's rename it to
      memcg_nr_cache_ids.  It's concise and points us directly to
      memcg_cache_id.
      
      Also, rename kmem_limited_groups to memcg_cache_ida.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dbcf73e2
    • V
      vmscan: per memory cgroup slab shrinkers · cb731d6c
      Vladimir Davydov 提交于
      This patch adds SHRINKER_MEMCG_AWARE flag.  If a shrinker has this flag
      set, it will be called per memory cgroup.  The memory cgroup to scan
      objects from is passed in shrink_control->memcg.  If the memory cgroup
      is NULL, a memcg aware shrinker is supposed to scan objects from the
      global list.  Unaware shrinkers are only called on global pressure with
      memcg=NULL.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cb731d6c
    • V
      list_lru: introduce list_lru_shrink_{count,walk} · 503c358c
      Vladimir Davydov 提交于
      Kmem accounting of memcg is unusable now, because it lacks slab shrinker
      support.  That means when we hit the limit we will get ENOMEM w/o any
      chance to recover.  What we should do then is to call shrink_slab, which
      would reclaim old inode/dentry caches from this cgroup.  This is what
      this patch set is intended to do.
      
      Basically, it does two things.  First, it introduces the notion of
      per-memcg slab shrinker.  A shrinker that wants to reclaim objects per
      cgroup should mark itself as SHRINKER_MEMCG_AWARE.  Then it will be
      passed the memory cgroup to scan from in shrink_control->memcg.  For
      such shrinkers shrink_slab iterates over the whole cgroup subtree under
      the target cgroup and calls the shrinker for each kmem-active memory
      cgroup.
      
      Secondly, this patch set makes the list_lru structure per-memcg.  It's
      done transparently to list_lru users - everything they have to do is to
      tell list_lru_init that they want memcg-aware list_lru.  Then the
      list_lru will automatically distribute objects among per-memcg lists
      basing on which cgroup the object is accounted to.  This way to make FS
      shrinkers (icache, dcache) memcg-aware we only need to make them use
      memcg-aware list_lru, and this is what this patch set does.
      
      As before, this patch set only enables per-memcg kmem reclaim when the
      pressure goes from memory.limit, not from memory.kmem.limit.  Handling
      memory.kmem.limit is going to be tricky due to GFP_NOFS allocations, and
      it is still unclear whether we will have this knob in the unified
      hierarchy.
      
      This patch (of 9):
      
      NUMA aware slab shrinkers use the list_lru structure to distribute
      objects coming from different NUMA nodes to different lists.  Whenever
      such a shrinker needs to count or scan objects from a particular node,
      it issues commands like this:
      
              count = list_lru_count_node(lru, sc->nid);
              freed = list_lru_walk_node(lru, sc->nid, isolate_func,
                                         isolate_arg, &sc->nr_to_scan);
      
      where sc is an instance of the shrink_control structure passed to it
      from vmscan.
      
      To simplify this, let's add special list_lru functions to be used by
      shrinkers, list_lru_shrink_count() and list_lru_shrink_walk(), which
      consolidate the nid and nr_to_scan arguments in the shrink_control
      structure.
      
      This will also allow us to avoid patching shrinkers that use list_lru
      when we make shrink_slab() per-memcg - all we will have to do is extend
      the shrink_control structure to include the target memcg and make
      list_lru_shrink_{count,walk} handle this appropriately.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Suggested-by: NDave Chinner <david@fromorbit.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      503c358c