1. 13 12月, 2016 1 次提交
  2. 28 10月, 2016 2 次提交
  3. 07 9月, 2016 1 次提交
  4. 03 8月, 2016 2 次提交
  5. 27 7月, 2016 5 次提交
    • W
      mm/slab: use list_move instead of list_del/list_add · de24baec
      Wei Yongjun 提交于
      Using list_move() instead of list_del() + list_add() to avoid needlessly
      poisoning the next and prev values.
      
      Link: http://lkml.kernel.org/r/1468929772-9174-1-git-send-email-weiyj_lk@163.comSigned-off-by: NWei Yongjun <yongjun_wei@trendmicro.com.cn>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      de24baec
    • M
      slab: do not panic on invalid gfp_mask · 72baeef0
      Michal Hocko 提交于
      Both SLAB and SLUB BUG() when a caller provides an invalid gfp_mask.
      This is a rather harsh way to announce a non-critical issue.  Allocator
      is free to ignore invalid flags.  Let's simply replace BUG() by
      dump_stack to tell the offender and fixup the mask to move on with the
      allocation request.
      
      This is an example for kmalloc(GFP_KERNEL|__GFP_HIGHMEM) from a test
      module:
      
        Unexpected gfp: 0x2 (__GFP_HIGHMEM). Fixing up to gfp: 0x24000c0 (GFP_KERNEL). Fix your code!
        CPU: 0 PID: 2916 Comm: insmod Tainted: G           O    4.6.0-slabgfp2-00002-g4cdfc2ef4892-dirty #936
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
        Call Trace:
          dump_stack+0x67/0x90
          cache_alloc_refill+0x201/0x617
          kmem_cache_alloc_trace+0xa7/0x24a
          ? 0xffffffffa0005000
          mymodule_init+0x20/0x1000 [test_slab]
          do_one_initcall+0xe7/0x16c
          ? rcu_read_lock_sched_held+0x61/0x69
          ? kmem_cache_alloc_trace+0x197/0x24a
          do_init_module+0x5f/0x1d9
          load_module+0x1a3d/0x1f21
          ? retint_kernel+0x2d/0x2d
          SyS_init_module+0xe8/0x10e
          ? SyS_init_module+0xe8/0x10e
          do_syscall_64+0x68/0x13f
          entry_SYSCALL64_slow_path+0x25/0x25
      
      Link: http://lkml.kernel.org/r/1465548200-11384-2-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      72baeef0
    • M
      slab: make GFP_SLAB_BUG_MASK information more human readable · bacdcb34
      Michal Hocko 提交于
      printk offers %pGg for quite some time so let's use it to get a human
      readable list of invalid flags.
      
      The original output would be
        [  429.191962] gfp: 2
      
      after the change
        [  429.191962] Unexpected gfp: 0x2 (__GFP_HIGHMEM)
      
      Link: http://lkml.kernel.org/r/1465548200-11384-1-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bacdcb34
    • T
      mm: reorganize SLAB freelist randomization · 7c00fce9
      Thomas Garnier 提交于
      The kernel heap allocators are using a sequential freelist making their
      allocation predictable.  This predictability makes kernel heap overflow
      easier to exploit.  An attacker can careful prepare the kernel heap to
      control the following chunk overflowed.
      
      For example these attacks exploit the predictability of the heap:
       - Linux Kernel CAN SLUB overflow (https://goo.gl/oMNWkU)
       - Exploiting Linux Kernel Heap corruptions (http://goo.gl/EXLn95)
      
      ***Problems that needed solving:
       - Randomize the Freelist (singled linked) used in the SLUB allocator.
       - Ensure good performance to encourage usage.
       - Get best entropy in early boot stage.
      
      ***Parts:
       - 01/02 Reorganize the SLAB Freelist randomization to share elements
         with the SLUB implementation.
       - 02/02 The SLUB Freelist randomization implementation. Similar approach
         than the SLAB but tailored to the singled freelist used in SLUB.
      
      ***Performance data:
      
      slab_test impact is between 3% to 4% on average for 100000 attempts
      without smp.  It is a very focused testing, kernbench show the overall
      impact on the system is way lower.
      
      Before:
      
        Single thread testing
        =====================
        1. Kmalloc: Repeatedly allocate then free test
        100000 times kmalloc(8) -> 49 cycles kfree -> 77 cycles
        100000 times kmalloc(16) -> 51 cycles kfree -> 79 cycles
        100000 times kmalloc(32) -> 53 cycles kfree -> 83 cycles
        100000 times kmalloc(64) -> 62 cycles kfree -> 90 cycles
        100000 times kmalloc(128) -> 81 cycles kfree -> 97 cycles
        100000 times kmalloc(256) -> 98 cycles kfree -> 121 cycles
        100000 times kmalloc(512) -> 95 cycles kfree -> 122 cycles
        100000 times kmalloc(1024) -> 96 cycles kfree -> 126 cycles
        100000 times kmalloc(2048) -> 115 cycles kfree -> 140 cycles
        100000 times kmalloc(4096) -> 149 cycles kfree -> 171 cycles
        2. Kmalloc: alloc/free test
        100000 times kmalloc(8)/kfree -> 70 cycles
        100000 times kmalloc(16)/kfree -> 70 cycles
        100000 times kmalloc(32)/kfree -> 70 cycles
        100000 times kmalloc(64)/kfree -> 70 cycles
        100000 times kmalloc(128)/kfree -> 70 cycles
        100000 times kmalloc(256)/kfree -> 69 cycles
        100000 times kmalloc(512)/kfree -> 70 cycles
        100000 times kmalloc(1024)/kfree -> 73 cycles
        100000 times kmalloc(2048)/kfree -> 72 cycles
        100000 times kmalloc(4096)/kfree -> 71 cycles
      
      After:
      
        Single thread testing
        =====================
        1. Kmalloc: Repeatedly allocate then free test
        100000 times kmalloc(8) -> 57 cycles kfree -> 78 cycles
        100000 times kmalloc(16) -> 61 cycles kfree -> 81 cycles
        100000 times kmalloc(32) -> 76 cycles kfree -> 93 cycles
        100000 times kmalloc(64) -> 83 cycles kfree -> 94 cycles
        100000 times kmalloc(128) -> 106 cycles kfree -> 107 cycles
        100000 times kmalloc(256) -> 118 cycles kfree -> 117 cycles
        100000 times kmalloc(512) -> 114 cycles kfree -> 116 cycles
        100000 times kmalloc(1024) -> 115 cycles kfree -> 118 cycles
        100000 times kmalloc(2048) -> 147 cycles kfree -> 131 cycles
        100000 times kmalloc(4096) -> 214 cycles kfree -> 161 cycles
        2. Kmalloc: alloc/free test
        100000 times kmalloc(8)/kfree -> 66 cycles
        100000 times kmalloc(16)/kfree -> 66 cycles
        100000 times kmalloc(32)/kfree -> 66 cycles
        100000 times kmalloc(64)/kfree -> 66 cycles
        100000 times kmalloc(128)/kfree -> 65 cycles
        100000 times kmalloc(256)/kfree -> 67 cycles
        100000 times kmalloc(512)/kfree -> 67 cycles
        100000 times kmalloc(1024)/kfree -> 64 cycles
        100000 times kmalloc(2048)/kfree -> 67 cycles
        100000 times kmalloc(4096)/kfree -> 67 cycles
      
      Kernbench, before:
      
        Average Optimal load -j 12 Run (std deviation):
        Elapsed Time 101.873 (1.16069)
        User Time 1045.22 (1.60447)
        System Time 88.969 (0.559195)
        Percent CPU 1112.9 (13.8279)
        Context Switches 189140 (2282.15)
        Sleeps 99008.6 (768.091)
      
      After:
      
        Average Optimal load -j 12 Run (std deviation):
        Elapsed Time 102.47 (0.562732)
        User Time 1045.3 (1.34263)
        System Time 88.311 (0.342554)
        Percent CPU 1105.8 (6.49444)
        Context Switches 189081 (2355.78)
        Sleeps 99231.5 (800.358)
      
      This patch (of 2):
      
      This commit reorganizes the previous SLAB freelist randomization to
      prepare for the SLUB implementation.  It moves functions that will be
      shared to slab_common.
      
      The entropy functions are changed to align with the SLUB implementation,
      now using get_random_(int|long) functions.  These functions were chosen
      because they provide a bit more entropy early on boot and better
      performance when specific arch instructions are not available.
      
      [akpm@linux-foundation.org: fix build]
      Link: http://lkml.kernel.org/r/1464295031-26375-2-git-send-email-thgarnie@google.comSigned-off-by: NThomas Garnier <thgarnie@google.com>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7c00fce9
    • K
      mm: SLAB hardened usercopy support · 04385fc5
      Kees Cook 提交于
      Under CONFIG_HARDENED_USERCOPY, this adds object size checking to the
      SLAB allocator to catch any copies that may span objects.
      
      Based on code from PaX and grsecurity.
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Tested-by: NValdis Kletnieks <valdis.kletnieks@vt.edu>
      04385fc5
  6. 21 5月, 2016 2 次提交
    • A
      mm, kasan: don't call kasan_krealloc() from ksize(). · 4ebb31a4
      Alexander Potapenko 提交于
      Instead of calling kasan_krealloc(), which replaces the memory
      allocation stack ID (if stack depot is used), just unpoison the whole
      memory chunk.
      Signed-off-by: NAlexander Potapenko <glider@google.com>
      Acked-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Andrey Konovalov <adech.fo@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Konstantin Serebryany <kcc@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4ebb31a4
    • A
      mm: kasan: initial memory quarantine implementation · 55834c59
      Alexander Potapenko 提交于
      Quarantine isolates freed objects in a separate queue.  The objects are
      returned to the allocator later, which helps to detect use-after-free
      errors.
      
      When the object is freed, its state changes from KASAN_STATE_ALLOC to
      KASAN_STATE_QUARANTINE.  The object is poisoned and put into quarantine
      instead of being returned to the allocator, therefore every subsequent
      access to that object triggers a KASAN error, and the error handler is
      able to say where the object has been allocated and deallocated.
      
      When it's time for the object to leave quarantine, its state becomes
      KASAN_STATE_FREE and it's returned to the allocator.  From now on the
      allocator may reuse it for another allocation.  Before that happens,
      it's still possible to detect a use-after free on that object (it
      retains the allocation/deallocation stacks).
      
      When the allocator reuses this object, the shadow is unpoisoned and old
      allocation/deallocation stacks are wiped.  Therefore a use of this
      object, even an incorrect one, won't trigger ASan warning.
      
      Without the quarantine, it's not guaranteed that the objects aren't
      reused immediately, that's why the probability of catching a
      use-after-free is lower than with quarantine in place.
      
      Quarantine isolates freed objects in a separate queue.  The objects are
      returned to the allocator later, which helps to detect use-after-free
      errors.
      
      Freed objects are first added to per-cpu quarantine queues.  When a
      cache is destroyed or memory shrinking is requested, the objects are
      moved into the global quarantine queue.  Whenever a kmalloc call allows
      memory reclaiming, the oldest objects are popped out of the global queue
      until the total size of objects in quarantine is less than 3/4 of the
      maximum quarantine size (which is a fraction of installed physical
      memory).
      
      As long as an object remains in the quarantine, KASAN is able to report
      accesses to it, so the chance of reporting a use-after-free is
      increased.  Once the object leaves quarantine, the allocator may reuse
      it, in which case the object is unpoisoned and KASAN can't detect
      incorrect accesses to it.
      
      Right now quarantine support is only enabled in SLAB allocator.
      Unification of KASAN features in SLAB and SLUB will be done later.
      
      This patch is based on the "mm: kasan: quarantine" patch originally
      prepared by Dmitry Chernenkov.  A number of improvements have been
      suggested by Andrey Ryabinin.
      
      [glider@google.com: v9]
        Link: http://lkml.kernel.org/r/1462987130-144092-1-git-send-email-glider@google.comSigned-off-by: NAlexander Potapenko <glider@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Andrey Konovalov <adech.fo@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Konstantin Serebryany <kcc@google.com>
      Cc: Dmitry Chernenkov <dmitryc@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      55834c59
  7. 20 5月, 2016 14 次提交
    • A
      include/linux/nodemask.h: create next_node_in() helper · 0edaf86c
      Andrew Morton 提交于
      Lots of code does
      
      	node = next_node(node, XXX);
      	if (node == MAX_NUMNODES)
      		node = first_node(XXX);
      
      so create next_node_in() to do this and use it in various places.
      
      [mhocko@suse.com: use next_node_in() helper]
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@kernel.org>
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Laura Abbott <lauraa@codeaurora.org>
      Cc: Hui Zhu <zhuhui@xiaomi.com>
      Cc: Wang Xiaoqiang <wangxq10@lzu.edu.cn>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0edaf86c
    • Y
      mm: slab: remove ZONE_DMA_FLAG · a3187e43
      Yang Shi 提交于
      Now we have IS_ENABLED helper to check if a Kconfig option is enabled or
      not, so ZONE_DMA_FLAG sounds no longer useful.
      
      And, the use of ZONE_DMA_FLAG in slab looks pointless according to the
      comment [1] from Johannes Weiner, so remove them and ORing passed in
      flags with the cache gfp flags has been done in kmem_getpages().
      
      [1] https://lkml.org/lkml/2014/9/25/553
      
      Link: http://lkml.kernel.org/r/1462381297-11009-1-git-send-email-yang.shi@linaro.orgSigned-off-by: NYang Shi <yang.shi@linaro.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a3187e43
    • T
      mm: SLAB freelist randomization · c7ce4f60
      Thomas Garnier 提交于
      Provides an optional config (CONFIG_SLAB_FREELIST_RANDOM) to randomize
      the SLAB freelist.  The list is randomized during initialization of a
      new set of pages.  The order on different freelist sizes is pre-computed
      at boot for performance.  Each kmem_cache has its own randomized
      freelist.  Before pre-computed lists are available freelists are
      generated dynamically.  This security feature reduces the predictability
      of the kernel SLAB allocator against heap overflows rendering attacks
      much less stable.
      
      For example this attack against SLUB (also applicable against SLAB)
      would be affected:
      
        https://jon.oberheide.org/blog/2010/09/10/linux-kernel-can-slub-overflow/
      
      Also, since v4.6 the freelist was moved at the end of the SLAB.  It
      means a controllable heap is opened to new attacks not yet publicly
      discussed.  A kernel heap overflow can be transformed to multiple
      use-after-free.  This feature makes this type of attack harder too.
      
      To generate entropy, we use get_random_bytes_arch because 0 bits of
      entropy is available in the boot stage.  In the worse case this function
      will fallback to the get_random_bytes sub API.  We also generate a shift
      random number to shift pre-computed freelist for each new set of pages.
      
      The config option name is not specific to the SLAB as this approach will
      be extended to other allocators like SLUB.
      
      Performance results highlighted no major changes:
      
      Hackbench (running 90 10 times):
      
        Before average: 0.0698
        After average: 0.0663 (-5.01%)
      
      slab_test 1 run on boot.  Difference only seen on the 2048 size test
      being the worse case scenario covered by freelist randomization.  New
      slab pages are constantly being created on the 10000 allocations.
      Variance should be mainly due to getting new pages every few
      allocations.
      
      Before:
      
        Single thread testing
        =====================
        1. Kmalloc: Repeatedly allocate then free test
        10000 times kmalloc(8) -> 99 cycles kfree -> 112 cycles
        10000 times kmalloc(16) -> 109 cycles kfree -> 140 cycles
        10000 times kmalloc(32) -> 129 cycles kfree -> 137 cycles
        10000 times kmalloc(64) -> 141 cycles kfree -> 141 cycles
        10000 times kmalloc(128) -> 152 cycles kfree -> 148 cycles
        10000 times kmalloc(256) -> 195 cycles kfree -> 167 cycles
        10000 times kmalloc(512) -> 257 cycles kfree -> 199 cycles
        10000 times kmalloc(1024) -> 393 cycles kfree -> 251 cycles
        10000 times kmalloc(2048) -> 649 cycles kfree -> 228 cycles
        10000 times kmalloc(4096) -> 806 cycles kfree -> 370 cycles
        10000 times kmalloc(8192) -> 814 cycles kfree -> 411 cycles
        10000 times kmalloc(16384) -> 892 cycles kfree -> 455 cycles
        2. Kmalloc: alloc/free test
        10000 times kmalloc(8)/kfree -> 121 cycles
        10000 times kmalloc(16)/kfree -> 121 cycles
        10000 times kmalloc(32)/kfree -> 121 cycles
        10000 times kmalloc(64)/kfree -> 121 cycles
        10000 times kmalloc(128)/kfree -> 121 cycles
        10000 times kmalloc(256)/kfree -> 119 cycles
        10000 times kmalloc(512)/kfree -> 119 cycles
        10000 times kmalloc(1024)/kfree -> 119 cycles
        10000 times kmalloc(2048)/kfree -> 119 cycles
        10000 times kmalloc(4096)/kfree -> 121 cycles
        10000 times kmalloc(8192)/kfree -> 119 cycles
        10000 times kmalloc(16384)/kfree -> 119 cycles
      
      After:
      
        Single thread testing
        =====================
        1. Kmalloc: Repeatedly allocate then free test
        10000 times kmalloc(8) -> 130 cycles kfree -> 86 cycles
        10000 times kmalloc(16) -> 118 cycles kfree -> 86 cycles
        10000 times kmalloc(32) -> 121 cycles kfree -> 85 cycles
        10000 times kmalloc(64) -> 176 cycles kfree -> 102 cycles
        10000 times kmalloc(128) -> 178 cycles kfree -> 100 cycles
        10000 times kmalloc(256) -> 205 cycles kfree -> 109 cycles
        10000 times kmalloc(512) -> 262 cycles kfree -> 136 cycles
        10000 times kmalloc(1024) -> 342 cycles kfree -> 157 cycles
        10000 times kmalloc(2048) -> 701 cycles kfree -> 238 cycles
        10000 times kmalloc(4096) -> 803 cycles kfree -> 364 cycles
        10000 times kmalloc(8192) -> 835 cycles kfree -> 404 cycles
        10000 times kmalloc(16384) -> 896 cycles kfree -> 441 cycles
        2. Kmalloc: alloc/free test
        10000 times kmalloc(8)/kfree -> 121 cycles
        10000 times kmalloc(16)/kfree -> 121 cycles
        10000 times kmalloc(32)/kfree -> 123 cycles
        10000 times kmalloc(64)/kfree -> 142 cycles
        10000 times kmalloc(128)/kfree -> 121 cycles
        10000 times kmalloc(256)/kfree -> 119 cycles
        10000 times kmalloc(512)/kfree -> 119 cycles
        10000 times kmalloc(1024)/kfree -> 119 cycles
        10000 times kmalloc(2048)/kfree -> 119 cycles
        10000 times kmalloc(4096)/kfree -> 119 cycles
        10000 times kmalloc(8192)/kfree -> 119 cycles
        10000 times kmalloc(16384)/kfree -> 119 cycles
      
      [akpm@linux-foundation.org: propagate gfp_t into cache_random_seq_create()]
      Signed-off-by: NThomas Garnier <thgarnie@google.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Laura Abbott <labbott@fedoraproject.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c7ce4f60
    • J
      mm/slab: lockless decision to grow cache · 801faf0d
      Joonsoo Kim 提交于
      To check whether free objects exist or not precisely, we need to grab a
      lock.  But, accuracy isn't that important because race window would be
      even small and if there is too much free object, cache reaper would reap
      it.  So, this patch makes the check for free object exisistence not to
      hold a lock.  This will reduce lock contention in heavily allocation
      case.
      
      Note that until now, n->shared can be freed during the processing by
      writing slabinfo, but, with some trick in this patch, we can access it
      freely within interrupt disabled period.
      
      Below is the result of concurrent allocation/free in slab allocation
      benchmark made by Christoph a long time ago.  I make the output simpler.
      The number shows cycle count during alloc/free respectively so less is
      better.
      
        * Before
        Kmalloc N*alloc N*free(32): Average=248/966
        Kmalloc N*alloc N*free(64): Average=261/949
        Kmalloc N*alloc N*free(128): Average=314/1016
        Kmalloc N*alloc N*free(256): Average=741/1061
        Kmalloc N*alloc N*free(512): Average=1246/1152
        Kmalloc N*alloc N*free(1024): Average=2437/1259
        Kmalloc N*alloc N*free(2048): Average=4980/1800
        Kmalloc N*alloc N*free(4096): Average=9000/2078
      
        * After
        Kmalloc N*alloc N*free(32): Average=344/792
        Kmalloc N*alloc N*free(64): Average=347/882
        Kmalloc N*alloc N*free(128): Average=390/959
        Kmalloc N*alloc N*free(256): Average=393/1067
        Kmalloc N*alloc N*free(512): Average=683/1229
        Kmalloc N*alloc N*free(1024): Average=1295/1325
        Kmalloc N*alloc N*free(2048): Average=2513/1664
        Kmalloc N*alloc N*free(4096): Average=4742/2172
      
      It shows that allocation performance decreases for the object size up to
      128 and it may be due to extra checks in cache_alloc_refill().  But,
      with considering improvement of free performance, net result looks the
      same.  Result for other size class looks very promising, roughly, 50%
      performance improvement.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      801faf0d
    • J
      mm/slab: refill cpu cache through a new slab without holding a node lock · 213b4695
      Joonsoo Kim 提交于
      Until now, cache growing makes a free slab on node's slab list and then
      we can allocate free objects from it.  This necessarily requires to hold
      a node lock which is very contended.  If we refill cpu cache before
      attaching it to node's slab list, we can avoid holding a node lock as
      much as possible because this newly allocated slab is only visible to
      the current task.  This will reduce lock contention.
      
      Below is the result of concurrent allocation/free in slab allocation
      benchmark made by Christoph a long time ago.  I make the output simpler.
      The number shows cycle count during alloc/free respectively so less is
      better.
      
        * Before
        Kmalloc N*alloc N*free(32): Average=355/750
        Kmalloc N*alloc N*free(64): Average=452/812
        Kmalloc N*alloc N*free(128): Average=559/1070
        Kmalloc N*alloc N*free(256): Average=1176/980
        Kmalloc N*alloc N*free(512): Average=1939/1189
        Kmalloc N*alloc N*free(1024): Average=3521/1278
        Kmalloc N*alloc N*free(2048): Average=7152/1838
        Kmalloc N*alloc N*free(4096): Average=13438/2013
      
        * After
        Kmalloc N*alloc N*free(32): Average=248/966
        Kmalloc N*alloc N*free(64): Average=261/949
        Kmalloc N*alloc N*free(128): Average=314/1016
        Kmalloc N*alloc N*free(256): Average=741/1061
        Kmalloc N*alloc N*free(512): Average=1246/1152
        Kmalloc N*alloc N*free(1024): Average=2437/1259
        Kmalloc N*alloc N*free(2048): Average=4980/1800
        Kmalloc N*alloc N*free(4096): Average=9000/2078
      
      It shows that contention is reduced for all the object sizes and
      performance increases by 30 ~ 40%.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      213b4695
    • J
      mm/slab: separate cache_grow() to two parts · 76b342bd
      Joonsoo Kim 提交于
      This is a preparation step to implement lockless allocation path when
      there is no free objects in kmem_cache.
      
      What we'd like to do here is to refill cpu cache without holding a node
      lock.  To accomplish this purpose, refill should be done after new slab
      allocation but before attaching the slab to the management list.  So,
      this patch separates cache_grow() to two parts, allocation and attaching
      to the list in order to add some code inbetween them in the following
      patch.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      76b342bd
    • J
      mm/slab: make cache_grow() handle the page allocated on arbitrary node · 511e3a05
      Joonsoo Kim 提交于
      Currently, cache_grow() assumes that allocated page's nodeid would be
      same with parameter nodeid which is used for allocation request.  If we
      discard this assumption, we can handle fallback_alloc() case gracefully.
      So, this patch makes cache_grow() handle the page allocated on arbitrary
      node and clean-up relevant code.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      511e3a05
    • J
      mm/slab: racy access/modify the slab color · 03d1d43a
      Joonsoo Kim 提交于
      Slab color isn't needed to be changed strictly.  Because locking for
      changing slab color could cause more lock contention so this patch
      implements racy access/modify the slab color.  This is a preparation
      step to implement lockless allocation path when there is no free objects
      in the kmem_cache.
      
      Below is the result of concurrent allocation/free in slab allocation
      benchmark made by Christoph a long time ago.  I make the output simpler.
      The number shows cycle count during alloc/free respectively so less is
      better.
      
        * Before
        Kmalloc N*alloc N*free(32): Average=365/806
        Kmalloc N*alloc N*free(64): Average=452/690
        Kmalloc N*alloc N*free(128): Average=736/886
        Kmalloc N*alloc N*free(256): Average=1167/985
        Kmalloc N*alloc N*free(512): Average=2088/1125
        Kmalloc N*alloc N*free(1024): Average=4115/1184
        Kmalloc N*alloc N*free(2048): Average=8451/1748
        Kmalloc N*alloc N*free(4096): Average=16024/2048
      
        * After
        Kmalloc N*alloc N*free(32): Average=355/750
        Kmalloc N*alloc N*free(64): Average=452/812
        Kmalloc N*alloc N*free(128): Average=559/1070
        Kmalloc N*alloc N*free(256): Average=1176/980
        Kmalloc N*alloc N*free(512): Average=1939/1189
        Kmalloc N*alloc N*free(1024): Average=3521/1278
        Kmalloc N*alloc N*free(2048): Average=7152/1838
        Kmalloc N*alloc N*free(4096): Average=13438/2013
      
      It shows that contention is reduced for object size >= 1024 and
      performance increases by roughly 15%.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      03d1d43a
    • J
      mm/slab: don't keep free slabs if free_objects exceeds free_limit · 6052b788
      Joonsoo Kim 提交于
      Currently, determination to free a slab is done whenever each freed
      object is put into the slab.  This has a following problem.
      
      Assume free_limit = 10 and nr_free = 9.
      
      Free happens as following sequence and nr_free changes as following.
      
      free(become a free slab) free(not become a free slab) nr_free: 9 -> 10
      (at first free) -> 11 (at second free)
      
      If we try to check if we can free current slab or not on each object
      free, we can't free any slab in this situation because current slab
      isn't a free slab when nr_free exceed free_limit (at second free) even
      if there is a free slab.
      
      However, if we check it lastly, we can free 1 free slab.
      
      This problem would cause to keep too much memory in the slab subsystem.
      This patch try to fix it by checking number of free object after all
      free work is done.  If there is free slab at that time, we can free slab
      as much as possible so we keep free slab as minimal.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6052b788
    • J
      mm/slab: clean-up kmem_cache_node setup · c3d332b6
      Joonsoo Kim 提交于
      There are mostly same code for setting up kmem_cache_node either in
      cpuup_prepare() or alloc_kmem_cache_node().  Factor out and clean-up
      them.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Tested-by: NNishanth Menon <nm@ti.com>
      Tested-by: NJon Hunter <jonathanh@nvidia.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c3d332b6
    • J
      mm/slab: factor out kmem_cache_node initialization code · ded0ecf6
      Joonsoo Kim 提交于
      It can be reused on other place, so factor out it.  Following patch will
      use it.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ded0ecf6
    • J
      mm/slab: drain the free slab as much as possible · a5aa63a5
      Joonsoo Kim 提交于
      slabs_tofree() implies freeing all free slab.  We can do it with just
      providing INT_MAX.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a5aa63a5
    • J
      mm/slab: remove BAD_ALIEN_MAGIC again · 8888177e
      Joonsoo Kim 提交于
      Initial attemp to remove BAD_ALIEN_MAGIC is once reverted by 'commit
      edcad250 ("Revert "slab: remove BAD_ALIEN_MAGIC"")' because it
      causes a problem on m68k which has many node but !CONFIG_NUMA.  In this
      case, although alien cache isn't used at all but to cope with some
      initialization path, garbage value is used and that is BAD_ALIEN_MAGIC.
      Now, this patch set use_alien_caches to 0 when !CONFIG_NUMA, there is no
      initialization path problem so we don't need BAD_ALIEN_MAGIC at all.  So
      remove it.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Tested-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8888177e
    • J
      mm/slab: fix the theoretical race by holding proper lock · 18726ca8
      Joonsoo Kim 提交于
      While processing concurrent allocation, SLAB could be contended a lot
      because it did a lots of work with holding a lock.  This patchset try to
      reduce the number of critical section to reduce lock contention.  Major
      changes are lockless decision to allocate more slab and lockless cpu
      cache refill from the newly allocated slab.
      
      Below is the result of concurrent allocation/free in slab allocation
      benchmark made by Christoph a long time ago.  I make the output simpler.
      The number shows cycle count during alloc/free respectively so less is
      better.
      
        * Before
        Kmalloc N*alloc N*free(32): Average=365/806
        Kmalloc N*alloc N*free(64): Average=452/690
        Kmalloc N*alloc N*free(128): Average=736/886
        Kmalloc N*alloc N*free(256): Average=1167/985
        Kmalloc N*alloc N*free(512): Average=2088/1125
        Kmalloc N*alloc N*free(1024): Average=4115/1184
        Kmalloc N*alloc N*free(2048): Average=8451/1748
        Kmalloc N*alloc N*free(4096): Average=16024/2048
      
        * After
        Kmalloc N*alloc N*free(32): Average=344/792
        Kmalloc N*alloc N*free(64): Average=347/882
        Kmalloc N*alloc N*free(128): Average=390/959
        Kmalloc N*alloc N*free(256): Average=393/1067
        Kmalloc N*alloc N*free(512): Average=683/1229
        Kmalloc N*alloc N*free(1024): Average=1295/1325
        Kmalloc N*alloc N*free(2048): Average=2513/1664
        Kmalloc N*alloc N*free(4096): Average=4742/2172
      
      It shows that performance improves greatly (roughly more than 50%) for
      the object class whose size is more than 128 bytes.
      
      This patch (of 11):
      
      If we don't hold neither the slab_mutex nor the node lock, node's shared
      array cache could be freed and re-populated.  If __kmem_cache_shrink()
      is called at the same time, it will call drain_array() with n->shared
      without holding node lock so problem can happen.  This patch fix the
      situation by holding the node lock before trying to drain the shared
      array.
      
      In addition, add a debug check to confirm that n->shared access race
      doesn't exist.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      18726ca8
  8. 26 3月, 2016 2 次提交
    • A
      mm, kasan: add GFP flags to KASAN API · 505f5dcb
      Alexander Potapenko 提交于
      Add GFP flags to KASAN hooks for future patches to use.
      
      This patch is based on the "mm: kasan: unified support for SLUB and SLAB
      allocators" patch originally prepared by Dmitry Chernenkov.
      Signed-off-by: NAlexander Potapenko <glider@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Andrey Konovalov <adech.fo@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Konstantin Serebryany <kcc@google.com>
      Cc: Dmitry Chernenkov <dmitryc@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      505f5dcb
    • A
      mm, kasan: SLAB support · 7ed2f9e6
      Alexander Potapenko 提交于
      Add KASAN hooks to SLAB allocator.
      
      This patch is based on the "mm: kasan: unified support for SLUB and SLAB
      allocators" patch originally prepared by Dmitry Chernenkov.
      Signed-off-by: NAlexander Potapenko <glider@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Andrey Konovalov <adech.fo@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Konstantin Serebryany <kcc@google.com>
      Cc: Dmitry Chernenkov <dmitryc@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7ed2f9e6
  9. 18 3月, 2016 4 次提交
    • J
      mm: convert printk(KERN_<LEVEL> to pr_<level> · 1170532b
      Joe Perches 提交于
      Most of the mm subsystem uses pr_<level> so make it consistent.
      
      Miscellanea:
      
       - Realign arguments
       - Add missing newline to format
       - kmemleak-test.c has a "kmemleak: " prefix added to the
         "Kmemleak testing" logging message via pr_fmt
      Signed-off-by: NJoe Perches <joe@perches.com>
      Acked-by: Tejun Heo <tj@kernel.org>	[percpu]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1170532b
    • J
      mm: coalesce split strings · 756a025f
      Joe Perches 提交于
      Kernel style prefers a single string over split strings when the string is
      'user-visible'.
      
      Miscellanea:
      
       - Add a missing newline
       - Realign arguments
      Signed-off-by: NJoe Perches <joe@perches.com>
      Acked-by: Tejun Heo <tj@kernel.org>	[percpu]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      756a025f
    • M
      mm: thp: set THP defrag by default to madvise and add a stall-free defrag option · 444eb2a4
      Mel Gorman 提交于
      THP defrag is enabled by default to direct reclaim/compact but not wake
      kswapd in the event of a THP allocation failure.  The problem is that
      THP allocation requests potentially enter reclaim/compaction.  This
      potentially incurs a severe stall that is not guaranteed to be offset by
      reduced TLB misses.  While there has been considerable effort to reduce
      the impact of reclaim/compaction, it is still a high cost and workloads
      that should fit in memory fail to do so.  Specifically, a simple
      anon/file streaming workload will enter direct reclaim on NUMA at least
      even though the working set size is 80% of RAM.  It's been years and
      it's time to throw in the towel.
      
      First, this patch defines THP defrag as follows;
      
       madvise: A failed allocation will direct reclaim/compact if the application requests it
       never:   Neither reclaim/compact nor wake kswapd
       defer:   A failed allocation will wake kswapd/kcompactd
       always:  A failed allocation will direct reclaim/compact (historical behaviour)
                khugepaged defrag will enter direct/reclaim but not wake kswapd.
      
      Next it sets the default defrag option to be "madvise" to only enter
      direct reclaim/compaction for applications that specifically requested
      it.
      
      Lastly, it removes a check from the page allocator slowpath that is
      related to __GFP_THISNODE to allow "defer" to work.  The callers that
      really cares are slub/slab and they are updated accordingly.  The slab
      one may be surprising because it also corrects a comment as kswapd was
      never woken up by that path.
      
      This means that a THP fault will no longer stall for most applications
      by default and the ideal for most users that get THP if they are
      immediately available.  There are still options for users that prefer a
      stall at startup of a new application by either restoring historical
      behaviour with "always" or pick a half-way point with "defer" where
      kswapd does some of the work in the background and wakes kcompactd if
      necessary.  THP defrag for khugepaged remains enabled and will enter
      direct/reclaim but no wakeup kswapd or kcompactd.
      
      After this patch a THP allocation failure will quickly fallback and rely
      on khugepaged to recover the situation at some time in the future.  In
      some cases, this will reduce THP usage but the benefit of THP is hard to
      measure and not a universal win where as a stall to reclaim/compaction
      is definitely measurable and can be painful.
      
      The first test for this is using "usemem" to read a large file and write
      a large anonymous mapping (to avoid the zero page) multiple times.  The
      total size of the mappings is 80% of RAM and the benchmark simply
      measures how long it takes to complete.  It uses multiple threads to see
      if that is a factor.  On UMA, the performance is almost identical so is
      not reported but on NUMA, we see this
      
      usemem
                                         4.4.0                 4.4.0
                                kcompactd-v1r1         nodefrag-v1r3
      Amean    System-1       102.86 (  0.00%)       46.81 ( 54.50%)
      Amean    System-4        37.85 (  0.00%)       34.02 ( 10.12%)
      Amean    System-7        48.12 (  0.00%)       46.89 (  2.56%)
      Amean    System-12       51.98 (  0.00%)       56.96 ( -9.57%)
      Amean    System-21       80.16 (  0.00%)       79.05 (  1.39%)
      Amean    System-30      110.71 (  0.00%)      107.17 (  3.20%)
      Amean    System-48      127.98 (  0.00%)      124.83 (  2.46%)
      Amean    Elapsd-1       185.84 (  0.00%)      105.51 ( 43.23%)
      Amean    Elapsd-4        26.19 (  0.00%)       25.58 (  2.33%)
      Amean    Elapsd-7        21.65 (  0.00%)       21.62 (  0.16%)
      Amean    Elapsd-12       18.58 (  0.00%)       17.94 (  3.43%)
      Amean    Elapsd-21       17.53 (  0.00%)       16.60 (  5.33%)
      Amean    Elapsd-30       17.45 (  0.00%)       17.13 (  1.84%)
      Amean    Elapsd-48       15.40 (  0.00%)       15.27 (  0.82%)
      
      For a single thread, the benchmark completes 43.23% faster with this
      patch applied with smaller benefits as the thread increases.  Similar,
      notice the large reduction in most cases in system CPU usage.  The
      overall CPU time is
      
                     4.4.0       4.4.0
              kcompactd-v1r1 nodefrag-v1r3
      User        10357.65    10438.33
      System       3988.88     3543.94
      Elapsed      2203.01     1634.41
      
      Which is substantial. Now, the reclaim figures
      
                                       4.4.0       4.4.0
                                kcompactd-v1r1nodefrag-v1r3
      Minor Faults                 128458477   278352931
      Major Faults                   2174976         225
      Swap Ins                      16904701           0
      Swap Outs                     17359627           0
      Allocation stalls                43611           0
      DMA allocs                           0           0
      DMA32 allocs                  19832646    19448017
      Normal allocs                614488453   580941839
      Movable allocs                       0           0
      Direct pages scanned          24163800           0
      Kswapd pages scanned                 0           0
      Kswapd pages reclaimed               0           0
      Direct pages reclaimed        20691346           0
      Compaction stalls                42263           0
      Compaction success                 938           0
      Compaction failures              41325           0
      
      This patch eliminates almost all swapping and direct reclaim activity.
      There is still overhead but it's from NUMA balancing which does not
      identify that it's pointless trying to do anything with this workload.
      
      I also tried the thpscale benchmark which forces a corner case where
      compaction can be used heavily and measures the latency of whether base
      or huge pages were used
      
      thpscale Fault Latencies
                                             4.4.0                 4.4.0
                                    kcompactd-v1r1         nodefrag-v1r3
      Amean    fault-base-1      5288.84 (  0.00%)     2817.12 ( 46.73%)
      Amean    fault-base-3      6365.53 (  0.00%)     3499.11 ( 45.03%)
      Amean    fault-base-5      6526.19 (  0.00%)     4363.06 ( 33.15%)
      Amean    fault-base-7      7142.25 (  0.00%)     4858.08 ( 31.98%)
      Amean    fault-base-12    13827.64 (  0.00%)    10292.11 ( 25.57%)
      Amean    fault-base-18    18235.07 (  0.00%)    13788.84 ( 24.38%)
      Amean    fault-base-24    21597.80 (  0.00%)    24388.03 (-12.92%)
      Amean    fault-base-30    26754.15 (  0.00%)    19700.55 ( 26.36%)
      Amean    fault-base-32    26784.94 (  0.00%)    19513.57 ( 27.15%)
      Amean    fault-huge-1      4223.96 (  0.00%)     2178.57 ( 48.42%)
      Amean    fault-huge-3      2194.77 (  0.00%)     2149.74 (  2.05%)
      Amean    fault-huge-5      2569.60 (  0.00%)     2346.95 (  8.66%)
      Amean    fault-huge-7      3612.69 (  0.00%)     2997.70 ( 17.02%)
      Amean    fault-huge-12     3301.75 (  0.00%)     6727.02 (-103.74%)
      Amean    fault-huge-18     6696.47 (  0.00%)     6685.72 (  0.16%)
      Amean    fault-huge-24     8000.72 (  0.00%)     9311.43 (-16.38%)
      Amean    fault-huge-30    13305.55 (  0.00%)     9750.45 ( 26.72%)
      Amean    fault-huge-32     9981.71 (  0.00%)    10316.06 ( -3.35%)
      
      The average time to fault pages is substantially reduced in the majority
      of caseds but with the obvious caveat that fewer THPs are actually used
      in this adverse workload
      
                                         4.4.0                 4.4.0
                                kcompactd-v1r1         nodefrag-v1r3
      Percentage huge-1         0.71 (  0.00%)       14.04 (1865.22%)
      Percentage huge-3        10.77 (  0.00%)       33.05 (206.85%)
      Percentage huge-5        60.39 (  0.00%)       38.51 (-36.23%)
      Percentage huge-7        45.97 (  0.00%)       34.57 (-24.79%)
      Percentage huge-12       68.12 (  0.00%)       40.07 (-41.17%)
      Percentage huge-18       64.93 (  0.00%)       47.82 (-26.35%)
      Percentage huge-24       62.69 (  0.00%)       44.23 (-29.44%)
      Percentage huge-30       43.49 (  0.00%)       55.38 ( 27.34%)
      Percentage huge-32       50.72 (  0.00%)       51.90 (  2.35%)
      
                                       4.4.0       4.4.0
                                kcompactd-v1r1nodefrag-v1r3
      Minor Faults                  37429143    47564000
      Major Faults                      1916        1558
      Swap Ins                          1466        1079
      Swap Outs                      2936863      149626
      Allocation stalls                62510           3
      DMA allocs                           0           0
      DMA32 allocs                   6566458     6401314
      Normal allocs                216361697   216538171
      Movable allocs                       0           0
      Direct pages scanned          25977580       17998
      Kswapd pages scanned                 0     3638931
      Kswapd pages reclaimed               0      207236
      Direct pages reclaimed         8833714          88
      Compaction stalls               103349           5
      Compaction success                 270           4
      Compaction failures             103079           1
      
      Note again that while this does swap as it's an aggressive workload, the
      direct relcim activity and allocation stalls is substantially reduced.
      There is some kswapd activity but ftrace showed that the kswapd activity
      was due to normal wakeups from 4K pages being allocated.
      Compaction-related stalls and activity are almost eliminated.
      
      I also tried the stutter benchmark.  For this, I do not have figures for
      NUMA but it's something that does impact UMA so I'll report what is
      available
      
      stutter
                                       4.4.0                 4.4.0
                              kcompactd-v1r1         nodefrag-v1r3
      Min         mmap      7.3571 (  0.00%)      7.3438 (  0.18%)
      1st-qrtle   mmap      7.5278 (  0.00%)     17.9200 (-138.05%)
      2nd-qrtle   mmap      7.6818 (  0.00%)     21.6055 (-181.25%)
      3rd-qrtle   mmap     11.0889 (  0.00%)     21.8881 (-97.39%)
      Max-90%     mmap     27.8978 (  0.00%)     22.1632 ( 20.56%)
      Max-93%     mmap     28.3202 (  0.00%)     22.3044 ( 21.24%)
      Max-95%     mmap     28.5600 (  0.00%)     22.4580 ( 21.37%)
      Max-99%     mmap     29.6032 (  0.00%)     25.5216 ( 13.79%)
      Max         mmap   4109.7289 (  0.00%)   4813.9832 (-17.14%)
      Mean        mmap     12.4474 (  0.00%)     19.3027 (-55.07%)
      
      This benchmark is trying to fault an anonymous mapping while there is a
      heavy IO load -- a scenario that desktop users used to complain about
      frequently.  This shows a mix because the ideal case of mapping with THP
      is not hit as often.  However, note that 99% of the mappings complete
      13.79% faster.  The CPU usage here is particularly interesting
      
                     4.4.0       4.4.0
              kcompactd-v1r1nodefrag-v1r3
      User           67.50        0.99
      System       1327.88       91.30
      Elapsed      2079.00     2128.98
      
      And once again we look at the reclaim figures
      
                                       4.4.0       4.4.0
                                kcompactd-v1r1nodefrag-v1r3
      Minor Faults                 335241922  1314582827
      Major Faults                       715         819
      Swap Ins                             0           0
      Swap Outs                            0           0
      Allocation stalls               532723           0
      DMA allocs                           0           0
      DMA32 allocs                1822364341  1177950222
      Normal allocs               1815640808  1517844854
      Movable allocs                       0           0
      Direct pages scanned          21892772           0
      Kswapd pages scanned          20015890    41879484
      Kswapd pages reclaimed        19961986    41822072
      Direct pages reclaimed        21892741           0
      Compaction stalls              1065755           0
      Compaction success                 514           0
      Compaction failures            1065241           0
      
      Allocation stalls and all direct reclaim activity is eliminated as well
      as compaction-related stalls.
      
      THP gives impressive gains in some cases but only if they are quickly
      available.  We're not going to reach the point where they are completely
      free so lets take the costs out of the fast paths finally and defer the
      cost to kswapd, kcompactd and khugepaged where it belongs.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      444eb2a4
    • V
      mm: memcontrol: report slab usage in cgroup2 memory.stat · 27ee57c9
      Vladimir Davydov 提交于
      Show how much memory is used for storing reclaimable and unreclaimable
      in-kernel data structures allocated from slab caches.
      Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      27ee57c9
  10. 16 3月, 2016 7 次提交
    • V
      mm, sl[au]b: print gfp_flags as strings in slab_out_of_memory() · 5b3810e5
      Vlastimil Babka 提交于
      We can now print gfp_flags more human-readable.  Make use of this in
      slab_out_of_memory() for SLUB and SLAB.  Also convert the SLAB variant
      it to pr_warn() along the way.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5b3810e5
    • J
      mm/slab: re-implement pfmemalloc support · f68f8ddd
      Joonsoo Kim 提交于
      Current implementation of pfmemalloc handling in SLAB has some problems.
      
      1) pfmemalloc_active is set to true when there is just one or more
         pfmemalloc slabs in the system, but it is cleared when there is no
         pfmemalloc slab in one arbitrary kmem_cache.  So, pfmemalloc_active
         could be wrongly cleared.
      
      2) Search to partial and free list doesn't happen when non-pfmemalloc
         object are not found in cpu cache.  Instead, allocating new slab
         happens and it is not optimal.
      
      3) Even after sk_memalloc_socks() is disabled, cpu cache would keep
         pfmemalloc objects tagged with SLAB_OBJ_PFMEMALLOC.  It isn't cleared
         if sk_memalloc_socks() is disabled so it could cause problem.
      
      4) If cpu cache is filled with pfmemalloc objects, it would cause slow
         down non-pfmemalloc allocation.
      
      To me, current pointer tagging approach looks complex and fragile so this
      patch re-implement whole thing instead of fixing problems one by one.
      
      Design principle for new implementation is that
      
      1) Don't disrupt non-pfmemalloc allocation in fast path even if
         sk_memalloc_socks() is enabled.  It's more likely case than pfmemalloc
         allocation.
      
      2) Ensure that pfmemalloc slab is used only for pfmemalloc allocation.
      
      3) Don't consider performance of pfmemalloc allocation in memory
         deficiency state.
      
      As a result, all pfmemalloc alloc/free in memory tight state will be
      handled in slow-path.  If there is non-pfmemalloc free object, it will be
      returned first even for pfmemalloc user in fast-path so that performance
      of pfmemalloc user isn't affected in normal case and pfmemalloc objects
      will be kept as long as possible.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Tested-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f68f8ddd
    • J
      mm/slab: avoid returning values by reference · 70f75067
      Joonsoo Kim 提交于
      Returing values by reference is bad practice.  Instead, just use
      function return value.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Suggested-by: NChristoph Lameter <cl@linux.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      70f75067
    • J
      mm/slab: introduce new slab management type, OBJFREELIST_SLAB · b03a017b
      Joonsoo Kim 提交于
      SLAB needs an array to manage freed objects in a slab.  It is only used
      if some objects are freed so we can use free object itself as this
      array.  This requires additional branch in somewhat critical lock path
      to check if it is first freed object or not but that's all we need.
      Benefits is that we can save extra memory usage and reduce some
      computational overhead by allocating a management array when new slab is
      created.
      
      Code change is rather complex than what we can expect from the idea, in
      order to handle debugging feature efficiently.  If you want to see core
      idea only, please remove '#if DEBUG' block in the patch.
      
      Although this idea can apply to all caches whose size is larger than
      management array size, it isn't applied to caches which have a
      constructor.  If such cache's object is used for management array,
      constructor should be called for it before that object is returned to
      user.  I guess that overhead overwhelm benefit in that case so this idea
      doesn't applied to them at least now.
      
      For summary, from now on, slab management type is determined by
      following logic.
      
      1) if management array size is smaller than object size and no ctor, it
         becomes OBJFREELIST_SLAB.
      
      2) if management array size is smaller than leftover, it becomes
         NORMAL_SLAB which uses leftover as a array.
      
      3) if OFF_SLAB help to save memory than way 4), it becomes OFF_SLAB.
         It allocate a management array from the other cache so memory waste
         happens.
      
      4) others become NORMAL_SLAB.  It uses dedicated internal memory in a
         slab as a management array so it causes memory waste.
      
      In my system, without enabling CONFIG_DEBUG_SLAB, Almost caches become
      OBJFREELIST_SLAB and NORMAL_SLAB (using leftover) which doesn't waste
      memory.  Following is the result of number of caches with specific slab
      management type.
      
      TOTAL = OBJFREELIST + NORMAL(leftover) + NORMAL + OFF
      
      /Before/
      126 = 0 + 60 + 25 + 41
      
      /After/
      126 = 97 + 12 + 15 + 2
      
      Result shows that number of caches that doesn't waste memory increase
      from 60 to 109.
      
      I did some benchmarking and it looks that benefit are more than loss.
      
      Kmalloc: Repeatedly allocate then free test
      
      /Before/
      [    0.286809] 1. Kmalloc: Repeatedly allocate then free test
      [    1.143674] 100000 times kmalloc(32) -> 116 cycles kfree -> 78 cycles
      [    1.441726] 100000 times kmalloc(64) -> 121 cycles kfree -> 80 cycles
      [    1.815734] 100000 times kmalloc(128) -> 168 cycles kfree -> 85 cycles
      [    2.380709] 100000 times kmalloc(256) -> 287 cycles kfree -> 95 cycles
      [    3.101153] 100000 times kmalloc(512) -> 370 cycles kfree -> 117 cycles
      [    3.942432] 100000 times kmalloc(1024) -> 413 cycles kfree -> 156 cycles
      [    5.227396] 100000 times kmalloc(2048) -> 622 cycles kfree -> 248 cycles
      [    7.519793] 100000 times kmalloc(4096) -> 1102 cycles kfree -> 452 cycles
      
      /After/
      [    1.205313] 100000 times kmalloc(32) -> 117 cycles kfree -> 78 cycles
      [    1.510526] 100000 times kmalloc(64) -> 124 cycles kfree -> 81 cycles
      [    1.827382] 100000 times kmalloc(128) -> 130 cycles kfree -> 84 cycles
      [    2.226073] 100000 times kmalloc(256) -> 177 cycles kfree -> 92 cycles
      [    2.814747] 100000 times kmalloc(512) -> 286 cycles kfree -> 112 cycles
      [    3.532952] 100000 times kmalloc(1024) -> 344 cycles kfree -> 141 cycles
      [    4.608777] 100000 times kmalloc(2048) -> 519 cycles kfree -> 210 cycles
      [    6.350105] 100000 times kmalloc(4096) -> 789 cycles kfree -> 391 cycles
      
      In fact, I tested another idea implementing OBJFREELIST_SLAB with
      extendable linked array through another freed object.  It can remove
      memory waste completely but it causes more computational overhead in
      critical lock path and it seems that overhead outweigh benefit.  So, this
      patch doesn't include it.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b03a017b
    • J
      mm/slab: factor out debugging initialization in cache_init_objs() · 10b2e9e8
      Joonsoo Kim 提交于
      cache_init_objs() will be changed in following patch and current form
      doesn't fit well for that change.  So, before doing it, this patch
      separates debugging initialization.  This would cause two loop iteration
      when debugging is enabled, but, this overhead seems too light than debug
      feature itself so effect may not be visible.  This patch will greatly
      simplify changes in cache_init_objs() in following patch.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      10b2e9e8
    • J
      mm/slab: factor out slab list fixup code · d8410234
      Joonsoo Kim 提交于
      Slab list should be fixed up after object is detached from the slab and
      this happens at two places.  They do exactly same thing.  They will be
      changed in the following patch, so, to reduce code duplication, this
      patch factor out them and make it common function.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d8410234
    • J
      mm/slab: make criteria for off slab determination robust and simple · 3217fd9b
      Joonsoo Kim 提交于
      To become an off slab, there are some constraints to avoid bootstrapping
      problem and recursive call.  This can be avoided differently by simply
      checking that corresponding kmalloc cache is ready and it's not a off
      slab.  It would be more robust because static size checking can be
      affected by cache size change or architecture type but dynamic checking
      isn't.
      
      One check 'freelist_cache->size > cachep->size / 2' is added to check
      benefit of choosing off slab, because, now, there is no size constraint
      which ensures enough advantage when selecting off slab.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3217fd9b