1. 06 11月, 2015 4 次提交
  2. 09 9月, 2015 1 次提交
    • V
      mm: rename alloc_pages_exact_node() to __alloc_pages_node() · 96db800f
      Vlastimil Babka 提交于
      alloc_pages_exact_node() was introduced in commit 6484eb3e ("page
      allocator: do not check NUMA node ID when the caller knows the node is
      valid") as an optimized variant of alloc_pages_node(), that doesn't
      fallback to current node for nid == NUMA_NO_NODE.  Unfortunately the
      name of the function can easily suggest that the allocation is
      restricted to the given node and fails otherwise.  In truth, the node is
      only preferred, unless __GFP_THISNODE is passed among the gfp flags.
      
      The misleading name has lead to mistakes in the past, see for example
      commits 5265047a ("mm, thp: really limit transparent hugepage
      allocation to local node") and b360edb4 ("mm, mempolicy:
      migrate_to_node should only migrate to node").
      
      Another issue with the name is that there's a family of
      alloc_pages_exact*() functions where 'exact' means exact size (instead
      of page order), which leads to more confusion.
      
      To prevent further mistakes, this patch effectively renames
      alloc_pages_exact_node() to __alloc_pages_node() to better convey that
      it's an optimized variant of alloc_pages_node() not intended for general
      usage.  Both functions get described in comments.
      
      It has been also considered to really provide a convenience function for
      allocations restricted to a node, but the major opinion seems to be that
      __GFP_THISNODE already provides that functionality and we shouldn't
      duplicate the API needlessly.  The number of users would be small
      anyway.
      
      Existing callers of alloc_pages_exact_node() are simply converted to
      call __alloc_pages_node(), with the exception of sba_alloc_coherent()
      which open-codes the check for NUMA_NO_NODE, so it is converted to use
      alloc_pages_node() instead.  This means it no longer performs some
      VM_BUG_ON checks, and since the current check for nid in
      alloc_pages_node() uses a 'nid < 0' comparison (which includes
      NUMA_NO_NODE), it may hide wrong values which would be previously
      exposed.
      
      Both differences will be rectified by the next patch.
      
      To sum up, this patch makes no functional changes, except temporarily
      hiding potentially buggy callers.  Restricting the checks in
      alloc_pages_node() is left for the next patch which can in turn expose
      more existing buggy callers.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NRobin Holt <robinmholt@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Acked-by: NMichael Ellerman <mpe@ellerman.id.au>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Gleb Natapov <gleb@kernel.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Cliff Whickman <cpw@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96db800f
  3. 05 9月, 2015 9 次提交
    • J
      mm/slub: don't wait for high-order page allocation · 45eb00cd
      Joonsoo Kim 提交于
      Description is almost copied from commit fb05e7a8 ("net: don't wait
      for order-3 page allocation").
      
      I saw excessive direct memory reclaim/compaction triggered by slub.  This
      causes performance issues and add latency.  Slub uses high-order
      allocation to reduce internal fragmentation and management overhead.  But,
      direct memory reclaim/compaction has high overhead and the benefit of
      high-order allocation can't compensate the overhead of both work.
      
      This patch makes auxiliary high-order allocation atomic.  If there is no
      memory pressure and memory isn't fragmented, the alloction will still
      success, so we don't sacrifice high-order allocation's benefit here.  If
      the atomic allocation fails, direct memory reclaim/compaction will not be
      triggered, allocation fallback to low-order immediately, hence the direct
      memory reclaim/compaction overhead is avoided.  In the allocation failure
      case, kswapd is waken up and trying to make high-order freepages, so
      allocation could success next time.
      
      Following is the test to measure effect of this patch.
      
      System: QEMU, CPU 8, 512 MB
      Mem: 25% memory is allocated at random position to make fragmentation.
       Memory-hogger occupies 150 MB memory.
      Workload: hackbench -g 20 -l 1000
      
      Average result by 10 runs (Base va Patched)
      
      elapsed_time(s): 4.3468 vs 2.9838
      compact_stall: 461.7 vs 73.6
      pgmigrate_success: 28315.9 vs 7256.1
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      45eb00cd
    • K
      mm/slub: fix slab double-free in case of duplicate sysfs filename · 80da026a
      Konstantin Khlebnikov 提交于
      sysfs_slab_add() shouldn't call kobject_put at error path: this puts last
      reference of kmem-cache kobject and frees it.  Kmem cache will be freed
      second time at error path in kmem_cache_create().
      
      For example this happens when slub debug was enabled in runtime and
      somebody creates new kmem cache:
      
      # echo 1 | tee /sys/kernel/slab/*/sanity_checks
      # modprobe configfs
      
      "configfs_dir_cache" cannot be merged because existing slab have debug and
      cannot create new slab because unique name ":t-0000096" already taken.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      80da026a
    • T
      mm/slub: move slab initialization into irq enabled region · 588f8ba9
      Thomas Gleixner 提交于
      Initializing a new slab can introduce rather large latencies because most
      of the initialization runs always with interrupts disabled.
      
      There is no point in doing so.  The newly allocated slab is not visible
      yet, so there is no reason to protect it against concurrent alloc/free.
      
      Move the expensive parts of the initialization into allocate_slab(), so
      for all allocations with GFP_WAIT set, interrupts are enabled.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      588f8ba9
    • J
      slub: add support for kmem_cache_debug in bulk calls · 3eed034d
      Jesper Dangaard Brouer 提交于
      Per request of Joonsoo Kim adding kmem debug support.
      
      I've tested that when debugging is disabled, then there is almost no
      performance impact as this code basically gets removed by the compiler.
      
      Need some guidance in enabling and testing this.
      
      bulk- PREVIOUS                  - THIS-PATCH
        1 -  43 cycles(tsc) 10.811 ns -  44 cycles(tsc) 11.236 ns  improved  -2.3%
        2 -  27 cycles(tsc)  6.867 ns -  28 cycles(tsc)  7.019 ns  improved  -3.7%
        3 -  21 cycles(tsc)  5.496 ns -  22 cycles(tsc)  5.526 ns  improved  -4.8%
        4 -  24 cycles(tsc)  6.038 ns -  19 cycles(tsc)  4.786 ns  improved  20.8%
        8 -  17 cycles(tsc)  4.280 ns -  18 cycles(tsc)  4.572 ns  improved  -5.9%
       16 -  17 cycles(tsc)  4.483 ns -  18 cycles(tsc)  4.658 ns  improved  -5.9%
       30 -  18 cycles(tsc)  4.531 ns -  18 cycles(tsc)  4.568 ns  improved   0.0%
       32 -  58 cycles(tsc) 14.586 ns -  65 cycles(tsc) 16.454 ns  improved -12.1%
       34 -  53 cycles(tsc) 13.391 ns -  63 cycles(tsc) 15.932 ns  improved -18.9%
       48 -  65 cycles(tsc) 16.268 ns -  50 cycles(tsc) 12.506 ns  improved  23.1%
       64 -  53 cycles(tsc) 13.440 ns -  63 cycles(tsc) 15.929 ns  improved -18.9%
      128 -  79 cycles(tsc) 19.899 ns -  86 cycles(tsc) 21.583 ns  improved  -8.9%
      158 -  90 cycles(tsc) 22.732 ns -  90 cycles(tsc) 22.552 ns  improved   0.0%
      250 -  95 cycles(tsc) 23.916 ns -  98 cycles(tsc) 24.589 ns  improved  -3.2%
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3eed034d
    • J
      slub: initial bulk free implementation · fbd02630
      Jesper Dangaard Brouer 提交于
      This implements SLUB specific kmem_cache_free_bulk().  SLUB allocator now
      both have bulk alloc and free implemented.
      
      Choose to reenable local IRQs while calling slowpath __slab_free().  In
      worst case, where all objects hit slowpath call, the performance should
      still be faster than fallback function __kmem_cache_free_bulk(), because
      local_irq_{disable+enable} is very fast (7-cycles), while the fallback
      invokes this_cpu_cmpxchg() which is slightly slower (9-cycles).
      Nitpicking, this should be faster for N>=4, due to the entry cost of
      local_irq_{disable+enable}.
      
      Do notice that the save+restore variant is very expensive, this is key to
      why this optimization works.
      
      CPU: i7-4790K CPU @ 4.00GHz
       * local_irq_{disable,enable}:  7 cycles(tsc) - 1.821 ns
       * local_irq_{save,restore}  : 37 cycles(tsc) - 9.443 ns
      
      Measurements on CPU CPU i7-4790K @ 4.00GHz
      Baseline normal fastpath (alloc+free cost): 43 cycles(tsc) 10.834 ns
      
      Bulk- fallback                   - this-patch
        1 -  58 cycles(tsc) 14.542 ns  -  43 cycles(tsc) 10.811 ns  improved 25.9%
        2 -  50 cycles(tsc) 12.659 ns  -  27 cycles(tsc)  6.867 ns  improved 46.0%
        3 -  48 cycles(tsc) 12.168 ns  -  21 cycles(tsc)  5.496 ns  improved 56.2%
        4 -  47 cycles(tsc) 11.987 ns  -  24 cycles(tsc)  6.038 ns  improved 48.9%
        8 -  46 cycles(tsc) 11.518 ns  -  17 cycles(tsc)  4.280 ns  improved 63.0%
       16 -  45 cycles(tsc) 11.366 ns  -  17 cycles(tsc)  4.483 ns  improved 62.2%
       30 -  45 cycles(tsc) 11.433 ns  -  18 cycles(tsc)  4.531 ns  improved 60.0%
       32 -  75 cycles(tsc) 18.983 ns  -  58 cycles(tsc) 14.586 ns  improved 22.7%
       34 -  71 cycles(tsc) 17.940 ns  -  53 cycles(tsc) 13.391 ns  improved 25.4%
       48 -  80 cycles(tsc) 20.077 ns  -  65 cycles(tsc) 16.268 ns  improved 18.8%
       64 -  71 cycles(tsc) 17.799 ns  -  53 cycles(tsc) 13.440 ns  improved 25.4%
      128 -  91 cycles(tsc) 22.980 ns  -  79 cycles(tsc) 19.899 ns  improved 13.2%
      158 - 100 cycles(tsc) 25.241 ns  -  90 cycles(tsc) 22.732 ns  improved 10.0%
      250 - 102 cycles(tsc) 25.583 ns  -  95 cycles(tsc) 23.916 ns  improved  6.9%
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fbd02630
    • J
      slub: improve bulk alloc strategy · ebe909e0
      Jesper Dangaard Brouer 提交于
      Call slowpath __slab_alloc() from within the bulk loop, as the side-effect
      of this call likely repopulates c->freelist.
      
      Choose to reenable local IRQs while calling slowpath.
      
      Saving some optimizations for later.  E.g.  it is possible to extract
      parts of __slab_alloc() and avoid the unnecessary and expensive (37
      cycles) local_irq_{save,restore}.  For now, be happy calling
      __slab_alloc() this lower icache impact of this func and I don't have to
      worry about correctness.
      
      Measurements on CPU CPU i7-4790K @ 4.00GHz
      Baseline normal fastpath (alloc+free cost): 42 cycles(tsc) 10.601 ns
      
      Bulk- fallback                   - this-patch
        1 -  58 cycles(tsc) 14.516 ns  -  49 cycles(tsc) 12.459 ns  improved 15.5%
        2 -  51 cycles(tsc) 12.930 ns  -  38 cycles(tsc)  9.605 ns  improved 25.5%
        3 -  49 cycles(tsc) 12.274 ns  -  34 cycles(tsc)  8.525 ns  improved 30.6%
        4 -  48 cycles(tsc) 12.058 ns  -  32 cycles(tsc)  8.036 ns  improved 33.3%
        8 -  46 cycles(tsc) 11.609 ns  -  31 cycles(tsc)  7.756 ns  improved 32.6%
       16 -  45 cycles(tsc) 11.451 ns  -  32 cycles(tsc)  8.148 ns  improved 28.9%
       30 -  79 cycles(tsc) 19.865 ns  -  68 cycles(tsc) 17.164 ns  improved 13.9%
       32 -  76 cycles(tsc) 19.212 ns  -  66 cycles(tsc) 16.584 ns  improved 13.2%
       34 -  74 cycles(tsc) 18.600 ns  -  63 cycles(tsc) 15.954 ns  improved 14.9%
       48 -  88 cycles(tsc) 22.092 ns  -  77 cycles(tsc) 19.373 ns  improved 12.5%
       64 -  80 cycles(tsc) 20.043 ns  -  68 cycles(tsc) 17.188 ns  improved 15.0%
      128 -  99 cycles(tsc) 24.818 ns  -  89 cycles(tsc) 22.404 ns  improved 10.1%
      158 -  99 cycles(tsc) 24.977 ns  -  92 cycles(tsc) 23.089 ns  improved  7.1%
      250 - 106 cycles(tsc) 26.552 ns  -  99 cycles(tsc) 24.785 ns  improved  6.6%
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ebe909e0
    • J
      slub bulk alloc: extract objects from the per cpu slab · 994eb764
      Jesper Dangaard Brouer 提交于
      First piece: acceleration of retrieval of per cpu objects
      
      If we are allocating lots of objects then it is advantageous to disable
      interrupts and avoid the this_cpu_cmpxchg() operation to get these objects
      faster.
      
      Note that we cannot do the fast operation if debugging is enabled, because
      we would have to add extra code to do all the debugging checks.  And it
      would not be fast anyway.
      
      Note also that the requirement of having interrupts disabled avoids having
      to do processor flag operations.
      
      Allocate as many objects as possible in the fast way and then fall back to
      the generic implementation for the rest of the objects.
      
      Measurements on CPU CPU i7-4790K @ 4.00GHz
      Baseline normal fastpath (alloc+free cost): 42 cycles(tsc) 10.554 ns
      
      Bulk- fallback                   - this-patch
        1 -  57 cycles(tsc) 14.432 ns  -  48 cycles(tsc) 12.155 ns  improved 15.8%
        2 -  50 cycles(tsc) 12.746 ns  -  37 cycles(tsc)  9.390 ns  improved 26.0%
        3 -  48 cycles(tsc) 12.180 ns  -  33 cycles(tsc)  8.417 ns  improved 31.2%
        4 -  48 cycles(tsc) 12.015 ns  -  32 cycles(tsc)  8.045 ns  improved 33.3%
        8 -  46 cycles(tsc) 11.526 ns  -  30 cycles(tsc)  7.699 ns  improved 34.8%
       16 -  45 cycles(tsc) 11.418 ns  -  32 cycles(tsc)  8.205 ns  improved 28.9%
       30 -  80 cycles(tsc) 20.246 ns  -  73 cycles(tsc) 18.328 ns  improved  8.8%
       32 -  79 cycles(tsc) 19.946 ns  -  72 cycles(tsc) 18.208 ns  improved  8.9%
       34 -  78 cycles(tsc) 19.659 ns  -  71 cycles(tsc) 17.987 ns  improved  9.0%
       48 -  86 cycles(tsc) 21.516 ns  -  82 cycles(tsc) 20.566 ns  improved  4.7%
       64 -  93 cycles(tsc) 23.423 ns  -  89 cycles(tsc) 22.480 ns  improved  4.3%
      128 - 100 cycles(tsc) 25.170 ns  -  99 cycles(tsc) 24.871 ns  improved  1.0%
      158 - 102 cycles(tsc) 25.549 ns  - 101 cycles(tsc) 25.375 ns  improved  1.0%
      250 - 101 cycles(tsc) 25.344 ns  - 100 cycles(tsc) 25.182 ns  improved  1.0%
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      994eb764
    • C
      slab: infrastructure for bulk object allocation and freeing · 484748f0
      Christoph Lameter 提交于
      Add the basic infrastructure for alloc/free operations on pointer arrays.
      It includes a generic function in the common slab code that is used in
      this infrastructure patch to create the unoptimized functionality for slab
      bulk operations.
      
      Allocators can then provide optimized allocation functions for situations
      in which large numbers of objects are needed.  These optimization may
      avoid taking locks repeatedly and bypass metadata creation if all objects
      in slab pages can be used to provide the objects required.
      
      Allocators can extend the skeletons provided and add their own code to the
      bulk alloc and free functions.  They can keep the generic allocation and
      freeing and just fall back to those if optimizations would not work (like
      for example when debugging is on).
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      484748f0
    • J
      slub: fix spelling succedd to succeed · 2ae44005
      Jesper Dangaard Brouer 提交于
      With this patchset the SLUB allocator now has both bulk alloc and free
      implemented.
      
      This patchset mostly optimizes the "fastpath" where objects are available
      on the per CPU fastpath page.  This mostly amortize the less-heavy
      none-locked cmpxchg_double used on fastpath.
      
      The "fallback" bulking (e.g __kmem_cache_free_bulk) provides a good basis
      for comparison.  Measurements[1] of the fallback functions
      __kmem_cache_{free,alloc}_bulk have been copied from slab_common.c and
      forced "noinline" to force a function call like slab_common.c.
      
      Measurements on CPU CPU i7-4790K @ 4.00GHz
      Baseline normal fastpath (alloc+free cost): 42 cycles(tsc) 10.601 ns
      
      Measurements last-patch with disabled debugging:
      
      Bulk- fallback                   - this-patch
        1 -  57 cycles(tsc) 14.448 ns  -  44 cycles(tsc) 11.236 ns  improved 22.8%
        2 -  51 cycles(tsc) 12.768 ns  -  28 cycles(tsc)  7.019 ns  improved 45.1%
        3 -  48 cycles(tsc) 12.232 ns  -  22 cycles(tsc)  5.526 ns  improved 54.2%
        4 -  48 cycles(tsc) 12.025 ns  -  19 cycles(tsc)  4.786 ns  improved 60.4%
        8 -  46 cycles(tsc) 11.558 ns  -  18 cycles(tsc)  4.572 ns  improved 60.9%
       16 -  45 cycles(tsc) 11.458 ns  -  18 cycles(tsc)  4.658 ns  improved 60.0%
       30 -  45 cycles(tsc) 11.499 ns  -  18 cycles(tsc)  4.568 ns  improved 60.0%
       32 -  79 cycles(tsc) 19.917 ns  -  65 cycles(tsc) 16.454 ns  improved 17.7%
       34 -  78 cycles(tsc) 19.655 ns  -  63 cycles(tsc) 15.932 ns  improved 19.2%
       48 -  68 cycles(tsc) 17.049 ns  -  50 cycles(tsc) 12.506 ns  improved 26.5%
       64 -  80 cycles(tsc) 20.009 ns  -  63 cycles(tsc) 15.929 ns  improved 21.3%
      128 -  94 cycles(tsc) 23.749 ns  -  86 cycles(tsc) 21.583 ns  improved  8.5%
      158 -  97 cycles(tsc) 24.299 ns  -  90 cycles(tsc) 22.552 ns  improved  7.2%
      250 - 102 cycles(tsc) 25.681 ns  -  98 cycles(tsc) 24.589 ns  improved  3.9%
      
      Benchmarking shows impressive improvements in the "fastpath" with a small
      number of objects in the working set.  Once the working set increases,
      resulting in activating the "slowpath" (that contains the heavier locked
      cmpxchg_double) the improvement decreases.
      
      I'm currently working on also optimizing the "slowpath" (as network stack
      use-case hits this), but this patchset should provide a good foundation
      for further improvements.  Rest of my patch queue in this area needs some
      more work, but preliminary results are good.  I'm attending Netfilter
      Workshop[2] next week, and I'll hopefully return working on further
      improvements in this area.
      
      This patch (of 6):
      
      s/succedd/succeed/
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2ae44005
  4. 22 8月, 2015 1 次提交
    • M
      mm: make page pfmemalloc check more robust · 2f064f34
      Michal Hocko 提交于
      Commit c48a11c7 ("netvm: propagate page->pfmemalloc to skb") added
      checks for page->pfmemalloc to __skb_fill_page_desc():
      
              if (page->pfmemalloc && !page->mapping)
                      skb->pfmemalloc = true;
      
      It assumes page->mapping == NULL implies that page->pfmemalloc can be
      trusted.  However, __delete_from_page_cache() can set set page->mapping
      to NULL and leave page->index value alone.  Due to being in union, a
      non-zero page->index will be interpreted as true page->pfmemalloc.
      
      So the assumption is invalid if the networking code can see such a page.
      And it seems it can.  We have encountered this with a NFS over loopback
      setup when such a page is attached to a new skbuf.  There is no copying
      going on in this case so the page confuses __skb_fill_page_desc which
      interprets the index as pfmemalloc flag and the network stack drops
      packets that have been allocated using the reserves unless they are to
      be queued on sockets handling the swapping which is the case here and
      that leads to hangs when the nfs client waits for a response from the
      server which has been dropped and thus never arrive.
      
      The struct page is already heavily packed so rather than finding another
      hole to put it in, let's do a trick instead.  We can reuse the index
      again but define it to an impossible value (-1UL).  This is the page
      index so it should never see the value that large.  Replace all direct
      users of page->pfmemalloc by page_is_pfmemalloc which will hide this
      nastiness from unspoiled eyes.
      
      The information will get lost if somebody wants to use page->index
      obviously but that was the case before and the original code expected
      that the information should be persisted somewhere else if that is
      really needed (e.g.  what SLAB and SLUB do).
      
      [akpm@linux-foundation.org: fix blooper in slub]
      Fixes: c48a11c7 ("netvm: propagate page->pfmemalloc to skb")
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Debugged-by: NVlastimil Babka <vbabka@suse.com>
      Debugged-by: NJiri Bohac <jbohac@suse.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: David Miller <davem@davemloft.net>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: <stable@vger.kernel.org>	[3.6+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2f064f34
  5. 25 6月, 2015 1 次提交
    • D
      slab: correct size_index table before replacing the bootstrap kmem_cache_node · 34cc6990
      Daniel Sanders 提交于
      This patch moves the initialization of the size_index table slightly
      earlier so that the first few kmem_cache_node's can be safely allocated
      when KMALLOC_MIN_SIZE is large.
      
      There are currently two ways to generate indices into kmalloc_caches (via
      kmalloc_index() and via the size_index table in slab_common.c) and on some
      arches (possibly only MIPS) they potentially disagree with each other
      until create_kmalloc_caches() has been called.  It seems that the
      intention is that the size_index table is a fast equivalent to
      kmalloc_index() and that create_kmalloc_caches() patches the table to
      return the correct value for the cases where kmalloc_index()'s
      if-statements apply.
      
      The failing sequence was:
      * kmalloc_caches contains NULL elements
      * kmem_cache_init initialises the element that 'struct
        kmem_cache_node' will be allocated to. For 32-bit Mips, this is a
        56-byte struct and kmalloc_index returns KMALLOC_SHIFT_LOW (7).
      * init_list is called which calls kmalloc_node to allocate a 'struct
        kmem_cache_node'.
      * kmalloc_slab selects the kmem_caches element using
        size_index[size_index_elem(size)]. For MIPS, size is 56, and the
        expression returns 6.
      * This element of kmalloc_caches is NULL and allocation fails.
      * If it had not already failed, it would have called
        create_kmalloc_caches() at this point which would have changed
        size_index[size_index_elem(size)] to 7.
      
      I don't believe the bug to be LLVM specific but GCC doesn't normally
      encounter the problem.  I haven't been able to identify exactly what GCC
      is doing better (probably inlining) but it seems that GCC is managing to
      optimize to the point that it eliminates the problematic allocations.
      This theory is supported by the fact that GCC can be made to fail in the
      same way by changing inline, __inline, __inline__, and __always_inline in
      include/linux/compiler-gcc.h such that they don't actually inline things.
      Signed-off-by: NDaniel Sanders <daniel.sanders@imgtec.com>
      Acked-by: NPekka Enberg <penberg@kernel.org>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      34cc6990
  6. 16 4月, 2015 1 次提交
  7. 15 4月, 2015 2 次提交
  8. 26 3月, 2015 1 次提交
    • M
      mm/slub: fix lockups on PREEMPT && !SMP kernels · 859b7a0e
      Mark Rutland 提交于
      Commit 9aabf810 ("mm/slub: optimize alloc/free fastpath by removing
      preemption on/off") introduced an occasional hang for kernels built with
      CONFIG_PREEMPT && !CONFIG_SMP.
      
      The problem is the following loop the patch introduced to
      slab_alloc_node and slab_free:
      
          do {
              tid = this_cpu_read(s->cpu_slab->tid);
              c = raw_cpu_ptr(s->cpu_slab);
          } while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));
      
      GCC 4.9 has been observed to hoist the load of c and c->tid above the
      loop for !SMP kernels (as in this case raw_cpu_ptr(x) is compile-time
      constant and does not force a reload).  On arm64 the generated assembly
      looks like:
      
               ldr     x4, [x0,#8]
        loop:
               ldr     x1, [x0,#8]
               cmp     x1, x4
               b.ne    loop
      
      If the thread is preempted between the load of c->tid (into x1) and tid
      (into x4), and an allocation or free occurs in another thread (bumping
      the cpu_slab's tid), the thread will be stuck in the loop until
      s->cpu_slab->tid wraps, which may be forever in the absence of
      allocations/frees on the same CPU.
      
      This patch changes the loop condition to access c->tid with READ_ONCE.
      This ensures that the value is reloaded even when the compiler would
      otherwise assume it could cache the value, and also ensures that the
      load will not be torn.
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Steve Capper <steve.capper@linaro.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      859b7a0e
  9. 14 2月, 2015 4 次提交
    • A
      mm: slub: add kernel address sanitizer support for slub allocator · 0316bec2
      Andrey Ryabinin 提交于
      With this patch kasan will be able to catch bugs in memory allocated by
      slub.  Initially all objects in newly allocated slab page, marked as
      redzone.  Later, when allocation of slub object happens, requested by
      caller number of bytes marked as accessible, and the rest of the object
      (including slub's metadata) marked as redzone (inaccessible).
      
      We also mark object as accessible if ksize was called for this object.
      There is some places in kernel where ksize function is called to inquire
      size of really allocated area.  Such callers could validly access whole
      allocated memory, so it should be marked as accessible.
      
      Code in slub.c and slab_common.c files could validly access to object's
      metadata, so instrumentation for this files are disabled.
      Signed-off-by: NAndrey Ryabinin <a.ryabinin@samsung.com>
      Signed-off-by: NDmitry Chernenkov <dmitryc@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Konstantin Serebryany <kcc@google.com>
      Signed-off-by: NAndrey Konovalov <adech.fo@gmail.com>
      Cc: Yuri Gribov <tetra2005@gmail.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0316bec2
    • A
      mm: slub: introduce metadata_access_enable()/metadata_access_disable() · a79316c6
      Andrey Ryabinin 提交于
      It's ok for slub to access memory that marked by kasan as inaccessible
      (object's metadata).  Kasan shouldn't print report in that case because
      these accesses are valid.  Disabling instrumentation of slub.c code is not
      enough to achieve this because slub passes pointer to object's metadata
      into external functions like memchr_inv().
      
      We don't want to disable instrumentation for memchr_inv() because this is
      quite generic function, and we don't want to miss bugs.
      
      metadata_access_enable/metadata_access_disable used to tell KASan where
      accesses to metadata starts/end, so we could temporarily disable KASan
      reports.
      Signed-off-by: NAndrey Ryabinin <a.ryabinin@samsung.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Konstantin Serebryany <kcc@google.com>
      Cc: Dmitry Chernenkov <dmitryc@google.com>
      Signed-off-by: NAndrey Konovalov <adech.fo@gmail.com>
      Cc: Yuri Gribov <tetra2005@gmail.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a79316c6
    • A
      mm: slub: share object_err function · 75c66def
      Andrey Ryabinin 提交于
      Remove static and add function declarations to linux/slub_def.h so it
      could be used by kernel address sanitizer.
      Signed-off-by: NAndrey Ryabinin <a.ryabinin@samsung.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Konstantin Serebryany <kcc@google.com>
      Cc: Dmitry Chernenkov <dmitryc@google.com>
      Signed-off-by: NAndrey Konovalov <adech.fo@gmail.com>
      Cc: Yuri Gribov <tetra2005@gmail.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      75c66def
    • T
      slub: use %*pb[l] to print bitmaps including cpumasks and nodemasks · 5024c1d7
      Tejun Heo 提交于
      printk and friends can now format bitmaps using '%*pb[l]'.  cpumask
      and nodemask also provide cpumask_pr_args() and nodemask_pr_args()
      respectively which can be used to generate the two printf arguments
      necessary to format the specified cpu/nodemask.
      
      * This is an equivalent conversion but the whole function should be
        converted to use scnprinf famiily of functions rather than
        performing custom output length predictions in multiple places.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5024c1d7
  10. 13 2月, 2015 5 次提交
    • V
      slub: make dead caches discard free slabs immediately · d6e0b7fa
      Vladimir Davydov 提交于
      To speed up further allocations SLUB may store empty slabs in per cpu/node
      partial lists instead of freeing them immediately.  This prevents per
      memcg caches destruction, because kmem caches created for a memory cgroup
      are only destroyed after the last page charged to the cgroup is freed.
      
      To fix this issue, this patch resurrects approach first proposed in [1].
      It forbids SLUB to cache empty slabs after the memory cgroup that the
      cache belongs to was destroyed.  It is achieved by setting kmem_cache's
      cpu_partial and min_partial constants to 0 and tuning put_cpu_partial() so
      that it would drop frozen empty slabs immediately if cpu_partial = 0.
      
      The runtime overhead is minimal.  From all the hot functions, we only
      touch relatively cold put_cpu_partial(): we make it call
      unfreeze_partials() after freezing a slab that belongs to an offline
      memory cgroup.  Since slab freezing exists to avoid moving slabs from/to a
      partial list on free/alloc, and there can't be allocations from dead
      caches, it shouldn't cause any overhead.  We do have to disable preemption
      for put_cpu_partial() to achieve that though.
      
      The original patch was accepted well and even merged to the mm tree.
      However, I decided to withdraw it due to changes happening to the memcg
      core at that time.  I had an idea of introducing per-memcg shrinkers for
      kmem caches, but now, as memcg has finally settled down, I do not see it
      as an option, because SLUB shrinker would be too costly to call since SLUB
      does not keep free slabs on a separate list.  Besides, we currently do not
      even call per-memcg shrinkers for offline memcgs.  Overall, it would
      introduce much more complexity to both SLUB and memcg than this small
      patch.
      
      Regarding to SLAB, there's no problem with it, because it shrinks
      per-cpu/node caches periodically.  Thanks to list_lru reparenting, we no
      longer keep entries for offline cgroups in per-memcg arrays (such as
      memcg_cache_params->memcg_caches), so we do not have to bother if a
      per-memcg cache will be shrunk a bit later than it could be.
      
      [1] http://thread.gmane.org/gmane.linux.kernel.mm/118649/focus=118650Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d6e0b7fa
    • V
      slub: fix kmem_cache_shrink return value · ce3712d7
      Vladimir Davydov 提交于
      It is supposed to return 0 if the cache has no remaining objects and 1
      otherwise, while currently it always returns 0.  Fix it.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ce3712d7
    • V
      slub: never fail to shrink cache · 832f37f5
      Vladimir Davydov 提交于
      SLUB's version of __kmem_cache_shrink() not only removes empty slabs, but
      also tries to rearrange the partial lists to place slabs filled up most to
      the head to cope with fragmentation.  To achieve that, it allocates a
      temporary array of lists used to sort slabs by the number of objects in
      use.  If the allocation fails, the whole procedure is aborted.
      
      This is unacceptable for the kernel memory accounting extension of the
      memory cgroup, where we want to make sure that kmem_cache_shrink()
      successfully discarded empty slabs.  Although the allocation failure is
      utterly unlikely with the current page allocator implementation, which
      retries GFP_KERNEL allocations of order <= 2 infinitely, it is better not
      to rely on that.
      
      This patch therefore makes __kmem_cache_shrink() allocate the array on
      stack instead of calling kmalloc, which may fail.  The array size is
      chosen to be equal to 32, because most SLUB caches store not more than 32
      objects per slab page.  Slab pages with <= 32 free objects are sorted
      using the array by the number of objects in use and promoted to the head
      of the partial list, while slab pages with > 32 free objects are left in
      the end of the list without any ordering imposed on them.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Acked-by: NPekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      832f37f5
    • V
      slab: link memcg caches of the same kind into a list · 426589f5
      Vladimir Davydov 提交于
      Sometimes, we need to iterate over all memcg copies of a particular root
      kmem cache.  Currently, we use memcg_cache_params->memcg_caches array for
      that, because it contains all existing memcg caches.
      
      However, it's a bad practice to keep all caches, including those that
      belong to offline cgroups, in this array, because it will be growing
      beyond any bounds then.  I'm going to wipe away dead caches from it to
      save space.  To still be able to perform iterations over all memcg caches
      of the same kind, let us link them into a list.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      426589f5
    • V
      slab: embed memcg_cache_params to kmem_cache · f7ce3190
      Vladimir Davydov 提交于
      Currently, kmem_cache stores a pointer to struct memcg_cache_params
      instead of embedding it.  The rationale is to save memory when kmem
      accounting is disabled.  However, the memcg_cache_params has shrivelled
      drastically since it was first introduced:
      
      * Initially:
      
      struct memcg_cache_params {
      	bool is_root_cache;
      	union {
      		struct kmem_cache *memcg_caches[0];
      		struct {
      			struct mem_cgroup *memcg;
      			struct list_head list;
      			struct kmem_cache *root_cache;
      			bool dead;
      			atomic_t nr_pages;
      			struct work_struct destroy;
      		};
      	};
      };
      
      * Now:
      
      struct memcg_cache_params {
      	bool is_root_cache;
      	union {
      		struct {
      			struct rcu_head rcu_head;
      			struct kmem_cache *memcg_caches[0];
      		};
      		struct {
      			struct mem_cgroup *memcg;
      			struct kmem_cache *root_cache;
      		};
      	};
      };
      
      So the memory saving does not seem to be a clear win anymore.
      
      OTOH, keeping a pointer to memcg_cache_params struct instead of embedding
      it results in touching one more cache line on kmem alloc/free hot paths.
      Besides, it makes linking kmem caches in a list chained by a field of
      struct memcg_cache_params really painful due to a level of indirection,
      while I want to make them linked in the following patch.  That said, let
      us embed it.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f7ce3190
  11. 11 2月, 2015 2 次提交
    • K
    • J
      mm/slub: optimize alloc/free fastpath by removing preemption on/off · 9aabf810
      Joonsoo Kim 提交于
      We had to insert a preempt enable/disable in the fastpath a while ago in
      order to guarantee that tid and kmem_cache_cpu are retrieved on the same
      cpu.  It is the problem only for CONFIG_PREEMPT in which scheduler can
      move the process to other cpu during retrieving data.
      
      Now, I reach the solution to remove preempt enable/disable in the
      fastpath.  If tid is matched with kmem_cache_cpu's tid after tid and
      kmem_cache_cpu are retrieved by separate this_cpu operation, it means
      that they are retrieved on the same cpu.  If not matched, we just have
      to retry it.
      
      With this guarantee, preemption enable/disable isn't need at all even if
      CONFIG_PREEMPT, so this patch removes it.
      
      I saw roughly 5% win in a fast-path loop over kmem_cache_alloc/free in
      CONFIG_PREEMPT.  (14.821 ns -> 14.049 ns)
      
      Below is the result of Christoph's slab_test reported by Jesper Dangaard
      Brouer.
      
      * Before
      
       Single thread testing
       =====================
       1. Kmalloc: Repeatedly allocate then free test
       10000 times kmalloc(8) -> 49 cycles kfree -> 62 cycles
       10000 times kmalloc(16) -> 48 cycles kfree -> 64 cycles
       10000 times kmalloc(32) -> 53 cycles kfree -> 70 cycles
       10000 times kmalloc(64) -> 64 cycles kfree -> 77 cycles
       10000 times kmalloc(128) -> 74 cycles kfree -> 84 cycles
       10000 times kmalloc(256) -> 84 cycles kfree -> 114 cycles
       10000 times kmalloc(512) -> 83 cycles kfree -> 116 cycles
       10000 times kmalloc(1024) -> 81 cycles kfree -> 120 cycles
       10000 times kmalloc(2048) -> 104 cycles kfree -> 136 cycles
       10000 times kmalloc(4096) -> 142 cycles kfree -> 165 cycles
       10000 times kmalloc(8192) -> 238 cycles kfree -> 226 cycles
       10000 times kmalloc(16384) -> 403 cycles kfree -> 264 cycles
       2. Kmalloc: alloc/free test
       10000 times kmalloc(8)/kfree -> 68 cycles
       10000 times kmalloc(16)/kfree -> 68 cycles
       10000 times kmalloc(32)/kfree -> 69 cycles
       10000 times kmalloc(64)/kfree -> 68 cycles
       10000 times kmalloc(128)/kfree -> 68 cycles
       10000 times kmalloc(256)/kfree -> 68 cycles
       10000 times kmalloc(512)/kfree -> 74 cycles
       10000 times kmalloc(1024)/kfree -> 75 cycles
       10000 times kmalloc(2048)/kfree -> 74 cycles
       10000 times kmalloc(4096)/kfree -> 74 cycles
       10000 times kmalloc(8192)/kfree -> 75 cycles
       10000 times kmalloc(16384)/kfree -> 510 cycles
      
      * After
      
       Single thread testing
       =====================
       1. Kmalloc: Repeatedly allocate then free test
       10000 times kmalloc(8) -> 46 cycles kfree -> 61 cycles
       10000 times kmalloc(16) -> 46 cycles kfree -> 63 cycles
       10000 times kmalloc(32) -> 49 cycles kfree -> 69 cycles
       10000 times kmalloc(64) -> 57 cycles kfree -> 76 cycles
       10000 times kmalloc(128) -> 66 cycles kfree -> 83 cycles
       10000 times kmalloc(256) -> 84 cycles kfree -> 110 cycles
       10000 times kmalloc(512) -> 77 cycles kfree -> 114 cycles
       10000 times kmalloc(1024) -> 80 cycles kfree -> 116 cycles
       10000 times kmalloc(2048) -> 102 cycles kfree -> 131 cycles
       10000 times kmalloc(4096) -> 135 cycles kfree -> 163 cycles
       10000 times kmalloc(8192) -> 238 cycles kfree -> 218 cycles
       10000 times kmalloc(16384) -> 399 cycles kfree -> 262 cycles
       2. Kmalloc: alloc/free test
       10000 times kmalloc(8)/kfree -> 65 cycles
       10000 times kmalloc(16)/kfree -> 66 cycles
       10000 times kmalloc(32)/kfree -> 65 cycles
       10000 times kmalloc(64)/kfree -> 66 cycles
       10000 times kmalloc(128)/kfree -> 66 cycles
       10000 times kmalloc(256)/kfree -> 71 cycles
       10000 times kmalloc(512)/kfree -> 72 cycles
       10000 times kmalloc(1024)/kfree -> 71 cycles
       10000 times kmalloc(2048)/kfree -> 71 cycles
       10000 times kmalloc(4096)/kfree -> 71 cycles
       10000 times kmalloc(8192)/kfree -> 65 cycles
       10000 times kmalloc(16384)/kfree -> 511 cycles
      
      Most of the results are better than before.
      
      Note that this change slightly worses performance in !CONFIG_PREEMPT,
      roughly 0.3%.  Implementing each case separately would help performance,
      but, since it's so marginal, I didn't do that.  This would help
      maintanance since we have same code for all cases.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Tested-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9aabf810
  12. 14 12月, 2014 2 次提交
    • V
      slub: fix cpuset check in get_any_partial · dee2f8aa
      Vladimir Davydov 提交于
      If we fail to allocate from the current node's stock, we look for free
      objects on other nodes before calling the page allocator (see
      get_any_partial).  While checking other nodes we respect cpuset
      constraints by calling cpuset_zone_allowed.  We enforce hardwall check.
      As a result, we will fallback to the page allocator even if there are some
      pages cached on other nodes, but the current cpuset doesn't have them set.
       However, the page allocator uses softwall check for kernel allocations,
      so it may allocate from one of the other nodes in this case.
      
      Therefore we should use softwall cpuset check in get_any_partial to
      conform with the cpuset check in the page allocator.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NZefan Li <lizefan@huawei.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dee2f8aa
    • V
      memcg: fix possible use-after-free in memcg_kmem_get_cache() · 8135be5a
      Vladimir Davydov 提交于
      Suppose task @t that belongs to a memory cgroup @memcg is going to
      allocate an object from a kmem cache @c.  The copy of @c corresponding to
      @memcg, @mc, is empty.  Then if kmem_cache_alloc races with the memory
      cgroup destruction we can access the memory cgroup's copy of the cache
      after it was destroyed:
      
      CPU0				CPU1
      ----				----
      [ current=@t
        @mc->memcg_params->nr_pages=0 ]
      
      kmem_cache_alloc(@c):
        call memcg_kmem_get_cache(@c);
        proceed to allocation from @mc:
          alloc a page for @mc:
            ...
      
      				move @t from @memcg
      				destroy @memcg:
      				  mem_cgroup_css_offline(@memcg):
      				    memcg_unregister_all_caches(@memcg):
      				      kmem_cache_destroy(@mc)
      
          add page to @mc
      
      We could fix this issue by taking a reference to a per-memcg cache, but
      that would require adding a per-cpu reference counter to per-memcg caches,
      which would look cumbersome.
      
      Instead, let's take a reference to a memory cgroup, which already has a
      per-cpu reference counter, in the beginning of kmem_cache_alloc to be
      dropped in the end, and move per memcg caches destruction from css offline
      to css free.  As a side effect, per-memcg caches will be destroyed not one
      by one, but all at once when the last page accounted to the memory cgroup
      is freed.  This doesn't sound as a high price for code readability though.
      
      Note, this patch does add some overhead to the kmem_cache_alloc hot path,
      but it is pretty negligible - it's just a function call plus a per cpu
      counter decrement, which is comparable to what we already have in
      memcg_kmem_get_cache.  Besides, it's only relevant if there are memory
      cgroups with kmem accounting enabled.  I don't think we can find a way to
      handle this race w/o it, because alloc_page called from kmem_cache_alloc
      may sleep so we can't flush all pending kmallocs w/o reference counting.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8135be5a
  13. 11 12月, 2014 3 次提交
  14. 27 10月, 2014 1 次提交
    • V
      cpuset: simplify cpuset_node_allowed API · 344736f2
      Vladimir Davydov 提交于
      Current cpuset API for checking if a zone/node is allowed to allocate
      from looks rather awkward. We have hardwall and softwall versions of
      cpuset_node_allowed with the softwall version doing literally the same
      as the hardwall version if __GFP_HARDWALL is passed to it in gfp flags.
      If it isn't, the softwall version may check the given node against the
      enclosing hardwall cpuset, which it needs to take the callback lock to
      do.
      
      Such a distinction was introduced by commit 02a0e53d ("cpuset:
      rework cpuset_zone_allowed api"). Before, we had the only version with
      the __GFP_HARDWALL flag determining its behavior. The purpose of the
      commit was to avoid sleep-in-atomic bugs when someone would mistakenly
      call the function without the __GFP_HARDWALL flag for an atomic
      allocation. The suffixes introduced were intended to make the callers
      think before using the function.
      
      However, since the callback lock was converted from mutex to spinlock by
      the previous patch, the softwall check function cannot sleep, and these
      precautions are no longer necessary.
      
      So let's simplify the API back to the single check.
      Suggested-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Acked-by: NZefan Li <lizefan@huawei.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      344736f2
  15. 10 10月, 2014 3 次提交