1. 16 3月, 2016 19 次提交
    • J
      mm/slab: remove object status buffer for DEBUG_SLAB_LEAK · 249247b6
      Joonsoo Kim 提交于
      Now, we don't use object status buffer in any setup. Remove it.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      249247b6
    • J
      mm/slab: alternative implementation for DEBUG_SLAB_LEAK · d31676df
      Joonsoo Kim 提交于
      DEBUG_SLAB_LEAK is a debug option.  It's current implementation requires
      status buffer so we need more memory to use it.  And, it cause
      kmem_cache initialization step more complex.
      
      To remove this extra memory usage and to simplify initialization step,
      this patch implement this feature with another way.
      
      When user requests to get slab object owner information, it marks that
      getting information is started.  And then, all free objects in caches
      are flushed to corresponding slab page.  Now, we can distinguish all
      freed object so we can know all allocated objects, too.  After
      collecting slab object owner information on allocated objects, mark is
      checked that there is no free during the processing.  If true, we can be
      sure that our information is correct so information is returned to user.
      
      Although this way is rather complex, it has two important benefits
      mentioned above.  So, I think it is worth changing.
      
      There is one drawback that it takes more time to get slab object owner
      information but it is just a debug option so it doesn't matter at all.
      
      To help review, this patch implements new way only.  Following patch
      will remove useless code.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d31676df
    • J
      mm/slab: clean up DEBUG_PAGEALLOC processing code · 40b44137
      Joonsoo Kim 提交于
      Currently, open code for checking DEBUG_PAGEALLOC cache is spread to
      some sites.  It makes code unreadable and hard to change.
      
      This patch cleans up this code.  The following patch will change the
      criteria for DEBUG_PAGEALLOC cache so this clean-up will help it, too.
      
      [akpm@linux-foundation.org: fix build with CONFIG_DEBUG_PAGEALLOC=n]
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      40b44137
    • J
      mm/slab: use more appropriate condition check for debug_pagealloc · 40323278
      Joonsoo Kim 提交于
      debug_pagealloc debugging is related to SLAB_POISON flag rather than
      FORCED_DEBUG option, although FORCED_DEBUG option will enable
      SLAB_POISON.  Fix it.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      40323278
    • J
      mm/slab: activate debug_pagealloc in SLAB when it is actually enabled · a307ebd4
      Joonsoo Kim 提交于
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a307ebd4
    • J
      mm/slab: remove the checks for slab implementation bug · 260b61dd
      Joonsoo Kim 提交于
      Some of "#if DEBUG" are for reporting slab implementation bug rather
      than user usecase bug.  It's not really needed because slab is stable
      for a quite long time and it makes code too dirty.  This patch remove
      it.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      260b61dd
    • J
      mm/slab: remove useless structure define · 6fb92430
      Joonsoo Kim 提交于
      It is obsolete so remove it.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6fb92430
    • J
      mm/slab: fix stale code comment · 12c61fe9
      Joonsoo Kim 提交于
      This patchset implements a new freed object management way, that is,
      OBJFREELIST_SLAB.  Purpose of it is to reduce memory overhead in SLAB.
      
      SLAB needs a array to manage freed objects in a slab.  If there is
      leftover after objects are packed into a slab, we can use it as a
      management array, and, in this case, there is no memory waste.  But, in
      the other cases, we need to allocate extra memory for a management array
      or utilize dedicated internal memory in a slab for it.  Both cases
      causes memory waste so it's not good.
      
      With this patchset, freed object itself can be used for a management
      array.  So, memory waste could be reduced.  Detailed idea and numbers
      are described in last patch's commit description.  Please refer it.
      
      In fact, I tested another idea implementing OBJFREELIST_SLAB with
      extendable linked array through another freed object.  It can remove
      memory waste completely but it causes more computational overhead in
      critical lock path and it seems that overhead outweigh benefit.  So,
      this patchset doesn't include it.  I will attach prototype just for a
      reference.
      
      This patch (of 16):
      
      We use freelist_idx_t type for free object management whose size would be
      smaller than size of unsigned int.  Fix it.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      12c61fe9
    • J
      mm: fix some spelling · 9f706d68
      Jesper Dangaard Brouer 提交于
      Fix up trivial spelling errors, noticed while reading the code.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9f706d68
    • J
      mm: new API kfree_bulk() for SLAB+SLUB allocators · ca257195
      Jesper Dangaard Brouer 提交于
      This patch introduce a new API call kfree_bulk() for bulk freeing memory
      objects not bound to a single kmem_cache.
      
      Christoph pointed out that it is possible to implement freeing of
      objects, without knowing the kmem_cache pointer as that information is
      available from the object's page->slab_cache.  Proposing to remove the
      kmem_cache argument from the bulk free API.
      
      Jesper demonstrated that these extra steps per object comes at a
      performance cost.  It is only in the case CONFIG_MEMCG_KMEM is compiled
      in and activated runtime that these steps are done anyhow.  The extra
      cost is most visible for SLAB allocator, because the SLUB allocator does
      the page lookup (virt_to_head_page()) anyhow.
      
      Thus, the conclusion was to keep the kmem_cache free bulk API with a
      kmem_cache pointer, but we can still implement a kfree_bulk() API fairly
      easily.  Simply by handling if kmem_cache_free_bulk() gets called with a
      kmem_cache NULL pointer.
      
      This does increase the code size a bit, but implementing a separate
      kfree_bulk() call would likely increase code size even more.
      
      Below benchmarks cost of alloc+free (obj size 256 bytes) on CPU i7-4790K
      @ 4.00GHz, no PREEMPT and CONFIG_MEMCG_KMEM=y.
      
      Code size increase for SLAB:
      
       add/remove: 0/0 grow/shrink: 1/0 up/down: 74/0 (74)
       function                                     old     new   delta
       kmem_cache_free_bulk                         660     734     +74
      
      SLAB fastpath: 87 cycles(tsc) 21.814
        sz - fallback             - kmem_cache_free_bulk - kfree_bulk
         1 - 103 cycles 25.878 ns -  41 cycles 10.498 ns - 81 cycles 20.312 ns
         2 -  94 cycles 23.673 ns -  26 cycles  6.682 ns - 42 cycles 10.649 ns
         3 -  92 cycles 23.181 ns -  21 cycles  5.325 ns - 39 cycles 9.950 ns
         4 -  90 cycles 22.727 ns -  18 cycles  4.673 ns - 26 cycles 6.693 ns
         8 -  89 cycles 22.270 ns -  14 cycles  3.664 ns - 23 cycles 5.835 ns
        16 -  88 cycles 22.038 ns -  14 cycles  3.503 ns - 22 cycles 5.543 ns
        30 -  89 cycles 22.284 ns -  13 cycles  3.310 ns - 20 cycles 5.197 ns
        32 -  88 cycles 22.249 ns -  13 cycles  3.420 ns - 20 cycles 5.166 ns
        34 -  88 cycles 22.224 ns -  14 cycles  3.643 ns - 20 cycles 5.170 ns
        48 -  88 cycles 22.088 ns -  14 cycles  3.507 ns - 20 cycles 5.203 ns
        64 -  88 cycles 22.063 ns -  13 cycles  3.428 ns - 20 cycles 5.152 ns
       128 -  89 cycles 22.483 ns -  15 cycles  3.891 ns - 23 cycles 5.885 ns
       158 -  89 cycles 22.381 ns -  15 cycles  3.779 ns - 22 cycles 5.548 ns
       250 -  91 cycles 22.798 ns -  16 cycles  4.152 ns - 23 cycles 5.967 ns
      
      SLAB when enabling MEMCG_KMEM runtime:
       - kmemcg fastpath: 130 cycles(tsc) 32.684 ns (step:0)
       1 - 148 cycles 37.220 ns -  66 cycles 16.622 ns - 66 cycles 16.583 ns
       2 - 141 cycles 35.510 ns -  51 cycles 12.820 ns - 58 cycles 14.625 ns
       3 - 140 cycles 35.017 ns -  37 cycles 9.326 ns - 33 cycles 8.474 ns
       4 - 137 cycles 34.507 ns -  31 cycles 7.888 ns - 33 cycles 8.300 ns
       8 - 140 cycles 35.069 ns -  25 cycles 6.461 ns - 25 cycles 6.436 ns
       16 - 138 cycles 34.542 ns -  23 cycles 5.945 ns - 22 cycles 5.670 ns
       30 - 136 cycles 34.227 ns -  22 cycles 5.502 ns - 22 cycles 5.587 ns
       32 - 136 cycles 34.253 ns -  21 cycles 5.475 ns - 21 cycles 5.324 ns
       34 - 136 cycles 34.254 ns -  21 cycles 5.448 ns - 20 cycles 5.194 ns
       48 - 136 cycles 34.075 ns -  21 cycles 5.458 ns - 21 cycles 5.367 ns
       64 - 135 cycles 33.994 ns -  21 cycles 5.350 ns - 21 cycles 5.259 ns
       128 - 137 cycles 34.446 ns -  23 cycles 5.816 ns - 22 cycles 5.688 ns
       158 - 137 cycles 34.379 ns -  22 cycles 5.727 ns - 22 cycles 5.602 ns
       250 - 138 cycles 34.755 ns -  24 cycles 6.093 ns - 23 cycles 5.986 ns
      
      Code size increase for SLUB:
       function                                     old     new   delta
       kmem_cache_free_bulk                         717     799     +82
      
      SLUB benchmark:
       SLUB fastpath: 46 cycles(tsc) 11.691 ns (step:0)
        sz - fallback             - kmem_cache_free_bulk - kfree_bulk
         1 -  61 cycles 15.486 ns -  53 cycles 13.364 ns - 57 cycles 14.464 ns
         2 -  54 cycles 13.703 ns -  32 cycles  8.110 ns - 33 cycles 8.482 ns
         3 -  53 cycles 13.272 ns -  25 cycles  6.362 ns - 27 cycles 6.947 ns
         4 -  51 cycles 12.994 ns -  24 cycles  6.087 ns - 24 cycles 6.078 ns
         8 -  50 cycles 12.576 ns -  21 cycles  5.354 ns - 22 cycles 5.513 ns
        16 -  49 cycles 12.368 ns -  20 cycles  5.054 ns - 20 cycles 5.042 ns
        30 -  49 cycles 12.273 ns -  18 cycles  4.748 ns - 19 cycles 4.758 ns
        32 -  49 cycles 12.401 ns -  19 cycles  4.821 ns - 19 cycles 4.810 ns
        34 -  98 cycles 24.519 ns -  24 cycles  6.154 ns - 24 cycles 6.157 ns
        48 -  83 cycles 20.833 ns -  21 cycles  5.446 ns - 21 cycles 5.429 ns
        64 -  75 cycles 18.891 ns -  20 cycles  5.247 ns - 20 cycles 5.238 ns
       128 -  93 cycles 23.271 ns -  27 cycles  6.856 ns - 27 cycles 6.823 ns
       158 - 102 cycles 25.581 ns -  30 cycles  7.714 ns - 30 cycles 7.695 ns
       250 - 107 cycles 26.917 ns -  38 cycles  9.514 ns - 38 cycles 9.506 ns
      
      SLUB when enabling MEMCG_KMEM runtime:
       - kmemcg fastpath: 71 cycles(tsc) 17.897 ns (step:0)
       1 - 85 cycles 21.484 ns -  78 cycles 19.569 ns - 75 cycles 18.938 ns
       2 - 81 cycles 20.363 ns -  45 cycles 11.258 ns - 44 cycles 11.076 ns
       3 - 78 cycles 19.709 ns -  33 cycles 8.354 ns - 32 cycles 8.044 ns
       4 - 77 cycles 19.430 ns -  28 cycles 7.216 ns - 28 cycles 7.003 ns
       8 - 101 cycles 25.288 ns -  23 cycles 5.849 ns - 23 cycles 5.787 ns
       16 - 76 cycles 19.148 ns -  20 cycles 5.162 ns - 20 cycles 5.081 ns
       30 - 76 cycles 19.067 ns -  19 cycles 4.868 ns - 19 cycles 4.821 ns
       32 - 76 cycles 19.052 ns -  19 cycles 4.857 ns - 19 cycles 4.815 ns
       34 - 121 cycles 30.291 ns -  25 cycles 6.333 ns - 25 cycles 6.268 ns
       48 - 108 cycles 27.111 ns -  21 cycles 5.498 ns - 21 cycles 5.458 ns
       64 - 100 cycles 25.164 ns -  20 cycles 5.242 ns - 20 cycles 5.229 ns
       128 - 155 cycles 38.976 ns -  27 cycles 6.886 ns - 27 cycles 6.892 ns
       158 - 132 cycles 33.034 ns -  30 cycles 7.711 ns - 30 cycles 7.728 ns
       250 - 130 cycles 32.612 ns -  38 cycles 9.560 ns - 38 cycles 9.549 ns
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ca257195
    • J
      slab: implement bulk free in SLAB allocator · e6cdb58d
      Jesper Dangaard Brouer 提交于
      This patch implements the free side of bulk API for the SLAB allocator
      kmem_cache_free_bulk(), and concludes the implementation of optimized
      bulk API for SLAB allocator.
      
      Benchmarked[1] cost of alloc+free (obj size 256 bytes) on CPU i7-4790K @
      4.00GHz, with no debug options, no PREEMPT and CONFIG_MEMCG_KMEM=y but
      no active user of kmemcg.
      
      SLAB single alloc+free cost: 87 cycles(tsc) 21.814 ns with this
      optimized config.
      
      bulk- Current fallback          - optimized SLAB bulk
        1 - 102 cycles(tsc) 25.747 ns - 41 cycles(tsc) 10.490 ns - improved 59.8%
        2 -  94 cycles(tsc) 23.546 ns - 26 cycles(tsc)  6.567 ns - improved 72.3%
        3 -  92 cycles(tsc) 23.127 ns - 20 cycles(tsc)  5.244 ns - improved 78.3%
        4 -  90 cycles(tsc) 22.663 ns - 18 cycles(tsc)  4.588 ns - improved 80.0%
        8 -  88 cycles(tsc) 22.242 ns - 14 cycles(tsc)  3.656 ns - improved 84.1%
       16 -  88 cycles(tsc) 22.010 ns - 13 cycles(tsc)  3.480 ns - improved 85.2%
       30 -  89 cycles(tsc) 22.305 ns - 13 cycles(tsc)  3.303 ns - improved 85.4%
       32 -  89 cycles(tsc) 22.277 ns - 13 cycles(tsc)  3.309 ns - improved 85.4%
       34 -  88 cycles(tsc) 22.246 ns - 13 cycles(tsc)  3.294 ns - improved 85.2%
       48 -  88 cycles(tsc) 22.121 ns - 13 cycles(tsc)  3.492 ns - improved 85.2%
       64 -  88 cycles(tsc) 22.052 ns - 13 cycles(tsc)  3.411 ns - improved 85.2%
      128 -  89 cycles(tsc) 22.452 ns - 15 cycles(tsc)  3.841 ns - improved 83.1%
      158 -  89 cycles(tsc) 22.403 ns - 14 cycles(tsc)  3.746 ns - improved 84.3%
      250 -  91 cycles(tsc) 22.775 ns - 16 cycles(tsc)  4.111 ns - improved 82.4%
      
      Notice it is not recommended to do very large bulk operation with
      this bulk API, because local IRQs are disabled in this period.
      
      [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/slab_bulk_test01.cSigned-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e6cdb58d
    • J
      slab: avoid running debug SLAB code with IRQs disabled for alloc_bulk · 7b0501dd
      Jesper Dangaard Brouer 提交于
      Move the call to cache_alloc_debugcheck_after() outside the IRQ disabled
      section in kmem_cache_alloc_bulk().
      
      When CONFIG_DEBUG_SLAB is disabled the compiler should remove this code.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7b0501dd
    • J
      slab: implement bulk alloc in SLAB allocator · 2a777eac
      Jesper Dangaard Brouer 提交于
      This patch implements the alloc side of bulk API for the SLAB allocator.
      
      Further optimization are still possible by changing the call to
      __do_cache_alloc() into something that can return multiple objects.
      This optimization is left for later, given end results already show in
      the area of 80% speedup.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2a777eac
    • J
      slab: use slab_post_alloc_hook in SLAB allocator shared with SLUB · d5e3ed66
      Jesper Dangaard Brouer 提交于
      Reviewers notice that the order in slab_post_alloc_hook() of
      kmemcheck_slab_alloc() and kmemleak_alloc_recursive() gets swapped
      compared to slab.c / SLAB allocator.
      
      Also notice memset now occurs before calling kmemcheck_slab_alloc() and
      kmemleak_alloc_recursive().
      
      I assume this reordering of kmemcheck, kmemleak and memset is okay
      because this is the order they are used by the SLUB allocator.
      
      This patch completes the sharing of alloc_hook's between SLUB and SLAB.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d5e3ed66
    • J
      mm: kmemcheck skip object if slab allocation failed · 0142eae3
      Jesper Dangaard Brouer 提交于
      In the SLAB allocator kmemcheck_slab_alloc() is guarded against being
      called in case the object is NULL.  In SLUB allocator this NULL pointer
      invocation can happen, which seems like an oversight.
      
      Move the NULL pointer check into kmemcheck code (kmemcheck_slab_alloc)
      so the check gets moved out of the fastpath, when not compiled with
      CONFIG_KMEMCHECK.
      
      This is a step towards sharing post_alloc_hook between SLUB and SLAB,
      because slab_post_alloc_hook() does not perform this check before
      calling kmemcheck_slab_alloc().
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0142eae3
    • J
      slab: use slab_pre_alloc_hook in SLAB allocator shared with SLUB · 011eceaf
      Jesper Dangaard Brouer 提交于
      Deduplicate code in SLAB allocator functions slab_alloc() and
      slab_alloc_node() by using the slab_pre_alloc_hook() call, which is now
      shared between SLUB and SLAB.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      011eceaf
    • J
      mm: fault-inject take over bootstrap kmem_cache check · fab9963a
      Jesper Dangaard Brouer 提交于
      Remove the SLAB specific function slab_should_failslab(), by moving the
      check against fault-injection for the bootstrap slab, into the shared
      function should_failslab() (used by both SLAB and SLUB).
      
      This is a step towards sharing alloc_hook's between SLUB and SLAB.
      
      This bootstrap slab "kmem_cache" is used for allocating struct
      kmem_cache objects to the allocator itself.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fab9963a
    • J
      mm/slab: move SLUB alloc hooks to common mm/slab.h · 11c7aec2
      Jesper Dangaard Brouer 提交于
      First step towards sharing alloc_hook's between SLUB and SLAB
      allocators.  Move the SLUB allocators *_alloc_hook to the common
      mm/slab.h for internal slab definitions.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      11c7aec2
    • J
      slub: clean up code for kmem cgroup support to kmem_cache_free_bulk · 376bf125
      Jesper Dangaard Brouer 提交于
      This change is primarily an attempt to make it easier to realize the
      optimizations the compiler performs in-case CONFIG_MEMCG_KMEM is not
      enabled.
      
      Performance wise, even when CONFIG_MEMCG_KMEM is compiled in, the
      overhead is zero.  This is because, as long as no process have enabled
      kmem cgroups accounting, the assignment is replaced by asm-NOP
      operations.  This is possible because memcg_kmem_enabled() uses a
      static_key_false() construct.
      
      It also helps readability as it avoid accessing the p[] array like:
      p[size - 1] which "expose" that the array is processed backwards inside
      helper function build_detached_freelist().
      
      Lastly this also makes the code more robust, in error case like passing
      NULL pointers in the array.  Which were previously handled before commit
      03374518 ("slub: add missing kmem cgroup support to
      kmem_cache_free_bulk").
      
      Fixes: 03374518 ("slub: add missing kmem cgroup support to kmem_cache_free_bulk")
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      376bf125
  2. 12 3月, 2016 1 次提交
  3. 10 3月, 2016 5 次提交
    • J
      mm/hugetlb: use EOPNOTSUPP in hugetlb sysctl handlers · 86613628
      Jan Stancek 提交于
      Replace ENOTSUPP with EOPNOTSUPP.  If hugepages are not supported, this
      value is propagated to userspace.  EOPNOTSUPP is part of uapi and is
      widely supported by libc libraries.
      
      It gives nicer message to user, rather than:
      
        # cat /proc/sys/vm/nr_hugepages
        cat: /proc/sys/vm/nr_hugepages: Unknown error 524
      
      And also LTP's proc01 test was failing because this ret code (524)
      was unexpected:
      
        proc01      1  TFAIL  :  proc01.c:396: read failed: /proc/sys/vm/nr_hugepages: errno=???(524): Unknown error 524
        proc01      2  TFAIL  :  proc01.c:396: read failed: /proc/sys/vm/nr_hugepages_mempolicy: errno=???(524): Unknown error 524
        proc01      3  TFAIL  :  proc01.c:396: read failed: /proc/sys/vm/nr_overcommit_hugepages: errno=???(524): Unknown error 524
      Signed-off-by: NJan Stancek <jstancek@redhat.com>
      Acked-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      86613628
    • K
      mm, thp: fix migration of PTE-mapped transparent huge pages · 0a2e280b
      Kirill A. Shutemov 提交于
      We don't have native support of THP migration, so we have to split huge
      page into small pages in order to migrate it to different node.  This
      includes PTE-mapped huge pages.
      
      I made mistake in refcounting patchset: we don't actually split
      PTE-mapped huge page in queue_pages_pte_range(), if we step on head
      page.
      
      The result is that the head page is queued for migration, but none of
      tail pages: putting head page on queue takes pin on the page and any
      subsequent attempts of split_huge_pages() would fail and we skip queuing
      tail pages.
      
      unmap_and_move_huge_page() will eventually split the huge pages, but
      only one of 512 pages would get migrated.
      
      Let's fix the situation.
      
      Fixes: 248db92d ("migrate_pages: try to split pages on queuing")
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0a2e280b
    • M
      kasan: add functions to clear stack poison · e3ae1163
      Mark Rutland 提交于
      Functions which the compiler has instrumented for ASAN place poison on
      the stack shadow upon entry and remove this poison prior to returning.
      
      In some cases (e.g. hotplug and idle), CPUs may exit the kernel a
      number of levels deep in C code.  If there are any instrumented
      functions on this critical path, these will leave portions of the idle
      thread stack shadow poisoned.
      
      If a CPU returns to the kernel via a different path (e.g. a cold
      entry), then depending on stack frame layout subsequent calls to
      instrumented functions may use regions of the stack with stale poison,
      resulting in (spurious) KASAN splats to the console.
      
      Contemporary GCCs always add stack shadow poisoning when ASAN is
      enabled, even when asked to not instrument a function [1], so we can't
      simply annotate functions on the critical path to avoid poisoning.
      
      Instead, this series explicitly removes any stale poison before it can
      be hit.  In the common hotplug case we clear the entire stack shadow in
      common code, before a CPU is brought online.
      
      On architectures which perform a cold return as part of cpu idle may
      retain an architecture-specific amount of stack contents.  To retain the
      poison for this retained context, the arch code must call the core KASAN
      code, passing a "watermark" stack pointer value beyond which shadow will
      be cleared.  Architectures which don't perform a cold return as part of
      idle do not need any additional code.
      
      This patch (of 3):
      
      Functions which the compiler has instrumented for KASAN place poison on
      the stack shadow upon entry and remove this poision prior to returning.
      
      In some cases (e.g.  hotplug and idle), CPUs may exit the kernel a number
      of levels deep in C code.  If there are any instrumented functions on this
      critical path, these will leave portions of the stack shadow poisoned.
      
      If a CPU returns to the kernel via a different path (e.g.  a cold entry),
      then depending on stack frame layout subsequent calls to instrumented
      functions may use regions of the stack with stale poison, resulting in
      (spurious) KASAN splats to the console.
      
      To avoid this, we must clear stale poison from the stack prior to
      instrumented functions being called.  This patch adds functions to the
      KASAN core for removing poison from (portions of) a task's stack.  These
      will be used by subsequent patches to avoid problems with hotplug and
      idle.
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Reviewed-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e3ae1163
    • H
      mm: __delete_from_page_cache show Bad page if mapped · 06b241f3
      Hugh Dickins 提交于
      Commit e1534ae9 ("mm: differentiate page_mapped() from
      page_mapcount() for compound pages") changed the famous
      BUG_ON(page_mapped(page)) in __delete_from_page_cache() to
      VM_BUG_ON_PAGE(page_mapped(page)): which gives us more info when
      CONFIG_DEBUG_VM=y, but nothing at all when not.
      
      Although it has not usually been very helpul, being hit long after the
      error in question, we do need to know if it actually happens on users'
      systems; but reinstating a crash there is likely to be opposed :)
      
      In the non-debug case, pr_alert("BUG: Bad page cache") plus dump_page(),
      dump_stack(), add_taint() - I don't really believe LOCKDEP_NOW_UNRELIABLE,
      but that seems to be the standard procedure now.  Move that, or the
      VM_BUG_ON_PAGE(), up before the deletion from tree: so that the
      unNULLified page->mapping gives a little more information.
      
      If the inode is being evicted (rather than truncated), it won't have any
      vmas left, so it's safe(ish) to assume that the raised mapcount is
      erroneous, and we can discount it from page_count to avoid leaking the
      page (I'm less worried by leaking the occasional 4kB, than losing a
      potential 2MB page with each 4kB page leaked).
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      06b241f3
    • G
      mm/hugetlb: hugetlb_no_page: rate-limit warning message · 910154d5
      Geoffrey Thomas 提交于
      The warning message "killed due to inadequate hugepage pool" simply
      indicates that SIGBUS was sent, not that the process was forcibly killed.
      If the process has a signal handler installed does not fix the problem,
      this message can rapidly spam the kernel log.
      
      On my amd64 dev machine that does not have hugepages configured, I can
      reproduce the repeated warnings easily by setting vm.nr_hugepages=2 (i.e.,
      4 megabytes of huge pages) and running something that sets a signal
      handler and forks, like
      
        #include <sys/mman.h>
        #include <signal.h>
        #include <stdlib.h>
        #include <unistd.h>
      
        sig_atomic_t counter = 10;
        void handler(int signal)
        {
            if (counter-- == 0)
               exit(0);
        }
      
        int main(void)
        {
            int status;
            char *addr = mmap(NULL, 4 * 1048576, PROT_READ | PROT_WRITE,
                    MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);
            if (addr == MAP_FAILED) {perror("mmap"); return 1;}
            *addr = 'x';
            switch (fork()) {
               case -1:
                  perror("fork"); return 1;
               case 0:
                  signal(SIGBUS, handler);
                  *addr = 'x';
                  break;
               default:
                  *addr = 'x';
                  wait(&status);
                  if (WIFSIGNALED(status)) {
                     psignal(WTERMSIG(status), "child");
                  }
                  break;
            }
        }
      Signed-off-by: NGeoffrey Thomas <geofft@ldpreload.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      910154d5
  4. 28 2月, 2016 3 次提交
    • R
      dax: move writeback calls into the filesystems · 7f6d5b52
      Ross Zwisler 提交于
      Previously calls to dax_writeback_mapping_range() for all DAX filesystems
      (ext2, ext4 & xfs) were centralized in filemap_write_and_wait_range().
      
      dax_writeback_mapping_range() needs a struct block_device, and it used
      to get that from inode->i_sb->s_bdev.  This is correct for normal inodes
      mounted on ext2, ext4 and XFS filesystems, but is incorrect for DAX raw
      block devices and for XFS real-time files.
      
      Instead, call dax_writeback_mapping_range() directly from the filesystem
      ->writepages function so that it can supply us with a valid block
      device.  This also fixes DAX code to properly flush caches in response
      to sync(2).
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Cc: Al Viro <viro@ftp.linux.org.uk>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7f6d5b52
    • M
      mm: numa: quickly fail allocations for NUMA balancing on full nodes · 8479eba7
      Mel Gorman 提交于
      Commit 4167e9b2 ("mm: remove GFP_THISNODE") removed the GFP_THISNODE
      flag combination due to confusing semantics.  It noted that
      alloc_misplaced_dst_page() was one such user after changes made by
      commit e97ca8e5 ("mm: fix GFP_THISNODE callers and clarify").
      
      Unfortunately when GFP_THISNODE was removed, users of
      alloc_misplaced_dst_page() started waking kswapd and entering direct
      reclaim because the wrong GFP flags are cleared.  The consequence is
      that workloads that used to fit into memory now get reclaimed which is
      addressed by this patch.
      
      The problem can be demonstrated with "mutilate" that exercises memcached
      which is software dedicated to memory object caching.  The configuration
      uses 80% of memory and is run 3 times for varying numbers of clients.
      The results on a 4-socket NUMA box are
      
      mutilate
                                  4.4.0                 4.4.0
                                vanilla           numaswap-v1
      Hmean    1      8394.71 (  0.00%)     8395.32 (  0.01%)
      Hmean    4     30024.62 (  0.00%)    34513.54 ( 14.95%)
      Hmean    7     32821.08 (  0.00%)    70542.96 (114.93%)
      Hmean    12    55229.67 (  0.00%)    93866.34 ( 69.96%)
      Hmean    21    39438.96 (  0.00%)    85749.21 (117.42%)
      Hmean    30    37796.10 (  0.00%)    50231.49 ( 32.90%)
      Hmean    47    18070.91 (  0.00%)    38530.13 (113.22%)
      
      The metric is queries/second with the more the better.  The results are
      way outside of the noise and the reason for the improvement is obvious
      from some of the vmstats
      
                                       4.4.0       4.4.0
                                     vanillanumaswap-v1r1
      Minor Faults                1929399272  2146148218
      Major Faults                  19746529        3567
      Swap Ins                      57307366        9913
      Swap Outs                     50623229       17094
      Allocation stalls                35909         443
      DMA allocs                           0           0
      DMA32 allocs                  72976349   170567396
      Normal allocs               5306640898  5310651252
      Movable allocs                       0           0
      Direct pages scanned         404130893      799577
      Kswapd pages scanned         160230174           0
      Kswapd pages reclaimed        55928786           0
      Direct pages reclaimed         1843936       41921
      Page writes file                  2391           0
      Page writes anon              50623229       17094
      
      The vanilla kernel is swapping like crazy with large amounts of direct
      reclaim and kswapd activity.  The figures are aggregate but it's known
      that the bad activity is throughout the entire test.
      
      Note that simple streaming anon/file memory consumers also see this
      problem but it's not as obvious.  In those cases, kswapd is awake when
      it should not be.
      
      As there are at least two reclaim-related bugs out there, it's worth
      spelling out the user-visible impact.  This patch only addresses bugs
      related to excessive reclaim on NUMA hardware when the working set is
      larger than a NUMA node.  There is a bug related to high kswapd CPU
      usage but the reports are against laptops and other UMA hardware and is
      not addressed by this patch.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: <stable@vger.kernel.org>	[4.1+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8479eba7
    • A
      mm: thp: fix SMP race condition between THP page fault and MADV_DONTNEED · ad33bb04
      Andrea Arcangeli 提交于
      pmd_trans_unstable()/pmd_none_or_trans_huge_or_clear_bad() were
      introduced to locklessy (but atomically) detect when a pmd is a regular
      (stable) pmd or when the pmd is unstable and can infinitely transition
      from pmd_none() and pmd_trans_huge() from under us, while only holding
      the mmap_sem for reading (for writing not).
      
      While holding the mmap_sem only for reading, MADV_DONTNEED can run from
      under us and so before we can assume the pmd to be a regular stable pmd
      we need to compare it against pmd_none() and pmd_trans_huge() in an
      atomic way, with pmd_trans_unstable().  The old pmd_trans_huge() left a
      tiny window for a race.
      
      Useful applications are unlikely to notice the difference as doing
      MADV_DONTNEED concurrently with a page fault would lead to undefined
      behavior.
      
      [akpm@linux-foundation.org: tidy up comment grammar/layout]
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ad33bb04
  5. 25 2月, 2016 1 次提交
  6. 19 2月, 2016 4 次提交
    • D
      mm: slab: free kmem_cache_node after destroy sysfs file · 52b4b950
      Dmitry Safonov 提交于
      When slub_debug alloc_calls_show is enabled we will try to track
      location and user of slab object on each online node, kmem_cache_node
      structure and cpu_cache/cpu_slub shouldn't be freed till there is the
      last reference to sysfs file.
      
      This fixes the following panic:
      
         BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
         IP:  list_locations+0x169/0x4e0
         PGD 257304067 PUD 438456067 PMD 0
         Oops: 0000 [#1] SMP
         CPU: 3 PID: 973074 Comm: cat ve: 0 Not tainted 3.10.0-229.7.2.ovz.9.30-00007-japdoll-dirty #2 9.30
         Hardware name: DEPO Computers To Be Filled By O.E.M./H67DE3, BIOS L1.60c 07/14/2011
         task: ffff88042a5dc5b0 ti: ffff88037f8d8000 task.ti: ffff88037f8d8000
         RIP: list_locations+0x169/0x4e0
         Call Trace:
           alloc_calls_show+0x1d/0x30
           slab_attr_show+0x1b/0x30
           sysfs_read_file+0x9a/0x1a0
           vfs_read+0x9c/0x170
           SyS_read+0x58/0xb0
           system_call_fastpath+0x16/0x1b
         Code: 5e 07 12 00 b9 00 04 00 00 3d 00 04 00 00 0f 4f c1 3d 00 04 00 00 89 45 b0 0f 84 c3 00 00 00 48 63 45 b0 49 8b 9c c4 f8 00 00 00 <48> 8b 43 20 48 85 c0 74 b6 48 89 df e8 46 37 44 00 48 8b 53 10
         CR2: 0000000000000020
      
      Separated __kmem_cache_release from __kmem_cache_shutdown which now
      called on slab_kmem_cache_release (after the last reference to sysfs
      file object has dropped).
      
      Reintroduced locking in free_partial as sysfs file might access cache's
      partial list after shutdowning - partial revert of the commit
      69cb8e6b ("slub: free slabs without holding locks").  Zap
      __remove_partial and use remove_partial (w/o underscores) as
      free_partial now takes list_lock which s partial revert for commit
      1e4dd946 ("slub: do not assert not having lock in removing freed
      partial")
      Signed-off-by: NDmitry Safonov <dsafonov@virtuozzo.com>
      Suggested-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      52b4b950
    • V
      mm/hugetlb.c: fix incorrect proc nr_hugepages value · f8b74815
      Vaishali Thakkar 提交于
      Currently incorrect default hugepage pool size is reported by proc
      nr_hugepages when number of pages for the default huge page size is
      specified twice.
      
      When multiple huge page sizes are supported, /proc/sys/vm/nr_hugepages
      indicates the current number of pre-allocated huge pages of the default
      size.  Basically /proc/sys/vm/nr_hugepages displays default_hstate->
      max_huge_pages and after boot time pre-allocation, max_huge_pages should
      equal the number of pre-allocated pages (nr_hugepages).
      
      Test case:
      
      Note that this is specific to x86 architecture.
      
      Boot the kernel with command line option 'default_hugepagesz=1G
      hugepages=X hugepagesz=2M hugepages=Y hugepagesz=1G hugepages=Z'.  After
      boot, 'cat /proc/sys/vm/nr_hugepages' and 'sysctl -a | grep hugepages'
      returns the value X.  However, dmesg output shows that Z huge pages were
      pre-allocated.
      
      So, the root cause of the problem here is that the global variable
      default_hstate_max_huge_pages is set if a default huge page size is
      specified (directly or indirectly) on the command line.  After the command
      line processing in hugetlb_init, if default_hstate_max_huge_pages is set,
      the value is assigned to default_hstae.max_huge_pages.  However,
      default_hstate.max_huge_pages may have already been set based on the
      number of pre-allocated huge pages of default_hstate size.
      
      The solution to this problem is if hstate->max_huge_pages is already set
      then it should not set as a result of global max_huge_pages value.
      Basically if the value of the variable hugepages is set multiple times on
      a command line for a specific supported hugepagesize then proc layer
      should consider the last specified value.
      Signed-off-by: NVaishali Thakkar <vaishali.thakkar@oracle.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f8b74815
    • K
      mm: fix regression in remap_file_pages() emulation · 48f7df32
      Kirill A. Shutemov 提交于
      Grazvydas Ignotas has reported a regression in remap_file_pages()
      emulation.
      
      Testcase:
      	#define _GNU_SOURCE
      	#include <assert.h>
      	#include <stdlib.h>
      	#include <stdio.h>
      	#include <sys/mman.h>
      
      	#define SIZE    (4096 * 3)
      
      	int main(int argc, char **argv)
      	{
      		unsigned long *p;
      		long i;
      
      		p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
      				MAP_SHARED | MAP_ANONYMOUS, -1, 0);
      		if (p == MAP_FAILED) {
      			perror("mmap");
      			return -1;
      		}
      
      		for (i = 0; i < SIZE / 4096; i++)
      			p[i * 4096 / sizeof(*p)] = i;
      
      		if (remap_file_pages(p, 4096, 0, 1, 0)) {
      			perror("remap_file_pages");
      			return -1;
      		}
      
      		if (remap_file_pages(p, 4096 * 2, 0, 1, 0)) {
      			perror("remap_file_pages");
      			return -1;
      		}
      
      		assert(p[0] == 1);
      
      		munmap(p, SIZE);
      
      		return 0;
      	}
      
      The second remap_file_pages() fails with -EINVAL.
      
      The reason is that remap_file_pages() emulation assumes that the target
      vma covers whole area we want to over map.  That assumption is broken by
      first remap_file_pages() call: it split the area into two vma.
      
      The solution is to check next adjacent vmas, if they map the same file
      with the same flags.
      
      Fixes: c8d78c18 ("mm: replace remap_file_pages() syscall with emulation")
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: NGrazvydas Ignotas <notasas@gmail.com>
      Tested-by: NGrazvydas Ignotas <notasas@gmail.com>
      Cc: <stable@vger.kernel.org>	[4.0+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      48f7df32
    • K
      thp, dax: do not try to withdraw pgtable from non-anon VMA · 69a8ec2d
      Kirill A. Shutemov 提交于
      DAX doesn't deposit pgtables when it maps huge pages: nothing to
      withdraw. It can lead to crash.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      69a8ec2d
  7. 15 2月, 2016 1 次提交
    • A
      powerpc/mm: Fix Multi hit ERAT cause by recent THP update · c777e2a8
      Aneesh Kumar K.V 提交于
      With ppc64 we use the deposited pgtable_t to store the hash pte slot
      information. We should not withdraw the deposited pgtable_t without
      marking the pmd none. This ensure that low level hash fault handling
      will skip this huge pte and we will handle them at upper levels.
      
      Recent change to pmd splitting changed the above in order to handle the
      race between pmd split and exit_mmap. The race is explained below.
      
      Consider following race:
      
      		CPU0				CPU1
      shrink_page_list()
        add_to_swap()
          split_huge_page_to_list()
            __split_huge_pmd_locked()
              pmdp_huge_clear_flush_notify()
      	// pmd_none() == true
      					exit_mmap()
      					  unmap_vmas()
      					    zap_pmd_range()
      					      // no action on pmd since pmd_none() == true
      	pmd_populate()
      
      As result the THP will not be freed. The leak is detected by check_mm():
      
      	BUG: Bad rss-counter state mm:ffff880058d2e580 idx:1 val:512
      
      The above required us to not mark pmd none during a pmd split.
      
      The fix for ppc is to clear the huge pte of _PAGE_USER, so that low
      level fault handling code skip this pte. At higher level we do take ptl
      lock. That should serialze us against the pmd split. Once the lock is
      acquired we do check the pmd again using pmd_same. That should always
      return false for us and hence we should retry the access. We do the
      pmd_same check in all case after taking plt with
      THP (do_huge_pmd_wp_page, do_huge_pmd_numa_page and
      huge_pmd_set_accessed)
      
      Also make sure we wait for irq disable section in other cpus to finish
      before flipping a huge pte entry with a regular pmd entry. Code paths
      like find_linux_pte_or_hugepte depend on irq disable to get
      a stable pte_t pointer. A parallel thp split need to make sure we
      don't convert a pmd pte to a regular pmd entry without waiting for the
      irq disable section to finish.
      
      Fixes: eef1b3ba ("thp: implement split_huge_pmd()")
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      c777e2a8
  8. 12 2月, 2016 5 次提交
  9. 06 2月, 2016 1 次提交