1. 16 3月, 2016 29 次提交
    • J
      mm/slab: remove the checks for slab implementation bug · 260b61dd
      Joonsoo Kim 提交于
      Some of "#if DEBUG" are for reporting slab implementation bug rather
      than user usecase bug.  It's not really needed because slab is stable
      for a quite long time and it makes code too dirty.  This patch remove
      it.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      260b61dd
    • J
      mm/slab: remove useless structure define · 6fb92430
      Joonsoo Kim 提交于
      It is obsolete so remove it.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6fb92430
    • J
      mm/slab: fix stale code comment · 12c61fe9
      Joonsoo Kim 提交于
      This patchset implements a new freed object management way, that is,
      OBJFREELIST_SLAB.  Purpose of it is to reduce memory overhead in SLAB.
      
      SLAB needs a array to manage freed objects in a slab.  If there is
      leftover after objects are packed into a slab, we can use it as a
      management array, and, in this case, there is no memory waste.  But, in
      the other cases, we need to allocate extra memory for a management array
      or utilize dedicated internal memory in a slab for it.  Both cases
      causes memory waste so it's not good.
      
      With this patchset, freed object itself can be used for a management
      array.  So, memory waste could be reduced.  Detailed idea and numbers
      are described in last patch's commit description.  Please refer it.
      
      In fact, I tested another idea implementing OBJFREELIST_SLAB with
      extendable linked array through another freed object.  It can remove
      memory waste completely but it causes more computational overhead in
      critical lock path and it seems that overhead outweigh benefit.  So,
      this patchset doesn't include it.  I will attach prototype just for a
      reference.
      
      This patch (of 16):
      
      We use freelist_idx_t type for free object management whose size would be
      smaller than size of unsigned int.  Fix it.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      12c61fe9
    • J
      mm: fix some spelling · 9f706d68
      Jesper Dangaard Brouer 提交于
      Fix up trivial spelling errors, noticed while reading the code.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9f706d68
    • J
      mm: new API kfree_bulk() for SLAB+SLUB allocators · ca257195
      Jesper Dangaard Brouer 提交于
      This patch introduce a new API call kfree_bulk() for bulk freeing memory
      objects not bound to a single kmem_cache.
      
      Christoph pointed out that it is possible to implement freeing of
      objects, without knowing the kmem_cache pointer as that information is
      available from the object's page->slab_cache.  Proposing to remove the
      kmem_cache argument from the bulk free API.
      
      Jesper demonstrated that these extra steps per object comes at a
      performance cost.  It is only in the case CONFIG_MEMCG_KMEM is compiled
      in and activated runtime that these steps are done anyhow.  The extra
      cost is most visible for SLAB allocator, because the SLUB allocator does
      the page lookup (virt_to_head_page()) anyhow.
      
      Thus, the conclusion was to keep the kmem_cache free bulk API with a
      kmem_cache pointer, but we can still implement a kfree_bulk() API fairly
      easily.  Simply by handling if kmem_cache_free_bulk() gets called with a
      kmem_cache NULL pointer.
      
      This does increase the code size a bit, but implementing a separate
      kfree_bulk() call would likely increase code size even more.
      
      Below benchmarks cost of alloc+free (obj size 256 bytes) on CPU i7-4790K
      @ 4.00GHz, no PREEMPT and CONFIG_MEMCG_KMEM=y.
      
      Code size increase for SLAB:
      
       add/remove: 0/0 grow/shrink: 1/0 up/down: 74/0 (74)
       function                                     old     new   delta
       kmem_cache_free_bulk                         660     734     +74
      
      SLAB fastpath: 87 cycles(tsc) 21.814
        sz - fallback             - kmem_cache_free_bulk - kfree_bulk
         1 - 103 cycles 25.878 ns -  41 cycles 10.498 ns - 81 cycles 20.312 ns
         2 -  94 cycles 23.673 ns -  26 cycles  6.682 ns - 42 cycles 10.649 ns
         3 -  92 cycles 23.181 ns -  21 cycles  5.325 ns - 39 cycles 9.950 ns
         4 -  90 cycles 22.727 ns -  18 cycles  4.673 ns - 26 cycles 6.693 ns
         8 -  89 cycles 22.270 ns -  14 cycles  3.664 ns - 23 cycles 5.835 ns
        16 -  88 cycles 22.038 ns -  14 cycles  3.503 ns - 22 cycles 5.543 ns
        30 -  89 cycles 22.284 ns -  13 cycles  3.310 ns - 20 cycles 5.197 ns
        32 -  88 cycles 22.249 ns -  13 cycles  3.420 ns - 20 cycles 5.166 ns
        34 -  88 cycles 22.224 ns -  14 cycles  3.643 ns - 20 cycles 5.170 ns
        48 -  88 cycles 22.088 ns -  14 cycles  3.507 ns - 20 cycles 5.203 ns
        64 -  88 cycles 22.063 ns -  13 cycles  3.428 ns - 20 cycles 5.152 ns
       128 -  89 cycles 22.483 ns -  15 cycles  3.891 ns - 23 cycles 5.885 ns
       158 -  89 cycles 22.381 ns -  15 cycles  3.779 ns - 22 cycles 5.548 ns
       250 -  91 cycles 22.798 ns -  16 cycles  4.152 ns - 23 cycles 5.967 ns
      
      SLAB when enabling MEMCG_KMEM runtime:
       - kmemcg fastpath: 130 cycles(tsc) 32.684 ns (step:0)
       1 - 148 cycles 37.220 ns -  66 cycles 16.622 ns - 66 cycles 16.583 ns
       2 - 141 cycles 35.510 ns -  51 cycles 12.820 ns - 58 cycles 14.625 ns
       3 - 140 cycles 35.017 ns -  37 cycles 9.326 ns - 33 cycles 8.474 ns
       4 - 137 cycles 34.507 ns -  31 cycles 7.888 ns - 33 cycles 8.300 ns
       8 - 140 cycles 35.069 ns -  25 cycles 6.461 ns - 25 cycles 6.436 ns
       16 - 138 cycles 34.542 ns -  23 cycles 5.945 ns - 22 cycles 5.670 ns
       30 - 136 cycles 34.227 ns -  22 cycles 5.502 ns - 22 cycles 5.587 ns
       32 - 136 cycles 34.253 ns -  21 cycles 5.475 ns - 21 cycles 5.324 ns
       34 - 136 cycles 34.254 ns -  21 cycles 5.448 ns - 20 cycles 5.194 ns
       48 - 136 cycles 34.075 ns -  21 cycles 5.458 ns - 21 cycles 5.367 ns
       64 - 135 cycles 33.994 ns -  21 cycles 5.350 ns - 21 cycles 5.259 ns
       128 - 137 cycles 34.446 ns -  23 cycles 5.816 ns - 22 cycles 5.688 ns
       158 - 137 cycles 34.379 ns -  22 cycles 5.727 ns - 22 cycles 5.602 ns
       250 - 138 cycles 34.755 ns -  24 cycles 6.093 ns - 23 cycles 5.986 ns
      
      Code size increase for SLUB:
       function                                     old     new   delta
       kmem_cache_free_bulk                         717     799     +82
      
      SLUB benchmark:
       SLUB fastpath: 46 cycles(tsc) 11.691 ns (step:0)
        sz - fallback             - kmem_cache_free_bulk - kfree_bulk
         1 -  61 cycles 15.486 ns -  53 cycles 13.364 ns - 57 cycles 14.464 ns
         2 -  54 cycles 13.703 ns -  32 cycles  8.110 ns - 33 cycles 8.482 ns
         3 -  53 cycles 13.272 ns -  25 cycles  6.362 ns - 27 cycles 6.947 ns
         4 -  51 cycles 12.994 ns -  24 cycles  6.087 ns - 24 cycles 6.078 ns
         8 -  50 cycles 12.576 ns -  21 cycles  5.354 ns - 22 cycles 5.513 ns
        16 -  49 cycles 12.368 ns -  20 cycles  5.054 ns - 20 cycles 5.042 ns
        30 -  49 cycles 12.273 ns -  18 cycles  4.748 ns - 19 cycles 4.758 ns
        32 -  49 cycles 12.401 ns -  19 cycles  4.821 ns - 19 cycles 4.810 ns
        34 -  98 cycles 24.519 ns -  24 cycles  6.154 ns - 24 cycles 6.157 ns
        48 -  83 cycles 20.833 ns -  21 cycles  5.446 ns - 21 cycles 5.429 ns
        64 -  75 cycles 18.891 ns -  20 cycles  5.247 ns - 20 cycles 5.238 ns
       128 -  93 cycles 23.271 ns -  27 cycles  6.856 ns - 27 cycles 6.823 ns
       158 - 102 cycles 25.581 ns -  30 cycles  7.714 ns - 30 cycles 7.695 ns
       250 - 107 cycles 26.917 ns -  38 cycles  9.514 ns - 38 cycles 9.506 ns
      
      SLUB when enabling MEMCG_KMEM runtime:
       - kmemcg fastpath: 71 cycles(tsc) 17.897 ns (step:0)
       1 - 85 cycles 21.484 ns -  78 cycles 19.569 ns - 75 cycles 18.938 ns
       2 - 81 cycles 20.363 ns -  45 cycles 11.258 ns - 44 cycles 11.076 ns
       3 - 78 cycles 19.709 ns -  33 cycles 8.354 ns - 32 cycles 8.044 ns
       4 - 77 cycles 19.430 ns -  28 cycles 7.216 ns - 28 cycles 7.003 ns
       8 - 101 cycles 25.288 ns -  23 cycles 5.849 ns - 23 cycles 5.787 ns
       16 - 76 cycles 19.148 ns -  20 cycles 5.162 ns - 20 cycles 5.081 ns
       30 - 76 cycles 19.067 ns -  19 cycles 4.868 ns - 19 cycles 4.821 ns
       32 - 76 cycles 19.052 ns -  19 cycles 4.857 ns - 19 cycles 4.815 ns
       34 - 121 cycles 30.291 ns -  25 cycles 6.333 ns - 25 cycles 6.268 ns
       48 - 108 cycles 27.111 ns -  21 cycles 5.498 ns - 21 cycles 5.458 ns
       64 - 100 cycles 25.164 ns -  20 cycles 5.242 ns - 20 cycles 5.229 ns
       128 - 155 cycles 38.976 ns -  27 cycles 6.886 ns - 27 cycles 6.892 ns
       158 - 132 cycles 33.034 ns -  30 cycles 7.711 ns - 30 cycles 7.728 ns
       250 - 130 cycles 32.612 ns -  38 cycles 9.560 ns - 38 cycles 9.549 ns
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ca257195
    • J
      slab: implement bulk free in SLAB allocator · e6cdb58d
      Jesper Dangaard Brouer 提交于
      This patch implements the free side of bulk API for the SLAB allocator
      kmem_cache_free_bulk(), and concludes the implementation of optimized
      bulk API for SLAB allocator.
      
      Benchmarked[1] cost of alloc+free (obj size 256 bytes) on CPU i7-4790K @
      4.00GHz, with no debug options, no PREEMPT and CONFIG_MEMCG_KMEM=y but
      no active user of kmemcg.
      
      SLAB single alloc+free cost: 87 cycles(tsc) 21.814 ns with this
      optimized config.
      
      bulk- Current fallback          - optimized SLAB bulk
        1 - 102 cycles(tsc) 25.747 ns - 41 cycles(tsc) 10.490 ns - improved 59.8%
        2 -  94 cycles(tsc) 23.546 ns - 26 cycles(tsc)  6.567 ns - improved 72.3%
        3 -  92 cycles(tsc) 23.127 ns - 20 cycles(tsc)  5.244 ns - improved 78.3%
        4 -  90 cycles(tsc) 22.663 ns - 18 cycles(tsc)  4.588 ns - improved 80.0%
        8 -  88 cycles(tsc) 22.242 ns - 14 cycles(tsc)  3.656 ns - improved 84.1%
       16 -  88 cycles(tsc) 22.010 ns - 13 cycles(tsc)  3.480 ns - improved 85.2%
       30 -  89 cycles(tsc) 22.305 ns - 13 cycles(tsc)  3.303 ns - improved 85.4%
       32 -  89 cycles(tsc) 22.277 ns - 13 cycles(tsc)  3.309 ns - improved 85.4%
       34 -  88 cycles(tsc) 22.246 ns - 13 cycles(tsc)  3.294 ns - improved 85.2%
       48 -  88 cycles(tsc) 22.121 ns - 13 cycles(tsc)  3.492 ns - improved 85.2%
       64 -  88 cycles(tsc) 22.052 ns - 13 cycles(tsc)  3.411 ns - improved 85.2%
      128 -  89 cycles(tsc) 22.452 ns - 15 cycles(tsc)  3.841 ns - improved 83.1%
      158 -  89 cycles(tsc) 22.403 ns - 14 cycles(tsc)  3.746 ns - improved 84.3%
      250 -  91 cycles(tsc) 22.775 ns - 16 cycles(tsc)  4.111 ns - improved 82.4%
      
      Notice it is not recommended to do very large bulk operation with
      this bulk API, because local IRQs are disabled in this period.
      
      [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/slab_bulk_test01.cSigned-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e6cdb58d
    • J
      slab: avoid running debug SLAB code with IRQs disabled for alloc_bulk · 7b0501dd
      Jesper Dangaard Brouer 提交于
      Move the call to cache_alloc_debugcheck_after() outside the IRQ disabled
      section in kmem_cache_alloc_bulk().
      
      When CONFIG_DEBUG_SLAB is disabled the compiler should remove this code.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7b0501dd
    • J
      slab: implement bulk alloc in SLAB allocator · 2a777eac
      Jesper Dangaard Brouer 提交于
      This patch implements the alloc side of bulk API for the SLAB allocator.
      
      Further optimization are still possible by changing the call to
      __do_cache_alloc() into something that can return multiple objects.
      This optimization is left for later, given end results already show in
      the area of 80% speedup.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2a777eac
    • J
      slab: use slab_post_alloc_hook in SLAB allocator shared with SLUB · d5e3ed66
      Jesper Dangaard Brouer 提交于
      Reviewers notice that the order in slab_post_alloc_hook() of
      kmemcheck_slab_alloc() and kmemleak_alloc_recursive() gets swapped
      compared to slab.c / SLAB allocator.
      
      Also notice memset now occurs before calling kmemcheck_slab_alloc() and
      kmemleak_alloc_recursive().
      
      I assume this reordering of kmemcheck, kmemleak and memset is okay
      because this is the order they are used by the SLUB allocator.
      
      This patch completes the sharing of alloc_hook's between SLUB and SLAB.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d5e3ed66
    • J
      mm: kmemcheck skip object if slab allocation failed · 0142eae3
      Jesper Dangaard Brouer 提交于
      In the SLAB allocator kmemcheck_slab_alloc() is guarded against being
      called in case the object is NULL.  In SLUB allocator this NULL pointer
      invocation can happen, which seems like an oversight.
      
      Move the NULL pointer check into kmemcheck code (kmemcheck_slab_alloc)
      so the check gets moved out of the fastpath, when not compiled with
      CONFIG_KMEMCHECK.
      
      This is a step towards sharing post_alloc_hook between SLUB and SLAB,
      because slab_post_alloc_hook() does not perform this check before
      calling kmemcheck_slab_alloc().
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0142eae3
    • J
      slab: use slab_pre_alloc_hook in SLAB allocator shared with SLUB · 011eceaf
      Jesper Dangaard Brouer 提交于
      Deduplicate code in SLAB allocator functions slab_alloc() and
      slab_alloc_node() by using the slab_pre_alloc_hook() call, which is now
      shared between SLUB and SLAB.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      011eceaf
    • J
      mm: fault-inject take over bootstrap kmem_cache check · fab9963a
      Jesper Dangaard Brouer 提交于
      Remove the SLAB specific function slab_should_failslab(), by moving the
      check against fault-injection for the bootstrap slab, into the shared
      function should_failslab() (used by both SLAB and SLUB).
      
      This is a step towards sharing alloc_hook's between SLUB and SLAB.
      
      This bootstrap slab "kmem_cache" is used for allocating struct
      kmem_cache objects to the allocator itself.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fab9963a
    • J
      mm/slab: move SLUB alloc hooks to common mm/slab.h · 11c7aec2
      Jesper Dangaard Brouer 提交于
      First step towards sharing alloc_hook's between SLUB and SLAB
      allocators.  Move the SLUB allocators *_alloc_hook to the common
      mm/slab.h for internal slab definitions.
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      11c7aec2
    • J
      slub: clean up code for kmem cgroup support to kmem_cache_free_bulk · 376bf125
      Jesper Dangaard Brouer 提交于
      This change is primarily an attempt to make it easier to realize the
      optimizations the compiler performs in-case CONFIG_MEMCG_KMEM is not
      enabled.
      
      Performance wise, even when CONFIG_MEMCG_KMEM is compiled in, the
      overhead is zero.  This is because, as long as no process have enabled
      kmem cgroups accounting, the assignment is replaced by asm-NOP
      operations.  This is possible because memcg_kmem_enabled() uses a
      static_key_false() construct.
      
      It also helps readability as it avoid accessing the p[] array like:
      p[size - 1] which "expose" that the array is processed backwards inside
      helper function build_detached_freelist().
      
      Lastly this also makes the code more robust, in error case like passing
      NULL pointers in the array.  Which were previously handled before commit
      03374518 ("slub: add missing kmem cgroup support to
      kmem_cache_free_bulk").
      
      Fixes: 03374518 ("slub: add missing kmem cgroup support to kmem_cache_free_bulk")
      Signed-off-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      376bf125
    • A
      paride: make 'verbose' parameter an 'int' again · dec63a4d
      Arnd Bergmann 提交于
      gcc-6.0 found an ancient bug in the paride driver, which had a
      "module_param(verbose, bool, 0);" since before 2.6.12, but actually uses
      it to accept '0', '1' or '2' as arguments:
      
        drivers/block/paride/pd.c: In function 'pd_init_dev_parms':
        drivers/block/paride/pd.c:298:29: warning: comparison of constant '1' with boolean expression is always false [-Wbool-compare]
         #define DBMSG(msg) ((verbose>1)?(msg):NULL)
      
      In 2012, Rusty did a cleanup patch that also changed the type of the
      variable to 'bool', which introduced what is now a gcc warning.
      
      This changes the type back to 'int' and adapts the module_param() line
      instead, so it should work as documented in case anyone ever cares about
      running the ancient driver with debugging.
      
      Fixes: 90ab5ee9 ("module_param: make bool parameters really bool (drivers & misc)")
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Rusty Russell <rusty@rustcorp.com.au>
      Cc: Tim Waugh <tim@cyberelk.net>
      Cc: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dec63a4d
    • S
      block: partition: add partition specific uevent callbacks for partition info · 0d9c51a6
      San Mehat 提交于
      This patch has been carried in the Android tree for quite some time and
      is one of the few patches required to get a mainline kernel up and
      running with an exsiting Android userspace.  So I wanted to submit it
      for review and consideration if it should be merged.
      
      For partitions, add new uevent parameters 'PARTN' which specifies the
      partitions index in the table, and 'PARTNAME', which specifies PARTNAME
      specifices the partition name of a partition device.
      
      Android's userspace uses this for creating device node links from the
      partition name and number, ie:
      
          /dev/block/platform/soc/by-name/system
      or
          /dev/block/platform/soc/by-num/p1
      
      One can see its usage here:
          https://android.googlesource.com/platform/system/core/+/master/init/devices.cpp#355
      and
          https://android.googlesource.com/platform/system/core/+/master/init/devices.cpp#494
      
      [john.stultz@linaro.org: dropped NPARTS and reworded commit message for context]
      Signed-off-by: NDima Zavin <dima@android.com>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Rom Lemarchand <romlem@google.com>
      Cc: Android Kernel Team <kernel-team@android.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: <harald@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Kay Sievers <kay@vrfy.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0d9c51a6
    • J
      ocfs2/dlm: fix a variable overflow problem in dlmdomain.c · 8d67d3c2
      Jun Piao 提交于
      In dlm_send_join_cancels(), node is defined with type unsigned int, but
      initialized with -1, this will lead variable overflow.  Although this
      won't cause any runtime problem, the code looks a little uncoordinated.
      Signed-off-by: NJun Piao <piaojun@huawei.com>
      Reviewed-by: NJoseph Qi <joseph.qi@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8d67d3c2
    • J
      ocfs2: fix a tiny race that leads file system read-only · 814ce694
      Jiufei Xue 提交于
      when o2hb detect a node down, it first set the dead node to recovery map
      and create ocfs2rec which will replay journal for dead node.  o2hb
      thread then call dlm_do_local_recovery_cleanup() to delete the lock for
      dead node.  After the lock of dead node is gone, locks for other nodes
      can be granted and may modify the meta data without replaying journal of
      the dead node.  The detail is described as follows.
      
           N1                         N2                   N3(master)
      modify the extent tree of
      inode, and commit
      dirty metadata to journal,
      then goes down.
                                                       o2hb thread detects
                                                       N1 goes down, set
                                                       recovery map and
                                                       delete the lock of N1.
      
                                                       dlm_thread flush ast
                                                       for the lock of N2.
                              do not detect the death
                              of N1, so recovery map is
                              empty.
      
                              read inode from disk
                              without replaying
                              the journal of N1 and
                              modify the extent tree
                              of the inode that N1
                              had modified.
                                                       ocfs2rec recover the
                                                       journal of N1.
                                                       The modification of N2
                                                       is lost.
      
      The modification of N1 and N2 are not serial, and it will lead to
      read-only file system.  We can set recovery_waiting flag to the lock
      resource after delete the lock for dead node to prevent other node from
      getting the lock before dlm recovery.  After dlm recovery, the recovery
      map on N2 is not empty, ocfs2_inode_lock_full_nested() will wait for ocfs2
      recovery.
      Signed-off-by: NJiufei Xue <xuejiufei@huawei.com>
      Reviewed-by: NJoseph Qi <joseph.qi@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      814ce694
    • X
      ocfs2/dlm: return EINVAL when the lockres on migration target is in DROPPING_REF state · d277f33e
      xuejiufei 提交于
      If master migrate this lock resource to node when it happened to purge
      it, a new lock resource will be created and inserted into hash list.  If
      then master goes down, the lock resource being purged is recovered, so
      there exist two lock resource with different owner.  So return error to
      master if the lock resource is in DROPPING state, master will retry to
      migrate this lock resource.
      Signed-off-by: Nxuejiufei <xuejiufei@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Reviewed-by: NJoseph Qi <joseph.qi@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d277f33e
    • X
      ocfs2/dlm: clear DROPPING_REF flag when the master goes down · 8c034396
      xuejiufei 提交于
      If the master goes down after return in-progress for deref message.  The
      lock resource on non-master node can not be purged.  Clear the
      DROPPING_REF flag and recovery it.
      Signed-off-by: Nxuejiufei <xuejiufei@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Reviewed-by: NJoseph Qi <joseph.qi@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8c034396
    • X
      ocfs2/dlm: return in progress if master can not clear the refmap bit right now · 842b90b6
      xuejiufei 提交于
      Master returns in-progress to non-master node when it can not clear the
      refmap bit right now.  And non-master node will not purge the lock
      resource until receiving deref done message.
      Signed-off-by: Nxuejiufei <xuejiufei@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Reviewed-by: NJoseph Qi <joseph.qi@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      842b90b6
    • X
      ocfs2/dlm: add DEREF_DONE message · 60d663cb
      xuejiufei 提交于
      This series of patches is to fix the dis-order issue of setting/clearing
      refmap bit described below.
      
      Node 1                               Node 2(master)
      dlmlock
      dlm_do_master_request
                                      dlm_master_request_handler
                                      -> dlm_lockres_set_refmap_bit
      dlmlock succeed
      dlmunlock succeed
      
      dlm_purge_lockres
                                      dlm_deref_handler
                                      -> find lock resource is in
                                         DLM_LOCK_RES_SETREF_INPROG state,
                                         so dispatch a deref work
      dlm_purge_lockres succeed.
      
      call dlmlock again
      dlm_do_master_request
                                      dlm_master_request_handler
                                      -> dlm_lockres_set_refmap_bit
      
                                      deref work trigger, call
                                      dlm_lockres_clear_refmap_bit
                                      to clear Node 1 from refmap
      
                                      dlm_purge_lockres succeed
      
      dlm_send_remote_lock_request
                                      return DLM_IVLOCKID because
                                      the lockres is not exist
      BUG if the lockres is $RECOVERY
      
      This series of patches add a new message to keep the order of set and
      clear.  Other nodes can purge the lock resource only after the refmap bit
      on master is cleared.
      
      This patch is to add DEREF_DONE message and corresponding handler.  Node
      can purge the lock resource after receiving this message.  As a new
      message is added, so increase the minor number of dlm protocol version.
      Signed-off-by: Nxuejiufei <xuejiufei@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Reviewed-by: NJoseph Qi <joseph.qi@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      60d663cb
    • J
      ocfs2/dlm: fix a typo in dlmcommon.h · 39b29af0
      Joseph Qi 提交于
      Refer to cluster/tcp.h, NET_MAX_PAYLOAD_BYTES is a typo for
      O2NET_MAX_PAYLOAD_BYTES.
      
      Since currently DLM_MIG_LOCKRES_RESERVED is not actually used, it won't
      cause any problem.  But we'd better correct it for further use.
      Signed-off-by: NJoseph Qi <joseph.qi@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <joseph.qi@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      39b29af0
    • J
      ocfs2: use spinlock_irqsave() to downconvert lock in ocfs2_osb_dump() · bfd97a03
      jiangyiwen 提交于
      Commit a75e9cca ("ocfs2: use spinlock irqsave for downconvert lock")
      missed an unmodified place in ocfs2_osb_dump(), so it still exists a
      deadlock scenario.
      
          ocfs2_wake_downconvert_thread
          ocfs2_rw_unlock
          ocfs2_dio_end_io
          dio_complete
          .....
          bio_endio
          req_bio_endio
          ....
          scsi_io_completion
          blk_done_softirq
          __do_softirq
          do_softirq
          irq_exit
          do_IRQ
          ocfs2_osb_dump
          cat /sys/kernel/debug/ocfs2/${uuid}/fs_state
      
      This patch still uses spin_lock_irqsave() - replace spin_lock() to solve
      this situation.
      Signed-off-by: NYiwen Jiang <jiangyiwen@huawei.com>
      Reviewed-by: NJoseph Qi <joseph.qi@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bfd97a03
    • J
      ocfs2/cluster: replace the interrupt safe spinlocks with common ones · 4d548f61
      jiangyiwen 提交于
      There actually no hardware or software interrupts in the context which
      using o2hb_live_lock, so we don't need to worry about race conditions
      caused by irq/softirq with spinlock held.  Turning off irq is not good
      for system performance after all.  Just replace them with a non
      interrupt safe function.
      Signed-off-by: NYiwen Jiang <jiangyiwen@huawei.com>
      Reviewed-by: NJoseph Qi <joseph.qi@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4d548f61
    • S
      blackfin: define dummy pgprot_writecombine for !MMU · e928f350
      Sudip Mukherjee 提交于
      blackfin allmodconfig build fails with the error:
      
        ../sound/core/pcm_native.c: In function 'snd_pcm_lib_default_mmap':
        ../sound/core/pcm_native.c:3386:24: error: implicit declaration of function 'pgprot_writecombine' [-Werror=implicit-function-declaration]
           area->vm_page_prot = pgprot_writecombine(area->vm_page_prot);
                                ^
        ../sound/core/pcm_native.c:3386:22: error: incompatible types when assigning to type 'pgprot_t {aka struct <anonymous>}' from type 'int'
           area->vm_page_prot = pgprot_writecombine(area->vm_page_prot);
                              ^
      
      When !MMU, asm-generic will not define default pgprot_writecombine, so
      blackfin needs to define it by itself.
      
      The patch idea is from commit 65b9ab88 ("arch/c6x/include/asm/pgtable.h:
      define dummy pgprot_writecombine for !MMU")
      Signed-off-by: NSudip Mukherjee <sudip@vectorindia.org>
      Cc: Steven Miao <realmz6@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e928f350
    • S
      m32r: mm: fix build warning · 3701dc81
      Sudip Mukherjee 提交于
      While building we are getting warnings:
      
        arch/m32r/mm/init.c:63:17: warning: unused variable 'low'
        arch/m32r/mm/init.c:62:17: warning: unused variable 'max_dma'
      
      max_dma and low are only used if CONFIG_MMU is defined.  Lets declare
      the variables inside the #ifdef.
      Signed-off-by: NSudip Mukherjee <sudip@vectorindia.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3701dc81
    • P
      tags: Fix DEFINE_PER_CPU expansions · 25528213
      Peter Zijlstra 提交于
      $ make tags
        GEN     tags
      ctags: Warning: drivers/acpi/processor_idle.c:64: null expansion of name pattern "\1"
      ctags: Warning: drivers/xen/events/events_2l.c:41: null expansion of name pattern "\1"
      ctags: Warning: kernel/locking/lockdep.c:151: null expansion of name pattern "\1"
      ctags: Warning: kernel/rcu/rcutorture.c:133: null expansion of name pattern "\1"
      ctags: Warning: kernel/rcu/rcutorture.c:135: null expansion of name pattern "\1"
      ctags: Warning: kernel/workqueue.c:323: null expansion of name pattern "\1"
      ctags: Warning: net/ipv4/syncookies.c:53: null expansion of name pattern "\1"
      ctags: Warning: net/ipv6/syncookies.c:44: null expansion of name pattern "\1"
      ctags: Warning: net/rds/page.c:45: null expansion of name pattern "\1"
      
      Which are all the result of the DEFINE_PER_CPU pattern:
      
        scripts/tags.sh:200:	'/\<DEFINE_PER_CPU([^,]*, *\([[:alnum:]_]*\)/\1/v/'
        scripts/tags.sh:201:	'/\<DEFINE_PER_CPU_SHARED_ALIGNED([^,]*, *\([[:alnum:]_]*\)/\1/v/'
      
      The below cures them. All except the workqueue one are within reasonable
      distance of the 80 char limit. TJ do you have any preference on how to
      fix the wq one, or shall we just not care its too long?
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      25528213
    • G
      init/main.c: use list_for_each_entry() · e6fd1fb3
      Geliang Tang 提交于
      Use list_for_each_entry() instead of list_for_each() to simplify the code.
      Signed-off-by: NGeliang Tang <geliangtang@163.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e6fd1fb3
  2. 15 3月, 2016 8 次提交
    • L
      Merge branch 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · e23604ed
      Linus Torvalds 提交于
      Pull NOHZ updates from Ingo Molnar:
       "NOHZ enhancements, by Frederic Weisbecker, which reorganizes/refactors
        the NOHZ 'can the tick be stopped?' infrastructure and related code to
        be data driven, and harmonizes the naming and handling of all the
        various properties"
      
      [ This makes the ugly "fetch_or()" macro that the scheduler used
        internally a new generic helper, and does a bad job at it.
      
        I'm pulling it, but I've asked Ingo and Frederic to get this
        fixed up ]
      
      * 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        sched-clock: Migrate to use new tick dependency mask model
        posix-cpu-timers: Migrate to use new tick dependency mask model
        sched: Migrate sched to use new tick dependency mask model
        sched: Account rr tasks
        perf: Migrate perf to use new tick dependency mask model
        nohz: Use enum code for tick stop failure tracing message
        nohz: New tick dependency mask
        nohz: Implement wide kick on top of irq work
        atomic: Export fetch_or()
      e23604ed
    • L
      Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · d4e79615
      Linus Torvalds 提交于
      Pull scheduler updates from Ingo Molnar:
       "The main changes in this cycle are:
      
         - Make schedstats a runtime tunable (disabled by default) and
           optimize it via static keys.
      
           As most distributions enable CONFIG_SCHEDSTATS=y due to its
           instrumentation value, this is a nice performance enhancement.
           (Mel Gorman)
      
         - Implement 'simple waitqueues' (swait): these are just pure
           waitqueues without any of the more complex features of full-blown
           waitqueues (callbacks, wake flags, wake keys, etc.).  Simple
           waitqueues have less memory overhead and are faster.
      
           Use simple waitqueues in the RCU code (in 4 different places) and
           for handling KVM vCPU wakeups.
      
           (Peter Zijlstra, Daniel Wagner, Thomas Gleixner, Paul Gortmaker,
           Marcelo Tosatti)
      
         - sched/numa enhancements (Rik van Riel)
      
         - NOHZ performance enhancements (Rik van Riel)
      
         - Various sched/deadline enhancements (Steven Rostedt)
      
         - Various fixes (Peter Zijlstra)
      
         - ... and a number of other fixes, cleanups and smaller enhancements"
      
      * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (29 commits)
        sched/cputime: Fix steal_account_process_tick() to always return jiffies
        sched/deadline: Remove dl_new from struct sched_dl_entity
        Revert "kbuild: Add option to turn incompatible pointer check into error"
        sched/deadline: Remove superfluous call to switched_to_dl()
        sched/debug: Fix preempt_disable_ip recording for preempt_disable()
        sched, time: Switch VIRT_CPU_ACCOUNTING_GEN to jiffy granularity
        time, acct: Drop irq save & restore from __acct_update_integrals()
        acct, time: Change indentation in __acct_update_integrals()
        sched, time: Remove non-power-of-two divides from __acct_update_integrals()
        sched/rt: Kick RT bandwidth timer immediately on start up
        sched/debug: Add deadline scheduler bandwidth ratio to /proc/sched_debug
        sched/debug: Move sched_domain_sysctl to debug.c
        sched/debug: Move the /sys/kernel/debug/sched_features file setup into debug.c
        sched/rt: Fix PI handling vs. sched_setscheduler()
        sched/core: Remove duplicated sched_group_set_shares() prototype
        sched/fair: Consolidate nohz CPU load update code
        sched/fair: Avoid using decay_load_missed() with a negative value
        sched/deadline: Always calculate end of period on sched_yield()
        sched/cgroup: Fix cgroup entity load tracking tear-down
        rcu: Use simple wait queues where possible in rcutree
        ...
      d4e79615
    • L
      Merge branch 'ras-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · d88bfe1d
      Linus Torvalds 提交于
      Pull RAS updates from Ingo Molnar:
       "Various RAS updates:
      
         - AMD MCE support updates for future CPUs, fixes and 'SMCA' (Scalable
           MCA) error decoding support (Aravind Gopalakrishnan)
      
         - x86 memcpy_mcsafe() support, to enable smart(er) hardware error
           recovery in NVDIMM drivers, based on an extension of the x86
           exception handling code.  (Tony Luck)"
      
      * 'ras-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        EDAC/sb_edac: Fix computation of channel address
        x86/mm, x86/mce: Add memcpy_mcsafe()
        x86/mce/AMD: Document some functionality
        x86/mce: Clarify comments regarding deferred error
        x86/mce/AMD: Fix logic to obtain block address
        x86/mce/AMD, EDAC: Enable error decoding of Scalable MCA errors
        x86/mce: Move MCx_CONFIG MSR definitions
        x86/mce: Check for faults tagged in EXTABLE_CLASS_FAULT exception table entries
        x86/mm: Expand the exception table logic to allow new handling options
        x86/mce/AMD: Set MCAX Enable bit
        x86/mce/AMD: Carve out threshold block preparation
        x86/mce/AMD: Fix LVT offset configuration for thresholding
        x86/mce/AMD: Reduce number of blocks scanned per bank
        x86/mce/AMD: Do not perform shared bank check for future processors
        x86/mce: Fix order of AMD MCE init function call
      d88bfe1d
    • L
      Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · e71c2c1e
      Linus Torvalds 提交于
      Pull perf updates from Ingo Molnar:
       "Main kernel side changes:
      
         - Big reorganization of the x86 perf support code.  The old code grew
           organically deep inside arch/x86/kernel/cpu/perf* and its naming
           became somewhat messy.
      
           The new location is under arch/x86/events/, using the following
           cleaner hierarchy of source code files:
      
             perf/x86: Move perf_event.c .................. => x86/events/core.c
             perf/x86: Move perf_event_amd.c .............. => x86/events/amd/core.c
             perf/x86: Move perf_event_amd_ibs.c .......... => x86/events/amd/ibs.c
             perf/x86: Move perf_event_amd_iommu.[ch] ..... => x86/events/amd/iommu.[ch]
             perf/x86: Move perf_event_amd_uncore.c ....... => x86/events/amd/uncore.c
             perf/x86: Move perf_event_intel_bts.c ........ => x86/events/intel/bts.c
             perf/x86: Move perf_event_intel.c ............ => x86/events/intel/core.c
             perf/x86: Move perf_event_intel_cqm.c ........ => x86/events/intel/cqm.c
             perf/x86: Move perf_event_intel_cstate.c ..... => x86/events/intel/cstate.c
             perf/x86: Move perf_event_intel_ds.c ......... => x86/events/intel/ds.c
             perf/x86: Move perf_event_intel_lbr.c ........ => x86/events/intel/lbr.c
             perf/x86: Move perf_event_intel_pt.[ch] ...... => x86/events/intel/pt.[ch]
             perf/x86: Move perf_event_intel_rapl.c ....... => x86/events/intel/rapl.c
             perf/x86: Move perf_event_intel_uncore.[ch] .. => x86/events/intel/uncore.[ch]
             perf/x86: Move perf_event_intel_uncore_nhmex.c => x86/events/intel/uncore_nmhex.c
             perf/x86: Move perf_event_intel_uncore_snb.c   => x86/events/intel/uncore_snb.c
             perf/x86: Move perf_event_intel_uncore_snbep.c => x86/events/intel/uncore_snbep.c
             perf/x86: Move perf_event_knc.c .............. => x86/events/intel/knc.c
             perf/x86: Move perf_event_p4.c ............... => x86/events/intel/p4.c
             perf/x86: Move perf_event_p6.c ............... => x86/events/intel/p6.c
             perf/x86: Move perf_event_msr.c .............. => x86/events/msr.c
      
           (Borislav Petkov)
      
         - Update various x86 PMU constraint and hw support details (Stephane
           Eranian)
      
         - Optimize kprobes for BPF execution (Martin KaFai Lau)
      
         - Rewrite, refactor and fix the Intel uncore PMU driver code (Thomas
           Gleixner)
      
         - Rewrite, refactor and fix the Intel RAPL PMU code (Thomas Gleixner)
      
         - Various fixes and smaller cleanups.
      
        There are lots of perf tooling updates as well.  A few highlights:
      
        perf report/top:
      
           - Hierarchy histogram mode for 'perf top' and 'perf report',
             showing multiple levels, one per --sort entry: (Namhyung Kim)
      
             On a mostly idle system:
      
               # perf top --hierarchy -s comm,dso
      
             Then expand some levels and use 'P' to take a snapshot:
      
               # cat perf.hist.0
               -  92.32%         perf
                     58.20%         perf
                     22.29%         libc-2.22.so
                      5.97%         [kernel]
                      4.18%         libelf-0.165.so
                      1.69%         [unknown]
               -   4.71%         qemu-system-x86
                      3.10%         [kernel]
                      1.60%         qemu-system-x86_64 (deleted)
               +   2.97%         swapper
               #
      
           - Add 'L' hotkey to dynamicly set the percent threshold for
             histogram entries and callchains, i.e.  dynamicly do what the
             --percent-limit command line option to 'top' and 'report' does.
             (Namhyung Kim)
      
        perf mem:
      
           - Allow specifying events via -e in 'perf mem record', also listing
             what events can be specified via 'perf mem record -e list' (Jiri
             Olsa)
      
        perf record:
      
           - Add 'perf record' --all-user/--all-kernel options, so that one
             can tell that all the events in the command line should be
             restricted to the user or kernel levels (Jiri Olsa), i.e.:
      
               perf record -e cycles:u,instructions:u
      
             is equivalent to:
      
               perf record --all-user -e cycles,instructions
      
           - Make 'perf record' collect CPU cache info in the perf.data file header:
      
               $ perf record usleep 1
               [ perf record: Woken up 1 times to write data ]
               [ perf record: Captured and wrote 0.017 MB perf.data (7 samples) ]
               $ perf report --header-only -I | tail -10 | head -8
               # CPU cache info:
               #  L1 Data                 32K [0-1]
               #  L1 Instruction          32K [0-1]
               #  L1 Data                 32K [2-3]
               #  L1 Instruction          32K [2-3]
               #  L2 Unified             256K [0-1]
               #  L2 Unified             256K [2-3]
               #  L3 Unified            4096K [0-3]
      
             Will be used in 'perf c2c' and eventually in 'perf diff' to
             allow, for instance running the same workload in multiple
             machines and then when using 'diff' show the hardware difference.
             (Jiri Olsa)
      
           - Improved support for Java, using the JVMTI agent library to do
             jitdumps that then will be inserted in synthesized
             PERF_RECORD_MMAP2 events via 'perf inject' pointed to synthesized
             ELF files stored in ~/.debug and keyed with build-ids, to allow
             symbol resolution and even annotation with source line info, see
             the changeset comments to see how to use it (Stephane Eranian)
      
        perf script/trace:
      
           - Decode data_src values (e.g.  perf.data files generated by 'perf
             mem record') in 'perf script': (Jiri Olsa)
      
               # perf script
                 perf 693 [1] 4.088652: 1 cpu/mem-loads,ldlat=30/P: ffff88007d0b0f40 68100142 L1 hit|SNP None|TLB L1 or L2 hit|LCK No <SNIP>
                                                                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
           - Improve support to 'data_src', 'weight' and 'addr' fields in
             'perf script' (Jiri Olsa)
      
           - Handle empty print fmts in 'perf script -s' i.e. when running
             python or perl scripts (Taeung Song)
      
        perf stat:
      
           - 'perf stat' now shows shadow metrics (insn per cycle, etc) in
             interval mode too.  E.g:
      
               # perf stat -I 1000 -e instructions,cycles sleep 1
               #         time   counts unit events
                  1.000215928  519,620      instructions     #  0.69 insn per cycle
                  1.000215928  752,003      cycles
               <SNIP>
      
           - Port 'perf kvm stat' to PowerPC (Hemant Kumar)
      
           - Implement CSV metrics output in 'perf stat' (Andi Kleen)
      
        perf BPF support:
      
           - Support converting data from bpf events in 'perf data' (Wang Nan)
      
           - Print bpf-output events in 'perf script': (Wang Nan).
      
               # perf record -e bpf-output/no-inherit,name=evt/ -e ./test_bpf_output_3.c/map:channel.event=evt/ usleep 1000
               # perf script
                  usleep  4882 21384.532523:   evt:  ffffffff810e97d1 sys_nanosleep ([kernel.kallsyms])
                   BPF output: 0000: 52 61 69 73 65 20 61 20  Raise a
                               0008: 42 50 46 20 65 76 65 6e  BPF even
                               0010: 74 21 00 00              t!..
                   BPF string: "Raise a BPF event!"
               #
      
           - Add API to set values of map entries in a BPF object, be it
             individual map slots or ranges (Wang Nan)
      
           - Introduce support for the 'bpf-output' event (Wang Nan)
      
           - Add glue to read perf events in a BPF program (Wang Nan)
      
           - Improve support for bpf-output events in 'perf trace' (Wang Nan)
      
        ... and tons of other changes as well - see the shortlog and git log
        for details!"
      
      * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (342 commits)
        perf stat: Add --metric-only support for -A
        perf stat: Implement --metric-only mode
        perf stat: Document CSV format in manpage
        perf hists browser: Check sort keys before hot key actions
        perf hists browser: Allow thread filtering for comm sort key
        perf tools: Add sort__has_comm variable
        perf tools: Recalc total periods using top-level entries in hierarchy
        perf tools: Remove nr_sort_keys field
        perf hists browser: Cleanup hist_browser__fprintf_hierarchy_entry()
        perf tools: Remove hist_entry->fmt field
        perf tools: Fix command line filters in hierarchy mode
        perf tools: Add more sort entry check functions
        perf tools: Fix hist_entry__filter() for hierarchy
        perf jitdump: Build only on supported archs
        tools lib traceevent: Add '~' operation within arg_num_eval()
        perf tools: Omit unnecessary cast in perf_pmu__parse_scale
        perf tools: Pass perf_hpp_list all the way through setup_sort_list
        perf tools: Fix perf script python database export crash
        perf jitdump: DWARF is also needed
        perf bench mem: Prepare the x86-64 build for upstream memcpy_mcsafe() changes
        ...
      e71c2c1e
    • L
      Merge branch 'mm-readonly-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · d09e356a
      Linus Torvalds 提交于
      Pull read-only kernel memory updates from Ingo Molnar:
       "This tree adds two (security related) enhancements to the kernel's
        handling of read-only kernel memory:
      
         - extend read-only kernel memory to a new class of formerly writable
           kernel data: 'post-init read-only memory' via the __ro_after_init
           attribute, and mark the ARM and x86 vDSO as such read-only memory.
      
           This kind of attribute can be used for data that requires a once
           per bootup initialization sequence, but is otherwise never modified
           after that point.
      
           This feature was based on the work by PaX Team and Brad Spengler.
      
           (by Kees Cook, the ARM vDSO bits by David Brown.)
      
         - make CONFIG_DEBUG_RODATA always enabled on x86 and remove the
           Kconfig option.  This simplifies the kernel and also signals that
           read-only memory is the default model and a first-class citizen.
           (Kees Cook)"
      
      * 'mm-readonly-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        ARM/vdso: Mark the vDSO code read-only after init
        x86/vdso: Mark the vDSO code read-only after init
        lkdtm: Verify that '__ro_after_init' works correctly
        arch: Introduce post-init read-only memory
        x86/mm: Always enable CONFIG_DEBUG_RODATA and remove the Kconfig option
        mm/init: Add 'rodata=off' boot cmdline parameter to disable read-only kernel mappings
        asm-generic: Consolidate mark_rodata_ro()
      d09e356a
    • L
      Merge branch 'mm-pat-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · 5ec94246
      Linus Torvalds 提交于
      Pull dma_*_writecombine rename from Ingo Molnar:
       "Rename dma_*_writecombine() to dma_*_wc()
      
        This is a tree-wide API rename, to move the dma_*() write-combining
        APIs closer in name to their usual API families.  (The old API names
        are kept as compatibility wrappers to not introduce extra breakage.)
      
        The patch was Coccinelle generated"
      
      * 'mm-pat-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        dma, mm/pat: Rename dma_*_writecombine() to dma_*_wc()
      5ec94246
    • L
      Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · fbed0bc0
      Linus Torvalds 提交于
      Pull locking changes from Ingo Molnar:
       "Various updates:
      
         - Futex scalability improvements: remove page lock use for shared
           futex get_futex_key(), which speeds up 'perf bench futex hash'
           benchmarks by over 40% on a 60-core Westmere.  This makes anon-mem
           shared futexes perform close to private futexes.  (Mel Gorman)
      
         - lockdep hash collision detection and fix (Alfredo Alvarez
           Fernandez)
      
         - lockdep testing enhancements (Alfredo Alvarez Fernandez)
      
         - robustify lockdep init by using hlists (Andrew Morton, Andrey
           Ryabinin)
      
         - mutex and csd_lock micro-optimizations (Davidlohr Bueso)
      
         - small x86 barriers tweaks (Michael S Tsirkin)
      
         - qspinlock updates (Waiman Long)"
      
      * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits)
        locking/csd_lock: Use smp_cond_acquire() in csd_lock_wait()
        locking/csd_lock: Explicitly inline csd_lock*() helpers
        futex: Replace barrier() in unqueue_me() with READ_ONCE()
        locking/lockdep: Detect chain_key collisions
        locking/lockdep: Prevent chain_key collisions
        tools/lib/lockdep: Fix link creation warning
        tools/lib/lockdep: Add tests for AA and ABBA locking
        tools/lib/lockdep: Add userspace version of READ_ONCE()
        tools/lib/lockdep: Fix the build on recent kernels
        locking/qspinlock: Move __ARCH_SPIN_LOCK_UNLOCKED to qspinlock_types.h
        locking/mutex: Allow next waiter lockless wakeup
        locking/pvqspinlock: Enable slowpath locking count tracking
        locking/qspinlock: Use smp_cond_acquire() in pending code
        locking/pvqspinlock: Move lock stealing count tracking code into pv_queued_spin_steal_lock()
        locking/mcs: Fix mcs_spin_lock() ordering
        futex: Remove requirement for lock_page() in get_futex_key()
        futex: Rename barrier references in ordering guarantees
        locking/atomics: Update comment about READ_ONCE() and structures
        locking/lockdep: Eliminate lockdep_init()
        locking/lockdep: Convert hash tables to hlists
        ...
      fbed0bc0
    • L
      Merge branch 'core-resources-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip · d37a14bb
      Linus Torvalds 提交于
      Pull ram resource handling changes from Ingo Molnar:
       "Core kernel resource handling changes to support NVDIMM error
        injection.
      
        This tree introduces a new I/O resource type, IORESOURCE_SYSTEM_RAM,
        for System RAM while keeping the current IORESOURCE_MEM type bit set
        for all memory-mapped ranges (including System RAM) for backward
        compatibility.
      
        With this resource flag it no longer takes a strcmp() loop through the
        resource tree to find "System RAM" resources.
      
        The new resource type is then used to extend ACPI/APEI error injection
        facility to also support NVDIMM"
      
      * 'core-resources-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
        ACPI/EINJ: Allow memory error injection to NVDIMM
        resource: Kill walk_iomem_res()
        x86/kexec: Remove walk_iomem_res() call with GART type
        x86, kexec, nvdimm: Use walk_iomem_res_desc() for iomem search
        resource: Add walk_iomem_res_desc()
        memremap: Change region_intersects() to take @flags and @desc
        arm/samsung: Change s3c_pm_run_res() to use System RAM type
        resource: Change walk_system_ram() to use System RAM type
        drivers: Initialize resource entry to zero
        xen, mm: Set IORESOURCE_SYSTEM_RAM to System RAM
        kexec: Set IORESOURCE_SYSTEM_RAM for System RAM
        arch: Set IORESOURCE_SYSTEM_RAM flag for System RAM
        ia64: Set System RAM type and descriptor
        x86/e820: Set System RAM type and descriptor
        resource: Add I/O resource descriptor
        resource: Handle resource flags properly
        resource: Add System RAM resource type
      d37a14bb
  3. 14 3月, 2016 2 次提交
  4. 13 3月, 2016 1 次提交
    • J
      MIPS: smp.c: Fix uninitialised temp_foreign_map · d825c06b
      James Hogan 提交于
      When calculate_cpu_foreign_map() recalculates the cpu_foreign_map
      cpumask it uses the local variable temp_foreign_map without initialising
      it to zero. Since the calculation only ever sets bits in this cpumask
      any existing bits at that memory location will remain set and find their
      way into cpu_foreign_map too. This could potentially lead to cache
      operations suboptimally doing smp calls to multiple VPEs in the same
      core, even though the VPEs share primary caches.
      
      Therefore initialise temp_foreign_map using cpumask_clear() before use.
      
      Fixes: cccf34e9 ("MIPS: c-r4k: Fix cache flushing for MT cores")
      Signed-off-by: NJames Hogan <james.hogan@imgtec.com>
      Cc: Paul Burton <paul.burton@imgtec.com>
      Cc: linux-mips@linux-mips.org
      Patchwork: https://patchwork.linux-mips.org/patch/12759/Signed-off-by: NRalf Baechle <ralf@linux-mips.org>
      d825c06b