1. 07 8月, 2014 6 次提交
  2. 24 6月, 2014 1 次提交
    • J
      slab: fix oops when reading /proc/slab_allocators · 03787301
      Joonsoo Kim 提交于
      Commit b1cb0982 ("change the management method of free objects of
      the slab") introduced a bug on slab leak detector
      ('/proc/slab_allocators').  This detector works like as following
      decription.
      
       1. traverse all objects on all the slabs.
       2. determine whether it is active or not.
       3. if active, print who allocate this object.
      
      but that commit changed the way how to manage free objects, so the logic
      determining whether it is active or not is also changed.  In before, we
      regard object in cpu caches as inactive one, but, with this commit, we
      mistakenly regard object in cpu caches as active one.
      
      This intoduces kernel oops if DEBUG_PAGEALLOC is enabled.  If
      DEBUG_PAGEALLOC is enabled, kernel_map_pages() is used to detect who
      corrupt free memory in the slab.  It unmaps page table mapping if object
      is free and map it if object is active.  When slab leak detector check
      object in cpu caches, it mistakenly think this object active so try to
      access object memory to retrieve caller of allocation.  At this point,
      page table mapping to this object doesn't exist, so oops occurs.
      
      Following is oops message reported from Dave.
      
      It blew up when something tried to read /proc/slab_allocators
      (Just cat it, and you should see the oops below)
      
        Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
        Modules linked in:
        [snip...]
        CPU: 1 PID: 9386 Comm: trinity-c33 Not tainted 3.14.0-rc5+ #131
        task: ffff8801aa46e890 ti: ffff880076924000 task.ti: ffff880076924000
        RIP: 0010:[<ffffffffaa1a8f4a>]  [<ffffffffaa1a8f4a>] handle_slab+0x8a/0x180
        RSP: 0018:ffff880076925de0  EFLAGS: 00010002
        RAX: 0000000000001000 RBX: 0000000000000000 RCX: 000000005ce85ce7
        RDX: ffffea00079be100 RSI: 0000000000001000 RDI: ffff880107458000
        RBP: ffff880076925e18 R08: 0000000000000001 R09: 0000000000000000
        R10: 0000000000000000 R11: 000000000000000f R12: ffff8801e6f84000
        R13: ffffea00079be100 R14: ffff880107458000 R15: ffff88022bb8d2c0
        FS:  00007fb769e45740(0000) GS:ffff88024d040000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: ffff8801e6f84ff8 CR3: 00000000a22db000 CR4: 00000000001407e0
        DR0: 0000000002695000 DR1: 0000000002695000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000070602
        Call Trace:
          leaks_show+0xce/0x240
          seq_read+0x28e/0x490
          proc_reg_read+0x3d/0x80
          vfs_read+0x9b/0x160
          SyS_read+0x58/0xb0
          tracesys+0xd4/0xd9
        Code: f5 00 00 00 0f 1f 44 00 00 48 63 c8 44 3b 0c 8a 0f 84 e3 00 00 00 83 c0 01 44 39 c0 72 eb 41 f6 47 1a 01 0f 84 e9 00 00 00 89 f0 <4d> 8b 4c 04 f8 4d 85 c9 0f 84 88 00 00 00 49 8b 7e 08 4d 8d 46
        RIP   handle_slab+0x8a/0x180
      
      To fix the problem, I introduce an object status buffer on each slab.
      With this, we can track object status precisely, so slab leak detector
      would not access active object and no kernel oops would occur.  Memory
      overhead caused by this fix is only imposed to CONFIG_DEBUG_SLAB_LEAK
      which is mainly used for debugging, so memory overhead isn't big
      problem.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reported-by: NDave Jones <davej@redhat.com>
      Reported-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Reviewed-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      03787301
  3. 05 6月, 2014 4 次提交
    • V
      memcg, slab: merge memcg_{bind,release}_pages to memcg_{un}charge_slab · c67a8a68
      Vladimir Davydov 提交于
      Currently we have two pairs of kmemcg-related functions that are called on
      slab alloc/free.  The first is memcg_{bind,release}_pages that count the
      total number of pages allocated on a kmem cache.  The second is
      memcg_{un}charge_slab that {un}charge slab pages to kmemcg resource
      counter.  Let's just merge them to keep the code clean.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c67a8a68
    • V
      slab: get_online_mems for kmem_cache_{create,destroy,shrink} · 03afc0e2
      Vladimir Davydov 提交于
      When we create a sl[au]b cache, we allocate kmem_cache_node structures
      for each online NUMA node.  To handle nodes taken online/offline, we
      register memory hotplug notifier and allocate/free kmem_cache_node
      corresponding to the node that changes its state for each kmem cache.
      
      To synchronize between the two paths we hold the slab_mutex during both
      the cache creationg/destruction path and while tuning per-node parts of
      kmem caches in memory hotplug handler, but that's not quite right,
      because it does not guarantee that a newly created cache will have all
      kmem_cache_nodes initialized in case it races with memory hotplug.  For
      instance, in case of slub:
      
          CPU0                            CPU1
          ----                            ----
          kmem_cache_create:              online_pages:
           __kmem_cache_create:            slab_memory_callback:
                                            slab_mem_going_online_callback:
                                             lock slab_mutex
                                             for each slab_caches list entry
                                                 allocate kmem_cache node
                                             unlock slab_mutex
            lock slab_mutex
            init_kmem_cache_nodes:
             for_each_node_state(node, N_NORMAL_MEMORY)
                 allocate kmem_cache node
            add kmem_cache to slab_caches list
            unlock slab_mutex
                                          online_pages (continued):
                                           node_states_set_node
      
      As a result we'll get a kmem cache with not all kmem_cache_nodes
      allocated.
      
      To avoid issues like that we should hold get/put_online_mems() during
      the whole kmem cache creation/destruction/shrink paths, just like we
      deal with cpu hotplug.  This patch does the trick.
      
      Note, that after it's applied, there is no need in taking the slab_mutex
      for kmem_cache_shrink any more, so it is removed from there.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Jiang Liu <liuj97@gmail.com>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      03afc0e2
    • V
      sl[au]b: charge slabs to kmemcg explicitly · 5dfb4175
      Vladimir Davydov 提交于
      We have only a few places where we actually want to charge kmem so
      instead of intruding into the general page allocation path with
      __GFP_KMEMCG it's better to explictly charge kmem there.  All kmem
      charges will be easier to follow that way.
      
      This is a step towards removing __GFP_KMEMCG.  It removes __GFP_KMEMCG
      from memcg caches' allocflags.  Instead it makes slab allocation path
      call memcg_charge_kmem directly getting memcg to charge from the cache's
      memcg params.
      
      This also eliminates any possibility of misaccounting an allocation
      going from one memcg's cache to another memcg, because now we always
      charge slabs against the memcg the cache belongs to.  That's why this
      patch removes the big comment to memcg_kmem_get_cache.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NGreg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5dfb4175
    • D
      mm, slab: suppress out of memory warning unless debug is enabled · 9a02d699
      David Rientjes 提交于
      When the slab or slub allocators cannot allocate additional slab pages,
      they emit diagnostic information to the kernel log such as current
      number of slabs, number of objects, active objects, etc.  This is always
      coupled with a page allocation failure warning since it is controlled by
      !__GFP_NOWARN.
      
      Suppress this out of memory warning if the allocator is configured
      without debug supported.  The page allocation failure warning will
      indicate it is a failed slab allocation, the order, and the gfp mask, so
      this is only useful to diagnose allocator issues.
      
      Since CONFIG_SLUB_DEBUG is already enabled by default for the slub
      allocator, there is no functional change with this patch.  If debug is
      disabled, however, the warnings are now suppressed.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9a02d699
  4. 06 5月, 2014 2 次提交
    • D
      slab: Fix off by one in object max number tests. · 30321c7b
      David Miller 提交于
      If freelist_idx_t is a byte, SLAB_OBJ_MAX_NUM should be 255 not 256, and
      likewise if freelist_idx_t is a short, then it should be 65535 not
      65536.
      
      This was leading to all kinds of random crashes on sparc64 where
      PAGE_SIZE is 8192.  One problem shown was that if spinlock debugging was
      enabled, we'd get deadlocks in copy_pte_range() or do_wp_page() with the
      same cpu already holding a lock it shouldn't hold, or the lock belonging
      to a completely unrelated process.
      
      Fixes: a41adfaa ("slab: introduce byte sized index for the freelist of a slab")
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      30321c7b
    • J
      slab: fix the type of the index on freelist index accessor · 7cc68973
      Joonsoo Kim 提交于
      Commit a41adfaa ("slab: introduce byte sized index for the freelist
      of a slab") changes the size of freelist index and also changes
      prototype of accessor function to freelist index.  And there was a
      mistake.
      
      The mistake is that although it changes the size of freelist index
      correctly, it changes the size of the index of freelist index
      incorrectly.  With patch, freelist index can be 1 byte or 2 bytes, that
      means that num of object on on a slab can be more than 255.  So we need
      more than 1 byte for the index to find the index of free object on
      freelist.  But, above patch makes this index type 1 byte, so slab which
      have more than 255 objects cannot work properly and in consequence of
      it, the system cannot boot.
      
      This issue was reported by Steven King on m68knommu which would use
      2 bytes freelist index:
      
        https://lkml.org/lkml/2014/4/16/433
      
      To fix is easy.  To change the type of the index of freelist index on
      accessor functions is enough to fix this bug.  Although 2 bytes is
      enough, I use 4 bytes since it have no bad effect and make things more
      easier.  This fix was suggested and tested by Steven in his original
      report.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reported-and-acked-by: NSteven King <sfking@fdwdc.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Tested-by: NJames Hogan <james.hogan@imgtec.com>
      Tested-by: NDavid Miller <davem@davemloft.net>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7cc68973
  5. 11 4月, 2014 1 次提交
  6. 08 4月, 2014 2 次提交
    • D
      mm, mempolicy: remove per-process flag · f0432d15
      David Rientjes 提交于
      PF_MEMPOLICY is an unnecessary optimization for CONFIG_SLAB users.
      There's no significant performance degradation to checking
      current->mempolicy rather than current->flags & PF_MEMPOLICY in the
      allocation path, especially since this is considered unlikely().
      
      Running TCP_RR with netperf-2.4.5 through localhost on 16 cpu machine with
      64GB of memory and without a mempolicy:
      
      	threads		before		after
      	16		1249409		1244487
      	32		1281786		1246783
      	48		1239175		1239138
      	64		1244642		1241841
      	80		1244346		1248918
      	96		1266436		1254316
      	112		1307398		1312135
      	128		1327607		1326502
      
      Per-process flags are a scarce resource so we should free them up whenever
      possible and make them available.  We'll be using it shortly for memcg oom
      reserves.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Tim Hockin <thockin@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f0432d15
    • D
      mm, mempolicy: rename slab_node for clarity · 2a389610
      David Rientjes 提交于
      slab_node() is actually a mempolicy function, so rename it to
      mempolicy_slab_node() to make it clearer that it used for processes with
      mempolicies.
      
      At the same time, cleanup its code by saving numa_mem_id() in a local
      variable (since we require a node with memory, not just any node) and
      remove an obsolete comment that assumes the mempolicy is actually passed
      into the function.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Jianguo Wu <wujianguo@huawei.com>
      Cc: Tim Hockin <thockin@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2a389610
  7. 04 4月, 2014 1 次提交
  8. 01 4月, 2014 1 次提交
  9. 19 2月, 2014 1 次提交
  10. 08 2月, 2014 6 次提交
    • J
      slab: Make allocations with GFP_ZERO slightly more efficient · 5087c822
      Joe Perches 提交于
      Use the likely mechanism already around valid
      pointer tests to better choose when to memset
      to 0 allocations with __GFP_ZERO
      Acked-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NJoe Perches <joe@perches.com>
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      5087c822
    • J
      slab: make more slab management structure off the slab · 8fc9cf42
      Joonsoo Kim 提交于
      Now, the size of the freelist for the slab management diminish,
      so that the on-slab management structure can waste large space
      if the object of the slab is large.
      
      Consider a 128 byte sized slab. If on-slab is used, 31 objects can be
      in the slab. The size of the freelist for this case would be 31 bytes
      so that 97 bytes, that is, more than 75% of object size, are wasted.
      
      In a 64 byte sized slab case, no space is wasted if we use on-slab.
      So set off-slab determining constraint to 128 bytes.
      Acked-by: NChristoph Lameter <cl@linux.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      8fc9cf42
    • J
      slab: introduce byte sized index for the freelist of a slab · a41adfaa
      Joonsoo Kim 提交于
      Currently, the freelist of a slab consist of unsigned int sized indexes.
      Since most of slabs have less number of objects than 256, large sized
      indexes is needless. For example, consider the minimum kmalloc slab. It's
      object size is 32 byte and it would consist of one page, so 256 indexes
      through byte sized index are enough to contain all possible indexes.
      
      There can be some slabs whose object size is 8 byte. We cannot handle
      this case with byte sized index, so we need to restrict minimum
      object size. Since these slabs are not major, wasted memory from these
      slabs would be negligible.
      
      Some architectures' page size isn't 4096 bytes and rather larger than
      4096 bytes (One example is 64KB page size on PPC or IA64) so that
      byte sized index doesn't fit to them. In this case, we will use
      two bytes sized index.
      
      Below is some number for this patch.
      
      * Before *
      kmalloc-512          525    640    512    8    1 : tunables   54   27    0 : slabdata     80     80      0
      kmalloc-256          210    210    256   15    1 : tunables  120   60    0 : slabdata     14     14      0
      kmalloc-192         1016   1040    192   20    1 : tunables  120   60    0 : slabdata     52     52      0
      kmalloc-96           560    620    128   31    1 : tunables  120   60    0 : slabdata     20     20      0
      kmalloc-64          2148   2280     64   60    1 : tunables  120   60    0 : slabdata     38     38      0
      kmalloc-128          647    682    128   31    1 : tunables  120   60    0 : slabdata     22     22      0
      kmalloc-32         11360  11413     32  113    1 : tunables  120   60    0 : slabdata    101    101      0
      kmem_cache           197    200    192   20    1 : tunables  120   60    0 : slabdata     10     10      0
      
      * After *
      kmalloc-512          521    648    512    8    1 : tunables   54   27    0 : slabdata     81     81      0
      kmalloc-256          208    208    256   16    1 : tunables  120   60    0 : slabdata     13     13      0
      kmalloc-192         1029   1029    192   21    1 : tunables  120   60    0 : slabdata     49     49      0
      kmalloc-96           529    589    128   31    1 : tunables  120   60    0 : slabdata     19     19      0
      kmalloc-64          2142   2142     64   63    1 : tunables  120   60    0 : slabdata     34     34      0
      kmalloc-128          660    682    128   31    1 : tunables  120   60    0 : slabdata     22     22      0
      kmalloc-32         11716  11780     32  124    1 : tunables  120   60    0 : slabdata     95     95      0
      kmem_cache           197    210    192   21    1 : tunables  120   60    0 : slabdata     10     10      0
      
      kmem_caches consisting of objects less than or equal to 256 byte have
      one or more objects than before. In the case of kmalloc-32, we have 11 more
      objects, so 352 bytes (11 * 32) are saved and this is roughly 9% saving of
      memory. Of couse, this percentage decreases as the number of objects
      in a slab decreases.
      
      Here are the performance results on my 4 cpus machine.
      
      * Before *
      
       Performance counter stats for 'perf bench sched messaging -g 50 -l 1000' (10 runs):
      
             229,945,138 cache-misses                                                  ( +-  0.23% )
      
            11.627897174 seconds time elapsed                                          ( +-  0.14% )
      
      * After *
      
       Performance counter stats for 'perf bench sched messaging -g 50 -l 1000' (10 runs):
      
             218,640,472 cache-misses                                                  ( +-  0.42% )
      
            11.504999837 seconds time elapsed                                          ( +-  0.21% )
      
      cache-misses are reduced by this patchset, roughly 5%.
      And elapsed times are improved by 1%.
      Acked-by: NChristoph Lameter <cl@linux.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      a41adfaa
    • J
      slab: restrict the number of objects in a slab · f315e3fa
      Joonsoo Kim 提交于
      To prepare to implement byte sized index for managing the freelist
      of a slab, we should restrict the number of objects in a slab to be less
      or equal to 256, since byte only represent 256 different values.
      Setting the size of object to value equal or more than newly introduced
      SLAB_OBJ_MIN_SIZE ensures that the number of objects in a slab is less or
      equal to 256 for a slab with 1 page.
      
      If page size is rather larger than 4096, above assumption would be wrong.
      In this case, we would fall back on 2 bytes sized index.
      
      If minimum size of kmalloc is less than 16, we use it as minimum object
      size and give up this optimization.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      f315e3fa
    • J
      slab: introduce helper functions to get/set free object · e5c58dfd
      Joonsoo Kim 提交于
      In the following patches, to get/set free objects from the freelist
      is changed so that simple casting doesn't work for it. Therefore,
      introduce helper functions.
      Acked-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      e5c58dfd
    • J
      slab: factor out calculate nr objects in cache_estimate · 9cef2e2b
      Joonsoo Kim 提交于
      This logic is not simple to understand so that making separate function
      helping readability. Additionally, we can use this change in the
      following patch which implement for freelist to have another sized index
      in according to nr objects.
      Acked-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NPekka Enberg <penberg@kernel.org>
      9cef2e2b
  11. 31 1月, 2014 1 次提交
  12. 13 11月, 2013 1 次提交
  13. 30 10月, 2013 2 次提交
  14. 25 10月, 2013 11 次提交