1. 20 5月, 2016 18 次提交
    • A
      include/linux/nodemask.h: create next_node_in() helper · 0edaf86c
      Andrew Morton 提交于
      Lots of code does
      
      	node = next_node(node, XXX);
      	if (node == MAX_NUMNODES)
      		node = first_node(XXX);
      
      so create next_node_in() to do this and use it in various places.
      
      [mhocko@suse.com: use next_node_in() helper]
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@kernel.org>
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Laura Abbott <lauraa@codeaurora.org>
      Cc: Hui Zhu <zhuhui@xiaomi.com>
      Cc: Wang Xiaoqiang <wangxq10@lzu.edu.cn>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0edaf86c
    • J
      mm: rename _count, field of the struct page, to _refcount · 0139aa7b
      Joonsoo Kim 提交于
      Many developers already know that field for reference count of the
      struct page is _count and atomic type.  They would try to handle it
      directly and this could break the purpose of page reference count
      tracepoint.  To prevent direct _count modification, this patch rename it
      to _refcount and add warning message on the code.  After that, developer
      who need to handle reference count will find that field should not be
      accessed directly.
      
      [akpm@linux-foundation.org: fix comments, per Vlastimil]
      [akpm@linux-foundation.org: Documentation/vm/transhuge.txt too]
      [sfr@canb.auug.org.au: sync ethernet driver changes]
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Berg <johannes@sipsolutions.net>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Sunil Goutham <sgoutham@cavium.com>
      Cc: Chris Metcalf <cmetcalf@mellanox.com>
      Cc: Manish Chopra <manish.chopra@qlogic.com>
      Cc: Yuval Mintz <yuval.mintz@qlogic.com>
      Cc: Tariq Toukan <tariqt@mellanox.com>
      Cc: Saeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0139aa7b
    • J
      mm/page_ref: use page_ref helper instead of direct modification of _count · 6d061f9f
      Joonsoo Kim 提交于
      page_reference manipulation functions are introduced to track down
      reference count change of the page.  Use it instead of direct
      modification of _count.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Berg <johannes@sipsolutions.net>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Sunil Goutham <sgoutham@cavium.com>
      Cc: Chris Metcalf <cmetcalf@mellanox.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6d061f9f
    • L
      mm/slub.c: fix sysfs filename in comment · 43efd3ea
      Li Peng 提交于
      /sys/kernel/slab/xx/defrag_ratio should be remote_node_defrag_ratio.
      
      Link: http://lkml.kernel.org/r/1463449242-5366-1-git-send-email-lip@dtdream.comSigned-off-by: NLi Peng <lip@dtdream.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      43efd3ea
    • Y
      mm: slab: remove ZONE_DMA_FLAG · a3187e43
      Yang Shi 提交于
      Now we have IS_ENABLED helper to check if a Kconfig option is enabled or
      not, so ZONE_DMA_FLAG sounds no longer useful.
      
      And, the use of ZONE_DMA_FLAG in slab looks pointless according to the
      comment [1] from Johannes Weiner, so remove them and ORing passed in
      flags with the cache gfp flags has been done in kmem_getpages().
      
      [1] https://lkml.org/lkml/2014/9/25/553
      
      Link: http://lkml.kernel.org/r/1462381297-11009-1-git-send-email-yang.shi@linaro.orgSigned-off-by: NYang Shi <yang.shi@linaro.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a3187e43
    • T
      mm: SLAB freelist randomization · c7ce4f60
      Thomas Garnier 提交于
      Provides an optional config (CONFIG_SLAB_FREELIST_RANDOM) to randomize
      the SLAB freelist.  The list is randomized during initialization of a
      new set of pages.  The order on different freelist sizes is pre-computed
      at boot for performance.  Each kmem_cache has its own randomized
      freelist.  Before pre-computed lists are available freelists are
      generated dynamically.  This security feature reduces the predictability
      of the kernel SLAB allocator against heap overflows rendering attacks
      much less stable.
      
      For example this attack against SLUB (also applicable against SLAB)
      would be affected:
      
        https://jon.oberheide.org/blog/2010/09/10/linux-kernel-can-slub-overflow/
      
      Also, since v4.6 the freelist was moved at the end of the SLAB.  It
      means a controllable heap is opened to new attacks not yet publicly
      discussed.  A kernel heap overflow can be transformed to multiple
      use-after-free.  This feature makes this type of attack harder too.
      
      To generate entropy, we use get_random_bytes_arch because 0 bits of
      entropy is available in the boot stage.  In the worse case this function
      will fallback to the get_random_bytes sub API.  We also generate a shift
      random number to shift pre-computed freelist for each new set of pages.
      
      The config option name is not specific to the SLAB as this approach will
      be extended to other allocators like SLUB.
      
      Performance results highlighted no major changes:
      
      Hackbench (running 90 10 times):
      
        Before average: 0.0698
        After average: 0.0663 (-5.01%)
      
      slab_test 1 run on boot.  Difference only seen on the 2048 size test
      being the worse case scenario covered by freelist randomization.  New
      slab pages are constantly being created on the 10000 allocations.
      Variance should be mainly due to getting new pages every few
      allocations.
      
      Before:
      
        Single thread testing
        =====================
        1. Kmalloc: Repeatedly allocate then free test
        10000 times kmalloc(8) -> 99 cycles kfree -> 112 cycles
        10000 times kmalloc(16) -> 109 cycles kfree -> 140 cycles
        10000 times kmalloc(32) -> 129 cycles kfree -> 137 cycles
        10000 times kmalloc(64) -> 141 cycles kfree -> 141 cycles
        10000 times kmalloc(128) -> 152 cycles kfree -> 148 cycles
        10000 times kmalloc(256) -> 195 cycles kfree -> 167 cycles
        10000 times kmalloc(512) -> 257 cycles kfree -> 199 cycles
        10000 times kmalloc(1024) -> 393 cycles kfree -> 251 cycles
        10000 times kmalloc(2048) -> 649 cycles kfree -> 228 cycles
        10000 times kmalloc(4096) -> 806 cycles kfree -> 370 cycles
        10000 times kmalloc(8192) -> 814 cycles kfree -> 411 cycles
        10000 times kmalloc(16384) -> 892 cycles kfree -> 455 cycles
        2. Kmalloc: alloc/free test
        10000 times kmalloc(8)/kfree -> 121 cycles
        10000 times kmalloc(16)/kfree -> 121 cycles
        10000 times kmalloc(32)/kfree -> 121 cycles
        10000 times kmalloc(64)/kfree -> 121 cycles
        10000 times kmalloc(128)/kfree -> 121 cycles
        10000 times kmalloc(256)/kfree -> 119 cycles
        10000 times kmalloc(512)/kfree -> 119 cycles
        10000 times kmalloc(1024)/kfree -> 119 cycles
        10000 times kmalloc(2048)/kfree -> 119 cycles
        10000 times kmalloc(4096)/kfree -> 121 cycles
        10000 times kmalloc(8192)/kfree -> 119 cycles
        10000 times kmalloc(16384)/kfree -> 119 cycles
      
      After:
      
        Single thread testing
        =====================
        1. Kmalloc: Repeatedly allocate then free test
        10000 times kmalloc(8) -> 130 cycles kfree -> 86 cycles
        10000 times kmalloc(16) -> 118 cycles kfree -> 86 cycles
        10000 times kmalloc(32) -> 121 cycles kfree -> 85 cycles
        10000 times kmalloc(64) -> 176 cycles kfree -> 102 cycles
        10000 times kmalloc(128) -> 178 cycles kfree -> 100 cycles
        10000 times kmalloc(256) -> 205 cycles kfree -> 109 cycles
        10000 times kmalloc(512) -> 262 cycles kfree -> 136 cycles
        10000 times kmalloc(1024) -> 342 cycles kfree -> 157 cycles
        10000 times kmalloc(2048) -> 701 cycles kfree -> 238 cycles
        10000 times kmalloc(4096) -> 803 cycles kfree -> 364 cycles
        10000 times kmalloc(8192) -> 835 cycles kfree -> 404 cycles
        10000 times kmalloc(16384) -> 896 cycles kfree -> 441 cycles
        2. Kmalloc: alloc/free test
        10000 times kmalloc(8)/kfree -> 121 cycles
        10000 times kmalloc(16)/kfree -> 121 cycles
        10000 times kmalloc(32)/kfree -> 123 cycles
        10000 times kmalloc(64)/kfree -> 142 cycles
        10000 times kmalloc(128)/kfree -> 121 cycles
        10000 times kmalloc(256)/kfree -> 119 cycles
        10000 times kmalloc(512)/kfree -> 119 cycles
        10000 times kmalloc(1024)/kfree -> 119 cycles
        10000 times kmalloc(2048)/kfree -> 119 cycles
        10000 times kmalloc(4096)/kfree -> 119 cycles
        10000 times kmalloc(8192)/kfree -> 119 cycles
        10000 times kmalloc(16384)/kfree -> 119 cycles
      
      [akpm@linux-foundation.org: propagate gfp_t into cache_random_seq_create()]
      Signed-off-by: NThomas Garnier <thgarnie@google.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Laura Abbott <labbott@fedoraproject.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c7ce4f60
    • V
      mm/slub.c: replace kick_all_cpus_sync() with synchronize_sched() in kmem_cache_shrink() · 81ae6d03
      Vladimir Davydov 提交于
      When we call __kmem_cache_shrink on memory cgroup removal, we need to
      synchronize kmem_cache->cpu_partial update with put_cpu_partial that
      might be running on other cpus.  Currently, we achieve that by using
      kick_all_cpus_sync, which works as a system wide memory barrier.  Though
      fast it is, this method has a flaw - it issues a lot of IPIs, which
      might hurt high performance or real-time workloads.
      
      To fix this, let's replace kick_all_cpus_sync with synchronize_sched.
      Although the latter one may take much longer to finish, it shouldn't be
      a problem in this particular case, because memory cgroups are destroyed
      asynchronously from a workqueue so that no user visible effects should
      be introduced.  OTOH, it will save us from excessive IPIs when someone
      removes a cgroup.
      
      Anyway, even if using synchronize_sched turns out to take too long, we
      can always introduce a kind of __kmem_cache_shrink batching so that this
      method would only be called once per one cgroup destruction (not per
      each per memcg kmem cache as it is now).
      Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Reported-by: NPeter Zijlstra <peterz@infradead.org>
      Suggested-by: NPeter Zijlstra <peterz@infradead.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      81ae6d03
    • J
      mm/slab: lockless decision to grow cache · 801faf0d
      Joonsoo Kim 提交于
      To check whether free objects exist or not precisely, we need to grab a
      lock.  But, accuracy isn't that important because race window would be
      even small and if there is too much free object, cache reaper would reap
      it.  So, this patch makes the check for free object exisistence not to
      hold a lock.  This will reduce lock contention in heavily allocation
      case.
      
      Note that until now, n->shared can be freed during the processing by
      writing slabinfo, but, with some trick in this patch, we can access it
      freely within interrupt disabled period.
      
      Below is the result of concurrent allocation/free in slab allocation
      benchmark made by Christoph a long time ago.  I make the output simpler.
      The number shows cycle count during alloc/free respectively so less is
      better.
      
        * Before
        Kmalloc N*alloc N*free(32): Average=248/966
        Kmalloc N*alloc N*free(64): Average=261/949
        Kmalloc N*alloc N*free(128): Average=314/1016
        Kmalloc N*alloc N*free(256): Average=741/1061
        Kmalloc N*alloc N*free(512): Average=1246/1152
        Kmalloc N*alloc N*free(1024): Average=2437/1259
        Kmalloc N*alloc N*free(2048): Average=4980/1800
        Kmalloc N*alloc N*free(4096): Average=9000/2078
      
        * After
        Kmalloc N*alloc N*free(32): Average=344/792
        Kmalloc N*alloc N*free(64): Average=347/882
        Kmalloc N*alloc N*free(128): Average=390/959
        Kmalloc N*alloc N*free(256): Average=393/1067
        Kmalloc N*alloc N*free(512): Average=683/1229
        Kmalloc N*alloc N*free(1024): Average=1295/1325
        Kmalloc N*alloc N*free(2048): Average=2513/1664
        Kmalloc N*alloc N*free(4096): Average=4742/2172
      
      It shows that allocation performance decreases for the object size up to
      128 and it may be due to extra checks in cache_alloc_refill().  But,
      with considering improvement of free performance, net result looks the
      same.  Result for other size class looks very promising, roughly, 50%
      performance improvement.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      801faf0d
    • J
      mm/slab: refill cpu cache through a new slab without holding a node lock · 213b4695
      Joonsoo Kim 提交于
      Until now, cache growing makes a free slab on node's slab list and then
      we can allocate free objects from it.  This necessarily requires to hold
      a node lock which is very contended.  If we refill cpu cache before
      attaching it to node's slab list, we can avoid holding a node lock as
      much as possible because this newly allocated slab is only visible to
      the current task.  This will reduce lock contention.
      
      Below is the result of concurrent allocation/free in slab allocation
      benchmark made by Christoph a long time ago.  I make the output simpler.
      The number shows cycle count during alloc/free respectively so less is
      better.
      
        * Before
        Kmalloc N*alloc N*free(32): Average=355/750
        Kmalloc N*alloc N*free(64): Average=452/812
        Kmalloc N*alloc N*free(128): Average=559/1070
        Kmalloc N*alloc N*free(256): Average=1176/980
        Kmalloc N*alloc N*free(512): Average=1939/1189
        Kmalloc N*alloc N*free(1024): Average=3521/1278
        Kmalloc N*alloc N*free(2048): Average=7152/1838
        Kmalloc N*alloc N*free(4096): Average=13438/2013
      
        * After
        Kmalloc N*alloc N*free(32): Average=248/966
        Kmalloc N*alloc N*free(64): Average=261/949
        Kmalloc N*alloc N*free(128): Average=314/1016
        Kmalloc N*alloc N*free(256): Average=741/1061
        Kmalloc N*alloc N*free(512): Average=1246/1152
        Kmalloc N*alloc N*free(1024): Average=2437/1259
        Kmalloc N*alloc N*free(2048): Average=4980/1800
        Kmalloc N*alloc N*free(4096): Average=9000/2078
      
      It shows that contention is reduced for all the object sizes and
      performance increases by 30 ~ 40%.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      213b4695
    • J
      mm/slab: separate cache_grow() to two parts · 76b342bd
      Joonsoo Kim 提交于
      This is a preparation step to implement lockless allocation path when
      there is no free objects in kmem_cache.
      
      What we'd like to do here is to refill cpu cache without holding a node
      lock.  To accomplish this purpose, refill should be done after new slab
      allocation but before attaching the slab to the management list.  So,
      this patch separates cache_grow() to two parts, allocation and attaching
      to the list in order to add some code inbetween them in the following
      patch.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      76b342bd
    • J
      mm/slab: make cache_grow() handle the page allocated on arbitrary node · 511e3a05
      Joonsoo Kim 提交于
      Currently, cache_grow() assumes that allocated page's nodeid would be
      same with parameter nodeid which is used for allocation request.  If we
      discard this assumption, we can handle fallback_alloc() case gracefully.
      So, this patch makes cache_grow() handle the page allocated on arbitrary
      node and clean-up relevant code.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      511e3a05
    • J
      mm/slab: racy access/modify the slab color · 03d1d43a
      Joonsoo Kim 提交于
      Slab color isn't needed to be changed strictly.  Because locking for
      changing slab color could cause more lock contention so this patch
      implements racy access/modify the slab color.  This is a preparation
      step to implement lockless allocation path when there is no free objects
      in the kmem_cache.
      
      Below is the result of concurrent allocation/free in slab allocation
      benchmark made by Christoph a long time ago.  I make the output simpler.
      The number shows cycle count during alloc/free respectively so less is
      better.
      
        * Before
        Kmalloc N*alloc N*free(32): Average=365/806
        Kmalloc N*alloc N*free(64): Average=452/690
        Kmalloc N*alloc N*free(128): Average=736/886
        Kmalloc N*alloc N*free(256): Average=1167/985
        Kmalloc N*alloc N*free(512): Average=2088/1125
        Kmalloc N*alloc N*free(1024): Average=4115/1184
        Kmalloc N*alloc N*free(2048): Average=8451/1748
        Kmalloc N*alloc N*free(4096): Average=16024/2048
      
        * After
        Kmalloc N*alloc N*free(32): Average=355/750
        Kmalloc N*alloc N*free(64): Average=452/812
        Kmalloc N*alloc N*free(128): Average=559/1070
        Kmalloc N*alloc N*free(256): Average=1176/980
        Kmalloc N*alloc N*free(512): Average=1939/1189
        Kmalloc N*alloc N*free(1024): Average=3521/1278
        Kmalloc N*alloc N*free(2048): Average=7152/1838
        Kmalloc N*alloc N*free(4096): Average=13438/2013
      
      It shows that contention is reduced for object size >= 1024 and
      performance increases by roughly 15%.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      03d1d43a
    • J
      mm/slab: don't keep free slabs if free_objects exceeds free_limit · 6052b788
      Joonsoo Kim 提交于
      Currently, determination to free a slab is done whenever each freed
      object is put into the slab.  This has a following problem.
      
      Assume free_limit = 10 and nr_free = 9.
      
      Free happens as following sequence and nr_free changes as following.
      
      free(become a free slab) free(not become a free slab) nr_free: 9 -> 10
      (at first free) -> 11 (at second free)
      
      If we try to check if we can free current slab or not on each object
      free, we can't free any slab in this situation because current slab
      isn't a free slab when nr_free exceed free_limit (at second free) even
      if there is a free slab.
      
      However, if we check it lastly, we can free 1 free slab.
      
      This problem would cause to keep too much memory in the slab subsystem.
      This patch try to fix it by checking number of free object after all
      free work is done.  If there is free slab at that time, we can free slab
      as much as possible so we keep free slab as minimal.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6052b788
    • J
      mm/slab: clean-up kmem_cache_node setup · c3d332b6
      Joonsoo Kim 提交于
      There are mostly same code for setting up kmem_cache_node either in
      cpuup_prepare() or alloc_kmem_cache_node().  Factor out and clean-up
      them.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Tested-by: NNishanth Menon <nm@ti.com>
      Tested-by: NJon Hunter <jonathanh@nvidia.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c3d332b6
    • J
      mm/slab: factor out kmem_cache_node initialization code · ded0ecf6
      Joonsoo Kim 提交于
      It can be reused on other place, so factor out it.  Following patch will
      use it.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ded0ecf6
    • J
      mm/slab: drain the free slab as much as possible · a5aa63a5
      Joonsoo Kim 提交于
      slabs_tofree() implies freeing all free slab.  We can do it with just
      providing INT_MAX.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a5aa63a5
    • J
      mm/slab: remove BAD_ALIEN_MAGIC again · 8888177e
      Joonsoo Kim 提交于
      Initial attemp to remove BAD_ALIEN_MAGIC is once reverted by 'commit
      edcad250 ("Revert "slab: remove BAD_ALIEN_MAGIC"")' because it
      causes a problem on m68k which has many node but !CONFIG_NUMA.  In this
      case, although alien cache isn't used at all but to cope with some
      initialization path, garbage value is used and that is BAD_ALIEN_MAGIC.
      Now, this patch set use_alien_caches to 0 when !CONFIG_NUMA, there is no
      initialization path problem so we don't need BAD_ALIEN_MAGIC at all.  So
      remove it.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Tested-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8888177e
    • J
      mm/slab: fix the theoretical race by holding proper lock · 18726ca8
      Joonsoo Kim 提交于
      While processing concurrent allocation, SLAB could be contended a lot
      because it did a lots of work with holding a lock.  This patchset try to
      reduce the number of critical section to reduce lock contention.  Major
      changes are lockless decision to allocate more slab and lockless cpu
      cache refill from the newly allocated slab.
      
      Below is the result of concurrent allocation/free in slab allocation
      benchmark made by Christoph a long time ago.  I make the output simpler.
      The number shows cycle count during alloc/free respectively so less is
      better.
      
        * Before
        Kmalloc N*alloc N*free(32): Average=365/806
        Kmalloc N*alloc N*free(64): Average=452/690
        Kmalloc N*alloc N*free(128): Average=736/886
        Kmalloc N*alloc N*free(256): Average=1167/985
        Kmalloc N*alloc N*free(512): Average=2088/1125
        Kmalloc N*alloc N*free(1024): Average=4115/1184
        Kmalloc N*alloc N*free(2048): Average=8451/1748
        Kmalloc N*alloc N*free(4096): Average=16024/2048
      
        * After
        Kmalloc N*alloc N*free(32): Average=344/792
        Kmalloc N*alloc N*free(64): Average=347/882
        Kmalloc N*alloc N*free(128): Average=390/959
        Kmalloc N*alloc N*free(256): Average=393/1067
        Kmalloc N*alloc N*free(512): Average=683/1229
        Kmalloc N*alloc N*free(1024): Average=1295/1325
        Kmalloc N*alloc N*free(2048): Average=2513/1664
        Kmalloc N*alloc N*free(4096): Average=4742/2172
      
      It shows that performance improves greatly (roughly more than 50%) for
      the object class whose size is more than 128 bytes.
      
      This patch (of 11):
      
      If we don't hold neither the slab_mutex nor the node lock, node's shared
      array cache could be freed and re-populated.  If __kmem_cache_shrink()
      is called at the same time, it will call drain_array() with n->shared
      without holding node lock so problem can happen.  This patch fix the
      situation by holding the node lock before trying to drain the shared
      array.
      
      In addition, add a debug check to confirm that n->shared access race
      doesn't exist.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      18726ca8
  2. 13 5月, 2016 2 次提交
    • A
      mm: thp: calculate the mapcount correctly for THP pages during WP faults · 6d0a07ed
      Andrea Arcangeli 提交于
      This will provide fully accuracy to the mapcount calculation in the
      write protect faults, so page pinning will not get broken by false
      positive copy-on-writes.
      
      total_mapcount() isn't the right calculation needed in
      reuse_swap_page(), so this introduces a page_trans_huge_mapcount()
      that is effectively the full accurate return value for page_mapcount()
      if dealing with Transparent Hugepages, however we only use the
      page_trans_huge_mapcount() during COW faults where it strictly needed,
      due to its higher runtime cost.
      
      This also provide at practical zero cost the total_mapcount
      information which is needed to know if we can still relocate the page
      anon_vma to the local vma. If page_trans_huge_mapcount() returns 1 we
      can reuse the page no matter if it's a pte or a pmd_trans_huge
      triggering the fault, but we can only relocate the page anon_vma to
      the local vma->anon_vma if we're sure it's only this "vma" mapping the
      whole THP physical range.
      
      Kirill A. Shutemov discovered the problem with moving the page
      anon_vma to the local vma->anon_vma in a previous version of this
      patch and another problem in the way page_move_anon_rmap() was called.
      
      Andrew Morton discovered that CONFIG_SWAP=n wouldn't build in a
      previous version, because reuse_swap_page must be a macro to call
      page_trans_huge_mapcount from swap.h, so this uses a macro again
      instead of an inline function. With this change at least it's a less
      dangerous usage than it was before, because "page" is used only once
      now, while with the previous code reuse_swap_page(page++) would have
      called page_mapcount on page+1 and it would have increased page twice
      instead of just once.
      
      Dean Luick noticed an uninitialized variable that could result in a
      rmap inefficiency for the non-THP case in a previous version.
      
      Mike Marciniszyn said:
      
      : Our RDMA tests are seeing an issue with memory locking that bisects to
      : commit 61f5d698 ("mm: re-enable THP")
      :
      : The test program registers two rather large MRs (512M) and RDMA
      : writes data to a passive peer using the first and RDMA reads it back
      : into the second MR and compares that data.  The sizes are chosen randomly
      : between 0 and 1024 bytes.
      :
      : The test will get through a few (<= 4 iterations) and then gets a
      : compare error.
      :
      : Tracing indicates the kernel logical addresses associated with the individual
      : pages at registration ARE correct , the data in the "RDMA read response only"
      : packets ARE correct.
      :
      : The "corruption" occurs when the packet crosse two pages that are not physically
      : contiguous.   The second page reads back as zero in the program.
      :
      : It looks like the user VA at the point of the compare error no longer points to
      : the same physical address as was registered.
      :
      : This patch totally resolves the issue!
      
      Link: http://lkml.kernel.org/r/1462547040-1737-2-git-send-email-aarcange@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: N"Kirill A. Shutemov" <kirill@shutemov.name>
      Reviewed-by: NDean Luick <dean.luick@intel.com>
      Tested-by: NAlex Williamson <alex.williamson@redhat.com>
      Tested-by: NMike Marciniszyn <mike.marciniszyn@intel.com>
      Tested-by: NJosh Collier <josh.d.collier@intel.com>
      Cc: Marc Haber <mh+linux-kernel@zugschlus.de>
      Cc: <stable@vger.kernel.org>	[4.5]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6d0a07ed
    • Z
      ksm: fix conflict between mmput and scan_get_next_rmap_item · 7496fea9
      Zhou Chengming 提交于
      A concurrency issue about KSM in the function scan_get_next_rmap_item.
      
      task A (ksmd):				|task B (the mm's task):
      					|
      mm = slot->mm;				|
      down_read(&mm->mmap_sem);		|
      					|
      ...					|
      					|
      spin_lock(&ksm_mmlist_lock);		|
      					|
      ksm_scan.mm_slot go to the next slot;	|
      					|
      spin_unlock(&ksm_mmlist_lock);		|
      					|mmput() ->
      					|	ksm_exit():
      					|
      					|spin_lock(&ksm_mmlist_lock);
      					|if (mm_slot && ksm_scan.mm_slot != mm_slot) {
      					|	if (!mm_slot->rmap_list) {
      					|		easy_to_free = 1;
      					|		...
      					|
      					|if (easy_to_free) {
      					|	mmdrop(mm);
      					|	...
      					|
      					|So this mm_struct may be freed in the mmput().
      					|
      up_read(&mm->mmap_sem);			|
      
      As we can see above, the ksmd thread may access a mm_struct that already
      been freed to the kmem_cache.  Suppose a fork will get this mm_struct from
      the kmem_cache, the ksmd thread then call up_read(&mm->mmap_sem), will
      cause mmap_sem.count to become -1.
      
      As suggested by Andrea Arcangeli, unmerge_and_remove_all_rmap_items has
      the same SMP race condition, so fix it too.  My prev fix in function
      scan_get_next_rmap_item will introduce a different SMP race condition, so
      just invert the up_read/spin_unlock order as Andrea Arcangeli said.
      
      Link: http://lkml.kernel.org/r/1462708815-31301-1-git-send-email-zhouchengming1@huawei.comSigned-off-by: NZhou Chengming <zhouchengming1@huawei.com>
      Suggested-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Geliang Tang <geliangtang@163.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Hanjun Guo <guohanjun@huawei.com>
      Cc: Ding Tianhong <dingtianhong@huawei.com>
      Cc: Li Bin <huawei.libin@huawei.com>
      Cc: Zhen Lei <thunder.leizhen@huawei.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7496fea9
  3. 10 5月, 2016 1 次提交
    • S
      zsmalloc: fix zs_can_compact() integer overflow · 44f43e99
      Sergey Senozhatsky 提交于
      zs_can_compact() has two race conditions in its core calculation:
      
      unsigned long obj_wasted = zs_stat_get(class, OBJ_ALLOCATED) -
      				zs_stat_get(class, OBJ_USED);
      
      1) classes are not locked, so the numbers of allocated and used
         objects can change by the concurrent ops happening on other CPUs
      2) shrinker invokes it from preemptible context
      
      Depending on the circumstances, thus, OBJ_ALLOCATED can become
      less than OBJ_USED, which can result in either very high or
      negative `total_scan' value calculated later in do_shrink_slab().
      
      do_shrink_slab() has some logic to prevent those cases:
      
       vmscan: shrink_slab: zs_shrinker_scan+0x0/0x28 [zsmalloc] negative objects to delete nr=-62
       vmscan: shrink_slab: zs_shrinker_scan+0x0/0x28 [zsmalloc] negative objects to delete nr=-62
       vmscan: shrink_slab: zs_shrinker_scan+0x0/0x28 [zsmalloc] negative objects to delete nr=-64
       vmscan: shrink_slab: zs_shrinker_scan+0x0/0x28 [zsmalloc] negative objects to delete nr=-62
       vmscan: shrink_slab: zs_shrinker_scan+0x0/0x28 [zsmalloc] negative objects to delete nr=-62
       vmscan: shrink_slab: zs_shrinker_scan+0x0/0x28 [zsmalloc] negative objects to delete nr=-62
      
      However, due to the way `total_scan' is calculated, not every
      shrinker->count_objects() overflow can be spotted and handled.
      To demonstrate the latter, I added some debugging code to do_shrink_slab()
      (x86_64) and the results were:
      
       vmscan: OVERFLOW: shrinker->count_objects() == -1 [18446744073709551615]
       vmscan: but total_scan > 0: 92679974445502
       vmscan: resulting total_scan: 92679974445502
      [..]
       vmscan: OVERFLOW: shrinker->count_objects() == -1 [18446744073709551615]
       vmscan: but total_scan > 0: 22634041808232578
       vmscan: resulting total_scan: 22634041808232578
      
      Even though shrinker->count_objects() has returned an overflowed value,
      the resulting `total_scan' is positive, and, what is more worrisome, it
      is insanely huge. This value is getting used later on in
      shrinker->scan_objects() loop:
      
              while (total_scan >= batch_size ||
                     total_scan >= freeable) {
                      unsigned long ret;
                      unsigned long nr_to_scan = min(batch_size, total_scan);
      
                      shrinkctl->nr_to_scan = nr_to_scan;
                      ret = shrinker->scan_objects(shrinker, shrinkctl);
                      if (ret == SHRINK_STOP)
                              break;
                      freed += ret;
      
                      count_vm_events(SLABS_SCANNED, nr_to_scan);
                      total_scan -= nr_to_scan;
      
                      cond_resched();
              }
      
      `total_scan >= batch_size' is true for a very-very long time and
      'total_scan >= freeable' is also true for quite some time, because
      `freeable < 0' and `total_scan' is large enough, for example,
      22634041808232578. The only break condition, in the given scheme of
      things, is shrinker->scan_objects() == SHRINK_STOP test, which is a
      bit too weak to rely on, especially in heavy zsmalloc-usage scenarios.
      
      To fix the issue, take a pool stat snapshot and use it instead of
      racy zs_stat_get() calls.
      
      Link: http://lkml.kernel.org/r/20160509140052.3389-1-sergey.senozhatsky@gmail.comSigned-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: <stable@vger.kernel.org>        [4.3+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      44f43e99
  4. 06 5月, 2016 7 次提交
    • V
      mm: fix kcompactd hang during memory offlining · 172400c6
      Vlastimil Babka 提交于
      Assume memory47 is the last online block left in node1.  This will hang:
      
        # echo offline > /sys/devices/system/node/node1/memory47/state
      
      After a couple of minutes, the following pops up in dmesg:
      
        INFO: task bash:957 blocked for more than 120 seconds.
               Not tainted 4.6.0-rc6+ #6
        "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        bash            D ffff8800b7adbaf8     0   957    951 0x00000000
        Call Trace:
          schedule+0x35/0x80
          schedule_timeout+0x1ac/0x270
          wait_for_completion+0xe1/0x120
          kthread_stop+0x4f/0x110
          kcompactd_stop+0x26/0x40
          __offline_pages.constprop.28+0x7e6/0x840
          offline_pages+0x11/0x20
          memory_block_action+0x73/0x1d0
          memory_subsys_offline+0x47/0x60
          device_offline+0x86/0xb0
          store_mem_state+0xda/0xf0
          dev_attr_store+0x18/0x30
          sysfs_kf_write+0x37/0x40
          kernfs_fop_write+0x11d/0x170
          __vfs_write+0x37/0x120
          vfs_write+0xa9/0x1a0
          SyS_write+0x55/0xc0
          entry_SYSCALL_64_fastpath+0x1a/0xa4
      
      kcompactd is waiting for kcompactd_max_order > 0 when it's woken up to
      actually exit.  Check kthread_should_stop() to break out of the wait.
      
      Fixes: 698b1b30 ("mm, compaction: introduce kcompactd").
      Reported-by: NReza Arbab <arbab@linux.vnet.ibm.com>
      Tested-by: NReza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      172400c6
    • D
      mm/zswap: provide unique zpool name · 32a4e169
      Dan Streetman 提交于
      Instead of using "zswap" as the name for all zpools created, add an
      atomic counter and use "zswap%x" with the counter number for each zpool
      created, to provide a unique name for each new zpool.
      
      As zsmalloc, one of the zpool implementations, requires/expects a unique
      name for each pool created, zswap should provide a unique name.  The
      zsmalloc pool creation does not fail if a new pool with a conflicting
      name is created, unless CONFIG_ZSMALLOC_STAT is enabled; in that case,
      zsmalloc pool creation fails with -ENOMEM.  Then zswap will be unable to
      change its compressor parameter if its zpool is zsmalloc; it also will
      be unable to change its zpool parameter back to zsmalloc, if it has any
      existing old zpool using zsmalloc with page(s) in it.  Attempts to
      change the parameters will result in failure to create the zpool.  This
      changes zswap to provide a unique name for each zpool creation.
      
      Fixes: f1c54846 ("zswap: dynamic pool creation")
      Signed-off-by: NDan Streetman <ddstreet@ieee.org>
      Reported-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Reviewed-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Dan Streetman <dan.streetman@canonical.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      32a4e169
    • H
      mm, cma: prevent nr_isolated_* counters from going negative · 14af4a5e
      Hugh Dickins 提交于
      /proc/sys/vm/stat_refresh warns nr_isolated_anon and nr_isolated_file go
      increasingly negative under compaction: which would add delay when
      should be none, or no delay when should delay.  The bug in compaction
      was due to a recent mmotm patch, but much older instance of the bug was
      also noticed in isolate_migratepages_range() which is used for CMA and
      gigantic hugepage allocations.
      
      The bug is caused by putback_movable_pages() in an error path
      decrementing the isolated counters without them being previously
      incremented by acct_isolated().  Fix isolate_migratepages_range() by
      removing the error-path putback, thus reaching acct_isolated() with
      migratepages still isolated, and leaving putback to caller like most
      other places do.
      
      Fixes: edc2ca61 ("mm, compaction: move pageblock checks up from isolate_migratepages_range()")
      [vbabka@suse.cz: expanded the changelog]
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      14af4a5e
    • J
      mm: update min_free_kbytes from khugepaged after core initialization · bc22af74
      Jason Baron 提交于
      Khugepaged attempts to raise min_free_kbytes if its set too low.
      However, on boot khugepaged sets min_free_kbytes first from
      subsys_initcall(), and then the mm 'core' over-rides min_free_kbytes
      after from init_per_zone_wmark_min(), via a module_init() call.
      
      Khugepaged used to use a late_initcall() to set min_free_kbytes (such
      that it occurred after the core initialization), however this was
      removed when the initialization of min_free_kbytes was integrated into
      the starting of the khugepaged thread.
      
      The fix here is simply to invoke the core initialization using a
      core_initcall() instead of module_init(), such that the previous
      initialization ordering is restored.  I didn't restore the
      late_initcall() since start_stop_khugepaged() already sets
      min_free_kbytes via set_recommended_min_free_kbytes().
      
      This was noticed when we had a number of page allocation failures when
      moving a workload to a kernel with this new initialization ordering.  On
      an 8GB system this restores min_free_kbytes back to 67584 from 11365
      when CONFIG_TRANSPARENT_HUGEPAGE=y is set and either
      CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y or
      CONFIG_TRANSPARENT_HUGEPAGE_MADVISE=y.
      
      Fixes: 79553da2 ("thp: cleanup khugepaged startup")
      Signed-off-by: NJason Baron <jbaron@akamai.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bc22af74
    • H
      huge pagecache: mmap_sem is unlocked when truncation splits pmd · 68428398
      Hugh Dickins 提交于
      zap_pmd_range()'s CONFIG_DEBUG_VM !rwsem_is_locked(&mmap_sem) BUG() will
      be invalid with huge pagecache, in whatever way it is implemented:
      truncation of a hugely-mapped file to an unhugely-aligned size would
      easily hit it.
      
      (Although anon THP could in principle apply khugepaged to private file
      mappings, which are not excluded by the MADV_HUGEPAGE restrictions, in
      practice there's a vm_ops check which excludes them, so it never hits
      this BUG() - there's no interface to "truncate" an anonymous mapping.)
      
      We could complicate the test, to check i_mmap_rwsem also when there's a
      vm_file; but my inclination was to make zap_pmd_range() more readable by
      simply deleting this check.  A search has shown no report of the issue
      in the years since commit e0897d75 ("mm, thp: print useful
      information when mmap_sem is unlocked in zap_pmd_range") expanded it
      from VM_BUG_ON() - though I cannot point to what commit I would say then
      fixed the issue.
      
      But there are a couple of other patches now floating around, neither yet
      in the tree: let's agree to retain the check as a VM_BUG_ON_VMA(), as
      Matthew Wilcox has done; but subject to a vma_is_anonymous() check, as
      Kirill Shutemov has done.  And let's get this in, without waiting for
      any particular huge pagecache implementation to reach the tree.
      
      Matthew said "We can reproduce this BUG() in the current Linus tree with
      DAX PMDs".
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Tested-by: NMatthew Wilcox <willy@linux.intel.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Yang Shi <yang.shi@linaro.org>
      Cc: Ning Qu <quning@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      68428398
    • Y
      mm: thp: correct split_huge_pages file permission · 145bdaa1
      Yang Shi 提交于
      split_huge_pages doesn't support get method at all, so the read
      permission sounds confusing, change the permission to write only.
      
      And, add "\n" to the output of set method to make it more readable.
      Signed-off-by: NYang Shi <yang.shi@linaro.org>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      145bdaa1
    • H
      writeback: Fix performance regression in wb_over_bg_thresh() · 74d36944
      Howard Cochran 提交于
      Commit 947e9762 ("writeback: update wb_over_bg_thresh() to use
      wb_domain aware operations") unintentionally changed this function's
      meaning from "are there more dirty pages than the background writeback
      threshold" to "are there more dirty pages than the writeback threshold".
      The background writeback threshold is typically half of the writeback
      threshold, so this had the effect of raising the number of dirty pages
      required to cause a writeback worker to perform background writeout.
      
      This can cause a very severe performance regression when a BDI uses
      BDI_CAP_STRICTLIMIT because balance_dirty_pages() and the writeback worker
      can now disagree on whether writeback should be initiated.
      
      For example, in a system having 1GB of RAM, a single spinning disk, and a
      "pass-through" FUSE filesystem mounted over the disk, application code
      mmapped a 128MB file on the disk and was randomly dirtying pages in that
      mapping.
      
      Because FUSE uses strictlimit and has a default max_ratio of only 1%, in
      balance_dirty_pages, thresh is ~200, bg_thresh is ~100, and the
      dirty_freerun_ceiling is the average of those, ~150. So, it pauses the
      dirtying processes when we have 151 dirty pages and wakes up a background
      writeback worker. But the worker tests the wrong threshold (200 instead of
      100), so it does not initiate writeback and just returns.
      
      Thus, balance_dirty_pages keeps looping, sleeping and then waking up the
      worker who will do nothing. It remains stuck in this state until the few
      dirty pages that we have finally expire and we write them back for that
      reason. Then the whole process repeats, resulting in near-zero throughput
      through the FUSE BDI.
      
      The fix is to call the parameterized variant of wb_calc_thresh, so that the
      worker will do writeback if the bg_thresh is exceeded which was the
      behavior before the referenced commit.
      
      Fixes: 947e9762 ("writeback: update wb_over_bg_thresh() to use wb_domain aware operations")
      Signed-off-by: NHoward Cochran <hcochran@kernelspring.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Cc: <stable@vger.kernel.org> # v4.2+
      Tested-by Sedat Dilek <sedat.dilek@gmail.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      74d36944
  5. 03 5月, 2016 1 次提交
    • A
      parallel lookups machinery, part 2 · 84e710da
      Al Viro 提交于
      We'll need to verify that there's neither a hashed nor in-lookup
      dentry with desired parent/name before adding to in-lookup set.
      
      One possible solution would be to hold the parent's ->d_lock through
      both checks, but while the in-lookup set is relatively small at any
      time, dcache is not.  And holding the parent's ->d_lock through
      something like __d_lookup_rcu() would suck too badly.
      
      So we leave the parent's ->d_lock alone, which means that we watch
      out for the following scenario:
      	* we verify that there's no hashed match
      	* existing in-lookup match gets hashed by another process
      	* we verify that there's no in-lookup matches and decide
      that everything's fine.
      
      Solution: per-directory kinda-sorta seqlock, bumped around the times
      we hash something that used to be in-lookup or move (and hash)
      something in place of in-lookup.  Then the above would turn into
      	* read the counter
      	* do dcache lookup
      	* if no matches found, check for in-lookup matches
      	* if there had been none of those either, check if the
      counter has changed; repeat if it has.
      
      The "kinda-sorta" part is due to the fact that we don't have much spare
      space in inode.  There is a spare word (shared with i_bdev/i_cdev/i_pipe),
      so the counter part is not a problem, but spinlock is a different story.
      
      We could use the parent's ->d_lock, and it would be less painful in
      terms of contention, for __d_add() it would be rather inconvenient to
      grab; we could do that (using lock_parent()), but...
      
      Fortunately, we can get serialization on the counter itself, and it
      might be a good idea in general; we can use cmpxchg() in a loop to
      get from even to odd and smp_store_release() from odd to even.
      
      This commit adds the counter and updating logics; the readers will be
      added in the next commit.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      84e710da
  6. 02 5月, 2016 5 次提交
  7. 29 4月, 2016 6 次提交
    • K
      mm/memory-failure: fix race with compound page split/merge · c2e7e00b
      Konstantin Khlebnikov 提交于
      get_hwpoison_page() must recheck relation between head and tail pages.
      
      n-horiguchi said: without this recheck, the race causes kernel to pin an
      irrelevant page, and finally makes kernel crash for refcount mismatch.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Acked-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c2e7e00b
    • V
      mm: wake kcompactd before kswapd's short sleep · fd901c95
      Vlastimil Babka 提交于
      When kswapd goes to sleep it checks if the node is balanced and at first
      it sleeps only for HZ/10 time, then rechecks if the node is still
      balanced and nobody has woken it during the initial sleep.  Only then it
      goes fully sleep until an allocation slowpath wakes it up again.
      
      For higher-order allocations, waking up kcompactd is done only before
      the full sleep.  This turns out to be an issue in case another
      high-order allocation fails during the initial sleep.  It will wake
      kswapd up, however kswapd considers the zone balanced from the order-0
      perspective, and will just quickly try to sleep again.  So if there's a
      longer stream of high-order allocations hitting the slowpath and waking
      up kswapd, it might never actually wake up kcompactd, which may be
      considered a regression from kswapd-based compaction.  In the worst
      case, it might be that a single allocation that cannot direct
      reclaim/compact itself is waking kswapd in the retry loop and preventing
      kcompactd from being woken up and unblocking it.
      
      This patch makes sure kcompactd is woken up in such situations by simply
      moving the wakeup before the short initial sleep.  More efficient
      solution would be to wake kcompactd immediately instead of kswapd if the
      node is already order-0 balanced, but in that case we should also move
      reset_isolation_suitable() call to kcompactd so it's not adding to the
      allocator's latency.  Since it's late in the 4.6 cycle, let's go with
      the simpler change for now.
      
      Fixes: accf6242 ("mm, kswapd: replace kswapd compaction with waking up kcompactd")
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fd901c95
    • M
      mm/hwpoison: fix wrong num_poisoned_pages accounting · d7e69488
      Minchan Kim 提交于
      Currently, migration code increses num_poisoned_pages on *failed*
      migration page as well as successfully migrated one at the trial of
      memory-failure.  It will make the stat wrong.  As well, it marks the
      page as PG_HWPoison even if the migration trial failed.  It would mean
      we cannot recover the corrupted page using memory-failure facility.
      
      This patches fixes it.
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Reported-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d7e69488
    • M
      mm: call swap_slot_free_notify() with page lock held · b06bad17
      Minchan Kim 提交于
      Kyeongdon reported below error which is BUG_ON(!PageSwapCache(page)) in
      page_swap_info.  The reason is that page_endio in rw_page unlocks the
      page if read I/O is completed so we need to hold a PG_lock again to
      check PageSwapCache.  Otherwise, the page can be removed from swapcache.
      
        Kernel BUG at c00f9040 [verbose debug info unavailable]
        Internal error: Oops - BUG: 0 [#1] PREEMPT SMP ARM
        Modules linked in:
        CPU: 4 PID: 13446 Comm: RenderThread Tainted: G        W 3.10.84-g9f14aec-dirty #73
        task: c3b73200 ti: dd192000 task.ti: dd192000
        PC is at page_swap_info+0x10/0x2c
        LR is at swap_slot_free_notify+0x18/0x6c
        pc : [<c00f9040>]    lr : [<c00f5560>]    psr: 400f0113
        sp : dd193d78  ip : c2deb1e4  fp : da015180
        r10: 00000000  r9 : 000200da  r8 : c120fe08
        r7 : 00000000  r6 : 00000000  r5 : c249a6c0  r4 : = c249a6c0
        r3 : 00000000  r2 : 40080009  r1 : 200f0113  r0 : = c249a6c0
        ..<snip> ..
        Call Trace:
          page_swap_info+0x10/0x2c
          swap_slot_free_notify+0x18/0x6c
          swap_readpage+0x90/0x11c
          read_swap_cache_async+0x134/0x1ac
          swapin_readahead+0x70/0xb0
          handle_pte_fault+0x320/0x6fc
          handle_mm_fault+0xc0/0xf0
          do_page_fault+0x11c/0x36c
          do_DataAbort+0x34/0x118
      
      Fixes: 3f2b1a04 ("zram: revive swap_slot_free_notify")
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Tested-by: NKyeongdon Kim <kyeongdon.kim@lge.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b06bad17
    • M
      mm: vmscan: reclaim highmem zone if buffer_heads is over limit · 7bf52fb8
      Minchan Kim 提交于
      We have been reclaimed highmem zone if buffer_heads is over limit but
      commit 6b4f7799 ("mm: vmscan: invoke slab shrinkers from
      shrink_zone()") changed the behavior so it doesn't reclaim highmem zone
      although buffer_heads is over the limit.  This patch restores the logic.
      
      Fixes: 6b4f7799 ("mm: vmscan: invoke slab shrinkers from shrink_zone()")
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7bf52fb8
    • G
      numa: fix /proc/<pid>/numa_maps for THP · 28093f9f
      Gerald Schaefer 提交于
      In gather_pte_stats() a THP pmd is cast into a pte, which is wrong
      because the layouts may differ depending on the architecture.  On s390
      this will lead to inaccurate numa_maps accounting in /proc because of
      misguided pte_present() and pte_dirty() checks on the fake pte.
      
      On other architectures pte_present() and pte_dirty() may work by chance,
      but there may be an issue with direct-access (dax) mappings w/o
      underlying struct pages when HAVE_PTE_SPECIAL is set and THP is
      available.  In vm_normal_page() the fake pte will be checked with
      pte_special() and because there is no "special" bit in a pmd, this will
      always return false and the VM_PFNMAP | VM_MIXEDMAP checking will be
      skipped.  On dax mappings w/o struct pages, an invalid struct page
      pointer would then be returned that can crash the kernel.
      
      This patch fixes the numa_maps THP handling by introducing new "_pmd"
      variants of the can_gather_numa_stats() and vm_normal_page() functions.
      Signed-off-by: NGerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com>
      Cc: <stable@vger.kernel.org>	[4.3+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      28093f9f