1. 24 1月, 2014 17 次提交
    • P
      mm/nobootmem.c: add return value check in __alloc_memory_core_early() · 87379ec8
      Philipp Hachtmann 提交于
      When memblock_reserve() fails because memblock.reserved.regions cannot
      be resized, the caller (e.g.  alloc_bootmem()) is not informed of the
      failed allocation.  Therefore alloc_bootmem() silently returns the same
      pointer again and again.
      
      This patch adds a check for the return value of memblock_reserve() in
      __alloc_memory_core().
      Signed-off-by: NPhilipp Hachtmann <phacht@linux.vnet.ibm.com>
      Reviewed-by: NTejun Heo <tj@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      87379ec8
    • V
      memcg: rework memcg_update_kmem_limit synchronization · d6441637
      Vladimir Davydov 提交于
      Currently we take both the memcg_create_mutex and the set_limit_mutex
      when we enable kmem accounting for a memory cgroup, which makes kmem
      activation events serialize with both memcg creations and other memcg
      limit updates (memory.limit, memory.memsw.limit).  However, there is no
      point in such strict synchronization rules there.
      
      First, the set_limit_mutex was introduced to keep the memory.limit and
      memory.memsw.limit values in sync.  Since memory.kmem.limit can be set
      independently of them, it is better to introduce a separate mutex to
      synchronize against concurrent kmem limit updates.
      
      Second, we take the memcg_create_mutex in order to make sure all
      children of this memcg will be kmem-active as well.  For achieving that,
      it is enough to hold this mutex only while checking if
      memcg_has_children() though.  This guarantees that if a child is added
      after we checked that the memcg has no children, the newly added cgroup
      will see its parent kmem-active (of course if the latter succeeded), and
      call kmem activation for itself.
      
      This patch simplifies the locking rules of memcg_update_kmem_limit()
      according to these considerations.
      
      [vdavydov@parallels.com: fix unintialized var warning]
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d6441637
    • V
      memcg: remove KMEM_ACCOUNTED_ACTIVATED flag · 6de64beb
      Vladimir Davydov 提交于
      Currently we have two state bits in mem_cgroup::kmem_account_flags
      regarding kmem accounting activation, ACTIVATED and ACTIVE.  We start
      kmem accounting only if both flags are set (memcg_can_account_kmem()),
      plus throughout the code there are several places where we check only
      the ACTIVE flag, but we never check the ACTIVATED flag alone.  These
      flags are both set from memcg_update_kmem_limit() under the
      set_limit_mutex, the ACTIVE flag always being set after ACTIVATED, and
      they never get cleared.  That said checking if both flags are set is
      equivalent to checking only for the ACTIVE flag, and since there is no
      ACTIVATED flag checks, we can safely remove the ACTIVATED flag, and
      nothing will change.
      
      Let's try to understand what was the reason for introducing these flags.
      The purpose of the ACTIVE flag is clear - it states that kmem should be
      accounting to the cgroup.  The only requirement for it is that it should
      be set after we have fully initialized kmem accounting bits for the
      cgroup and patched all static branches relating to kmem accounting.
      Since we always check if static branch is enabled before actually
      considering if we should account (otherwise we wouldn't benefit from
      static branching), this guarantees us that we won't skip a commit or
      uncharge after a charge due to an unpatched static branch.
      
      Now let's move on to the ACTIVATED bit.  As I proved in the beginning of
      this message, it is absolutely useless, and removing it will change
      nothing.  So what was the reason introducing it?
      
      The ACTIVATED flag was introduced by commit a8964b9b ("memcg: use
      static branches when code not in use") in order to guarantee that
      static_key_slow_inc(&memcg_kmem_enabled_key) would be called only once
      for each memory cgroup when its kmem accounting was activated.  The
      point was that at that time the memcg_update_kmem_limit() function's
      work-flow looked like this:
      
              bool must_inc_static_branch = false;
      
              cgroup_lock();
              mutex_lock(&set_limit_mutex);
              if (!memcg->kmem_account_flags && val != RESOURCE_MAX) {
                      /* The kmem limit is set for the first time */
                      ret = res_counter_set_limit(&memcg->kmem, val);
      
                      memcg_kmem_set_activated(memcg);
                      must_inc_static_branch = true;
              } else
                      ret = res_counter_set_limit(&memcg->kmem, val);
              mutex_unlock(&set_limit_mutex);
              cgroup_unlock();
      
              if (must_inc_static_branch) {
                      /* We can't do this under cgroup_lock */
                      static_key_slow_inc(&memcg_kmem_enabled_key);
                      memcg_kmem_set_active(memcg);
              }
      
      So that without the ACTIVATED flag we could race with other threads
      trying to set the limit and increment the static branching ref-counter
      more than once.  Today we call the whole memcg_update_kmem_limit()
      function under the set_limit_mutex and this race is impossible.
      
      As now we understand why the ACTIVATED bit was introduced and why we
      don't need it now, and know that removing it will change nothing anyway,
      let's get rid of it.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6de64beb
    • V
      memcg, slab: RCU protect memcg_params for root caches · f8570263
      Vladimir Davydov 提交于
      We relocate root cache's memcg_params whenever we need to grow the
      memcg_caches array to accommodate all kmem-active memory cgroups.
      Currently on relocation we free the old version immediately, which can
      lead to use-after-free, because the memcg_caches array is accessed
      lock-free (see cache_from_memcg_idx()).  This patch fixes this by making
      memcg_params RCU-protected for root caches.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f8570263
    • V
      slab: do not panic if we fail to create memcg cache · f717eb3a
      Vladimir Davydov 提交于
      There is no point in flooding logs with warnings or especially crashing
      the system if we fail to create a cache for a memcg.  In this case we
      will be accounting the memcg allocation to the root cgroup until we
      succeed to create its own cache, but it isn't that critical.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f717eb3a
    • V
      memcg: get rid of kmem_cache_dup() · 842e2873
      Vladimir Davydov 提交于
      kmem_cache_dup() is only called from memcg_create_kmem_cache().  The
      latter, in fact, does nothing besides this, so let's fold
      kmem_cache_dup() into memcg_create_kmem_cache().
      
      This patch also makes the memcg_cache_mutex private to
      memcg_create_kmem_cache(), because it is not used anywhere else.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      842e2873
    • V
      memcg, slab: fix races in per-memcg cache creation/destruction · 2edefe11
      Vladimir Davydov 提交于
      We obtain a per-memcg cache from a root kmem_cache by dereferencing an
      entry of the root cache's memcg_params::memcg_caches array.  If we find
      no cache for a memcg there on allocation, we initiate the memcg cache
      creation (see memcg_kmem_get_cache()).  The cache creation proceeds
      asynchronously in memcg_create_kmem_cache() in order to avoid lock
      clashes, so there can be several threads trying to create the same
      kmem_cache concurrently, but only one of them may succeed.  However, due
      to a race in the code, it is not always true.  The point is that the
      memcg_caches array can be relocated when we activate kmem accounting for
      a memcg (see memcg_update_all_caches(), memcg_update_cache_size()).  If
      memcg_update_cache_size() and memcg_create_kmem_cache() proceed
      concurrently as described below, we can leak a kmem_cache.
      
      Asume two threads schedule creation of the same kmem_cache.  One of them
      successfully creates it.  Another one should fail then, but if
      memcg_create_kmem_cache() interleaves with memcg_update_cache_size() as
      follows, it won't:
      
        memcg_create_kmem_cache()             memcg_update_cache_size()
        (called w/o mutexes held)             (called with slab_mutex,
                                               set_limit_mutex held)
        -------------------------             -------------------------
      
        mutex_lock(&memcg_cache_mutex)
      
                                              s->memcg_params=kzalloc(...)
      
        new_cachep=cache_from_memcg_idx(cachep,idx)
        // new_cachep==NULL => proceed to creation
      
                                              s->memcg_params->memcg_caches[i]
                                                  =cur_params->memcg_caches[i]
      
        // kmem_cache_create_memcg takes slab_mutex
        // so we will hang around until
        // memcg_update_cache_size finishes, but
        // nothing will prevent it from succeeding so
        // memcg_caches[idx] will be overwritten in
        // memcg_register_cache!
      
        new_cachep = kmem_cache_create_memcg(...)
        mutex_unlock(&memcg_cache_mutex)
      
      Let's fix this by moving the check for existence of the memcg cache to
      kmem_cache_create_memcg() to be called under the slab_mutex and make it
      return NULL if so.
      
      A similar race is possible when destroying a memcg cache (see
      kmem_cache_destroy()).  Since memcg_unregister_cache(), which clears the
      pointer in the memcg_caches array, is called w/o protection, we can race
      with memcg_update_cache_size() and omit clearing the pointer.  Therefore
      memcg_unregister_cache() should be moved before we release the
      slab_mutex.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2edefe11
    • V
      memcg: fix possible NULL deref while traversing memcg_slab_caches list · 96403da2
      Vladimir Davydov 提交于
      All caches of the same memory cgroup are linked in the memcg_slab_caches
      list via kmem_cache::memcg_params::list.  This list is traversed, for
      example, when we read memory.kmem.slabinfo.
      
      Since the list actually consists of memcg_cache_params objects, we have
      to convert an element of the list to a kmem_cache object using
      memcg_params_to_cache(), which obtains the pointer to the cache from the
      memcg_params::memcg_caches array of the corresponding root cache.  That
      said the pointer to a kmem_cache in its parent's memcg_params must be
      initialized before adding the cache to the list, and cleared only after
      it has been unlinked.  Currently it is vice-versa, which can result in a
      NULL ptr dereference while traversing the memcg_slab_caches list.  This
      patch restores the correct order.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96403da2
    • V
      memcg, slab: fix barrier usage when accessing memcg_caches · 959c8963
      Vladimir Davydov 提交于
      Each root kmem_cache has pointers to per-memcg caches stored in its
      memcg_params::memcg_caches array.  Whenever we want to allocate a slab
      for a memcg, we access this array to get per-memcg cache to allocate
      from (see memcg_kmem_get_cache()).  The access must be lock-free for
      performance reasons, so we should use barriers to assert the kmem_cache
      is up-to-date.
      
      First, we should place a write barrier immediately before setting the
      pointer to it in the memcg_caches array in order to make sure nobody
      will see a partially initialized object.  Second, we should issue a read
      barrier before dereferencing the pointer to conform to the write
      barrier.
      
      However, currently the barrier usage looks rather strange.  We have a
      write barrier *after* setting the pointer and a read barrier *before*
      reading the pointer, which is incorrect.  This patch fixes this.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      959c8963
    • V
      memcg, slab: clean up memcg cache initialization/destruction · 1aa13254
      Vladimir Davydov 提交于
      Currently, we have rather a messy function set relating to per-memcg
      kmem cache initialization/destruction.
      
      Per-memcg caches are created in memcg_create_kmem_cache().  This
      function calls kmem_cache_create_memcg() to allocate and initialize a
      kmem cache and then "registers" the new cache in the
      memcg_params::memcg_caches array of the parent cache.
      
      During its work-flow, kmem_cache_create_memcg() executes the following
      memcg-related functions:
      
       - memcg_alloc_cache_params(), to initialize memcg_params of the newly
         created cache;
       - memcg_cache_list_add(), to add the new cache to the memcg_slab_caches
         list.
      
      On the other hand, kmem_cache_destroy() called on a cache destruction
      only calls memcg_release_cache(), which does all the work: it cleans the
      reference to the cache in its parent's memcg_params::memcg_caches,
      removes the cache from the memcg_slab_caches list, and frees
      memcg_params.
      
      Such an inconsistency between destruction and initialization paths make
      the code difficult to read, so let's clean this up a bit.
      
      This patch moves all the code relating to registration of per-memcg
      caches (adding to memcg list, setting the pointer to a cache from its
      parent) to the newly created memcg_register_cache() and
      memcg_unregister_cache() functions making the initialization and
      destruction paths look symmetrical.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1aa13254
    • V
      memcg, slab: kmem_cache_create_memcg(): fix memleak on fail path · 363a044f
      Vladimir Davydov 提交于
      We do not free the cache's memcg_params if __kmem_cache_create fails.
      Fix this.
      
      Plus, rename memcg_register_cache() to memcg_alloc_cache_params(),
      because it actually does not register the cache anywhere, but simply
      initialize kmem_cache::memcg_params.
      
      [akpm@linux-foundation.org: fix build]
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      363a044f
    • V
      slab: clean up kmem_cache_create_memcg() error handling · 3965fc36
      Vladimir Davydov 提交于
      Currently kmem_cache_create_memcg() backoffs on failure inside
      conditionals, without using gotos.  This results in the rollback code
      duplication, which makes the function look cumbersome even though on
      error we should only free the allocated cache.  Since in the next patch
      I am going to add yet another rollback function call on error path
      there, let's employ labels instead of conditionals for undoing any
      changes on failure to keep things clean.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Reviewed-by: NPekka Enberg <penberg@kernel.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3965fc36
    • S
      mm: dump page when hitting a VM_BUG_ON using VM_BUG_ON_PAGE · 309381fe
      Sasha Levin 提交于
      Most of the VM_BUG_ON assertions are performed on a page.  Usually, when
      one of these assertions fails we'll get a BUG_ON with a call stack and
      the registers.
      
      I've recently noticed based on the requests to add a small piece of code
      that dumps the page to various VM_BUG_ON sites that the page dump is
      quite useful to people debugging issues in mm.
      
      This patch adds a VM_BUG_ON_PAGE(cond, page) which beyond doing what
      VM_BUG_ON() does, also dumps the page before executing the actual
      BUG_ON.
      
      [akpm@linux-foundation.org: fix up includes]
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      309381fe
    • V
      memcg: do not use vmalloc for mem_cgroup allocations · 8ff69e2c
      Vladimir Davydov 提交于
      The vmalloc was introduced by 33327948 ("memcgroup: use vmalloc for
      mem_cgroup allocation"), because at that time MAX_NUMNODES was used for
      defining the per-node array in the mem_cgroup structure so that the
      structure could be huge even if the system had the only NUMA node.
      
      The situation was significantly improved by commit 45cf7ebd ("memcg:
      reduce the size of struct memcg 244-fold"), which made the size of the
      mem_cgroup structure calculated dynamically depending on the real number
      of NUMA nodes installed on the system (nr_node_ids), so now there is no
      point in using vmalloc here: the structure is allocated rarely and on
      most systems its size is about 1K.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Glauber Costa <glommer@openvz.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8ff69e2c
    • V
      mm: munlock: fix potential race with THP page split · 01cc2e58
      Vlastimil Babka 提交于
      Since commit ff6a6da6 ("mm: accelerate munlock() treatment of THP
      pages") munlock skips tail pages of a munlocked THP page.  There is some
      attempt to prevent bad consequences of racing with a THP page split, but
      code inspection indicates that there are two problems that may lead to a
      non-fatal, yet wrong outcome.
      
      First, __split_huge_page_refcount() copies flags including PageMlocked
      from the head page to the tail pages.  Clearing PageMlocked by
      munlock_vma_page() in the middle of this operation might result in part
      of tail pages left with PageMlocked flag.  As the head page still
      appears to be a THP page until all tail pages are processed,
      munlock_vma_page() might think it munlocked the whole THP page and skip
      all the former tail pages.  Before ff6a6da6, those pages would be
      cleared in further iterations of munlock_vma_pages_range(), but NR_MLOCK
      would still become undercounted (related the next point).
      
      Second, NR_MLOCK accounting is based on call to hpage_nr_pages() after
      the PageMlocked is cleared.  The accounting might also become
      inconsistent due to race with __split_huge_page_refcount()
      
      - undercount when HUGE_PMD_NR is subtracted, but some tail pages are
        left with PageMlocked set and counted again (only possible before
        ff6a6da6)
      
      - overcount when hpage_nr_pages() sees a normal page (split has already
        finished), but the parallel split has meanwhile cleared PageMlocked from
        additional tail pages
      
      This patch prevents both problems via extending the scope of lru_lock in
      munlock_vma_page().  This is convenient because:
      
      - __split_huge_page_refcount() takes lru_lock for its whole operation
      
      - munlock_vma_page() typically takes lru_lock anyway for page isolation
      
      As this becomes a second function where page isolation is done with
      lru_lock already held, factor this out to a new
      __munlock_isolate_lru_page() function and clean up the code around.
      
      [akpm@linux-foundation.org: avoid a coding-style ugly]
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      01cc2e58
    • D
      mm: print more details for bad_page() · f0b791a3
      Dave Hansen 提交于
      bad_page() is cool in that it prints out a bunch of data about the page.
      But, I can never remember which page flags are good and which are bad,
      or whether ->index or ->mapping is required to be NULL.
      
      This patch allows bad/dump_page() callers to specify a string about why
      they are dumping the page and adds explanation strings to a number of
      places.  It also adds a 'bad_flags' argument to bad_page(), which it
      then dumps out separately from the flags which are actually set.
      
      This way, the messages will show specifically why the page was bad,
      *specifically* which flags it is complaining about, if it was a page
      flag combination which was the problem.
      
      [akpm@linux-foundation.org: switch to pr_alert]
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NChristoph Lameter <cl@linux.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f0b791a3
    • D
      mm/zswap.c: change params from hidden to ro · 12ab028b
      Dan Streetman 提交于
      The "compressor" and "enabled" params are currently hidden, this changes
      them to read-only, so userspace can tell if zswap is enabled or not and
      see what compressor is in use.
      Signed-off-by: NDan Streetman <ddstreet@ieee.org>
      Cc: Vladimir Murzin <murzin.v@gmail.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Weijie Yang <weijie.yang@samsung.com>
      Acked-by: NSeth Jennings <sjennings@variantweb.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      12ab028b
  2. 22 1月, 2014 23 次提交
    • J
      mm/migrate: remove unused function, fail_migrate_page() · 78d5506e
      Joonsoo Kim 提交于
      fail_migrate_page() isn't used anywhere, so remove it.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: NWanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      78d5506e
    • J
      mm/migrate: remove putback_lru_pages, fix comment on putback_movable_pages · 59c82b70
      Joonsoo Kim 提交于
      Some part of putback_lru_pages() and putback_movable_pages() is
      duplicated, so it could confuse us what we should use.  We can remove
      putback_lru_pages() since it is not really needed now.  This makes us
      undestand and maintain the code more easily.
      
      And comment on putback_movable_pages() is stale now, so fix it.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reviewed-by: NWanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      59c82b70
    • J
      mm/migrate: correct failure handling if !hugepage_migration_support() · 32665f2b
      Joonsoo Kim 提交于
      We should remove the page from the list if we fail with ENOSYS, since
      migrate_pages() consider error cases except -ENOMEM and -EAGAIN as
      permanent failure and it assumes that the page would be removed from the
      list.  Without this patch, we could overcount number of failure.
      
      In addition, we should put back the new hugepage if
      !hugepage_migration_support().  If not, we would leak hugepage memory.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Reviewed-by: NWanpeng Li <liwanp@linux.vnet.ibm.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      32665f2b
    • N
      mm/migrate: add comment about permanent failure path · 354a3363
      Naoya Horiguchi 提交于
      Let's add a comment about where the failed page goes to, which makes
      code more readable.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Reviewed-by: NWanpeng Li <liwanp@linux.vnet.ibm.com>
      Acked-by: NRafael Aquini <aquini@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      354a3363
    • D
      mm, page_alloc: warn for non-blockable __GFP_NOFAIL allocation failure · aed0a0e3
      David Rientjes 提交于
      __GFP_NOFAIL may return NULL when coupled with GFP_NOWAIT or GFP_ATOMIC.
      
      Luckily, nothing currently does such craziness.  So instead of causing
      such allocations to loop (potentially forever), we maintain the current
      behavior and also warn about the new users of the deprecated flag.
      Suggested-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aed0a0e3
    • V
      mm: compaction: reset scanner positions immediately when they meet · 55b7c4c9
      Vlastimil Babka 提交于
      Compaction used to start its migrate and free page scaners at the zone's
      lowest and highest pfn, respectively.  Later, caching was introduced to
      remember the scanners' progress across compaction attempts so that
      pageblocks are not re-scanned uselessly.  Additionally, pageblocks where
      isolation failed are marked to be quickly skipped when encountered again
      in future compactions.
      
      Currently, both the reset of cached pfn's and clearing of the pageblock
      skip information for a zone is done in __reset_isolation_suitable().
      This function gets called when:
      
       - compaction is restarting after being deferred
       - compact_blockskip_flush flag is set in compact_finished() when the scanners
         meet (and not again cleared when direct compaction succeeds in allocation)
         and kswapd acts upon this flag before going to sleep
      
      This behavior is suboptimal for several reasons:
      
       - when direct sync compaction is called after async compaction fails (in the
         allocation slowpath), it will effectively do nothing, unless kswapd
         happens to process the compact_blockskip_flush flag meanwhile. This is racy
         and goes against the purpose of sync compaction to more thoroughly retry
         the compaction of a zone where async compaction has failed.
         The restart-after-deferring path cannot help here as deferring happens only
         after the sync compaction fails. It is also done only for the preferred
         zone, while the compaction might be done for a fallback zone.
      
       - the mechanism of marking pageblock to be skipped has little value since the
         cached pfn's are reset only together with the pageblock skip flags. This
         effectively limits pageblock skip usage to parallel compactions.
      
      This patch changes compact_finished() so that cached pfn's are reset
      immediately when the scanners meet.  Clearing pageblock skip flags is
      unchanged, as well as the other situations where cached pfn's are reset.
      This allows the sync-after-async compaction to retry pageblocks not
      marked as skipped, such as blocks !MIGRATE_MOVABLE blocks that async
      compactions now skips without marking them.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      55b7c4c9
    • V
      mm: compaction: do not mark unmovable pageblocks as skipped in async compaction · 50b5b094
      Vlastimil Babka 提交于
      Compaction temporarily marks pageblocks where it fails to isolate pages
      as to-be-skipped in further compactions, in order to improve efficiency.
      One of the reasons to fail isolating pages is that isolation is not
      attempted in pageblocks that are not of MIGRATE_MOVABLE (or CMA) type.
      
      The problem is that blocks skipped due to not being MIGRATE_MOVABLE in
      async compaction become skipped due to the temporary mark also in future
      sync compaction.  Moreover, this may follow quite soon during
      __alloc_page_slowpath, without much time for kswapd to clear the
      pageblock skip marks.  This goes against the idea that sync compaction
      should try to scan these blocks more thoroughly than the async
      compaction.
      
      The fix is to ensure in async compaction that these !MIGRATE_MOVABLE
      blocks are not marked to be skipped.  Note this should not affect
      performance or locking impact of further async compactions, as skipping
      a block due to being !MIGRATE_MOVABLE is done soon after skipping a
      block marked to be skipped, both without locking.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      50b5b094
    • V
      mm: compaction: detect when scanners meet in isolate_freepages · 7ed695e0
      Vlastimil Babka 提交于
      Compaction of a zone is finished when the migrate scanner (which begins
      at the zone's lowest pfn) meets the free page scanner (which begins at
      the zone's highest pfn).  This is detected in compact_zone() and in the
      case of direct compaction, the compact_blockskip_flush flag is set so
      that kswapd later resets the cached scanner pfn's, and a new compaction
      may again start at the zone's borders.
      
      The meeting of the scanners can happen during either scanner's activity.
      However, it may currently fail to be detected when it occurs in the free
      page scanner, due to two problems.  First, isolate_freepages() keeps
      free_pfn at the highest block where it isolated pages from, for the
      purposes of not missing the pages that are returned back to allocator
      when migration fails.  Second, failing to isolate enough free pages due
      to scanners meeting results in -ENOMEM being returned by
      migrate_pages(), which makes compact_zone() bail out immediately without
      calling compact_finished() that would detect scanners meeting.
      
      This failure to detect scanners meeting might result in repeated
      attempts at compaction of a zone that keep starting from the cached
      pfn's close to the meeting point, and quickly failing through the
      -ENOMEM path, without the cached pfns being reset, over and over.  This
      has been observed (through additional tracepoints) in the third phase of
      the mmtests stress-highalloc benchmark, where the allocator runs on an
      otherwise idle system.  The problem was observed in the DMA32 zone,
      which was used as a fallback to the preferred Normal zone, but on the
      4GB system it was actually the largest zone.  The problem is even
      amplified for such fallback zone - the deferred compaction logic, which
      could (after being fixed by a previous patch) reset the cached scanner
      pfn's, is only applied to the preferred zone and not for the fallbacks.
      
      The problem in the third phase of the benchmark was further amplified by
      commit 81c0a2bb ("mm: page_alloc: fair zone allocator policy") which
      resulted in a non-deterministic regression of the allocation success
      rate from ~85% to ~65%.  This occurs in about half of benchmark runs,
      making bisection problematic.  It is unlikely that the commit itself is
      buggy, but it should put more pressure on the DMA32 zone during phases 1
      and 2, which may leave it more fragmented in phase 3 and expose the bugs
      that this patch fixes.
      
      The fix is to make scanners meeting in isolate_freepage() stay that way,
      and to check in compact_zone() for scanners meeting when migrate_pages()
      returns -ENOMEM.  The result is that compact_finished() also detects
      scanners meeting and sets the compact_blockskip_flush flag to make
      kswapd reset the scanner pfn's.
      
      The results in stress-highalloc benchmark show that the "regression" by
      commit 81c0a2bb in phase 3 no longer occurs, and phase 1 and 2
      allocation success rates are also significantly improved.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7ed695e0
    • V
      mm: compaction: reset cached scanner pfn's before reading them · d3132e4b
      Vlastimil Babka 提交于
      Compaction caches pfn's for its migrate and free scanners to avoid
      scanning the whole zone each time.  In compact_zone(), the cached values
      are read to set up initial values for the scanners.  There are several
      situations when these cached pfn's are reset to the first and last pfn
      of the zone, respectively.  One of these situations is when a compaction
      has been deferred for a zone and is now being restarted during a direct
      compaction, which is also done in compact_zone().
      
      However, compact_zone() currently reads the cached pfn's *before*
      resetting them.  This means the reset doesn't affect the compaction that
      performs it, and with good chance also subsequent compactions, as
      update_pageblock_skip() is likely to be called and update the cached
      pfn's to those being processed.  Another chance for a successful reset
      is when a direct compaction detects that migration and free scanners
      meet (which has its own problems addressed by another patch) and sets
      update_pageblock_skip flag which kswapd uses to do the reset because it
      goes to sleep.
      
      This is clearly a bug that results in non-deterministic behavior, so
      this patch moves the cached pfn reset to be performed *before* the
      values are read.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d3132e4b
    • V
      mm: compaction: encapsulate defer reset logic · de6c60a6
      Vlastimil Babka 提交于
      Currently there are several functions to manipulate the deferred
      compaction state variables.  The remaining case where the variables are
      touched directly is when a successful allocation occurs in direct
      compaction, or is expected to be successful in the future by kswapd.
      Here, the lowest order that is expected to fail is updated, and in the
      case of successful allocation, the deferred status and counter is reset
      completely.
      
      Create a new function compaction_defer_reset() to encapsulate this
      functionality and make it easier to understand the code.  No functional
      change.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      de6c60a6
    • M
      mm: compaction: trace compaction begin and end · 0eb927c0
      Mel Gorman 提交于
      The broad goal of the series is to improve allocation success rates for
      huge pages through memory compaction, while trying not to increase the
      compaction overhead.  The original objective was to reintroduce
      capturing of high-order pages freed by the compaction, before they are
      split by concurrent activity.  However, several bugs and opportunities
      for simple improvements were found in the current implementation, mostly
      through extra tracepoints (which are however too ugly for now to be
      considered for sending).
      
      The patches mostly deal with two mechanisms that reduce compaction
      overhead, which is caching the progress of migrate and free scanners,
      and marking pageblocks where isolation failed to be skipped during
      further scans.
      
      Patch 1 (from mgorman) adds tracepoints that allow calculate time spent in
              compaction and potentially debug scanner pfn values.
      
      Patch 2 encapsulates the some functionality for handling deferred compactions
              for better maintainability, without a functional change
              type is not determined without being actually needed.
      
      Patch 3 fixes a bug where cached scanner pfn's are sometimes reset only after
              they have been read to initialize a compaction run.
      
      Patch 4 fixes a bug where scanners meeting is sometimes not properly detected
              and can lead to multiple compaction attempts quitting early without
              doing any work.
      
      Patch 5 improves the chances of sync compaction to process pageblocks that
              async compaction has skipped due to being !MIGRATE_MOVABLE.
      
      Patch 6 improves the chances of sync direct compaction to actually do anything
              when called after async compaction fails during allocation slowpath.
      
      The impact of patches were validated using mmtests's stress-highalloc
      benchmark with mmtests's stress-highalloc benchmark on a x86_64 machine
      with 4GB memory.
      
      Due to instability of the results (mostly related to the bugs fixed by
      patches 2 and 3), 10 iterations were performed, taking min,mean,max
      values for success rates and mean values for time and vmstat-based
      metrics.
      
      First, the default GFP_HIGHUSER_MOVABLE allocations were tested with the
      patches stacked on top of v3.13-rc2.  Patch 2 is OK to serve as baseline
      due to no functional changes in 1 and 2.  Comments below.
      
      stress-highalloc
                                   3.13-rc2              3.13-rc2              3.13-rc2              3.13-rc2              3.13-rc2
                                    2-nothp               3-nothp               4-nothp               5-nothp               6-nothp
      Success 1 Min          9.00 (  0.00%)       10.00 (-11.11%)       43.00 (-377.78%)       43.00 (-377.78%)       33.00 (-266.67%)
      Success 1 Mean        27.50 (  0.00%)       25.30 (  8.00%)       45.50 (-65.45%)       45.90 (-66.91%)       46.30 (-68.36%)
      Success 1 Max         36.00 (  0.00%)       36.00 (  0.00%)       47.00 (-30.56%)       48.00 (-33.33%)       52.00 (-44.44%)
      Success 2 Min         10.00 (  0.00%)        8.00 ( 20.00%)       46.00 (-360.00%)       45.00 (-350.00%)       35.00 (-250.00%)
      Success 2 Mean        26.40 (  0.00%)       23.50 ( 10.98%)       47.30 (-79.17%)       47.60 (-80.30%)       48.10 (-82.20%)
      Success 2 Max         34.00 (  0.00%)       33.00 (  2.94%)       48.00 (-41.18%)       50.00 (-47.06%)       54.00 (-58.82%)
      Success 3 Min         65.00 (  0.00%)       63.00 (  3.08%)       85.00 (-30.77%)       84.00 (-29.23%)       85.00 (-30.77%)
      Success 3 Mean        76.70 (  0.00%)       70.50 (  8.08%)       86.20 (-12.39%)       85.50 (-11.47%)       86.00 (-12.13%)
      Success 3 Max         87.00 (  0.00%)       86.00 (  1.15%)       88.00 ( -1.15%)       87.00 (  0.00%)       87.00 (  0.00%)
      
                  3.13-rc2    3.13-rc2    3.13-rc2    3.13-rc2    3.13-rc2
                   2-nothp     3-nothp     4-nothp     5-nothp     6-nothp
      User         6437.72     6459.76     5960.32     5974.55     6019.67
      System       1049.65     1049.09     1029.32     1031.47     1032.31
      Elapsed      1856.77     1874.48     1949.97     1994.22     1983.15
      
                                    3.13-rc2    3.13-rc2    3.13-rc2    3.13-rc2    3.13-rc2
                                     2-nothp     3-nothp     4-nothp     5-nothp     6-nothp
      Minor Faults                 253952267   254581900   250030122   250507333   250157829
      Major Faults                       420         407         506         530         530
      Swap Ins                             4           9           9           6           6
      Swap Outs                          398         375         345         346         333
      Direct pages scanned            197538      189017      298574      287019      299063
      Kswapd pages scanned           1809843     1801308     1846674     1873184     1861089
      Kswapd pages reclaimed         1806972     1798684     1844219     1870509     1858622
      Direct pages reclaimed          197227      188829      298380      286822      298835
      Kswapd efficiency                  99%         99%         99%         99%         99%
      Kswapd velocity                953.382     970.449     952.243     934.569     922.286
      Direct efficiency                  99%         99%         99%         99%         99%
      Direct velocity                104.058     101.832     153.961     143.200     148.205
      Percentage direct scans             9%          9%         13%         13%         13%
      Zone normal velocity           347.289     359.676     348.063     339.933     332.983
      Zone dma32 velocity            710.151     712.605     758.140     737.835     737.507
      Zone dma velocity                0.000       0.000       0.000       0.000       0.000
      Page writes by reclaim         557.600     429.000     353.600     426.400     381.800
      Page writes file                   159          53           7          79          48
      Page writes anon                   398         375         345         346         333
      Page reclaim immediate             825         644         411         575         420
      Sector Reads                   2781750     2769780     2878547     2939128     2910483
      Sector Writes                 12080843    12083351    12012892    12002132    12010745
      Page rescued immediate               0           0           0           0           0
      Slabs scanned                  1575654     1545344     1778406     1786700     1794073
      Direct inode steals               9657       10037       15795       14104       14645
      Kswapd inode steals              46857       46335       50543       50716       51796
      Kswapd skipped wait                  0           0           0           0           0
      THP fault alloc                     97          91          81          71          77
      THP collapse alloc                 456         506         546         544         565
      THP splits                           6           5           5           4           4
      THP fault fallback                   0           1           0           0           0
      THP collapse fail                   14          14          12          13          12
      Compaction stalls                 1006         980        1537        1536        1548
      Compaction success                 303         284         562         559         578
      Compaction failures                702         696         974         976         969
      Page migrate success           1177325     1070077     3927538     3781870     3877057
      Page migrate failure                 0           0           0           0           0
      Compaction pages isolated      2547248     2306457     8301218     8008500     8200674
      Compaction migrate scanned    42290478    38832618   153961130   154143900   159141197
      Compaction free scanned       89199429    79189151   356529027   351943166   356326727
      Compaction cost                   1566        1426        5312        5156        5294
      NUMA PTE updates                     0           0           0           0           0
      NUMA hint faults                     0           0           0           0           0
      NUMA hint local faults               0           0           0           0           0
      NUMA hint local percent            100         100         100         100         100
      NUMA pages migrated                  0           0           0           0           0
      AutoNUMA cost                        0           0           0           0           0
      
      Observations:
      
      - The "Success 3" line is allocation success rate with system idle
        (phases 1 and 2 are with background interference).  I used to get stable
        values around 85% with vanilla 3.11.  The lower min and mean values came
        with 3.12.  This was bisected to commit 81c0a2bb ("mm: page_alloc: fair
        zone allocator policy") As explained in comment for patch 3, I don't
        think the commit is wrong, but that it makes the effect of compaction
        bugs worse.  From patch 3 onwards, the results are OK and match the 3.11
        results.
      
      - Patch 4 also clearly helps phases 1 and 2, and exceeds any results
        I've seen with 3.11 (I didn't measure it that thoroughly then, but it
        was never above 40%).
      
      - Compaction cost and number of scanned pages is higher, especially due
        to patch 4.  However, keep in mind that patches 3 and 4 fix existing
        bugs in the current design of compaction overhead mitigation, they do
        not change it.  If overhead is found unacceptable, then it should be
        decreased differently (and consistently, not due to random conditions)
        than the current implementation does.  In contrast, patches 5 and 6
        (which are not strictly bug fixes) do not increase the overhead (but
        also not success rates).  This might be a limitation of the
        stress-highalloc benchmark as it's quite uniform.
      
      Another set of results is when configuring stress-highalloc t allocate
      with similar flags as THP uses:
       (GFP_HIGHUSER_MOVABLE|__GFP_NOMEMALLOC|__GFP_NORETRY|__GFP_NO_KSWAPD)
      
      stress-highalloc
                                   3.13-rc2              3.13-rc2              3.13-rc2              3.13-rc2              3.13-rc2
                                      2-thp                 3-thp                 4-thp                 5-thp                 6-thp
      Success 1 Min          2.00 (  0.00%)        7.00 (-250.00%)       18.00 (-800.00%)       19.00 (-850.00%)       26.00 (-1200.00%)
      Success 1 Mean        19.20 (  0.00%)       17.80 (  7.29%)       29.20 (-52.08%)       29.90 (-55.73%)       32.80 (-70.83%)
      Success 1 Max         27.00 (  0.00%)       29.00 ( -7.41%)       35.00 (-29.63%)       36.00 (-33.33%)       37.00 (-37.04%)
      Success 2 Min          3.00 (  0.00%)        8.00 (-166.67%)       21.00 (-600.00%)       21.00 (-600.00%)       32.00 (-966.67%)
      Success 2 Mean        19.30 (  0.00%)       17.90 (  7.25%)       32.20 (-66.84%)       32.60 (-68.91%)       35.70 (-84.97%)
      Success 2 Max         27.00 (  0.00%)       30.00 (-11.11%)       36.00 (-33.33%)       37.00 (-37.04%)       39.00 (-44.44%)
      Success 3 Min         62.00 (  0.00%)       62.00 (  0.00%)       85.00 (-37.10%)       75.00 (-20.97%)       64.00 ( -3.23%)
      Success 3 Mean        66.30 (  0.00%)       65.50 (  1.21%)       85.60 (-29.11%)       83.40 (-25.79%)       83.50 (-25.94%)
      Success 3 Max         70.00 (  0.00%)       69.00 (  1.43%)       87.00 (-24.29%)       86.00 (-22.86%)       87.00 (-24.29%)
      
                  3.13-rc2    3.13-rc2    3.13-rc2    3.13-rc2    3.13-rc2
                     2-thp       3-thp       4-thp       5-thp       6-thp
      User         6547.93     6475.85     6265.54     6289.46     6189.96
      System       1053.42     1047.28     1043.23     1042.73     1038.73
      Elapsed      1835.43     1821.96     1908.67     1912.74     1956.38
      
                                    3.13-rc2    3.13-rc2    3.13-rc2    3.13-rc2    3.13-rc2
                                       2-thp       3-thp       4-thp       5-thp       6-thp
      Minor Faults                 256805673   253106328   253222299   249830289   251184418
      Major Faults                       395         375         423         434         448
      Swap Ins                            12          10          10          12           9
      Swap Outs                          530         537         487         455         415
      Direct pages scanned             71859       86046      153244      152764      190713
      Kswapd pages scanned           1900994     1870240     1898012     1892864     1880520
      Kswapd pages reclaimed         1897814     1867428     1894939     1890125     1877924
      Direct pages reclaimed           71766       85908      153167      152643      190600
      Kswapd efficiency                  99%         99%         99%         99%         99%
      Kswapd velocity               1029.000    1067.782    1000.091     991.049     951.218
      Direct efficiency                  99%         99%         99%         99%         99%
      Direct velocity                 38.897      49.127      80.747      79.983      96.468
      Percentage direct scans             3%          4%          7%          7%          9%
      Zone normal velocity           351.377     372.494     348.910     341.689     335.310
      Zone dma32 velocity            716.520     744.414     731.928     729.343     712.377
      Zone dma velocity                0.000       0.000       0.000       0.000       0.000
      Page writes by reclaim         669.300     604.000     545.700     538.900     429.900
      Page writes file                   138          66          58          83          14
      Page writes anon                   530         537         487         455         415
      Page reclaim immediate             806         655         772         548         517
      Sector Reads                   2711956     2703239     2811602     2818248     2839459
      Sector Writes                 12163238    12018662    12038248    11954736    11994892
      Page rescued immediate               0           0           0           0           0
      Slabs scanned                  1385088     1388364     1507968     1513292     1558656
      Direct inode steals               1739        2564        4622        5496        6007
      Kswapd inode steals              47461       46406       47804       48013       48466
      Kswapd skipped wait                  0           0           0           0           0
      THP fault alloc                    110          82          84          69          70
      THP collapse alloc                 445         482         467         462         539
      THP splits                           6           5           4           5           3
      THP fault fallback                   3           0           0           0           0
      THP collapse fail                   15          14          14          14          13
      Compaction stalls                  659         685        1033        1073        1111
      Compaction success                 222         225         410         427         456
      Compaction failures                436         460         622         646         655
      Page migrate success            446594      439978     1085640     1095062     1131716
      Page migrate failure                 0           0           0           0           0
      Compaction pages isolated      1029475     1013490     2453074     2482698     2565400
      Compaction migrate scanned     9955461    11344259    24375202    27978356    30494204
      Compaction free scanned       27715272    28544654    80150615    82898631    85756132
      Compaction cost                    552         555        1344        1379        1436
      NUMA PTE updates                     0           0           0           0           0
      NUMA hint faults                     0           0           0           0           0
      NUMA hint local faults               0           0           0           0           0
      NUMA hint local percent            100         100         100         100         100
      NUMA pages migrated                  0           0           0           0           0
      AutoNUMA cost                        0           0           0           0           0
      
      There are some differences from the previous results for THP-like allocations:
      
      - Here, the bad result for unpatched kernel in phase 3 is much more
        consistent to be between 65-70% and not related to the "regression" in
        3.12.  Still there is the improvement from patch 4 onwards, which brings
        it on par with simple GFP_HIGHUSER_MOVABLE allocations.
      
      - Compaction costs have increased, but nowhere near as much as the
        non-THP case.  Again, the patches should be worth the gained
        determininsm.
      
      - Patches 5 and 6 somewhat increase the number of migrate-scanned pages.
         This is most likely due to __GFP_NO_KSWAPD flag, which means the cached
        pfn's and pageblock skip bits are not reset by kswapd that often (at
        least in phase 3 where no concurrent activity would wake up kswapd) and
        the patches thus help the sync-after-async compaction.  It doesn't
        however show that the sync compaction would help so much with success
        rates, which can be again seen as a limitation of the benchmark
        scenario.
      
      This patch (of 6):
      
      Add two tracepoints for compaction begin and end of a zone.  Using this it
      is possible to calculate how much time a workload is spending within
      compaction and potentially debug problems related to cached pfns for
      scanning.  In combination with the direct reclaim and slab trace points it
      should be possible to estimate most allocation-related overhead for a
      workload.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0eb927c0
    • M
      memcg, oom: lock mem_cgroup_print_oom_info · 947b3dd1
      Michal Hocko 提交于
      mem_cgroup_print_oom_info uses a static buffer (memcg_name) to store the
      name of the cgroup.  This is not safe as pointed out by David Rientjes
      because memcg oom is locked only for its hierarchy and nothing prevents
      another parallel hierarchy to trigger oom as well and overwrite the
      already in-use buffer.
      
      This patch introduces oom_info_lock hidden inside
      mem_cgroup_print_oom_info which is held throughout the function.  It
      makes access to memcg_name safe and as a bonus it also prevents parallel
      memcg ooms to interleave their statistics which would make the printed
      data hard to analyze otherwise.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      947b3dd1
    • M
      mm: numa: do not automatically migrate KSM pages · 64a9a34e
      Mel Gorman 提交于
      KSM pages can be shared between tasks that are not necessarily related
      to each other from a NUMA perspective.  This patch causes those pages to
      be ignored by automatic NUMA balancing so they do not migrate and do not
      cause unrelated tasks to be grouped together.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Alex Thorlton <athorlton@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      64a9a34e
    • M
      mm: numa: trace tasks that fail migration due to rate limiting · af1839d7
      Mel Gorman 提交于
      A low local/remote numa hinting fault ratio is potentially explained by
      failed migrations.  This patch adds a tracepoint that fires when
      migration fails due to migration rate limitation.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Alex Thorlton <athorlton@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      af1839d7
    • M
      mm: numa: limit scope of lock for NUMA migrate rate limiting · 1c5e9c27
      Mel Gorman 提交于
      NUMA migrate rate limiting protects a migration counter and window using
      a lock but in some cases this can be a contended lock.  It is not
      critical that the number of pages be perfect, lost updates are
      acceptable.  Reduce the importance of this lock.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Alex Thorlton <athorlton@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1c5e9c27
    • M
      mm: numa: make NUMA-migrate related functions static · 1c30e017
      Mel Gorman 提交于
      numamigrate_update_ratelimit and numamigrate_isolate_page only have
      callers in mm/migrate.c.  This patch makes them static.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Alex Thorlton <athorlton@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1c30e017
    • W
      mm/hwpoison: add '#' to hwpoison_inject · 4883e997
      Wanpeng Li 提交于
      Add '#' to hwpoison_inject just as done in madvise_hwpoison.
      Signed-off-by: NWanpeng Li <liwanp@linux.vnet.ibm.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: NVladimir Murzin <murzin.v@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4883e997
    • G
      mm/memblock: use WARN_ONCE when MAX_NUMNODES passed as input parameter · 560dca27
      Grygorii Strashko 提交于
      Check nid parameter and produce warning if it has deprecated
      MAX_NUMNODES value.  Also re-assign NUMA_NO_NODE value to the nid
      parameter in this case.
      
      These will help to identify the wrong API usage (the caller) and make
      code simpler.
      Signed-off-by: NGrygorii Strashko <grygorii.strashko@ti.com>
      Signed-off-by: NSantosh Shilimkar <santosh.shilimkar@ti.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      560dca27
    • S
      mm/memory_hotplug.c: use memblock apis for early memory allocations · 9e43aa2b
      Santosh Shilimkar 提交于
      Correct ensure_zone_is_initialized() function description according to
      the introduced memblock APIs for early memory allocations.
      Signed-off-by: NGrygorii Strashko <grygorii.strashko@ti.com>
      Signed-off-by: NSantosh Shilimkar <santosh.shilimkar@ti.com>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Paul Walmsley <paul@pwsan.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Tony Lindgren <tony@atomide.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9e43aa2b
    • S
      mm/percpu.c: use memblock apis for early memory allocations · 999c17e3
      Santosh Shilimkar 提交于
      Switch to memblock interfaces for early memory allocator instead of
      bootmem allocator.  No functional change in beahvior than what it is in
      current code from bootmem users points of view.
      
      Archs already converted to NO_BOOTMEM now directly use memblock
      interfaces instead of bootmem wrappers build on top of memblock.  And
      the archs which still uses bootmem, these new apis just fallback to
      exiting bootmem APIs.
      Signed-off-by: NSantosh Shilimkar <santosh.shilimkar@ti.com>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Grygorii Strashko <grygorii.strashko@ti.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Paul Walmsley <paul@pwsan.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Tony Lindgren <tony@atomide.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      999c17e3
    • G
      mm/page_cgroup.c: use memblock apis for early memory allocations · 0d036e9e
      Grygorii Strashko 提交于
      Switch to memblock interfaces for early memory allocator instead of
      bootmem allocator.  No functional change in beahvior than what it is in
      current code from bootmem users points of view.
      
      Archs already converted to NO_BOOTMEM now directly use memblock
      interfaces instead of bootmem wrappers build on top of memblock.  And
      the archs which still uses bootmem, these new apis just fallback to
      exiting bootmem APIs.
      Signed-off-by: NGrygorii Strashko <grygorii.strashko@ti.com>
      Signed-off-by: NSantosh Shilimkar <santosh.shilimkar@ti.com>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Grygorii Strashko <grygorii.strashko@ti.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Paul Walmsley <paul@pwsan.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Tony Lindgren <tony@atomide.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0d036e9e
    • G
      mm/hugetlb.c: use memblock apis for early memory allocations · 8b89a116
      Grygorii Strashko 提交于
      Switch to memblock interfaces for early memory allocator instead of
      bootmem allocator.  No functional change in beahvior than what it is in
      current code from bootmem users points of view.
      
      Archs already converted to NO_BOOTMEM now directly use memblock
      interfaces instead of bootmem wrappers build on top of memblock.  And
      the archs which still uses bootmem, these new apis just fallback to
      exiting bootmem APIs.
      Signed-off-by: NGrygorii Strashko <grygorii.strashko@ti.com>
      Signed-off-by: NSantosh Shilimkar <santosh.shilimkar@ti.com>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Paul Walmsley <paul@pwsan.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Tony Lindgren <tony@atomide.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8b89a116
    • S
      mm/sparse: use memblock apis for early memory allocations · bb016b84
      Santosh Shilimkar 提交于
      Switch to memblock interfaces for early memory allocator instead of
      bootmem allocator.  No functional change in beahvior than what it is in
      current code from bootmem users points of view.
      
      Archs already converted to NO_BOOTMEM now directly use memblock
      interfaces instead of bootmem wrappers build on top of memblock.  And
      the archs which still uses bootmem, these new apis just fallback to
      exiting bootmem APIs.
      Signed-off-by: NSantosh Shilimkar <santosh.shilimkar@ti.com>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Grygorii Strashko <grygorii.strashko@ti.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Paul Walmsley <paul@pwsan.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Tony Lindgren <tony@atomide.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bb016b84