1. 21 5月, 2016 20 次提交
    • O
      userfaultfd: don't pin the user memory in userfaultfd_file_create() · d2005e3f
      Oleg Nesterov 提交于
      userfaultfd_file_create() increments mm->mm_users; this means that the
      memory won't be unmapped/freed if mm owner exits/execs, and UFFDIO_COPY
      after that can populate the orphaned mm more.
      
      Change userfaultfd_file_create() and userfaultfd_ctx_put() to use
      mm->mm_count to pin mm_struct.  This means that
      atomic_inc_not_zero(mm->mm_users) is needed when we are going to
      actually play with this memory.  Except handle_userfault() path doesn't
      need this, the caller must already have a reference.
      
      The patch adds the new trivial helper, mmget_not_zero(), it can have
      more users.
      
      Link: http://lkml.kernel.org/r/20160516172254.GA8595@redhat.comSigned-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d2005e3f
    • A
      mm: thp: microoptimize compound_mapcount() · 5f527c2b
      Andrea Arcangeli 提交于
      compound_mapcount() is only called after PageCompound() has already been
      checked by the caller, so there's no point to check it again.  Gcc may
      optimize it away too because it's inline but this will remove the
      runtime check for sure and add it'll add an assert instead.
      
      Link: http://lkml.kernel.org/r/1462547040-1737-3-git-send-email-aarcange@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Alex Williamson <alex.williamson@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5f527c2b
    • Y
      mm: use unsigned long constant for page flags · d2a1a1f0
      Yu Zhao 提交于
      struct page->flags is unsigned long, so when shifting bits we should use
      UL suffix to match it.
      
      Found this problem after I added 64-bit CPU specific page flags and
      failed to compile the kernel:
      
        mm/page_alloc.c: In function '__free_one_page':
        mm/page_alloc.c:672:2: error: integer overflow in expression [-Werror=overflow]
      
      Link: http://lkml.kernel.org/r/1461971723-16187-1-git-send-email-yuzhao@google.comSigned-off-by: NYu Zhao <yuzhao@google.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d2a1a1f0
    • W
      mm fix commmets: if SPARSEMEM, pgdata doesn't have page_ext · 0c9ad804
      Weijie Yang 提交于
      If SPARSEMEM, use page_ext in mem_section
      if !SPARSEMEM, use page_ext in pgdata
      Signed-off-by: NWeijie Yang <weijie.yang@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0c9ad804
    • C
      include/linux/hugetlb.h: use bool instead of int for hugepage_migration_supported() · d70c17d4
      Chen Gang 提交于
      It is used as a pure bool function within kernel source wide.
      Signed-off-by: NChen Gang <gang.chen.5i5j@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d70c17d4
    • C
      include/linux/hugetlb*.h: clean up code · 7fab358d
      Chen Gang 提交于
      Macro HUGETLBFS_SB is clear enough, so one statement is clearer than 3
      lines statements.
      
      Remove redundant return statements for non-return functions, which can
      save lines, at least.
      Signed-off-by: NChen Gang <gang.chen.5i5j@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7fab358d
    • E
      mm: tighten fault_in_pages_writeable() · b8ca9e3a
      Eric Dumazet 提交于
      copy_page_to_iter_iovec() is currently the only user of
      fault_in_pages_writeable(), and it definitely can use fragments from
      high order pages.
      
      Make sure fault_in_pages_writeable() is only touching two adjacent pages
      at most, as claimed.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b8ca9e3a
    • C
      mm/vmalloc: keep a separate lazy-free list · 80c4bd7a
      Chris Wilson 提交于
      When mixing lots of vmallocs and set_memory_*() (which calls
      vm_unmap_aliases()) I encountered situations where the performance
      degraded severely due to the walking of the entire vmap_area list each
      invocation.
      
      One simple improvement is to add the lazily freed vmap_area to a
      separate lockless free list, such that we then avoid having to walk the
      full list on each purge.
      Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Reviewed-by: NRoman Pen <r.peniaev@gmail.com>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Roman Pen <r.peniaev@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: Shawn Lin <shawn.lin@rock-chips.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      80c4bd7a
    • T
      mm,oom: speed up select_bad_process() loop · f44666b0
      Tetsuo Handa 提交于
      Since commit 3a5dda7a ("oom: prevent unnecessary oom kills or kernel
      panics"), select_bad_process() is using for_each_process_thread().
      
      Since oom_unkillable_task() scans all threads in the caller's thread
      group and oom_task_origin() scans signal_struct of the caller's thread
      group, we don't need to call oom_unkillable_task() and oom_task_origin()
      on each thread.  Also, since !mm test will be done later at
      oom_badness(), we don't need to do !mm test on each thread.  Therefore,
      we only need to do TIF_MEMDIE test on each thread.
      
      Although the original code was correct it was quite inefficient because
      each thread group was scanned num_threads times which can be a lot
      especially with processes with many threads.  Even though the OOM is
      extremely cold path it is always good to be as effective as possible
      when we are inside rcu_read_lock() - aka unpreemptible context.
      
      If we track number of TIF_MEMDIE threads inside signal_struct, we don't
      need to do TIF_MEMDIE test on each thread.  This will allow
      select_bad_process() to use for_each_process().
      
      This patch adds a counter to signal_struct for tracking how many
      TIF_MEMDIE threads are in a given thread group, and check it at
      oom_scan_process_thread() so that select_bad_process() can use
      for_each_process() rather than for_each_process_thread().
      
      [mhocko@suse.com: do not blow the signal_struct size]
        Link: http://lkml.kernel.org/r/20160520075035.GF19172@dhcp22.suse.cz
      Link: http://lkml.kernel.org/r/201605182230.IDC73435.MVSOHLFOQFOJtF@I-love.SAKURA.ne.jpSigned-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f44666b0
    • M
      oom: consider multi-threaded tasks in task_will_free_mem · 98748bd7
      Michal Hocko 提交于
      task_will_free_mem is a misnomer for a more complex PF_EXITING test for
      early break out from the oom killer because it is believed that such a
      task would release its memory shortly and so we do not have to select an
      oom victim and perform a disruptive action.
      
      Currently we make sure that the given task is not participating in the
      core dumping because it might get blocked for a long time - see commit
      d003f371 ("oom: don't assume that a coredumping thread will exit
      soon").
      
      The check can still do better though.  We shouldn't consider the task
      unless the whole thread group is going down.  This is rather unlikely
      but not impossible.  A single exiting thread would surely leave all the
      address space behind.  If we are really unlucky it might get stuck on
      the exit path and keep its TIF_MEMDIE and so block the oom killer.
      
      Link: http://lkml.kernel.org/r/1460452756-15491-1-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      98748bd7
    • M
      mm, oom_reaper: do not mmput synchronously from the oom reaper context · ec8d7c14
      Michal Hocko 提交于
      Tetsuo has properly noted that mmput slow path might get blocked waiting
      for another party (e.g.  exit_aio waits for an IO).  If that happens the
      oom_reaper would be put out of the way and will not be able to process
      next oom victim.  We should strive for making this context as reliable
      and independent on other subsystems as much as possible.
      
      Introduce mmput_async which will perform the slow path from an async
      (WQ) context.  This will delay the operation but that shouldn't be a
      problem because the oom_reaper has reclaimed the victim's address space
      for most cases as much as possible and the remaining context shouldn't
      bind too much memory anymore.  The only exception is when mmap_sem
      trylock has failed which shouldn't happen too often.
      
      The issue is only theoretical but not impossible.
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ec8d7c14
    • M
      mm, oom_reaper: hide oom reaped tasks from OOM killer more carefully · bb8a4b7f
      Michal Hocko 提交于
      Commit 36324a99 ("oom: clear TIF_MEMDIE after oom_reaper managed to
      unmap the address space") not only clears TIF_MEMDIE for oom reaped task
      but also set OOM_SCORE_ADJ_MIN for the target task to hide it from the
      oom killer.  This works in simple cases but it is not sufficient for
      (unlikely) cases where the mm is shared between independent processes
      (as they do not share signal struct).  If the mm had only small amount
      of memory which could be reaped then another task sharing the mm could
      be selected and that wouldn't help to move out from the oom situation.
      
      Introduce MMF_OOM_REAPED mm flag which is checked in oom_badness (same
      as OOM_SCORE_ADJ_MIN) and task is skipped if the flag is set.  Set the
      flag after __oom_reap_task is done with a task.  This will force the
      select_bad_process() to ignore all already oom reaped tasks as well as
      no such task is sacrificed for its parent.
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bb8a4b7f
    • M
      mm, oom, compaction: prevent from should_compact_retry looping for ever for costly orders · 86a294a8
      Michal Hocko 提交于
      "mm: consider compaction feedback also for costly allocation" has
      removed the upper bound for the reclaim/compaction retries based on the
      number of reclaimed pages for costly orders.  While this is desirable
      the patch did miss a mis interaction between reclaim, compaction and the
      retry logic.  The direct reclaim tries to get zones over min watermark
      while compaction backs off and returns COMPACT_SKIPPED when all zones
      are below low watermark + 1<<order gap.  If we are getting really close
      to OOM then __compaction_suitable can keep returning COMPACT_SKIPPED a
      high order request (e.g.  hugetlb order-9) while the reclaim is not able
      to release enough pages to get us over low watermark.  The reclaim is
      still able to make some progress (usually trashing over few remaining
      pages) so we are not able to break out from the loop.
      
      I have seen this happening with the same test described in "mm: consider
      compaction feedback also for costly allocation" on a swapless system.
      The original problem got resolved by "vmscan: consider classzone_idx in
      compaction_ready" but it shows how things might go wrong when we
      approach the oom event horizont.
      
      The reason why compaction requires being over low rather than min
      watermark is not clear to me.  This check was there essentially since
      56de7263 ("mm: compaction: direct compact when a high-order
      allocation fails").  It is clearly an implementation detail though and
      we shouldn't pull it into the generic retry logic while we should be
      able to cope with such eventuality.  The only place in
      should_compact_retry where we retry without any upper bound is for
      compaction_withdrawn() case.
      
      Introduce compaction_zonelist_suitable function which checks the given
      zonelist and returns true only if there is at least one zone which would
      would unblock __compaction_suitable if more memory got reclaimed.  In
      this implementation it checks __compaction_suitable with NR_FREE_PAGES
      plus part of the reclaimable memory as the target for the watermark
      check.  The reclaimable memory is reduced linearly by the allocation
      order.  The idea is that we do not want to reclaim all the remaining
      memory for a single allocation request just unblock
      __compaction_suitable which doesn't guarantee we will make a further
      progress.
      
      The new helper is then used if compaction_withdrawn() feedback was
      provided so we do not retry if there is no outlook for a further
      progress.  !costly requests shouldn't be affected much - e.g.  order-2
      pages would require to have at least 64kB on the reclaimable LRUs while
      order-9 would need at least 32M which should be enough to not lock up.
      
      [vbabka@suse.cz: fix classzone_idx vs. high_zoneidx usage in compaction_zonelist_suitable]
      [akpm@linux-foundation.org: fix it for Mel's mm-page_alloc-remove-field-from-alloc_context.patch]
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      86a294a8
    • M
      mm, oom: rework oom detection · 0a0337e0
      Michal Hocko 提交于
      __alloc_pages_slowpath has traditionally relied on the direct reclaim
      and did_some_progress as an indicator that it makes sense to retry
      allocation rather than declaring OOM.  shrink_zones had to rely on
      zone_reclaimable if shrink_zone didn't make any progress to prevent from
      a premature OOM killer invocation - the LRU might be full of dirty or
      writeback pages and direct reclaim cannot clean those up.
      
      zone_reclaimable allows to rescan the reclaimable lists several times
      and restart if a page is freed.  This is really subtle behavior and it
      might lead to a livelock when a single freed page keeps allocator
      looping but the current task will not be able to allocate that single
      page.  OOM killer would be more appropriate than looping without any
      progress for unbounded amount of time.
      
      This patch changes OOM detection logic and pulls it out from shrink_zone
      which is too low to be appropriate for any high level decisions such as
      OOM which is per zonelist property.  It is __alloc_pages_slowpath which
      knows how many attempts have been done and what was the progress so far
      therefore it is more appropriate to implement this logic.
      
      The new heuristic is implemented in should_reclaim_retry helper called
      from __alloc_pages_slowpath.  It tries to be more deterministic and
      easier to follow.  It builds on an assumption that retrying makes sense
      only if the currently reclaimable memory + free pages would allow the
      current allocation request to succeed (as per __zone_watermark_ok) at
      least for one zone in the usable zonelist.
      
      This alone wouldn't be sufficient, though, because the writeback might
      get stuck and reclaimable pages might be pinned for a really long time
      or even depend on the current allocation context.  Therefore there is a
      backoff mechanism implemented which reduces the reclaim target after
      each reclaim round without any progress.  This means that we should
      eventually converge to only NR_FREE_PAGES as the target and fail on the
      wmark check and proceed to OOM.  The backoff is simple and linear with
      1/16 of the reclaimable pages for each round without any progress.  We
      are optimistic and reset counter for successful reclaim rounds.
      
      Costly high order pages mostly preserve their semantic and those without
      __GFP_REPEAT fail right away while those which have the flag set will
      back off after the amount of reclaimable pages reaches equivalent of the
      requested order.  The only difference is that if there was no progress
      during the reclaim we rely on zone watermark check.  This is more
      logical thing to do than previous 1<<order attempts which were a result
      of zone_reclaimable faking the progress.
      
      [vdavydov@virtuozzo.com: check classzone_idx for shrink_zone]
      [hannes@cmpxchg.org: separate the heuristic into should_reclaim_retry]
      [rientjes@google.com: use zone_page_state_snapshot for NR_FREE_PAGES]
      [rientjes@google.com: shrink_zones doesn't need to return anything]
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0a0337e0
    • M
      mm, compaction: abstract compaction feedback to helpers · cab1802b
      Michal Hocko 提交于
      Compaction can provide a wild variation of feedback to the caller.  Many
      of them are implementation specific and the caller of the compaction
      (especially the page allocator) shouldn't be bound to specifics of the
      current implementation.
      
      This patch abstracts the feedback into three basic types:
      	- compaction_made_progress - compaction was active and made some
      	  progress.
      	- compaction_failed - compaction failed and further attempts to
      	  invoke it would most probably fail and therefore it is not
      	  worth retrying
      	- compaction_withdrawn - compaction wasn't invoked for an
                implementation specific reasons. In the current implementation
                it means that the compaction was deferred, contended or the
                page scanners met too early without any progress. Retrying is
                still worthwhile.
      
      [vbabka@suse.cz: do not change thp back off behavior]
      [akpm@linux-foundation.org: fix typo in comment, per Hillf]
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cab1802b
    • M
      mm, compaction: update compaction_result ordering · 4f9a358c
      Michal Hocko 提交于
      compaction_result will be used as the primary feedback channel for
      compaction users.  At the same time try_to_compact_pages (and
      potentially others) assume a certain ordering where a more specific
      feedback takes precendence.
      
      This gets a bit awkward when we have conflicting feedback from different
      zones.  E.g one returing COMPACT_COMPLETE meaning the full zone has been
      scanned without any outcome while other returns with COMPACT_PARTIAL aka
      made some progress.  The caller should get COMPACT_PARTIAL because that
      means that the compaction still can make some progress.  The same
      applies for COMPACT_PARTIAL vs COMPACT_PARTIAL_SKIPPED.
      
      Reorder PARTIAL to be the largest one so the larger the value is the
      more progress we have done.
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4f9a358c
    • M
      mm, compaction: distinguish between full and partial COMPACT_COMPLETE · c8f7de0b
      Michal Hocko 提交于
      COMPACT_COMPLETE now means that compaction and free scanner met.  This
      is not very useful information if somebody just wants to use this
      feedback and make any decisions based on that.  The current caller might
      be a poor guy who just happened to scan tiny portion of the zone and
      that could be the reason no suitable pages were compacted.  Make sure we
      distinguish the full and partial zone walks.
      
      Consumers should treat COMPACT_PARTIAL_SKIPPED as a potential success
      and be optimistic in retrying.
      
      The existing users of COMPACT_COMPLETE are conservatively changed to use
      COMPACT_PARTIAL_SKIPPED as well but some of them should be probably
      reconsidered and only defer the compaction only for COMPACT_COMPLETE
      with the new semantic.
      
      This patch shouldn't introduce any functional changes.
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c8f7de0b
    • M
      mm, compaction: distinguish COMPACT_DEFERRED from COMPACT_SKIPPED · 1d4746d3
      Michal Hocko 提交于
      try_to_compact_pages() can currently return COMPACT_SKIPPED even when
      the compaction is defered for some zone just because zone DMA is skipped
      in 99% of cases due to watermark checks.  This makes COMPACT_DEFERRED
      basically unusable for the page allocator as a feedback mechanism.
      
      Make sure we distinguish those two states properly and switch their
      ordering in the enum.  This would mean that the COMPACT_SKIPPED will be
      returned only when all eligible zones are skipped.
      
      As a result COMPACT_DEFERRED handling for THP in __alloc_pages_slowpath
      will be more precise and we would bail out rather than reclaim.
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1d4746d3
    • M
      mm, compaction: change COMPACT_ constants into enum · ea7ab982
      Michal Hocko 提交于
      Compaction code is doing weird dances between COMPACT_FOO -> int ->
      unsigned long
      
      But there doesn't seem to be any reason for that.  All functions which
      return/use one of those constants are not expecting any other value so it
      really makes sense to define an enum for them and make it clear that no
      other values are expected.
      
      This is a pure cleanup and shouldn't introduce any functional changes.
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ea7ab982
    • R
      mm: vmscan: reduce size of inactive file list · 59dc76b0
      Rik van Riel 提交于
      The inactive file list should still be large enough to contain readahead
      windows and freshly written file data, but it no longer is the only
      source for detecting multiple accesses to file pages.  The workingset
      refault measurement code causes recently evicted file pages that get
      accessed again after a shorter interval to be promoted directly to the
      active list.
      
      With that mechanism in place, we can afford to (on a larger system)
      dedicate more memory to the active file list, so we can actually cache
      more of the frequently used file pages in memory, and not have them
      pushed out by streaming writes, once-used streaming file reads, etc.
      
      This can help things like database workloads, where only half the page
      cache can currently be used to cache the database working set.  This
      patch automatically increases that fraction on larger systems, using the
      same ratio that has already been used for anonymous memory.
      
      [hannes@cmpxchg.org: cgroup-awareness]
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NAndres Freund <andres@anarazel.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      59dc76b0
  2. 20 5月, 2016 20 次提交
    • V
      cpuset: use static key better and convert to new API · 002f2906
      Vlastimil Babka 提交于
      An important function for cpusets is cpuset_node_allowed(), which
      optimizes on the fact if there's a single root CPU set, it must be
      trivially allowed.  But the check "nr_cpusets() <= 1" doesn't use the
      cpusets_enabled_key static key the right way where static keys eliminate
      branching overhead with jump labels.
      
      This patch converts it so that static key is used properly.  It's also
      switched to the new static key API and the checking functions are
      converted to return bool instead of int.  We also provide a new variant
      __cpuset_zone_allowed() which expects that the static key check was
      already done and they key was enabled.  This is needed for
      get_page_from_freelist() where we want to also avoid the relatively
      slower check when ALLOC_CPUSET is not set in alloc_flags.
      
      The impact on the page allocator microbenchmark is less than expected
      but the cleanup in itself is worthwhile.
      
                                                   4.6.0-rc2                  4.6.0-rc2
                                             multcheck-v1r20               cpuset-v1r20
        Min      alloc-odr0-1               348.00 (  0.00%)           348.00 (  0.00%)
        Min      alloc-odr0-2               254.00 (  0.00%)           254.00 (  0.00%)
        Min      alloc-odr0-4               213.00 (  0.00%)           213.00 (  0.00%)
        Min      alloc-odr0-8               186.00 (  0.00%)           183.00 (  1.61%)
        Min      alloc-odr0-16              173.00 (  0.00%)           171.00 (  1.16%)
        Min      alloc-odr0-32              166.00 (  0.00%)           163.00 (  1.81%)
        Min      alloc-odr0-64              162.00 (  0.00%)           159.00 (  1.85%)
        Min      alloc-odr0-128             160.00 (  0.00%)           157.00 (  1.88%)
        Min      alloc-odr0-256             169.00 (  0.00%)           166.00 (  1.78%)
        Min      alloc-odr0-512             180.00 (  0.00%)           180.00 (  0.00%)
        Min      alloc-odr0-1024            188.00 (  0.00%)           187.00 (  0.53%)
        Min      alloc-odr0-2048            194.00 (  0.00%)           193.00 (  0.52%)
        Min      alloc-odr0-4096            199.00 (  0.00%)           198.00 (  0.50%)
        Min      alloc-odr0-8192            202.00 (  0.00%)           201.00 (  0.50%)
        Min      alloc-odr0-16384           203.00 (  0.00%)           202.00 (  0.49%)
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NZefan Li <lizefan@huawei.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      002f2906
    • M
      mm, page_alloc: inline pageblock lookup in page free fast paths · 0b423ca2
      Mel Gorman 提交于
      The function call overhead of get_pfnblock_flags_mask() is measurable in
      the page free paths.  This patch uses an inlined version that is faster.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0b423ca2
    • M
      mm, page_alloc: avoid looking up the first zone in a zonelist twice · c33d6c06
      Mel Gorman 提交于
      The allocator fast path looks up the first usable zone in a zonelist and
      then get_page_from_freelist does the same job in the zonelist iterator.
      This patch preserves the necessary information.
      
                                                   4.6.0-rc2                  4.6.0-rc2
                                              fastmark-v1r20             initonce-v1r20
        Min      alloc-odr0-1               364.00 (  0.00%)           359.00 (  1.37%)
        Min      alloc-odr0-2               262.00 (  0.00%)           260.00 (  0.76%)
        Min      alloc-odr0-4               214.00 (  0.00%)           214.00 (  0.00%)
        Min      alloc-odr0-8               186.00 (  0.00%)           186.00 (  0.00%)
        Min      alloc-odr0-16              173.00 (  0.00%)           173.00 (  0.00%)
        Min      alloc-odr0-32              165.00 (  0.00%)           165.00 (  0.00%)
        Min      alloc-odr0-64              161.00 (  0.00%)           162.00 ( -0.62%)
        Min      alloc-odr0-128             159.00 (  0.00%)           161.00 ( -1.26%)
        Min      alloc-odr0-256             168.00 (  0.00%)           170.00 ( -1.19%)
        Min      alloc-odr0-512             180.00 (  0.00%)           181.00 ( -0.56%)
        Min      alloc-odr0-1024            190.00 (  0.00%)           190.00 (  0.00%)
        Min      alloc-odr0-2048            196.00 (  0.00%)           196.00 (  0.00%)
        Min      alloc-odr0-4096            202.00 (  0.00%)           202.00 (  0.00%)
        Min      alloc-odr0-8192            206.00 (  0.00%)           205.00 (  0.49%)
        Min      alloc-odr0-16384           206.00 (  0.00%)           205.00 (  0.49%)
      
      The benefit is negligible and the results are within the noise but each
      cycle counts.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c33d6c06
    • M
      mm, page_alloc: simplify last cpupid reset · 09940a4f
      Mel Gorman 提交于
      The current reset unnecessarily clears flags and makes pointless
      calculations.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09940a4f
    • M
      mm, page_alloc: convert alloc_flags to unsigned · c603844b
      Mel Gorman 提交于
      alloc_flags is a bitmask of flags but it is signed which does not
      necessarily generate the best code depending on the compiler.  Even
      without an impact, it makes more sense that this be unsigned.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c603844b
    • M
      mm, page_alloc: inline the fast path of the zonelist iterator · 682a3385
      Mel Gorman 提交于
      The page allocator iterates through a zonelist for zones that match the
      addressing limitations and nodemask of the caller but many allocations
      will not be restricted.  Despite this, there is always functional call
      overhead which builds up.
      
      This patch inlines the optimistic basic case and only calls the iterator
      function for the complex case.  A hindrance was the fact that
      cpuset_current_mems_allowed is used in the fastpath as the allowed
      nodemask even though all nodes are allowed on most systems.  The patch
      handles this by only considering cpuset_current_mems_allowed if a cpuset
      exists.  As well as being faster in the fast-path, this removes some
      junk in the slowpath.
      
      The performance difference on a page allocator microbenchmark is;
      
                                                   4.6.0-rc2                  4.6.0-rc2
                                            statinline-v1r20              optiter-v1r20
        Min      alloc-odr0-1               412.00 (  0.00%)           382.00 (  7.28%)
        Min      alloc-odr0-2               301.00 (  0.00%)           282.00 (  6.31%)
        Min      alloc-odr0-4               247.00 (  0.00%)           233.00 (  5.67%)
        Min      alloc-odr0-8               215.00 (  0.00%)           203.00 (  5.58%)
        Min      alloc-odr0-16              199.00 (  0.00%)           188.00 (  5.53%)
        Min      alloc-odr0-32              191.00 (  0.00%)           182.00 (  4.71%)
        Min      alloc-odr0-64              187.00 (  0.00%)           177.00 (  5.35%)
        Min      alloc-odr0-128             185.00 (  0.00%)           175.00 (  5.41%)
        Min      alloc-odr0-256             193.00 (  0.00%)           184.00 (  4.66%)
        Min      alloc-odr0-512             207.00 (  0.00%)           197.00 (  4.83%)
        Min      alloc-odr0-1024            213.00 (  0.00%)           203.00 (  4.69%)
        Min      alloc-odr0-2048            220.00 (  0.00%)           209.00 (  5.00%)
        Min      alloc-odr0-4096            226.00 (  0.00%)           214.00 (  5.31%)
        Min      alloc-odr0-8192            229.00 (  0.00%)           218.00 (  4.80%)
        Min      alloc-odr0-16384           229.00 (  0.00%)           219.00 (  4.37%)
      
      perf indicated that next_zones_zonelist disappeared in the profile and
      __next_zones_zonelist did not appear.  This is expected as the
      micro-benchmark would hit the inlined fast-path every time.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      682a3385
    • M
      mm, page_alloc: inline zone_statistics · 060e7417
      Mel Gorman 提交于
      zone_statistics has one call-site but it's a public function.  Make it
      static and inline.
      
      The performance difference on a page allocator microbenchmark is;
      
                                                   4.6.0-rc2                  4.6.0-rc2
                                            statbranch-v1r20           statinline-v1r20
        Min      alloc-odr0-1               419.00 (  0.00%)           412.00 (  1.67%)
        Min      alloc-odr0-2               305.00 (  0.00%)           301.00 (  1.31%)
        Min      alloc-odr0-4               250.00 (  0.00%)           247.00 (  1.20%)
        Min      alloc-odr0-8               219.00 (  0.00%)           215.00 (  1.83%)
        Min      alloc-odr0-16              203.00 (  0.00%)           199.00 (  1.97%)
        Min      alloc-odr0-32              195.00 (  0.00%)           191.00 (  2.05%)
        Min      alloc-odr0-64              191.00 (  0.00%)           187.00 (  2.09%)
        Min      alloc-odr0-128             189.00 (  0.00%)           185.00 (  2.12%)
        Min      alloc-odr0-256             198.00 (  0.00%)           193.00 (  2.53%)
        Min      alloc-odr0-512             210.00 (  0.00%)           207.00 (  1.43%)
        Min      alloc-odr0-1024            216.00 (  0.00%)           213.00 (  1.39%)
        Min      alloc-odr0-2048            221.00 (  0.00%)           220.00 (  0.45%)
        Min      alloc-odr0-4096            227.00 (  0.00%)           226.00 (  0.44%)
        Min      alloc-odr0-8192            232.00 (  0.00%)           229.00 (  1.29%)
        Min      alloc-odr0-16384           232.00 (  0.00%)           229.00 (  1.29%)
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      060e7417
    • M
      mm, page_alloc: use new PageAnonHead helper in the free page fast path · 17514574
      Mel Gorman 提交于
      The PageAnon check always checks for compound_head but this is a
      relatively expensive check if the caller already knows the page is a
      head page.  This patch creates a helper and uses it in the page free
      path which only operates on head pages.
      
      With this patch and "Only check PageCompound for high-order pages", the
      performance difference on a page allocator microbenchmark is;
      
                                                   4.6.0-rc2                  4.6.0-rc2
                                                     vanilla           nocompound-v1r20
        Min      alloc-odr0-1               425.00 (  0.00%)           417.00 (  1.88%)
        Min      alloc-odr0-2               313.00 (  0.00%)           308.00 (  1.60%)
        Min      alloc-odr0-4               257.00 (  0.00%)           253.00 (  1.56%)
        Min      alloc-odr0-8               224.00 (  0.00%)           221.00 (  1.34%)
        Min      alloc-odr0-16              208.00 (  0.00%)           205.00 (  1.44%)
        Min      alloc-odr0-32              199.00 (  0.00%)           199.00 (  0.00%)
        Min      alloc-odr0-64              195.00 (  0.00%)           193.00 (  1.03%)
        Min      alloc-odr0-128             192.00 (  0.00%)           191.00 (  0.52%)
        Min      alloc-odr0-256             204.00 (  0.00%)           200.00 (  1.96%)
        Min      alloc-odr0-512             213.00 (  0.00%)           212.00 (  0.47%)
        Min      alloc-odr0-1024            219.00 (  0.00%)           219.00 (  0.00%)
        Min      alloc-odr0-2048            225.00 (  0.00%)           225.00 (  0.00%)
        Min      alloc-odr0-4096            230.00 (  0.00%)           231.00 ( -0.43%)
        Min      alloc-odr0-8192            235.00 (  0.00%)           234.00 (  0.43%)
        Min      alloc-odr0-16384           235.00 (  0.00%)           234.00 (  0.43%)
        Min      free-odr0-1                215.00 (  0.00%)           191.00 ( 11.16%)
        Min      free-odr0-2                152.00 (  0.00%)           136.00 ( 10.53%)
        Min      free-odr0-4                119.00 (  0.00%)           107.00 ( 10.08%)
        Min      free-odr0-8                106.00 (  0.00%)            96.00 (  9.43%)
        Min      free-odr0-16                97.00 (  0.00%)            87.00 ( 10.31%)
        Min      free-odr0-32                91.00 (  0.00%)            83.00 (  8.79%)
        Min      free-odr0-64                89.00 (  0.00%)            81.00 (  8.99%)
        Min      free-odr0-128               88.00 (  0.00%)            80.00 (  9.09%)
        Min      free-odr0-256              106.00 (  0.00%)            95.00 ( 10.38%)
        Min      free-odr0-512              116.00 (  0.00%)           111.00 (  4.31%)
        Min      free-odr0-1024             125.00 (  0.00%)           118.00 (  5.60%)
        Min      free-odr0-2048             133.00 (  0.00%)           126.00 (  5.26%)
        Min      free-odr0-4096             136.00 (  0.00%)           130.00 (  4.41%)
        Min      free-odr0-8192             138.00 (  0.00%)           130.00 (  5.80%)
        Min      free-odr0-16384            137.00 (  0.00%)           130.00 (  5.11%)
      
      There is a sizable boost to the free allocator performance.  While there
      is an apparent boost on the allocation side, it's likely a co-incidence
      or due to the patches slightly reducing cache footprint.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      17514574
    • M
      oom, oom_reaper: try to reap tasks which skip regular OOM killer path · 3ef22dff
      Michal Hocko 提交于
      If either the current task is already killed or PF_EXITING or a selected
      task is PF_EXITING then the oom killer is suppressed and so is the oom
      reaper.  This patch adds try_oom_reaper which checks the given task and
      queues it for the oom reaper if that is safe to be done meaning that the
      task doesn't share the mm with an alive process.
      
      This might help to release the memory pressure while the task tries to
      exit.
      
      [akpm@linux-foundation.org: fix nommu build]
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Raushaniya Maksudova <rmaksudova@parallels.com>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Daniel Vetter <daniel.vetter@intel.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3ef22dff
    • H
      huge mm: move_huge_pmd does not need new_vma · bf8616d5
      Hugh Dickins 提交于
      Remove move_huge_pmd()'s redundant new_vma arg: all it was used for was
      a VM_NOHUGEPAGE check on new_vma flags, but the new_vma is cloned from
      the old vma, so a trans_huge_pmd in the new_vma will be as acceptable as
      it was in the old vma, alignment and size permitting.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Yang Shi <yang.shi@linaro.org>
      Cc: Ning Qu <quning@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bf8616d5
    • H
      mm: /proc/sys/vm/stat_refresh to force vmstat update · 52b6f46b
      Hugh Dickins 提交于
      Provide /proc/sys/vm/stat_refresh to force an immediate update of
      per-cpu into global vmstats: useful to avoid a sleep(2) or whatever
      before checking counts when testing.  Originally added to work around a
      bug which left counts stranded indefinitely on a cpu going idle (an
      inaccuracy magnified when small below-batch numbers represent "huge"
      amounts of memory), but I believe that bug is now fixed: nonetheless,
      this is still a useful knob.
      
      Its schedule_on_each_cpu() is probably too expensive just to fold into
      reading /proc/meminfo itself: give this mode 0600 to prevent abuse.
      Allow a write or a read to do the same: nothing to read, but "grep -h
      Shmem /proc/sys/vm/stat_refresh /proc/meminfo" is convenient.  Oh, and
      since global_page_state() itself is careful to disguise any underflow as
      0, hack in an "Invalid argument" and pr_warn() if a counter is negative
      after the refresh - this helped to fix a misaccounting of
      NR_ISOLATED_FILE in my migration code.
      
      But on recent kernels, I find that NR_ALLOC_BATCH and NR_PAGES_SCANNED
      often go negative some of the time.  I have not yet worked out why, but
      have no evidence that it's actually harmful.  Punt for the moment by
      just ignoring the anomaly on those.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Yang Shi <yang.shi@linaro.org>
      Cc: Ning Qu <quning@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      52b6f46b
    • H
      tmpfs: preliminary minor tidyups · 75edd345
      Hugh Dickins 提交于
      Make a few cleanups in mm/shmem.c, before going on to complicate it.
      
      shmem_alloc_page() will become more complicated: we can't afford to to
      have that complication duplicated between a CONFIG_NUMA version and a
      !CONFIG_NUMA version, so rearrange the #ifdef'ery there to yield a
      single shmem_swapin() and a single shmem_alloc_page().
      
      Yes, it's a shame to inflict the horrid pseudo-vma on non-NUMA
      configurations, but eliminating it is a larger cleanup: I have an
      alloc_pages_mpol() patchset not yet ready - mpol handling is subtle and
      bug-prone, and changed yet again since my last version.
      
      Move __SetPageLocked, __SetPageSwapBacked from shmem_getpage_gfp() to
      shmem_alloc_page(): that SwapBacked flag will be useful in future, to
      help to distinguish different cases appropriately.
      
      And the SGP_DIRTY variant of SGP_CACHE is hard to understand and of
      little use (IIRC it dates back to when shmem_getpage() returned the page
      unlocked): kill it and do the necessary in shmem_file_read_iter().
      
      But an arm64 build then complained that info may be uninitialized (where
      shmem_getpage_gfp() deletes a freshly alloced page beyond eof), and
      advancing to an "sgp <= SGP_CACHE" test jogged it back to reality.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Yang Shi <yang.shi@linaro.org>
      Cc: Ning Qu <quning@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      75edd345
    • H
      mm: update_lru_size do the __mod_zone_page_state · 9d5e6a9f
      Hugh Dickins 提交于
      Konstantin Khlebnikov pointed out (nearly four years ago, when lumpy
      reclaim was removed) that lru_size can be updated by -nr_taken once per
      call to isolate_lru_pages(), instead of page by page.
      
      Update it inside isolate_lru_pages(), or at its two callsites? I chose
      to update it at the callsites, rearranging and grouping the updates by
      nr_taken and nr_scanned together in both.
      
      With one exception, mem_cgroup_update_lru_size(,lru,) is then used where
      __mod_zone_page_state(,NR_LRU_BASE+lru,) is used; and we shall be adding
      some more calls in a future commit.  Make the code a little smaller and
      simpler by incorporating stat update in lru_size update.
      
      The exception was move_active_pages_to_lru(), which aggregated the
      pgmoved stat update separately from the individual lru_size updates; but
      I still think this a simplification worth making.
      
      However, the __mod_zone_page_state is not peculiar to mem_cgroups: so
      better use the name update_lru_size, calls mem_cgroup_update_lru_size
      when CONFIG_MEMCG.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Yang Shi <yang.shi@linaro.org>
      Cc: Ning Qu <quning@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9d5e6a9f
    • H
      mm: update_lru_size warn and reset bad lru_size · ca707239
      Hugh Dickins 提交于
      Though debug kernels have a VM_BUG_ON to help protect from misaccounting
      lru_size, non-debug kernels are liable to wrap it around: and then the
      vast unsigned long size draws page reclaim into a loop of repeatedly
      doing nothing on an empty list, without even a cond_resched().
      
      That soft lockup looks confusingly like an over-busy reclaim scenario,
      with lots of contention on the lru_lock in shrink_inactive_list(): yet
      has a totally different origin.
      
      Help differentiate with a custom warning in
      mem_cgroup_update_lru_size(), even in non-debug kernels; and reset the
      size to avoid the lockup.  But the particular bug which suggested this
      change was mine alone, and since fixed.
      
      Make it a WARN_ONCE: the first occurrence is the most informative, a
      flurry may follow, yet even when rate-limited little more is learnt.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Yang Shi <yang.shi@linaro.org>
      Cc: Ning Qu <quning@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ca707239
    • A
      mm: uninline page_mapped() · 1aa8aea5
      Andrew Morton 提交于
      It's huge.  Uninlining it saves 206 bytes per callsite.  Shaves 4924
      bytes from the x86_64 allmodconfig vmlinux.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Cc: Steve Capper <steve.capper@arm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1aa8aea5
    • C
      mm/highmem: simplify is_highmem() · 29f9cb53
      Chanho Min 提交于
      is_highmem() can be simplified by use of is_highmem_idx().  This patch
      removes redundant code and will make it easier to maintain if the zone
      policy is changed or a new zone is added.
      
      (akpm: saves me 25 bytes of text per is_highmem() callsite)
      Signed-off-by: NChanho Min <chanho.min@lge.com>
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      29f9cb53
    • Y
      mm/mempolicy.c: vma_migratable() can return bool · 4ee815be
      Yaowei Bai 提交于
      Make vma_migratable() return bool due to this particular function only
      using either one or zero as its return value.
      Signed-off-by: NYaowei Bai <baiyaowei@cmss.chinamobile.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4ee815be
    • Y
      mm/vmalloc.c: is_vmalloc_addr() can return bool · bb00a789
      Yaowei Bai 提交于
      Make is_vmalloc_addr() return bool to improve readability due to this
      particular function only using either one or zero as its return value.
      Signed-off-by: NYaowei Bai <baiyaowei@cmss.chinamobile.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bb00a789
    • Y
      mm/memory_hotplug: is_mem_section_removable() can return bool · c98940f6
      Yaowei Bai 提交于
      Make is_mem_section_removable() return bool to improve readability due
      to this particular function only using either one or zero as its return
      value.
      Signed-off-by: NYaowei Bai <baiyaowei@cmss.chinamobile.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c98940f6
    • Y
      mm/hugetlb: is_vm_hugetlb_page() can return bool · 32f6271d
      Yaowei Bai 提交于
      Make is_vm_hugetlb_page() return bool to improve readability due to this
      particular function only using either one or zero as its return value.
      Signed-off-by: NYaowei Bai <baiyaowei@cmss.chinamobile.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      32f6271d