1. 09 5月, 2017 1 次提交
    • V
      mm, page_alloc: split smallest stolen page in fallback · 3bc48f96
      Vlastimil Babka 提交于
      The __rmqueue_fallback() function is called when there's no free page of
      requested migratetype, and we need to steal from a different one.
      
      There are various heuristics to make this event infrequent and reduce
      permanent fragmentation.  The main one is to try stealing from a
      pageblock that has the most free pages, and possibly steal them all at
      once and convert the whole pageblock.  Precise searching for such
      pageblock would be expensive, so instead the heuristics walks the free
      lists from MAX_ORDER down to requested order and assumes that the block
      with highest-order free page is likely to also have the most free pages
      in total.
      
      Chances are that together with the highest-order page, we steal also
      pages of lower orders from the same block.  But then we still split the
      highest order page.  This is wasteful and can contribute to
      fragmentation instead of avoiding it.
      
      This patch thus changes __rmqueue_fallback() to just steal the page(s)
      and put them on the freelist of the requested migratetype, and only
      report whether it was successful.  Then we pick (and eventually split)
      the smallest page with __rmqueue_smallest().  This all happens under
      zone lock, so nobody can steal it from us in the process.  This should
      reduce fragmentation due to fallbacks.  At worst we are only stealing a
      single highest-order page and waste some cycles by moving it between
      lists and then removing it, but fallback is not exactly hot path so that
      should not be a concern.  As a side benefit the patch removes some
      duplicate code by reusing __rmqueue_smallest().
      
      [vbabka@suse.cz: fix endless loop in the modified __rmqueue()]
        Link: http://lkml.kernel.org/r/59d71b35-d556-4fc9-ee2e-1574259282fd@suse.cz
      Link: http://lkml.kernel.org/r/20170307131545.28577-4-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3bc48f96
  2. 04 5月, 2017 8 次提交
    • T
      mm, page_alloc: remove debug_guardpage_minorder() test in warn_alloc() · 0f7896f1
      Tetsuo Handa 提交于
      Commit c0a32fc5 ("mm: more intensive memory corruption debugging")
      changed to check debug_guardpage_minorder() > 0 when reporting
      allocation failures.  The reasoning was
      
        When we use guard page to debug memory corruption, it shrinks
        available pages to 1/2, 1/4, 1/8 and so on, depending on parameter
        value. In such case memory allocation failures can be common and
        printing errors can flood dmesg. If somebody debug corruption,
        allocation failures are not the things he/she is interested about.
      
      but this is misguided.
      
      Allocation requests with __GFP_NOWARN flag by definition do not cause
      flooding of allocation failure messages.  Allocation requests with
      __GFP_NORETRY flag likely also have __GFP_NOWARN flag.  Costly
      allocation requests likely also have __GFP_NOWARN flag.
      
      Allocation requests without __GFP_DIRECT_RECLAIM flag likely also have
      __GFP_NOWARN flag or __GFP_HIGH flag.  Non-costly allocation requests
      with __GFP_DIRECT_RECLAIM flag basically retry forever due to the "too
      small to fail" memory-allocation rule.
      
      Therefore, as a whole, shrinking available pages by
      debug_guardpage_minorder= kernel boot parameter might cause flooding of
      OOM killer messages but unlikely causes flooding of allocation failure
      messages.  Let's remove debug_guardpage_minorder() > 0 check which would
      likely be pointless.
      
      Link: http://lkml.kernel.org/r/1491910035-4231-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jpSigned-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Stanislaw Gruszka <sgruszka@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Rafael J . Wysocki" <rafael.j.wysocki@intel.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0f7896f1
    • V
      mm: enable page poisoning early at boot · bd33ef36
      Vinayak Menon 提交于
      On SPARSEMEM systems page poisoning is enabled after buddy is up,
      because of the dependency on page extension init.  This causes the pages
      released by free_all_bootmem not to be poisoned.  This either delays or
      misses the identification of some issues because the pages have to
      undergo another cycle of alloc-free-alloc for any corruption to be
      detected.
      
      Enable page poisoning early by getting rid of the PAGE_EXT_DEBUG_POISON
      flag.  Since all the free pages will now be poisoned, the flag need not
      be verified before checking the poison during an alloc.
      
      [vinmenon@codeaurora.org: fix Kconfig]
        Link: http://lkml.kernel.org/r/1490878002-14423-1-git-send-email-vinmenon@codeaurora.org
      Link: http://lkml.kernel.org/r/1490358246-11001-1-git-send-email-vinmenon@codeaurora.orgSigned-off-by: NVinayak Menon <vinmenon@codeaurora.org>
      Acked-by: NLaura Abbott <labbott@redhat.com>
      Tested-by: NLaura Abbott <labbott@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Akinobu Mita <akinobu.mita@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bd33ef36
    • J
      mm: page_alloc: __GFP_NOWARN shouldn't suppress stall warnings · 82251963
      Johannes Weiner 提交于
      __GFP_NOWARN, which is usually added to avoid warnings from callsites
      that expect to fail and have fallbacks, currently also suppresses
      allocation stall warnings.  These trigger when an allocation is stuck
      inside the allocator for 10 seconds or longer.
      
      But there is no class of allocations that can get legitimately stuck in
      the allocator for this long.  This always indicates a problem.
      
      Always emit stall warnings.  Restrict __GFP_NOWARN to alloc failures.
      
      Link: http://lkml.kernel.org/r/20170125181150.GA16398@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      82251963
    • M
      mm: introduce memalloc_nofs_{save,restore} API · 7dea19f9
      Michal Hocko 提交于
      GFP_NOFS context is used for the following 5 reasons currently:
      
       - to prevent from deadlocks when the lock held by the allocation
         context would be needed during the memory reclaim
      
       - to prevent from stack overflows during the reclaim because the
         allocation is performed from a deep context already
      
       - to prevent lockups when the allocation context depends on other
         reclaimers to make a forward progress indirectly
      
       - just in case because this would be safe from the fs POV
      
       - silence lockdep false positives
      
      Unfortunately overuse of this allocation context brings some problems to
      the MM.  Memory reclaim is much weaker (especially during heavy FS
      metadata workloads), OOM killer cannot be invoked because the MM layer
      doesn't have enough information about how much memory is freeable by the
      FS layer.
      
      In many cases it is far from clear why the weaker context is even used
      and so it might be used unnecessarily.  We would like to get rid of
      those as much as possible.  One way to do that is to use the flag in
      scopes rather than isolated cases.  Such a scope is declared when really
      necessary, tracked per task and all the allocation requests from within
      the context will simply inherit the GFP_NOFS semantic.
      
      Not only this is easier to understand and maintain because there are
      much less problematic contexts than specific allocation requests, this
      also helps code paths where FS layer interacts with other layers (e.g.
      crypto, security modules, MM etc...) and there is no easy way to convey
      the allocation context between the layers.
      
      Introduce memalloc_nofs_{save,restore} API to control the scope of
      GFP_NOFS allocation context.  This is basically copying
      memalloc_noio_{save,restore} API we have for other restricted allocation
      context GFP_NOIO.  The PF_MEMALLOC_NOFS flag already exists and it is
      just an alias for PF_FSTRANS which has been xfs specific until recently.
      There are no more PF_FSTRANS users anymore so let's just drop it.
      
      PF_MEMALLOC_NOFS is now checked in the MM layer and drops __GFP_FS
      implicitly same as PF_MEMALLOC_NOIO drops __GFP_IO.  memalloc_noio_flags
      is renamed to current_gfp_context because it now cares about both
      PF_MEMALLOC_NOFS and PF_MEMALLOC_NOIO contexts.  Xfs code paths preserve
      their semantic.  kmem_flags_convert() doesn't need to evaluate the flag
      anymore.
      
      This patch shouldn't introduce any functional changes.
      
      Let's hope that filesystems will drop direct GFP_NOFS (resp.  ~__GFP_FS)
      usage as much as possible and only use a properly documented
      memalloc_nofs_{save,restore} checkpoints where they are appropriate.
      
      [akpm@linux-foundation.org: fix comment typo, reflow comment]
      Link: http://lkml.kernel.org/r/20170306131408.9828-5-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Chris Mason <clm@fb.com>
      Cc: David Sterba <dsterba@suse.cz>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Brian Foster <bfoster@redhat.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: Nikolay Borisov <nborisov@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7dea19f9
    • X
      mm: use is_migrate_highatomic() to simplify the code · a6ffdc07
      Xishi Qiu 提交于
      Introduce two helpers, is_migrate_highatomic() and is_migrate_highatomic_page().
      
      Simplify the code, no functional changes.
      
      [akpm@linux-foundation.org: use static inlines rather than macros, per mhocko]
      Link: http://lkml.kernel.org/r/58B94F15.6060606@huawei.comSigned-off-by: NXishi Qiu <qiuxishi@huawei.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a6ffdc07
    • J
      mm: remove unnecessary back-off function when retrying page reclaim · 491d79ae
      Johannes Weiner 提交于
      The backoff mechanism is not needed.  If we have MAX_RECLAIM_RETRIES
      loops without progress, we'll OOM anyway; backing off might cut one or
      two iterations off that in the rare OOM case.  If we have intermittent
      success reclaiming a few pages, the backoff function gets reset also,
      and so is of little help in these scenarios.
      
      We might want a backoff function for when there IS progress, but not
      enough to be satisfactory.  But this isn't that.  Remove it.
      
      Link: http://lkml.kernel.org/r/20170228214007.5621-10-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Jia He <hejianet@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      491d79ae
    • J
      mm: delete NR_PAGES_SCANNED and pgdat_reclaimable() · c822f622
      Johannes Weiner 提交于
      NR_PAGES_SCANNED counts number of pages scanned since the last page free
      event in the allocator.  This was used primarily to measure the
      reclaimability of zones and nodes, and determine when reclaim should
      give up on them.  In that role, it has been replaced in the preceding
      patches by a different mechanism.
      
      Being implemented as an efficient vmstat counter, it was automatically
      exported to userspace as well.  It's however unlikely that anyone
      outside the kernel is using this counter in any meaningful way.
      
      Remove the counter and the unused pgdat_reclaimable().
      
      Link: http://lkml.kernel.org/r/20170228214007.5621-8-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Jia He <hejianet@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c822f622
    • J
      mm: fix 100% CPU kswapd busyloop on unreclaimable nodes · c73322d0
      Johannes Weiner 提交于
      Patch series "mm: kswapd spinning on unreclaimable nodes - fixes and
      cleanups".
      
      Jia reported a scenario in which the kswapd of a node indefinitely spins
      at 100% CPU usage.  We have seen similar cases at Facebook.
      
      The kernel's current method of judging its ability to reclaim a node (or
      whether to back off and sleep) is based on the amount of scanned pages
      in proportion to the amount of reclaimable pages.  In Jia's and our
      scenarios, there are no reclaimable pages in the node, however, and the
      condition for backing off is never met.  Kswapd busyloops in an attempt
      to restore the watermarks while having nothing to work with.
      
      This series reworks the definition of an unreclaimable node based not on
      scanning but on whether kswapd is able to actually reclaim pages in
      MAX_RECLAIM_RETRIES (16) consecutive runs.  This is the same criteria
      the page allocator uses for giving up on direct reclaim and invoking the
      OOM killer.  If it cannot free any pages, kswapd will go to sleep and
      leave further attempts to direct reclaim invocations, which will either
      make progress and re-enable kswapd, or invoke the OOM killer.
      
      Patch #1 fixes the immediate problem Jia reported, the remainder are
      smaller fixlets, cleanups, and overall phasing out of the old method.
      
      Patch #6 is the odd one out.  It's a nice cleanup to get_scan_count(),
      and directly related to #5, but in itself not relevant to the series.
      
      If the whole series is too ambitious for 4.11, I would consider the
      first three patches fixes, the rest cleanups.
      
      This patch (of 9):
      
      Jia He reports a problem with kswapd spinning at 100% CPU when
      requesting more hugepages than memory available in the system:
      
      $ echo 4000 >/proc/sys/vm/nr_hugepages
      
      top - 13:42:59 up  3:37,  1 user,  load average: 1.09, 1.03, 1.01
      Tasks:   1 total,   1 running,   0 sleeping,   0 stopped,   0 zombie
      %Cpu(s):  0.0 us, 12.5 sy,  0.0 ni, 85.5 id,  2.0 wa,  0.0 hi,  0.0 si,  0.0 st
      KiB Mem:  31371520 total, 30915136 used,   456384 free,      320 buffers
      KiB Swap:  6284224 total,   115712 used,  6168512 free.    48192 cached Mem
      
        PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
         76 root      20   0       0      0      0 R 100.0 0.000 217:17.29 kswapd3
      
      At that time, there are no reclaimable pages left in the node, but as
      kswapd fails to restore the high watermarks it refuses to go to sleep.
      
      Kswapd needs to back away from nodes that fail to balance.  Up until
      commit 1d82de61 ("mm, vmscan: make kswapd reclaim in terms of
      nodes") kswapd had such a mechanism.  It considered zones whose
      theoretically reclaimable pages it had reclaimed six times over as
      unreclaimable and backed away from them.  This guard was erroneously
      removed as the patch changed the definition of a balanced node.
      
      However, simply restoring this code wouldn't help in the case reported
      here: there *are* no reclaimable pages that could be scanned until the
      threshold is met.  Kswapd would stay awake anyway.
      
      Introduce a new and much simpler way of backing off.  If kswapd runs
      through MAX_RECLAIM_RETRIES (16) cycles without reclaiming a single
      page, make it back off from the node.  This is the same number of shots
      direct reclaim takes before declaring OOM.  Kswapd will go to sleep on
      that node until a direct reclaimer manages to reclaim some pages, thus
      proving the node reclaimable again.
      
      [hannes@cmpxchg.org: check kswapd failure against the cumulative nr_reclaimed count]
        Link: http://lkml.kernel.org/r/20170306162410.GB2090@cmpxchg.org
      [shakeelb@google.com: fix condition for throttle_direct_reclaim]
        Link: http://lkml.kernel.org/r/20170314183228.20152-1-shakeelb@google.com
      Link: http://lkml.kernel.org/r/20170228214007.5621-2-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NShakeel Butt <shakeelb@google.com>
      Reported-by: NJia He <hejianet@gmail.com>
      Tested-by: NJia He <hejianet@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c73322d0
  3. 21 4月, 2017 1 次提交
  4. 08 4月, 2017 2 次提交
  5. 04 4月, 2017 1 次提交
    • S
      ftrace: Have init/main.c call ftrace directly to free init memory · b80f0f6c
      Steven Rostedt (VMware) 提交于
      Relying on free_reserved_area() to call ftrace to free init memory proved to
      not be sufficient. The issue is that on x86, when debug_pagealloc is
      enabled, the init memory is not freed, but simply set as not present. Since
      ftrace was uninformed of this, starting function tracing still tries to
      update pages that are not present according to the page tables, causing
      ftrace to bug, as well as killing the kernel itself.
      
      Instead of relying on free_reserved_area(), have init/main.c call ftrace
      directly just before it frees the init memory. Then it needs to use
      __init_begin and __init_end to know where the init memory location is.
      Looking at all archs (and testing what I can), it appears that this should
      work for each of them.
      Reported-by: Nkernel test robot <xiaolong.ye@intel.com>
      Reported-by: NFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      b80f0f6c
  6. 03 4月, 2017 1 次提交
    • M
      kernel-api.rst: fix a series of errors when parsing C files · 0e056eb5
      mchehab@s-opensource.com 提交于
      ./lib/string.c:134: WARNING: Inline emphasis start-string without end-string.
      ./mm/filemap.c:522: WARNING: Inline interpreted text or phrase reference start-string without end-string.
      ./mm/filemap.c:1283: ERROR: Unexpected indentation.
      ./mm/filemap.c:3003: WARNING: Inline interpreted text or phrase reference start-string without end-string.
      ./mm/vmalloc.c:1544: WARNING: Inline emphasis start-string without end-string.
      ./mm/page_alloc.c:4245: ERROR: Unexpected indentation.
      ./ipc/util.c:676: ERROR: Unexpected indentation.
      ./drivers/pci/irq.c:35: WARNING: Block quote ends without a blank line; unexpected unindent.
      ./security/security.c:109: ERROR: Unexpected indentation.
      ./security/security.c:110: WARNING: Definition list ends without a blank line; unexpected unindent.
      ./block/genhd.c:275: WARNING: Inline strong start-string without end-string.
      ./block/genhd.c:283: WARNING: Inline strong start-string without end-string.
      ./include/linux/clk.h:134: WARNING: Inline emphasis start-string without end-string.
      ./include/linux/clk.h:134: WARNING: Inline emphasis start-string without end-string.
      ./ipc/util.c:477: ERROR: Unknown target name: "s".
      Signed-off-by: NMauro Carvalho Chehab <mchehab@s-opensource.com>
      Acked-by: NBjorn Helgaas <bhelgaas@google.com>
      Signed-off-by: NJonathan Corbet <corbet@lwn.net>
      0e056eb5
  7. 25 3月, 2017 1 次提交
  8. 09 3月, 2017 1 次提交
    • T
      mm, page_alloc: Add missing check for memory holes · b4fb8f66
      Tony Luck 提交于
      Commit 13ad59df ("mm, page_alloc: avoid page_to_pfn() when merging
      buddies") moved the check for memory holes out of page_is_buddy() and
      had the callers do the check.
      
      But this wasn't done correctly in one place which caused ia64 to crash
      very early in boot.
      
      Update to fix that and make ia64 boot again.
      
      [ v2: Vlastimil pointed out we don't need to call page_to_pfn()
            since we already have the result of that in "buddy_pfn" ]
      
      Fixes: 13ad59df ("avoid page_to_pfn() when merging buddies")
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b4fb8f66
  9. 02 3月, 2017 1 次提交
  10. 28 2月, 2017 1 次提交
  11. 25 2月, 2017 13 次提交
    • W
      mm/page_alloc.c: remove redundant init code for ZONE_MOVABLE · ad69444e
      Wei Yang 提交于
      arch_zone_lowest/highest_possible_pfn[] is set to 0 and [ZONE_MOVABLE]
      is skipped in the loop.  No need to reset them to 0 again.
      
      This patch just removes the redundant code.
      
      Link: http://lkml.kernel.org/r/20170209141731.60208-1-richard.weiyang@gmail.comSigned-off-by: NWei Yang <richard.weiyang@gmail.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ad69444e
    • G
      mm/page_alloc: fix nodes for reclaim in fast path · e02dc017
      Gavin Shan 提交于
      When @node_reclaim_node isn't 0, the page allocator tries to reclaim
      pages if the amount of free memory in the zones are below the low
      watermark.  On Power platform, none of NUMA nodes are scanned for page
      reclaim because no nodes match the condition in zone_allows_reclaim().
      On Power platform, RECLAIM_DISTANCE is set to 10 which is the distance
      of Node-A to Node-A.  So the preferred node even won't be scanned for
      page reclaim.
      
         __alloc_pages_nodemask()
         get_page_from_freelist()
            zone_allows_reclaim()
      
      Anton proposed the test code as below:
      
         # cat alloc.c
            :
         int main(int argc, char *argv[])
         {
      	void *p;
      	unsigned long size;
      	unsigned long start, end;
      
      	start = time(NULL);
      	size = strtoul(argv[1], NULL, 0);
      	printf("To allocate %ldGB memory\n", size);
      
      	size <<= 30;
      	p = malloc(size);
      	assert(p);
      	memset(p, 0, size);
      
      	end = time(NULL);
      	printf("Used time: %ld seconds\n", end - start);
      	sleep(3600);
      	return 0;
         }
      
      The system I use for testing has two NUMA nodes.  Both have 128GB
      memory.  In below scnario, the page caches on node#0 should be reclaimed
      when it encounters pressure to accommodate request of allocation.
      
         # echo 2 > /proc/sys/vm/zone_reclaim_mode; \
           sync; \
           echo 3 > /proc/sys/vm/drop_caches; \
         # taskset -c 0 cat file.32G > /dev/null; \
           grep FilePages /sys/devices/system/node/node0/meminfo
           Node 0 FilePages:       33619712 kB
         # taskset -c 0 ./alloc 128
         # grep FilePages /sys/devices/system/node/node0/meminfo
           Node 0 FilePages:       33619840 kB
         # grep MemFree /sys/devices/system/node/node0/meminfo
           Node 0 MemFree:          186816 kB
      
      With the patch applied, the pagecache on node-0 is reclaimed when its
      free memory is running out.  It's the expected behaviour.
      
         # echo 2 > /proc/sys/vm/zone_reclaim_mode; \
           sync; \
           echo 3 > /proc/sys/vm/drop_caches
         # taskset -c 0 cat file.32G > /dev/null; \
           grep FilePages /sys/devices/system/node/node0/meminfo
           Node 0 FilePages:       33605568 kB
         # taskset -c 0 ./alloc 128
         # grep FilePages /sys/devices/system/node/node0/meminfo
           Node 0 FilePages:        1379520 kB
         # grep MemFree /sys/devices/system/node/node0/meminfo
           Node 0 MemFree:           317120 kB
      
      Fixes: 5f7a75ac ("mm: page_alloc: do not cache reclaim distances")
      Link: http://lkml.kernel.org/r/1486532455-29613-1-git-send-email-gwshan@linux.vnet.ibm.comSigned-off-by: NGavin Shan <gwshan@linux.vnet.ibm.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Anton Blanchard <anton@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: <stable@vger.kernel.org>	[3.16+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e02dc017
    • M
    • L
      mm: alloc_contig_range: allow to specify GFP mask · ca96b625
      Lucas Stach 提交于
      Currently alloc_contig_range assumes that the compaction should be done
      with the default GFP_KERNEL flags.  This is probably right for all
      current uses of this interface, but may change as CMA is used in more
      use-cases (including being the default DMA memory allocator on some
      platforms).
      
      Change the function prototype, to allow for passing through the GFP mask
      set by upper layers.
      
      Also respect global restrictions by applying memalloc_noio_flags to the
      passed in flags.
      
      Link: http://lkml.kernel.org/r/20170127172328.18574-1-l.stach@pengutronix.deSigned-off-by: NLucas Stach <l.stach@pengutronix.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Radim Krcmar <rkrcmar@redhat.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Alexander Graf <agraf@suse.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ca96b625
    • Y
      mm/hotplug: enable memory hotplug for non-lru movable pages · 0efadf48
      Yisheng Xie 提交于
      We had considered all of the non-lru pages as unmovable before commit
      bda807d4 ("mm: migrate: support non-lru movable page migration").
      But now some of non-lru pages like zsmalloc, virtio-balloon pages also
      become movable.  So we can offline such blocks by using non-lru page
      migration.
      
      This patch straightforwardly adds non-lru migration code, which means
      adding non-lru related code to the functions which scan over pfn and
      collect pages to be migrated and isolate them before migration.
      Signed-off-by: NYisheng Xie <xieyisheng1@huawei.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Hanjun Guo <guohanjun@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Yisheng Xie <xieyisheng1@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0efadf48
    • M
      mm, page_alloc: use static global work_struct for draining per-cpu pages · bd233f53
      Mel Gorman 提交于
      As suggested by Vlastimil Babka and Tejun Heo, this patch uses a static
      work_struct to co-ordinate the draining of per-cpu pages on the
      workqueue.  Only one task can drain at a time but this is better than
      the previous scheme that allowed multiple tasks to send IPIs at a time.
      
      One consideration is whether parallel requests should synchronise
      against each other.  This patch does not synchronise for a global drain
      as the common case for such callers is expected to be multiple parallel
      direct reclaimers competing for pages when the watermark is close to
      min.  Draining the per-cpu list is unlikely to make much progress and
      serialising the drain is of dubious merit.  Drains are synchonrised for
      callers such as memory hotplug and CMA that care about the drain being
      complete when the function returns.
      
      Link: http://lkml.kernel.org/r/20170125083038.rzb5f43nptmk7aed@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Suggested-by: NTejun Heo <tj@kernel.org>
      Suggested-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bd233f53
    • V
      mm, page_alloc: don't check cpuset allowed twice in fast-path · 51047820
      Vlastimil Babka 提交于
      Since commit 682a3385 ("mm, page_alloc: inline the fast path of the
      zonelist iterator") we replace a NULL nodemask with
      cpuset_current_mems_allowed in the fast path, so that
      get_page_from_freelist() filters nodes allowed by the cpuset via
      for_next_zone_zonelist_nodemask().
      
      In that case it's pointless to additionaly check __cpuset_zone_allowed()
      in each iteration, which we can avoid by not adding ALLOC_CPUSET to
      alloc_flags in that scenario.
      
      This saves some cycles in the allocator fast path on systems with one or
      more non-root cpuset configured.  In the slow path, ALLOC_CPUSET is
      reset according to __alloc_pages_slowpath().  Without configured
      cpusets, this code is disabled by a static key.
      
      Link: http://lkml.kernel.org/r/20170124150511.5710-2-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NAnshuman Khandual <khandual@linux.vnet.ibm.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      51047820
    • V
      mm, page_alloc: remove redundant checks from alloc fastpath · df76cee6
      Vlastimil Babka 提交于
      The allocation fast path contains two similar checks for zoneref->zone
      being NULL, where zoneref points either to the first zone in the
      zonelist, or to the preferred zone.  These can be NULL either due to
      empty zonelist, or no zone being compatible with given nodemask or
      task's cpuset.
      
      These checks are unnecessary, because the zonelist walks in
      first_zones_zonelist() and get_page_from_freelist() handle a NULL
      starting zoneref->zone or preferred_zoneref->zone safely.  It's safe to
      fallback to __alloc_pages_slowpath() where we also have the check early
      enough.
      
      Link: http://lkml.kernel.org/r/20170124150511.5710-1-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      df76cee6
    • M
      mm, page_alloc: only use per-cpu allocator for irq-safe requests · 374ad05a
      Mel Gorman 提交于
      Many workloads that allocate pages are not handling an interrupt at a
      time.  As allocation requests may be from IRQ context, it's necessary to
      disable/enable IRQs for every page allocation.  This cost is the bulk of
      the free path but also a significant percentage of the allocation path.
      
      This patch alters the locking and checks such that only irq-safe
      allocation requests use the per-cpu allocator.  All others acquire the
      irq-safe zone->lock and allocate from the buddy allocator.  It relies on
      disabling preemption to safely access the per-cpu structures.  It could
      be slightly modified to avoid soft IRQs using it but it's not clear it's
      worthwhile.
      
      This modification may slow allocations from IRQ context slightly but the
      main gain from the per-cpu allocator is that it scales better for
      allocations from multiple contexts.  There is an implicit assumption
      that intensive allocations from IRQ contexts on multiple CPUs from a
      single NUMA node are rare and that the fast majority of scaling issues
      are encountered in !IRQ contexts such as page faulting.  It's worth
      noting that this patch is not required for a bulk page allocator but it
      significantly reduces the overhead.
      
      The following is results from a page allocator micro-benchmark.  Only
      order-0 is interesting as higher orders do not use the per-cpu allocator
      
                                                4.10.0-rc2                 4.10.0-rc2
                                                   vanilla               irqsafe-v1r5
      Amean    alloc-odr0-1               287.15 (  0.00%)           219.00 ( 23.73%)
      Amean    alloc-odr0-2               221.23 (  0.00%)           183.23 ( 17.18%)
      Amean    alloc-odr0-4               187.00 (  0.00%)           151.38 ( 19.05%)
      Amean    alloc-odr0-8               167.54 (  0.00%)           132.77 ( 20.75%)
      Amean    alloc-odr0-16              156.00 (  0.00%)           123.00 ( 21.15%)
      Amean    alloc-odr0-32              149.00 (  0.00%)           118.31 ( 20.60%)
      Amean    alloc-odr0-64              138.77 (  0.00%)           116.00 ( 16.41%)
      Amean    alloc-odr0-128             145.00 (  0.00%)           118.00 ( 18.62%)
      Amean    alloc-odr0-256             136.15 (  0.00%)           125.00 (  8.19%)
      Amean    alloc-odr0-512             147.92 (  0.00%)           121.77 ( 17.68%)
      Amean    alloc-odr0-1024            147.23 (  0.00%)           126.15 ( 14.32%)
      Amean    alloc-odr0-2048            155.15 (  0.00%)           129.92 ( 16.26%)
      Amean    alloc-odr0-4096            164.00 (  0.00%)           136.77 ( 16.60%)
      Amean    alloc-odr0-8192            166.92 (  0.00%)           138.08 ( 17.28%)
      Amean    alloc-odr0-16384           159.00 (  0.00%)           138.00 ( 13.21%)
      Amean    free-odr0-1                165.00 (  0.00%)            89.00 ( 46.06%)
      Amean    free-odr0-2                113.00 (  0.00%)            63.00 ( 44.25%)
      Amean    free-odr0-4                 99.00 (  0.00%)            54.00 ( 45.45%)
      Amean    free-odr0-8                 88.00 (  0.00%)            47.38 ( 46.15%)
      Amean    free-odr0-16                83.00 (  0.00%)            46.00 ( 44.58%)
      Amean    free-odr0-32                80.00 (  0.00%)            44.38 ( 44.52%)
      Amean    free-odr0-64                72.62 (  0.00%)            43.00 ( 40.78%)
      Amean    free-odr0-128               78.00 (  0.00%)            42.00 ( 46.15%)
      Amean    free-odr0-256               80.46 (  0.00%)            57.00 ( 29.16%)
      Amean    free-odr0-512               96.38 (  0.00%)            64.69 ( 32.88%)
      Amean    free-odr0-1024             107.31 (  0.00%)            72.54 ( 32.40%)
      Amean    free-odr0-2048             108.92 (  0.00%)            78.08 ( 28.32%)
      Amean    free-odr0-4096             113.38 (  0.00%)            82.23 ( 27.48%)
      Amean    free-odr0-8192             112.08 (  0.00%)            82.85 ( 26.08%)
      Amean    free-odr0-16384            110.38 (  0.00%)            81.92 ( 25.78%)
      Amean    total-odr0-1               452.15 (  0.00%)           308.00 ( 31.88%)
      Amean    total-odr0-2               334.23 (  0.00%)           246.23 ( 26.33%)
      Amean    total-odr0-4               286.00 (  0.00%)           205.38 ( 28.19%)
      Amean    total-odr0-8               255.54 (  0.00%)           180.15 ( 29.50%)
      Amean    total-odr0-16              239.00 (  0.00%)           169.00 ( 29.29%)
      Amean    total-odr0-32              229.00 (  0.00%)           162.69 ( 28.96%)
      Amean    total-odr0-64              211.38 (  0.00%)           159.00 ( 24.78%)
      Amean    total-odr0-128             223.00 (  0.00%)           160.00 ( 28.25%)
      Amean    total-odr0-256             216.62 (  0.00%)           182.00 ( 15.98%)
      Amean    total-odr0-512             244.31 (  0.00%)           186.46 ( 23.68%)
      Amean    total-odr0-1024            254.54 (  0.00%)           198.69 ( 21.94%)
      Amean    total-odr0-2048            264.08 (  0.00%)           208.00 ( 21.24%)
      Amean    total-odr0-4096            277.38 (  0.00%)           219.00 ( 21.05%)
      Amean    total-odr0-8192            279.00 (  0.00%)           220.92 ( 20.82%)
      Amean    total-odr0-16384           269.38 (  0.00%)           219.92 ( 18.36%)
      
      This is the alloc, free and total overhead of allocating order-0 pages
      in batches of 1 page up to 16384 pages.  Avoiding disabling/enabling
      overhead massively reduces overhead.  Alloc overhead is roughly reduced
      by 14-20% in most cases.  The free path is reduced by 26-46% and the
      total reduction is significant.
      
      Many users require zeroing of pages from the page allocator which is the
      vast cost of allocation.  Hence, the impact on a basic page faulting
      benchmark is not that significant
      
                                    4.10.0-rc2            4.10.0-rc2
                                       vanilla          irqsafe-v1r5
      Hmean    page_test   656632.98 (  0.00%)   675536.13 (  2.88%)
      Hmean    brk_test   3845502.67 (  0.00%)  3867186.94 (  0.56%)
      Stddev   page_test    10543.29 (  0.00%)     4104.07 ( 61.07%)
      Stddev   brk_test     33472.36 (  0.00%)    15538.39 ( 53.58%)
      CoeffVar page_test        1.61 (  0.00%)        0.61 ( 62.15%)
      CoeffVar brk_test         0.87 (  0.00%)        0.40 ( 53.84%)
      Max      page_test   666513.33 (  0.00%)   678640.00 (  1.82%)
      Max      brk_test   3882800.00 (  0.00%)  3887008.66 (  0.11%)
      
      This is from aim9 and the most notable outcome is that fault variability
      is reduced by the patch.  The headline improvement is small as the
      overall fault cost, zeroing, page table insertion etc dominate relative
      to disabling/enabling IRQs in the per-cpu allocator.
      
      Similarly, little benefit was seen on networking benchmarks both
      localhost and between physical server/clients where other costs
      dominate.  It's possible that this will only be noticable on very high
      speed networks.
      
      Jesper Dangaard Brouer independently tested this with a separate
      microbenchmark from
        https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm/bench
      
      Micro-benchmarked with [1] page_bench02:
       modprobe page_bench02 page_order=0 run_flags=$((2#010)) loops=$((10**8)); \
        rmmod page_bench02 ; dmesg --notime | tail -n 4
      
      Compared to baseline: 213 cycles(tsc) 53.417 ns
       - against this     : 184 cycles(tsc) 46.056 ns
       - Saving           : -29 cycles
       - Very close to expected 27 cycles saving [see below [2]]
      
      Micro benchmarking via time_bench_sample[3], we get the cost of these
      operations:
      
       time_bench: Type:for_loop                 Per elem: 0 cycles(tsc) 0.232 ns (step:0)
       time_bench: Type:spin_lock_unlock         Per elem: 33 cycles(tsc) 8.334 ns (step:0)
       time_bench: Type:spin_lock_unlock_irqsave Per elem: 62 cycles(tsc) 15.607 ns (step:0)
       time_bench: Type:irqsave_before_lock      Per elem: 57 cycles(tsc) 14.344 ns (step:0)
       time_bench: Type:spin_lock_unlock_irq     Per elem: 34 cycles(tsc) 8.560 ns (step:0)
       time_bench: Type:simple_irq_disable_before_lock Per elem: 37 cycles(tsc) 9.289 ns (step:0)
       time_bench: Type:local_BH_disable_enable  Per elem: 19 cycles(tsc) 4.920 ns (step:0)
       time_bench: Type:local_IRQ_disable_enable Per elem: 7 cycles(tsc) 1.864 ns (step:0)
       time_bench: Type:local_irq_save_restore   Per elem: 38 cycles(tsc) 9.665 ns (step:0)
       [Mel's patch removes a ^^^^^^^^^^^^^^^^]            ^^^^^^^^^ expected saving - preempt cost
       time_bench: Type:preempt_disable_enable   Per elem: 11 cycles(tsc) 2.794 ns (step:0)
       [adds a preempt  ^^^^^^^^^^^^^^^^^^^^^^]            ^^^^^^^^^ adds this cost
       time_bench: Type:funcion_call_cost        Per elem: 6 cycles(tsc) 1.689 ns (step:0)
       time_bench: Type:func_ptr_call_cost       Per elem: 11 cycles(tsc) 2.767 ns (step:0)
       time_bench: Type:page_alloc_put           Per elem: 211 cycles(tsc) 52.803 ns (step:0)
      
      Thus, expected improvement is: 38-11 = 27 cycles.
      
      [mgorman@techsingularity.net: s/preempt_enable_no_resched/preempt_enable/]
        Link: http://lkml.kernel.org/r/20170208143128.25ahymqlyspjcixu@techsingularity.net
      Link: http://lkml.kernel.org/r/20170123153906.3122-5-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      374ad05a
    • M
      mm, page_alloc: do not depend on cpu hotplug locks inside the allocator · a459eeb7
      Michal Hocko 提交于
      Dmitry has reported the following lockdep splat
        lock_acquire+0x2a1/0x630 kernel/locking/lockdep.c:3753
        __mutex_lock_common kernel/locking/mutex.c:521 [inline]
        mutex_lock_nested+0x24e/0xff0 kernel/locking/mutex.c:621
        pcpu_alloc+0xbda/0x1280 mm/percpu.c:896
        __alloc_percpu+0x24/0x30 mm/percpu.c:1075
        smpcfd_prepare_cpu+0x73/0xd0 kernel/smp.c:44
        cpuhp_invoke_callback+0x254/0x1480 kernel/cpu.c:136
        cpuhp_up_callbacks+0x81/0x2a0 kernel/cpu.c:493
        _cpu_up+0x1e3/0x2a0 kernel/cpu.c:1057
        do_cpu_up+0x73/0xa0 kernel/cpu.c:1087
        cpu_up+0x18/0x20 kernel/cpu.c:1095
        smp_init+0xe9/0xee kernel/smp.c:564
        kernel_init_freeable+0x439/0x690 init/main.c:1010
        kernel_init+0x13/0x180 init/main.c:941
        ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:433
      
      cpu_hotplug_begin
        cpu_hotplug.lock
      pcpu_alloc
        pcpu_alloc_mutex
      
        get_online_cpus+0x62/0x90 kernel/cpu.c:248
        drain_all_pages+0xf8/0x710 mm/page_alloc.c:2385
        __alloc_pages_direct_reclaim mm/page_alloc.c:3440 [inline]
        __alloc_pages_slowpath+0x8fd/0x2370 mm/page_alloc.c:3778
        __alloc_pages_nodemask+0x8f5/0xc60 mm/page_alloc.c:3980
        __alloc_pages include/linux/gfp.h:426 [inline]
        __alloc_pages_node include/linux/gfp.h:439 [inline]
        alloc_pages_node include/linux/gfp.h:453 [inline]
        pcpu_alloc_pages mm/percpu-vm.c:93 [inline]
        pcpu_populate_chunk+0x1e1/0x900 mm/percpu-vm.c:282
        pcpu_alloc+0xe01/0x1280 mm/percpu.c:998
        __alloc_percpu_gfp+0x27/0x30 mm/percpu.c:1062
        bpf_array_alloc_percpu kernel/bpf/arraymap.c:34 [inline]
        array_map_alloc+0x532/0x710 kernel/bpf/arraymap.c:99
        find_and_alloc_map kernel/bpf/syscall.c:34 [inline]
        map_create kernel/bpf/syscall.c:188 [inline]
        SYSC_bpf kernel/bpf/syscall.c:870 [inline]
        SyS_bpf+0xd64/0x2500 kernel/bpf/syscall.c:827
        entry_SYSCALL_64_fastpath+0x1f/0xc2
      
      pcpu_alloc
        pcpu_alloc_mutex
      drain_all_pages
        get_online_cpus
          cpu_hotplug.lock
      
        cpu_hotplug_begin+0x206/0x2e0 kernel/cpu.c:304
        _cpu_up+0xca/0x2a0 kernel/cpu.c:1011
        do_cpu_up+0x73/0xa0 kernel/cpu.c:1087
        cpu_up+0x18/0x20 kernel/cpu.c:1095
        smp_init+0xe9/0xee kernel/smp.c:564
        kernel_init_freeable+0x439/0x690 init/main.c:1010
        kernel_init+0x13/0x180 init/main.c:941
        ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:433
      
      cpu_hotplug_begin
        cpu_hotplug.lock
      
      Pulling cpu hotplug locks inside the page allocator is just too
      dangerous.  Let's remove the dependency by dropping get_online_cpus()
      from drain_all_pages.  This is not so simple though because now we do
      not have a protection against cpu hotplug which means 2 things:
      
        - the work item might be executed on a different cpu in worker from
          unbound pool so it doesn't run on pinned on the cpu
      
        - we have to make sure that we do not race with page_alloc_cpu_dead
          calling drain_pages_zone
      
      Disabling preemption in drain_local_pages_wq will solve the first
      problem drain_local_pages will determine its local CPU from the WQ
      context which will be stable after that point, page_alloc_cpu_dead is
      pinned to the CPU already.  The later condition is achieved by disabling
      IRQs in drain_pages_zone.
      
      Fixes: mm, page_alloc: drain per-cpu pages from workqueue context
      Link: http://lkml.kernel.org/r/20170207201950.20482-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a459eeb7
    • M
      mm, page_alloc: drain per-cpu pages from workqueue context · 0ccce3b9
      Mel Gorman 提交于
      The per-cpu page allocator can be drained immediately via
      drain_all_pages() which sends IPIs to every CPU.  In the next patch, the
      per-cpu allocator will only be used for interrupt-safe allocations which
      prevents draining it from IPI context.  This patch uses workqueues to
      drain the per-cpu lists instead.
      
      This is slower but no slowdown during intensive reclaim was measured and
      the paths that use drain_all_pages() are not that sensitive to
      performance.  This is particularly true as the path would only be
      triggered when reclaim is failing.  It also makes a some sense to avoid
      storming a machine with IPIs when it's under memory pressure.  Arguably,
      it should be further adjusted so that only one caller at a time is
      draining pages but it's beyond the scope of the current patch.
      
      Link: http://lkml.kernel.org/r/20170123153906.3122-4-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0ccce3b9
    • M
      mm, page_alloc: split alloc_pages_nodemask() · 9cd75558
      Mel Gorman 提交于
      alloc_pages_nodemask does a number of preperation steps that determine
      what zones can be used for the allocation depending on a variety of
      factors.  This is fine but a hypothetical caller that wanted multiple
      order-0 pages has to do the preparation steps multiple times.  This
      patch structures __alloc_pages_nodemask such that it's relatively easy
      to build a bulk order-0 page allocator.  There is no functional change.
      
      Link: http://lkml.kernel.org/r/20170123153906.3122-3-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9cd75558
    • M
      mm, page_alloc: split buffered_rmqueue() · 066b2393
      Mel Gorman 提交于
      Patch series "Use per-cpu allocator for !irq requests and prepare for a
      bulk allocator", v5.
      
      This series is motivated by a conversation led by Jesper Dangaard Brouer
      at the last LSF/MM proposing a generic page pool for DMA-coherent pages.
      Part of his motivation was due to the overhead of allocating multiple
      order-0 that led some drivers to use high-order allocations and
      splitting them.  This is very slow in some cases.
      
      The first two patches in this series restructure the page allocator such
      that it is relatively easy to introduce an order-0 bulk page allocator.
      A patch exists to do that and has been handed over to Jesper until an
      in-kernel users is created.  The third patch prevents the per-cpu
      allocator being drained from IPI context as that can potentially corrupt
      the list after patch four is merged.  The final patch alters the per-cpu
      alloctor to make it exclusive to !irq requests.  This cuts
      allocation/free overhead by roughly 30%.
      
      Performance tests from both Jesper and me are included in the patch.
      
      This patch (of 4):
      
      buffered_rmqueue removes a page from a given zone and uses the per-cpu
      list for order-0.  This is fine but a hypothetical caller that wanted
      multiple order-0 pages has to disable/reenable interrupts multiple
      times.  This patch structures buffere_rmqueue such that it's relatively
      easy to build a bulk order-0 page allocator.  There is no functional
      change.
      
      [mgorman@techsingularity.net: failed per-cpu refill may blow up]
        Link: http://lkml.kernel.org/r/20170124112723.mshmgwq2ihxku2um@techsingularity.net
      Link: http://lkml.kernel.org/r/20170123153906.3122-2-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      066b2393
  12. 23 2月, 2017 9 次提交
    • D
      mm, page_alloc: warn_alloc nodemask is NULL when cpusets are disabled · 685dbf6f
      David Rientjes 提交于
      The patch "mm, page_alloc: warn_alloc print nodemask" implicitly sets
      the allocation nodemask to cpuset_current_mems_allowed when there is no
      effective mempolicy.  cpuset_current_mems_allowed is only effective when
      cpusets are enabled, which is also printed by warn_alloc(), so setting
      the nodemask to cpuset_current_mems_allowed is redundant and prevents
      debugging issues where ac->nodemask is not set properly in the page
      allocator.
      
      This provides better debugging output since
      cpuset_print_current_mems_allowed() is already provided.
      
      Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1701181347320.142399@chino.kir.corp.google.comSigned-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      685dbf6f
    • M
      mm: help __GFP_NOFAIL allocations which do not trigger OOM killer · 6c18ba7a
      Michal Hocko 提交于
      Now that __GFP_NOFAIL doesn't override decisions to skip the oom killer
      we are left with requests which require to loop inside the allocator
      without invoking the oom killer (e.g.  GFP_NOFS|__GFP_NOFAIL used by fs
      code) and so they might, in very unlikely situations, loop for ever -
      e.g.  other parallel request could starve them.
      
      This patch tries to limit the likelihood of such a lockup by giving
      these __GFP_NOFAIL requests a chance to move on by consuming a small
      part of memory reserves.  We are using ALLOC_HARDER which should be
      enough to prevent from the starvation by regular allocation requests,
      yet it shouldn't consume enough from the reserves to disrupt high
      priority requests (ALLOC_HIGH).
      
      While we are at it, let's introduce a helper __alloc_pages_cpuset_fallback
      which enforces the cpusets but allows to fallback to ignore them if the
      first attempt fails.  __GFP_NOFAIL requests can be considered important
      enough to allow cpuset runaway in order for the system to move on.  It
      is highly unlikely that any of these will be GFP_USER anyway.
      
      Link: http://lkml.kernel.org/r/20161220134904.21023-4-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6c18ba7a
    • M
      mm, oom: do not enforce OOM killer for __GFP_NOFAIL automatically · 06ad276a
      Michal Hocko 提交于
      __alloc_pages_may_oom makes sure to skip the OOM killer depending on the
      allocation request.  This includes lowmem requests, costly high order
      requests and others.  For a long time __GFP_NOFAIL acted as an override
      for all those rules.  This is not documented and it can be quite
      surprising as well.  E.g.  GFP_NOFS requests are not invoking the OOM
      killer but GFP_NOFS|__GFP_NOFAIL does so if we try to convert some of
      the existing open coded loops around allocator to nofail request (and we
      have done that in the past) then such a change would have a non trivial
      side effect which is far from obvious.  Note that the primary motivation
      for skipping the OOM killer is to prevent from pre-mature invocation.
      
      The exception has been added by commit 82553a93 ("oom: invoke oom
      killer for __GFP_NOFAIL").  The changelog points out that the oom killer
      has to be invoked otherwise the request would be looping for ever.  But
      this argument is rather weak because the OOM killer doesn't really
      guarantee a forward progress for those exceptional cases:
      
      - it will hardly help to form costly order which in turn can result in
        the system panic because of no oom killable task in the end - I believe
        we certainly do not want to put the system down just because there is a
        nasty driver asking for order-9 page with GFP_NOFAIL not realizing all
        the consequences.  It is much better this request would loop for ever
        than the massive system disruption
      
      - lowmem is also highly unlikely to be freed during OOM killer
      
      - GFP_NOFS request could trigger while there is still a lot of memory
        pinned by filesystems.
      
      This patch simply removes the __GFP_NOFAIL special case in order to have a
      more clear semantic without surprising side effects.
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NNils Holland <nholland@tisys.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      06ad276a
    • M
      mm: consolidate GFP_NOFAIL checks in the allocator slowpath · 9a67f648
      Michal Hocko 提交于
      Tetsuo Handa has pointed out that commit 0a0337e0 ("mm, oom: rework
      oom detection") has subtly changed semantic for costly high order
      requests with __GFP_NOFAIL and withtout __GFP_REPEAT and those can fail
      right now.  My code inspection didn't reveal any such users in the tree
      but it is true that this might lead to unexpected allocation failures
      and subsequent OOPs.
      
      __alloc_pages_slowpath wrt.  GFP_NOFAIL is hard to follow currently.
      There are few special cases but we are lacking a catch all place to be
      sure we will not miss any case where the non failing allocation might
      fail.  This patch reorganizes the code a bit and puts all those special
      cases under nopage label which is the generic go-to-fail path.  Non
      failing allocations are retried or those that cannot retry like
      non-sleeping allocation go to the failure point directly.  This should
      make the code flow much easier to follow and make it less error prone
      for future changes.
      
      While we are there we have to move the stall check up to catch
      potentially looping non-failing allocations.
      
      [akpm@linux-foundation.org: fix alloc_flags may-be-used-uninitalized]
      Link: http://lkml.kernel.org/r/20161220134904.21023-2-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9a67f648
    • M
      lib/show_mem.c: teach show_mem to work with the given nodemask · 9af744d7
      Michal Hocko 提交于
      show_mem() allows to filter out node specific data which is irrelevant
      to the allocation request via SHOW_MEM_FILTER_NODES.  The filtering is
      done in skip_free_areas_node which skips all nodes which are not in the
      mems_allowed of the current process.  This works most of the time as
      expected because the nodemask shouldn't be outside of the allocating
      task but there are some exceptions.  E.g.  memory hotplug might want to
      request allocations from outside of the allowed nodes (see
      new_node_page).
      
      Get rid of this hardcoded behavior and push the allocation mask down the
      show_mem path and use it instead of cpuset_current_mems_allowed.  NULL
      nodemask is interpreted as cpuset_current_mems_allowed.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Link: http://lkml.kernel.org/r/20170117091543.25850-5-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9af744d7
    • M
      mm, page_alloc: warn_alloc print nodemask · a8e99259
      Michal Hocko 提交于
      warn_alloc is currently used for to report an allocation failure or an
      allocation stall.  We print some details of the allocation request like
      the gfp mask and the request order.  We do not print the allocation
      nodemask which is important when debugging the reason for the allocation
      failure as well.  We alreaddy print the nodemask in the OOM report.
      
      Add nodemask to warn_alloc and print it in warn_alloc as well.
      
      Link: http://lkml.kernel.org/r/20170117091543.25850-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a8e99259
    • M
      mm, page_alloc: do not report all nodes in show_mem · c02e50bb
      Michal Hocko 提交于
      Patch series "show_mem updates", v2.
      
      This is a mixture of one bug fix (patch 1), an enhancement (patch 2) and
      cleanups (the rest of the series).  First two patches should be really
      straightforward.  Patch 3 removes some arch specific show_mem
      implementations because I think they are quite outdated and do not
      really serve any useful purpose anymore.  I think we should really
      strive to have a consistent show_mem output regardless of the
      architecture.  If some architecture is really special and wants to dump
      something additional we should do that via an arch specific hook.
      
      The last patch adds nodemask parameter so that we do not rely on the
      hardcoded mems_allowed of the current task when doing the node
      filtering.  I consider this more a cleanup than a fix because basically
      all users use a nodemask which is a subset of mems_allowed.  There is
      only one call path in the memory hotplug which doesn't comply with this
      but that is hardly something to worry about.
      
      This patch (of 4):
      
      Commit 599d0c95 ("mm, vmscan: move LRU lists to node") has added per
      numa node statistics to show_mem but it forgot to add
      skip_free_areas_node to filter out nodes which are outside of the
      allocating task numa policy.  Add this check to not pollute the output
      with the pointless information.
      
      Link: http://lkml.kernel.org/r/20170117091543.25850-2-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c02e50bb
    • P
      mm: page_alloc: skip over regions of invalid pfns where possible · b92df1de
      Paul Burton 提交于
      When using a sparse memory model memmap_init_zone() when invoked with
      the MEMMAP_EARLY context will skip over pages which aren't valid - ie.
      which aren't in a populated region of the sparse memory map.  However if
      the memory map is extremely sparse then it can spend a long time
      linearly checking each PFN in a large non-populated region of the memory
      map & skipping it in turn.
      
      When CONFIG_HAVE_MEMBLOCK_NODE_MAP is enabled, we have sufficient
      information to quickly discover the next valid PFN given an invalid one
      by searching through the list of memory regions & skipping forwards to
      the first PFN covered by the memory region to the right of the
      non-populated region.  Implement this in order to speed up
      memmap_init_zone() for systems with extremely sparse memory maps.
      
      James said "I have tested this patch on a virtual model of a Samurai CPU
      with a sparse memory map.  The kernel boot time drops from 109 to
      62 seconds. "
      
      Link: http://lkml.kernel.org/r/20161125185518.29885-1-paul.burton@imgtec.comSigned-off-by: NPaul Burton <paul.burton@imgtec.com>
      Tested-by: NJames Hartley <james.hartley@imgtec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b92df1de
    • M
      oom, trace: add compaction retry tracepoint · 65190cff
      Michal Hocko 提交于
      Higher order requests oom debugging is currently quite hard.  We do have
      some compaction points which can tell us how the compaction is operating
      but there is no trace point to tell us about compaction retry logic.
      This patch adds a one which will have the following format
      
                  bash-3126  [001] ....  1498.220001: compact_retry: order=9 priority=COMPACT_PRIO_SYNC_LIGHT compaction_result=withdrawn retries=0 max_retries=16 should_retry=0
      
      we can see that the order 9 request is not retried even though we are in
      the highest compaction priority mode becase the last compaction attempt
      was withdrawn.  This means that compaction_zonelist_suitable must have
      returned false and there is no suitable zone to compact for this request
      and so no need to retry further.
      
      another example would be
                 <...>-3137  [001] ....    81.501689: compact_retry: order=9 priority=COMPACT_PRIO_SYNC_LIGHT compaction_result=failed retries=0 max_retries=16 should_retry=0
      
      in this case the order-9 compaction failed to find any suitable block.
      We do not retry anymore because this is a costly request and those do
      not go below COMPACT_PRIO_SYNC_LIGHT priority.
      
      Link: http://lkml.kernel.org/r/20161220130135.15719-4-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      65190cff