1. 18 3月, 2016 32 次提交
    • C
      mm, memory hotplug: print debug message in the proper way for online_pages · e33e33b4
      Chen Yucong 提交于
      online_pages() simply returns an error value if
      memory_notify(MEM_GOING_ONLINE, &arg) return a value that is not what we
      want for successfully onlining target pages.  This patch arms to print
      more failure information like offline_pages() in online_pages.
      
      This patch also converts printk(KERN_<LEVEL>) to pr_<level>(), and moves
      __offline_pages() to not print failure information with KERN_INFO
      according to David Rientjes's suggestion[1].
      
      [1] https://lkml.org/lkml/2016/2/24/1094Signed-off-by: NChen Yucong <slaoub@gmail.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e33e33b4
    • M
      mm: remove __GFP_NOFAIL is deprecated comment · 0f352e53
      Michal Hocko 提交于
      Commit 64775719 ("mm: clarify __GFP_NOFAIL deprecation status") was
      incomplete and didn't remove the comment about __GFP_NOFAIL being
      deprecated in buffered_rmqueue.
      
      Let's get rid of this leftover but keep the WARN_ON_ONCE for order > 1
      because we should really discourage from using __GFP_NOFAIL with higher
      order allocations because those are just too subtle.
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NNikolay Borisov <kernel@kyup.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0f352e53
    • J
      mm/page_ref: add tracepoint to track down page reference manipulation · 95813b8f
      Joonsoo Kim 提交于
      CMA allocation should be guaranteed to succeed by definition, but,
      unfortunately, it would be failed sometimes.  It is hard to track down
      the problem, because it is related to page reference manipulation and we
      don't have any facility to analyze it.
      
      This patch adds tracepoints to track down page reference manipulation.
      With it, we can find exact reason of failure and can fix the problem.
      Following is an example of tracepoint output.  (note: this example is
      stale version that printing flags as the number.  Recent version will
      print it as human readable string.)
      
      <...>-9018  [004]    92.678375: page_ref_set:         pfn=0x17ac9 flags=0x0 count=1 mapcount=0 mapping=(nil) mt=4 val=1
      <...>-9018  [004]    92.678378: kernel_stack:
       => get_page_from_freelist (ffffffff81176659)
       => __alloc_pages_nodemask (ffffffff81176d22)
       => alloc_pages_vma (ffffffff811bf675)
       => handle_mm_fault (ffffffff8119e693)
       => __do_page_fault (ffffffff810631ea)
       => trace_do_page_fault (ffffffff81063543)
       => do_async_page_fault (ffffffff8105c40a)
       => async_page_fault (ffffffff817581d8)
      [snip]
      <...>-9018  [004]    92.678379: page_ref_mod:         pfn=0x17ac9 flags=0x40048 count=2 mapcount=1 mapping=0xffff880015a78dc1 mt=4 val=1
      [snip]
      ...
      ...
      <...>-9131  [001]    93.174468: test_pages_isolated:  start_pfn=0x17800 end_pfn=0x17c00 fin_pfn=0x17ac9 ret=fail
      [snip]
      <...>-9018  [004]    93.174843: page_ref_mod_and_test: pfn=0x17ac9 flags=0x40068 count=0 mapcount=0 mapping=0xffff880015a78dc1 mt=4 val=-1 ret=1
       => release_pages (ffffffff8117c9e4)
       => free_pages_and_swap_cache (ffffffff811b0697)
       => tlb_flush_mmu_free (ffffffff81199616)
       => tlb_finish_mmu (ffffffff8119a62c)
       => exit_mmap (ffffffff811a53f7)
       => mmput (ffffffff81073f47)
       => do_exit (ffffffff810794e9)
       => do_group_exit (ffffffff81079def)
       => SyS_exit_group (ffffffff81079e74)
       => entry_SYSCALL_64_fastpath (ffffffff817560b6)
      
      This output shows that problem comes from exit path.  In exit path, to
      improve performance, pages are not freed immediately.  They are gathered
      and processed by batch.  During this process, migration cannot be
      possible and CMA allocation is failed.  This problem is hard to find
      without this page reference tracepoint facility.
      
      Enabling this feature bloat kernel text 30 KB in my configuration.
      
         text    data     bss     dec     hex filename
      12127327        2243616 1507328 15878271         f2487f vmlinux_disabled
      12157208        2258880 1507328 15923416         f2f8d8 vmlinux_enabled
      
      Note that, due to header file dependency problem between mm.h and
      tracepoint.h, this feature has to open code the static key functions for
      tracepoints.  Proposed by Steven Rostedt in following link.
      
      https://lkml.org/lkml/2015/12/9/699
      
      [arnd@arndb.de: crypto/async_pq: use __free_page() instead of put_page()]
      [iamjoonsoo.kim@lge.com: fix build failure for xtensa]
      [akpm@linux-foundation.org: tweak Kconfig text, per Vlastimil]
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NMichal Nazarewicz <mina86@mina86.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Acked-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      95813b8f
    • J
      mm: introduce page reference manipulation functions · fe896d18
      Joonsoo Kim 提交于
      The success of CMA allocation largely depends on the success of
      migration and key factor of it is page reference count.  Until now, page
      reference is manipulated by direct calling atomic functions so we cannot
      follow up who and where manipulate it.  Then, it is hard to find actual
      reason of CMA allocation failure.  CMA allocation should be guaranteed
      to succeed so finding offending place is really important.
      
      In this patch, call sites where page reference is manipulated are
      converted to introduced wrapper function.  This is preparation step to
      add tracepoint to each page reference manipulation function.  With this
      facility, we can easily find reason of CMA allocation failure.  There is
      no functional change in this patch.
      
      In addition, this patch also converts reference read sites.  It will
      help a second step that renames page._count to something else and
      prevents later attempt to direct access to it (Suggested by Andrew).
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NMichal Nazarewicz <mina86@mina86.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fe896d18
    • M
      mm: thp: set THP defrag by default to madvise and add a stall-free defrag option · 444eb2a4
      Mel Gorman 提交于
      THP defrag is enabled by default to direct reclaim/compact but not wake
      kswapd in the event of a THP allocation failure.  The problem is that
      THP allocation requests potentially enter reclaim/compaction.  This
      potentially incurs a severe stall that is not guaranteed to be offset by
      reduced TLB misses.  While there has been considerable effort to reduce
      the impact of reclaim/compaction, it is still a high cost and workloads
      that should fit in memory fail to do so.  Specifically, a simple
      anon/file streaming workload will enter direct reclaim on NUMA at least
      even though the working set size is 80% of RAM.  It's been years and
      it's time to throw in the towel.
      
      First, this patch defines THP defrag as follows;
      
       madvise: A failed allocation will direct reclaim/compact if the application requests it
       never:   Neither reclaim/compact nor wake kswapd
       defer:   A failed allocation will wake kswapd/kcompactd
       always:  A failed allocation will direct reclaim/compact (historical behaviour)
                khugepaged defrag will enter direct/reclaim but not wake kswapd.
      
      Next it sets the default defrag option to be "madvise" to only enter
      direct reclaim/compaction for applications that specifically requested
      it.
      
      Lastly, it removes a check from the page allocator slowpath that is
      related to __GFP_THISNODE to allow "defer" to work.  The callers that
      really cares are slub/slab and they are updated accordingly.  The slab
      one may be surprising because it also corrects a comment as kswapd was
      never woken up by that path.
      
      This means that a THP fault will no longer stall for most applications
      by default and the ideal for most users that get THP if they are
      immediately available.  There are still options for users that prefer a
      stall at startup of a new application by either restoring historical
      behaviour with "always" or pick a half-way point with "defer" where
      kswapd does some of the work in the background and wakes kcompactd if
      necessary.  THP defrag for khugepaged remains enabled and will enter
      direct/reclaim but no wakeup kswapd or kcompactd.
      
      After this patch a THP allocation failure will quickly fallback and rely
      on khugepaged to recover the situation at some time in the future.  In
      some cases, this will reduce THP usage but the benefit of THP is hard to
      measure and not a universal win where as a stall to reclaim/compaction
      is definitely measurable and can be painful.
      
      The first test for this is using "usemem" to read a large file and write
      a large anonymous mapping (to avoid the zero page) multiple times.  The
      total size of the mappings is 80% of RAM and the benchmark simply
      measures how long it takes to complete.  It uses multiple threads to see
      if that is a factor.  On UMA, the performance is almost identical so is
      not reported but on NUMA, we see this
      
      usemem
                                         4.4.0                 4.4.0
                                kcompactd-v1r1         nodefrag-v1r3
      Amean    System-1       102.86 (  0.00%)       46.81 ( 54.50%)
      Amean    System-4        37.85 (  0.00%)       34.02 ( 10.12%)
      Amean    System-7        48.12 (  0.00%)       46.89 (  2.56%)
      Amean    System-12       51.98 (  0.00%)       56.96 ( -9.57%)
      Amean    System-21       80.16 (  0.00%)       79.05 (  1.39%)
      Amean    System-30      110.71 (  0.00%)      107.17 (  3.20%)
      Amean    System-48      127.98 (  0.00%)      124.83 (  2.46%)
      Amean    Elapsd-1       185.84 (  0.00%)      105.51 ( 43.23%)
      Amean    Elapsd-4        26.19 (  0.00%)       25.58 (  2.33%)
      Amean    Elapsd-7        21.65 (  0.00%)       21.62 (  0.16%)
      Amean    Elapsd-12       18.58 (  0.00%)       17.94 (  3.43%)
      Amean    Elapsd-21       17.53 (  0.00%)       16.60 (  5.33%)
      Amean    Elapsd-30       17.45 (  0.00%)       17.13 (  1.84%)
      Amean    Elapsd-48       15.40 (  0.00%)       15.27 (  0.82%)
      
      For a single thread, the benchmark completes 43.23% faster with this
      patch applied with smaller benefits as the thread increases.  Similar,
      notice the large reduction in most cases in system CPU usage.  The
      overall CPU time is
      
                     4.4.0       4.4.0
              kcompactd-v1r1 nodefrag-v1r3
      User        10357.65    10438.33
      System       3988.88     3543.94
      Elapsed      2203.01     1634.41
      
      Which is substantial. Now, the reclaim figures
      
                                       4.4.0       4.4.0
                                kcompactd-v1r1nodefrag-v1r3
      Minor Faults                 128458477   278352931
      Major Faults                   2174976         225
      Swap Ins                      16904701           0
      Swap Outs                     17359627           0
      Allocation stalls                43611           0
      DMA allocs                           0           0
      DMA32 allocs                  19832646    19448017
      Normal allocs                614488453   580941839
      Movable allocs                       0           0
      Direct pages scanned          24163800           0
      Kswapd pages scanned                 0           0
      Kswapd pages reclaimed               0           0
      Direct pages reclaimed        20691346           0
      Compaction stalls                42263           0
      Compaction success                 938           0
      Compaction failures              41325           0
      
      This patch eliminates almost all swapping and direct reclaim activity.
      There is still overhead but it's from NUMA balancing which does not
      identify that it's pointless trying to do anything with this workload.
      
      I also tried the thpscale benchmark which forces a corner case where
      compaction can be used heavily and measures the latency of whether base
      or huge pages were used
      
      thpscale Fault Latencies
                                             4.4.0                 4.4.0
                                    kcompactd-v1r1         nodefrag-v1r3
      Amean    fault-base-1      5288.84 (  0.00%)     2817.12 ( 46.73%)
      Amean    fault-base-3      6365.53 (  0.00%)     3499.11 ( 45.03%)
      Amean    fault-base-5      6526.19 (  0.00%)     4363.06 ( 33.15%)
      Amean    fault-base-7      7142.25 (  0.00%)     4858.08 ( 31.98%)
      Amean    fault-base-12    13827.64 (  0.00%)    10292.11 ( 25.57%)
      Amean    fault-base-18    18235.07 (  0.00%)    13788.84 ( 24.38%)
      Amean    fault-base-24    21597.80 (  0.00%)    24388.03 (-12.92%)
      Amean    fault-base-30    26754.15 (  0.00%)    19700.55 ( 26.36%)
      Amean    fault-base-32    26784.94 (  0.00%)    19513.57 ( 27.15%)
      Amean    fault-huge-1      4223.96 (  0.00%)     2178.57 ( 48.42%)
      Amean    fault-huge-3      2194.77 (  0.00%)     2149.74 (  2.05%)
      Amean    fault-huge-5      2569.60 (  0.00%)     2346.95 (  8.66%)
      Amean    fault-huge-7      3612.69 (  0.00%)     2997.70 ( 17.02%)
      Amean    fault-huge-12     3301.75 (  0.00%)     6727.02 (-103.74%)
      Amean    fault-huge-18     6696.47 (  0.00%)     6685.72 (  0.16%)
      Amean    fault-huge-24     8000.72 (  0.00%)     9311.43 (-16.38%)
      Amean    fault-huge-30    13305.55 (  0.00%)     9750.45 ( 26.72%)
      Amean    fault-huge-32     9981.71 (  0.00%)    10316.06 ( -3.35%)
      
      The average time to fault pages is substantially reduced in the majority
      of caseds but with the obvious caveat that fewer THPs are actually used
      in this adverse workload
      
                                         4.4.0                 4.4.0
                                kcompactd-v1r1         nodefrag-v1r3
      Percentage huge-1         0.71 (  0.00%)       14.04 (1865.22%)
      Percentage huge-3        10.77 (  0.00%)       33.05 (206.85%)
      Percentage huge-5        60.39 (  0.00%)       38.51 (-36.23%)
      Percentage huge-7        45.97 (  0.00%)       34.57 (-24.79%)
      Percentage huge-12       68.12 (  0.00%)       40.07 (-41.17%)
      Percentage huge-18       64.93 (  0.00%)       47.82 (-26.35%)
      Percentage huge-24       62.69 (  0.00%)       44.23 (-29.44%)
      Percentage huge-30       43.49 (  0.00%)       55.38 ( 27.34%)
      Percentage huge-32       50.72 (  0.00%)       51.90 (  2.35%)
      
                                       4.4.0       4.4.0
                                kcompactd-v1r1nodefrag-v1r3
      Minor Faults                  37429143    47564000
      Major Faults                      1916        1558
      Swap Ins                          1466        1079
      Swap Outs                      2936863      149626
      Allocation stalls                62510           3
      DMA allocs                           0           0
      DMA32 allocs                   6566458     6401314
      Normal allocs                216361697   216538171
      Movable allocs                       0           0
      Direct pages scanned          25977580       17998
      Kswapd pages scanned                 0     3638931
      Kswapd pages reclaimed               0      207236
      Direct pages reclaimed         8833714          88
      Compaction stalls               103349           5
      Compaction success                 270           4
      Compaction failures             103079           1
      
      Note again that while this does swap as it's an aggressive workload, the
      direct relcim activity and allocation stalls is substantially reduced.
      There is some kswapd activity but ftrace showed that the kswapd activity
      was due to normal wakeups from 4K pages being allocated.
      Compaction-related stalls and activity are almost eliminated.
      
      I also tried the stutter benchmark.  For this, I do not have figures for
      NUMA but it's something that does impact UMA so I'll report what is
      available
      
      stutter
                                       4.4.0                 4.4.0
                              kcompactd-v1r1         nodefrag-v1r3
      Min         mmap      7.3571 (  0.00%)      7.3438 (  0.18%)
      1st-qrtle   mmap      7.5278 (  0.00%)     17.9200 (-138.05%)
      2nd-qrtle   mmap      7.6818 (  0.00%)     21.6055 (-181.25%)
      3rd-qrtle   mmap     11.0889 (  0.00%)     21.8881 (-97.39%)
      Max-90%     mmap     27.8978 (  0.00%)     22.1632 ( 20.56%)
      Max-93%     mmap     28.3202 (  0.00%)     22.3044 ( 21.24%)
      Max-95%     mmap     28.5600 (  0.00%)     22.4580 ( 21.37%)
      Max-99%     mmap     29.6032 (  0.00%)     25.5216 ( 13.79%)
      Max         mmap   4109.7289 (  0.00%)   4813.9832 (-17.14%)
      Mean        mmap     12.4474 (  0.00%)     19.3027 (-55.07%)
      
      This benchmark is trying to fault an anonymous mapping while there is a
      heavy IO load -- a scenario that desktop users used to complain about
      frequently.  This shows a mix because the ideal case of mapping with THP
      is not hit as often.  However, note that 99% of the mappings complete
      13.79% faster.  The CPU usage here is particularly interesting
      
                     4.4.0       4.4.0
              kcompactd-v1r1nodefrag-v1r3
      User           67.50        0.99
      System       1327.88       91.30
      Elapsed      2079.00     2128.98
      
      And once again we look at the reclaim figures
      
                                       4.4.0       4.4.0
                                kcompactd-v1r1nodefrag-v1r3
      Minor Faults                 335241922  1314582827
      Major Faults                       715         819
      Swap Ins                             0           0
      Swap Outs                            0           0
      Allocation stalls               532723           0
      DMA allocs                           0           0
      DMA32 allocs                1822364341  1177950222
      Normal allocs               1815640808  1517844854
      Movable allocs                       0           0
      Direct pages scanned          21892772           0
      Kswapd pages scanned          20015890    41879484
      Kswapd pages reclaimed        19961986    41822072
      Direct pages reclaimed        21892741           0
      Compaction stalls              1065755           0
      Compaction success                 514           0
      Compaction failures            1065241           0
      
      Allocation stalls and all direct reclaim activity is eliminated as well
      as compaction-related stalls.
      
      THP gives impressive gains in some cases but only if they are quickly
      available.  We're not going to reach the point where they are completely
      free so lets take the costs out of the fast paths finally and defer the
      cost to kswapd, kcompactd and khugepaged where it belongs.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      444eb2a4
    • D
      mm, mempool: only set __GFP_NOMEMALLOC if there are free elements · f9054c70
      David Rientjes 提交于
      If an oom killed thread calls mempool_alloc(), it is possible that it'll
      loop forever if there are no elements on the freelist since
      __GFP_NOMEMALLOC prevents it from accessing needed memory reserves in
      oom conditions.
      
      Only set __GFP_NOMEMALLOC if there are elements on the freelist.  If
      there are no free elements, allow allocations without the bit set so
      that memory reserves can be accessed if needed.
      
      Additionally, using mempool_alloc() with __GFP_NOMEMALLOC is not
      supported since the implementation can loop forever without accessing
      memory reserves when needed.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f9054c70
    • J
      mm: scale kswapd watermarks in proportion to memory · 795ae7a0
      Johannes Weiner 提交于
      In machines with 140G of memory and enterprise flash storage, we have
      seen read and write bursts routinely exceed the kswapd watermarks and
      cause thundering herds in direct reclaim.  Unfortunately, the only way
      to tune kswapd aggressiveness is through adjusting min_free_kbytes - the
      system's emergency reserves - which is entirely unrelated to the
      system's latency requirements.  In order to get kswapd to maintain a
      250M buffer of free memory, the emergency reserves need to be set to 1G.
      That is a lot of memory wasted for no good reason.
      
      On the other hand, it's reasonable to assume that allocation bursts and
      overall allocation concurrency scale with memory capacity, so it makes
      sense to make kswapd aggressiveness a function of that as well.
      
      Change the kswapd watermark scale factor from the currently fixed 25% of
      the tunable emergency reserve to a tunable 0.1% of memory.
      
      Beyond 1G of memory, this will produce bigger watermark steps than the
      current formula in default settings.  Ensure that the new formula never
      chooses steps smaller than that, i.e.  25% of the emergency reserve.
      
      On a 140G machine, this raises the default watermark steps - the
      distance between min and low, and low and high - from 16M to 143M.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      795ae7a0
    • K
      mm: cleanup *pte_alloc* interfaces · 3ed3a4f0
      Kirill A. Shutemov 提交于
      There are few things about *pte_alloc*() helpers worth cleaning up:
      
       - 'vma' argument is unused, let's drop it;
      
       - most __pte_alloc() callers do speculative check for pmd_none(),
         before taking ptl: let's introduce pte_alloc() macro which does
         the check.
      
         The only direct user of __pte_alloc left is userfaultfd, which has
         different expectation about atomicity wrt pmd.
      
       - pte_alloc_map() and pte_alloc_map_lock() are redefined using
         pte_alloc().
      
      [sudeep.holla@arm.com: fix build for arm64 hugetlbpage]
      [sfr@canb.auug.org.au: fix arch/arm/mm/mmu.c some more]
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: NSudeep Holla <sudeep.holla@arm.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3ed3a4f0
    • I
      mm/page_alloc.c: calculate 'available' memory in a separate function · d02bd27b
      Igor Redko 提交于
      Add a new field, VIRTIO_BALLOON_S_AVAIL, to virtio_balloon memory
      statistics protocol, corresponding to 'Available' in /proc/meminfo.
      
      It indicates to the hypervisor how big the balloon can be inflated
      without pushing the guest system to swap.  This metric would be very
      useful in VM orchestration software to improve memory management of
      different VMs under overcommit.
      
      This patch (of 2):
      
      Factor out calculation of the available memory counter into a separate
      exportable function, in order to be able to use it in other parts of the
      kernel.
      
      In particular, it appears a relevant metric to report to the hypervisor
      via virtio-balloon statistics interface (in a followup patch).
      Signed-off-by: NIgor Redko <redkoi@virtuozzo.com>
      Signed-off-by: NDenis V. Lunev <den@openvz.org>
      Reviewed-by: NRoman Kagan <rkagan@virtuozzo.com>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d02bd27b
    • Y
      mm/Kconfig: remove redundant arch depend for memory hotplug · 7eb50292
      Yang Shi 提交于
      MEMORY_HOTPLUG already depends on ARCH_ENABLE_MEMORY_HOTPLUG which is
      selected by the supported architectures, so the following arch depend is
      unnecessary.
      Signed-off-by: NYang Shi <yang.shi@linaro.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7eb50292
    • A
      mm/thp/migration: switch from flush_tlb_range to flush_pmd_tlb_range · 458aa76d
      Aneesh Kumar K.V 提交于
      We remove one instace of flush_tlb_range here.  That was added by commit
      f714f4f2 ("mm: numa: call MMU notifiers on THP migration").  But the
      pmdp_huge_clear_flush_notify should have done the require flush for us.
      Hence remove the extra flush.
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Vineet Gupta <Vineet.Gupta1@synopsys.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      458aa76d
    • A
      mm: deduplicate memory overcommitment code · 39a1aa8e
      Andrey Ryabinin 提交于
      Currently we have two copies of the same code which implements memory
      overcommitment logic.  Let's move it into mm/util.c and hence avoid
      duplication.  No functional changes here.
      Signed-off-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      39a1aa8e
    • A
      mm: move max_map_count bits into mm.h · ea606cf5
      Andrey Ryabinin 提交于
      max_map_count sysctl unrelated to scheduler. Move its bits from
      include/linux/sched/sysctl.h to include/linux/mm.h.
      Signed-off-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ea606cf5
    • K
      thp, vmstats: count deferred split events · f9719a03
      Kirill A. Shutemov 提交于
      Count how many times we put a THP in split queue.  Currently, it happens
      on partial unmap of a THP.
      
      Rapidly growing value can indicate that an application behaves
      unfriendly wrt THP: often fault in huge page and then unmap part of it.
      This leads to unnecessary memory fragmentation and the application may
      require tuning.
      
      The event also can help with debugging kernel [mis-]behaviour.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f9719a03
    • V
      mm: workingset: make shadow node shrinker memcg aware · 0a6b76dd
      Vladimir Davydov 提交于
      Workingset code was recently made memcg aware, but shadow node shrinker
      is still global.  As a result, one small cgroup can consume all memory
      available for shadow nodes, possibly hurting other cgroups by reclaiming
      their shadow nodes, even though reclaim distances stored in its shadow
      nodes have no effect.  To avoid this, we need to make shadow node
      shrinker memcg aware.
      Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0a6b76dd
    • V
      mm: workingset: size shadow nodes lru basing on file cache size · cdcbb72e
      Vladimir Davydov 提交于
      A page is activated on refault if the refault distance stored in the
      corresponding shadow entry is less than the number of active file pages.
      Since active file pages can't occupy more than half memory, we assume
      that the maximal effective refault distance can't be greater than half
      the number of present pages and size the shadow nodes lru list
      appropriately.  Generally speaking, this assumption is correct, but it
      can result in wasting a considerable chunk of memory on stale shadow
      nodes in case the portion of file pages is small, e.g.  if a workload
      mostly uses anonymous memory.
      
      To sort this out, we need to compute the size of shadow nodes lru basing
      not on the maximal possible, but the current size of file cache.  We
      could take the size of active file lru for the maximal refault distance,
      but active lru is pretty unstable - it can shrink dramatically at
      runtime possibly disrupting workingset detection logic.
      
      Instead we assume that the maximal refault distance equals half the
      total number of file cache pages.  This will protect us against active
      file lru size fluctuations while still being correct, because size of
      active lru is normally maintained lower than size of inactive lru.
      Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cdcbb72e
    • V
      mm: memcontrol: zap memcg_kmem_online helper · b6ecd2de
      Vladimir Davydov 提交于
      As kmem accounting is now either enabled for all cgroups or disabled
      system-wide, there's no point in having memcg_kmem_online() helper -
      instead one can use memcg_kmem_enabled() and mem_cgroup_online(), as
      shrink_slab() now does.
      
      There are only two places left where this helper is used -
      __memcg_kmem_charge() and memcg_create_kmem_cache().  The former can
      only be called if memcg_kmem_enabled() returned true.  Since the cgroup
      it operates on is online, mem_cgroup_is_root() check will be enough.
      
      memcg_create_kmem_cache() can't use mem_cgroup_online() helper instead
      of memcg_kmem_online(), because it relies on the fact that in
      memcg_offline_kmem() memcg->kmem_state is changed before
      memcg_deactivate_kmem_caches() is called, but there we can just
      open-code the check.
      Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b6ecd2de
    • V
      mm: vmscan: pass root_mem_cgroup instead of NULL to memcg aware shrinker · 0fc9f58a
      Vladimir Davydov 提交于
      It's just convenient to implement a memcg aware shrinker when you know
      that shrink_control->memcg != NULL unless memcg_kmem_enabled() returns
      false.
      Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0fc9f58a
    • V
      mm: memcontrol: enable kmem accounting for all cgroups in the legacy hierarchy · b313aeee
      Vladimir Davydov 提交于
      Workingset code was recently made memcg aware, but shadow node shrinker
      is still global.  As a result, one small cgroup can consume all memory
      available for shadow nodes, possibly hurting other cgroups by reclaiming
      their shadow nodes, even though reclaim distances stored in its shadow
      nodes have no effect.  To avoid this, we need to make shadow node
      shrinker memcg aware.
      
      The actual work is done in patch 6 of the series.  Patches 1 and 2
      prepare memcg/shrinker infrastructure for the change.  Patch 3 is just a
      collateral cleanup.  Patch 4 makes radix_tree_node accounted, which is
      necessary for making shadow node shrinker memcg aware.  Patch 5 reduces
      shadow nodes overhead in case workload mostly uses anonymous pages.
      
      This patch:
      
      Currently, in the legacy hierarchy kmem accounting is off for all
      cgroups by default and must be enabled explicitly by writing something
      to memory.kmem.limit_in_bytes.  Since we don't support reclaim on
      hitting kmem limit, nor do we have any plans to implement it, this is
      likely to be -1, just to enable kmem accounting and limit kernel memory
      consumption by the memory.limit_in_bytes along with user memory.
      
      This user API was introduced when the implementation of kmem accounting
      lacked slab shrinker support and hence was useless in practice.  Things
      have changed since then - slab shrinkers were made memcg aware, the
      accounting overhead seems to be negligible, and a failure to charge a
      kmem allocation should not have critical consequences, because we only
      account those kernel objects that should be safe to fail.  That's why
      kmem accounting is enabled by default for all cgroups in the default
      hierarchy, which will eventually replace the legacy one.
      
      The ability to enable kmem accounting for some cgroups while keeping it
      disabled for others is getting difficult to maintain.  E.g.  to make
      shadow node shrinker memcg aware (see mm/workingset.c), we need to know
      the relationship between the number of shadow nodes allocated for a
      cgroup and the size of its lru list.  If kmem accounting is enabled for
      all cgroups there is no problem, but what should we do if kmem
      accounting is enabled only for half of cgroups? We've no other choice
      but use global lru stats while scanning root cgroup's shadow nodes, but
      that would be wrong if kmem accounting was enabled for all cgroups
      (which is the case if the unified hierarchy is used), in which case we
      should use lru stats of the root cgroup's lruvec.
      
      That being said, let's enable kmem accounting for all memory cgroups by
      default.  If one finds it unstable or too costly, it can always be
      disabled system-wide by passing cgroup.memory=nokmem to the kernel at
      boot time.
      Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b313aeee
    • V
      mm, kswapd: replace kswapd compaction with waking up kcompactd · accf6242
      Vlastimil Babka 提交于
      Similarly to direct reclaim/compaction, kswapd attempts to combine
      reclaim and compaction to attempt making memory allocation of given
      order available.
      
      The details differ from direct reclaim e.g. in having high watermark as
      a goal.  The code involved in kswapd's reclaim/compaction decisions has
      evolved to be quite complex.
      
      Testing reveals that it doesn't actually work in at least one scenario,
      and closer inspection suggests that it could be greatly simplified
      without compromising on the goal (make high-order page available) or
      efficiency (don't reclaim too much).  The simplification relieas of
      doing all compaction in kcompactd, which is simply woken up when high
      watermarks are reached by kswapd's reclaim.
      
      The scenario where kswapd compaction doesn't work was found with mmtests
      test stress-highalloc configured to attempt order-9 allocations without
      direct reclaim, just waking up kswapd.  There was no compaction attempt
      from kswapd during the whole test.  Some added instrumentation shows
      what happens:
      
       - balance_pgdat() sets end_zone to Normal, as it's not balanced
       - reclaim is attempted on DMA zone, which sets nr_attempted to 99, but
         it cannot reclaim anything, so sc.nr_reclaimed is 0
       - for zones DMA32 and Normal, kswapd_shrink_zone uses testorder=0, so
         it merely checks if high watermarks were reached for base pages.
         This is true, so no reclaim is attempted.  For DMA, testorder=0
         wasn't used, as compaction_suitable() returned COMPACT_SKIPPED
       - even though the pgdat_needs_compaction flag wasn't set to false, no
         compaction happens due to the condition sc.nr_reclaimed >
         nr_attempted being false (as 0 < 99)
       - priority-- due to nr_reclaimed being 0, repeat until priority reaches
         0 pgdat_balanced() is false as only the small zone DMA appears
         balanced (curiously in that check, watermark appears OK and
         compaction_suitable() returns COMPACT_PARTIAL, because a lower
         classzone_idx is used there)
      
      Now, even if it was decided that reclaim shouldn't be attempted on the
      DMA zone, the scenario would be the same, as (sc.nr_reclaimed=0 >
      nr_attempted=0) is also false.  The condition really should use >= as
      the comment suggests.  Then there is a mismatch in the check for setting
      pgdat_needs_compaction to false using low watermark, while the rest uses
      high watermark, and who knows what other subtlety.  Hopefully this
      demonstrates that this is unsustainable.
      
      Luckily we can simplify this a lot.  The reclaim/compaction decisions
      make sense for direct reclaim scenario, but in kswapd, our primary goal
      is to reach high watermark in order-0 pages.  Afterwards we can attempt
      compaction just once.  Unlike direct reclaim, we don't reclaim extra
      pages (over the high watermark), the current code already disallows it
      for good reasons.
      
      After this patch, we simply wake up kcompactd to process the pgdat,
      after we have either succeeded or failed to reach the high watermarks in
      kswapd, which goes to sleep.  We pass kswapd's order and classzone_idx,
      so kcompactd can apply the same criteria to determine which zones are
      worth compacting.  Note that we use the classzone_idx from
      wakeup_kswapd(), not balanced_classzone_idx which can include higher
      zones that kswapd tried to balance too, but didn't consider them in
      pgdat_balanced().
      
      Since kswapd now cannot create high-order pages itself, we need to
      adjust how it determines the zones to be balanced.  The key element here
      is adding a "highorder" parameter to zone_balanced, which, when set to
      false, makes it consider only order-0 watermark instead of the desired
      higher order (this was done previously by kswapd_shrink_zone(), but not
      elsewhere).  This false is passed for example in pgdat_balanced().
      Importantly, wakeup_kswapd() uses true to make sure kswapd and thus
      kcompactd are woken up for a high-order allocation failure.
      
      The last thing is to decide what to do with pageblock_skip bitmap
      handling.  Compaction maintains a pageblock_skip bitmap to record
      pageblocks where isolation recently failed.  This bitmap can be reset by
      three ways:
      
      1) direct compaction is restarting after going through the full deferred cycle
      
      2) kswapd goes to sleep, and some other direct compaction has previously
         finished scanning the whole zone and set zone->compact_blockskip_flush.
         Note that a successful direct compaction clears this flag.
      
      3) compaction was invoked manually via trigger in /proc
      
      The case 2) is somewhat fuzzy to begin with, but after introducing
      kcompactd we should update it.  The check for direct compaction in 1),
      and to set the flush flag in 2) use current_is_kswapd(), which doesn't
      work for kcompactd.  Thus, this patch adds bool direct_compaction to
      compact_control to use in 2).  For the case 1) we remove the check
      completely - unlike the former kswapd compaction, kcompactd does use the
      deferred compaction functionality, so flushing tied to restarting from
      deferred compaction makes sense here.
      
      Note that when kswapd goes to sleep, kcompactd is woken up, so it will
      see the flushed pageblock_skip bits.  This is different from when the
      former kswapd compaction observed the bits and I believe it makes more
      sense.  Kcompactd can afford to be more thorough than a direct
      compaction trying to limit allocation latency, or kswapd whose primary
      goal is to reclaim.
      
      For testing, I used stress-highalloc configured to do order-9
      allocations with GFP_NOWAIT|__GFP_HIGH|__GFP_COMP, so they relied just
      on kswapd/kcompactd reclaim/compaction (the interfering kernel builds in
      phases 1 and 2 work as usual):
      
      stress-highalloc
                              4.5-rc1+before          4.5-rc1+after
                                   -nodirect              -nodirect
      Success 1 Min          1.00 (  0.00%)         5.00 (-66.67%)
      Success 1 Mean         1.40 (  0.00%)         6.20 (-55.00%)
      Success 1 Max          2.00 (  0.00%)         7.00 (-16.67%)
      Success 2 Min          1.00 (  0.00%)         5.00 (-66.67%)
      Success 2 Mean         1.80 (  0.00%)         6.40 (-52.38%)
      Success 2 Max          3.00 (  0.00%)         7.00 (-16.67%)
      Success 3 Min         34.00 (  0.00%)        62.00 (  1.59%)
      Success 3 Mean        41.80 (  0.00%)        63.80 (  1.24%)
      Success 3 Max         53.00 (  0.00%)        65.00 (  2.99%)
      
      User                          3166.67        3181.09
      System                        1153.37        1158.25
      Elapsed                       1768.53        1799.37
      
                                  4.5-rc1+before   4.5-rc1+after
                                       -nodirect    -nodirect
      Direct pages scanned                32938        32797
      Kswapd pages scanned              2183166      2202613
      Kswapd pages reclaimed            2152359      2143524
      Direct pages reclaimed              32735        32545
      Percentage direct scans                1%           1%
      THP fault alloc                       579          612
      THP collapse alloc                    304          316
      THP splits                              0            0
      THP fault fallback                    793          778
      THP collapse fail                      11           16
      Compaction stalls                    1013         1007
      Compaction success                     92           67
      Compaction failures                   920          939
      Page migrate success               238457       721374
      Page migrate failure                23021        23469
      Compaction pages isolated          504695      1479924
      Compaction migrate scanned         661390      8812554
      Compaction free scanned          13476658     84327916
      Compaction cost                       262          838
      
      After this patch we see improvements in allocation success rate
      (especially for phase 3) along with increased compaction activity.  The
      compaction stalls (direct compaction) in the interfering kernel builds
      (probably THP's) also decreased somewhat thanks to kcompactd activity,
      yet THP alloc successes improved a bit.
      
      Note that elapsed and user time isn't so useful for this benchmark,
      because of the background interference being unpredictable.  It's just
      to quickly spot some major unexpected differences.  System time is
      somewhat more useful and that didn't increase.
      
      Also (after adjusting mmtests' ftrace monitor):
      
      Time kswapd awake               2547781     2269241
      Time kcompactd awake                  0      119253
      Time direct compacting           939937      557649
      Time kswapd compacting                0           0
      Time kcompactd compacting             0      119099
      
      The decrease of overal time spent compacting appears to not match the
      increased compaction stats.  I suspect the tasks get rescheduled and
      since the ftrace monitor doesn't see that, the reported time is wall
      time, not CPU time.  But arguably direct compactors care about overall
      latency anyway, whether busy compacting or waiting for CPU doesn't
      matter.  And that latency seems to almost halved.
      
      It's also interesting how much time kswapd spent awake just going
      through all the priorities and failing to even try compacting, over and
      over.
      
      We can also configure stress-highalloc to perform both direct
      reclaim/compaction and wakeup kswapd/kcompactd, by using
      GFP_KERNEL|__GFP_HIGH|__GFP_COMP:
      
      stress-highalloc
                              4.5-rc1+before         4.5-rc1+after
                                     -direct               -direct
      Success 1 Min          4.00 (  0.00%)        9.00 (-50.00%)
      Success 1 Mean         8.00 (  0.00%)       10.00 (-19.05%)
      Success 1 Max         12.00 (  0.00%)       11.00 ( 15.38%)
      Success 2 Min          4.00 (  0.00%)        9.00 (-50.00%)
      Success 2 Mean         8.20 (  0.00%)       10.00 (-16.28%)
      Success 2 Max         13.00 (  0.00%)       11.00 (  8.33%)
      Success 3 Min         75.00 (  0.00%)       74.00 (  1.33%)
      Success 3 Mean        75.60 (  0.00%)       75.20 (  0.53%)
      Success 3 Max         77.00 (  0.00%)       76.00 (  0.00%)
      
      User                          3344.73       3246.04
      System                        1194.24       1172.29
      Elapsed                       1838.04       1836.76
      
                                  4.5-rc1+before  4.5-rc1+after
                                         -direct     -direct
      Direct pages scanned               125146      120966
      Kswapd pages scanned              2119757     2135012
      Kswapd pages reclaimed            2073183     2108388
      Direct pages reclaimed             124909      120577
      Percentage direct scans                5%          5%
      THP fault alloc                       599         652
      THP collapse alloc                    323         354
      THP splits                              0           0
      THP fault fallback                    806         793
      THP collapse fail                      17          16
      Compaction stalls                    2457        2025
      Compaction success                    906         518
      Compaction failures                  1551        1507
      Page migrate success              2031423     2360608
      Page migrate failure                32845       40852
      Compaction pages isolated         4129761     4802025
      Compaction migrate scanned       11996712    21750613
      Compaction free scanned         214970969   344372001
      Compaction cost                      2271        2694
      
      In this scenario, this patch doesn't change the overall success rate as
      direct compaction already tries all it can.  There's however significant
      reduction in direct compaction stalls (that is, the number of
      allocations that went into direct compaction).  The number of successes
      (i.e.  direct compaction stalls that ended up with successful
      allocation) is reduced by the same number.  This means the offload to
      kcompactd is working as expected, and direct compaction is reduced
      either due to detecting contention, or compaction deferred by kcompactd.
      In the previous version of this patchset there was some apparent
      reduction of success rate, but the changes in this version (such as
      using sync compaction only), new baseline kernel, and/or averaging
      results from 5 executions (my bet), made this go away.
      
      Ftrace-based stats seem to roughly agree:
      
      Time kswapd awake               2532984     2326824
      Time kcompactd awake                  0      257916
      Time direct compacting           864839      735130
      Time kswapd compacting                0           0
      Time kcompactd compacting             0      257585
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      accf6242
    • V
      mm, memory hotplug: small cleanup in online_pages() · e888ca35
      Vlastimil Babka 提交于
      We can reuse the nid we've determined instead of repeated pfn_to_nid()
      usages.  Also zone_to_nid() should be a bit cheaper in general than
      pfn_to_nid().
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e888ca35
    • V
      mm, compaction: introduce kcompactd · 698b1b30
      Vlastimil Babka 提交于
      Memory compaction can be currently performed in several contexts:
      
       - kswapd balancing a zone after a high-order allocation failure
       - direct compaction to satisfy a high-order allocation, including THP
         page fault attemps
       - khugepaged trying to collapse a hugepage
       - manually from /proc
      
      The purpose of compaction is two-fold.  The obvious purpose is to
      satisfy a (pending or future) high-order allocation, and is easy to
      evaluate.  The other purpose is to keep overal memory fragmentation low
      and help the anti-fragmentation mechanism.  The success wrt the latter
      purpose is more
      
      The current situation wrt the purposes has a few drawbacks:
      
       - compaction is invoked only when a high-order page or hugepage is not
         available (or manually).  This might be too late for the purposes of
         keeping memory fragmentation low.
       - direct compaction increases latency of allocations.  Again, it would
         be better if compaction was performed asynchronously to keep
         fragmentation low, before the allocation itself comes.
       - (a special case of the previous) the cost of compaction during THP
         page faults can easily offset the benefits of THP.
       - kswapd compaction appears to be complex, fragile and not working in
         some scenarios.  It could also end up compacting for a high-order
         allocation request when it should be reclaiming memory for a later
         order-0 request.
      
      To improve the situation, we should be able to benefit from an
      equivalent of kswapd, but for compaction - i.e. a background thread
      which responds to fragmentation and the need for high-order allocations
      (including hugepages) somewhat proactively.
      
      One possibility is to extend the responsibilities of kswapd, which could
      however complicate its design too much.  It should be better to let
      kswapd handle reclaim, as order-0 allocations are often more critical
      than high-order ones.
      
      Another possibility is to extend khugepaged, but this kthread is a
      single instance and tied to THP configs.
      
      This patch goes with the option of a new set of per-node kthreads called
      kcompactd, and lays the foundations, without introducing any new
      tunables.  The lifecycle mimics kswapd kthreads, including the memory
      hotplug hooks.
      
      For compaction, kcompactd uses the standard compaction_suitable() and
      ompact_finished() criteria and the deferred compaction functionality.
      Unlike direct compaction, it uses only sync compaction, as there's no
      allocation latency to minimize.
      
      This patch doesn't yet add a call to wakeup_kcompactd.  The kswapd
      compact/reclaim loop for high-order pages will be replaced by waking up
      kcompactd in the next patch with the description of what's wrong with
      the old approach.
      
      Waking up of the kcompactd threads is also tied to kswapd activity and
      follows these rules:
       - we don't want to affect any fastpaths, so wake up kcompactd only from
         the slowpath, as it's done for kswapd
       - if kswapd is doing reclaim, it's more important than compaction, so
         don't invoke kcompactd until kswapd goes to sleep
       - the target order used for kswapd is passed to kcompactd
      
      Future possible future uses for kcompactd include the ability to wake up
      kcompactd on demand in special situations, such as when hugepages are
      not available (currently not done due to __GFP_NO_KSWAPD) or when a
      fragmentation event (i.e.  __rmqueue_fallback()) occurs.  It's also
      possible to perform periodic compaction with kcompactd.
      
      [arnd@arndb.de: fix build errors with kcompactd]
      [paul.gortmaker@windriver.com: don't use modular references for non modular code]
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      698b1b30
    • V
      mm, kswapd: remove bogus check of balance_classzone_idx · 81c5857b
      Vlastimil Babka 提交于
      During work on kcompactd integration I have spotted a confusing check of
      balance_classzone_idx, which I believe is bogus.
      
      The balanced_classzone_idx is filled by balance_pgdat() as the highest
      zone it attempted to balance.  This was introduced by commit dc83edd9
      ("mm: kswapd: use the classzone idx that kswapd was using for
      sleeping_prematurely()").
      
      The intention is that (as expressed in today's function names), the
      value used for kswapd_shrink_zone() calls in balance_pgdat() is the same
      as for the decisions in kswapd_try_to_sleep().
      
      An unwanted side-effect of that commit was breaking the checks in
      kswapd() whether there was another kswapd_wakeup with a tighter (=lower)
      classzone_idx.  Commits 215ddd66 ("mm: vmscan: only read
      new_classzone_idx from pgdat when reclaiming successfully") and
      d2ebd0f6 ("kswapd: avoid unnecessary rebalance after an unsuccessful
      balancing") tried to fixed, but apparently introduced a bogus check that
      this patch removes.
      
      Consider zone indexes X < Y < Z, where:
      - Z is the value used for the first kswapd wakeup.
      - Y is returned as balanced_classzone_idx, which means zones with index higher
        than Y (including Z) were found to be unreclaimable.
      - X is the value used for the second kswapd wakeup
      
      The new wakeup with value X means that kswapd is now supposed to balance
      harder all zones with index <= X.  But instead, due to Y < Z, it will go
      sleep and won't read the new value X.  This is subtly wrong.
      
      The effect of this patch is that kswapd will react better in some
      situations, where e.g.  the first wakeup is for ZONE_DMA32, the second is
      for ZONE_DMA, and due to unreclaimable ZONE_NORMAL.  Before this patch,
      kswapd would go sleep instead of reclaiming ZONE_DMA harder.  I expect
      these situations are very rare, and more value is in better
      maintainability due to the removal of confusing and bogus check.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      81c5857b
    • J
      sound: query dynamic DEBUG_PAGEALLOC setting · 505f6d22
      Joonsoo Kim 提交于
      We can disable debug_pagealloc processing even if the code is compiled
      with CONFIG_DEBUG_PAGEALLOC.  This patch changes the code to query
      whether it is enabled or not in runtime.
      
      [akpm@linux-foundation.org: export _debug_pagealloc_enabled to modules]
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NTakashi Iwai <tiwai@suse.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      505f6d22
    • J
      mm/slub: query dynamic DEBUG_PAGEALLOC setting · 922d566c
      Joonsoo Kim 提交于
      We can disable debug_pagealloc processing even if the code is compiled
      with CONFIG_DEBUG_PAGEALLOC.  This patch changes the code to query
      whether it is enabled or not in runtime.
      
      [akpm@linux-foundation.org: clean up code, per Christian]
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reviewed-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Takashi Iwai <tiwai@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      922d566c
    • J
      mm/vmalloc: query dynamic DEBUG_PAGEALLOC setting · f48d97f3
      Joonsoo Kim 提交于
      As CONFIG_DEBUG_PAGEALLOC can be enabled/disabled via kernel parameters
      we can optimize some cases by checking the enablement state.
      
      This is follow-up work for Christian's Optimize CONFIG_DEBUG_PAGEALLOC:
      
        https://lkml.org/lkml/2016/1/27/194
      
      Remaining work is to make sparc to be aware of this but it looks not
      easy for me so I skip that in this series.
      
      This patch (of 5):
      
      We can disable debug_pagealloc processing even if the code is complied
      with CONFIG_DEBUG_PAGEALLOC.  This patch changes the code to query
      whether it is enabled or not in runtime.
      
      [akpm@linux-foundation.org: update comment, per David.  Adjust comment to use 80 cols]
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reviewed-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Takashi Iwai <tiwai@suse.com>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f48d97f3
    • N
      /proc/kpageflags: return KPF_BUDDY for "tail" buddy pages · 832fc1de
      Naoya Horiguchi 提交于
      Currently /proc/kpageflags returns nothing for "tail" buddy pages, which
      is inconvenient when grasping how free pages are distributed.  This
      patch sets KPF_BUDDY for such pages.
      
      With this patch:
      
        $ grep MemFree /proc/meminfo ; tools/vm/page-types -b buddy
        MemFree:         3134992 kB
                     flags      page-count       MB  symbolic-flags                     long-symbolic-flags
        0x0000000000000400          779272     3044  __________B_______________________________ buddy
        0x0000000000000c00            4385       17  __________BM______________________________ buddy,mmap
                     total          783657     3061
      
      783657 pages is 3134628 kB (roughly consistent with the global counter,)
      so it's OK.
      
      [akpm@linux-foundation.org: update comment, per Naoya]
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: NVladimir Davydov <vdavydov@virtuozzo.com&gt;>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      832fc1de
    • V
      mm: memcontrol: report kernel stack usage in cgroup2 memory.stat · 12580e4b
      Vladimir Davydov 提交于
      Show how much memory is allocated to kernel stacks.
      Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      12580e4b
    • V
      mm: memcontrol: report slab usage in cgroup2 memory.stat · 27ee57c9
      Vladimir Davydov 提交于
      Show how much memory is used for storing reclaimable and unreclaimable
      in-kernel data structures allocated from slab caches.
      Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      27ee57c9
    • V
      mm: memcontrol: make tree_{stat,events} fetch all stats · 72b54e73
      Vladimir Davydov 提交于
      Currently, tree_{stat,events} helpers can only get one stat index at a
      time, so when there are a lot of stats to be reported one has to call it
      over and over again (see memory_stat_show).  This is neither effective,
      nor does it look good.  Instead, let's make these helpers take a
      snapshot of all available counters.
      Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      72b54e73
    • V
      mm: memcontrol: do not bypass slab charge if memcg is offline · fcff7d7e
      Vladimir Davydov 提交于
      Slab pages are charged in two steps.  First, an appropriate per memcg
      cache is selected (see memcg_kmem_get_cache) basing on the current
      context, then the new slab page is charged to the memory cgroup which
      the selected cache was created for (see memcg_charge_slab ->
      __memcg_kmem_charge_memcg).  It is OK to bypass kmemcg charge at step 1,
      but if step 1 succeeded and we successfully allocated a new slab page,
      step 2 must be performed, otherwise we would get a per memcg kmem cache
      which contains a slab that does not hold a reference to the memory
      cgroup owning the cache.  Since per memcg kmem caches are destroyed on
      memcg css free, this could result in freeing a cache while there are
      still active objects in it.
      
      However, currently we will bypass slab page charge if the memory cgroup
      owning the cache is offline (see __memcg_kmem_charge_memcg).  This is
      very unlikely to occur in practice, because for this to happen a process
      must be migrated to a different cgroup and the old cgroup must be
      removed while the process is in kmalloc somewhere between steps 1 and 2
      (e.g.  trying to allocate a new page).  Nevertheless, it's still better
      to eliminate such a possibility.
      Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fcff7d7e
    • J
      mm: oom_kill: don't ignore oom score on exiting tasks · 6a618957
      Johannes Weiner 提交于
      When the OOM killer scans tasks and encounters a PF_EXITING one, it
      force-selects that task regardless of the score.  The problem is that if
      that task got stuck waiting for some state the allocation site is
      holding, the OOM reaper can not move on to the next best victim.
      
      Frankly, I don't even know why we check for exiting tasks in the OOM
      killer.  We've tried direct reclaim at least 15 times by the time we
      decide the system is OOM, there was plenty of time to exit and free
      memory; and a task might exit voluntarily right after we issue a kill.
      This is testing pure noise.  Remove it.
      Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andrea Argangeli <andrea@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6a618957
  2. 16 3月, 2016 8 次提交