1. 20 5月, 2016 4 次提交
    • M
      mm, page_alloc: convert alloc_flags to unsigned · c603844b
      Mel Gorman 提交于
      alloc_flags is a bitmask of flags but it is signed which does not
      necessarily generate the best code depending on the compiler.  Even
      without an impact, it makes more sense that this be unsigned.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c603844b
    • V
      mm, compaction: skip blocks where isolation fails in async direct compaction · fdd048e1
      Vlastimil Babka 提交于
      The goal of direct compaction is to quickly make a high-order page
      available for the pending allocation.  Within an aligned block of pages
      of desired order, a single allocated page that cannot be isolated for
      migration means that the block cannot fully merge to a buddy page that
      would satisfy the allocation request.  Therefore we can reduce the
      allocation stall by skipping the rest of the block immediately on
      isolation failure.  For async compaction, this also means a higher
      chance of succeeding until it detects contention.
      
      We however shouldn't completely sacrifice the second objective of
      compaction, which is to reduce overal long-term memory fragmentation.
      As a compromise, perform the eager skipping only in direct async
      compaction, while sync compaction (including kcompactd) remains
      thorough.
      
      Testing was done using stress-highalloc from mmtests, configured for
      order-4 GFP_KERNEL allocations:
      
                                       4.6-rc1               4.6-rc1
                                        before                 after
        Success 1 Min         24.00 (  0.00%)       27.00 (-12.50%)
        Success 1 Mean        30.20 (  0.00%)       31.60 ( -4.64%)
        Success 1 Max         37.00 (  0.00%)       35.00 (  5.41%)
        Success 2 Min         42.00 (  0.00%)       32.00 ( 23.81%)
        Success 2 Mean        44.00 (  0.00%)       44.80 ( -1.82%)
        Success 2 Max         48.00 (  0.00%)       52.00 ( -8.33%)
        Success 3 Min         91.00 (  0.00%)       92.00 ( -1.10%)
        Success 3 Mean        92.20 (  0.00%)       92.80 ( -0.65%)
        Success 3 Max         94.00 (  0.00%)       93.00 (  1.06%)
      
      We can see that success rates are unaffected by the skipping.
      
                      4.6-rc1     4.6-rc1
                       before       after
        User         2587.42     2566.53
        System        482.89      471.20
        Elapsed      1395.68     1382.00
      
      Times are not so useful metric for this benchmark as main portion is the
      interfering kernel builds, but results do hint at reduced system times.
      
                                            4.6-rc1     4.6-rc1
                                             before       after
        Direct pages scanned                163614      159608
        Kswapd pages scanned               2070139     2078790
        Kswapd pages reclaimed             2061707     2069757
        Direct pages reclaimed              163354      159505
      
      Reduced direct reclaim was unintended, but could be explained by more
      successful first attempt at (async) direct compaction, which is
      attempted before the first reclaim attempt in __alloc_pages_slowpath().
      
        Compaction stalls                    33052       39853
        Compaction success                   12121       19773
        Compaction failures                  20931       20079
      
      Compaction is indeed more successful, and thus less likely to get
      deferred, so there are also more direct compaction stalls.
      
        Page migrate success               3781876     3326819
        Page migrate failure                 45817       41774
        Compaction pages isolated          7868232     6941457
        Compaction migrate scanned       168160492   127269354
        Compaction migrate prescanned            0           0
        Compaction free scanned         2522142582  2326342620
        Compaction free direct alloc             0           0
        Compaction free dir. all. miss           0           0
        Compaction cost                       5252        4476
      
      The patch reduces migration scanned pages by 25% thanks to the eager
      skipping.
      
      [hughd@google.com: prevent nr_isolated_* from going negative]
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fdd048e1
    • V
      mm, compaction: reduce spurious pcplist drains · a34753d2
      Vlastimil Babka 提交于
      Compaction drains the local pcplists each time migration scanner moves
      away from a cc->order aligned block where it isolated pages for
      migration, so that the pages freed by migrations can merge into higher
      orders.
      
      The detection is currently coarser than it could be.  The
      cc->last_migrated_pfn variable should track the lowest pfn that was
      isolated for migration.  But it is set to the pfn where
      isolate_migratepages_block() starts scanning, which is typically the
      first pfn of the pageblock.  There, the scanner might fail to isolate
      several order-aligned blocks, and then isolate COMPACT_CLUSTER_MAX in
      another block.  This would cause the pcplists drain to be performed,
      although the scanner didn't yet finish the block where it isolated from.
      
      This patch thus makes cc->last_migrated_pfn handling more accurate by
      setting it to the pfn of an actually isolated page in
      isolate_migratepages_block().  Although practical effects of this patch
      are likely low, it arguably makes the intent of the code more obvious.
      Also the next patch will make async direct compaction skip blocks more
      aggressively, and draining pcplists due to skipped blocks is wasteful.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a34753d2
    • V
      mm, compaction: wrap calculating first and last pfn of pageblock · 06b6640a
      Vlastimil Babka 提交于
      Compaction code has accumulated numerous instances of manual
      calculations of the first (inclusive) and last (exclusive) pfn of a
      pageblock (or a smaller block of given order), given a pfn within the
      pageblock.
      
      Wrap these calculations by introducing pageblock_start_pfn(pfn) and
      pageblock_end_pfn(pfn) macros.
      
      [vbabka@suse.cz: fix crash in get_pfnblock_flags_mask() from isolate_freepages():]
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      06b6640a
  2. 06 5月, 2016 2 次提交
    • V
      mm: fix kcompactd hang during memory offlining · 172400c6
      Vlastimil Babka 提交于
      Assume memory47 is the last online block left in node1.  This will hang:
      
        # echo offline > /sys/devices/system/node/node1/memory47/state
      
      After a couple of minutes, the following pops up in dmesg:
      
        INFO: task bash:957 blocked for more than 120 seconds.
               Not tainted 4.6.0-rc6+ #6
        "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
        bash            D ffff8800b7adbaf8     0   957    951 0x00000000
        Call Trace:
          schedule+0x35/0x80
          schedule_timeout+0x1ac/0x270
          wait_for_completion+0xe1/0x120
          kthread_stop+0x4f/0x110
          kcompactd_stop+0x26/0x40
          __offline_pages.constprop.28+0x7e6/0x840
          offline_pages+0x11/0x20
          memory_block_action+0x73/0x1d0
          memory_subsys_offline+0x47/0x60
          device_offline+0x86/0xb0
          store_mem_state+0xda/0xf0
          dev_attr_store+0x18/0x30
          sysfs_kf_write+0x37/0x40
          kernfs_fop_write+0x11d/0x170
          __vfs_write+0x37/0x120
          vfs_write+0xa9/0x1a0
          SyS_write+0x55/0xc0
          entry_SYSCALL_64_fastpath+0x1a/0xa4
      
      kcompactd is waiting for kcompactd_max_order > 0 when it's woken up to
      actually exit.  Check kthread_should_stop() to break out of the wait.
      
      Fixes: 698b1b30 ("mm, compaction: introduce kcompactd").
      Reported-by: NReza Arbab <arbab@linux.vnet.ibm.com>
      Tested-by: NReza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      172400c6
    • H
      mm, cma: prevent nr_isolated_* counters from going negative · 14af4a5e
      Hugh Dickins 提交于
      /proc/sys/vm/stat_refresh warns nr_isolated_anon and nr_isolated_file go
      increasingly negative under compaction: which would add delay when
      should be none, or no delay when should delay.  The bug in compaction
      was due to a recent mmotm patch, but much older instance of the bug was
      also noticed in isolate_migratepages_range() which is used for CMA and
      gigantic hugepage allocations.
      
      The bug is caused by putback_movable_pages() in an error path
      decrementing the isolated counters without them being previously
      incremented by acct_isolated().  Fix isolate_migratepages_range() by
      removing the error-path putback, thus reaching acct_isolated() with
      migratepages still isolated, and leaving putback to caller like most
      other places do.
      
      Fixes: edc2ca61 ("mm, compaction: move pageblock checks up from isolate_migratepages_range()")
      [vbabka@suse.cz: expanded the changelog]
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      14af4a5e
  3. 18 3月, 2016 2 次提交
    • V
      mm, kswapd: replace kswapd compaction with waking up kcompactd · accf6242
      Vlastimil Babka 提交于
      Similarly to direct reclaim/compaction, kswapd attempts to combine
      reclaim and compaction to attempt making memory allocation of given
      order available.
      
      The details differ from direct reclaim e.g. in having high watermark as
      a goal.  The code involved in kswapd's reclaim/compaction decisions has
      evolved to be quite complex.
      
      Testing reveals that it doesn't actually work in at least one scenario,
      and closer inspection suggests that it could be greatly simplified
      without compromising on the goal (make high-order page available) or
      efficiency (don't reclaim too much).  The simplification relieas of
      doing all compaction in kcompactd, which is simply woken up when high
      watermarks are reached by kswapd's reclaim.
      
      The scenario where kswapd compaction doesn't work was found with mmtests
      test stress-highalloc configured to attempt order-9 allocations without
      direct reclaim, just waking up kswapd.  There was no compaction attempt
      from kswapd during the whole test.  Some added instrumentation shows
      what happens:
      
       - balance_pgdat() sets end_zone to Normal, as it's not balanced
       - reclaim is attempted on DMA zone, which sets nr_attempted to 99, but
         it cannot reclaim anything, so sc.nr_reclaimed is 0
       - for zones DMA32 and Normal, kswapd_shrink_zone uses testorder=0, so
         it merely checks if high watermarks were reached for base pages.
         This is true, so no reclaim is attempted.  For DMA, testorder=0
         wasn't used, as compaction_suitable() returned COMPACT_SKIPPED
       - even though the pgdat_needs_compaction flag wasn't set to false, no
         compaction happens due to the condition sc.nr_reclaimed >
         nr_attempted being false (as 0 < 99)
       - priority-- due to nr_reclaimed being 0, repeat until priority reaches
         0 pgdat_balanced() is false as only the small zone DMA appears
         balanced (curiously in that check, watermark appears OK and
         compaction_suitable() returns COMPACT_PARTIAL, because a lower
         classzone_idx is used there)
      
      Now, even if it was decided that reclaim shouldn't be attempted on the
      DMA zone, the scenario would be the same, as (sc.nr_reclaimed=0 >
      nr_attempted=0) is also false.  The condition really should use >= as
      the comment suggests.  Then there is a mismatch in the check for setting
      pgdat_needs_compaction to false using low watermark, while the rest uses
      high watermark, and who knows what other subtlety.  Hopefully this
      demonstrates that this is unsustainable.
      
      Luckily we can simplify this a lot.  The reclaim/compaction decisions
      make sense for direct reclaim scenario, but in kswapd, our primary goal
      is to reach high watermark in order-0 pages.  Afterwards we can attempt
      compaction just once.  Unlike direct reclaim, we don't reclaim extra
      pages (over the high watermark), the current code already disallows it
      for good reasons.
      
      After this patch, we simply wake up kcompactd to process the pgdat,
      after we have either succeeded or failed to reach the high watermarks in
      kswapd, which goes to sleep.  We pass kswapd's order and classzone_idx,
      so kcompactd can apply the same criteria to determine which zones are
      worth compacting.  Note that we use the classzone_idx from
      wakeup_kswapd(), not balanced_classzone_idx which can include higher
      zones that kswapd tried to balance too, but didn't consider them in
      pgdat_balanced().
      
      Since kswapd now cannot create high-order pages itself, we need to
      adjust how it determines the zones to be balanced.  The key element here
      is adding a "highorder" parameter to zone_balanced, which, when set to
      false, makes it consider only order-0 watermark instead of the desired
      higher order (this was done previously by kswapd_shrink_zone(), but not
      elsewhere).  This false is passed for example in pgdat_balanced().
      Importantly, wakeup_kswapd() uses true to make sure kswapd and thus
      kcompactd are woken up for a high-order allocation failure.
      
      The last thing is to decide what to do with pageblock_skip bitmap
      handling.  Compaction maintains a pageblock_skip bitmap to record
      pageblocks where isolation recently failed.  This bitmap can be reset by
      three ways:
      
      1) direct compaction is restarting after going through the full deferred cycle
      
      2) kswapd goes to sleep, and some other direct compaction has previously
         finished scanning the whole zone and set zone->compact_blockskip_flush.
         Note that a successful direct compaction clears this flag.
      
      3) compaction was invoked manually via trigger in /proc
      
      The case 2) is somewhat fuzzy to begin with, but after introducing
      kcompactd we should update it.  The check for direct compaction in 1),
      and to set the flush flag in 2) use current_is_kswapd(), which doesn't
      work for kcompactd.  Thus, this patch adds bool direct_compaction to
      compact_control to use in 2).  For the case 1) we remove the check
      completely - unlike the former kswapd compaction, kcompactd does use the
      deferred compaction functionality, so flushing tied to restarting from
      deferred compaction makes sense here.
      
      Note that when kswapd goes to sleep, kcompactd is woken up, so it will
      see the flushed pageblock_skip bits.  This is different from when the
      former kswapd compaction observed the bits and I believe it makes more
      sense.  Kcompactd can afford to be more thorough than a direct
      compaction trying to limit allocation latency, or kswapd whose primary
      goal is to reclaim.
      
      For testing, I used stress-highalloc configured to do order-9
      allocations with GFP_NOWAIT|__GFP_HIGH|__GFP_COMP, so they relied just
      on kswapd/kcompactd reclaim/compaction (the interfering kernel builds in
      phases 1 and 2 work as usual):
      
      stress-highalloc
                              4.5-rc1+before          4.5-rc1+after
                                   -nodirect              -nodirect
      Success 1 Min          1.00 (  0.00%)         5.00 (-66.67%)
      Success 1 Mean         1.40 (  0.00%)         6.20 (-55.00%)
      Success 1 Max          2.00 (  0.00%)         7.00 (-16.67%)
      Success 2 Min          1.00 (  0.00%)         5.00 (-66.67%)
      Success 2 Mean         1.80 (  0.00%)         6.40 (-52.38%)
      Success 2 Max          3.00 (  0.00%)         7.00 (-16.67%)
      Success 3 Min         34.00 (  0.00%)        62.00 (  1.59%)
      Success 3 Mean        41.80 (  0.00%)        63.80 (  1.24%)
      Success 3 Max         53.00 (  0.00%)        65.00 (  2.99%)
      
      User                          3166.67        3181.09
      System                        1153.37        1158.25
      Elapsed                       1768.53        1799.37
      
                                  4.5-rc1+before   4.5-rc1+after
                                       -nodirect    -nodirect
      Direct pages scanned                32938        32797
      Kswapd pages scanned              2183166      2202613
      Kswapd pages reclaimed            2152359      2143524
      Direct pages reclaimed              32735        32545
      Percentage direct scans                1%           1%
      THP fault alloc                       579          612
      THP collapse alloc                    304          316
      THP splits                              0            0
      THP fault fallback                    793          778
      THP collapse fail                      11           16
      Compaction stalls                    1013         1007
      Compaction success                     92           67
      Compaction failures                   920          939
      Page migrate success               238457       721374
      Page migrate failure                23021        23469
      Compaction pages isolated          504695      1479924
      Compaction migrate scanned         661390      8812554
      Compaction free scanned          13476658     84327916
      Compaction cost                       262          838
      
      After this patch we see improvements in allocation success rate
      (especially for phase 3) along with increased compaction activity.  The
      compaction stalls (direct compaction) in the interfering kernel builds
      (probably THP's) also decreased somewhat thanks to kcompactd activity,
      yet THP alloc successes improved a bit.
      
      Note that elapsed and user time isn't so useful for this benchmark,
      because of the background interference being unpredictable.  It's just
      to quickly spot some major unexpected differences.  System time is
      somewhat more useful and that didn't increase.
      
      Also (after adjusting mmtests' ftrace monitor):
      
      Time kswapd awake               2547781     2269241
      Time kcompactd awake                  0      119253
      Time direct compacting           939937      557649
      Time kswapd compacting                0           0
      Time kcompactd compacting             0      119099
      
      The decrease of overal time spent compacting appears to not match the
      increased compaction stats.  I suspect the tasks get rescheduled and
      since the ftrace monitor doesn't see that, the reported time is wall
      time, not CPU time.  But arguably direct compactors care about overall
      latency anyway, whether busy compacting or waiting for CPU doesn't
      matter.  And that latency seems to almost halved.
      
      It's also interesting how much time kswapd spent awake just going
      through all the priorities and failing to even try compacting, over and
      over.
      
      We can also configure stress-highalloc to perform both direct
      reclaim/compaction and wakeup kswapd/kcompactd, by using
      GFP_KERNEL|__GFP_HIGH|__GFP_COMP:
      
      stress-highalloc
                              4.5-rc1+before         4.5-rc1+after
                                     -direct               -direct
      Success 1 Min          4.00 (  0.00%)        9.00 (-50.00%)
      Success 1 Mean         8.00 (  0.00%)       10.00 (-19.05%)
      Success 1 Max         12.00 (  0.00%)       11.00 ( 15.38%)
      Success 2 Min          4.00 (  0.00%)        9.00 (-50.00%)
      Success 2 Mean         8.20 (  0.00%)       10.00 (-16.28%)
      Success 2 Max         13.00 (  0.00%)       11.00 (  8.33%)
      Success 3 Min         75.00 (  0.00%)       74.00 (  1.33%)
      Success 3 Mean        75.60 (  0.00%)       75.20 (  0.53%)
      Success 3 Max         77.00 (  0.00%)       76.00 (  0.00%)
      
      User                          3344.73       3246.04
      System                        1194.24       1172.29
      Elapsed                       1838.04       1836.76
      
                                  4.5-rc1+before  4.5-rc1+after
                                         -direct     -direct
      Direct pages scanned               125146      120966
      Kswapd pages scanned              2119757     2135012
      Kswapd pages reclaimed            2073183     2108388
      Direct pages reclaimed             124909      120577
      Percentage direct scans                5%          5%
      THP fault alloc                       599         652
      THP collapse alloc                    323         354
      THP splits                              0           0
      THP fault fallback                    806         793
      THP collapse fail                      17          16
      Compaction stalls                    2457        2025
      Compaction success                    906         518
      Compaction failures                  1551        1507
      Page migrate success              2031423     2360608
      Page migrate failure                32845       40852
      Compaction pages isolated         4129761     4802025
      Compaction migrate scanned       11996712    21750613
      Compaction free scanned         214970969   344372001
      Compaction cost                      2271        2694
      
      In this scenario, this patch doesn't change the overall success rate as
      direct compaction already tries all it can.  There's however significant
      reduction in direct compaction stalls (that is, the number of
      allocations that went into direct compaction).  The number of successes
      (i.e.  direct compaction stalls that ended up with successful
      allocation) is reduced by the same number.  This means the offload to
      kcompactd is working as expected, and direct compaction is reduced
      either due to detecting contention, or compaction deferred by kcompactd.
      In the previous version of this patchset there was some apparent
      reduction of success rate, but the changes in this version (such as
      using sync compaction only), new baseline kernel, and/or averaging
      results from 5 executions (my bet), made this go away.
      
      Ftrace-based stats seem to roughly agree:
      
      Time kswapd awake               2532984     2326824
      Time kcompactd awake                  0      257916
      Time direct compacting           864839      735130
      Time kswapd compacting                0           0
      Time kcompactd compacting             0      257585
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      accf6242
    • V
      mm, compaction: introduce kcompactd · 698b1b30
      Vlastimil Babka 提交于
      Memory compaction can be currently performed in several contexts:
      
       - kswapd balancing a zone after a high-order allocation failure
       - direct compaction to satisfy a high-order allocation, including THP
         page fault attemps
       - khugepaged trying to collapse a hugepage
       - manually from /proc
      
      The purpose of compaction is two-fold.  The obvious purpose is to
      satisfy a (pending or future) high-order allocation, and is easy to
      evaluate.  The other purpose is to keep overal memory fragmentation low
      and help the anti-fragmentation mechanism.  The success wrt the latter
      purpose is more
      
      The current situation wrt the purposes has a few drawbacks:
      
       - compaction is invoked only when a high-order page or hugepage is not
         available (or manually).  This might be too late for the purposes of
         keeping memory fragmentation low.
       - direct compaction increases latency of allocations.  Again, it would
         be better if compaction was performed asynchronously to keep
         fragmentation low, before the allocation itself comes.
       - (a special case of the previous) the cost of compaction during THP
         page faults can easily offset the benefits of THP.
       - kswapd compaction appears to be complex, fragile and not working in
         some scenarios.  It could also end up compacting for a high-order
         allocation request when it should be reclaiming memory for a later
         order-0 request.
      
      To improve the situation, we should be able to benefit from an
      equivalent of kswapd, but for compaction - i.e. a background thread
      which responds to fragmentation and the need for high-order allocations
      (including hugepages) somewhat proactively.
      
      One possibility is to extend the responsibilities of kswapd, which could
      however complicate its design too much.  It should be better to let
      kswapd handle reclaim, as order-0 allocations are often more critical
      than high-order ones.
      
      Another possibility is to extend khugepaged, but this kthread is a
      single instance and tied to THP configs.
      
      This patch goes with the option of a new set of per-node kthreads called
      kcompactd, and lays the foundations, without introducing any new
      tunables.  The lifecycle mimics kswapd kthreads, including the memory
      hotplug hooks.
      
      For compaction, kcompactd uses the standard compaction_suitable() and
      ompact_finished() criteria and the deferred compaction functionality.
      Unlike direct compaction, it uses only sync compaction, as there's no
      allocation latency to minimize.
      
      This patch doesn't yet add a call to wakeup_kcompactd.  The kswapd
      compact/reclaim loop for high-order pages will be replaced by waking up
      kcompactd in the next patch with the description of what's wrong with
      the old approach.
      
      Waking up of the kcompactd threads is also tied to kswapd activity and
      follows these rules:
       - we don't want to affect any fastpaths, so wake up kcompactd only from
         the slowpath, as it's done for kswapd
       - if kswapd is doing reclaim, it's more important than compaction, so
         don't invoke kcompactd until kswapd goes to sleep
       - the target order used for kswapd is passed to kcompactd
      
      Future possible future uses for kcompactd include the ability to wake up
      kcompactd on demand in special situations, such as when hugepages are
      not available (currently not done due to __GFP_NO_KSWAPD) or when a
      fragmentation event (i.e.  __rmqueue_fallback()) occurs.  It's also
      possible to perform periodic compaction with kcompactd.
      
      [arnd@arndb.de: fix build errors with kcompactd]
      [paul.gortmaker@windriver.com: don't use modular references for non modular code]
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      698b1b30
  4. 16 3月, 2016 3 次提交
    • J
      mm/compaction: speed up pageblock_pfn_to_page() when zone is contiguous · 7cf91a98
      Joonsoo Kim 提交于
      There is a performance drop report due to hugepage allocation and in
      there half of cpu time are spent on pageblock_pfn_to_page() in
      compaction [1].
      
      In that workload, compaction is triggered to make hugepage but most of
      pageblocks are un-available for compaction due to pageblock type and
      skip bit so compaction usually fails.  Most costly operations in this
      case is to find valid pageblock while scanning whole zone range.  To
      check if pageblock is valid to compact, valid pfn within pageblock is
      required and we can obtain it by calling pageblock_pfn_to_page().  This
      function checks whether pageblock is in a single zone and return valid
      pfn if possible.  Problem is that we need to check it every time before
      scanning pageblock even if we re-visit it and this turns out to be very
      expensive in this workload.
      
      Although we have no way to skip this pageblock check in the system where
      hole exists at arbitrary position, we can use cached value for zone
      continuity and just do pfn_to_page() in the system where hole doesn't
      exist.  This optimization considerably speeds up in above workload.
      
      Before vs After
        Max: 1096 MB/s vs 1325 MB/s
        Min: 635 MB/s 1015 MB/s
        Avg: 899 MB/s 1194 MB/s
      
      Avg is improved by roughly 30% [2].
      
      [1]: http://www.spinics.net/lists/linux-mm/msg97378.html
      [2]: https://lkml.org/lkml/2015/12/9/23
      
      [akpm@linux-foundation.org: don't forget to restore zone->contiguous on error path, per Vlastimil]
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reported-by: NAaron Lu <aaron.lu@intel.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Tested-by: NAaron Lu <aaron.lu@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7cf91a98
    • J
      mm/compaction: pass only pageblock aligned range to pageblock_pfn_to_page · e1409c32
      Joonsoo Kim 提交于
      pageblock_pfn_to_page() is used to check there is valid pfn and all
      pages in the pageblock is in a single zone.  If there is a hole in the
      pageblock, passing arbitrary position to pageblock_pfn_to_page() could
      cause to skip whole pageblock scanning, instead of just skipping the
      hole page.  For deterministic behaviour, it's better to always pass
      pageblock aligned range to pageblock_pfn_to_page().  It will also help
      further optimization on pageblock_pfn_to_page() in the following patch.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Aaron Lu <aaron.lu@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e1409c32
    • J
      mm/compaction: fix invalid free_pfn and compact_cached_free_pfn · 623446e4
      Joonsoo Kim 提交于
      free_pfn and compact_cached_free_pfn are the pointer that remember
      restart position of freepage scanner.  When they are reset or invalid,
      we set them to zone_end_pfn because freepage scanner works in reverse
      direction.  But, because zone range is defined as [zone_start_pfn,
      zone_end_pfn), zone_end_pfn is invalid to access.  Therefore, we should
      not store it to free_pfn and compact_cached_free_pfn.  Instead, we need
      to store zone_end_pfn - 1 to them.  There is one more thing we should
      consider.  Freepage scanner scan reversely by pageblock unit.  If
      free_pfn and compact_cached_free_pfn are set to middle of pageblock, it
      regards that sitiation as that it already scans front part of pageblock
      so we lose opportunity to scan there.  To fix-up, this patch do
      round_down() to guarantee that reset position will be pageblock aligned.
      
      Note that thanks to the current pageblock_pfn_to_page() implementation,
      actual access to zone_end_pfn doesn't happen until now.  But, following
      patch will change pageblock_pfn_to_page() so this patch is needed from
      now on.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Aaron Lu <aaron.lu@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      623446e4
  5. 15 1月, 2016 2 次提交
  6. 06 11月, 2015 3 次提交
  7. 09 9月, 2015 6 次提交
    • J
      mm/compaction: correct to flush migrated pages if pageblock skip happens · 1a16718c
      Joonsoo Kim 提交于
      We cache isolate_start_pfn before entering isolate_migratepages().  If
      pageblock is skipped in isolate_migratepages() due to whatever reason,
      cc->migrate_pfn can be far from isolate_start_pfn hence we flush pages
      that were freed.  For example, the following scenario can be possible:
      
      - assume order-9 compaction, pageblock order is 9
      - start_isolate_pfn is 0x200
      - isolate_migratepages()
        - skip a number of pageblocks
        - start to isolate from pfn 0x600
        - cc->migrate_pfn = 0x620
        - return
      - last_migrated_pfn is set to 0x200
      - check flushing condition
        - current_block_start is set to 0x600
        - last_migrated_pfn < current_block_start then do useless flush
      
      This wrong flush would not help the performance and success rate so this
      patch tries to fix it.  One simple way to know the exact position where
      we start to isolate migratable pages is that we cache it in
      isolate_migratepages() before entering actual isolation.  This patch
      implements that and fixes the problem.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1a16718c
    • V
      mm, compaction: skip compound pages by order in free scanner · 9fcd6d2e
      Vlastimil Babka 提交于
      The compaction free scanner is looking for PageBuddy() pages and
      skipping all others.  For large compound pages such as THP or hugetlbfs,
      we can save a lot of iterations if we skip them at once using their
      compound_order().  This is generally unsafe and we can read a bogus
      value of order due to a race, but if we are careful, the only danger is
      skipping too much.
      
      When tested with stress-highalloc from mmtests on 4GB system with 1GB
      hugetlbfs pages, the vmstat compact_free_scanned count decreased by at
      least 15%.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Acked-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NMichal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9fcd6d2e
    • V
      mm, compaction: always skip all compound pages by order in migrate scanner · 29c0dde8
      Vlastimil Babka 提交于
      The compaction migrate scanner tries to skip THP pages by their order,
      to reduce number of iterations for pages it cannot isolate.  The check
      is only done if PageLRU() is true, which means it applies to THP pages,
      but not e.g.  hugetlbfs pages or any other non-LRU compound pages, which
      we have to iterate by base pages.
      
      This limitation comes from the assumption that it's only safe to read
      compound_order() when we have the zone's lru_lock and THP cannot be
      split under us.  But the only danger (after filtering out order values
      that are not below MAX_ORDER, to prevent overflows) is that we skip too
      much or too little after reading a bogus compound_order() due to a rare
      race.  This is the same reasoning as patch 99c0fd5e ("mm,
      compaction: skip buddy pages by their order in the migrate scanner")
      introduced for unsafely reading PageBuddy() order.
      
      After this patch, all pages are tested for PageCompound() and we skip
      them by compound_order().  The test is done after the test for
      balloon_page_movable() as we don't want to assume if balloon pages (or
      other pages with own isolation and migration implementation if a generic
      API gets implemented) are compound or not.
      
      When tested with stress-highalloc from mmtests on 4GB system with 1GB
      hugetlbfs pages, the vmstat compact_migrate_scanned count decreased by
      15%.
      
      [kirill.shutemov@linux.intel.com: change PageTransHuge checks to PageCompound for different series was squashed here]
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NMichal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      29c0dde8
    • V
      mm, compaction: encapsulate resetting cached scanner positions · 02333641
      Vlastimil Babka 提交于
      Reseting the cached compaction scanner positions is now open-coded in
      __reset_isolation_suitable() and compact_finished().  Encapsulate the
      functionality in a new function reset_cached_positions().
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NMichal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      02333641
    • V
      mm, compaction: simplify handling restart position in free pages scanner · f5f61a32
      Vlastimil Babka 提交于
      Handling the position where compaction free scanner should restart
      (stored in cc->free_pfn) got more complex with commit e14c720e ("mm,
      compaction: remember position within pageblock in free pages scanner").
      Currently the position is updated in each loop iteration of
      isolate_freepages(), although it should be enough to update it only when
      breaking from the loop.  There's also an extra check outside the loop
      updates the position in case we have met the migration scanner.
      
      This can be simplified if we move the test for having isolated enough
      from the for-loop header next to the test for contention, and
      determining the restart position only in these cases.  We can reuse the
      isolate_start_pfn variable for this instead of setting cc->free_pfn
      directly.  Outside the loop, we can simply set cc->free_pfn to current
      value of isolate_start_pfn without any extra check.
      
      Also add a VM_BUG_ON to catch possible mistake in the future, in case we
      later add a new condition that terminates isolate_freepages_block()
      prematurely without also considering the condition in
      isolate_freepages().
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f5f61a32
    • V
      mm, compaction: more robust check for scanners meeting · f2849aa0
      Vlastimil Babka 提交于
      Assorted compaction cleanups and optimizations.  The interesting patches
      are 4 and 5.  In 4, skipping of compound pages in single iteration is
      improved for migration scanner, so it works also for !PageLRU compound
      pages such as hugetlbfs, slab etc.  Patch 5 introduces this kind of
      skipping in the free scanner.  The trick is that we can read
      compound_order() without any protection, if we are careful to filter out
      values larger than MAX_ORDER.  The only danger is that we skip too much.
      The same trick was already used for reading the freepage order in the
      migrate scanner.
      
      To demonstrate improvements of Patches 4 and 5 I've run stress-highalloc
      from mmtests, set to simulate THP allocations (including __GFP_COMP) on
      a 4GB system where 1GB was occupied by hugetlbfs pages.  I'll include
      just the relevant stats:
      
                                     Patch 3     Patch 4     Patch 5
      
      Compaction stalls                 7523        7529        7515
      Compaction success                 323         304         322
      Compaction failures               7200        7224        7192
      Page migrate success            247778      264395      240737
      Page migrate failure             15358       33184       21621
      Compaction pages isolated       906928      980192      909983
      Compaction migrate scanned     2005277     1692805     1498800
      Compaction free scanned       13255284    11539986     9011276
      Compaction cost                    288         305         277
      
      With 5 iterations per patch, the results are still noisy, but we can see
      that Patch 4 does reduce migrate_scanned by 15% thanks to skipping the
      hugetlbfs pages at once.  Interestingly, free_scanned is also reduced
      and I have no idea why.  Patch 5 further reduces free_scanned as
      expected, by 15%.  Other stats are unaffected modulo noise.
      
      [1] https://lkml.org/lkml/2015/1/19/158
      
      This patch (of 5):
      
      Compaction should finish when the migration and free scanner meet, i.e.
      they reach the same pageblock.  Currently however, the test in
      compact_finished() simply just compares the exact pfns, which may yield
      a false negative when the free scanner position is in the middle of a
      pageblock and the migration scanner reaches the begining of the same
      pageblock.
      
      This hasn't been a problem until commit e14c720e ("mm, compaction:
      remember position within pageblock in free pages scanner") allowed the
      free scanner position to be in the middle of a pageblock between
      invocations.  The hot-fix 1d5bfe1f ("mm, compaction: prevent
      infinite loop in compact_zone") prevented the issue by adding a special
      check in the migration scanner to satisfy the current detection of
      scanners meeting.
      
      However, the proper fix is to make the detection more robust.  This
      patch introduces the compact_scanners_met() function that returns true
      when the free scanner position is in the same or lower pageblock than
      the migration scanner.  The special case in isolate_migratepages()
      introduced by 1d5bfe1f is removed.
      Suggested-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NMichal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f2849aa0
  8. 16 4月, 2015 3 次提交
  9. 15 4月, 2015 1 次提交
    • J
      mm/compaction: enhance compaction finish condition · 2149cdae
      Joonsoo Kim 提交于
      Compaction has anti fragmentation algorithm.  It is that freepage should
      be more than pageblock order to finish the compaction if we don't find any
      freepage in requested migratetype buddy list.  This is for mitigating
      fragmentation, but, there is a lack of migratetype consideration and it is
      too excessive compared to page allocator's anti fragmentation algorithm.
      
      Not considering migratetype would cause premature finish of compaction.
      For example, if allocation request is for unmovable migratetype, freepage
      with CMA migratetype doesn't help that allocation and compaction should
      not be stopped.  But, current logic regards this situation as compaction
      is no longer needed, so finish the compaction.
      
      Secondly, condition is too excessive compared to page allocator's logic.
      We can steal freepage from other migratetype and change pageblock
      migratetype on more relaxed conditions in page allocator.  This is
      designed to prevent fragmentation and we can use it here.  Imposing hard
      constraint only to the compaction doesn't help much in this case since
      page allocator would cause fragmentation again.
      
      To solve these problems, this patch borrows anti fragmentation logic from
      page allocator.  It will reduce premature compaction finish in some cases
      and reduce excessive compaction work.
      
      stress-highalloc test in mmtests with non movable order 7 allocation shows
      considerable increase of compaction success rate.
      
      Compaction success rate (Compaction success * 100 / Compaction stalls, %)
      31.82 : 42.20
      
      I tested it on non-reboot 5 runs stress-highalloc benchmark and found that
      there is no more degradation on allocation success rate than before.  That
      roughly means that this patch doesn't result in more fragmentations.
      
      Vlastimil suggests additional idea that we only test for fallbacks when
      migration scanner has scanned a whole pageblock.  It looked good for
      fragmentation because chance of stealing increase due to making more free
      pages in certain pageblock.  So, I tested it, but, it results in decreased
      compaction success rate, roughly 38.00.  I guess the reason that if system
      is low memory condition, watermark check could be failed due to not enough
      order 0 free page and so, sometimes, we can't reach a fallback check
      although migrate_pfn is aligned to pageblock_nr_pages.  I can insert code
      to cope with this situation but it makes code more complicated so I don't
      include his idea at this patch.
      
      [akpm@linux-foundation.org: fix CONFIG_CMA=n build]
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2149cdae
  10. 14 2月, 2015 1 次提交
    • A
      mm: page_alloc: add kasan hooks on alloc and free paths · b8c73fc2
      Andrey Ryabinin 提交于
      Add kernel address sanitizer hooks to mark allocated page's addresses as
      accessible in corresponding shadow region.  Mark freed pages as
      inaccessible.
      Signed-off-by: NAndrey Ryabinin <a.ryabinin@samsung.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Konstantin Serebryany <kcc@google.com>
      Cc: Dmitry Chernenkov <dmitryc@google.com>
      Signed-off-by: NAndrey Konovalov <adech.fo@gmail.com>
      Cc: Yuri Gribov <tetra2005@gmail.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b8c73fc2
  11. 13 2月, 2015 3 次提交
    • H
      mm: fix negative nr_isolated counts · ff59909a
      Hugh Dickins 提交于
      The vmstat interfaces are good at hiding negative counts (at least when
      CONFIG_SMP); but if you peer behind the curtain, you find that
      nr_isolated_anon and nr_isolated_file soon go negative, and grow ever
      more negative: so they can absorb larger and larger numbers of isolated
      pages, yet still appear to be zero.
      
      I'm happy to avoid a congestion_wait() when too_many_isolated() myself;
      but I guess it's there for a good reason, in which case we ought to get
      too_many_isolated() working again.
      
      The imbalance comes from isolate_migratepages()'s ISOLATE_ABORT case:
      putback_movable_pages() decrements the NR_ISOLATED counts, but we forgot
      to call acct_isolated() to increment them.
      
      It is possible that the bug whcih this patch fixes could cause OOM kills
      when the system still has a lot of reclaimable page cache.
      
      Fixes: edc2ca61 ("mm, compaction: move pageblock checks up from isolate_migratepages_range()")
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: <stable@vger.kernel.org>	[3.18+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ff59909a
    • J
      mm/compaction: stop the isolation when we isolate enough freepage · 932ff6bb
      Joonsoo Kim 提交于
      Currently, freepage isolation in one pageblock doesn't consider how many
      freepages we isolate. When I traced flow of compaction, compaction
      sometimes isolates more than 256 freepages to migrate just 32 pages.
      
      In this patch, freepage isolation is stopped at the point that we
      have more isolated freepage than isolated page for migration. This
      results in slowing down free page scanner and make compaction success
      rate higher.
      
      stress-highalloc test in mmtests with non movable order 7 allocation shows
      increase of compaction success rate.
      
      Compaction success rate (Compaction success * 100 / Compaction stalls, %)
      27.13 : 31.82
      
      pfn where both scanners meets on compaction complete
      (separate test due to enormous tracepoint buffer)
      (zone_start=4096, zone_end=1048576)
      586034 : 654378
      
      In fact, I didn't fully understand why this patch results in such good
      result. There was a guess that not used freepages are released to pcp list
      and on next compaction trial we won't isolate them again so compaction
      success rate would decrease. To prevent this effect, I tested with adding
      pcp drain code on release_freepages(), but, it has no good effect.
      
      Anyway, this patch reduces waste time to isolate unneeded freepages so
      seems reasonable.
      
      Vlastimil said:
      
      : I briefly tried it on top of the pivot-changing series and with order-9
      : allocations it reduced free page scanned counter by almost 10%.  No effect
      : on success rates (maybe because pivot changing already took care of the
      : scanners meeting problem) but the scanning reduction is good on its own.
      :
      : It also explains why e14c720e ("mm, compaction: remember position
      : within pageblock in free pages scanner") had less than expected
      : improvements.  It would only actually stop within pageblock in case of
      : async compaction detecting contention.  I guess that's also why the
      : infinite loop problem fixed by 1d5bfe1f affected so relatively few
      : people.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Tested-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      932ff6bb
    • J
      mm/compaction: fix wrong order check in compact_finished() · 372549c2
      Joonsoo Kim 提交于
      What we want to check here is whether there is highorder freepage in buddy
      list of other migratetype in order to steal it without fragmentation.
      But, current code just checks cc->order which means allocation request
      order.  So, this is wrong.
      
      Without this fix, non-movable synchronous compaction below pageblock order
      would not stopped until compaction is complete, because migratetype of
      most pageblocks are movable and high order freepage made by compaction is
      usually on movable type buddy list.
      
      There is some report related to this bug. See below link.
      
        http://www.spinics.net/lists/linux-mm/msg81666.html
      
      Although the issued system still has load spike comes from compaction,
      this makes that system completely stable and responsive according to his
      report.
      
      stress-highalloc test in mmtests with non movable order 7 allocation
      doesn't show any notable difference in allocation success rate, but, it
      shows more compaction success rate.
      
      Compaction success rate (Compaction success * 100 / Compaction stalls, %)
      18.47 : 28.94
      
      Fixes: 1fb3f8ca ("mm: compaction: capture a suitable high-order page immediately when it is made available")
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: <stable@vger.kernel.org>	[3.7+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      372549c2
  12. 12 2月, 2015 5 次提交
  13. 11 12月, 2014 5 次提交
    • V
      mm, compaction: more focused lru and pcplists draining · fdaf7f5c
      Vlastimil Babka 提交于
      The goal of memory compaction is to create high-order freepages through
      page migration.  Page migration however puts pages on the per-cpu lru_add
      cache, which is later flushed to per-cpu pcplists, and only after pcplists
      are drained the pages can actually merge.  This can happen due to the
      per-cpu caches becoming full through further freeing, or explicitly.
      
      During direct compaction, it is useful to do the draining explicitly so
      that pages merge as soon as possible and compaction can detect success
      immediately and keep the latency impact at minimum.  However the current
      implementation is far from ideal.  Draining is done only in
      __alloc_pages_direct_compact(), after all zones were already compacted,
      and the decisions to continue or stop compaction in individual zones was
      done without the last batch of migrations being merged.  It is also
      missing the draining of lru_add cache before the pcplists.
      
      This patch moves the draining for direct compaction into compact_zone().
      It adds the missing lru_cache draining and uses the newly introduced
      single zone pcplists draining to reduce overhead and avoid impact on
      unrelated zones.  Draining is only performed when it can actually lead to
      merging of a page of desired order (passed by cc->order).  This means it
      is only done when migration occurred in the previously scanned cc->order
      aligned block(s) and the migration scanner is now pointing to the next
      cc->order aligned block.
      
      The patch has been tested with stress-highalloc benchmark from mmtests.
      Although overal allocation success rates of the benchmark were not
      affected, the number of detected compaction successes has doubled.  This
      suggests that allocations were previously successful due to implicit
      merging caused by background activity, making a later allocation attempt
      succeed immediately, but not attributing the success to compaction.  Since
      stress-highalloc always tries to allocate almost the whole memory, it
      cannot show the improvement in its reported success rate metric.  However
      after this patch, compaction should detect success and terminate earlier,
      reducing the direct compaction latencies in a real scenario.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fdaf7f5c
    • V
      mm, compaction: always update cached scanner positions · 6bace090
      Vlastimil Babka 提交于
      Compaction caches the migration and free scanner positions between
      compaction invocations, so that the whole zone gets eventually scanned and
      there is no bias towards the initial scanner positions at the
      beginning/end of the zone.
      
      The cached positions are continuously updated as scanners progress and the
      updating stops as soon as a page is successfully isolated.  The reasoning
      behind this is that a pageblock where isolation succeeded is likely to
      succeed again in near future and it should be worth revisiting it.
      
      However, the downside is that potentially many pages are rescanned without
      successful isolation.  At worst, there might be a page where isolation
      from LRU succeeds but migration fails (potentially always).  So upon
      encountering this page, cached position would always stop being updated
      for no good reason.  It might have been useful to let such page be
      rescanned with sync compaction after async one failed, but this is now
      handled by caching scanner position for async and sync mode separately
      since commit 35979ef3 ("mm, compaction: add per-zone migration pfn
      cache for async compaction").
      
      After this patch, cached positions are updated unconditionally.  In
      stress-highalloc benchmark, this has decreased the numbers of scanned
      pages by few percent, without affecting allocation success rates.
      
      To prevent free scanner from leaving free pages behind after they are
      returned due to page migration failure, the cached scanner pfn is changed
      to point to the pageblock of the returned free page with the highest pfn,
      before leaving compact_zone().
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6bace090
    • V
      mm, compaction: defer only on COMPACT_COMPLETE · f8669795
      Vlastimil Babka 提交于
      Deferred compaction is employed to avoid compacting zone where sync direct
      compaction has recently failed.  As such, it makes sense to only defer
      when a full zone was scanned, which is when compact_zone returns with
      COMPACT_COMPLETE.  It's less useful to defer when compact_zone returns
      with apparent success (COMPACT_PARTIAL), followed by a watermark check
      failure, which can happen due to parallel allocation activity.  It also
      does not make much sense to defer compaction which was completely skipped
      (COMPACT_SKIP) for being unsuitable in the first place.
      
      This patch therefore makes deferred compaction trigger only when
      COMPACT_COMPLETE is returned from compact_zone().  Results of
      stress-highalloc becnmark show the difference is within measurement error,
      so the issue is rather cosmetic.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f8669795
    • V
      mm, compaction: simplify deferred compaction · 97d47a65
      Vlastimil Babka 提交于
      Since commit 53853e2d ("mm, compaction: defer each zone individually
      instead of preferred zone"), compaction is deferred for each zone where
      sync direct compaction fails, and reset where it succeeds.  However, it
      was observed that for DMA zone compaction often appeared to succeed
      while subsequent allocation attempt would not, due to different outcome
      of watermark check.
      
      In order to properly defer compaction in this zone, the candidate zone
      has to be passed back to __alloc_pages_direct_compact() and compaction
      deferred in the zone after the allocation attempt fails.
      
      The large source of mismatch between watermark check in compaction and
      allocation was the lack of alloc_flags and classzone_idx values in
      compaction, which has been fixed in the previous patch.  So with this
      problem fixed, we can simplify the code by removing the candidate_zone
      parameter and deferring in __alloc_pages_direct_compact().
      
      After this patch, the compaction activity during stress-highalloc
      benchmark is still somewhat increased, but it's negligible compared to the
      increase that occurred without the better watermark checking.  This
      suggests that it is still possible to apparently succeed in compaction but
      fail to allocate, possibly due to parallel allocation activity.
      
      [akpm@linux-foundation.org: fix build]
      Suggested-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      97d47a65
    • V
      mm, compaction: pass classzone_idx and alloc_flags to watermark checking · ebff3980
      Vlastimil Babka 提交于
      Compaction relies on zone watermark checks for decisions such as if it's
      worth to start compacting in compaction_suitable() or whether compaction
      should stop in compact_finished().  The watermark checks take
      classzone_idx and alloc_flags parameters, which are related to the memory
      allocation request.  But from the context of compaction they are currently
      passed as 0, including the direct compaction which is invoked to satisfy
      the allocation request, and could therefore know the proper values.
      
      The lack of proper values can lead to mismatch between decisions taken
      during compaction and decisions related to the allocation request.  Lack
      of proper classzone_idx value means that lowmem_reserve is not taken into
      account.  This has manifested (during recent changes to deferred
      compaction) when DMA zone was used as fallback for preferred Normal zone.
      compaction_suitable() without proper classzone_idx would think that the
      watermarks are already satisfied, but watermark check in
      get_page_from_freelist() would fail.  Because of this problem, deferring
      compaction has extra complexity that can be removed in the following
      patch.
      
      The issue (not confirmed in practice) with missing alloc_flags is opposite
      in nature.  For allocations that include ALLOC_HIGH, ALLOC_HIGHER or
      ALLOC_CMA in alloc_flags (the last includes all MOVABLE allocations on
      CMA-enabled systems) the watermark checking in compaction with 0 passed
      will be stricter than in get_page_from_freelist().  In these cases
      compaction might be running for a longer time than is really needed.
      
      Another issue compaction_suitable() is that the check for "does the zone
      need compaction at all?" comes only after the check "does the zone have
      enough free free pages to succeed compaction".  The latter considers extra
      pages for migration and can therefore in some situations fail and return
      COMPACT_SKIPPED, although the high-order allocation would succeed and we
      should return COMPACT_PARTIAL.
      
      This patch fixes these problems by adding alloc_flags and classzone_idx to
      struct compact_control and related functions involved in direct compaction
      and watermark checking.  Where possible, all other callers of
      compaction_suitable() pass proper values where those are known.  This is
      currently limited to classzone_idx, which is sometimes known in kswapd
      context.  However, the direct reclaim callers should_continue_reclaim()
      and compaction_ready() do not currently know the proper values, so the
      coordination between reclaim and compaction may still not be as accurate
      as it could.  This can be fixed later, if it's shown to be an issue.
      
      Additionaly the checks in compact_suitable() are reordered to address the
      second issue described above.
      
      The effect of this patch should be slightly better high-order allocation
      success rates and/or less compaction overhead, depending on the type of
      allocations and presence of CMA.  It allows simplifying deferred
      compaction code in a followup patch.
      
      When testing with stress-highalloc, there was some slight improvement
      (which might be just due to variance) in success rates of non-THP-like
      allocations.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ebff3980