1. 12 5月, 2015 1 次提交
    • A
      mm/net: Rename and move page fragment handling from net/ to mm/ · b63ae8ca
      Alexander Duyck 提交于
      This change moves the __alloc_page_frag functionality out of the networking
      stack and into the page allocation portion of mm.  The idea it so help make
      this maintainable by placing it with other page allocation functions.
      
      Since we are moving it from skbuff.c to page_alloc.c I have also renamed
      the basic defines and structure from netdev_alloc_cache to page_frag_cache
      to reflect that this is now part of a different kernel subsystem.
      
      I have also added a simple __free_page_frag function which can handle
      freeing the frags based on the skb->head pointer.  The model for this is
      based off of __free_pages since we don't actually need to deal with all of
      the cases that put_page handles.  I incorporated the virt_to_head_page call
      and compound_order into the function as it actually allows for a signficant
      size reduction by reducing code duplication.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b63ae8ca
  2. 16 4月, 2015 1 次提交
  3. 15 4月, 2015 7 次提交
    • Y
      42ff2703
    • D
      mm: remove GFP_THISNODE · 4167e9b2
      David Rientjes 提交于
      NOTE: this is not about __GFP_THISNODE, this is only about GFP_THISNODE.
      
      GFP_THISNODE is a secret combination of gfp bits that have different
      behavior than expected.  It is a combination of __GFP_THISNODE,
      __GFP_NORETRY, and __GFP_NOWARN and is special-cased in the page
      allocator slowpath to fail without trying reclaim even though it may be
      used in combination with __GFP_WAIT.
      
      An example of the problem this creates: commit e97ca8e5 ("mm: fix
      GFP_THISNODE callers and clarify") fixed up many users of GFP_THISNODE
      that really just wanted __GFP_THISNODE.  The problem doesn't end there,
      however, because even it was a no-op for alloc_misplaced_dst_page(),
      which also sets __GFP_NORETRY and __GFP_NOWARN, and
      migrate_misplaced_transhuge_page(), where __GFP_NORETRY and __GFP_NOWAIT
      is set in GFP_TRANSHUGE.  Converting GFP_THISNODE to __GFP_THISNODE is a
      no-op in these cases since the page allocator special-cases
      __GFP_THISNODE && __GFP_NORETRY && __GFP_NOWARN.
      
      It's time to just remove GFP_THISNODE entirely.  We leave __GFP_THISNODE
      to restrict an allocation to a local node, but remove GFP_THISNODE and
      its obscurity.  Instead, we require that a caller clear __GFP_WAIT if it
      wants to avoid reclaim.
      
      This allows the aforementioned functions to actually reclaim as they
      should.  It also enables any future callers that want to do
      __GFP_THISNODE but also __GFP_NORETRY && __GFP_NOWARN to reclaim.  The
      rule is simple: if you don't want to reclaim, then don't set __GFP_WAIT.
      
      Aside: ovs_flow_stats_update() really wants to avoid reclaim as well, so
      it is unchanged.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Acked-by: NPekka Enberg <penberg@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Pravin Shelar <pshelar@nicira.com>
      Cc: Jarno Rajahalme <jrajahalme@nicira.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4167e9b2
    • K
      mm: completely remove dumping per-cpu lists from show_mem() · 761b0677
      Konstantin Khlebnikov 提交于
      It seems nobody needs this.
      Signed-off-by: NKonstantin Khlebnikov <koct9i@gmail.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      761b0677
    • K
      mm: hide per-cpu lists in output of show_mem() · d1bfcdb8
      Konstantin Khlebnikov 提交于
      This makes show_mem() much less verbose on huge machines.  Instead of huge
      and almost useless dump of counters for each per-zone per-cpu lists this
      patch prints the sum of these counters for each zone (free_pcp) and size
      of per-cpu list for current cpu (local_pcp).
      
      The filter flag SHOW_MEM_PERCPU_LISTS reverts to the old verbose mode.
      
      [akpm@linux-foundation.org: update show_free_areas comment]
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d1bfcdb8
    • J
      mm/compaction: enhance compaction finish condition · 2149cdae
      Joonsoo Kim 提交于
      Compaction has anti fragmentation algorithm.  It is that freepage should
      be more than pageblock order to finish the compaction if we don't find any
      freepage in requested migratetype buddy list.  This is for mitigating
      fragmentation, but, there is a lack of migratetype consideration and it is
      too excessive compared to page allocator's anti fragmentation algorithm.
      
      Not considering migratetype would cause premature finish of compaction.
      For example, if allocation request is for unmovable migratetype, freepage
      with CMA migratetype doesn't help that allocation and compaction should
      not be stopped.  But, current logic regards this situation as compaction
      is no longer needed, so finish the compaction.
      
      Secondly, condition is too excessive compared to page allocator's logic.
      We can steal freepage from other migratetype and change pageblock
      migratetype on more relaxed conditions in page allocator.  This is
      designed to prevent fragmentation and we can use it here.  Imposing hard
      constraint only to the compaction doesn't help much in this case since
      page allocator would cause fragmentation again.
      
      To solve these problems, this patch borrows anti fragmentation logic from
      page allocator.  It will reduce premature compaction finish in some cases
      and reduce excessive compaction work.
      
      stress-highalloc test in mmtests with non movable order 7 allocation shows
      considerable increase of compaction success rate.
      
      Compaction success rate (Compaction success * 100 / Compaction stalls, %)
      31.82 : 42.20
      
      I tested it on non-reboot 5 runs stress-highalloc benchmark and found that
      there is no more degradation on allocation success rate than before.  That
      roughly means that this patch doesn't result in more fragmentations.
      
      Vlastimil suggests additional idea that we only test for fallbacks when
      migration scanner has scanned a whole pageblock.  It looked good for
      fragmentation because chance of stealing increase due to making more free
      pages in certain pageblock.  So, I tested it, but, it results in decreased
      compaction success rate, roughly 38.00.  I guess the reason that if system
      is low memory condition, watermark check could be failed due to not enough
      order 0 free page and so, sometimes, we can't reach a fallback check
      although migrate_pfn is aligned to pageblock_nr_pages.  I can insert code
      to cope with this situation but it makes code more complicated so I don't
      include his idea at this patch.
      
      [akpm@linux-foundation.org: fix CONFIG_CMA=n build]
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2149cdae
    • J
      mm/page_alloc: factor out fallback freepage checking · 4eb7dce6
      Joonsoo Kim 提交于
      This is preparation step to use page allocator's anti fragmentation logic
      in compaction.  This patch just separates fallback freepage checking part
      from fallback freepage management part.  Therefore, there is no functional
      change.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4eb7dce6
    • J
      mm/cma: change fallback behaviour for CMA freepage · dc67647b
      Joonsoo Kim 提交于
      Freepage with MIGRATE_CMA can be used only for MIGRATE_MOVABLE and they
      should not be expanded to other migratetype buddy list to protect them
      from unmovable/reclaimable allocation.  Implementing these requirements in
      __rmqueue_fallback(), that is, finding largest possible block of freepage
      has bad effect that high order freepage with MIGRATE_CMA are broken
      continually although there are suitable order CMA freepage.  Reason is
      that they are not be expanded to other migratetype buddy list and next
      __rmqueue_fallback() invocation try to finds another largest block of
      freepage and break it again.  So, MIGRATE_CMA fallback should be handled
      separately.  This patch introduces __rmqueue_cma_fallback(), that just
      wrapper of __rmqueue_smallest() and call it before __rmqueue_fallback() if
      migratetype == MIGRATE_MOVABLE.
      
      This results in unintended behaviour change that MIGRATE_CMA freepage is
      always used first rather than other migratetype as movable allocation's
      fallback.  But, as already mentioned above, MIGRATE_CMA can be used only
      for MIGRATE_MOVABLE, so it is better to use MIGRATE_CMA freepage first as
      much as possible.  Otherwise, we needlessly take up precious freepages
      with other migratetype and increase chance of fragmentation.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dc67647b
  4. 13 3月, 2015 1 次提交
    • M
      mm, oom: do not fail __GFP_NOFAIL allocation if oom killer is disabled · e009d5dc
      Michal Hocko 提交于
      Tetsuo Handa has pointed out that __GFP_NOFAIL allocations might fail
      after OOM killer is disabled if the allocation is performed by a kernel
      thread.  This behavior was introduced from the very beginning by
      7f33d49a ("mm, PM/Freezer: Disable OOM killer when tasks are frozen").
       This means that the basic contract for the allocation request is broken
      and the context requesting such an allocation might blow up unexpectedly.
      
      There are basically two ways forward.
      
      1) move oom_killer_disable after kernel threads are frozen.  This has a
         risk that the OOM victim wouldn't be able to finish because it would
         depend on an already frozen kernel thread.  This would be really tricky
         to debug.
      
      2) do not fail GFP_NOFAIL allocation no matter what and risk a
         potential Freezable kernel threads will loop and fail the suspend.
         Incidental allocations after kernel threads are frozen will at least
         dump a warning - if we are lucky and the serial console is still active
         of course...
      
      This patch implements the later option because it is safer.  We would see
      warning rather than allocation failures for the kernel threads which would
      blow up otherwise and have a higher chances to identify __GFP_NOFAIL users
      from deeper pm code.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NDavid Rientjes <rientjes@gooogle.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e009d5dc
  5. 01 3月, 2015 1 次提交
  6. 14 2月, 2015 1 次提交
    • A
      mm: page_alloc: add kasan hooks on alloc and free paths · b8c73fc2
      Andrey Ryabinin 提交于
      Add kernel address sanitizer hooks to mark allocated page's addresses as
      accessible in corresponding shadow region.  Mark freed pages as
      inaccessible.
      Signed-off-by: NAndrey Ryabinin <a.ryabinin@samsung.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Konstantin Serebryany <kcc@google.com>
      Cc: Dmitry Chernenkov <dmitryc@google.com>
      Signed-off-by: NAndrey Konovalov <adech.fo@gmail.com>
      Cc: Yuri Gribov <tetra2005@gmail.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b8c73fc2
  7. 13 2月, 2015 2 次提交
  8. 12 2月, 2015 12 次提交
    • V
      mm: more aggressive page stealing for UNMOVABLE allocations · 9c0415eb
      Vlastimil Babka 提交于
      When allocation falls back to stealing free pages of another migratetype,
      it can decide to steal extra pages, or even the whole pageblock in order
      to reduce fragmentation, which could happen if further allocation
      fallbacks pick a different pageblock.  In try_to_steal_freepages(), one of
      the situations where extra pages are stolen happens when we are trying to
      allocate a MIGRATE_RECLAIMABLE page.
      
      However, MIGRATE_UNMOVABLE allocations are not treated the same way,
      although spreading such allocation over multiple fallback pageblocks is
      arguably even worse than it is for RECLAIMABLE allocations.  To minimize
      fragmentation, we should minimize the number of such fallbacks, and thus
      steal as much as is possible from each fallback pageblock.
      
      Note that in theory this might put more pressure on movable pageblocks and
      cause movable allocations to steal back from unmovable pageblocks.
      However, movable allocations are not as aggressive with stealing, and do
      not cause permanent fragmentation, so the tradeoff is reasonable, and
      evaluation seems to support the change.
      
      This patch thus adds a check for MIGRATE_UNMOVABLE to the decision to
      steal extra free pages.  When evaluating with stress-highalloc from
      mmtests, this has reduced the number of MIGRATE_UNMOVABLE fallbacks to
      roughly 1/6.  The number of these fallbacks stealing from MIGRATE_MOVABLE
      block is reduced to 1/3.  There was no observation of growing number of
      unmovable pageblocks over time, and also not of increased movable
      allocation fallbacks.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9c0415eb
    • V
      mm: always steal split buddies in fallback allocations · 3a1086fb
      Vlastimil Babka 提交于
      When allocation falls back to another migratetype, it will steal a page
      with highest available order, and (depending on this order and desired
      migratetype), it might also steal the rest of free pages from the same
      pageblock.
      
      Given the preference of highest available order, it is likely that it will
      be higher than the desired order, and result in the stolen buddy page
      being split.  The remaining pages after split are currently stolen only
      when the rest of the free pages are stolen.  This can however lead to
      situations where for MOVABLE allocations we split e.g.  order-4 fallback
      UNMOVABLE page, but steal only order-0 page.  Then on the next MOVABLE
      allocation (which may be batched to fill the pcplists) we split another
      order-3 or higher page, etc.  By stealing all pages that we have split, we
      can avoid further stealing.
      
      This patch therefore adjusts the page stealing so that buddy pages created
      by split are always stolen.  This has effect only on MOVABLE allocations,
      as RECLAIMABLE and UNMOVABLE allocations already always do that in
      addition to stealing the rest of free pages from the pageblock.  The
      change also allows to simplify try_to_steal_freepages() and factor out CMA
      handling.
      
      According to Mel, it has been intended since the beginning that buddy
      pages after split would be stolen always, but it doesn't seem like it was
      ever the case until commit 47118af0 ("mm: mmzone: MIGRATE_CMA
      migration type added").  The commit has unintentionally introduced this
      behavior, but was reverted by commit 0cbef29a ("mm:
      __rmqueue_fallback() should respect pageblock type").  Neither included
      evaluation.
      
      My evaluation with stress-highalloc from mmtests shows about 2.5x
      reduction of page stealing events for MOVABLE allocations, without
      affecting the page stealing events for other allocation migratetypes.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3a1086fb
    • V
      mm: when stealing freepages, also take pages created by splitting buddy page · 99592d59
      Vlastimil Babka 提交于
      When studying page stealing, I noticed some weird looking decisions in
      try_to_steal_freepages().  The first I assume is a bug (Patch 1), the
      following two patches were driven by evaluation.
      
      Testing was done with stress-highalloc of mmtests, using the
      mm_page_alloc_extfrag tracepoint and postprocessing to get counts of how
      often page stealing occurs for individual migratetypes, and what
      migratetypes are used for fallbacks.  Arguably, the worst case of page
      stealing is when UNMOVABLE allocation steals from MOVABLE pageblock.
      RECLAIMABLE allocation stealing from MOVABLE allocation is also not ideal,
      so the goal is to minimize these two cases.
      
      The evaluation of v2 wasn't always clear win and Joonsoo questioned the
      results.  Here I used different baseline which includes RFC compaction
      improvements from [1].  I found that the compaction improvements reduce
      variability of stress-highalloc, so there's less noise in the data.
      
      First, let's look at stress-highalloc configured to do sync compaction,
      and how these patches reduce page stealing events during the test.  First
      column is after fresh reboot, other two are reiterations of test without
      reboot.  That was all accumulater over 5 re-iterations (so the benchmark
      was run 5x3 times with 5 fresh restarts).
      
      Baseline:
      
                                                         3.19-rc4        3.19-rc4        3.19-rc4
                                                        5-nothp-1       5-nothp-2       5-nothp-3
      Page alloc extfrag event                               10264225     8702233    10244125
      Extfrag fragmenting                                    10263271     8701552    10243473
      Extfrag fragmenting for unmovable                         13595       17616       15960
      Extfrag fragmenting unmovable placed with movable          7989       12193        8447
      Extfrag fragmenting for reclaimable                         658        1840        1817
      Extfrag fragmenting reclaimable placed with movable         558        1677        1679
      Extfrag fragmenting for movable                        10249018     8682096    10225696
      
      With Patch 1:
                                                         3.19-rc4        3.19-rc4        3.19-rc4
                                                        6-nothp-1       6-nothp-2       6-nothp-3
      Page alloc extfrag event                               11834954     9877523     9774860
      Extfrag fragmenting                                    11833993     9876880     9774245
      Extfrag fragmenting for unmovable                          7342       16129       11712
      Extfrag fragmenting unmovable placed with movable          4191       10547        6270
      Extfrag fragmenting for reclaimable                         373        1130         923
      Extfrag fragmenting reclaimable placed with movable         302         906         738
      Extfrag fragmenting for movable                        11826278     9859621     9761610
      
      With Patch 2:
                                                         3.19-rc4        3.19-rc4        3.19-rc4
                                                        7-nothp-1       7-nothp-2       7-nothp-3
      Page alloc extfrag event                                4725990     3668793     3807436
      Extfrag fragmenting                                     4725104     3668252     3806898
      Extfrag fragmenting for unmovable                          6678        7974        7281
      Extfrag fragmenting unmovable placed with movable          2051        3829        4017
      Extfrag fragmenting for reclaimable                         429        1208        1278
      Extfrag fragmenting reclaimable placed with movable         369         976        1034
      Extfrag fragmenting for movable                         4717997     3659070     3798339
      
      With Patch 3:
                                                         3.19-rc4        3.19-rc4        3.19-rc4
                                                        8-nothp-1       8-nothp-2       8-nothp-3
      Page alloc extfrag event                                5016183     4700142     3850633
      Extfrag fragmenting                                     5015325     4699613     3850072
      Extfrag fragmenting for unmovable                          1312        3154        3088
      Extfrag fragmenting unmovable placed with movable          1115        2777        2714
      Extfrag fragmenting for reclaimable                         437        1193        1097
      Extfrag fragmenting reclaimable placed with movable         330         969         879
      Extfrag fragmenting for movable                         5013576     4695266     3845887
      
      In v2 we've seen apparent regression with Patch 1 for unmovable events,
      this is now gone, suggesting it was indeed noise.  Here, each patch
      improves the situation for unmovable events.  Reclaimable is improved by
      patch 1 and then either the same modulo noise, or perhaps sligtly worse -
      a small price for unmovable improvements, IMHO.  The number of movable
      allocations falling back to other migratetypes is most noisy, but it's
      reduced to half at Patch 2 nevertheless.  These are least critical as
      compaction can move them around.
      
      If we look at success rates, the patches don't affect them, that didn't change.
      
      Baseline:
                                   3.19-rc4              3.19-rc4              3.19-rc4
                                  5-nothp-1             5-nothp-2             5-nothp-3
      Success 1 Min         49.00 (  0.00%)       42.00 ( 14.29%)       41.00 ( 16.33%)
      Success 1 Mean        51.00 (  0.00%)       45.00 ( 11.76%)       42.60 ( 16.47%)
      Success 1 Max         55.00 (  0.00%)       51.00 (  7.27%)       46.00 ( 16.36%)
      Success 2 Min         53.00 (  0.00%)       47.00 ( 11.32%)       44.00 ( 16.98%)
      Success 2 Mean        59.60 (  0.00%)       50.80 ( 14.77%)       48.20 ( 19.13%)
      Success 2 Max         64.00 (  0.00%)       56.00 ( 12.50%)       52.00 ( 18.75%)
      Success 3 Min         84.00 (  0.00%)       82.00 (  2.38%)       78.00 (  7.14%)
      Success 3 Mean        85.60 (  0.00%)       82.80 (  3.27%)       79.40 (  7.24%)
      Success 3 Max         86.00 (  0.00%)       83.00 (  3.49%)       80.00 (  6.98%)
      
      Patch 1:
                                   3.19-rc4              3.19-rc4              3.19-rc4
                                  6-nothp-1             6-nothp-2             6-nothp-3
      Success 1 Min         49.00 (  0.00%)       44.00 ( 10.20%)       44.00 ( 10.20%)
      Success 1 Mean        51.80 (  0.00%)       46.00 ( 11.20%)       45.80 ( 11.58%)
      Success 1 Max         54.00 (  0.00%)       49.00 (  9.26%)       49.00 (  9.26%)
      Success 2 Min         58.00 (  0.00%)       49.00 ( 15.52%)       48.00 ( 17.24%)
      Success 2 Mean        60.40 (  0.00%)       51.80 ( 14.24%)       50.80 ( 15.89%)
      Success 2 Max         63.00 (  0.00%)       54.00 ( 14.29%)       55.00 ( 12.70%)
      Success 3 Min         84.00 (  0.00%)       81.00 (  3.57%)       79.00 (  5.95%)
      Success 3 Mean        85.00 (  0.00%)       81.60 (  4.00%)       79.80 (  6.12%)
      Success 3 Max         86.00 (  0.00%)       82.00 (  4.65%)       82.00 (  4.65%)
      
      Patch 2:
      
                                   3.19-rc4              3.19-rc4              3.19-rc4
                                  7-nothp-1             7-nothp-2             7-nothp-3
      Success 1 Min         50.00 (  0.00%)       44.00 ( 12.00%)       39.00 ( 22.00%)
      Success 1 Mean        52.80 (  0.00%)       45.60 ( 13.64%)       42.40 ( 19.70%)
      Success 1 Max         55.00 (  0.00%)       46.00 ( 16.36%)       47.00 ( 14.55%)
      Success 2 Min         52.00 (  0.00%)       48.00 (  7.69%)       45.00 ( 13.46%)
      Success 2 Mean        53.40 (  0.00%)       49.80 (  6.74%)       48.80 (  8.61%)
      Success 2 Max         57.00 (  0.00%)       52.00 (  8.77%)       52.00 (  8.77%)
      Success 3 Min         84.00 (  0.00%)       81.00 (  3.57%)       79.00 (  5.95%)
      Success 3 Mean        85.00 (  0.00%)       82.40 (  3.06%)       79.60 (  6.35%)
      Success 3 Max         86.00 (  0.00%)       83.00 (  3.49%)       80.00 (  6.98%)
      
      Patch 3:
                                   3.19-rc4              3.19-rc4              3.19-rc4
                                  8-nothp-1             8-nothp-2             8-nothp-3
      Success 1 Min         46.00 (  0.00%)       44.00 (  4.35%)       42.00 (  8.70%)
      Success 1 Mean        50.20 (  0.00%)       45.60 (  9.16%)       44.00 ( 12.35%)
      Success 1 Max         52.00 (  0.00%)       47.00 (  9.62%)       47.00 (  9.62%)
      Success 2 Min         53.00 (  0.00%)       49.00 (  7.55%)       48.00 (  9.43%)
      Success 2 Mean        55.80 (  0.00%)       50.60 (  9.32%)       49.00 ( 12.19%)
      Success 2 Max         59.00 (  0.00%)       52.00 ( 11.86%)       51.00 ( 13.56%)
      Success 3 Min         84.00 (  0.00%)       80.00 (  4.76%)       79.00 (  5.95%)
      Success 3 Mean        85.40 (  0.00%)       81.60 (  4.45%)       80.40 (  5.85%)
      Success 3 Max         87.00 (  0.00%)       83.00 (  4.60%)       82.00 (  5.75%)
      
      While there's no improvement here, I consider reduced fragmentation events
      to be worth on its own.  Patch 2 also seems to reduce scanning for free
      pages, and migrations in compaction, suggesting it has somewhat less work
      to do:
      
      Patch 1:
      
      Compaction stalls                 4153        3959        3978
      Compaction success                1523        1441        1446
      Compaction failures               2630        2517        2531
      Page migrate success           4600827     4943120     5104348
      Page migrate failure             19763       16656       17806
      Compaction pages isolated      9597640    10305617    10653541
      Compaction migrate scanned    77828948    86533283    87137064
      Compaction free scanned      517758295   521312840   521462251
      Compaction cost                   5503        5932        6110
      
      Patch 2:
      
      Compaction stalls                 3800        3450        3518
      Compaction success                1421        1316        1317
      Compaction failures               2379        2134        2201
      Page migrate success           4160421     4502708     4752148
      Page migrate failure             19705       14340       14911
      Compaction pages isolated      8731983     9382374     9910043
      Compaction migrate scanned    98362797    96349194    98609686
      Compaction free scanned      496512560   469502017   480442545
      Compaction cost                   5173        5526        5811
      
      As with v2, /proc/pagetypeinfo appears unaffected with respect to numbers
      of unmovable and reclaimable pageblocks.
      
      Configuring the benchmark to allocate like THP page fault (i.e.  no sync
      compaction) gives much noisier results for iterations 2 and 3 after
      reboot.  This is not so surprising given how [1] offers lower improvements
      in this scenario due to less restarts after deferred compaction which
      would change compaction pivot.
      
      Baseline:
                                                         3.19-rc4        3.19-rc4        3.19-rc4
                                                          5-thp-1         5-thp-2         5-thp-3
      Page alloc extfrag event                                8148965     6227815     6646741
      Extfrag fragmenting                                     8147872     6227130     6646117
      Extfrag fragmenting for unmovable                         10324       12942       15975
      Extfrag fragmenting unmovable placed with movable          5972        8495       10907
      Extfrag fragmenting for reclaimable                         601        1707        2210
      Extfrag fragmenting reclaimable placed with movable         520        1570        2000
      Extfrag fragmenting for movable                         8136947     6212481     6627932
      
      Patch 1:
                                                         3.19-rc4        3.19-rc4        3.19-rc4
                                                          6-thp-1         6-thp-2         6-thp-3
      Page alloc extfrag event                                8345457     7574471     7020419
      Extfrag fragmenting                                     8343546     7573777     7019718
      Extfrag fragmenting for unmovable                         10256       18535       30716
      Extfrag fragmenting unmovable placed with movable          6893       11726       22181
      Extfrag fragmenting for reclaimable                         465        1208        1023
      Extfrag fragmenting reclaimable placed with movable         353         996         843
      Extfrag fragmenting for movable                         8332825     7554034     6987979
      
      Patch 2:
                                                         3.19-rc4        3.19-rc4        3.19-rc4
                                                          7-thp-1         7-thp-2         7-thp-3
      Page alloc extfrag event                                3512847     3020756     2891625
      Extfrag fragmenting                                     3511940     3020185     2891059
      Extfrag fragmenting for unmovable                          9017        6892        6191
      Extfrag fragmenting unmovable placed with movable          1524        3053        2435
      Extfrag fragmenting for reclaimable                         445        1081        1160
      Extfrag fragmenting reclaimable placed with movable         375         918         986
      Extfrag fragmenting for movable                         3502478     3012212     2883708
      
      Patch 3:
                                                         3.19-rc4        3.19-rc4        3.19-rc4
                                                          8-thp-1         8-thp-2         8-thp-3
      Page alloc extfrag event                                3181699     3082881     2674164
      Extfrag fragmenting                                     3180812     3082303     2673611
      Extfrag fragmenting for unmovable                          1201        4031        4040
      Extfrag fragmenting unmovable placed with movable           974        3611        3645
      Extfrag fragmenting for reclaimable                         478        1165        1294
      Extfrag fragmenting reclaimable placed with movable         387         985        1030
      Extfrag fragmenting for movable                         3179133     3077107     2668277
      
      The improvements for first iteration are clear, the rest is much noisier
      and can appear like regression for Patch 1.  Anyway, patch 2 rectifies it.
      
      Allocation success rates are again unaffected so there's no point in
      making this e-mail any longer.
      
      [1] http://marc.info/?l=linux-mm&m=142166196321125&w=2
      
      This patch (of 3):
      
      When __rmqueue_fallback() is called to allocate a page of order X, it will
      find a page of order Y >= X of a fallback migratetype, which is different
      from the desired migratetype.  With the help of try_to_steal_freepages(),
      it may change the migratetype (to the desired one) also of:
      
      1) all currently free pages in the pageblock containing the fallback page
      2) the fallback pageblock itself
      3) buddy pages created by splitting the fallback page (when Y > X)
      
      These decisions take the order Y into account, as well as the desired
      migratetype, with the goal of preventing multiple fallback allocations
      that could e.g.  distribute UNMOVABLE allocations among multiple
      pageblocks.
      
      Originally, decision for 1) has implied the decision for 3).  Commit
      47118af0 ("mm: mmzone: MIGRATE_CMA migration type added") changed that
      (probably unintentionally) so that the buddy pages in case 3) are always
      changed to the desired migratetype, except for CMA pageblocks.
      
      Commit fef903ef ("mm/page_allo.c: restructure free-page stealing code
      and fix a bug") did some refactoring and added a comment that the case of
      3) is intended.  Commit 0cbef29a ("mm: __rmqueue_fallback() should
      respect pageblock type") removed the comment and tried to restore the
      original behavior where 1) implies 3), but due to the previous
      refactoring, the result is instead that only 2) implies 3) - and the
      conditions for 2) are less frequently met than conditions for 1).  This
      may increase fragmentation in situations where the code decides to steal
      all free pages from the pageblock (case 1)), but then gives back the buddy
      pages produced by splitting.
      
      This patch restores the original intended logic where 1) implies 3).
      During testing with stress-highalloc from mmtests, this has shown to
      decrease the number of events where UNMOVABLE and RECLAIMABLE allocations
      steal from MOVABLE pageblocks, which can lead to permanent fragmentation.
      In some cases it has increased the number of events when MOVABLE
      allocations steal from UNMOVABLE or RECLAIMABLE pageblocks, but these are
      fixable by sync compaction and thus less harmful.
      
      Note that evaluation has shown that the behavior introduced by
      47118af0 for buddy pages in case 3) is actually even better than the
      original logic, so the following patch will introduce it properly once
      again.  For stable backports of this patch it makes thus sense to only fix
      versions containing 0cbef29a.
      
      [iamjoonsoo.kim@lge.com: tracepoint fix]
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: <stable@vger.kernel.org>	[3.13+ containing 0cbef29a]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      99592d59
    • M
      oom, PM: make OOM detection in the freezer path raceless · c32b3cbe
      Michal Hocko 提交于
      Commit 5695be14 ("OOM, PM: OOM killed task shouldn't escape PM
      suspend") has left a race window when OOM killer manages to
      note_oom_kill after freeze_processes checks the counter.  The race
      window is quite small and really unlikely and partial solution deemed
      sufficient at the time of submission.
      
      Tejun wasn't happy about this partial solution though and insisted on a
      full solution.  That requires the full OOM and freezer's task freezing
      exclusion, though.  This is done by this patch which introduces oom_sem
      RW lock and turns oom_killer_disable() into a full OOM barrier.
      
      oom_killer_disabled check is moved from the allocation path to the OOM
      level and we take oom_sem for reading for both the check and the whole
      OOM invocation.
      
      oom_killer_disable() takes oom_sem for writing so it waits for all
      currently running OOM killer invocations.  Then it disable all the further
      OOMs by setting oom_killer_disabled and checks for any oom victims.
      Victims are counted via mark_tsk_oom_victim resp.  unmark_oom_victim.  The
      last victim wakes up all waiters enqueued by oom_killer_disable().
      Therefore this function acts as the full OOM barrier.
      
      The page fault path is covered now as well although it was assumed to be
      safe before.  As per Tejun, "We used to have freezing points deep in file
      system code which may be reacheable from page fault." so it would be
      better and more robust to not rely on freezing points here.  Same applies
      to the memcg OOM killer.
      
      out_of_memory tells the caller whether the OOM was allowed to trigger and
      the callers are supposed to handle the situation.  The page allocation
      path simply fails the allocation same as before.  The page fault path will
      retry the fault (more on that later) and Sysrq OOM trigger will simply
      complain to the log.
      
      Normally there wouldn't be any unfrozen user tasks after
      try_to_freeze_tasks so the function will not block. But if there was an
      OOM killer racing with try_to_freeze_tasks and the OOM victim didn't
      finish yet then we have to wait for it. This should complete in a finite
      time, though, because
      
      	- the victim cannot loop in the page fault handler (it would die
      	  on the way out from the exception)
      	- it cannot loop in the page allocator because all the further
      	  allocation would fail and __GFP_NOFAIL allocations are not
      	  acceptable at this stage
      	- it shouldn't be blocked on any locks held by frozen tasks
      	  (try_to_freeze expects lockless context) and kernel threads and
      	  work queues are not frozen yet
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Suggested-by: NTejun Heo <tj@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c32b3cbe
    • J
      mm: use correct format specifiers when printing address ranges · 8d29e18a
      Juergen Gross 提交于
      Especially on 32 bit kernels memory node ranges are printed with 32 bit
      wide addresses only.  Use u64 types and %llx specifiers to print full
      width of addresses.
      Signed-off-by: NJuergen Gross <jgross@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8d29e18a
    • K
      mm: more checks on free_pages_prepare() for tail pages · 81422f29
      Kirill A. Shutemov 提交于
      Although it was not called, destroy_compound_page() did some potentially
      useful checks.  Let's re-introduce them in free_pages_prepare(), where
      they can be actually triggered when CONFIG_DEBUG_VM=y.
      
      compound_order() assert is already in free_pages_prepare().  We have few
      checks for tail pages left.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      81422f29
    • K
      mm/page_alloc.c: drop dead destroy_compound_page() · 6e9f0d58
      Kirill A. Shutemov 提交于
      The only caller is __free_one_page(). By the time we should have
      page->flags to be cleared already:
      
       - for 0-order pages though PCP list:
      	free_hot_cold_page()
      		free_pages_prepare()
      			free_pages_check()
      				page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
      		<put the page to PCP list>
      
      	free_pcppages_bulk()
      		page = <withdraw pages from PCP list>
      		__free_one_page(page)
      
       - for non-0-order pages:
      	__free_pages_ok()
      		free_pages_prepare()
      			free_pages_check()
      				page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
      		free_one_page()
      			__free_one_page()
      
      So there's no way PageCompound() will return true in __free_one_page().
      Let's remove dead destroy_compound_page() and put assert for page->flags
      there instead.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6e9f0d58
    • V
      mm: reduce try_to_compact_pages parameters · 1a6d53a1
      Vlastimil Babka 提交于
      Expand the usage of the struct alloc_context introduced in the previous
      patch also for calling try_to_compact_pages(), to reduce the number of its
      parameters.  Since the function is in different compilation unit, we need
      to move alloc_context definition in the shared mm/internal.h header.
      
      With this change we get simpler code and small savings of code size and stack
      usage:
      
      add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-27 (-27)
      function                                     old     new   delta
      __alloc_pages_direct_compact                 283     256     -27
      add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-13 (-13)
      function                                     old     new   delta
      try_to_compact_pages                         582     569     -13
      
      Stack usage of __alloc_pages_direct_compact goes from 24 to none (per
      scripts/checkstack.pl).
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1a6d53a1
    • V
      mm, page_alloc: reduce number of alloc_pages* functions' parameters · a9263751
      Vlastimil Babka 提交于
      Introduce struct alloc_context to accumulate the numerous parameters
      passed between the alloc_pages* family of functions and
      get_page_from_freelist().  This excludes gfp_flags and alloc_info, which
      mutate too much along the way, and allocation order, which is conceptually
      different.
      
      The result is shorter function signatures, as well as overal code size and
      stack usage reductions.
      
      bloat-o-meter:
      
      add/remove: 0/0 grow/shrink: 1/2 up/down: 127/-310 (-183)
      function                                     old     new   delta
      get_page_from_freelist                      2525    2652    +127
      __alloc_pages_direct_compact                 329     283     -46
      __alloc_pages_nodemask                      2564    2300    -264
      
      checkstack.pl:
      
      function                            old    new
      __alloc_pages_nodemask              248    200
      get_page_from_freelist              168    184
      __alloc_pages_direct_compact         40     24
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9263751
    • V
      mm: set page->pfmemalloc in prep_new_page() · 75379191
      Vlastimil Babka 提交于
      The possibility of replacing the numerous parameters of alloc_pages*
      functions with a single structure has been discussed when Minchan proposed
      to expand the x86 kernel stack [1].  This series implements the change,
      along with few more cleanups/microoptimizations.
      
      The series is based on next-20150108 and I used gcc 4.8.3 20140627 on
      openSUSE 13.2 for compiling.  Config includess NUMA and COMPACTION.
      
      The core change is the introduction of a new struct alloc_context, which looks
      like this:
      
      struct alloc_context {
              struct zonelist *zonelist;
              nodemask_t *nodemask;
              struct zone *preferred_zone;
              int classzone_idx;
              int migratetype;
              enum zone_type high_zoneidx;
      };
      
      All the contents is mostly constant, except that __alloc_pages_slowpath()
      changes preferred_zone, classzone_idx and potentially zonelist.  But
      that's not a problem in case control returns to retry_cpuset: in
      __alloc_pages_nodemask(), those will be reset to initial values again
      (although it's a bit subtle).  On the other hand, gfp_flags and alloc_info
      mutate so much that it doesn't make sense to put them into alloc_context.
      Still, the result is one parameter instead of up to 7.  This is all in
      Patch 2.
      
      Patch 3 is a step to expand alloc_context usage out of page_alloc.c
      itself.  The function try_to_compact_pages() can also much benefit from
      the parameter reduction, but it means the struct definition has to be
      moved to a shared header.
      
      Patch 1 should IMHO be included even if the rest is deemed not useful
      enough.  It improves maintainability and also has some code/stack
      reduction.  Patch 4 is OTOH a tiny optimization.
      
      Overall bloat-o-meter results:
      
      add/remove: 0/0 grow/shrink: 0/4 up/down: 0/-460 (-460)
      function                                     old     new   delta
      nr_free_zone_pages                           129     115     -14
      __alloc_pages_direct_compact                 329     256     -73
      get_page_from_freelist                      2670    2576     -94
      __alloc_pages_nodemask                      2564    2285    -279
      try_to_compact_pages                         582     579      -3
      
      Overall stack sizes per ./scripts/checkstack.pl:
      
                                old   new delta
      get_page_from_freelist:   184   184     0
      __alloc_pages_nodemask    248   200   -48
      __alloc_pages_direct_c     40     -   -40
      try_to_compact_pages       72    72     0
                                            -88
      
      [1] http://marc.info/?l=linux-mm&m=140142462528257&w=2
      
      This patch (of 4):
      
      prep_new_page() sets almost everything in the struct page of the page
      being allocated, except page->pfmemalloc.  This is not obvious and has at
      least once led to a bug where page->pfmemalloc was forgotten to be set
      correctly, see commit 8fb74b9f ("mm: compaction: partially revert
      capture of suitable high-order page").
      
      This patch moves the pfmemalloc setting to prep_new_page(), which means it
      needs to gain alloc_flags parameter.  The call to prep_new_page is moved
      from buffered_rmqueue() to get_page_from_freelist(), which also leads to
      simpler code.  An obsolete comment for buffered_rmqueue() is replaced.
      
      In addition to better maintainability there is a small reduction of code
      and stack usage for get_page_from_freelist(), which inlines the other
      functions involved.
      
      add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-145 (-145)
      function                                     old     new   delta
      get_page_from_freelist                      2670    2525    -145
      
      Stack usage is reduced from 184 to 168 bytes.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      75379191
    • X
      kmemcheck: move hook into __alloc_pages_nodemask() for the page allocator · 23f086f9
      Xishi Qiu 提交于
      Now kmemcheck_pagealloc_alloc() is only called by __alloc_pages_slowpath().
      __alloc_pages_nodemask()
      	__alloc_pages_slowpath()
      		kmemcheck_pagealloc_alloc()
      
      And the page will not be tracked by kmemcheck in the following path.
      __alloc_pages_nodemask()
      	get_page_from_freelist()
      
      So move kmemcheck_pagealloc_alloc() into __alloc_pages_nodemask(),
      like this:
      __alloc_pages_nodemask()
      	...
      	get_page_from_freelist()
      	if (!page)
      		__alloc_pages_slowpath()
      	kmemcheck_pagealloc_alloc()
      	...
      Signed-off-by: NXishi Qiu <qiuxishi@huawei.com>
      Cc: Vegard Nossum <vegard.nossum@oracle.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Li Zefan <lizefan@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      23f086f9
    • A
      mm/page_alloc.c:__alloc_pages_nodemask(): don't alter arg gfp_mask · 91fbdc0f
      Andrew Morton 提交于
      __alloc_pages_nodemask() strips __GFP_IO when retrying the page
      allocation.  But it does this by altering the function-wide variable
      gfp_mask.  This will cause subsequent allocation attempts to inadvertently
      use the modified gfp_mask.
      
      Also, pass the correct mask (the mask we actually used) into
      trace_mm_page_alloc().
      
      Cc: Ming Lei <ming.lei@canonical.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      91fbdc0f
  9. 11 2月, 2015 1 次提交
    • W
      mm/page_alloc.c: place zone_id check before VM_BUG_ON_PAGE check · 4c5018ce
      Weijie Yang 提交于
      If the freeing page and its buddy page are not at the same zone, the
      current holding zone->lock for the freeing page cann't prevent buddy page
      getting allocated, this could trigger VM_BUG_ON_PAGE in page_is_buddy() at
      a very tiny chance, such as:
      
      cpu 0:						cpu 1:
      hold zone_1 lock
      check page and it buddy
      PageBuddy(buddy) is true			hold zone_2 lock
      page_order(buddy) == order is true		alloc buddy
      trigger VM_BUG_ON_PAGE(page_count(buddy) != 0)
      
      zone_1->lock prevents the freeing page getting allocated
      zone_2->lock prevents the buddy page getting allocated
      they are not the same zone->lock.
      
      If we can't remove the zone_id check statement, it's better handle this
      rare race.  This patch fixes this by placing the zone_id check before the
      VM_BUG_ON_PAGE check.
      Signed-off-by: NWeijie Yang <weijie.yang@samsung.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4c5018ce
  10. 27 1月, 2015 1 次提交
    • J
      mm: page_alloc: embed OOM killing naturally into allocation slowpath · 9879de73
      Johannes Weiner 提交于
      The OOM killing invocation does a lot of duplicative checks against the
      task's allocation context.  Rework it to take advantage of the existing
      checks in the allocator slowpath.
      
      The OOM killer is invoked when the allocator is unable to reclaim any
      pages but the allocation has to keep looping.  Instead of having a check
      for __GFP_NORETRY hidden in oom_gfp_allowed(), just move the OOM
      invocation to the true branch of should_alloc_retry().  The __GFP_FS
      check from oom_gfp_allowed() can then be moved into the OOM avoidance
      branch in __alloc_pages_may_oom(), along with the PF_DUMPCORE test.
      
      __alloc_pages_may_oom() can then signal to the caller whether the OOM
      killer was invoked, instead of requiring it to duplicate the order and
      high_zoneidx checks to guess this when deciding whether to continue.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9879de73
  11. 19 12月, 2014 1 次提交
    • P
      mm: cma: split cma-reserved in dmesg log · e48322ab
      Pintu Kumar 提交于
      When the system boots up, in the dmesg logs we can see the memory
      statistics along with total reserved as below.  Memory: 458840k/458840k
      available, 65448k reserved, 0K highmem
      
      When CMA is enabled, still the total reserved memory remains the same.
      However, the CMA memory is not considered as reserved.  But, when we see
      /proc/meminfo, the CMA memory is part of free memory.  This creates
      confusion.  This patch corrects the problem by properly subtracting the
      CMA reserved memory from the total reserved memory in dmesg logs.
      
      Below is the dmesg snapshot from an arm based device with 512MB RAM and
      12MB single CMA region.
      
      Before this change:
        Memory: 458840k/458840k available, 65448k reserved, 0K highmem
      
      After this change:
        Memory: 458840k/458840k available, 53160k reserved, 12288k cma-reserved, 0K highmem
      Signed-off-by: NPintu Kumar <pintu.k@samsung.com>
      Signed-off-by: NVishnu Pratap Singh <vishnu.ps@samsung.com>
      Acked-by: NMichal Nazarewicz <mina86@mina86.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e48322ab
  12. 14 12月, 2014 7 次提交
    • Z
      mm: remove the highmem zones' memmap in the highmem zone · ba914f48
      Zhong Hongbo 提交于
      Since 01cefaef ("mm: provide more accurate estimation
      of pages occupied by memmap") allocate the pages from lowmem for the
      highmem zones' memmap. So It is not need to reserver the memmap's for
      the highmem.
      
      A 2G DDR3 for the arm platform:
      On node 0 totalpages: 524288
      free_area_init_node: node 0, pgdat 80ccd380, node_mem_map 80d38000
        DMA zone: 3568 pages used for memmap
        DMA zone: 0 pages reserved
        DMA zone: 456704 pages, LIFO batch:31
        HighMem zone: 528 pages used for memmap
        HighMem zone: 67584 pages, LIFO batch:15
      
      On node 0 totalpages: 524288
      free_area_init_node: node 0, pgdat 80cd6f40, node_mem_map 80d42000
        DMA zone: 3568 pages used for memmap
        DMA zone: 0 pages reserved
        DMA zone: 456704 pages, LIFO batch:31
        HighMem zone: 67584 pages, LIFO batch:15
      Signed-off-by: NHongbo Zhong <hongbo.zhong@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ba914f48
    • J
      mm: vmscan: invoke slab shrinkers from shrink_zone() · 6b4f7799
      Johannes Weiner 提交于
      The slab shrinkers are currently invoked from the zonelist walkers in
      kswapd, direct reclaim, and zone reclaim, all of which roughly gauge the
      eligible LRU pages and assemble a nodemask to pass to NUMA-aware
      shrinkers, which then again have to walk over the nodemask.  This is
      redundant code, extra runtime work, and fairly inaccurate when it comes to
      the estimation of actually scannable LRU pages.  The code duplication will
      only get worse when making the shrinkers cgroup-aware and requiring them
      to have out-of-band cgroup hierarchy walks as well.
      
      Instead, invoke the shrinkers from shrink_zone(), which is where all
      reclaimers end up, to avoid this duplication.
      
      Take the count for eligible LRU pages out of get_scan_count(), which
      considers many more factors than just the availability of swap space, like
      zone_reclaimable_pages() currently does.  Accumulate the number over all
      visited lruvecs to get the per-zone value.
      
      Some nodes have multiple zones due to memory addressing restrictions.  To
      avoid putting too much pressure on the shrinkers, only invoke them once
      for each such node, using the class zone of the allocation as the pivot
      zone.
      
      For now, this integrates the slab shrinking better into the reclaim logic
      and gets rid of duplicative invocations from kswapd, direct reclaim, and
      zone reclaim.  It also prepares for cgroup-awareness, allowing
      memcg-capable shrinkers to be added at the lruvec level without much
      duplication of both code and runtime work.
      
      This changes kswapd behavior, which used to invoke the shrinkers for each
      zone, but with scan ratios gathered from the entire node, resulting in
      meaningless pressure quantities on multi-zone nodes.
      
      Zone reclaim behavior also changes.  It used to shrink slabs until the
      same amount of pages were shrunk as were reclaimed from the LRUs.  Now it
      merely invokes the shrinkers once with the zone's scan ratio, which makes
      the shrinkers go easier on caches that implement aging and would prefer
      feeding back pressure from recently used slab objects to unused LRU pages.
      
      [vdavydov@parallels.com: assure class zone is populated]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6b4f7799
    • J
      mm/page_owner: keep track of page owners · 48c96a36
      Joonsoo Kim 提交于
      This is the page owner tracking code which is introduced so far ago.  It
      is resident on Andrew's tree, though, nobody tried to upstream so it
      remain as is.  Our company uses this feature actively to debug memory leak
      or to find a memory hogger so I decide to upstream this feature.
      
      This functionality help us to know who allocates the page.  When
      allocating a page, we store some information about allocation in extra
      memory.  Later, if we need to know status of all pages, we can get and
      analyze it from this stored information.
      
      In previous version of this feature, extra memory is statically defined in
      struct page, but, in this version, extra memory is allocated outside of
      struct page.  It enables us to turn on/off this feature at boottime
      without considerable memory waste.
      
      Although we already have tracepoint for tracing page allocation/free,
      using it to analyze page owner is rather complex.  We need to enlarge the
      trace buffer for preventing overlapping until userspace program launched.
      And, launched program continually dump out the trace buffer for later
      analysis and it would change system behaviour with more possibility rather
      than just keeping it in memory, so bad for debug.
      
      Moreover, we can use page_owner feature further for various purposes.  For
      example, we can use it for fragmentation statistics implemented in this
      patch.  And, I also plan to implement some CMA failure debugging feature
      using this interface.
      
      I'd like to give the credit for all developers contributed this feature,
      but, it's not easy because I don't know exact history.  Sorry about that.
      Below is people who has "Signed-off-by" in the patches in Andrew's tree.
      
      Contributor:
      Alexander Nyberg <alexn@dsv.su.se>
      Mel Gorman <mgorman@suse.de>
      Dave Hansen <dave@linux.vnet.ibm.com>
      Minchan Kim <minchan@kernel.org>
      Michal Nazarewicz <mina86@mina86.com>
      Andrew Morton <akpm@linux-foundation.org>
      Jungsoo Son <jungsoo.son@lge.com>
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Jungsoo Son <jungsoo.son@lge.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      48c96a36
    • J
      mm/debug-pagealloc: make debug-pagealloc boottime configurable · 031bc574
      Joonsoo Kim 提交于
      Now, we have prepared to avoid using debug-pagealloc in boottime.  So
      introduce new kernel-parameter to disable debug-pagealloc in boottime, and
      makes related functions to be disabled in this case.
      
      Only non-intuitive part is change of guard page functions.  Because guard
      page is effective only if debug-pagealloc is enabled, turning off
      according to debug-pagealloc is reasonable thing to do.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Jungsoo Son <jungsoo.son@lge.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      031bc574
    • J
      mm/debug-pagealloc: prepare boottime configurable on/off · e30825f1
      Joonsoo Kim 提交于
      Until now, debug-pagealloc needs extra flags in struct page, so we need to
      recompile whole source code when we decide to use it.  This is really
      painful, because it takes some time to recompile and sometimes rebuild is
      not possible due to third party module depending on struct page.  So, we
      can't use this good feature in many cases.
      
      Now, we have the page extension feature that allows us to insert extra
      flags to outside of struct page.  This gets rid of third party module
      issue mentioned above.  And, this allows us to determine if we need extra
      memory for this page extension in boottime.  With these property, we can
      avoid using debug-pagealloc in boottime with low computational overhead in
      the kernel built with CONFIG_DEBUG_PAGEALLOC.  This will help our
      development process greatly.
      
      This patch is the preparation step to achive above goal.  debug-pagealloc
      originally uses extra field of struct page, but, after this patch, it will
      use field of struct page_ext.  Because memory for page_ext is allocated
      later than initialization of page allocator in CONFIG_SPARSEMEM, we should
      disable debug-pagealloc feature temporarily until initialization of
      page_ext.  This patch implements this.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Jungsoo Son <jungsoo.son@lge.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e30825f1
    • J
      mm/page_ext: resurrect struct page extending code for debugging · eefa864b
      Joonsoo Kim 提交于
      When we debug something, we'd like to insert some information to every
      page.  For this purpose, we sometimes modify struct page itself.  But,
      this has drawbacks.  First, it requires re-compile.  This makes us
      hesitate to use the powerful debug feature so development process is
      slowed down.  And, second, sometimes it is impossible to rebuild the
      kernel due to third party module dependency.  At third, system behaviour
      would be largely different after re-compile, because it changes size of
      struct page greatly and this structure is accessed by every part of
      kernel.  Keeping this as it is would be better to reproduce errornous
      situation.
      
      This feature is intended to overcome above mentioned problems.  This
      feature allocates memory for extended data per page in certain place
      rather than the struct page itself.  This memory can be accessed by the
      accessor functions provided by this code.  During the boot process, it
      checks whether allocation of huge chunk of memory is needed or not.  If
      not, it avoids allocating memory at all.  With this advantage, we can
      include this feature into the kernel in default and can avoid rebuild and
      solve related problems.
      
      Until now, memcg uses this technique.  But, now, memcg decides to embed
      their variable to struct page itself and it's code to extend struct page
      has been removed.  I'd like to use this code to develop debug feature, so
      this patch resurrect it.
      
      To help these things to work well, this patch introduces two callbacks for
      clients.  One is the need callback which is mandatory if user wants to
      avoid useless memory allocation at boot-time.  The other is optional, init
      callback, which is used to do proper initialization after memory is
      allocated.  Detailed explanation about purpose of these functions is in
      code comment.  Please refer it.
      
      Others are completely same with previous extension code in memcg.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Jungsoo Son <jungsoo.son@lge.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eefa864b
    • J
      mm/debug-pagealloc: cleanup page guard code · 2847cf95
      Joonsoo Kim 提交于
      Page guard is used by debug-pagealloc feature.  Currently, it is
      open-coded, but, I think that more abstraction of it makes core page
      allocator code more readable.
      
      There is no functional difference.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Gioh Kim <gioh.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2847cf95
  13. 11 12月, 2014 4 次提交
    • J
      mm: move page->mem_cgroup bad page handling into generic code · 9edad6ea
      Johannes Weiner 提交于
      Now that the external page_cgroup data structure and its lookup is
      gone, let the generic bad_page() check for page->mem_cgroup sanity.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9edad6ea
    • J
      mm: embed the memcg pointer directly into struct page · 1306a85a
      Johannes Weiner 提交于
      Memory cgroups used to have 5 per-page pointers.  To allow users to
      disable that amount of overhead during runtime, those pointers were
      allocated in a separate array, with a translation layer between them and
      struct page.
      
      There is now only one page pointer remaining: the memcg pointer, that
      indicates which cgroup the page is associated with when charged.  The
      complexity of runtime allocation and the runtime translation overhead is
      no longer justified to save that *potential* 0.19% of memory.  With
      CONFIG_SLUB, page->mem_cgroup actually sits in the doubleword padding
      after the page->private member and doesn't even increase struct page,
      and then this patch actually saves space.  Remaining users that care can
      still compile their kernels without CONFIG_MEMCG.
      
           text    data     bss     dec     hex     filename
        8828345 1725264  983040 11536649 b00909  vmlinux.old
        8827425 1725264  966656 11519345 afc571  vmlinux.new
      
      [mhocko@suse.cz: update Documentation/cgroups/memory.txt]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NKonstantin Khlebnikov <koct9i@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1306a85a
    • W
      mm: fix a spelling mistake · 26086de3
      Wei Yuan 提交于
      Signed-off-by Wei Yuan <weiyuan.wei@huawei.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      26086de3
    • V
      mm, compaction: more focused lru and pcplists draining · fdaf7f5c
      Vlastimil Babka 提交于
      The goal of memory compaction is to create high-order freepages through
      page migration.  Page migration however puts pages on the per-cpu lru_add
      cache, which is later flushed to per-cpu pcplists, and only after pcplists
      are drained the pages can actually merge.  This can happen due to the
      per-cpu caches becoming full through further freeing, or explicitly.
      
      During direct compaction, it is useful to do the draining explicitly so
      that pages merge as soon as possible and compaction can detect success
      immediately and keep the latency impact at minimum.  However the current
      implementation is far from ideal.  Draining is done only in
      __alloc_pages_direct_compact(), after all zones were already compacted,
      and the decisions to continue or stop compaction in individual zones was
      done without the last batch of migrations being merged.  It is also
      missing the draining of lru_add cache before the pcplists.
      
      This patch moves the draining for direct compaction into compact_zone().
      It adds the missing lru_cache draining and uses the newly introduced
      single zone pcplists draining to reduce overhead and avoid impact on
      unrelated zones.  Draining is only performed when it can actually lead to
      merging of a page of desired order (passed by cc->order).  This means it
      is only done when migration occurred in the previously scanned cc->order
      aligned block(s) and the migration scanner is now pointing to the next
      cc->order aligned block.
      
      The patch has been tested with stress-highalloc benchmark from mmtests.
      Although overal allocation success rates of the benchmark were not
      affected, the number of detected compaction successes has doubled.  This
      suggests that allocations were previously successful due to implicit
      merging caused by background activity, making a later allocation attempt
      succeed immediately, but not attributing the success to compaction.  Since
      stress-highalloc always tries to allocate almost the whole memory, it
      cannot show the improvement in its reported success rate metric.  However
      after this patch, compaction should detect success and terminate earlier,
      reducing the direct compaction latencies in a real scenario.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fdaf7f5c