1. 12 2月, 2015 40 次提交
    • V
      mm: always steal split buddies in fallback allocations · 3a1086fb
      Vlastimil Babka 提交于
      When allocation falls back to another migratetype, it will steal a page
      with highest available order, and (depending on this order and desired
      migratetype), it might also steal the rest of free pages from the same
      pageblock.
      
      Given the preference of highest available order, it is likely that it will
      be higher than the desired order, and result in the stolen buddy page
      being split.  The remaining pages after split are currently stolen only
      when the rest of the free pages are stolen.  This can however lead to
      situations where for MOVABLE allocations we split e.g.  order-4 fallback
      UNMOVABLE page, but steal only order-0 page.  Then on the next MOVABLE
      allocation (which may be batched to fill the pcplists) we split another
      order-3 or higher page, etc.  By stealing all pages that we have split, we
      can avoid further stealing.
      
      This patch therefore adjusts the page stealing so that buddy pages created
      by split are always stolen.  This has effect only on MOVABLE allocations,
      as RECLAIMABLE and UNMOVABLE allocations already always do that in
      addition to stealing the rest of free pages from the pageblock.  The
      change also allows to simplify try_to_steal_freepages() and factor out CMA
      handling.
      
      According to Mel, it has been intended since the beginning that buddy
      pages after split would be stolen always, but it doesn't seem like it was
      ever the case until commit 47118af0 ("mm: mmzone: MIGRATE_CMA
      migration type added").  The commit has unintentionally introduced this
      behavior, but was reverted by commit 0cbef29a ("mm:
      __rmqueue_fallback() should respect pageblock type").  Neither included
      evaluation.
      
      My evaluation with stress-highalloc from mmtests shows about 2.5x
      reduction of page stealing events for MOVABLE allocations, without
      affecting the page stealing events for other allocation migratetypes.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3a1086fb
    • V
      mm: when stealing freepages, also take pages created by splitting buddy page · 99592d59
      Vlastimil Babka 提交于
      When studying page stealing, I noticed some weird looking decisions in
      try_to_steal_freepages().  The first I assume is a bug (Patch 1), the
      following two patches were driven by evaluation.
      
      Testing was done with stress-highalloc of mmtests, using the
      mm_page_alloc_extfrag tracepoint and postprocessing to get counts of how
      often page stealing occurs for individual migratetypes, and what
      migratetypes are used for fallbacks.  Arguably, the worst case of page
      stealing is when UNMOVABLE allocation steals from MOVABLE pageblock.
      RECLAIMABLE allocation stealing from MOVABLE allocation is also not ideal,
      so the goal is to minimize these two cases.
      
      The evaluation of v2 wasn't always clear win and Joonsoo questioned the
      results.  Here I used different baseline which includes RFC compaction
      improvements from [1].  I found that the compaction improvements reduce
      variability of stress-highalloc, so there's less noise in the data.
      
      First, let's look at stress-highalloc configured to do sync compaction,
      and how these patches reduce page stealing events during the test.  First
      column is after fresh reboot, other two are reiterations of test without
      reboot.  That was all accumulater over 5 re-iterations (so the benchmark
      was run 5x3 times with 5 fresh restarts).
      
      Baseline:
      
                                                         3.19-rc4        3.19-rc4        3.19-rc4
                                                        5-nothp-1       5-nothp-2       5-nothp-3
      Page alloc extfrag event                               10264225     8702233    10244125
      Extfrag fragmenting                                    10263271     8701552    10243473
      Extfrag fragmenting for unmovable                         13595       17616       15960
      Extfrag fragmenting unmovable placed with movable          7989       12193        8447
      Extfrag fragmenting for reclaimable                         658        1840        1817
      Extfrag fragmenting reclaimable placed with movable         558        1677        1679
      Extfrag fragmenting for movable                        10249018     8682096    10225696
      
      With Patch 1:
                                                         3.19-rc4        3.19-rc4        3.19-rc4
                                                        6-nothp-1       6-nothp-2       6-nothp-3
      Page alloc extfrag event                               11834954     9877523     9774860
      Extfrag fragmenting                                    11833993     9876880     9774245
      Extfrag fragmenting for unmovable                          7342       16129       11712
      Extfrag fragmenting unmovable placed with movable          4191       10547        6270
      Extfrag fragmenting for reclaimable                         373        1130         923
      Extfrag fragmenting reclaimable placed with movable         302         906         738
      Extfrag fragmenting for movable                        11826278     9859621     9761610
      
      With Patch 2:
                                                         3.19-rc4        3.19-rc4        3.19-rc4
                                                        7-nothp-1       7-nothp-2       7-nothp-3
      Page alloc extfrag event                                4725990     3668793     3807436
      Extfrag fragmenting                                     4725104     3668252     3806898
      Extfrag fragmenting for unmovable                          6678        7974        7281
      Extfrag fragmenting unmovable placed with movable          2051        3829        4017
      Extfrag fragmenting for reclaimable                         429        1208        1278
      Extfrag fragmenting reclaimable placed with movable         369         976        1034
      Extfrag fragmenting for movable                         4717997     3659070     3798339
      
      With Patch 3:
                                                         3.19-rc4        3.19-rc4        3.19-rc4
                                                        8-nothp-1       8-nothp-2       8-nothp-3
      Page alloc extfrag event                                5016183     4700142     3850633
      Extfrag fragmenting                                     5015325     4699613     3850072
      Extfrag fragmenting for unmovable                          1312        3154        3088
      Extfrag fragmenting unmovable placed with movable          1115        2777        2714
      Extfrag fragmenting for reclaimable                         437        1193        1097
      Extfrag fragmenting reclaimable placed with movable         330         969         879
      Extfrag fragmenting for movable                         5013576     4695266     3845887
      
      In v2 we've seen apparent regression with Patch 1 for unmovable events,
      this is now gone, suggesting it was indeed noise.  Here, each patch
      improves the situation for unmovable events.  Reclaimable is improved by
      patch 1 and then either the same modulo noise, or perhaps sligtly worse -
      a small price for unmovable improvements, IMHO.  The number of movable
      allocations falling back to other migratetypes is most noisy, but it's
      reduced to half at Patch 2 nevertheless.  These are least critical as
      compaction can move them around.
      
      If we look at success rates, the patches don't affect them, that didn't change.
      
      Baseline:
                                   3.19-rc4              3.19-rc4              3.19-rc4
                                  5-nothp-1             5-nothp-2             5-nothp-3
      Success 1 Min         49.00 (  0.00%)       42.00 ( 14.29%)       41.00 ( 16.33%)
      Success 1 Mean        51.00 (  0.00%)       45.00 ( 11.76%)       42.60 ( 16.47%)
      Success 1 Max         55.00 (  0.00%)       51.00 (  7.27%)       46.00 ( 16.36%)
      Success 2 Min         53.00 (  0.00%)       47.00 ( 11.32%)       44.00 ( 16.98%)
      Success 2 Mean        59.60 (  0.00%)       50.80 ( 14.77%)       48.20 ( 19.13%)
      Success 2 Max         64.00 (  0.00%)       56.00 ( 12.50%)       52.00 ( 18.75%)
      Success 3 Min         84.00 (  0.00%)       82.00 (  2.38%)       78.00 (  7.14%)
      Success 3 Mean        85.60 (  0.00%)       82.80 (  3.27%)       79.40 (  7.24%)
      Success 3 Max         86.00 (  0.00%)       83.00 (  3.49%)       80.00 (  6.98%)
      
      Patch 1:
                                   3.19-rc4              3.19-rc4              3.19-rc4
                                  6-nothp-1             6-nothp-2             6-nothp-3
      Success 1 Min         49.00 (  0.00%)       44.00 ( 10.20%)       44.00 ( 10.20%)
      Success 1 Mean        51.80 (  0.00%)       46.00 ( 11.20%)       45.80 ( 11.58%)
      Success 1 Max         54.00 (  0.00%)       49.00 (  9.26%)       49.00 (  9.26%)
      Success 2 Min         58.00 (  0.00%)       49.00 ( 15.52%)       48.00 ( 17.24%)
      Success 2 Mean        60.40 (  0.00%)       51.80 ( 14.24%)       50.80 ( 15.89%)
      Success 2 Max         63.00 (  0.00%)       54.00 ( 14.29%)       55.00 ( 12.70%)
      Success 3 Min         84.00 (  0.00%)       81.00 (  3.57%)       79.00 (  5.95%)
      Success 3 Mean        85.00 (  0.00%)       81.60 (  4.00%)       79.80 (  6.12%)
      Success 3 Max         86.00 (  0.00%)       82.00 (  4.65%)       82.00 (  4.65%)
      
      Patch 2:
      
                                   3.19-rc4              3.19-rc4              3.19-rc4
                                  7-nothp-1             7-nothp-2             7-nothp-3
      Success 1 Min         50.00 (  0.00%)       44.00 ( 12.00%)       39.00 ( 22.00%)
      Success 1 Mean        52.80 (  0.00%)       45.60 ( 13.64%)       42.40 ( 19.70%)
      Success 1 Max         55.00 (  0.00%)       46.00 ( 16.36%)       47.00 ( 14.55%)
      Success 2 Min         52.00 (  0.00%)       48.00 (  7.69%)       45.00 ( 13.46%)
      Success 2 Mean        53.40 (  0.00%)       49.80 (  6.74%)       48.80 (  8.61%)
      Success 2 Max         57.00 (  0.00%)       52.00 (  8.77%)       52.00 (  8.77%)
      Success 3 Min         84.00 (  0.00%)       81.00 (  3.57%)       79.00 (  5.95%)
      Success 3 Mean        85.00 (  0.00%)       82.40 (  3.06%)       79.60 (  6.35%)
      Success 3 Max         86.00 (  0.00%)       83.00 (  3.49%)       80.00 (  6.98%)
      
      Patch 3:
                                   3.19-rc4              3.19-rc4              3.19-rc4
                                  8-nothp-1             8-nothp-2             8-nothp-3
      Success 1 Min         46.00 (  0.00%)       44.00 (  4.35%)       42.00 (  8.70%)
      Success 1 Mean        50.20 (  0.00%)       45.60 (  9.16%)       44.00 ( 12.35%)
      Success 1 Max         52.00 (  0.00%)       47.00 (  9.62%)       47.00 (  9.62%)
      Success 2 Min         53.00 (  0.00%)       49.00 (  7.55%)       48.00 (  9.43%)
      Success 2 Mean        55.80 (  0.00%)       50.60 (  9.32%)       49.00 ( 12.19%)
      Success 2 Max         59.00 (  0.00%)       52.00 ( 11.86%)       51.00 ( 13.56%)
      Success 3 Min         84.00 (  0.00%)       80.00 (  4.76%)       79.00 (  5.95%)
      Success 3 Mean        85.40 (  0.00%)       81.60 (  4.45%)       80.40 (  5.85%)
      Success 3 Max         87.00 (  0.00%)       83.00 (  4.60%)       82.00 (  5.75%)
      
      While there's no improvement here, I consider reduced fragmentation events
      to be worth on its own.  Patch 2 also seems to reduce scanning for free
      pages, and migrations in compaction, suggesting it has somewhat less work
      to do:
      
      Patch 1:
      
      Compaction stalls                 4153        3959        3978
      Compaction success                1523        1441        1446
      Compaction failures               2630        2517        2531
      Page migrate success           4600827     4943120     5104348
      Page migrate failure             19763       16656       17806
      Compaction pages isolated      9597640    10305617    10653541
      Compaction migrate scanned    77828948    86533283    87137064
      Compaction free scanned      517758295   521312840   521462251
      Compaction cost                   5503        5932        6110
      
      Patch 2:
      
      Compaction stalls                 3800        3450        3518
      Compaction success                1421        1316        1317
      Compaction failures               2379        2134        2201
      Page migrate success           4160421     4502708     4752148
      Page migrate failure             19705       14340       14911
      Compaction pages isolated      8731983     9382374     9910043
      Compaction migrate scanned    98362797    96349194    98609686
      Compaction free scanned      496512560   469502017   480442545
      Compaction cost                   5173        5526        5811
      
      As with v2, /proc/pagetypeinfo appears unaffected with respect to numbers
      of unmovable and reclaimable pageblocks.
      
      Configuring the benchmark to allocate like THP page fault (i.e.  no sync
      compaction) gives much noisier results for iterations 2 and 3 after
      reboot.  This is not so surprising given how [1] offers lower improvements
      in this scenario due to less restarts after deferred compaction which
      would change compaction pivot.
      
      Baseline:
                                                         3.19-rc4        3.19-rc4        3.19-rc4
                                                          5-thp-1         5-thp-2         5-thp-3
      Page alloc extfrag event                                8148965     6227815     6646741
      Extfrag fragmenting                                     8147872     6227130     6646117
      Extfrag fragmenting for unmovable                         10324       12942       15975
      Extfrag fragmenting unmovable placed with movable          5972        8495       10907
      Extfrag fragmenting for reclaimable                         601        1707        2210
      Extfrag fragmenting reclaimable placed with movable         520        1570        2000
      Extfrag fragmenting for movable                         8136947     6212481     6627932
      
      Patch 1:
                                                         3.19-rc4        3.19-rc4        3.19-rc4
                                                          6-thp-1         6-thp-2         6-thp-3
      Page alloc extfrag event                                8345457     7574471     7020419
      Extfrag fragmenting                                     8343546     7573777     7019718
      Extfrag fragmenting for unmovable                         10256       18535       30716
      Extfrag fragmenting unmovable placed with movable          6893       11726       22181
      Extfrag fragmenting for reclaimable                         465        1208        1023
      Extfrag fragmenting reclaimable placed with movable         353         996         843
      Extfrag fragmenting for movable                         8332825     7554034     6987979
      
      Patch 2:
                                                         3.19-rc4        3.19-rc4        3.19-rc4
                                                          7-thp-1         7-thp-2         7-thp-3
      Page alloc extfrag event                                3512847     3020756     2891625
      Extfrag fragmenting                                     3511940     3020185     2891059
      Extfrag fragmenting for unmovable                          9017        6892        6191
      Extfrag fragmenting unmovable placed with movable          1524        3053        2435
      Extfrag fragmenting for reclaimable                         445        1081        1160
      Extfrag fragmenting reclaimable placed with movable         375         918         986
      Extfrag fragmenting for movable                         3502478     3012212     2883708
      
      Patch 3:
                                                         3.19-rc4        3.19-rc4        3.19-rc4
                                                          8-thp-1         8-thp-2         8-thp-3
      Page alloc extfrag event                                3181699     3082881     2674164
      Extfrag fragmenting                                     3180812     3082303     2673611
      Extfrag fragmenting for unmovable                          1201        4031        4040
      Extfrag fragmenting unmovable placed with movable           974        3611        3645
      Extfrag fragmenting for reclaimable                         478        1165        1294
      Extfrag fragmenting reclaimable placed with movable         387         985        1030
      Extfrag fragmenting for movable                         3179133     3077107     2668277
      
      The improvements for first iteration are clear, the rest is much noisier
      and can appear like regression for Patch 1.  Anyway, patch 2 rectifies it.
      
      Allocation success rates are again unaffected so there's no point in
      making this e-mail any longer.
      
      [1] http://marc.info/?l=linux-mm&m=142166196321125&w=2
      
      This patch (of 3):
      
      When __rmqueue_fallback() is called to allocate a page of order X, it will
      find a page of order Y >= X of a fallback migratetype, which is different
      from the desired migratetype.  With the help of try_to_steal_freepages(),
      it may change the migratetype (to the desired one) also of:
      
      1) all currently free pages in the pageblock containing the fallback page
      2) the fallback pageblock itself
      3) buddy pages created by splitting the fallback page (when Y > X)
      
      These decisions take the order Y into account, as well as the desired
      migratetype, with the goal of preventing multiple fallback allocations
      that could e.g.  distribute UNMOVABLE allocations among multiple
      pageblocks.
      
      Originally, decision for 1) has implied the decision for 3).  Commit
      47118af0 ("mm: mmzone: MIGRATE_CMA migration type added") changed that
      (probably unintentionally) so that the buddy pages in case 3) are always
      changed to the desired migratetype, except for CMA pageblocks.
      
      Commit fef903ef ("mm/page_allo.c: restructure free-page stealing code
      and fix a bug") did some refactoring and added a comment that the case of
      3) is intended.  Commit 0cbef29a ("mm: __rmqueue_fallback() should
      respect pageblock type") removed the comment and tried to restore the
      original behavior where 1) implies 3), but due to the previous
      refactoring, the result is instead that only 2) implies 3) - and the
      conditions for 2) are less frequently met than conditions for 1).  This
      may increase fragmentation in situations where the code decides to steal
      all free pages from the pageblock (case 1)), but then gives back the buddy
      pages produced by splitting.
      
      This patch restores the original intended logic where 1) implies 3).
      During testing with stress-highalloc from mmtests, this has shown to
      decrease the number of events where UNMOVABLE and RECLAIMABLE allocations
      steal from MOVABLE pageblocks, which can lead to permanent fragmentation.
      In some cases it has increased the number of events when MOVABLE
      allocations steal from UNMOVABLE or RECLAIMABLE pageblocks, but these are
      fixable by sync compaction and thus less harmful.
      
      Note that evaluation has shown that the behavior introduced by
      47118af0 for buddy pages in case 3) is actually even better than the
      original logic, so the following patch will introduce it properly once
      again.  For stable backports of this patch it makes thus sense to only fix
      versions containing 0cbef29a.
      
      [iamjoonsoo.kim@lge.com: tracepoint fix]
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: <stable@vger.kernel.org>	[3.13+ containing 0cbef29a]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      99592d59
    • N
      mincore: apply page table walker on do_mincore() · 1e25a271
      Naoya Horiguchi 提交于
      This patch makes do_mincore() use walk_page_vma(), which reduces many
      lines of code by using common page table walk code.
      
      [daeseok.youn@gmail.com: remove unneeded variable 'err']
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NDaeseok Youn <daeseok.youn@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1e25a271
    • N
      mm: pagewalk: fix misbehavior of walk_page_range for vma(VM_PFNMAP) · 48684a65
      Naoya Horiguchi 提交于
      walk_page_range() silently skips vma having VM_PFNMAP set, which leads to
      undesirable behaviour at client end (who called walk_page_range).  For
      example for pagemap_read(), when no callbacks are called against VM_PFNMAP
      vma, pagemap_read() may prepare pagemap data for next virtual address
      range at wrong index.  That could confuse and/or break userspace
      applications.
      
      This patch avoid this misbehavior caused by vma(VM_PFNMAP) like follows:
      - for pagemap_read() which has its own ->pte_hole(), call the ->pte_hole()
        over vma(VM_PFNMAP),
      - for clear_refs and queue_pages which have their own ->tests_walk,
        just return 1 and skip vma(VM_PFNMAP). This is no problem because
        these are not interested in hole regions,
      - for other callers, just skip the vma(VM_PFNMAP) as a default behavior.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NShiraz Hashim <shashim@codeaurora.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      48684a65
    • N
      mempolicy: apply page table walker on queue_pages_range() · 6f4576e3
      Naoya Horiguchi 提交于
      queue_pages_range() does page table walking in its own way now, but there
      is some code duplicate.  This patch applies page table walker to reduce
      lines of code.
      
      queue_pages_range() has to do some precheck to determine whether we really
      walk over the vma or just skip it.  Now we have test_walk() callback in
      mm_walk for this purpose, so we can do this replacement cleanly.
      queue_pages_test_walk() depends on not only the current vma but also the
      previous one, so queue_pages->prev is introduced to remember it.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6f4576e3
    • N
      memcg: cleanup preparation for page table walk · 26bcd64a
      Naoya Horiguchi 提交于
      pagewalk.c can handle vma in itself, so we don't have to pass vma via
      walk->private.  And both of mem_cgroup_count_precharge() and
      mem_cgroup_move_charge() do for each vma loop themselves, but now it's
      done in pagewalk.c, so let's clean up them.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      26bcd64a
    • N
      pagewalk: add walk_page_vma() · 900fc5f1
      Naoya Horiguchi 提交于
      Introduce walk_page_vma(), which is useful for the callers which want to
      walk over a given vma.  It's used by later patches.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      900fc5f1
    • N
      pagewalk: improve vma handling · fafaa426
      Naoya Horiguchi 提交于
      Current implementation of page table walker has a fundamental problem in
      vma handling, which started when we tried to handle vma(VM_HUGETLB).
      Because it's done in pgd loop, considering vma boundary makes code
      complicated and bug-prone.
      
      From the users viewpoint, some user checks some vma-related condition to
      determine whether the user really does page walk over the vma.
      
      In order to solve these, this patch moves vma check outside pgd loop and
      introduce a new callback ->test_walk().
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fafaa426
    • N
      mm/pagewalk: remove pgd_entry() and pud_entry() · 0b1fbfe5
      Naoya Horiguchi 提交于
      Currently no user of page table walker sets ->pgd_entry() or
      ->pud_entry(), so checking their existence in each loop is just wasting
      CPU cycle.  So let's remove it to reduce overhead.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0b1fbfe5
    • A
      mm: gup: use get_user_pages_unlocked · 7e339128
      Andrea Arcangeli 提交于
      This allows those get_user_pages calls to pass FAULT_FLAG_ALLOW_RETRY to
      the page fault in order to release the mmap_sem during the I/O.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Peter Feiner <pfeiner@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7e339128
    • A
      mm: gup: use get_user_pages_unlocked within get_user_pages_fast · a7b78075
      Andrea Arcangeli 提交于
      This allows the get_user_pages_fast slow path to release the mmap_sem
      before blocking.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Peter Feiner <pfeiner@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a7b78075
    • A
      mm: gup: add __get_user_pages_unlocked to customize gup_flags · 0fd71a56
      Andrea Arcangeli 提交于
      Some callers (like KVM) may want to set the gup_flags like FOLL_HWPOSION
      to get a proper -EHWPOSION retval instead of -EFAULT to take a more
      appropriate action if get_user_pages runs into a memory failure.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Peter Feiner <pfeiner@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0fd71a56
    • A
      mm: gup: add get_user_pages_locked and get_user_pages_unlocked · f0818f47
      Andrea Arcangeli 提交于
      FAULT_FOLL_ALLOW_RETRY allows the page fault to drop the mmap_sem for
      reading to reduce the mmap_sem contention (for writing), like while
      waiting for I/O completion.  The problem is that right now practically no
      get_user_pages call uses FAULT_FOLL_ALLOW_RETRY, so we're not leveraging
      that nifty feature.
      
      Andres fixed it for the KVM page fault.  However get_user_pages_fast
      remains uncovered, and 99% of other get_user_pages aren't using it either
      (the only exception being FOLL_NOWAIT in KVM which is really nonblocking
      and in fact it doesn't even release the mmap_sem).
      
      So this patchsets extends the optimization Andres did in the KVM page
      fault to the whole kernel.  It makes most important places (including
      gup_fast) to use FAULT_FOLL_ALLOW_RETRY to reduce the mmap_sem hold times
      during I/O.
      
      The only few places that remains uncovered are drivers like v4l and other
      exceptions that tends to work on their own memory and they're not working
      on random user memory (for example like O_DIRECT that uses gup_fast and is
      fully covered by this patch).
      
      A follow up patch should probably also add a printk_once warning to
      get_user_pages that should go obsolete and be phased out eventually.  The
      "vmas" parameter of get_user_pages makes it fundamentally incompatible
      with FAULT_FOLL_ALLOW_RETRY (vmas array becomes meaningless the moment the
      mmap_sem is released).
      
      While this is just an optimization, this becomes an absolute requirement
      for the userfaultfd feature http://lwn.net/Articles/615086/ .
      
      The userfaultfd allows to block the page fault, and in order to do so I
      need to drop the mmap_sem first.  So this patch also ensures that all
      memory where userfaultfd could be registered by KVM, the very first fault
      (no matter if it is a regular page fault, or a get_user_pages) always has
      FAULT_FOLL_ALLOW_RETRY set.  Then the userfaultfd blocks and it is waken
      only when the pagetable is already mapped.  The second fault attempt after
      the wakeup doesn't need FAULT_FOLL_ALLOW_RETRY, so it's ok to retry
      without it.
      
      This patch (of 5):
      
      We can leverage the VM_FAULT_RETRY functionality in the page fault paths
      better by using either get_user_pages_locked or get_user_pages_unlocked.
      
      The former allows conversion of get_user_pages invocations that will have
      to pass a "&locked" parameter to know if the mmap_sem was dropped during
      the call.  Example from:
      
          down_read(&mm->mmap_sem);
          do_something()
          get_user_pages(tsk, mm, ..., pages, NULL);
          up_read(&mm->mmap_sem);
      
      to:
      
          int locked = 1;
          down_read(&mm->mmap_sem);
          do_something()
          get_user_pages_locked(tsk, mm, ..., pages, &locked);
          if (locked)
              up_read(&mm->mmap_sem);
      
      The latter is suitable only as a drop in replacement of the form:
      
          down_read(&mm->mmap_sem);
          get_user_pages(tsk, mm, ..., pages, NULL);
          up_read(&mm->mmap_sem);
      
      into:
      
          get_user_pages_unlocked(tsk, mm, ..., pages);
      
      Where tsk, mm, the intermediate "..." paramters and "pages" can be any
      value as before.  Just the last parameter of get_user_pages (vmas) must be
      NULL for get_user_pages_locked|unlocked to be usable (the latter original
      form wouldn't have been safe anyway if vmas wasn't null, for the former we
      just make it explicit by dropping the parameter).
      
      If vmas is not NULL these two methods cannot be used.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: NAndres Lagar-Cavilla <andreslc@google.com>
      Reviewed-by: NPeter Feiner <pfeiner@google.com>
      Reviewed-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f0818f47
    • V
      mm/mempolicy.c: merge alloc_hugepage_vma to alloc_pages_vma · be97a41b
      Vlastimil Babka 提交于
      The previous commit ("mm/thp: Allocate transparent hugepages on local
      node") introduced alloc_hugepage_vma() to mm/mempolicy.c to perform a
      special policy for THP allocations.  The function has the same interface
      as alloc_pages_vma(), shares a lot of boilerplate code and a long
      comment.
      
      This patch merges the hugepage special case into alloc_pages_vma.  The
      extra if condition should be cheap enough price to pay.  We also prevent
      a (however unlikely) race with parallel mems_allowed update, which could
      make hugepage allocation restart only within the fallback call to
      alloc_hugepage_vma() and not reconsider the special rule in
      alloc_hugepage_vma().
      
      Also by making sure mpol_cond_put(pol) is always called before actual
      allocation attempt, we can use a single exit path within the function.
      
      Also update the comment for missing node parameter and obsolete reference
      to mm_sem.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      be97a41b
    • A
      mm/thp: allocate transparent hugepages on local node · 077fcf11
      Aneesh Kumar K.V 提交于
      This make sure that we try to allocate hugepages from local node if
      allowed by mempolicy.  If we can't, we fallback to small page allocation
      based on mempolicy.  This is based on the observation that allocating
      pages on local node is more beneficial than allocating hugepages on remote
      node.
      
      With this patch applied we may find transparent huge page allocation
      failures if the current node doesn't have enough freee hugepages.  Before
      this patch such failures result in us retrying the allocation on other
      nodes in the numa node mask.
      
      [akpm@linux-foundation.org: fix comment, add CONFIG_TRANSPARENT_HUGEPAGE dependency]
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      077fcf11
    • J
      mm/compaction: add tracepoint to observe behaviour of compaction defer · 24e2716f
      Joonsoo Kim 提交于
      Compaction deferring logic is heavy hammer that block the way to the
      compaction.  It doesn't consider overall system state, so it could prevent
      user from doing compaction falsely.  In other words, even if system has
      enough range of memory to compact, compaction would be skipped due to
      compaction deferring logic.  This patch add new tracepoint to understand
      work of deferring logic.  This will also help to check compaction success
      and fail.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      24e2716f
    • J
      mm/compaction: more trace to understand when/why compaction start/finish · 837d026d
      Joonsoo Kim 提交于
      It is not well analyzed that when/why compaction start/finish or not.
      With these new tracepoints, we can know much more about start/finish
      reason of compaction.  I can find following bug with these tracepoint.
      
      http://www.spinics.net/lists/linux-mm/msg81582.htmlSigned-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      837d026d
    • J
      mm/compaction: print current range where compaction work · e34d85f0
      Joonsoo Kim 提交于
      It'd be useful to know current range where compaction work for detailed
      analysis.  With it, we can know pageblock where we actually scan and
      isolate, and, how much pages we try in that pageblock and can guess why it
      doesn't become freepage with pageblock order roughly.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e34d85f0
    • J
      mm/compaction: enhance tracepoint output for compaction begin/end · 16c4a097
      Joonsoo Kim 提交于
      We now have tracepoint for begin event of compaction and it prints start
      position of both scanners, but, tracepoint for end event of compaction
      doesn't print finish position of both scanners.  It'd be also useful to
      know finish position of both scanners so this patch add it.  It will help
      to find odd behavior or problem on compaction internal logic.
      
      And mode is added to both begin/end tracepoint output, since according to
      mode, compaction behavior is quite different.
      
      And lastly, status format is changed to string rather than status number
      for readability.
      
      [akpm@linux-foundation.org: fix sparse warning]
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      16c4a097
    • K
      page_writeback: put account_page_redirty() after set_page_dirty() · 8d38633c
      Konstantin Khebnikov 提交于
      Helper account_page_redirty() fixes dirty pages counter for redirtied
      pages.  This patch puts it after dirtying and prevents temporary
      underflows of dirtied pages counters on zone/bdi and current->nr_dirtied.
      Signed-off-by: NKonstantin Khebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8d38633c
    • K
      mm: fix false-positive warning on exit due mm_nr_pmds(mm) · b30fe6c7
      Kirill A. Shutemov 提交于
      The problem is that we check nr_ptes/nr_pmds in exit_mmap() which happens
      *before* pgd_free().  And if an arch does pte/pmd allocation in
      pgd_alloc() and frees them in pgd_free() we see offset in counters by the
      time of the checks.
      
      We tried to workaround this by offsetting expected counter value according
      to FIRST_USER_ADDRESS for both nr_pte and nr_pmd in exit_mmap().  But it
      doesn't work in some cases:
      
      1. ARM with LPAE enabled also has non-zero USER_PGTABLES_CEILING, but
         upper addresses occupied with huge pmd entries, so the trick with
         offsetting expected counter value will get really ugly: we will have
         to apply it nr_pmds, but not nr_ptes.
      
      2. Metag has non-zero FIRST_USER_ADDRESS, but doesn't do allocation
         pte/pmd page tables allocation in pgd_alloc(), just setup a pgd entry
         which is allocated at boot and shared accross all processes.
      
      The proposal is to move the check to check_mm() which happens *after*
      pgd_free() and do proper accounting during pgd_alloc() and pgd_free()
      which would bring counters to zero if nothing leaked.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: NTyler Baker <tyler.baker@linaro.org>
      Tested-by: NTyler Baker <tyler.baker@linaro.org>
      Tested-by: NNishanth Menon <nm@ti.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b30fe6c7
    • K
      mm: account pmd page tables to the process · dc6c9a35
      Kirill A. Shutemov 提交于
      Dave noticed that unprivileged process can allocate significant amount of
      memory -- >500 MiB on x86_64 -- and stay unnoticed by oom-killer and
      memory cgroup.  The trick is to allocate a lot of PMD page tables.  Linux
      kernel doesn't account PMD tables to the process, only PTE.
      
      The use-cases below use few tricks to allocate a lot of PMD page tables
      while keeping VmRSS and VmPTE low.  oom_score for the process will be 0.
      
      	#include <errno.h>
      	#include <stdio.h>
      	#include <stdlib.h>
      	#include <unistd.h>
      	#include <sys/mman.h>
      	#include <sys/prctl.h>
      
      	#define PUD_SIZE (1UL << 30)
      	#define PMD_SIZE (1UL << 21)
      
      	#define NR_PUD 130000
      
      	int main(void)
      	{
      		char *addr = NULL;
      		unsigned long i;
      
      		prctl(PR_SET_THP_DISABLE);
      		for (i = 0; i < NR_PUD ; i++) {
      			addr = mmap(addr + PUD_SIZE, PUD_SIZE, PROT_WRITE|PROT_READ,
      					MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
      			if (addr == MAP_FAILED) {
      				perror("mmap");
      				break;
      			}
      			*addr = 'x';
      			munmap(addr, PMD_SIZE);
      			mmap(addr, PMD_SIZE, PROT_WRITE|PROT_READ,
      					MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED, -1, 0);
      			if (addr == MAP_FAILED)
      				perror("re-mmap"), exit(1);
      		}
      		printf("PID %d consumed %lu KiB in PMD page tables\n",
      				getpid(), i * 4096 >> 10);
      		return pause();
      	}
      
      The patch addresses the issue by account PMD tables to the process the
      same way we account PTE.
      
      The main place where PMD tables is accounted is __pmd_alloc() and
      free_pmd_range(). But there're few corner cases:
      
       - HugeTLB can share PMD page tables. The patch handles by accounting
         the table to all processes who share it.
      
       - x86 PAE pre-allocates few PMD tables on fork.
      
       - Architectures with FIRST_USER_ADDRESS > 0. We need to adjust sanity
         check on exit(2).
      
      Accounting only happens on configuration where PMD page table's level is
      present (PMD is not folded).  As with nr_ptes we use per-mm counter.  The
      counter value is used to calculate baseline for badness score by
      oom-killer.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Reviewed-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: David Rientjes <rientjes@google.com>
      Tested-by: NSedat Dilek <sedat.dilek@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dc6c9a35
    • J
      mm: memcontrol: consolidate swap controller code · 21afa38e
      Johannes Weiner 提交于
      The swap controller code is scattered all over the file.  Gather all
      the code that isn't directly needed by the memory controller at the
      end of the file in its own CONFIG_MEMCG_SWAP section.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Reviewed-by: NVladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      21afa38e
    • J
      mm: memcontrol: consolidate memory controller initialization · 95a045f6
      Johannes Weiner 提交于
      The initialization code for the per-cpu charge stock and the soft
      limit tree is compact enough to inline it into mem_cgroup_init().
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NVladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      95a045f6
    • J
      mm: memcontrol: simplify soft limit tree init code · 9c608dbe
      Johannes Weiner 提交于
      - No need to test the node for N_MEMORY.  node_online() is enough for
        node fallback to work in slab, use NUMA_NO_NODE for everything else.
      
      - Remove the BUG_ON() for allocation failure.  A NULL pointer crash is
        just as descriptive, and the absent return value check is obvious.
      
      - Move local variables to the inner-most blocks.
      
      - Point to the tree structure after its initialized, not before, it's
        just more logical that way.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9c608dbe
    • G
      mm: cma: fix totalcma_pages to include DT defined CMA regions · 94737a85
      George G. Davis 提交于
      The totalcma_pages variable is not updated to account for CMA regions
      defined via device tree reserved-memory sub-nodes.  Fix this omission by
      moving the calculation of totalcma_pages into cma_init_reserved_mem()
      instead of cma_declare_contiguous() such that it will include reserved
      memory used by all CMA regions.
      Signed-off-by: NGeorge G. Davis <george_davis@mentor.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Acked-by: NMichal Nazarewicz <mina86@mina86.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Laurent Pinchart <laurent.pinchart+renesas@ideasonboard.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      94737a85
    • M
      oom, PM: make OOM detection in the freezer path raceless · c32b3cbe
      Michal Hocko 提交于
      Commit 5695be14 ("OOM, PM: OOM killed task shouldn't escape PM
      suspend") has left a race window when OOM killer manages to
      note_oom_kill after freeze_processes checks the counter.  The race
      window is quite small and really unlikely and partial solution deemed
      sufficient at the time of submission.
      
      Tejun wasn't happy about this partial solution though and insisted on a
      full solution.  That requires the full OOM and freezer's task freezing
      exclusion, though.  This is done by this patch which introduces oom_sem
      RW lock and turns oom_killer_disable() into a full OOM barrier.
      
      oom_killer_disabled check is moved from the allocation path to the OOM
      level and we take oom_sem for reading for both the check and the whole
      OOM invocation.
      
      oom_killer_disable() takes oom_sem for writing so it waits for all
      currently running OOM killer invocations.  Then it disable all the further
      OOMs by setting oom_killer_disabled and checks for any oom victims.
      Victims are counted via mark_tsk_oom_victim resp.  unmark_oom_victim.  The
      last victim wakes up all waiters enqueued by oom_killer_disable().
      Therefore this function acts as the full OOM barrier.
      
      The page fault path is covered now as well although it was assumed to be
      safe before.  As per Tejun, "We used to have freezing points deep in file
      system code which may be reacheable from page fault." so it would be
      better and more robust to not rely on freezing points here.  Same applies
      to the memcg OOM killer.
      
      out_of_memory tells the caller whether the OOM was allowed to trigger and
      the callers are supposed to handle the situation.  The page allocation
      path simply fails the allocation same as before.  The page fault path will
      retry the fault (more on that later) and Sysrq OOM trigger will simply
      complain to the log.
      
      Normally there wouldn't be any unfrozen user tasks after
      try_to_freeze_tasks so the function will not block. But if there was an
      OOM killer racing with try_to_freeze_tasks and the OOM victim didn't
      finish yet then we have to wait for it. This should complete in a finite
      time, though, because
      
      	- the victim cannot loop in the page fault handler (it would die
      	  on the way out from the exception)
      	- it cannot loop in the page allocator because all the further
      	  allocation would fail and __GFP_NOFAIL allocations are not
      	  acceptable at this stage
      	- it shouldn't be blocked on any locks held by frozen tasks
      	  (try_to_freeze expects lockless context) and kernel threads and
      	  work queues are not frozen yet
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Suggested-by: NTejun Heo <tj@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c32b3cbe
    • M
      oom: thaw the OOM victim if it is frozen · 63a8ca9b
      Michal Hocko 提交于
      oom_kill_process only sets TIF_MEMDIE flag and sends a signal to the
      victim.  This is basically noop when the task is frozen though because the
      task sleeps in the uninterruptible sleep.  The victim is eventually thawed
      later when oom_scan_process_thread meets the task again in a later OOM
      invocation so the OOM killer doesn't live lock.  But this is less than
      optimal.
      
      Let's add __thaw_task into mark_tsk_oom_victim after we set TIF_MEMDIE to
      the victim.  We are not checking whether the task is frozen because that
      would be racy and __thaw_task does that already.  oom_scan_process_thread
      doesn't need to care about freezer anymore as TIF_MEMDIE and freezer are
      excluded completely now.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      63a8ca9b
    • M
      oom: add helpers for setting and clearing TIF_MEMDIE · 49550b60
      Michal Hocko 提交于
      This patchset addresses a race which was described in the changelog for
      5695be14 ("OOM, PM: OOM killed task shouldn't escape PM suspend"):
      
      : PM freezer relies on having all tasks frozen by the time devices are
      : getting frozen so that no task will touch them while they are getting
      : frozen.  But OOM killer is allowed to kill an already frozen task in order
      : to handle OOM situtation.  In order to protect from late wake ups OOM
      : killer is disabled after all tasks are frozen.  This, however, still keeps
      : a window open when a killed task didn't manage to die by the time
      : freeze_processes finishes.
      
      The original patch hasn't closed the race window completely because that
      would require a more complex solution as it can be seen by this patchset.
      
      The primary motivation was to close the race condition between OOM killer
      and PM freezer _completely_.  As Tejun pointed out, even though the race
      condition is unlikely the harder it would be to debug weird bugs deep in
      the PM freezer when the debugging options are reduced considerably.  I can
      only speculate what might happen when a task is still runnable
      unexpectedly.
      
      On a plus side and as a side effect the oom enable/disable has a better
      (full barrier) semantic without polluting hot paths.
      
      I have tested the series in KVM with 100M RAM:
      - many small tasks (20M anon mmap) which are triggering OOM continually
      - s2ram which resumes automatically is triggered in a loop
      	echo processors > /sys/power/pm_test
      	while true
      	do
      		echo mem > /sys/power/state
      		sleep 1s
      	done
      - simple module which allocates and frees 20M in 8K chunks. If it sees
        freezing(current) then it tries another round of allocation before calling
        try_to_freeze
      - debugging messages of PM stages and OOM killer enable/disable/fail added
        and unmark_oom_victim is delayed by 1s after it clears TIF_MEMDIE and before
        it wakes up waiters.
      - rebased on top of the current mmotm which means some necessary updates
        in mm/oom_kill.c. mark_tsk_oom_victim is now called under task_lock but
        I think this should be OK because __thaw_task shouldn't interfere with any
        locking down wake_up_process. Oleg?
      
      As expected there are no OOM killed tasks after oom is disabled and
      allocations requested by the kernel thread are failing after all the tasks
      are frozen and OOM disabled.  I wasn't able to catch a race where
      oom_killer_disable would really have to wait but I kinda expected the race
      is really unlikely.
      
      [  242.609330] Killed process 2992 (mem_eater) total-vm:24412kB, anon-rss:2164kB, file-rss:4kB
      [  243.628071] Unmarking 2992 OOM victim. oom_victims: 1
      [  243.636072] (elapsed 2.837 seconds) done.
      [  243.641985] Trying to disable OOM killer
      [  243.643032] Waiting for concurent OOM victims
      [  243.644342] OOM killer disabled
      [  243.645447] Freezing remaining freezable tasks ... (elapsed 0.005 seconds) done.
      [  243.652983] Suspending console(s) (use no_console_suspend to debug)
      [  243.903299] kmem_eater: page allocation failure: order:1, mode:0x204010
      [...]
      [  243.992600] PM: suspend of devices complete after 336.667 msecs
      [  243.993264] PM: late suspend of devices complete after 0.660 msecs
      [  243.994713] PM: noirq suspend of devices complete after 1.446 msecs
      [  243.994717] ACPI: Preparing to enter system sleep state S3
      [  243.994795] PM: Saving platform NVS memory
      [  243.994796] Disabling non-boot CPUs ...
      
      The first 2 patches are simple cleanups for OOM.  They should go in
      regardless the rest IMO.
      
      Patches 3 and 4 are trivial printk -> pr_info conversion and they should
      go in ditto.
      
      The main patch is the last one and I would appreciate acks from Tejun and
      Rafael.  I think the OOM part should be OK (except for __thaw_task vs.
      task_lock where a look from Oleg would appreciated) but I am not so sure I
      haven't screwed anything in the freezer code.  I have found several
      surprises there.
      
      This patch (of 5):
      
      This patch is just a preparatory and it doesn't introduce any functional
      change.
      
      Note:
      I am utterly unhappy about lowmemory killer abusing TIF_MEMDIE just to
      wait for the oom victim and to prevent from new killing. This is
      just a side effect of the flag. The primary meaning is to give the oom
      victim access to the memory reserves and that shouldn't be necessary
      here.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      49550b60
    • J
      mm: memcontrol: fold move_anon() and move_file() · 1dfab5ab
      Johannes Weiner 提交于
      Turn the move type enum into flags and give the flags field a shorter
      name.  Once that is done, move_anon() and move_file() are simple enough to
      just fold them into the callsites.
      
      [akpm@linux-foundation.org: tweak MOVE_MASK definition, per Michal]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1dfab5ab
    • J
      mm: memcontrol: default hierarchy interface for memory · 241994ed
      Johannes Weiner 提交于
      Introduce the basic control files to account, partition, and limit
      memory using cgroups in default hierarchy mode.
      
      This interface versioning allows us to address fundamental design
      issues in the existing memory cgroup interface, further explained
      below.  The old interface will be maintained indefinitely, but a
      clearer model and improved workload performance should encourage
      existing users to switch over to the new one eventually.
      
      The control files are thus:
      
        - memory.current shows the current consumption of the cgroup and its
          descendants, in bytes.
      
        - memory.low configures the lower end of the cgroup's expected
          memory consumption range.  The kernel considers memory below that
          boundary to be a reserve - the minimum that the workload needs in
          order to make forward progress - and generally avoids reclaiming
          it, unless there is an imminent risk of entering an OOM situation.
      
        - memory.high configures the upper end of the cgroup's expected
          memory consumption range.  A cgroup whose consumption grows beyond
          this threshold is forced into direct reclaim, to work off the
          excess and to throttle new allocations heavily, but is generally
          allowed to continue and the OOM killer is not invoked.
      
        - memory.max configures the hard maximum amount of memory that the
          cgroup is allowed to consume before the OOM killer is invoked.
      
        - memory.events shows event counters that indicate how often the
          cgroup was reclaimed while below memory.low, how often it was
          forced to reclaim excess beyond memory.high, how often it hit
          memory.max, and how often it entered OOM due to memory.max.  This
          allows users to identify configuration problems when observing a
          degradation in workload performance.  An overcommitted system will
          have an increased rate of low boundary breaches, whereas increased
          rates of high limit breaches, maximum hits, or even OOM situations
          will indicate internally overcommitted cgroups.
      
      For existing users of memory cgroups, the following deviations from
      the current interface are worth pointing out and explaining:
      
        - The original lower boundary, the soft limit, is defined as a limit
          that is per default unset.  As a result, the set of cgroups that
          global reclaim prefers is opt-in, rather than opt-out.  The costs
          for optimizing these mostly negative lookups are so high that the
          implementation, despite its enormous size, does not even provide
          the basic desirable behavior.  First off, the soft limit has no
          hierarchical meaning.  All configured groups are organized in a
          global rbtree and treated like equal peers, regardless where they
          are located in the hierarchy.  This makes subtree delegation
          impossible.  Second, the soft limit reclaim pass is so aggressive
          that it not just introduces high allocation latencies into the
          system, but also impacts system performance due to overreclaim, to
          the point where the feature becomes self-defeating.
      
          The memory.low boundary on the other hand is a top-down allocated
          reserve.  A cgroup enjoys reclaim protection when it and all its
          ancestors are below their low boundaries, which makes delegation
          of subtrees possible.  Secondly, new cgroups have no reserve per
          default and in the common case most cgroups are eligible for the
          preferred reclaim pass.  This allows the new low boundary to be
          efficiently implemented with just a minor addition to the generic
          reclaim code, without the need for out-of-band data structures and
          reclaim passes.  Because the generic reclaim code considers all
          cgroups except for the ones running low in the preferred first
          reclaim pass, overreclaim of individual groups is eliminated as
          well, resulting in much better overall workload performance.
      
        - The original high boundary, the hard limit, is defined as a strict
          limit that can not budge, even if the OOM killer has to be called.
          But this generally goes against the goal of making the most out of
          the available memory.  The memory consumption of workloads varies
          during runtime, and that requires users to overcommit.  But doing
          that with a strict upper limit requires either a fairly accurate
          prediction of the working set size or adding slack to the limit.
          Since working set size estimation is hard and error prone, and
          getting it wrong results in OOM kills, most users tend to err on
          the side of a looser limit and end up wasting precious resources.
      
          The memory.high boundary on the other hand can be set much more
          conservatively.  When hit, it throttles allocations by forcing
          them into direct reclaim to work off the excess, but it never
          invokes the OOM killer.  As a result, a high boundary that is
          chosen too aggressively will not terminate the processes, but
          instead it will lead to gradual performance degradation.  The user
          can monitor this and make corrections until the minimal memory
          footprint that still gives acceptable performance is found.
      
          In extreme cases, with many concurrent allocations and a complete
          breakdown of reclaim progress within the group, the high boundary
          can be exceeded.  But even then it's mostly better to satisfy the
          allocation from the slack available in other groups or the rest of
          the system than killing the group.  Otherwise, memory.max is there
          to limit this type of spillover and ultimately contain buggy or
          even malicious applications.
      
        - The original control file names are unwieldy and inconsistent in
          many different ways.  For example, the upper boundary hit count is
          exported in the memory.failcnt file, but an OOM event count has to
          be manually counted by listening to memory.oom_control events, and
          lower boundary / soft limit events have to be counted by first
          setting a threshold for that value and then counting those events.
          Also, usage and limit files encode their units in the filename.
          That makes the filenames very long, even though this is not
          information that a user needs to be reminded of every time they
          type out those names.
      
          To address these naming issues, as well as to signal clearly that
          the new interface carries a new configuration model, the naming
          conventions in it necessarily differ from the old interface.
      
        - The original limit files indicate the state of an unset limit with
          a very high number, and a configured limit can be unset by echoing
          -1 into those files.  But that very high number is implementation
          and architecture dependent and not very descriptive.  And while -1
          can be understood as an underflow into the highest possible value,
          -2 or -10M etc. do not work, so it's not inconsistent.
      
          memory.low, memory.high, and memory.max will use the string
          "infinity" to indicate and set the highest possible value.
      
      [akpm@linux-foundation.org: use seq_puts() for basic strings]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      241994ed
    • J
      mm: page_counter: pull "-1" handling out of page_counter_memparse() · 650c5e56
      Johannes Weiner 提交于
      The unified hierarchy interface for memory cgroups will no longer use "-1"
      to mean maximum possible resource value.  In preparation for this, make
      the string an argument and let the caller supply it.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      650c5e56
    • J
      mm: use correct format specifiers when printing address ranges · 8d29e18a
      Juergen Gross 提交于
      Especially on 32 bit kernels memory node ranges are printed with 32 bit
      wide addresses only.  Use u64 types and %llx specifiers to print full
      width of addresses.
      Signed-off-by: NJuergen Gross <jgross@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8d29e18a
    • G
      memcg: add BUILD_BUG_ON() for string tables · 0ca44b14
      Greg Thelen 提交于
      Use BUILD_BUG_ON() to compile assert that memcg string tables are in sync
      with corresponding enums.  There aren't currently any issues with these
      tables.  This is just defensive.
      Signed-off-by: NGreg Thelen <gthelen@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0ca44b14
    • V
      vmscan: force scan offline memory cgroups · 90cbc250
      Vladimir Davydov 提交于
      Since commit b2052564 ("mm: memcontrol: continue cache reclaim from
      offlined groups") pages charged to a memory cgroup are not reparented when
      the cgroup is removed.  Instead, they are supposed to be reclaimed in a
      regular way, along with pages accounted to online memory cgroups.
      
      However, an lruvec of an offline memory cgroup will sooner or later get so
      small that it will be scanned only at low scan priorities (see
      get_scan_count()).  Therefore, if there are enough reclaimable pages in
      big lruvecs, pages accounted to offline memory cgroups will never be
      scanned at all, wasting memory.
      
      Fix this by unconditionally forcing scanning dead lruvecs from kswapd.
      
      [akpm@linux-foundation.org: fix build]
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      90cbc250
    • K
      mm: more checks on free_pages_prepare() for tail pages · 81422f29
      Kirill A. Shutemov 提交于
      Although it was not called, destroy_compound_page() did some potentially
      useful checks.  Let's re-introduce them in free_pages_prepare(), where
      they can be actually triggered when CONFIG_DEBUG_VM=y.
      
      compound_order() assert is already in free_pages_prepare().  We have few
      checks for tail pages left.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      81422f29
    • K
      mm/page_alloc.c: drop dead destroy_compound_page() · 6e9f0d58
      Kirill A. Shutemov 提交于
      The only caller is __free_one_page(). By the time we should have
      page->flags to be cleared already:
      
       - for 0-order pages though PCP list:
      	free_hot_cold_page()
      		free_pages_prepare()
      			free_pages_check()
      				page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
      		<put the page to PCP list>
      
      	free_pcppages_bulk()
      		page = <withdraw pages from PCP list>
      		__free_one_page(page)
      
       - for non-0-order pages:
      	__free_pages_ok()
      		free_pages_prepare()
      			free_pages_check()
      				page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
      		free_one_page()
      			__free_one_page()
      
      So there's no way PageCompound() will return true in __free_one_page().
      Let's remove dead destroy_compound_page() and put assert for page->flags
      there instead.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6e9f0d58
    • V
      mm: microoptimize zonelist operations · 05891fb0
      Vlastimil Babka 提交于
      next_zones_zonelist() returns a zoneref pointer, as well as a zone pointer
      via extra parameter.  Since the latter can be trivially obtained by
      dereferencing the former, the overhead of the extra parameter is
      unjustified.
      
      This patch thus removes the zone parameter from next_zones_zonelist().
      Both callers happen to be in the same header file, so it's simple to add
      the zoneref dereference inline.  We save some bytes of code size.
      
      add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-105 (-105)
      function                                     old     new   delta
      nr_free_zone_pages                           129     115     -14
      __alloc_pages_nodemask                      2300    2285     -15
      get_page_from_freelist                      2652    2576     -76
      
      add/remove: 0/0 grow/shrink: 1/0 up/down: 10/0 (10)
      function                                     old     new   delta
      try_to_compact_pages                         569     579     +10
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      05891fb0
    • V
      mm: reduce try_to_compact_pages parameters · 1a6d53a1
      Vlastimil Babka 提交于
      Expand the usage of the struct alloc_context introduced in the previous
      patch also for calling try_to_compact_pages(), to reduce the number of its
      parameters.  Since the function is in different compilation unit, we need
      to move alloc_context definition in the shared mm/internal.h header.
      
      With this change we get simpler code and small savings of code size and stack
      usage:
      
      add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-27 (-27)
      function                                     old     new   delta
      __alloc_pages_direct_compact                 283     256     -27
      add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-13 (-13)
      function                                     old     new   delta
      try_to_compact_pages                         582     569     -13
      
      Stack usage of __alloc_pages_direct_compact goes from 24 to none (per
      scripts/checkstack.pl).
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1a6d53a1
    • V
      mm, page_alloc: reduce number of alloc_pages* functions' parameters · a9263751
      Vlastimil Babka 提交于
      Introduce struct alloc_context to accumulate the numerous parameters
      passed between the alloc_pages* family of functions and
      get_page_from_freelist().  This excludes gfp_flags and alloc_info, which
      mutate too much along the way, and allocation order, which is conceptually
      different.
      
      The result is shorter function signatures, as well as overal code size and
      stack usage reductions.
      
      bloat-o-meter:
      
      add/remove: 0/0 grow/shrink: 1/2 up/down: 127/-310 (-183)
      function                                     old     new   delta
      get_page_from_freelist                      2525    2652    +127
      __alloc_pages_direct_compact                 329     283     -46
      __alloc_pages_nodemask                      2564    2300    -264
      
      checkstack.pl:
      
      function                            old    new
      __alloc_pages_nodemask              248    200
      get_page_from_freelist              168    184
      __alloc_pages_direct_compact         40     24
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9263751