1. 20 5月, 2016 13 次提交
    • M
      mm, page_alloc: convert alloc_flags to unsigned · c603844b
      Mel Gorman 提交于
      alloc_flags is a bitmask of flags but it is signed which does not
      necessarily generate the best code depending on the compiler.  Even
      without an impact, it makes more sense that this be unsigned.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c603844b
    • M
      mm, page_alloc: avoid unnecessary zone lookups during pageblock operations · f75fb889
      Mel Gorman 提交于
      Pageblocks have an associated bitmap to store migrate types and whether
      the pageblock should be skipped during compaction.  The bitmap may be
      associated with a memory section or a zone but the zone is looked up
      unconditionally.  The compiler should optimise this away automatically
      so this is a cosmetic patch only in many cases.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f75fb889
    • M
      mm, page_alloc: use __dec_zone_state for order-0 page allocation · 754078eb
      Mel Gorman 提交于
      __dec_zone_state is cheaper to use for removing an order-0 page as it
      has fewer conditions to check.
      
      The performance difference on a page allocator microbenchmark is;
      
                                                   4.6.0-rc2                  4.6.0-rc2
                                               optiter-v1r20              decstat-v1r20
        Min      alloc-odr0-1               382.00 (  0.00%)           381.00 (  0.26%)
        Min      alloc-odr0-2               282.00 (  0.00%)           275.00 (  2.48%)
        Min      alloc-odr0-4               233.00 (  0.00%)           229.00 (  1.72%)
        Min      alloc-odr0-8               203.00 (  0.00%)           199.00 (  1.97%)
        Min      alloc-odr0-16              188.00 (  0.00%)           186.00 (  1.06%)
        Min      alloc-odr0-32              182.00 (  0.00%)           179.00 (  1.65%)
        Min      alloc-odr0-64              177.00 (  0.00%)           174.00 (  1.69%)
        Min      alloc-odr0-128             175.00 (  0.00%)           172.00 (  1.71%)
        Min      alloc-odr0-256             184.00 (  0.00%)           181.00 (  1.63%)
        Min      alloc-odr0-512             197.00 (  0.00%)           193.00 (  2.03%)
        Min      alloc-odr0-1024            203.00 (  0.00%)           201.00 (  0.99%)
        Min      alloc-odr0-2048            209.00 (  0.00%)           206.00 (  1.44%)
        Min      alloc-odr0-4096            214.00 (  0.00%)           212.00 (  0.93%)
        Min      alloc-odr0-8192            218.00 (  0.00%)           215.00 (  1.38%)
        Min      alloc-odr0-16384           219.00 (  0.00%)           216.00 (  1.37%)
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      754078eb
    • M
      mm, page_alloc: inline the fast path of the zonelist iterator · 682a3385
      Mel Gorman 提交于
      The page allocator iterates through a zonelist for zones that match the
      addressing limitations and nodemask of the caller but many allocations
      will not be restricted.  Despite this, there is always functional call
      overhead which builds up.
      
      This patch inlines the optimistic basic case and only calls the iterator
      function for the complex case.  A hindrance was the fact that
      cpuset_current_mems_allowed is used in the fastpath as the allowed
      nodemask even though all nodes are allowed on most systems.  The patch
      handles this by only considering cpuset_current_mems_allowed if a cpuset
      exists.  As well as being faster in the fast-path, this removes some
      junk in the slowpath.
      
      The performance difference on a page allocator microbenchmark is;
      
                                                   4.6.0-rc2                  4.6.0-rc2
                                            statinline-v1r20              optiter-v1r20
        Min      alloc-odr0-1               412.00 (  0.00%)           382.00 (  7.28%)
        Min      alloc-odr0-2               301.00 (  0.00%)           282.00 (  6.31%)
        Min      alloc-odr0-4               247.00 (  0.00%)           233.00 (  5.67%)
        Min      alloc-odr0-8               215.00 (  0.00%)           203.00 (  5.58%)
        Min      alloc-odr0-16              199.00 (  0.00%)           188.00 (  5.53%)
        Min      alloc-odr0-32              191.00 (  0.00%)           182.00 (  4.71%)
        Min      alloc-odr0-64              187.00 (  0.00%)           177.00 (  5.35%)
        Min      alloc-odr0-128             185.00 (  0.00%)           175.00 (  5.41%)
        Min      alloc-odr0-256             193.00 (  0.00%)           184.00 (  4.66%)
        Min      alloc-odr0-512             207.00 (  0.00%)           197.00 (  4.83%)
        Min      alloc-odr0-1024            213.00 (  0.00%)           203.00 (  4.69%)
        Min      alloc-odr0-2048            220.00 (  0.00%)           209.00 (  5.00%)
        Min      alloc-odr0-4096            226.00 (  0.00%)           214.00 (  5.31%)
        Min      alloc-odr0-8192            229.00 (  0.00%)           218.00 (  4.80%)
        Min      alloc-odr0-16384           229.00 (  0.00%)           219.00 (  4.37%)
      
      perf indicated that next_zones_zonelist disappeared in the profile and
      __next_zones_zonelist did not appear.  This is expected as the
      micro-benchmark would hit the inlined fast-path every time.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      682a3385
    • M
      mm, page_alloc: inline zone_statistics · 060e7417
      Mel Gorman 提交于
      zone_statistics has one call-site but it's a public function.  Make it
      static and inline.
      
      The performance difference on a page allocator microbenchmark is;
      
                                                   4.6.0-rc2                  4.6.0-rc2
                                            statbranch-v1r20           statinline-v1r20
        Min      alloc-odr0-1               419.00 (  0.00%)           412.00 (  1.67%)
        Min      alloc-odr0-2               305.00 (  0.00%)           301.00 (  1.31%)
        Min      alloc-odr0-4               250.00 (  0.00%)           247.00 (  1.20%)
        Min      alloc-odr0-8               219.00 (  0.00%)           215.00 (  1.83%)
        Min      alloc-odr0-16              203.00 (  0.00%)           199.00 (  1.97%)
        Min      alloc-odr0-32              195.00 (  0.00%)           191.00 (  2.05%)
        Min      alloc-odr0-64              191.00 (  0.00%)           187.00 (  2.09%)
        Min      alloc-odr0-128             189.00 (  0.00%)           185.00 (  2.12%)
        Min      alloc-odr0-256             198.00 (  0.00%)           193.00 (  2.53%)
        Min      alloc-odr0-512             210.00 (  0.00%)           207.00 (  1.43%)
        Min      alloc-odr0-1024            216.00 (  0.00%)           213.00 (  1.39%)
        Min      alloc-odr0-2048            221.00 (  0.00%)           220.00 (  0.45%)
        Min      alloc-odr0-4096            227.00 (  0.00%)           226.00 (  0.44%)
        Min      alloc-odr0-8192            232.00 (  0.00%)           229.00 (  1.29%)
        Min      alloc-odr0-16384           232.00 (  0.00%)           229.00 (  1.29%)
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      060e7417
    • M
      mm, page_alloc: use new PageAnonHead helper in the free page fast path · 17514574
      Mel Gorman 提交于
      The PageAnon check always checks for compound_head but this is a
      relatively expensive check if the caller already knows the page is a
      head page.  This patch creates a helper and uses it in the page free
      path which only operates on head pages.
      
      With this patch and "Only check PageCompound for high-order pages", the
      performance difference on a page allocator microbenchmark is;
      
                                                   4.6.0-rc2                  4.6.0-rc2
                                                     vanilla           nocompound-v1r20
        Min      alloc-odr0-1               425.00 (  0.00%)           417.00 (  1.88%)
        Min      alloc-odr0-2               313.00 (  0.00%)           308.00 (  1.60%)
        Min      alloc-odr0-4               257.00 (  0.00%)           253.00 (  1.56%)
        Min      alloc-odr0-8               224.00 (  0.00%)           221.00 (  1.34%)
        Min      alloc-odr0-16              208.00 (  0.00%)           205.00 (  1.44%)
        Min      alloc-odr0-32              199.00 (  0.00%)           199.00 (  0.00%)
        Min      alloc-odr0-64              195.00 (  0.00%)           193.00 (  1.03%)
        Min      alloc-odr0-128             192.00 (  0.00%)           191.00 (  0.52%)
        Min      alloc-odr0-256             204.00 (  0.00%)           200.00 (  1.96%)
        Min      alloc-odr0-512             213.00 (  0.00%)           212.00 (  0.47%)
        Min      alloc-odr0-1024            219.00 (  0.00%)           219.00 (  0.00%)
        Min      alloc-odr0-2048            225.00 (  0.00%)           225.00 (  0.00%)
        Min      alloc-odr0-4096            230.00 (  0.00%)           231.00 ( -0.43%)
        Min      alloc-odr0-8192            235.00 (  0.00%)           234.00 (  0.43%)
        Min      alloc-odr0-16384           235.00 (  0.00%)           234.00 (  0.43%)
        Min      free-odr0-1                215.00 (  0.00%)           191.00 ( 11.16%)
        Min      free-odr0-2                152.00 (  0.00%)           136.00 ( 10.53%)
        Min      free-odr0-4                119.00 (  0.00%)           107.00 ( 10.08%)
        Min      free-odr0-8                106.00 (  0.00%)            96.00 (  9.43%)
        Min      free-odr0-16                97.00 (  0.00%)            87.00 ( 10.31%)
        Min      free-odr0-32                91.00 (  0.00%)            83.00 (  8.79%)
        Min      free-odr0-64                89.00 (  0.00%)            81.00 (  8.99%)
        Min      free-odr0-128               88.00 (  0.00%)            80.00 (  9.09%)
        Min      free-odr0-256              106.00 (  0.00%)            95.00 ( 10.38%)
        Min      free-odr0-512              116.00 (  0.00%)           111.00 (  4.31%)
        Min      free-odr0-1024             125.00 (  0.00%)           118.00 (  5.60%)
        Min      free-odr0-2048             133.00 (  0.00%)           126.00 (  5.26%)
        Min      free-odr0-4096             136.00 (  0.00%)           130.00 (  4.41%)
        Min      free-odr0-8192             138.00 (  0.00%)           130.00 (  5.80%)
        Min      free-odr0-16384            137.00 (  0.00%)           130.00 (  5.11%)
      
      There is a sizable boost to the free allocator performance.  While there
      is an apparent boost on the allocation side, it's likely a co-incidence
      or due to the patches slightly reducing cache footprint.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      17514574
    • M
      mm, page_alloc: only check PageCompound for high-order pages · d61f8590
      Mel Gorman 提交于
      Another year, another round of page allocator optimisations focusing
      this time on the alloc and free fast paths.  This should be of help to
      workloads that are allocator-intensive from kernel space where the cost
      of zeroing is not nceessraily incurred.
      
      The series is motivated by the observation that page alloc
      microbenchmarks on multiple machines regressed between 3.12.44 and 4.4.
      Second, there is discussions before LSF/MM considering the possibility
      of adding another page allocator which is potentially hazardous but a
      patch series improving performance is better than whining.
      
      After the series is applied, there are still hazards.  In the free
      paths, the debugging checking and page zone/pageblock lookups dominate
      but there was not an obvious solution to that.  In the alloc path, the
      major contributers are dealing with zonelists, new page preperation, the
      fair zone allocation and numerous statistic updates.  The fair zone
      allocator is removed by the per-node LRU series if that gets merged so
      it's nor a major concern at the moment.
      
      On normal userspace benchmarks, there is little impact as the zeroing
      cost is significant but it's visible
      
        aim9
                                       4.6.0-rc3             4.6.0-rc3
                                         vanilla         deferalloc-v3
        Min      page_test   828693.33 (  0.00%)   887060.00 (  7.04%)
        Min      brk_test   4847266.67 (  0.00%)  4966266.67 (  2.45%)
        Min      exec_test     1271.00 (  0.00%)     1275.67 (  0.37%)
        Min      fork_test    12371.75 (  0.00%)    12380.00 (  0.07%)
      
      The overall impact on a page allocator microbenchmark for a range of orders
      and number of pages allocated in a batch is
      
                                                  4.6.0-rc3                  4.6.0-rc3
                                                     vanilla            deferalloc-v3r7
        Min      alloc-odr0-1               428.00 (  0.00%)           316.00 ( 26.17%)
        Min      alloc-odr0-2               314.00 (  0.00%)           231.00 ( 26.43%)
        Min      alloc-odr0-4               256.00 (  0.00%)           192.00 ( 25.00%)
        Min      alloc-odr0-8               222.00 (  0.00%)           166.00 ( 25.23%)
        Min      alloc-odr0-16              207.00 (  0.00%)           154.00 ( 25.60%)
        Min      alloc-odr0-32              197.00 (  0.00%)           148.00 ( 24.87%)
        Min      alloc-odr0-64              193.00 (  0.00%)           144.00 ( 25.39%)
        Min      alloc-odr0-128             191.00 (  0.00%)           143.00 ( 25.13%)
        Min      alloc-odr0-256             203.00 (  0.00%)           153.00 ( 24.63%)
        Min      alloc-odr0-512             212.00 (  0.00%)           165.00 ( 22.17%)
        Min      alloc-odr0-1024            221.00 (  0.00%)           172.00 ( 22.17%)
        Min      alloc-odr0-2048            225.00 (  0.00%)           179.00 ( 20.44%)
        Min      alloc-odr0-4096            232.00 (  0.00%)           185.00 ( 20.26%)
        Min      alloc-odr0-8192            235.00 (  0.00%)           187.00 ( 20.43%)
        Min      alloc-odr0-16384           236.00 (  0.00%)           188.00 ( 20.34%)
        Min      alloc-odr1-1               519.00 (  0.00%)           450.00 ( 13.29%)
        Min      alloc-odr1-2               391.00 (  0.00%)           336.00 ( 14.07%)
        Min      alloc-odr1-4               313.00 (  0.00%)           268.00 ( 14.38%)
        Min      alloc-odr1-8               277.00 (  0.00%)           235.00 ( 15.16%)
        Min      alloc-odr1-16              256.00 (  0.00%)           218.00 ( 14.84%)
        Min      alloc-odr1-32              252.00 (  0.00%)           212.00 ( 15.87%)
        Min      alloc-odr1-64              244.00 (  0.00%)           206.00 ( 15.57%)
        Min      alloc-odr1-128             244.00 (  0.00%)           207.00 ( 15.16%)
        Min      alloc-odr1-256             243.00 (  0.00%)           207.00 ( 14.81%)
        Min      alloc-odr1-512             245.00 (  0.00%)           209.00 ( 14.69%)
        Min      alloc-odr1-1024            248.00 (  0.00%)           214.00 ( 13.71%)
        Min      alloc-odr1-2048            253.00 (  0.00%)           220.00 ( 13.04%)
        Min      alloc-odr1-4096            258.00 (  0.00%)           224.00 ( 13.18%)
        Min      alloc-odr1-8192            261.00 (  0.00%)           229.00 ( 12.26%)
        Min      alloc-odr2-1               560.00 (  0.00%)           753.00 (-34.46%)
        Min      alloc-odr2-2               424.00 (  0.00%)           351.00 ( 17.22%)
        Min      alloc-odr2-4               339.00 (  0.00%)           393.00 (-15.93%)
        Min      alloc-odr2-8               298.00 (  0.00%)           246.00 ( 17.45%)
        Min      alloc-odr2-16              276.00 (  0.00%)           227.00 ( 17.75%)
        Min      alloc-odr2-32              271.00 (  0.00%)           221.00 ( 18.45%)
        Min      alloc-odr2-64              264.00 (  0.00%)           217.00 ( 17.80%)
        Min      alloc-odr2-128             264.00 (  0.00%)           217.00 ( 17.80%)
        Min      alloc-odr2-256             264.00 (  0.00%)           218.00 ( 17.42%)
        Min      alloc-odr2-512             269.00 (  0.00%)           223.00 ( 17.10%)
        Min      alloc-odr2-1024            279.00 (  0.00%)           230.00 ( 17.56%)
        Min      alloc-odr2-2048            283.00 (  0.00%)           235.00 ( 16.96%)
        Min      alloc-odr2-4096            285.00 (  0.00%)           239.00 ( 16.14%)
        Min      alloc-odr3-1               629.00 (  0.00%)           505.00 ( 19.71%)
        Min      alloc-odr3-2               472.00 (  0.00%)           374.00 ( 20.76%)
        Min      alloc-odr3-4               383.00 (  0.00%)           301.00 ( 21.41%)
        Min      alloc-odr3-8               341.00 (  0.00%)           266.00 ( 21.99%)
        Min      alloc-odr3-16              316.00 (  0.00%)           248.00 ( 21.52%)
        Min      alloc-odr3-32              308.00 (  0.00%)           241.00 ( 21.75%)
        Min      alloc-odr3-64              305.00 (  0.00%)           241.00 ( 20.98%)
        Min      alloc-odr3-128             308.00 (  0.00%)           244.00 ( 20.78%)
        Min      alloc-odr3-256             317.00 (  0.00%)           249.00 ( 21.45%)
        Min      alloc-odr3-512             327.00 (  0.00%)           256.00 ( 21.71%)
        Min      alloc-odr3-1024            331.00 (  0.00%)           261.00 ( 21.15%)
        Min      alloc-odr3-2048            333.00 (  0.00%)           266.00 ( 20.12%)
        Min      alloc-odr4-1               767.00 (  0.00%)           572.00 ( 25.42%)
        Min      alloc-odr4-2               578.00 (  0.00%)           429.00 ( 25.78%)
        Min      alloc-odr4-4               474.00 (  0.00%)           346.00 ( 27.00%)
        Min      alloc-odr4-8               422.00 (  0.00%)           310.00 ( 26.54%)
        Min      alloc-odr4-16              399.00 (  0.00%)           295.00 ( 26.07%)
        Min      alloc-odr4-32              392.00 (  0.00%)           293.00 ( 25.26%)
        Min      alloc-odr4-64              394.00 (  0.00%)           293.00 ( 25.63%)
        Min      alloc-odr4-128             405.00 (  0.00%)           305.00 ( 24.69%)
        Min      alloc-odr4-256             417.00 (  0.00%)           319.00 ( 23.50%)
        Min      alloc-odr4-512             425.00 (  0.00%)           326.00 ( 23.29%)
        Min      alloc-odr4-1024            426.00 (  0.00%)           329.00 ( 22.77%)
        Min      free-odr0-1                216.00 (  0.00%)           178.00 ( 17.59%)
        Min      free-odr0-2                152.00 (  0.00%)           125.00 ( 17.76%)
        Min      free-odr0-4                120.00 (  0.00%)            99.00 ( 17.50%)
        Min      free-odr0-8                106.00 (  0.00%)            85.00 ( 19.81%)
        Min      free-odr0-16                97.00 (  0.00%)            80.00 ( 17.53%)
        Min      free-odr0-32                92.00 (  0.00%)            76.00 ( 17.39%)
        Min      free-odr0-64                89.00 (  0.00%)            74.00 ( 16.85%)
        Min      free-odr0-128               89.00 (  0.00%)            73.00 ( 17.98%)
        Min      free-odr0-256              107.00 (  0.00%)            90.00 ( 15.89%)
        Min      free-odr0-512              117.00 (  0.00%)           108.00 (  7.69%)
        Min      free-odr0-1024             125.00 (  0.00%)           118.00 (  5.60%)
        Min      free-odr0-2048             132.00 (  0.00%)           125.00 (  5.30%)
        Min      free-odr0-4096             135.00 (  0.00%)           130.00 (  3.70%)
        Min      free-odr0-8192             137.00 (  0.00%)           130.00 (  5.11%)
        Min      free-odr0-16384            137.00 (  0.00%)           131.00 (  4.38%)
        Min      free-odr1-1                318.00 (  0.00%)           289.00 (  9.12%)
        Min      free-odr1-2                228.00 (  0.00%)           207.00 (  9.21%)
        Min      free-odr1-4                182.00 (  0.00%)           165.00 (  9.34%)
        Min      free-odr1-8                163.00 (  0.00%)           146.00 ( 10.43%)
        Min      free-odr1-16               151.00 (  0.00%)           135.00 ( 10.60%)
        Min      free-odr1-32               146.00 (  0.00%)           129.00 ( 11.64%)
        Min      free-odr1-64               145.00 (  0.00%)           130.00 ( 10.34%)
        Min      free-odr1-128              148.00 (  0.00%)           134.00 (  9.46%)
        Min      free-odr1-256              148.00 (  0.00%)           137.00 (  7.43%)
        Min      free-odr1-512              151.00 (  0.00%)           140.00 (  7.28%)
        Min      free-odr1-1024             154.00 (  0.00%)           143.00 (  7.14%)
        Min      free-odr1-2048             156.00 (  0.00%)           144.00 (  7.69%)
        Min      free-odr1-4096             156.00 (  0.00%)           142.00 (  8.97%)
        Min      free-odr1-8192             156.00 (  0.00%)           140.00 ( 10.26%)
        Min      free-odr2-1                361.00 (  0.00%)           457.00 (-26.59%)
        Min      free-odr2-2                258.00 (  0.00%)           224.00 ( 13.18%)
        Min      free-odr2-4                208.00 (  0.00%)           223.00 ( -7.21%)
        Min      free-odr2-8                185.00 (  0.00%)           160.00 ( 13.51%)
        Min      free-odr2-16               173.00 (  0.00%)           149.00 ( 13.87%)
        Min      free-odr2-32               166.00 (  0.00%)           145.00 ( 12.65%)
        Min      free-odr2-64               166.00 (  0.00%)           146.00 ( 12.05%)
        Min      free-odr2-128              169.00 (  0.00%)           148.00 ( 12.43%)
        Min      free-odr2-256              170.00 (  0.00%)           152.00 ( 10.59%)
        Min      free-odr2-512              177.00 (  0.00%)           156.00 ( 11.86%)
        Min      free-odr2-1024             182.00 (  0.00%)           162.00 ( 10.99%)
        Min      free-odr2-2048             181.00 (  0.00%)           160.00 ( 11.60%)
        Min      free-odr2-4096             180.00 (  0.00%)           159.00 ( 11.67%)
        Min      free-odr3-1                431.00 (  0.00%)           367.00 ( 14.85%)
        Min      free-odr3-2                306.00 (  0.00%)           259.00 ( 15.36%)
        Min      free-odr3-4                249.00 (  0.00%)           208.00 ( 16.47%)
        Min      free-odr3-8                224.00 (  0.00%)           186.00 ( 16.96%)
        Min      free-odr3-16               208.00 (  0.00%)           176.00 ( 15.38%)
        Min      free-odr3-32               206.00 (  0.00%)           174.00 ( 15.53%)
        Min      free-odr3-64               210.00 (  0.00%)           178.00 ( 15.24%)
        Min      free-odr3-128              215.00 (  0.00%)           182.00 ( 15.35%)
        Min      free-odr3-256              224.00 (  0.00%)           189.00 ( 15.62%)
        Min      free-odr3-512              232.00 (  0.00%)           195.00 ( 15.95%)
        Min      free-odr3-1024             230.00 (  0.00%)           195.00 ( 15.22%)
        Min      free-odr3-2048             229.00 (  0.00%)           193.00 ( 15.72%)
        Min      free-odr4-1                561.00 (  0.00%)           439.00 ( 21.75%)
        Min      free-odr4-2                418.00 (  0.00%)           318.00 ( 23.92%)
        Min      free-odr4-4                339.00 (  0.00%)           269.00 ( 20.65%)
        Min      free-odr4-8                299.00 (  0.00%)           239.00 ( 20.07%)
        Min      free-odr4-16               289.00 (  0.00%)           234.00 ( 19.03%)
        Min      free-odr4-32               291.00 (  0.00%)           235.00 ( 19.24%)
        Min      free-odr4-64               298.00 (  0.00%)           238.00 ( 20.13%)
        Min      free-odr4-128              308.00 (  0.00%)           251.00 ( 18.51%)
        Min      free-odr4-256              321.00 (  0.00%)           267.00 ( 16.82%)
        Min      free-odr4-512              327.00 (  0.00%)           269.00 ( 17.74%)
        Min      free-odr4-1024             326.00 (  0.00%)           271.00 ( 16.87%)
        Min      total-odr0-1               644.00 (  0.00%)           494.00 ( 23.29%)
        Min      total-odr0-2               466.00 (  0.00%)           356.00 ( 23.61%)
        Min      total-odr0-4               376.00 (  0.00%)           291.00 ( 22.61%)
        Min      total-odr0-8               328.00 (  0.00%)           251.00 ( 23.48%)
        Min      total-odr0-16              304.00 (  0.00%)           234.00 ( 23.03%)
        Min      total-odr0-32              289.00 (  0.00%)           224.00 ( 22.49%)
        Min      total-odr0-64              282.00 (  0.00%)           218.00 ( 22.70%)
        Min      total-odr0-128             280.00 (  0.00%)           216.00 ( 22.86%)
        Min      total-odr0-256             310.00 (  0.00%)           243.00 ( 21.61%)
        Min      total-odr0-512             329.00 (  0.00%)           273.00 ( 17.02%)
        Min      total-odr0-1024            346.00 (  0.00%)           290.00 ( 16.18%)
        Min      total-odr0-2048            357.00 (  0.00%)           304.00 ( 14.85%)
        Min      total-odr0-4096            367.00 (  0.00%)           315.00 ( 14.17%)
        Min      total-odr0-8192            372.00 (  0.00%)           317.00 ( 14.78%)
        Min      total-odr0-16384           373.00 (  0.00%)           319.00 ( 14.48%)
        Min      total-odr1-1               838.00 (  0.00%)           739.00 ( 11.81%)
        Min      total-odr1-2               619.00 (  0.00%)           543.00 ( 12.28%)
        Min      total-odr1-4               495.00 (  0.00%)           433.00 ( 12.53%)
        Min      total-odr1-8               440.00 (  0.00%)           382.00 ( 13.18%)
        Min      total-odr1-16              407.00 (  0.00%)           353.00 ( 13.27%)
        Min      total-odr1-32              398.00 (  0.00%)           341.00 ( 14.32%)
        Min      total-odr1-64              389.00 (  0.00%)           336.00 ( 13.62%)
        Min      total-odr1-128             392.00 (  0.00%)           341.00 ( 13.01%)
        Min      total-odr1-256             391.00 (  0.00%)           344.00 ( 12.02%)
        Min      total-odr1-512             396.00 (  0.00%)           349.00 ( 11.87%)
        Min      total-odr1-1024            402.00 (  0.00%)           357.00 ( 11.19%)
        Min      total-odr1-2048            409.00 (  0.00%)           364.00 ( 11.00%)
        Min      total-odr1-4096            414.00 (  0.00%)           366.00 ( 11.59%)
        Min      total-odr1-8192            417.00 (  0.00%)           369.00 ( 11.51%)
        Min      total-odr2-1               921.00 (  0.00%)          1210.00 (-31.38%)
        Min      total-odr2-2               682.00 (  0.00%)           576.00 ( 15.54%)
        Min      total-odr2-4               547.00 (  0.00%)           616.00 (-12.61%)
        Min      total-odr2-8               483.00 (  0.00%)           406.00 ( 15.94%)
        Min      total-odr2-16              449.00 (  0.00%)           376.00 ( 16.26%)
        Min      total-odr2-32              437.00 (  0.00%)           366.00 ( 16.25%)
        Min      total-odr2-64              431.00 (  0.00%)           363.00 ( 15.78%)
        Min      total-odr2-128             433.00 (  0.00%)           365.00 ( 15.70%)
        Min      total-odr2-256             434.00 (  0.00%)           371.00 ( 14.52%)
        Min      total-odr2-512             446.00 (  0.00%)           379.00 ( 15.02%)
        Min      total-odr2-1024            461.00 (  0.00%)           392.00 ( 14.97%)
        Min      total-odr2-2048            464.00 (  0.00%)           395.00 ( 14.87%)
        Min      total-odr2-4096            465.00 (  0.00%)           398.00 ( 14.41%)
        Min      total-odr3-1              1060.00 (  0.00%)           872.00 ( 17.74%)
        Min      total-odr3-2               778.00 (  0.00%)           633.00 ( 18.64%)
        Min      total-odr3-4               632.00 (  0.00%)           510.00 ( 19.30%)
        Min      total-odr3-8               565.00 (  0.00%)           452.00 ( 20.00%)
        Min      total-odr3-16              524.00 (  0.00%)           424.00 ( 19.08%)
        Min      total-odr3-32              514.00 (  0.00%)           415.00 ( 19.26%)
        Min      total-odr3-64              515.00 (  0.00%)           419.00 ( 18.64%)
        Min      total-odr3-128             523.00 (  0.00%)           426.00 ( 18.55%)
        Min      total-odr3-256             541.00 (  0.00%)           438.00 ( 19.04%)
        Min      total-odr3-512             559.00 (  0.00%)           451.00 ( 19.32%)
        Min      total-odr3-1024            561.00 (  0.00%)           456.00 ( 18.72%)
        Min      total-odr3-2048            562.00 (  0.00%)           459.00 ( 18.33%)
        Min      total-odr4-1              1328.00 (  0.00%)          1011.00 ( 23.87%)
        Min      total-odr4-2               997.00 (  0.00%)           747.00 ( 25.08%)
        Min      total-odr4-4               813.00 (  0.00%)           615.00 ( 24.35%)
        Min      total-odr4-8               721.00 (  0.00%)           550.00 ( 23.72%)
        Min      total-odr4-16              689.00 (  0.00%)           529.00 ( 23.22%)
        Min      total-odr4-32              683.00 (  0.00%)           528.00 ( 22.69%)
        Min      total-odr4-64              692.00 (  0.00%)           531.00 ( 23.27%)
        Min      total-odr4-128             713.00 (  0.00%)           556.00 ( 22.02%)
        Min      total-odr4-256             738.00 (  0.00%)           586.00 ( 20.60%)
        Min      total-odr4-512             753.00 (  0.00%)           595.00 ( 20.98%)
        Min      total-odr4-1024            752.00 (  0.00%)           600.00 ( 20.21%)
      
      This patch (of 27):
      
      order-0 pages by definition cannot be compound so avoid the check in the
      fast path for those pages.
      
      [akpm@linux-foundation.org: use unlikely(order) in free_pages_prepare(), per Vlastimil]
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d61f8590
    • M
      mm, oom: move GFP_NOFS check to out_of_memory · 3da88fb3
      Michal Hocko 提交于
      __alloc_pages_may_oom is the central place to decide when the
      out_of_memory should be invoked.  This is a good approach for most
      checks there because they are page allocator specific and the allocation
      fails right after for all of them.
      
      The notable exception is GFP_NOFS context which is faking
      did_some_progress and keep the page allocator looping even though there
      couldn't have been any progress from the OOM killer.  This patch doesn't
      change this behavior because we are not ready to allow those allocation
      requests to fail yet (and maybe we will face the reality that we will
      never manage to safely fail these request).  Instead __GFP_FS check is
      moved down to out_of_memory and prevent from OOM victim selection there.
      There are two reasons for that
      
      	- OOM notifiers might release some memory even from this context
      	  as none of the registered notifier seems to be FS related
      	- this might help a dying thread to get an access to memory
                reserves and move on which will make the behavior more
                consistent with the case when the task gets killed from a
                different context.
      
      Keep a comment in __alloc_pages_may_oom to make sure we do not forget
      how GFP_NOFS is special and that we really want to do something about
      it.
      
      Note to the current oom_notifier users:
      
      The observable difference for you is that oom notifiers cannot depend on
      any fs locks because we could deadlock.  Not that this would be allowed
      today because that would just lockup machine in most of the cases and
      ruling out the OOM killer along the way.  Another difference is that
      callbacks might be invoked sooner now because GFP_NOFS is a weaker
      reclaim context and so there could be reclaimable memory which is just
      not reachable now.  That would require GFP_NOFS only loads which are
      really rare and more importantly the observable result would be dropping
      of reconstructible object and potential performance drop which is not
      such a big deal when we are struggling to fulfill other important
      allocation requests.
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Raushaniya Maksudova <rmaksudova@parallels.com>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Daniel Vetter <daniel.vetter@intel.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3da88fb3
    • J
      mm/page_alloc: correct highmem memory statistics · fc2bd799
      Joonsoo Kim 提交于
      ZONE_MOVABLE could be treated as highmem so we need to consider it for
      accurate statistics.  And, in following patches, ZONE_CMA will be
      introduced and it can be treated as highmem, too.  So, instead of
      manually adding stat of ZONE_MOVABLE, looping all zones and check
      whether the zone is highmem or not and add stat of the zone which can be
      treated as highmem.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Laura Abbott <lauraa@codeaurora.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fc2bd799
    • J
      power: add zone range overlapping check · ba6b0979
      Joonsoo Kim 提交于
      There is a system thats node's pfns are overlapped as follows:
      
        -----pfn-------->
        N0 N1 N2 N0 N1 N2
      
      Therefore, we need to care this overlapping when iterating pfn range.
      
      mark_free_pages() iterates requested zone's pfn range and unset all
      range's bitmap first.  And then it marks freepages in a zone to the
      bitmap.  If there is an overlapping zone, above unset could clear
      previous marked bit and reference to this bitmap in the future will
      cause the problem.  To prevent it, this patch adds a zone check in
      mark_free_pages().
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Laura Abbott <lauraa@codeaurora.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ba6b0979
    • J
      mm/memory_hotplug: add comment to some functions related to memory hotplug · b9eb6319
      Joonsoo Kim 提交于
      __offline_isolated_pages() and test_pages_isolated() are used by memory
      hotplug.  These functions require that range is in a single zone but
      there is no code to do this because memory hotplug checks it before
      calling these functions.  To avoid confusing future user of these
      functions, this patch adds comments to them.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Laura Abbott <lauraa@codeaurora.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b9eb6319
    • L
      mm/page_alloc: Remove useless parameter of __free_pages_boot_core · 949698a3
      Li Zhang 提交于
      __free_pages_boot_core has parameter pfn which is not used at all.
      Remove it.
      Signed-off-by: NLi Zhang <zhlcindy@linux.vnet.ibm.com>
      Reviewed-by: NPan Xinhui <xinhui.pan@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      949698a3
    • J
      mm: rename _count, field of the struct page, to _refcount · 0139aa7b
      Joonsoo Kim 提交于
      Many developers already know that field for reference count of the
      struct page is _count and atomic type.  They would try to handle it
      directly and this could break the purpose of page reference count
      tracepoint.  To prevent direct _count modification, this patch rename it
      to _refcount and add warning message on the code.  After that, developer
      who need to handle reference count will find that field should not be
      accessed directly.
      
      [akpm@linux-foundation.org: fix comments, per Vlastimil]
      [akpm@linux-foundation.org: Documentation/vm/transhuge.txt too]
      [sfr@canb.auug.org.au: sync ethernet driver changes]
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Berg <johannes@sipsolutions.net>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Sunil Goutham <sgoutham@cavium.com>
      Cc: Chris Metcalf <cmetcalf@mellanox.com>
      Cc: Manish Chopra <manish.chopra@qlogic.com>
      Cc: Yuval Mintz <yuval.mintz@qlogic.com>
      Cc: Tariq Toukan <tariqt@mellanox.com>
      Cc: Saeed Mahameed <saeedm@mellanox.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0139aa7b
  2. 06 5月, 2016 1 次提交
    • J
      mm: update min_free_kbytes from khugepaged after core initialization · bc22af74
      Jason Baron 提交于
      Khugepaged attempts to raise min_free_kbytes if its set too low.
      However, on boot khugepaged sets min_free_kbytes first from
      subsys_initcall(), and then the mm 'core' over-rides min_free_kbytes
      after from init_per_zone_wmark_min(), via a module_init() call.
      
      Khugepaged used to use a late_initcall() to set min_free_kbytes (such
      that it occurred after the core initialization), however this was
      removed when the initialization of min_free_kbytes was integrated into
      the starting of the khugepaged thread.
      
      The fix here is simply to invoke the core initialization using a
      core_initcall() instead of module_init(), such that the previous
      initialization ordering is restored.  I didn't restore the
      late_initcall() since start_stop_khugepaged() already sets
      min_free_kbytes via set_recommended_min_free_kbytes().
      
      This was noticed when we had a number of page allocation failures when
      moving a workload to a kernel with this new initialization ordering.  On
      an 8GB system this restores min_free_kbytes back to 67584 from 11365
      when CONFIG_TRANSPARENT_HUGEPAGE=y is set and either
      CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y or
      CONFIG_TRANSPARENT_HUGEPAGE_MADVISE=y.
      
      Fixes: 79553da2 ("thp: cleanup khugepaged startup")
      Signed-off-by: NJason Baron <jbaron@akamai.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bc22af74
  3. 26 3月, 2016 1 次提交
    • V
      mm/page_alloc: prevent merging between isolated and other pageblocks · d9dddbf5
      Vlastimil Babka 提交于
      Hanjun Guo has reported that a CMA stress test causes broken accounting of
      CMA and free pages:
      
      > Before the test, I got:
      > -bash-4.3# cat /proc/meminfo | grep Cma
      > CmaTotal:         204800 kB
      > CmaFree:          195044 kB
      >
      >
      > After running the test:
      > -bash-4.3# cat /proc/meminfo | grep Cma
      > CmaTotal:         204800 kB
      > CmaFree:         6602584 kB
      >
      > So the freed CMA memory is more than total..
      >
      > Also the the MemFree is more than mem total:
      >
      > -bash-4.3# cat /proc/meminfo
      > MemTotal:       16342016 kB
      > MemFree:        22367268 kB
      > MemAvailable:   22370528 kB
      
      Laura Abbott has confirmed the issue and suspected the freepage accounting
      rewrite around 3.18/4.0 by Joonsoo Kim.  Joonsoo had a theory that this is
      caused by unexpected merging between MIGRATE_ISOLATE and MIGRATE_CMA
      pageblocks:
      
      > CMA isolates MAX_ORDER aligned blocks, but, during the process,
      > partialy isolated block exists. If MAX_ORDER is 11 and
      > pageblock_order is 9, two pageblocks make up MAX_ORDER
      > aligned block and I can think following scenario because pageblock
      > (un)isolation would be done one by one.
      >
      > (each character means one pageblock. 'C', 'I' means MIGRATE_CMA,
      > MIGRATE_ISOLATE, respectively.
      >
      > CC -> IC -> II (Isolation)
      > II -> CI -> CC (Un-isolation)
      >
      > If some pages are freed at this intermediate state such as IC or CI,
      > that page could be merged to the other page that is resident on
      > different type of pageblock and it will cause wrong freepage count.
      
      This was supposed to be prevented by CMA operating on MAX_ORDER blocks,
      but since it doesn't hold the zone->lock between pageblocks, a race
      window does exist.
      
      It's also likely that unexpected merging can occur between
      MIGRATE_ISOLATE and non-CMA pageblocks.  This should be prevented in
      __free_one_page() since commit 3c605096 ("mm/page_alloc: restrict
      max order of merging on isolated pageblock").  However, we only check
      the migratetype of the pageblock where buddy merging has been initiated,
      not the migratetype of the buddy pageblock (or group of pageblocks)
      which can be MIGRATE_ISOLATE.
      
      Joonsoo has suggested checking for buddy migratetype as part of
      page_is_buddy(), but that would add extra checks in allocator hotpath
      and bloat-o-meter has shown significant code bloat (the function is
      inline).
      
      This patch reduces the bloat at some expense of more complicated code.
      The buddy-merging while-loop in __free_one_page() is initially bounded
      to pageblock_border and without any migratetype checks.  The checks are
      placed outside, bumping the max_order if merging is allowed, and
      returning to the while-loop with a statement which can't be possibly
      considered harmful.
      
      This fixes the accounting bug and also removes the arguably weird state
      in the original commit 3c605096 where buddies could be left
      unmerged.
      
      Fixes: 3c605096 ("mm/page_alloc: restrict max order of merging on isolated pageblock")
      Link: https://lkml.org/lkml/2016/3/2/280Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reported-by: NHanjun Guo <guohanjun@huawei.com>
      Tested-by: NHanjun Guo <guohanjun@huawei.com>
      Acked-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Debugged-by: NLaura Abbott <labbott@redhat.com>
      Debugged-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: <stable@vger.kernel.org>	[3.18+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d9dddbf5
  4. 18 3月, 2016 12 次提交
    • T
      mm,oom: do not loop !__GFP_FS allocation if the OOM killer is disabled · 0a687aac
      Tetsuo Handa 提交于
      After the OOM killer is disabled during suspend operation, any
      !__GFP_NOFAIL && __GFP_FS allocations are forced to fail.  Thus, any
      !__GFP_NOFAIL && !__GFP_FS allocations should be forced to fail as well.
      Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0a687aac
    • L
      mm: meminit: initialise more memory for inode/dentry hash tables in early boot · 987b3095
      Li Zhang 提交于
      Upstream has supported page parallel initialisation for X86 and the boot
      time is improved greately.  Some tests have been done for Power.
      
      Here is the result I have done with different memory size.
      
      * 4GB memory:
          boot time is as the following:
          with patch vs without patch: 10.4s vs 24.5s
          boot time is improved 57%
      * 200GB memory:
          boot time looks the same with and without patches.
          boot time is about 38s
      * 32TB memory:
          boot time looks the same with and without patches
          boot time is about 160s.
          The boot time is much shorter than X86 with 24TB memory.
          From community discussion, it costs about 694s for X86 24T system.
      
      Parallel initialisation improves the performance by deferring memory
      initilisation to kswap with N kthreads, it should improve the performance
      therotically.
      
      In testing on X86, performance is improved greatly with huge memory.  But
      on Power platform, it is improved greatly with less than 100GB memory.
      For huge memory, it is not improved greatly.  But it saves the time with
      several threads at least, as the following information shows(32TB system
      log):
      
      [   22.648169] node 9 initialised, 16607461 pages in 280ms
      [   22.783772] node 3 initialised, 23937243 pages in 410ms
      [   22.858877] node 6 initialised, 29179347 pages in 490ms
      [   22.863252] node 2 initialised, 29179347 pages in 490ms
      [   22.907545] node 0 initialised, 32049614 pages in 540ms
      [   22.920891] node 15 initialised, 32212280 pages in 550ms
      [   22.923236] node 4 initialised, 32306127 pages in 550ms
      [   22.923384] node 12 initialised, 32314319 pages in 550ms
      [   22.924754] node 8 initialised, 32314319 pages in 550ms
      [   22.940780] node 13 initialised, 33353677 pages in 570ms
      [   22.940796] node 11 initialised, 33353677 pages in 570ms
      [   22.941700] node 5 initialised, 33353677 pages in 570ms
      [   22.941721] node 10 initialised, 33353677 pages in 570ms
      [   22.941876] node 7 initialised, 33353677 pages in 570ms
      [   22.944946] node 14 initialised, 33353677 pages in 570ms
      [   22.946063] node 1 initialised, 33345485 pages in 580ms
      
      It saves the time about 550*16 ms at least, although it can be ignore to
      compare the boot time about 160 seconds.  What's more, the boot time is
      much shorter on Power even without patches than x86 for huge memory
      machine.
      
      So this patchset is still necessary to be enabled for Power.
      
      This patch (of 2):
      
      This patch is based on Mel Gorman's old patch in the mailing list,
      https://lkml.org/lkml/2015/5/5/280 which is discussed but it is fixed with
      a completion to wait for all memory initialised in page_alloc_init_late().
      It is to fix the OOM problem on X86 with 24TB memory which allocates
      memory in late initialisation.  But for Power platform with 32TB memory,
      it causes a call trace in vfs_caches_init->inode_init() and inode hash
      table needs more memory.  So this patch allocates 1GB for 0.25TB/node for
      large system as it is mentioned in https://lkml.org/lkml/2015/5/1/627
      
      This call trace is found on Power with 32TB memory, 1024CPUs, 16nodes.
      Currently, it only allocates 2GB*16=32GB for early initialisation.  But
      Dentry cache hash table needes 16GB and Inode cache hash table needs 16GB.
      So the system have no enough memory for it.  The log from dmesg as the
      following:
      
        Dentry cache hash table entries: 2147483648 (order: 18,17179869184 bytes)
        vmalloc: allocation failure, allocated 16021913600 of 17179934720 bytes
        swapper/0: page allocation failure: order:0,mode:0x2080020
        CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.4.0-0-ppc64
        Call Trace:
          .dump_stack+0xb4/0xb664 (unreliable)
          .warn_alloc_failed+0x114/0x160
          .__vmalloc_area_node+0x1a4/0x2b0
          .__vmalloc_node_range+0xe4/0x110
          .__vmalloc_node+0x40/0x50
          .alloc_large_system_hash+0x134/0x2a4
          .inode_init+0xa4/0xf0
          .vfs_caches_init+0x80/0x144
          .start_kernel+0x40c/0x4e0
          start_here_common+0x20/0x4a4
      Signed-off-by: NLi Zhang <zhlcindy@linux.vnet.ibm.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      987b3095
    • J
      mm: convert printk(KERN_<LEVEL> to pr_<level> · 1170532b
      Joe Perches 提交于
      Most of the mm subsystem uses pr_<level> so make it consistent.
      
      Miscellanea:
      
       - Realign arguments
       - Add missing newline to format
       - kmemleak-test.c has a "kmemleak: " prefix added to the
         "Kmemleak testing" logging message via pr_fmt
      Signed-off-by: NJoe Perches <joe@perches.com>
      Acked-by: Tejun Heo <tj@kernel.org>	[percpu]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1170532b
    • J
      mm: coalesce split strings · 756a025f
      Joe Perches 提交于
      Kernel style prefers a single string over split strings when the string is
      'user-visible'.
      
      Miscellanea:
      
       - Add a missing newline
       - Realign arguments
      Signed-off-by: NJoe Perches <joe@perches.com>
      Acked-by: Tejun Heo <tj@kernel.org>	[percpu]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      756a025f
    • M
      mm: remove __GFP_NOFAIL is deprecated comment · 0f352e53
      Michal Hocko 提交于
      Commit 64775719 ("mm: clarify __GFP_NOFAIL deprecation status") was
      incomplete and didn't remove the comment about __GFP_NOFAIL being
      deprecated in buffered_rmqueue.
      
      Let's get rid of this leftover but keep the WARN_ON_ONCE for order > 1
      because we should really discourage from using __GFP_NOFAIL with higher
      order allocations because those are just too subtle.
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NNikolay Borisov <kernel@kyup.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0f352e53
    • J
      mm: introduce page reference manipulation functions · fe896d18
      Joonsoo Kim 提交于
      The success of CMA allocation largely depends on the success of
      migration and key factor of it is page reference count.  Until now, page
      reference is manipulated by direct calling atomic functions so we cannot
      follow up who and where manipulate it.  Then, it is hard to find actual
      reason of CMA allocation failure.  CMA allocation should be guaranteed
      to succeed so finding offending place is really important.
      
      In this patch, call sites where page reference is manipulated are
      converted to introduced wrapper function.  This is preparation step to
      add tracepoint to each page reference manipulation function.  With this
      facility, we can easily find reason of CMA allocation failure.  There is
      no functional change in this patch.
      
      In addition, this patch also converts reference read sites.  It will
      help a second step that renames page._count to something else and
      prevents later attempt to direct access to it (Suggested by Andrew).
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NMichal Nazarewicz <mina86@mina86.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fe896d18
    • M
      mm: thp: set THP defrag by default to madvise and add a stall-free defrag option · 444eb2a4
      Mel Gorman 提交于
      THP defrag is enabled by default to direct reclaim/compact but not wake
      kswapd in the event of a THP allocation failure.  The problem is that
      THP allocation requests potentially enter reclaim/compaction.  This
      potentially incurs a severe stall that is not guaranteed to be offset by
      reduced TLB misses.  While there has been considerable effort to reduce
      the impact of reclaim/compaction, it is still a high cost and workloads
      that should fit in memory fail to do so.  Specifically, a simple
      anon/file streaming workload will enter direct reclaim on NUMA at least
      even though the working set size is 80% of RAM.  It's been years and
      it's time to throw in the towel.
      
      First, this patch defines THP defrag as follows;
      
       madvise: A failed allocation will direct reclaim/compact if the application requests it
       never:   Neither reclaim/compact nor wake kswapd
       defer:   A failed allocation will wake kswapd/kcompactd
       always:  A failed allocation will direct reclaim/compact (historical behaviour)
                khugepaged defrag will enter direct/reclaim but not wake kswapd.
      
      Next it sets the default defrag option to be "madvise" to only enter
      direct reclaim/compaction for applications that specifically requested
      it.
      
      Lastly, it removes a check from the page allocator slowpath that is
      related to __GFP_THISNODE to allow "defer" to work.  The callers that
      really cares are slub/slab and they are updated accordingly.  The slab
      one may be surprising because it also corrects a comment as kswapd was
      never woken up by that path.
      
      This means that a THP fault will no longer stall for most applications
      by default and the ideal for most users that get THP if they are
      immediately available.  There are still options for users that prefer a
      stall at startup of a new application by either restoring historical
      behaviour with "always" or pick a half-way point with "defer" where
      kswapd does some of the work in the background and wakes kcompactd if
      necessary.  THP defrag for khugepaged remains enabled and will enter
      direct/reclaim but no wakeup kswapd or kcompactd.
      
      After this patch a THP allocation failure will quickly fallback and rely
      on khugepaged to recover the situation at some time in the future.  In
      some cases, this will reduce THP usage but the benefit of THP is hard to
      measure and not a universal win where as a stall to reclaim/compaction
      is definitely measurable and can be painful.
      
      The first test for this is using "usemem" to read a large file and write
      a large anonymous mapping (to avoid the zero page) multiple times.  The
      total size of the mappings is 80% of RAM and the benchmark simply
      measures how long it takes to complete.  It uses multiple threads to see
      if that is a factor.  On UMA, the performance is almost identical so is
      not reported but on NUMA, we see this
      
      usemem
                                         4.4.0                 4.4.0
                                kcompactd-v1r1         nodefrag-v1r3
      Amean    System-1       102.86 (  0.00%)       46.81 ( 54.50%)
      Amean    System-4        37.85 (  0.00%)       34.02 ( 10.12%)
      Amean    System-7        48.12 (  0.00%)       46.89 (  2.56%)
      Amean    System-12       51.98 (  0.00%)       56.96 ( -9.57%)
      Amean    System-21       80.16 (  0.00%)       79.05 (  1.39%)
      Amean    System-30      110.71 (  0.00%)      107.17 (  3.20%)
      Amean    System-48      127.98 (  0.00%)      124.83 (  2.46%)
      Amean    Elapsd-1       185.84 (  0.00%)      105.51 ( 43.23%)
      Amean    Elapsd-4        26.19 (  0.00%)       25.58 (  2.33%)
      Amean    Elapsd-7        21.65 (  0.00%)       21.62 (  0.16%)
      Amean    Elapsd-12       18.58 (  0.00%)       17.94 (  3.43%)
      Amean    Elapsd-21       17.53 (  0.00%)       16.60 (  5.33%)
      Amean    Elapsd-30       17.45 (  0.00%)       17.13 (  1.84%)
      Amean    Elapsd-48       15.40 (  0.00%)       15.27 (  0.82%)
      
      For a single thread, the benchmark completes 43.23% faster with this
      patch applied with smaller benefits as the thread increases.  Similar,
      notice the large reduction in most cases in system CPU usage.  The
      overall CPU time is
      
                     4.4.0       4.4.0
              kcompactd-v1r1 nodefrag-v1r3
      User        10357.65    10438.33
      System       3988.88     3543.94
      Elapsed      2203.01     1634.41
      
      Which is substantial. Now, the reclaim figures
      
                                       4.4.0       4.4.0
                                kcompactd-v1r1nodefrag-v1r3
      Minor Faults                 128458477   278352931
      Major Faults                   2174976         225
      Swap Ins                      16904701           0
      Swap Outs                     17359627           0
      Allocation stalls                43611           0
      DMA allocs                           0           0
      DMA32 allocs                  19832646    19448017
      Normal allocs                614488453   580941839
      Movable allocs                       0           0
      Direct pages scanned          24163800           0
      Kswapd pages scanned                 0           0
      Kswapd pages reclaimed               0           0
      Direct pages reclaimed        20691346           0
      Compaction stalls                42263           0
      Compaction success                 938           0
      Compaction failures              41325           0
      
      This patch eliminates almost all swapping and direct reclaim activity.
      There is still overhead but it's from NUMA balancing which does not
      identify that it's pointless trying to do anything with this workload.
      
      I also tried the thpscale benchmark which forces a corner case where
      compaction can be used heavily and measures the latency of whether base
      or huge pages were used
      
      thpscale Fault Latencies
                                             4.4.0                 4.4.0
                                    kcompactd-v1r1         nodefrag-v1r3
      Amean    fault-base-1      5288.84 (  0.00%)     2817.12 ( 46.73%)
      Amean    fault-base-3      6365.53 (  0.00%)     3499.11 ( 45.03%)
      Amean    fault-base-5      6526.19 (  0.00%)     4363.06 ( 33.15%)
      Amean    fault-base-7      7142.25 (  0.00%)     4858.08 ( 31.98%)
      Amean    fault-base-12    13827.64 (  0.00%)    10292.11 ( 25.57%)
      Amean    fault-base-18    18235.07 (  0.00%)    13788.84 ( 24.38%)
      Amean    fault-base-24    21597.80 (  0.00%)    24388.03 (-12.92%)
      Amean    fault-base-30    26754.15 (  0.00%)    19700.55 ( 26.36%)
      Amean    fault-base-32    26784.94 (  0.00%)    19513.57 ( 27.15%)
      Amean    fault-huge-1      4223.96 (  0.00%)     2178.57 ( 48.42%)
      Amean    fault-huge-3      2194.77 (  0.00%)     2149.74 (  2.05%)
      Amean    fault-huge-5      2569.60 (  0.00%)     2346.95 (  8.66%)
      Amean    fault-huge-7      3612.69 (  0.00%)     2997.70 ( 17.02%)
      Amean    fault-huge-12     3301.75 (  0.00%)     6727.02 (-103.74%)
      Amean    fault-huge-18     6696.47 (  0.00%)     6685.72 (  0.16%)
      Amean    fault-huge-24     8000.72 (  0.00%)     9311.43 (-16.38%)
      Amean    fault-huge-30    13305.55 (  0.00%)     9750.45 ( 26.72%)
      Amean    fault-huge-32     9981.71 (  0.00%)    10316.06 ( -3.35%)
      
      The average time to fault pages is substantially reduced in the majority
      of caseds but with the obvious caveat that fewer THPs are actually used
      in this adverse workload
      
                                         4.4.0                 4.4.0
                                kcompactd-v1r1         nodefrag-v1r3
      Percentage huge-1         0.71 (  0.00%)       14.04 (1865.22%)
      Percentage huge-3        10.77 (  0.00%)       33.05 (206.85%)
      Percentage huge-5        60.39 (  0.00%)       38.51 (-36.23%)
      Percentage huge-7        45.97 (  0.00%)       34.57 (-24.79%)
      Percentage huge-12       68.12 (  0.00%)       40.07 (-41.17%)
      Percentage huge-18       64.93 (  0.00%)       47.82 (-26.35%)
      Percentage huge-24       62.69 (  0.00%)       44.23 (-29.44%)
      Percentage huge-30       43.49 (  0.00%)       55.38 ( 27.34%)
      Percentage huge-32       50.72 (  0.00%)       51.90 (  2.35%)
      
                                       4.4.0       4.4.0
                                kcompactd-v1r1nodefrag-v1r3
      Minor Faults                  37429143    47564000
      Major Faults                      1916        1558
      Swap Ins                          1466        1079
      Swap Outs                      2936863      149626
      Allocation stalls                62510           3
      DMA allocs                           0           0
      DMA32 allocs                   6566458     6401314
      Normal allocs                216361697   216538171
      Movable allocs                       0           0
      Direct pages scanned          25977580       17998
      Kswapd pages scanned                 0     3638931
      Kswapd pages reclaimed               0      207236
      Direct pages reclaimed         8833714          88
      Compaction stalls               103349           5
      Compaction success                 270           4
      Compaction failures             103079           1
      
      Note again that while this does swap as it's an aggressive workload, the
      direct relcim activity and allocation stalls is substantially reduced.
      There is some kswapd activity but ftrace showed that the kswapd activity
      was due to normal wakeups from 4K pages being allocated.
      Compaction-related stalls and activity are almost eliminated.
      
      I also tried the stutter benchmark.  For this, I do not have figures for
      NUMA but it's something that does impact UMA so I'll report what is
      available
      
      stutter
                                       4.4.0                 4.4.0
                              kcompactd-v1r1         nodefrag-v1r3
      Min         mmap      7.3571 (  0.00%)      7.3438 (  0.18%)
      1st-qrtle   mmap      7.5278 (  0.00%)     17.9200 (-138.05%)
      2nd-qrtle   mmap      7.6818 (  0.00%)     21.6055 (-181.25%)
      3rd-qrtle   mmap     11.0889 (  0.00%)     21.8881 (-97.39%)
      Max-90%     mmap     27.8978 (  0.00%)     22.1632 ( 20.56%)
      Max-93%     mmap     28.3202 (  0.00%)     22.3044 ( 21.24%)
      Max-95%     mmap     28.5600 (  0.00%)     22.4580 ( 21.37%)
      Max-99%     mmap     29.6032 (  0.00%)     25.5216 ( 13.79%)
      Max         mmap   4109.7289 (  0.00%)   4813.9832 (-17.14%)
      Mean        mmap     12.4474 (  0.00%)     19.3027 (-55.07%)
      
      This benchmark is trying to fault an anonymous mapping while there is a
      heavy IO load -- a scenario that desktop users used to complain about
      frequently.  This shows a mix because the ideal case of mapping with THP
      is not hit as often.  However, note that 99% of the mappings complete
      13.79% faster.  The CPU usage here is particularly interesting
      
                     4.4.0       4.4.0
              kcompactd-v1r1nodefrag-v1r3
      User           67.50        0.99
      System       1327.88       91.30
      Elapsed      2079.00     2128.98
      
      And once again we look at the reclaim figures
      
                                       4.4.0       4.4.0
                                kcompactd-v1r1nodefrag-v1r3
      Minor Faults                 335241922  1314582827
      Major Faults                       715         819
      Swap Ins                             0           0
      Swap Outs                            0           0
      Allocation stalls               532723           0
      DMA allocs                           0           0
      DMA32 allocs                1822364341  1177950222
      Normal allocs               1815640808  1517844854
      Movable allocs                       0           0
      Direct pages scanned          21892772           0
      Kswapd pages scanned          20015890    41879484
      Kswapd pages reclaimed        19961986    41822072
      Direct pages reclaimed        21892741           0
      Compaction stalls              1065755           0
      Compaction success                 514           0
      Compaction failures            1065241           0
      
      Allocation stalls and all direct reclaim activity is eliminated as well
      as compaction-related stalls.
      
      THP gives impressive gains in some cases but only if they are quickly
      available.  We're not going to reach the point where they are completely
      free so lets take the costs out of the fast paths finally and defer the
      cost to kswapd, kcompactd and khugepaged where it belongs.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      444eb2a4
    • J
      mm: scale kswapd watermarks in proportion to memory · 795ae7a0
      Johannes Weiner 提交于
      In machines with 140G of memory and enterprise flash storage, we have
      seen read and write bursts routinely exceed the kswapd watermarks and
      cause thundering herds in direct reclaim.  Unfortunately, the only way
      to tune kswapd aggressiveness is through adjusting min_free_kbytes - the
      system's emergency reserves - which is entirely unrelated to the
      system's latency requirements.  In order to get kswapd to maintain a
      250M buffer of free memory, the emergency reserves need to be set to 1G.
      That is a lot of memory wasted for no good reason.
      
      On the other hand, it's reasonable to assume that allocation bursts and
      overall allocation concurrency scale with memory capacity, so it makes
      sense to make kswapd aggressiveness a function of that as well.
      
      Change the kswapd watermark scale factor from the currently fixed 25% of
      the tunable emergency reserve to a tunable 0.1% of memory.
      
      Beyond 1G of memory, this will produce bigger watermark steps than the
      current formula in default settings.  Ensure that the new formula never
      chooses steps smaller than that, i.e.  25% of the emergency reserve.
      
      On a 140G machine, this raises the default watermark steps - the
      distance between min and low, and low and high - from 16M to 143M.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      795ae7a0
    • I
      mm/page_alloc.c: calculate 'available' memory in a separate function · d02bd27b
      Igor Redko 提交于
      Add a new field, VIRTIO_BALLOON_S_AVAIL, to virtio_balloon memory
      statistics protocol, corresponding to 'Available' in /proc/meminfo.
      
      It indicates to the hypervisor how big the balloon can be inflated
      without pushing the guest system to swap.  This metric would be very
      useful in VM orchestration software to improve memory management of
      different VMs under overcommit.
      
      This patch (of 2):
      
      Factor out calculation of the available memory counter into a separate
      exportable function, in order to be able to use it in other parts of the
      kernel.
      
      In particular, it appears a relevant metric to report to the hypervisor
      via virtio-balloon statistics interface (in a followup patch).
      Signed-off-by: NIgor Redko <redkoi@virtuozzo.com>
      Signed-off-by: NDenis V. Lunev <den@openvz.org>
      Reviewed-by: NRoman Kagan <rkagan@virtuozzo.com>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d02bd27b
    • V
      mm, compaction: introduce kcompactd · 698b1b30
      Vlastimil Babka 提交于
      Memory compaction can be currently performed in several contexts:
      
       - kswapd balancing a zone after a high-order allocation failure
       - direct compaction to satisfy a high-order allocation, including THP
         page fault attemps
       - khugepaged trying to collapse a hugepage
       - manually from /proc
      
      The purpose of compaction is two-fold.  The obvious purpose is to
      satisfy a (pending or future) high-order allocation, and is easy to
      evaluate.  The other purpose is to keep overal memory fragmentation low
      and help the anti-fragmentation mechanism.  The success wrt the latter
      purpose is more
      
      The current situation wrt the purposes has a few drawbacks:
      
       - compaction is invoked only when a high-order page or hugepage is not
         available (or manually).  This might be too late for the purposes of
         keeping memory fragmentation low.
       - direct compaction increases latency of allocations.  Again, it would
         be better if compaction was performed asynchronously to keep
         fragmentation low, before the allocation itself comes.
       - (a special case of the previous) the cost of compaction during THP
         page faults can easily offset the benefits of THP.
       - kswapd compaction appears to be complex, fragile and not working in
         some scenarios.  It could also end up compacting for a high-order
         allocation request when it should be reclaiming memory for a later
         order-0 request.
      
      To improve the situation, we should be able to benefit from an
      equivalent of kswapd, but for compaction - i.e. a background thread
      which responds to fragmentation and the need for high-order allocations
      (including hugepages) somewhat proactively.
      
      One possibility is to extend the responsibilities of kswapd, which could
      however complicate its design too much.  It should be better to let
      kswapd handle reclaim, as order-0 allocations are often more critical
      than high-order ones.
      
      Another possibility is to extend khugepaged, but this kthread is a
      single instance and tied to THP configs.
      
      This patch goes with the option of a new set of per-node kthreads called
      kcompactd, and lays the foundations, without introducing any new
      tunables.  The lifecycle mimics kswapd kthreads, including the memory
      hotplug hooks.
      
      For compaction, kcompactd uses the standard compaction_suitable() and
      ompact_finished() criteria and the deferred compaction functionality.
      Unlike direct compaction, it uses only sync compaction, as there's no
      allocation latency to minimize.
      
      This patch doesn't yet add a call to wakeup_kcompactd.  The kswapd
      compact/reclaim loop for high-order pages will be replaced by waking up
      kcompactd in the next patch with the description of what's wrong with
      the old approach.
      
      Waking up of the kcompactd threads is also tied to kswapd activity and
      follows these rules:
       - we don't want to affect any fastpaths, so wake up kcompactd only from
         the slowpath, as it's done for kswapd
       - if kswapd is doing reclaim, it's more important than compaction, so
         don't invoke kcompactd until kswapd goes to sleep
       - the target order used for kswapd is passed to kcompactd
      
      Future possible future uses for kcompactd include the ability to wake up
      kcompactd on demand in special situations, such as when hugepages are
      not available (currently not done due to __GFP_NO_KSWAPD) or when a
      fragmentation event (i.e.  __rmqueue_fallback()) occurs.  It's also
      possible to perform periodic compaction with kcompactd.
      
      [arnd@arndb.de: fix build errors with kcompactd]
      [paul.gortmaker@windriver.com: don't use modular references for non modular code]
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      698b1b30
    • J
      sound: query dynamic DEBUG_PAGEALLOC setting · 505f6d22
      Joonsoo Kim 提交于
      We can disable debug_pagealloc processing even if the code is compiled
      with CONFIG_DEBUG_PAGEALLOC.  This patch changes the code to query
      whether it is enabled or not in runtime.
      
      [akpm@linux-foundation.org: export _debug_pagealloc_enabled to modules]
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NTakashi Iwai <tiwai@suse.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      505f6d22
    • N
      /proc/kpageflags: return KPF_BUDDY for "tail" buddy pages · 832fc1de
      Naoya Horiguchi 提交于
      Currently /proc/kpageflags returns nothing for "tail" buddy pages, which
      is inconvenient when grasping how free pages are distributed.  This
      patch sets KPF_BUDDY for such pages.
      
      With this patch:
      
        $ grep MemFree /proc/meminfo ; tools/vm/page-types -b buddy
        MemFree:         3134992 kB
                     flags      page-count       MB  symbolic-flags                     long-symbolic-flags
        0x0000000000000400          779272     3044  __________B_______________________________ buddy
        0x0000000000000c00            4385       17  __________BM______________________________ buddy,mmap
                     total          783657     3061
      
      783657 pages is 3134628 kB (roughly consistent with the global counter,)
      so it's OK.
      
      [akpm@linux-foundation.org: update comment, per Naoya]
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: NVladimir Davydov <vdavydov@virtuozzo.com&gt;>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      832fc1de
  5. 16 3月, 2016 11 次提交
    • J
      mm/compaction: speed up pageblock_pfn_to_page() when zone is contiguous · 7cf91a98
      Joonsoo Kim 提交于
      There is a performance drop report due to hugepage allocation and in
      there half of cpu time are spent on pageblock_pfn_to_page() in
      compaction [1].
      
      In that workload, compaction is triggered to make hugepage but most of
      pageblocks are un-available for compaction due to pageblock type and
      skip bit so compaction usually fails.  Most costly operations in this
      case is to find valid pageblock while scanning whole zone range.  To
      check if pageblock is valid to compact, valid pfn within pageblock is
      required and we can obtain it by calling pageblock_pfn_to_page().  This
      function checks whether pageblock is in a single zone and return valid
      pfn if possible.  Problem is that we need to check it every time before
      scanning pageblock even if we re-visit it and this turns out to be very
      expensive in this workload.
      
      Although we have no way to skip this pageblock check in the system where
      hole exists at arbitrary position, we can use cached value for zone
      continuity and just do pfn_to_page() in the system where hole doesn't
      exist.  This optimization considerably speeds up in above workload.
      
      Before vs After
        Max: 1096 MB/s vs 1325 MB/s
        Min: 635 MB/s 1015 MB/s
        Avg: 899 MB/s 1194 MB/s
      
      Avg is improved by roughly 30% [2].
      
      [1]: http://www.spinics.net/lists/linux-mm/msg97378.html
      [2]: https://lkml.org/lkml/2015/12/9/23
      
      [akpm@linux-foundation.org: don't forget to restore zone->contiguous on error path, per Vlastimil]
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reported-by: NAaron Lu <aaron.lu@intel.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Tested-by: NAaron Lu <aaron.lu@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7cf91a98
    • L
      mm/page_poisoning.c: allow for zero poisoning · 1414c7f4
      Laura Abbott 提交于
      By default, page poisoning uses a poison value (0xaa) on free.  If this
      is changed to 0, the page is not only sanitized but zeroing on alloc
      with __GFP_ZERO can be skipped as well.  The tradeoff is that detecting
      corruption from the poisoning is harder to detect.  This feature also
      cannot be used with hibernation since pages are not guaranteed to be
      zeroed after hibernation.
      
      Credit to Grsecurity/PaX team for inspiring this work
      Signed-off-by: NLaura Abbott <labbott@fedoraproject.org>
      Acked-by: NRafael J. Wysocki <rjw@rjwysocki.net>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Mathias Krause <minipli@googlemail.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Jianyu Zhan <nasa4836@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1414c7f4
    • L
      mm/page_poison.c: enable PAGE_POISONING as a separate option · 8823b1db
      Laura Abbott 提交于
      Page poisoning is currently set up as a feature if architectures don't
      have architecture debug page_alloc to allow unmapping of pages.  It has
      uses apart from that though.  Clearing of the pages on free provides an
      increase in security as it helps to limit the risk of information leaks.
      Allow page poisoning to be enabled as a separate option independent of
      kernel_map pages since the two features do separate work.  Because of
      how hiberanation is implemented, the checks on alloc cannot occur if
      hibernation is enabled.  The runtime alloc checks can also be enabled
      with an option when !HIBERNATION.
      
      Credit to Grsecurity/PaX team for inspiring this work
      Signed-off-by: NLaura Abbott <labbott@fedoraproject.org>
      Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Mathias Krause <minipli@googlemail.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Jianyu Zhan <nasa4836@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8823b1db
    • V
      mm, debug: move bad flags printing to bad_page() · ff8e8116
      Vlastimil Babka 提交于
      Since bad_page() is the only user of the badflags parameter of
      dump_page_badflags(), we can move the code to bad_page() and simplify a
      bit.
      
      The dump_page_badflags() function is renamed to __dump_page() and can
      still be called separately from dump_page() for temporary debug prints
      where page_owner info is not desired.
      
      The only user-visible change is that page->mem_cgroup is printed before
      the bad flags.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ff8e8116
    • V
      mm, page_owner: dump page owner info from dump_page() · 4e462112
      Vlastimil Babka 提交于
      The page_owner mechanism is useful for dealing with memory leaks.  By
      reading /sys/kernel/debug/page_owner one can determine the stack traces
      leading to allocations of all pages, and find e.g.  a buggy driver.
      
      This information might be also potentially useful for debugging, such as
      the VM_BUG_ON_PAGE() calls to dump_page().  So let's print the stored
      info from dump_page().
      
      Example output:
      
        page:ffffea000292f1c0 count:1 mapcount:0 mapping:ffff8800b2f6cc18 index:0x91d
        flags: 0x1fffff8001002c(referenced|uptodate|lru|mappedtodisk)
        page dumped because: VM_BUG_ON_PAGE(1)
        page->mem_cgroup:ffff8801392c5000
        page allocated via order 0, migratetype Movable, gfp_mask 0x24213ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD|__GFP_NOWARN|__GFP_NORETRY)
         [<ffffffff811682c4>] __alloc_pages_nodemask+0x134/0x230
         [<ffffffff811b40c8>] alloc_pages_current+0x88/0x120
         [<ffffffff8115e386>] __page_cache_alloc+0xe6/0x120
         [<ffffffff8116ba6c>] __do_page_cache_readahead+0xdc/0x240
         [<ffffffff8116bd05>] ondemand_readahead+0x135/0x260
         [<ffffffff8116be9c>] page_cache_async_readahead+0x6c/0x70
         [<ffffffff811604c2>] generic_file_read_iter+0x3f2/0x760
         [<ffffffff811e0dc7>] __vfs_read+0xa7/0xd0
        page has been migrated, last migrate reason: compaction
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4e462112
    • V
      mm, page_owner: print migratetype of page and pageblock, symbolic flags · 60f30350
      Vlastimil Babka 提交于
      The information in /sys/kernel/debug/page_owner includes the migratetype
      of the pageblock the page belongs to.  This is also checked against the
      page's migratetype (as declared by gfp_flags during its allocation), and
      the page is reported as Fallback if its migratetype differs from the
      pageblock's one.  t This is somewhat misleading because in fact fallback
      allocation is not the only reason why these two can differ.  It also
      doesn't direcly provide the page's migratetype, although it's possible
      to derive that from the gfp_flags.
      
      It's arguably better to print both page and pageblock's migratetype and
      leave the interpretation to the consumer than to suggest fallback
      allocation as the only possible reason.  While at it, we can print the
      migratetypes as string the same way as /proc/pagetypeinfo does, as some
      of the numeric values depend on kernel configuration.  For that, this
      patch moves the migratetype_names array from #ifdef CONFIG_PROC_FS part
      of mm/vmstat.c to mm/page_alloc.c and exports it.
      
      With the new format strings for flags, we can now also provide symbolic
      page and gfp flags in the /sys/kernel/debug/page_owner file.  This
      replaces the positional printing of page flags as single letters, which
      might have looked nicer, but was limited to a subset of flags, and
      required the user to remember the letters.
      
      Example page_owner entry after the patch:
      
        Page allocated via order 0, mask 0x24213ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD|__GFP_NOWARN|__GFP_NORETRY)
        PFN 520 type Movable Block 1 type Movable Flags 0xfffff8001006c(referenced|uptodate|lru|active|mappedtodisk)
         [<ffffffff811682c4>] __alloc_pages_nodemask+0x134/0x230
         [<ffffffff811b4058>] alloc_pages_current+0x88/0x120
         [<ffffffff8115e386>] __page_cache_alloc+0xe6/0x120
         [<ffffffff8116ba6c>] __do_page_cache_readahead+0xdc/0x240
         [<ffffffff8116bd05>] ondemand_readahead+0x135/0x260
         [<ffffffff8116bfb1>] page_cache_sync_readahead+0x31/0x50
         [<ffffffff81160523>] generic_file_read_iter+0x453/0x760
         [<ffffffff811e0d57>] __vfs_read+0xa7/0xd0
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      60f30350
    • V
      mm, page_alloc: print symbolic gfp_flags on allocation failure · c5c990e8
      Vlastimil Babka 提交于
      It would be useful to translate gfp_flags into string representation
      when printing in case of an allocation failure, especially as the flags
      have been undergoing some changes recently and the script
      ./scripts/gfp-translate needs a matching source version to be accurate.
      
      Example output:
      
        stapio: page allocation failure: order:9, mode:0x2080020(GFP_ATOMIC)
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c5c990e8
    • C
      mm/debug_pagealloc: ask users for default setting of debug_pagealloc · ea6eabb0
      Christian Borntraeger 提交于
      Since commit 031bc574 ("mm/debug-pagealloc: make debug-pagealloc
      boottime configurable") CONFIG_DEBUG_PAGEALLOC is by default not adding
      any page debugging.
      
      This resulted in several unnoticed bugs, e.g.
      
          https://lkml.kernel.org/g/<569F5E29.3090107@de.ibm.com>
      or
          https://lkml.kernel.org/g/<56A20F30.4050705@de.ibm.com>
      
      as this behaviour change was not even documented in Kconfig.
      
      Let's provide a new Kconfig symbol that allows to change the default
      back to enabled, e.g.  for debug kernels.  This also makes the change
      obvious to kernel packagers.
      
      Let's also change the Kconfig description for CONFIG_DEBUG_PAGEALLOC, to
      indicate that there are two stages of overhead.
      Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ea6eabb0
    • A
      mm/page_alloc.c: rework code layout in memmap_init_zone() · b72d0ffb
      Andrew Morton 提交于
      This function is getting full of weird tricks to avoid word-wrapping.
      Use a goto to eliminate a tab stop then use the new space
      
      Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b72d0ffb
    • T
      mm/page_alloc.c: introduce kernelcore=mirror option · 342332e6
      Taku Izumi 提交于
      This patch extends existing "kernelcore" option and introduces
      kernelcore=mirror option.  By specifying "mirror" instead of specifying
      the amount of memory, non-mirrored (non-reliable) region will be
      arranged into ZONE_MOVABLE.
      
      [akpm@linux-foundation.org: fix build with CONFIG_HAVE_MEMBLOCK_NODE_MAP=n]
      Signed-off-by: NTaku Izumi <izumi.taku@jp.fujitsu.com>
      Tested-by: NSudeep Holla <sudeep.holla@arm.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Steve Capper <steve.capper@linaro.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      342332e6
    • T
      mm/page_alloc.c: calculate zone_start_pfn at zone_spanned_pages_in_node() · d91749c1
      Taku Izumi 提交于
      Xeon E7 v3 based systems supports Address Range Mirroring and UEFI BIOS
      complied with UEFI spec 2.5 can notify which ranges are mirrored
      (reliable) via EFI memory map.  Now Linux kernel utilize its information
      and allocates boot time memory from reliable region.
      
      My requirement is:
        - allocate kernel memory from mirrored region
        - allocate user memory from non-mirrored region
      
      In order to meet my requirement, ZONE_MOVABLE is useful.  By arranging
      non-mirrored range into ZONE_MOVABLE, mirrored memory is used for kernel
      allocations.
      
      My idea is to extend existing "kernelcore" option and introduces
      kernelcore=mirror option.  By specifying "mirror" instead of specifying
      the amount of memory, non-mirrored region will be arranged into
      ZONE_MOVABLE.
      
      Earlier discussions are at:
       https://lkml.org/lkml/2015/10/9/24
       https://lkml.org/lkml/2015/10/15/9
       https://lkml.org/lkml/2015/11/27/18
       https://lkml.org/lkml/2015/12/8/836
      
      For example, suppose 2-nodes system with the following memory range:
      
        node 0 [mem 0x0000000000001000-0x000000109fffffff]
        node 1 [mem 0x00000010a0000000-0x000000209fffffff]
      and the following ranges are marked as reliable (mirrored):
        [0x0000000000000000-0x0000000100000000]
        [0x0000000100000000-0x0000000180000000]
        [0x0000000800000000-0x0000000880000000]
        [0x00000010a0000000-0x0000001120000000]
        [0x00000017a0000000-0x0000001820000000]
      
      If you specify kernelcore=mirror, ZONE_NORMAL and ZONE_MOVABLE are
      arranged like bellow:
      
       - node 0:
        ZONE_NORMAL : [0x0000000100000000-0x00000010a0000000]
        ZONE_MOVABLE: [0x0000000180000000-0x00000010a0000000]
       - node 1:
        ZONE_NORMAL : [0x00000010a0000000-0x00000020a0000000]
        ZONE_MOVABLE: [0x0000001120000000-0x00000020a0000000]
      
      In overlapped range, pages to be ZONE_MOVABLE in ZONE_NORMAL are treated
      as absent pages, and vice versa.
      
      This patch (of 2):
      
      Currently each zone's zone_start_pfn is calculated at
      free_area_init_core().  However zone's range is fixed at the time when
      invoking zone_spanned_pages_in_node().
      
      This patch changes how each zone->zone_start_pfn is calculated in
      zone_spanned_pages_in_node().
      Signed-off-by: NTaku Izumi <izumi.taku@jp.fujitsu.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Steve Capper <steve.capper@linaro.org>
      Cc: Sudeep Holla <sudeep.holla@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d91749c1
  6. 06 2月, 2016 1 次提交
    • V
      mm, hugetlb: don't require CMA for runtime gigantic pages · 080fe206
      Vlastimil Babka 提交于
      Commit 944d9fec ("hugetlb: add support for gigantic page allocation
      at runtime") has added the runtime gigantic page allocation via
      alloc_contig_range(), making this support available only when CONFIG_CMA
      is enabled.  Because it doesn't depend on MIGRATE_CMA pageblocks and the
      associated infrastructure, it is possible with few simple adjustments to
      require only CONFIG_MEMORY_ISOLATION instead of full CONFIG_CMA.
      
      After this patch, alloc_contig_range() and related functions are
      available and used for gigantic pages with just CONFIG_MEMORY_ISOLATION
      enabled.  Note CONFIG_CMA selects CONFIG_MEMORY_ISOLATION.  This allows
      supporting runtime gigantic pages without the CMA-specific checks in
      page allocator fastpaths.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      080fe206
  7. 04 2月, 2016 1 次提交