1. 10 10月, 2014 12 次提交
    • M
      mm: page_alloc: default node-ordering on 64-bit NUMA, zone-ordering on 32-bit · 3193913c
      Mel Gorman 提交于
      Zones are allocated by the page allocator in either node or zone order.
      Node ordering is preferred in terms of locality and is applied
      automatically in one of three cases:
      
        1. If a node has only low memory
      
        2. If DMA/DMA32 is a high percentage of memory
      
        3. If low memory on a single node is greater than 70% of the node size
      
      Otherwise zone ordering is used to preserve low memory for devices that
      require it.  Unfortunately a consequence of this is that applications
      running on a machine with balanced NUMA nodes will experience different
      performance characteristics depending on which node they happen to start
      from.
      
      The point of zone ordering is to protect lower zones for devices that
      require DMA/DMA32 memory.  When NUMA was first introduced, this was
      critical as 32-bit NUMA machines existed and exhausting low memory
      triggered OOMs easily as so many allocations required low memory.  On
      64-bit machines the primary concern is devices that are 32-bit only which
      is less severe than the low memory exhaustion problem on 32-bit NUMA.  It
      seems there are really few devices that depends on it.
      
      AGP -- I assume this is getting more rare but even then I think the allocations
      	happen early in boot time where lowmem pressure is less of a problem
      
      DRM -- If the device is 32-bit only then there may be low pressure. I didn't
      	evaluate these in detail but it looks like some of these are mobile
      	graphics card. Not many NUMA laptops out there. DRM folk should know
      	better though.
      
      Some TV cards -- Much demand for 32-bit capable TV cards on NUMA machines?
      
      B43 wireless card -- again not really a NUMA thing.
      
      I cannot find a good reason to incur a performance penalty on all 64-bit NUMA
      machines in case someone throws a brain damanged TV or graphics card in there.
      This patch defaults to node-ordering on 64-bit NUMA machines. I was tempted
      to make it default everywhere but I understand that some embedded arches may
      be using 32-bit NUMA where I cannot predict the consequences.
      
      The performance impact depends on the workload and the characteristics of the
      machine and the machine I tested on had a large Normal zone on node 0 so the
      impact is within the noise for the majority of tests. The allocation stats
      show more allocation requests were from DMA32 and local node. Running SpecJBB
      with multiple JVMs and automatic NUMA balancing disabled the results were
      
      specjbb
                           3.17.0-rc2            3.17.0-rc2
                              vanilla        nodeorder-v1r1
      Min    1      29534.00 (  0.00%)     30020.00 (  1.65%)
      Min    10    115717.00 (  0.00%)    134038.00 ( 15.83%)
      Min    19    109718.00 (  0.00%)    114186.00 (  4.07%)
      Min    28    104459.00 (  0.00%)    103639.00 ( -0.78%)
      Min    37     98245.00 (  0.00%)    103756.00 (  5.61%)
      Min    46     97198.00 (  0.00%)     96197.00 ( -1.03%)
      Mean   1      30953.25 (  0.00%)     31917.75 (  3.12%)
      Mean   10    124432.50 (  0.00%)    140904.00 ( 13.24%)
      Mean   19    116033.50 (  0.00%)    119294.75 (  2.81%)
      Mean   28    108365.25 (  0.00%)    106879.50 ( -1.37%)
      Mean   37    102984.75 (  0.00%)    106924.25 (  3.83%)
      Mean   46    100783.25 (  0.00%)    105368.50 (  4.55%)
      Stddev 1       1260.38 (  0.00%)      1109.66 ( 11.96%)
      Stddev 10      7434.03 (  0.00%)      5171.91 ( 30.43%)
      Stddev 19      8453.84 (  0.00%)      5309.59 ( 37.19%)
      Stddev 28      4184.55 (  0.00%)      2906.63 ( 30.54%)
      Stddev 37      5409.49 (  0.00%)      3192.12 ( 40.99%)
      Stddev 46      4521.95 (  0.00%)      7392.52 (-63.48%)
      Max    1      32738.00 (  0.00%)     32719.00 ( -0.06%)
      Max    10    136039.00 (  0.00%)    148614.00 (  9.24%)
      Max    19    130566.00 (  0.00%)    127418.00 ( -2.41%)
      Max    28    115404.00 (  0.00%)    111254.00 ( -3.60%)
      Max    37    112118.00 (  0.00%)    111732.00 ( -0.34%)
      Max    46    108541.00 (  0.00%)    116849.00 (  7.65%)
      TPut   1     123813.00 (  0.00%)    127671.00 (  3.12%)
      TPut   10    497730.00 (  0.00%)    563616.00 ( 13.24%)
      TPut   19    464134.00 (  0.00%)    477179.00 (  2.81%)
      TPut   28    433461.00 (  0.00%)    427518.00 ( -1.37%)
      TPut   37    411939.00 (  0.00%)    427697.00 (  3.83%)
      TPut   46    403133.00 (  0.00%)    421474.00 (  4.55%)
      
                                  3.17.0-rc2  3.17.0-rc2
                                     vanillanodeorder-v1r1
      DMA allocs                           0           0
      DMA32 allocs                        57     1491992
      Normal allocs                 32543566    30026383
      Movable allocs                       0           0
      Direct pages scanned                 0           0
      Kswapd pages scanned                 0           0
      Kswapd pages reclaimed               0           0
      Direct pages reclaimed               0           0
      Kswapd efficiency                 100%        100%
      Kswapd velocity                  0.000       0.000
      Direct efficiency                 100%        100%
      Direct velocity                  0.000       0.000
      Percentage direct scans             0%          0%
      Zone normal velocity             0.000       0.000
      Zone dma32 velocity              0.000       0.000
      Zone dma velocity                0.000       0.000
      THP fault alloc                  55164       52987
      THP collapse alloc                 139         147
      THP splits                          26          21
      NUMA alloc hit                 4169066     4250692
      NUMA alloc miss                      0           0
      
      Note that there were more DMA32 allocations with the patch applied.  In this
      particular case there was no difference in numa_hit and numa_miss. The
      expectation is that DMA32 was being used at the low watermark instead of
      falling into the slow path. kswapd was not woken but it's not worken for
      THP allocations.
      
      On 32-bit, this patch defaults to zone-ordering as low memory depletion
      can be a serious problem on 32-bit large memory machines. If the default
      ordering was node then processes on node 0 will deplete the Normal zone
      due to normal activity.  The problem is worse if CONFIG_HIGHPTE is not
      set. If combined with large amounts of dirty/writeback pages in Normal
      zone then there is also a high risk of OOM. The heuristics are removed
      as it's not clear they were ever important on 32-bit. They were only
      relevant for setting node-ordering on 64-bit.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3193913c
    • M
      mm: page_alloc: Make paranoid check in move_freepages a VM_BUG_ON · 97ee4ba7
      Mel Gorman 提交于
      Since 2.6.24 there has been a paranoid check in move_freepages that looks
      up the zone of two pages.  This is a very slow path and the only time I've
      seen this bug trigger recently is when memory initialisation was broken
      during patch development.  Despite the fact it's a slow path, this patch
      converts the check to a VM_BUG_ON anyway as it has served its purpose by
      now.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      97ee4ba7
    • J
      mm: clean up zone flags · 57054651
      Johannes Weiner 提交于
      Page reclaim tests zone_is_reclaim_dirty(), but the site that actually
      sets this state does zone_set_flag(zone, ZONE_TAIL_LRU_DIRTY), sending the
      reader through layers indirection just to track down a simple bit.
      
      Remove all zone flag wrappers and just use bitops against zone->flags
      directly.  It's just as readable and the lines are barely any longer.
      
      Also rename ZONE_TAIL_LRU_DIRTY to ZONE_DIRTY to match ZONE_WRITEBACK, and
      remove the zone_flags_t typedef.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      57054651
    • W
      mm: page_alloc: avoid wakeup kswapd on the unintended node · 7ade3c99
      Weijie Yang 提交于
      When entering the page_alloc slowpath, we wakeup kswapd on every pgdat
      according to the zonelist and high_zoneidx.  However, this doesn't take
      nodemask into account, and could prematurely wakeup kswapd on some
      unintended nodes.
      
      This patch uses for_each_zone_zonelist_nodemask() instead of
      for_each_zone_zonelist() in wake_all_kswapds() to avoid the above
      situation.
      Signed-off-by: NWeijie Yang <weijie.yang@samsung.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7ade3c99
    • S
      mm: introduce dump_vma · 0bf55139
      Sasha Levin 提交于
      Introduce a helper to dump information about a VMA, this also makes
      dump_page_flags more generic and re-uses that so the output looks very
      similar to dump_page:
      
      [   61.903437] vma ffff88070f88be00 start 00007fff25970000 end 00007fff25992000
      [   61.903437] next ffff88070facd600 prev ffff88070face400 mm ffff88070fade000
      [   61.903437] prot 8000000000000025 anon_vma ffff88070fa1e200 vm_ops           (null)
      [   61.903437] pgoff 7ffffffdd file           (null) private_data           (null)
      [   61.909129] flags: 0x100173(read|write|mayread|maywrite|mayexec|growsdown|account)
      
      [akpm@linux-foundation.org: make dump_vma() require CONFIG_DEBUG_VM]
      [swarren@nvidia.com: fix dump_vma() compilation]
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NStephen Warren <swarren@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0bf55139
    • D
      mm: rename allocflags_to_migratetype for clarity · 43e7a34d
      David Rientjes 提交于
      The page allocator has gfp flags (like __GFP_WAIT) and alloc flags (like
      ALLOC_CPUSET) that have separate semantics.
      
      The function allocflags_to_migratetype() actually takes gfp flags, not
      alloc flags, and returns a migratetype.  Rename it to
      gfpflags_to_migratetype().
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      43e7a34d
    • V
      mm, compaction: khugepaged should not give up due to need_resched() · 1f9efdef
      Vlastimil Babka 提交于
      Async compaction aborts when it detects zone lock contention or
      need_resched() is true.  David Rientjes has reported that in practice,
      most direct async compactions for THP allocation abort due to
      need_resched().  This means that a second direct compaction is never
      attempted, which might be OK for a page fault, but khugepaged is intended
      to attempt a sync compaction in such case and in these cases it won't.
      
      This patch replaces "bool contended" in compact_control with an int that
      distinguishes between aborting due to need_resched() and aborting due to
      lock contention.  This allows propagating the abort through all compaction
      functions as before, but passing the abort reason up to
      __alloc_pages_slowpath() which decides when to continue with direct
      reclaim and another compaction attempt.
      
      Another problem is that try_to_compact_pages() did not act upon the
      reported contention (both need_resched() or lock contention) immediately
      and would proceed with another zone from the zonelist.  When
      need_resched() is true, that means initializing another zone compaction,
      only to check again need_resched() in isolate_migratepages() and aborting.
       For zone lock contention, the unintended consequence is that the lock
      contended status reported back to the allocator is detrmined from the last
      zone where compaction was attempted, which is rather arbitrary.
      
      This patch fixes the problem in the following way:
      - async compaction of a zone aborting due to need_resched() or fatal signal
        pending means that further zones should not be tried. We report
        COMPACT_CONTENDED_SCHED to the allocator.
      - aborting zone compaction due to lock contention means we can still try
        another zone, since it has different set of locks. We report back
        COMPACT_CONTENDED_LOCK only if *all* zones where compaction was attempted,
        it was aborted due to lock contention.
      
      As a result of these fixes, khugepaged will proceed with second sync
      compaction as intended, when the preceding async compaction aborted due to
      need_resched().  Page fault compactions aborting due to need_resched()
      will spare some cycles previously wasted by initializing another zone
      compaction only to abort again.  Lock contention will be reported only
      when compaction in all zones aborted due to lock contention, and therefore
      it's not a good idea to try again after reclaim.
      
      In stress-highalloc from mmtests configured to use __GFP_NO_KSWAPD, this
      has improved number of THP collapse allocations by 10%, which shows
      positive effect on khugepaged.  The benchmark's success rates are
      unchanged as it is not recognized as khugepaged.  Numbers of compact_stall
      and compact_fail events have however decreased by 20%, with
      compact_success still a bit improved, which is good.  With benchmark
      configured not to use __GFP_NO_KSWAPD, there is 6% improvement in THP
      collapse allocations, and only slight improvement in stalls and failures.
      
      [akpm@linux-foundation.org: fix warnings]
      Reported-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1f9efdef
    • V
      mm, compaction: move pageblock checks up from isolate_migratepages_range() · edc2ca61
      Vlastimil Babka 提交于
      isolate_migratepages_range() is the main function of the compaction
      scanner, called either on a single pageblock by isolate_migratepages()
      during regular compaction, or on an arbitrary range by CMA's
      __alloc_contig_migrate_range().  It currently perfoms two pageblock-wide
      compaction suitability checks, and because of the CMA callpath, it tracks
      if it crossed a pageblock boundary in order to repeat those checks.
      
      However, closer inspection shows that those checks are always true for CMA:
      - isolation_suitable() is true because CMA sets cc->ignore_skip_hint to true
      - migrate_async_suitable() check is skipped because CMA uses sync compaction
      
      We can therefore move the compaction-specific checks to
      isolate_migratepages() and simplify isolate_migratepages_range().
      Furthermore, we can mimic the freepage scanner family of functions, which
      has isolate_freepages_block() function called both by compaction from
      isolate_freepages() and by CMA from isolate_freepages_range(), where each
      use-case adds own specific glue code.  This allows further code
      simplification.
      
      Thus, we rename isolate_migratepages_range() to
      isolate_migratepages_block() and limit its functionality to a single
      pageblock (or its subset).  For CMA, a new different
      isolate_migratepages_range() is created as a CMA-specific wrapper for the
      _block() function.  The checks specific to compaction are moved to
      isolate_migratepages().  As part of the unification of these two families
      of functions, we remove the redundant zone parameter where applicable,
      since zone pointer is already passed in cc->zone.
      
      Furthermore, going back to compact_zone() and compact_finished() when
      pageblock is found unsuitable (now by isolate_migratepages()) is wasteful
      - the checks are meant to skip pageblocks quickly.  The patch therefore
      also introduces a simple loop into isolate_migratepages() so that it does
      not return immediately on failed pageblock checks, but keeps going until
      isolate_migratepages_range() gets called once.  Similarily to
      isolate_freepages(), the function periodically checks if it needs to
      reschedule or abort async compaction.
      
      [iamjoonsoo.kim@lge.com: fix isolated page counting bug in compaction]
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      edc2ca61
    • V
      mm, compaction: do not count compact_stall if all zones skipped compaction · 98dd3b48
      Vlastimil Babka 提交于
      The compact_stall vmstat counter counts the number of allocations stalled
      by direct compaction.  It does not count when all attempted zones had
      deferred compaction, but it does count when all zones skipped compaction.
      The skipping is decided based on very early check of
      compaction_suitable(), based on watermarks and memory fragmentation.
      Therefore it makes sense not to count skipped compactions as stalls.
      Moreover, compact_success or compact_fail is also already not being
      counted when compaction was skipped, so this patch changes the
      compact_stall counting to match the other two.
      
      Additionally, restructure __alloc_pages_direct_compact() code for better
      readability.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      98dd3b48
    • V
      mm, compaction: defer each zone individually instead of preferred zone · 53853e2d
      Vlastimil Babka 提交于
      When direct sync compaction is often unsuccessful, it may become deferred
      for some time to avoid further useless attempts, both sync and async.
      Successful high-order allocations un-defer compaction, while further
      unsuccessful compaction attempts prolong the compaction deferred period.
      
      Currently the checking and setting deferred status is performed only on
      the preferred zone of the allocation that invoked direct compaction.  But
      compaction itself is attempted on all eligible zones in the zonelist, so
      the behavior is suboptimal and may lead both to scenarios where 1)
      compaction is attempted uselessly, or 2) where it's not attempted despite
      good chances of succeeding, as shown on the examples below:
      
      1) A direct compaction with Normal preferred zone failed and set
         deferred compaction for the Normal zone.  Another unrelated direct
         compaction with DMA32 as preferred zone will attempt to compact DMA32
         zone even though the first compaction attempt also included DMA32 zone.
      
         In another scenario, compaction with Normal preferred zone failed to
         compact Normal zone, but succeeded in the DMA32 zone, so it will not
         defer compaction.  In the next attempt, it will try Normal zone which
         will fail again, instead of skipping Normal zone and trying DMA32
         directly.
      
      2) Kswapd will balance DMA32 zone and reset defer status based on
         watermarks looking good.  A direct compaction with preferred Normal
         zone will skip compaction of all zones including DMA32 because Normal
         was still deferred.  The allocation might have succeeded in DMA32, but
         won't.
      
      This patch makes compaction deferring work on individual zone basis
      instead of preferred zone.  For each zone, it checks compaction_deferred()
      to decide if the zone should be skipped.  If watermarks fail after
      compacting the zone, defer_compaction() is called.  The zone where
      watermarks passed can still be deferred when the allocation attempt is
      unsuccessful.  When allocation is successful, compaction_defer_reset() is
      called for the zone containing the allocated page.  This approach should
      approximate calling defer_compaction() only on zones where compaction was
      attempted and did not yield allocated page.  There might be corner cases
      but that is inevitable as long as the decision to stop compacting dues not
      guarantee that a page will be allocated.
      
      Due to a new COMPACT_DEFERRED return value, some functions relying
      implicitly on COMPACT_SKIPPED = 0 had to be updated, with comments made
      more accurate.  The did_some_progress output parameter of
      __alloc_pages_direct_compact() is removed completely, as the caller
      actually does not use it after compaction sets it - it is only considered
      when direct reclaim sets it.
      
      During testing on a two-node machine with a single very small Normal zone
      on node 1, this patch has improved success rates in stress-highalloc
      mmtests benchmark.  The success here were previously made worse by commit
      3a025760 ("mm: page_alloc: spill to remote nodes before waking
      kswapd") as kswapd was no longer resetting often enough the deferred
      compaction for the Normal zone, and DMA32 zones on both nodes were thus
      not considered for compaction.  On different machine, success rates were
      improved with __GFP_NO_KSWAPD allocations.
      
      [akpm@linux-foundation.org: fix CONFIG_COMPACTION=n build]
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Reviewed-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      53853e2d
    • V
      mm: page_alloc: determine migratetype only once · 21bb9bd1
      Vlastimil Babka 提交于
      The check for ALLOC_CMA in __alloc_pages_nodemask() derives migratetype
      from gfp_mask in each retry pass, although the migratetype variable
      already has the value determined and it does not change.  Use the variable
      and perform the check only once.  Also convert #ifdef CONFIG_CMA to
      IS_ENABLED.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      21bb9bd1
    • J
      topology: add support for node_to_mem_node() to determine the fallback node · ad2c8144
      Joonsoo Kim 提交于
      Anton noticed (http://www.spinics.net/lists/linux-mm/msg67489.html) that
      on ppc LPARs with memoryless nodes, a large amount of memory was consumed
      by slabs and was marked unreclaimable.  He tracked it down to slab
      deactivations in the SLUB core when we allocate remotely, leading to poor
      efficiency always when memoryless nodes are present.
      
      After much discussion, Joonsoo provided a few patches that help
      significantly.  They don't resolve the problem altogether:
      
       - memory hotplug still needs testing, that is when a memoryless node
         becomes memory-ful, we want to dtrt
       - there are other reasons for going off-node than memoryless nodes,
         e.g., fully exhausted local nodes
      
      Neither case is resolved with this series, but I don't think that should
      block their acceptance, as they can be explored/resolved with follow-on
      patches.
      
      The series consists of:
      
      [1/3] topology: add support for node_to_mem_node() to determine the
            fallback node
      
      [2/3] slub: fallback to node_to_mem_node() node if allocating on
            memoryless node
      
            - Joonsoo's patches to cache the nearest node with memory for each
              NUMA node
      
      [3/3] Partial revert of 81c98869 (""kthread: ensure locality of
            task_struct allocations")
      
       - At Tejun's request, keep the knowledge of memoryless node fallback
         to the allocator core.
      
      This patch (of 3):
      
      We need to determine the fallback node in slub allocator if the allocation
      target node is memoryless node.  Without it, the SLUB wrongly select the
      node which has no memory and can't use a partial slab, because of node
      mismatch.  Introduced function, node_to_mem_node(X), will return a node Y
      with memory that has the nearest distance.  If X is memoryless node, it
      will return nearest distance node, but, if X is normal node, it will
      return itself.
      
      We will use this function in following patch to determine the fallback
      node.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NNishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Han Pingtian <hanpt@linux.vnet.ibm.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Anton Blanchard <anton@samba.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ad2c8144
  2. 03 10月, 2014 1 次提交
  3. 07 8月, 2014 10 次提交
  4. 31 7月, 2014 1 次提交
    • D
      mm, thp: do not allow thp faults to avoid cpuset restrictions · b104a35d
      David Rientjes 提交于
      The page allocator relies on __GFP_WAIT to determine if ALLOC_CPUSET
      should be set in allocflags.  ALLOC_CPUSET controls if a page allocation
      should be restricted only to the set of allowed cpuset mems.
      
      Transparent hugepages clears __GFP_WAIT when defrag is disabled to prevent
      the fault path from using memory compaction or direct reclaim.  Thus, it
      is unfairly able to allocate outside of its cpuset mems restriction as a
      side-effect.
      
      This patch ensures that ALLOC_CPUSET is only cleared when the gfp mask is
      truly GFP_ATOMIC by verifying it is also not a thp allocation.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Reported-by: NAlex Thorlton <athorlton@sgi.com>
      Tested-by: NAlex Thorlton <athorlton@sgi.com>
      Cc: Bob Liu <lliubbo@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hedi Berriche <hedi@sgi.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b104a35d
  5. 30 7月, 2014 1 次提交
    • R
      mm: fix page_alloc.c kernel-doc warnings · 1aab4d77
      Randy Dunlap 提交于
      Fix kernel-doc warnings and function name in mm/page_alloc.c:
      
        Warning(..//mm/page_alloc.c:6074): No description found for parameter 'pfn'
        Warning(..//mm/page_alloc.c:6074): No description found for parameter 'mask'
        Warning(..//mm/page_alloc.c:6074): Excess function parameter 'start_bitidx' description in 'get_pfnblock_flags_mask'
        Warning(..//mm/page_alloc.c:6102): No description found for parameter 'pfn'
        Warning(..//mm/page_alloc.c:6102): No description found for parameter 'mask'
        Warning(..//mm/page_alloc.c:6102): Excess function parameter 'start_bitidx' description in 'set_pfnblock_flags_mask'
      Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1aab4d77
  6. 04 7月, 2014 1 次提交
    • M
      mm: page_alloc: fix CMA area initialisation when pageblock > MAX_ORDER · dc78327c
      Michal Nazarewicz 提交于
      With a kernel configured with ARM64_64K_PAGES && !TRANSPARENT_HUGEPAGE,
      the following is triggered at early boot:
      
        SMP: Total of 8 processors activated.
        devtmpfs: initialized
        Unable to handle kernel NULL pointer dereference at virtual address 00000008
        pgd = fffffe0000050000
        [00000008] *pgd=00000043fba00003, *pmd=00000043fba00003, *pte=00e0000078010407
        Internal error: Oops: 96000006 [#1] SMP
        Modules linked in:
        CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.15.0-rc864k+ #44
        task: fffffe03bc040000 ti: fffffe03bc080000 task.ti: fffffe03bc080000
        PC is at __list_add+0x10/0xd4
        LR is at free_one_page+0x270/0x638
        ...
        Call trace:
          __list_add+0x10/0xd4
          free_one_page+0x26c/0x638
          __free_pages_ok.part.52+0x84/0xbc
          __free_pages+0x74/0xbc
          init_cma_reserved_pageblock+0xe8/0x104
          cma_init_reserved_areas+0x190/0x1e4
          do_one_initcall+0xc4/0x154
          kernel_init_freeable+0x204/0x2a8
          kernel_init+0xc/0xd4
      
      This happens because init_cma_reserved_pageblock() calls
      __free_one_page() with pageblock_order as page order but it is bigger
      than MAX_ORDER.  This in turn causes accesses past zone->free_list[].
      
      Fix the problem by changing init_cma_reserved_pageblock() such that it
      splits pageblock into individual MAX_ORDER pages if pageblock is bigger
      than a MAX_ORDER page.
      
      In cases where !CONFIG_HUGETLB_PAGE_SIZE_VARIABLE, which is all
      architectures expect for ia64, powerpc and tile at the moment, the
      “pageblock_order > MAX_ORDER” condition will be optimised out since both
      sides of the operator are constants.  In cases where pageblock size is
      variable, the performance degradation should not be significant anyway
      since init_cma_reserved_pageblock() is called only at boot time at most
      MAX_CMA_AREAS times which by default is eight.
      Signed-off-by: NMichal Nazarewicz <mina86@mina86.com>
      Reported-by: NMark Salter <msalter@redhat.com>
      Tested-by: NMark Salter <msalter@redhat.com>
      Tested-by: NChristopher Covington <cov@codeaurora.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: <stable@vger.kernel.org>	[3.5+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dc78327c
  7. 24 6月, 2014 1 次提交
    • D
      mm, pcp: allow restoring percpu_pagelist_fraction default · 7cd2b0a3
      David Rientjes 提交于
      Oleg reports a division by zero error on zero-length write() to the
      percpu_pagelist_fraction sysctl:
      
          divide error: 0000 [#1] SMP DEBUG_PAGEALLOC
          CPU: 1 PID: 9142 Comm: badarea_io Not tainted 3.15.0-rc2-vm-nfs+ #19
          Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
          task: ffff8800d5aeb6e0 ti: ffff8800d87a2000 task.ti: ffff8800d87a2000
          RIP: 0010: percpu_pagelist_fraction_sysctl_handler+0x84/0x120
          RSP: 0018:ffff8800d87a3e78  EFLAGS: 00010246
          RAX: 0000000000000f89 RBX: ffff88011f7fd000 RCX: 0000000000000000
          RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000010
          RBP: ffff8800d87a3e98 R08: ffffffff81d002c8 R09: ffff8800d87a3f50
          R10: 000000000000000b R11: 0000000000000246 R12: 0000000000000060
          R13: ffffffff81c3c3e0 R14: ffffffff81cfddf8 R15: ffff8801193b0800
          FS:  00007f614f1e9740(0000) GS:ffff88011f440000(0000) knlGS:0000000000000000
          CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
          CR2: 00007f614f1fa000 CR3: 00000000d9291000 CR4: 00000000000006e0
          Call Trace:
            proc_sys_call_handler+0xb3/0xc0
            proc_sys_write+0x14/0x20
            vfs_write+0xba/0x1e0
            SyS_write+0x46/0xb0
            tracesys+0xe1/0xe6
      
      However, if the percpu_pagelist_fraction sysctl is set by the user, it
      is also impossible to restore it to the kernel default since the user
      cannot write 0 to the sysctl.
      
      This patch allows the user to write 0 to restore the default behavior.
      It still requires a fraction equal to or larger than 8, however, as
      stated by the documentation for sanity.  If a value in the range [1, 7]
      is written, the sysctl will return EINVAL.
      
      This successfully solves the divide by zero issue at the same time.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Reported-by: NOleg Drokin <green@linuxhacker.ru>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7cd2b0a3
  8. 07 6月, 2014 1 次提交
  9. 05 6月, 2014 12 次提交