1. 29 7月, 2016 15 次提交
    • M
      mm: page_alloc: cache the last node whose dirty limit is reached · 3b8c0be4
      Mel Gorman 提交于
      If a page is about to be dirtied then the page allocator attempts to
      limit the total number of dirty pages that exists in any given zone.
      The call to node_dirty_ok is expensive so this patch records if the last
      pgdat examined hit the dirty limits.  In some cases, this reduces the
      number of calls to node_dirty_ok().
      
      Link: http://lkml.kernel.org/r/1467970510-21195-31-git-send-email-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3b8c0be4
    • M
      mm, page_alloc: remove fair zone allocation policy · e6cbd7f2
      Mel Gorman 提交于
      The fair zone allocation policy interleaves allocation requests between
      zones to avoid an age inversion problem whereby new pages are reclaimed
      to balance a zone.  Reclaim is now node-based so this should no longer
      be an issue and the fair zone allocation policy is not free.  This patch
      removes it.
      
      Link: http://lkml.kernel.org/r/1467970510-21195-30-git-send-email-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e6cbd7f2
    • M
      mm: convert zone_reclaim to node_reclaim · a5f5f91d
      Mel Gorman 提交于
      As reclaim is now per-node based, convert zone_reclaim to be
      node_reclaim.  It is possible that a node will be reclaimed multiple
      times if it has multiple zones but this is unavoidable without caching
      all nodes traversed so far.  The documentation and interface to
      userspace is the same from a configuration perspective and will will be
      similar in behaviour unless the node-local allocation requests were also
      limited to lower zones.
      
      Link: http://lkml.kernel.org/r/1467970510-21195-24-git-send-email-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a5f5f91d
    • M
      mm, page_alloc: wake kswapd based on the highest eligible zone · 52e9f87a
      Mel Gorman 提交于
      The ac_classzone_idx is used as the basis for waking kswapd and that is
      based on the preferred zoneref.  If the preferred zoneref's first zone
      is lower than what is available on other nodes, it's possible that
      kswapd is woken on a zone with only higher, but still eligible, zones.
      As classzone_idx is strictly adhered to now, it causes a problem because
      eligible pages are skipped.
      
      For example, node 0 has only DMA32 and node 1 has only NORMAL.  An
      allocating context running on node 0 may wake kswapd on node 1 telling
      it to skip all NORMAL pages.
      
      Link: http://lkml.kernel.org/r/1467970510-21195-23-git-send-email-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      52e9f87a
    • M
      mm, vmscan: only wakeup kswapd once per node for the requested classzone · e1a55637
      Mel Gorman 提交于
      kswapd is woken when zones are below the low watermark but the wakeup
      decision is not taking the classzone into account.  Now that reclaim is
      node-based, it is only required to wake kswapd once per node and only if
      all zones are unbalanced for the requested classzone.
      
      Note that one node might be checked multiple times if the zonelist is
      ordered by node because there is no cheap way of tracking what nodes
      have already been visited.  For zone-ordering, each node should be
      checked only once.
      
      Link: http://lkml.kernel.org/r/1467970510-21195-22-git-send-email-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e1a55637
    • M
      mm: move most file-based accounting to the node · 11fb9989
      Mel Gorman 提交于
      There are now a number of accounting oddities such as mapped file pages
      being accounted for on the node while the total number of file pages are
      accounted on the zone.  This can be coped with to some extent but it's
      confusing so this patch moves the relevant file-based accounted.  Due to
      throttling logic in the page allocator for reliable OOM detection, it is
      still necessary to track dirty and writeback pages on a per-zone basis.
      
      [mgorman@techsingularity.net: fix NR_ZONE_WRITE_PENDING accounting]
        Link: http://lkml.kernel.org/r/1468404004-5085-5-git-send-email-mgorman@techsingularity.net
      Link: http://lkml.kernel.org/r/1467970510-21195-20-git-send-email-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      11fb9989
    • M
      mm: move page mapped accounting to the node · 50658e2e
      Mel Gorman 提交于
      Reclaim makes decisions based on the number of pages that are mapped but
      it's mixing node and zone information.  Account NR_FILE_MAPPED and
      NR_ANON_PAGES pages on the node.
      
      Link: http://lkml.kernel.org/r/1467970510-21195-18-git-send-email-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      50658e2e
    • M
      mm, page_alloc: consider dirtyable memory in terms of nodes · 281e3726
      Mel Gorman 提交于
      Historically dirty pages were spread among zones but now that LRUs are
      per-node it is more appropriate to consider dirty pages in a node.
      
      Link: http://lkml.kernel.org/r/1467970510-21195-17-git-send-email-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      281e3726
    • M
      mm, vmscan: make shrink_node decisions more node-centric · a9dd0a83
      Mel Gorman 提交于
      Earlier patches focused on having direct reclaim and kswapd use data
      that is node-centric for reclaiming but shrink_node() itself still uses
      too much zone information.  This patch removes unnecessary zone-based
      information with the most important decision being whether to continue
      reclaim or not.  Some memcg APIs are adjusted as a result even though
      memcg itself still uses some zone information.
      
      [mgorman@techsingularity.net: optimization]
        Link: http://lkml.kernel.org/r/1468588165-12461-2-git-send-email-mgorman@techsingularity.net
      Link: http://lkml.kernel.org/r/1467970510-21195-14-git-send-email-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9dd0a83
    • M
      mm, vmscan: simplify the logic deciding whether kswapd sleeps · 38087d9b
      Mel Gorman 提交于
      kswapd goes through some complex steps trying to figure out if it should
      stay awake based on the classzone_idx and the requested order.  It is
      unnecessarily complex and passes in an invalid classzone_idx to
      balance_pgdat().  What matters most of all is whether a larger order has
      been requsted and whether kswapd successfully reclaimed at the previous
      order.  This patch irons out the logic to check just that and the end
      result is less headache inducing.
      
      Link: http://lkml.kernel.org/r/1467970510-21195-10-git-send-email-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      38087d9b
    • M
      mm, vmscan: move LRU lists to node · 599d0c95
      Mel Gorman 提交于
      This moves the LRU lists from the zone to the node and related data such
      as counters, tracing, congestion tracking and writeback tracking.
      
      Unfortunately, due to reclaim and compaction retry logic, it is
      necessary to account for the number of LRU pages on both zone and node
      logic.  Most reclaim logic is based on the node counters but the retry
      logic uses the zone counters which do not distinguish inactive and
      active sizes.  It would be possible to leave the LRU counters on a
      per-zone basis but it's a heavier calculation across multiple cache
      lines that is much more frequent than the retry checks.
      
      Other than the LRU counters, this is mostly a mechanical patch but note
      that it introduces a number of anomalies.  For example, the scans are
      per-zone but using per-node counters.  We also mark a node as congested
      when a zone is congested.  This causes weird problems that are fixed
      later but is easier to review.
      
      In the event that there is excessive overhead on 32-bit systems due to
      the nodes being on LRU then there are two potential solutions
      
      1. Long-term isolation of highmem pages when reclaim is lowmem
      
         When pages are skipped, they are immediately added back onto the LRU
         list. If lowmem reclaim persisted for long periods of time, the same
         highmem pages get continually scanned. The idea would be that lowmem
         keeps those pages on a separate list until a reclaim for highmem pages
         arrives that splices the highmem pages back onto the LRU. It potentially
         could be implemented similar to the UNEVICTABLE list.
      
         That would reduce the skip rate with the potential corner case is that
         highmem pages have to be scanned and reclaimed to free lowmem slab pages.
      
      2. Linear scan lowmem pages if the initial LRU shrink fails
      
         This will break LRU ordering but may be preferable and faster during
         memory pressure than skipping LRU pages.
      
      Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      599d0c95
    • M
      mm, vmscan: move lru_lock to the node · a52633d8
      Mel Gorman 提交于
      Node-based reclaim requires node-based LRUs and locking.  This is a
      preparation patch that just moves the lru_lock to the node so later
      patches are easier to review.  It is a mechanical change but note this
      patch makes contention worse because the LRU lock is hotter and direct
      reclaim and kswapd can contend on the same lock even when reclaiming
      from different zones.
      
      Link: http://lkml.kernel.org/r/1467970510-21195-3-git-send-email-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a52633d8
    • M
      mm, vmstat: add infrastructure for per-node vmstats · 75ef7184
      Mel Gorman 提交于
      Patchset: "Move LRU page reclaim from zones to nodes v9"
      
      This series moves LRUs from the zones to the node.  While this is a
      current rebase, the test results were based on mmotm as of June 23rd.
      Conceptually, this series is simple but there are a lot of details.
      Some of the broad motivations for this are;
      
      1. The residency of a page partially depends on what zone the page was
         allocated from.  This is partially combatted by the fair zone allocation
         policy but that is a partial solution that introduces overhead in the
         page allocator paths.
      
      2. Currently, reclaim on node 0 behaves slightly different to node 1. For
         example, direct reclaim scans in zonelist order and reclaims even if
         the zone is over the high watermark regardless of the age of pages
         in that LRU. Kswapd on the other hand starts reclaim on the highest
         unbalanced zone. A difference in distribution of file/anon pages due
         to when they were allocated results can result in a difference in
         again. While the fair zone allocation policy mitigates some of the
         problems here, the page reclaim results on a multi-zone node will
         always be different to a single-zone node.
         it was scheduled on as a result.
      
      3. kswapd and the page allocator scan zones in the opposite order to
         avoid interfering with each other but it's sensitive to timing.  This
         mitigates the page allocator using pages that were allocated very recently
         in the ideal case but it's sensitive to timing. When kswapd is allocating
         from lower zones then it's great but during the rebalancing of the highest
         zone, the page allocator and kswapd interfere with each other. It's worse
         if the highest zone is small and difficult to balance.
      
      4. slab shrinkers are node-based which makes it harder to identify the exact
         relationship between slab reclaim and LRU reclaim.
      
      The reason we have zone-based reclaim is that we used to have
      large highmem zones in common configurations and it was necessary
      to quickly find ZONE_NORMAL pages for reclaim. Today, this is much
      less of a concern as machines with lots of memory will (or should) use
      64-bit kernels. Combinations of 32-bit hardware and 64-bit hardware are
      rare. Machines that do use highmem should have relatively low highmem:lowmem
      ratios than we worried about in the past.
      
      Conceptually, moving to node LRUs should be easier to understand. The
      page allocator plays fewer tricks to game reclaim and reclaim behaves
      similarly on all nodes.
      
      The series has been tested on a 16 core UMA machine and a 2-socket 48
      core NUMA machine. The UMA results are presented in most cases as the NUMA
      machine behaved similarly.
      
      pagealloc
      ---------
      
      This is a microbenchmark that shows the benefit of removing the fair zone
      allocation policy. It was tested uip to order-4 but only orders 0 and 1 are
      shown as the other orders were comparable.
      
                                                 4.7.0-rc4                  4.7.0-rc4
                                            mmotm-20160623                 nodelru-v9
      Min      total-odr0-1               490.00 (  0.00%)           457.00 (  6.73%)
      Min      total-odr0-2               347.00 (  0.00%)           329.00 (  5.19%)
      Min      total-odr0-4               288.00 (  0.00%)           273.00 (  5.21%)
      Min      total-odr0-8               251.00 (  0.00%)           239.00 (  4.78%)
      Min      total-odr0-16              234.00 (  0.00%)           222.00 (  5.13%)
      Min      total-odr0-32              223.00 (  0.00%)           211.00 (  5.38%)
      Min      total-odr0-64              217.00 (  0.00%)           208.00 (  4.15%)
      Min      total-odr0-128             214.00 (  0.00%)           204.00 (  4.67%)
      Min      total-odr0-256             250.00 (  0.00%)           230.00 (  8.00%)
      Min      total-odr0-512             271.00 (  0.00%)           269.00 (  0.74%)
      Min      total-odr0-1024            291.00 (  0.00%)           282.00 (  3.09%)
      Min      total-odr0-2048            303.00 (  0.00%)           296.00 (  2.31%)
      Min      total-odr0-4096            311.00 (  0.00%)           309.00 (  0.64%)
      Min      total-odr0-8192            316.00 (  0.00%)           314.00 (  0.63%)
      Min      total-odr0-16384           317.00 (  0.00%)           315.00 (  0.63%)
      Min      total-odr1-1               742.00 (  0.00%)           712.00 (  4.04%)
      Min      total-odr1-2               562.00 (  0.00%)           530.00 (  5.69%)
      Min      total-odr1-4               457.00 (  0.00%)           433.00 (  5.25%)
      Min      total-odr1-8               411.00 (  0.00%)           381.00 (  7.30%)
      Min      total-odr1-16              381.00 (  0.00%)           356.00 (  6.56%)
      Min      total-odr1-32              372.00 (  0.00%)           346.00 (  6.99%)
      Min      total-odr1-64              372.00 (  0.00%)           343.00 (  7.80%)
      Min      total-odr1-128             375.00 (  0.00%)           351.00 (  6.40%)
      Min      total-odr1-256             379.00 (  0.00%)           351.00 (  7.39%)
      Min      total-odr1-512             385.00 (  0.00%)           355.00 (  7.79%)
      Min      total-odr1-1024            386.00 (  0.00%)           358.00 (  7.25%)
      Min      total-odr1-2048            390.00 (  0.00%)           362.00 (  7.18%)
      Min      total-odr1-4096            390.00 (  0.00%)           362.00 (  7.18%)
      Min      total-odr1-8192            388.00 (  0.00%)           363.00 (  6.44%)
      
      This shows a steady improvement throughout. The primary benefit is from
      reduced system CPU usage which is obvious from the overall times;
      
                 4.7.0-rc4   4.7.0-rc4
              mmotm-20160623nodelru-v8
      User          189.19      191.80
      System       2604.45     2533.56
      Elapsed      2855.30     2786.39
      
      The vmstats also showed that the fair zone allocation policy was definitely
      removed as can be seen here;
      
                                   4.7.0-rc3   4.7.0-rc3
                               mmotm-20160623 nodelru-v8
      DMA32 allocs               28794729769           0
      Normal allocs              48432501431 77227309877
      Movable allocs                       0           0
      
      tiobench on ext4
      ----------------
      
      tiobench is a benchmark that artifically benefits if old pages remain resident
      while new pages get reclaimed. The fair zone allocation policy mitigates this
      problem so pages age fairly. While the benchmark has problems, it is important
      that tiobench performance remains constant as it implies that page aging
      problems that the fair zone allocation policy fixes are not re-introduced.
      
                                               4.7.0-rc4             4.7.0-rc4
                                          mmotm-20160623            nodelru-v9
      Min      PotentialReadSpeed        89.65 (  0.00%)       90.21 (  0.62%)
      Min      SeqRead-MB/sec-1          82.68 (  0.00%)       82.01 ( -0.81%)
      Min      SeqRead-MB/sec-2          72.76 (  0.00%)       72.07 ( -0.95%)
      Min      SeqRead-MB/sec-4          75.13 (  0.00%)       74.92 ( -0.28%)
      Min      SeqRead-MB/sec-8          64.91 (  0.00%)       65.19 (  0.43%)
      Min      SeqRead-MB/sec-16         62.24 (  0.00%)       62.22 ( -0.03%)
      Min      RandRead-MB/sec-1          0.88 (  0.00%)        0.88 (  0.00%)
      Min      RandRead-MB/sec-2          0.95 (  0.00%)        0.92 ( -3.16%)
      Min      RandRead-MB/sec-4          1.43 (  0.00%)        1.34 ( -6.29%)
      Min      RandRead-MB/sec-8          1.61 (  0.00%)        1.60 ( -0.62%)
      Min      RandRead-MB/sec-16         1.80 (  0.00%)        1.90 (  5.56%)
      Min      SeqWrite-MB/sec-1         76.41 (  0.00%)       76.85 (  0.58%)
      Min      SeqWrite-MB/sec-2         74.11 (  0.00%)       73.54 ( -0.77%)
      Min      SeqWrite-MB/sec-4         80.05 (  0.00%)       80.13 (  0.10%)
      Min      SeqWrite-MB/sec-8         72.88 (  0.00%)       73.20 (  0.44%)
      Min      SeqWrite-MB/sec-16        75.91 (  0.00%)       76.44 (  0.70%)
      Min      RandWrite-MB/sec-1         1.18 (  0.00%)        1.14 ( -3.39%)
      Min      RandWrite-MB/sec-2         1.02 (  0.00%)        1.03 (  0.98%)
      Min      RandWrite-MB/sec-4         1.05 (  0.00%)        0.98 ( -6.67%)
      Min      RandWrite-MB/sec-8         0.89 (  0.00%)        0.92 (  3.37%)
      Min      RandWrite-MB/sec-16        0.92 (  0.00%)        0.93 (  1.09%)
      
                 4.7.0-rc4   4.7.0-rc4
              mmotm-20160623 approx-v9
      User          645.72      525.90
      System        403.85      331.75
      Elapsed      6795.36     6783.67
      
      This shows that the series has little or not impact on tiobench which is
      desirable and a reduction in system CPU usage. It indicates that the fair
      zone allocation policy was removed in a manner that didn't reintroduce
      one class of page aging bug. There were only minor differences in overall
      reclaim activity
      
                                   4.7.0-rc4   4.7.0-rc4
                                mmotm-20160623nodelru-v8
      Minor Faults                    645838      647465
      Major Faults                       573         640
      Swap Ins                             0           0
      Swap Outs                            0           0
      DMA allocs                           0           0
      DMA32 allocs                  46041453    44190646
      Normal allocs                 78053072    79887245
      Movable allocs                       0           0
      Allocation stalls                   24          67
      Stall zone DMA                       0           0
      Stall zone DMA32                     0           0
      Stall zone Normal                    0           2
      Stall zone HighMem                   0           0
      Stall zone Movable                   0          65
      Direct pages scanned             10969       30609
      Kswapd pages scanned          93375144    93492094
      Kswapd pages reclaimed        93372243    93489370
      Direct pages reclaimed           10969       30609
      Kswapd efficiency                  99%         99%
      Kswapd velocity              13741.015   13781.934
      Direct efficiency                 100%        100%
      Direct velocity                  1.614       4.512
      Percentage direct scans             0%          0%
      
      kswapd activity was roughly comparable. There were differences in direct
      reclaim activity but negligible in the context of the overall workload
      (velocity of 4 pages per second with the patches applied, 1.6 pages per
      second in the baseline kernel).
      
      pgbench read-only large configuration on ext4
      ---------------------------------------------
      
      pgbench is a database benchmark that can be sensitive to page reclaim
      decisions. This also checks if removing the fair zone allocation policy
      is safe
      
      pgbench Transactions
                              4.7.0-rc4             4.7.0-rc4
                         mmotm-20160623            nodelru-v8
      Hmean    1       188.26 (  0.00%)      189.78 (  0.81%)
      Hmean    5       330.66 (  0.00%)      328.69 ( -0.59%)
      Hmean    12      370.32 (  0.00%)      380.72 (  2.81%)
      Hmean    21      368.89 (  0.00%)      369.00 (  0.03%)
      Hmean    30      382.14 (  0.00%)      360.89 ( -5.56%)
      Hmean    32      428.87 (  0.00%)      432.96 (  0.95%)
      
      Negligible differences again. As with tiobench, overall reclaim activity
      was comparable.
      
      bonnie++ on ext4
      ----------------
      
      No interesting performance difference, negligible differences on reclaim
      stats.
      
      paralleldd on ext4
      ------------------
      
      This workload uses varying numbers of dd instances to read large amounts of
      data from disk.
      
                                     4.7.0-rc3             4.7.0-rc3
                                mmotm-20160623            nodelru-v9
      Amean    Elapsd-1       186.04 (  0.00%)      189.41 ( -1.82%)
      Amean    Elapsd-3       192.27 (  0.00%)      191.38 (  0.46%)
      Amean    Elapsd-5       185.21 (  0.00%)      182.75 (  1.33%)
      Amean    Elapsd-7       183.71 (  0.00%)      182.11 (  0.87%)
      Amean    Elapsd-12      180.96 (  0.00%)      181.58 ( -0.35%)
      Amean    Elapsd-16      181.36 (  0.00%)      183.72 ( -1.30%)
      
                 4.7.0-rc4   4.7.0-rc4
              mmotm-20160623 nodelru-v9
      User         1548.01     1552.44
      System       8609.71     8515.08
      Elapsed      3587.10     3594.54
      
      There is little or no change in performance but some drop in system CPU usage.
      
                                   4.7.0-rc3   4.7.0-rc3
                              mmotm-20160623  nodelru-v9
      Minor Faults                    362662      367360
      Major Faults                      1204        1143
      Swap Ins                            22           0
      Swap Outs                         2855        1029
      DMA allocs                           0           0
      DMA32 allocs                  31409797    28837521
      Normal allocs                 46611853    49231282
      Movable allocs                       0           0
      Direct pages scanned                 0           0
      Kswapd pages scanned          40845270    40869088
      Kswapd pages reclaimed        40830976    40855294
      Direct pages reclaimed               0           0
      Kswapd efficiency                  99%         99%
      Kswapd velocity              11386.711   11369.769
      Direct efficiency                 100%        100%
      Direct velocity                  0.000       0.000
      Percentage direct scans             0%          0%
      Page writes by reclaim            2855        1029
      Page writes file                     0           0
      Page writes anon                  2855        1029
      Page reclaim immediate             771        1628
      Sector Reads                 293312636   293536360
      Sector Writes                 18213568    18186480
      Page rescued immediate               0           0
      Slabs scanned                   128257      132747
      Direct inode steals                181          56
      Kswapd inode steals                 59        1131
      
      It basically shows that kswapd was active at roughly the same rate in
      both kernels. There was also comparable slab scanning activity and direct
      reclaim was avoided in both cases. There appears to be a large difference
      in numbers of inodes reclaimed but the workload has few active inodes and
      is likely a timing artifact.
      
      stutter
      -------
      
      stutter simulates a simple workload. One part uses a lot of anonymous
      memory, a second measures mmap latency and a third copies a large file.
      The primary metric is checking for mmap latency.
      
      stutter
                                   4.7.0-rc4             4.7.0-rc4
                              mmotm-20160623            nodelru-v8
      Min         mmap     16.6283 (  0.00%)     13.4258 ( 19.26%)
      1st-qrtle   mmap     54.7570 (  0.00%)     34.9121 ( 36.24%)
      2nd-qrtle   mmap     57.3163 (  0.00%)     46.1147 ( 19.54%)
      3rd-qrtle   mmap     58.9976 (  0.00%)     47.1882 ( 20.02%)
      Max-90%     mmap     59.7433 (  0.00%)     47.4453 ( 20.58%)
      Max-93%     mmap     60.1298 (  0.00%)     47.6037 ( 20.83%)
      Max-95%     mmap     73.4112 (  0.00%)     82.8719 (-12.89%)
      Max-99%     mmap     92.8542 (  0.00%)     88.8870 (  4.27%)
      Max         mmap   1440.6569 (  0.00%)    121.4201 ( 91.57%)
      Mean        mmap     59.3493 (  0.00%)     42.2991 ( 28.73%)
      Best99%Mean mmap     57.2121 (  0.00%)     41.8207 ( 26.90%)
      Best95%Mean mmap     55.9113 (  0.00%)     39.9620 ( 28.53%)
      Best90%Mean mmap     55.6199 (  0.00%)     39.3124 ( 29.32%)
      Best50%Mean mmap     53.2183 (  0.00%)     33.1307 ( 37.75%)
      Best10%Mean mmap     45.9842 (  0.00%)     20.4040 ( 55.63%)
      Best5%Mean  mmap     43.2256 (  0.00%)     17.9654 ( 58.44%)
      Best1%Mean  mmap     32.9388 (  0.00%)     16.6875 ( 49.34%)
      
      This shows a number of improvements with the worst-case outlier greatly
      improved.
      
      Some of the vmstats are interesting
      
                                   4.7.0-rc4   4.7.0-rc4
                                mmotm-20160623nodelru-v8
      Swap Ins                           163         502
      Swap Outs                            0           0
      DMA allocs                           0           0
      DMA32 allocs                 618719206  1381662383
      Normal allocs                891235743   564138421
      Movable allocs                       0           0
      Allocation stalls                 2603           1
      Direct pages scanned            216787           2
      Kswapd pages scanned          50719775    41778378
      Kswapd pages reclaimed        41541765    41777639
      Direct pages reclaimed          209159           0
      Kswapd efficiency                  81%         99%
      Kswapd velocity              16859.554   14329.059
      Direct efficiency                  96%          0%
      Direct velocity                 72.061       0.001
      Percentage direct scans             0%          0%
      Page writes by reclaim         6215049           0
      Page writes file               6215049           0
      Page writes anon                     0           0
      Page reclaim immediate           70673          90
      Sector Reads                  81940800    81680456
      Sector Writes                100158984    98816036
      Page rescued immediate               0           0
      Slabs scanned                  1366954       22683
      
      While this is not guaranteed in all cases, this particular test showed
      a large reduction in direct reclaim activity. It's also worth noting
      that no page writes were issued from reclaim context.
      
      This series is not without its hazards. There are at least three areas
      that I'm concerned with even though I could not reproduce any problems in
      that area.
      
      1. Reclaim/compaction is going to be affected because the amount of reclaim is
         no longer targetted at a specific zone. Compaction works on a per-zone basis
         so there is no guarantee that reclaiming a few THP's worth page pages will
         have a positive impact on compaction success rates.
      
      2. The Slab/LRU reclaim ratio is affected because the frequency the shrinkers
         are called is now different. This may or may not be a problem but if it
         is, it'll be because shrinkers are not called enough and some balancing
         is required.
      
      3. The anon/file reclaim ratio may be affected. Pages about to be dirtied are
         distributed between zones and the fair zone allocation policy used to do
         something very similar for anon. The distribution is now different but not
         necessarily in any way that matters but it's still worth bearing in mind.
      
      VM statistic counters for reclaim decisions are zone-based.  If the kernel
      is to reclaim on a per-node basis then we need to track per-node
      statistics but there is no infrastructure for that.  The most notable
      change is that the old node_page_state is renamed to
      sum_zone_node_page_state.  The new node_page_state takes a pglist_data and
      uses per-node stats but none exist yet.  There is some renaming such as
      vm_stat to vm_zone_stat and the addition of vm_node_stat and the renaming
      of mod_state to mod_zone_state.  Otherwise, this is mostly a mechanical
      patch with no functional change.  There is a lot of similarity between the
      node and zone helpers which is unfortunate but there was no obvious way of
      reusing the code and maintaining type safety.
      
      Link: http://lkml.kernel.org/r/1467970510-21195-2-git-send-email-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      75ef7184
    • M
      mm, meminit: remove early_page_nid_uninitialised · a621184a
      Mel Gorman 提交于
      The helper early_page_nid_uninitialised() has been dead since commit
      974a786e ("mm, page_alloc: remove MIGRATE_RESERVE") so remove the
      dead code.
      
      Link: http://lkml.kernel.org/r/1468008031-3848-2-git-send-email-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a621184a
    • Z
      mm: update the comment in __isolate_free_page · 400bc7fd
      zhong jiang 提交于
      We need to assure the comment is consistent with the code.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Link: http://lkml.kernel.org/r/1466171914-21027-1-git-send-email-zhongjiang@huawei.comSigned-off-by: Nzhong jiang <zhongjiang@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      400bc7fd
  2. 27 7月, 2016 11 次提交
  3. 15 7月, 2016 2 次提交
  4. 04 6月, 2016 4 次提交
  5. 21 5月, 2016 8 次提交
    • N
      mm: check_new_page_bad() directly returns in __PG_HWPOISON case · e570f56c
      Naoya Horiguchi 提交于
      Currently we check page->flags twice for "HWPoisoned" case of
      check_new_page_bad(), which can cause a race with unpoisoning.
      
      This race unnecessarily taints kernel with "BUG: Bad page state".
      check_new_page_bad() is the only caller of bad_page() which is
      interested in __PG_HWPOISON, so let's move the hwpoison related code in
      bad_page() to it.
      
      Link: http://lkml.kernel.org/r/20160518100949.GA17299@hori1.linux.bs1.fc.nec.co.jpSigned-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e570f56c
    • S
      mm, kasan: fix to call kasan_free_pages() after poisoning page · 29b52de1
      seokhoon.yoon 提交于
      When CONFIG_PAGE_POISONING and CONFIG_KASAN is enabled,
      free_pages_prepare()'s codeflow is below.
      
        1)kmemcheck_free_shadow()
        2)kasan_free_pages()
          - set shadow byte of page is freed
        3)kernel_poison_pages()
        3.1) check access to page is valid or not using kasan
          ---> error occur, kasan think it is invalid access
        3.2) poison page
        4)kernel_map_pages()
      
      So kasan_free_pages() should be called after poisoning the page.
      
      Link: http://lkml.kernel.org/r/1463220405-7455-1-git-send-email-iamyooon@gmail.comSigned-off-by: Nseokhoon.yoon <iamyooon@gmail.com>
      Cc: Andrey Ryabinin <a.ryabinin@samsung.com>
      Cc: Laura Abbott <labbott@fedoraproject.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      29b52de1
    • S
      mm: use phys_addr_t for reserve_bootmem_region() arguments · 4b50bcc7
      Stefan Bader 提交于
      Since commit 92923ca3 ("mm: meminit: only set page reserved in the
      memblock region") the reserved bit is set on reserved memblock regions.
      However start and end address are passed as unsigned long.  This is only
      32bit on i386, so it can end up marking the wrong pages reserved for
      ranges at 4GB and above.
      
      This was observed on a 32bit Xen dom0 which was booted with initial
      memory set to a value below 4G but allowing to balloon in memory
      (dom0_mem=1024M for example).  This would define a reserved bootmem
      region for the additional memory (for example on a 8GB system there was
      a reverved region covering the 4GB-8GB range).  But since the addresses
      were passed on as unsigned long, this was actually marking all pages
      from 0 to 4GB as reserved.
      
      Fixes: 92923ca3 ("mm: meminit: only set page reserved in the memblock region")
      Link: http://lkml.kernel.org/r/1463491221-10573-1-git-send-email-stefan.bader@canonical.comSigned-off-by: NStefan Bader <stefan.bader@canonical.com>
      Cc: <stable@vger.kernel.org>	[4.2+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4b50bcc7
    • M
      mm: use existing helper to convert "on"/"off" to boolean · 2a138dc7
      Minfei Huang 提交于
      It's more convenient to use existing function helper to convert string
      "on/off" to boolean.
      
      Link: http://lkml.kernel.org/r/1461908824-16129-1-git-send-email-mnghuan@gmail.comSigned-off-by: NMinfei Huang <mnghuan@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2a138dc7
    • M
      mm, oom: protect !costly allocations some more for !CONFIG_COMPACTION · 31e49bfd
      Michal Hocko 提交于
      Joonsoo has reported that he is able to trigger OOM for !costly high
      order requests (heavy fork() workload close the OOM) with the new oom
      detection rework.  This is because we rely only on should_reclaim_retry
      when the compaction is disabled and it only checks watermarks for the
      requested order and so we might trigger OOM when there is a lot of free
      memory.
      
      It is not very clear what are the usual workloads when the compaction is
      disabled.  Relying on high order allocations heavily without any
      mechanism to create those orders except for unbound amount of reclaim is
      certainly not a good idea.
      
      To prevent from potential regressions let's help this configuration
      some.  We have to sacrifice the determinsm though because there simply
      is none here possible.  should_compact_retry implementation for
      !CONFIG_COMPACTION, which was empty so far, will do watermark check for
      order-0 on all eligible zones.  This will cause retrying until either
      the reclaim cannot make any further progress or all the zones are
      depleted even for order-0 pages.  This means that the number of retries
      is basically unbounded for !costly orders but that was the case before
      the rework as well so this shouldn't regress.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Link: http://lkml.kernel.org/r/1463051677-29418-3-git-send-email-mhocko@kernel.orgReported-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      31e49bfd
    • M
      mm, oom, compaction: prevent from should_compact_retry looping for ever for costly orders · 86a294a8
      Michal Hocko 提交于
      "mm: consider compaction feedback also for costly allocation" has
      removed the upper bound for the reclaim/compaction retries based on the
      number of reclaimed pages for costly orders.  While this is desirable
      the patch did miss a mis interaction between reclaim, compaction and the
      retry logic.  The direct reclaim tries to get zones over min watermark
      while compaction backs off and returns COMPACT_SKIPPED when all zones
      are below low watermark + 1<<order gap.  If we are getting really close
      to OOM then __compaction_suitable can keep returning COMPACT_SKIPPED a
      high order request (e.g.  hugetlb order-9) while the reclaim is not able
      to release enough pages to get us over low watermark.  The reclaim is
      still able to make some progress (usually trashing over few remaining
      pages) so we are not able to break out from the loop.
      
      I have seen this happening with the same test described in "mm: consider
      compaction feedback also for costly allocation" on a swapless system.
      The original problem got resolved by "vmscan: consider classzone_idx in
      compaction_ready" but it shows how things might go wrong when we
      approach the oom event horizont.
      
      The reason why compaction requires being over low rather than min
      watermark is not clear to me.  This check was there essentially since
      56de7263 ("mm: compaction: direct compact when a high-order
      allocation fails").  It is clearly an implementation detail though and
      we shouldn't pull it into the generic retry logic while we should be
      able to cope with such eventuality.  The only place in
      should_compact_retry where we retry without any upper bound is for
      compaction_withdrawn() case.
      
      Introduce compaction_zonelist_suitable function which checks the given
      zonelist and returns true only if there is at least one zone which would
      would unblock __compaction_suitable if more memory got reclaimed.  In
      this implementation it checks __compaction_suitable with NR_FREE_PAGES
      plus part of the reclaimable memory as the target for the watermark
      check.  The reclaimable memory is reduced linearly by the allocation
      order.  The idea is that we do not want to reclaim all the remaining
      memory for a single allocation request just unblock
      __compaction_suitable which doesn't guarantee we will make a further
      progress.
      
      The new helper is then used if compaction_withdrawn() feedback was
      provided so we do not retry if there is no outlook for a further
      progress.  !costly requests shouldn't be affected much - e.g.  order-2
      pages would require to have at least 64kB on the reclaimable LRUs while
      order-9 would need at least 32M which should be enough to not lock up.
      
      [vbabka@suse.cz: fix classzone_idx vs. high_zoneidx usage in compaction_zonelist_suitable]
      [akpm@linux-foundation.org: fix it for Mel's mm-page_alloc-remove-field-from-alloc_context.patch]
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      86a294a8
    • M
      mm: consider compaction feedback also for costly allocation · 7854ea6c
      Michal Hocko 提交于
      PAGE_ALLOC_COSTLY_ORDER retry logic is mostly handled inside
      should_reclaim_retry currently where we decide to not retry after at
      least order worth of pages were reclaimed or the watermark check for at
      least one zone would succeed after reclaiming all pages if the reclaim
      hasn't made any progress.  Compaction feedback is mostly ignored and we
      just try to make sure that the compaction did at least something before
      giving up.
      
      The first condition was added by a41f24ea ("page allocator: smarter
      retry of costly-order allocations) and it assumed that lumpy reclaim
      could have created a page of the sufficient order.  Lumpy reclaim, has
      been removed quite some time ago so the assumption doesn't hold anymore.
      Remove the check for the number of reclaimed pages and rely on the
      compaction feedback solely.  should_reclaim_retry now only makes sure
      that we keep retrying reclaim for high order pages only if they are
      hidden by watermaks so order-0 reclaim makes really sense.
      
      should_compact_retry now keeps retrying even for the costly allocations.
      The number of retries is reduced wrt.  !costly requests because they are
      less important and harder to grant and so their pressure shouldn't cause
      contention for other requests or cause an over reclaim.  We also do not
      reset no_progress_loops for costly request to make sure we do not keep
      reclaiming too agressively.
      
      This has been tested by running a process which fragments memory:
      	- compact memory
      	- mmap large portion of the memory (1920M on 2GRAM machine with 2G
      	  of swapspace)
      	- MADV_DONTNEED single page in PAGE_SIZE*((1UL<<MAX_ORDER)-1)
      	  steps until certain amount of memory is freed (250M in my test)
      	  and reduce the step to (step / 2) + 1 after reaching the end of
      	  the mapping
      	- then run a script which populates the page cache 2G (MemTotal)
      	  from /dev/zero to a new file
      And then tries to allocate
      nr_hugepages=$(awk '/MemAvailable/{printf "%d\n", $2/(2*1024)}' /proc/meminfo)
      huge pages.
      
      root@test1:~# echo 1 > /proc/sys/vm/overcommit_memory;echo 1 > /proc/sys/vm/compact_memory; ./fragment-mem-and-run /root/alloc_hugepages.sh 1920M 250M
      Node 0, zone      DMA     31     28     31     10      2      0      2      1      2      3      1
      Node 0, zone    DMA32    437    319    171     50     28     25     20     16     16     14    437
      
      * This is the /proc/buddyinfo after the compaction
      
      Done fragmenting. size=2013265920 freed=262144000
      Node 0, zone      DMA    165     48      3      1      2      0      2      2      2      2      0
      Node 0, zone    DMA32  35109  14575    185     51     41     12      6      0      0      0      0
      
      * /proc/buddyinfo after memory got fragmented
      
      Executing "/root/alloc_hugepages.sh"
      Eating some pagecache
      508623+0 records in
      508623+0 records out
      2083319808 bytes (2.1 GB) copied, 11.7292 s, 178 MB/s
      Node 0, zone      DMA      3      5      3      1      2      0      2      2      2      2      0
      Node 0, zone    DMA32    111    344    153     20     24     10      3      0      0      0      0
      
      * /proc/buddyinfo after page cache got eaten
      
      Trying to allocate 129
      129
      
      * 129 hugepages requested and all of them granted.
      
      Node 0, zone      DMA      3      5      3      1      2      0      2      2      2      2      0
      Node 0, zone    DMA32    127     97     30     99     11      6      2      1      4      0      0
      
      * /proc/buddyinfo after hugetlb allocation.
      
      10 runs will behave as follows:
      Trying to allocate 130
      130
      --
      Trying to allocate 129
      129
      --
      Trying to allocate 128
      128
      --
      Trying to allocate 129
      129
      --
      Trying to allocate 128
      128
      --
      Trying to allocate 129
      129
      --
      Trying to allocate 132
      132
      --
      Trying to allocate 129
      129
      --
      Trying to allocate 128
      128
      --
      Trying to allocate 129
      129
      
      So basically 100% success for all 10 attempts.
      Without the patch numbers looked much worse:
      Trying to allocate 128
      12
      --
      Trying to allocate 129
      14
      --
      Trying to allocate 129
      7
      --
      Trying to allocate 129
      16
      --
      Trying to allocate 129
      30
      --
      Trying to allocate 129
      38
      --
      Trying to allocate 129
      19
      --
      Trying to allocate 129
      37
      --
      Trying to allocate 129
      28
      --
      Trying to allocate 129
      37
      
      Just for completness the base kernel without oom detection rework looks
      as follows:
      Trying to allocate 127
      30
      --
      Trying to allocate 129
      12
      --
      Trying to allocate 129
      52
      --
      Trying to allocate 128
      32
      --
      Trying to allocate 129
      12
      --
      Trying to allocate 129
      10
      --
      Trying to allocate 129
      32
      --
      Trying to allocate 128
      14
      --
      Trying to allocate 128
      16
      --
      Trying to allocate 129
      8
      
      As we can see the success rate is much more volatile and smaller without
      this patch. So the patch not only makes the retry logic for costly
      requests more sensible the success rate is even higher.
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7854ea6c
    • M
      mm, oom: protect !costly allocations some more · 33c2d214
      Michal Hocko 提交于
      should_reclaim_retry will give up retries for higher order allocations
      if none of the eligible zones has any requested or higher order pages
      available even if we pass the watermak check for order-0.  This is done
      because there is no guarantee that the reclaimable and currently free
      pages will form the required order.
      
      This can, however, lead to situations where the high-order request (e.g.
      order-2 required for the stack allocation during fork) will trigger OOM
      too early - e.g.  after the first reclaim/compaction round.  Such a
      system would have to be highly fragmented and there is no guarantee
      further reclaim/compaction attempts would help but at least make sure
      that the compaction was active before we go OOM and keep retrying even
      if should_reclaim_retry tells us to oom if
      
      	- the last compaction round backed off or
      	- we haven't completed at least MAX_COMPACT_RETRIES active
      	  compaction rounds.
      
      The first rule ensures that the very last attempt for compaction was not
      ignored while the second guarantees that the compaction has done some
      work.  Multiple retries might be needed to prevent occasional pigggy
      backing of other contexts to steal the compacted pages before the
      current context manages to retry to allocate them.
      
      compaction_failed() is taken as a final word from the compaction that
      the retry doesn't make much sense.  We have to be careful though because
      the first compaction round is MIGRATE_ASYNC which is rather weak as it
      ignores pages under writeback and gives up too easily in other
      situations.  We therefore have to make sure that MIGRATE_SYNC_LIGHT mode
      has been used before we give up.  With this logic in place we do not
      have to increase the migration mode unconditionally and rather do it
      only if the compaction failed for the weaker mode.  A nice side effect
      is that the stronger migration mode is used only when really needed so
      this has a potential of smaller latencies in some cases.
      
      Please note that the compaction doesn't tell us much about how
      successful it was when returning compaction_made_progress so we just
      have to blindly trust that another retry is worthwhile and cap the
      number to something reasonable to guarantee a convergence.
      
      If the given number of successful retries is not sufficient for a
      reasonable workloads we should focus on the collected compaction
      tracepoints data and try to address the issue in the compaction code.
      If this is not feasible we can increase the retries limit.
      
      [mhocko@suse.com: fix warning]
        Link: http://lkml.kernel.org/r/20160512061636.GA4200@dhcp22.suse.czSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      33c2d214