1. 29 7月, 2016 4 次提交
    • M
      mm, mmzone: clarify the usage of zone padding · 0f661148
      Mel Gorman 提交于
      Zone padding separates write-intensive fields used by page allocation,
      compaction and vmstats but the comments are a little misleading and need
      clarification.
      
      Link: http://lkml.kernel.org/r/1467970510-21195-5-git-send-email-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0f661148
    • M
      mm, vmscan: move LRU lists to node · 599d0c95
      Mel Gorman 提交于
      This moves the LRU lists from the zone to the node and related data such
      as counters, tracing, congestion tracking and writeback tracking.
      
      Unfortunately, due to reclaim and compaction retry logic, it is
      necessary to account for the number of LRU pages on both zone and node
      logic.  Most reclaim logic is based on the node counters but the retry
      logic uses the zone counters which do not distinguish inactive and
      active sizes.  It would be possible to leave the LRU counters on a
      per-zone basis but it's a heavier calculation across multiple cache
      lines that is much more frequent than the retry checks.
      
      Other than the LRU counters, this is mostly a mechanical patch but note
      that it introduces a number of anomalies.  For example, the scans are
      per-zone but using per-node counters.  We also mark a node as congested
      when a zone is congested.  This causes weird problems that are fixed
      later but is easier to review.
      
      In the event that there is excessive overhead on 32-bit systems due to
      the nodes being on LRU then there are two potential solutions
      
      1. Long-term isolation of highmem pages when reclaim is lowmem
      
         When pages are skipped, they are immediately added back onto the LRU
         list. If lowmem reclaim persisted for long periods of time, the same
         highmem pages get continually scanned. The idea would be that lowmem
         keeps those pages on a separate list until a reclaim for highmem pages
         arrives that splices the highmem pages back onto the LRU. It potentially
         could be implemented similar to the UNEVICTABLE list.
      
         That would reduce the skip rate with the potential corner case is that
         highmem pages have to be scanned and reclaimed to free lowmem slab pages.
      
      2. Linear scan lowmem pages if the initial LRU shrink fails
      
         This will break LRU ordering but may be preferable and faster during
         memory pressure than skipping LRU pages.
      
      Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      599d0c95
    • M
      mm, vmscan: move lru_lock to the node · a52633d8
      Mel Gorman 提交于
      Node-based reclaim requires node-based LRUs and locking.  This is a
      preparation patch that just moves the lru_lock to the node so later
      patches are easier to review.  It is a mechanical change but note this
      patch makes contention worse because the LRU lock is hotter and direct
      reclaim and kswapd can contend on the same lock even when reclaiming
      from different zones.
      
      Link: http://lkml.kernel.org/r/1467970510-21195-3-git-send-email-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a52633d8
    • M
      mm, vmstat: add infrastructure for per-node vmstats · 75ef7184
      Mel Gorman 提交于
      Patchset: "Move LRU page reclaim from zones to nodes v9"
      
      This series moves LRUs from the zones to the node.  While this is a
      current rebase, the test results were based on mmotm as of June 23rd.
      Conceptually, this series is simple but there are a lot of details.
      Some of the broad motivations for this are;
      
      1. The residency of a page partially depends on what zone the page was
         allocated from.  This is partially combatted by the fair zone allocation
         policy but that is a partial solution that introduces overhead in the
         page allocator paths.
      
      2. Currently, reclaim on node 0 behaves slightly different to node 1. For
         example, direct reclaim scans in zonelist order and reclaims even if
         the zone is over the high watermark regardless of the age of pages
         in that LRU. Kswapd on the other hand starts reclaim on the highest
         unbalanced zone. A difference in distribution of file/anon pages due
         to when they were allocated results can result in a difference in
         again. While the fair zone allocation policy mitigates some of the
         problems here, the page reclaim results on a multi-zone node will
         always be different to a single-zone node.
         it was scheduled on as a result.
      
      3. kswapd and the page allocator scan zones in the opposite order to
         avoid interfering with each other but it's sensitive to timing.  This
         mitigates the page allocator using pages that were allocated very recently
         in the ideal case but it's sensitive to timing. When kswapd is allocating
         from lower zones then it's great but during the rebalancing of the highest
         zone, the page allocator and kswapd interfere with each other. It's worse
         if the highest zone is small and difficult to balance.
      
      4. slab shrinkers are node-based which makes it harder to identify the exact
         relationship between slab reclaim and LRU reclaim.
      
      The reason we have zone-based reclaim is that we used to have
      large highmem zones in common configurations and it was necessary
      to quickly find ZONE_NORMAL pages for reclaim. Today, this is much
      less of a concern as machines with lots of memory will (or should) use
      64-bit kernels. Combinations of 32-bit hardware and 64-bit hardware are
      rare. Machines that do use highmem should have relatively low highmem:lowmem
      ratios than we worried about in the past.
      
      Conceptually, moving to node LRUs should be easier to understand. The
      page allocator plays fewer tricks to game reclaim and reclaim behaves
      similarly on all nodes.
      
      The series has been tested on a 16 core UMA machine and a 2-socket 48
      core NUMA machine. The UMA results are presented in most cases as the NUMA
      machine behaved similarly.
      
      pagealloc
      ---------
      
      This is a microbenchmark that shows the benefit of removing the fair zone
      allocation policy. It was tested uip to order-4 but only orders 0 and 1 are
      shown as the other orders were comparable.
      
                                                 4.7.0-rc4                  4.7.0-rc4
                                            mmotm-20160623                 nodelru-v9
      Min      total-odr0-1               490.00 (  0.00%)           457.00 (  6.73%)
      Min      total-odr0-2               347.00 (  0.00%)           329.00 (  5.19%)
      Min      total-odr0-4               288.00 (  0.00%)           273.00 (  5.21%)
      Min      total-odr0-8               251.00 (  0.00%)           239.00 (  4.78%)
      Min      total-odr0-16              234.00 (  0.00%)           222.00 (  5.13%)
      Min      total-odr0-32              223.00 (  0.00%)           211.00 (  5.38%)
      Min      total-odr0-64              217.00 (  0.00%)           208.00 (  4.15%)
      Min      total-odr0-128             214.00 (  0.00%)           204.00 (  4.67%)
      Min      total-odr0-256             250.00 (  0.00%)           230.00 (  8.00%)
      Min      total-odr0-512             271.00 (  0.00%)           269.00 (  0.74%)
      Min      total-odr0-1024            291.00 (  0.00%)           282.00 (  3.09%)
      Min      total-odr0-2048            303.00 (  0.00%)           296.00 (  2.31%)
      Min      total-odr0-4096            311.00 (  0.00%)           309.00 (  0.64%)
      Min      total-odr0-8192            316.00 (  0.00%)           314.00 (  0.63%)
      Min      total-odr0-16384           317.00 (  0.00%)           315.00 (  0.63%)
      Min      total-odr1-1               742.00 (  0.00%)           712.00 (  4.04%)
      Min      total-odr1-2               562.00 (  0.00%)           530.00 (  5.69%)
      Min      total-odr1-4               457.00 (  0.00%)           433.00 (  5.25%)
      Min      total-odr1-8               411.00 (  0.00%)           381.00 (  7.30%)
      Min      total-odr1-16              381.00 (  0.00%)           356.00 (  6.56%)
      Min      total-odr1-32              372.00 (  0.00%)           346.00 (  6.99%)
      Min      total-odr1-64              372.00 (  0.00%)           343.00 (  7.80%)
      Min      total-odr1-128             375.00 (  0.00%)           351.00 (  6.40%)
      Min      total-odr1-256             379.00 (  0.00%)           351.00 (  7.39%)
      Min      total-odr1-512             385.00 (  0.00%)           355.00 (  7.79%)
      Min      total-odr1-1024            386.00 (  0.00%)           358.00 (  7.25%)
      Min      total-odr1-2048            390.00 (  0.00%)           362.00 (  7.18%)
      Min      total-odr1-4096            390.00 (  0.00%)           362.00 (  7.18%)
      Min      total-odr1-8192            388.00 (  0.00%)           363.00 (  6.44%)
      
      This shows a steady improvement throughout. The primary benefit is from
      reduced system CPU usage which is obvious from the overall times;
      
                 4.7.0-rc4   4.7.0-rc4
              mmotm-20160623nodelru-v8
      User          189.19      191.80
      System       2604.45     2533.56
      Elapsed      2855.30     2786.39
      
      The vmstats also showed that the fair zone allocation policy was definitely
      removed as can be seen here;
      
                                   4.7.0-rc3   4.7.0-rc3
                               mmotm-20160623 nodelru-v8
      DMA32 allocs               28794729769           0
      Normal allocs              48432501431 77227309877
      Movable allocs                       0           0
      
      tiobench on ext4
      ----------------
      
      tiobench is a benchmark that artifically benefits if old pages remain resident
      while new pages get reclaimed. The fair zone allocation policy mitigates this
      problem so pages age fairly. While the benchmark has problems, it is important
      that tiobench performance remains constant as it implies that page aging
      problems that the fair zone allocation policy fixes are not re-introduced.
      
                                               4.7.0-rc4             4.7.0-rc4
                                          mmotm-20160623            nodelru-v9
      Min      PotentialReadSpeed        89.65 (  0.00%)       90.21 (  0.62%)
      Min      SeqRead-MB/sec-1          82.68 (  0.00%)       82.01 ( -0.81%)
      Min      SeqRead-MB/sec-2          72.76 (  0.00%)       72.07 ( -0.95%)
      Min      SeqRead-MB/sec-4          75.13 (  0.00%)       74.92 ( -0.28%)
      Min      SeqRead-MB/sec-8          64.91 (  0.00%)       65.19 (  0.43%)
      Min      SeqRead-MB/sec-16         62.24 (  0.00%)       62.22 ( -0.03%)
      Min      RandRead-MB/sec-1          0.88 (  0.00%)        0.88 (  0.00%)
      Min      RandRead-MB/sec-2          0.95 (  0.00%)        0.92 ( -3.16%)
      Min      RandRead-MB/sec-4          1.43 (  0.00%)        1.34 ( -6.29%)
      Min      RandRead-MB/sec-8          1.61 (  0.00%)        1.60 ( -0.62%)
      Min      RandRead-MB/sec-16         1.80 (  0.00%)        1.90 (  5.56%)
      Min      SeqWrite-MB/sec-1         76.41 (  0.00%)       76.85 (  0.58%)
      Min      SeqWrite-MB/sec-2         74.11 (  0.00%)       73.54 ( -0.77%)
      Min      SeqWrite-MB/sec-4         80.05 (  0.00%)       80.13 (  0.10%)
      Min      SeqWrite-MB/sec-8         72.88 (  0.00%)       73.20 (  0.44%)
      Min      SeqWrite-MB/sec-16        75.91 (  0.00%)       76.44 (  0.70%)
      Min      RandWrite-MB/sec-1         1.18 (  0.00%)        1.14 ( -3.39%)
      Min      RandWrite-MB/sec-2         1.02 (  0.00%)        1.03 (  0.98%)
      Min      RandWrite-MB/sec-4         1.05 (  0.00%)        0.98 ( -6.67%)
      Min      RandWrite-MB/sec-8         0.89 (  0.00%)        0.92 (  3.37%)
      Min      RandWrite-MB/sec-16        0.92 (  0.00%)        0.93 (  1.09%)
      
                 4.7.0-rc4   4.7.0-rc4
              mmotm-20160623 approx-v9
      User          645.72      525.90
      System        403.85      331.75
      Elapsed      6795.36     6783.67
      
      This shows that the series has little or not impact on tiobench which is
      desirable and a reduction in system CPU usage. It indicates that the fair
      zone allocation policy was removed in a manner that didn't reintroduce
      one class of page aging bug. There were only minor differences in overall
      reclaim activity
      
                                   4.7.0-rc4   4.7.0-rc4
                                mmotm-20160623nodelru-v8
      Minor Faults                    645838      647465
      Major Faults                       573         640
      Swap Ins                             0           0
      Swap Outs                            0           0
      DMA allocs                           0           0
      DMA32 allocs                  46041453    44190646
      Normal allocs                 78053072    79887245
      Movable allocs                       0           0
      Allocation stalls                   24          67
      Stall zone DMA                       0           0
      Stall zone DMA32                     0           0
      Stall zone Normal                    0           2
      Stall zone HighMem                   0           0
      Stall zone Movable                   0          65
      Direct pages scanned             10969       30609
      Kswapd pages scanned          93375144    93492094
      Kswapd pages reclaimed        93372243    93489370
      Direct pages reclaimed           10969       30609
      Kswapd efficiency                  99%         99%
      Kswapd velocity              13741.015   13781.934
      Direct efficiency                 100%        100%
      Direct velocity                  1.614       4.512
      Percentage direct scans             0%          0%
      
      kswapd activity was roughly comparable. There were differences in direct
      reclaim activity but negligible in the context of the overall workload
      (velocity of 4 pages per second with the patches applied, 1.6 pages per
      second in the baseline kernel).
      
      pgbench read-only large configuration on ext4
      ---------------------------------------------
      
      pgbench is a database benchmark that can be sensitive to page reclaim
      decisions. This also checks if removing the fair zone allocation policy
      is safe
      
      pgbench Transactions
                              4.7.0-rc4             4.7.0-rc4
                         mmotm-20160623            nodelru-v8
      Hmean    1       188.26 (  0.00%)      189.78 (  0.81%)
      Hmean    5       330.66 (  0.00%)      328.69 ( -0.59%)
      Hmean    12      370.32 (  0.00%)      380.72 (  2.81%)
      Hmean    21      368.89 (  0.00%)      369.00 (  0.03%)
      Hmean    30      382.14 (  0.00%)      360.89 ( -5.56%)
      Hmean    32      428.87 (  0.00%)      432.96 (  0.95%)
      
      Negligible differences again. As with tiobench, overall reclaim activity
      was comparable.
      
      bonnie++ on ext4
      ----------------
      
      No interesting performance difference, negligible differences on reclaim
      stats.
      
      paralleldd on ext4
      ------------------
      
      This workload uses varying numbers of dd instances to read large amounts of
      data from disk.
      
                                     4.7.0-rc3             4.7.0-rc3
                                mmotm-20160623            nodelru-v9
      Amean    Elapsd-1       186.04 (  0.00%)      189.41 ( -1.82%)
      Amean    Elapsd-3       192.27 (  0.00%)      191.38 (  0.46%)
      Amean    Elapsd-5       185.21 (  0.00%)      182.75 (  1.33%)
      Amean    Elapsd-7       183.71 (  0.00%)      182.11 (  0.87%)
      Amean    Elapsd-12      180.96 (  0.00%)      181.58 ( -0.35%)
      Amean    Elapsd-16      181.36 (  0.00%)      183.72 ( -1.30%)
      
                 4.7.0-rc4   4.7.0-rc4
              mmotm-20160623 nodelru-v9
      User         1548.01     1552.44
      System       8609.71     8515.08
      Elapsed      3587.10     3594.54
      
      There is little or no change in performance but some drop in system CPU usage.
      
                                   4.7.0-rc3   4.7.0-rc3
                              mmotm-20160623  nodelru-v9
      Minor Faults                    362662      367360
      Major Faults                      1204        1143
      Swap Ins                            22           0
      Swap Outs                         2855        1029
      DMA allocs                           0           0
      DMA32 allocs                  31409797    28837521
      Normal allocs                 46611853    49231282
      Movable allocs                       0           0
      Direct pages scanned                 0           0
      Kswapd pages scanned          40845270    40869088
      Kswapd pages reclaimed        40830976    40855294
      Direct pages reclaimed               0           0
      Kswapd efficiency                  99%         99%
      Kswapd velocity              11386.711   11369.769
      Direct efficiency                 100%        100%
      Direct velocity                  0.000       0.000
      Percentage direct scans             0%          0%
      Page writes by reclaim            2855        1029
      Page writes file                     0           0
      Page writes anon                  2855        1029
      Page reclaim immediate             771        1628
      Sector Reads                 293312636   293536360
      Sector Writes                 18213568    18186480
      Page rescued immediate               0           0
      Slabs scanned                   128257      132747
      Direct inode steals                181          56
      Kswapd inode steals                 59        1131
      
      It basically shows that kswapd was active at roughly the same rate in
      both kernels. There was also comparable slab scanning activity and direct
      reclaim was avoided in both cases. There appears to be a large difference
      in numbers of inodes reclaimed but the workload has few active inodes and
      is likely a timing artifact.
      
      stutter
      -------
      
      stutter simulates a simple workload. One part uses a lot of anonymous
      memory, a second measures mmap latency and a third copies a large file.
      The primary metric is checking for mmap latency.
      
      stutter
                                   4.7.0-rc4             4.7.0-rc4
                              mmotm-20160623            nodelru-v8
      Min         mmap     16.6283 (  0.00%)     13.4258 ( 19.26%)
      1st-qrtle   mmap     54.7570 (  0.00%)     34.9121 ( 36.24%)
      2nd-qrtle   mmap     57.3163 (  0.00%)     46.1147 ( 19.54%)
      3rd-qrtle   mmap     58.9976 (  0.00%)     47.1882 ( 20.02%)
      Max-90%     mmap     59.7433 (  0.00%)     47.4453 ( 20.58%)
      Max-93%     mmap     60.1298 (  0.00%)     47.6037 ( 20.83%)
      Max-95%     mmap     73.4112 (  0.00%)     82.8719 (-12.89%)
      Max-99%     mmap     92.8542 (  0.00%)     88.8870 (  4.27%)
      Max         mmap   1440.6569 (  0.00%)    121.4201 ( 91.57%)
      Mean        mmap     59.3493 (  0.00%)     42.2991 ( 28.73%)
      Best99%Mean mmap     57.2121 (  0.00%)     41.8207 ( 26.90%)
      Best95%Mean mmap     55.9113 (  0.00%)     39.9620 ( 28.53%)
      Best90%Mean mmap     55.6199 (  0.00%)     39.3124 ( 29.32%)
      Best50%Mean mmap     53.2183 (  0.00%)     33.1307 ( 37.75%)
      Best10%Mean mmap     45.9842 (  0.00%)     20.4040 ( 55.63%)
      Best5%Mean  mmap     43.2256 (  0.00%)     17.9654 ( 58.44%)
      Best1%Mean  mmap     32.9388 (  0.00%)     16.6875 ( 49.34%)
      
      This shows a number of improvements with the worst-case outlier greatly
      improved.
      
      Some of the vmstats are interesting
      
                                   4.7.0-rc4   4.7.0-rc4
                                mmotm-20160623nodelru-v8
      Swap Ins                           163         502
      Swap Outs                            0           0
      DMA allocs                           0           0
      DMA32 allocs                 618719206  1381662383
      Normal allocs                891235743   564138421
      Movable allocs                       0           0
      Allocation stalls                 2603           1
      Direct pages scanned            216787           2
      Kswapd pages scanned          50719775    41778378
      Kswapd pages reclaimed        41541765    41777639
      Direct pages reclaimed          209159           0
      Kswapd efficiency                  81%         99%
      Kswapd velocity              16859.554   14329.059
      Direct efficiency                  96%          0%
      Direct velocity                 72.061       0.001
      Percentage direct scans             0%          0%
      Page writes by reclaim         6215049           0
      Page writes file               6215049           0
      Page writes anon                     0           0
      Page reclaim immediate           70673          90
      Sector Reads                  81940800    81680456
      Sector Writes                100158984    98816036
      Page rescued immediate               0           0
      Slabs scanned                  1366954       22683
      
      While this is not guaranteed in all cases, this particular test showed
      a large reduction in direct reclaim activity. It's also worth noting
      that no page writes were issued from reclaim context.
      
      This series is not without its hazards. There are at least three areas
      that I'm concerned with even though I could not reproduce any problems in
      that area.
      
      1. Reclaim/compaction is going to be affected because the amount of reclaim is
         no longer targetted at a specific zone. Compaction works on a per-zone basis
         so there is no guarantee that reclaiming a few THP's worth page pages will
         have a positive impact on compaction success rates.
      
      2. The Slab/LRU reclaim ratio is affected because the frequency the shrinkers
         are called is now different. This may or may not be a problem but if it
         is, it'll be because shrinkers are not called enough and some balancing
         is required.
      
      3. The anon/file reclaim ratio may be affected. Pages about to be dirtied are
         distributed between zones and the fair zone allocation policy used to do
         something very similar for anon. The distribution is now different but not
         necessarily in any way that matters but it's still worth bearing in mind.
      
      VM statistic counters for reclaim decisions are zone-based.  If the kernel
      is to reclaim on a per-node basis then we need to track per-node
      statistics but there is no infrastructure for that.  The most notable
      change is that the old node_page_state is renamed to
      sum_zone_node_page_state.  The new node_page_state takes a pglist_data and
      uses per-node stats but none exist yet.  There is some renaming such as
      vm_stat to vm_zone_stat and the addition of vm_node_stat and the renaming
      of mod_state to mod_zone_state.  Otherwise, this is mostly a mechanical
      patch with no functional change.  There is a lot of similarity between the
      node and zone helpers which is unfortunate but there was no obvious way of
      reusing the code and maintaining type safety.
      
      Link: http://lkml.kernel.org/r/1467970510-21195-2-git-send-email-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      75ef7184
  2. 27 7月, 2016 3 次提交
  3. 21 5月, 2016 2 次提交
    • W
      mm fix commmets: if SPARSEMEM, pgdata doesn't have page_ext · 0c9ad804
      Weijie Yang 提交于
      If SPARSEMEM, use page_ext in mem_section
      if !SPARSEMEM, use page_ext in pgdata
      Signed-off-by: NWeijie Yang <weijie.yang@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0c9ad804
    • M
      mm, oom, compaction: prevent from should_compact_retry looping for ever for costly orders · 86a294a8
      Michal Hocko 提交于
      "mm: consider compaction feedback also for costly allocation" has
      removed the upper bound for the reclaim/compaction retries based on the
      number of reclaimed pages for costly orders.  While this is desirable
      the patch did miss a mis interaction between reclaim, compaction and the
      retry logic.  The direct reclaim tries to get zones over min watermark
      while compaction backs off and returns COMPACT_SKIPPED when all zones
      are below low watermark + 1<<order gap.  If we are getting really close
      to OOM then __compaction_suitable can keep returning COMPACT_SKIPPED a
      high order request (e.g.  hugetlb order-9) while the reclaim is not able
      to release enough pages to get us over low watermark.  The reclaim is
      still able to make some progress (usually trashing over few remaining
      pages) so we are not able to break out from the loop.
      
      I have seen this happening with the same test described in "mm: consider
      compaction feedback also for costly allocation" on a swapless system.
      The original problem got resolved by "vmscan: consider classzone_idx in
      compaction_ready" but it shows how things might go wrong when we
      approach the oom event horizont.
      
      The reason why compaction requires being over low rather than min
      watermark is not clear to me.  This check was there essentially since
      56de7263 ("mm: compaction: direct compact when a high-order
      allocation fails").  It is clearly an implementation detail though and
      we shouldn't pull it into the generic retry logic while we should be
      able to cope with such eventuality.  The only place in
      should_compact_retry where we retry without any upper bound is for
      compaction_withdrawn() case.
      
      Introduce compaction_zonelist_suitable function which checks the given
      zonelist and returns true only if there is at least one zone which would
      would unblock __compaction_suitable if more memory got reclaimed.  In
      this implementation it checks __compaction_suitable with NR_FREE_PAGES
      plus part of the reclaimable memory as the target for the watermark
      check.  The reclaimable memory is reduced linearly by the allocation
      order.  The idea is that we do not want to reclaim all the remaining
      memory for a single allocation request just unblock
      __compaction_suitable which doesn't guarantee we will make a further
      progress.
      
      The new helper is then used if compaction_withdrawn() feedback was
      provided so we do not retry if there is no outlook for a further
      progress.  !costly requests shouldn't be affected much - e.g.  order-2
      pages would require to have at least 64kB on the reclaimable LRUs while
      order-9 would need at least 32M which should be enough to not lock up.
      
      [vbabka@suse.cz: fix classzone_idx vs. high_zoneidx usage in compaction_zonelist_suitable]
      [akpm@linux-foundation.org: fix it for Mel's mm-page_alloc-remove-field-from-alloc_context.patch]
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      86a294a8
  4. 20 5月, 2016 5 次提交
    • M
      mm, page_alloc: inline pageblock lookup in page free fast paths · 0b423ca2
      Mel Gorman 提交于
      The function call overhead of get_pfnblock_flags_mask() is measurable in
      the page free paths.  This patch uses an inlined version that is faster.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0b423ca2
    • M
      mm, page_alloc: avoid looking up the first zone in a zonelist twice · c33d6c06
      Mel Gorman 提交于
      The allocator fast path looks up the first usable zone in a zonelist and
      then get_page_from_freelist does the same job in the zonelist iterator.
      This patch preserves the necessary information.
      
                                                   4.6.0-rc2                  4.6.0-rc2
                                              fastmark-v1r20             initonce-v1r20
        Min      alloc-odr0-1               364.00 (  0.00%)           359.00 (  1.37%)
        Min      alloc-odr0-2               262.00 (  0.00%)           260.00 (  0.76%)
        Min      alloc-odr0-4               214.00 (  0.00%)           214.00 (  0.00%)
        Min      alloc-odr0-8               186.00 (  0.00%)           186.00 (  0.00%)
        Min      alloc-odr0-16              173.00 (  0.00%)           173.00 (  0.00%)
        Min      alloc-odr0-32              165.00 (  0.00%)           165.00 (  0.00%)
        Min      alloc-odr0-64              161.00 (  0.00%)           162.00 ( -0.62%)
        Min      alloc-odr0-128             159.00 (  0.00%)           161.00 ( -1.26%)
        Min      alloc-odr0-256             168.00 (  0.00%)           170.00 ( -1.19%)
        Min      alloc-odr0-512             180.00 (  0.00%)           181.00 ( -0.56%)
        Min      alloc-odr0-1024            190.00 (  0.00%)           190.00 (  0.00%)
        Min      alloc-odr0-2048            196.00 (  0.00%)           196.00 (  0.00%)
        Min      alloc-odr0-4096            202.00 (  0.00%)           202.00 (  0.00%)
        Min      alloc-odr0-8192            206.00 (  0.00%)           205.00 (  0.49%)
        Min      alloc-odr0-16384           206.00 (  0.00%)           205.00 (  0.49%)
      
      The benefit is negligible and the results are within the noise but each
      cycle counts.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c33d6c06
    • M
      mm, page_alloc: convert alloc_flags to unsigned · c603844b
      Mel Gorman 提交于
      alloc_flags is a bitmask of flags but it is signed which does not
      necessarily generate the best code depending on the compiler.  Even
      without an impact, it makes more sense that this be unsigned.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c603844b
    • M
      mm, page_alloc: inline the fast path of the zonelist iterator · 682a3385
      Mel Gorman 提交于
      The page allocator iterates through a zonelist for zones that match the
      addressing limitations and nodemask of the caller but many allocations
      will not be restricted.  Despite this, there is always functional call
      overhead which builds up.
      
      This patch inlines the optimistic basic case and only calls the iterator
      function for the complex case.  A hindrance was the fact that
      cpuset_current_mems_allowed is used in the fastpath as the allowed
      nodemask even though all nodes are allowed on most systems.  The patch
      handles this by only considering cpuset_current_mems_allowed if a cpuset
      exists.  As well as being faster in the fast-path, this removes some
      junk in the slowpath.
      
      The performance difference on a page allocator microbenchmark is;
      
                                                   4.6.0-rc2                  4.6.0-rc2
                                            statinline-v1r20              optiter-v1r20
        Min      alloc-odr0-1               412.00 (  0.00%)           382.00 (  7.28%)
        Min      alloc-odr0-2               301.00 (  0.00%)           282.00 (  6.31%)
        Min      alloc-odr0-4               247.00 (  0.00%)           233.00 (  5.67%)
        Min      alloc-odr0-8               215.00 (  0.00%)           203.00 (  5.58%)
        Min      alloc-odr0-16              199.00 (  0.00%)           188.00 (  5.53%)
        Min      alloc-odr0-32              191.00 (  0.00%)           182.00 (  4.71%)
        Min      alloc-odr0-64              187.00 (  0.00%)           177.00 (  5.35%)
        Min      alloc-odr0-128             185.00 (  0.00%)           175.00 (  5.41%)
        Min      alloc-odr0-256             193.00 (  0.00%)           184.00 (  4.66%)
        Min      alloc-odr0-512             207.00 (  0.00%)           197.00 (  4.83%)
        Min      alloc-odr0-1024            213.00 (  0.00%)           203.00 (  4.69%)
        Min      alloc-odr0-2048            220.00 (  0.00%)           209.00 (  5.00%)
        Min      alloc-odr0-4096            226.00 (  0.00%)           214.00 (  5.31%)
        Min      alloc-odr0-8192            229.00 (  0.00%)           218.00 (  4.80%)
        Min      alloc-odr0-16384           229.00 (  0.00%)           219.00 (  4.37%)
      
      perf indicated that next_zones_zonelist disappeared in the profile and
      __next_zones_zonelist did not appear.  This is expected as the
      micro-benchmark would hit the inlined fast-path every time.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      682a3385
    • C
      mm/highmem: simplify is_highmem() · 29f9cb53
      Chanho Min 提交于
      is_highmem() can be simplified by use of is_highmem_idx().  This patch
      removes redundant code and will make it easier to maintain if the zone
      policy is changed or a new zone is added.
      
      (akpm: saves me 25 bytes of text per is_highmem() callsite)
      Signed-off-by: NChanho Min <chanho.min@lge.com>
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      29f9cb53
  5. 18 3月, 2016 2 次提交
    • J
      mm: scale kswapd watermarks in proportion to memory · 795ae7a0
      Johannes Weiner 提交于
      In machines with 140G of memory and enterprise flash storage, we have
      seen read and write bursts routinely exceed the kswapd watermarks and
      cause thundering herds in direct reclaim.  Unfortunately, the only way
      to tune kswapd aggressiveness is through adjusting min_free_kbytes - the
      system's emergency reserves - which is entirely unrelated to the
      system's latency requirements.  In order to get kswapd to maintain a
      250M buffer of free memory, the emergency reserves need to be set to 1G.
      That is a lot of memory wasted for no good reason.
      
      On the other hand, it's reasonable to assume that allocation bursts and
      overall allocation concurrency scale with memory capacity, so it makes
      sense to make kswapd aggressiveness a function of that as well.
      
      Change the kswapd watermark scale factor from the currently fixed 25% of
      the tunable emergency reserve to a tunable 0.1% of memory.
      
      Beyond 1G of memory, this will produce bigger watermark steps than the
      current formula in default settings.  Ensure that the new formula never
      chooses steps smaller than that, i.e.  25% of the emergency reserve.
      
      On a 140G machine, this raises the default watermark steps - the
      distance between min and low, and low and high - from 16M to 143M.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      795ae7a0
    • V
      mm, compaction: introduce kcompactd · 698b1b30
      Vlastimil Babka 提交于
      Memory compaction can be currently performed in several contexts:
      
       - kswapd balancing a zone after a high-order allocation failure
       - direct compaction to satisfy a high-order allocation, including THP
         page fault attemps
       - khugepaged trying to collapse a hugepage
       - manually from /proc
      
      The purpose of compaction is two-fold.  The obvious purpose is to
      satisfy a (pending or future) high-order allocation, and is easy to
      evaluate.  The other purpose is to keep overal memory fragmentation low
      and help the anti-fragmentation mechanism.  The success wrt the latter
      purpose is more
      
      The current situation wrt the purposes has a few drawbacks:
      
       - compaction is invoked only when a high-order page or hugepage is not
         available (or manually).  This might be too late for the purposes of
         keeping memory fragmentation low.
       - direct compaction increases latency of allocations.  Again, it would
         be better if compaction was performed asynchronously to keep
         fragmentation low, before the allocation itself comes.
       - (a special case of the previous) the cost of compaction during THP
         page faults can easily offset the benefits of THP.
       - kswapd compaction appears to be complex, fragile and not working in
         some scenarios.  It could also end up compacting for a high-order
         allocation request when it should be reclaiming memory for a later
         order-0 request.
      
      To improve the situation, we should be able to benefit from an
      equivalent of kswapd, but for compaction - i.e. a background thread
      which responds to fragmentation and the need for high-order allocations
      (including hugepages) somewhat proactively.
      
      One possibility is to extend the responsibilities of kswapd, which could
      however complicate its design too much.  It should be better to let
      kswapd handle reclaim, as order-0 allocations are often more critical
      than high-order ones.
      
      Another possibility is to extend khugepaged, but this kthread is a
      single instance and tied to THP configs.
      
      This patch goes with the option of a new set of per-node kthreads called
      kcompactd, and lays the foundations, without introducing any new
      tunables.  The lifecycle mimics kswapd kthreads, including the memory
      hotplug hooks.
      
      For compaction, kcompactd uses the standard compaction_suitable() and
      ompact_finished() criteria and the deferred compaction functionality.
      Unlike direct compaction, it uses only sync compaction, as there's no
      allocation latency to minimize.
      
      This patch doesn't yet add a call to wakeup_kcompactd.  The kswapd
      compact/reclaim loop for high-order pages will be replaced by waking up
      kcompactd in the next patch with the description of what's wrong with
      the old approach.
      
      Waking up of the kcompactd threads is also tied to kswapd activity and
      follows these rules:
       - we don't want to affect any fastpaths, so wake up kcompactd only from
         the slowpath, as it's done for kswapd
       - if kswapd is doing reclaim, it's more important than compaction, so
         don't invoke kcompactd until kswapd goes to sleep
       - the target order used for kswapd is passed to kcompactd
      
      Future possible future uses for kcompactd include the ability to wake up
      kcompactd on demand in special situations, such as when hugepages are
      not available (currently not done due to __GFP_NO_KSWAPD) or when a
      fragmentation event (i.e.  __rmqueue_fallback()) occurs.  It's also
      possible to perform periodic compaction with kcompactd.
      
      [arnd@arndb.de: fix build errors with kcompactd]
      [paul.gortmaker@windriver.com: don't use modular references for non modular code]
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      698b1b30
  6. 16 3月, 2016 3 次提交
    • J
      mm/compaction: speed up pageblock_pfn_to_page() when zone is contiguous · 7cf91a98
      Joonsoo Kim 提交于
      There is a performance drop report due to hugepage allocation and in
      there half of cpu time are spent on pageblock_pfn_to_page() in
      compaction [1].
      
      In that workload, compaction is triggered to make hugepage but most of
      pageblocks are un-available for compaction due to pageblock type and
      skip bit so compaction usually fails.  Most costly operations in this
      case is to find valid pageblock while scanning whole zone range.  To
      check if pageblock is valid to compact, valid pfn within pageblock is
      required and we can obtain it by calling pageblock_pfn_to_page().  This
      function checks whether pageblock is in a single zone and return valid
      pfn if possible.  Problem is that we need to check it every time before
      scanning pageblock even if we re-visit it and this turns out to be very
      expensive in this workload.
      
      Although we have no way to skip this pageblock check in the system where
      hole exists at arbitrary position, we can use cached value for zone
      continuity and just do pfn_to_page() in the system where hole doesn't
      exist.  This optimization considerably speeds up in above workload.
      
      Before vs After
        Max: 1096 MB/s vs 1325 MB/s
        Min: 635 MB/s 1015 MB/s
        Avg: 899 MB/s 1194 MB/s
      
      Avg is improved by roughly 30% [2].
      
      [1]: http://www.spinics.net/lists/linux-mm/msg97378.html
      [2]: https://lkml.org/lkml/2015/12/9/23
      
      [akpm@linux-foundation.org: don't forget to restore zone->contiguous on error path, per Vlastimil]
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reported-by: NAaron Lu <aaron.lu@intel.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Tested-by: NAaron Lu <aaron.lu@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7cf91a98
    • J
      mm: workingset: per-cgroup cache thrash detection · 23047a96
      Johannes Weiner 提交于
      Cache thrash detection (see a528910e "mm: thrash detection-based
      file cache sizing" for details) currently only works on the system
      level, not inside cgroups.  Worse, as the refaults are compared to the
      global number of active cache, cgroups might wrongfully get all their
      refaults activated when their pages are hotter than those of others.
      
      Move the refault machinery from the zone to the lruvec, and then tag
      eviction entries with the memcg ID.  This makes the thrash detection
      work correctly inside cgroups.
      
      [sergey.senozhatsky@gmail.com: do not return from workingset_activation() with locked rcu and page]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Reviewed-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      23047a96
    • V
      mm, page_owner: print migratetype of page and pageblock, symbolic flags · 60f30350
      Vlastimil Babka 提交于
      The information in /sys/kernel/debug/page_owner includes the migratetype
      of the pageblock the page belongs to.  This is also checked against the
      page's migratetype (as declared by gfp_flags during its allocation), and
      the page is reported as Fallback if its migratetype differs from the
      pageblock's one.  t This is somewhat misleading because in fact fallback
      allocation is not the only reason why these two can differ.  It also
      doesn't direcly provide the page's migratetype, although it's possible
      to derive that from the gfp_flags.
      
      It's arguably better to print both page and pageblock's migratetype and
      leave the interpretation to the consumer than to suggest fallback
      allocation as the only possible reason.  While at it, we can print the
      migratetypes as string the same way as /proc/pagetypeinfo does, as some
      of the numeric values depend on kernel configuration.  For that, this
      patch moves the migratetype_names array from #ifdef CONFIG_PROC_FS part
      of mm/vmstat.c to mm/page_alloc.c and exports it.
      
      With the new format strings for flags, we can now also provide symbolic
      page and gfp flags in the /sys/kernel/debug/page_owner file.  This
      replaces the positional printing of page flags as single letters, which
      might have looked nicer, but was limited to a subset of flags, and
      required the user to remember the letters.
      
      Example page_owner entry after the patch:
      
        Page allocated via order 0, mask 0x24213ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD|__GFP_NOWARN|__GFP_NORETRY)
        PFN 520 type Movable Block 1 type Movable Flags 0xfffff8001006c(referenced|uptodate|lru|active|mappedtodisk)
         [<ffffffff811682c4>] __alloc_pages_nodemask+0x134/0x230
         [<ffffffff811b4058>] alloc_pages_current+0x88/0x120
         [<ffffffff8115e386>] __page_cache_alloc+0xe6/0x120
         [<ffffffff8116ba6c>] __do_page_cache_readahead+0xdc/0x240
         [<ffffffff8116bd05>] ondemand_readahead+0x135/0x260
         [<ffffffff8116bfb1>] page_cache_sync_readahead+0x31/0x50
         [<ffffffff81160523>] generic_file_read_iter+0x453/0x760
         [<ffffffff811e0d57>] __vfs_read+0xa7/0xd0
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      60f30350
  7. 04 2月, 2016 1 次提交
  8. 15 1月, 2016 4 次提交
  9. 07 11月, 2015 6 次提交
    • A
      include/linux/mmzone.h: reflow comment · 89903327
      Andrew Morton 提交于
      Someone has an 86 column display.
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      89903327
    • M
      mm, page_alloc: reserve pageblocks for high-order atomic allocations on demand · 0aaa29a5
      Mel Gorman 提交于
      High-order watermark checking exists for two reasons -- kswapd high-order
      awareness and protection for high-order atomic requests.  Historically the
      kernel depended on MIGRATE_RESERVE to preserve min_free_kbytes as
      high-order free pages for as long as possible.  This patch introduces
      MIGRATE_HIGHATOMIC that reserves pageblocks for high-order atomic
      allocations on demand and avoids using those blocks for order-0
      allocations.  This is more flexible and reliable than MIGRATE_RESERVE was.
      
      A MIGRATE_HIGHORDER pageblock is created when an atomic high-order
      allocation request steals a pageblock but limits the total number to 1% of
      the zone.  Callers that speculatively abuse atomic allocations for
      long-lived high-order allocations to access the reserve will quickly fail.
       Note that SLUB is currently not such an abuser as it reclaims at least
      once.  It is possible that the pageblock stolen has few suitable
      high-order pages and will need to steal again in the near future but there
      would need to be strong justification to search all pageblocks for an
      ideal candidate.
      
      The pageblocks are unreserved if an allocation fails after a direct
      reclaim attempt.
      
      The watermark checks account for the reserved pageblocks when the
      allocation request is not a high-order atomic allocation.
      
      The reserved pageblocks can not be used for order-0 allocations.  This may
      allow temporary wastage until a failed reclaim reassigns the pageblock.
      This is deliberate as the intent of the reservation is to satisfy a
      limited number of atomic high-order short-lived requests if the system
      requires them.
      
      The stutter benchmark was used to evaluate this but while it was running
      there was a systemtap script that randomly allocated between 1 high-order
      page and 12.5% of memory's worth of order-3 pages using GFP_ATOMIC.  This
      is much larger than the potential reserve and it does not attempt to be
      realistic.  It is intended to stress random high-order allocations from an
      unknown source, show that there is a reduction in failures without
      introducing an anomaly where atomic allocations are more reliable than
      regular allocations.  The amount of memory reserved varied throughout the
      workload as reserves were created and reclaimed under memory pressure.
      The allocation failures once the workload warmed up were as follows;
      
      4.2-rc5-vanilla		70%
      4.2-rc5-atomic-reserve	56%
      
      The failure rate was also measured while building multiple kernels.  The
      failure rate was 14% but is 6% with this patch applied.
      
      Overall, this is a small reduction but the reserves are small relative to
      the number of allocation requests.  In early versions of the patch, the
      failure rate reduced by a much larger amount but that required much larger
      reserves and perversely made atomic allocations seem more reliable than
      regular allocations.
      
      [yalin.wang2010@gmail.com: fix redundant check and a memory leak]
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: Nyalin wang <yalin.wang2010@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0aaa29a5
    • M
      mm, page_alloc: remove MIGRATE_RESERVE · 974a786e
      Mel Gorman 提交于
      MIGRATE_RESERVE preserves an old property of the buddy allocator that
      existed prior to fragmentation avoidance -- min_free_kbytes worth of pages
      tended to remain contiguous until the only alternative was to fail the
      allocation.  At the time it was discovered that high-order atomic
      allocations relied on this property so MIGRATE_RESERVE was introduced.  A
      later patch will introduce an alternative MIGRATE_HIGHATOMIC so this patch
      deletes MIGRATE_RESERVE and supporting code so it'll be easier to review.
      Note that this patch in isolation may look like a false regression if
      someone was bisecting high-order atomic allocation failures.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      974a786e
    • M
      mm, page_alloc: delete the zonelist_cache · f77cf4e4
      Mel Gorman 提交于
      The zonelist cache (zlc) was introduced to skip over zones that were
      recently known to be full.  This avoided expensive operations such as the
      cpuset checks, watermark calculations and zone_reclaim.  The situation
      today is different and the complexity of zlc is harder to justify.
      
      1) The cpuset checks are no-ops unless a cpuset is active and in general
         are a lot cheaper.
      
      2) zone_reclaim is now disabled by default and I suspect that was a large
         source of the cost that zlc wanted to avoid. When it is enabled, it's
         known to be a major source of stalling when nodes fill up and it's
         unwise to hit every other user with the overhead.
      
      3) Watermark checks are expensive to calculate for high-order
         allocation requests. Later patches in this series will reduce the cost
         of the watermark checking.
      
      4) The most important issue is that in the current implementation it
         is possible for a failed THP allocation to mark a zone full for order-0
         allocations and cause a fallback to remote nodes.
      
      The last issue could be addressed with additional complexity but as the
      benefit of zlc is questionable, it is better to remove it.  If stalls due
      to zone_reclaim are ever reported then an alternative would be to
      introduce deferring logic based on a timeout inside zone_reclaim itself
      and leave the page allocator fast paths alone.
      
      The impact on page-allocator microbenchmarks is negligible as they don't
      hit the paths where the zlc comes into play.  Most page-reclaim related
      workloads showed no noticeable difference as a result of the removal.
      
      The impact was noticeable in a workload called "stutter".  One part uses a
      lot of anonymous memory, a second measures mmap latency and a third copies
      a large file.  In an ideal world the latency application would not notice
      the mmap latency.  On a 2-node machine the results of this patch are
      
      stutter
                                   4.3.0-rc1             4.3.0-rc1
                                    baseline              nozlc-v4
      Min         mmap     20.9243 (  0.00%)     20.7716 (  0.73%)
      1st-qrtle   mmap     22.0612 (  0.00%)     22.0680 ( -0.03%)
      2nd-qrtle   mmap     22.3291 (  0.00%)     22.3809 ( -0.23%)
      3rd-qrtle   mmap     25.2244 (  0.00%)     25.2396 ( -0.06%)
      Max-90%     mmap     48.0995 (  0.00%)     28.3713 ( 41.02%)
      Max-93%     mmap     52.5557 (  0.00%)     36.0170 ( 31.47%)
      Max-95%     mmap     55.8173 (  0.00%)     47.3163 ( 15.23%)
      Max-99%     mmap     67.3781 (  0.00%)     70.1140 ( -4.06%)
      Max         mmap  24447.6375 (  0.00%)  12915.1356 ( 47.17%)
      Mean        mmap     33.7883 (  0.00%)     27.7944 ( 17.74%)
      Best99%Mean mmap     27.7825 (  0.00%)     25.2767 (  9.02%)
      Best95%Mean mmap     26.3912 (  0.00%)     23.7994 (  9.82%)
      Best90%Mean mmap     24.9886 (  0.00%)     23.2251 (  7.06%)
      Best50%Mean mmap     22.0157 (  0.00%)     22.0261 ( -0.05%)
      Best10%Mean mmap     21.6705 (  0.00%)     21.6083 (  0.29%)
      Best5%Mean  mmap     21.5581 (  0.00%)     21.4611 (  0.45%)
      Best1%Mean  mmap     21.3079 (  0.00%)     21.1631 (  0.68%)
      
      Note that the maximum stall latency went from 24 seconds to 12 which is
      still bad but an improvement.  The milage varies considerably 2-node
      machine on an earlier test went from 494 seconds to 47 seconds and a
      4-node machine that tested an earlier version of this patch went from a
      worst case stall time of 6 seconds to 67ms.  The nature of the benchmark
      is inherently unpredictable as it is hammering the system and the milage
      will vary between machines.
      
      There is a secondary impact with potentially more direct reclaim because
      zones are now being considered instead of being skipped by zlc.  In this
      particular test run it did not occur so will not be described.  However,
      in at least one test the following was observed
      
      1. Direct reclaim rates were higher. This was likely due to direct reclaim
        being entered instead of the zlc disabling a zone and busy looping.
        Busy looping may have the effect of allowing kswapd to make more
        progress and in some cases may be better overall. If this is found then
        the correct action is to put direct reclaimers to sleep on a waitqueue
        and allow kswapd make forward progress. Busy looping on the zlc is even
        worse than when the allocator used to blindly call congestion_wait().
      
      2. There was higher swap activity as direct reclaim was active.
      
      3. Direct reclaim efficiency was lower. This is related to 1 as more
        scanning activity also encountered more pages that could not be
        immediately reclaimed
      
      In that case, the direct page scan and reclaim rates are noticeable but
      it is not considered a problem for a few reasons
      
      1. The test is primarily concerned with latency. The mmap attempts are also
         faulted which means there are THP allocation requests. The ZLC could
         cause zones to be disabled causing the process to busy loop instead
         of reclaiming.  This looks like elevated direct reclaim activity but
         it's the correct action to take based on what processes requested.
      
      2. The test hammers reclaim and compaction heavily. The number of successful
         THP faults is highly variable but affects the reclaim stats. It's not a
         realistic or reasonable measure of page reclaim activity.
      
      3. No other page-reclaim intensive workload that was tested showed a problem.
      
      4. If a workload is identified that benefitted from the busy looping then it
         should be fixed by having direct reclaimers sleep on a wait queue until
         woken by kswapd instead of busy looping. We had this class of problem before
         when congestion_waits() with a fixed timeout was a brain damaged decision
         but happened to benefit some workloads.
      
      If a workload is identified that relied on the zlc to busy loop then it
      should be fixed correctly and have a direct reclaimer sleep on a waitqueue
      until woken by kswapd.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f77cf4e4
    • M
      mm, page_alloc: use masks and shifts when converting GFP flags to migrate types · 016c13da
      Mel Gorman 提交于
      This patch redefines which GFP bits are used for specifying mobility and
      the order of the migrate types.  Once redefined it's possible to convert
      GFP flags to a migrate type with a simple mask and shift.  The only
      downside is that readers of OOM kill messages and allocation failures may
      have been used to the existing values but scripts/gfp-translate will help.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      016c13da
    • M
      mm, page_alloc: remove unnecessary parameter from zone_watermark_ok_safe · e2b19197
      Mel Gorman 提交于
      Overall, the intent of this series is to remove the zonelist cache which
      was introduced to avoid high overhead in the page allocator.  Once this is
      done, it is necessary to reduce the cost of watermark checks.
      
      The series starts with minor micro-optimisations.
      
      Next it notes that GFP flags that affect watermark checks are abused.
      __GFP_WAIT historically identified callers that could not sleep and could
      access reserves.  This was later abused to identify callers that simply
      prefer to avoid sleeping and have other options.  A patch distinguishes
      between atomic callers, high-priority callers and those that simply wish
      to avoid sleep.
      
      The zonelist cache has been around for a long time but it is of dubious
      merit with a lot of complexity and some issues that are explained.  The
      most important issue is that a failed THP allocation can cause a zone to
      be treated as "full".  This potentially causes unnecessary stalls, reclaim
      activity or remote fallbacks.  The issues could be fixed but it's not
      worth it.  The series places a small number of other micro-optimisations
      on top before examining GFP flags watermarks.
      
      High-order watermarks enforcement can cause high-order allocations to fail
      even though pages are free.  The watermark checks both protect high-order
      atomic allocations and make kswapd aware of high-order pages but there is
      a much better way that can be handled using migrate types.  This series
      uses page grouping by mobility to reserve pageblocks for high-order
      allocations with the size of the reservation depending on demand.  kswapd
      awareness is maintained by examining the free lists.  By patch 12 in this
      series, there are no high-order watermark checks while preserving the
      properties that motivated the introduction of the watermark checks.
      
      This patch (of 10):
      
      No user of zone_watermark_ok_safe() specifies alloc_flags.  This patch
      removes the unnecessary parameter.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NChristoph Lameter <cl@linux.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e2b19197
  10. 06 11月, 2015 1 次提交
  11. 05 9月, 2015 1 次提交
  12. 28 8月, 2015 1 次提交
    • D
      mm: ZONE_DEVICE for "device memory" · 033fbae9
      Dan Williams 提交于
      While pmem is usable as a block device or via DAX mappings to userspace
      there are several usage scenarios that can not target pmem due to its
      lack of struct page coverage. In preparation for "hot plugging" pmem
      into the vmemmap add ZONE_DEVICE as a new zone to tag these pages
      separately from the ones that are subject to standard page allocations.
      Importantly "device memory" can be removed at will by userspace
      unbinding the driver of the device.
      
      Having a separate zone prevents allocation and otherwise marks these
      pages that are distinct from typical uniform memory.  Device memory has
      different lifetime and performance characteristics than RAM.  However,
      since we have run out of ZONES_SHIFT bits this functionality currently
      depends on sacrificing ZONE_DMA.
      
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Jerome Glisse <j.glisse@gmail.com>
      [hch: various simplifications in the arch interface]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      033fbae9
  13. 01 7月, 2015 3 次提交
  14. 16 4月, 2015 1 次提交
  15. 08 4月, 2015 1 次提交
    • M
      mm: move zone lock to a different cache line than order-0 free page lists · a368ab67
      Mel Gorman 提交于
      Huang Ying reported the following problem due to commit 3484b2de ("mm:
      rearrange zone fields into read-only, page alloc, statistics and page
      reclaim lines") from the Intel performance tests
      
          24b7e581  3484b2de
          ----------------  --------------------------
                   %stddev     %change         %stddev
                       \          |                \
              152288 \261  0%     -46.2%      81911 \261  0%  aim7.jobs-per-min
                 237 \261  0%     +85.6%        440 \261  0%  aim7.time.elapsed_time
                 237 \261  0%     +85.6%        440 \261  0%  aim7.time.elapsed_time.max
               25026 \261  0%     +70.7%      42712 \261  0%  aim7.time.system_time
             2186645 \261  5%     +32.0%    2885949 \261  4%  aim7.time.voluntary_context_switches
             4576561 \261  1%     +24.9%    5715773 \261  0%  aim7.time.involuntary_context_switches
      
      The problem is specific to very large machines under stress.  It was not
      reproducible with the machines I had used to justify the original patch
      because large numbers of CPUs are required.  When pressure is high enough,
      the cache line is bouncing between CPUs trying to acquire the lock and the
      holder of the lock adjusting free lists.  The intention was that the
      acquirer of the lock would automatically have the cache line holding the
      free lists but according to Huang, this is not a universal win.
      
      One possibility is to move the zone lock to its own cache line but it
      increases the size of the zone.  This patch moves the lock to the other
      end of the free lists where they do not contend under high pressure.  It
      does mean the page allocator paths now require more cache lines but Huang
      reports that it restores performance to previous levels on large machines
      
                   %stddev     %change         %stddev
                       \          |                \
               84568 \261  1%     +94.3%     164280 \261  1%  aim7.jobs-per-min
             2881944 \261  2%     -35.1%    1870386 \261  8%  aim7.time.voluntary_context_switches
                 681 \261  1%      -3.4%        658 \261  0%  aim7.time.user_time
             5538139 \261  0%     -12.1%    4867884 \261  0%  aim7.time.involuntary_context_switches
               44174 \261  1%     -46.0%      23848 \261  1%  aim7.time.system_time
                 426 \261  1%     -48.4%        219 \261  1%  aim7.time.elapsed_time
                 426 \261  1%     -48.4%        219 \261  1%  aim7.time.elapsed_time.max
                 468 \261  1%     -43.1%        266 \261  2%  uptime.boot
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reported-by: NHuang Ying <ying.huang@intel.com>
      Tested-by: NHuang Ying <ying.huang@intel.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a368ab67
  16. 12 2月, 2015 2 次提交
    • V
      mm: microoptimize zonelist operations · 05891fb0
      Vlastimil Babka 提交于
      next_zones_zonelist() returns a zoneref pointer, as well as a zone pointer
      via extra parameter.  Since the latter can be trivially obtained by
      dereferencing the former, the overhead of the extra parameter is
      unjustified.
      
      This patch thus removes the zone parameter from next_zones_zonelist().
      Both callers happen to be in the same header file, so it's simple to add
      the zoneref dereference inline.  We save some bytes of code size.
      
      add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-105 (-105)
      function                                     old     new   delta
      nr_free_zone_pages                           129     115     -14
      __alloc_pages_nodemask                      2300    2285     -15
      get_page_from_freelist                      2652    2576     -76
      
      add/remove: 0/0 grow/shrink: 1/0 up/down: 10/0 (10)
      function                                     old     new   delta
      try_to_compact_pages                         569     579     +10
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      05891fb0
    • B
      mm: fix typo of MIGRATE_RESERVE in comment · 44628d97
      Baoquan He 提交于
      Found it when I want to jump to the definition of MIGRATE_RESERVE ctags.
      Signed-off-by: NBaoquan He <bhe@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      44628d97