1. 07 7月, 2017 1 次提交
  2. 29 7月, 2016 4 次提交
    • M
      mm: vmstat: replace __count_zone_vm_events with a zone id equivalent · 16709d1d
      Mel Gorman 提交于
      This is partially a preparation patch for more vmstat work but it also
      has the slight advantage that __count_zid_vm_events is cheaper to
      calculate than __count_zone_vm_events().
      
      Link: http://lkml.kernel.org/r/1467970510-21195-32-git-send-email-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      16709d1d
    • M
      mm, workingset: make working set detection node-aware · 1e6b1085
      Mel Gorman 提交于
      Working set and refault detection is still zone-based, fix it.
      
      Link: http://lkml.kernel.org/r/1467970510-21195-16-git-send-email-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1e6b1085
    • M
      mm, vmscan: move LRU lists to node · 599d0c95
      Mel Gorman 提交于
      This moves the LRU lists from the zone to the node and related data such
      as counters, tracing, congestion tracking and writeback tracking.
      
      Unfortunately, due to reclaim and compaction retry logic, it is
      necessary to account for the number of LRU pages on both zone and node
      logic.  Most reclaim logic is based on the node counters but the retry
      logic uses the zone counters which do not distinguish inactive and
      active sizes.  It would be possible to leave the LRU counters on a
      per-zone basis but it's a heavier calculation across multiple cache
      lines that is much more frequent than the retry checks.
      
      Other than the LRU counters, this is mostly a mechanical patch but note
      that it introduces a number of anomalies.  For example, the scans are
      per-zone but using per-node counters.  We also mark a node as congested
      when a zone is congested.  This causes weird problems that are fixed
      later but is easier to review.
      
      In the event that there is excessive overhead on 32-bit systems due to
      the nodes being on LRU then there are two potential solutions
      
      1. Long-term isolation of highmem pages when reclaim is lowmem
      
         When pages are skipped, they are immediately added back onto the LRU
         list. If lowmem reclaim persisted for long periods of time, the same
         highmem pages get continually scanned. The idea would be that lowmem
         keeps those pages on a separate list until a reclaim for highmem pages
         arrives that splices the highmem pages back onto the LRU. It potentially
         could be implemented similar to the UNEVICTABLE list.
      
         That would reduce the skip rate with the potential corner case is that
         highmem pages have to be scanned and reclaimed to free lowmem slab pages.
      
      2. Linear scan lowmem pages if the initial LRU shrink fails
      
         This will break LRU ordering but may be preferable and faster during
         memory pressure than skipping LRU pages.
      
      Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      599d0c95
    • M
      mm, vmstat: add infrastructure for per-node vmstats · 75ef7184
      Mel Gorman 提交于
      Patchset: "Move LRU page reclaim from zones to nodes v9"
      
      This series moves LRUs from the zones to the node.  While this is a
      current rebase, the test results were based on mmotm as of June 23rd.
      Conceptually, this series is simple but there are a lot of details.
      Some of the broad motivations for this are;
      
      1. The residency of a page partially depends on what zone the page was
         allocated from.  This is partially combatted by the fair zone allocation
         policy but that is a partial solution that introduces overhead in the
         page allocator paths.
      
      2. Currently, reclaim on node 0 behaves slightly different to node 1. For
         example, direct reclaim scans in zonelist order and reclaims even if
         the zone is over the high watermark regardless of the age of pages
         in that LRU. Kswapd on the other hand starts reclaim on the highest
         unbalanced zone. A difference in distribution of file/anon pages due
         to when they were allocated results can result in a difference in
         again. While the fair zone allocation policy mitigates some of the
         problems here, the page reclaim results on a multi-zone node will
         always be different to a single-zone node.
         it was scheduled on as a result.
      
      3. kswapd and the page allocator scan zones in the opposite order to
         avoid interfering with each other but it's sensitive to timing.  This
         mitigates the page allocator using pages that were allocated very recently
         in the ideal case but it's sensitive to timing. When kswapd is allocating
         from lower zones then it's great but during the rebalancing of the highest
         zone, the page allocator and kswapd interfere with each other. It's worse
         if the highest zone is small and difficult to balance.
      
      4. slab shrinkers are node-based which makes it harder to identify the exact
         relationship between slab reclaim and LRU reclaim.
      
      The reason we have zone-based reclaim is that we used to have
      large highmem zones in common configurations and it was necessary
      to quickly find ZONE_NORMAL pages for reclaim. Today, this is much
      less of a concern as machines with lots of memory will (or should) use
      64-bit kernels. Combinations of 32-bit hardware and 64-bit hardware are
      rare. Machines that do use highmem should have relatively low highmem:lowmem
      ratios than we worried about in the past.
      
      Conceptually, moving to node LRUs should be easier to understand. The
      page allocator plays fewer tricks to game reclaim and reclaim behaves
      similarly on all nodes.
      
      The series has been tested on a 16 core UMA machine and a 2-socket 48
      core NUMA machine. The UMA results are presented in most cases as the NUMA
      machine behaved similarly.
      
      pagealloc
      ---------
      
      This is a microbenchmark that shows the benefit of removing the fair zone
      allocation policy. It was tested uip to order-4 but only orders 0 and 1 are
      shown as the other orders were comparable.
      
                                                 4.7.0-rc4                  4.7.0-rc4
                                            mmotm-20160623                 nodelru-v9
      Min      total-odr0-1               490.00 (  0.00%)           457.00 (  6.73%)
      Min      total-odr0-2               347.00 (  0.00%)           329.00 (  5.19%)
      Min      total-odr0-4               288.00 (  0.00%)           273.00 (  5.21%)
      Min      total-odr0-8               251.00 (  0.00%)           239.00 (  4.78%)
      Min      total-odr0-16              234.00 (  0.00%)           222.00 (  5.13%)
      Min      total-odr0-32              223.00 (  0.00%)           211.00 (  5.38%)
      Min      total-odr0-64              217.00 (  0.00%)           208.00 (  4.15%)
      Min      total-odr0-128             214.00 (  0.00%)           204.00 (  4.67%)
      Min      total-odr0-256             250.00 (  0.00%)           230.00 (  8.00%)
      Min      total-odr0-512             271.00 (  0.00%)           269.00 (  0.74%)
      Min      total-odr0-1024            291.00 (  0.00%)           282.00 (  3.09%)
      Min      total-odr0-2048            303.00 (  0.00%)           296.00 (  2.31%)
      Min      total-odr0-4096            311.00 (  0.00%)           309.00 (  0.64%)
      Min      total-odr0-8192            316.00 (  0.00%)           314.00 (  0.63%)
      Min      total-odr0-16384           317.00 (  0.00%)           315.00 (  0.63%)
      Min      total-odr1-1               742.00 (  0.00%)           712.00 (  4.04%)
      Min      total-odr1-2               562.00 (  0.00%)           530.00 (  5.69%)
      Min      total-odr1-4               457.00 (  0.00%)           433.00 (  5.25%)
      Min      total-odr1-8               411.00 (  0.00%)           381.00 (  7.30%)
      Min      total-odr1-16              381.00 (  0.00%)           356.00 (  6.56%)
      Min      total-odr1-32              372.00 (  0.00%)           346.00 (  6.99%)
      Min      total-odr1-64              372.00 (  0.00%)           343.00 (  7.80%)
      Min      total-odr1-128             375.00 (  0.00%)           351.00 (  6.40%)
      Min      total-odr1-256             379.00 (  0.00%)           351.00 (  7.39%)
      Min      total-odr1-512             385.00 (  0.00%)           355.00 (  7.79%)
      Min      total-odr1-1024            386.00 (  0.00%)           358.00 (  7.25%)
      Min      total-odr1-2048            390.00 (  0.00%)           362.00 (  7.18%)
      Min      total-odr1-4096            390.00 (  0.00%)           362.00 (  7.18%)
      Min      total-odr1-8192            388.00 (  0.00%)           363.00 (  6.44%)
      
      This shows a steady improvement throughout. The primary benefit is from
      reduced system CPU usage which is obvious from the overall times;
      
                 4.7.0-rc4   4.7.0-rc4
              mmotm-20160623nodelru-v8
      User          189.19      191.80
      System       2604.45     2533.56
      Elapsed      2855.30     2786.39
      
      The vmstats also showed that the fair zone allocation policy was definitely
      removed as can be seen here;
      
                                   4.7.0-rc3   4.7.0-rc3
                               mmotm-20160623 nodelru-v8
      DMA32 allocs               28794729769           0
      Normal allocs              48432501431 77227309877
      Movable allocs                       0           0
      
      tiobench on ext4
      ----------------
      
      tiobench is a benchmark that artifically benefits if old pages remain resident
      while new pages get reclaimed. The fair zone allocation policy mitigates this
      problem so pages age fairly. While the benchmark has problems, it is important
      that tiobench performance remains constant as it implies that page aging
      problems that the fair zone allocation policy fixes are not re-introduced.
      
                                               4.7.0-rc4             4.7.0-rc4
                                          mmotm-20160623            nodelru-v9
      Min      PotentialReadSpeed        89.65 (  0.00%)       90.21 (  0.62%)
      Min      SeqRead-MB/sec-1          82.68 (  0.00%)       82.01 ( -0.81%)
      Min      SeqRead-MB/sec-2          72.76 (  0.00%)       72.07 ( -0.95%)
      Min      SeqRead-MB/sec-4          75.13 (  0.00%)       74.92 ( -0.28%)
      Min      SeqRead-MB/sec-8          64.91 (  0.00%)       65.19 (  0.43%)
      Min      SeqRead-MB/sec-16         62.24 (  0.00%)       62.22 ( -0.03%)
      Min      RandRead-MB/sec-1          0.88 (  0.00%)        0.88 (  0.00%)
      Min      RandRead-MB/sec-2          0.95 (  0.00%)        0.92 ( -3.16%)
      Min      RandRead-MB/sec-4          1.43 (  0.00%)        1.34 ( -6.29%)
      Min      RandRead-MB/sec-8          1.61 (  0.00%)        1.60 ( -0.62%)
      Min      RandRead-MB/sec-16         1.80 (  0.00%)        1.90 (  5.56%)
      Min      SeqWrite-MB/sec-1         76.41 (  0.00%)       76.85 (  0.58%)
      Min      SeqWrite-MB/sec-2         74.11 (  0.00%)       73.54 ( -0.77%)
      Min      SeqWrite-MB/sec-4         80.05 (  0.00%)       80.13 (  0.10%)
      Min      SeqWrite-MB/sec-8         72.88 (  0.00%)       73.20 (  0.44%)
      Min      SeqWrite-MB/sec-16        75.91 (  0.00%)       76.44 (  0.70%)
      Min      RandWrite-MB/sec-1         1.18 (  0.00%)        1.14 ( -3.39%)
      Min      RandWrite-MB/sec-2         1.02 (  0.00%)        1.03 (  0.98%)
      Min      RandWrite-MB/sec-4         1.05 (  0.00%)        0.98 ( -6.67%)
      Min      RandWrite-MB/sec-8         0.89 (  0.00%)        0.92 (  3.37%)
      Min      RandWrite-MB/sec-16        0.92 (  0.00%)        0.93 (  1.09%)
      
                 4.7.0-rc4   4.7.0-rc4
              mmotm-20160623 approx-v9
      User          645.72      525.90
      System        403.85      331.75
      Elapsed      6795.36     6783.67
      
      This shows that the series has little or not impact on tiobench which is
      desirable and a reduction in system CPU usage. It indicates that the fair
      zone allocation policy was removed in a manner that didn't reintroduce
      one class of page aging bug. There were only minor differences in overall
      reclaim activity
      
                                   4.7.0-rc4   4.7.0-rc4
                                mmotm-20160623nodelru-v8
      Minor Faults                    645838      647465
      Major Faults                       573         640
      Swap Ins                             0           0
      Swap Outs                            0           0
      DMA allocs                           0           0
      DMA32 allocs                  46041453    44190646
      Normal allocs                 78053072    79887245
      Movable allocs                       0           0
      Allocation stalls                   24          67
      Stall zone DMA                       0           0
      Stall zone DMA32                     0           0
      Stall zone Normal                    0           2
      Stall zone HighMem                   0           0
      Stall zone Movable                   0          65
      Direct pages scanned             10969       30609
      Kswapd pages scanned          93375144    93492094
      Kswapd pages reclaimed        93372243    93489370
      Direct pages reclaimed           10969       30609
      Kswapd efficiency                  99%         99%
      Kswapd velocity              13741.015   13781.934
      Direct efficiency                 100%        100%
      Direct velocity                  1.614       4.512
      Percentage direct scans             0%          0%
      
      kswapd activity was roughly comparable. There were differences in direct
      reclaim activity but negligible in the context of the overall workload
      (velocity of 4 pages per second with the patches applied, 1.6 pages per
      second in the baseline kernel).
      
      pgbench read-only large configuration on ext4
      ---------------------------------------------
      
      pgbench is a database benchmark that can be sensitive to page reclaim
      decisions. This also checks if removing the fair zone allocation policy
      is safe
      
      pgbench Transactions
                              4.7.0-rc4             4.7.0-rc4
                         mmotm-20160623            nodelru-v8
      Hmean    1       188.26 (  0.00%)      189.78 (  0.81%)
      Hmean    5       330.66 (  0.00%)      328.69 ( -0.59%)
      Hmean    12      370.32 (  0.00%)      380.72 (  2.81%)
      Hmean    21      368.89 (  0.00%)      369.00 (  0.03%)
      Hmean    30      382.14 (  0.00%)      360.89 ( -5.56%)
      Hmean    32      428.87 (  0.00%)      432.96 (  0.95%)
      
      Negligible differences again. As with tiobench, overall reclaim activity
      was comparable.
      
      bonnie++ on ext4
      ----------------
      
      No interesting performance difference, negligible differences on reclaim
      stats.
      
      paralleldd on ext4
      ------------------
      
      This workload uses varying numbers of dd instances to read large amounts of
      data from disk.
      
                                     4.7.0-rc3             4.7.0-rc3
                                mmotm-20160623            nodelru-v9
      Amean    Elapsd-1       186.04 (  0.00%)      189.41 ( -1.82%)
      Amean    Elapsd-3       192.27 (  0.00%)      191.38 (  0.46%)
      Amean    Elapsd-5       185.21 (  0.00%)      182.75 (  1.33%)
      Amean    Elapsd-7       183.71 (  0.00%)      182.11 (  0.87%)
      Amean    Elapsd-12      180.96 (  0.00%)      181.58 ( -0.35%)
      Amean    Elapsd-16      181.36 (  0.00%)      183.72 ( -1.30%)
      
                 4.7.0-rc4   4.7.0-rc4
              mmotm-20160623 nodelru-v9
      User         1548.01     1552.44
      System       8609.71     8515.08
      Elapsed      3587.10     3594.54
      
      There is little or no change in performance but some drop in system CPU usage.
      
                                   4.7.0-rc3   4.7.0-rc3
                              mmotm-20160623  nodelru-v9
      Minor Faults                    362662      367360
      Major Faults                      1204        1143
      Swap Ins                            22           0
      Swap Outs                         2855        1029
      DMA allocs                           0           0
      DMA32 allocs                  31409797    28837521
      Normal allocs                 46611853    49231282
      Movable allocs                       0           0
      Direct pages scanned                 0           0
      Kswapd pages scanned          40845270    40869088
      Kswapd pages reclaimed        40830976    40855294
      Direct pages reclaimed               0           0
      Kswapd efficiency                  99%         99%
      Kswapd velocity              11386.711   11369.769
      Direct efficiency                 100%        100%
      Direct velocity                  0.000       0.000
      Percentage direct scans             0%          0%
      Page writes by reclaim            2855        1029
      Page writes file                     0           0
      Page writes anon                  2855        1029
      Page reclaim immediate             771        1628
      Sector Reads                 293312636   293536360
      Sector Writes                 18213568    18186480
      Page rescued immediate               0           0
      Slabs scanned                   128257      132747
      Direct inode steals                181          56
      Kswapd inode steals                 59        1131
      
      It basically shows that kswapd was active at roughly the same rate in
      both kernels. There was also comparable slab scanning activity and direct
      reclaim was avoided in both cases. There appears to be a large difference
      in numbers of inodes reclaimed but the workload has few active inodes and
      is likely a timing artifact.
      
      stutter
      -------
      
      stutter simulates a simple workload. One part uses a lot of anonymous
      memory, a second measures mmap latency and a third copies a large file.
      The primary metric is checking for mmap latency.
      
      stutter
                                   4.7.0-rc4             4.7.0-rc4
                              mmotm-20160623            nodelru-v8
      Min         mmap     16.6283 (  0.00%)     13.4258 ( 19.26%)
      1st-qrtle   mmap     54.7570 (  0.00%)     34.9121 ( 36.24%)
      2nd-qrtle   mmap     57.3163 (  0.00%)     46.1147 ( 19.54%)
      3rd-qrtle   mmap     58.9976 (  0.00%)     47.1882 ( 20.02%)
      Max-90%     mmap     59.7433 (  0.00%)     47.4453 ( 20.58%)
      Max-93%     mmap     60.1298 (  0.00%)     47.6037 ( 20.83%)
      Max-95%     mmap     73.4112 (  0.00%)     82.8719 (-12.89%)
      Max-99%     mmap     92.8542 (  0.00%)     88.8870 (  4.27%)
      Max         mmap   1440.6569 (  0.00%)    121.4201 ( 91.57%)
      Mean        mmap     59.3493 (  0.00%)     42.2991 ( 28.73%)
      Best99%Mean mmap     57.2121 (  0.00%)     41.8207 ( 26.90%)
      Best95%Mean mmap     55.9113 (  0.00%)     39.9620 ( 28.53%)
      Best90%Mean mmap     55.6199 (  0.00%)     39.3124 ( 29.32%)
      Best50%Mean mmap     53.2183 (  0.00%)     33.1307 ( 37.75%)
      Best10%Mean mmap     45.9842 (  0.00%)     20.4040 ( 55.63%)
      Best5%Mean  mmap     43.2256 (  0.00%)     17.9654 ( 58.44%)
      Best1%Mean  mmap     32.9388 (  0.00%)     16.6875 ( 49.34%)
      
      This shows a number of improvements with the worst-case outlier greatly
      improved.
      
      Some of the vmstats are interesting
      
                                   4.7.0-rc4   4.7.0-rc4
                                mmotm-20160623nodelru-v8
      Swap Ins                           163         502
      Swap Outs                            0           0
      DMA allocs                           0           0
      DMA32 allocs                 618719206  1381662383
      Normal allocs                891235743   564138421
      Movable allocs                       0           0
      Allocation stalls                 2603           1
      Direct pages scanned            216787           2
      Kswapd pages scanned          50719775    41778378
      Kswapd pages reclaimed        41541765    41777639
      Direct pages reclaimed          209159           0
      Kswapd efficiency                  81%         99%
      Kswapd velocity              16859.554   14329.059
      Direct efficiency                  96%          0%
      Direct velocity                 72.061       0.001
      Percentage direct scans             0%          0%
      Page writes by reclaim         6215049           0
      Page writes file               6215049           0
      Page writes anon                     0           0
      Page reclaim immediate           70673          90
      Sector Reads                  81940800    81680456
      Sector Writes                100158984    98816036
      Page rescued immediate               0           0
      Slabs scanned                  1366954       22683
      
      While this is not guaranteed in all cases, this particular test showed
      a large reduction in direct reclaim activity. It's also worth noting
      that no page writes were issued from reclaim context.
      
      This series is not without its hazards. There are at least three areas
      that I'm concerned with even though I could not reproduce any problems in
      that area.
      
      1. Reclaim/compaction is going to be affected because the amount of reclaim is
         no longer targetted at a specific zone. Compaction works on a per-zone basis
         so there is no guarantee that reclaiming a few THP's worth page pages will
         have a positive impact on compaction success rates.
      
      2. The Slab/LRU reclaim ratio is affected because the frequency the shrinkers
         are called is now different. This may or may not be a problem but if it
         is, it'll be because shrinkers are not called enough and some balancing
         is required.
      
      3. The anon/file reclaim ratio may be affected. Pages about to be dirtied are
         distributed between zones and the fair zone allocation policy used to do
         something very similar for anon. The distribution is now different but not
         necessarily in any way that matters but it's still worth bearing in mind.
      
      VM statistic counters for reclaim decisions are zone-based.  If the kernel
      is to reclaim on a per-node basis then we need to track per-node
      statistics but there is no infrastructure for that.  The most notable
      change is that the old node_page_state is renamed to
      sum_zone_node_page_state.  The new node_page_state takes a pglist_data and
      uses per-node stats but none exist yet.  There is some renaming such as
      vm_stat to vm_zone_stat and the addition of vm_node_stat and the renaming
      of mod_state to mod_zone_state.  Otherwise, this is mostly a mechanical
      patch with no functional change.  There is a lot of similarity between the
      node and zone helpers which is unfortunate but there was no obvious way of
      reusing the code and maintaining type safety.
      
      Link: http://lkml.kernel.org/r/1467970510-21195-2-git-send-email-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      75ef7184
  3. 20 5月, 2016 2 次提交
    • M
      mm, page_alloc: inline zone_statistics · 060e7417
      Mel Gorman 提交于
      zone_statistics has one call-site but it's a public function.  Make it
      static and inline.
      
      The performance difference on a page allocator microbenchmark is;
      
                                                   4.6.0-rc2                  4.6.0-rc2
                                            statbranch-v1r20           statinline-v1r20
        Min      alloc-odr0-1               419.00 (  0.00%)           412.00 (  1.67%)
        Min      alloc-odr0-2               305.00 (  0.00%)           301.00 (  1.31%)
        Min      alloc-odr0-4               250.00 (  0.00%)           247.00 (  1.20%)
        Min      alloc-odr0-8               219.00 (  0.00%)           215.00 (  1.83%)
        Min      alloc-odr0-16              203.00 (  0.00%)           199.00 (  1.97%)
        Min      alloc-odr0-32              195.00 (  0.00%)           191.00 (  2.05%)
        Min      alloc-odr0-64              191.00 (  0.00%)           187.00 (  2.09%)
        Min      alloc-odr0-128             189.00 (  0.00%)           185.00 (  2.12%)
        Min      alloc-odr0-256             198.00 (  0.00%)           193.00 (  2.53%)
        Min      alloc-odr0-512             210.00 (  0.00%)           207.00 (  1.43%)
        Min      alloc-odr0-1024            216.00 (  0.00%)           213.00 (  1.39%)
        Min      alloc-odr0-2048            221.00 (  0.00%)           220.00 (  0.45%)
        Min      alloc-odr0-4096            227.00 (  0.00%)           226.00 (  0.44%)
        Min      alloc-odr0-8192            232.00 (  0.00%)           229.00 (  1.29%)
        Min      alloc-odr0-16384           232.00 (  0.00%)           229.00 (  1.29%)
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      060e7417
    • H
      mm: /proc/sys/vm/stat_refresh to force vmstat update · 52b6f46b
      Hugh Dickins 提交于
      Provide /proc/sys/vm/stat_refresh to force an immediate update of
      per-cpu into global vmstats: useful to avoid a sleep(2) or whatever
      before checking counts when testing.  Originally added to work around a
      bug which left counts stranded indefinitely on a cpu going idle (an
      inaccuracy magnified when small below-batch numbers represent "huge"
      amounts of memory), but I believe that bug is now fixed: nonetheless,
      this is still a useful knob.
      
      Its schedule_on_each_cpu() is probably too expensive just to fold into
      reading /proc/meminfo itself: give this mode 0600 to prevent abuse.
      Allow a write or a read to do the same: nothing to read, but "grep -h
      Shmem /proc/sys/vm/stat_refresh /proc/meminfo" is convenient.  Oh, and
      since global_page_state() itself is careful to disguise any underflow as
      0, hack in an "Invalid argument" and pr_warn() if a counter is negative
      after the refresh - this helped to fix a misaccounting of
      NR_ISOLATED_FILE in my migration code.
      
      But on recent kernels, I find that NR_ALLOC_BATCH and NR_PAGES_SCANNED
      often go negative some of the time.  I have not yet worked out why, but
      have no evidence that it's actually harmful.  Punt for the moment by
      just ignoring the anomaly on those.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Yang Shi <yang.shi@linaro.org>
      Cc: Ning Qu <quning@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      52b6f46b
  4. 15 1月, 2016 1 次提交
    • C
      vmstat: make vmstat_updater deferrable again and shut down on idle · 0eb77e98
      Christoph Lameter 提交于
      Currently the vmstat updater is not deferrable as a result of commit
      ba4877b9 ("vmstat: do not use deferrable delayed work for
      vmstat_update").  This in turn can cause multiple interruptions of the
      applications because the vmstat updater may run at
      
      Make vmstate_update deferrable again and provide a function that folds
      the differentials when the processor is going to idle mode thus
      addressing the issue of the above commit in a clean way.
      
      Note that the shepherd thread will continue scanning the differentials
      from another processor and will reenable the vmstat workers if it
      detects any changes.
      
      Fixes: ba4877b9 ("vmstat: do not use deferrable delayed work for vmstat_update")
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0eb77e98
  5. 30 12月, 2015 1 次提交
    • H
      mm/vmstat: fix overflow in mod_zone_page_state() · 6cdb18ad
      Heiko Carstens 提交于
      mod_zone_page_state() takes a "delta" integer argument.  delta contains
      the number of pages that should be added or subtracted from a struct
      zone's vm_stat field.
      
      If a zone is larger than 8TB this will cause overflows.  E.g.  for a
      zone with a size slightly larger than 8TB the line
      
          mod_zone_page_state(zone, NR_ALLOC_BATCH, zone->managed_pages);
      
      in mm/page_alloc.c:free_area_init_core() will result in a negative
      result for the NR_ALLOC_BATCH entry within the zone's vm_stat, since 8TB
      contain 0x8xxxxxxx pages which will be sign extended to a negative
      value.
      
      Fix this by changing the delta argument to long type.
      
      This could fix an early boot problem seen on s390, where we have a 9TB
      system with only one node.  ZONE_DMA contains 2GB and ZONE_NORMAL the
      rest.  The system is trying to allocate a GFP_DMA page but ZONE_DMA is
      completely empty, so it tries to reclaim pages in an endless loop.
      
      This was seen on a heavily patched 3.10 kernel.  One possible
      explaination seem to be the overflows caused by mod_zone_page_state().
      Unfortunately I did not have the chance to verify that this patch
      actually fixes the problem, since I don't have access to the system
      right now.  However the overflow problem does exist anyway.
      
      Given the description that a system with slightly less than 8TB does
      work, this seems to be a candidate for the observed problem.
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6cdb18ad
  6. 06 11月, 2015 2 次提交
  7. 05 6月, 2014 1 次提交
  8. 08 4月, 2014 1 次提交
  9. 04 4月, 2014 1 次提交
    • J
      mm: vmstat: fix UP zone state accounting · 6a3ed212
      Johannes Weiner 提交于
      Summary:
      
      The VM maintains cached filesystem pages on two types of lists.  One
      list holds the pages recently faulted into the cache, the other list
      holds pages that have been referenced repeatedly on that first list.
      The idea is to prefer reclaiming young pages over those that have shown
      to benefit from caching in the past.  We call the recently used list
      "inactive list" and the frequently used list "active list".
      
      Currently, the VM aims for a 1:1 ratio between the lists, which is the
      "perfect" trade-off between the ability to *protect* frequently used
      pages and the ability to *detect* frequently used pages.  This means
      that working set changes bigger than half of cache memory go undetected
      and thrash indefinitely, whereas working sets bigger than half of cache
      memory are unprotected against used-once streams that don't even need
      caching.
      
      This happens on file servers and media streaming servers, where the
      popular files and file sections change over time.  Even though the
      individual files might be smaller than half of memory, concurrent access
      to many of them may still result in their inter-reference distance being
      greater than half of memory.  It's also been reported as a problem on
      database workloads that switch back and forth between tables that are
      bigger than half of memory.  In these cases the VM never recognizes the
      new working set and will for the remainder of the workload thrash disk
      data which could easily live in memory.
      
      Historically, every reclaim scan of the inactive list also took a
      smaller number of pages from the tail of the active list and moved them
      to the head of the inactive list.  This model gave established working
      sets more gracetime in the face of temporary use-once streams, but
      ultimately was not significantly better than a FIFO policy and still
      thrashed cache based on eviction speed, rather than actual demand for
      cache.
      
      This series solves the problem by maintaining a history of pages evicted
      from the inactive list, enabling the VM to detect frequently used pages
      regardless of inactive list size and facilitate working set transitions.
      
      Tests:
      
      The reported database workload is easily demonstrated on a 8G machine
      with two filesets a 6G.  This fio workload operates on one set first,
      then switches to the other.  The VM should obviously always cache the
      set that the workload is currently using.
      
      This test is based on a problem encountered by Citus Data customers:
        http://citusdata.com/blog/72-linux-memory-manager-and-your-big-data
      
      unpatched:
        db1: READ: io=98304MB, aggrb=885559KB/s, minb=885559KB/s, maxb=885559KB/s, mint= 113672msec, maxt= 113672msec
        db2: READ: io=98304MB, aggrb= 66169KB/s, minb= 66169KB/s, maxb= 66169KB/s, mint=1521302msec, maxt=1521302msec
        sdb: ios=835750/4, merge=2/1, ticks=4659739/60016, in_queue=4719203, util=98.92%
      
        real    27m15.541s
        user    0m19.059s
        sys     0m51.459s
      
      patched:
        db1: READ: io=98304MB, aggrb=877783KB/s, minb=877783KB/s, maxb=877783KB/s, mint=114679msec, maxt=114679msec
        db2: READ: io=98304MB, aggrb=397449KB/s, minb=397449KB/s, maxb=397449KB/s, mint=253273msec, maxt=253273msec
        sdb: ios=170587/4, merge=2/1, ticks=954910/61123, in_queue=1015923, util=90.40%
      
        real    6m8.630s
        user    0m14.714s
        sys     0m31.233s
      
      As can be seen, the unpatched kernel simply never adapts to the
      workingset change and db2 is stuck indefinitely with secondary storage
      speed.  The patched kernel needs 2-3 iterations over db2 before it
      replaces db1 and reaches full memory speed.  Given the unbounded
      negative affect of the existing VM behavior, these patches should be
      considered correctness fixes rather than performance optimizations.
      
      Another test resembles a fileserver or streaming server workload, where
      data in excess of memory size is accessed at different frequencies.
      There is very hot data accessed at a high frequency.  Machines should be
      fitted so that the hot set of such a workload can be fully cached or all
      bets are off.  Then there is a very big (compared to available memory)
      set of data that is used-once or at a very low frequency; this is what
      drives the inactive list and does not really benefit from caching.
      Lastly, there is a big set of warm data in between that is accessed at
      medium frequencies and benefits from caching the pages between the first
      and last streamer of each burst.
      
      unpatched:
         hot: READ: io=128000MB, aggrb=160693KB/s, minb=160693KB/s, maxb=160693KB/s, mint=815665msec, maxt=815665msec
        warm: READ: io= 81920MB, aggrb=109853KB/s, minb= 27463KB/s, maxb= 29244KB/s, mint=717110msec, maxt=763617msec
        cold: READ: io= 30720MB, aggrb= 35245KB/s, minb= 35245KB/s, maxb= 35245KB/s, mint=892530msec, maxt=892530msec
         sdb: ios=797960/4, merge=11763/1, ticks=4307910/796, in_queue=4308380, util=100.00%
      
      patched:
         hot: READ: io=128000MB, aggrb=160678KB/s, minb=160678KB/s, maxb=160678KB/s, mint=815740msec, maxt=815740msec
        warm: READ: io= 81920MB, aggrb=147747KB/s, minb= 36936KB/s, maxb= 40960KB/s, mint=512000msec, maxt=567767msec
        cold: READ: io= 30720MB, aggrb= 40960KB/s, minb= 40960KB/s, maxb= 40960KB/s, mint=768000msec, maxt=768000msec
         sdb: ios=596514/4, merge=9341/1, ticks=2395362/997, in_queue=2396484, util=79.18%
      
      In both kernels, the hot set is propagated to the active list and then
      served from cache.
      
      In both kernels, the beginning of the warm set is propagated to the
      active list as well, but in the unpatched case the active list
      eventually takes up half of memory and no new pages from the warm set
      get activated, despite repeated access, and despite most of the active
      list soon being stale.  The patched kernel on the other hand detects the
      thrashing and manages to keep this cache window rolling through the data
      set.  This frees up enough IO bandwidth that the cold set is served at
      full speed as well and disk utilization even drops by 20%.
      
      For reference, this same test was performed with the traditional
      demotion mechanism, where deactivation is coupled to inactive list
      reclaim.  However, this had the same outcome as the unpatched kernel:
      while the warm set does indeed get activated continuously, it is forced
      out of the active list by inactive list pressure, which is dictated
      primarily by the unrelated cold set.  The warm set is evicted before
      subsequent streamers can benefit from it, even though there would be
      enough space available to cache the pages of interest.
      
      Costs:
      
      Page reclaim used to shrink the radix trees but now the tree nodes are
      reused for shadow entries, where the cost depends heavily on the page
      cache access patterns.  However, with workloads that maintain spatial or
      temporal locality, the shadow entries are either refaulted quickly or
      reclaimed along with the inode object itself.  Workloads that will
      experience a memory cost increase are those that don't really benefit
      from caching in the first place.
      
      A more predictable alternative would be a fixed-cost separate pool of
      shadow entries, but this would incur relatively higher memory cost for
      well-behaved workloads at the benefit of cornercases.  It would also
      make the shadow entry lookup more costly compared to storing them
      directly in the cache structure.
      
      Future:
      
      To simplify the merging process, this patch set is implementing thrash
      detection on a global per-zone level only for now, but the design is
      such that it can be extended to memory cgroups as well.  All we need to
      do is store the unique cgroup ID along the node and zone identifier
      inside the eviction cookie to identify the lruvec.
      
      Right now we have a fixed ratio (50:50) between inactive and active list
      but we already have complaints about working sets exceeding half of
      memory being pushed out of the cache by simple streaming in the
      background.  Ultimately, we want to adjust this ratio and allow for a
      much smaller inactive list.  These patches are an essential step in this
      direction because they decouple the VMs ability to detect working set
      changes from the inactive list size.  This would allow us to base the
      inactive list size on the combined readahead window size for example and
      potentially protect a much bigger working set.
      
      It's also a big step towards activating pages with a reuse distance
      larger than memory, as long as they are the most frequently used pages
      in the workload.  This will require knowing more about the access
      frequency of active pages than what we measure right now, so it's also
      deferred in this series.
      
      Another possibility of having thrashing information would be to revisit
      the idea of local reclaim in the form of zero-config memory control
      groups.  Instead of having allocating tasks go straight to global
      reclaim, they could try to reclaim the pages in the memcg they are part
      of first as long as the group is not thrashing.  This would allow a user
      to drop e.g.  a back-up job in an otherwise unconfigured memcg and it
      would only inflate (and possibly do global reclaim) until it has enough
      memory to do proper readahead.  But once it reaches that point and stops
      thrashing it would just recycle its own used-once pages without kicking
      out the cache of any other tasks in the system more than necessary.
      
      This patch (of 10):
      
      Fengguang Wu's build testing spotted problems with inc_zone_state() and
      dec_zone_state() on UP configurations in out-of-tree patches.
      
      inc_zone_state() is declared but not defined, dec_zone_state() is
      missing entirely.
      
      Just like with *_zone_page_state(), they can be defined like their
      preemption-unsafe counterparts on UP.
      
      [akpm@linux-foundation.org: make it build]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Metin Doslu <metin@citusdata.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Ozgun Erdogan <ozgun@citusdata.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Roman Gushchin <klamm@yandex-team.ru>
      Cc: Ryan Mallon <rmallon@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6a3ed212
  10. 30 1月, 2014 1 次提交
  11. 25 1月, 2014 1 次提交
  12. 12 9月, 2013 2 次提交
    • L
      mm: vmscan: fix do_try_to_free_pages() livelock · 6e543d57
      Lisa Du 提交于
      This patch is based on KOSAKI's work and I add a little more description,
      please refer https://lkml.org/lkml/2012/6/14/74.
      
      Currently, I found system can enter a state that there are lots of free
      pages in a zone but only order-0 and order-1 pages which means the zone is
      heavily fragmented, then high order allocation could make direct reclaim
      path's long stall(ex, 60 seconds) especially in no swap and no compaciton
      enviroment.  This problem happened on v3.4, but it seems issue still lives
      in current tree, the reason is do_try_to_free_pages enter live lock:
      
      kswapd will go to sleep if the zones have been fully scanned and are still
      not balanced.  As kswapd thinks there's little point trying all over again
      to avoid infinite loop.  Instead it changes order from high-order to
      0-order because kswapd think order-0 is the most important.  Look at
      73ce02e9 in detail.  If watermarks are ok, kswapd will go back to sleep
      and may leave zone->all_unreclaimable =3D 0.  It assume high-order users
      can still perform direct reclaim if they wish.
      
      Direct reclaim continue to reclaim for a high order which is not a
      COSTLY_ORDER without oom-killer until kswapd turn on
      zone->all_unreclaimble= .  This is because to avoid too early oom-kill.
      So it means direct_reclaim depends on kswapd to break this loop.
      
      In worst case, direct-reclaim may continue to page reclaim forever when
      kswapd sleeps forever until someone like watchdog detect and finally kill
      the process.  As described in:
      http://thread.gmane.org/gmane.linux.kernel.mm/103737
      
      We can't turn on zone->all_unreclaimable from direct reclaim path because
      direct reclaim path don't take any lock and this way is racy.  Thus this
      patch removes zone->all_unreclaimable field completely and recalculates
      zone reclaimable state every time.
      
      Note: we can't take the idea that direct-reclaim see zone->pages_scanned
      directly and kswapd continue to use zone->all_unreclaimable.  Because, it
      is racy.  commit 929bea7c (vmscan: all_unreclaimable() use
      zone->all_unreclaimable as a name) describes the detail.
      
      [akpm@linux-foundation.org: uninline zone_reclaimable_pages() and zone_reclaimable()]
      Cc: Aaditya Kumar <aaditya.kumar.30@gmail.com>
      Cc: Ying Han <yinghan@google.com>
      Cc: Nick Piggin <npiggin@gmail.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Bob Liu <lliubbo@gmail.com>
      Cc: Neil Zhang <zhangwm@marvell.com>
      Cc: Russell King - ARM Linux <linux@arm.linux.org.uk>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NLisa Du <cldu@marvell.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6e543d57
    • C
      vmstat: create separate function to fold per cpu diffs into local counters · 2bb921e5
      Christoph Lameter 提交于
      The main idea behind this patchset is to reduce the vmstat update overhead
      by avoiding interrupt enable/disable and the use of per cpu atomics.
      
      This patch (of 3):
      
      It is better to have a separate folding function because
      refresh_cpu_vm_stats() also does other things like expire pages in the
      page allocator caches.
      
      If we have a separate function then refresh_cpu_vm_stats() is only called
      from the local cpu which allows additional optimizations.
      
      The folding function is only called when a cpu is being downed and
      therefore no other processor will be accessing the counters.  Also
      simplifies synchronization.
      
      [akpm@linux-foundation.org: fix UP build]
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      CC: Tejun Heo <tj@kernel.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2bb921e5
  13. 30 4月, 2013 1 次提交
  14. 24 2月, 2013 1 次提交
    • M
      mm: numa: handle side-effects in count_vm_numa_events() for !CONFIG_NUMA_BALANCING · 3c0ff468
      Mel Gorman 提交于
      The current definitions for count_vm_numa_events() is wrong for
      !CONFIG_NUMA_BALANCING as the following would miss the side-effect.
      
      	count_vm_numa_events(NUMA_FOO, bar++);
      
      There are no such users of count_vm_numa_events() but this patch fixes
      it as it is a potential pitfall.  Ideally both would be converted to
      static inline but NUMA_PTE_UPDATES is not defined if
      !CONFIG_NUMA_BALANCING and creating dummy constants just to have a
      static inline would be similarly clumsy.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Simon Jeons <simon.jeons@gmail.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3c0ff468
  15. 11 12月, 2012 1 次提交
    • M
      mm: numa: Add pte updates, hinting and migration stats · 03c5a6e1
      Mel Gorman 提交于
      It is tricky to quantify the basic cost of automatic NUMA placement in a
      meaningful manner. This patch adds some vmstats that can be used as part
      of a basic costing model.
      
      u    = basic unit = sizeof(void *)
      Ca   = cost of struct page access = sizeof(struct page) / u
      Cpte = Cost PTE access = Ca
      Cupdate = Cost PTE update = (2 * Cpte) + (2 * Wlock)
      	where Cpte is incurred twice for a read and a write and Wlock
      	is a constant representing the cost of taking or releasing a
      	lock
      Cnumahint = Cost of a minor page fault = some high constant e.g. 1000
      Cpagerw = Cost to read or write a full page = Ca + PAGE_SIZE/u
      Ci = Cost of page isolation = Ca + Wi
      	where Wi is a constant that should reflect the approximate cost
      	of the locking operation
      Cpagecopy = Cpagerw + (Cpagerw * Wnuma) + Ci + (Ci * Wnuma)
      	where Wnuma is the approximate NUMA factor. 1 is local. 1.2
      	would imply that remote accesses are 20% more expensive
      
      Balancing cost = Cpte * numa_pte_updates +
      		Cnumahint * numa_hint_faults +
      		Ci * numa_pages_migrated +
      		Cpagecopy * numa_pages_migrated
      
      Note that numa_pages_migrated is used as a measure of how many pages
      were isolated even though it would miss pages that failed to migrate. A
      vmstat counter could have been added for it but the isolation cost is
      pretty marginal in comparison to the overall cost so it seemed overkill.
      
      The ideal way to measure automatic placement benefit would be to count
      the number of remote accesses versus local accesses and do something like
      
      	benefit = (remote_accesses_before - remove_access_after) * Wnuma
      
      but the information is not readily available. As a workload converges, the
      expection would be that the number of remote numa hints would reduce to 0.
      
      	convergence = numa_hint_faults_local / numa_hint_faults
      		where this is measured for the last N number of
      		numa hints recorded. When the workload is fully
      		converged the value is 1.
      
      This can measure if the placement policy is converging and how fast it is
      doing it.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      03c5a6e1
  16. 09 10月, 2012 2 次提交
  17. 01 8月, 2012 1 次提交
  18. 27 7月, 2011 1 次提交
  19. 27 5月, 2011 1 次提交
    • A
      mm: move enum vm_event_item into a standalone header file · f042e707
      Andrew Morton 提交于
      enums are problematic because they cannot be forward-declared:
      
        akpm2:/home/akpm> cat t.c
      
        enum foo;
      
        static inline void bar(enum foo f)
        {
        }
        akpm2:/home/akpm> gcc -c t.c
        t.c:4: error: parameter 1 ('f') has incomplete type
      
      So move the enum's definition into a standalone header file which can be used
      wherever its definition is needed.
      
      Cc: Ying Han <yinghan@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f042e707
  20. 25 5月, 2011 2 次提交
  21. 15 4月, 2011 1 次提交
  22. 23 3月, 2011 1 次提交
    • A
      mm: add __GFP_OTHER_NODE flag · 78afd561
      Andi Kleen 提交于
      Add a new __GFP_OTHER_NODE flag to tell the low level numa statistics in
      zone_statistics() that an allocation is on behalf of another thread.  This
      way the local and remote counters can be still correct, even when
      background daemons like khugepaged are changing memory mappings.
      
      This only affects the accounting, but I think it's worth doing that right
      to avoid confusing users.
      
      I first tried to just pass down the right node, but this required a lot of
      changes to pass down this parameter and at least one addition of a 10th
      argument to a 9 argument function.  Using the flag is a lot less
      intrusive.
      
      Open: should be also used for migration?
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      78afd561
  23. 14 1月, 2011 2 次提交
    • M
      mm: vmstat: use a single setter function and callback for adjusting percpu thresholds · b44129b3
      Mel Gorman 提交于
      reduce_pgdat_percpu_threshold() and restore_pgdat_percpu_threshold() exist
      to adjust the per-cpu vmstat thresholds while kswapd is awake to avoid
      errors due to counter drift.  The functions duplicate some code so this
      patch replaces them with a single set_pgdat_percpu_threshold() that takes
      a callback function to calculate the desired threshold as a parameter.
      
      [akpm@linux-foundation.org: readability tweak]
      [kosaki.motohiro@jp.fujitsu.com: set_pgdat_percpu_threshold(): don't use for_each_online_cpu]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NChristoph Lameter <cl@linux.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b44129b3
    • M
      mm: page allocator: adjust the per-cpu counter threshold when memory is low · 88f5acf8
      Mel Gorman 提交于
      Commit aa454840 ("calculate a better estimate of NR_FREE_PAGES when memory
      is low") noted that watermarks were based on the vmstat NR_FREE_PAGES.  To
      avoid synchronization overhead, these counters are maintained on a per-cpu
      basis and drained both periodically and when a threshold is above a
      threshold.  On large CPU systems, the difference between the estimate and
      real value of NR_FREE_PAGES can be very high.  The system can get into a
      case where pages are allocated far below the min watermark potentially
      causing livelock issues.  The commit solved the problem by taking a better
      reading of NR_FREE_PAGES when memory was low.
      
      Unfortately, as reported by Shaohua Li this accurate reading can consume a
      large amount of CPU time on systems with many sockets due to cache line
      bouncing.  This patch takes a different approach.  For large machines
      where counter drift might be unsafe and while kswapd is awake, the per-cpu
      thresholds for the target pgdat are reduced to limit the level of drift to
      what should be a safe level.  This incurs a performance penalty in heavy
      memory pressure by a factor that depends on the workload and the machine
      but the machine should function correctly without accidentally exhausting
      all memory on a node.  There is an additional cost when kswapd wakes and
      sleeps but the event is not expected to be frequent - in Shaohua's test
      case, there was one recorded sleep and wake event at least.
      
      To ensure that kswapd wakes up, a safe version of zone_watermark_ok() is
      introduced that takes a more accurate reading of NR_FREE_PAGES when called
      from wakeup_kswapd, when deciding whether it is really safe to go back to
      sleep in sleeping_prematurely() and when deciding if a zone is really
      balanced or not in balance_pgdat().  We are still using an expensive
      function but limiting how often it is called.
      
      When the test case is reproduced, the time spent in the watermark
      functions is reduced.  The following report is on the percentage of time
      spent cumulatively spent in the functions zone_nr_free_pages(),
      zone_watermark_ok(), __zone_watermark_ok(), zone_watermark_ok_safe(),
      zone_page_state_snapshot(), zone_page_state().
      
      vanilla                      11.6615%
      disable-threshold            0.2584%
      
      David said:
      
      : We had to pull aa454840 "mm: page allocator: calculate a better estimate
      : of NR_FREE_PAGES when memory is low and kswapd is awake" from 2.6.36
      : internally because tests showed that it would cause the machine to stall
      : as the result of heavy kswapd activity.  I merged it back with this fix as
      : it is pending in the -mm tree and it solves the issue we were seeing, so I
      : definitely think this should be pushed to -stable (and I would seriously
      : consider it for 2.6.37 inclusion even at this late date).
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Reported-by: NShaohua Li <shaohua.li@intel.com>
      Reviewed-by: NChristoph Lameter <cl@linux.com>
      Tested-by: NNicolas Bareil <nico@chdir.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Kyle McMartin <kyle@mcmartin.ca>
      Cc: <stable@kernel.org>		[2.6.37.1, 2.6.36.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      88f5acf8
  24. 10 9月, 2010 1 次提交
    • C
      mm: page allocator: calculate a better estimate of NR_FREE_PAGES when memory... · aa454840
      Christoph Lameter 提交于
      mm: page allocator: calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake
      
      Ordinarily watermark checks are based on the vmstat NR_FREE_PAGES as it is
      cheaper than scanning a number of lists.  To avoid synchronization
      overhead, counter deltas are maintained on a per-cpu basis and drained
      both periodically and when the delta is above a threshold.  On large CPU
      systems, the difference between the estimated and real value of
      NR_FREE_PAGES can be very high.  If NR_FREE_PAGES is much higher than
      number of real free page in buddy, the VM can allocate pages below min
      watermark, at worst reducing the real number of pages to zero.  Even if
      the OOM killer kills some victim for freeing memory, it may not free
      memory if the exit path requires a new page resulting in livelock.
      
      This patch introduces a zone_page_state_snapshot() function (courtesy of
      Christoph) that takes a slightly more accurate view of an arbitrary vmstat
      counter.  It is used to read NR_FREE_PAGES while kswapd is awake to avoid
      the watermark being accidentally broken.  The estimate is not perfect and
      may result in cache line bounces but is expected to be lighter than the
      IPI calls necessary to continually drain the per-cpu counters while kswapd
      is awake.
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aa454840
  25. 25 5月, 2010 2 次提交
    • M
      mm: compaction: direct compact when a high-order allocation fails · 56de7263
      Mel Gorman 提交于
      Ordinarily when a high-order allocation fails, direct reclaim is entered
      to free pages to satisfy the allocation.  With this patch, it is
      determined if an allocation failed due to external fragmentation instead
      of low memory and if so, the calling process will compact until a suitable
      page is freed.  Compaction by moving pages in memory is considerably
      cheaper than paging out to disk and works where there are locked pages or
      no swap.  If compaction fails to free a page of a suitable size, then
      reclaim will still occur.
      
      Direct compaction returns as soon as possible.  As each block is
      compacted, it is checked if a suitable page has been freed and if so, it
      returns.
      
      [akpm@linux-foundation.org: Fix build errors]
      [aarcange@redhat.com: fix count_vm_event preempt in memory compaction direct reclaim]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      56de7263
    • M
      mm: compaction: memory compaction core · 748446bb
      Mel Gorman 提交于
      This patch is the core of a mechanism which compacts memory in a zone by
      relocating movable pages towards the end of the zone.
      
      A single compaction run involves a migration scanner and a free scanner.
      Both scanners operate on pageblock-sized areas in the zone.  The migration
      scanner starts at the bottom of the zone and searches for all movable
      pages within each area, isolating them onto a private list called
      migratelist.  The free scanner starts at the top of the zone and searches
      for suitable areas and consumes the free pages within making them
      available for the migration scanner.  The pages isolated for migration are
      then migrated to the newly isolated free pages.
      
      [aarcange@redhat.com: Fix unsafe optimisation]
      [mel@csn.ul.ie: do not schedule work on other CPUs for compaction]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      748446bb
  26. 16 12月, 2009 2 次提交
    • K
      vmscan: stop kswapd waiting on congestion when the min watermark is not being met · bb3ab596
      KOSAKI Motohiro 提交于
      If reclaim fails to make sufficient progress, the priority is raised.
      Once the priority is higher, kswapd starts waiting on congestion.
      However, if the zone is below the min watermark then kswapd needs to
      continue working without delay as there is a danger of an increased rate
      of GFP_ATOMIC allocation failure.
      
      This patch changes the conditions under which kswapd waits on congestion
      by only going to sleep if the min watermarks are being met.
      
      [mel@csn.ul.ie: add stats to track how relevant the logic is]
      [mel@csn.ul.ie: make kswapd only check its own zones and rename the relevant counters]
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bb3ab596
    • M
      vmscan: have kswapd sleep for a short interval and double check it should be asleep · f50de2d3
      Mel Gorman 提交于
      After kswapd balances all zones in a pgdat, it goes to sleep.  In the
      event of no IO congestion, kswapd can go to sleep very shortly after the
      high watermark was reached.  If there are a constant stream of allocations
      from parallel processes, it can mean that kswapd went to sleep too quickly
      and the high watermark is not being maintained for sufficient length time.
      
      This patch makes kswapd go to sleep as a two-stage process.  It first
      tries to sleep for HZ/10.  If it is woken up by another process or the
      high watermark is no longer met, it's considered a premature sleep and
      kswapd continues work.  Otherwise it goes fully to sleep.
      
      This adds more counters to distinguish between fast and slow breaches of
      watermarks.  A "fast" premature sleep is one where the low watermark was
      hit in a very short time after kswapd going to sleep.  A "slow" premature
      sleep indicates that the high watermark was breached after a very short
      interval.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Frans Pop <elendil@planet.nl>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f50de2d3
  27. 29 10月, 2009 1 次提交
  28. 03 10月, 2009 1 次提交
  29. 22 9月, 2009 1 次提交