1. 21 1月, 2011 1 次提交
  2. 18 1月, 2011 1 次提交
  3. 14 1月, 2011 17 次提交
    • S
      mm: batch activate_page() to reduce lock contention · 744ed144
      Shaohua Li 提交于
      The zone->lru_lock is heavily contented in workload where activate_page()
      is frequently used.  We could do batch activate_page() to reduce the lock
      contention.  The batched pages will be added into zone list when the pool
      is full or page reclaim is trying to drain them.
      
      For example, in a 4 socket 64 CPU system, create a sparse file and 64
      processes, processes shared map to the file.  Each process read access the
      whole file and then exit.  The process exit will do unmap_vmas() and cause
      a lot of activate_page() call.  In such workload, we saw about 58% total
      time reduction with below patch.  Other workloads with a lot of
      activate_page also benefits a lot too.
      
      I tested some microbenchmarks:
      case-anon-cow-rand-mt		0.58%
      case-anon-cow-rand		-3.30%
      case-anon-cow-seq-mt		-0.51%
      case-anon-cow-seq		-5.68%
      case-anon-r-rand-mt		0.23%
      case-anon-r-rand		0.81%
      case-anon-r-seq-mt		-0.71%
      case-anon-r-seq			-1.99%
      case-anon-rx-rand-mt		2.11%
      case-anon-rx-seq-mt		3.46%
      case-anon-w-rand-mt		-0.03%
      case-anon-w-rand		-0.50%
      case-anon-w-seq-mt		-1.08%
      case-anon-w-seq			-0.12%
      case-anon-wx-rand-mt		-5.02%
      case-anon-wx-seq-mt		-1.43%
      case-fork			1.65%
      case-fork-sleep			-0.07%
      case-fork-withmem		1.39%
      case-hugetlb			-0.59%
      case-lru-file-mmap-read-mt	-0.54%
      case-lru-file-mmap-read		0.61%
      case-lru-file-mmap-read-rand	-2.24%
      case-lru-file-readonce		-0.64%
      case-lru-file-readtwice		-11.69%
      case-lru-memcg			-1.35%
      case-mmap-pread-rand-mt		1.88%
      case-mmap-pread-rand		-15.26%
      case-mmap-pread-seq-mt		0.89%
      case-mmap-pread-seq		-69.72%
      case-mmap-xread-rand-mt		0.71%
      case-mmap-xread-seq-mt		0.38%
      
      The most significent are:
      case-lru-file-readtwice		-11.69%
      case-mmap-pread-rand		-15.26%
      case-mmap-pread-seq		-69.72%
      
      which use activate_page a lot.  others are basically variations because
      each run has slightly difference.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      744ed144
    • R
      thp: scale nr_rotated to balance memory pressure · 9992af10
      Rik van Riel 提交于
      Make sure we scale up nr_rotated when we encounter a referenced
      transparent huge page.  This ensures pageout scanning balance is not
      distorted when there are huge pages on the LRU.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9992af10
    • R
      thp: fix anon memory statistics with transparent hugepages · 2c888cfb
      Rik van Riel 提交于
      Count each transparent hugepage as HPAGE_PMD_NR pages in the LRU
      statistics, so the Active(anon) and Inactive(anon) statistics in
      /proc/meminfo are correct.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2c888cfb
    • A
      thp: use compaction in kswapd for GFP_ATOMIC order > 0 · 5a03b051
      Andrea Arcangeli 提交于
      This takes advantage of memory compaction to properly generate pages of
      order > 0 if regular page reclaim fails and priority level becomes more
      severe and we don't reach the proper watermarks.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5a03b051
    • M
      mm: kswapd: use the classzone idx that kswapd was using for sleeping_prematurely() · dc83edd9
      Mel Gorman 提交于
      When kswapd is woken up for a high-order allocation, it takes account of
      the highest usable zone by the caller (the classzone idx).  During
      allocation, this index is used to select the lowmem_reserve[] that should
      be applied to the watermark calculation in zone_watermark_ok().
      
      When balancing a node, kswapd considers the highest unbalanced zone to be
      the classzone index.  This will always be at least be the callers
      classzone_idx and can be higher.  However, sleeping_prematurely() always
      considers the lowest zone (e.g.  ZONE_DMA) to be the classzone index.
      This means that sleeping_prematurely() can consider a zone to be balanced
      that is unusable by the allocation request that originally woke kswapd.
      This patch changes sleeping_prematurely() to use a classzone_idx matching
      the value it used in balance_pgdat().
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: NEric B Munson <emunson@mgebm.net>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Simon Kirby <sim@hostway.ca>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dc83edd9
    • M
      mm: kswapd: treat zone->all_unreclaimable in sleeping_prematurely similar to balance_pgdat() · 355b09c4
      Mel Gorman 提交于
      After DEF_PRIORITY, balance_pgdat() considers all_unreclaimable zones to
      be balanced but sleeping_prematurely does not.  This can force kswapd to
      stay awake longer than it should.  This patch fixes it.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NEric B Munson <emunson@mgebm.net>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Simon Kirby <sim@hostway.ca>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      355b09c4
    • M
      mm: kswapd: reset kswapd_max_order and classzone_idx after reading · 4d40502e
      Mel Gorman 提交于
      When kswapd wakes up, it reads its order and classzone from pgdat and
      calls balance_pgdat.  While its awake, it potentially reclaimes at a high
      order and a low classzone index.  This might have been a once-off that was
      not required by subsequent callers.  However, because the pgdat values
      were not reset, they remain artifically high while balance_pgdat() is
      running and potentially kswapd enters a second unnecessary reclaim cycle.
      Reset the pgdat order and classzone index after reading.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NEric B Munson <emunson@mgebm.net>
      Cc: Simon Kirby <sim@hostway.ca>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4d40502e
    • M
      mm: kswapd: use the order that kswapd was reclaiming at for sleeping_prematurely() · 0abdee2b
      Mel Gorman 提交于
      Before kswapd goes to sleep, it uses sleeping_prematurely() to check if
      there was a race pushing a zone below its watermark.  If the race
      happened, it stays awake.  However, balance_pgdat() can decide to reclaim
      at order-0 if it decides that high-order reclaim is not working as
      expected.  This information is not passed back to sleeping_prematurely().
      The impact is that kswapd remains awake reclaiming pages long after it
      should have gone to sleep.  This patch passes the adjusted order to
      sleeping_prematurely and uses the same logic as balance_pgdat to decide if
      it's ok to go to sleep.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NEric B Munson <emunson@mgebm.net>
      Cc: Simon Kirby <sim@hostway.ca>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0abdee2b
    • M
      mm: kswapd: keep kswapd awake for high-order allocations until a percentage of the node is balanced · 1741c877
      Mel Gorman 提交于
      When reclaiming for high-orders, kswapd is responsible for balancing a
      node but it should not reclaim excessively.  It avoids excessive reclaim
      by considering if any zone in a node is balanced then the node is
      balanced.  In the cases where there are imbalanced zone sizes (e.g.
      ZONE_DMA with both ZONE_DMA32 and ZONE_NORMAL), kswapd can go to sleep
      prematurely as just one small zone was balanced.
      
      This alters the sleep logic of kswapd slightly.  It counts the number of
      pages that make up the balanced zones.  If the total number of balanced
      pages is more than a quarter of the zone, kswapd will go back to sleep.
      This should keep a node balanced without reclaiming an excessive number of
      pages.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NEric B Munson <emunson@mgebm.net>
      Cc: Simon Kirby <sim@hostway.ca>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1741c877
    • M
      mm: kswapd: stop high-order balancing when any suitable zone is balanced · 99504748
      Mel Gorman 提交于
      Simon Kirby reported the following problem
      
         We're seeing cases on a number of servers where cache never fully
         grows to use all available memory.  Sometimes we see servers with 4 GB
         of memory that never seem to have less than 1.5 GB free, even with a
         constantly-active VM.  In some cases, these servers also swap out while
         this happens, even though they are constantly reading the working set
         into memory.  We have been seeing this happening for a long time; I
         don't think it's anything recent, and it still happens on 2.6.36.
      
      After some debugging work by Simon, Dave Hansen and others, the prevaling
      theory became that kswapd is reclaiming order-3 pages requested by SLUB
      too aggressive about it.
      
      There are two apparent problems here.  On the target machine, there is a
      small Normal zone in comparison to DMA32.  As kswapd tries to balance all
      zones, it would continually try reclaiming for Normal even though DMA32
      was balanced enough for callers.  The second problem is that
      sleeping_prematurely() does not use the same logic as balance_pgdat() when
      deciding whether to sleep or not.  This keeps kswapd artifically awake.
      
      A number of tests were run and the figures from previous postings will
      look very different for a few reasons.  One, the old figures were forcing
      my network card to use GFP_ATOMIC in attempt to replicate Simon's problem.
       Second, I previous specified slub_min_order=3 again in an attempt to
      reproduce Simon's problem.  In this posting, I'm depending on Simon to say
      whether his problem is fixed or not and these figures are to show the
      impact to the ordinary cases.  Finally, the "vmscan" figures are taken
      from /proc/vmstat instead of the tracepoints.  There is less information
      but recording is less disruptive.
      
      The first test of relevance was postmark with a process running in the
      background reading a large amount of anonymous memory in blocks.  The
      objective was to vaguely simulate what was happening on Simon's machine
      and it's memory intensive enough to have kswapd awake.
      
      POSTMARK
                                                  traceonly          kanyzone
      Transactions per second:              156.00 ( 0.00%)   153.00 (-1.96%)
      Data megabytes read per second:        21.51 ( 0.00%)    21.52 ( 0.05%)
      Data megabytes written per second:     29.28 ( 0.00%)    29.11 (-0.58%)
      Files created alone per second:       250.00 ( 0.00%)   416.00 (39.90%)
      Files create/transact per second:      79.00 ( 0.00%)    76.00 (-3.95%)
      Files deleted alone per second:       520.00 ( 0.00%)   420.00 (-23.81%)
      Files delete/transact per second:      79.00 ( 0.00%)    76.00 (-3.95%)
      
      MMTests Statistics: duration
      User/Sys Time Running Test (seconds)         16.58      17.4
      Total Elapsed Time (seconds)                218.48    222.47
      
      VMstat Reclaim Statistics: vmscan
      Direct reclaims                                  0          4
      Direct reclaim pages scanned                     0        203
      Direct reclaim pages reclaimed                   0        184
      Kswapd pages scanned                        326631     322018
      Kswapd pages reclaimed                      312632     309784
      Kswapd low wmark quickly                         1          4
      Kswapd high wmark quickly                      122        475
      Kswapd skip congestion_wait                      1          0
      Pages activated                             700040     705317
      Pages deactivated                           212113     203922
      Pages written                                 9875       6363
      
      Total pages scanned                         326631    322221
      Total pages reclaimed                       312632    309968
      %age total pages scanned/reclaimed          95.71%    96.20%
      %age total pages scanned/written             3.02%     1.97%
      
      proc vmstat: Faults
      Major Faults                                   300       254
      Minor Faults                                645183    660284
      Page ins                                    493588    486704
      Page outs                                  4960088   4986704
      Swap ins                                      1230       661
      Swap outs                                     9869      6355
      
      Performance is mildly affected because kswapd is no longer doing as much
      work and the background memory consumer process is getting in the way.
      Note that kswapd scanned and reclaimed fewer pages as it's less aggressive
      and overall fewer pages were scanned and reclaimed.  Swap in/out is
      particularly reduced again reflecting kswapd throwing out fewer pages.
      
      The slight performance impact is unfortunate here but it looks like a
      direct result of kswapd being less aggressive.  As the bug report is about
      too many pages being freed by kswapd, it may have to be accepted for now.
      
      The second test is a streaming IO benchmark that was previously used by
      Johannes to show regressions in page reclaim.
      
      MICRO
      					 traceonly  kanyzone
      User/Sys Time Running Test (seconds)         29.29     28.87
      Total Elapsed Time (seconds)                492.18    488.79
      
      VMstat Reclaim Statistics: vmscan
      Direct reclaims                               2128       1460
      Direct reclaim pages scanned               2284822    1496067
      Direct reclaim pages reclaimed              148919     110937
      Kswapd pages scanned                      15450014   16202876
      Kswapd pages reclaimed                     8503697    8537897
      Kswapd low wmark quickly                      3100       3397
      Kswapd high wmark quickly                     1860       7243
      Kswapd skip congestion_wait                    708        801
      Pages activated                               9635       9573
      Pages deactivated                             1432       1271
      Pages written                                  223       1130
      
      Total pages scanned                       17734836  17698943
      Total pages reclaimed                      8652616   8648834
      %age total pages scanned/reclaimed          48.79%    48.87%
      %age total pages scanned/written             0.00%     0.01%
      
      proc vmstat: Faults
      Major Faults                                   165       221
      Minor Faults                               9655785   9656506
      Page ins                                      3880      7228
      Page outs                                 37692940  37480076
      Swap ins                                         0        69
      Swap outs                                       19        15
      
      Again fewer pages are scanned and reclaimed as expected and this time the
      test completed faster.  Note that kswapd is hitting its watermarks faster
      (low and high wmark quickly) which I expect is due to kswapd reclaiming
      fewer pages.
      
      I also ran fs-mark, iozone and sysbench but there is nothing interesting
      to report in the figures.  Performance is not significantly changed and
      the reclaim statistics look reasonable.
      
      Tgis patch:
      
      When the allocator enters its slow path, kswapd is woken up to balance the
      node.  It continues working until all zones within the node are balanced.
      For order-0 allocations, this makes perfect sense but for higher orders it
      can have unintended side-effects.  If the zone sizes are imbalanced,
      kswapd may reclaim heavily within a smaller zone discarding an excessive
      number of pages.  The user-visible behaviour is that kswapd is awake and
      reclaiming even though plenty of pages are free from a suitable zone.
      
      This patch alters the "balance" logic for high-order reclaim allowing
      kswapd to stop if any suitable zone becomes balanced to reduce the number
      of pages it reclaims from other zones.  kswapd still tries to ensure that
      order-0 watermarks for all zones are met before sleeping.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NEric B Munson <emunson@mgebm.net>
      Cc: Simon Kirby <sim@hostway.ca>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      99504748
    • M
      mm: vmscan: rename lumpy_mode to reclaim_mode · f3a310bc
      Mel Gorman 提交于
      With compaction being used instead of lumpy reclaim, the name lumpy_mode
      and associated variables is a bit misleading.  Rename lumpy_mode to
      reclaim_mode which is a better fit.  There is no functional change.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f3a310bc
    • M
      mm: migration: allow migration to operate asynchronously and avoid synchronous... · 77f1fe6b
      Mel Gorman 提交于
      mm: migration: allow migration to operate asynchronously and avoid synchronous compaction in the faster path
      
      Migration synchronously waits for writeback if the initial passes fails.
      Callers of memory compaction do not necessarily want this behaviour if the
      caller is latency sensitive or expects that synchronous migration is not
      going to have a significantly better success rate.
      
      This patch adds a sync parameter to migrate_pages() allowing the caller to
      indicate if wait_on_page_writeback() is allowed within migration or not.
      For reclaim/compaction, try_to_compact_pages() is first called
      asynchronously, direct reclaim runs and then try_to_compact_pages() is
      called synchronously as there is a greater expectation that it'll succeed.
      
      [akpm@linux-foundation.org: build/merge fix]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      77f1fe6b
    • M
      mm: vmscan: reclaim order-0 and use compaction instead of lumpy reclaim · 3e7d3449
      Mel Gorman 提交于
      Lumpy reclaim is disruptive.  It reclaims a large number of pages and
      ignores the age of the pages it reclaims.  This can incur significant
      stalls and potentially increase the number of major faults.
      
      Compaction has reached the point where it is considered reasonably stable
      (meaning it has passed a lot of testing) and is a potential candidate for
      displacing lumpy reclaim.  This patch introduces an alternative to lumpy
      reclaim whe compaction is available called reclaim/compaction.  The basic
      operation is very simple - instead of selecting a contiguous range of
      pages to reclaim, a number of order-0 pages are reclaimed and then
      compaction is later by either kswapd (compact_zone_order()) or direct
      compaction (__alloc_pages_direct_compact()).
      
      [akpm@linux-foundation.org: fix build]
      [akpm@linux-foundation.org: use conventional task_struct naming]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3e7d3449
    • M
      mm: vmscan: convert lumpy_mode into a bitmask · ee64fc93
      Mel Gorman 提交于
      Currently lumpy_mode is an enum and determines if lumpy reclaim is off,
      syncronous or asyncronous.  In preparation for using compaction instead of
      lumpy reclaim, this patch converts the flags into a bitmap.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ee64fc93
    • K
      vmscan: factor out kswapd sleeping logic from kswapd() · f0bc0a60
      KOSAKI Motohiro 提交于
      Currently, kswapd() has deep nesting and is slightly hard to read.  Clean
      this up.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f0bc0a60
    • M
      mm: vmstat: use a single setter function and callback for adjusting percpu thresholds · b44129b3
      Mel Gorman 提交于
      reduce_pgdat_percpu_threshold() and restore_pgdat_percpu_threshold() exist
      to adjust the per-cpu vmstat thresholds while kswapd is awake to avoid
      errors due to counter drift.  The functions duplicate some code so this
      patch replaces them with a single set_pgdat_percpu_threshold() that takes
      a callback function to calculate the desired threshold as a parameter.
      
      [akpm@linux-foundation.org: readability tweak]
      [kosaki.motohiro@jp.fujitsu.com: set_pgdat_percpu_threshold(): don't use for_each_online_cpu]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NChristoph Lameter <cl@linux.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b44129b3
    • M
      mm: page allocator: adjust the per-cpu counter threshold when memory is low · 88f5acf8
      Mel Gorman 提交于
      Commit aa454840 ("calculate a better estimate of NR_FREE_PAGES when memory
      is low") noted that watermarks were based on the vmstat NR_FREE_PAGES.  To
      avoid synchronization overhead, these counters are maintained on a per-cpu
      basis and drained both periodically and when a threshold is above a
      threshold.  On large CPU systems, the difference between the estimate and
      real value of NR_FREE_PAGES can be very high.  The system can get into a
      case where pages are allocated far below the min watermark potentially
      causing livelock issues.  The commit solved the problem by taking a better
      reading of NR_FREE_PAGES when memory was low.
      
      Unfortately, as reported by Shaohua Li this accurate reading can consume a
      large amount of CPU time on systems with many sockets due to cache line
      bouncing.  This patch takes a different approach.  For large machines
      where counter drift might be unsafe and while kswapd is awake, the per-cpu
      thresholds for the target pgdat are reduced to limit the level of drift to
      what should be a safe level.  This incurs a performance penalty in heavy
      memory pressure by a factor that depends on the workload and the machine
      but the machine should function correctly without accidentally exhausting
      all memory on a node.  There is an additional cost when kswapd wakes and
      sleeps but the event is not expected to be frequent - in Shaohua's test
      case, there was one recorded sleep and wake event at least.
      
      To ensure that kswapd wakes up, a safe version of zone_watermark_ok() is
      introduced that takes a more accurate reading of NR_FREE_PAGES when called
      from wakeup_kswapd, when deciding whether it is really safe to go back to
      sleep in sleeping_prematurely() and when deciding if a zone is really
      balanced or not in balance_pgdat().  We are still using an expensive
      function but limiting how often it is called.
      
      When the test case is reproduced, the time spent in the watermark
      functions is reduced.  The following report is on the percentage of time
      spent cumulatively spent in the functions zone_nr_free_pages(),
      zone_watermark_ok(), __zone_watermark_ok(), zone_watermark_ok_safe(),
      zone_page_state_snapshot(), zone_page_state().
      
      vanilla                      11.6615%
      disable-threshold            0.2584%
      
      David said:
      
      : We had to pull aa454840 "mm: page allocator: calculate a better estimate
      : of NR_FREE_PAGES when memory is low and kswapd is awake" from 2.6.36
      : internally because tests showed that it would cause the machine to stall
      : as the result of heavy kswapd activity.  I merged it back with this fix as
      : it is pending in the -mm tree and it solves the issue we were seeing, so I
      : definitely think this should be pushed to -stable (and I would seriously
      : consider it for 2.6.37 inclusion even at this late date).
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Reported-by: NShaohua Li <shaohua.li@intel.com>
      Reviewed-by: NChristoph Lameter <cl@linux.com>
      Tested-by: NNicolas Bareil <nico@chdir.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Kyle McMartin <kyle@mcmartin.ca>
      Cc: <stable@kernel.org>		[2.6.37.1, 2.6.36.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      88f5acf8
  4. 02 12月, 2010 1 次提交
    • L
      Call the filesystem back whenever a page is removed from the page cache · 6072d13c
      Linus Torvalds 提交于
      NFS needs to be able to release objects that are stored in the page
      cache once the page itself is no longer visible from the page cache.
      
      This patch adds a callback to the address space operations that allows
      filesystems to perform page cleanups once the page has been removed
      from the page cache.
      
      Original patch by: Linus Torvalds <torvalds@linux-foundation.org>
      [trondmy: cover the cases of invalidate_inode_pages2() and
                truncate_inode_pages()]
      Signed-off-by: NTrond Myklebust <Trond.Myklebust@netapp.com>
      6072d13c
  5. 12 11月, 2010 1 次提交
  6. 27 10月, 2010 11 次提交
    • K
      vmscan,tmpfs: treat used once pages on tmpfs as used once · 2e30244a
      KOSAKI Motohiro 提交于
      When a page has PG_referenced, shrink_page_list() discards it only if it
      is not dirty.  This rule works fine if the backing filesystem is a regular
      one.  PG_dirty is a good signal that the page was used recently because
      the flusher threads clean pages periodically.  In addition, page writeback
      is costlier than simple page discard.
      
      However, when a page is on tmpfs this heuristic doesn't work because
      flusher threads don't write back tmpfs pages.  Consequently tmpfs pages
      always rotate around the lru twice at least and adds unnecessary lru
      churn.  Simple tmpfs streaming io shouldn't cause large anonymous page
      swap-out.
      
      Remove this unncessary reclaim bonus of tmpfs pages.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Hugh Dickins <hughd@google.com>
      Reviewed-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2e30244a
    • M
      writeback: do not sleep on the congestion queue if there are no congested BDIs... · 0e093d99
      Mel Gorman 提交于
      writeback: do not sleep on the congestion queue if there are no congested BDIs or if significant congestion is not being encountered in the current zone
      
      If congestion_wait() is called with no BDI congested, the caller will
      sleep for the full timeout and this may be an unnecessary sleep.  This
      patch adds a wait_iff_congested() that checks congestion and only sleeps
      if a BDI is congested else, it calls cond_resched() to ensure the caller
      is not hogging the CPU longer than its quota but otherwise will not sleep.
      
      This is aimed at reducing some of the major desktop stalls reported during
      IO.  For example, while kswapd is operating, it calls congestion_wait()
      but it could just have been reclaiming clean page cache pages with no
      congestion.  Without this patch, it would sleep for a full timeout but
      after this patch, it'll just call schedule() if it has been on the CPU too
      long.  Similar logic applies to direct reclaimers that are not making
      enough progress.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0e093d99
    • K
      vmscan: isolate_lru_pages(): stop neighbour search if neighbour cannot be isolated · 08fc468f
      KOSAKI Motohiro 提交于
      isolate_lru_pages() does not just isolate LRU tail pages, but also
      isolates neighbour pages of the eviction page.  The neighbour search does
      not stop even if neighbours cannot be isolated which is excessive as the
      lumpy reclaim will no longer result in a successful higher order
      allocation.  This patch stops the PFN neighbour pages if an isolation
      fails and moves on to the next block.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NWu Fengguang <fengguang.wu@intel.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      08fc468f
    • K
      vmscan: remove dead code in shrink_inactive_list() · 47185052
      KOSAKI Motohiro 提交于
      After synchrounous lumpy reclaim, the page_list is guaranteed to not have
      active pages as page activation in shrink_page_list() disables lumpy
      reclaim.  Remove the dead code.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      47185052
    • K
      vmscan: narrow the scenarios in whcih lumpy reclaim uses synchrounous reclaim · 7d3579e8
      KOSAKI Motohiro 提交于
      shrink_page_list() can decide to give up reclaiming a page under a
      number of conditions such as
      
        1. trylock_page() failure
        2. page is unevictable
        3. zone reclaim and page is mapped
        4. PageWriteback() is true
        5. page is swapbacked and swap is full
        6. add_to_swap() failure
        7. page is dirty and gfpmask don't have GFP_IO, GFP_FS
        8. page is pinned
        9. IO queue is congested
       10. pageout() start IO, but not finished
      
      With lumpy reclaim, failures result in entering synchronous lumpy reclaim
      but this can be unnecessary.  In cases (2), (3), (5), (6), (7) and (8),
      there is no point retrying.  This patch causes lumpy reclaim to abort when
      it is known it will fail.
      
      Case (9) is more interesting. current behavior is,
        1. start shrink_page_list(async)
        2. found queue_congested()
        3. skip pageout write
        4. still start shrink_page_list(sync)
        5. wait on a lot of pages
        6. again, found queue_congested()
        7. give up pageout write again
      
      So, it's useless time wasting.  However, just skipping page reclaim is
      also notgood as x86 allocating a huge page needs 512 pages for example.
      It can have more dirty pages than queue congestion threshold (~=128).
      
      After this patch, pageout() behaves as follows;
      
       - If order > PAGE_ALLOC_COSTLY_ORDER
      	Ignore queue congestion always.
       - If order <= PAGE_ALLOC_COSTLY_ORDER
      	skip write page and disable lumpy reclaim.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7d3579e8
    • K
      vmscan: synchronous lumpy reclaim should not call congestion_wait() · bc57e00f
      KOSAKI Motohiro 提交于
      congestion_wait() means "wait until queue congestion is cleared".
      However, synchronous lumpy reclaim does not need this congestion_wait() as
      shrink_page_list(PAGEOUT_IO_SYNC) uses wait_on_page_writeback() and it
      provides the necessary waiting.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NWu Fengguang <fengguang.wu@intel.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bc57e00f
    • M
      tracing, vmscan: add trace events for LRU list shrinking · e11da5b4
      Mel Gorman 提交于
      There have been numerous reports of stalls that pointed at the problem
      being somewhere in the VM.  There are multiple roots to the problems which
      means dealing with any of the root problems in isolation is tricky to
      justify on their own and they would still need integration testing.  This
      patch series puts together two different patch sets which in combination
      should tackle some of the root causes of latency problems being reported.
      
      Patch 1 adds a tracepoint for shrink_inactive_list.  For this series, the
      most important results is being able to calculate the scanning/reclaim
      ratio as a measure of the amount of work being done by page reclaim.
      
      Patch 2 accounts for time spent in congestion_wait.
      
      Patches 3-6 were originally developed by Kosaki Motohiro but reworked for
      this series.  It has been noted that lumpy reclaim is far too aggressive
      and trashes the system somewhat.  As SLUB uses high-order allocations, a
      large cost incurred by lumpy reclaim will be noticeable.  It was also
      reported during transparent hugepage support testing that lumpy reclaim
      was trashing the system and these patches should mitigate that problem
      without disabling lumpy reclaim.
      
      Patch 7 adds wait_iff_congested() and replaces some callers of
      congestion_wait().  wait_iff_congested() only sleeps if there is a BDI
      that is currently congested.  Patch 8 notes that any BDI being congested
      is not necessarily a problem because there could be multiple BDIs of
      varying speeds and numberous zones.  It attempts to track when a zone
      being reclaimed contains many pages backed by a congested BDI and if so,
      reclaimers wait on the congestion queue.
      
      I ran a number of tests with monitoring on X86, X86-64 and PPC64. Each
      machine had 3G of RAM and the CPUs were
      
      X86:    Intel P4 2-core
      X86-64: AMD Phenom 4-core
      PPC64:  PPC970MP
      
      Each used a single disk and the onboard IO controller.  Dirty ratio was
      left at 20.  I'm just going to report for X86-64 and PPC64 in a vague
      attempt to keep this report short.  Four kernels were tested each based on
      v2.6.36-rc4
      
      traceonly-v2r2:     Patches 1 and 2 to instrument vmscan reclaims and congestion_wait
      lowlumpy-v2r3:      Patches 1-6 to test if lumpy reclaim is better
      waitcongest-v2r3:   Patches 1-7 to only wait on congestion
      waitwriteback-v2r4: Patches 1-8 to detect when a zone is congested
      
      nocongest-v1r5: Patches 1-3 for testing wait_iff_congestion
      nodirect-v1r5:  Patches 1-10 to disable filesystem writeback for better IO
      
      The tests run were as follows
      
      kernbench
      	compile-based benchmark. Smoke test performance
      
      sysbench
      	OLTP read-only benchmark. Will be re-run in the future as read-write
      
      micro-mapped-file-stream
      	This is a micro-benchmark from Johannes Weiner that accesses a
      	large sparse-file through mmap(). It was configured to run in only
      	single-CPU mode but can be indicative of how well page reclaim
      	identifies suitable pages.
      
      stress-highalloc
      	Tries to allocate huge pages under heavy load.
      
      kernbench, iozone and sysbench did not report any performance regression
      on any machine.  sysbench did pressure the system lightly and there was
      reclaim activity but there were no difference of major interest between
      the kernels.
      
      X86-64 micro-mapped-file-stream
      
                                            traceonly-v2r2           lowlumpy-v2r3        waitcongest-v2r3     waitwriteback-v2r4
      pgalloc_dma                       1639.00 (   0.00%)       667.00 (-145.73%)      1167.00 ( -40.45%)       578.00 (-183.56%)
      pgalloc_dma32                  2842410.00 (   0.00%)   2842626.00 (   0.01%)   2843043.00 (   0.02%)   2843014.00 (   0.02%)
      pgalloc_normal                       0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
      pgsteal_dma                        729.00 (   0.00%)        85.00 (-757.65%)       609.00 ( -19.70%)       125.00 (-483.20%)
      pgsteal_dma32                  2338721.00 (   0.00%)   2447354.00 (   4.44%)   2429536.00 (   3.74%)   2436772.00 (   4.02%)
      pgsteal_normal                       0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
      pgscan_kswapd_dma                 1469.00 (   0.00%)       532.00 (-176.13%)      1078.00 ( -36.27%)       220.00 (-567.73%)
      pgscan_kswapd_dma32            4597713.00 (   0.00%)   4503597.00 (  -2.09%)   4295673.00 (  -7.03%)   3891686.00 ( -18.14%)
      pgscan_kswapd_normal                 0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
      pgscan_direct_dma                   71.00 (   0.00%)       134.00 (  47.01%)       243.00 (  70.78%)       352.00 (  79.83%)
      pgscan_direct_dma32             305820.00 (   0.00%)    280204.00 (  -9.14%)    600518.00 (  49.07%)    957485.00 (  68.06%)
      pgscan_direct_normal                 0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
      pageoutrun                       16296.00 (   0.00%)     21254.00 (  23.33%)     18447.00 (  11.66%)     20067.00 (  18.79%)
      allocstall                         443.00 (   0.00%)       273.00 ( -62.27%)       513.00 (  13.65%)      1568.00 (  71.75%)
      
      These are based on the raw figures taken from /proc/vmstat.  It's a rough
      measure of reclaim activity.  Note that allocstall counts are higher
      because we are entering direct reclaim more often as a result of not
      sleeping in congestion.  In itself, it's not necessarily a bad thing.
      It's easier to get a view of what happened from the vmscan tracepoint
      report.
      
      FTrace Reclaim Statistics: vmscan
      
                                      traceonly-v2r2   lowlumpy-v2r3 waitcongest-v2r3 waitwriteback-v2r4
      Direct reclaims                                443        273        513       1568
      Direct reclaim pages scanned                305968     280402     600825     957933
      Direct reclaim pages reclaimed               43503      19005      30327     117191
      Direct reclaim write file async I/O              0          0          0          0
      Direct reclaim write anon async I/O              0          3          4         12
      Direct reclaim write file sync I/O               0          0          0          0
      Direct reclaim write anon sync I/O               0          0          0          0
      Wake kswapd requests                        187649     132338     191695     267701
      Kswapd wakeups                                   3          1          4          1
      Kswapd pages scanned                       4599269    4454162    4296815    3891906
      Kswapd pages reclaimed                     2295947    2428434    2399818    2319706
      Kswapd reclaim write file async I/O              1          0          1          1
      Kswapd reclaim write anon async I/O             59        187         41        222
      Kswapd reclaim write file sync I/O               0          0          0          0
      Kswapd reclaim write anon sync I/O               0          0          0          0
      Time stalled direct reclaim (seconds)         4.34       2.52       6.63       2.96
      Time kswapd awake (seconds)                  11.15      10.25      11.01      10.19
      
      Total pages scanned                        4905237   4734564   4897640   4849839
      Total pages reclaimed                      2339450   2447439   2430145   2436897
      %age total pages scanned/reclaimed          47.69%    51.69%    49.62%    50.25%
      %age total pages scanned/written             0.00%     0.00%     0.00%     0.00%
      %age  file pages scanned/written             0.00%     0.00%     0.00%     0.00%
      Percentage Time Spent Direct Reclaim        29.23%    19.02%    38.48%    20.25%
      Percentage Time kswapd Awake                78.58%    78.85%    76.83%    79.86%
      
      What is interesting here for nocongest in particular is that while direct
      reclaim scans more pages, the overall number of pages scanned remains the
      same and the ratio of pages scanned to pages reclaimed is more or less the
      same.  In other words, while we are sleeping less, reclaim is not doing
      more work and as direct reclaim and kswapd is awake for less time, it
      would appear to be doing less work.
      
      FTrace Reclaim Statistics: congestion_wait
      Direct number congest     waited                87        196         64          0
      Direct time   congest     waited            4604ms     4732ms     5420ms        0ms
      Direct full   congest     waited                72        145         53          0
      Direct number conditional waited                 0          0        324       1315
      Direct time   conditional waited               0ms        0ms        0ms        0ms
      Direct full   conditional waited                 0          0          0          0
      KSwapd number congest     waited                20         10         15          7
      KSwapd time   congest     waited            1264ms      536ms      884ms      284ms
      KSwapd full   congest     waited                10          4          6          2
      KSwapd number conditional waited                 0          0          0          0
      KSwapd time   conditional waited               0ms        0ms        0ms        0ms
      KSwapd full   conditional waited                 0          0          0          0
      
      The vanilla kernel spent 8 seconds asleep in direct reclaim and no time at
      all asleep with the patches.
      
      MMTests Statistics: duration
      User/Sys Time Running Test (seconds)         10.51     10.73      10.6     11.66
      Total Elapsed Time (seconds)                 14.19     13.00     14.33     12.76
      
      Overall, the tests completed faster. It is interesting to note that backing off further
      when a zone is congested and not just a BDI was more efficient overall.
      
      PPC64 micro-mapped-file-stream
      pgalloc_dma                    3024660.00 (   0.00%)   3027185.00 (   0.08%)   3025845.00 (   0.04%)   3026281.00 (   0.05%)
      pgalloc_normal                       0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
      pgsteal_dma                    2508073.00 (   0.00%)   2565351.00 (   2.23%)   2463577.00 (  -1.81%)   2532263.00 (   0.96%)
      pgsteal_normal                       0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
      pgscan_kswapd_dma              4601307.00 (   0.00%)   4128076.00 ( -11.46%)   3912317.00 ( -17.61%)   3377165.00 ( -36.25%)
      pgscan_kswapd_normal                 0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
      pgscan_direct_dma               629825.00 (   0.00%)    971622.00 (  35.18%)   1063938.00 (  40.80%)   1711935.00 (  63.21%)
      pgscan_direct_normal                 0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
      pageoutrun                       27776.00 (   0.00%)     20458.00 ( -35.77%)     18763.00 ( -48.04%)     18157.00 ( -52.98%)
      allocstall                         977.00 (   0.00%)      2751.00 (  64.49%)      2098.00 (  53.43%)      5136.00 (  80.98%)
      
      Similar trends to x86-64. allocstalls are up but it's not necessarily bad.
      
      FTrace Reclaim Statistics: vmscan
      Direct reclaims                                977       2709       2098       5136
      Direct reclaim pages scanned                629825     963814    1063938    1711935
      Direct reclaim pages reclaimed               75550     242538     150904     387647
      Direct reclaim write file async I/O              0          0          0          2
      Direct reclaim write anon async I/O              0         10          0          4
      Direct reclaim write file sync I/O               0          0          0          0
      Direct reclaim write anon sync I/O               0          0          0          0
      Wake kswapd requests                        392119    1201712     571935     571921
      Kswapd wakeups                                   3          2          3          3
      Kswapd pages scanned                       4601307    4128076    3912317    3377165
      Kswapd pages reclaimed                     2432523    2318797    2312673    2144616
      Kswapd reclaim write file async I/O             20          1          1          1
      Kswapd reclaim write anon async I/O             57        132         11        121
      Kswapd reclaim write file sync I/O               0          0          0          0
      Kswapd reclaim write anon sync I/O               0          0          0          0
      Time stalled direct reclaim (seconds)         6.19       7.30      13.04      10.88
      Time kswapd awake (seconds)                  21.73      26.51      25.55      23.90
      
      Total pages scanned                        5231132   5091890   4976255   5089100
      Total pages reclaimed                      2508073   2561335   2463577   2532263
      %age total pages scanned/reclaimed          47.95%    50.30%    49.51%    49.76%
      %age total pages scanned/written             0.00%     0.00%     0.00%     0.00%
      %age  file pages scanned/written             0.00%     0.00%     0.00%     0.00%
      Percentage Time Spent Direct Reclaim        18.89%    20.65%    32.65%    27.65%
      Percentage Time kswapd Awake                72.39%    80.68%    78.21%    77.40%
      
      Again, a similar trend that the congestion_wait changes mean that direct
      reclaim scans more pages but the overall number of pages scanned while
      slightly reduced, are very similar.  The ratio of scanning/reclaimed
      remains roughly similar.  The downside is that kswapd and direct reclaim
      was awake longer and for a larger percentage of the overall workload.
      It's possible there were big differences in the amount of time spent
      reclaiming slab pages between the different kernels which is plausible
      considering that the micro tests runs after fsmark and sysbench.
      
      Trace Reclaim Statistics: congestion_wait
      Direct number congest     waited               845       1312        104          0
      Direct time   congest     waited           19416ms    26560ms     7544ms        0ms
      Direct full   congest     waited               745       1105         72          0
      Direct number conditional waited                 0          0       1322       2935
      Direct time   conditional waited               0ms        0ms       12ms      312ms
      Direct full   conditional waited                 0          0          0          3
      KSwapd number congest     waited                39        102         75         63
      KSwapd time   congest     waited            2484ms     6760ms     5756ms     3716ms
      KSwapd full   congest     waited                20         48         46         25
      KSwapd number conditional waited                 0          0          0          0
      KSwapd time   conditional waited               0ms        0ms        0ms        0ms
      KSwapd full   conditional waited                 0          0          0          0
      
      The vanilla kernel spent 20 seconds asleep in direct reclaim and only
      312ms asleep with the patches.  The time kswapd spent congest waited was
      also reduced by a large factor.
      
      MMTests Statistics: duration
      ser/Sys Time Running Test (seconds)         26.58     28.05      26.9     28.47
      Total Elapsed Time (seconds)                 30.02     32.86     32.67     30.88
      
      With all patches applies, the completion times are very similar.
      
      X86-64 STRESS-HIGHALLOC
                      traceonly-v2r2     lowlumpy-v2r3  waitcongest-v2r3waitwriteback-v2r4
      Pass 1          82.00 ( 0.00%)    84.00 ( 2.00%)    85.00 ( 3.00%)    85.00 ( 3.00%)
      Pass 2          90.00 ( 0.00%)    87.00 (-3.00%)    88.00 (-2.00%)    89.00 (-1.00%)
      At Rest         92.00 ( 0.00%)    90.00 (-2.00%)    90.00 (-2.00%)    91.00 (-1.00%)
      
      Success figures across the board are broadly similar.
      
                      traceonly-v2r2     lowlumpy-v2r3  waitcongest-v2r3waitwriteback-v2r4
      Direct reclaims                               1045        944        886        887
      Direct reclaim pages scanned                135091     119604     109382     101019
      Direct reclaim pages reclaimed               88599      47535      47863      46671
      Direct reclaim write file async I/O            494        283        465        280
      Direct reclaim write anon async I/O          29357      13710      16656      13462
      Direct reclaim write file sync I/O             154          2          2          3
      Direct reclaim write anon sync I/O           14594        571        509        561
      Wake kswapd requests                          7491        933        872        892
      Kswapd wakeups                                 814        778        731        780
      Kswapd pages scanned                       7290822   15341158   11916436   13703442
      Kswapd pages reclaimed                     3587336    3142496    3094392    3187151
      Kswapd reclaim write file async I/O          91975      32317      28022      29628
      Kswapd reclaim write anon async I/O        1992022     789307     829745     849769
      Kswapd reclaim write file sync I/O               0          0          0          0
      Kswapd reclaim write anon sync I/O               0          0          0          0
      Time stalled direct reclaim (seconds)      4588.93    2467.16    2495.41    2547.07
      Time kswapd awake (seconds)                2497.66    1020.16    1098.06    1176.82
      
      Total pages scanned                        7425913  15460762  12025818  13804461
      Total pages reclaimed                      3675935   3190031   3142255   3233822
      %age total pages scanned/reclaimed          49.50%    20.63%    26.13%    23.43%
      %age total pages scanned/written            28.66%     5.41%     7.28%     6.47%
      %age  file pages scanned/written             1.25%     0.21%     0.24%     0.22%
      Percentage Time Spent Direct Reclaim        57.33%    42.15%    42.41%    42.99%
      Percentage Time kswapd Awake                43.56%    27.87%    29.76%    31.25%
      
      Scanned/reclaimed ratios again look good with big improvements in
      efficiency.  The Scanned/written ratios also look much improved.  With a
      better scanned/written ration, there is an expectation that IO would be
      more efficient and indeed, the time spent in direct reclaim is much
      reduced by the full series and kswapd spends a little less time awake.
      
      Overall, indications here are that allocations were happening much faster
      and this can be seen with a graph of the latency figures as the
      allocations were taking place
      http://www.csn.ul.ie/~mel/postings/vmscanreduce-20101509/highalloc-interlatency-hydra-mean.ps
      
      FTrace Reclaim Statistics: congestion_wait
      Direct number congest     waited              1333        204        169          4
      Direct time   congest     waited           78896ms     8288ms     7260ms      200ms
      Direct full   congest     waited               756         92         69          2
      Direct number conditional waited                 0          0         26        186
      Direct time   conditional waited               0ms        0ms        0ms     2504ms
      Direct full   conditional waited                 0          0          0         25
      KSwapd number congest     waited                 4        395        227        282
      KSwapd time   congest     waited             384ms    25136ms    10508ms    18380ms
      KSwapd full   congest     waited                 3        232         98        176
      KSwapd number conditional waited                 0          0          0          0
      KSwapd time   conditional waited               0ms        0ms        0ms        0ms
      KSwapd full   conditional waited                 0          0          0          0
      KSwapd full   conditional waited               318          0        312          9
      
      Overall, the time spent speeping is reduced.  kswapd is still hitting
      congestion_wait() but that is because there are callers remaining where it
      wasn't clear in advance if they should be changed to wait_iff_congested()
      or not.  Overall the sleep imes are reduced though - from 79ish seconds to
      about 19.
      
      MMTests Statistics: duration
      User/Sys Time Running Test (seconds)       3415.43   3386.65   3388.39    3377.5
      Total Elapsed Time (seconds)               5733.48   3660.33   3689.41   3765.39
      
      With the full series, the time to complete the tests are reduced by 30%
      
      PPC64 STRESS-HIGHALLOC
                      traceonly-v2r2     lowlumpy-v2r3  waitcongest-v2r3waitwriteback-v2r4
      Pass 1          17.00 ( 0.00%)    34.00 (17.00%)    38.00 (21.00%)    43.00 (26.00%)
      Pass 2          25.00 ( 0.00%)    37.00 (12.00%)    42.00 (17.00%)    46.00 (21.00%)
      At Rest         49.00 ( 0.00%)    43.00 (-6.00%)    45.00 (-4.00%)    51.00 ( 2.00%)
      
      Success rates there are *way* up particularly considering that the 16MB
      huge pages on PPC64 mean that it's always much harder to allocate them.
      
      FTrace Reclaim Statistics: vmscan
                    stress-highalloc  stress-highalloc  stress-highalloc  stress-highalloc
                      traceonly-v2r2     lowlumpy-v2r3  waitcongest-v2r3waitwriteback-v2r4
      Direct reclaims                                499        505        564        509
      Direct reclaim pages scanned                223478      41898      51818      45605
      Direct reclaim pages reclaimed              137730      21148      27161      23455
      Direct reclaim write file async I/O            399        136        162        136
      Direct reclaim write anon async I/O          46977       2865       4686       3998
      Direct reclaim write file sync I/O              29          0          1          3
      Direct reclaim write anon sync I/O           31023        159        237        239
      Wake kswapd requests                           420        351        360        326
      Kswapd wakeups                                 185        294        249        277
      Kswapd pages scanned                      15703488   16392500   17821724   17598737
      Kswapd pages reclaimed                     5808466    2908858    3139386    3145435
      Kswapd reclaim write file async I/O         159938      18400      18717      13473
      Kswapd reclaim write anon async I/O        3467554     228957     322799     234278
      Kswapd reclaim write file sync I/O               0          0          0          0
      Kswapd reclaim write anon sync I/O               0          0          0          0
      Time stalled direct reclaim (seconds)      9665.35    1707.81    2374.32    1871.23
      Time kswapd awake (seconds)                9401.21    1367.86    1951.75    1328.88
      
      Total pages scanned                       15926966  16434398  17873542  17644342
      Total pages reclaimed                      5946196   2930006   3166547   3168890
      %age total pages scanned/reclaimed          37.33%    17.83%    17.72%    17.96%
      %age total pages scanned/written            23.27%     1.52%     1.94%     1.43%
      %age  file pages scanned/written             1.01%     0.11%     0.11%     0.08%
      Percentage Time Spent Direct Reclaim        44.55%    35.10%    41.42%    36.91%
      Percentage Time kswapd Awake                86.71%    43.58%    52.67%    41.14%
      
      While the scanning rates are slightly up, the scanned/reclaimed and
      scanned/written figures are much improved.  The time spent in direct
      reclaim and with kswapd are massively reduced, mostly by the lowlumpy
      patches.
      
      FTrace Reclaim Statistics: congestion_wait
      Direct number congest     waited               725        303        126          3
      Direct time   congest     waited           45524ms     9180ms     5936ms      300ms
      Direct full   congest     waited               487        190         52          3
      Direct number conditional waited                 0          0        200        301
      Direct time   conditional waited               0ms        0ms        0ms     1904ms
      Direct full   conditional waited                 0          0          0         19
      KSwapd number congest     waited                 0          2         23          4
      KSwapd time   congest     waited               0ms      200ms      420ms      404ms
      KSwapd full   congest     waited                 0          2          2          4
      KSwapd number conditional waited                 0          0          0          0
      KSwapd time   conditional waited               0ms        0ms        0ms        0ms
      KSwapd full   conditional waited                 0          0          0          0
      
      Not as dramatic a story here but the time spent asleep is reduced and we
      can still see what wait_iff_congested is going to sleep when necessary.
      
      MMTests Statistics: duration
      User/Sys Time Running Test (seconds)      12028.09   3157.17   3357.79   3199.16
      Total Elapsed Time (seconds)              10842.07   3138.72   3705.54   3229.85
      
      The time to complete this test goes way down.  With the full series, we
      are allocating over twice the number of huge pages in 30% of the time and
      there is a corresponding impact on the allocation latency graph available
      at.
      
      http://www.csn.ul.ie/~mel/postings/vmscanreduce-20101509/highalloc-interlatency-powyah-mean.ps
      
      This patch:
      
      Add a trace event for shrink_inactive_list() and updates the sample
      postprocessing script appropriately.  It can be used to determine how many
      pages were reclaimed and for non-lumpy reclaim where exactly the pages
      were reclaimed from.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e11da5b4
    • S
      vmscan: delete dead code · 66d9a986
      Shaohua Li 提交于
      `priority' cannot be negative here.  And the comment is obsolete.
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      66d9a986
    • M
      vmscan: prevent background aging of anon page in no swap system · 74e3f3c3
      Minchan Kim 提交于
      Ying Han reported that backing aging of anon pages in no swap system
      causes unnecessary TLB flush.
      
      When I sent a patch(69c85481), I wanted this patch but Rik pointed out
      and allowed aging of anon pages to give a chance to promote from inactive
      to active LRU.
      
      It has a two problem.
      
      1) non-swap system
      
      Never make sense to age anon pages.
      
      2) swap configured but still doesn't swapon
      
      It doesn't make sense to age anon pages until swap-on time.  But it's
      arguable.  If we have aged anon pages by swapon, VM have moved anon pages
      from active to inactive.  And in the time swapon by admin, the VM can't
      reclaim hot pages so we can protect hot pages swapout.
      
      But let's think about it.  When does swap-on happen?  It depends on admin.
       we can't expect it.  Nonetheless, we have done aging of anon pages to
      protect hot pages swapout.  It means we lost run time overhead when below
      high watermark but gain hot page swap-[in/out] overhead when VM decide
      swapout.  Is it true?  Let's think more detail.  We don't promote anon
      pages in case of non-swap system.  So even though VM does aging of anon
      pages, the pages would be in inactive LRU for a long time.  It means many
      of pages in there would mark access bit again.  So access bit hot/code
      separation would be pointless.
      
      This patch prevents unnecessary anon pages demotion in not-yet-swapon and
      non-configured swap system.  Even, in non-configuared swap system
      inactive_anon_is_low can be compiled out.
      
      It could make side effect that hot anon pages could swap out when admin
      does swap on.  But I think sooner or later it would be steady state.  So
      it's not a big problem.
      
      We could lose someting but gain more thing(TLB flush and unnecessary
      function call to demote anon pages).
      Signed-off-by: NYing Han <yinghan@google.com>
      Signed-off-by: NMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      74e3f3c3
    • T
      mm: only build per-node scan_unevictable functions when NUMA is enabled · e4455abb
      Thadeu Lima de Souza Cascardo 提交于
      Non-NUMA systems do never create these files anyway, since they are only
      created by driver subsystem when NUMA is configured.
      
      [akpm@linux-foundation.org: cleanup]
      Signed-off-by: NThadeu Lima de Souza Cascardo <cascardo@holoscopio.com>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e4455abb
    • W
      writeback: remove nonblocking/encountered_congestion references · 1b430bee
      Wu Fengguang 提交于
      This removes more dead code that was somehow missed by commit 0d99519e
      (writeback: remove unused nonblocking and congestion checks).  There are
      no behavior change except for the removal of two entries from one of the
      ext4 tracing interface.
      
      The nonblocking checks in ->writepages are no longer used because the
      flusher now prefer to block on get_request_wait() than to skip inodes on
      IO congestion.  The latter will lead to more seeky IO.
      
      The nonblocking checks in ->writepage are no longer used because it's
      redundant with the WB_SYNC_NONE check.
      
      We no long set ->nonblocking in VM page out and page migration, because
      a) it's effectively redundant with WB_SYNC_NONE in current code
      b) it's old semantic of "Don't get stuck on request queues" is mis-behavior:
         that would skip some dirty inodes on congestion and page out others, which
         is unfair in terms of LRU age.
      
      Inspired by Christoph Hellwig. Thanks!
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Sage Weil <sage@newdream.net>
      Cc: Steve French <sfrench@samba.org>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1b430bee
  7. 23 9月, 2010 1 次提交
  8. 18 8月, 2010 1 次提交
  9. 11 8月, 2010 4 次提交
  10. 10 8月, 2010 2 次提交
    • W
      vmscan: raise the bar to PAGEOUT_IO_SYNC stalls · e31f3698
      Wu Fengguang 提交于
      Fix "system goes unresponsive under memory pressure and lots of
      dirty/writeback pages" bug.
      
      	http://lkml.org/lkml/2010/4/4/86
      
      In the above thread, Andreas Mohr described that
      
      	Invoking any command locked up for minutes (note that I'm
      	talking about attempted additional I/O to the _other_,
      	_unaffected_ main system HDD - such as loading some shell
      	binaries -, NOT the external SSD18M!!).
      
      This happens when the two conditions are both meet:
      - under memory pressure
      - writing heavily to a slow device
      
      OOM also happens in Andreas' system.  The OOM trace shows that 3 processes
      are stuck in wait_on_page_writeback() in the direct reclaim path.  One in
      do_fork() and the other two in unix_stream_sendmsg().  They are blocked on
      this condition:
      
      	(sc->order && priority < DEF_PRIORITY - 2)
      
      which was introduced in commit 78dc583d (vmscan: low order lumpy reclaim
      also should use PAGEOUT_IO_SYNC) one year ago.  That condition may be too
      permissive.  In Andreas' case, 512MB/1024 = 512KB.  If the direct reclaim
      for the order-1 fork() allocation runs into a range of 512KB
      hard-to-reclaim LRU pages, it will be stalled.
      
      It's a severe problem in three ways.
      
      Firstly, it can easily happen in daily desktop usage.  vmscan priority can
      easily go below (DEF_PRIORITY - 2) on _local_ memory pressure.  Even if
      the system has 50% globally reclaimable pages, it still has good
      opportunity to have 0.1% sized hard-to-reclaim ranges.  For example, a
      simple dd can easily create a big range (up to 20%) of dirty pages in the
      LRU lists.  And order-1 to order-3 allocations are more than common with
      SLUB.  Try "grep -v '1 :' /proc/slabinfo" to get the list of high order
      slab caches.  For example, the order-1 radix_tree_node slab cache may
      stall applications at swap-in time; the order-3 inode cache on most
      filesystems may stall applications when trying to read some file; the
      order-2 proc_inode_cache may stall applications when trying to open a
      /proc file.
      
      Secondly, once triggered, it will stall unrelated processes (not doing IO
      at all) in the system.  This "one slow USB device stalls the whole system"
      avalanching effect is very bad.
      
      Thirdly, once stalled, the stall time could be intolerable long for the
      users.  When there are 20MB queued writeback pages and USB 1.1 is writing
      them in 1MB/s, wait_on_page_writeback() will stuck for up to 20 seconds.
      Not to mention it may be called multiple times.
      
      So raise the bar to only enable PAGEOUT_IO_SYNC when priority goes below
      DEF_PRIORITY/3, or 6.25% LRU size.  As the default dirty throttle ratio is
      20%, it will hardly be triggered by pure dirty pages.  We'd better treat
      PAGEOUT_IO_SYNC as some last resort workaround -- its stall time is so
      uncomfortably long (easily goes beyond 1s).
      
      The bar is only raised for (order < PAGE_ALLOC_COSTLY_ORDER) allocations,
      which are easy to satisfy in 1TB memory boxes.  So, although 6.25% of
      memory could be an awful lot of pages to scan on a system with 1TB of
      memory, it won't really have to busy scan that much.
      
      Andreas tested an older version of this patch and reported that it mostly
      fixed his problem.  Mel Gorman helped improve it and KOSAKI Motohiro will
      fix it further in the next patch.
      Reported-by: NAndreas Mohr <andi@lisas.de>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e31f3698
    • K
      memcg, vmscan: add memcg reclaim tracepoint · bdce6d9e
      KOSAKI Motohiro 提交于
      Memcg also need to trace reclaim progress as direct reclaim.  This patch
      add it.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bdce6d9e