1. 16 12月, 2009 6 次提交
    • K
      vmscan: kill hibernation specific reclaim logic and unify it · 7b51755c
      KOSAKI Motohiro 提交于
      shrink_all_zone() was introduced by commit d6277db4 (swsusp: rework
      memory shrinker) for hibernate performance improvement.  and
      sc.swap_cluster_max was introduced by commit a06fe4d307 (Speed freeing
      memory for suspend).
      
      commit a06fe4d307 said
      
         Without the patch:
         Freed  14600 pages in  1749 jiffies = 32.61 MB/s (Anomolous!)
         Freed  88563 pages in 14719 jiffies = 23.50 MB/s
         Freed 205734 pages in 32389 jiffies = 24.81 MB/s
      
         With the patch:
         Freed  68252 pages in   496 jiffies = 537.52 MB/s
         Freed 116464 pages in   569 jiffies = 798.54 MB/s
         Freed 209699 pages in   705 jiffies = 1161.89 MB/s
      
      At that time, their patch was pretty worth.  However, Modern Hardware
      trend and recent VM improvement broke its worth.  From several reason, I
      think we should remove shrink_all_zones() at all.
      
      detail:
      
      1) Old days, shrink_zone()'s slowness was mainly caused by stupid io-throttle
        at no i/o congestion.
        but current shrink_zone() is sane, not slow.
      
      2) shrink_all_zone() try to shrink all pages at a time. but it doesn't works
        fine on numa system.
        example)
          System has 4GB memory and each node have 2GB. and hibernate need 1GB.
      
          optimal)
             steal 500MB from each node.
          shrink_all_zones)
             steal 1GB from node-0.
      
        Oh, Cache balancing logic was broken. ;)
        Unfortunately, Desktop system moved ahead NUMA at nowadays.
        (Side note, if hibernate require 2GB, shrink_all_zones() never success
         on above machine)
      
      3) if the node has several I/O flighting pages, shrink_all_zones() makes
        pretty bad result.
      
        schenario) hibernate need 1GB
      
        1) shrink_all_zones() try to reclaim 1GB from Node-0
        2) but it only reclaimed 990MB
        3) stupidly, shrink_all_zones() try to reclaim 1GB from Node-1
        4) it reclaimed 990MB
      
        Oh, well. it reclaimed twice much than required.
        In the other hand, current shrink_zone() has sane baling out logic.
        then, it doesn't make overkill reclaim. then, we lost shrink_zones()'s risk.
      
      4) SplitLRU VM always keep active/inactive ratio very carefully. inactive list only
        shrinking break its assumption. it makes unnecessary OOM risk. it obviously suboptimal.
      
      Now, shrink_all_memory() is only the wrapper function of do_try_to_free_pages().
      it bring good reviewability and debuggability, and solve above problems.
      
      side note: Reclaim logic unificication makes two good side effect.
       - Fix recursive reclaim bug on shrink_all_memory().
         it did forgot to use PF_MEMALLOC. it mean the system be able to stuck into deadlock.
       - Now, shrink_all_memory() got lockdep awareness. it bring good debuggability.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Acked-by: NRafael J. Wysocki <rjw@sisk.pl>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7b51755c
    • K
      vmscan: separate sc.swap_cluster_max and sc.nr_max_reclaim · 22fba335
      KOSAKI Motohiro 提交于
      Currently, sc.scap_cluster_max has double meanings.
      
       1) reclaim batch size as isolate_lru_pages()'s argument
       2) reclaim baling out thresolds
      
      The two meanings pretty unrelated. Thus, Let's separate it.
      this patch doesn't change any behavior.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      22fba335
    • K
      vmscan: stop kswapd waiting on congestion when the min watermark is not being met · bb3ab596
      KOSAKI Motohiro 提交于
      If reclaim fails to make sufficient progress, the priority is raised.
      Once the priority is higher, kswapd starts waiting on congestion.
      However, if the zone is below the min watermark then kswapd needs to
      continue working without delay as there is a danger of an increased rate
      of GFP_ATOMIC allocation failure.
      
      This patch changes the conditions under which kswapd waits on congestion
      by only going to sleep if the min watermarks are being met.
      
      [mel@csn.ul.ie: add stats to track how relevant the logic is]
      [mel@csn.ul.ie: make kswapd only check its own zones and rename the relevant counters]
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bb3ab596
    • M
      vmscan: have kswapd sleep for a short interval and double check it should be asleep · f50de2d3
      Mel Gorman 提交于
      After kswapd balances all zones in a pgdat, it goes to sleep.  In the
      event of no IO congestion, kswapd can go to sleep very shortly after the
      high watermark was reached.  If there are a constant stream of allocations
      from parallel processes, it can mean that kswapd went to sleep too quickly
      and the high watermark is not being maintained for sufficient length time.
      
      This patch makes kswapd go to sleep as a two-stage process.  It first
      tries to sleep for HZ/10.  If it is woken up by another process or the
      high watermark is no longer met, it's considered a premature sleep and
      kswapd continues work.  Otherwise it goes fully to sleep.
      
      This adds more counters to distinguish between fast and slow breaches of
      watermarks.  A "fast" premature sleep is one where the low watermark was
      hit in a very short time after kswapd going to sleep.  A "slow" premature
      sleep indicates that the high watermark was breached after a very short
      interval.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Frans Pop <elendil@planet.nl>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f50de2d3
    • V
      mm/vmscan: change comment generic_file_write to __generic_file_aio_write · 6aceb53b
      Vincent Li 提交于
      Commit 543ade1f ("Streamline generic_file_* interfaces and filemap
      cleanups") removed generic_file_write() in filemap.  Change the comment in
      vmscan pageout() to __generic_file_aio_write().
      Signed-off-by: NVincent Li <macli@brc.ubc.ca>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6aceb53b
    • D
      mm: clear node in N_HIGH_MEMORY and stop kswapd when all memory is offlined · 8fe23e05
      David Rientjes 提交于
      When memory is hot-removed, its node must be cleared in N_HIGH_MEMORY if
      there are no present pages left.
      
      In such a situation, kswapd must also be stopped since it has nothing left
      to do.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rafael J. Wysocki <rjw@sisk.pl>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Cc: Nishanth Aravamudan <nacc@us.ibm.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: Andy Whitcroft <apw@canonical.com>
      Cc: Eric Whitney <eric.whitney@hp.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8fe23e05
  2. 29 10月, 2009 3 次提交
  3. 26 9月, 2009 1 次提交
  4. 24 9月, 2009 2 次提交
  5. 22 9月, 2009 17 次提交
  6. 16 9月, 2009 1 次提交
    • A
      HWPOISON: Use bitmask/action code for try_to_unmap behaviour · 14fa31b8
      Andi Kleen 提交于
      try_to_unmap currently has multiple modi (migration, munlock, normal unmap)
      which are selected by magic flag variables. The logic is not very straight
      forward, because each of these flag change multiple behaviours (e.g.
      migration turns off aging, not only sets up migration ptes etc.)
      Also the different flags interact in magic ways.
      
      A later patch in this series adds another mode to try_to_unmap, so
      this becomes quickly unmanageable.
      
      Replace the different flags with a action code (migration, munlock, munmap)
      and some additional flags as modifiers (ignore mlock, ignore aging).
      This makes the logic more straight forward and allows easier extension
      to new behaviours. Change all the caller to declare what they want to
      do.
      
      This patch is supposed to be a nop in behaviour. If anyone can prove
      it is not that would be a bug.
      
      Cc: Lee.Schermerhorn@hp.com
      Cc: npiggin@suse.de
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      14fa31b8
  7. 11 9月, 2009 1 次提交
    • J
      writeback: switch to per-bdi threads for flushing data · 03ba3782
      Jens Axboe 提交于
      This gets rid of pdflush for bdi writeout and kupdated style cleaning.
      pdflush writeout suffers from lack of locality and also requires more
      threads to handle the same workload, since it has to work in a
      non-blocking fashion against each queue. This also introduces lumpy
      behaviour and potential request starvation, since pdflush can be starved
      for queue access if others are accessing it. A sample ffsb workload that
      does random writes to files is about 8% faster here on a simple SATA drive
      during the benchmark phase. File layout also seems a LOT more smooth in
      vmstat:
      
       r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
       0  1      0 608848   2652 375372    0    0     0 71024  604    24  1 10 48 42
       0  1      0 549644   2712 433736    0    0     0 60692  505    27  1  8 48 44
       1  0      0 476928   2784 505192    0    0     4 29540  553    24  0  9 53 37
       0  1      0 457972   2808 524008    0    0     0 54876  331    16  0  4 38 58
       0  1      0 366128   2928 614284    0    0     4 92168  710    58  0 13 53 34
       0  1      0 295092   3000 684140    0    0     0 62924  572    23  0  9 53 37
       0  1      0 236592   3064 741704    0    0     4 58256  523    17  0  8 48 44
       0  1      0 165608   3132 811464    0    0     0 57460  560    21  0  8 54 38
       0  1      0 102952   3200 873164    0    0     4 74748  540    29  1 10 48 41
       0  1      0  48604   3252 926472    0    0     0 53248  469    29  0  7 47 45
      
      where vanilla tends to fluctuate a lot in the creation phase:
      
       r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
       1  1      0 678716   5792 303380    0    0     0 74064  565    50  1 11 52 36
       1  0      0 662488   5864 319396    0    0     4   352  302   329  0  2 47 51
       0  1      0 599312   5924 381468    0    0     0 78164  516    55  0  9 51 40
       0  1      0 519952   6008 459516    0    0     4 78156  622    56  1 11 52 37
       1  1      0 436640   6092 541632    0    0     0 82244  622    54  0 11 48 41
       0  1      0 436640   6092 541660    0    0     0     8  152    39  0  0 51 49
       0  1      0 332224   6200 644252    0    0     4 102800  728    46  1 13 49 36
       1  0      0 274492   6260 701056    0    0     4 12328  459    49  0  7 50 43
       0  1      0 211220   6324 763356    0    0     0 106940  515    37  1 10 51 39
       1  0      0 160412   6376 813468    0    0     0  8224  415    43  0  6 49 45
       1  1      0  85980   6452 886556    0    0     4 113516  575    39  1 11 54 34
       0  2      0  85968   6452 886620    0    0     0  1640  158   211  0  0 46 54
      
      A 10 disk test with btrfs performs 26% faster with per-bdi flushing. A
      SSD based writeback test on XFS performs over 20% better as well, with
      the throughput being very stable around 1GB/sec, where pdflush only
      manages 750MB/sec and fluctuates wildly while doing so. Random buffered
      writes to many files behave a lot better as well, as does random mmap'ed
      writes.
      
      A separate thread is added to sync the super blocks. In the long term,
      adding sync_supers_bdi() functionality could get rid of this thread again.
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      03ba3782
  8. 27 8月, 2009 1 次提交
  9. 11 7月, 2009 1 次提交
  10. 24 6月, 2009 1 次提交
  11. 19 6月, 2009 1 次提交
  12. 17 6月, 2009 5 次提交
    • K
      mm: fix lumpy reclaim lru handling at isolate_lru_pages · ee993b13
      KAMEZAWA Hiroyuki 提交于
      At lumpy reclaim, a page failed to be taken by __isolate_lru_page() can be
      pushed back to "src" list by list_move().  But the page may not be from
      "src" list.  This pushes the page back to wrong LRU.  And list_move()
      itself is unnecessary because the page is not on top of LRU.  Then, leave
      it as it is if __isolate_lru_page() fails.
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ee993b13
    • M
      vmscan: count the number of times zone_reclaim() scans and fails · 24cf7251
      Mel Gorman 提交于
      On NUMA machines, the administrator can configure zone_reclaim_mode that
      is a more targetted form of direct reclaim.  On machines with large NUMA
      distances for example, a zone_reclaim_mode defaults to 1 meaning that
      clean unmapped pages will be reclaimed if the zone watermarks are not
      being met.
      
      There is a heuristic that determines if the scan is worthwhile but it is
      possible that the heuristic will fail and the CPU gets tied up scanning
      uselessly.  Detecting the situation requires some guesswork and
      experimentation so this patch adds a counter "zreclaim_failed" to
      /proc/vmstat.  If during high CPU utilisation this counter is increasing
      rapidly, then the resolution to the problem may be to set
      /proc/sys/vm/zone_reclaim_mode to 0.
      
      [akpm@linux-foundation.org: name things consistently]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      24cf7251
    • M
      vmscan: do not unconditionally treat zones that fail zone_reclaim() as full · fa5e084e
      Mel Gorman 提交于
      On NUMA machines, the administrator can configure zone_reclaim_mode that
      is a more targetted form of direct reclaim.  On machines with large NUMA
      distances for example, a zone_reclaim_mode defaults to 1 meaning that
      clean unmapped pages will be reclaimed if the zone watermarks are not
      being met.  The problem is that zone_reclaim() failing at all means the
      zone gets marked full.
      
      This can cause situations where a zone is usable, but is being skipped
      because it has been considered full.  Take a situation where a large tmpfs
      mount is occuping a large percentage of memory overall.  The pages do not
      get cleaned or reclaimed by zone_reclaim(), but the zone gets marked full
      and the zonelist cache considers them not worth trying in the future.
      
      This patch makes zone_reclaim() return more fine-grained information about
      what occured when zone_reclaim() failued.  The zone only gets marked full
      if it really is unreclaimable.  If it's a case that the scan did not occur
      or if enough pages were not reclaimed with the limited reclaim_mode, then
      the zone is simply skipped.
      
      There is a side-effect to this patch.  Currently, if zone_reclaim()
      successfully reclaimed SWAP_CLUSTER_MAX, an allocation attempt would go
      ahead.  With this patch applied, zone watermarks are rechecked after
      zone_reclaim() does some work.
      
      This bug was introduced by commit 9276b1bc
      ("memory page_alloc zonelist caching speedup") way back in 2.6.19 when the
      zonelist_cache was introduced.  It was not intended that zone_reclaim()
      aggressively consider the zone to be full when it failed as full direct
      reclaim can still be an option.  Due to the age of the bug, it should be
      considered a -stable candidate.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NWu Fengguang <fengguang.wu@intel.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fa5e084e
    • M
      vmscan: properly account for the number of page cache pages zone_reclaim() can reclaim · 90afa5de
      Mel Gorman 提交于
      A bug was brought to my attention against a distro kernel but it affects
      mainline and I believe problems like this have been reported in various
      guises on the mailing lists although I don't have specific examples at the
      moment.
      
      The reported problem was that malloc() stalled for a long time (minutes in
      some cases) if a large tmpfs mount was occupying a large percentage of
      memory overall.  The pages did not get cleaned or reclaimed by
      zone_reclaim() because the zone_reclaim_mode was unsuitable, but the lists
      are uselessly scanned frequencly making the CPU spin at near 100%.
      
      This patchset intends to address that bug and bring the behaviour of
      zone_reclaim() more in line with expectations which were noticed during
      investigation.  It is based on top of mmotm and takes advantage of
      Kosaki's work with respect to zone_reclaim().
      
      Patch 1 fixes the heuristics that zone_reclaim() uses to determine if the
      	scan should go ahead. The broken heuristic is what was causing the
      	malloc() stall as it uselessly scanned the LRU constantly. Currently,
      	zone_reclaim is assuming zone_reclaim_mode is 1 and historically it
      	could not deal with tmpfs pages at all. This fixes up the heuristic so
      	that an unnecessary scan is more likely to be correctly avoided.
      
      Patch 2 notes that zone_reclaim() returning a failure automatically means
      	the zone is marked full. This is not always true. It could have
      	failed because the GFP mask or zone_reclaim_mode were unsuitable.
      
      Patch 3 introduces a counter zreclaim_failed that will increment each
      	time the zone_reclaim scan-avoidance heuristics fail. If that
      	counter is rapidly increasing, then zone_reclaim_mode should be
      	set to 0 as a temporarily resolution and a bug reported because
      	the scan-avoidance heuristic is still broken.
      
      This patch:
      
      On NUMA machines, the administrator can configure zone_reclaim_mode that
      is a more targetted form of direct reclaim.  On machines with large NUMA
      distances for example, a zone_reclaim_mode defaults to 1 meaning that
      clean unmapped pages will be reclaimed if the zone watermarks are not
      being met.
      
      There is a heuristic that determines if the scan is worthwhile but the
      problem is that the heuristic is not being properly applied and is
      basically assuming zone_reclaim_mode is 1 if it is enabled.  The lack of
      proper detection can manfiest as high CPU usage as the LRU list is scanned
      uselessly.
      
      Historically, once enabled it was depending on NR_FILE_PAGES which may
      include swapcache pages that the reclaim_mode cannot deal with.  Patch
      vmscan-change-the-number-of-the-unmapped-files-in-zone-reclaim.patch by
      Kosaki Motohiro noted that zone_page_state(zone, NR_FILE_PAGES) included
      pages that were not file-backed such as swapcache and made a calculation
      based on the inactive, active and mapped files.  This is far superior when
      zone_reclaim==1 but if RECLAIM_SWAP is set, then NR_FILE_PAGES is a
      reasonable starting figure.
      
      This patch alters how zone_reclaim() works out how many pages it might be
      able to reclaim given the current reclaim_mode.  If RECLAIM_SWAP is set in
      the reclaim_mode it will either consider NR_FILE_PAGES as potential
      candidates or else use NR_{IN}ACTIVE}_PAGES-NR_FILE_MAPPED to discount
      swapcache and other non-file-backed pages.  If RECLAIM_WRITE is not set,
      then NR_FILE_DIRTY number of pages are not candidates.  If RECLAIM_SWAP is
      not set, then NR_FILE_MAPPED are not.
      
      [kosaki.motohiro@jp.fujitsu.com: Estimate unmapped pages minus tmpfs pages]
      [fengguang.wu@intel.com: Fix underflow problem in Kosaki's estimate]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Acked-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      90afa5de
    • D
      vmscan: handle may_swap more strictly · 9198e96c
      Daisuke Nishimura 提交于
      Commit 2e2e4259 ("vmscan,memcg:
      reintroduce sc->may_swap) add may_swap flag and handle it at
      get_scan_ratio().
      
      But the result of get_scan_ratio() is ignored when priority == 0, so anon
      lru is scanned even if may_swap == 0 or nr_swap_pages == 0.  IMHO, this is
      not an expected behavior.
      
      As for memcg especially, because of this behavior many and many pages are
      swapped-out just in vain when oom is invoked by mem+swap limit.
      
      This patch is for handling may_swap flag more strictly.
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9198e96c