1. 27 5月, 2011 5 次提交
    • Y
      memcg: rename mem_cgroup_zone_nr_pages() to mem_cgroup_zone_nr_lru_pages() · 1bac180b
      Ying Han 提交于
      The caller of the function has been renamed to zone_nr_lru_pages(), and
      this is just fixing up in the memcg code.  The current name is easily to
      be mis-read as zone's total number of pages.
      Signed-off-by: NYing Han <yinghan@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1bac180b
    • K
      memcg: fix get_scan_count() for small targets · 246e87a9
      KAMEZAWA Hiroyuki 提交于
      During memory reclaim we determine the number of pages to be scanned per
      zone as
      
      	(anon + file) >> priority.
      Assume
      	scan = (anon + file) >> priority.
      
      If scan < SWAP_CLUSTER_MAX, the scan will be skipped for this time and
      priority gets higher.  This has some problems.
      
        1. This increases priority as 1 without any scan.
           To do scan in this priority, amount of pages should be larger than 512M.
           If pages>>priority < SWAP_CLUSTER_MAX, it's recorded and scan will be
           batched, later. (But we lose 1 priority.)
           If memory size is below 16M, pages >> priority is 0 and no scan in
           DEF_PRIORITY forever.
      
        2. If zone->all_unreclaimabe==true, it's scanned only when priority==0.
           So, x86's ZONE_DMA will never be recoverred until the user of pages
           frees memory by itself.
      
        3. With memcg, the limit of memory can be small. When using small memcg,
           it gets priority < DEF_PRIORITY-2 very easily and need to call
           wait_iff_congested().
           For doing scan before priorty=9, 64MB of memory should be used.
      
      Then, this patch tries to scan SWAP_CLUSTER_MAX of pages in force...when
      
        1. the target is enough small.
        2. it's kswapd or memcg reclaim.
      
      Then we can avoid rapid priority drop and may be able to recover
      all_unreclaimable in a small zones.  And this patch removes nr_saved_scan.
       This will allow scanning in this priority even when pages >> priority is
      very small.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NYing Han <yinghan@google.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      246e87a9
    • Y
      memcg: reclaim memory from nodes in round-robin order · 889976db
      Ying Han 提交于
      Presently, memory cgroup's direct reclaim frees memory from the current
      node.  But this has some troubles.  Usually when a set of threads works in
      a cooperative way, they tend to operate on the same node.  So if they hit
      limits under memcg they will reclaim memory from themselves, damaging the
      active working set.
      
      For example, assume 2 node system which has Node 0 and Node 1 and a memcg
      which has 1G limit.  After some work, file cache remains and the usages
      are
      
         Node 0:  1M
         Node 1:  998M.
      
      and run an application on Node 0, it will eat its foot before freeing
      unnecessary file caches.
      
      This patch adds round-robin for NUMA and adds equal pressure to each node.
      When using cpuset's spread memory feature, this will work very well.
      
      But yes, a better algorithm is needed.
      
      [akpm@linux-foundation.org: comment editing]
      [kamezawa.hiroyu@jp.fujitsu.com: fix time comparisons]
      Signed-off-by: NYing Han <yinghan@google.com>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      889976db
    • Y
      memcg: add the soft_limit reclaim in global direct reclaim. · d149e3b2
      Ying Han 提交于
      We recently added the change in global background reclaim which counts the
      return value of soft_limit reclaim.  Now this patch adds the similar logic
      on global direct reclaim.
      
      We should skip scanning global LRU on shrink_zone if soft_limit reclaim
      does enough work.  This is the first step where we start with counting the
      nr_scanned and nr_reclaimed from soft_limit reclaim into global
      scan_control.
      Signed-off-by: NYing Han <yinghan@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d149e3b2
    • Y
      memcg: count the soft_limit reclaim in global background reclaim · 0ae5e89c
      Ying Han 提交于
      The global kswapd scans per-zone LRU and reclaims pages regardless of the
      cgroup. It breaks memory isolation since one cgroup can end up reclaiming
      pages from another cgroup. Instead we should rely on memcg-aware target
      reclaim including per-memcg kswapd and soft_limit hierarchical reclaim under
      memory pressure.
      
      In the global background reclaim, we do soft reclaim before scanning the
      per-zone LRU. However, the return value is ignored. This patch is the first
      step to skip shrink_zone() if soft_limit reclaim does enough work.
      
      This is part of the effort which tries to reduce reclaiming pages in global
      LRU in memcg. The per-memcg background reclaim patchset further enhances the
      per-cgroup targetting reclaim, which I should have V4 posted shortly.
      
      Try running multiple memory intensive workloads within seperate memcgs. Watch
      the counters of soft_steal in memory.stat.
      
        $ cat /dev/cgroup/A/memory.stat | grep 'soft'
        soft_steal 240000
        soft_scan 240000
        total_soft_steal 240000
        total_soft_scan 240000
      
      This patch:
      
      In the global background reclaim, we do soft reclaim before scanning the
      per-zone LRU.  However, the return value is ignored.
      
      We would like to skip shrink_zone() if soft_limit reclaim does enough
      work.  Also, we need to make the memory pressure balanced across per-memcg
      zones, like the logic vm-core.  This patch is the first step where we
      start with counting the nr_scanned and nr_reclaimed from soft_limit
      reclaim into the global scan_control.
      Signed-off-by: NYing Han <yinghan@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Acked-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0ae5e89c
  2. 25 5月, 2011 5 次提交
    • Y
      vmscan: change shrinker API by passing shrink_control struct · 1495f230
      Ying Han 提交于
      Change each shrinker's API by consolidating the existing parameters into
      shrink_control struct.  This will simplify any further features added w/o
      touching each file of shrinker.
      
      [akpm@linux-foundation.org: fix build]
      [akpm@linux-foundation.org: fix warning]
      [kosaki.motohiro@jp.fujitsu.com: fix up new shrinker API]
      [akpm@linux-foundation.org: fix xfs warning]
      [akpm@linux-foundation.org: update gfs2]
      Signed-off-by: NYing Han <yinghan@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Acked-by: NPavel Emelyanov <xemul@openvz.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Steven Whitehouse <swhiteho@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1495f230
    • Y
      vmscan: change shrink_slab() interfaces by passing shrink_control · a09ed5e0
      Ying Han 提交于
      Consolidate the existing parameters to shrink_slab() into a new
      shrink_control struct.  This is needed later to pass the same struct to
      shrinkers.
      Signed-off-by: NYing Han <yinghan@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Acked-by: NPavel Emelyanov <xemul@openvz.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a09ed5e0
    • K
      mm: strictly require elevated page refcount in isolate_lru_page() · 0c917313
      Konstantin Khlebnikov 提交于
      isolate_lru_page() must be called only with stable reference to the page,
      this is what is written in the comment above it, this is reasonable.
      
      current isolate_lru_page() users and its page extra reference sources:
      
       mm/huge_memory.c:
        __collapse_huge_page_isolate()	- reference from pte
      
       mm/memcontrol.c:
        mem_cgroup_move_parent()		- get_page_unless_zero()
        mem_cgroup_move_charge_pte_range()	- reference from pte
      
       mm/memory-failure.c:
        soft_offline_page()			- fixed, reference from get_any_page()
        delete_from_lru_cache() - reference from caller or get_page_unless_zero()
      	[ seems like there bug, because __memory_failure() can call
      	  page_action() for hpages tail, but it is ok for
      	  isolate_lru_page(), tail getted and not in lru]
      
       mm/memory_hotplug.c:
        do_migrate_range()			- fixed, get_page_unless_zero()
      
       mm/mempolicy.c:
        migrate_page_add()			- reference from pte
      
       mm/migrate.c:
        do_move_page_to_node_array()		- reference from follow_page()
      
       mlock.c:				- various external references
      
       mm/vmscan.c:
        putback_lru_page()			- reference from isolate_lru_page()
      
      It seems that all isolate_lru_page() users are ready now for this
      restriction.  So, let's replace redundant get_page_unless_zero() with
      get_page() and add page initial reference count check with VM_BUG_ON()
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0c917313
    • M
      mm: vmscan: correctly check if reclaimer should schedule during shrink_slab · f06590bd
      Minchan Kim 提交于
      It has been reported on some laptops that kswapd is consuming large
      amounts of CPU and not being scheduled when SLUB is enabled during large
      amounts of file copying.  It is expected that this is due to kswapd
      missing every cond_resched() point because;
      
      shrink_page_list() calls cond_resched() if inactive pages were isolated
              which in turn may not happen if all_unreclaimable is set in
              shrink_zones(). If for whatver reason, all_unreclaimable is
              set on all zones, we can miss calling cond_resched().
      
      balance_pgdat() only calls cond_resched if the zones are not
              balanced. For a high-order allocation that is balanced, it
              checks order-0 again. During that window, order-0 might have
              become unbalanced so it loops again for order-0 and returns
              that it was reclaiming for order-0 to kswapd(). It can then
              find that a caller has rewoken kswapd for a high-order and
              re-enters balance_pgdat() without ever calling cond_resched().
      
      shrink_slab only calls cond_resched() if we are reclaiming slab
      	pages. If there are a large number of direct reclaimers, the
      	shrinker_rwsem can be contended and prevent kswapd calling
      	cond_resched().
      
      This patch modifies the shrink_slab() case.  If the semaphore is
      contended, the caller will still check cond_resched().  After each
      successful call into a shrinker, the check for cond_resched() remains in
      case one shrinker is particularly slow.
      
      [mgorman@suse.de: preserve call to cond_resched after each call into shrinker]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NMinchan Kim <minchan.kim@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Tested-by: NColin King <colin.king@canonical.com>
      Cc: Raghavendra D Prabhu <raghu.prabhu13@gmail.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: <stable@kernel.org>		[2.6.38+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f06590bd
    • J
      mm: vmscan: correct use of pgdat_balanced in sleeping_prematurely · afc7e326
      Johannes Weiner 提交于
      There are a few reports of people experiencing hangs when copying large
      amounts of data with kswapd using a large amount of CPU which appear to be
      due to recent reclaim changes.  SLUB using high orders is the trigger but
      not the root cause as SLUB has been using high orders for a while.  The
      root cause was bugs introduced into reclaim which are addressed by the
      following two patches.
      
      Patch 1 corrects logic introduced by commit 1741c877 ("mm: kswapd:
              keep kswapd awake for high-order allocations until a percentage of
              the node is balanced") to allow kswapd to go to sleep when
              balanced for high orders.
      
      Patch 2 notes that it is possible for kswapd to miss every
              cond_resched() and updates shrink_slab() so it'll at least reach
              that scheduling point.
      
      Chris Wood reports that these two patches in isolation are sufficient to
      prevent the system hanging.  AFAIK, they should also resolve similar hangs
      experienced by James Bottomley.
      
      This patch:
      
      Johannes Weiner poined out that the logic in commit 1741c877 ("mm: kswapd:
      keep kswapd awake for high-order allocations until a percentage of the
      node is balanced") is backwards.  Instead of allowing kswapd to go to
      sleep when balancing for high order allocations, it keeps it kswapd
      running uselessly.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Tested-by: NColin King <colin.king@canonical.com>
      Cc: Raghavendra D Prabhu <raghu.prabhu13@gmail.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Rik van Riel <riel@redhat.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: NWu Fengguang <fengguang.wu@intel.com>
      Cc: <stable@kernel.org>		[2.6.38+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      afc7e326
  3. 21 5月, 2011 1 次提交
    • L
      sanitize <linux/prefetch.h> usage · 268bb0ce
      Linus Torvalds 提交于
      Commit e66eed65 ("list: remove prefetching from regular list
      iterators") removed the include of prefetch.h from list.h, which
      uncovered several cases that had apparently relied on that rather
      obscure header file dependency.
      
      So this fixes things up a bit, using
      
         grep -L linux/prefetch.h $(git grep -l '[^a-z_]prefetchw*(' -- '*.[ch]')
         grep -L 'prefetchw*(' $(git grep -l 'linux/prefetch.h' -- '*.[ch]')
      
      to guide us in finding files that either need <linux/prefetch.h>
      inclusion, or have it despite not needing it.
      
      There are more of them around (mostly network drivers), but this gets
      many core ones.
      Reported-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      268bb0ce
  4. 18 5月, 2011 1 次提交
  5. 15 4月, 2011 1 次提交
    • K
      vmscan: all_unreclaimable() use zone->all_unreclaimable as a name · 929bea7c
      KOSAKI Motohiro 提交于
      all_unreclaimable check in direct reclaim has been introduced at 2.6.19
      by following commit.
      
      	2006 Sep 25; commit 408d8544; oom: use unreclaimable info
      
      And it went through strange history. firstly, following commit broke
      the logic unintentionally.
      
      	2008 Apr 29; commit a41f24ea; page allocator: smarter retry of
      				      costly-order allocations
      
      Two years later, I've found obvious meaningless code fragment and
      restored original intention by following commit.
      
      	2010 Jun 04; commit bb21c7ce; vmscan: fix do_try_to_free_pages()
      				      return value when priority==0
      
      But, the logic didn't works when 32bit highmem system goes hibernation
      and Minchan slightly changed the algorithm and fixed it .
      
      	2010 Sep 22: commit d1908362: vmscan: check all_unreclaimable
      				      in direct reclaim path
      
      But, recently, Andrey Vagin found the new corner case. Look,
      
      	struct zone {
      	  ..
      	        int                     all_unreclaimable;
      	  ..
      	        unsigned long           pages_scanned;
      	  ..
      	}
      
      zone->all_unreclaimable and zone->pages_scanned are neigher atomic
      variables nor protected by lock.  Therefore zones can become a state of
      zone->page_scanned=0 and zone->all_unreclaimable=1.  In this case, current
      all_unreclaimable() return false even though zone->all_unreclaimabe=1.
      
      This resulted in the kernel hanging up when executing a loop of the form
      
      1. fork
      2. mmap
      3. touch memory
      4. read memory
      5. munmmap
      
      as described in
      http://www.gossamer-threads.com/lists/linux/kernel/1348725#1348725
      
      Is this ignorable minor issue?  No.  Unfortunately, x86 has very small dma
      zone and it become zone->all_unreclamble=1 easily.  and if it become
      all_unreclaimable=1, it never restore all_unreclaimable=0.  Why?  if
      all_unreclaimable=1, vmscan only try DEF_PRIORITY reclaim and
      a-few-lru-pages>>DEF_PRIORITY always makes 0.  that mean no page scan at
      all!
      
      Eventually, oom-killer never works on such systems.  That said, we can't
      use zone->pages_scanned for this purpose.  This patch restore
      all_unreclaimable() use zone->all_unreclaimable as old.  and in addition,
      to add oom_killer_disabled check to avoid reintroduce the issue of commit
      d1908362 ("vmscan: check all_unreclaimable in direct reclaim path").
      Reported-by: NAndrey Vagin <avagin@openvz.org>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      929bea7c
  6. 31 3月, 2011 1 次提交
  7. 23 3月, 2011 3 次提交
  8. 10 3月, 2011 1 次提交
  9. 26 2月, 2011 1 次提交
    • M
      mm: vmscan: stop reclaim/compaction earlier due to insufficient progress if !__GFP_REPEAT · 2876592f
      Mel Gorman 提交于
      should_continue_reclaim() for reclaim/compaction allows scanning to
      continue even if pages are not being reclaimed until the full list is
      scanned.  In terms of allocation success, this makes sense but potentially
      it introduces unwanted latency for high-order allocations such as
      transparent hugepages and network jumbo frames that would prefer to fail
      the allocation attempt and fallback to order-0 pages.  Worse, there is a
      potential that the full LRU scan will clear all the young bits, distort
      page aging information and potentially push pages into swap that would
      have otherwise remained resident.
      
      This patch will stop reclaim/compaction if no pages were reclaimed in the
      last SWAP_CLUSTER_MAX pages that were considered.  For allocations such as
      hugetlbfs that use __GFP_REPEAT and have fewer fallback options, the full
      LRU list may still be scanned.
      
      Order-0 allocation should not be affected because RECLAIM_MODE_COMPACTION
      is not set so the following avoids the gfp_mask being examined:
      
              if (!(sc->reclaim_mode & RECLAIM_MODE_COMPACTION))
                      return false;
      
      A tool was developed based on ftrace that tracked the latency of
      high-order allocations while transparent hugepage support was enabled and
      three benchmarks were run.  The "fix-infinite" figures are 2.6.38-rc4 with
      Johannes's patch "vmscan: fix zone shrinking exit when scan work is done"
      applied.
      
        STREAM Highorder Allocation Latency Statistics
                       fix-infinite     break-early
        1 :: Count            10298           10229
        1 :: Min             0.4560          0.4640
        1 :: Mean            1.0589          1.0183
        1 :: Max            14.5990         11.7510
        1 :: Stddev          0.5208          0.4719
        2 :: Count                2               1
        2 :: Min             1.8610          3.7240
        2 :: Mean            3.4325          3.7240
        2 :: Max             5.0040          3.7240
        2 :: Stddev          1.5715          0.0000
        9 :: Count           111696          111694
        9 :: Min             0.5230          0.4110
        9 :: Mean           10.5831         10.5718
        9 :: Max            38.4480         43.2900
        9 :: Stddev          1.1147          1.1325
      
      Mean time for order-1 allocations is reduced.  order-2 looks increased but
      with so few allocations, it's not particularly significant.  THP mean
      allocation latency is also reduced.  That said, allocation time varies so
      significantly that the reductions are within noise.
      
      Max allocation time is reduced by a significant amount for low-order
      allocations but reduced for THP allocations which presumably are now
      breaking before reclaim has done enough work.
      
        SysBench Highorder Allocation Latency Statistics
                       fix-infinite     break-early
        1 :: Count            15745           15677
        1 :: Min             0.4250          0.4550
        1 :: Mean            1.1023          1.0810
        1 :: Max            14.4590         10.8220
        1 :: Stddev          0.5117          0.5100
        2 :: Count                1               1
        2 :: Min             3.0040          2.1530
        2 :: Mean            3.0040          2.1530
        2 :: Max             3.0040          2.1530
        2 :: Stddev          0.0000          0.0000
        9 :: Count             2017            1931
        9 :: Min             0.4980          0.7480
        9 :: Mean           10.4717         10.3840
        9 :: Max            24.9460         26.2500
        9 :: Stddev          1.1726          1.1966
      
      Again, mean time for order-1 allocations is reduced while order-2
      allocations are too few to draw conclusions from.  The mean time for THP
      allocations is also slightly reduced albeit the reductions are within
      varianes.
      
      Once again, our maximum allocation time is significantly reduced for
      low-order allocations and slightly increased for THP allocations.
      
        Anon stream mmap reference Highorder Allocation Latency Statistics
        1 :: Count             1376            1790
        1 :: Min             0.4940          0.5010
        1 :: Mean            1.0289          0.9732
        1 :: Max             6.2670          4.2540
        1 :: Stddev          0.4142          0.2785
        2 :: Count                1               -
        2 :: Min             1.9060               -
        2 :: Mean            1.9060               -
        2 :: Max             1.9060               -
        2 :: Stddev          0.0000               -
        9 :: Count            11266           11257
        9 :: Min             0.4990          0.4940
        9 :: Mean        27250.4669      24256.1919
        9 :: Max      11439211.0000    6008885.0000
        9 :: Stddev     226427.4624     186298.1430
      
      This benchmark creates one thread per CPU which references an amount of
      anonymous memory 1.5 times the size of physical RAM.  This pounds swap
      quite heavily and is intended to exercise THP a bit.
      
      Mean allocation time for order-1 is reduced as before.  It's also reduced
      for THP allocations but the variations here are pretty massive due to
      swap.  As before, maximum allocation times are significantly reduced.
      
      Overall, the patch reduces the mean and maximum allocation latencies for
      the smaller high-order allocations.  This was with Slab configured so it
      would be expected to be more significant with Slub which uses these size
      allocations more aggressively.
      
      The mean allocation times for THP allocations are also slightly reduced.
      The maximum latency was slightly increased as predicted by the comments
      due to reclaim/compaction breaking early.  However, workloads care more
      about the latency of lower-order allocations than THP so it's an
      acceptable trade-off.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Acked-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Kent Overstreet <kent.overstreet@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2876592f
  10. 12 2月, 2011 1 次提交
  11. 26 1月, 2011 1 次提交
    • D
      mm: fix deferred congestion timeout if preferred zone is not allowed · f33261d7
      David Rientjes 提交于
      Before 0e093d99 ("writeback: do not sleep on the congestion queue if
      there are no congested BDIs or if significant congestion is not being
      encountered in the current zone"), preferred_zone was only used for NUMA
      statistics, to determine the zoneidx from which to allocate from given
      the type requested, and whether to utilize memory compaction.
      
      wait_iff_congested(), though, uses preferred_zone to determine if the
      congestion wait should be deferred because its dirty pages are backed by
      a congested bdi.  This incorrectly defers the timeout and busy loops in
      the page allocator with various cond_resched() calls if preferred_zone
      is not allowed in the current context, usually consuming 100% of a cpu.
      
      This patch ensures preferred_zone is an allowed zone in the fastpath
      depending on whether current is constrained by its cpuset or nodes in
      its mempolicy (when the nodemask passed is non-NULL).  This is correct
      since the fastpath allocation always passes ALLOC_CPUSET when trying to
      allocate memory.  In the slowpath, this patch resets preferred_zone to
      the first zone of the allowed type when the allocation is not
      constrained by current's cpuset, i.e.  it does not pass ALLOC_CPUSET.
      
      This patch also ensures preferred_zone is from the set of allowed nodes
      when called from within direct reclaim since allocations are always
      constrained by cpusets in this context (it is blockable).
      
      Both of these uses of cpuset_current_mems_allowed are protected by
      get_mems_allowed().
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f33261d7
  12. 21 1月, 2011 1 次提交
  13. 18 1月, 2011 1 次提交
  14. 14 1月, 2011 17 次提交
    • S
      mm: batch activate_page() to reduce lock contention · 744ed144
      Shaohua Li 提交于
      The zone->lru_lock is heavily contented in workload where activate_page()
      is frequently used.  We could do batch activate_page() to reduce the lock
      contention.  The batched pages will be added into zone list when the pool
      is full or page reclaim is trying to drain them.
      
      For example, in a 4 socket 64 CPU system, create a sparse file and 64
      processes, processes shared map to the file.  Each process read access the
      whole file and then exit.  The process exit will do unmap_vmas() and cause
      a lot of activate_page() call.  In such workload, we saw about 58% total
      time reduction with below patch.  Other workloads with a lot of
      activate_page also benefits a lot too.
      
      I tested some microbenchmarks:
      case-anon-cow-rand-mt		0.58%
      case-anon-cow-rand		-3.30%
      case-anon-cow-seq-mt		-0.51%
      case-anon-cow-seq		-5.68%
      case-anon-r-rand-mt		0.23%
      case-anon-r-rand		0.81%
      case-anon-r-seq-mt		-0.71%
      case-anon-r-seq			-1.99%
      case-anon-rx-rand-mt		2.11%
      case-anon-rx-seq-mt		3.46%
      case-anon-w-rand-mt		-0.03%
      case-anon-w-rand		-0.50%
      case-anon-w-seq-mt		-1.08%
      case-anon-w-seq			-0.12%
      case-anon-wx-rand-mt		-5.02%
      case-anon-wx-seq-mt		-1.43%
      case-fork			1.65%
      case-fork-sleep			-0.07%
      case-fork-withmem		1.39%
      case-hugetlb			-0.59%
      case-lru-file-mmap-read-mt	-0.54%
      case-lru-file-mmap-read		0.61%
      case-lru-file-mmap-read-rand	-2.24%
      case-lru-file-readonce		-0.64%
      case-lru-file-readtwice		-11.69%
      case-lru-memcg			-1.35%
      case-mmap-pread-rand-mt		1.88%
      case-mmap-pread-rand		-15.26%
      case-mmap-pread-seq-mt		0.89%
      case-mmap-pread-seq		-69.72%
      case-mmap-xread-rand-mt		0.71%
      case-mmap-xread-seq-mt		0.38%
      
      The most significent are:
      case-lru-file-readtwice		-11.69%
      case-mmap-pread-rand		-15.26%
      case-mmap-pread-seq		-69.72%
      
      which use activate_page a lot.  others are basically variations because
      each run has slightly difference.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      744ed144
    • R
      thp: scale nr_rotated to balance memory pressure · 9992af10
      Rik van Riel 提交于
      Make sure we scale up nr_rotated when we encounter a referenced
      transparent huge page.  This ensures pageout scanning balance is not
      distorted when there are huge pages on the LRU.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9992af10
    • R
      thp: fix anon memory statistics with transparent hugepages · 2c888cfb
      Rik van Riel 提交于
      Count each transparent hugepage as HPAGE_PMD_NR pages in the LRU
      statistics, so the Active(anon) and Inactive(anon) statistics in
      /proc/meminfo are correct.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2c888cfb
    • A
      thp: use compaction in kswapd for GFP_ATOMIC order > 0 · 5a03b051
      Andrea Arcangeli 提交于
      This takes advantage of memory compaction to properly generate pages of
      order > 0 if regular page reclaim fails and priority level becomes more
      severe and we don't reach the proper watermarks.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5a03b051
    • M
      mm: kswapd: use the classzone idx that kswapd was using for sleeping_prematurely() · dc83edd9
      Mel Gorman 提交于
      When kswapd is woken up for a high-order allocation, it takes account of
      the highest usable zone by the caller (the classzone idx).  During
      allocation, this index is used to select the lowmem_reserve[] that should
      be applied to the watermark calculation in zone_watermark_ok().
      
      When balancing a node, kswapd considers the highest unbalanced zone to be
      the classzone index.  This will always be at least be the callers
      classzone_idx and can be higher.  However, sleeping_prematurely() always
      considers the lowest zone (e.g.  ZONE_DMA) to be the classzone index.
      This means that sleeping_prematurely() can consider a zone to be balanced
      that is unusable by the allocation request that originally woke kswapd.
      This patch changes sleeping_prematurely() to use a classzone_idx matching
      the value it used in balance_pgdat().
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: NEric B Munson <emunson@mgebm.net>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Simon Kirby <sim@hostway.ca>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dc83edd9
    • M
      mm: kswapd: treat zone->all_unreclaimable in sleeping_prematurely similar to balance_pgdat() · 355b09c4
      Mel Gorman 提交于
      After DEF_PRIORITY, balance_pgdat() considers all_unreclaimable zones to
      be balanced but sleeping_prematurely does not.  This can force kswapd to
      stay awake longer than it should.  This patch fixes it.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NEric B Munson <emunson@mgebm.net>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Simon Kirby <sim@hostway.ca>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      355b09c4
    • M
      mm: kswapd: reset kswapd_max_order and classzone_idx after reading · 4d40502e
      Mel Gorman 提交于
      When kswapd wakes up, it reads its order and classzone from pgdat and
      calls balance_pgdat.  While its awake, it potentially reclaimes at a high
      order and a low classzone index.  This might have been a once-off that was
      not required by subsequent callers.  However, because the pgdat values
      were not reset, they remain artifically high while balance_pgdat() is
      running and potentially kswapd enters a second unnecessary reclaim cycle.
      Reset the pgdat order and classzone index after reading.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NEric B Munson <emunson@mgebm.net>
      Cc: Simon Kirby <sim@hostway.ca>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4d40502e
    • M
      mm: kswapd: use the order that kswapd was reclaiming at for sleeping_prematurely() · 0abdee2b
      Mel Gorman 提交于
      Before kswapd goes to sleep, it uses sleeping_prematurely() to check if
      there was a race pushing a zone below its watermark.  If the race
      happened, it stays awake.  However, balance_pgdat() can decide to reclaim
      at order-0 if it decides that high-order reclaim is not working as
      expected.  This information is not passed back to sleeping_prematurely().
      The impact is that kswapd remains awake reclaiming pages long after it
      should have gone to sleep.  This patch passes the adjusted order to
      sleeping_prematurely and uses the same logic as balance_pgdat to decide if
      it's ok to go to sleep.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NEric B Munson <emunson@mgebm.net>
      Cc: Simon Kirby <sim@hostway.ca>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0abdee2b
    • M
      mm: kswapd: keep kswapd awake for high-order allocations until a percentage of the node is balanced · 1741c877
      Mel Gorman 提交于
      When reclaiming for high-orders, kswapd is responsible for balancing a
      node but it should not reclaim excessively.  It avoids excessive reclaim
      by considering if any zone in a node is balanced then the node is
      balanced.  In the cases where there are imbalanced zone sizes (e.g.
      ZONE_DMA with both ZONE_DMA32 and ZONE_NORMAL), kswapd can go to sleep
      prematurely as just one small zone was balanced.
      
      This alters the sleep logic of kswapd slightly.  It counts the number of
      pages that make up the balanced zones.  If the total number of balanced
      pages is more than a quarter of the zone, kswapd will go back to sleep.
      This should keep a node balanced without reclaiming an excessive number of
      pages.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NEric B Munson <emunson@mgebm.net>
      Cc: Simon Kirby <sim@hostway.ca>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1741c877
    • M
      mm: kswapd: stop high-order balancing when any suitable zone is balanced · 99504748
      Mel Gorman 提交于
      Simon Kirby reported the following problem
      
         We're seeing cases on a number of servers where cache never fully
         grows to use all available memory.  Sometimes we see servers with 4 GB
         of memory that never seem to have less than 1.5 GB free, even with a
         constantly-active VM.  In some cases, these servers also swap out while
         this happens, even though they are constantly reading the working set
         into memory.  We have been seeing this happening for a long time; I
         don't think it's anything recent, and it still happens on 2.6.36.
      
      After some debugging work by Simon, Dave Hansen and others, the prevaling
      theory became that kswapd is reclaiming order-3 pages requested by SLUB
      too aggressive about it.
      
      There are two apparent problems here.  On the target machine, there is a
      small Normal zone in comparison to DMA32.  As kswapd tries to balance all
      zones, it would continually try reclaiming for Normal even though DMA32
      was balanced enough for callers.  The second problem is that
      sleeping_prematurely() does not use the same logic as balance_pgdat() when
      deciding whether to sleep or not.  This keeps kswapd artifically awake.
      
      A number of tests were run and the figures from previous postings will
      look very different for a few reasons.  One, the old figures were forcing
      my network card to use GFP_ATOMIC in attempt to replicate Simon's problem.
       Second, I previous specified slub_min_order=3 again in an attempt to
      reproduce Simon's problem.  In this posting, I'm depending on Simon to say
      whether his problem is fixed or not and these figures are to show the
      impact to the ordinary cases.  Finally, the "vmscan" figures are taken
      from /proc/vmstat instead of the tracepoints.  There is less information
      but recording is less disruptive.
      
      The first test of relevance was postmark with a process running in the
      background reading a large amount of anonymous memory in blocks.  The
      objective was to vaguely simulate what was happening on Simon's machine
      and it's memory intensive enough to have kswapd awake.
      
      POSTMARK
                                                  traceonly          kanyzone
      Transactions per second:              156.00 ( 0.00%)   153.00 (-1.96%)
      Data megabytes read per second:        21.51 ( 0.00%)    21.52 ( 0.05%)
      Data megabytes written per second:     29.28 ( 0.00%)    29.11 (-0.58%)
      Files created alone per second:       250.00 ( 0.00%)   416.00 (39.90%)
      Files create/transact per second:      79.00 ( 0.00%)    76.00 (-3.95%)
      Files deleted alone per second:       520.00 ( 0.00%)   420.00 (-23.81%)
      Files delete/transact per second:      79.00 ( 0.00%)    76.00 (-3.95%)
      
      MMTests Statistics: duration
      User/Sys Time Running Test (seconds)         16.58      17.4
      Total Elapsed Time (seconds)                218.48    222.47
      
      VMstat Reclaim Statistics: vmscan
      Direct reclaims                                  0          4
      Direct reclaim pages scanned                     0        203
      Direct reclaim pages reclaimed                   0        184
      Kswapd pages scanned                        326631     322018
      Kswapd pages reclaimed                      312632     309784
      Kswapd low wmark quickly                         1          4
      Kswapd high wmark quickly                      122        475
      Kswapd skip congestion_wait                      1          0
      Pages activated                             700040     705317
      Pages deactivated                           212113     203922
      Pages written                                 9875       6363
      
      Total pages scanned                         326631    322221
      Total pages reclaimed                       312632    309968
      %age total pages scanned/reclaimed          95.71%    96.20%
      %age total pages scanned/written             3.02%     1.97%
      
      proc vmstat: Faults
      Major Faults                                   300       254
      Minor Faults                                645183    660284
      Page ins                                    493588    486704
      Page outs                                  4960088   4986704
      Swap ins                                      1230       661
      Swap outs                                     9869      6355
      
      Performance is mildly affected because kswapd is no longer doing as much
      work and the background memory consumer process is getting in the way.
      Note that kswapd scanned and reclaimed fewer pages as it's less aggressive
      and overall fewer pages were scanned and reclaimed.  Swap in/out is
      particularly reduced again reflecting kswapd throwing out fewer pages.
      
      The slight performance impact is unfortunate here but it looks like a
      direct result of kswapd being less aggressive.  As the bug report is about
      too many pages being freed by kswapd, it may have to be accepted for now.
      
      The second test is a streaming IO benchmark that was previously used by
      Johannes to show regressions in page reclaim.
      
      MICRO
      					 traceonly  kanyzone
      User/Sys Time Running Test (seconds)         29.29     28.87
      Total Elapsed Time (seconds)                492.18    488.79
      
      VMstat Reclaim Statistics: vmscan
      Direct reclaims                               2128       1460
      Direct reclaim pages scanned               2284822    1496067
      Direct reclaim pages reclaimed              148919     110937
      Kswapd pages scanned                      15450014   16202876
      Kswapd pages reclaimed                     8503697    8537897
      Kswapd low wmark quickly                      3100       3397
      Kswapd high wmark quickly                     1860       7243
      Kswapd skip congestion_wait                    708        801
      Pages activated                               9635       9573
      Pages deactivated                             1432       1271
      Pages written                                  223       1130
      
      Total pages scanned                       17734836  17698943
      Total pages reclaimed                      8652616   8648834
      %age total pages scanned/reclaimed          48.79%    48.87%
      %age total pages scanned/written             0.00%     0.01%
      
      proc vmstat: Faults
      Major Faults                                   165       221
      Minor Faults                               9655785   9656506
      Page ins                                      3880      7228
      Page outs                                 37692940  37480076
      Swap ins                                         0        69
      Swap outs                                       19        15
      
      Again fewer pages are scanned and reclaimed as expected and this time the
      test completed faster.  Note that kswapd is hitting its watermarks faster
      (low and high wmark quickly) which I expect is due to kswapd reclaiming
      fewer pages.
      
      I also ran fs-mark, iozone and sysbench but there is nothing interesting
      to report in the figures.  Performance is not significantly changed and
      the reclaim statistics look reasonable.
      
      Tgis patch:
      
      When the allocator enters its slow path, kswapd is woken up to balance the
      node.  It continues working until all zones within the node are balanced.
      For order-0 allocations, this makes perfect sense but for higher orders it
      can have unintended side-effects.  If the zone sizes are imbalanced,
      kswapd may reclaim heavily within a smaller zone discarding an excessive
      number of pages.  The user-visible behaviour is that kswapd is awake and
      reclaiming even though plenty of pages are free from a suitable zone.
      
      This patch alters the "balance" logic for high-order reclaim allowing
      kswapd to stop if any suitable zone becomes balanced to reduce the number
      of pages it reclaims from other zones.  kswapd still tries to ensure that
      order-0 watermarks for all zones are met before sleeping.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NEric B Munson <emunson@mgebm.net>
      Cc: Simon Kirby <sim@hostway.ca>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      99504748
    • M
      mm: vmscan: rename lumpy_mode to reclaim_mode · f3a310bc
      Mel Gorman 提交于
      With compaction being used instead of lumpy reclaim, the name lumpy_mode
      and associated variables is a bit misleading.  Rename lumpy_mode to
      reclaim_mode which is a better fit.  There is no functional change.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f3a310bc
    • M
      mm: migration: allow migration to operate asynchronously and avoid synchronous... · 77f1fe6b
      Mel Gorman 提交于
      mm: migration: allow migration to operate asynchronously and avoid synchronous compaction in the faster path
      
      Migration synchronously waits for writeback if the initial passes fails.
      Callers of memory compaction do not necessarily want this behaviour if the
      caller is latency sensitive or expects that synchronous migration is not
      going to have a significantly better success rate.
      
      This patch adds a sync parameter to migrate_pages() allowing the caller to
      indicate if wait_on_page_writeback() is allowed within migration or not.
      For reclaim/compaction, try_to_compact_pages() is first called
      asynchronously, direct reclaim runs and then try_to_compact_pages() is
      called synchronously as there is a greater expectation that it'll succeed.
      
      [akpm@linux-foundation.org: build/merge fix]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      77f1fe6b
    • M
      mm: vmscan: reclaim order-0 and use compaction instead of lumpy reclaim · 3e7d3449
      Mel Gorman 提交于
      Lumpy reclaim is disruptive.  It reclaims a large number of pages and
      ignores the age of the pages it reclaims.  This can incur significant
      stalls and potentially increase the number of major faults.
      
      Compaction has reached the point where it is considered reasonably stable
      (meaning it has passed a lot of testing) and is a potential candidate for
      displacing lumpy reclaim.  This patch introduces an alternative to lumpy
      reclaim whe compaction is available called reclaim/compaction.  The basic
      operation is very simple - instead of selecting a contiguous range of
      pages to reclaim, a number of order-0 pages are reclaimed and then
      compaction is later by either kswapd (compact_zone_order()) or direct
      compaction (__alloc_pages_direct_compact()).
      
      [akpm@linux-foundation.org: fix build]
      [akpm@linux-foundation.org: use conventional task_struct naming]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3e7d3449
    • M
      mm: vmscan: convert lumpy_mode into a bitmask · ee64fc93
      Mel Gorman 提交于
      Currently lumpy_mode is an enum and determines if lumpy reclaim is off,
      syncronous or asyncronous.  In preparation for using compaction instead of
      lumpy reclaim, this patch converts the flags into a bitmap.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ee64fc93
    • K
      vmscan: factor out kswapd sleeping logic from kswapd() · f0bc0a60
      KOSAKI Motohiro 提交于
      Currently, kswapd() has deep nesting and is slightly hard to read.  Clean
      this up.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f0bc0a60
    • M
      mm: vmstat: use a single setter function and callback for adjusting percpu thresholds · b44129b3
      Mel Gorman 提交于
      reduce_pgdat_percpu_threshold() and restore_pgdat_percpu_threshold() exist
      to adjust the per-cpu vmstat thresholds while kswapd is awake to avoid
      errors due to counter drift.  The functions duplicate some code so this
      patch replaces them with a single set_pgdat_percpu_threshold() that takes
      a callback function to calculate the desired threshold as a parameter.
      
      [akpm@linux-foundation.org: readability tweak]
      [kosaki.motohiro@jp.fujitsu.com: set_pgdat_percpu_threshold(): don't use for_each_online_cpu]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NChristoph Lameter <cl@linux.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b44129b3
    • M
      mm: page allocator: adjust the per-cpu counter threshold when memory is low · 88f5acf8
      Mel Gorman 提交于
      Commit aa454840 ("calculate a better estimate of NR_FREE_PAGES when memory
      is low") noted that watermarks were based on the vmstat NR_FREE_PAGES.  To
      avoid synchronization overhead, these counters are maintained on a per-cpu
      basis and drained both periodically and when a threshold is above a
      threshold.  On large CPU systems, the difference between the estimate and
      real value of NR_FREE_PAGES can be very high.  The system can get into a
      case where pages are allocated far below the min watermark potentially
      causing livelock issues.  The commit solved the problem by taking a better
      reading of NR_FREE_PAGES when memory was low.
      
      Unfortately, as reported by Shaohua Li this accurate reading can consume a
      large amount of CPU time on systems with many sockets due to cache line
      bouncing.  This patch takes a different approach.  For large machines
      where counter drift might be unsafe and while kswapd is awake, the per-cpu
      thresholds for the target pgdat are reduced to limit the level of drift to
      what should be a safe level.  This incurs a performance penalty in heavy
      memory pressure by a factor that depends on the workload and the machine
      but the machine should function correctly without accidentally exhausting
      all memory on a node.  There is an additional cost when kswapd wakes and
      sleeps but the event is not expected to be frequent - in Shaohua's test
      case, there was one recorded sleep and wake event at least.
      
      To ensure that kswapd wakes up, a safe version of zone_watermark_ok() is
      introduced that takes a more accurate reading of NR_FREE_PAGES when called
      from wakeup_kswapd, when deciding whether it is really safe to go back to
      sleep in sleeping_prematurely() and when deciding if a zone is really
      balanced or not in balance_pgdat().  We are still using an expensive
      function but limiting how often it is called.
      
      When the test case is reproduced, the time spent in the watermark
      functions is reduced.  The following report is on the percentage of time
      spent cumulatively spent in the functions zone_nr_free_pages(),
      zone_watermark_ok(), __zone_watermark_ok(), zone_watermark_ok_safe(),
      zone_page_state_snapshot(), zone_page_state().
      
      vanilla                      11.6615%
      disable-threshold            0.2584%
      
      David said:
      
      : We had to pull aa454840 "mm: page allocator: calculate a better estimate
      : of NR_FREE_PAGES when memory is low and kswapd is awake" from 2.6.36
      : internally because tests showed that it would cause the machine to stall
      : as the result of heavy kswapd activity.  I merged it back with this fix as
      : it is pending in the -mm tree and it solves the issue we were seeing, so I
      : definitely think this should be pushed to -stable (and I would seriously
      : consider it for 2.6.37 inclusion even at this late date).
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Reported-by: NShaohua Li <shaohua.li@intel.com>
      Reviewed-by: NChristoph Lameter <cl@linux.com>
      Tested-by: NNicolas Bareil <nico@chdir.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Kyle McMartin <kyle@mcmartin.ca>
      Cc: <stable@kernel.org>		[2.6.37.1, 2.6.36.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      88f5acf8