1. 09 1月, 2009 9 次提交
  2. 07 1月, 2009 13 次提交
    • K
      mm: stop kswapd's infinite loop at high order allocation · 73ce02e9
      KOSAKI Motohiro 提交于
      Wassim Dagash reported following kswapd infinite loop problem.
      
        kswapd runs in some infinite loop trying to swap until order 10 of zone
        highmem is OK.... kswapd will continue to try to balance order 10 of zone
        highmem forever (or until someone release a very large chunk of highmem).
      
      For non order-0 allocations, the system may never be balanced due to
      fragmentation but kswapd should not infinitely loop as a result.
      
      Instead, recheck all watermarks at order-0 as they are the most important.
      If watermarks are ok, kswapd will go back to sleep.
      
      [akpm@linux-foundation.org: fix comment]
      Reported-by: Nwassim dagash <wassim.dagash@gmail.com>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      73ce02e9
    • A
      vmscan: shrink_active_list(): reduce lru_lock hold time · b555749a
      Andrew Morton 提交于
      These three statements manipulate local variables and do not need the lock
      coverage.
      
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Rik van Riel <riel@redhat.com
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b555749a
    • K
      mm: kill zone_is_near_oom() · 09f445e7
      KOSAKI Motohiro 提交于
      zone_is_near_oom() is unused.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09f445e7
    • K
      vmscan: improve reclaim throughput to bail out patch · 01dbe5c9
      KOSAKI Motohiro 提交于
      The vmscan bail out patch move nr_reclaimed variable to struct
      scan_control.  Unfortunately, indirect access can easily happen cache
      miss.
      
      if heavy memory pressure happend, that's ok.
      cache miss already plenty. it is not observable.
      
      but, if memory pressure is lite, performance degression is obserbable.
      
      I compared following three pattern (it was mesured 10 times each)
      
      hackbench 125 process 3000
      hackbench 130 process 3000
      hackbench 135 process 3000
      
                  2.6.28-rc6                       bail-out
      
      	125	130	135		125	130	135
            ==============================================================
      	71.866	75.86	81.274		93.414	73.254	193.382
      	74.145	78.295	77.27		74.897	75.021	80.17
      	70.305	77.643	75.855		70.134	77.571	79.896
      	74.288	73.986	75.955		77.222	78.48	80.619
      	72.029	79.947	78.312		75.128	82.172	79.708
      	71.499	77.615	77.042		74.177	76.532	77.306
      	76.188	74.471	83.562		73.839	72.43	79.833
      	73.236	75.606	78.743		76.001	76.557	82.726
      	69.427	77.271	76.691		76.236	79.371	103.189
      	72.473	76.978	80.643		69.128	78.932	75.736
      
      avg	72.545	76.767	78.534		76.017	77.03	93.256
      std	1.89	1.71	2.41		6.29	2.79	34.16
      min	69.427	73.986	75.855		69.128	72.43	75.736
      max	76.188	79.947	83.562		93.414	82.172	193.382
      
      about 4-5% degression.
      
      Then, this patch introduces a temporary local variable.
      
      result:
      
                  2.6.28-rc6                       this patch
      
      num	125	130	135		125	130	135
            ==============================================================
      	71.866	75.86	81.274		67.302	68.269	77.161
      	74.145	78.295	77.27   	72.616	72.712	79.06
      	70.305	77.643	75.855  	72.475	75.712	77.735
      	74.288	73.986	75.955  	69.229	73.062	78.814
      	72.029	79.947	78.312  	71.551	74.392	78.564
      	71.499	77.615	77.042  	69.227	74.31	78.837
      	76.188	74.471	83.562  	70.759	75.256	76.6
      	73.236	75.606	78.743  	69.966	76.001	78.464
      	69.427	77.271	76.691  	69.068	75.218	80.321
      	72.473	76.978	80.643  	72.057	77.151	79.068
      
      avg	72.545	76.767	78.534 		70.425	74.2083	78.462
      std 	1.89	1.71	2.41    	1.66	2.34	1.00
      min 	69.427	73.986	75.855  	67.302	68.269	76.6
      max 	76.188	79.947	83.562  	72.616	77.151	80.321
      
      OK. the degression is disappeared.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      01dbe5c9
    • R
      vmscan: bail out of direct reclaim after swap_cluster_max pages · a79311c1
      Rik van Riel 提交于
      When the VM is under pressure, it can happen that several direct reclaim
      processes are in the pageout code simultaneously.  It also happens that
      the reclaiming processes run into mostly referenced, mapped and dirty
      pages in the first round.
      
      This results in multiple direct reclaim processes having a lower
      pageout priority, which corresponds to a higher target of pages to
      scan.
      
      This in turn can result in each direct reclaim process freeing
      many pages.  Together, they can end up freeing way too many pages.
      
      This kicks useful data out of memory (in some cases more than half
      of all memory is swapped out).  It also impacts performance by
      keeping tasks stuck in the pageout code for too long.
      
      A 30% improvement in hackbench has been observed with this patch.
      
      The fix is relatively simple: in shrink_zone() we can check how many
      pages we have already freed, direct reclaim tasks break out of the
      scanning loop if they have already freed enough pages and have reached
      a lower priority level.
      
      We do not break out of shrink_zone() when priority == DEF_PRIORITY,
      to ensure that equal pressure is applied to every zone in the common
      case.
      
      However, in order to do this we do need to know how many pages we already
      freed, so move nr_reclaimed into scan_control.
      
      akpm: a historical interlude...
      
      We tried this in 2004:
      
      :commit e468e46a9bea3297011d5918663ce6d19094cf87
      :Author: akpm <akpm>
      :Date:   Thu Jun 24 15:53:52 2004 +0000
      :
      :[PATCH] vmscan.c: dont reclaim too many pages
      :
      :    The shrink_zone() logic can, under some circumstances, cause far too many
      :    pages to be reclaimed.  Say, we're scanning at high priority and suddenly hit
      :    a large number of reclaimable pages on the LRU.
      :    Change things so we bale out when SWAP_CLUSTER_MAX pages have been reclaimed.
      
      And we reverted it in 2006:
      
      :commit 210fe530
      :Author: Andrew Morton <akpm@osdl.org>
      :Date:   Fri Jan 6 00:11:14 2006 -0800
      :
      :    [PATCH] vmscan: balancing fix
      :
      :    Revert a patch which went into 2.6.8-rc1.  The changelog for that patch was:
      :
      :      The shrink_zone() logic can, under some circumstances, cause far too many
      :      pages to be reclaimed.  Say, we're scanning at high priority and suddenly
      :      hit a large number of reclaimable pages on the LRU.
      :
      :      Change things so we bale out when SWAP_CLUSTER_MAX pages have been
      :      reclaimed.
      :
      :    Problem is, this change caused significant imbalance in inter-zone scan
      :    balancing by truncating scans of larger zones.
      :
      :    Suppose, for example, ZONE_HIGHMEM is 10x the size of ZONE_NORMAL.  The zone
      :    balancing algorithm would require that if we're scanning 100 pages of
      :    ZONE_HIGHMEM, we should scan 10 pages of ZONE_NORMAL.  But this logic will
      :    cause the scanning of ZONE_HIGHMEM to bale out after only 32 pages are
      :    reclaimed.  Thus effectively causing smaller zones to be scanned relatively
      :    harder than large ones.
      :
      :    Now I need to remember what the workload was which caused me to write this
      :    patch originally, then fix it up in a different way...
      
      And we haven't demonstrated that whatever problem caused that reversion is
      not being reintroduced by this change in 2008.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a79311c1
    • K
      mm: make scan_zone_unevictable_pages() static · 14b90b22
      KOSAKI Motohiro 提交于
      sparse output following warning
      
      	mm/vmscan.c:2507:6: warning: symbol 'scan_zone_unevictable_pages' was not declared. Should it be static?
      
      cleanup here.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      14b90b22
    • K
      mm: make scan_all_zones_unevictable_pages() static · ff30153b
      KOSAKI Motohiro 提交于
      sparse output following warning.
      
      	mm/vmscan.c:2549:6: warning: symbol 'scan_all_zones_unevictable_pages' was not declared. Should it be static?
      
      cleanup here.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ff30153b
    • K
      memcg: reclaim shouldn't change zone->recent_rotated statistics · 077cbc58
      KOSAKI Motohiro 提交于
      memcg reclaim shouldn't change zone->recent_rotated statistics.  If
      memcgroup reclaim changes zone statistics, global reclaim can get a bit
      confused.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      077cbc58
    • H
      mm: optimize get_scan_ratio for no swap · b962716b
      Hugh Dickins 提交于
      Rik suggests a simplified get_scan_ratio() for !CONFIG_SWAP.  Yes, the gcc
      optimizer gives us that, when nr_swap_pages is #defined as 0L.  Move usual
      declaration to swapfile.c: it never belonged in page_alloc.c.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Robin Holt <holt@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b962716b
    • H
      mm: add add_to_swap stub · 60371d97
      Hugh Dickins 提交于
      If we add a failing stub for add_to_swap(), then we can remove the #ifdef
      CONFIG_SWAP from mm/vmscan.c.
      
      This was intended as a source cleanup, but looking more closely, it turns
      out that the !CONFIG_SWAP case was going to keep_locked for an anonymous
      page, whereas now it goes to the more suitable activate_locked, like the
      CONFIG_SWAP nr_swap_pages 0 case.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Robin Holt <holt@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      60371d97
    • H
      mm: remove gfp_mask from add_to_swap · ac47b003
      Hugh Dickins 提交于
      Remove gfp_mask argument from add_to_swap(): it's misleading because its
      only caller, shrink_page_list(), is not atomic at that point; and in due
      course (implementing discard) we'll sometimes want to allocate some memory
      with GFP_NOIO (as is used in swap_writepage) when allocating swap.
      
      No change to the gfp_mask passed down to add_to_swap_cache(): still use
      __GFP_HIGH without __GFP_WAIT (with nomemalloc and nowarn as before):
      though it's not obvious if that's the best combination to ask for here.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Robin Holt <holt@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ac47b003
    • H
      mm: remove try_to_munlock from vmscan · 63d6c5ad
      Hugh Dickins 提交于
      An unfortunate feature of the Unevictable LRU work was that reclaiming an
      anonymous page involved an extra scan through the anon_vma: to check that
      the page is evictable before allocating swap, because the swap could not
      be freed reliably soon afterwards.
      
      Now try_to_free_swap() has replaced remove_exclusive_swap_page(), that's
      not an issue any more: remove try_to_munlock() call from
      shrink_page_list(), leaving it to try_to_munmap() to discover if the page
      is one to be culled to the unevictable list - in which case then
      try_to_free_swap().
      
      Update unevictable-lru.txt to remove comments on the try_to_munlock() in
      shrink_page_list(), and shorten some lines over 80 columns.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Robin Holt <holt@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      63d6c5ad
    • H
      mm: try_to_free_swap replaces remove_exclusive_swap_page · a2c43eed
      Hugh Dickins 提交于
      remove_exclusive_swap_page(): its problem is in living up to its name.
      
      It doesn't matter if someone else has a reference to the page (raised
      page_count); it doesn't matter if the page is mapped into userspace
      (raised page_mapcount - though that hints it may be worth keeping the
      swap): all that matters is that there be no more references to the swap
      (and no writeback in progress).
      
      swapoff (try_to_unuse) has been removing pages from swapcache for years,
      with no concern for page count or page mapcount, and we used to have a
      comment in lookup_swap_cache() recognizing that: if you go for a page of
      swapcache, you'll get the right page, but it could have been removed from
      swapcache by the time you get page lock.
      
      So, give up asking for exclusivity: get rid of
      remove_exclusive_swap_page(), and remove_exclusive_swap_page_ref() and
      remove_exclusive_swap_page_count() which were spawned for the recent LRU
      work: replace them by the simpler try_to_free_swap() which just checks
      page_swapcount().
      
      Similarly, remove the page_count limitation from free_swap_and_count(),
      but assume that it's worth holding on to the swap if page is mapped and
      swap nowhere near full.  Add a vm_swap_full() test in free_swap_cache()?
      It would be consistent, but I think we probably have enough for now.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Robin Holt <holt@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a2c43eed
  3. 01 1月, 2009 2 次提交
  4. 01 12月, 2008 1 次提交
  5. 20 11月, 2008 2 次提交
  6. 16 11月, 2008 1 次提交
  7. 20 10月, 2008 12 次提交
    • N
      mm: unlockless reclaim · a978d6f5
      Nick Piggin 提交于
      unlock_page is fairly expensive.  It can be avoided in page reclaim
      success path.  By definition if we have any other references to the page
      it would be a bug anyway.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a978d6f5
    • J
      vmscan: don't accumulate scan pressure on unrelated lists · e0f79b8f
      Johannes Weiner 提交于
      During each reclaim scan we accumulate scan pressure on unrelated lists
      which will result in bogus scans and unwanted reclaims eventually.
      
      Scanning lists with few reclaim candidates results in a lot of rotation
      and therefor also disturbs the list balancing, putting even more
      pressure on the wrong lists.
      
      In a test-case with much streaming IO, and therefor a crowded inactive
      file page list, swapping started because
      
        a) anon pages were reclaimed after swap_cluster_max reclaim
        invocations -- nr_scan of this list has just accumulated
      
        b) active file pages were scanned because *their* nr_scan has also
        accumulated through the same logic.  And this in return created a
        lot of rotation for file pages and resulted in a decrease of file
        list priority, again increasing the pressure on anon pages.
      
      The result was an evicted working set of anon pages while there were
      tons of inactive file pages that should have been taken instead.
      Signed-off-by: NJohannes Weiner <hannes@saeurebad.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e0f79b8f
    • L
      vmscan: unevictable LRU scan sysctl · af936a16
      Lee Schermerhorn 提交于
      This patch adds a function to scan individual or all zones' unevictable
      lists and move any pages that have become evictable onto the respective
      zone's inactive list, where shrink_inactive_list() will deal with them.
      
      Adds sysctl to scan all nodes, and per node attributes to individual
      nodes' zones.
      
      Kosaki: If evictable page found in unevictable lru when write
      /proc/sys/vm/scan_unevictable_pages, print filename and file offset of
      these pages.
      
      [akpm@linux-foundation.org: fix one CONFIG_MMU=n build error]
      [kosaki.motohiro@jp.fujitsu.com: adapt vmscan-unevictable-lru-scan-sysctl.patch to new sysfs API]
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      af936a16
    • N
      mlock: mlocked pages are unevictable · b291f000
      Nick Piggin 提交于
      Make sure that mlocked pages also live on the unevictable LRU, so kswapd
      will not scan them over and over again.
      
      This is achieved through various strategies:
      
      1) add yet another page flag--PG_mlocked--to indicate that
         the page is locked for efficient testing in vmscan and,
         optionally, fault path.  This allows early culling of
         unevictable pages, preventing them from getting to
         page_referenced()/try_to_unmap().  Also allows separate
         accounting of mlock'd pages, as Nick's original patch
         did.
      
         Note:  Nick's original mlock patch used a PG_mlocked
         flag.  I had removed this in favor of the PG_unevictable
         flag + an mlock_count [new page struct member].  I
         restored the PG_mlocked flag to eliminate the new
         count field.
      
      2) add the mlock/unevictable infrastructure to mm/mlock.c,
         with internal APIs in mm/internal.h.  This is a rework
         of Nick's original patch to these files, taking into
         account that mlocked pages are now kept on unevictable
         LRU list.
      
      3) update vmscan.c:page_evictable() to check PageMlocked()
         and, if vma passed in, the vm_flags.  Note that the vma
         will only be passed in for new pages in the fault path;
         and then only if the "cull unevictable pages in fault
         path" patch is included.
      
      4) add try_to_unlock() to rmap.c to walk a page's rmap and
         ClearPageMlocked() if no other vmas have it mlocked.
         Reuses as much of try_to_unmap() as possible.  This
         effectively replaces the use of one of the lru list links
         as an mlock count.  If this mechanism let's pages in mlocked
         vmas leak through w/o PG_mlocked set [I don't know that it
         does], we should catch them later in try_to_unmap().  One
         hopes this will be rare, as it will be relatively expensive.
      
      Original mm/internal.h, mm/rmap.c and mm/mlock.c changes:
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      
      splitlru: introduce __get_user_pages():
      
        New munlock processing need to GUP_FLAGS_IGNORE_VMA_PERMISSIONS.
        because current get_user_pages() can't grab PROT_NONE pages theresore it
        cause PROT_NONE pages can't munlock.
      
      [akpm@linux-foundation.org: fix this for pagemap-pass-mm-into-pagewalkers.patch]
      [akpm@linux-foundation.org: untangle patch interdependencies]
      [akpm@linux-foundation.org: fix things after out-of-order merging]
      [hugh@veritas.com: fix page-flags mess]
      [lee.schermerhorn@hp.com: fix munlock page table walk - now requires 'mm']
      [kosaki.motohiro@jp.fujitsu.com: build fix]
      [kosaki.motohiro@jp.fujitsu.com: fix truncate race and sevaral comments]
      [kosaki.motohiro@jp.fujitsu.com: splitlru: introduce __get_user_pages()]
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Matt Mackall <mpm@selenic.com>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b291f000
    • L
      SHM_LOCKED pages are unevictable · 89e004ea
      Lee Schermerhorn 提交于
      Shmem segments locked into memory via shmctl(SHM_LOCKED) should not be
      kept on the normal LRU, since scanning them is a waste of time and might
      throw off kswapd's balancing algorithms.  Place them on the unevictable
      LRU list instead.
      
      Use the AS_UNEVICTABLE flag to mark address_space of SHM_LOCKed shared
      memory regions as unevictable.  Then these pages will be culled off the
      normal LRU lists during vmscan.
      
      Add new wrapper function to clear the mapping's unevictable state when/if
      shared memory segment is munlocked.
      
      Add 'scan_mapping_unevictable_page()' to mm/vmscan.c to scan all pages in
      the shmem segment's mapping [struct address_space] for evictability now
      that they're no longer locked.  If so, move them to the appropriate zone
      lru list.
      
      Changes depend on [CONFIG_]UNEVICTABLE_LRU.
      
      [kosaki.motohiro@jp.fujitsu.com: revert shm change]
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NKosaki Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      89e004ea
    • L
      Ramfs and Ram Disk pages are unevictable · ba9ddf49
      Lee Schermerhorn 提交于
      Christoph Lameter pointed out that ram disk pages also clutter the LRU
      lists.  When vmscan finds them dirty and tries to clean them, the ram disk
      writeback function just redirties the page so that it goes back onto the
      active list.  Round and round she goes...
      
      With the ram disk driver [rd.c] replaced by the newer 'brd.c', this is no
      longer the case, as ram disk pages are no longer maintained on the lru.
      [This makes them unmigratable for defrag or memory hot remove, but that
      can be addressed by a separate patch series.] However, the ramfs pages
      behave like ram disk pages used to, so:
      
      Define new address_space flag [shares address_space flags member with
      mapping's gfp mask] to indicate that the address space contains all
      unevictable pages.  This will provide for efficient testing of ramfs pages
      in page_evictable().
      
      Also provide wrapper functions to set/test the unevictable state to
      minimize #ifdefs in ramfs driver and any other users of this facility.
      
      Set the unevictable state on address_space structures for new ramfs
      inodes.  Test the unevictable state in page_evictable() to cull
      unevictable pages.
      
      These changes depend on [CONFIG_]UNEVICTABLE_LRU.
      
      [riel@redhat.com: undo the brd.c part]
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Debugged-by: NNick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ba9ddf49
    • L
      unevictable lru: add event counting with statistics · bbfd28ee
      Lee Schermerhorn 提交于
      Fix to unevictable-lru-page-statistics.patch
      
      Add unevictable lru infrastructure vm events to the statistics patch.
      Rename the "NORECL_" and "noreclaim_" symbols and text strings to
      "UNEVICTABLE_" and "unevictable_", respectively.
      
      Currently, both the infrastructure and the mlocked pages event are
      added by a single patch later in the series.  This makes it difficult
      to add or rework the incremental patches.  The events actually "belong"
      with the stats, so pull them up to here.
      
      Also, restore the event counting to putback_lru_page().  This was removed
      from previous patch in series where it was "misplaced".  The actual events
      weren't defined that early.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Rik van Riel <riel@redhat.com>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bbfd28ee
    • L
      Unevictable LRU Infrastructure · 894bc310
      Lee Schermerhorn 提交于
      When the system contains lots of mlocked or otherwise unevictable pages,
      the pageout code (kswapd) can spend lots of time scanning over these
      pages.  Worse still, the presence of lots of unevictable pages can confuse
      kswapd into thinking that more aggressive pageout modes are required,
      resulting in all kinds of bad behaviour.
      
      Infrastructure to manage pages excluded from reclaim--i.e., hidden from
      vmscan.  Based on a patch by Larry Woodman of Red Hat.  Reworked to
      maintain "unevictable" pages on a separate per-zone LRU list, to "hide"
      them from vmscan.
      
      Kosaki Motohiro added the support for the memory controller unevictable
      lru list.
      
      Pages on the unevictable list have both PG_unevictable and PG_lru set.
      Thus, PG_unevictable is analogous to and mutually exclusive with
      PG_active--it specifies which LRU list the page is on.
      
      The unevictable infrastructure is enabled by a new mm Kconfig option
      [CONFIG_]UNEVICTABLE_LRU.
      
      A new function 'page_evictable(page, vma)' in vmscan.c tests whether or
      not a page may be evictable.  Subsequent patches will add the various
      !evictable tests.  We'll want to keep these tests light-weight for use in
      shrink_active_list() and, possibly, the fault path.
      
      To avoid races between tasks putting pages [back] onto an LRU list and
      tasks that might be moving the page from non-evictable to evictable state,
      the new function 'putback_lru_page()' -- inverse to 'isolate_lru_page()'
      -- tests the "evictability" of a page after placing it on the LRU, before
      dropping the reference.  If the page has become unevictable,
      putback_lru_page() will redo the 'putback', thus moving the page to the
      unevictable list.  This way, we avoid "stranding" evictable pages on the
      unevictable list.
      
      [akpm@linux-foundation.org: fix fallout from out-of-order merge]
      [riel@redhat.com: fix UNEVICTABLE_LRU and !PROC_PAGE_MONITOR build]
      [nishimura@mxp.nes.nec.co.jp: remove redundant mapping check]
      [kosaki.motohiro@jp.fujitsu.com: unevictable-lru-infrastructure: putback_lru_page()/unevictable page handling rework]
      [kosaki.motohiro@jp.fujitsu.com: kill unnecessary lock_page() in vmscan.c]
      [kosaki.motohiro@jp.fujitsu.com: revert migration change of unevictable lru infrastructure]
      [kosaki.motohiro@jp.fujitsu.com: revert to unevictable-lru-infrastructure-kconfig-fix.patch]
      [kosaki.motohiro@jp.fujitsu.com: restore patch failure of vmstat-unevictable-and-mlocked-pages-vm-events.patch]
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Debugged-by: NBenjamin Kidwell <benjkidwell@yahoo.com>
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      894bc310
    • R
      more aggressively use lumpy reclaim · 33c120ed
      Rik van Riel 提交于
      During an AIM7 run on a 16GB system, fork started failing around 32000
      threads, despite the system having plenty of free swap and 15GB of
      pageable memory.  This was on x86-64, so 8k stacks.
      
      If a higher order allocation fails, we can either:
      - keep evicting pages off the end of the LRUs and hope that
        we eventually create a contiguous region; this is somewhat
        unlikely if the system is under enough stress by new
        allocations
      - after trying normal eviction for a bit, use lumpy reclaim
      
      This patch switches the system to lumpy reclaim if the VM is having
      trouble freeing enough pages, using the same threshold for detection as
      used by pageout congestion wait.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      33c120ed
    • R
      vmscan: fix pagecache reclaim referenced bit check · 7e9cd484
      Rik van Riel 提交于
      Moving referenced pages back to the head of the active list creates a huge
      scalability problem, because by the time a large memory system finally
      runs out of free memory, every single page in the system will have been
      referenced.
      
      Not only do we not have the time to scan every single page on the active
      list, but since they have will all have the referenced bit set, that bit
      conveys no useful information.
      
      A more scalable solution is to just move every page that hits the end of
      the active list to the inactive list.
      
      We clear the referenced bit off of mapped pages, which need just one
      reference to be moved back onto the active list.
      
      Unmapped pages will be moved back to the active list after two references
      (see mark_page_accessed).  We preserve the PG_referenced flag on unmapped
      pages to preserve accesses that were made while the page was on the active
      list.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7e9cd484
    • R
      vmscan: second chance replacement for anonymous pages · 556adecb
      Rik van Riel 提交于
      We avoid evicting and scanning anonymous pages for the most part, but
      under some workloads we can end up with most of memory filled with
      anonymous pages.  At that point, we suddenly need to clear the referenced
      bits on all of memory, which can take ages on very large memory systems.
      
      We can reduce the maximum number of pages that need to be scanned by not
      taking the referenced state into account when deactivating an anonymous
      page.  After all, every anonymous page starts out referenced, so why
      check?
      
      If an anonymous page gets referenced again before it reaches the end of
      the inactive list, we move it back to the active list.
      
      To keep the maximum amount of necessary work reasonable, we scale the
      active to inactive ratio with the size of memory, using the formula
      active:inactive ratio = sqrt(memory in GB * 10).
      
      Kswapd CPU use now seems to scale by the amount of pageout bandwidth,
      instead of by the amount of memory present in the system.
      
      [kamezawa.hiroyu@jp.fujitsu.com: fix OOM with memcg]
      [kamezawa.hiroyu@jp.fujitsu.com: memcg: lru scan fix]
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      556adecb
    • R
      vmscan: split LRU lists into anon & file sets · 4f98a2fe
      Rik van Riel 提交于
      Split the LRU lists in two, one set for pages that are backed by real file
      systems ("file") and one for pages that are backed by memory and swap
      ("anon").  The latter includes tmpfs.
      
      The advantage of doing this is that the VM will not have to scan over lots
      of anonymous pages (which we generally do not want to swap out), just to
      find the page cache pages that it should evict.
      
      This patch has the infrastructure and a basic policy to balance how much
      we scan the anon lists and how much we scan the file lists.  The big
      policy changes are in separate patches.
      
      [lee.schermerhorn@hp.com: collect lru meminfo statistics from correct offset]
      [kosaki.motohiro@jp.fujitsu.com: prevent incorrect oom under split_lru]
      [kosaki.motohiro@jp.fujitsu.com: fix pagevec_move_tail() doesn't treat unevictable page]
      [hugh@veritas.com: memcg swapbacked pages active]
      [hugh@veritas.com: splitlru: BDI_CAP_SWAP_BACKED]
      [akpm@linux-foundation.org: fix /proc/vmstat units]
      [nishimura@mxp.nes.nec.co.jp: memcg: fix handling of shmem migration]
      [kosaki.motohiro@jp.fujitsu.com: adjust Quicklists field of /proc/meminfo]
      [kosaki.motohiro@jp.fujitsu.com: fix style issue of get_scan_ratio()]
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NLee Schermerhorn <Lee.Schermerhorn@hp.com>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4f98a2fe