1. 01 4月, 2009 8 次提交
  2. 15 3月, 2009 1 次提交
  3. 13 3月, 2009 1 次提交
  4. 22 2月, 2009 2 次提交
  5. 16 2月, 2009 1 次提交
    • I
      lockdep: annotate reclaim context (__GFP_NOFS), fix · 6700ec65
      Ingo Molnar 提交于
      Impact: fix build warning
      
      Fix:
      
        mm/vmscan.c: In function ‘kswapd’:
        mm/vmscan.c:1969: warning: ISO C90 forbids mixed declarations and code
      
      node_to_cpumask_ptr(cpumask, pgdat->node_id), has a side-effect: it
      defines the 'cpumask' local variable as well, so it has to go into
      the variable definition section.
      
      Sidenote: it might make sense to make this purpose of these macros
      more apparent, by naming them the standard way, such as:
      
        DEFINE_node_to_cpumask_ptr(cpumask, pgdat->node_id);
      
      (But that is outside the scope of this patch.)
      
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Mike Travis <travis@sgi.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Nick Piggin <npiggin@suse.de>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      6700ec65
  6. 15 2月, 2009 1 次提交
    • N
      lockdep: annotate reclaim context (__GFP_NOFS) · cf40bd16
      Nick Piggin 提交于
      Here is another version, with the incremental patch rolled up, and
      added reclaim context annotation to kswapd, and allocation tracing
      to slab allocators (which may only ever reach the page allocator
      in rare cases, so it is good to put annotations here too).
      
      Haven't tested this version as such, but it should be getting closer
      to merge worthy ;)
      
      --
      After noticing some code in mm/filemap.c accidentally perform a __GFP_FS
      allocation when it should not have been, I thought it might be a good idea to
      try to catch this kind of thing with lockdep.
      
      I coded up a little idea that seems to work. Unfortunately the system has to
      actually be in __GFP_FS page reclaim, then take the lock, before it will mark
      it. But at least that might still be some orders of magnitude more common
      (and more debuggable) than an actual deadlock condition, so we have some
      improvement I hope (the concept is no less complete than discovery of a lock's
      interrupt contexts).
      
      I guess we could even do the same thing with __GFP_IO (normal reclaim), and
      even GFP_NOIO locks too... but filesystems will have the most locks and fiddly
      code paths, so let's start there and see how it goes.
      
      It *seems* to work. I did a quick test.
      
      =================================
      [ INFO: inconsistent lock state ]
      2.6.28-rc6-00007-ged313489-dirty #26
      ---------------------------------
      inconsistent {in-reclaim-W} -> {ov-reclaim-W} usage.
      modprobe/8526 [HC0[0]:SC0[0]:HE1:SE1] takes:
       (testlock){--..}, at: [<ffffffffa0020055>] brd_init+0x55/0x216 [brd]
      {in-reclaim-W} state was registered at:
        [<ffffffff80267bdb>] __lock_acquire+0x75b/0x1a60
        [<ffffffff80268f71>] lock_acquire+0x91/0xc0
        [<ffffffff8070f0e1>] mutex_lock_nested+0xb1/0x310
        [<ffffffffa002002b>] brd_init+0x2b/0x216 [brd]
        [<ffffffff8020903b>] _stext+0x3b/0x170
        [<ffffffff80272ebf>] sys_init_module+0xaf/0x1e0
        [<ffffffff8020c3fb>] system_call_fastpath+0x16/0x1b
        [<ffffffffffffffff>] 0xffffffffffffffff
      irq event stamp: 3929
      hardirqs last  enabled at (3929): [<ffffffff8070f2b5>] mutex_lock_nested+0x285/0x310
      hardirqs last disabled at (3928): [<ffffffff8070f089>] mutex_lock_nested+0x59/0x310
      softirqs last  enabled at (3732): [<ffffffff8061f623>] sk_filter+0x83/0xe0
      softirqs last disabled at (3730): [<ffffffff8061f5b6>] sk_filter+0x16/0xe0
      
      other info that might help us debug this:
      1 lock held by modprobe/8526:
       #0:  (testlock){--..}, at: [<ffffffffa0020055>] brd_init+0x55/0x216 [brd]
      
      stack backtrace:
      Pid: 8526, comm: modprobe Not tainted 2.6.28-rc6-00007-ged313489-dirty #26
      Call Trace:
       [<ffffffff80265483>] print_usage_bug+0x193/0x1d0
       [<ffffffff80266530>] mark_lock+0xaf0/0xca0
       [<ffffffff80266735>] mark_held_locks+0x55/0xc0
       [<ffffffffa0020000>] ? brd_init+0x0/0x216 [brd]
       [<ffffffff802667ca>] trace_reclaim_fs+0x2a/0x60
       [<ffffffff80285005>] __alloc_pages_internal+0x475/0x580
       [<ffffffff8070f29e>] ? mutex_lock_nested+0x26e/0x310
       [<ffffffffa0020000>] ? brd_init+0x0/0x216 [brd]
       [<ffffffffa002006a>] brd_init+0x6a/0x216 [brd]
       [<ffffffffa0020000>] ? brd_init+0x0/0x216 [brd]
       [<ffffffff8020903b>] _stext+0x3b/0x170
       [<ffffffff8070f8b9>] ? mutex_unlock+0x9/0x10
       [<ffffffff8070f83d>] ? __mutex_unlock_slowpath+0x10d/0x180
       [<ffffffff802669ec>] ? trace_hardirqs_on_caller+0x12c/0x190
       [<ffffffff80272ebf>] sys_init_module+0xaf/0x1e0
       [<ffffffff8020c3fb>] system_call_fastpath+0x16/0x1b
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      cf40bd16
  7. 09 1月, 2009 13 次提交
  8. 07 1月, 2009 13 次提交
    • K
      mm: stop kswapd's infinite loop at high order allocation · 73ce02e9
      KOSAKI Motohiro 提交于
      Wassim Dagash reported following kswapd infinite loop problem.
      
        kswapd runs in some infinite loop trying to swap until order 10 of zone
        highmem is OK.... kswapd will continue to try to balance order 10 of zone
        highmem forever (or until someone release a very large chunk of highmem).
      
      For non order-0 allocations, the system may never be balanced due to
      fragmentation but kswapd should not infinitely loop as a result.
      
      Instead, recheck all watermarks at order-0 as they are the most important.
      If watermarks are ok, kswapd will go back to sleep.
      
      [akpm@linux-foundation.org: fix comment]
      Reported-by: Nwassim dagash <wassim.dagash@gmail.com>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      73ce02e9
    • A
      vmscan: shrink_active_list(): reduce lru_lock hold time · b555749a
      Andrew Morton 提交于
      These three statements manipulate local variables and do not need the lock
      coverage.
      
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Rik van Riel <riel@redhat.com
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b555749a
    • K
      mm: kill zone_is_near_oom() · 09f445e7
      KOSAKI Motohiro 提交于
      zone_is_near_oom() is unused.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      09f445e7
    • K
      vmscan: improve reclaim throughput to bail out patch · 01dbe5c9
      KOSAKI Motohiro 提交于
      The vmscan bail out patch move nr_reclaimed variable to struct
      scan_control.  Unfortunately, indirect access can easily happen cache
      miss.
      
      if heavy memory pressure happend, that's ok.
      cache miss already plenty. it is not observable.
      
      but, if memory pressure is lite, performance degression is obserbable.
      
      I compared following three pattern (it was mesured 10 times each)
      
      hackbench 125 process 3000
      hackbench 130 process 3000
      hackbench 135 process 3000
      
                  2.6.28-rc6                       bail-out
      
      	125	130	135		125	130	135
            ==============================================================
      	71.866	75.86	81.274		93.414	73.254	193.382
      	74.145	78.295	77.27		74.897	75.021	80.17
      	70.305	77.643	75.855		70.134	77.571	79.896
      	74.288	73.986	75.955		77.222	78.48	80.619
      	72.029	79.947	78.312		75.128	82.172	79.708
      	71.499	77.615	77.042		74.177	76.532	77.306
      	76.188	74.471	83.562		73.839	72.43	79.833
      	73.236	75.606	78.743		76.001	76.557	82.726
      	69.427	77.271	76.691		76.236	79.371	103.189
      	72.473	76.978	80.643		69.128	78.932	75.736
      
      avg	72.545	76.767	78.534		76.017	77.03	93.256
      std	1.89	1.71	2.41		6.29	2.79	34.16
      min	69.427	73.986	75.855		69.128	72.43	75.736
      max	76.188	79.947	83.562		93.414	82.172	193.382
      
      about 4-5% degression.
      
      Then, this patch introduces a temporary local variable.
      
      result:
      
                  2.6.28-rc6                       this patch
      
      num	125	130	135		125	130	135
            ==============================================================
      	71.866	75.86	81.274		67.302	68.269	77.161
      	74.145	78.295	77.27   	72.616	72.712	79.06
      	70.305	77.643	75.855  	72.475	75.712	77.735
      	74.288	73.986	75.955  	69.229	73.062	78.814
      	72.029	79.947	78.312  	71.551	74.392	78.564
      	71.499	77.615	77.042  	69.227	74.31	78.837
      	76.188	74.471	83.562  	70.759	75.256	76.6
      	73.236	75.606	78.743  	69.966	76.001	78.464
      	69.427	77.271	76.691  	69.068	75.218	80.321
      	72.473	76.978	80.643  	72.057	77.151	79.068
      
      avg	72.545	76.767	78.534 		70.425	74.2083	78.462
      std 	1.89	1.71	2.41    	1.66	2.34	1.00
      min 	69.427	73.986	75.855  	67.302	68.269	76.6
      max 	76.188	79.947	83.562  	72.616	77.151	80.321
      
      OK. the degression is disappeared.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      01dbe5c9
    • R
      vmscan: bail out of direct reclaim after swap_cluster_max pages · a79311c1
      Rik van Riel 提交于
      When the VM is under pressure, it can happen that several direct reclaim
      processes are in the pageout code simultaneously.  It also happens that
      the reclaiming processes run into mostly referenced, mapped and dirty
      pages in the first round.
      
      This results in multiple direct reclaim processes having a lower
      pageout priority, which corresponds to a higher target of pages to
      scan.
      
      This in turn can result in each direct reclaim process freeing
      many pages.  Together, they can end up freeing way too many pages.
      
      This kicks useful data out of memory (in some cases more than half
      of all memory is swapped out).  It also impacts performance by
      keeping tasks stuck in the pageout code for too long.
      
      A 30% improvement in hackbench has been observed with this patch.
      
      The fix is relatively simple: in shrink_zone() we can check how many
      pages we have already freed, direct reclaim tasks break out of the
      scanning loop if they have already freed enough pages and have reached
      a lower priority level.
      
      We do not break out of shrink_zone() when priority == DEF_PRIORITY,
      to ensure that equal pressure is applied to every zone in the common
      case.
      
      However, in order to do this we do need to know how many pages we already
      freed, so move nr_reclaimed into scan_control.
      
      akpm: a historical interlude...
      
      We tried this in 2004:
      
      :commit e468e46a9bea3297011d5918663ce6d19094cf87
      :Author: akpm <akpm>
      :Date:   Thu Jun 24 15:53:52 2004 +0000
      :
      :[PATCH] vmscan.c: dont reclaim too many pages
      :
      :    The shrink_zone() logic can, under some circumstances, cause far too many
      :    pages to be reclaimed.  Say, we're scanning at high priority and suddenly hit
      :    a large number of reclaimable pages on the LRU.
      :    Change things so we bale out when SWAP_CLUSTER_MAX pages have been reclaimed.
      
      And we reverted it in 2006:
      
      :commit 210fe530
      :Author: Andrew Morton <akpm@osdl.org>
      :Date:   Fri Jan 6 00:11:14 2006 -0800
      :
      :    [PATCH] vmscan: balancing fix
      :
      :    Revert a patch which went into 2.6.8-rc1.  The changelog for that patch was:
      :
      :      The shrink_zone() logic can, under some circumstances, cause far too many
      :      pages to be reclaimed.  Say, we're scanning at high priority and suddenly
      :      hit a large number of reclaimable pages on the LRU.
      :
      :      Change things so we bale out when SWAP_CLUSTER_MAX pages have been
      :      reclaimed.
      :
      :    Problem is, this change caused significant imbalance in inter-zone scan
      :    balancing by truncating scans of larger zones.
      :
      :    Suppose, for example, ZONE_HIGHMEM is 10x the size of ZONE_NORMAL.  The zone
      :    balancing algorithm would require that if we're scanning 100 pages of
      :    ZONE_HIGHMEM, we should scan 10 pages of ZONE_NORMAL.  But this logic will
      :    cause the scanning of ZONE_HIGHMEM to bale out after only 32 pages are
      :    reclaimed.  Thus effectively causing smaller zones to be scanned relatively
      :    harder than large ones.
      :
      :    Now I need to remember what the workload was which caused me to write this
      :    patch originally, then fix it up in a different way...
      
      And we haven't demonstrated that whatever problem caused that reversion is
      not being reintroduced by this change in 2008.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a79311c1
    • K
      mm: make scan_zone_unevictable_pages() static · 14b90b22
      KOSAKI Motohiro 提交于
      sparse output following warning
      
      	mm/vmscan.c:2507:6: warning: symbol 'scan_zone_unevictable_pages' was not declared. Should it be static?
      
      cleanup here.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      14b90b22
    • K
      mm: make scan_all_zones_unevictable_pages() static · ff30153b
      KOSAKI Motohiro 提交于
      sparse output following warning.
      
      	mm/vmscan.c:2549:6: warning: symbol 'scan_all_zones_unevictable_pages' was not declared. Should it be static?
      
      cleanup here.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ff30153b
    • K
      memcg: reclaim shouldn't change zone->recent_rotated statistics · 077cbc58
      KOSAKI Motohiro 提交于
      memcg reclaim shouldn't change zone->recent_rotated statistics.  If
      memcgroup reclaim changes zone statistics, global reclaim can get a bit
      confused.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      077cbc58
    • H
      mm: optimize get_scan_ratio for no swap · b962716b
      Hugh Dickins 提交于
      Rik suggests a simplified get_scan_ratio() for !CONFIG_SWAP.  Yes, the gcc
      optimizer gives us that, when nr_swap_pages is #defined as 0L.  Move usual
      declaration to swapfile.c: it never belonged in page_alloc.c.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Robin Holt <holt@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b962716b
    • H
      mm: add add_to_swap stub · 60371d97
      Hugh Dickins 提交于
      If we add a failing stub for add_to_swap(), then we can remove the #ifdef
      CONFIG_SWAP from mm/vmscan.c.
      
      This was intended as a source cleanup, but looking more closely, it turns
      out that the !CONFIG_SWAP case was going to keep_locked for an anonymous
      page, whereas now it goes to the more suitable activate_locked, like the
      CONFIG_SWAP nr_swap_pages 0 case.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Robin Holt <holt@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      60371d97
    • H
      mm: remove gfp_mask from add_to_swap · ac47b003
      Hugh Dickins 提交于
      Remove gfp_mask argument from add_to_swap(): it's misleading because its
      only caller, shrink_page_list(), is not atomic at that point; and in due
      course (implementing discard) we'll sometimes want to allocate some memory
      with GFP_NOIO (as is used in swap_writepage) when allocating swap.
      
      No change to the gfp_mask passed down to add_to_swap_cache(): still use
      __GFP_HIGH without __GFP_WAIT (with nomemalloc and nowarn as before):
      though it's not obvious if that's the best combination to ask for here.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Robin Holt <holt@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ac47b003
    • H
      mm: remove try_to_munlock from vmscan · 63d6c5ad
      Hugh Dickins 提交于
      An unfortunate feature of the Unevictable LRU work was that reclaiming an
      anonymous page involved an extra scan through the anon_vma: to check that
      the page is evictable before allocating swap, because the swap could not
      be freed reliably soon afterwards.
      
      Now try_to_free_swap() has replaced remove_exclusive_swap_page(), that's
      not an issue any more: remove try_to_munlock() call from
      shrink_page_list(), leaving it to try_to_munmap() to discover if the page
      is one to be culled to the unevictable list - in which case then
      try_to_free_swap().
      
      Update unevictable-lru.txt to remove comments on the try_to_munlock() in
      shrink_page_list(), and shorten some lines over 80 columns.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Robin Holt <holt@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      63d6c5ad
    • H
      mm: try_to_free_swap replaces remove_exclusive_swap_page · a2c43eed
      Hugh Dickins 提交于
      remove_exclusive_swap_page(): its problem is in living up to its name.
      
      It doesn't matter if someone else has a reference to the page (raised
      page_count); it doesn't matter if the page is mapped into userspace
      (raised page_mapcount - though that hints it may be worth keeping the
      swap): all that matters is that there be no more references to the swap
      (and no writeback in progress).
      
      swapoff (try_to_unuse) has been removing pages from swapcache for years,
      with no concern for page count or page mapcount, and we used to have a
      comment in lookup_swap_cache() recognizing that: if you go for a page of
      swapcache, you'll get the right page, but it could have been removed from
      swapcache by the time you get page lock.
      
      So, give up asking for exclusivity: get rid of
      remove_exclusive_swap_page(), and remove_exclusive_swap_page_ref() and
      remove_exclusive_swap_page_count() which were spawned for the recent LRU
      work: replace them by the simpler try_to_free_swap() which just checks
      page_swapcount().
      
      Similarly, remove the page_count limitation from free_swap_and_count(),
      but assume that it's worth holding on to the swap if page is mapped and
      swap nowhere near full.  Add a vm_swap_full() test in free_swap_cache()?
      It would be consistent, but I think we probably have enough for now.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Robin Holt <holt@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a2c43eed