1. 25 5月, 2011 4 次提交
  2. 21 5月, 2011 1 次提交
    • L
      sanitize <linux/prefetch.h> usage · 268bb0ce
      Linus Torvalds 提交于
      Commit e66eed65 ("list: remove prefetching from regular list
      iterators") removed the include of prefetch.h from list.h, which
      uncovered several cases that had apparently relied on that rather
      obscure header file dependency.
      
      So this fixes things up a bit, using
      
         grep -L linux/prefetch.h $(git grep -l '[^a-z_]prefetchw*(' -- '*.[ch]')
         grep -L 'prefetchw*(' $(git grep -l 'linux/prefetch.h' -- '*.[ch]')
      
      to guide us in finding files that either need <linux/prefetch.h>
      inclusion, or have it despite not needing it.
      
      There are more of them around (mostly network drivers), but this gets
      many core ones.
      Reported-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      268bb0ce
  3. 17 5月, 2011 1 次提交
  4. 12 5月, 2011 2 次提交
    • A
      mm: add alloc_pages_exact_nid() · ee85c2e1
      Andi Kleen 提交于
      Add a alloc_pages_exact_nid() that allocates on a specific node.
      
      The naming is quite broken, but fixing that would need a larger renaming
      action.
      
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: tweak comment]
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ee85c2e1
    • Y
      mm: use alloc_bootmem_node_nopanic() on really needed path · 8f389a99
      Yinghai Lu 提交于
      Stefan found nobootmem does not work on his system that has only 8M of
      RAM.  This causes an early panic:
      
        BIOS-provided physical RAM map:
         BIOS-88: 0000000000000000 - 000000000009f000 (usable)
         BIOS-88: 0000000000100000 - 0000000000840000 (usable)
        bootconsole [earlyser0] enabled
        Notice: NX (Execute Disable) protection missing in CPU or disabled in BIOS!
        DMI not present or invalid.
        last_pfn = 0x840 max_arch_pfn = 0x100000
        init_memory_mapping: 0000000000000000-0000000000840000
        8MB LOWMEM available.
          mapped low ram: 0 - 00840000
          low ram: 0 - 00840000
        Zone PFN ranges:
          DMA      0x00000001 -> 0x00001000
          Normal   empty
        Movable zone start PFN for each node
        early_node_map[2] active PFN ranges
            0: 0x00000001 -> 0x0000009f
            0: 0x00000100 -> 0x00000840
        BUG: Int 6: CR2 (null)
             EDI c034663c  ESI (null)  EBP c0329f38  ESP c0329ef4
             EBX c0346380  EDX 00000006  ECX ffffffff  EAX fffffff4
             err (null)  EIP c0353191   CS c0320060  flg 00010082
        Stack: (null) c030c533 000007cd (null) c030c533 00000001 (null) (null)
               00000003 0000083f 00000018 00000002 00000002 c0329f6c c03534d6 (null)
               (null) 00000100 00000840 (null) c0329f64 00000001 00001000 (null)
        Pid: 0, comm: swapper Not tainted 2.6.36 #5
        Call Trace:
         [<c02e3707>] ? 0xc02e3707
         [<c035e6e5>] 0xc035e6e5
         [<c0353191>] ? 0xc0353191
         [<c03534d6>] 0xc03534d6
         [<c034f1cd>] 0xc034f1cd
         [<c034a824>] 0xc034a824
         [<c03513cb>] ? 0xc03513cb
         [<c0349432>] 0xc0349432
         [<c0349066>] 0xc0349066
      
      It turns out that we should ignore the low limit of 16M.
      
      Use alloc_bootmem_node_nopanic() in this case.
      
      [akpm@linux-foundation.org: less mess]
      Signed-off-by: NYinghai LU <yinghai@kernel.org>
      Reported-by: NStefan Hellermann <stefan@the2masters.de>
      Tested-by: NStefan Hellermann <stefan@the2masters.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: <stable@kernel.org>		[2.6.34+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8f389a99
  5. 15 4月, 2011 1 次提交
  6. 10 4月, 2011 1 次提交
  7. 31 3月, 2011 1 次提交
  8. 25 3月, 2011 1 次提交
  9. 24 3月, 2011 1 次提交
  10. 23 3月, 2011 7 次提交
  11. 18 3月, 2011 1 次提交
  12. 26 2月, 2011 2 次提交
  13. 24 2月, 2011 2 次提交
  14. 26 1月, 2011 2 次提交
    • D
      mm: clear pages_scanned only if draining a pcp adds pages to the buddy allocator · 2ff754fa
      David Rientjes 提交于
      Commit 0e093d99 ("writeback: do not sleep on the congestion queue if
      there are no congested BDIs or if significant congestion is not being
      encountered in the current zone") uncovered a livelock in the page
      allocator that resulted in tasks infinitely looping trying to find
      memory and kswapd running at 100% cpu.
      
      The issue occurs because drain_all_pages() is called immediately
      following direct reclaim when no memory is freed and try_to_free_pages()
      returns non-zero because all zones in the zonelist do not have their
      all_unreclaimable flag set.
      
      When draining the per-cpu pagesets back to the buddy allocator for each
      zone, the zone->pages_scanned counter is cleared to avoid erroneously
      setting zone->all_unreclaimable later.  The problem is that no pages may
      actually be drained and, thus, the unreclaimable logic never fails
      direct reclaim so the oom killer may be invoked.
      
      This apparently only manifested after wait_iff_congested() was
      introduced and the zone was full of anonymous memory that would not
      congest the backing store.  The page allocator would infinitely loop if
      there were no other tasks waiting to be scheduled and clear
      zone->pages_scanned because of drain_all_pages() as the result of this
      change before kswapd could scan enough pages to trigger the reclaim
      logic.  Additionally, with every loop of the page allocator and in the
      reclaim path, kswapd would be kicked and would end up running at 100%
      cpu.  In this scenario, current and kswapd are all running continuously
      with kswapd incrementing zone->pages_scanned and current clearing it.
      
      The problem is even more pronounced when current swaps some of its
      memory to swap cache and the reclaimable logic then considers all active
      anonymous memory in the all_unreclaimable logic, which requires a much
      higher zone->pages_scanned value for try_to_free_pages() to return zero
      that is never attainable in this scenario.
      
      Before wait_iff_congested(), the page allocator would incur an
      unconditional timeout and allow kswapd to elevate zone->pages_scanned to
      a level that the oom killer would be called the next time it loops.
      
      The fix is to only attempt to drain pcp pages if there is actually a
      quantity to be drained.  The unconditional clearing of
      zone->pages_scanned in free_pcppages_bulk() need not be changed since
      other callers already ensure that draining will occur.  This patch
      ensures that free_pcppages_bulk() will actually free memory before
      calling into it from drain_all_pages() so zone->pages_scanned is only
      cleared if appropriate.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Reviewed-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2ff754fa
    • D
      mm: fix deferred congestion timeout if preferred zone is not allowed · f33261d7
      David Rientjes 提交于
      Before 0e093d99 ("writeback: do not sleep on the congestion queue if
      there are no congested BDIs or if significant congestion is not being
      encountered in the current zone"), preferred_zone was only used for NUMA
      statistics, to determine the zoneidx from which to allocate from given
      the type requested, and whether to utilize memory compaction.
      
      wait_iff_congested(), though, uses preferred_zone to determine if the
      congestion wait should be deferred because its dirty pages are backed by
      a congested bdi.  This incorrectly defers the timeout and busy loops in
      the page allocator with various cond_resched() calls if preferred_zone
      is not allowed in the current context, usually consuming 100% of a cpu.
      
      This patch ensures preferred_zone is an allowed zone in the fastpath
      depending on whether current is constrained by its cpuset or nodes in
      its mempolicy (when the nodemask passed is non-NULL).  This is correct
      since the fastpath allocation always passes ALLOC_CPUSET when trying to
      allocate memory.  In the slowpath, this patch resets preferred_zone to
      the first zone of the allowed type when the allocation is not
      constrained by current's cpuset, i.e.  it does not pass ALLOC_CPUSET.
      
      This patch also ensures preferred_zone is from the set of allowed nodes
      when called from within direct reclaim since allocations are always
      constrained by cpusets in this context (it is blockable).
      
      Both of these uses of cpuset_current_mems_allowed are protected by
      get_mems_allowed().
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f33261d7
  15. 14 1月, 2011 13 次提交
    • A
      mm/page_alloc.c: don't cache `current' in a local · c06b1fca
      Andrew Morton 提交于
      It's old-fashioned and unneeded.
      
      akpm:/usr/src/25> size mm/page_alloc.o
         text    data     bss     dec     hex filename
        39884 1241317   18808 1300009  13d629 mm/page_alloc.o (before)
        39838 1241317   18808 1299963  13d5fb mm/page_alloc.o (after)
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c06b1fca
    • K
      mm/page_alloc.c: simplify calculation of combined index of adjacent buddy lists · 43506fad
      KyongHo Cho 提交于
      The previous approach of calucation of combined index was
      
      	page_idx & ~(1 << order))
      
      but we have same result with
      
      	page_idx & buddy_idx
      
      This reduces instructions slightly as well as enhances readability.
      
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: fix used-unintialised warning]
      Signed-off-by: NKyongHo Cho <pullip.cho@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      43506fad
    • A
      thp: remove PG_buddy · 5f24ce5f
      Andrea Arcangeli 提交于
      PG_buddy can be converted to _mapcount == -2.  So the PG_compound_lock can
      be added to page->flags without overflowing (because of the sparse section
      bits increasing) with CONFIG_X86_PAE=y and CONFIG_X86_PAT=y.  This also
      has to move the memory hotplug code from _mapcount to lru.next to avoid
      any risk of clashes.  We can't use lru.next for PG_buddy removal, but
      memory hotplug can use lru.next even more easily than the mapcount
      instead.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5f24ce5f
    • A
      thp: don't alloc harder for gfp nomemalloc even if nowait · 5c3240d9
      Andrea Arcangeli 提交于
      Not worth throwing away the precious reserved free memory pool for
      allocations that can fail gracefully (either through mempool or because
      they're transhuge allocations later falling back to 4k allocations).
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5c3240d9
    • A
      thp: _GFP_NO_KSWAPD · 32dba98e
      Andrea Arcangeli 提交于
      Transparent hugepage allocations must be allowed not to invoke kswapd or
      any other kind of indirect reclaim (especially when the defrag sysfs is
      control disabled).  It's unacceptable to swap out anonymous pages
      (potentially anonymous transparent hugepages) in order to create new
      transparent hugepages.  This is true for the MADV_HUGEPAGE areas too
      (swapping out a kvm virtual machine and so having it suffer an unbearable
      slowdown, so another one with guest physical memory marked MADV_HUGEPAGE
      can run 30% faster if it is running memory intensive workloads, makes no
      sense).  If a transparent hugepage allocation fails the slowdown is minor
      and there is total fallback, so kswapd should never be asked to swapout
      memory to allow the high order allocation to succeed.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      32dba98e
    • A
      thp: comment reminder in destroy_compound_page · 59ff4216
      Andrea Arcangeli 提交于
      Warn destroy_compound_page that __split_huge_page_refcount is heavily
      dependent on its internal behavior.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      59ff4216
    • A
      thp: clear compound mapping · 8dd60a3a
      Andrea Arcangeli 提交于
      Clear compound mapping for anonymous compound pages like it already
      happens for regular anonymous pages.  But crash if mapping is set for any
      tail page, also the PageAnon check is meaningless for tail pages.  This
      check only makes sense for the head page, for tail page it can only hide
      bugs and we definitely don't want to hide bugs.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8dd60a3a
    • A
      thp: fix bad_page to show the real reason the page is bad · 4e9f64c4
      Andrea Arcangeli 提交于
      page_count shows the count of the head page, but the actual check is done
      on the tail page, so show what is really being checked.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4e9f64c4
    • V
      mm: set correct numa_zonelist_order string when configured on the kernel command line · ecb256f8
      Volodymyr G. Lukiianyk 提交于
      When numa_zonelist_order parameter is set to "node" or "zone" on the
      command line it's still showing as "default" in sysctl.  That's because
      early_param parsing function changes only user_zonelist_order variable.
      Fix this by copying user-provided string to numa_zonelist_order if it was
      successfully parsed.
      Signed-off-by: NVolodymyr G Lukiianyk <volodymyrgl@gmail.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ecb256f8
    • M
      mm: kswapd: stop high-order balancing when any suitable zone is balanced · 99504748
      Mel Gorman 提交于
      Simon Kirby reported the following problem
      
         We're seeing cases on a number of servers where cache never fully
         grows to use all available memory.  Sometimes we see servers with 4 GB
         of memory that never seem to have less than 1.5 GB free, even with a
         constantly-active VM.  In some cases, these servers also swap out while
         this happens, even though they are constantly reading the working set
         into memory.  We have been seeing this happening for a long time; I
         don't think it's anything recent, and it still happens on 2.6.36.
      
      After some debugging work by Simon, Dave Hansen and others, the prevaling
      theory became that kswapd is reclaiming order-3 pages requested by SLUB
      too aggressive about it.
      
      There are two apparent problems here.  On the target machine, there is a
      small Normal zone in comparison to DMA32.  As kswapd tries to balance all
      zones, it would continually try reclaiming for Normal even though DMA32
      was balanced enough for callers.  The second problem is that
      sleeping_prematurely() does not use the same logic as balance_pgdat() when
      deciding whether to sleep or not.  This keeps kswapd artifically awake.
      
      A number of tests were run and the figures from previous postings will
      look very different for a few reasons.  One, the old figures were forcing
      my network card to use GFP_ATOMIC in attempt to replicate Simon's problem.
       Second, I previous specified slub_min_order=3 again in an attempt to
      reproduce Simon's problem.  In this posting, I'm depending on Simon to say
      whether his problem is fixed or not and these figures are to show the
      impact to the ordinary cases.  Finally, the "vmscan" figures are taken
      from /proc/vmstat instead of the tracepoints.  There is less information
      but recording is less disruptive.
      
      The first test of relevance was postmark with a process running in the
      background reading a large amount of anonymous memory in blocks.  The
      objective was to vaguely simulate what was happening on Simon's machine
      and it's memory intensive enough to have kswapd awake.
      
      POSTMARK
                                                  traceonly          kanyzone
      Transactions per second:              156.00 ( 0.00%)   153.00 (-1.96%)
      Data megabytes read per second:        21.51 ( 0.00%)    21.52 ( 0.05%)
      Data megabytes written per second:     29.28 ( 0.00%)    29.11 (-0.58%)
      Files created alone per second:       250.00 ( 0.00%)   416.00 (39.90%)
      Files create/transact per second:      79.00 ( 0.00%)    76.00 (-3.95%)
      Files deleted alone per second:       520.00 ( 0.00%)   420.00 (-23.81%)
      Files delete/transact per second:      79.00 ( 0.00%)    76.00 (-3.95%)
      
      MMTests Statistics: duration
      User/Sys Time Running Test (seconds)         16.58      17.4
      Total Elapsed Time (seconds)                218.48    222.47
      
      VMstat Reclaim Statistics: vmscan
      Direct reclaims                                  0          4
      Direct reclaim pages scanned                     0        203
      Direct reclaim pages reclaimed                   0        184
      Kswapd pages scanned                        326631     322018
      Kswapd pages reclaimed                      312632     309784
      Kswapd low wmark quickly                         1          4
      Kswapd high wmark quickly                      122        475
      Kswapd skip congestion_wait                      1          0
      Pages activated                             700040     705317
      Pages deactivated                           212113     203922
      Pages written                                 9875       6363
      
      Total pages scanned                         326631    322221
      Total pages reclaimed                       312632    309968
      %age total pages scanned/reclaimed          95.71%    96.20%
      %age total pages scanned/written             3.02%     1.97%
      
      proc vmstat: Faults
      Major Faults                                   300       254
      Minor Faults                                645183    660284
      Page ins                                    493588    486704
      Page outs                                  4960088   4986704
      Swap ins                                      1230       661
      Swap outs                                     9869      6355
      
      Performance is mildly affected because kswapd is no longer doing as much
      work and the background memory consumer process is getting in the way.
      Note that kswapd scanned and reclaimed fewer pages as it's less aggressive
      and overall fewer pages were scanned and reclaimed.  Swap in/out is
      particularly reduced again reflecting kswapd throwing out fewer pages.
      
      The slight performance impact is unfortunate here but it looks like a
      direct result of kswapd being less aggressive.  As the bug report is about
      too many pages being freed by kswapd, it may have to be accepted for now.
      
      The second test is a streaming IO benchmark that was previously used by
      Johannes to show regressions in page reclaim.
      
      MICRO
      					 traceonly  kanyzone
      User/Sys Time Running Test (seconds)         29.29     28.87
      Total Elapsed Time (seconds)                492.18    488.79
      
      VMstat Reclaim Statistics: vmscan
      Direct reclaims                               2128       1460
      Direct reclaim pages scanned               2284822    1496067
      Direct reclaim pages reclaimed              148919     110937
      Kswapd pages scanned                      15450014   16202876
      Kswapd pages reclaimed                     8503697    8537897
      Kswapd low wmark quickly                      3100       3397
      Kswapd high wmark quickly                     1860       7243
      Kswapd skip congestion_wait                    708        801
      Pages activated                               9635       9573
      Pages deactivated                             1432       1271
      Pages written                                  223       1130
      
      Total pages scanned                       17734836  17698943
      Total pages reclaimed                      8652616   8648834
      %age total pages scanned/reclaimed          48.79%    48.87%
      %age total pages scanned/written             0.00%     0.01%
      
      proc vmstat: Faults
      Major Faults                                   165       221
      Minor Faults                               9655785   9656506
      Page ins                                      3880      7228
      Page outs                                 37692940  37480076
      Swap ins                                         0        69
      Swap outs                                       19        15
      
      Again fewer pages are scanned and reclaimed as expected and this time the
      test completed faster.  Note that kswapd is hitting its watermarks faster
      (low and high wmark quickly) which I expect is due to kswapd reclaiming
      fewer pages.
      
      I also ran fs-mark, iozone and sysbench but there is nothing interesting
      to report in the figures.  Performance is not significantly changed and
      the reclaim statistics look reasonable.
      
      Tgis patch:
      
      When the allocator enters its slow path, kswapd is woken up to balance the
      node.  It continues working until all zones within the node are balanced.
      For order-0 allocations, this makes perfect sense but for higher orders it
      can have unintended side-effects.  If the zone sizes are imbalanced,
      kswapd may reclaim heavily within a smaller zone discarding an excessive
      number of pages.  The user-visible behaviour is that kswapd is awake and
      reclaiming even though plenty of pages are free from a suitable zone.
      
      This patch alters the "balance" logic for high-order reclaim allowing
      kswapd to stop if any suitable zone becomes balanced to reduce the number
      of pages it reclaims from other zones.  kswapd still tries to ensure that
      order-0 watermarks for all zones are met before sleeping.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NEric B Munson <emunson@mgebm.net>
      Cc: Simon Kirby <sim@hostway.ca>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      99504748
    • M
      mm: migration: allow migration to operate asynchronously and avoid synchronous... · 77f1fe6b
      Mel Gorman 提交于
      mm: migration: allow migration to operate asynchronously and avoid synchronous compaction in the faster path
      
      Migration synchronously waits for writeback if the initial passes fails.
      Callers of memory compaction do not necessarily want this behaviour if the
      caller is latency sensitive or expects that synchronous migration is not
      going to have a significantly better success rate.
      
      This patch adds a sync parameter to migrate_pages() allowing the caller to
      indicate if wait_on_page_writeback() is allowed within migration or not.
      For reclaim/compaction, try_to_compact_pages() is first called
      asynchronously, direct reclaim runs and then try_to_compact_pages() is
      called synchronously as there is a greater expectation that it'll succeed.
      
      [akpm@linux-foundation.org: build/merge fix]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      77f1fe6b
    • M
      mm: vmscan: reclaim order-0 and use compaction instead of lumpy reclaim · 3e7d3449
      Mel Gorman 提交于
      Lumpy reclaim is disruptive.  It reclaims a large number of pages and
      ignores the age of the pages it reclaims.  This can incur significant
      stalls and potentially increase the number of major faults.
      
      Compaction has reached the point where it is considered reasonably stable
      (meaning it has passed a lot of testing) and is a potential candidate for
      displacing lumpy reclaim.  This patch introduces an alternative to lumpy
      reclaim whe compaction is available called reclaim/compaction.  The basic
      operation is very simple - instead of selecting a contiguous range of
      pages to reclaim, a number of order-0 pages are reclaimed and then
      compaction is later by either kswapd (compact_zone_order()) or direct
      compaction (__alloc_pages_direct_compact()).
      
      [akpm@linux-foundation.org: fix build]
      [akpm@linux-foundation.org: use conventional task_struct naming]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3e7d3449
    • M
      mm: page allocator: adjust the per-cpu counter threshold when memory is low · 88f5acf8
      Mel Gorman 提交于
      Commit aa454840 ("calculate a better estimate of NR_FREE_PAGES when memory
      is low") noted that watermarks were based on the vmstat NR_FREE_PAGES.  To
      avoid synchronization overhead, these counters are maintained on a per-cpu
      basis and drained both periodically and when a threshold is above a
      threshold.  On large CPU systems, the difference between the estimate and
      real value of NR_FREE_PAGES can be very high.  The system can get into a
      case where pages are allocated far below the min watermark potentially
      causing livelock issues.  The commit solved the problem by taking a better
      reading of NR_FREE_PAGES when memory was low.
      
      Unfortately, as reported by Shaohua Li this accurate reading can consume a
      large amount of CPU time on systems with many sockets due to cache line
      bouncing.  This patch takes a different approach.  For large machines
      where counter drift might be unsafe and while kswapd is awake, the per-cpu
      thresholds for the target pgdat are reduced to limit the level of drift to
      what should be a safe level.  This incurs a performance penalty in heavy
      memory pressure by a factor that depends on the workload and the machine
      but the machine should function correctly without accidentally exhausting
      all memory on a node.  There is an additional cost when kswapd wakes and
      sleeps but the event is not expected to be frequent - in Shaohua's test
      case, there was one recorded sleep and wake event at least.
      
      To ensure that kswapd wakes up, a safe version of zone_watermark_ok() is
      introduced that takes a more accurate reading of NR_FREE_PAGES when called
      from wakeup_kswapd, when deciding whether it is really safe to go back to
      sleep in sleeping_prematurely() and when deciding if a zone is really
      balanced or not in balance_pgdat().  We are still using an expensive
      function but limiting how often it is called.
      
      When the test case is reproduced, the time spent in the watermark
      functions is reduced.  The following report is on the percentage of time
      spent cumulatively spent in the functions zone_nr_free_pages(),
      zone_watermark_ok(), __zone_watermark_ok(), zone_watermark_ok_safe(),
      zone_page_state_snapshot(), zone_page_state().
      
      vanilla                      11.6615%
      disable-threshold            0.2584%
      
      David said:
      
      : We had to pull aa454840 "mm: page allocator: calculate a better estimate
      : of NR_FREE_PAGES when memory is low and kswapd is awake" from 2.6.36
      : internally because tests showed that it would cause the machine to stall
      : as the result of heavy kswapd activity.  I merged it back with this fix as
      : it is pending in the -mm tree and it solves the issue we were seeing, so I
      : definitely think this should be pushed to -stable (and I would seriously
      : consider it for 2.6.37 inclusion even at this late date).
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Reported-by: NShaohua Li <shaohua.li@intel.com>
      Reviewed-by: NChristoph Lameter <cl@linux.com>
      Tested-by: NNicolas Bareil <nico@chdir.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Kyle McMartin <kyle@mcmartin.ca>
      Cc: <stable@kernel.org>		[2.6.37.1, 2.6.36.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      88f5acf8