1. 14 1月, 2011 8 次提交
    • A
      thp: comment reminder in destroy_compound_page · 59ff4216
      Andrea Arcangeli 提交于
      Warn destroy_compound_page that __split_huge_page_refcount is heavily
      dependent on its internal behavior.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      59ff4216
    • A
      thp: clear compound mapping · 8dd60a3a
      Andrea Arcangeli 提交于
      Clear compound mapping for anonymous compound pages like it already
      happens for regular anonymous pages.  But crash if mapping is set for any
      tail page, also the PageAnon check is meaningless for tail pages.  This
      check only makes sense for the head page, for tail page it can only hide
      bugs and we definitely don't want to hide bugs.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8dd60a3a
    • A
      thp: fix bad_page to show the real reason the page is bad · 4e9f64c4
      Andrea Arcangeli 提交于
      page_count shows the count of the head page, but the actual check is done
      on the tail page, so show what is really being checked.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4e9f64c4
    • V
      mm: set correct numa_zonelist_order string when configured on the kernel command line · ecb256f8
      Volodymyr G. Lukiianyk 提交于
      When numa_zonelist_order parameter is set to "node" or "zone" on the
      command line it's still showing as "default" in sysctl.  That's because
      early_param parsing function changes only user_zonelist_order variable.
      Fix this by copying user-provided string to numa_zonelist_order if it was
      successfully parsed.
      Signed-off-by: NVolodymyr G Lukiianyk <volodymyrgl@gmail.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ecb256f8
    • M
      mm: kswapd: stop high-order balancing when any suitable zone is balanced · 99504748
      Mel Gorman 提交于
      Simon Kirby reported the following problem
      
         We're seeing cases on a number of servers where cache never fully
         grows to use all available memory.  Sometimes we see servers with 4 GB
         of memory that never seem to have less than 1.5 GB free, even with a
         constantly-active VM.  In some cases, these servers also swap out while
         this happens, even though they are constantly reading the working set
         into memory.  We have been seeing this happening for a long time; I
         don't think it's anything recent, and it still happens on 2.6.36.
      
      After some debugging work by Simon, Dave Hansen and others, the prevaling
      theory became that kswapd is reclaiming order-3 pages requested by SLUB
      too aggressive about it.
      
      There are two apparent problems here.  On the target machine, there is a
      small Normal zone in comparison to DMA32.  As kswapd tries to balance all
      zones, it would continually try reclaiming for Normal even though DMA32
      was balanced enough for callers.  The second problem is that
      sleeping_prematurely() does not use the same logic as balance_pgdat() when
      deciding whether to sleep or not.  This keeps kswapd artifically awake.
      
      A number of tests were run and the figures from previous postings will
      look very different for a few reasons.  One, the old figures were forcing
      my network card to use GFP_ATOMIC in attempt to replicate Simon's problem.
       Second, I previous specified slub_min_order=3 again in an attempt to
      reproduce Simon's problem.  In this posting, I'm depending on Simon to say
      whether his problem is fixed or not and these figures are to show the
      impact to the ordinary cases.  Finally, the "vmscan" figures are taken
      from /proc/vmstat instead of the tracepoints.  There is less information
      but recording is less disruptive.
      
      The first test of relevance was postmark with a process running in the
      background reading a large amount of anonymous memory in blocks.  The
      objective was to vaguely simulate what was happening on Simon's machine
      and it's memory intensive enough to have kswapd awake.
      
      POSTMARK
                                                  traceonly          kanyzone
      Transactions per second:              156.00 ( 0.00%)   153.00 (-1.96%)
      Data megabytes read per second:        21.51 ( 0.00%)    21.52 ( 0.05%)
      Data megabytes written per second:     29.28 ( 0.00%)    29.11 (-0.58%)
      Files created alone per second:       250.00 ( 0.00%)   416.00 (39.90%)
      Files create/transact per second:      79.00 ( 0.00%)    76.00 (-3.95%)
      Files deleted alone per second:       520.00 ( 0.00%)   420.00 (-23.81%)
      Files delete/transact per second:      79.00 ( 0.00%)    76.00 (-3.95%)
      
      MMTests Statistics: duration
      User/Sys Time Running Test (seconds)         16.58      17.4
      Total Elapsed Time (seconds)                218.48    222.47
      
      VMstat Reclaim Statistics: vmscan
      Direct reclaims                                  0          4
      Direct reclaim pages scanned                     0        203
      Direct reclaim pages reclaimed                   0        184
      Kswapd pages scanned                        326631     322018
      Kswapd pages reclaimed                      312632     309784
      Kswapd low wmark quickly                         1          4
      Kswapd high wmark quickly                      122        475
      Kswapd skip congestion_wait                      1          0
      Pages activated                             700040     705317
      Pages deactivated                           212113     203922
      Pages written                                 9875       6363
      
      Total pages scanned                         326631    322221
      Total pages reclaimed                       312632    309968
      %age total pages scanned/reclaimed          95.71%    96.20%
      %age total pages scanned/written             3.02%     1.97%
      
      proc vmstat: Faults
      Major Faults                                   300       254
      Minor Faults                                645183    660284
      Page ins                                    493588    486704
      Page outs                                  4960088   4986704
      Swap ins                                      1230       661
      Swap outs                                     9869      6355
      
      Performance is mildly affected because kswapd is no longer doing as much
      work and the background memory consumer process is getting in the way.
      Note that kswapd scanned and reclaimed fewer pages as it's less aggressive
      and overall fewer pages were scanned and reclaimed.  Swap in/out is
      particularly reduced again reflecting kswapd throwing out fewer pages.
      
      The slight performance impact is unfortunate here but it looks like a
      direct result of kswapd being less aggressive.  As the bug report is about
      too many pages being freed by kswapd, it may have to be accepted for now.
      
      The second test is a streaming IO benchmark that was previously used by
      Johannes to show regressions in page reclaim.
      
      MICRO
      					 traceonly  kanyzone
      User/Sys Time Running Test (seconds)         29.29     28.87
      Total Elapsed Time (seconds)                492.18    488.79
      
      VMstat Reclaim Statistics: vmscan
      Direct reclaims                               2128       1460
      Direct reclaim pages scanned               2284822    1496067
      Direct reclaim pages reclaimed              148919     110937
      Kswapd pages scanned                      15450014   16202876
      Kswapd pages reclaimed                     8503697    8537897
      Kswapd low wmark quickly                      3100       3397
      Kswapd high wmark quickly                     1860       7243
      Kswapd skip congestion_wait                    708        801
      Pages activated                               9635       9573
      Pages deactivated                             1432       1271
      Pages written                                  223       1130
      
      Total pages scanned                       17734836  17698943
      Total pages reclaimed                      8652616   8648834
      %age total pages scanned/reclaimed          48.79%    48.87%
      %age total pages scanned/written             0.00%     0.01%
      
      proc vmstat: Faults
      Major Faults                                   165       221
      Minor Faults                               9655785   9656506
      Page ins                                      3880      7228
      Page outs                                 37692940  37480076
      Swap ins                                         0        69
      Swap outs                                       19        15
      
      Again fewer pages are scanned and reclaimed as expected and this time the
      test completed faster.  Note that kswapd is hitting its watermarks faster
      (low and high wmark quickly) which I expect is due to kswapd reclaiming
      fewer pages.
      
      I also ran fs-mark, iozone and sysbench but there is nothing interesting
      to report in the figures.  Performance is not significantly changed and
      the reclaim statistics look reasonable.
      
      Tgis patch:
      
      When the allocator enters its slow path, kswapd is woken up to balance the
      node.  It continues working until all zones within the node are balanced.
      For order-0 allocations, this makes perfect sense but for higher orders it
      can have unintended side-effects.  If the zone sizes are imbalanced,
      kswapd may reclaim heavily within a smaller zone discarding an excessive
      number of pages.  The user-visible behaviour is that kswapd is awake and
      reclaiming even though plenty of pages are free from a suitable zone.
      
      This patch alters the "balance" logic for high-order reclaim allowing
      kswapd to stop if any suitable zone becomes balanced to reduce the number
      of pages it reclaims from other zones.  kswapd still tries to ensure that
      order-0 watermarks for all zones are met before sleeping.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NEric B Munson <emunson@mgebm.net>
      Cc: Simon Kirby <sim@hostway.ca>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      99504748
    • M
      mm: migration: allow migration to operate asynchronously and avoid synchronous... · 77f1fe6b
      Mel Gorman 提交于
      mm: migration: allow migration to operate asynchronously and avoid synchronous compaction in the faster path
      
      Migration synchronously waits for writeback if the initial passes fails.
      Callers of memory compaction do not necessarily want this behaviour if the
      caller is latency sensitive or expects that synchronous migration is not
      going to have a significantly better success rate.
      
      This patch adds a sync parameter to migrate_pages() allowing the caller to
      indicate if wait_on_page_writeback() is allowed within migration or not.
      For reclaim/compaction, try_to_compact_pages() is first called
      asynchronously, direct reclaim runs and then try_to_compact_pages() is
      called synchronously as there is a greater expectation that it'll succeed.
      
      [akpm@linux-foundation.org: build/merge fix]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      77f1fe6b
    • M
      mm: vmscan: reclaim order-0 and use compaction instead of lumpy reclaim · 3e7d3449
      Mel Gorman 提交于
      Lumpy reclaim is disruptive.  It reclaims a large number of pages and
      ignores the age of the pages it reclaims.  This can incur significant
      stalls and potentially increase the number of major faults.
      
      Compaction has reached the point where it is considered reasonably stable
      (meaning it has passed a lot of testing) and is a potential candidate for
      displacing lumpy reclaim.  This patch introduces an alternative to lumpy
      reclaim whe compaction is available called reclaim/compaction.  The basic
      operation is very simple - instead of selecting a contiguous range of
      pages to reclaim, a number of order-0 pages are reclaimed and then
      compaction is later by either kswapd (compact_zone_order()) or direct
      compaction (__alloc_pages_direct_compact()).
      
      [akpm@linux-foundation.org: fix build]
      [akpm@linux-foundation.org: use conventional task_struct naming]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3e7d3449
    • M
      mm: page allocator: adjust the per-cpu counter threshold when memory is low · 88f5acf8
      Mel Gorman 提交于
      Commit aa454840 ("calculate a better estimate of NR_FREE_PAGES when memory
      is low") noted that watermarks were based on the vmstat NR_FREE_PAGES.  To
      avoid synchronization overhead, these counters are maintained on a per-cpu
      basis and drained both periodically and when a threshold is above a
      threshold.  On large CPU systems, the difference between the estimate and
      real value of NR_FREE_PAGES can be very high.  The system can get into a
      case where pages are allocated far below the min watermark potentially
      causing livelock issues.  The commit solved the problem by taking a better
      reading of NR_FREE_PAGES when memory was low.
      
      Unfortately, as reported by Shaohua Li this accurate reading can consume a
      large amount of CPU time on systems with many sockets due to cache line
      bouncing.  This patch takes a different approach.  For large machines
      where counter drift might be unsafe and while kswapd is awake, the per-cpu
      thresholds for the target pgdat are reduced to limit the level of drift to
      what should be a safe level.  This incurs a performance penalty in heavy
      memory pressure by a factor that depends on the workload and the machine
      but the machine should function correctly without accidentally exhausting
      all memory on a node.  There is an additional cost when kswapd wakes and
      sleeps but the event is not expected to be frequent - in Shaohua's test
      case, there was one recorded sleep and wake event at least.
      
      To ensure that kswapd wakes up, a safe version of zone_watermark_ok() is
      introduced that takes a more accurate reading of NR_FREE_PAGES when called
      from wakeup_kswapd, when deciding whether it is really safe to go back to
      sleep in sleeping_prematurely() and when deciding if a zone is really
      balanced or not in balance_pgdat().  We are still using an expensive
      function but limiting how often it is called.
      
      When the test case is reproduced, the time spent in the watermark
      functions is reduced.  The following report is on the percentage of time
      spent cumulatively spent in the functions zone_nr_free_pages(),
      zone_watermark_ok(), __zone_watermark_ok(), zone_watermark_ok_safe(),
      zone_page_state_snapshot(), zone_page_state().
      
      vanilla                      11.6615%
      disable-threshold            0.2584%
      
      David said:
      
      : We had to pull aa454840 "mm: page allocator: calculate a better estimate
      : of NR_FREE_PAGES when memory is low and kswapd is awake" from 2.6.36
      : internally because tests showed that it would cause the machine to stall
      : as the result of heavy kswapd activity.  I merged it back with this fix as
      : it is pending in the -mm tree and it solves the issue we were seeing, so I
      : definitely think this should be pushed to -stable (and I would seriously
      : consider it for 2.6.37 inclusion even at this late date).
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Reported-by: NShaohua Li <shaohua.li@intel.com>
      Reviewed-by: NChristoph Lameter <cl@linux.com>
      Tested-by: NNicolas Bareil <nico@chdir.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Kyle McMartin <kyle@mcmartin.ca>
      Cc: <stable@kernel.org>		[2.6.37.1, 2.6.36.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      88f5acf8
  2. 07 12月, 2010 1 次提交
    • R
      PM / Hibernate: Fix memory corruption related to swap · c9e664f1
      Rafael J. Wysocki 提交于
      There is a problem that swap pages allocated before the creation of
      a hibernation image can be released and used for storing the contents
      of different memory pages while the image is being saved.  Since the
      kernel stored in the image doesn't know of that, it causes memory
      corruption to occur after resume from hibernation, especially on
      systems with relatively small RAM that need to swap often.
      
      This issue can be addressed by keeping the GFP_IOFS bits clear
      in gfp_allowed_mask during the entire hibernation, including the
      saving of the image, until the system is finally turned off or
      the hibernation is aborted.  Unfortunately, for this purpose
      it's necessary to rework the way in which the hibernate and
      suspend code manipulates gfp_allowed_mask.
      
      This change is based on an earlier patch from Hugh Dickins.
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Reported-by: NOndrej Zary <linux@rainbow-software.org>
      Acked-by: NHugh Dickins <hughd@google.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: stable@kernel.org
      c9e664f1
  3. 29 11月, 2010 1 次提交
    • J
      Kill off a bunch of warning: ‘inline’ is not at beginning of declaration · fa9f90be
      Jesper Juhl 提交于
      These warnings are spewed during a build of a 'allnoconfig' kernel
      (especially the ones from u64_stats_sync.h show up a lot) when building
      with -Wextra (which I often do)..
      They are
        a) annoying
        b) easy to get rid of.
      This patch kills them off.
      
      include/linux/u64_stats_sync.h:70:1: warning: ‘inline’ is not at beginning of declaration
      include/linux/u64_stats_sync.h:77:1: warning: ‘inline’ is not at beginning of declaration
      include/linux/u64_stats_sync.h:84:1: warning: ‘inline’ is not at beginning of declaration
      include/linux/u64_stats_sync.h:96:1: warning: ‘inline’ is not at beginning of declaration
      include/linux/u64_stats_sync.h:115:1: warning: ‘inline’ is not at beginning of declaration
      include/linux/u64_stats_sync.h:127:1: warning: ‘inline’ is not at beginning of declaration
      kernel/time.c:241:1: warning: ‘inline’ is not at beginning of declaration
      kernel/time.c:257:1: warning: ‘inline’ is not at beginning of declaration
      kernel/perf_event.c:4513:1: warning: ‘inline’ is not at beginning of declaration
      mm/page_alloc.c:4012:1: warning: ‘inline’ is not at beginning of declaration
      Signed-off-by: NJesper Juhl <jj@chaosbits.net>
      Signed-off-by: NJiri Kosina <jkosina@suse.cz>
      fa9f90be
  4. 25 11月, 2010 1 次提交
    • K
      mm/page_alloc.c: fix build_all_zonelist() where percpu_alloc() is wrongly... · e9959f0f
      KAMEZAWA Hiroyuki 提交于
      mm/page_alloc.c: fix build_all_zonelist() where percpu_alloc() is wrongly called under stop_machine_run()
      
      During memory hotplug, build_allzonelists() may be called under
      stop_machine_run().  In this function, setup_zone_pageset() is called.
      But it's bug because it will do page allocation under stop_machine_run().
      
      Here is a report from Alok Kataria.
      
        BUG: sleeping function called from invalid context at kernel/mutex.c:94
        in_atomic(): 0, irqs_disabled(): 1, pid: 4, name: migration/0
        Pid: 4, comm: migration/0 Not tainted 2.6.35.6-45.fc14.x86_64 #1
        Call Trace:
         [<ffffffff8103d12b>] __might_sleep+0xeb/0xf0
         [<ffffffff81468245>] mutex_lock+0x24/0x50
         [<ffffffff8110eaa6>] pcpu_alloc+0x6d/0x7ee
         [<ffffffff81048888>] ? load_balance+0xbe/0x60e
         [<ffffffff8103a1b3>] ? rt_se_boosted+0x21/0x2f
         [<ffffffff8103e1cf>] ? dequeue_rt_stack+0x18b/0x1ed
         [<ffffffff8110f237>] __alloc_percpu+0x10/0x12
         [<ffffffff81465e22>] setup_zone_pageset+0x38/0xbe
         [<ffffffff810d6d81>] ? build_zonelists_node.clone.58+0x79/0x8c
         [<ffffffff81452539>] __build_all_zonelists+0x419/0x46c
         [<ffffffff8108ef01>] ? cpu_stopper_thread+0xb2/0x198
         [<ffffffff8108f075>] stop_machine_cpu_stop+0x8e/0xc5
         [<ffffffff8108efe7>] ? stop_machine_cpu_stop+0x0/0xc5
         [<ffffffff8108ef57>] cpu_stopper_thread+0x108/0x198
         [<ffffffff81467a37>] ? schedule+0x5b2/0x5cc
         [<ffffffff8108ee4f>] ? cpu_stopper_thread+0x0/0x198
         [<ffffffff81065f29>] kthread+0x7f/0x87
         [<ffffffff8100aae4>] kernel_thread_helper+0x4/0x10
         [<ffffffff81065eaa>] ? kthread+0x0/0x87
         [<ffffffff8100aae0>] ? kernel_thread_helper+0x0/0x10
        Built 5 zonelists in Node order, mobility grouping on.  Total pages: 289456
        Policy zone: Normal
      
      This patch tries to fix the issue by moving setup_zone_pageset() out from
      stop_machine_run(). It's obviously not necessary to be called under
      stop_machine_run().
      
      [akpm@linux-foundation.org: remove unneeded local]
      Reported-by: NAlok Kataria <akataria@vmware.com>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Petr Vandrovec <petr@vmware.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Reviewed-by: NChristoph Lameter <cl@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e9959f0f
  5. 27 10月, 2010 5 次提交
    • N
      mm: add casts to/from gfp_t in gfp_to_alloc_flags() · e6223a3b
      Namhyung Kim 提交于
      This removes following warning from sparse:
      
       mm/page_alloc.c:1934:9: warning: restricted gfp_t degrades to integer
      Signed-off-by: NNamhyung Kim <namhyung@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e6223a3b
    • M
      writeback: do not sleep on the congestion queue if there are no congested BDIs... · 0e093d99
      Mel Gorman 提交于
      writeback: do not sleep on the congestion queue if there are no congested BDIs or if significant congestion is not being encountered in the current zone
      
      If congestion_wait() is called with no BDI congested, the caller will
      sleep for the full timeout and this may be an unnecessary sleep.  This
      patch adds a wait_iff_congested() that checks congestion and only sleeps
      if a BDI is congested else, it calls cond_resched() to ensure the caller
      is not hogging the CPU longer than its quota but otherwise will not sleep.
      
      This is aimed at reducing some of the major desktop stalls reported during
      IO.  For example, while kswapd is operating, it calls congestion_wait()
      but it could just have been reclaiming clean page cache pages with no
      congestion.  Without this patch, it would sleep for a full timeout but
      after this patch, it'll just call schedule() if it has been on the CPU too
      long.  Similar logic applies to direct reclaimers that are not making
      enough progress.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0e093d99
    • K
      memory hotplug: unify is_removable and offline detection code · 49ac8255
      KAMEZAWA Hiroyuki 提交于
      Now, sysfs interface of memory hotplug shows whether the section is
      removable or not.  But it checks only migrateype of pages and doesn't
      check details of cluster of pages.
      
      Next, memory hotplug's set_migratetype_isolate() has the same kind of
      check, too.
      
      This patch adds the function __count_unmovable_pages() and makes above 2
      checks to use the same logic.  Then, is_removable and hotremove code uses
      the same logic.  No changes in the hotremove logic itself.
      
      TODO: need to find a way to check RECLAMABLE. But, considering bit,
            calling shrink_slab() against a range before starting memory hotremove
            sounds better. If so, this patch's logic doesn't need to be changed.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reported-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      49ac8255
    • K
      memory hotplug: fix notifier's return value check · 4b20477f
      KAMEZAWA Hiroyuki 提交于
      Even if notifier cannot find any pages, it doesn't mean no pages are
      available...And, if there are no notifiers registered, this condition will
      be always true and memory hotplug will show -EBUSY.
      
      This is a bug but not critical.
      
      In most case, a pageblock which will be offlined is MIGRATE_MOVABLE This
      "notifier" is called only when the pageblock is _not_ MIGRATE_MOVABLE.
      But if not MIGRATE_MOVABLE, it's common case that memory hotplug will
      fail.  So, no one notice this bug.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4b20477f
    • M
      mm, page-allocator: do not check the state of a non-existant buddy during free · b7f50cfa
      Mel Gorman 提交于
      There is a bug in commit 6dda9d55 ("page allocator: reduce fragmentation
      in buddy allocator by adding buddies that are merging to the tail of the
      free lists") that means a buddy at order MAX_ORDER is checked for merging.
       A page of this order never exists so at times, an effectively random
      piece of memory is being checked.
      
      Alan Curry has reported that this is causing memory corruption in
      userspace data on a PPC32 platform (http://lkml.org/lkml/2010/10/9/32).
      It is not clear why this is happening.  It could be a cache coherency
      problem where pages mapped in both user and kernel space are getting
      different cache lines due to the bad read from kernel space
      (http://lkml.org/lkml/2010/10/13/179).  It could also be that there are
      some special registers being io-remapped at the end of the memmap array
      and that a read has special meaning on them.  Compiler bugs have been
      ruled out because the assembly before and after the patch looks relatively
      harmless.
      
      This patch fixes the problem by ensuring we are not reading a possibly
      invalid location of memory.  It's not clear why the read causes corruption
      but one way or the other it is a buggy read.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Corrado Zoccolo <czoccolo@gmail.com>
      Reported-by: NAlan Curry <pacman@kosh.dhis.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b7f50cfa
  6. 08 10月, 2010 1 次提交
  7. 10 9月, 2010 3 次提交
  8. 28 8月, 2010 2 次提交
    • Y
      x86: Use memblock to replace early_res · 72d7c3b3
      Yinghai Lu 提交于
      1. replace find_e820_area with memblock_find_in_range
      2. replace reserve_early with memblock_x86_reserve_range
      3. replace free_early with memblock_x86_free_range.
      4. NO_BOOTMEM will switch to use memblock too.
      5. use _e820, _early wrap in the patch, in following patch, will
         replace them all
      6. because memblock_x86_free_range support partial free, we can remove some special care
      7. Need to make sure that memblock_find_in_range() is called after memblock_x86_fill()
         so adjust some calling later in setup.c::setup_arch()
         -- corruption_check and mptable_update
      
      -v2: Move reserve_brk() early
          Before fill_memblock_area, to avoid overlap between brk and memblock_find_in_range()
          that could happen We have more then 128 RAM entry in E820 tables, and
          memblock_x86_fill() could use memblock_find_in_range() to find a new place for
          memblock.memory.region array.
          and We don't need to use extend_brk() after fill_memblock_area()
          So move reserve_brk() early before fill_memblock_area().
      -v3: Move find_smp_config early
          To make sure memblock_find_in_range not find wrong place, if BIOS doesn't put mptable
          in right place.
      -v4: Treat RESERVED_KERN as RAM in memblock.memory. and they are already in
          memblock.reserved already..
          use __NOT_KEEP_MEMBLOCK to make sure memblock related code could be freed later.
      -v5: Generic version __memblock_find_in_range() is going from high to low, and for 32bit
          active_region for 32bit does include high pages
          need to replace the limit with memblock.default_alloc_limit, aka get_max_mapped()
      -v6: Use current_limit instead
      -v7: check with MEMBLOCK_ERROR instead of -1ULL or -1L
      -v8: Set memblock_can_resize early to handle EFI with more RAM entries
      -v9: update after kmemleak changes in mainline
      Suggested-by: NDavid S. Miller <davem@davemloft.net>
      Suggested-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Suggested-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      72d7c3b3
    • Y
      memblock: Add find_memory_core_early() · edbe7d23
      Yinghai Lu 提交于
      According to node range in early_node_map[] with __memblock_find_in_range
      to find free range.
      
      Will be used by memblock_x86_find_in_range_node()
      
      memblock_x86_find_in_range_node will be used to find right buffer for NODE_DATA
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      edbe7d23
  9. 10 8月, 2010 3 次提交
    • K
      vmscan: kill prev_priority completely · 25edde03
      KOSAKI Motohiro 提交于
      Since 2.6.28 zone->prev_priority is unused. Then it can be removed
      safely. It reduce stack usage slightly.
      
      Now I have to say that I'm sorry. 2 years ago, I thought prev_priority
      can be integrate again, it's useful. but four (or more) times trying
      haven't got good performance number. Thus I give up such approach.
      
      The rest of this changelog is notes on prev_priority and why it existed in
      the first place and why it might be not necessary any more. This information
      is based heavily on discussions between Andrew Morton, Rik van Riel and
      Kosaki Motohiro who is heavily quotes from.
      
      Historically prev_priority was important because it determined when the VM
      would start unmapping PTE pages. i.e. there are no balances of note within
      the VM, Anon vs File and Mapped vs Unmapped. Without prev_priority, there
      is a potential risk of unnecessarily increasing minor faults as a large
      amount of read activity of use-once pages could push mapped pages to the
      end of the LRU and get unmapped.
      
      There is no proof this is still a problem but currently it is not considered
      to be. Active files are not deactivated if the active file list is smaller
      than the inactive list reducing the liklihood that file-mapped pages are
      being pushed off the LRU and referenced executable pages are kept on the
      active list to avoid them getting pushed out by read activity.
      
      Even if it is a problem, prev_priority prev_priority wouldn't works
      nowadays. First of all, current vmscan still a lot of UP centric code. it
      expose some weakness on some dozens CPUs machine. I think we need more and
      more improvement.
      
      The problem is, current vmscan mix up per-system-pressure, per-zone-pressure
      and per-task-pressure a bit. example, prev_priority try to boost priority to
      other concurrent priority. but if the another task have mempolicy restriction,
      it is unnecessary, but also makes wrong big latency and exceeding reclaim.
      per-task based priority + prev_priority adjustment make the emulation of
      per-system pressure. but it have two issue 1) too rough and brutal emulation
      2) we need per-zone pressure, not per-system.
      
      Another example, currently DEF_PRIORITY is 12. it mean the lru rotate about
      2 cycle (1/4096 + 1/2048 + 1/1024 + .. + 1) before invoking OOM-Killer.
      but if 10,0000 thrreads enter DEF_PRIORITY reclaim at the same time, the
      system have higher memory pressure than priority==0 (1/4096*10,000 > 2).
      prev_priority can't solve such multithreads workload issue. In other word,
      prev_priority concept assume the sysmtem don't have lots threads."
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michael Rubin <mrubin@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      25edde03
    • M
      mm: rename try_set_zone_oom() to try_set_zonelist_oom() · ff321fea
      Minchan Kim 提交于
      We have been used naming try_set_zone_oom and clear_zonelist_oom.
      The role of functions is to lock of zonelist for preventing parallel
      OOM. So clear_zonelist_oom makes sense but try_set_zone_oome is rather
      awkward and unmatched with clear_zonelist_oom.
      
      Let's change it with try_set_zonelist_oom.
      Signed-off-by: NMinchan Kim <minchan.kim@gmail.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ff321fea
    • D
      oom: avoid oom killer for lowmem allocations · 03668b3c
      David Rientjes 提交于
      If memory has been depleted in lowmem zones even with the protection
      afforded to it by /proc/sys/vm/lowmem_reserve_ratio, it is unlikely that
      killing current users will help.  The memory is either reclaimable (or
      migratable) already, in which case we should not invoke the oom killer at
      all, or it is pinned by an application for I/O.  Killing such an
      application may leave the hardware in an unspecified state and there is no
      guarantee that it will be able to make a timely exit.
      
      Lowmem allocations are now failed in oom conditions when __GFP_NOFAIL is
      not used so that the task can perhaps recover or try again later.
      
      Previously, the heuristic provided some protection for those tasks with
      CAP_SYS_RAWIO, but this is no longer necessary since we will not be
      killing tasks for the purposes of ISA allocations.
      
      high_zoneidx is gfp_zone(gfp_flags), meaning that ZONE_NORMAL will be the
      default for all allocations that are not __GFP_DMA, __GFP_DMA32,
      __GFP_HIGHMEM, and __GFP_MOVABLE on kernels configured to support those
      flags.  Testing for high_zoneidx being less than ZONE_NORMAL will only
      return true for allocations that have either __GFP_DMA or __GFP_DMA32.
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      03668b3c
  10. 21 7月, 2010 1 次提交
    • Y
      x86,nobootmem: make alloc_bootmem_node fall back to other node when 32bit numa is used · b8ab9f82
      Yinghai Lu 提交于
      Borislav Petkov reported his 32bit numa system has problem:
      
      [    0.000000] Reserving total of 4c00 pages for numa KVA remap
      [    0.000000] kva_start_pfn ~ 32800 max_low_pfn ~ 375fe
      [    0.000000] max_pfn = 238000
      [    0.000000] 8202MB HIGHMEM available.
      [    0.000000] 885MB LOWMEM available.
      [    0.000000]   mapped low ram: 0 - 375fe000
      [    0.000000]   low ram: 0 - 375fe000
      [    0.000000] alloc (nid=8 100000 - 7ee00000) (1000000 - ffffffff) 1000 1000 => 34e7000
      [    0.000000] alloc (nid=8 100000 - 7ee00000) (1000000 - ffffffff) 200 40 => 34c9d80
      [    0.000000] alloc (nid=0 100000 - 7ee00000) (1000000 - ffffffffffffffff) 180 40 => 34e6140
      [    0.000000] alloc (nid=1 80000000 - c7e60000) (1000000 - ffffffffffffffff) 240 40 => 80000000
      [    0.000000] BUG: unable to handle kernel paging request at 40000000
      [    0.000000] IP: [<c2c8cff1>] __alloc_memory_core_early+0x147/0x1d6
      [    0.000000] *pdpt = 0000000000000000 *pde = f000ff53f000ff00
      ...
      [    0.000000] Call Trace:
      [    0.000000]  [<c2c8b4f8>] ? __alloc_bootmem_node+0x216/0x22f
      [    0.000000]  [<c2c90c9b>] ? sparse_early_usemaps_alloc_node+0x5a/0x10b
      [    0.000000]  [<c2c9149e>] ? sparse_init+0x1dc/0x499
      [    0.000000]  [<c2c79118>] ? paging_init+0x168/0x1df
      [    0.000000]  [<c2c780ff>] ? native_pagetable_setup_start+0xef/0x1bb
      
      looks like it allocates too much high address for bootmem.
      
      Try to cut limit with get_max_mapped()
      Reported-by: NBorislav Petkov <borislav.petkov@amd.com>
      Tested-by: NConny Seidel <conny.seidel@amd.com>
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Cc: <stable@kernel.org>		[2.6.34.x]
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b8ab9f82
  11. 19 7月, 2010 1 次提交
  12. 28 5月, 2010 2 次提交
    • L
      numa: introduce numa_mem_id()- effective local memory node id · 7aac7898
      Lee Schermerhorn 提交于
      Introduce numa_mem_id(), based on generic percpu variable infrastructure
      to track "nearest node with memory" for archs that support memoryless
      nodes.
      
      Define API in <linux/topology.h> when CONFIG_HAVE_MEMORYLESS_NODES
      defined, else stubs.  Architectures will define HAVE_MEMORYLESS_NODES
      if/when they support them.
      
      Archs can override definitions of:
      
      numa_mem_id() - returns node number of "local memory" node
      set_numa_mem() - initialize [this cpus'] per cpu variable 'numa_mem'
      cpu_to_mem()  - return numa_mem for specified cpu; may be used as lvalue
      
      Generic initialization of 'numa_mem' occurs in __build_all_zonelists().
      This will initialize the boot cpu at boot time, and all cpus on change of
      numa_zonelist_order, or when node or memory hot-plug requires zonelist
      rebuild.  Archs that support memoryless nodes will need to initialize
      'numa_mem' for secondary cpus as they're brought on-line.
      
      [akpm@linux-foundation.org: fix build]
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Eric Whitney <eric.whitney@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7aac7898
    • L
      numa: add generic percpu var numa_node_id() implementation · 72812019
      Lee Schermerhorn 提交于
      Rework the generic version of the numa_node_id() function to use the new
      generic percpu variable infrastructure.
      
      Guard the new implementation with a new config option:
      
              CONFIG_USE_PERCPU_NUMA_NODE_ID.
      
      Archs which support this new implemention will default this option to 'y'
      when NUMA is configured.  This config option could be removed if/when all
      archs switch over to the generic percpu implementation of numa_node_id().
      Arch support involves:
      
        1) converting any existing per cpu variable implementations to use
           this implementation.  x86_64 is an instance of such an arch.
        2) archs that don't use a per cpu variable for numa_node_id() will
           need to initialize the new per cpu variable "numa_node" as cpus
           are brought on-line.  ia64 is an example.
        3) Defining USE_PERCPU_NUMA_NODE_ID in arch dependent Kconfig--e.g.,
           when NUMA is configured.  This is required because I have
           retained the old implementation by default to allow archs to
           be modified incrementally, as desired.
      
      Subsequent patches will convert x86_64 and ia64 to use this implemenation.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Reviewed-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Eric Whitney <eric.whitney@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      72812019
  13. 25 5月, 2010 10 次提交
    • H
      mem-hotplug: fix potential race while building zonelist for new populated zone · 4eaf3f64
      Haicheng Li 提交于
      Add global mutex zonelists_mutex to fix the possible race:
      
           CPU0                                  CPU1                    CPU2
      (1) zone->present_pages += online_pages;
      (2)                                       build_all_zonelists();
      (3)                                                               alloc_page();
      (4)                                                               free_page();
      (5) build_all_zonelists();
      (6)   __build_all_zonelists();
      (7)     zone->pageset = alloc_percpu();
      
      In step (3,4), zone->pageset still points to boot_pageset, so bad
      things may happen if 2+ nodes are in this state. Even if only 1 node
      is accessing the boot_pageset, (3) may still consume too much memory
      to fail the memory allocations in step (7).
      
      Besides, atomic operation ensures alloc_percpu() in step (7) will never fail
      since there is a new fresh memory block added in step(6).
      
      [haicheng.li@linux.intel.com: hold zonelists_mutex when build_all_zonelists]
      Signed-off-by: NHaicheng Li <haicheng.li@linux.intel.com>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Reviewed-by: NAndi Kleen <andi.kleen@intel.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4eaf3f64
    • H
      mem-hotplug: avoid multiple zones sharing same boot strapping boot_pageset · 1f522509
      Haicheng Li 提交于
      For each new populated zone of hotadded node, need to update its pagesets
      with dynamically allocated per_cpu_pageset struct for all possible CPUs:
      
          1) Detach zone->pageset from the shared boot_pageset
             at end of __build_all_zonelists().
      
          2) Use mutex to protect zone->pageset when it's still
             shared in onlined_pages()
      
      Otherwises, multiple zones of different nodes would share same boot strapping
      boot_pageset for same CPU, which will finally cause below kernel panic:
      
        ------------[ cut here ]------------
        kernel BUG at mm/page_alloc.c:1239!
        invalid opcode: 0000 [#1] SMP
        ...
        Call Trace:
         [<ffffffff811300c1>] __alloc_pages_nodemask+0x131/0x7b0
         [<ffffffff81162e67>] alloc_pages_current+0x87/0xd0
         [<ffffffff81128407>] __page_cache_alloc+0x67/0x70
         [<ffffffff811325f0>] __do_page_cache_readahead+0x120/0x260
         [<ffffffff81132751>] ra_submit+0x21/0x30
         [<ffffffff811329c6>] ondemand_readahead+0x166/0x2c0
         [<ffffffff81132ba0>] page_cache_async_readahead+0x80/0xa0
         [<ffffffff8112a0e4>] generic_file_aio_read+0x364/0x670
         [<ffffffff81266cfa>] nfs_file_read+0xca/0x130
         [<ffffffff8117b20a>] do_sync_read+0xfa/0x140
         [<ffffffff8117bf75>] vfs_read+0xb5/0x1a0
         [<ffffffff8117c151>] sys_read+0x51/0x80
         [<ffffffff8103c032>] system_call_fastpath+0x16/0x1b
        RIP  [<ffffffff8112ff13>] get_page_from_freelist+0x883/0x900
         RSP <ffff88000d1e78a8>
        ---[ end trace 4bda28328b9990db ]
      
      [akpm@linux-foundation.org: merge fix]
      Signed-off-by: NHaicheng Li <haicheng.li@linux.intel.com>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Reviewed-by: NAndi Kleen <andi.kleen@intel.com>
      Reviewed-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1f522509
    • W
      mem-hotplug: separate setup_per_cpu_pageset() into separate functions · 319774e2
      Wu Fengguang 提交于
      No behavior change here.
      
      Move some of setup_per_cpu_pageset() code into a new function
      setup_zone_pageset() that will be useful for memory hotplug.
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NHaicheng Li <haicheng.li@linux.intel.com>
      Reviewed-by: NAndi Kleen <andi.kleen@intel.com>
      Reviewed-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      319774e2
    • K
      mm: introduce free_pages_prepare() · ec95f53a
      KOSAKI Motohiro 提交于
      free_hot_cold_page() and __free_pages_ok() have very similar freeing
      preparation.  Consolidate them.
      
      [akpm@linux-foundation.org: fix busted coding style]
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ec95f53a
    • M
      mm: compaction: defer compaction using an exponential backoff when compaction fails · 4f92e258
      Mel Gorman 提交于
      The fragmentation index may indicate that a failure is due to external
      fragmentation but after a compaction run completes, it is still possible
      for an allocation to fail.  There are two obvious reasons as to why
      
        o Page migration cannot move all pages so fragmentation remains
        o A suitable page may exist but watermarks are not met
      
      In the event of compaction followed by an allocation failure, this patch
      defers further compaction in the zone (1 << compact_defer_shift) times.
      If the next compaction attempt also fails, compact_defer_shift is
      increased up to a maximum of 6.  If compaction succeeds, the defer
      counters are reset again.
      
      The zone that is deferred is the first zone in the zonelist - i.e.  the
      preferred zone.  To defer compaction in the other zones, the information
      would need to be stored in the zonelist or implemented similar to the
      zonelist_cache.  This would impact the fast-paths and is not justified at
      this time.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4f92e258
    • M
      mm: compaction: direct compact when a high-order allocation fails · 56de7263
      Mel Gorman 提交于
      Ordinarily when a high-order allocation fails, direct reclaim is entered
      to free pages to satisfy the allocation.  With this patch, it is
      determined if an allocation failed due to external fragmentation instead
      of low memory and if so, the calling process will compact until a suitable
      page is freed.  Compaction by moving pages in memory is considerably
      cheaper than paging out to disk and works where there are locked pages or
      no swap.  If compaction fails to free a page of a suitable size, then
      reclaim will still occur.
      
      Direct compaction returns as soon as possible.  As each block is
      compacted, it is checked if a suitable page has been freed and if so, it
      returns.
      
      [akpm@linux-foundation.org: Fix build errors]
      [aarcange@redhat.com: fix count_vm_event preempt in memory compaction direct reclaim]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      56de7263
    • M
      mm: compaction: memory compaction core · 748446bb
      Mel Gorman 提交于
      This patch is the core of a mechanism which compacts memory in a zone by
      relocating movable pages towards the end of the zone.
      
      A single compaction run involves a migration scanner and a free scanner.
      Both scanners operate on pageblock-sized areas in the zone.  The migration
      scanner starts at the bottom of the zone and searches for all movable
      pages within each area, isolating them onto a private list called
      migratelist.  The free scanner starts at the top of the zone and searches
      for suitable areas and consumes the free pages within making them
      available for the migration scanner.  The pages isolated for migration are
      then migrated to the newly isolated free pages.
      
      [aarcange@redhat.com: Fix unsafe optimisation]
      [mel@csn.ul.ie: do not schedule work on other CPUs for compaction]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      748446bb
    • D
      mm: default to node zonelist ordering when nodes have only lowmem · e325c90f
      David Rientjes 提交于
      There are two types of zonelist ordering methodologies:
      
       - node order, preferring allocations on a node to stay local to and
      
       - zone order, preferring allocations come from a higher zone to avoid
         allocating in lowmem zones even though they may not be local.
      
      The ordering technique used by the kernel is configurable on the command
      line, but also has some logic to determine what the default should be.
      
      This logic currently lacks knowledge of systems where a node may only have
      lowmem.  For such systems, it is necessary to use node order so that
      GFP_KERNEL allocations may be satisfied by nodes consisting of only
      lowmem.
      
      If zone order is used, GFP_KERNEL allocations to such nodes are actually
      allocated on a node with local affinity that includes ZONE_NORMAL.
      
      This change defaults to node zonelist ordering if any node lacks
      ZONE_NORMAL.
      
      To force zone order, append 'numa_zonelist_order=zone' to the kernel
      command line.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e325c90f
    • M
      cpuset,mm: fix no node to alloc memory when changing cpuset's mems · c0ff7453
      Miao Xie 提交于
      Before applying this patch, cpuset updates task->mems_allowed and
      mempolicy by setting all new bits in the nodemask first, and clearing all
      old unallowed bits later.  But in the way, the allocator may find that
      there is no node to alloc memory.
      
      The reason is that cpuset rebinds the task's mempolicy, it cleans the
      nodes which the allocater can alloc pages on, for example:
      
      (mpol: mempolicy)
      	task1			task1's mpol	task2
      	alloc page		1
      	  alloc on node0? NO	1
      				1		change mems from 1 to 0
      				1		rebind task1's mpol
      				0-1		  set new bits
      				0	  	  clear disallowed bits
      	  alloc on node1? NO	0
      	  ...
      	can't alloc page
      	  goto oom
      
      This patch fixes this problem by expanding the nodes range first(set newly
      allowed bits) and shrink it lazily(clear newly disallowed bits).  So we
      use a variable to tell the write-side task that read-side task is reading
      nodemask, and the write-side task clears newly disallowed nodes after
      read-side task ends the current memory allocation.
      
      [akpm@linux-foundation.org: fix spello]
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Paul Menage <menage@google.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Ravikiran Thirumalai <kiran@scalex86.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Andi Kleen <andi@firstfloor.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c0ff7453
    • C
      page allocator: reduce fragmentation in buddy allocator by adding buddies that... · 6dda9d55
      Corrado Zoccolo 提交于
      page allocator: reduce fragmentation in buddy allocator by adding buddies that are merging to the tail of the free lists
      
      In order to reduce fragmentation, this patch classifies freed pages in two
      groups according to their probability of being part of a high order merge.
       Pages belonging to a compound whose next-highest buddy is free are more
      likely to be part of a high order merge in the near future, so they will
      be added at the tail of the freelist.  The remaining pages are put at the
      front of the freelist.
      
      In this way, the pages that are more likely to cause a big merge are kept
      free longer.  Consequently there is a tendency to aggregate the
      long-living allocations on a subset of the compounds, reducing the
      fragmentation.
      
      This heuristic was tested on three machines, x86, x86-64 and ppc64 with
      3GB of RAM in each machine.  The tests were kernbench, netperf, sysbench
      and STREAM for performance and a high-order stress test for huge page
      allocations.
      
      KernBench X86
      Elapsed mean     374.77 ( 0.00%)   375.10 (-0.09%)
      User    mean     649.53 ( 0.00%)   650.44 (-0.14%)
      System  mean      54.75 ( 0.00%)    54.18 ( 1.05%)
      CPU     mean     187.75 ( 0.00%)   187.25 ( 0.27%)
      
      KernBench X86-64
      Elapsed mean      94.45 ( 0.00%)    94.01 ( 0.47%)
      User    mean     323.27 ( 0.00%)   322.66 ( 0.19%)
      System  mean      36.71 ( 0.00%)    36.50 ( 0.57%)
      CPU     mean     380.75 ( 0.00%)   381.75 (-0.26%)
      
      KernBench PPC64
      Elapsed mean     173.45 ( 0.00%)   173.74 (-0.17%)
      User    mean     587.99 ( 0.00%)   587.95 ( 0.01%)
      System  mean      60.60 ( 0.00%)    60.57 ( 0.05%)
      CPU     mean     373.50 ( 0.00%)   372.75 ( 0.20%)
      
      Nothing notable for kernbench.
      
      NetPerf UDP X86
            64    42.68 ( 0.00%)     42.77 ( 0.21%)
           128    85.62 ( 0.00%)     85.32 (-0.35%)
           256   170.01 ( 0.00%)    168.76 (-0.74%)
          1024   655.68 ( 0.00%)    652.33 (-0.51%)
          2048  1262.39 ( 0.00%)   1248.61 (-1.10%)
          3312  1958.41 ( 0.00%)   1944.61 (-0.71%)
          4096  2345.63 ( 0.00%)   2318.83 (-1.16%)
          8192  4132.90 ( 0.00%)   4089.50 (-1.06%)
         16384  6770.88 ( 0.00%)   6642.05 (-1.94%)*
      
      NetPerf UDP X86-64
            64   148.82 ( 0.00%)    154.92 ( 3.94%)
           128   298.96 ( 0.00%)    312.95 ( 4.47%)
           256   583.67 ( 0.00%)    626.39 ( 6.82%)
          1024  2293.18 ( 0.00%)   2371.10 ( 3.29%)
          2048  4274.16 ( 0.00%)   4396.83 ( 2.79%)
          3312  6356.94 ( 0.00%)   6571.35 ( 3.26%)
          4096  7422.68 ( 0.00%)   7635.42 ( 2.79%)*
          8192 12114.81 ( 0.00%)* 12346.88 ( 1.88%)
         16384 17022.28 ( 0.00%)* 17033.19 ( 0.06%)*
                   1.64%             2.73%
      
      NetPerf UDP PPC64
            64    49.98 ( 0.00%)     50.25 ( 0.54%)
           128    98.66 ( 0.00%)    100.95 ( 2.27%)
           256   197.33 ( 0.00%)    191.03 (-3.30%)
          1024   761.98 ( 0.00%)    785.07 ( 2.94%)
          2048  1493.50 ( 0.00%)   1510.85 ( 1.15%)
          3312  2303.95 ( 0.00%)   2271.72 (-1.42%)
          4096  2774.56 ( 0.00%)   2773.06 (-0.05%)
          8192  4918.31 ( 0.00%)   4793.59 (-2.60%)
         16384  7497.98 ( 0.00%)   7749.52 ( 3.25%)
      
      The tests are run to have confidence limits within 1%.  Results marked
      with a * were not confident although in this case, it's only outside by
      small amounts.  Even with some results that were not confident, the
      netperf UDP results were generally positive.
      
      NetPerf TCP X86
            64   652.25 ( 0.00%)*   648.12 (-0.64%)*
                  23.80%            22.82%
           128  1229.98 ( 0.00%)*  1220.56 (-0.77%)*
                  21.03%            18.90%
           256  2105.88 ( 0.00%)   1872.03 (-12.49%)*
                   1.00%            16.46%
          1024  3476.46 ( 0.00%)*  3548.28 ( 2.02%)*
                  13.37%            11.39%
          2048  4023.44 ( 0.00%)*  4231.45 ( 4.92%)*
                   9.76%            12.48%
          3312  4348.88 ( 0.00%)*  4396.96 ( 1.09%)*
                   6.49%             8.75%
          4096  4726.56 ( 0.00%)*  4877.71 ( 3.10%)*
                   9.85%             8.50%
          8192  4732.28 ( 0.00%)*  5777.77 (18.10%)*
                   9.13%            13.04%
         16384  5543.05 ( 0.00%)*  5906.24 ( 6.15%)*
                   7.73%             8.68%
      
      NETPERF TCP X86-64
                  netperf-tcp-vanilla-netperf       netperf-tcp
                         tcp-vanilla     pgalloc-delay
            64  1895.87 ( 0.00%)*  1775.07 (-6.81%)*
                   5.79%             4.78%
           128  3571.03 ( 0.00%)*  3342.20 (-6.85%)*
                   3.68%             6.06%
           256  5097.21 ( 0.00%)*  4859.43 (-4.89%)*
                   3.02%             2.10%
          1024  8919.10 ( 0.00%)*  8892.49 (-0.30%)*
                   5.89%             6.55%
          2048 10255.46 ( 0.00%)* 10449.39 ( 1.86%)*
                   7.08%             7.44%
          3312 10839.90 ( 0.00%)* 10740.15 (-0.93%)*
                   6.87%             7.33%
          4096 10814.84 ( 0.00%)* 10766.97 (-0.44%)*
                   6.86%             8.18%
          8192 11606.89 ( 0.00%)* 11189.28 (-3.73%)*
                   7.49%             5.55%
         16384 12554.88 ( 0.00%)* 12361.22 (-1.57%)*
                   7.36%             6.49%
      
      NETPERF TCP PPC64
                  netperf-tcp-vanilla-netperf       netperf-tcp
                         tcp-vanilla     pgalloc-delay
            64   594.17 ( 0.00%)    596.04 ( 0.31%)*
                   1.00%             2.29%
           128  1064.87 ( 0.00%)*  1074.77 ( 0.92%)*
                   1.30%             1.40%
           256  1852.46 ( 0.00%)*  1856.95 ( 0.24%)
                   1.25%             1.00%
          1024  3839.46 ( 0.00%)*  3813.05 (-0.69%)
                   1.02%             1.00%
          2048  4885.04 ( 0.00%)*  4881.97 (-0.06%)*
                   1.15%             1.04%
          3312  5506.90 ( 0.00%)   5459.72 (-0.86%)
          4096  6449.19 ( 0.00%)   6345.46 (-1.63%)
          8192  7501.17 ( 0.00%)   7508.79 ( 0.10%)
         16384  9618.65 ( 0.00%)   9490.10 (-1.35%)
      
      There was a distinct lack of confidence in the X86* figures so I included
      what the devation was where the results were not confident.  Many of the
      results, whether gains or losses were within the standard deviation so no
      solid conclusion can be reached on performance impact.  Looking at the
      figures, only the X86-64 ones look suspicious with a few losses that were
      outside the noise.  However, the results were so unstable that without
      knowing why they vary so much, a solid conclusion cannot be reached.
      
      SYSBENCH X86
                    sysbench-vanilla     pgalloc-delay
                 1  7722.85 ( 0.00%)  7756.79 ( 0.44%)
                 2 14901.11 ( 0.00%) 13683.44 (-8.90%)
                 3 15171.71 ( 0.00%) 14888.25 (-1.90%)
                 4 14966.98 ( 0.00%) 15029.67 ( 0.42%)
                 5 14370.47 ( 0.00%) 14865.00 ( 3.33%)
                 6 14870.33 ( 0.00%) 14845.57 (-0.17%)
                 7 14429.45 ( 0.00%) 14520.85 ( 0.63%)
                 8 14354.35 ( 0.00%) 14362.31 ( 0.06%)
      
      SYSBENCH X86-64
                 1 17448.70 ( 0.00%) 17484.41 ( 0.20%)
                 2 34276.39 ( 0.00%) 34251.00 (-0.07%)
                 3 50805.25 ( 0.00%) 50854.80 ( 0.10%)
                 4 66667.10 ( 0.00%) 66174.69 (-0.74%)
                 5 66003.91 ( 0.00%) 65685.25 (-0.49%)
                 6 64981.90 ( 0.00%) 65125.60 ( 0.22%)
                 7 64933.16 ( 0.00%) 64379.23 (-0.86%)
                 8 63353.30 ( 0.00%) 63281.22 (-0.11%)
                 9 63511.84 ( 0.00%) 63570.37 ( 0.09%)
                10 62708.27 ( 0.00%) 63166.25 ( 0.73%)
                11 62092.81 ( 0.00%) 61787.75 (-0.49%)
                12 61330.11 ( 0.00%) 61036.34 (-0.48%)
                13 61438.37 ( 0.00%) 61994.47 ( 0.90%)
                14 62304.48 ( 0.00%) 62064.90 (-0.39%)
                15 63296.48 ( 0.00%) 62875.16 (-0.67%)
                16 63951.76 ( 0.00%) 63769.09 (-0.29%)
      
      SYSBENCH PPC64
                                   -sysbench-pgalloc-delay-sysbench
                    sysbench-vanilla     pgalloc-delay
                 1  7645.08 ( 0.00%)  7467.43 (-2.38%)
                 2 14856.67 ( 0.00%) 14558.73 (-2.05%)
                 3 21952.31 ( 0.00%) 21683.64 (-1.24%)
                 4 27946.09 ( 0.00%) 28623.29 ( 2.37%)
                 5 28045.11 ( 0.00%) 28143.69 ( 0.35%)
                 6 27477.10 ( 0.00%) 27337.45 (-0.51%)
                 7 26489.17 ( 0.00%) 26590.06 ( 0.38%)
                 8 26642.91 ( 0.00%) 25274.33 (-5.41%)
                 9 25137.27 ( 0.00%) 24810.06 (-1.32%)
                10 24451.99 ( 0.00%) 24275.85 (-0.73%)
                11 23262.20 ( 0.00%) 23674.88 ( 1.74%)
                12 24234.81 ( 0.00%) 23640.89 (-2.51%)
                13 24577.75 ( 0.00%) 24433.50 (-0.59%)
                14 25640.19 ( 0.00%) 25116.52 (-2.08%)
                15 26188.84 ( 0.00%) 26181.36 (-0.03%)
                16 26782.37 ( 0.00%) 26255.99 (-2.00%)
      
      Again, there is little to conclude here.  While there are a few losses,
      the results vary by +/- 8% in some cases.  They are the results of most
      concern as there are some large losses but it's also within the variance
      typically seen between kernel releases.
      
      The STREAM results varied so little and are so verbose that I didn't
      include them here.
      
      The final test stressed how many huge pages can be allocated.  The
      absolute number of huge pages allocated are the same with or without the
      page.  However, the "unusability free space index" which is a measure of
      external fragmentation was slightly lower (lower is better) throughout the
      lifetime of the system.  I also measured the latency of how long it took
      to successfully allocate a huge page.  The latency was slightly lower and
      on X86 and PPC64, more huge pages were allocated almost immediately from
      the free lists.  The improvement is slight but there.
      
      [mel@csn.ul.ie: Tested, reworked for less branches]
      [czoccolo@gmail.com: fix oops by checking pfn_valid_within()]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Acked-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NPekka Enberg <penberg@cs.helsinki.fi>
      Cc: Corrado Zoccolo <czoccolo@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6dda9d55
  14. 16 3月, 2010 1 次提交