1. 26 1月, 2011 2 次提交
    • D
      mm: clear pages_scanned only if draining a pcp adds pages to the buddy allocator · 2ff754fa
      David Rientjes 提交于
      Commit 0e093d99 ("writeback: do not sleep on the congestion queue if
      there are no congested BDIs or if significant congestion is not being
      encountered in the current zone") uncovered a livelock in the page
      allocator that resulted in tasks infinitely looping trying to find
      memory and kswapd running at 100% cpu.
      
      The issue occurs because drain_all_pages() is called immediately
      following direct reclaim when no memory is freed and try_to_free_pages()
      returns non-zero because all zones in the zonelist do not have their
      all_unreclaimable flag set.
      
      When draining the per-cpu pagesets back to the buddy allocator for each
      zone, the zone->pages_scanned counter is cleared to avoid erroneously
      setting zone->all_unreclaimable later.  The problem is that no pages may
      actually be drained and, thus, the unreclaimable logic never fails
      direct reclaim so the oom killer may be invoked.
      
      This apparently only manifested after wait_iff_congested() was
      introduced and the zone was full of anonymous memory that would not
      congest the backing store.  The page allocator would infinitely loop if
      there were no other tasks waiting to be scheduled and clear
      zone->pages_scanned because of drain_all_pages() as the result of this
      change before kswapd could scan enough pages to trigger the reclaim
      logic.  Additionally, with every loop of the page allocator and in the
      reclaim path, kswapd would be kicked and would end up running at 100%
      cpu.  In this scenario, current and kswapd are all running continuously
      with kswapd incrementing zone->pages_scanned and current clearing it.
      
      The problem is even more pronounced when current swaps some of its
      memory to swap cache and the reclaimable logic then considers all active
      anonymous memory in the all_unreclaimable logic, which requires a much
      higher zone->pages_scanned value for try_to_free_pages() to return zero
      that is never attainable in this scenario.
      
      Before wait_iff_congested(), the page allocator would incur an
      unconditional timeout and allow kswapd to elevate zone->pages_scanned to
      a level that the oom killer would be called the next time it loops.
      
      The fix is to only attempt to drain pcp pages if there is actually a
      quantity to be drained.  The unconditional clearing of
      zone->pages_scanned in free_pcppages_bulk() need not be changed since
      other callers already ensure that draining will occur.  This patch
      ensures that free_pcppages_bulk() will actually free memory before
      calling into it from drain_all_pages() so zone->pages_scanned is only
      cleared if appropriate.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Reviewed-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2ff754fa
    • D
      mm: fix deferred congestion timeout if preferred zone is not allowed · f33261d7
      David Rientjes 提交于
      Before 0e093d99 ("writeback: do not sleep on the congestion queue if
      there are no congested BDIs or if significant congestion is not being
      encountered in the current zone"), preferred_zone was only used for NUMA
      statistics, to determine the zoneidx from which to allocate from given
      the type requested, and whether to utilize memory compaction.
      
      wait_iff_congested(), though, uses preferred_zone to determine if the
      congestion wait should be deferred because its dirty pages are backed by
      a congested bdi.  This incorrectly defers the timeout and busy loops in
      the page allocator with various cond_resched() calls if preferred_zone
      is not allowed in the current context, usually consuming 100% of a cpu.
      
      This patch ensures preferred_zone is an allowed zone in the fastpath
      depending on whether current is constrained by its cpuset or nodes in
      its mempolicy (when the nodemask passed is non-NULL).  This is correct
      since the fastpath allocation always passes ALLOC_CPUSET when trying to
      allocate memory.  In the slowpath, this patch resets preferred_zone to
      the first zone of the allowed type when the allocation is not
      constrained by current's cpuset, i.e.  it does not pass ALLOC_CPUSET.
      
      This patch also ensures preferred_zone is from the set of allowed nodes
      when called from within direct reclaim since allocations are always
      constrained by cpusets in this context (it is blockable).
      
      Both of these uses of cpuset_current_mems_allowed are protected by
      get_mems_allowed().
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f33261d7
  2. 14 1月, 2011 13 次提交
    • A
      mm/page_alloc.c: don't cache `current' in a local · c06b1fca
      Andrew Morton 提交于
      It's old-fashioned and unneeded.
      
      akpm:/usr/src/25> size mm/page_alloc.o
         text    data     bss     dec     hex filename
        39884 1241317   18808 1300009  13d629 mm/page_alloc.o (before)
        39838 1241317   18808 1299963  13d5fb mm/page_alloc.o (after)
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c06b1fca
    • K
      mm/page_alloc.c: simplify calculation of combined index of adjacent buddy lists · 43506fad
      KyongHo Cho 提交于
      The previous approach of calucation of combined index was
      
      	page_idx & ~(1 << order))
      
      but we have same result with
      
      	page_idx & buddy_idx
      
      This reduces instructions slightly as well as enhances readability.
      
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: fix used-unintialised warning]
      Signed-off-by: NKyongHo Cho <pullip.cho@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      43506fad
    • A
      thp: remove PG_buddy · 5f24ce5f
      Andrea Arcangeli 提交于
      PG_buddy can be converted to _mapcount == -2.  So the PG_compound_lock can
      be added to page->flags without overflowing (because of the sparse section
      bits increasing) with CONFIG_X86_PAE=y and CONFIG_X86_PAT=y.  This also
      has to move the memory hotplug code from _mapcount to lru.next to avoid
      any risk of clashes.  We can't use lru.next for PG_buddy removal, but
      memory hotplug can use lru.next even more easily than the mapcount
      instead.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5f24ce5f
    • A
      thp: don't alloc harder for gfp nomemalloc even if nowait · 5c3240d9
      Andrea Arcangeli 提交于
      Not worth throwing away the precious reserved free memory pool for
      allocations that can fail gracefully (either through mempool or because
      they're transhuge allocations later falling back to 4k allocations).
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5c3240d9
    • A
      thp: _GFP_NO_KSWAPD · 32dba98e
      Andrea Arcangeli 提交于
      Transparent hugepage allocations must be allowed not to invoke kswapd or
      any other kind of indirect reclaim (especially when the defrag sysfs is
      control disabled).  It's unacceptable to swap out anonymous pages
      (potentially anonymous transparent hugepages) in order to create new
      transparent hugepages.  This is true for the MADV_HUGEPAGE areas too
      (swapping out a kvm virtual machine and so having it suffer an unbearable
      slowdown, so another one with guest physical memory marked MADV_HUGEPAGE
      can run 30% faster if it is running memory intensive workloads, makes no
      sense).  If a transparent hugepage allocation fails the slowdown is minor
      and there is total fallback, so kswapd should never be asked to swapout
      memory to allow the high order allocation to succeed.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      32dba98e
    • A
      thp: comment reminder in destroy_compound_page · 59ff4216
      Andrea Arcangeli 提交于
      Warn destroy_compound_page that __split_huge_page_refcount is heavily
      dependent on its internal behavior.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      59ff4216
    • A
      thp: clear compound mapping · 8dd60a3a
      Andrea Arcangeli 提交于
      Clear compound mapping for anonymous compound pages like it already
      happens for regular anonymous pages.  But crash if mapping is set for any
      tail page, also the PageAnon check is meaningless for tail pages.  This
      check only makes sense for the head page, for tail page it can only hide
      bugs and we definitely don't want to hide bugs.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8dd60a3a
    • A
      thp: fix bad_page to show the real reason the page is bad · 4e9f64c4
      Andrea Arcangeli 提交于
      page_count shows the count of the head page, but the actual check is done
      on the tail page, so show what is really being checked.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4e9f64c4
    • V
      mm: set correct numa_zonelist_order string when configured on the kernel command line · ecb256f8
      Volodymyr G. Lukiianyk 提交于
      When numa_zonelist_order parameter is set to "node" or "zone" on the
      command line it's still showing as "default" in sysctl.  That's because
      early_param parsing function changes only user_zonelist_order variable.
      Fix this by copying user-provided string to numa_zonelist_order if it was
      successfully parsed.
      Signed-off-by: NVolodymyr G Lukiianyk <volodymyrgl@gmail.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ecb256f8
    • M
      mm: kswapd: stop high-order balancing when any suitable zone is balanced · 99504748
      Mel Gorman 提交于
      Simon Kirby reported the following problem
      
         We're seeing cases on a number of servers where cache never fully
         grows to use all available memory.  Sometimes we see servers with 4 GB
         of memory that never seem to have less than 1.5 GB free, even with a
         constantly-active VM.  In some cases, these servers also swap out while
         this happens, even though they are constantly reading the working set
         into memory.  We have been seeing this happening for a long time; I
         don't think it's anything recent, and it still happens on 2.6.36.
      
      After some debugging work by Simon, Dave Hansen and others, the prevaling
      theory became that kswapd is reclaiming order-3 pages requested by SLUB
      too aggressive about it.
      
      There are two apparent problems here.  On the target machine, there is a
      small Normal zone in comparison to DMA32.  As kswapd tries to balance all
      zones, it would continually try reclaiming for Normal even though DMA32
      was balanced enough for callers.  The second problem is that
      sleeping_prematurely() does not use the same logic as balance_pgdat() when
      deciding whether to sleep or not.  This keeps kswapd artifically awake.
      
      A number of tests were run and the figures from previous postings will
      look very different for a few reasons.  One, the old figures were forcing
      my network card to use GFP_ATOMIC in attempt to replicate Simon's problem.
       Second, I previous specified slub_min_order=3 again in an attempt to
      reproduce Simon's problem.  In this posting, I'm depending on Simon to say
      whether his problem is fixed or not and these figures are to show the
      impact to the ordinary cases.  Finally, the "vmscan" figures are taken
      from /proc/vmstat instead of the tracepoints.  There is less information
      but recording is less disruptive.
      
      The first test of relevance was postmark with a process running in the
      background reading a large amount of anonymous memory in blocks.  The
      objective was to vaguely simulate what was happening on Simon's machine
      and it's memory intensive enough to have kswapd awake.
      
      POSTMARK
                                                  traceonly          kanyzone
      Transactions per second:              156.00 ( 0.00%)   153.00 (-1.96%)
      Data megabytes read per second:        21.51 ( 0.00%)    21.52 ( 0.05%)
      Data megabytes written per second:     29.28 ( 0.00%)    29.11 (-0.58%)
      Files created alone per second:       250.00 ( 0.00%)   416.00 (39.90%)
      Files create/transact per second:      79.00 ( 0.00%)    76.00 (-3.95%)
      Files deleted alone per second:       520.00 ( 0.00%)   420.00 (-23.81%)
      Files delete/transact per second:      79.00 ( 0.00%)    76.00 (-3.95%)
      
      MMTests Statistics: duration
      User/Sys Time Running Test (seconds)         16.58      17.4
      Total Elapsed Time (seconds)                218.48    222.47
      
      VMstat Reclaim Statistics: vmscan
      Direct reclaims                                  0          4
      Direct reclaim pages scanned                     0        203
      Direct reclaim pages reclaimed                   0        184
      Kswapd pages scanned                        326631     322018
      Kswapd pages reclaimed                      312632     309784
      Kswapd low wmark quickly                         1          4
      Kswapd high wmark quickly                      122        475
      Kswapd skip congestion_wait                      1          0
      Pages activated                             700040     705317
      Pages deactivated                           212113     203922
      Pages written                                 9875       6363
      
      Total pages scanned                         326631    322221
      Total pages reclaimed                       312632    309968
      %age total pages scanned/reclaimed          95.71%    96.20%
      %age total pages scanned/written             3.02%     1.97%
      
      proc vmstat: Faults
      Major Faults                                   300       254
      Minor Faults                                645183    660284
      Page ins                                    493588    486704
      Page outs                                  4960088   4986704
      Swap ins                                      1230       661
      Swap outs                                     9869      6355
      
      Performance is mildly affected because kswapd is no longer doing as much
      work and the background memory consumer process is getting in the way.
      Note that kswapd scanned and reclaimed fewer pages as it's less aggressive
      and overall fewer pages were scanned and reclaimed.  Swap in/out is
      particularly reduced again reflecting kswapd throwing out fewer pages.
      
      The slight performance impact is unfortunate here but it looks like a
      direct result of kswapd being less aggressive.  As the bug report is about
      too many pages being freed by kswapd, it may have to be accepted for now.
      
      The second test is a streaming IO benchmark that was previously used by
      Johannes to show regressions in page reclaim.
      
      MICRO
      					 traceonly  kanyzone
      User/Sys Time Running Test (seconds)         29.29     28.87
      Total Elapsed Time (seconds)                492.18    488.79
      
      VMstat Reclaim Statistics: vmscan
      Direct reclaims                               2128       1460
      Direct reclaim pages scanned               2284822    1496067
      Direct reclaim pages reclaimed              148919     110937
      Kswapd pages scanned                      15450014   16202876
      Kswapd pages reclaimed                     8503697    8537897
      Kswapd low wmark quickly                      3100       3397
      Kswapd high wmark quickly                     1860       7243
      Kswapd skip congestion_wait                    708        801
      Pages activated                               9635       9573
      Pages deactivated                             1432       1271
      Pages written                                  223       1130
      
      Total pages scanned                       17734836  17698943
      Total pages reclaimed                      8652616   8648834
      %age total pages scanned/reclaimed          48.79%    48.87%
      %age total pages scanned/written             0.00%     0.01%
      
      proc vmstat: Faults
      Major Faults                                   165       221
      Minor Faults                               9655785   9656506
      Page ins                                      3880      7228
      Page outs                                 37692940  37480076
      Swap ins                                         0        69
      Swap outs                                       19        15
      
      Again fewer pages are scanned and reclaimed as expected and this time the
      test completed faster.  Note that kswapd is hitting its watermarks faster
      (low and high wmark quickly) which I expect is due to kswapd reclaiming
      fewer pages.
      
      I also ran fs-mark, iozone and sysbench but there is nothing interesting
      to report in the figures.  Performance is not significantly changed and
      the reclaim statistics look reasonable.
      
      Tgis patch:
      
      When the allocator enters its slow path, kswapd is woken up to balance the
      node.  It continues working until all zones within the node are balanced.
      For order-0 allocations, this makes perfect sense but for higher orders it
      can have unintended side-effects.  If the zone sizes are imbalanced,
      kswapd may reclaim heavily within a smaller zone discarding an excessive
      number of pages.  The user-visible behaviour is that kswapd is awake and
      reclaiming even though plenty of pages are free from a suitable zone.
      
      This patch alters the "balance" logic for high-order reclaim allowing
      kswapd to stop if any suitable zone becomes balanced to reduce the number
      of pages it reclaims from other zones.  kswapd still tries to ensure that
      order-0 watermarks for all zones are met before sleeping.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NEric B Munson <emunson@mgebm.net>
      Cc: Simon Kirby <sim@hostway.ca>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      99504748
    • M
      mm: migration: allow migration to operate asynchronously and avoid synchronous... · 77f1fe6b
      Mel Gorman 提交于
      mm: migration: allow migration to operate asynchronously and avoid synchronous compaction in the faster path
      
      Migration synchronously waits for writeback if the initial passes fails.
      Callers of memory compaction do not necessarily want this behaviour if the
      caller is latency sensitive or expects that synchronous migration is not
      going to have a significantly better success rate.
      
      This patch adds a sync parameter to migrate_pages() allowing the caller to
      indicate if wait_on_page_writeback() is allowed within migration or not.
      For reclaim/compaction, try_to_compact_pages() is first called
      asynchronously, direct reclaim runs and then try_to_compact_pages() is
      called synchronously as there is a greater expectation that it'll succeed.
      
      [akpm@linux-foundation.org: build/merge fix]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      77f1fe6b
    • M
      mm: vmscan: reclaim order-0 and use compaction instead of lumpy reclaim · 3e7d3449
      Mel Gorman 提交于
      Lumpy reclaim is disruptive.  It reclaims a large number of pages and
      ignores the age of the pages it reclaims.  This can incur significant
      stalls and potentially increase the number of major faults.
      
      Compaction has reached the point where it is considered reasonably stable
      (meaning it has passed a lot of testing) and is a potential candidate for
      displacing lumpy reclaim.  This patch introduces an alternative to lumpy
      reclaim whe compaction is available called reclaim/compaction.  The basic
      operation is very simple - instead of selecting a contiguous range of
      pages to reclaim, a number of order-0 pages are reclaimed and then
      compaction is later by either kswapd (compact_zone_order()) or direct
      compaction (__alloc_pages_direct_compact()).
      
      [akpm@linux-foundation.org: fix build]
      [akpm@linux-foundation.org: use conventional task_struct naming]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3e7d3449
    • M
      mm: page allocator: adjust the per-cpu counter threshold when memory is low · 88f5acf8
      Mel Gorman 提交于
      Commit aa454840 ("calculate a better estimate of NR_FREE_PAGES when memory
      is low") noted that watermarks were based on the vmstat NR_FREE_PAGES.  To
      avoid synchronization overhead, these counters are maintained on a per-cpu
      basis and drained both periodically and when a threshold is above a
      threshold.  On large CPU systems, the difference between the estimate and
      real value of NR_FREE_PAGES can be very high.  The system can get into a
      case where pages are allocated far below the min watermark potentially
      causing livelock issues.  The commit solved the problem by taking a better
      reading of NR_FREE_PAGES when memory was low.
      
      Unfortately, as reported by Shaohua Li this accurate reading can consume a
      large amount of CPU time on systems with many sockets due to cache line
      bouncing.  This patch takes a different approach.  For large machines
      where counter drift might be unsafe and while kswapd is awake, the per-cpu
      thresholds for the target pgdat are reduced to limit the level of drift to
      what should be a safe level.  This incurs a performance penalty in heavy
      memory pressure by a factor that depends on the workload and the machine
      but the machine should function correctly without accidentally exhausting
      all memory on a node.  There is an additional cost when kswapd wakes and
      sleeps but the event is not expected to be frequent - in Shaohua's test
      case, there was one recorded sleep and wake event at least.
      
      To ensure that kswapd wakes up, a safe version of zone_watermark_ok() is
      introduced that takes a more accurate reading of NR_FREE_PAGES when called
      from wakeup_kswapd, when deciding whether it is really safe to go back to
      sleep in sleeping_prematurely() and when deciding if a zone is really
      balanced or not in balance_pgdat().  We are still using an expensive
      function but limiting how often it is called.
      
      When the test case is reproduced, the time spent in the watermark
      functions is reduced.  The following report is on the percentage of time
      spent cumulatively spent in the functions zone_nr_free_pages(),
      zone_watermark_ok(), __zone_watermark_ok(), zone_watermark_ok_safe(),
      zone_page_state_snapshot(), zone_page_state().
      
      vanilla                      11.6615%
      disable-threshold            0.2584%
      
      David said:
      
      : We had to pull aa454840 "mm: page allocator: calculate a better estimate
      : of NR_FREE_PAGES when memory is low and kswapd is awake" from 2.6.36
      : internally because tests showed that it would cause the machine to stall
      : as the result of heavy kswapd activity.  I merged it back with this fix as
      : it is pending in the -mm tree and it solves the issue we were seeing, so I
      : definitely think this should be pushed to -stable (and I would seriously
      : consider it for 2.6.37 inclusion even at this late date).
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Reported-by: NShaohua Li <shaohua.li@intel.com>
      Reviewed-by: NChristoph Lameter <cl@linux.com>
      Tested-by: NNicolas Bareil <nico@chdir.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Kyle McMartin <kyle@mcmartin.ca>
      Cc: <stable@kernel.org>		[2.6.37.1, 2.6.36.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      88f5acf8
  3. 07 12月, 2010 1 次提交
    • R
      PM / Hibernate: Fix memory corruption related to swap · c9e664f1
      Rafael J. Wysocki 提交于
      There is a problem that swap pages allocated before the creation of
      a hibernation image can be released and used for storing the contents
      of different memory pages while the image is being saved.  Since the
      kernel stored in the image doesn't know of that, it causes memory
      corruption to occur after resume from hibernation, especially on
      systems with relatively small RAM that need to swap often.
      
      This issue can be addressed by keeping the GFP_IOFS bits clear
      in gfp_allowed_mask during the entire hibernation, including the
      saving of the image, until the system is finally turned off or
      the hibernation is aborted.  Unfortunately, for this purpose
      it's necessary to rework the way in which the hibernate and
      suspend code manipulates gfp_allowed_mask.
      
      This change is based on an earlier patch from Hugh Dickins.
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Reported-by: NOndrej Zary <linux@rainbow-software.org>
      Acked-by: NHugh Dickins <hughd@google.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: stable@kernel.org
      c9e664f1
  4. 29 11月, 2010 1 次提交
    • J
      Kill off a bunch of warning: ‘inline’ is not at beginning of declaration · fa9f90be
      Jesper Juhl 提交于
      These warnings are spewed during a build of a 'allnoconfig' kernel
      (especially the ones from u64_stats_sync.h show up a lot) when building
      with -Wextra (which I often do)..
      They are
        a) annoying
        b) easy to get rid of.
      This patch kills them off.
      
      include/linux/u64_stats_sync.h:70:1: warning: ‘inline’ is not at beginning of declaration
      include/linux/u64_stats_sync.h:77:1: warning: ‘inline’ is not at beginning of declaration
      include/linux/u64_stats_sync.h:84:1: warning: ‘inline’ is not at beginning of declaration
      include/linux/u64_stats_sync.h:96:1: warning: ‘inline’ is not at beginning of declaration
      include/linux/u64_stats_sync.h:115:1: warning: ‘inline’ is not at beginning of declaration
      include/linux/u64_stats_sync.h:127:1: warning: ‘inline’ is not at beginning of declaration
      kernel/time.c:241:1: warning: ‘inline’ is not at beginning of declaration
      kernel/time.c:257:1: warning: ‘inline’ is not at beginning of declaration
      kernel/perf_event.c:4513:1: warning: ‘inline’ is not at beginning of declaration
      mm/page_alloc.c:4012:1: warning: ‘inline’ is not at beginning of declaration
      Signed-off-by: NJesper Juhl <jj@chaosbits.net>
      Signed-off-by: NJiri Kosina <jkosina@suse.cz>
      fa9f90be
  5. 25 11月, 2010 1 次提交
    • K
      mm/page_alloc.c: fix build_all_zonelist() where percpu_alloc() is wrongly... · e9959f0f
      KAMEZAWA Hiroyuki 提交于
      mm/page_alloc.c: fix build_all_zonelist() where percpu_alloc() is wrongly called under stop_machine_run()
      
      During memory hotplug, build_allzonelists() may be called under
      stop_machine_run().  In this function, setup_zone_pageset() is called.
      But it's bug because it will do page allocation under stop_machine_run().
      
      Here is a report from Alok Kataria.
      
        BUG: sleeping function called from invalid context at kernel/mutex.c:94
        in_atomic(): 0, irqs_disabled(): 1, pid: 4, name: migration/0
        Pid: 4, comm: migration/0 Not tainted 2.6.35.6-45.fc14.x86_64 #1
        Call Trace:
         [<ffffffff8103d12b>] __might_sleep+0xeb/0xf0
         [<ffffffff81468245>] mutex_lock+0x24/0x50
         [<ffffffff8110eaa6>] pcpu_alloc+0x6d/0x7ee
         [<ffffffff81048888>] ? load_balance+0xbe/0x60e
         [<ffffffff8103a1b3>] ? rt_se_boosted+0x21/0x2f
         [<ffffffff8103e1cf>] ? dequeue_rt_stack+0x18b/0x1ed
         [<ffffffff8110f237>] __alloc_percpu+0x10/0x12
         [<ffffffff81465e22>] setup_zone_pageset+0x38/0xbe
         [<ffffffff810d6d81>] ? build_zonelists_node.clone.58+0x79/0x8c
         [<ffffffff81452539>] __build_all_zonelists+0x419/0x46c
         [<ffffffff8108ef01>] ? cpu_stopper_thread+0xb2/0x198
         [<ffffffff8108f075>] stop_machine_cpu_stop+0x8e/0xc5
         [<ffffffff8108efe7>] ? stop_machine_cpu_stop+0x0/0xc5
         [<ffffffff8108ef57>] cpu_stopper_thread+0x108/0x198
         [<ffffffff81467a37>] ? schedule+0x5b2/0x5cc
         [<ffffffff8108ee4f>] ? cpu_stopper_thread+0x0/0x198
         [<ffffffff81065f29>] kthread+0x7f/0x87
         [<ffffffff8100aae4>] kernel_thread_helper+0x4/0x10
         [<ffffffff81065eaa>] ? kthread+0x0/0x87
         [<ffffffff8100aae0>] ? kernel_thread_helper+0x0/0x10
        Built 5 zonelists in Node order, mobility grouping on.  Total pages: 289456
        Policy zone: Normal
      
      This patch tries to fix the issue by moving setup_zone_pageset() out from
      stop_machine_run(). It's obviously not necessary to be called under
      stop_machine_run().
      
      [akpm@linux-foundation.org: remove unneeded local]
      Reported-by: NAlok Kataria <akataria@vmware.com>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Petr Vandrovec <petr@vmware.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Reviewed-by: NChristoph Lameter <cl@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e9959f0f
  6. 27 10月, 2010 5 次提交
    • N
      mm: add casts to/from gfp_t in gfp_to_alloc_flags() · e6223a3b
      Namhyung Kim 提交于
      This removes following warning from sparse:
      
       mm/page_alloc.c:1934:9: warning: restricted gfp_t degrades to integer
      Signed-off-by: NNamhyung Kim <namhyung@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e6223a3b
    • M
      writeback: do not sleep on the congestion queue if there are no congested BDIs... · 0e093d99
      Mel Gorman 提交于
      writeback: do not sleep on the congestion queue if there are no congested BDIs or if significant congestion is not being encountered in the current zone
      
      If congestion_wait() is called with no BDI congested, the caller will
      sleep for the full timeout and this may be an unnecessary sleep.  This
      patch adds a wait_iff_congested() that checks congestion and only sleeps
      if a BDI is congested else, it calls cond_resched() to ensure the caller
      is not hogging the CPU longer than its quota but otherwise will not sleep.
      
      This is aimed at reducing some of the major desktop stalls reported during
      IO.  For example, while kswapd is operating, it calls congestion_wait()
      but it could just have been reclaiming clean page cache pages with no
      congestion.  Without this patch, it would sleep for a full timeout but
      after this patch, it'll just call schedule() if it has been on the CPU too
      long.  Similar logic applies to direct reclaimers that are not making
      enough progress.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0e093d99
    • K
      memory hotplug: unify is_removable and offline detection code · 49ac8255
      KAMEZAWA Hiroyuki 提交于
      Now, sysfs interface of memory hotplug shows whether the section is
      removable or not.  But it checks only migrateype of pages and doesn't
      check details of cluster of pages.
      
      Next, memory hotplug's set_migratetype_isolate() has the same kind of
      check, too.
      
      This patch adds the function __count_unmovable_pages() and makes above 2
      checks to use the same logic.  Then, is_removable and hotremove code uses
      the same logic.  No changes in the hotremove logic itself.
      
      TODO: need to find a way to check RECLAMABLE. But, considering bit,
            calling shrink_slab() against a range before starting memory hotremove
            sounds better. If so, this patch's logic doesn't need to be changed.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reported-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      49ac8255
    • K
      memory hotplug: fix notifier's return value check · 4b20477f
      KAMEZAWA Hiroyuki 提交于
      Even if notifier cannot find any pages, it doesn't mean no pages are
      available...And, if there are no notifiers registered, this condition will
      be always true and memory hotplug will show -EBUSY.
      
      This is a bug but not critical.
      
      In most case, a pageblock which will be offlined is MIGRATE_MOVABLE This
      "notifier" is called only when the pageblock is _not_ MIGRATE_MOVABLE.
      But if not MIGRATE_MOVABLE, it's common case that memory hotplug will
      fail.  So, no one notice this bug.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4b20477f
    • M
      mm, page-allocator: do not check the state of a non-existant buddy during free · b7f50cfa
      Mel Gorman 提交于
      There is a bug in commit 6dda9d55 ("page allocator: reduce fragmentation
      in buddy allocator by adding buddies that are merging to the tail of the
      free lists") that means a buddy at order MAX_ORDER is checked for merging.
       A page of this order never exists so at times, an effectively random
      piece of memory is being checked.
      
      Alan Curry has reported that this is causing memory corruption in
      userspace data on a PPC32 platform (http://lkml.org/lkml/2010/10/9/32).
      It is not clear why this is happening.  It could be a cache coherency
      problem where pages mapped in both user and kernel space are getting
      different cache lines due to the bad read from kernel space
      (http://lkml.org/lkml/2010/10/13/179).  It could also be that there are
      some special registers being io-remapped at the end of the memmap array
      and that a read has special meaning on them.  Compiler bugs have been
      ruled out because the assembly before and after the patch looks relatively
      harmless.
      
      This patch fixes the problem by ensuring we are not reading a possibly
      invalid location of memory.  It's not clear why the read causes corruption
      but one way or the other it is a buggy read.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Corrado Zoccolo <czoccolo@gmail.com>
      Reported-by: NAlan Curry <pacman@kosh.dhis.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b7f50cfa
  7. 08 10月, 2010 1 次提交
  8. 10 9月, 2010 3 次提交
  9. 28 8月, 2010 2 次提交
    • Y
      x86: Use memblock to replace early_res · 72d7c3b3
      Yinghai Lu 提交于
      1. replace find_e820_area with memblock_find_in_range
      2. replace reserve_early with memblock_x86_reserve_range
      3. replace free_early with memblock_x86_free_range.
      4. NO_BOOTMEM will switch to use memblock too.
      5. use _e820, _early wrap in the patch, in following patch, will
         replace them all
      6. because memblock_x86_free_range support partial free, we can remove some special care
      7. Need to make sure that memblock_find_in_range() is called after memblock_x86_fill()
         so adjust some calling later in setup.c::setup_arch()
         -- corruption_check and mptable_update
      
      -v2: Move reserve_brk() early
          Before fill_memblock_area, to avoid overlap between brk and memblock_find_in_range()
          that could happen We have more then 128 RAM entry in E820 tables, and
          memblock_x86_fill() could use memblock_find_in_range() to find a new place for
          memblock.memory.region array.
          and We don't need to use extend_brk() after fill_memblock_area()
          So move reserve_brk() early before fill_memblock_area().
      -v3: Move find_smp_config early
          To make sure memblock_find_in_range not find wrong place, if BIOS doesn't put mptable
          in right place.
      -v4: Treat RESERVED_KERN as RAM in memblock.memory. and they are already in
          memblock.reserved already..
          use __NOT_KEEP_MEMBLOCK to make sure memblock related code could be freed later.
      -v5: Generic version __memblock_find_in_range() is going from high to low, and for 32bit
          active_region for 32bit does include high pages
          need to replace the limit with memblock.default_alloc_limit, aka get_max_mapped()
      -v6: Use current_limit instead
      -v7: check with MEMBLOCK_ERROR instead of -1ULL or -1L
      -v8: Set memblock_can_resize early to handle EFI with more RAM entries
      -v9: update after kmemleak changes in mainline
      Suggested-by: NDavid S. Miller <davem@davemloft.net>
      Suggested-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Suggested-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      72d7c3b3
    • Y
      memblock: Add find_memory_core_early() · edbe7d23
      Yinghai Lu 提交于
      According to node range in early_node_map[] with __memblock_find_in_range
      to find free range.
      
      Will be used by memblock_x86_find_in_range_node()
      
      memblock_x86_find_in_range_node will be used to find right buffer for NODE_DATA
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      edbe7d23
  10. 10 8月, 2010 3 次提交
    • K
      vmscan: kill prev_priority completely · 25edde03
      KOSAKI Motohiro 提交于
      Since 2.6.28 zone->prev_priority is unused. Then it can be removed
      safely. It reduce stack usage slightly.
      
      Now I have to say that I'm sorry. 2 years ago, I thought prev_priority
      can be integrate again, it's useful. but four (or more) times trying
      haven't got good performance number. Thus I give up such approach.
      
      The rest of this changelog is notes on prev_priority and why it existed in
      the first place and why it might be not necessary any more. This information
      is based heavily on discussions between Andrew Morton, Rik van Riel and
      Kosaki Motohiro who is heavily quotes from.
      
      Historically prev_priority was important because it determined when the VM
      would start unmapping PTE pages. i.e. there are no balances of note within
      the VM, Anon vs File and Mapped vs Unmapped. Without prev_priority, there
      is a potential risk of unnecessarily increasing minor faults as a large
      amount of read activity of use-once pages could push mapped pages to the
      end of the LRU and get unmapped.
      
      There is no proof this is still a problem but currently it is not considered
      to be. Active files are not deactivated if the active file list is smaller
      than the inactive list reducing the liklihood that file-mapped pages are
      being pushed off the LRU and referenced executable pages are kept on the
      active list to avoid them getting pushed out by read activity.
      
      Even if it is a problem, prev_priority prev_priority wouldn't works
      nowadays. First of all, current vmscan still a lot of UP centric code. it
      expose some weakness on some dozens CPUs machine. I think we need more and
      more improvement.
      
      The problem is, current vmscan mix up per-system-pressure, per-zone-pressure
      and per-task-pressure a bit. example, prev_priority try to boost priority to
      other concurrent priority. but if the another task have mempolicy restriction,
      it is unnecessary, but also makes wrong big latency and exceeding reclaim.
      per-task based priority + prev_priority adjustment make the emulation of
      per-system pressure. but it have two issue 1) too rough and brutal emulation
      2) we need per-zone pressure, not per-system.
      
      Another example, currently DEF_PRIORITY is 12. it mean the lru rotate about
      2 cycle (1/4096 + 1/2048 + 1/1024 + .. + 1) before invoking OOM-Killer.
      but if 10,0000 thrreads enter DEF_PRIORITY reclaim at the same time, the
      system have higher memory pressure than priority==0 (1/4096*10,000 > 2).
      prev_priority can't solve such multithreads workload issue. In other word,
      prev_priority concept assume the sysmtem don't have lots threads."
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michael Rubin <mrubin@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      25edde03
    • M
      mm: rename try_set_zone_oom() to try_set_zonelist_oom() · ff321fea
      Minchan Kim 提交于
      We have been used naming try_set_zone_oom and clear_zonelist_oom.
      The role of functions is to lock of zonelist for preventing parallel
      OOM. So clear_zonelist_oom makes sense but try_set_zone_oome is rather
      awkward and unmatched with clear_zonelist_oom.
      
      Let's change it with try_set_zonelist_oom.
      Signed-off-by: NMinchan Kim <minchan.kim@gmail.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ff321fea
    • D
      oom: avoid oom killer for lowmem allocations · 03668b3c
      David Rientjes 提交于
      If memory has been depleted in lowmem zones even with the protection
      afforded to it by /proc/sys/vm/lowmem_reserve_ratio, it is unlikely that
      killing current users will help.  The memory is either reclaimable (or
      migratable) already, in which case we should not invoke the oom killer at
      all, or it is pinned by an application for I/O.  Killing such an
      application may leave the hardware in an unspecified state and there is no
      guarantee that it will be able to make a timely exit.
      
      Lowmem allocations are now failed in oom conditions when __GFP_NOFAIL is
      not used so that the task can perhaps recover or try again later.
      
      Previously, the heuristic provided some protection for those tasks with
      CAP_SYS_RAWIO, but this is no longer necessary since we will not be
      killing tasks for the purposes of ISA allocations.
      
      high_zoneidx is gfp_zone(gfp_flags), meaning that ZONE_NORMAL will be the
      default for all allocations that are not __GFP_DMA, __GFP_DMA32,
      __GFP_HIGHMEM, and __GFP_MOVABLE on kernels configured to support those
      flags.  Testing for high_zoneidx being less than ZONE_NORMAL will only
      return true for allocations that have either __GFP_DMA or __GFP_DMA32.
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      03668b3c
  11. 21 7月, 2010 1 次提交
    • Y
      x86,nobootmem: make alloc_bootmem_node fall back to other node when 32bit numa is used · b8ab9f82
      Yinghai Lu 提交于
      Borislav Petkov reported his 32bit numa system has problem:
      
      [    0.000000] Reserving total of 4c00 pages for numa KVA remap
      [    0.000000] kva_start_pfn ~ 32800 max_low_pfn ~ 375fe
      [    0.000000] max_pfn = 238000
      [    0.000000] 8202MB HIGHMEM available.
      [    0.000000] 885MB LOWMEM available.
      [    0.000000]   mapped low ram: 0 - 375fe000
      [    0.000000]   low ram: 0 - 375fe000
      [    0.000000] alloc (nid=8 100000 - 7ee00000) (1000000 - ffffffff) 1000 1000 => 34e7000
      [    0.000000] alloc (nid=8 100000 - 7ee00000) (1000000 - ffffffff) 200 40 => 34c9d80
      [    0.000000] alloc (nid=0 100000 - 7ee00000) (1000000 - ffffffffffffffff) 180 40 => 34e6140
      [    0.000000] alloc (nid=1 80000000 - c7e60000) (1000000 - ffffffffffffffff) 240 40 => 80000000
      [    0.000000] BUG: unable to handle kernel paging request at 40000000
      [    0.000000] IP: [<c2c8cff1>] __alloc_memory_core_early+0x147/0x1d6
      [    0.000000] *pdpt = 0000000000000000 *pde = f000ff53f000ff00
      ...
      [    0.000000] Call Trace:
      [    0.000000]  [<c2c8b4f8>] ? __alloc_bootmem_node+0x216/0x22f
      [    0.000000]  [<c2c90c9b>] ? sparse_early_usemaps_alloc_node+0x5a/0x10b
      [    0.000000]  [<c2c9149e>] ? sparse_init+0x1dc/0x499
      [    0.000000]  [<c2c79118>] ? paging_init+0x168/0x1df
      [    0.000000]  [<c2c780ff>] ? native_pagetable_setup_start+0xef/0x1bb
      
      looks like it allocates too much high address for bootmem.
      
      Try to cut limit with get_max_mapped()
      Reported-by: NBorislav Petkov <borislav.petkov@amd.com>
      Tested-by: NConny Seidel <conny.seidel@amd.com>
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Cc: <stable@kernel.org>		[2.6.34.x]
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b8ab9f82
  12. 19 7月, 2010 1 次提交
  13. 28 5月, 2010 2 次提交
    • L
      numa: introduce numa_mem_id()- effective local memory node id · 7aac7898
      Lee Schermerhorn 提交于
      Introduce numa_mem_id(), based on generic percpu variable infrastructure
      to track "nearest node with memory" for archs that support memoryless
      nodes.
      
      Define API in <linux/topology.h> when CONFIG_HAVE_MEMORYLESS_NODES
      defined, else stubs.  Architectures will define HAVE_MEMORYLESS_NODES
      if/when they support them.
      
      Archs can override definitions of:
      
      numa_mem_id() - returns node number of "local memory" node
      set_numa_mem() - initialize [this cpus'] per cpu variable 'numa_mem'
      cpu_to_mem()  - return numa_mem for specified cpu; may be used as lvalue
      
      Generic initialization of 'numa_mem' occurs in __build_all_zonelists().
      This will initialize the boot cpu at boot time, and all cpus on change of
      numa_zonelist_order, or when node or memory hot-plug requires zonelist
      rebuild.  Archs that support memoryless nodes will need to initialize
      'numa_mem' for secondary cpus as they're brought on-line.
      
      [akpm@linux-foundation.org: fix build]
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Eric Whitney <eric.whitney@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7aac7898
    • L
      numa: add generic percpu var numa_node_id() implementation · 72812019
      Lee Schermerhorn 提交于
      Rework the generic version of the numa_node_id() function to use the new
      generic percpu variable infrastructure.
      
      Guard the new implementation with a new config option:
      
              CONFIG_USE_PERCPU_NUMA_NODE_ID.
      
      Archs which support this new implemention will default this option to 'y'
      when NUMA is configured.  This config option could be removed if/when all
      archs switch over to the generic percpu implementation of numa_node_id().
      Arch support involves:
      
        1) converting any existing per cpu variable implementations to use
           this implementation.  x86_64 is an instance of such an arch.
        2) archs that don't use a per cpu variable for numa_node_id() will
           need to initialize the new per cpu variable "numa_node" as cpus
           are brought on-line.  ia64 is an example.
        3) Defining USE_PERCPU_NUMA_NODE_ID in arch dependent Kconfig--e.g.,
           when NUMA is configured.  This is required because I have
           retained the old implementation by default to allow archs to
           be modified incrementally, as desired.
      
      Subsequent patches will convert x86_64 and ia64 to use this implemenation.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Reviewed-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Eric Whitney <eric.whitney@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      72812019
  14. 25 5月, 2010 4 次提交