1. 13 7月, 2011 1 次提交
    • T
      x86, numa: Implement pfn -> nid mapping granularity check · 1e01979c
      Tejun Heo 提交于
      SPARSEMEM w/o VMEMMAP and DISCONTIGMEM, both used only on 32bit, use
      sections array to map pfn to nid which is limited in granularity.  If
      NUMA nodes are laid out such that the mapping cannot be accurate, boot
      will fail triggering BUG_ON() in mminit_verify_page_links().
      
      On 32bit, it's 512MiB w/ PAE and SPARSEMEM.  This seems to have been
      granular enough until commit 2706a0bf (x86, NUMA: Enable
      CONFIG_AMD_NUMA on 32bit too).  Apparently, there is a machine which
      aligns NUMA nodes to 128MiB and has only AMD NUMA but not SRAT.  This
      led to the following BUG_ON().
      
       On node 0 totalpages: 2096615
         DMA zone: 32 pages used for memmap
         DMA zone: 0 pages reserved
         DMA zone: 3927 pages, LIFO batch:0
         Normal zone: 1740 pages used for memmap
         Normal zone: 220978 pages, LIFO batch:31
         HighMem zone: 16405 pages used for memmap
         HighMem zone: 1853533 pages, LIFO batch:31
       BUG: Int 6: CR2   (null)
            EDI   (null)  ESI 00000002  EBP 00000002  ESP c1543ecc
            EBX f2400000  EDX 00000006  ECX   (null)  EAX 00000001
            err   (null)  EIP c16209aa   CS 00000060  flg 00010002
       Stack: f2400000 00220000 f7200800 c1620613 00220000 01000000 04400000 00238000
                (null) f7200000 00000002 f7200b58 f7200800 c1620929 000375fe   (null)
              f7200b80 c16395f0 00200a02 f7200a80   (null) 000375fe 00000002   (null)
       Pid: 0, comm: swapper Not tainted 2.6.39-rc5-00181-g2706a0bf #17
       Call Trace:
        [<c136b1e5>] ? early_fault+0x2e/0x2e
        [<c16209aa>] ? mminit_verify_page_links+0x12/0x42
        [<c1620613>] ? memmap_init_zone+0xaf/0x10c
        [<c1620929>] ? free_area_init_node+0x2b9/0x2e3
        [<c1607e99>] ? free_area_init_nodes+0x3f2/0x451
        [<c1601d80>] ? paging_init+0x112/0x118
        [<c15f578d>] ? setup_arch+0x791/0x82f
        [<c15f43d9>] ? start_kernel+0x6a/0x257
      
      This patch implements node_map_pfn_alignment() which determines
      maximum internode alignment and update numa_register_memblks() to
      reject NUMA configuration if alignment exceeds the pfn -> nid mapping
      granularity of the memory model as determined by PAGES_PER_SECTION.
      
      This makes the problematic machine boot w/ flatmem by rejecting the
      NUMA config and provides protection against crazy NUMA configurations.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/20110712074534.GB2872@htj.dyndns.org
      LKML-Reference: <20110628174613.GP478@escobedo.osrc.amd.com>
      Reported-and-Tested-by: NHans Rosenfeld <hans.rosenfeld@amd.com>
      Cc: Conny Seidel <conny.seidel@amd.com>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      1e01979c
  2. 02 6月, 2011 1 次提交
  3. 27 5月, 2011 1 次提交
    • K
      memcg: fix get_scan_count() for small targets · 246e87a9
      KAMEZAWA Hiroyuki 提交于
      During memory reclaim we determine the number of pages to be scanned per
      zone as
      
      	(anon + file) >> priority.
      Assume
      	scan = (anon + file) >> priority.
      
      If scan < SWAP_CLUSTER_MAX, the scan will be skipped for this time and
      priority gets higher.  This has some problems.
      
        1. This increases priority as 1 without any scan.
           To do scan in this priority, amount of pages should be larger than 512M.
           If pages>>priority < SWAP_CLUSTER_MAX, it's recorded and scan will be
           batched, later. (But we lose 1 priority.)
           If memory size is below 16M, pages >> priority is 0 and no scan in
           DEF_PRIORITY forever.
      
        2. If zone->all_unreclaimabe==true, it's scanned only when priority==0.
           So, x86's ZONE_DMA will never be recoverred until the user of pages
           frees memory by itself.
      
        3. With memcg, the limit of memory can be small. When using small memcg,
           it gets priority < DEF_PRIORITY-2 very easily and need to call
           wait_iff_congested().
           For doing scan before priorty=9, 64MB of memory should be used.
      
      Then, this patch tries to scan SWAP_CLUSTER_MAX of pages in force...when
      
        1. the target is enough small.
        2. it's kswapd or memcg reclaim.
      
      Then we can avoid rapid priority drop and may be able to recover
      all_unreclaimable in a small zones.  And this patch removes nr_saved_scan.
       This will allow scanning in this priority even when pages >> priority is
      very small.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NYing Han <yinghan@google.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      246e87a9
  4. 25 5月, 2011 10 次提交
  5. 21 5月, 2011 1 次提交
    • L
      sanitize <linux/prefetch.h> usage · 268bb0ce
      Linus Torvalds 提交于
      Commit e66eed65 ("list: remove prefetching from regular list
      iterators") removed the include of prefetch.h from list.h, which
      uncovered several cases that had apparently relied on that rather
      obscure header file dependency.
      
      So this fixes things up a bit, using
      
         grep -L linux/prefetch.h $(git grep -l '[^a-z_]prefetchw*(' -- '*.[ch]')
         grep -L 'prefetchw*(' $(git grep -l 'linux/prefetch.h' -- '*.[ch]')
      
      to guide us in finding files that either need <linux/prefetch.h>
      inclusion, or have it despite not needing it.
      
      There are more of them around (mostly network drivers), but this gets
      many core ones.
      Reported-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      268bb0ce
  6. 17 5月, 2011 1 次提交
  7. 12 5月, 2011 2 次提交
    • A
      mm: add alloc_pages_exact_nid() · ee85c2e1
      Andi Kleen 提交于
      Add a alloc_pages_exact_nid() that allocates on a specific node.
      
      The naming is quite broken, but fixing that would need a larger renaming
      action.
      
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: tweak comment]
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ee85c2e1
    • Y
      mm: use alloc_bootmem_node_nopanic() on really needed path · 8f389a99
      Yinghai Lu 提交于
      Stefan found nobootmem does not work on his system that has only 8M of
      RAM.  This causes an early panic:
      
        BIOS-provided physical RAM map:
         BIOS-88: 0000000000000000 - 000000000009f000 (usable)
         BIOS-88: 0000000000100000 - 0000000000840000 (usable)
        bootconsole [earlyser0] enabled
        Notice: NX (Execute Disable) protection missing in CPU or disabled in BIOS!
        DMI not present or invalid.
        last_pfn = 0x840 max_arch_pfn = 0x100000
        init_memory_mapping: 0000000000000000-0000000000840000
        8MB LOWMEM available.
          mapped low ram: 0 - 00840000
          low ram: 0 - 00840000
        Zone PFN ranges:
          DMA      0x00000001 -> 0x00001000
          Normal   empty
        Movable zone start PFN for each node
        early_node_map[2] active PFN ranges
            0: 0x00000001 -> 0x0000009f
            0: 0x00000100 -> 0x00000840
        BUG: Int 6: CR2 (null)
             EDI c034663c  ESI (null)  EBP c0329f38  ESP c0329ef4
             EBX c0346380  EDX 00000006  ECX ffffffff  EAX fffffff4
             err (null)  EIP c0353191   CS c0320060  flg 00010082
        Stack: (null) c030c533 000007cd (null) c030c533 00000001 (null) (null)
               00000003 0000083f 00000018 00000002 00000002 c0329f6c c03534d6 (null)
               (null) 00000100 00000840 (null) c0329f64 00000001 00001000 (null)
        Pid: 0, comm: swapper Not tainted 2.6.36 #5
        Call Trace:
         [<c02e3707>] ? 0xc02e3707
         [<c035e6e5>] 0xc035e6e5
         [<c0353191>] ? 0xc0353191
         [<c03534d6>] 0xc03534d6
         [<c034f1cd>] 0xc034f1cd
         [<c034a824>] 0xc034a824
         [<c03513cb>] ? 0xc03513cb
         [<c0349432>] 0xc0349432
         [<c0349066>] 0xc0349066
      
      It turns out that we should ignore the low limit of 16M.
      
      Use alloc_bootmem_node_nopanic() in this case.
      
      [akpm@linux-foundation.org: less mess]
      Signed-off-by: NYinghai LU <yinghai@kernel.org>
      Reported-by: NStefan Hellermann <stefan@the2masters.de>
      Tested-by: NStefan Hellermann <stefan@the2masters.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: <stable@kernel.org>		[2.6.34+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8f389a99
  8. 15 4月, 2011 1 次提交
  9. 10 4月, 2011 1 次提交
  10. 31 3月, 2011 1 次提交
  11. 25 3月, 2011 1 次提交
  12. 24 3月, 2011 1 次提交
  13. 23 3月, 2011 7 次提交
  14. 18 3月, 2011 1 次提交
  15. 26 2月, 2011 2 次提交
  16. 24 2月, 2011 2 次提交
  17. 26 1月, 2011 2 次提交
    • D
      mm: clear pages_scanned only if draining a pcp adds pages to the buddy allocator · 2ff754fa
      David Rientjes 提交于
      Commit 0e093d99 ("writeback: do not sleep on the congestion queue if
      there are no congested BDIs or if significant congestion is not being
      encountered in the current zone") uncovered a livelock in the page
      allocator that resulted in tasks infinitely looping trying to find
      memory and kswapd running at 100% cpu.
      
      The issue occurs because drain_all_pages() is called immediately
      following direct reclaim when no memory is freed and try_to_free_pages()
      returns non-zero because all zones in the zonelist do not have their
      all_unreclaimable flag set.
      
      When draining the per-cpu pagesets back to the buddy allocator for each
      zone, the zone->pages_scanned counter is cleared to avoid erroneously
      setting zone->all_unreclaimable later.  The problem is that no pages may
      actually be drained and, thus, the unreclaimable logic never fails
      direct reclaim so the oom killer may be invoked.
      
      This apparently only manifested after wait_iff_congested() was
      introduced and the zone was full of anonymous memory that would not
      congest the backing store.  The page allocator would infinitely loop if
      there were no other tasks waiting to be scheduled and clear
      zone->pages_scanned because of drain_all_pages() as the result of this
      change before kswapd could scan enough pages to trigger the reclaim
      logic.  Additionally, with every loop of the page allocator and in the
      reclaim path, kswapd would be kicked and would end up running at 100%
      cpu.  In this scenario, current and kswapd are all running continuously
      with kswapd incrementing zone->pages_scanned and current clearing it.
      
      The problem is even more pronounced when current swaps some of its
      memory to swap cache and the reclaimable logic then considers all active
      anonymous memory in the all_unreclaimable logic, which requires a much
      higher zone->pages_scanned value for try_to_free_pages() to return zero
      that is never attainable in this scenario.
      
      Before wait_iff_congested(), the page allocator would incur an
      unconditional timeout and allow kswapd to elevate zone->pages_scanned to
      a level that the oom killer would be called the next time it loops.
      
      The fix is to only attempt to drain pcp pages if there is actually a
      quantity to be drained.  The unconditional clearing of
      zone->pages_scanned in free_pcppages_bulk() need not be changed since
      other callers already ensure that draining will occur.  This patch
      ensures that free_pcppages_bulk() will actually free memory before
      calling into it from drain_all_pages() so zone->pages_scanned is only
      cleared if appropriate.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Reviewed-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2ff754fa
    • D
      mm: fix deferred congestion timeout if preferred zone is not allowed · f33261d7
      David Rientjes 提交于
      Before 0e093d99 ("writeback: do not sleep on the congestion queue if
      there are no congested BDIs or if significant congestion is not being
      encountered in the current zone"), preferred_zone was only used for NUMA
      statistics, to determine the zoneidx from which to allocate from given
      the type requested, and whether to utilize memory compaction.
      
      wait_iff_congested(), though, uses preferred_zone to determine if the
      congestion wait should be deferred because its dirty pages are backed by
      a congested bdi.  This incorrectly defers the timeout and busy loops in
      the page allocator with various cond_resched() calls if preferred_zone
      is not allowed in the current context, usually consuming 100% of a cpu.
      
      This patch ensures preferred_zone is an allowed zone in the fastpath
      depending on whether current is constrained by its cpuset or nodes in
      its mempolicy (when the nodemask passed is non-NULL).  This is correct
      since the fastpath allocation always passes ALLOC_CPUSET when trying to
      allocate memory.  In the slowpath, this patch resets preferred_zone to
      the first zone of the allowed type when the allocation is not
      constrained by current's cpuset, i.e.  it does not pass ALLOC_CPUSET.
      
      This patch also ensures preferred_zone is from the set of allowed nodes
      when called from within direct reclaim since allocations are always
      constrained by cpusets in this context (it is blockable).
      
      Both of these uses of cpuset_current_mems_allowed are protected by
      get_mems_allowed().
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f33261d7
  18. 14 1月, 2011 4 次提交