1. 27 10月, 2010 4 次提交
    • M
      writeback: do not sleep on the congestion queue if there are no congested BDIs... · 0e093d99
      Mel Gorman 提交于
      writeback: do not sleep on the congestion queue if there are no congested BDIs or if significant congestion is not being encountered in the current zone
      
      If congestion_wait() is called with no BDI congested, the caller will
      sleep for the full timeout and this may be an unnecessary sleep.  This
      patch adds a wait_iff_congested() that checks congestion and only sleeps
      if a BDI is congested else, it calls cond_resched() to ensure the caller
      is not hogging the CPU longer than its quota but otherwise will not sleep.
      
      This is aimed at reducing some of the major desktop stalls reported during
      IO.  For example, while kswapd is operating, it calls congestion_wait()
      but it could just have been reclaiming clean page cache pages with no
      congestion.  Without this patch, it would sleep for a full timeout but
      after this patch, it'll just call schedule() if it has been on the CPU too
      long.  Similar logic applies to direct reclaimers that are not making
      enough progress.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0e093d99
    • K
      memory hotplug: unify is_removable and offline detection code · 49ac8255
      KAMEZAWA Hiroyuki 提交于
      Now, sysfs interface of memory hotplug shows whether the section is
      removable or not.  But it checks only migrateype of pages and doesn't
      check details of cluster of pages.
      
      Next, memory hotplug's set_migratetype_isolate() has the same kind of
      check, too.
      
      This patch adds the function __count_unmovable_pages() and makes above 2
      checks to use the same logic.  Then, is_removable and hotremove code uses
      the same logic.  No changes in the hotremove logic itself.
      
      TODO: need to find a way to check RECLAMABLE. But, considering bit,
            calling shrink_slab() against a range before starting memory hotremove
            sounds better. If so, this patch's logic doesn't need to be changed.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reported-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      49ac8255
    • K
      memory hotplug: fix notifier's return value check · 4b20477f
      KAMEZAWA Hiroyuki 提交于
      Even if notifier cannot find any pages, it doesn't mean no pages are
      available...And, if there are no notifiers registered, this condition will
      be always true and memory hotplug will show -EBUSY.
      
      This is a bug but not critical.
      
      In most case, a pageblock which will be offlined is MIGRATE_MOVABLE This
      "notifier" is called only when the pageblock is _not_ MIGRATE_MOVABLE.
      But if not MIGRATE_MOVABLE, it's common case that memory hotplug will
      fail.  So, no one notice this bug.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4b20477f
    • M
      mm, page-allocator: do not check the state of a non-existant buddy during free · b7f50cfa
      Mel Gorman 提交于
      There is a bug in commit 6dda9d55 ("page allocator: reduce fragmentation
      in buddy allocator by adding buddies that are merging to the tail of the
      free lists") that means a buddy at order MAX_ORDER is checked for merging.
       A page of this order never exists so at times, an effectively random
      piece of memory is being checked.
      
      Alan Curry has reported that this is causing memory corruption in
      userspace data on a PPC32 platform (http://lkml.org/lkml/2010/10/9/32).
      It is not clear why this is happening.  It could be a cache coherency
      problem where pages mapped in both user and kernel space are getting
      different cache lines due to the bad read from kernel space
      (http://lkml.org/lkml/2010/10/13/179).  It could also be that there are
      some special registers being io-remapped at the end of the memmap array
      and that a read has special meaning on them.  Compiler bugs have been
      ruled out because the assembly before and after the patch looks relatively
      harmless.
      
      This patch fixes the problem by ensuring we are not reading a possibly
      invalid location of memory.  It's not clear why the read causes corruption
      but one way or the other it is a buggy read.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Corrado Zoccolo <czoccolo@gmail.com>
      Reported-by: NAlan Curry <pacman@kosh.dhis.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b7f50cfa
  2. 08 10月, 2010 1 次提交
  3. 10 9月, 2010 3 次提交
  4. 28 8月, 2010 2 次提交
    • Y
      x86: Use memblock to replace early_res · 72d7c3b3
      Yinghai Lu 提交于
      1. replace find_e820_area with memblock_find_in_range
      2. replace reserve_early with memblock_x86_reserve_range
      3. replace free_early with memblock_x86_free_range.
      4. NO_BOOTMEM will switch to use memblock too.
      5. use _e820, _early wrap in the patch, in following patch, will
         replace them all
      6. because memblock_x86_free_range support partial free, we can remove some special care
      7. Need to make sure that memblock_find_in_range() is called after memblock_x86_fill()
         so adjust some calling later in setup.c::setup_arch()
         -- corruption_check and mptable_update
      
      -v2: Move reserve_brk() early
          Before fill_memblock_area, to avoid overlap between brk and memblock_find_in_range()
          that could happen We have more then 128 RAM entry in E820 tables, and
          memblock_x86_fill() could use memblock_find_in_range() to find a new place for
          memblock.memory.region array.
          and We don't need to use extend_brk() after fill_memblock_area()
          So move reserve_brk() early before fill_memblock_area().
      -v3: Move find_smp_config early
          To make sure memblock_find_in_range not find wrong place, if BIOS doesn't put mptable
          in right place.
      -v4: Treat RESERVED_KERN as RAM in memblock.memory. and they are already in
          memblock.reserved already..
          use __NOT_KEEP_MEMBLOCK to make sure memblock related code could be freed later.
      -v5: Generic version __memblock_find_in_range() is going from high to low, and for 32bit
          active_region for 32bit does include high pages
          need to replace the limit with memblock.default_alloc_limit, aka get_max_mapped()
      -v6: Use current_limit instead
      -v7: check with MEMBLOCK_ERROR instead of -1ULL or -1L
      -v8: Set memblock_can_resize early to handle EFI with more RAM entries
      -v9: update after kmemleak changes in mainline
      Suggested-by: NDavid S. Miller <davem@davemloft.net>
      Suggested-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Suggested-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      72d7c3b3
    • Y
      memblock: Add find_memory_core_early() · edbe7d23
      Yinghai Lu 提交于
      According to node range in early_node_map[] with __memblock_find_in_range
      to find free range.
      
      Will be used by memblock_x86_find_in_range_node()
      
      memblock_x86_find_in_range_node will be used to find right buffer for NODE_DATA
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
      edbe7d23
  5. 10 8月, 2010 3 次提交
    • K
      vmscan: kill prev_priority completely · 25edde03
      KOSAKI Motohiro 提交于
      Since 2.6.28 zone->prev_priority is unused. Then it can be removed
      safely. It reduce stack usage slightly.
      
      Now I have to say that I'm sorry. 2 years ago, I thought prev_priority
      can be integrate again, it's useful. but four (or more) times trying
      haven't got good performance number. Thus I give up such approach.
      
      The rest of this changelog is notes on prev_priority and why it existed in
      the first place and why it might be not necessary any more. This information
      is based heavily on discussions between Andrew Morton, Rik van Riel and
      Kosaki Motohiro who is heavily quotes from.
      
      Historically prev_priority was important because it determined when the VM
      would start unmapping PTE pages. i.e. there are no balances of note within
      the VM, Anon vs File and Mapped vs Unmapped. Without prev_priority, there
      is a potential risk of unnecessarily increasing minor faults as a large
      amount of read activity of use-once pages could push mapped pages to the
      end of the LRU and get unmapped.
      
      There is no proof this is still a problem but currently it is not considered
      to be. Active files are not deactivated if the active file list is smaller
      than the inactive list reducing the liklihood that file-mapped pages are
      being pushed off the LRU and referenced executable pages are kept on the
      active list to avoid them getting pushed out by read activity.
      
      Even if it is a problem, prev_priority prev_priority wouldn't works
      nowadays. First of all, current vmscan still a lot of UP centric code. it
      expose some weakness on some dozens CPUs machine. I think we need more and
      more improvement.
      
      The problem is, current vmscan mix up per-system-pressure, per-zone-pressure
      and per-task-pressure a bit. example, prev_priority try to boost priority to
      other concurrent priority. but if the another task have mempolicy restriction,
      it is unnecessary, but also makes wrong big latency and exceeding reclaim.
      per-task based priority + prev_priority adjustment make the emulation of
      per-system pressure. but it have two issue 1) too rough and brutal emulation
      2) we need per-zone pressure, not per-system.
      
      Another example, currently DEF_PRIORITY is 12. it mean the lru rotate about
      2 cycle (1/4096 + 1/2048 + 1/1024 + .. + 1) before invoking OOM-Killer.
      but if 10,0000 thrreads enter DEF_PRIORITY reclaim at the same time, the
      system have higher memory pressure than priority==0 (1/4096*10,000 > 2).
      prev_priority can't solve such multithreads workload issue. In other word,
      prev_priority concept assume the sysmtem don't have lots threads."
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Chris Mason <chris.mason@oracle.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michael Rubin <mrubin@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      25edde03
    • M
      mm: rename try_set_zone_oom() to try_set_zonelist_oom() · ff321fea
      Minchan Kim 提交于
      We have been used naming try_set_zone_oom and clear_zonelist_oom.
      The role of functions is to lock of zonelist for preventing parallel
      OOM. So clear_zonelist_oom makes sense but try_set_zone_oome is rather
      awkward and unmatched with clear_zonelist_oom.
      
      Let's change it with try_set_zonelist_oom.
      Signed-off-by: NMinchan Kim <minchan.kim@gmail.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ff321fea
    • D
      oom: avoid oom killer for lowmem allocations · 03668b3c
      David Rientjes 提交于
      If memory has been depleted in lowmem zones even with the protection
      afforded to it by /proc/sys/vm/lowmem_reserve_ratio, it is unlikely that
      killing current users will help.  The memory is either reclaimable (or
      migratable) already, in which case we should not invoke the oom killer at
      all, or it is pinned by an application for I/O.  Killing such an
      application may leave the hardware in an unspecified state and there is no
      guarantee that it will be able to make a timely exit.
      
      Lowmem allocations are now failed in oom conditions when __GFP_NOFAIL is
      not used so that the task can perhaps recover or try again later.
      
      Previously, the heuristic provided some protection for those tasks with
      CAP_SYS_RAWIO, but this is no longer necessary since we will not be
      killing tasks for the purposes of ISA allocations.
      
      high_zoneidx is gfp_zone(gfp_flags), meaning that ZONE_NORMAL will be the
      default for all allocations that are not __GFP_DMA, __GFP_DMA32,
      __GFP_HIGHMEM, and __GFP_MOVABLE on kernels configured to support those
      flags.  Testing for high_zoneidx being less than ZONE_NORMAL will only
      return true for allocations that have either __GFP_DMA or __GFP_DMA32.
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      03668b3c
  6. 21 7月, 2010 1 次提交
    • Y
      x86,nobootmem: make alloc_bootmem_node fall back to other node when 32bit numa is used · b8ab9f82
      Yinghai Lu 提交于
      Borislav Petkov reported his 32bit numa system has problem:
      
      [    0.000000] Reserving total of 4c00 pages for numa KVA remap
      [    0.000000] kva_start_pfn ~ 32800 max_low_pfn ~ 375fe
      [    0.000000] max_pfn = 238000
      [    0.000000] 8202MB HIGHMEM available.
      [    0.000000] 885MB LOWMEM available.
      [    0.000000]   mapped low ram: 0 - 375fe000
      [    0.000000]   low ram: 0 - 375fe000
      [    0.000000] alloc (nid=8 100000 - 7ee00000) (1000000 - ffffffff) 1000 1000 => 34e7000
      [    0.000000] alloc (nid=8 100000 - 7ee00000) (1000000 - ffffffff) 200 40 => 34c9d80
      [    0.000000] alloc (nid=0 100000 - 7ee00000) (1000000 - ffffffffffffffff) 180 40 => 34e6140
      [    0.000000] alloc (nid=1 80000000 - c7e60000) (1000000 - ffffffffffffffff) 240 40 => 80000000
      [    0.000000] BUG: unable to handle kernel paging request at 40000000
      [    0.000000] IP: [<c2c8cff1>] __alloc_memory_core_early+0x147/0x1d6
      [    0.000000] *pdpt = 0000000000000000 *pde = f000ff53f000ff00
      ...
      [    0.000000] Call Trace:
      [    0.000000]  [<c2c8b4f8>] ? __alloc_bootmem_node+0x216/0x22f
      [    0.000000]  [<c2c90c9b>] ? sparse_early_usemaps_alloc_node+0x5a/0x10b
      [    0.000000]  [<c2c9149e>] ? sparse_init+0x1dc/0x499
      [    0.000000]  [<c2c79118>] ? paging_init+0x168/0x1df
      [    0.000000]  [<c2c780ff>] ? native_pagetable_setup_start+0xef/0x1bb
      
      looks like it allocates too much high address for bootmem.
      
      Try to cut limit with get_max_mapped()
      Reported-by: NBorislav Petkov <borislav.petkov@amd.com>
      Tested-by: NConny Seidel <conny.seidel@amd.com>
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Cc: <stable@kernel.org>		[2.6.34.x]
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b8ab9f82
  7. 19 7月, 2010 1 次提交
  8. 28 5月, 2010 2 次提交
    • L
      numa: introduce numa_mem_id()- effective local memory node id · 7aac7898
      Lee Schermerhorn 提交于
      Introduce numa_mem_id(), based on generic percpu variable infrastructure
      to track "nearest node with memory" for archs that support memoryless
      nodes.
      
      Define API in <linux/topology.h> when CONFIG_HAVE_MEMORYLESS_NODES
      defined, else stubs.  Architectures will define HAVE_MEMORYLESS_NODES
      if/when they support them.
      
      Archs can override definitions of:
      
      numa_mem_id() - returns node number of "local memory" node
      set_numa_mem() - initialize [this cpus'] per cpu variable 'numa_mem'
      cpu_to_mem()  - return numa_mem for specified cpu; may be used as lvalue
      
      Generic initialization of 'numa_mem' occurs in __build_all_zonelists().
      This will initialize the boot cpu at boot time, and all cpus on change of
      numa_zonelist_order, or when node or memory hot-plug requires zonelist
      rebuild.  Archs that support memoryless nodes will need to initialize
      'numa_mem' for secondary cpus as they're brought on-line.
      
      [akpm@linux-foundation.org: fix build]
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Eric Whitney <eric.whitney@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7aac7898
    • L
      numa: add generic percpu var numa_node_id() implementation · 72812019
      Lee Schermerhorn 提交于
      Rework the generic version of the numa_node_id() function to use the new
      generic percpu variable infrastructure.
      
      Guard the new implementation with a new config option:
      
              CONFIG_USE_PERCPU_NUMA_NODE_ID.
      
      Archs which support this new implemention will default this option to 'y'
      when NUMA is configured.  This config option could be removed if/when all
      archs switch over to the generic percpu implementation of numa_node_id().
      Arch support involves:
      
        1) converting any existing per cpu variable implementations to use
           this implementation.  x86_64 is an instance of such an arch.
        2) archs that don't use a per cpu variable for numa_node_id() will
           need to initialize the new per cpu variable "numa_node" as cpus
           are brought on-line.  ia64 is an example.
        3) Defining USE_PERCPU_NUMA_NODE_ID in arch dependent Kconfig--e.g.,
           when NUMA is configured.  This is required because I have
           retained the old implementation by default to allow archs to
           be modified incrementally, as desired.
      
      Subsequent patches will convert x86_64 and ia64 to use this implemenation.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Reviewed-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Eric Whitney <eric.whitney@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      72812019
  9. 25 5月, 2010 10 次提交
    • H
      mem-hotplug: fix potential race while building zonelist for new populated zone · 4eaf3f64
      Haicheng Li 提交于
      Add global mutex zonelists_mutex to fix the possible race:
      
           CPU0                                  CPU1                    CPU2
      (1) zone->present_pages += online_pages;
      (2)                                       build_all_zonelists();
      (3)                                                               alloc_page();
      (4)                                                               free_page();
      (5) build_all_zonelists();
      (6)   __build_all_zonelists();
      (7)     zone->pageset = alloc_percpu();
      
      In step (3,4), zone->pageset still points to boot_pageset, so bad
      things may happen if 2+ nodes are in this state. Even if only 1 node
      is accessing the boot_pageset, (3) may still consume too much memory
      to fail the memory allocations in step (7).
      
      Besides, atomic operation ensures alloc_percpu() in step (7) will never fail
      since there is a new fresh memory block added in step(6).
      
      [haicheng.li@linux.intel.com: hold zonelists_mutex when build_all_zonelists]
      Signed-off-by: NHaicheng Li <haicheng.li@linux.intel.com>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Reviewed-by: NAndi Kleen <andi.kleen@intel.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4eaf3f64
    • H
      mem-hotplug: avoid multiple zones sharing same boot strapping boot_pageset · 1f522509
      Haicheng Li 提交于
      For each new populated zone of hotadded node, need to update its pagesets
      with dynamically allocated per_cpu_pageset struct for all possible CPUs:
      
          1) Detach zone->pageset from the shared boot_pageset
             at end of __build_all_zonelists().
      
          2) Use mutex to protect zone->pageset when it's still
             shared in onlined_pages()
      
      Otherwises, multiple zones of different nodes would share same boot strapping
      boot_pageset for same CPU, which will finally cause below kernel panic:
      
        ------------[ cut here ]------------
        kernel BUG at mm/page_alloc.c:1239!
        invalid opcode: 0000 [#1] SMP
        ...
        Call Trace:
         [<ffffffff811300c1>] __alloc_pages_nodemask+0x131/0x7b0
         [<ffffffff81162e67>] alloc_pages_current+0x87/0xd0
         [<ffffffff81128407>] __page_cache_alloc+0x67/0x70
         [<ffffffff811325f0>] __do_page_cache_readahead+0x120/0x260
         [<ffffffff81132751>] ra_submit+0x21/0x30
         [<ffffffff811329c6>] ondemand_readahead+0x166/0x2c0
         [<ffffffff81132ba0>] page_cache_async_readahead+0x80/0xa0
         [<ffffffff8112a0e4>] generic_file_aio_read+0x364/0x670
         [<ffffffff81266cfa>] nfs_file_read+0xca/0x130
         [<ffffffff8117b20a>] do_sync_read+0xfa/0x140
         [<ffffffff8117bf75>] vfs_read+0xb5/0x1a0
         [<ffffffff8117c151>] sys_read+0x51/0x80
         [<ffffffff8103c032>] system_call_fastpath+0x16/0x1b
        RIP  [<ffffffff8112ff13>] get_page_from_freelist+0x883/0x900
         RSP <ffff88000d1e78a8>
        ---[ end trace 4bda28328b9990db ]
      
      [akpm@linux-foundation.org: merge fix]
      Signed-off-by: NHaicheng Li <haicheng.li@linux.intel.com>
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Reviewed-by: NAndi Kleen <andi.kleen@intel.com>
      Reviewed-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1f522509
    • W
      mem-hotplug: separate setup_per_cpu_pageset() into separate functions · 319774e2
      Wu Fengguang 提交于
      No behavior change here.
      
      Move some of setup_per_cpu_pageset() code into a new function
      setup_zone_pageset() that will be useful for memory hotplug.
      Signed-off-by: NWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NHaicheng Li <haicheng.li@linux.intel.com>
      Reviewed-by: NAndi Kleen <andi.kleen@intel.com>
      Reviewed-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      319774e2
    • K
      mm: introduce free_pages_prepare() · ec95f53a
      KOSAKI Motohiro 提交于
      free_hot_cold_page() and __free_pages_ok() have very similar freeing
      preparation.  Consolidate them.
      
      [akpm@linux-foundation.org: fix busted coding style]
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ec95f53a
    • M
      mm: compaction: defer compaction using an exponential backoff when compaction fails · 4f92e258
      Mel Gorman 提交于
      The fragmentation index may indicate that a failure is due to external
      fragmentation but after a compaction run completes, it is still possible
      for an allocation to fail.  There are two obvious reasons as to why
      
        o Page migration cannot move all pages so fragmentation remains
        o A suitable page may exist but watermarks are not met
      
      In the event of compaction followed by an allocation failure, this patch
      defers further compaction in the zone (1 << compact_defer_shift) times.
      If the next compaction attempt also fails, compact_defer_shift is
      increased up to a maximum of 6.  If compaction succeeds, the defer
      counters are reset again.
      
      The zone that is deferred is the first zone in the zonelist - i.e.  the
      preferred zone.  To defer compaction in the other zones, the information
      would need to be stored in the zonelist or implemented similar to the
      zonelist_cache.  This would impact the fast-paths and is not justified at
      this time.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4f92e258
    • M
      mm: compaction: direct compact when a high-order allocation fails · 56de7263
      Mel Gorman 提交于
      Ordinarily when a high-order allocation fails, direct reclaim is entered
      to free pages to satisfy the allocation.  With this patch, it is
      determined if an allocation failed due to external fragmentation instead
      of low memory and if so, the calling process will compact until a suitable
      page is freed.  Compaction by moving pages in memory is considerably
      cheaper than paging out to disk and works where there are locked pages or
      no swap.  If compaction fails to free a page of a suitable size, then
      reclaim will still occur.
      
      Direct compaction returns as soon as possible.  As each block is
      compacted, it is checked if a suitable page has been freed and if so, it
      returns.
      
      [akpm@linux-foundation.org: Fix build errors]
      [aarcange@redhat.com: fix count_vm_event preempt in memory compaction direct reclaim]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      56de7263
    • M
      mm: compaction: memory compaction core · 748446bb
      Mel Gorman 提交于
      This patch is the core of a mechanism which compacts memory in a zone by
      relocating movable pages towards the end of the zone.
      
      A single compaction run involves a migration scanner and a free scanner.
      Both scanners operate on pageblock-sized areas in the zone.  The migration
      scanner starts at the bottom of the zone and searches for all movable
      pages within each area, isolating them onto a private list called
      migratelist.  The free scanner starts at the top of the zone and searches
      for suitable areas and consumes the free pages within making them
      available for the migration scanner.  The pages isolated for migration are
      then migrated to the newly isolated free pages.
      
      [aarcange@redhat.com: Fix unsafe optimisation]
      [mel@csn.ul.ie: do not schedule work on other CPUs for compaction]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      748446bb
    • D
      mm: default to node zonelist ordering when nodes have only lowmem · e325c90f
      David Rientjes 提交于
      There are two types of zonelist ordering methodologies:
      
       - node order, preferring allocations on a node to stay local to and
      
       - zone order, preferring allocations come from a higher zone to avoid
         allocating in lowmem zones even though they may not be local.
      
      The ordering technique used by the kernel is configurable on the command
      line, but also has some logic to determine what the default should be.
      
      This logic currently lacks knowledge of systems where a node may only have
      lowmem.  For such systems, it is necessary to use node order so that
      GFP_KERNEL allocations may be satisfied by nodes consisting of only
      lowmem.
      
      If zone order is used, GFP_KERNEL allocations to such nodes are actually
      allocated on a node with local affinity that includes ZONE_NORMAL.
      
      This change defaults to node zonelist ordering if any node lacks
      ZONE_NORMAL.
      
      To force zone order, append 'numa_zonelist_order=zone' to the kernel
      command line.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e325c90f
    • M
      cpuset,mm: fix no node to alloc memory when changing cpuset's mems · c0ff7453
      Miao Xie 提交于
      Before applying this patch, cpuset updates task->mems_allowed and
      mempolicy by setting all new bits in the nodemask first, and clearing all
      old unallowed bits later.  But in the way, the allocator may find that
      there is no node to alloc memory.
      
      The reason is that cpuset rebinds the task's mempolicy, it cleans the
      nodes which the allocater can alloc pages on, for example:
      
      (mpol: mempolicy)
      	task1			task1's mpol	task2
      	alloc page		1
      	  alloc on node0? NO	1
      				1		change mems from 1 to 0
      				1		rebind task1's mpol
      				0-1		  set new bits
      				0	  	  clear disallowed bits
      	  alloc on node1? NO	0
      	  ...
      	can't alloc page
      	  goto oom
      
      This patch fixes this problem by expanding the nodes range first(set newly
      allowed bits) and shrink it lazily(clear newly disallowed bits).  So we
      use a variable to tell the write-side task that read-side task is reading
      nodemask, and the write-side task clears newly disallowed nodes after
      read-side task ends the current memory allocation.
      
      [akpm@linux-foundation.org: fix spello]
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Paul Menage <menage@google.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Ravikiran Thirumalai <kiran@scalex86.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Andi Kleen <andi@firstfloor.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c0ff7453
    • C
      page allocator: reduce fragmentation in buddy allocator by adding buddies that... · 6dda9d55
      Corrado Zoccolo 提交于
      page allocator: reduce fragmentation in buddy allocator by adding buddies that are merging to the tail of the free lists
      
      In order to reduce fragmentation, this patch classifies freed pages in two
      groups according to their probability of being part of a high order merge.
       Pages belonging to a compound whose next-highest buddy is free are more
      likely to be part of a high order merge in the near future, so they will
      be added at the tail of the freelist.  The remaining pages are put at the
      front of the freelist.
      
      In this way, the pages that are more likely to cause a big merge are kept
      free longer.  Consequently there is a tendency to aggregate the
      long-living allocations on a subset of the compounds, reducing the
      fragmentation.
      
      This heuristic was tested on three machines, x86, x86-64 and ppc64 with
      3GB of RAM in each machine.  The tests were kernbench, netperf, sysbench
      and STREAM for performance and a high-order stress test for huge page
      allocations.
      
      KernBench X86
      Elapsed mean     374.77 ( 0.00%)   375.10 (-0.09%)
      User    mean     649.53 ( 0.00%)   650.44 (-0.14%)
      System  mean      54.75 ( 0.00%)    54.18 ( 1.05%)
      CPU     mean     187.75 ( 0.00%)   187.25 ( 0.27%)
      
      KernBench X86-64
      Elapsed mean      94.45 ( 0.00%)    94.01 ( 0.47%)
      User    mean     323.27 ( 0.00%)   322.66 ( 0.19%)
      System  mean      36.71 ( 0.00%)    36.50 ( 0.57%)
      CPU     mean     380.75 ( 0.00%)   381.75 (-0.26%)
      
      KernBench PPC64
      Elapsed mean     173.45 ( 0.00%)   173.74 (-0.17%)
      User    mean     587.99 ( 0.00%)   587.95 ( 0.01%)
      System  mean      60.60 ( 0.00%)    60.57 ( 0.05%)
      CPU     mean     373.50 ( 0.00%)   372.75 ( 0.20%)
      
      Nothing notable for kernbench.
      
      NetPerf UDP X86
            64    42.68 ( 0.00%)     42.77 ( 0.21%)
           128    85.62 ( 0.00%)     85.32 (-0.35%)
           256   170.01 ( 0.00%)    168.76 (-0.74%)
          1024   655.68 ( 0.00%)    652.33 (-0.51%)
          2048  1262.39 ( 0.00%)   1248.61 (-1.10%)
          3312  1958.41 ( 0.00%)   1944.61 (-0.71%)
          4096  2345.63 ( 0.00%)   2318.83 (-1.16%)
          8192  4132.90 ( 0.00%)   4089.50 (-1.06%)
         16384  6770.88 ( 0.00%)   6642.05 (-1.94%)*
      
      NetPerf UDP X86-64
            64   148.82 ( 0.00%)    154.92 ( 3.94%)
           128   298.96 ( 0.00%)    312.95 ( 4.47%)
           256   583.67 ( 0.00%)    626.39 ( 6.82%)
          1024  2293.18 ( 0.00%)   2371.10 ( 3.29%)
          2048  4274.16 ( 0.00%)   4396.83 ( 2.79%)
          3312  6356.94 ( 0.00%)   6571.35 ( 3.26%)
          4096  7422.68 ( 0.00%)   7635.42 ( 2.79%)*
          8192 12114.81 ( 0.00%)* 12346.88 ( 1.88%)
         16384 17022.28 ( 0.00%)* 17033.19 ( 0.06%)*
                   1.64%             2.73%
      
      NetPerf UDP PPC64
            64    49.98 ( 0.00%)     50.25 ( 0.54%)
           128    98.66 ( 0.00%)    100.95 ( 2.27%)
           256   197.33 ( 0.00%)    191.03 (-3.30%)
          1024   761.98 ( 0.00%)    785.07 ( 2.94%)
          2048  1493.50 ( 0.00%)   1510.85 ( 1.15%)
          3312  2303.95 ( 0.00%)   2271.72 (-1.42%)
          4096  2774.56 ( 0.00%)   2773.06 (-0.05%)
          8192  4918.31 ( 0.00%)   4793.59 (-2.60%)
         16384  7497.98 ( 0.00%)   7749.52 ( 3.25%)
      
      The tests are run to have confidence limits within 1%.  Results marked
      with a * were not confident although in this case, it's only outside by
      small amounts.  Even with some results that were not confident, the
      netperf UDP results were generally positive.
      
      NetPerf TCP X86
            64   652.25 ( 0.00%)*   648.12 (-0.64%)*
                  23.80%            22.82%
           128  1229.98 ( 0.00%)*  1220.56 (-0.77%)*
                  21.03%            18.90%
           256  2105.88 ( 0.00%)   1872.03 (-12.49%)*
                   1.00%            16.46%
          1024  3476.46 ( 0.00%)*  3548.28 ( 2.02%)*
                  13.37%            11.39%
          2048  4023.44 ( 0.00%)*  4231.45 ( 4.92%)*
                   9.76%            12.48%
          3312  4348.88 ( 0.00%)*  4396.96 ( 1.09%)*
                   6.49%             8.75%
          4096  4726.56 ( 0.00%)*  4877.71 ( 3.10%)*
                   9.85%             8.50%
          8192  4732.28 ( 0.00%)*  5777.77 (18.10%)*
                   9.13%            13.04%
         16384  5543.05 ( 0.00%)*  5906.24 ( 6.15%)*
                   7.73%             8.68%
      
      NETPERF TCP X86-64
                  netperf-tcp-vanilla-netperf       netperf-tcp
                         tcp-vanilla     pgalloc-delay
            64  1895.87 ( 0.00%)*  1775.07 (-6.81%)*
                   5.79%             4.78%
           128  3571.03 ( 0.00%)*  3342.20 (-6.85%)*
                   3.68%             6.06%
           256  5097.21 ( 0.00%)*  4859.43 (-4.89%)*
                   3.02%             2.10%
          1024  8919.10 ( 0.00%)*  8892.49 (-0.30%)*
                   5.89%             6.55%
          2048 10255.46 ( 0.00%)* 10449.39 ( 1.86%)*
                   7.08%             7.44%
          3312 10839.90 ( 0.00%)* 10740.15 (-0.93%)*
                   6.87%             7.33%
          4096 10814.84 ( 0.00%)* 10766.97 (-0.44%)*
                   6.86%             8.18%
          8192 11606.89 ( 0.00%)* 11189.28 (-3.73%)*
                   7.49%             5.55%
         16384 12554.88 ( 0.00%)* 12361.22 (-1.57%)*
                   7.36%             6.49%
      
      NETPERF TCP PPC64
                  netperf-tcp-vanilla-netperf       netperf-tcp
                         tcp-vanilla     pgalloc-delay
            64   594.17 ( 0.00%)    596.04 ( 0.31%)*
                   1.00%             2.29%
           128  1064.87 ( 0.00%)*  1074.77 ( 0.92%)*
                   1.30%             1.40%
           256  1852.46 ( 0.00%)*  1856.95 ( 0.24%)
                   1.25%             1.00%
          1024  3839.46 ( 0.00%)*  3813.05 (-0.69%)
                   1.02%             1.00%
          2048  4885.04 ( 0.00%)*  4881.97 (-0.06%)*
                   1.15%             1.04%
          3312  5506.90 ( 0.00%)   5459.72 (-0.86%)
          4096  6449.19 ( 0.00%)   6345.46 (-1.63%)
          8192  7501.17 ( 0.00%)   7508.79 ( 0.10%)
         16384  9618.65 ( 0.00%)   9490.10 (-1.35%)
      
      There was a distinct lack of confidence in the X86* figures so I included
      what the devation was where the results were not confident.  Many of the
      results, whether gains or losses were within the standard deviation so no
      solid conclusion can be reached on performance impact.  Looking at the
      figures, only the X86-64 ones look suspicious with a few losses that were
      outside the noise.  However, the results were so unstable that without
      knowing why they vary so much, a solid conclusion cannot be reached.
      
      SYSBENCH X86
                    sysbench-vanilla     pgalloc-delay
                 1  7722.85 ( 0.00%)  7756.79 ( 0.44%)
                 2 14901.11 ( 0.00%) 13683.44 (-8.90%)
                 3 15171.71 ( 0.00%) 14888.25 (-1.90%)
                 4 14966.98 ( 0.00%) 15029.67 ( 0.42%)
                 5 14370.47 ( 0.00%) 14865.00 ( 3.33%)
                 6 14870.33 ( 0.00%) 14845.57 (-0.17%)
                 7 14429.45 ( 0.00%) 14520.85 ( 0.63%)
                 8 14354.35 ( 0.00%) 14362.31 ( 0.06%)
      
      SYSBENCH X86-64
                 1 17448.70 ( 0.00%) 17484.41 ( 0.20%)
                 2 34276.39 ( 0.00%) 34251.00 (-0.07%)
                 3 50805.25 ( 0.00%) 50854.80 ( 0.10%)
                 4 66667.10 ( 0.00%) 66174.69 (-0.74%)
                 5 66003.91 ( 0.00%) 65685.25 (-0.49%)
                 6 64981.90 ( 0.00%) 65125.60 ( 0.22%)
                 7 64933.16 ( 0.00%) 64379.23 (-0.86%)
                 8 63353.30 ( 0.00%) 63281.22 (-0.11%)
                 9 63511.84 ( 0.00%) 63570.37 ( 0.09%)
                10 62708.27 ( 0.00%) 63166.25 ( 0.73%)
                11 62092.81 ( 0.00%) 61787.75 (-0.49%)
                12 61330.11 ( 0.00%) 61036.34 (-0.48%)
                13 61438.37 ( 0.00%) 61994.47 ( 0.90%)
                14 62304.48 ( 0.00%) 62064.90 (-0.39%)
                15 63296.48 ( 0.00%) 62875.16 (-0.67%)
                16 63951.76 ( 0.00%) 63769.09 (-0.29%)
      
      SYSBENCH PPC64
                                   -sysbench-pgalloc-delay-sysbench
                    sysbench-vanilla     pgalloc-delay
                 1  7645.08 ( 0.00%)  7467.43 (-2.38%)
                 2 14856.67 ( 0.00%) 14558.73 (-2.05%)
                 3 21952.31 ( 0.00%) 21683.64 (-1.24%)
                 4 27946.09 ( 0.00%) 28623.29 ( 2.37%)
                 5 28045.11 ( 0.00%) 28143.69 ( 0.35%)
                 6 27477.10 ( 0.00%) 27337.45 (-0.51%)
                 7 26489.17 ( 0.00%) 26590.06 ( 0.38%)
                 8 26642.91 ( 0.00%) 25274.33 (-5.41%)
                 9 25137.27 ( 0.00%) 24810.06 (-1.32%)
                10 24451.99 ( 0.00%) 24275.85 (-0.73%)
                11 23262.20 ( 0.00%) 23674.88 ( 1.74%)
                12 24234.81 ( 0.00%) 23640.89 (-2.51%)
                13 24577.75 ( 0.00%) 24433.50 (-0.59%)
                14 25640.19 ( 0.00%) 25116.52 (-2.08%)
                15 26188.84 ( 0.00%) 26181.36 (-0.03%)
                16 26782.37 ( 0.00%) 26255.99 (-2.00%)
      
      Again, there is little to conclude here.  While there are a few losses,
      the results vary by +/- 8% in some cases.  They are the results of most
      concern as there are some large losses but it's also within the variance
      typically seen between kernel releases.
      
      The STREAM results varied so little and are so verbose that I didn't
      include them here.
      
      The final test stressed how many huge pages can be allocated.  The
      absolute number of huge pages allocated are the same with or without the
      page.  However, the "unusability free space index" which is a measure of
      external fragmentation was slightly lower (lower is better) throughout the
      lifetime of the system.  I also measured the latency of how long it took
      to successfully allocate a huge page.  The latency was slightly lower and
      on X86 and PPC64, more huge pages were allocated almost immediately from
      the free lists.  The improvement is slight but there.
      
      [mel@csn.ul.ie: Tested, reworked for less branches]
      [czoccolo@gmail.com: fix oops by checking pfn_valid_within()]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Acked-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NPekka Enberg <penberg@cs.helsinki.fi>
      Cc: Corrado Zoccolo <czoccolo@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6dda9d55
  10. 16 3月, 2010 1 次提交
  11. 13 3月, 2010 2 次提交
  12. 07 3月, 2010 6 次提交
    • D
      mm: suppress pfn range output for zones without pages · 72f0ba02
      David Rientjes 提交于
      free_area_init_nodes() emits pfn ranges for all zones on the system.
      There may be no pages on a higher zone, however, due to memory limitations
      or the use of the mem= kernel parameter.  For example:
      
      Zone PFN ranges:
        DMA      0x00000001 -> 0x00001000
        DMA32    0x00001000 -> 0x00100000
        Normal   0x00100000 -> 0x00100000
      
      The implementation copies the previous zone's highest pfn, if any, as the
      next zone's lowest pfn.  If its highest pfn is then greater than the
      amount of addressable memory, the upper memory limit is used instead.
      Thus, both the lowest and highest possible pfn for higher zones without
      memory may be the same.
      
      The pfn range for zones without memory is now shown as "empty" instead.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Reviewed-by: NChristoph Lameter <cl@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      72f0ba02
    • R
      mm/pm: force GFP_NOIO during suspend/hibernation and resume · 452aa699
      Rafael J. Wysocki 提交于
      There are quite a few GFP_KERNEL memory allocations made during
      suspend/hibernation and resume that may cause the system to hang, because
      the I/O operations they depend on cannot be completed due to the
      underlying devices being suspended.
      
      Avoid this problem by clearing the __GFP_IO and __GFP_FS bits in
      gfp_allowed_mask before suspend/hibernation and restoring the original
      values of these bits in gfp_allowed_mask durig the subsequent resume.
      
      [akpm@linux-foundation.org: fix CONFIG_PM=n linkage]
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Reported-by: NMaxim Levitsky <maximlevitsky@gmail.com>
      Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      452aa699
    • K
      mm: restore zone->all_unreclaimable to independence word · 93e4a89a
      KOSAKI Motohiro 提交于
      commit e815af95 ("change all_unreclaimable zone member to flags") changed
      all_unreclaimable member to bit flag.  But it had an undesireble side
      effect.  free_one_page() is one of most hot path in linux kernel and
      increasing atomic ops in it can reduce kernel performance a bit.
      
      Thus, this patch revert such commit partially. at least
      all_unreclaimable shouldn't share memory word with other zone flags.
      
      [akpm@linux-foundation.org: fix patch interaction]
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Huang Shijie <shijie8@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      93e4a89a
    • L
      mm: remove free_hot_page() · fc91668e
      Li Hong 提交于
      free_hot_page() is just a wrapper around free_hot_cold_page() with
      parameter 'cold = 0'.  After adding a clear comment for
      free_hot_cold_page(), it is reasonable to remove a level of call.
      
      [akpm@linux-foundation.org: fix build]
      Signed-off-by: NLi Hong <lihong.hi@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Li Ming Chun <macli@brc.ubc.ca>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Americo Wang <xiyou.wangcong@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fc91668e
    • L
      mm/page_alloc.c: adjust a call site to trace_mm_page_free_direct · c475dab6
      Li Hong 提交于
      Move a call of trace_mm_page_free_direct() from free_hot_page() to
      free_hot_cold_page().  It is clearer and close to kmemcheck_free_shadow(),
      as it is done in function __free_pages_ok().
      Signed-off-by: NLi Hong <lihong.hi@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Li Ming Chun <macli@brc.ubc.ca>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c475dab6
    • L
      mm/page_alloc.c: remove duplicate call to trace_mm_page_free_direct · f650316c
      Li Hong 提交于
      trace_mm_page_free_direct() is called in function __free_pages().  But it
      is called again in free_hot_page() if order == 0 and produce duplicate
      records in trace file for mm_page_free_direct event.  As below:
      
      K-PID    CPU#    TIMESTAMP  FUNCTION
        gnome-terminal-1567  [000]  4415.246466: mm_page_free_direct: page=ffffea0003db9f40 pfn=1155800 order=0
        gnome-terminal-1567  [000]  4415.246468: mm_page_free_direct: page=ffffea0003db9f40 pfn=1155800 order=0
        gnome-terminal-1567  [000]  4415.246506: mm_page_alloc: page=ffffea0003db9f40 pfn=1155800 order=0 migratetype=0 gfp_flags=GFP_KERNEL
        gnome-terminal-1567  [000]  4415.255557: mm_page_free_direct: page=ffffea0003db9f40 pfn=1155800 order=0
        gnome-terminal-1567  [000]  4415.255557: mm_page_free_direct: page=ffffea0003db9f40 pfn=1155800 order=0
      
      This patch removes the first call and adds a call to
      trace_mm_page_free_direct() in __free_pages_ok().
      Signed-off-by: NLi Hong <lihong.hi@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Li Ming Chun <macli@brc.ubc.ca>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f650316c
  13. 22 2月, 2010 1 次提交
    • Y
      x86: Fix non-bootmem compilation on PowerPC · 2ee78f7b
      Yinghai Lu 提交于
      These build errors on some non-x86 platforms (PowerPC for example):
      
       mm/page_alloc.c: In function '__alloc_memory_core_early':
         mm/page_alloc.c:3468: error: implicit declaration of function 'find_early_area'
         mm/page_alloc.c:3483: error: implicit declaration of function 'reserve_early_without_check'
      
      The function is only needed on CONFIG_NO_BOOTMEM.
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Johannes Weiner <hannes@saeurebad.de>
      Cc: Mel Gorman <mel@csn.ul.ie>
      LKML-Reference: <4B747239.4070907@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      2ee78f7b
  14. 13 2月, 2010 1 次提交
  15. 30 1月, 2010 1 次提交
    • H
      mm: fix migratetype bug which slowed swapping · a7016235
      Hugh Dickins 提交于
      After memory pressure has forced it to dip into the reserves, 2.6.32's
      5f8dcc21 "page-allocator: split per-cpu
      list into one-list-per-migrate-type" has been returning MIGRATE_RESERVE
      pages to the MIGRATE_MOVABLE free_list: in some sense depleting reserves.
      
      Fix that in the most straightforward way (which, considering the overheads
      of alternative approaches, is Mel's preference): the right migratetype is
      already in page_private(page), but free_pcppages_bulk() wasn't using it.
      
      How did this bug show up?  As a 20% slowdown in my tmpfs loop kbuild
      swapping tests, on PowerMac G5 with SLUB allocator.  Bisecting to that
      commit was easy, but explaining the magnitude of the slowdown not easy.
      
      The same effect appears, but much less markedly, with SLAB, and even
      less markedly on other machines (the PowerMac divides into fewer zones
      than x86, I think that may be a factor).  We guess that lumpy reclaim
      of short-lived high-order pages is implicated in some way, and probably
      this bug has been tickling a poor decision somewhere in page reclaim.
      
      But instrumentation hasn't told me much, I've run out of time and
      imagination to determine exactly what's going on, and shouldn't hold up
      the fix any longer: it's valid, and might even fix other misbehaviours.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a7016235
  16. 17 1月, 2010 1 次提交