1. 28 4月, 2008 10 次提交
    • P
      mm/page_alloc.c: remove hand-coded get_order() · 2309f9e6
      Pavel Machek 提交于
      Remove hand-coded get_order() from page_alloc.c.
      Signed-off-by: NPavel Machek <pavel@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2309f9e6
    • Y
      memory hotplug: free memmaps allocated by bootmem · 0c0a4a51
      Yasunori Goto 提交于
      This patch is to free memmaps which is allocated by bootmem.
      
      Freeing usemap is not necessary.  The pages of usemap may be necessary for
      other sections.
      
      If removing section is last section on the node, its section is the final user
      of usemap page.  (usemaps are allocated on its section by previous patch.) But
      it shouldn't be freed too, because the section must be logical offline state
      which all pages are isolated against page allocater.  If it is freed, page
      alloctor may use it which will be removed physically soon.  It will be
      disaster.  So, this patch keeps it as it is.
      Signed-off-by: NYasunori Goto <y-goto@jp.fujitsu.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Yinghai Lu <yhlu.kernel@gmail.com>
      Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0c0a4a51
    • C
      pageflags: eliminate PG_xxx aliases · 0a128b2b
      Christoph Lameter 提交于
      Remove aliases of PG_xxx.  We can easily drop those now and alias by
      specifying the PG_xxx flag in the macro that generates the functions.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0a128b2b
    • S
      mm/page_alloc.c: fix indentation · f05111f5
      S.Caglar Onur 提交于
      zlc_setup(): handle jiffies wraparound
      (10ed273f) changes tab with spaces
      Signed-off-by: NS.Caglar Onur <caglar@pardus.org.tr>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Paul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f05111f5
    • M
      mm: filter based on a nodemask as well as a gfp_mask · 19770b32
      Mel Gorman 提交于
      The MPOL_BIND policy creates a zonelist that is used for allocations
      controlled by that mempolicy.  As the per-node zonelist is already being
      filtered based on a zone id, this patch adds a version of __alloc_pages() that
      takes a nodemask for further filtering.  This eliminates the need for
      MPOL_BIND to create a custom zonelist.
      
      A positive benefit of this is that allocations using MPOL_BIND now use the
      local node's distance-ordered zonelist instead of a custom node-id-ordered
      zonelist.  I.e., pages will be allocated from the closest allowed node with
      available memory.
      
      [Lee.Schermerhorn@hp.com: Mempolicy: update stale documentation and comments]
      [Lee.Schermerhorn@hp.com: Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask]
      [Lee.Schermerhorn@hp.com: Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask rework]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      19770b32
    • M
      mm: have zonelist contains structs with both a zone pointer and zone_idx · dd1a239f
      Mel Gorman 提交于
      Filtering zonelists requires very frequent use of zone_idx().  This is costly
      as it involves a lookup of another structure and a substraction operation.  As
      the zone_idx is often required, it should be quickly accessible.  The node idx
      could also be stored here if it was found that accessing zone->node is
      significant which may be the case on workloads where nodemasks are heavily
      used.
      
      This patch introduces a struct zoneref to store a zone pointer and a zone
      index.  The zonelist then consists of an array of these struct zonerefs which
      are looked up as necessary.  Helpers are given for accessing the zone index as
      well as the node index.
      
      [kamezawa.hiroyu@jp.fujitsu.com: Suggested struct zoneref instead of embedding information in pointers]
      [hugh@veritas.com: mm-have-zonelist: fix memcg ooms]
      [hugh@veritas.com: just return do_try_to_free_pages]
      [hugh@veritas.com: do_try_to_free_pages gfp_mask redundant]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NChristoph Lameter <clameter@sgi.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dd1a239f
    • M
      mm: use two zonelist that are filtered by GFP mask · 54a6eb5c
      Mel Gorman 提交于
      Currently a node has two sets of zonelists, one for each zone type in the
      system and a second set for GFP_THISNODE allocations.  Based on the zones
      allowed by a gfp mask, one of these zonelists is selected.  All of these
      zonelists consume memory and occupy cache lines.
      
      This patch replaces the multiple zonelists per-node with two zonelists.  The
      first contains all populated zones in the system, ordered by distance, for
      fallback allocations when the target/preferred node has no free pages.  The
      second contains all populated zones in the node suitable for GFP_THISNODE
      allocations.
      
      An iterator macro is introduced called for_each_zone_zonelist() that interates
      through each zone allowed by the GFP flags in the selected zonelist.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      54a6eb5c
    • M
      mm: remember what the preferred zone is for zone_statistics · 18ea7e71
      Mel Gorman 提交于
      On NUMA, zone_statistics() is used to record events like numa hit, miss and
      foreign.  It assumes that the first zone in a zonelist is the preferred zone.
      When multiple zonelists are replaced by one that is filtered, this is no
      longer the case.
      
      This patch records what the preferred zone is rather than assuming the first
      zone in the zonelist is it.  This simplifies the reading of later patches in
      this set.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Reviewed-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      18ea7e71
    • M
      mm: introduce node_zonelist() for accessing the zonelist for a GFP mask · 0e88460d
      Mel Gorman 提交于
      Introduce a node_zonelist() helper function.  It is used to lookup the
      appropriate zonelist given a node and a GFP mask.  The patch on its own is a
      cleanup but it helps clarify parts of the two-zonelist-per-node patchset.  If
      necessary, it can be merged with the next patch in this set without problems.
      Reviewed-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0e88460d
    • M
      mm: use zonelists instead of zones when direct reclaiming pages · dac1d27b
      Mel Gorman 提交于
      The following patches replace multiple zonelists per node with two zonelists
      that are filtered based on the GFP flags.  The patches as a set fix a bug with
      regard to the use of MPOL_BIND and ZONE_MOVABLE.  With this patchset, the
      MPOL_BIND will apply to the two highest zones when the highest zone is
      ZONE_MOVABLE.  This should be considered as an alternative fix for the
      MPOL_BIND+ZONE_MOVABLE in 2.6.23 to the previously discussed hack that filters
      only custom zonelists.
      
      The first patch cleans up an inconsistency where direct reclaim uses
      zonelist->zones where other places use zonelist.
      
      The second patch introduces a helper function node_zonelist() for looking up
      the appropriate zonelist for a GFP mask which simplifies patches later in the
      set.
      
      The third patch defines/remembers the "preferred zone" for numa statistics, as
      it is no longer always the first zone in a zonelist.
      
      The forth patch replaces multiple zonelists with two zonelists that are
      filtered.  The two zonelists are due to the fact that the memoryless patchset
      introduces a second set of zonelists for __GFP_THISNODE.
      
      The fifth patch introduces helper macros for retrieving the zone and node
      indices of entries in a zonelist.
      
      The final patch introduces filtering of the zonelists based on a nodemask.
      Two zonelists exist per node, one for normal allocations and one for
      __GFP_THISNODE.
      
      Performance results varied depending on the machine configuration.  In real
      workloads the gain/loss will depend on how much the userspace portion of the
      benchmark benefits from having more cache available due to reduced referencing
      of zonelists.
      
      These are the range of performance losses/gains when running against
      2.6.24-rc4-mm1.  The set and these machines are a mix of i386, x86_64 and
      ppc64 both NUMA and non-NUMA.
      			     loss   to  gain
      Total CPU time on Kernbench: -0.86% to  1.13%
      Elapsed   time on Kernbench: -0.79% to  0.76%
      page_test from aim9:         -4.37% to  0.79%
      brk_test  from aim9:         -0.71% to  4.07%
      fork_test from aim9:         -1.84% to  4.60%
      exec_test from aim9:         -0.71% to  1.08%
      
      This patch:
      
      The allocator deals with zonelists which indicate the order in which zones
      should be targeted for an allocation.  Similarly, direct reclaim of pages
      iterates over an array of zones.  For consistency, this patch converts direct
      reclaim to use a zonelist.  No functionality is changed by this patch.  This
      simplifies zonelist iterators in the next patch.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dac1d27b
  2. 20 4月, 2008 1 次提交
    • M
      nodemask: use new node_to_cpumask_ptr function · c5f59f08
      Mike Travis 提交于
        * Use new node_to_cpumask_ptr.  This creates a pointer to the
          cpumask for a given node.  This definition is in mm patch:
      
      	asm-generic-add-node_to_cpumask_ptr-macro.patch
      
        * Use new set_cpus_allowed_ptr function.
      
      Depends on:
      	[mm-patch]: asm-generic-add-node_to_cpumask_ptr-macro.patch
      	[sched-devel]: sched: add new set_cpus_allowed_ptr function
      	[x86/latest]: x86: add cpus_scnprintf function
      
      Cc: Greg Kroah-Hartman <gregkh@suse.de>
      Cc: Greg Banks <gnb@melbourne.sgi.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Signed-off-by: NMike Travis <travis@sgi.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      c5f59f08
  3. 05 3月, 2008 2 次提交
  4. 24 2月, 2008 1 次提交
    • A
      Solve section mismatch for free_area_init_core. · b5a0e011
      Alexander van Heukelum 提交于
      WARNING: vmlinux.o(.meminit.text+0x649):
      Section mismatch in reference from the
      function free_area_init_core() to the function .init.text:setup_usemap()
      The function __meminit free_area_init_core() references
      a function __init setup_usemap().
      If free_area_init_core is only used by setup_usemap then
      annotate free_area_init_core with a matching annotation.
      
      The warning is covers this stack of functions in mm/page_alloc.c:
      
      alloc_bootmem_node must be marked __init.
      alloc_bootmem_node is used by setup_usemap, if !SPARSEMEM.
      (usemap_size is only used by setup_usemap, if !SPARSEMEM.)
      setup_usemap is only used by free_area_init_core.
      free_area_init_core is only used by free_area_init_node.
      
      free_area_init_node is used by:
      arch/alpha/mm/numa.c: __init paging_init()
      arch/arm/mm/init.c: __init bootmem_init_node()
      arch/avr32/mm/init.c: __init paging_init()
      arch/cris/arch-v10/mm/init.c: __init paging_init()
      arch/cris/arch-v32/mm/init.c: __init paging_init()
      arch/m32r/mm/discontig.c: __init zone_sizes_init()
      arch/m32r/mm/init.c: __init zone_sizes_init()
      arch/m68k/mm/motorola.c: __init paging_init()
      arch/m68k/mm/sun3mmu.c: __init paging_init()
      arch/mips/sgi-ip27/ip27-memory.c: __init paging_init()
      arch/parisc/mm/init.c: __init paging_init()
      arch/sparc/mm/srmmu.c: __init srmmu_paging_init()
      arch/sparc/mm/sun4c.c: __init sun4c_paging_init()
      arch/sparc64/mm/init.c: __init paging_init()
      mm/page_alloc.c: __init free_area_init_nodes()
      mm/page_alloc.c: __init free_area_init()
      and
      mm/memory_hotplug.c: hotadd_new_pgdat()
      
      hotadd_new_pgdat can not be an __init function, but:
      
      It is compiled for MEMORY_HOTPLUG configurations only
      MEMORY_HOTPLUG depends on SPARSEMEM || X86_64_ACPI_NUMA
      X86_64_ACPI_NUMA depends on X86_64
      ARCH_FLATMEM_ENABLE depends on X86_32
      ARCH_DISCONTIGMEM_ENABLE depends on X86_32
      So X86_64_ACPI_NUMA implies SPARSEMEM, right?
      
      So we can mark the stack of functions __init for !SPARSEMEM, but we must mark
      them __meminit for SPARSEMEM configurations.  This is ok, because then the
      calls to alloc_bootmem_node are also avoided.
      
      Compile-tested on:
      silly minimal config
      defconfig x86_32
      defconfig x86_64
      defconfig x86_64 -HIBERNATION +MEMORY_HOTPLUG
      Signed-off-by: NAlexander van Heukelum <heukelum@fastmail.fm>
      Reviewed-by: NSam Ravnborg <sam@ravnborg.org>
      Acked-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b5a0e011
  5. 09 2月, 2008 1 次提交
  6. 08 2月, 2008 1 次提交
    • B
      Memory controller: memory accounting · 8a9f3ccd
      Balbir Singh 提交于
      Add the accounting hooks.  The accounting is carried out for RSS and Page
      Cache (unmapped) pages.  There is now a common limit and accounting for both.
      The RSS accounting is accounted at page_add_*_rmap() and page_remove_rmap()
      time.  Page cache is accounted at add_to_page_cache(),
      __delete_from_page_cache().  Swap cache is also accounted for.
      
      Each page's page_cgroup is protected with the last bit of the
      page_cgroup pointer, this makes handling of race conditions involving
      simultaneous mappings of a page easier.  A reference count is kept in the
      page_cgroup to deal with cases where a page might be unmapped from the RSS
      of all tasks, but still lives in the page cache.
      
      Credits go to Vaidyanathan Srinivasan for helping with reference counting work
      of the page cgroup.  Almost all of the page cache accounting code has help
      from Vaidyanathan Srinivasan.
      
      [hugh@veritas.com: fix swapoff breakage]
      [akpm@linux-foundation.org: fix locking]
      Signed-off-by: NVaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: David Rientjes <rientjes@google.com>
      Cc: <Valdis.Kletnieks@vt.edu>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8a9f3ccd
  7. 06 2月, 2008 5 次提交
  8. 18 1月, 2008 1 次提交
  9. 09 1月, 2008 1 次提交
  10. 18 12月, 2007 1 次提交
    • M
      mm: fix page allocation for larger I/O segments · 81eabcbe
      Mel Gorman 提交于
      In some cases the IO subsystem is able to merge requests if the pages are
      adjacent in physical memory.  This was achieved in the allocator by having
      expand() return pages in physically contiguous order in situations were a
      large buddy was split.  However, list-based anti-fragmentation changed the
      order pages were returned in to avoid searching in buffered_rmqueue() for a
      page of the appropriate migrate type.
      
      This patch restores behaviour of rmqueue_bulk() preserving the physical
      order of pages returned by the allocator without incurring increased search
      costs for anti-fragmentation.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Cc: James Bottomley <James.Bottomley@steeleye.com>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Mark Lord <mlord@pobox.com
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      81eabcbe
  11. 30 11月, 2007 1 次提交
    • M
      Fix boot problem with iSeries lacking hugepage support · ba72cb8c
      Mel Gorman 提交于
      Ordinarily the size of a pageblock is determined at compile-time based on the
      hugepage size. On PPC64, the hugepage size is determined at runtime based on
      what is supported by the machine. With legacy machines such as iSeries that
      do not support hugepages, HPAGE_SHIFT is 0. This results in pageblock_order
      being set to -PAGE_SHIFT and a crash results shortly afterwards.
      
      This patch adds a function to select a sensible value for pageblock order by
      default when HUGETLB_PAGE_SIZE_VARIABLE is set. It checks that HPAGE_SHIFT
      is a sensible value before using the hugepage size; if it is not MAX_ORDER-1
      is used.
      
      This is a fix for 2.6.24.
      
      Credit goes to Stephen Rothwell for identifying the bug and testing candidate
      patches.  Additional credit goes to Andy Whitcroft for spotting a problem
      with respects to IA-64 before releasing. Additional credit to David Gibson
      for testing with the libhugetlbfs test suite.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Tested-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Acked-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ba72cb8c
  12. 29 11月, 2007 1 次提交
  13. 13 11月, 2007 1 次提交
  14. 20 10月, 2007 1 次提交
  15. 17 10月, 2007 12 次提交