1. 28 4月, 2008 3 次提交
    • M
      mm: remember what the preferred zone is for zone_statistics · 18ea7e71
      Mel Gorman 提交于
      On NUMA, zone_statistics() is used to record events like numa hit, miss and
      foreign.  It assumes that the first zone in a zonelist is the preferred zone.
      When multiple zonelists are replaced by one that is filtered, this is no
      longer the case.
      
      This patch records what the preferred zone is rather than assuming the first
      zone in the zonelist is it.  This simplifies the reading of later patches in
      this set.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Reviewed-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      18ea7e71
    • M
      mm: introduce node_zonelist() for accessing the zonelist for a GFP mask · 0e88460d
      Mel Gorman 提交于
      Introduce a node_zonelist() helper function.  It is used to lookup the
      appropriate zonelist given a node and a GFP mask.  The patch on its own is a
      cleanup but it helps clarify parts of the two-zonelist-per-node patchset.  If
      necessary, it can be merged with the next patch in this set without problems.
      Reviewed-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0e88460d
    • M
      mm: use zonelists instead of zones when direct reclaiming pages · dac1d27b
      Mel Gorman 提交于
      The following patches replace multiple zonelists per node with two zonelists
      that are filtered based on the GFP flags.  The patches as a set fix a bug with
      regard to the use of MPOL_BIND and ZONE_MOVABLE.  With this patchset, the
      MPOL_BIND will apply to the two highest zones when the highest zone is
      ZONE_MOVABLE.  This should be considered as an alternative fix for the
      MPOL_BIND+ZONE_MOVABLE in 2.6.23 to the previously discussed hack that filters
      only custom zonelists.
      
      The first patch cleans up an inconsistency where direct reclaim uses
      zonelist->zones where other places use zonelist.
      
      The second patch introduces a helper function node_zonelist() for looking up
      the appropriate zonelist for a GFP mask which simplifies patches later in the
      set.
      
      The third patch defines/remembers the "preferred zone" for numa statistics, as
      it is no longer always the first zone in a zonelist.
      
      The forth patch replaces multiple zonelists with two zonelists that are
      filtered.  The two zonelists are due to the fact that the memoryless patchset
      introduces a second set of zonelists for __GFP_THISNODE.
      
      The fifth patch introduces helper macros for retrieving the zone and node
      indices of entries in a zonelist.
      
      The final patch introduces filtering of the zonelists based on a nodemask.
      Two zonelists exist per node, one for normal allocations and one for
      __GFP_THISNODE.
      
      Performance results varied depending on the machine configuration.  In real
      workloads the gain/loss will depend on how much the userspace portion of the
      benchmark benefits from having more cache available due to reduced referencing
      of zonelists.
      
      These are the range of performance losses/gains when running against
      2.6.24-rc4-mm1.  The set and these machines are a mix of i386, x86_64 and
      ppc64 both NUMA and non-NUMA.
      			     loss   to  gain
      Total CPU time on Kernbench: -0.86% to  1.13%
      Elapsed   time on Kernbench: -0.79% to  0.76%
      page_test from aim9:         -4.37% to  0.79%
      brk_test  from aim9:         -0.71% to  4.07%
      fork_test from aim9:         -1.84% to  4.60%
      exec_test from aim9:         -0.71% to  1.08%
      
      This patch:
      
      The allocator deals with zonelists which indicate the order in which zones
      should be targeted for an allocation.  Similarly, direct reclaim of pages
      iterates over an array of zones.  For consistency, this patch converts direct
      reclaim to use a zonelist.  No functionality is changed by this patch.  This
      simplifies zonelist iterators in the next patch.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dac1d27b
  2. 20 4月, 2008 1 次提交
    • M
      nodemask: use new node_to_cpumask_ptr function · c5f59f08
      Mike Travis 提交于
        * Use new node_to_cpumask_ptr.  This creates a pointer to the
          cpumask for a given node.  This definition is in mm patch:
      
      	asm-generic-add-node_to_cpumask_ptr-macro.patch
      
        * Use new set_cpus_allowed_ptr function.
      
      Depends on:
      	[mm-patch]: asm-generic-add-node_to_cpumask_ptr-macro.patch
      	[sched-devel]: sched: add new set_cpus_allowed_ptr function
      	[x86/latest]: x86: add cpus_scnprintf function
      
      Cc: Greg Kroah-Hartman <gregkh@suse.de>
      Cc: Greg Banks <gnb@melbourne.sgi.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Signed-off-by: NMike Travis <travis@sgi.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      c5f59f08
  3. 05 3月, 2008 2 次提交
  4. 24 2月, 2008 1 次提交
    • A
      Solve section mismatch for free_area_init_core. · b5a0e011
      Alexander van Heukelum 提交于
      WARNING: vmlinux.o(.meminit.text+0x649):
      Section mismatch in reference from the
      function free_area_init_core() to the function .init.text:setup_usemap()
      The function __meminit free_area_init_core() references
      a function __init setup_usemap().
      If free_area_init_core is only used by setup_usemap then
      annotate free_area_init_core with a matching annotation.
      
      The warning is covers this stack of functions in mm/page_alloc.c:
      
      alloc_bootmem_node must be marked __init.
      alloc_bootmem_node is used by setup_usemap, if !SPARSEMEM.
      (usemap_size is only used by setup_usemap, if !SPARSEMEM.)
      setup_usemap is only used by free_area_init_core.
      free_area_init_core is only used by free_area_init_node.
      
      free_area_init_node is used by:
      arch/alpha/mm/numa.c: __init paging_init()
      arch/arm/mm/init.c: __init bootmem_init_node()
      arch/avr32/mm/init.c: __init paging_init()
      arch/cris/arch-v10/mm/init.c: __init paging_init()
      arch/cris/arch-v32/mm/init.c: __init paging_init()
      arch/m32r/mm/discontig.c: __init zone_sizes_init()
      arch/m32r/mm/init.c: __init zone_sizes_init()
      arch/m68k/mm/motorola.c: __init paging_init()
      arch/m68k/mm/sun3mmu.c: __init paging_init()
      arch/mips/sgi-ip27/ip27-memory.c: __init paging_init()
      arch/parisc/mm/init.c: __init paging_init()
      arch/sparc/mm/srmmu.c: __init srmmu_paging_init()
      arch/sparc/mm/sun4c.c: __init sun4c_paging_init()
      arch/sparc64/mm/init.c: __init paging_init()
      mm/page_alloc.c: __init free_area_init_nodes()
      mm/page_alloc.c: __init free_area_init()
      and
      mm/memory_hotplug.c: hotadd_new_pgdat()
      
      hotadd_new_pgdat can not be an __init function, but:
      
      It is compiled for MEMORY_HOTPLUG configurations only
      MEMORY_HOTPLUG depends on SPARSEMEM || X86_64_ACPI_NUMA
      X86_64_ACPI_NUMA depends on X86_64
      ARCH_FLATMEM_ENABLE depends on X86_32
      ARCH_DISCONTIGMEM_ENABLE depends on X86_32
      So X86_64_ACPI_NUMA implies SPARSEMEM, right?
      
      So we can mark the stack of functions __init for !SPARSEMEM, but we must mark
      them __meminit for SPARSEMEM configurations.  This is ok, because then the
      calls to alloc_bootmem_node are also avoided.
      
      Compile-tested on:
      silly minimal config
      defconfig x86_32
      defconfig x86_64
      defconfig x86_64 -HIBERNATION +MEMORY_HOTPLUG
      Signed-off-by: NAlexander van Heukelum <heukelum@fastmail.fm>
      Reviewed-by: NSam Ravnborg <sam@ravnborg.org>
      Acked-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b5a0e011
  5. 09 2月, 2008 1 次提交
  6. 08 2月, 2008 1 次提交
    • B
      Memory controller: memory accounting · 8a9f3ccd
      Balbir Singh 提交于
      Add the accounting hooks.  The accounting is carried out for RSS and Page
      Cache (unmapped) pages.  There is now a common limit and accounting for both.
      The RSS accounting is accounted at page_add_*_rmap() and page_remove_rmap()
      time.  Page cache is accounted at add_to_page_cache(),
      __delete_from_page_cache().  Swap cache is also accounted for.
      
      Each page's page_cgroup is protected with the last bit of the
      page_cgroup pointer, this makes handling of race conditions involving
      simultaneous mappings of a page easier.  A reference count is kept in the
      page_cgroup to deal with cases where a page might be unmapped from the RSS
      of all tasks, but still lives in the page cache.
      
      Credits go to Vaidyanathan Srinivasan for helping with reference counting work
      of the page cgroup.  Almost all of the page cache accounting code has help
      from Vaidyanathan Srinivasan.
      
      [hugh@veritas.com: fix swapoff breakage]
      [akpm@linux-foundation.org: fix locking]
      Signed-off-by: NVaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Kirill Korotaev <dev@sw.ru>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: David Rientjes <rientjes@google.com>
      Cc: <Valdis.Kletnieks@vt.edu>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8a9f3ccd
  7. 06 2月, 2008 5 次提交
  8. 18 1月, 2008 1 次提交
  9. 09 1月, 2008 1 次提交
  10. 18 12月, 2007 1 次提交
    • M
      mm: fix page allocation for larger I/O segments · 81eabcbe
      Mel Gorman 提交于
      In some cases the IO subsystem is able to merge requests if the pages are
      adjacent in physical memory.  This was achieved in the allocator by having
      expand() return pages in physically contiguous order in situations were a
      large buddy was split.  However, list-based anti-fragmentation changed the
      order pages were returned in to avoid searching in buffered_rmqueue() for a
      page of the appropriate migrate type.
      
      This patch restores behaviour of rmqueue_bulk() preserving the physical
      order of pages returned by the allocator without incurring increased search
      costs for anti-fragmentation.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Cc: James Bottomley <James.Bottomley@steeleye.com>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Mark Lord <mlord@pobox.com
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      81eabcbe
  11. 30 11月, 2007 1 次提交
    • M
      Fix boot problem with iSeries lacking hugepage support · ba72cb8c
      Mel Gorman 提交于
      Ordinarily the size of a pageblock is determined at compile-time based on the
      hugepage size. On PPC64, the hugepage size is determined at runtime based on
      what is supported by the machine. With legacy machines such as iSeries that
      do not support hugepages, HPAGE_SHIFT is 0. This results in pageblock_order
      being set to -PAGE_SHIFT and a crash results shortly afterwards.
      
      This patch adds a function to select a sensible value for pageblock order by
      default when HUGETLB_PAGE_SIZE_VARIABLE is set. It checks that HPAGE_SHIFT
      is a sensible value before using the hugepage size; if it is not MAX_ORDER-1
      is used.
      
      This is a fix for 2.6.24.
      
      Credit goes to Stephen Rothwell for identifying the bug and testing candidate
      patches.  Additional credit goes to Andy Whitcroft for spotting a problem
      with respects to IA-64 before releasing. Additional credit to David Gibson
      for testing with the libhugetlbfs test suite.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Tested-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Acked-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ba72cb8c
  12. 29 11月, 2007 1 次提交
  13. 13 11月, 2007 1 次提交
  14. 20 10月, 2007 1 次提交
  15. 17 10月, 2007 19 次提交
    • D
      oom: serialize out of memory calls · ff0ceb9d
      David Rientjes 提交于
      A final allocation attempt with a very high watermark needs to be attempted
      before invoking out_of_memory().  OOM killer serialization needs to occur
      before this final attempt, otherwise tasks attempting to OOM-lock all zones in
      its zonelist may spin and acquire the lock unnecessarily after the OOM
      condition has already been alleviated.
      
      If the final allocation does succeed, the zonelist is simply OOM-unlocked and
      __alloc_pages() returns the page.  Otherwise, the OOM killer is invoked.
      
      If the task cannot acquire OOM-locks on all zones in its zonelist, it is put
      to sleep and the allocation is retried when it gets rescheduled.  One of its
      zones is already marked as being in the OOM killer so it'll hopefully be
      getting some free memory soon, at least enough to satisfy a high watermark
      allocation attempt.  This prevents needlessly killing a task when the OOM
      condition would have already been alleviated if it had simply been given
      enough time.
      
      Cc: Andrea Arcangeli <andrea@suse.de>
      Acked-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ff0ceb9d
    • D
      oom: change all_unreclaimable zone member to flags · e815af95
      David Rientjes 提交于
      Convert the int all_unreclaimable member of struct zone to unsigned long
      flags.  This can now be used to specify several different zone flags such as
      all_unreclaimable and reclaim_in_progress, which can now be removed and
      converted to a per-zone flag.
      
      Flags are set and cleared as follows:
      
      	zone_set_flag(struct zone *zone, zone_flags_t flag)
      	zone_clear_flag(struct zone *zone, zone_flags_t flag)
      
      Defines the first zone flags, ZONE_ALL_UNRECLAIMABLE and ZONE_RECLAIM_LOCKED,
      which have the same semantics as the old zone->all_unreclaimable and
      zone->reclaim_in_progress, respectively.  Also converts all current users that
      set or clear either flag to use the new interface.
      
      Helper functions are defined to test the flags:
      
      	int zone_is_all_unreclaimable(const struct zone *zone)
      	int zone_is_reclaim_locked(const struct zone *zone)
      
      All flag operators are of the atomic variety because there are currently
      readers that are implemented that do not take zone->lock.
      
      [akpm@linux-foundation.org: add needed include]
      Cc: Andrea Arcangeli <andrea@suse.de>
      Acked-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e815af95
    • D
      oom: move prototypes to appropriate header file · 5a3135c2
      David Rientjes 提交于
      Move the OOM killer's extern function prototypes to include/linux/oom.h and
      include it where necessary.
      
      [clg@fr.ibm.com: build fix]
      Cc: Andrea Arcangeli <andrea@suse.de>
      Acked-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NCedric Le Goater <clg@fr.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5a3135c2
    • K
      memory unplug: page offline · 0c0e6195
      KAMEZAWA Hiroyuki 提交于
      Logic.
       - set all pages in  [start,end)  as isolated migration-type.
         by this, all free pages in the range will be not-for-use.
       - Migrate all LRU pages in the range.
       - Test all pages in the range's refcnt is zero or not.
      
      Todo:
       - allocate migration destination page from better area.
       - confirm page_count(page)== 0 && PageReserved(page) page is safe to be freed..
       (I don't like this kind of page but..
       - Find out pages which cannot be migrated.
       - more running tests.
       - Use reclaim for unplugging other memory type area.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NYasunori Goto <y-goto@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0c0e6195
    • K
      memory unplug: page isolation · a5d76b54
      KAMEZAWA Hiroyuki 提交于
      Implement generic chunk-of-pages isolation method by using page grouping ops.
      
      This patch add MIGRATE_ISOLATE to MIGRATE_TYPES. By this
       - MIGRATE_TYPES increases.
       - bitmap for migratetype is enlarged.
      
      pages of MIGRATE_ISOLATE migratetype will not be allocated even if it is free.
      By this, you can isolated *freed* pages from users. How-to-free pages is not
      a purpose of this patch. You may use reclaim and migrate codes to free pages.
      
      If start_isolate_page_range(start,end) is called,
       - migratetype of the range turns to be MIGRATE_ISOLATE  if
         its type is MIGRATE_MOVABLE. (*) this check can be updated if other
         memory reclaiming works make progress.
       - MIGRATE_ISOLATE is not on migratetype fallback list.
       - All free pages and will-be-freed pages are isolated.
      To check all pages in the range are isolated or not,  use test_pages_isolated(),
      To cancel isolation, use undo_isolate_page_range().
      
      Changes V6 -> V7
       - removed unnecessary #ifdef
      
      There are HOLES_IN_ZONE handling codes...I'm glad if we can remove them..
      Signed-off-by: NYasunori Goto <y-goto@jp.fujitsu.com>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a5d76b54
    • M
      Breakout page_order() to internal.h to avoid special knowledge of the buddy allocator · 48f13bf3
      Mel Gorman 提交于
      The statistics patch later needs to know what order a free page is on the free
      lists.  Rather than having special knowledge of page_private() when
      PageBuddy() is set, this patch places out page_order() in internal.h and adds
      a VM_BUG_ON to catch using it on non-PageBuddy pages.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Acked-by: NAndy Whitcroft <apw@shadowen.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      48f13bf3
    • A
      mm/page_alloc.c: make code static · 484f51f8
      Adrian Bunk 提交于
      This patch makes needlessly global code static.
      Signed-off-by: NAdrian Bunk <bunk@stusta.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      484f51f8
    • M
      Print out statistics in relation to fragmentation avoidance to /proc/pagetypeinfo · 467c996c
      Mel Gorman 提交于
      This patch provides fragmentation avoidance statistics via /proc/pagetypeinfo.
       The information is collected only on request so there is no runtime overhead.
       The statistics are in three parts:
      
      The first part prints information on the size of blocks that pages are
      being grouped on and looks like
      
      Page block order: 10
      Pages per block:  1024
      
      The second part is a more detailed version of /proc/buddyinfo and looks like
      
      Free pages count per migrate type at order       0      1      2      3      4      5      6      7      8      9     10
      Node    0, zone      DMA, type    Unmovable      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone      DMA, type  Reclaimable      1      0      0      0      0      0      0      0      0      0      0
      Node    0, zone      DMA, type      Movable      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone      DMA, type      Reserve      0      4      4      0      0      0      0      1      0      1      0
      Node    0, zone   Normal, type    Unmovable    111      8      4      4      2      3      1      0      0      0      0
      Node    0, zone   Normal, type  Reclaimable    293     89      8      0      0      0      0      0      0      0      0
      Node    0, zone   Normal, type      Movable      1      6     13      9      7      6      3      0      0      0      0
      Node    0, zone   Normal, type      Reserve      0      0      0      0      0      0      0      0      0      0      4
      
      The third part looks like
      
      Number of blocks type     Unmovable  Reclaimable      Movable      Reserve
      Node 0, zone      DMA            0            1            2            1
      Node 0, zone   Normal            3           17           94            4
      
      To walk the zones within a node with interrupts disabled, walk_zones_in_node()
      is introduced and shared between /proc/buddyinfo, /proc/zoneinfo and
      /proc/pagetypeinfo to reduce code duplication.  It seems specific to what
      vmstat.c requires but could be broken out as a general utility function in
      mmzone.c if there were other other potential users.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NAndy Whitcroft <apw@shadowen.org>
      Acked-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      467c996c
    • M
      Do not depend on MAX_ORDER when grouping pages by mobility · d9c23400
      Mel Gorman 提交于
      Currently mobility grouping works at the MAX_ORDER_NR_PAGES level.  This makes
      sense for the majority of users where this is also the huge page size.
      However, on platforms like ia64 where the huge page size is runtime
      configurable it is desirable to group at a lower order.  On x86_64 and
      occasionally on x86, the hugepage size may not always be MAX_ORDER_NR_PAGES.
      
      This patch groups pages together based on the value of HUGETLB_PAGE_ORDER.  It
      uses a compile-time constant if possible and a variable where the huge page
      size is runtime configurable.
      
      It is assumed that grouping should be done at the lowest sensible order and
      that the user would not want to override this.  If this is not true,
      page_block order could be forced to a variable initialised via a boot-time
      kernel parameter.
      
      One potential issue with this patch is that IA64 now parses hugepagesz with
      early_param() instead of __setup().  __setup() is called after the memory
      allocator has been initialised and the pageblock bitmaps already setup.  In
      tests on one IA64 there did not seem to be any problem with using
      early_param() and in fact may be more correct as it guarantees the parameter
      is handled before the parsing of hugepages=.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NAndy Whitcroft <apw@shadowen.org>
      Acked-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d9c23400
    • M
      Fix calculation in move_freepages_block for counting pages · d100313f
      Mel Gorman 提交于
      move_freepages_block() returns the number of blocks moved.  This value is used
      to determine if a block of pages should be stolen for the exclusive use of a
      migrate type or not.  However, the value returned is being used correctly.
      This patch fixes the calculation to return the number of base pages that have
      been moved.
      
      This should be considered a fix to the patch
      move-free-pages-between-lists-on-steal.patch
      
      Credit to Andy Whitcroft for spotting the problem.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NAndy Whitcroft <apw@shadowen.org>
      Acked-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d100313f
    • M
      don't group high order atomic allocations · 64c5e135
      Mel Gorman 提交于
      Grouping high-order atomic allocations together was intended to allow
      bursty users of atomic allocations to work such as e1000 in situations
      where their preallocated buffers were depleted.  This did not work in at
      least one case with a wireless network adapter needing order-1 allocations
      frequently.  To resolve that, the free pages used for min_free_kbytes were
      moved to separate contiguous blocks with the patch
      bias-the-location-of-pages-freed-for-min_free_kbytes-in-the-same-max_order_nr_pages-blocks.
      
      It is felt that keeping the free pages in the same contiguous blocks should
      be sufficient for bursty short-lived high-order atomic allocations to
      succeed, maybe even with the e1000.  Even if there is a failure, increasing
      the value of min_free_kbytes will free pages as contiguous bloks in
      contrast to the standard buddy allocator which makes no attempt to keep the
      minimum number of free pages contiguous.
      
      This patch backs out grouping high order atomic allocations together to
      determine if it is really needed or not.  If a new report comes in about
      high-order atomic allocations failing, the feature can be reintroduced to
      determine if it fixes the problem or not.  As a side-effect, this patch
      reduces by 1 the number of bits required to track the mobility type of
      pages within a MAX_ORDER_NR_PAGES block.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NAndy Whitcroft <apw@shadowen.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      64c5e135
    • M
      remove PAGE_GROUP_BY_MOBILITY · ac0e5b7a
      Mel Gorman 提交于
      Grouping pages by mobility can be disabled at compile-time. This was
      considered undesirable by a number of people. However, in the current stack of
      patches, it is not a simple case of just dropping the configurable patch as it
      would cause merge conflicts.  This patch backs out the configuration option.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NAndy Whitcroft <apw@shadowen.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ac0e5b7a
    • M
      Bias the location of pages freed for min_free_kbytes in the same MAX_ORDER_NR_PAGES blocks · 56fd56b8
      Mel Gorman 提交于
      The standard buddy allocator always favours the smallest block of pages.
      The effect of this is that the pages free to satisfy min_free_kbytes tends
      to be preserved since boot time at the same location of memory ffor a very
      long time and as a contiguous block.  When an administrator sets the
      reserve at 16384 at boot time, it tends to be the same MAX_ORDER blocks
      that remain free.  This allows the occasional high atomic allocation to
      succeed up until the point the blocks are split.  In practice, it is
      difficult to split these blocks but when they do split, the benefit of
      having min_free_kbytes for contiguous blocks disappears.  Additionally,
      increasing min_free_kbytes once the system has been running for some time
      has no guarantee of creating contiguous blocks.
      
      On the other hand, CONFIG_PAGE_GROUP_BY_MOBILITY favours splitting large
      blocks when there are no free pages of the appropriate type available.  A
      side-effect of this is that all blocks in memory tends to be used up and
      the contiguous free blocks from boot time are not preserved like in the
      vanilla allocator.  This can cause a problem if a new caller is unwilling
      to reclaim or does not reclaim for long enough.
      
      A failure scenario was found for a wireless network device allocating
      order-1 atomic allocations but the allocations were not intense or frequent
      enough for a whole block of pages to be preserved for MIGRATE_HIGHALLOC.
      This was reproduced on a desktop by booting with mem=256mb, forcing the
      driver to allocate at order-1, running a bittorrent client (downloading a
      debian ISO) and building a kernel with -j2.
      
      This patch addresses the problem on the desktop machine booted with
      mem=256mb.  It works by setting aside a reserve of MAX_ORDER_NR_PAGES
      blocks, the number of which depends on the value of min_free_kbytes.  These
      blocks are only fallen back to when there is no other free pages.  Then the
      smallest possible page is used just like the normal buddy allocator instead
      of the largest possible page to preserve contiguous pages The pages in free
      lists in the reserve blocks are never taken for another migrate type.  The
      results is that even if min_free_kbytes is set to a low value, contiguous
      blocks will be preserved in the MIGRATE_RESERVE blocks.
      
      This works better than the vanilla allocator because if min_free_kbytes is
      increased, a new reserve block will be chosen based on the location of
      reclaimable pages and the block will free up as contiguous pages.  In the
      vanilla allocator, no effort is made to target a block of pages to free as
      contiguous pages and min_free_kbytes pages are scattered randomly.
      
      This effect has been observed on the test machine.  min_free_kbytes was set
      initially low but it was kept as a contiguous free block within
      MIGRATE_RESERVE.  min_free_kbytes was then set to a higher value and over a
      period of time, the free blocks were within the reserve and coalescing.
      How long it takes to free up depends on how quickly LRU is rotating.
      Amusingly, this means that more activity will free the blocks faster.
      
      This mechanism potentially replaces MIGRATE_HIGHALLOC as it may be more
      effective than grouping contiguous free pages together.  It all depends on
      whether the number of active atomic high allocations exceeds
      min_free_kbytes or not.  If the number of active allocations exceeds
      min_free_kbytes, it's worth it but maybe in that situation, min_free_kbytes
      should be set higher.  Once there are no more reports of allocation
      failures, a patch will be submitted that backs out MIGRATE_HIGHALLOC and
      see if the reports stay missing.
      
      Credit to Mariusz Kozlowski for discovering the problem, describing the
      failure scenario and testing patches and scenarios.
      
      [akpm@linux-foundation.org: cleanups]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NAndy Whitcroft <apw@shadowen.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      56fd56b8
    • M
      Be more agressive about stealing when MIGRATE_RECLAIMABLE allocations fallback · 46dafbca
      Mel Gorman 提交于
      MIGRATE_RECLAIMABLE allocations tend to be very bursty in nature like when
      updatedb starts.  It is likely this will occur in situations where MAX_ORDER
      blocks of pages are not free.  This means that updatedb can scatter
      MIGRATE_RECLAIMABLE pages throughout the address space.  This patch is more
      agressive about stealing blocks of pages for MIGRATE_RECLAIMABLE.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      46dafbca
    • M
      Bias the placement of kernel pages at lower PFNs · 5adc5be7
      Mel Gorman 提交于
      This patch chooses blocks with lower PFNs when placing kernel allocations.
      This is particularly important during fallback in low memory situations to
      stop unmovable pages being placed throughout the entire address space.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5adc5be7
    • M
      Do not group pages by mobility type on low memory systems · 9ef9acb0
      Mel Gorman 提交于
      Grouping pages by mobility can only successfully operate when there are more
      MAX_ORDER_NR_PAGES areas than mobility types.  When there are insufficient
      areas, fallbacks cannot be avoided.  This has noticeable performance impacts
      on machines with small amounts of memory in comparison to MAX_ORDER_NR_PAGES.
      For example, on IA64 with a configuration including huge pages spans 1GiB with
      MAX_ORDER_NR_PAGES so would need at least 4GiB of RAM before grouping pages by
      mobility would be useful.  In comparison, an x86 would need 16MB.
      
      This patch checks the size of vm_total_pages in build_all_zonelists(). If
      there are not enough areas,  mobility is effectivly disabled by considering
      all allocations as the same type (UNMOVABLE).  This is achived via a
      __read_mostly flag.
      
      With this patch, performance is comparable to disabling grouping pages
      by mobility at compile-time on a test machine with insufficient memory.
      With this patch, it is reasonable to get rid of grouping pages by mobility
      a compile-time option.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NAndy Whitcroft <apw@shadowen.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9ef9acb0
    • M
      Group high-order atomic allocations · e010487d
      Mel Gorman 提交于
      In rare cases, the kernel needs to allocate a high-order block of pages
      without sleeping.  For example, this is the case with e1000 cards configured
      to use jumbo frames.  Migrating or reclaiming pages in this situation is not
      an option.
      
      This patch groups these allocations together as much as possible by adding a
      new MIGRATE_TYPE.  The MIGRATE_HIGHATOMIC type are exactly what they sound
      like.  Care is taken that pages of other migrate types do not use the same
      blocks as high-order atomic allocations.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e010487d
    • M
      Group short-lived and reclaimable kernel allocations · e12ba74d
      Mel Gorman 提交于
      This patch marks a number of allocations that are either short-lived such as
      network buffers or are reclaimable such as inode allocations.  When something
      like updatedb is called, long-lived and unmovable kernel allocations tend to
      be spread throughout the address space which increases fragmentation.
      
      This patch groups these allocations together as much as possible by adding a
      new MIGRATE_TYPE.  The MIGRATE_RECLAIMABLE type is for allocations that can be
      reclaimed on demand, but not moved.  i.e.  they can be migrated by deleting
      them and re-reading the information from elsewhere.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Christoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e12ba74d
    • M
      Move free pages between lists on steal · c361be55
      Mel Gorman 提交于
      When a fallback occurs, there will be free pages for one allocation type
      stored on the list for another.  When a large steal occurs, this patch will
      move all the free pages within one list to the other.
      
      [y-goto@jp.fujitsu.com: fix BUG_ON check at move_freepages()]
      [apw@shadowen.org: Move to using pfn_valid_within()]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      Signed-off-by: NYasunori Goto <y-goto@jp.fujitsu.com>
      Cc: Bjorn Helgaas <bjorn.helgaas@hp.com>
      Signed-off-by: NAndy Whitcroft <andyw@uk.ibm.com>
      Cc: Bob Picco <bob.picco@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c361be55