1. 07 12月, 2010 1 次提交
    • R
      PM / Hibernate: Fix memory corruption related to swap · c9e664f1
      Rafael J. Wysocki 提交于
      There is a problem that swap pages allocated before the creation of
      a hibernation image can be released and used for storing the contents
      of different memory pages while the image is being saved.  Since the
      kernel stored in the image doesn't know of that, it causes memory
      corruption to occur after resume from hibernation, especially on
      systems with relatively small RAM that need to swap often.
      
      This issue can be addressed by keeping the GFP_IOFS bits clear
      in gfp_allowed_mask during the entire hibernation, including the
      saving of the image, until the system is finally turned off or
      the hibernation is aborted.  Unfortunately, for this purpose
      it's necessary to rework the way in which the hibernate and
      suspend code manipulates gfp_allowed_mask.
      
      This change is based on an earlier patch from Hugh Dickins.
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Reported-by: NOndrej Zary <linux@rainbow-software.org>
      Acked-by: NHugh Dickins <hughd@google.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: stable@kernel.org
      c9e664f1
  2. 27 10月, 2010 1 次提交
  3. 25 5月, 2010 2 次提交
  4. 07 3月, 2010 3 次提交
  5. 23 9月, 2009 1 次提交
    • J
      BUILD_BUG_ON(): fix it and a couple of bogus uses of it · 8c87df45
      Jan Beulich 提交于
      gcc permitting variable length arrays makes the current construct used for
      BUILD_BUG_ON() useless, as that doesn't produce any diagnostic if the
      controlling expression isn't really constant.  Instead, this patch makes
      it so that a bit field gets used here.  Consequently, those uses where the
      condition isn't really constant now also need fixing.
      
      Note that in the gfp.h, kmemcheck.h, and virtio_config.h cases
      MAYBE_BUILD_BUG_ON() really just serves documentation purposes - even if
      the expression is compile time constant (__builtin_constant_p() yields
      true), the array is still deemed of variable length by gcc, and hence the
      whole expression doesn't have the intended effect.
      
      [akpm@linux-foundation.org: make arch/sparc/include/asm/vio.h compile]
      [akpm@linux-foundation.org: more nonsensical assertions in tpm.c..]
      Signed-off-by: NJan Beulich <jbeulich@novell.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Rajiv Andrade <srajiv@linux.vnet.ibm.com>
      Cc: Mimi Zohar <zohar@us.ibm.com>
      Cc: James Morris <jmorris@namei.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8c87df45
  6. 22 9月, 2009 2 次提交
  7. 19 6月, 2009 1 次提交
  8. 17 6月, 2009 5 次提交
    • C
      page-allocator: use integer fields lookup for gfp_zone and check for errors in... · b70d94ee
      Christoph Lameter 提交于
      page-allocator: use integer fields lookup for gfp_zone and check for errors in flags passed to the page allocator
      
      This simplifies the code in gfp_zone() and also keeps the ability of the
      compiler to use constant folding to get rid of gfp_zone processing.
      
      The lookup of the zone is done using a bitfield stored in an integer.  So
      the code in gfp_zone is a simple extraction of bits from a constant
      bitfield.  The compiler is generating a load of a constant into a register
      and then performs a shift and mask operation to get the zone from a gfp_t.
       No cachelines are touched and no branches have to be predicted by the
      compiler.
      
      We are doing some macro tricks here to convince the compiler to always do
      the constant folding if possible.
      Signed-off-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b70d94ee
    • R
      mm, PM/Freezer: Disable OOM killer when tasks are frozen · 7f33d49a
      Rafael J. Wysocki 提交于
      Currently, the following scenario appears to be possible in theory:
      
      * Tasks are frozen for hibernation or suspend.
      * Free pages are almost exhausted.
      * Certain piece of code in the suspend code path attempts to allocate
        some memory using GFP_KERNEL and allocation order less than or
        equal to PAGE_ALLOC_COSTLY_ORDER.
      * __alloc_pages_internal() cannot find a free page so it invokes the
        OOM killer.
      * The OOM killer attempts to kill a task, but the task is frozen, so
        it doesn't die immediately.
      * __alloc_pages_internal() jumps to 'restart', unsuccessfully tries
        to find a free page and invokes the OOM killer.
      * No progress can be made.
      
      Although it is now hard to trigger during hibernation due to the memory
      shrinking carried out by the hibernation code, it is theoretically
      possible to trigger during suspend after the memory shrinking has been
      removed from that code path.  Moreover, since memory allocations are
      going to be used for the hibernation memory shrinking, it will be even
      more likely to happen during hibernation.
      
      To prevent it from happening, introduce the oom_killer_disabled switch
      that will cause __alloc_pages_internal() to fail in the situations in
      which the OOM killer would have been called and make the freezer set
      this switch after tasks have been successfully frozen.
      
      [akpm@linux-foundation.org: be nicer to the namespace]
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Cc: Fengguang Wu <fengguang.wu@gmail.com>
      Cc: David Rientjes <rientjes@google.com>
      Acked-by: NPavel Machek <pavel@ucw.cz>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7f33d49a
    • M
      page allocator: do not check NUMA node ID when the caller knows the node is valid · 6484eb3e
      Mel Gorman 提交于
      Callers of alloc_pages_node() can optionally specify -1 as a node to mean
      "allocate from the current node".  However, a number of the callers in
      fast paths know for a fact their node is valid.  To avoid a comparison and
      branch, this patch adds alloc_pages_exact_node() that only checks the nid
      with VM_BUG_ON().  Callers that know their node is valid are then
      converted.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NChristoph Lameter <cl@linux-foundation.org>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NPekka Enberg <penberg@cs.helsinki.fi>
      Acked-by: Paul Mundt <lethal@linux-sh.org>	[for the SLOB NUMA bits]
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6484eb3e
    • M
      page allocator: do not sanity check order in the fast path · b3c466ce
      Mel Gorman 提交于
      No user of the allocator API should be passing in an order >= MAX_ORDER
      but we check for it on each and every allocation.  Delete this check and
      make it a VM_BUG_ON check further down the call path.
      
      [akpm@linux-foundation.org: s/VM_BUG_ON/WARN_ON_ONCE/]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NChristoph Lameter <cl@linux-foundation.org>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NPekka Enberg <penberg@cs.helsinki.fi>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b3c466ce
    • M
      page allocator: replace __alloc_pages_internal() with __alloc_pages_nodemask() · d239171e
      Mel Gorman 提交于
      The start of a large patch series to clean up and optimise the page
      allocator.
      
      The performance improvements are in a wide range depending on the exact
      machine but the results I've seen so fair are approximately;
      
      kernbench:	0	to	 0.12% (elapsed time)
      		0.49%	to	 3.20% (sys time)
      aim9:		-4%	to	30% (for page_test and brk_test)
      tbench:		-1%	to	 4%
      hackbench:	-2.5%	to	 3.45% (mostly within the noise though)
      netperf-udp	-1.34%  to	 4.06% (varies between machines a bit)
      netperf-tcp	-0.44%  to	 5.22% (varies between machines a bit)
      
      I haven't sysbench figures at hand, but previously they were within the
      -0.5% to 2% range.
      
      On netperf, the client and server were bound to opposite number CPUs to
      maximise the problems with cache line bouncing of the struct pages so I
      expect different people to report different results for netperf depending
      on their exact machine and how they ran the test (different machines, same
      cpus client/server, shared cache but two threads client/server, different
      socket client/server etc).
      
      I also measured the vmlinux sizes for a single x86-based config with
      CONFIG_DEBUG_INFO enabled but not CONFIG_DEBUG_VM.  The core of the
      .config is based on the Debian Lenny kernel config so I expect it to be
      reasonably typical.
      
      This patch:
      
      __alloc_pages_internal is the core page allocator function but essentially
      it is an alias of __alloc_pages_nodemask.  Naming a publicly available and
      exported function "internal" is also a big ugly.  This patch renames
      __alloc_pages_internal() to __alloc_pages_nodemask() and deletes the old
      nodemask function.
      
      Warning - This patch renames an exported symbol.  No kernel driver is
      affected by external drivers calling __alloc_pages_internal() should
      change the call to __alloc_pages_nodemask() without any alteration of
      parameters.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NChristoph Lameter <cl@linux-foundation.org>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NPekka Enberg <penberg@cs.helsinki.fi>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d239171e
  9. 15 6月, 2009 2 次提交
    • V
      kmemcheck: add hooks for the page allocator · b1eeab67
      Vegard Nossum 提交于
      This adds support for tracking the initializedness of memory that
      was allocated with the page allocator. Highmem requests are not
      tracked.
      
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Acked-by: NPekka Enberg <penberg@cs.helsinki.fi>
      
      [build fix for !CONFIG_KMEMCHECK]
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      
      [rebased for mainline inclusion]
      Signed-off-by: NVegard Nossum <vegard.nossum@gmail.com>
      b1eeab67
    • V
      kmemcheck: add mm functions · 2dff4405
      Vegard Nossum 提交于
      With kmemcheck enabled, the slab allocator needs to do this:
      
      1. Tell kmemcheck to allocate the shadow memory which stores the status of
         each byte in the allocation proper, e.g. whether it is initialized or
         uninitialized.
      2. Tell kmemcheck which parts of memory that should be marked uninitialized.
         There are actually a few more states, such as "not yet allocated" and
         "recently freed".
      
      If a slab cache is set up using the SLAB_NOTRACK flag, it will never return
      memory that can take page faults because of kmemcheck.
      
      If a slab cache is NOT set up using the SLAB_NOTRACK flag, callers can still
      request memory with the __GFP_NOTRACK flag. This does not prevent the page
      faults from occuring, however, but marks the object in question as being
      initialized so that no warnings will ever be produced for this object.
      
      In addition to (and in contrast to) __GFP_NOTRACK, the
      __GFP_NOTRACK_FALSE_POSITIVE flag indicates that the allocation should
      not be tracked _because_ it would produce a false positive. Their values
      are identical, but need not be so in the future (for example, we could now
      enable/disable false positives with a config option).
      
      Parts of this patch were contributed by Pekka Enberg but merged for
      atomicity.
      Signed-off-by: NVegard Nossum <vegard.nossum@gmail.com>
      Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      
      [rebased for mainline inclusion]
      Signed-off-by: NVegard Nossum <vegard.nossum@gmail.com>
      2dff4405
  10. 12 6月, 2009 1 次提交
    • P
      slab,slub: don't enable interrupts during early boot · 7e85ee0c
      Pekka Enberg 提交于
      As explained by Benjamin Herrenschmidt:
      
        Oh and btw, your patch alone doesn't fix powerpc, because it's missing
        a whole bunch of GFP_KERNEL's in the arch code... You would have to
        grep the entire kernel for things that check slab_is_available() and
        even then you'll be missing some.
      
        For example, slab_is_available() didn't always exist, and so in the
        early days on powerpc, we used a mem_init_done global that is set form
        mem_init() (not perfect but works in practice). And we still have code
        using that to do the test.
      
      Therefore, mask out __GFP_WAIT, __GFP_IO, and __GFP_FS in the slab allocators
      in early boot code to avoid enabling interrupts.
      Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>
      7e85ee0c
  11. 13 3月, 2009 1 次提交
    • R
      numa, cpumask: move numa_node_id default implementation to topology.h · 082edb7b
      Rusty Russell 提交于
      Impact: cleanup, potential bugfix
      
      Not sure what changed to expose this, but clearly that numa_node_id()
      doesn't belong in mmzone.h (the inline in gfp.h is probably overkill, too).
      
      In file included from include/linux/topology.h:34,
                       from arch/x86/mm/numa.c:2:
      /home/rusty/patches-cpumask/linux-2.6/arch/x86/include/asm/topology.h:64:1: warning: "numa_node_id" redefined
      In file included from include/linux/topology.h:32,
                       from arch/x86/mm/numa.c:2:
      include/linux/mmzone.h:770:1: warning: this is the location of the previous definition
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      Cc: Mike Travis <travis@sgi.com>
      LKML-Reference: <200903132343.37661.rusty@rustcorp.com.au>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      082edb7b
  12. 07 1月, 2009 1 次提交
  13. 25 7月, 2008 2 次提交
  14. 29 4月, 2008 1 次提交
    • N
      mm: fix misleading __GFP_REPEAT related comments · ab857d09
      Nishanth Aravamudan 提交于
      The definition and use of __GFP_REPEAT, __GFP_NOFAIL and __GFP_NORETRY in the
      core VM have somewhat differing comments as to their actual semantics.
      Annoyingly, the flags definition has inline and header comments, which might
      be interpreted as not being equivalent.  Just add references to the header
      comments in the inline ones so they don't go out of sync in the future.  In
      their use in __alloc_pages() clarify that the current implementation treats
      low-order allocations and __GFP_REPEAT allocations as distinct cases.
      
      To clarify, the flags' semantics are:
      
      __GFP_NORETRY means try no harder than one run through __alloc_pages
      
      __GFP_REPEAT means __GFP_NOFAIL
      
      __GFP_NOFAIL means repeat forever
      
      order <= PAGE_ALLOC_COSTLY_ORDER means __GFP_NOFAIL
      Signed-off-by: NNishanth Aravamudan <nacc@us.ibm.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ab857d09
  15. 28 4月, 2008 5 次提交
  16. 14 2月, 2008 1 次提交
  17. 06 2月, 2008 1 次提交
    • C
      Page allocator: clean up pcp draining functions · 9f8f2172
      Christoph Lameter 提交于
      - Add comments explaing how drain_pages() works.
      
      - Eliminate useless functions
      
      - Rename drain_all_local_pages to drain_all_pages(). It does drain
        all pages not only those of the local processor.
      
      - Eliminate useless interrupt off / on sequences. drain_pages()
        disables interrupts on its own. The execution thread is
        pinned to processor by the caller. So there is no need to
        disable interrupts.
      
      - Put drain_all_pages() declaration in gfp.h and remove the
        declarations from suspend.h and from mm/memory_hotplug.c
      
      - Make software suspend call drain_all_pages(). The draining
        of processor local pages is may not the right approach if
        software suspend wants to support SMP. If they call drain_all_pages
        then we can make drain_pages() static.
      
      [akpm@linux-foundation.org: fix build]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Daniel Walker <dwalker@mvista.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9f8f2172
  18. 17 10月, 2007 4 次提交
    • M
      Print out statistics in relation to fragmentation avoidance to /proc/pagetypeinfo · 467c996c
      Mel Gorman 提交于
      This patch provides fragmentation avoidance statistics via /proc/pagetypeinfo.
       The information is collected only on request so there is no runtime overhead.
       The statistics are in three parts:
      
      The first part prints information on the size of blocks that pages are
      being grouped on and looks like
      
      Page block order: 10
      Pages per block:  1024
      
      The second part is a more detailed version of /proc/buddyinfo and looks like
      
      Free pages count per migrate type at order       0      1      2      3      4      5      6      7      8      9     10
      Node    0, zone      DMA, type    Unmovable      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone      DMA, type  Reclaimable      1      0      0      0      0      0      0      0      0      0      0
      Node    0, zone      DMA, type      Movable      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone      DMA, type      Reserve      0      4      4      0      0      0      0      1      0      1      0
      Node    0, zone   Normal, type    Unmovable    111      8      4      4      2      3      1      0      0      0      0
      Node    0, zone   Normal, type  Reclaimable    293     89      8      0      0      0      0      0      0      0      0
      Node    0, zone   Normal, type      Movable      1      6     13      9      7      6      3      0      0      0      0
      Node    0, zone   Normal, type      Reserve      0      0      0      0      0      0      0      0      0      0      4
      
      The third part looks like
      
      Number of blocks type     Unmovable  Reclaimable      Movable      Reserve
      Node 0, zone      DMA            0            1            2            1
      Node 0, zone   Normal            3           17           94            4
      
      To walk the zones within a node with interrupts disabled, walk_zones_in_node()
      is introduced and shared between /proc/buddyinfo, /proc/zoneinfo and
      /proc/pagetypeinfo to reduce code duplication.  It seems specific to what
      vmstat.c requires but could be broken out as a general utility function in
      mmzone.c if there were other other potential users.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NAndy Whitcroft <apw@shadowen.org>
      Acked-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      467c996c
    • M
      Group short-lived and reclaimable kernel allocations · e12ba74d
      Mel Gorman 提交于
      This patch marks a number of allocations that are either short-lived such as
      network buffers or are reclaimable such as inode allocations.  When something
      like updatedb is called, long-lived and unmovable kernel allocations tend to
      be spread throughout the address space which increases fragmentation.
      
      This patch groups these allocations together as much as possible by adding a
      new MIGRATE_TYPE.  The MIGRATE_RECLAIMABLE type is for allocations that can be
      reclaimed on demand, but not moved.  i.e.  they can be migrated by deleting
      them and re-reading the information from elsewhere.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Christoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e12ba74d
    • C
      Categorize GFP flags · 6cb06229
      Christoph Lameter 提交于
      The function of GFP_LEVEL_MASK seems to be unclear.  In order to clear up
      the mystery we get rid of it and replace GFP_LEVEL_MASK with 3 sets of GFP
      flags:
      
      GFP_RECLAIM_MASK	Flags used to control page allocator reclaim behavior.
      
      GFP_CONSTRAINT_MASK	Flags used to limit where allocations can occur.
      
      GFP_SLAB_BUG_MASK	Flags that the slab allocator BUG()s on.
      
      These replace the uses of GFP_LEVEL mask in the slab allocators and in
      vmalloc.c.
      
      The use of the flags not included in these sets may occur as a result of a
      slab allocation standing in for a page allocation when constructing scatter
      gather lists.  Extraneous flags are cleared and not passed through to the
      page allocator.  __GFP_MOVABLE/RECLAIMABLE, __GFP_COLD and __GFP_COMP will
      now be ignored if passed to a slab allocator.
      
      Change the allocation of allocator meta data in SLAB and vmalloc to not
      pass through flags listed in GFP_CONSTRAINT_MASK.  SLAB already removes the
      __GFP_THISNODE flag for such allocations.  Generalize that to also cover
      vmalloc.  The use of GFP_CONSTRAINT_MASK also includes __GFP_HARDWALL.
      
      The impact of allocator metadata placement on access latency to the
      cachelines of the object itself is minimal since metadata is only
      referenced on alloc and free.  The attempt is still made to place the meta
      data optimally but we consistently allow fallback both in SLAB and vmalloc
      (SLUB does not need to allocate metadata like that).
      
      Allocator metadata may serve multiple in kernel users and thus should not
      be subject to the limitations arising from a single allocation context.
      
      [akpm@linux-foundation.org: fix fallback_alloc()]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6cb06229
    • C
      Memoryless nodes: Fix GFP_THISNODE behavior · 523b9458
      Christoph Lameter 提交于
      GFP_THISNODE checks that the zone selected is within the pgdat (node) of the
      first zone of a nodelist.  That only works if the node has memory.  A
      memoryless node will have its first node on another pgdat (node).
      
      GFP_THISNODE currently will return simply memory on the first pgdat.  Thus it
      is returning memory on other nodes.  GFP_THISNODE should fail if there is no
      local memory on a node.
      
      Add a new set of zonelists for each node that only contain the nodes that
      belong to the zones itself so that no fallback is possible.
      
      Then modify gfp_type to pickup the right zone based on the presence of
      __GFP_THISNODE.
      
      Drop the existing GFP_THISNODE checks from the page_allocators hot path.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Acked-by: NNishanth Aravamudan <nacc@us.ibm.com>
      Tested-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Acked-by: NBob Picco <bob.picco@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@skynet.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      523b9458
  19. 18 7月, 2007 2 次提交
    • M
      Create the ZONE_MOVABLE zone · 2a1e274a
      Mel Gorman 提交于
      The following 8 patches against 2.6.20-mm2 create a zone called ZONE_MOVABLE
      that is only usable by allocations that specify both __GFP_HIGHMEM and
      __GFP_MOVABLE.  This has the effect of keeping all non-movable pages within a
      single memory partition while allowing movable allocations to be satisfied
      from either partition.  The patches may be applied with the list-based
      anti-fragmentation patches that groups pages together based on mobility.
      
      The size of the zone is determined by a kernelcore= parameter specified at
      boot-time.  This specifies how much memory is usable by non-movable
      allocations and the remainder is used for ZONE_MOVABLE.  Any range of pages
      within ZONE_MOVABLE can be released by migrating the pages or by reclaiming.
      
      When selecting a zone to take pages from for ZONE_MOVABLE, there are two
      things to consider.  First, only memory from the highest populated zone is
      used for ZONE_MOVABLE.  On the x86, this is probably going to be ZONE_HIGHMEM
      but it would be ZONE_DMA on ppc64 or possibly ZONE_DMA32 on x86_64.  Second,
      the amount of memory usable by the kernel will be spread evenly throughout
      NUMA nodes where possible.  If the nodes are not of equal size, the amount of
      memory usable by the kernel on some nodes may be greater than others.
      
      By default, the zone is not as useful for hugetlb allocations because they are
      pinned and non-migratable (currently at least).  A sysctl is provided that
      allows huge pages to be allocated from that zone.  This means that the huge
      page pool can be resized to the size of ZONE_MOVABLE during the lifetime of
      the system assuming that pages are not mlocked.  Despite huge pages being
      non-movable, we do not introduce additional external fragmentation of note as
      huge pages are always the largest contiguous block we care about.
      
      Credit goes to Andy Whitcroft for catching a large variety of problems during
      review of the patches.
      
      This patch creates an additional zone, ZONE_MOVABLE.  This zone is only usable
      by allocations which specify both __GFP_HIGHMEM and __GFP_MOVABLE.  Hot-added
      memory continues to be placed in their existing destination as there is no
      mechanism to redirect them to a specific zone.
      
      [y-goto@jp.fujitsu.com: Fix section mismatch of memory hotplug related code]
      [akpm@linux-foundation.org: various fixes]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Signed-off-by: NYasunori Goto <y-goto@jp.fujitsu.com>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2a1e274a
    • M
      Add __GFP_MOVABLE for callers to flag allocations from high memory that may be migrated · 769848c0
      Mel Gorman 提交于
      It is often known at allocation time whether a page may be migrated or not.
      This patch adds a flag called __GFP_MOVABLE and a new mask called
      GFP_HIGH_MOVABLE.  Allocations using the __GFP_MOVABLE can be either migrated
      using the page migration mechanism or reclaimed by syncing with backing
      storage and discarding.
      
      An API function very similar to alloc_zeroed_user_highpage() is added for
      __GFP_MOVABLE allocations called alloc_zeroed_user_highpage_movable().  The
      flags used by alloc_zeroed_user_highpage() are not changed because it would
      change the semantics of an existing API.  After this patch is applied there
      are no in-kernel users of alloc_zeroed_user_highpage() so it probably should
      be marked deprecated if this patch is merged.
      
      Note that this patch includes a minor cleanup to the use of __GFP_ZERO in
      shmem.c to keep all flag modifications to inode->mapping in the
      shmem_dir_alloc() helper function.  This clean-up suggestion is courtesy of
      Hugh Dickens.
      
      Additional credit goes to Christoph Lameter and Linus Torvalds for shaping the
      concept.  Credit to Hugh Dickens for catching issues with shmem swap vector
      and ramfs allocations.
      
      [akpm@linux-foundation.org: build fix]
      [hugh@veritas.com: __GFP_ZERO cleanup]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Christoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      769848c0
  20. 10 5月, 2007 1 次提交
    • C
      Move remote node draining out of slab allocators · 4037d452
      Christoph Lameter 提交于
      Currently the slab allocators contain callbacks into the page allocator to
      perform the draining of pagesets on remote nodes.  This requires SLUB to have
      a whole subsystem in order to be compatible with SLAB.  Moving node draining
      out of the slab allocators avoids a section of code in SLUB.
      
      Move the node draining so that is is done when the vm statistics are updated.
      At that point we are already touching all the cachelines with the pagesets of
      a processor.
      
      Add a expire counter there.  If we have to update per zone or global vm
      statistics then assume that the pageset will require subsequent draining.
      
      The expire counter will be decremented on each vm stats update pass until it
      reaches zero.  Then we will drain one batch from the pageset.  The draining
      will cause vm counter updates which will then cause another expiration until
      the pcp is empty.  So we will drain a batch every 3 seconds.
      
      Note that remote node draining is a somewhat esoteric feature that is required
      on large NUMA systems because otherwise significant portions of system memory
      can become trapped in pcp queues.  The number of pcp is determined by the
      number of processors and nodes in a system.  A system with 4 processors and 2
      nodes has 8 pcps which is okay.  But a system with 1024 processors and 512
      nodes has 512k pcps with a high potential for large amount of memory being
      caught in them.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4037d452
  21. 08 5月, 2007 1 次提交
  22. 12 2月, 2007 1 次提交