1. 28 4月, 2008 19 次提交
    • D
      mempolicy: add MPOL_F_RELATIVE_NODES flag · 4c50bc01
      David Rientjes 提交于
      Adds another optional mode flag, MPOL_F_RELATIVE_NODES, that specifies
      nodemasks passed via set_mempolicy() or mbind() should be considered relative
      to the current task's mems_allowed.
      
      When the mempolicy is created, the passed nodemask is folded and mapped onto
      the current task's mems_allowed.  For example, consider a task using
      set_mempolicy() to pass MPOL_INTERLEAVE | MPOL_F_RELATIVE_NODES with a
      nodemask of 1-3.  If current's mems_allowed is 4-7, the effected nodemask is
      5-7 (the second, third, and fourth node of mems_allowed).
      
      If the same task is attached to a cpuset, the mempolicy nodemask is rebound
      each time the mems are changed.  Some possible rebinds and results are:
      
      	mems			result
      	1-3			1-3
      	1-7			2-4
      	1,5-6			1,5-6
      	1,5-7			5-7
      
      Likewise, the zonelist built for MPOL_BIND acts on the set of zones assigned
      to the resultant nodemask from the relative remap.
      
      In the MPOL_PREFERRED case, the preferred node is remapped from the currently
      effected nodemask to the relative nodemask.
      
      This mempolicy mode flag was conceived of by Paul Jackson <pj@sgi.com>.
      
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4c50bc01
    • D
      mempolicy: add MPOL_F_STATIC_NODES flag · f5b087b5
      David Rientjes 提交于
      Add an optional mempolicy mode flag, MPOL_F_STATIC_NODES, that suppresses the
      node remap when the policy is rebound.
      
      Adds another member to struct mempolicy, nodemask_t user_nodemask, as part of
      a union with cpuset_mems_allowed:
      
      	struct mempolicy {
      		...
      		union {
      			nodemask_t cpuset_mems_allowed;
      			nodemask_t user_nodemask;
      		} w;
      	}
      
      that stores the the nodemask that the user passed when he or she created the
      mempolicy via set_mempolicy() or mbind().  When using MPOL_F_STATIC_NODES,
      which is passed with any mempolicy mode, the user's passed nodemask
      intersected with the VMA or task's allowed nodes is always used when
      determining the preferred node, setting the MPOL_BIND zonelist, or creating
      the interleave nodemask.  This happens whenever the policy is rebound,
      including when a task's cpuset assignment changes or the cpuset's mems are
      changed.
      
      This creates an interesting side-effect in that it allows the mempolicy
      "intent" to lie dormant and uneffected until it has access to the node(s) that
      it desires.  For example, if you currently ask for an interleaved policy over
      a set of nodes that you do not have access to, the mempolicy is not created
      and the task continues to use the previous policy.  With this change, however,
      it is possible to create the same mempolicy; it is only effected when access
      to nodes in the nodemask is acquired.
      
      It is also possible to mount tmpfs with the static nodemask behavior when
      specifying a node or nodemask.  To do this, simply add "=static" immediately
      following the mempolicy mode at mount time:
      
      	mount -o remount mpol=interleave=static:1-3
      
      Also removes mpol_check_policy() and folds its logic into mpol_new() since it
      is now obsoleted.  The unused vma_mpol_equal() is also removed.
      
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f5b087b5
    • D
      mempolicy: support optional mode flags · 028fec41
      David Rientjes 提交于
      With the evolution of mempolicies, it is necessary to support mempolicy mode
      flags that specify how the policy shall behave in certain circumstances.  The
      most immediate need for mode flag support is to suppress remapping the
      nodemask of a policy at the time of rebind.
      
      Both the mempolicy mode and flags are passed by the user in the 'int policy'
      formal of either the set_mempolicy() or mbind() syscall.  A new constant,
      MPOL_MODE_FLAGS, represents the union of legal optional flags that may be
      passed as part of this int.  Mempolicies that include illegal flags as part of
      their policy are rejected as invalid.
      
      An additional member to struct mempolicy is added to support the mode flags:
      
      	struct mempolicy {
      		...
      		unsigned short policy;
      		unsigned short flags;
      	}
      
      The splitting of the 'int' actual passed by the user is done in
      sys_set_mempolicy() and sys_mbind() for their respective syscalls.  This is
      done by intersecting the actual with MPOL_MODE_FLAGS, rejecting the syscall of
      there are additional flags, and storing it in the new 'flags' member of struct
      mempolicy.  The intersection of the actual with ~MPOL_MODE_FLAGS is stored in
      the 'policy' member of the struct and all current users of pol->policy remain
      unchanged.
      
      The union of the policy mode and optional mode flags is passed back to the
      user in get_mempolicy().
      
      This combination of mode and flags within the same actual does not break
      userspace code that relies on get_mempolicy(&policy, ...) and either
      
      	switch (policy) {
      	case MPOL_BIND:
      		...
      	case MPOL_INTERLEAVE:
      		...
      	};
      
      statements or
      
      	if (policy == MPOL_INTERLEAVE) {
      		...
      	}
      
      statements.  Such applications would need to use optional mode flags when
      calling set_mempolicy() or mbind() for these previously implemented statements
      to stop working.  If an application does start using optional mode flags, it
      will need to mask the optional flags off the policy in switch and conditional
      statements that only test mode.
      
      An additional member is also added to struct shmem_sb_info to store the
      optional mode flags.
      
      [hugh@veritas.com: shmem mpol: fix build warning]
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      028fec41
    • D
      mempolicy: convert MPOL constants to enum · a3b51e01
      David Rientjes 提交于
      The mempolicy mode constants, MPOL_DEFAULT, MPOL_PREFERRED, MPOL_BIND, and
      MPOL_INTERLEAVE, are better declared as part of an enum since they are
      sequentially numbered and cannot be combined.
      
      The policy member of struct mempolicy is also converted from type short to
      type unsigned short.  A negative policy does not have any legitimate meaning,
      so it is possible to change its type in preparation for adding optional mode
      flags later.
      
      The equivalent member of struct shmem_sb_info is also changed from int to
      unsigned short.
      
      For compatibility, the policy formal to get_mempolicy() remains as a pointer
      to an int:
      
      	int get_mempolicy(int *policy, unsigned long *nmask,
      			  unsigned long maxnode, unsigned long addr,
      			  unsigned long flags);
      
      although the only possible values is the range of type unsigned short.
      
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a3b51e01
    • P
      mm: move cache_line_size() to <linux/cache.h> · 1b27d05b
      Pekka Enberg 提交于
      Not all architectures define cache_line_size() so as suggested by Andrew move
      the private implementations in mm/slab.c and mm/slob.c to <linux/cache.h>.
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Reviewed-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1b27d05b
    • A
      hugetlb: decrease hugetlb_lock cycling in gather_surplus_huge_pages · 19fc3f0a
      Adam Litke 提交于
      To reduce hugetlb_lock acquisitions and releases when freeing excess surplus
      pages, scan the page list in two parts.  First, transfer the needed pages to
      the hugetlb pool.  Then drop the lock and free the remaining pages back to the
      buddy allocator.
      
      In the common case there are zero excess pages and no lock operations are
      required.
      
      Thanks Mel Gorman for this improvement.
      Signed-off-by: NAdam Litke <agl@us.ibm.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      19fc3f0a
    • C
      mm: try both endianess when checking for endianess · 797df574
      Chris Dearman 提交于
      When checking for the swap header try byteswapping the endianess dependent
      fields to allow the swap partition to be shared between big & little endian
      systems.
      Signed-off-by: NChris Dearman <chris@mips.com>
      Signed-off-by: NRalf Baechle <ralf@linux-mips.org>
      Acked-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      797df574
    • M
      mm: filter based on a nodemask as well as a gfp_mask · 19770b32
      Mel Gorman 提交于
      The MPOL_BIND policy creates a zonelist that is used for allocations
      controlled by that mempolicy.  As the per-node zonelist is already being
      filtered based on a zone id, this patch adds a version of __alloc_pages() that
      takes a nodemask for further filtering.  This eliminates the need for
      MPOL_BIND to create a custom zonelist.
      
      A positive benefit of this is that allocations using MPOL_BIND now use the
      local node's distance-ordered zonelist instead of a custom node-id-ordered
      zonelist.  I.e., pages will be allocated from the closest allowed node with
      available memory.
      
      [Lee.Schermerhorn@hp.com: Mempolicy: update stale documentation and comments]
      [Lee.Schermerhorn@hp.com: Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask]
      [Lee.Schermerhorn@hp.com: Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask rework]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      19770b32
    • M
      mm: have zonelist contains structs with both a zone pointer and zone_idx · dd1a239f
      Mel Gorman 提交于
      Filtering zonelists requires very frequent use of zone_idx().  This is costly
      as it involves a lookup of another structure and a substraction operation.  As
      the zone_idx is often required, it should be quickly accessible.  The node idx
      could also be stored here if it was found that accessing zone->node is
      significant which may be the case on workloads where nodemasks are heavily
      used.
      
      This patch introduces a struct zoneref to store a zone pointer and a zone
      index.  The zonelist then consists of an array of these struct zonerefs which
      are looked up as necessary.  Helpers are given for accessing the zone index as
      well as the node index.
      
      [kamezawa.hiroyu@jp.fujitsu.com: Suggested struct zoneref instead of embedding information in pointers]
      [hugh@veritas.com: mm-have-zonelist: fix memcg ooms]
      [hugh@veritas.com: just return do_try_to_free_pages]
      [hugh@veritas.com: do_try_to_free_pages gfp_mask redundant]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NChristoph Lameter <clameter@sgi.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dd1a239f
    • M
      mm: use two zonelist that are filtered by GFP mask · 54a6eb5c
      Mel Gorman 提交于
      Currently a node has two sets of zonelists, one for each zone type in the
      system and a second set for GFP_THISNODE allocations.  Based on the zones
      allowed by a gfp mask, one of these zonelists is selected.  All of these
      zonelists consume memory and occupy cache lines.
      
      This patch replaces the multiple zonelists per-node with two zonelists.  The
      first contains all populated zones in the system, ordered by distance, for
      fallback allocations when the target/preferred node has no free pages.  The
      second contains all populated zones in the node suitable for GFP_THISNODE
      allocations.
      
      An iterator macro is introduced called for_each_zone_zonelist() that interates
      through each zone allowed by the GFP flags in the selected zonelist.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      54a6eb5c
    • M
      mm: remember what the preferred zone is for zone_statistics · 18ea7e71
      Mel Gorman 提交于
      On NUMA, zone_statistics() is used to record events like numa hit, miss and
      foreign.  It assumes that the first zone in a zonelist is the preferred zone.
      When multiple zonelists are replaced by one that is filtered, this is no
      longer the case.
      
      This patch records what the preferred zone is rather than assuming the first
      zone in the zonelist is it.  This simplifies the reading of later patches in
      this set.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Reviewed-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      18ea7e71
    • M
      mm: introduce node_zonelist() for accessing the zonelist for a GFP mask · 0e88460d
      Mel Gorman 提交于
      Introduce a node_zonelist() helper function.  It is used to lookup the
      appropriate zonelist given a node and a GFP mask.  The patch on its own is a
      cleanup but it helps clarify parts of the two-zonelist-per-node patchset.  If
      necessary, it can be merged with the next patch in this set without problems.
      Reviewed-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0e88460d
    • M
      mm: use zonelists instead of zones when direct reclaiming pages · dac1d27b
      Mel Gorman 提交于
      The following patches replace multiple zonelists per node with two zonelists
      that are filtered based on the GFP flags.  The patches as a set fix a bug with
      regard to the use of MPOL_BIND and ZONE_MOVABLE.  With this patchset, the
      MPOL_BIND will apply to the two highest zones when the highest zone is
      ZONE_MOVABLE.  This should be considered as an alternative fix for the
      MPOL_BIND+ZONE_MOVABLE in 2.6.23 to the previously discussed hack that filters
      only custom zonelists.
      
      The first patch cleans up an inconsistency where direct reclaim uses
      zonelist->zones where other places use zonelist.
      
      The second patch introduces a helper function node_zonelist() for looking up
      the appropriate zonelist for a GFP mask which simplifies patches later in the
      set.
      
      The third patch defines/remembers the "preferred zone" for numa statistics, as
      it is no longer always the first zone in a zonelist.
      
      The forth patch replaces multiple zonelists with two zonelists that are
      filtered.  The two zonelists are due to the fact that the memoryless patchset
      introduces a second set of zonelists for __GFP_THISNODE.
      
      The fifth patch introduces helper macros for retrieving the zone and node
      indices of entries in a zonelist.
      
      The final patch introduces filtering of the zonelists based on a nodemask.
      Two zonelists exist per node, one for normal allocations and one for
      __GFP_THISNODE.
      
      Performance results varied depending on the machine configuration.  In real
      workloads the gain/loss will depend on how much the userspace portion of the
      benchmark benefits from having more cache available due to reduced referencing
      of zonelists.
      
      These are the range of performance losses/gains when running against
      2.6.24-rc4-mm1.  The set and these machines are a mix of i386, x86_64 and
      ppc64 both NUMA and non-NUMA.
      			     loss   to  gain
      Total CPU time on Kernbench: -0.86% to  1.13%
      Elapsed   time on Kernbench: -0.79% to  0.76%
      page_test from aim9:         -4.37% to  0.79%
      brk_test  from aim9:         -0.71% to  4.07%
      fork_test from aim9:         -1.84% to  4.60%
      exec_test from aim9:         -0.71% to  1.08%
      
      This patch:
      
      The allocator deals with zonelists which indicate the order in which zones
      should be targeted for an allocation.  Similarly, direct reclaim of pages
      iterates over an array of zones.  For consistency, this patch converts direct
      reclaim to use a zonelist.  No functionality is changed by this patch.  This
      simplifies zonelist iterators in the next patch.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dac1d27b
    • N
      mm: remove nopage · 3c18ddd1
      Nick Piggin 提交于
      Nothing in the tree uses nopage any more.  Remove support for it in the
      core mm code and documentation (and a few stray references to it in
      comments).
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3c18ddd1
    • O
      mmap_region: cleanup the final vma_merge() related code · 4d3d5b41
      Oleg Nesterov 提交于
      It is not easy to actually understand the "if (!file || !vma_merge())"
      code, turn it into "if (file && vma_merge())".  This makes immediately
      obvious that the subsequent "if (file)" is superfluous.
      
      As Hugh Dickins pointed out, we can also factor out the ->i_writecount
      corrections, and add a small comment about that.
      Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru>
      Cc: Miklos Szeredi <miklos@szeredi.hu>
      Cc: Hugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4d3d5b41
    • H
      fix invalidate_inode_pages2_range() to not clear ret · 0dd1334f
      Hisashi Hifumi 提交于
      DIO invalidates page cache through invalidate_inode_pages2_range().
      invalidate_inode_pages2_range() sets ret=-EIO when
      invalidate_complete_page2() fails, but this ret is cleared if
      do_launder_page() succeed on a page of next index.
      
      In this case, dio is carried out even if invalidate_complete_page2() fails
      on some pages.
      
      This can cause inconsistency between memory and blocks on HDD because the
      page cache still exists.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NHisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Ken Chen <kenchen@google.com>
      Cc: Zach Brown <zach.brown@oracle.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: Chuck Lever <cel@citi.umich.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0dd1334f
    • J
      hotplug-memory: make online_page() common · 180c06ef
      Jeremy Fitzhardinge 提交于
      All architectures use an effectively identical definition of online_page(), so
      just make it common code.  x86-64, ia64, powerpc and sh are actually
      identical; x86-32 is slightly different.
      
      x86-32's differences arise because it puts its hotplug pages in the highmem
      zone.  We can handle this in the generic code by inspecting the page to see if
      its in highmem, and update the totalhigh_pages count appropriately.  This
      leaves init_32.c:free_new_highpage with a single caller, so I folded it into
      add_one_highpage_init.
      
      I also removed an incorrect comment referring to the NUMA case; any NUMA
      details have already been dealt with by the time online_page() is called.
      
      [akpm@linux-foundation.org: fix indenting]
      Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      Acked-by: NDave Hansen <dave@linux.vnet.ibm.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamez.hiroyu@jp.fujitsu.com>
      Tested-by: NKAMEZAWA Hiroyuki <kamez.hiroyu@jp.fujitsu.com>
      Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Acked-by: NYasunori Goto <y-goto@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      180c06ef
    • B
      hotplug memory remove: generic __remove_pages() support · ea01ea93
      Badari Pulavarty 提交于
      Generic helper function to remove section mappings and sysfs entries for the
      section of the memory we are removing.  offline_pages() correctly adjusted
      zone and marked the pages reserved.
      
      TODO: Yasunori Goto is working on patches to free up allocations from bootmem.
      Signed-off-by: NBadari Pulavarty <pbadari@us.ibm.com>
      Acked-by: NYasunori Goto <y-goto@jp.fujitsu.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ea01ea93
    • J
      mm: fix possible off-by-one in walk_pte_range() · 556637cd
      Johannes Weiner 提交于
      After the loop in walk_pte_range() pte might point to the first address after
      the pmd it walks.  The pte_unmap() is then applied to something bad.
      
      Spotted by Roel Kluin and Andreas Schwab.
      Signed-off-by: NJohannes Weiner <hannes@saeurebad.de>
      Cc: Roel Kluin <12o3l@tiscali.nl>
      Cc: Andreas Schwab <schwab@suse.de>
      Acked-by: NMatt Mackall <mpm@selenic.com>
      Acked-by: NMikael Pettersson <mikpe@it.uu.se>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      556637cd
  2. 27 4月, 2008 6 次提交
    • C
      s390: KVM preparation: host memory management changes for s390 kvm · 5b7baf05
      Christian Borntraeger 提交于
      This patch changes the s390 memory management defintions to use the pgste field
      for dirty and reference bit tracking of host and guest code. Usually on s390,
      dirty and referenced are tracked in storage keys, which belong to the physical
      page. This changes with virtualization: The guest and host dirty/reference bits
      are defined to be the logical OR of the values for the mapping and the physical
      page. This patch implements the necessary changes in pgtable.h for s390.
      
      There is a common code change in mm/rmap.c, the call to
      page_test_and_clear_young must be moved. This is a no-op for all
      architecture but s390. page_referenced checks the referenced bits for
      the physiscal page and for all mappings:
      o The physical page is checked with page_test_and_clear_young.
      o The mappings are checked with ptep_test_and_clear_young and friends.
      
      Without pgstes (the current implementation on Linux s390) the physical page
      check is implemented but the mapping callbacks are no-ops because dirty
      and referenced are not tracked in the s390 page tables. The pgstes introduces
      guest and host dirty and reference bits for s390 in the host mapping. These
      mapping must be checked before page_test_and_clear_young resets the reference
      bit.
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Acked-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      Acked-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NCarsten Otte <cotte@de.ibm.com>
      Signed-off-by: NAvi Kivity <avi@qumranet.com>
      5b7baf05
    • Y
      x86_64/mm: check and print vmemmap allocation continuous · c2b91e2e
      Yinghai Lu 提交于
      On big systems with lots of memory, don't print out too much during
      bootup, and make it easy to find if it is continuous.
      
      on 256G 8 sockets system will get
       [ffffe20000000000-ffffe20002bfffff] PMD -> [ffff810001400000-ffff810003ffffff] on node 0
      [ffffe2001c700000-ffffe2001c7fffff] potential offnode page_structs
       [ffffe20002c00000-ffffe2001c7fffff] PMD -> [ffff81000c000000-ffff8100255fffff] on node 0
      [ffffe20038700000-ffffe200387fffff] potential offnode page_structs
       [ffffe2001c800000-ffffe200387fffff] PMD -> [ffff810820200000-ffff81083c1fffff] on node 1
       [ffffe20040000000-ffffe2007fffffff] PUD ->ffff811027a00000 on node 2
       [ffffe20038800000-ffffe2003fffffff] PMD -> [ffff811020200000-ffff8110279fffff] on node 2
      [ffffe20054700000-ffffe200547fffff] potential offnode page_structs
       [ffffe20040000000-ffffe200547fffff] PMD -> [ffff811027c00000-ffff81103c3fffff] on node 2
      [ffffe20070700000-ffffe200707fffff] potential offnode page_structs
       [ffffe20054800000-ffffe200707fffff] PMD -> [ffff811820200000-ffff81183c1fffff] on node 3
       [ffffe20080000000-ffffe200bfffffff] PUD ->ffff81202fa00000 on node 4
       [ffffe20070800000-ffffe2007fffffff] PMD -> [ffff812020200000-ffff81202f9fffff] on node 4
      [ffffe2008c700000-ffffe2008c7fffff] potential offnode page_structs
       [ffffe20080000000-ffffe2008c7fffff] PMD -> [ffff81202fc00000-ffff81203c3fffff] on node 4
      [ffffe200a8700000-ffffe200a87fffff] potential offnode page_structs
       [ffffe2008c800000-ffffe200a87fffff] PMD -> [ffff812820200000-ffff81283c1fffff] on node 5
       [ffffe200c0000000-ffffe200ffffffff] PUD ->ffff813037a00000 on node 6
       [ffffe200a8800000-ffffe200bfffffff] PMD -> [ffff813020200000-ffff8130379fffff] on node 6
      [ffffe200c4700000-ffffe200c47fffff] potential offnode page_structs
       [ffffe200c0000000-ffffe200c47fffff] PMD -> [ffff813037c00000-ffff81303c3fffff] on node 6
       [ffffe200c4800000-ffffe200e07fffff] PMD -> [ffff813820200000-ffff81383c1fffff] on node 7
      
      instead of a very long print out...
      Signed-off-by: NYinghai Lu <yhlu.kernel@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      c2b91e2e
    • Y
      mm: allow reserve_bootmem() cross nodes · a5645a61
      Yinghai Lu 提交于
      split reserve_bootmem_core() into two functions, one which checks
      conflicts, and one which sets the bits.
      
      and make reserve_bootmem to loop bdata_list to cross the nodes.
      
      user could be crashkernel and ramdisk..., in case the range provided
      by those externalities crosses the nodes.
      Signed-off-by: NYinghai Lu <yhlu.kernel@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      a5645a61
    • Y
      mm: offset align in alloc_bootmem() · 9a2dc04c
      Yinghai Lu 提交于
      need offset alignment when node_boot_start's alignment is less than
      the alignment required.
      
      use local node_boot_start to match alignment - so don't add extra operation
      in search loop.
      Signed-off-by: NYinghai Lu <yhlu.kernel@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      9a2dc04c
    • Y
      mm: fix alloc_bootmem_core to use fast searching for all nodes · ad09315c
      Yinghai Lu 提交于
      Make the nodes other than node 0 use bdata->last_success for fast
      search too.
      
      We need to use __alloc_bootmem_core() for vmemmap allocation for other
      nodes when numa and sparsemem/vmemmap are enabled.
      
      Also, make fail_block path increase i with incr only after ALIGN
      to avoid extra increase when size is larger than align.
      Signed-off-by: NYinghai Lu <yhlu.kernel@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      ad09315c
    • Y
      mm: make mem_map allocation continuous · e123dd3f
      Yinghai Lu 提交于
      vmemmap allocation currently has this layout:
      
       [ffffe20000000000-ffffe200001fffff] PMD ->ffff810001400000 on node 0
       [ffffe20000200000-ffffe200003fffff] PMD ->ffff810001800000 on node 0
       [ffffe20000400000-ffffe200005fffff] PMD ->ffff810001c00000 on node 0
       [ffffe20000600000-ffffe200007fffff] PMD ->ffff810002000000 on node 0
       [ffffe20000800000-ffffe200009fffff] PMD ->ffff810002400000 on node 0
      ...
      
      note that there is a 2M hole between them - not optimal.
      
      the root cause is that usemap (24 bytes) will be allocated after every 2M
      mem_map, and it will push next vmemmap (2M) to the next (2M) alignment.
      
      solution: try to allocate the mem_map continously.
      
      after the patch, we get:
      
       [ffffe20000000000-ffffe200001fffff] PMD ->ffff810001400000 on node 0
       [ffffe20000200000-ffffe200003fffff] PMD ->ffff810001600000 on node 0
       [ffffe20000400000-ffffe200005fffff] PMD ->ffff810001800000 on node 0
       [ffffe20000600000-ffffe200007fffff] PMD ->ffff810001a00000 on node 0
       [ffffe20000800000-ffffe200009fffff] PMD ->ffff810001c00000 on node 0
      ...
      
      which is the ideal layout.
      
      and usemap will share a page because of they are allocated continuously too:
      
      sparse_early_usemap_alloc: usemap = ffff810024e00000 size = 24
      sparse_early_usemap_alloc: usemap = ffff810024e00080 size = 24
      sparse_early_usemap_alloc: usemap = ffff810024e00100 size = 24
      sparse_early_usemap_alloc: usemap = ffff810024e00180 size = 24
      ...
      
      so we make the bootmem allocation more compact and use less memory
      for usemap => mission accomplished ;-)
      Signed-off-by: NYinghai Lu <yhlu.kernel@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      e123dd3f
  3. 24 4月, 2008 1 次提交
  4. 22 4月, 2008 1 次提交
  5. 20 4月, 2008 4 次提交
  6. 18 4月, 2008 2 次提交
  7. 16 4月, 2008 3 次提交
    • K
      add "Isolate" migratetype name to /proc/pagetypeinfo · 91446b06
      KOSAKI Motohiro 提交于
      In a5d76b54 (memory unplug: page isolation by
      KAMEZAWA Hiroyuki), "isolate" migratetype added.  but unfortunately, it
      doesn't treat /proc/pagetypeinfo display logic.
      
      this patch add "Isolate" to pagetype name field.
      
      /proc/pagetype
      before:
      ------------------------------------------------------------------------------------------------------------------------
      Free pages count per migrate type at order       0      1      2      3      4      5      6      7      8      9     10
      Node    0, zone      DMA, type    Unmovable      1      2      2      2      1      2      2      1      1      0      0
      Node    0, zone      DMA, type  Reclaimable      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone      DMA, type      Movable      2      3      3      1      3      3      2      0      0      0      0
      Node    0, zone      DMA, type      Reserve      0      0      0      0      0      0      0      0      0      0      1
      Node    0, zone      DMA, type       <NULL>      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone   Normal, type    Unmovable      1      9      7      4      1      1      1      1      0      0      0
      Node    0, zone   Normal, type  Reclaimable      5      2      0      0      1      1      0      0      0      1      0
      Node    0, zone   Normal, type      Movable      0      1      1      0      0      0      1      0      0      1     60
      Node    0, zone   Normal, type      Reserve      0      0      0      0      0      0      0      0      0      0      1
      Node    0, zone   Normal, type       <NULL>      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone  HighMem, type    Unmovable      0      0      1      1      1      0      1      1      2      2      0
      Node    0, zone  HighMem, type  Reclaimable      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone  HighMem, type      Movable    236     62      6      2      2      1      1      0      1      1     16
      Node    0, zone  HighMem, type      Reserve      0      0      0      0      0      0      0      0      0      0      1
      Node    0, zone  HighMem, type       <NULL>      0      0      0      0      0      0      0      0      0      0      0
      
      Number of blocks type     Unmovable  Reclaimable      Movable      Reserve       <NULL>
      Node 0, zone      DMA            1            0            2       1            0
      Node 0, zone   Normal           10           40          169       1            0
      Node 0, zone  HighMem            2            0          283       1            0
      
      after:
      ------------------------------------------------------------------------------------------------------------------------
      Free pages count per migrate type at order       0      1      2      3      4      5      6      7      8      9     10
      Node    0, zone      DMA, type    Unmovable      1      2      2      2      1      2      2      1      1      0      0
      Node    0, zone      DMA, type  Reclaimable      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone      DMA, type      Movable      2      3      3      1      3      3      2      0      0      0      0
      Node    0, zone      DMA, type      Reserve      0      0      0      0      0      0      0      0      0      0      1
      Node    0, zone      DMA, type      Isolate      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone   Normal, type    Unmovable      0      2      1      1      0      1      0      0      0      0      0
      Node    0, zone   Normal, type  Reclaimable      1      1      1      1      1      0      1      1      1      0      0
      Node    0, zone   Normal, type      Movable      0      1      1      1      0      1      0      1      0      0    196
      Node    0, zone   Normal, type      Reserve      0      0      0      0      0      0      0      0      0      0      1
      Node    0, zone   Normal, type      Isolate      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone  HighMem, type    Unmovable      0      1      0      0      0      1      1      1      2      2      0
      Node    0, zone  HighMem, type  Reclaimable      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone  HighMem, type      Movable      1      0      1      1      0      0      0      0      1      0    200
      Node    0, zone  HighMem, type      Reserve      0      0      0      0      0      0      0      0      0      0      1
      Node    0, zone  HighMem, type      Isolate      0      0      0      0      0      0      0      0      0      0      0
      
      Number of blocks type     Unmovable  Reclaimable      Movable      Reserve      Isolate
      Node 0, zone      DMA            1            0            2       1            0
      Node 0, zone   Normal            8            4          207       1            0
      Node 0, zone  HighMem            2            0          283       1            0
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      91446b06
    • L
      memcg: fix oops in oom handling · e115f2d8
      Li Zefan 提交于
      When I used a test program to fork mass processes and immediately move them to
      a cgroup where the memory limit is low enough to trigger oom kill, I got oops:
      
      BUG: unable to handle kernel NULL pointer dereference at 0000000000000808
      IP: [<ffffffff8045c47f>] _spin_lock_irqsave+0x8/0x18
      PGD 4c95f067 PUD 4406c067 PMD 0
      Oops: 0002 [1] SMP
      CPU 2
      Modules linked in:
      
      Pid: 11973, comm: a.out Not tainted 2.6.25-rc7 #5
      RIP: 0010:[<ffffffff8045c47f>]  [<ffffffff8045c47f>] _spin_lock_irqsave+0x8/0x18
      RSP: 0018:ffff8100448c7c30  EFLAGS: 00010002
      RAX: 0000000000000202 RBX: 0000000000000009 RCX: 000000000001c9f3
      RDX: 0000000000000100 RSI: 0000000000000001 RDI: 0000000000000808
      RBP: ffff81007e444080 R08: 0000000000000000 R09: ffff8100448c7900
      R10: ffff81000105f480 R11: 00000100ffffffff R12: ffff810067c84140
      R13: 0000000000000001 R14: ffff8100441d0018 R15: ffff81007da56200
      FS:  00007f70eb1856f0(0000) GS:ffff81007fbad3c0(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      CR2: 0000000000000808 CR3: 000000004498a000 CR4: 00000000000006e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Process a.out (pid: 11973, threadinfo ffff8100448c6000, task ffff81007da533e0)
      Stack:  ffffffff8023ef5a 00000000000000d0 ffffffff80548dc0 00000000000000d0
       ffff810067c84140 ffff81007e444080 ffffffff8026cef9 00000000000000d0
       ffff8100441d0000 00000000000000d0 ffff8100441d0000 ffff8100505445c0
      Call Trace:
       [<ffffffff8023ef5a>] ? force_sig_info+0x25/0xb9
       [<ffffffff8026cef9>] ? oom_kill_task+0x77/0xe2
       [<ffffffff8026d696>] ? mem_cgroup_out_of_memory+0x55/0x67
       [<ffffffff802910ad>] ? mem_cgroup_charge_common+0xec/0x202
       [<ffffffff8027997b>] ? handle_mm_fault+0x24e/0x77f
       [<ffffffff8022c4af>] ? default_wake_function+0x0/0xe
       [<ffffffff8027a17a>] ? get_user_pages+0x2ce/0x3af
       [<ffffffff80290fee>] ? mem_cgroup_charge_common+0x2d/0x202
       [<ffffffff8027a441>] ? make_pages_present+0x8e/0xa4
       [<ffffffff8027d1ab>] ? mmap_region+0x373/0x429
       [<ffffffff8027d7eb>] ? do_mmap_pgoff+0x2ff/0x364
       [<ffffffff80210471>] ? sys_mmap+0xe5/0x111
       [<ffffffff8020bfc9>] ? tracesys+0xdc/0xe1
      
      Code: 00 00 01 48 8b 3c 24 e9 46 d4 dd ff f0 ff 07 48 8b 3c 24 e9 3a d4 dd ff fe 07 48 8b 3c 24 e9 2f d4 dd ff 9c 58 fa ba 00 01 00 00 <f0> 66 0f c1 17 38 f2 74 06 f3 90 8a 17 eb f6 c3 fa b8 00 01 00
      RIP  [<ffffffff8045c47f>] _spin_lock_irqsave+0x8/0x18
       RSP <ffff8100448c7c30>
      CR2: 0000000000000808
      ---[ end trace c3702fa668021ea4 ]---
      
      It's reproducable in a x86_64 box, but doesn't happen in x86_32.
      
      This is because tsk->sighand is not guarded by RCU, so we have to
      hold tasklist_lock, just as what out_of_memory() does.
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: David Rientjes <rientjes@cs.washington.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e115f2d8
    • I
      mm: sparsemem memory_present() fix · bead9a3a
      Ingo Molnar 提交于
      Fix memory corruption and crash on 32-bit x86 systems.
      
      If a !PAE x86 kernel is booted on a 32-bit system with more than 4GB of
      RAM, then we call memory_present() with a start/end that goes outside
      the scope of MAX_PHYSMEM_BITS.
      
      That causes this loop to happily walk over the limit of the sparse
      memory section map:
      
          for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION) {
                      unsigned long section = pfn_to_section_nr(pfn);
                      struct mem_section *ms;
      
                      sparse_index_init(section, nid);
                      set_section_nid(section, nid);
      
                      ms = __nr_to_section(section);
                      if (!ms->section_mem_map)
                              ms->section_mem_map = sparse_encode_early_nid(nid) |
      			                                SECTION_MARKED_PRESENT;
      
      'ms' will be out of bounds and we'll corrupt a small amount of memory by
      encoding the node ID and writing SECTION_MARKED_PRESENT (==0x1) over it.
      
      The corruption might happen when encoding a non-zero node ID, or due to
      the SECTION_MARKED_PRESENT which is 0x1:
      
      	mmzone.h:#define	SECTION_MARKED_PRESENT	(1UL<<0)
      
      The fix is to sanity check anything the architecture passes to
      sparsemem.
      
      This bug seems to be rather old (as old as sparsemem support itself),
      but the exact incarnation depended on random details like configs, which
      made this bug more prominent in v2.6.25-to-be.
      
      An additional enhancement might be to print a warning about ignored or
      trimmed memory ranges.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Tested-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Rafael J. Wysocki <rjw@sisk.pl>
      Cc: Yinghai Lu <Yinghai.Lu@sun.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bead9a3a
  8. 14 4月, 2008 4 次提交