1. 28 4月, 2008 17 次提交
    • L
      mempolicy: rework mempolicy Reference Counting [yet again] · 52cd3b07
      Lee Schermerhorn 提交于
      After further discussion with Christoph Lameter, it has become clear that my
      earlier attempts to clean up the mempolicy reference counting were a bit of
      overkill in some areas, resulting in superflous ref/unref in what are usually
      fast paths.  In other areas, further inspection reveals that I botched the
      unref for interleave policies.
      
      A separate patch, suitable for upstream/stable trees, fixes up the known
      errors in the previous attempt to fix reference counting.
      
      This patch reworks the memory policy referencing counting and, one hopes,
      simplifies the code.  Maybe I'll get it right this time.
      
      See the update to the numa_memory_policy.txt document for a discussion of
      memory policy reference counting that motivates this patch.
      
      Summary:
      
      Lookup of mempolicy, based on (vma, address) need only add a reference for
      shared policy, and we need only unref the policy when finished for shared
      policies.  So, this patch backs out all of the unneeded extra reference
      counting added by my previous attempt.  It then unrefs only shared policies
      when we're finished with them, using the mpol_cond_put() [conditional put]
      helper function introduced by this patch.
      
      Note that shmem_swapin() calls read_swap_cache_async() with a dummy vma
      containing just the policy.  read_swap_cache_async() can call alloc_page_vma()
      multiple times, so we can't let alloc_page_vma() unref the shared policy in
      this case.  To avoid this, we make a copy of any non-null shared policy and
      remove the MPOL_F_SHARED flag from the copy.  This copy occurs before reading
      a page [or multiple pages] from swap, so the overhead should not be an issue
      here.
      
      I introduced a new static inline function "mpol_cond_copy()" to copy the
      shared policy to an on-stack policy and remove the flags that would require a
      conditional free.  The current implementation of mpol_cond_copy() assumes that
      the struct mempolicy contains no pointers to dynamically allocated structures
      that must be duplicated or reference counted during copy.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      52cd3b07
    • L
      mempolicy: mark shared policies for unref · aab0b102
      Lee Schermerhorn 提交于
      As part of yet another rework of mempolicy reference counting, we want to be
      able to identify shared policies efficiently, because they have an extra ref
      taken on lookup that needs to be removed when we're finished using the policy.
      
        Note:  the extra ref is required because the policies are
        shared between tasks/processes and can be changed/freed
        by one task while another task is using them--e.g., for
        page allocation.
      
      Building on David Rientjes mempolicy "mode flags" enhancement, this patch
      indicates a "shared" policy by setting a new MPOL_F_SHARED flag in the flags
      member of the struct mempolicy added by David.  MPOL_F_SHARED, and any future
      "internal mode flags" are reserved from bit zero up, as they will never be
      passed in the upper bits of the mode argument of a mempolicy API.
      
      I set the MPOL_F_SHARED flag when the policy is installed in the shared policy
      rb-tree.  Don't need/want to clear the flag when removing from the tree as the
      mempolicy is freed [unref'd] internally to the sp_delete() function.  However,
      a task could hold another reference on this mempolicy from a prior lookup.  We
      need the MPOL_F_SHARED flag to stay put so that any tasks holding a ref will
      unref, eventually freeing, the mempolicy.
      
      A later patch in this series will introduce a function to conditionally unref
      [mpol_free] a policy.  The MPOL_F_SHARED flag is one reason [currently the
      only reason] to unref/free a policy via the conditional free.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aab0b102
    • L
      mempolicy: rename struct mempolicy 'policy' member to 'mode' · 45c4745a
      Lee Schermerhorn 提交于
      The terms 'policy' and 'mode' are both used in various places to describe the
      semantics of the value stored in the 'policy' member of struct mempolicy.
      Furthermore, the term 'policy' is used to refer to that member, to the entire
      struct mempolicy and to the more abstract concept of the tuple consisting of a
      "mode" and an optional node or set of nodes.  Recently, we have added "mode
      flags" that are passed in the upper bits of the 'mode' [or sometimes,
      'policy'] member of the numa APIs.
      
      I'd like to resolve this confusion, which perhaps only exists in my mind, by
      renaming the 'policy' member to 'mode' throughout, and fixing up the
      Documentation.  Man pages will be updated separately.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      45c4745a
    • L
      mempolicy: fixup Fallback for Default Shmem Policy · ae4d8c16
      Lee Schermerhorn 提交于
      get_vma_policy() is not handling fallback to task policy correctly when the
      get_policy() vm_op returns NULL.  The NULL overwrites the 'pol' variable that
      was holding the fallback task mempolicy.  So, it was falling back directly to
      system default policy.
      
      Fix get_vma_policy() to use only non-NULL policy returned from the vma
      get_policy op.
      
      shm_get_policy() was falling back to current task's mempolicy if the "backing
      file system" [tmpfs vs hugetlbfs] does not support the get_policy vm_op and
      the vma policy is null.  This is incorrect for show_numa_maps() which is
      likely querying the numa_maps of some task other than current.  Remove this
      fallback.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ae4d8c16
    • L
      mempolicy: write lock mmap_sem while changing task mempolicy · f4e53d91
      Lee Schermerhorn 提交于
      A read of /proc/<pid>/numa_maps holds the target task's mmap_sem for read
      while examining each vma's mempolicy.  A vma's mempolicy can fall back to the
      task's policy.  However, the task could be changing it's task policy and free
      the one that the show_numa_maps() is examining.
      
      To prevent this, grab the mmap_sem for write when updating task mempolicy.
      Pointed out to me by Christoph Lameter and extracted and reworked from
      Christoph's alternative mempol reference counting patch.
      
      This is analogous to the way that do_mbind() and do_get_mempolicy() prevent
      races between task's sharing an mm_struct [a.k.a.  threads] setting and
      querying a mempolicy for a particular address.
      
      Note: this is necessary, but not sufficient, to allow us to stop taking an
      extra reference on "other task's mempolicy" in get_vma_policy.  Subsequent
      patches will complete this update, allowing us to simplify the tests for
      whether we need to unref a mempolicy at various points in the code.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f4e53d91
    • L
      mempolicy: rename mpol_copy to mpol_dup · 846a16bf
      Lee Schermerhorn 提交于
      This patch renames mpol_copy() to mpol_dup() because, well, that's what it
      does.  Like, e.g., strdup() for strings, mpol_dup() takes a pointer to an
      existing mempolicy, allocates a new one and copies the contents.
      
      In a later patch, I want to use the name mpol_copy() to copy the contents from
      one mempolicy to another like, e.g., strcpy() does for strings.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      846a16bf
    • L
      mempolicy: rename mpol_free to mpol_put · f0be3d32
      Lee Schermerhorn 提交于
      This is a change that was requested some time ago by Mel Gorman.  Makes sense
      to me, so here it is.
      
      Note: I retain the name "mpol_free_shared_policy()" because it actually does
      free the shared_policy, which is NOT a reference counted object.  However, ...
      
      The mempolicy object[s] referenced by the shared_policy are reference counted,
      so mpol_put() is used to release the reference held by the shared_policy.  The
      mempolicy might not be freed at this time, because some task attached to the
      shared object associated with the shared policy may be in the process of
      allocating a page based on the mempolicy.  In that case, the task performing
      the allocation will hold a reference on the mempolicy, obtained via
      mpol_shared_policy_lookup().  The mempolicy will be freed when all tasks
      holding such a reference have called mpol_put() for the mempolicy.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f0be3d32
    • D
      mempolicy: disallow static or relative flags for local preferred mode · 3e1f0645
      David Rientjes 提交于
      MPOL_F_STATIC_NODES and MPOL_F_RELATIVE_NODES don't mean anything for
      MPOL_PREFERRED policies that were created with an empty nodemask (for purely
      local allocations).  They'll never be invalidated because the allowed mems of
      a task changes or need to be rebound relative to a cpuset's placement.
      
      Also fixes a bug identified by Lee Schermerhorn that disallowed empty
      nodemasks to be passed to MPOL_PREFERRED to specify local allocations.  [A
      different, somewhat incomplete, patch already existed in 25-rc5-mm1.]
      
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Andi Kleen <ak@suse.de>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3e1f0645
    • D
      mempolicy: create mempolicy_operations structure · 37012946
      David Rientjes 提交于
      Create a mempolicy_operations structure that currently points to two
      functions[*] for the various modes:
      
      	int (*create)(struct mempolicy *, const nodemask_t *);
      	void (*rebind)(struct mempolicy *, const nodemask_t *);
      
      This splits the implementation for the various modes out of two large
      functions, mpol_new() and mpol_rebind_policy().  Eventually it may be
      beneficial to add additional functions to accomodate the existing switch()
      statements in mm/mempolicy.c.
      
       [*] The ->create() function for MPOL_DEFAULT is currently NULL since no
           struct mempolicy is dynamically allocated.
      
      [Lee.Schermerhorn@hp.com: fix regression in the package mempolicy regression tests]
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Eric Whitney <eric.whitney@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      37012946
    • D
      mempolicy: move rebind functions · 1d0d2680
      David Rientjes 提交于
      Move the mpol_rebind_{policy,task,mm}() functions after mpol_new() to avoid
      having to declare function prototypes.
      
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1d0d2680
    • D
      mempolicy: add MPOL_F_RELATIVE_NODES flag · 4c50bc01
      David Rientjes 提交于
      Adds another optional mode flag, MPOL_F_RELATIVE_NODES, that specifies
      nodemasks passed via set_mempolicy() or mbind() should be considered relative
      to the current task's mems_allowed.
      
      When the mempolicy is created, the passed nodemask is folded and mapped onto
      the current task's mems_allowed.  For example, consider a task using
      set_mempolicy() to pass MPOL_INTERLEAVE | MPOL_F_RELATIVE_NODES with a
      nodemask of 1-3.  If current's mems_allowed is 4-7, the effected nodemask is
      5-7 (the second, third, and fourth node of mems_allowed).
      
      If the same task is attached to a cpuset, the mempolicy nodemask is rebound
      each time the mems are changed.  Some possible rebinds and results are:
      
      	mems			result
      	1-3			1-3
      	1-7			2-4
      	1,5-6			1,5-6
      	1,5-7			5-7
      
      Likewise, the zonelist built for MPOL_BIND acts on the set of zones assigned
      to the resultant nodemask from the relative remap.
      
      In the MPOL_PREFERRED case, the preferred node is remapped from the currently
      effected nodemask to the relative nodemask.
      
      This mempolicy mode flag was conceived of by Paul Jackson <pj@sgi.com>.
      
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4c50bc01
    • D
      mempolicy: add MPOL_F_STATIC_NODES flag · f5b087b5
      David Rientjes 提交于
      Add an optional mempolicy mode flag, MPOL_F_STATIC_NODES, that suppresses the
      node remap when the policy is rebound.
      
      Adds another member to struct mempolicy, nodemask_t user_nodemask, as part of
      a union with cpuset_mems_allowed:
      
      	struct mempolicy {
      		...
      		union {
      			nodemask_t cpuset_mems_allowed;
      			nodemask_t user_nodemask;
      		} w;
      	}
      
      that stores the the nodemask that the user passed when he or she created the
      mempolicy via set_mempolicy() or mbind().  When using MPOL_F_STATIC_NODES,
      which is passed with any mempolicy mode, the user's passed nodemask
      intersected with the VMA or task's allowed nodes is always used when
      determining the preferred node, setting the MPOL_BIND zonelist, or creating
      the interleave nodemask.  This happens whenever the policy is rebound,
      including when a task's cpuset assignment changes or the cpuset's mems are
      changed.
      
      This creates an interesting side-effect in that it allows the mempolicy
      "intent" to lie dormant and uneffected until it has access to the node(s) that
      it desires.  For example, if you currently ask for an interleaved policy over
      a set of nodes that you do not have access to, the mempolicy is not created
      and the task continues to use the previous policy.  With this change, however,
      it is possible to create the same mempolicy; it is only effected when access
      to nodes in the nodemask is acquired.
      
      It is also possible to mount tmpfs with the static nodemask behavior when
      specifying a node or nodemask.  To do this, simply add "=static" immediately
      following the mempolicy mode at mount time:
      
      	mount -o remount mpol=interleave=static:1-3
      
      Also removes mpol_check_policy() and folds its logic into mpol_new() since it
      is now obsoleted.  The unused vma_mpol_equal() is also removed.
      
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f5b087b5
    • D
      mempolicy: support optional mode flags · 028fec41
      David Rientjes 提交于
      With the evolution of mempolicies, it is necessary to support mempolicy mode
      flags that specify how the policy shall behave in certain circumstances.  The
      most immediate need for mode flag support is to suppress remapping the
      nodemask of a policy at the time of rebind.
      
      Both the mempolicy mode and flags are passed by the user in the 'int policy'
      formal of either the set_mempolicy() or mbind() syscall.  A new constant,
      MPOL_MODE_FLAGS, represents the union of legal optional flags that may be
      passed as part of this int.  Mempolicies that include illegal flags as part of
      their policy are rejected as invalid.
      
      An additional member to struct mempolicy is added to support the mode flags:
      
      	struct mempolicy {
      		...
      		unsigned short policy;
      		unsigned short flags;
      	}
      
      The splitting of the 'int' actual passed by the user is done in
      sys_set_mempolicy() and sys_mbind() for their respective syscalls.  This is
      done by intersecting the actual with MPOL_MODE_FLAGS, rejecting the syscall of
      there are additional flags, and storing it in the new 'flags' member of struct
      mempolicy.  The intersection of the actual with ~MPOL_MODE_FLAGS is stored in
      the 'policy' member of the struct and all current users of pol->policy remain
      unchanged.
      
      The union of the policy mode and optional mode flags is passed back to the
      user in get_mempolicy().
      
      This combination of mode and flags within the same actual does not break
      userspace code that relies on get_mempolicy(&policy, ...) and either
      
      	switch (policy) {
      	case MPOL_BIND:
      		...
      	case MPOL_INTERLEAVE:
      		...
      	};
      
      statements or
      
      	if (policy == MPOL_INTERLEAVE) {
      		...
      	}
      
      statements.  Such applications would need to use optional mode flags when
      calling set_mempolicy() or mbind() for these previously implemented statements
      to stop working.  If an application does start using optional mode flags, it
      will need to mask the optional flags off the policy in switch and conditional
      statements that only test mode.
      
      An additional member is also added to struct shmem_sb_info to store the
      optional mode flags.
      
      [hugh@veritas.com: shmem mpol: fix build warning]
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      028fec41
    • D
      mempolicy: convert MPOL constants to enum · a3b51e01
      David Rientjes 提交于
      The mempolicy mode constants, MPOL_DEFAULT, MPOL_PREFERRED, MPOL_BIND, and
      MPOL_INTERLEAVE, are better declared as part of an enum since they are
      sequentially numbered and cannot be combined.
      
      The policy member of struct mempolicy is also converted from type short to
      type unsigned short.  A negative policy does not have any legitimate meaning,
      so it is possible to change its type in preparation for adding optional mode
      flags later.
      
      The equivalent member of struct shmem_sb_info is also changed from int to
      unsigned short.
      
      For compatibility, the policy formal to get_mempolicy() remains as a pointer
      to an int:
      
      	int get_mempolicy(int *policy, unsigned long *nmask,
      			  unsigned long maxnode, unsigned long addr,
      			  unsigned long flags);
      
      although the only possible values is the range of type unsigned short.
      
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a3b51e01
    • M
      mm: filter based on a nodemask as well as a gfp_mask · 19770b32
      Mel Gorman 提交于
      The MPOL_BIND policy creates a zonelist that is used for allocations
      controlled by that mempolicy.  As the per-node zonelist is already being
      filtered based on a zone id, this patch adds a version of __alloc_pages() that
      takes a nodemask for further filtering.  This eliminates the need for
      MPOL_BIND to create a custom zonelist.
      
      A positive benefit of this is that allocations using MPOL_BIND now use the
      local node's distance-ordered zonelist instead of a custom node-id-ordered
      zonelist.  I.e., pages will be allocated from the closest allowed node with
      available memory.
      
      [Lee.Schermerhorn@hp.com: Mempolicy: update stale documentation and comments]
      [Lee.Schermerhorn@hp.com: Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask]
      [Lee.Schermerhorn@hp.com: Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask rework]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      19770b32
    • M
      mm: have zonelist contains structs with both a zone pointer and zone_idx · dd1a239f
      Mel Gorman 提交于
      Filtering zonelists requires very frequent use of zone_idx().  This is costly
      as it involves a lookup of another structure and a substraction operation.  As
      the zone_idx is often required, it should be quickly accessible.  The node idx
      could also be stored here if it was found that accessing zone->node is
      significant which may be the case on workloads where nodemasks are heavily
      used.
      
      This patch introduces a struct zoneref to store a zone pointer and a zone
      index.  The zonelist then consists of an array of these struct zonerefs which
      are looked up as necessary.  Helpers are given for accessing the zone index as
      well as the node index.
      
      [kamezawa.hiroyu@jp.fujitsu.com: Suggested struct zoneref instead of embedding information in pointers]
      [hugh@veritas.com: mm-have-zonelist: fix memcg ooms]
      [hugh@veritas.com: just return do_try_to_free_pages]
      [hugh@veritas.com: do_try_to_free_pages gfp_mask redundant]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NChristoph Lameter <clameter@sgi.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dd1a239f
    • M
      mm: introduce node_zonelist() for accessing the zonelist for a GFP mask · 0e88460d
      Mel Gorman 提交于
      Introduce a node_zonelist() helper function.  It is used to lookup the
      appropriate zonelist given a node and a GFP mask.  The patch on its own is a
      cleanup but it helps clarify parts of the two-zonelist-per-node patchset.  If
      necessary, it can be merged with the next patch in this set without problems.
      Reviewed-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0e88460d
  2. 11 3月, 2008 1 次提交
    • L
      mempolicy: fix reference counting bugs · 69682d85
      Lee Schermerhorn 提交于
      Address 3 known bugs in the current memory policy reference counting method.
      I have a series of patches to rework the reference counting to reduce overhead
      in the allocation path.  However, that series will require testing in -mm once
      I repost it.
      
      1) alloc_page_vma() does not release the extra reference taken for
         vma/shared mempolicy when the mode == MPOL_INTERLEAVE.  This can result in
         leaking mempolicy structures.  This is probably occurring, but not being
         noticed.
      
         Fix:  add the conditional release of the reference.
      
      2) hugezonelist unconditionally releases a reference on the mempolicy when
         mode == MPOL_INTERLEAVE.  This can result in decrementing the reference
         count for system default policy [should have no ill effect] or premature
         freeing of task policy.  If this occurred, the next allocation using task
         mempolicy would use the freed structure and probably BUG out.
      
         Fix:  add the necessary check to the release.
      
      3) The current reference counting method assumes that vma 'get_policy()'
         methods automatically add an extra reference a non-NULL returned mempolicy.
          This is true for shmem_get_policy() used by tmpfs mappings, including
         regular page shm segments.  However, SHM_HUGETLB shm's, backed by
         hugetlbfs, just use the vma policy without the extra reference.  This
         results in freeing of the vma policy on the first allocation, with reuse of
         the freed mempolicy structure on subsequent allocations.
      
         Fix: Rather than add another condition to the conditional reference
         release, which occur in the allocation path, just add a reference when
         returning the vma policy in shm_get_policy() to match the assumptions.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Greg KH <greg@kroah.com>
      Cc: Andi Kleen <ak@suse.de>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: David Rientjes <rientjes@google.com>
      Cc: <eric.whitney@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      69682d85
  3. 15 2月, 2008 1 次提交
  4. 12 2月, 2008 1 次提交
    • K
      mempolicy: silently restrict nodemask to allowed nodes · 31f1de46
      KOSAKI Motohiro 提交于
      Kosaki Motohito noted that "numactl --interleave=all ..." failed in the
      presence of memoryless nodes.  This patch attempts to fix that problem.
      
      Some background:
      
      numactl --interleave=all calls set_mempolicy(2) with a fully populated
      [out to MAXNUMNODES] nodemask.  set_mempolicy() [in do_set_mempolicy()]
      calls contextualize_policy() which requires that the nodemask be a
      subset of the current task's mems_allowed; else EINVAL will be returned.
      
      A task's mems_allowed will always be a subset of node_states[N_HIGH_MEMORY]
      i.e., nodes with memory.  So, a fully populated nodemask will be
      declared invalid if it includes memoryless nodes.
      
        NOTE:  the same thing will occur when running in a cpuset
               with restricted mem_allowed--for the same reason:
               node mask contains dis-allowed nodes.
      
      mbind(2), on the other hand, just masks off any nodes in the nodemask
      that are not included in the caller's mems_allowed.
      
      In each case [mbind() and set_mempolicy()], mpol_check_policy() will
      complain [again, resulting in EINVAL] if the nodemask contains any
      memoryless nodes.  This is somewhat redundant as mpol_new() will remove
      memoryless nodes for interleave policy, as will bind_zonelist()--called
      by mpol_new() for BIND policy.
      
      Proposed fix:
      
      1) modify contextualize_policy logic to:
         a) remember whether the incoming node mask is empty.
         b) if not, restrict the nodemask to allowed nodes, as is
            currently done in-line for mbind().  This guarantees
            that the resulting mask includes only nodes with memory.
      
            NOTE:  this is a [benign, IMO] change in behavior for
                   set_mempolicy().  Dis-allowed nodes will be
                   silently ignored, rather than returning an error.
      
         c) fold this code into mpol_check_policy(), replace 2 calls to
            contextualize_policy() to call mpol_check_policy() directly
            and remove contextualize_policy().
      
      2) In existing mpol_check_policy() logic, after "contextualization":
         a) MPOL_DEFAULT:  require that in coming mask "was_empty"
         b) MPOL_{BIND|INTERLEAVE}:  require that contextualized nodemask
            contains at least one node.
         c) add a case for MPOL_PREFERRED:  if in coming was not empty
            and resulting mask IS empty, user specified invalid nodes.
            Return EINVAL.
         c) remove the now redundant check for memoryless nodes
      
      3) remove the now redundant masking of policy nodes for interleave
         policy from mpol_new().
      
      4) Now that mpol_check_policy() contextualizes the nodemask, remove
         the in-line nodes_and() from sys_mbind().  I believe that this
         restores mbind() to the behavior before the memoryless-nodes
         patch series.  E.g., we'll no longer treat an invalid nodemask
         with MPOL_PREFERRED as local allocation.
      
      [ Patch history:
      
        v1 -> v2:
         - Communicate whether or not incoming node mask was empty to
           mpol_check_policy() for better error checking.
         - As suggested by David Rientjes, remove the now unused
           cpuset_nodes_subset_current_mems_allowed() from cpuset.h
      
        v2 -> v3:
         - As suggested by Kosaki Motohito, fold the "contextualization"
           of policy nodemask into mpol_check_policy().  Looks a little
           cleaner. ]
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Tested-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      31f1de46
  5. 15 11月, 2007 1 次提交
    • L
      Migration: find correct vma in new_vma_page() · 3ad33b24
      Lee Schermerhorn 提交于
      We hit the BUG_ON() in mm/rmap.c:vma_address() when trying to migrate via
      mbind(MPOL_MF_MOVE) a non-anon region that spans multiple vmas.  For
      anon-regions, we just fail to migrate any pages beyond the 1st vma in the
      range.
      
      This occurs because do_mbind() collects a list of pages to migrate by
      calling check_range().  check_range() walks the task's mm, spanning vmas as
      necessary, to collect the migratable pages into a list.  Then, do_mbind()
      calls migrate_pages() passing the list of pages, a function to allocate new
      pages based on vma policy [new_vma_page()], and a pointer to the first vma
      of the range.
      
      For each page in the list, new_vma_page() calls page_address_in_vma()
      passing the page and the vma [first in range] to obtain the address to get
      for alloc_page_vma().  The page address is needed to get interleaving
      policy correct.  If the pages in the list come from multiple vmas,
      eventually, new_page_address() will pass that page to page_address_in_vma()
      with the incorrect vma.  For !PageAnon pages, this will result in a bug
      check in rmap.c:vma_address().  For anon pages, vma_address() will just
      return EFAULT and fail the migration.
      
      This patch modifies new_vma_page() to check the return value from
      page_address_in_vma().  If the return value is EFAULT, new_vma_page()
      searchs forward via vm_next for the vma that maps the page--i.e., that does
      not return EFAULT.  This assumes that the pages in the list handed to
      migrate_pages() is in address order.  This is currently case.  The patch
      documents this assumption in a new comment block for new_vma_page().
      
      If new_vma_page() cannot locate the vma mapping the page in a forward
      search in the mm, it will pass a NULL vma to alloc_page_vma().  This will
      result in the allocation using the task policy, if any, else system default
      policy.  This situation is unlikely, but the patch documents this behavior
      with a comment.
      
      Note, this patch results in restarting from the first vma in a multi-vma
      range each time new_vma_page() is called.  If this is not acceptable, we
      can make the vma argument a pointer, both in new_vma_page() and it's caller
      unmap_and_move() so that the value held by the loop in migrate_pages()
      always passes down the last vma in which a page was found.  This will
      require changes to all new_page_t functions passed to migrate_pages().  Is
      this necessary?
      
      For this patch to work, we can't bug check in vma_address() for pages
      outside the argument vma.  This patch removes the BUG_ON().  All other
      callers [besides new_vma_page()] already check the return status.
      
      Tested on x86_64, 4 node NUMA platform.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Acked-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3ad33b24
  6. 20 10月, 2007 3 次提交
    • P
      Uninline find_task_by_xxx set of functions · 228ebcbe
      Pavel Emelyanov 提交于
      The find_task_by_something is a set of macros are used to find task by pid
      depending on what kind of pid is proposed - global or virtual one.  All of
      them are wrappers above the most generic one - find_task_by_pid_type_ns() -
      and just substitute some args for it.
      
      It turned out, that dereferencing the current->nsproxy->pid_ns construction
      and pushing one more argument on the stack inline cause kernel text size to
      grow.
      
      This patch moves all this stuff out-of-line into kernel/pid.c.  Together
      with the next patch it saves a bit less than 400 bytes from the .text
      section.
      Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
      Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: Paul Menage <menage@google.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      228ebcbe
    • P
      pid namespaces: changes to show virtual ids to user · b488893a
      Pavel Emelyanov 提交于
      This is the largest patch in the set. Make all (I hope) the places where
      the pid is shown to or get from user operate on the virtual pids.
      
      The idea is:
       - all in-kernel data structures must store either struct pid itself
         or the pid's global nr, obtained with pid_nr() call;
       - when seeking the task from kernel code with the stored id one
         should use find_task_by_pid() call that works with global pids;
       - when showing pid's numerical value to the user the virtual one
         should be used, but however when one shows task's pid outside this
         task's namespace the global one is to be used;
       - when getting the pid from userspace one need to consider this as
         the virtual one and use appropriate task/pid-searching functions.
      
      [akpm@linux-foundation.org: build fix]
      [akpm@linux-foundation.org: nuther build fix]
      [akpm@linux-foundation.org: yet nuther build fix]
      [akpm@linux-foundation.org: remove unneeded casts]
      Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NAlexey Dobriyan <adobriyan@openvz.org>
      Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: Paul Menage <menage@google.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b488893a
    • P
      Task Control Groups: make cpusets a client of cgroups · 8793d854
      Paul Menage 提交于
      Remove the filesystem support logic from the cpusets system and makes cpusets
      a cgroup subsystem
      
      The "cpuset" filesystem becomes a dummy filesystem; attempts to mount it get
      passed through to the cgroup filesystem with the appropriate options to
      emulate the old cpuset filesystem behaviour.
      Signed-off-by: NPaul Menage <menage@google.com>
      Cc: Serge E. Hallyn <serue@us.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Kirill Korotaev <dev@openvz.org>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
      Cc: Cedric Le Goater <clg@fr.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8793d854
  7. 17 10月, 2007 6 次提交
  8. 20 9月, 2007 1 次提交
    • L
      Fix NUMA Memory Policy Reference Counting · 480eccf9
      Lee Schermerhorn 提交于
      This patch proposes fixes to the reference counting of memory policy in the
      page allocation paths and in show_numa_map().  Extracted from my "Memory
      Policy Cleanups and Enhancements" series as stand-alone.
      
      Shared policy lookup [shmem] has always added a reference to the policy,
      but this was never unrefed after page allocation or after formatting the
      numa map data.
      
      Default system policy should not require additional ref counting, nor
      should the current task's task policy.  However, show_numa_map() calls
      get_vma_policy() to examine what may be [likely is] another task's policy.
      The latter case needs protection against freeing of the policy.
      
      This patch adds a reference count to a mempolicy returned by
      get_vma_policy() when the policy is a vma policy or another task's
      mempolicy.  Again, shared policy is already reference counted on lookup.  A
      matching "unref" [__mpol_free()] is performed in alloc_page_vma() for
      shared and vma policies, and in show_numa_map() for shared and another
      task's mempolicy.  We can call __mpol_free() directly, saving an admittedly
      inexpensive inline NULL test, because we know we have a non-NULL policy.
      
      Handling policy ref counts for hugepages is a bit trickier.
      huge_zonelist() returns a zone list that might come from a shared or vma
      'BIND policy.  In this case, we should hold the reference until after the
      huge page allocation in dequeue_hugepage().  The patch modifies
      huge_zonelist() to return a pointer to the mempolicy if it needs to be
      unref'd after allocation.
      
      Kernel Build [16cpu, 32GB, ia64] - average of 10 runs:
      
      		w/o patch	w/ refcount patch
      	    Avg	  Std Devn	   Avg	  Std Devn
      Real:	 100.59	    0.38	 100.63	    0.43
      User:	1209.60	    0.37	1209.91	    0.31
      System:   81.52	    0.42	  81.64	    0.34
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Acked-by: NAndi Kleen <ak@suse.de>
      Cc: Christoph Lameter <clameter@sgi.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      480eccf9
  9. 31 8月, 2007 1 次提交
  10. 23 8月, 2007 1 次提交
    • M
      Apply memory policies to top two highest zones when highest zone is ZONE_MOVABLE · b377fd39
      Mel Gorman 提交于
      The NUMA layer only supports NUMA policies for the highest zone.  When
      ZONE_MOVABLE is configured with kernelcore=, the the highest zone becomes
      ZONE_MOVABLE.  The result is that policies are only applied to allocations
      like anonymous pages and page cache allocated from ZONE_MOVABLE when the
      zone is used.
      
      This patch applies policies to the two highest zones when the highest zone
      is ZONE_MOVABLE.  As ZONE_MOVABLE consists of pages from the highest "real"
      zone, it's always functionally equivalent.
      
      The patch has been tested on a variety of machines both NUMA and non-NUMA
      covering x86, x86_64 and ppc64.  No abnormal results were seen in
      kernbench, tbench, dbench or hackbench.  It passes regression tests from
      the numactl package with and without kernelcore= once numactl tests are
      patched to wait for vmstat counters to update.
      
      akpm: this is the nasty hack to fix NUMA mempolicies in the presence of
      ZONE_MOVABLE and kernelcore= in 2.6.23.  Christoph says "For .24 either merge
      the mobility or get the other solution that Mel is working on.  That solution
      would only use a single zonelist per node and filter on the fly.  That may
      help performance and also help to make memory policies work better."
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Tested-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Acked-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Andi Kleen <ak@suse.de>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b377fd39
  11. 20 7月, 2007 1 次提交
    • P
      mm: Remove slab destructors from kmem_cache_create(). · 20c2df83
      Paul Mundt 提交于
      Slab destructors were no longer supported after Christoph's
      c59def9f change. They've been
      BUGs for both slab and slub, and slob never supported them
      either.
      
      This rips out support for the dtor pointer from kmem_cache_create()
      completely and fixes up every single callsite in the kernel (there were
      about 224, not including the slab allocator definitions themselves,
      or the documentation references).
      Signed-off-by: NPaul Mundt <lethal@linux-sh.org>
      20c2df83
  12. 18 7月, 2007 2 次提交
    • M
      Allow huge page allocations to use GFP_HIGH_MOVABLE · 396faf03
      Mel Gorman 提交于
      Huge pages are not movable so are not allocated from ZONE_MOVABLE.  However,
      as ZONE_MOVABLE will always have pages that can be migrated or reclaimed, it
      can be used to satisfy hugepage allocations even when the system has been
      running a long time.  This allows an administrator to resize the hugepage pool
      at runtime depending on the size of ZONE_MOVABLE.
      
      This patch adds a new sysctl called hugepages_treat_as_movable.  When a
      non-zero value is written to it, future allocations for the huge page pool
      will use ZONE_MOVABLE.  Despite huge pages being non-movable, we do not
      introduce additional external fragmentation of note as huge pages are always
      the largest contiguous block we care about.
      
      [akpm@linux-foundation.org: various fixes]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      396faf03
    • M
      Add __GFP_MOVABLE for callers to flag allocations from high memory that may be migrated · 769848c0
      Mel Gorman 提交于
      It is often known at allocation time whether a page may be migrated or not.
      This patch adds a flag called __GFP_MOVABLE and a new mask called
      GFP_HIGH_MOVABLE.  Allocations using the __GFP_MOVABLE can be either migrated
      using the page migration mechanism or reclaimed by syncing with backing
      storage and discarding.
      
      An API function very similar to alloc_zeroed_user_highpage() is added for
      __GFP_MOVABLE allocations called alloc_zeroed_user_highpage_movable().  The
      flags used by alloc_zeroed_user_highpage() are not changed because it would
      change the semantics of an existing API.  After this patch is applied there
      are no in-kernel users of alloc_zeroed_user_highpage() so it probably should
      be marked deprecated if this patch is merged.
      
      Note that this patch includes a minor cleanup to the use of __GFP_ZERO in
      shmem.c to keep all flag modifications to inode->mapping in the
      shmem_dir_alloc() helper function.  This clean-up suggestion is courtesy of
      Hugh Dickens.
      
      Additional credit goes to Christoph Lameter and Linus Torvalds for shaping the
      concept.  Credit to Hugh Dickens for catching issues with shmem swap vector
      and ramfs allocations.
      
      [akpm@linux-foundation.org: build fix]
      [hugh@veritas.com: __GFP_ZERO cleanup]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Christoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      769848c0
  13. 17 7月, 2007 2 次提交
    • P
      numa: mempolicy: trivial debug fixes. · 140d5a49
      Paul Mundt 提交于
      Enabling debugging fails to build due to the nodemask variable in
      do_mbind() having changed names, and then oopses on boot due to the
      assumption that the nodemask can be dereferenced -- which doesn't work out
      so well when the policy is changed to MPOL_DEFAULT with a NULL nodemask by
      numa_default_policy().
      
      This fixes it up, and switches from PDprintk() to pr_debug() while
      we're at it.
      Signed-off-by: NPaul Mundt <lethal@linux-sh.org>
      Cc: Christoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      140d5a49
    • P
      numa: mempolicy: dynamic interleave map for system init · b71636e2
      Paul Mundt 提交于
      This converts the default system init memory policy to use a dynamically
      created node map instead of defaulting to all online nodes.  Nodes of a
      certain size (>= 16MB) are judged to be suitable for interleave, and are added
      to the map.  If all nodes are smaller in size, the largest one is
      automatically selected.
      
      Without this, tiny nodes find themselves out of memory before we even make it
      to userspace.  Systems with large nodes will notice no change.
      
      Only the system init policy is effected by this change, the regular
      MPOL_DEFAULT policy is still switched to later on in the boot process as
      normal.
      Signed-off-by: NPaul Mundt <lethal@linux-sh.org>
      Cc: Andi Kleen <ak@suse.de>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b71636e2
  14. 05 3月, 2007 1 次提交
  15. 21 2月, 2007 1 次提交