1. 28 4月, 2008 40 次提交
    • L
      mempolicy: support mpol=local tmpfs mount option · 3f226aa1
      Lee Schermerhorn 提交于
      For tmpfs/shmem shared policies, MPOL_DEFAULT is not necessarily equivalent to
      "local allocation".  Because shared policies are at the same "scope" level
      [see Documentation/vm/numa_memory_policy.txt], as vma policies MPOL_DEFAULT
      means "fall back to current task policy".
      
      This patch extends the memory policy string parsing function to display
      "local" for MPOL_PREFERRED + MPOL_F_LOCAL.  This allows one to specify local
      allocation as the default policy for shared memory areas via the tmpfs mpol
      mount option, regardless of the current task's policy.
      
      Also, "local" is now displayed for this policy.  This patch allows us to
      accept the same input format as the display.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3f226aa1
    • L
      mempolicy: rework shmem mpol parsing and display · 095f1fc4
      Lee Schermerhorn 提交于
      mm/shmem.c currently contains functions to parse and display memory policy
      strings for the tmpfs 'mpol' mount option.  Move this to mm/mempolicy.c with
      the rest of the mempolicy support.  With subsequent patches, we'll be able to
      remove knowledge of the details [mode, flags, policy, ...] completely from
      shmem.c
      
      1) replace shmem_parse_mpol() in mm/shmem.c with mpol_parse_str() in
         mm/mempolicy.c.  Rework to use the policy_types[] array [used by
         mpol_to_str()] to look up mode by name.
      
      2) use mpol_to_str() to format policy for shmem_show_mpol().  mpol_to_str()
         expects a pointer to a struct mempolicy, so temporarily construct one.
         This will be replaced with a reference to a struct mempolicy in the tmpfs
         superblock in a subsequent patch.
      
         NOTE 1: I changed mpol_to_str() to use a colon ':' rather than an equal
         sign '=' as the nodemask delimiter to match mpol_parse_str() and the
         tmpfs/shmem mpol mount option formatting that now uses mpol_to_str().  This
         is a user visible change to numa_maps, but then the addition of the mode
         flags already changed the display.  It makes sense to me to have the mounts
         and numa_maps display the policy in the same format.  However, if anyone
         objects strongly, I can pass the desired nodemask delimeter as an arg to
         mpol_to_str().
      
         Note 2: Like show_numa_map(), I don't check the return code from
         mpol_to_str().  I do use a longer buffer than the one provided by
         show_numa_map(), which seems to have sufficed so far.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      095f1fc4
    • L
      mempolicy: clean-up mpol-to-str() mempolicy formatting · 2291990a
      Lee Schermerhorn 提交于
      mpol-to-str() formats memory policies into printable strings.  Currently this
      is only used to display "numa_maps".  A subsequent patch will use
      mpol_to_str() for formatting tmpfs [shmem] mpol mount options, allowing us to
      remove essentially duplicate code in mm/shmem.c.  This patch cleans up
      mpol_to_str() generally and in preparation for that patch.
      
      1) show_numa_maps() is not checking the return code from mpol_to_str().
         There's not a lot we can do in this context if mpol_to_str() did return the
         error [insufficient space in buffer].  Proposed "solution": just check,
         under DEBUG_VM, that callers are providing sufficient buffer space for the
         policy, flags, and a few nodes.  This way, we'll get some display.
         show_numa_maps() is providing a 50-byte buffer, so it won't trip this
         check.  50-bytes should be sufficient unless one has a large number of
         nodes in a very sparse nodemask.
      
      2) The display of the new mode flags ["static" & "relative"] was set up to
         display multiple flags, separated by a "bar" '|'.  However, this support is
         incomplete--e.g., need_bar was never incremented; and currently, these two
         flags are mutually exclusive.  So remove the "bar" support, for now, and
         only display one flag.
      
      3) Use snprint() to format flags, so as not to overflow the buffer.  Not
         that it's ever happed, AFAIK.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2291990a
    • L
      mempolicy: use MPOL_F_LOCAL to Indicate Preferred Local Policy · fc36b8d3
      Lee Schermerhorn 提交于
      Now that we're using "preferred local" policy for system default, we need to
      make this as fast as possible.  Because of the variable size of the mempolicy
      structure [based on size of nodemasks], the preferred_node may be in a
      different cacheline from the mode.  This can result in accessing an extra
      cacheline in the normal case of system default policy.  Suspect this is the
      cause of an observed 2-3% slowdown in page fault testing relative to kernel
      without this patch series.
      
      To alleviate this, use an internal mode flag, MPOL_F_LOCAL in the mempolicy
      flags member which is guaranteed [?] to be in the same cacheline as the mode
      itself.
      
      Verified that reworked mempolicy now performs slightly better on 25-rc8-mm1
      for both anon and shmem segments with system default and vma [preferred local]
      policy.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fc36b8d3
    • L
      mempolicy: mPOL_PREFERRED cleanups for "local allocation" · 53f2556b
      Lee Schermerhorn 提交于
      Here are a couple of "cleanups" for MPOL_PREFERRED behavior when
      v.preferred_node < 0 -- i.e., "local allocation":
      
      1)  [do_]get_mempolicy() calls the now renamed get_policy_nodemask()
          to fetch the nodemask associated with a policy.  Currently,
          get_policy_nodemask() returns the set of nodes with memory, when
          the policy 'mode' is 'PREFERRED, and the preferred_node is < 0.
          Change to return an empty nodemask, as this is what was specified
          to achieve "local allocation".
      
      2)  When a task is moved into a [new] cpuset, mpol_rebind_policy() is
          called to adjust any task and vma policy nodes to be valid in the
          new cpuset.  However, when the policy is MPOL_PREFERRED, and the
          preferred_node is <0, no rebind is necessary.  The "local allocation"
          indication is valid in any cpuset.  Existing code will "do the right
          thing" because node_remap() will just return the argument node when
          it is outside of the valid range of node ids.  However, I think it is
          clearer and cleaner to skip the remap explicitly in this case.
      
      3)  mpol_to_str() produces a printable, "human readable" string from a
          struct mempolicy.  For MPOL_PREFERRED with preferred_node <0,  show
          "local", as this indicates local allocation, as the task migrates
          among nodes.  Note that this matches the usage of "local allocation"
          in libnuma() and numactl.  Without this change, I believe that node_set()
          [via set_bit()] will set bit 31, resulting in a misleading display.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      53f2556b
    • L
      mempolicy: use MPOL_PREFERRED for system-wide default policy · bea904d5
      Lee Schermerhorn 提交于
      Currently, when one specifies MPOL_DEFAULT via a NUMA memory policy API
      [set_mempolicy(), mbind() and internal versions], the kernel simply installs a
      NULL struct mempolicy pointer in the appropriate context: task policy, vma
      policy, or shared policy.  This causes any use of that policy to "fall back"
      to the next most specific policy scope.
      
      The only use of MPOL_DEFAULT to mean "local allocation" is in the system
      default policy.  This requires extra checks/cases for MPOL_DEFAULT in many
      mempolicy.c functions.
      
      There is another, "preferred" way to specify local allocation via the APIs.
      That is using the MPOL_PREFERRED policy mode with an empty nodemask.
      Internally, the empty nodemask gets converted to a preferred_node id of '-1'.
      All internal usage of MPOL_PREFERRED will convert the '-1' to the id of the
      node local to the cpu where the allocation occurs.
      
      System default policy, except during boot, is hard-coded to "local
      allocation".  By using the MPOL_PREFERRED mode with a negative value of
      preferred node for system default policy, MPOL_DEFAULT will never occur in the
      'policy' member of a struct mempolicy.  Thus, we can remove all checks for
      MPOL_DEFAULT when converting policy to a node id/zonelist in the allocation
      paths.
      
      In slab_node() return local node id when policy pointer is NULL.  No need to
      set a pol value to take the switch default.  Replace switch default with
      BUG()--i.e., shouldn't happen.
      
      With this patch MPOL_DEFAULT is only used in the APIs, including internal
      calls to do_set_mempolicy() and in the display of policy in
      /proc/<pid>/numa_maps.  It always means "fall back" to the the next most
      specific policy scope.  This simplifies the description of memory policies
      quite a bit, with no visible change in behavior.
      
      get_mempolicy() continues to return MPOL_DEFAULT and an empty nodemask when
      the requested policy [task or vma/shared] is NULL.  These are the values one
      would supply via set_mempolicy() or mbind() to achieve that condition--default
      behavior.
      
      This patch updates Documentation to reflect this change.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bea904d5
    • L
      mempolicy: rework mempolicy Reference Counting [yet again] · 52cd3b07
      Lee Schermerhorn 提交于
      After further discussion with Christoph Lameter, it has become clear that my
      earlier attempts to clean up the mempolicy reference counting were a bit of
      overkill in some areas, resulting in superflous ref/unref in what are usually
      fast paths.  In other areas, further inspection reveals that I botched the
      unref for interleave policies.
      
      A separate patch, suitable for upstream/stable trees, fixes up the known
      errors in the previous attempt to fix reference counting.
      
      This patch reworks the memory policy referencing counting and, one hopes,
      simplifies the code.  Maybe I'll get it right this time.
      
      See the update to the numa_memory_policy.txt document for a discussion of
      memory policy reference counting that motivates this patch.
      
      Summary:
      
      Lookup of mempolicy, based on (vma, address) need only add a reference for
      shared policy, and we need only unref the policy when finished for shared
      policies.  So, this patch backs out all of the unneeded extra reference
      counting added by my previous attempt.  It then unrefs only shared policies
      when we're finished with them, using the mpol_cond_put() [conditional put]
      helper function introduced by this patch.
      
      Note that shmem_swapin() calls read_swap_cache_async() with a dummy vma
      containing just the policy.  read_swap_cache_async() can call alloc_page_vma()
      multiple times, so we can't let alloc_page_vma() unref the shared policy in
      this case.  To avoid this, we make a copy of any non-null shared policy and
      remove the MPOL_F_SHARED flag from the copy.  This copy occurs before reading
      a page [or multiple pages] from swap, so the overhead should not be an issue
      here.
      
      I introduced a new static inline function "mpol_cond_copy()" to copy the
      shared policy to an on-stack policy and remove the flags that would require a
      conditional free.  The current implementation of mpol_cond_copy() assumes that
      the struct mempolicy contains no pointers to dynamically allocated structures
      that must be duplicated or reference counted during copy.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      52cd3b07
    • L
      mempolicy: mark shared policies for unref · aab0b102
      Lee Schermerhorn 提交于
      As part of yet another rework of mempolicy reference counting, we want to be
      able to identify shared policies efficiently, because they have an extra ref
      taken on lookup that needs to be removed when we're finished using the policy.
      
        Note:  the extra ref is required because the policies are
        shared between tasks/processes and can be changed/freed
        by one task while another task is using them--e.g., for
        page allocation.
      
      Building on David Rientjes mempolicy "mode flags" enhancement, this patch
      indicates a "shared" policy by setting a new MPOL_F_SHARED flag in the flags
      member of the struct mempolicy added by David.  MPOL_F_SHARED, and any future
      "internal mode flags" are reserved from bit zero up, as they will never be
      passed in the upper bits of the mode argument of a mempolicy API.
      
      I set the MPOL_F_SHARED flag when the policy is installed in the shared policy
      rb-tree.  Don't need/want to clear the flag when removing from the tree as the
      mempolicy is freed [unref'd] internally to the sp_delete() function.  However,
      a task could hold another reference on this mempolicy from a prior lookup.  We
      need the MPOL_F_SHARED flag to stay put so that any tasks holding a ref will
      unref, eventually freeing, the mempolicy.
      
      A later patch in this series will introduce a function to conditionally unref
      [mpol_free] a policy.  The MPOL_F_SHARED flag is one reason [currently the
      only reason] to unref/free a policy via the conditional free.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aab0b102
    • L
      mempolicy: rename struct mempolicy 'policy' member to 'mode' · 45c4745a
      Lee Schermerhorn 提交于
      The terms 'policy' and 'mode' are both used in various places to describe the
      semantics of the value stored in the 'policy' member of struct mempolicy.
      Furthermore, the term 'policy' is used to refer to that member, to the entire
      struct mempolicy and to the more abstract concept of the tuple consisting of a
      "mode" and an optional node or set of nodes.  Recently, we have added "mode
      flags" that are passed in the upper bits of the 'mode' [or sometimes,
      'policy'] member of the numa APIs.
      
      I'd like to resolve this confusion, which perhaps only exists in my mind, by
      renaming the 'policy' member to 'mode' throughout, and fixing up the
      Documentation.  Man pages will be updated separately.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      45c4745a
    • L
      mempolicy: fixup Fallback for Default Shmem Policy · ae4d8c16
      Lee Schermerhorn 提交于
      get_vma_policy() is not handling fallback to task policy correctly when the
      get_policy() vm_op returns NULL.  The NULL overwrites the 'pol' variable that
      was holding the fallback task mempolicy.  So, it was falling back directly to
      system default policy.
      
      Fix get_vma_policy() to use only non-NULL policy returned from the vma
      get_policy op.
      
      shm_get_policy() was falling back to current task's mempolicy if the "backing
      file system" [tmpfs vs hugetlbfs] does not support the get_policy vm_op and
      the vma policy is null.  This is incorrect for show_numa_maps() which is
      likely querying the numa_maps of some task other than current.  Remove this
      fallback.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ae4d8c16
    • L
      mempolicy: write lock mmap_sem while changing task mempolicy · f4e53d91
      Lee Schermerhorn 提交于
      A read of /proc/<pid>/numa_maps holds the target task's mmap_sem for read
      while examining each vma's mempolicy.  A vma's mempolicy can fall back to the
      task's policy.  However, the task could be changing it's task policy and free
      the one that the show_numa_maps() is examining.
      
      To prevent this, grab the mmap_sem for write when updating task mempolicy.
      Pointed out to me by Christoph Lameter and extracted and reworked from
      Christoph's alternative mempol reference counting patch.
      
      This is analogous to the way that do_mbind() and do_get_mempolicy() prevent
      races between task's sharing an mm_struct [a.k.a.  threads] setting and
      querying a mempolicy for a particular address.
      
      Note: this is necessary, but not sufficient, to allow us to stop taking an
      extra reference on "other task's mempolicy" in get_vma_policy.  Subsequent
      patches will complete this update, allowing us to simplify the tests for
      whether we need to unref a mempolicy at various points in the code.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f4e53d91
    • L
      mempolicy: rename mpol_copy to mpol_dup · 846a16bf
      Lee Schermerhorn 提交于
      This patch renames mpol_copy() to mpol_dup() because, well, that's what it
      does.  Like, e.g., strdup() for strings, mpol_dup() takes a pointer to an
      existing mempolicy, allocates a new one and copies the contents.
      
      In a later patch, I want to use the name mpol_copy() to copy the contents from
      one mempolicy to another like, e.g., strcpy() does for strings.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      846a16bf
    • L
      mempolicy: rename mpol_free to mpol_put · f0be3d32
      Lee Schermerhorn 提交于
      This is a change that was requested some time ago by Mel Gorman.  Makes sense
      to me, so here it is.
      
      Note: I retain the name "mpol_free_shared_policy()" because it actually does
      free the shared_policy, which is NOT a reference counted object.  However, ...
      
      The mempolicy object[s] referenced by the shared_policy are reference counted,
      so mpol_put() is used to release the reference held by the shared_policy.  The
      mempolicy might not be freed at this time, because some task attached to the
      shared object associated with the shared policy may be in the process of
      allocating a page based on the mempolicy.  In that case, the task performing
      the allocation will hold a reference on the mempolicy, obtained via
      mpol_shared_policy_lookup().  The mempolicy will be freed when all tasks
      holding such a reference have called mpol_put() for the mempolicy.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f0be3d32
    • A
      Subject: [PATCH] hugetlb: vmstat events for huge page allocations · 3b116300
      Adam Litke 提交于
      Allocating huge pages directly from the buddy allocator is not guaranteed to
      succeed.  Success depends on several factors (such as the amount of physical
      memory available and the level of fragmentation).  With the addition of
      dynamic hugetlb pool resizing, allocations can occur much more frequently.
      For these reasons it is desirable to keep track of huge page allocation
      successes and failures.
      
      Add two new vmstat entries to track huge page allocations that succeed and
      fail.  The presence of the two entries is contingent upon CONFIG_HUGETLB_PAGE
      being enabled.
      
      [akpm@linux-foundation.org: reduced ifdeffery]
      Signed-off-by: NAdam Litke <agl@us.ibm.com>
      Signed-off-by: NEric Munson <ebmunson@us.ibm.com>
      Tested-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NAndy Whitcroft <apw@shadowen.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3b116300
    • N
      xip: support non-struct page backed memory · 70688e4d
      Nick Piggin 提交于
      Convert XIP to support non-struct page backed memory, using VM_MIXEDMAP for
      the user mappings.
      
      This requires the get_xip_page API to be changed to an address based one.
      Improve the API layering a little bit too, while we're here.
      
      This is required in order to support XIP filesystems on memory that isn't
      backed with struct page (but memory with struct page is still supported too).
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Acked-by: NCarsten Otte <cotte@de.ibm.com>
      Cc: Jared Hulbert <jaredeh@gmail.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      70688e4d
    • N
      mm: add vm_insert_mixed · 423bad60
      Nick Piggin 提交于
      vm_insert_mixed will insert either a raw pfn or a refcounted struct page into
      the page tables, depending on whether vm_normal_page() will return the page or
      not.  With the introduction of the new pte bit, this is now a too tricky for
      drivers to be doing themselves.
      
      filemap_xip uses this in a subsequent patch.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: Jared Hulbert <jaredeh@gmail.com>
      Cc: Carsten Otte <cotte@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      423bad60
    • N
      mm: introduce pte_special pte bit · 7e675137
      Nick Piggin 提交于
      s390 for one, cannot implement VM_MIXEDMAP with pfn_valid, due to their memory
      model (which is more dynamic than most).  Instead, they had proposed to
      implement it with an additional path through vm_normal_page(), using a bit in
      the pte to determine whether or not the page should be refcounted:
      
      vm_normal_page()
      {
      	...
              if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
                      if (vma->vm_flags & VM_MIXEDMAP) {
      #ifdef s390
      			if (!mixedmap_refcount_pte(pte))
      				return NULL;
      #else
                              if (!pfn_valid(pfn))
                                      return NULL;
      #endif
                              goto out;
                      }
      	...
      }
      
      This is fine, however if we are allowed to use a bit in the pte to determine
      refcountedness, we can use that to _completely_ replace all the vma based
      schemes.  So instead of adding more cases to the already complex vma-based
      scheme, we can have a clearly seperate and simple pte-based scheme (and get
      slightly better code generation in the process):
      
      vm_normal_page()
      {
      #ifdef s390
      	if (!mixedmap_refcount_pte(pte))
      		return NULL;
      	return pte_page(pte);
      #else
      	...
      #endif
      }
      
      And finally, we may rather make this concept usable by any architecture rather
      than making it s390 only, so implement a new type of pte state for this.
      Unfortunately the old vma based code must stay, because some architectures may
      not be able to spare pte bits.  This makes vm_normal_page a little bit more
      ugly than we would like, but the 2 cases are clearly seperate.
      
      So introduce a pte_special pte state, and use it in mm/memory.c.  It is
      currently a noop for all architectures, so this doesn't actually result in any
      compiled code changes to mm/memory.o.
      
      BTW:
      I haven't put vm_normal_page() into arch code as-per an earlier suggestion.
      The reason is that, regardless of where vm_normal_page is actually
      implemented, the *abstraction* is still exactly the same. Also, while it
      depends on whether the architecture has pte_special or not, that is the
      only two possible cases, and it really isn't an arch specific function --
      the role of the arch code should be to provide primitive functions and
      accessors with which to build the core code; pte_special does that. We do
      not want architectures to know or care about vm_normal_page itself, and
      we definitely don't want them being able to invent something new there
      out of sight of mm/ code. If we made vm_normal_page an arch function, then
      we have to make vm_insert_mixed (next patch) an arch function too. So I
      don't think moving it to arch code fundamentally improves any abstractions,
      while it does practically make the code more difficult to follow, for both
      mm and arch developers, and easier to misuse.
      
      [akpm@linux-foundation.org: build fix]
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Acked-by: NCarsten Otte <cotte@de.ibm.com>
      Cc: Jared Hulbert <jaredeh@gmail.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7e675137
    • J
      mm: introduce VM_MIXEDMAP · b379d790
      Jared Hulbert 提交于
      This series introduces some important infrastructure work.  The overall result
      is that:
      
      1. We now support XIP backed filesystems using memory that have no
         struct page allocated to them. And patches 6 and 7 actually implement
         this for s390.
      
         This is pretty important in a number of cases. As far as I understand,
         in the case of virtualisation (eg. s390), each guest may mount a
         readonly copy of the same filesystem (eg. the distro). Currently,
         guests need to allocate struct pages for this image. So if you have
         100 guests, you already need to allocate more memory for the struct
         pages than the size of the image. I think. (Carsten?)
      
         For other (eg. embedded) systems, you may have a very large non-
         volatile filesystem. If you have to have struct pages for this, then
         your RAM consumption will go up proportionally to fs size. Even
         though it is just a small proportion, the RAM can be much more costly
         eg in terms of power, so every KB less that Linux uses makes it more
         attractive to a lot of these guys.
      
      2. VM_MIXEDMAP allows us to support mappings where you actually do want
         to refcount _some_ pages in the mapping, but not others, and support
         COW on arbitrary (non-linear) mappings. Jared needs this for his NVRAM
         filesystem in progress. Future iterations of this filesystem will
         most likely want to migrate pages between pagecache and XIP backing,
         which is where the requirement for mixed (some refcounted, some not)
         comes from.
      
      3. pte_special also has a peripheral usage that I need for my lockless
         get_user_pages patch. That was shown to speed up "oltp" on db2 by
         10% on a 2 socket system, which is kind of significant because they
         scrounge for months to try to find 0.1% improvement on these
         workloads. I'm hoping we might finally be faster than AIX on
         pSeries with this :). My reference to lockless get_user_pages is not
         meant to justify this patchset (which doesn't include lockless gup),
         but just to show that pte_special is not some s390 specific thing that
         should be hidden in arch code or xip code: I definitely want to use it
         on at least x86 and powerpc as well.
      
      This patch:
      
      Introduce a new type of mapping, VM_MIXEDMAP.  This is unlike VM_PFNMAP in
      that it can support COW mappings of arbitrary ranges including ranges without
      struct page *and* ranges with a struct page that we actually want to refcount
      (PFNMAP can only support COW in those cases where the un-COW-ed translations
      are mapped linearly in the virtual address, and can only support non
      refcounted ranges).
      
      VM_MIXEDMAP achieves this by refcounting all pfn_valid pages, and not
      refcounting !pfn_valid pages (which is not an option for VM_PFNMAP, because it
      needs to avoid refcounting pfn_valid pages eg.  for /dev/mem mappings).
      Signed-off-by: NJared Hulbert <jaredeh@gmail.com>
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Acked-by: NCarsten Otte <cotte@de.ibm.com>
      Cc: Jared Hulbert <jaredeh@gmail.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b379d790
    • C
      PAGEFLAGS_EXTENDED and separate page flags for Head and Tail · e20b8cca
      Christoph Lameter 提交于
      Having separate page flags for the head and the tail of a compound page allows
      the compiler to use bitops instead of operations on a word to check for a tail
      page.  That is f.e.  important for virt_to_head_page() which is used in
      various critical code paths (kfree for example):
      
      Code for PageTail(page)
      
      Before:
      
       mov    (%rdi),%rdx		page->flags
       mov    %rdx,%rax		3 bytes
       and    $0x12000,%eax		5 bytes
       cmp    $0x12000,%rax		6 bytes
       je     897 <kfree+0xa7>
      
      After:
      
       mov    (%rdi),%rax
       test   $0x40,%ah			(3 bytes)
       jne    887 <kfree+0x97>
      
      So we go from 14 bytes to 3 bytes and from 3 instructions to one.  From the
      use of 2 registers we go to none.
      
      We can only use page flags for this if we have page flags available.  This
      patch introduces CONFIG_PAGEFLAGS_EXTENDED that is set if pageflags are not
      scarce due to SPARSEMEM using page flags for its sectionid on 32 bit NUMA
      platforms.
      
      Additional page flag definitions can be added to the CONFIG_PAGEFLAGS_EXTENDED
      section in page-flags.h if the functionality depends on PAGEFLAGS_EXTENDED or
      if more page flag overlapping tricks are used for the !PAGEFLAGS_EXTENDED
      fallback (the upcoming virtual compound patch may hook in here and Rik's/Lee's
      additional page flags to solve the reclaim issues could also be added there
      [hint...  hint...  where are these patchsets?]).
      
      Avoiding the overlaying of Pg_reclaim also clears the way for possible use of
      compound pages for the pagecache or on the LRU.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e20b8cca
    • C
      pageflags: eliminate PG_xxx aliases · 0a128b2b
      Christoph Lameter 提交于
      Remove aliases of PG_xxx.  We can easily drop those now and alias by
      specifying the PG_xxx flag in the macro that generates the functions.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Jeremy Fitzhardinge <jeremy@goop.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0a128b2b
    • C
      vmallocinfo: add caller information · 23016969
      Christoph Lameter 提交于
      Add caller information so that /proc/vmallocinfo shows where the allocation
      request for a slice of vmalloc memory originated.
      
      Results in output like this:
      
      0xffffc20000000000-0xffffc20000801000 8392704 alloc_large_system_hash+0x127/0x246 pages=2048 vmalloc vpages
      0xffffc20000801000-0xffffc20000806000   20480 alloc_large_system_hash+0x127/0x246 pages=4 vmalloc
      0xffffc20000806000-0xffffc20000c07000 4198400 alloc_large_system_hash+0x127/0x246 pages=1024 vmalloc vpages
      0xffffc20000c07000-0xffffc20000c0a000   12288 alloc_large_system_hash+0x127/0x246 pages=2 vmalloc
      0xffffc20000c0a000-0xffffc20000c0c000    8192 acpi_os_map_memory+0x13/0x1c phys=cff68000 ioremap
      0xffffc20000c0c000-0xffffc20000c0f000   12288 acpi_os_map_memory+0x13/0x1c phys=cff64000 ioremap
      0xffffc20000c10000-0xffffc20000c15000   20480 acpi_os_map_memory+0x13/0x1c phys=cff65000 ioremap
      0xffffc20000c16000-0xffffc20000c18000    8192 acpi_os_map_memory+0x13/0x1c phys=cff69000 ioremap
      0xffffc20000c18000-0xffffc20000c1a000    8192 acpi_os_map_memory+0x13/0x1c phys=fed1f000 ioremap
      0xffffc20000c1a000-0xffffc20000c1c000    8192 acpi_os_map_memory+0x13/0x1c phys=cff68000 ioremap
      0xffffc20000c1c000-0xffffc20000c1e000    8192 acpi_os_map_memory+0x13/0x1c phys=cff68000 ioremap
      0xffffc20000c1e000-0xffffc20000c20000    8192 acpi_os_map_memory+0x13/0x1c phys=cff68000 ioremap
      0xffffc20000c20000-0xffffc20000c22000    8192 acpi_os_map_memory+0x13/0x1c phys=cff68000 ioremap
      0xffffc20000c22000-0xffffc20000c24000    8192 acpi_os_map_memory+0x13/0x1c phys=cff68000 ioremap
      0xffffc20000c24000-0xffffc20000c26000    8192 acpi_os_map_memory+0x13/0x1c phys=e0081000 ioremap
      0xffffc20000c26000-0xffffc20000c28000    8192 acpi_os_map_memory+0x13/0x1c phys=e0080000 ioremap
      0xffffc20000c28000-0xffffc20000c2d000   20480 alloc_large_system_hash+0x127/0x246 pages=4 vmalloc
      0xffffc20000c2d000-0xffffc20000c31000   16384 tcp_init+0xd5/0x31c pages=3 vmalloc
      0xffffc20000c31000-0xffffc20000c34000   12288 alloc_large_system_hash+0x127/0x246 pages=2 vmalloc
      0xffffc20000c34000-0xffffc20000c36000    8192 init_vdso_vars+0xde/0x1f1
      0xffffc20000c36000-0xffffc20000c38000    8192 pci_iomap+0x8a/0xb4 phys=d8e00000 ioremap
      0xffffc20000c38000-0xffffc20000c3a000    8192 usb_hcd_pci_probe+0x139/0x295 [usbcore] phys=d8e00000 ioremap
      0xffffc20000c3a000-0xffffc20000c3e000   16384 sys_swapon+0x509/0xa15 pages=3 vmalloc
      0xffffc20000c40000-0xffffc20000c61000  135168 e1000_probe+0x1c4/0xa32 phys=d8a20000 ioremap
      0xffffc20000c61000-0xffffc20000c6a000   36864 _xfs_buf_map_pages+0x8e/0xc0 vmap
      0xffffc20000c6a000-0xffffc20000c73000   36864 _xfs_buf_map_pages+0x8e/0xc0 vmap
      0xffffc20000c73000-0xffffc20000c7c000   36864 _xfs_buf_map_pages+0x8e/0xc0 vmap
      0xffffc20000c7c000-0xffffc20000c7f000   12288 e1000e_setup_tx_resources+0x29/0xbe pages=2 vmalloc
      0xffffc20000c80000-0xffffc20001481000 8392704 pci_mmcfg_arch_init+0x90/0x118 phys=e0000000 ioremap
      0xffffc20001481000-0xffffc20001682000 2101248 alloc_large_system_hash+0x127/0x246 pages=512 vmalloc
      0xffffc20001682000-0xffffc20001e83000 8392704 alloc_large_system_hash+0x127/0x246 pages=2048 vmalloc vpages
      0xffffc20001e83000-0xffffc20002204000 3674112 alloc_large_system_hash+0x127/0x246 pages=896 vmalloc vpages
      0xffffc20002204000-0xffffc2000220d000   36864 _xfs_buf_map_pages+0x8e/0xc0 vmap
      0xffffc2000220d000-0xffffc20002216000   36864 _xfs_buf_map_pages+0x8e/0xc0 vmap
      0xffffc20002216000-0xffffc2000221f000   36864 _xfs_buf_map_pages+0x8e/0xc0 vmap
      0xffffc2000221f000-0xffffc20002228000   36864 _xfs_buf_map_pages+0x8e/0xc0 vmap
      0xffffc20002228000-0xffffc20002231000   36864 _xfs_buf_map_pages+0x8e/0xc0 vmap
      0xffffc20002231000-0xffffc20002234000   12288 e1000e_setup_rx_resources+0x35/0x122 pages=2 vmalloc
      0xffffc20002240000-0xffffc20002261000  135168 e1000_probe+0x1c4/0xa32 phys=d8a60000 ioremap
      0xffffc20002261000-0xffffc2000270c000 4894720 sys_swapon+0x509/0xa15 pages=1194 vmalloc vpages
      0xffffffffa0000000-0xffffffffa0022000  139264 module_alloc+0x4f/0x55 pages=33 vmalloc
      0xffffffffa0022000-0xffffffffa0029000   28672 module_alloc+0x4f/0x55 pages=6 vmalloc
      0xffffffffa002b000-0xffffffffa0034000   36864 module_alloc+0x4f/0x55 pages=8 vmalloc
      0xffffffffa0034000-0xffffffffa003d000   36864 module_alloc+0x4f/0x55 pages=8 vmalloc
      0xffffffffa003d000-0xffffffffa0049000   49152 module_alloc+0x4f/0x55 pages=11 vmalloc
      0xffffffffa0049000-0xffffffffa0050000   28672 module_alloc+0x4f/0x55 pages=6 vmalloc
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      23016969
    • C
      vmalloc: show vmalloced areas via /proc/vmallocinfo · a10aa579
      Christoph Lameter 提交于
      Implement a new proc file that allows the display of the currently allocated
      vmalloc memory.
      
      It allows to see the users of vmalloc.  That is important if vmalloc space is
      scarce (i386 for example).
      
      And it's going to be important for the compound page fallback to vmalloc.
      Many of the current users can be switched to use compound pages with fallback.
       This means that the number of users of vmalloc is reduced and page tables no
      longer necessary to access the memory.  /proc/vmallocinfo allows to review how
      that reduction occurs.
      
      If memory becomes fragmented and larger order allocations are no longer
      possible then /proc/vmallocinfo allows to see which compound page allocations
      fell back to virtual compound pages.  That is important for new users of
      virtual compound pages.  Such as order 1 stack allocation etc that may
      fallback to virtual compound pages in the future.
      
      /proc/vmallocinfo permissions are made readable-only-by-root to avoid possible
      information leakage.
      
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: CONFIG_MMU=n build fix]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Arjan van de Ven <arjan@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a10aa579
    • M
      mm: rotate_reclaimable_page() cleanup · ac6aadb2
      Miklos Szeredi 提交于
      Clean up messy conditional calling of test_clear_page_writeback() from both
      rotate_reclaimable_page() and end_page_writeback().
      
      The only user of rotate_reclaimable_page() is end_page_writeback() so this is
      OK.
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ac6aadb2
    • S
      mm/page_alloc.c: fix indentation · f05111f5
      S.Caglar Onur 提交于
      zlc_setup(): handle jiffies wraparound
      (10ed273f) changes tab with spaces
      Signed-off-by: NS.Caglar Onur <caglar@pardus.org.tr>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Paul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f05111f5
    • A
      dmapool: enable debugging for CONFIG_SLUB_DEBUG_ON too · b5ee5bef
      Andi Kleen 提交于
      Previously it was only enabled for CONFIG_DEBUG_SLAB.
      
      Not hooked into the slub runtime debug configuration, so you currently only
      get it with CONFIG_SLUB_DEBUG_ON, not plain CONFIG_SLUB_DEBUG
      Acked-by: NMatthew Wilcox <willy@linux.intel.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b5ee5bef
    • L
      mempolicy: fix parsing of tmpfs mpol mount option · a43361cf
      Lee Schermerhorn 提交于
      Parsing of new mode flags in the tmpfs mpol mount option is slightly broken:
      
      Setting a valid flag works OK:
      	#mount -o remount,mpol=bind=static:1-2 /dev/shm
      	#mount
      	...
      	tmpfs on /dev/shm type tmpfs (rw,mpol=bind=static:1-2)
      	...
      
      However, we can't remove them or change them, once we've
      set a valid flag:
      
      	#mount -o remount,mpol=bind:1-2 /dev/shm
      	#mount
      	...
      	tmpfs on /dev/shm type tmpfs (rw,mpol=bind:1-2)
      	...
      
      It SAYS it removed it, but that's just a copy of the input
      string.  If we now try to set it to a different flag, we
      get:
      
      	#mount -o remount,mpol=bind=relative:1-2 /dev/shm
      	mount: /dev/shm not mounted already, or bad option
      
      And on the console, we see:
      	tmpfs: Bad value 'bind' for mount option 'mpol'
      	                      ^ lost remainder of string
      
      Furthermore, bogus flags are accepted with out error.
      Granted, they are a no-op:
      
      	#mount -o remount,mpol=interleave=foo:0-3 /dev/shm
      	#mount
      	...
      	tmpfs on /dev/shm type tmpfs (rw,mpol=interleave=foo:0-3)
      
      Again, that's just a copy of the input string shown by the mount command.
      
      This patch fixes the behavior by pre-zeroing the flags so that only one of the
      mutually exclusive flags can be set at one time.  It also reports an error
      when an unrecognized flag is specified.
      
      The check for both flags being set is removed because it can't happen with
      this implementation.  If we ever want to support multiple non-exclusive flags,
      this area will need rework and we will need to check that any mutually
      exclusive flags aren't specified.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Andi Kleen <ak@suse.de>
      Cc: Eric Whitney <eric.whitney@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a43361cf
    • D
      mempolicy: disallow static or relative flags for local preferred mode · 3e1f0645
      David Rientjes 提交于
      MPOL_F_STATIC_NODES and MPOL_F_RELATIVE_NODES don't mean anything for
      MPOL_PREFERRED policies that were created with an empty nodemask (for purely
      local allocations).  They'll never be invalidated because the allowed mems of
      a task changes or need to be rebound relative to a cpuset's placement.
      
      Also fixes a bug identified by Lee Schermerhorn that disallowed empty
      nodemasks to be passed to MPOL_PREFERRED to specify local allocations.  [A
      different, somewhat incomplete, patch already existed in 25-rc5-mm1.]
      
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Andi Kleen <ak@suse.de>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3e1f0645
    • D
      mempolicy: create mempolicy_operations structure · 37012946
      David Rientjes 提交于
      Create a mempolicy_operations structure that currently points to two
      functions[*] for the various modes:
      
      	int (*create)(struct mempolicy *, const nodemask_t *);
      	void (*rebind)(struct mempolicy *, const nodemask_t *);
      
      This splits the implementation for the various modes out of two large
      functions, mpol_new() and mpol_rebind_policy().  Eventually it may be
      beneficial to add additional functions to accomodate the existing switch()
      statements in mm/mempolicy.c.
      
       [*] The ->create() function for MPOL_DEFAULT is currently NULL since no
           struct mempolicy is dynamically allocated.
      
      [Lee.Schermerhorn@hp.com: fix regression in the package mempolicy regression tests]
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Eric Whitney <eric.whitney@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      37012946
    • D
      mempolicy: move rebind functions · 1d0d2680
      David Rientjes 提交于
      Move the mpol_rebind_{policy,task,mm}() functions after mpol_new() to avoid
      having to declare function prototypes.
      
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1d0d2680
    • D
      mempolicy: add MPOL_F_RELATIVE_NODES flag · 4c50bc01
      David Rientjes 提交于
      Adds another optional mode flag, MPOL_F_RELATIVE_NODES, that specifies
      nodemasks passed via set_mempolicy() or mbind() should be considered relative
      to the current task's mems_allowed.
      
      When the mempolicy is created, the passed nodemask is folded and mapped onto
      the current task's mems_allowed.  For example, consider a task using
      set_mempolicy() to pass MPOL_INTERLEAVE | MPOL_F_RELATIVE_NODES with a
      nodemask of 1-3.  If current's mems_allowed is 4-7, the effected nodemask is
      5-7 (the second, third, and fourth node of mems_allowed).
      
      If the same task is attached to a cpuset, the mempolicy nodemask is rebound
      each time the mems are changed.  Some possible rebinds and results are:
      
      	mems			result
      	1-3			1-3
      	1-7			2-4
      	1,5-6			1,5-6
      	1,5-7			5-7
      
      Likewise, the zonelist built for MPOL_BIND acts on the set of zones assigned
      to the resultant nodemask from the relative remap.
      
      In the MPOL_PREFERRED case, the preferred node is remapped from the currently
      effected nodemask to the relative nodemask.
      
      This mempolicy mode flag was conceived of by Paul Jackson <pj@sgi.com>.
      
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4c50bc01
    • D
      mempolicy: add MPOL_F_STATIC_NODES flag · f5b087b5
      David Rientjes 提交于
      Add an optional mempolicy mode flag, MPOL_F_STATIC_NODES, that suppresses the
      node remap when the policy is rebound.
      
      Adds another member to struct mempolicy, nodemask_t user_nodemask, as part of
      a union with cpuset_mems_allowed:
      
      	struct mempolicy {
      		...
      		union {
      			nodemask_t cpuset_mems_allowed;
      			nodemask_t user_nodemask;
      		} w;
      	}
      
      that stores the the nodemask that the user passed when he or she created the
      mempolicy via set_mempolicy() or mbind().  When using MPOL_F_STATIC_NODES,
      which is passed with any mempolicy mode, the user's passed nodemask
      intersected with the VMA or task's allowed nodes is always used when
      determining the preferred node, setting the MPOL_BIND zonelist, or creating
      the interleave nodemask.  This happens whenever the policy is rebound,
      including when a task's cpuset assignment changes or the cpuset's mems are
      changed.
      
      This creates an interesting side-effect in that it allows the mempolicy
      "intent" to lie dormant and uneffected until it has access to the node(s) that
      it desires.  For example, if you currently ask for an interleaved policy over
      a set of nodes that you do not have access to, the mempolicy is not created
      and the task continues to use the previous policy.  With this change, however,
      it is possible to create the same mempolicy; it is only effected when access
      to nodes in the nodemask is acquired.
      
      It is also possible to mount tmpfs with the static nodemask behavior when
      specifying a node or nodemask.  To do this, simply add "=static" immediately
      following the mempolicy mode at mount time:
      
      	mount -o remount mpol=interleave=static:1-3
      
      Also removes mpol_check_policy() and folds its logic into mpol_new() since it
      is now obsoleted.  The unused vma_mpol_equal() is also removed.
      
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f5b087b5
    • D
      mempolicy: support optional mode flags · 028fec41
      David Rientjes 提交于
      With the evolution of mempolicies, it is necessary to support mempolicy mode
      flags that specify how the policy shall behave in certain circumstances.  The
      most immediate need for mode flag support is to suppress remapping the
      nodemask of a policy at the time of rebind.
      
      Both the mempolicy mode and flags are passed by the user in the 'int policy'
      formal of either the set_mempolicy() or mbind() syscall.  A new constant,
      MPOL_MODE_FLAGS, represents the union of legal optional flags that may be
      passed as part of this int.  Mempolicies that include illegal flags as part of
      their policy are rejected as invalid.
      
      An additional member to struct mempolicy is added to support the mode flags:
      
      	struct mempolicy {
      		...
      		unsigned short policy;
      		unsigned short flags;
      	}
      
      The splitting of the 'int' actual passed by the user is done in
      sys_set_mempolicy() and sys_mbind() for their respective syscalls.  This is
      done by intersecting the actual with MPOL_MODE_FLAGS, rejecting the syscall of
      there are additional flags, and storing it in the new 'flags' member of struct
      mempolicy.  The intersection of the actual with ~MPOL_MODE_FLAGS is stored in
      the 'policy' member of the struct and all current users of pol->policy remain
      unchanged.
      
      The union of the policy mode and optional mode flags is passed back to the
      user in get_mempolicy().
      
      This combination of mode and flags within the same actual does not break
      userspace code that relies on get_mempolicy(&policy, ...) and either
      
      	switch (policy) {
      	case MPOL_BIND:
      		...
      	case MPOL_INTERLEAVE:
      		...
      	};
      
      statements or
      
      	if (policy == MPOL_INTERLEAVE) {
      		...
      	}
      
      statements.  Such applications would need to use optional mode flags when
      calling set_mempolicy() or mbind() for these previously implemented statements
      to stop working.  If an application does start using optional mode flags, it
      will need to mask the optional flags off the policy in switch and conditional
      statements that only test mode.
      
      An additional member is also added to struct shmem_sb_info to store the
      optional mode flags.
      
      [hugh@veritas.com: shmem mpol: fix build warning]
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      028fec41
    • D
      mempolicy: convert MPOL constants to enum · a3b51e01
      David Rientjes 提交于
      The mempolicy mode constants, MPOL_DEFAULT, MPOL_PREFERRED, MPOL_BIND, and
      MPOL_INTERLEAVE, are better declared as part of an enum since they are
      sequentially numbered and cannot be combined.
      
      The policy member of struct mempolicy is also converted from type short to
      type unsigned short.  A negative policy does not have any legitimate meaning,
      so it is possible to change its type in preparation for adding optional mode
      flags later.
      
      The equivalent member of struct shmem_sb_info is also changed from int to
      unsigned short.
      
      For compatibility, the policy formal to get_mempolicy() remains as a pointer
      to an int:
      
      	int get_mempolicy(int *policy, unsigned long *nmask,
      			  unsigned long maxnode, unsigned long addr,
      			  unsigned long flags);
      
      although the only possible values is the range of type unsigned short.
      
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a3b51e01
    • P
      mm: move cache_line_size() to <linux/cache.h> · 1b27d05b
      Pekka Enberg 提交于
      Not all architectures define cache_line_size() so as suggested by Andrew move
      the private implementations in mm/slab.c and mm/slob.c to <linux/cache.h>.
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Reviewed-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1b27d05b
    • A
      hugetlb: decrease hugetlb_lock cycling in gather_surplus_huge_pages · 19fc3f0a
      Adam Litke 提交于
      To reduce hugetlb_lock acquisitions and releases when freeing excess surplus
      pages, scan the page list in two parts.  First, transfer the needed pages to
      the hugetlb pool.  Then drop the lock and free the remaining pages back to the
      buddy allocator.
      
      In the common case there are zero excess pages and no lock operations are
      required.
      
      Thanks Mel Gorman for this improvement.
      Signed-off-by: NAdam Litke <agl@us.ibm.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      19fc3f0a
    • C
      mm: try both endianess when checking for endianess · 797df574
      Chris Dearman 提交于
      When checking for the swap header try byteswapping the endianess dependent
      fields to allow the swap partition to be shared between big & little endian
      systems.
      Signed-off-by: NChris Dearman <chris@mips.com>
      Signed-off-by: NRalf Baechle <ralf@linux-mips.org>
      Acked-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      797df574
    • M
      mm: filter based on a nodemask as well as a gfp_mask · 19770b32
      Mel Gorman 提交于
      The MPOL_BIND policy creates a zonelist that is used for allocations
      controlled by that mempolicy.  As the per-node zonelist is already being
      filtered based on a zone id, this patch adds a version of __alloc_pages() that
      takes a nodemask for further filtering.  This eliminates the need for
      MPOL_BIND to create a custom zonelist.
      
      A positive benefit of this is that allocations using MPOL_BIND now use the
      local node's distance-ordered zonelist instead of a custom node-id-ordered
      zonelist.  I.e., pages will be allocated from the closest allowed node with
      available memory.
      
      [Lee.Schermerhorn@hp.com: Mempolicy: update stale documentation and comments]
      [Lee.Schermerhorn@hp.com: Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask]
      [Lee.Schermerhorn@hp.com: Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask rework]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      19770b32
    • M
      mm: have zonelist contains structs with both a zone pointer and zone_idx · dd1a239f
      Mel Gorman 提交于
      Filtering zonelists requires very frequent use of zone_idx().  This is costly
      as it involves a lookup of another structure and a substraction operation.  As
      the zone_idx is often required, it should be quickly accessible.  The node idx
      could also be stored here if it was found that accessing zone->node is
      significant which may be the case on workloads where nodemasks are heavily
      used.
      
      This patch introduces a struct zoneref to store a zone pointer and a zone
      index.  The zonelist then consists of an array of these struct zonerefs which
      are looked up as necessary.  Helpers are given for accessing the zone index as
      well as the node index.
      
      [kamezawa.hiroyu@jp.fujitsu.com: Suggested struct zoneref instead of embedding information in pointers]
      [hugh@veritas.com: mm-have-zonelist: fix memcg ooms]
      [hugh@veritas.com: just return do_try_to_free_pages]
      [hugh@veritas.com: do_try_to_free_pages gfp_mask redundant]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NChristoph Lameter <clameter@sgi.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dd1a239f
    • M
      mm: use two zonelist that are filtered by GFP mask · 54a6eb5c
      Mel Gorman 提交于
      Currently a node has two sets of zonelists, one for each zone type in the
      system and a second set for GFP_THISNODE allocations.  Based on the zones
      allowed by a gfp mask, one of these zonelists is selected.  All of these
      zonelists consume memory and occupy cache lines.
      
      This patch replaces the multiple zonelists per-node with two zonelists.  The
      first contains all populated zones in the system, ordered by distance, for
      fallback allocations when the target/preferred node has no free pages.  The
      second contains all populated zones in the node suitable for GFP_THISNODE
      allocations.
      
      An iterator macro is introduced called for_each_zone_zonelist() that interates
      through each zone allowed by the GFP flags in the selected zonelist.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      54a6eb5c
    • M
      mm: remember what the preferred zone is for zone_statistics · 18ea7e71
      Mel Gorman 提交于
      On NUMA, zone_statistics() is used to record events like numa hit, miss and
      foreign.  It assumes that the first zone in a zonelist is the preferred zone.
      When multiple zonelists are replaced by one that is filtered, this is no
      longer the case.
      
      This patch records what the preferred zone is rather than assuming the first
      zone in the zonelist is it.  This simplifies the reading of later patches in
      this set.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Reviewed-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      18ea7e71