1. 28 4月, 2008 3 次提交
  2. 11 3月, 2008 1 次提交
    • L
      mempolicy: fix reference counting bugs · 69682d85
      Lee Schermerhorn 提交于
      Address 3 known bugs in the current memory policy reference counting method.
      I have a series of patches to rework the reference counting to reduce overhead
      in the allocation path.  However, that series will require testing in -mm once
      I repost it.
      
      1) alloc_page_vma() does not release the extra reference taken for
         vma/shared mempolicy when the mode == MPOL_INTERLEAVE.  This can result in
         leaking mempolicy structures.  This is probably occurring, but not being
         noticed.
      
         Fix:  add the conditional release of the reference.
      
      2) hugezonelist unconditionally releases a reference on the mempolicy when
         mode == MPOL_INTERLEAVE.  This can result in decrementing the reference
         count for system default policy [should have no ill effect] or premature
         freeing of task policy.  If this occurred, the next allocation using task
         mempolicy would use the freed structure and probably BUG out.
      
         Fix:  add the necessary check to the release.
      
      3) The current reference counting method assumes that vma 'get_policy()'
         methods automatically add an extra reference a non-NULL returned mempolicy.
          This is true for shmem_get_policy() used by tmpfs mappings, including
         regular page shm segments.  However, SHM_HUGETLB shm's, backed by
         hugetlbfs, just use the vma policy without the extra reference.  This
         results in freeing of the vma policy on the first allocation, with reuse of
         the freed mempolicy structure on subsequent allocations.
      
         Fix: Rather than add another condition to the conditional reference
         release, which occur in the allocation path, just add a reference when
         returning the vma policy in shm_get_policy() to match the assumptions.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Greg KH <greg@kroah.com>
      Cc: Andi Kleen <ak@suse.de>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: David Rientjes <rientjes@google.com>
      Cc: <eric.whitney@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      69682d85
  3. 15 2月, 2008 1 次提交
  4. 12 2月, 2008 1 次提交
    • K
      mempolicy: silently restrict nodemask to allowed nodes · 31f1de46
      KOSAKI Motohiro 提交于
      Kosaki Motohito noted that "numactl --interleave=all ..." failed in the
      presence of memoryless nodes.  This patch attempts to fix that problem.
      
      Some background:
      
      numactl --interleave=all calls set_mempolicy(2) with a fully populated
      [out to MAXNUMNODES] nodemask.  set_mempolicy() [in do_set_mempolicy()]
      calls contextualize_policy() which requires that the nodemask be a
      subset of the current task's mems_allowed; else EINVAL will be returned.
      
      A task's mems_allowed will always be a subset of node_states[N_HIGH_MEMORY]
      i.e., nodes with memory.  So, a fully populated nodemask will be
      declared invalid if it includes memoryless nodes.
      
        NOTE:  the same thing will occur when running in a cpuset
               with restricted mem_allowed--for the same reason:
               node mask contains dis-allowed nodes.
      
      mbind(2), on the other hand, just masks off any nodes in the nodemask
      that are not included in the caller's mems_allowed.
      
      In each case [mbind() and set_mempolicy()], mpol_check_policy() will
      complain [again, resulting in EINVAL] if the nodemask contains any
      memoryless nodes.  This is somewhat redundant as mpol_new() will remove
      memoryless nodes for interleave policy, as will bind_zonelist()--called
      by mpol_new() for BIND policy.
      
      Proposed fix:
      
      1) modify contextualize_policy logic to:
         a) remember whether the incoming node mask is empty.
         b) if not, restrict the nodemask to allowed nodes, as is
            currently done in-line for mbind().  This guarantees
            that the resulting mask includes only nodes with memory.
      
            NOTE:  this is a [benign, IMO] change in behavior for
                   set_mempolicy().  Dis-allowed nodes will be
                   silently ignored, rather than returning an error.
      
         c) fold this code into mpol_check_policy(), replace 2 calls to
            contextualize_policy() to call mpol_check_policy() directly
            and remove contextualize_policy().
      
      2) In existing mpol_check_policy() logic, after "contextualization":
         a) MPOL_DEFAULT:  require that in coming mask "was_empty"
         b) MPOL_{BIND|INTERLEAVE}:  require that contextualized nodemask
            contains at least one node.
         c) add a case for MPOL_PREFERRED:  if in coming was not empty
            and resulting mask IS empty, user specified invalid nodes.
            Return EINVAL.
         c) remove the now redundant check for memoryless nodes
      
      3) remove the now redundant masking of policy nodes for interleave
         policy from mpol_new().
      
      4) Now that mpol_check_policy() contextualizes the nodemask, remove
         the in-line nodes_and() from sys_mbind().  I believe that this
         restores mbind() to the behavior before the memoryless-nodes
         patch series.  E.g., we'll no longer treat an invalid nodemask
         with MPOL_PREFERRED as local allocation.
      
      [ Patch history:
      
        v1 -> v2:
         - Communicate whether or not incoming node mask was empty to
           mpol_check_policy() for better error checking.
         - As suggested by David Rientjes, remove the now unused
           cpuset_nodes_subset_current_mems_allowed() from cpuset.h
      
        v2 -> v3:
         - As suggested by Kosaki Motohito, fold the "contextualization"
           of policy nodemask into mpol_check_policy().  Looks a little
           cleaner. ]
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Tested-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      31f1de46
  5. 15 11月, 2007 1 次提交
    • L
      Migration: find correct vma in new_vma_page() · 3ad33b24
      Lee Schermerhorn 提交于
      We hit the BUG_ON() in mm/rmap.c:vma_address() when trying to migrate via
      mbind(MPOL_MF_MOVE) a non-anon region that spans multiple vmas.  For
      anon-regions, we just fail to migrate any pages beyond the 1st vma in the
      range.
      
      This occurs because do_mbind() collects a list of pages to migrate by
      calling check_range().  check_range() walks the task's mm, spanning vmas as
      necessary, to collect the migratable pages into a list.  Then, do_mbind()
      calls migrate_pages() passing the list of pages, a function to allocate new
      pages based on vma policy [new_vma_page()], and a pointer to the first vma
      of the range.
      
      For each page in the list, new_vma_page() calls page_address_in_vma()
      passing the page and the vma [first in range] to obtain the address to get
      for alloc_page_vma().  The page address is needed to get interleaving
      policy correct.  If the pages in the list come from multiple vmas,
      eventually, new_page_address() will pass that page to page_address_in_vma()
      with the incorrect vma.  For !PageAnon pages, this will result in a bug
      check in rmap.c:vma_address().  For anon pages, vma_address() will just
      return EFAULT and fail the migration.
      
      This patch modifies new_vma_page() to check the return value from
      page_address_in_vma().  If the return value is EFAULT, new_vma_page()
      searchs forward via vm_next for the vma that maps the page--i.e., that does
      not return EFAULT.  This assumes that the pages in the list handed to
      migrate_pages() is in address order.  This is currently case.  The patch
      documents this assumption in a new comment block for new_vma_page().
      
      If new_vma_page() cannot locate the vma mapping the page in a forward
      search in the mm, it will pass a NULL vma to alloc_page_vma().  This will
      result in the allocation using the task policy, if any, else system default
      policy.  This situation is unlikely, but the patch documents this behavior
      with a comment.
      
      Note, this patch results in restarting from the first vma in a multi-vma
      range each time new_vma_page() is called.  If this is not acceptable, we
      can make the vma argument a pointer, both in new_vma_page() and it's caller
      unmap_and_move() so that the value held by the loop in migrate_pages()
      always passes down the last vma in which a page was found.  This will
      require changes to all new_page_t functions passed to migrate_pages().  Is
      this necessary?
      
      For this patch to work, we can't bug check in vma_address() for pages
      outside the argument vma.  This patch removes the BUG_ON().  All other
      callers [besides new_vma_page()] already check the return status.
      
      Tested on x86_64, 4 node NUMA platform.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Acked-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3ad33b24
  6. 20 10月, 2007 3 次提交
    • P
      Uninline find_task_by_xxx set of functions · 228ebcbe
      Pavel Emelyanov 提交于
      The find_task_by_something is a set of macros are used to find task by pid
      depending on what kind of pid is proposed - global or virtual one.  All of
      them are wrappers above the most generic one - find_task_by_pid_type_ns() -
      and just substitute some args for it.
      
      It turned out, that dereferencing the current->nsproxy->pid_ns construction
      and pushing one more argument on the stack inline cause kernel text size to
      grow.
      
      This patch moves all this stuff out-of-line into kernel/pid.c.  Together
      with the next patch it saves a bit less than 400 bytes from the .text
      section.
      Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
      Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: Paul Menage <menage@google.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      228ebcbe
    • P
      pid namespaces: changes to show virtual ids to user · b488893a
      Pavel Emelyanov 提交于
      This is the largest patch in the set. Make all (I hope) the places where
      the pid is shown to or get from user operate on the virtual pids.
      
      The idea is:
       - all in-kernel data structures must store either struct pid itself
         or the pid's global nr, obtained with pid_nr() call;
       - when seeking the task from kernel code with the stored id one
         should use find_task_by_pid() call that works with global pids;
       - when showing pid's numerical value to the user the virtual one
         should be used, but however when one shows task's pid outside this
         task's namespace the global one is to be used;
       - when getting the pid from userspace one need to consider this as
         the virtual one and use appropriate task/pid-searching functions.
      
      [akpm@linux-foundation.org: build fix]
      [akpm@linux-foundation.org: nuther build fix]
      [akpm@linux-foundation.org: yet nuther build fix]
      [akpm@linux-foundation.org: remove unneeded casts]
      Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NAlexey Dobriyan <adobriyan@openvz.org>
      Cc: Sukadev Bhattiprolu <sukadev@us.ibm.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: Paul Menage <menage@google.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b488893a
    • P
      Task Control Groups: make cpusets a client of cgroups · 8793d854
      Paul Menage 提交于
      Remove the filesystem support logic from the cpusets system and makes cpusets
      a cgroup subsystem
      
      The "cpuset" filesystem becomes a dummy filesystem; attempts to mount it get
      passed through to the cgroup filesystem with the appropriate options to
      emulate the old cpuset filesystem behaviour.
      Signed-off-by: NPaul Menage <menage@google.com>
      Cc: Serge E. Hallyn <serue@us.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Kirill Korotaev <dev@openvz.org>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
      Cc: Cedric Le Goater <clg@fr.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8793d854
  7. 17 10月, 2007 6 次提交
  8. 20 9月, 2007 1 次提交
    • L
      Fix NUMA Memory Policy Reference Counting · 480eccf9
      Lee Schermerhorn 提交于
      This patch proposes fixes to the reference counting of memory policy in the
      page allocation paths and in show_numa_map().  Extracted from my "Memory
      Policy Cleanups and Enhancements" series as stand-alone.
      
      Shared policy lookup [shmem] has always added a reference to the policy,
      but this was never unrefed after page allocation or after formatting the
      numa map data.
      
      Default system policy should not require additional ref counting, nor
      should the current task's task policy.  However, show_numa_map() calls
      get_vma_policy() to examine what may be [likely is] another task's policy.
      The latter case needs protection against freeing of the policy.
      
      This patch adds a reference count to a mempolicy returned by
      get_vma_policy() when the policy is a vma policy or another task's
      mempolicy.  Again, shared policy is already reference counted on lookup.  A
      matching "unref" [__mpol_free()] is performed in alloc_page_vma() for
      shared and vma policies, and in show_numa_map() for shared and another
      task's mempolicy.  We can call __mpol_free() directly, saving an admittedly
      inexpensive inline NULL test, because we know we have a non-NULL policy.
      
      Handling policy ref counts for hugepages is a bit trickier.
      huge_zonelist() returns a zone list that might come from a shared or vma
      'BIND policy.  In this case, we should hold the reference until after the
      huge page allocation in dequeue_hugepage().  The patch modifies
      huge_zonelist() to return a pointer to the mempolicy if it needs to be
      unref'd after allocation.
      
      Kernel Build [16cpu, 32GB, ia64] - average of 10 runs:
      
      		w/o patch	w/ refcount patch
      	    Avg	  Std Devn	   Avg	  Std Devn
      Real:	 100.59	    0.38	 100.63	    0.43
      User:	1209.60	    0.37	1209.91	    0.31
      System:   81.52	    0.42	  81.64	    0.34
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Acked-by: NAndi Kleen <ak@suse.de>
      Cc: Christoph Lameter <clameter@sgi.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      480eccf9
  9. 31 8月, 2007 1 次提交
  10. 23 8月, 2007 1 次提交
    • M
      Apply memory policies to top two highest zones when highest zone is ZONE_MOVABLE · b377fd39
      Mel Gorman 提交于
      The NUMA layer only supports NUMA policies for the highest zone.  When
      ZONE_MOVABLE is configured with kernelcore=, the the highest zone becomes
      ZONE_MOVABLE.  The result is that policies are only applied to allocations
      like anonymous pages and page cache allocated from ZONE_MOVABLE when the
      zone is used.
      
      This patch applies policies to the two highest zones when the highest zone
      is ZONE_MOVABLE.  As ZONE_MOVABLE consists of pages from the highest "real"
      zone, it's always functionally equivalent.
      
      The patch has been tested on a variety of machines both NUMA and non-NUMA
      covering x86, x86_64 and ppc64.  No abnormal results were seen in
      kernbench, tbench, dbench or hackbench.  It passes regression tests from
      the numactl package with and without kernelcore= once numactl tests are
      patched to wait for vmstat counters to update.
      
      akpm: this is the nasty hack to fix NUMA mempolicies in the presence of
      ZONE_MOVABLE and kernelcore= in 2.6.23.  Christoph says "For .24 either merge
      the mobility or get the other solution that Mel is working on.  That solution
      would only use a single zonelist per node and filter on the fly.  That may
      help performance and also help to make memory policies work better."
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Tested-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Acked-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Andi Kleen <ak@suse.de>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b377fd39
  11. 20 7月, 2007 1 次提交
    • P
      mm: Remove slab destructors from kmem_cache_create(). · 20c2df83
      Paul Mundt 提交于
      Slab destructors were no longer supported after Christoph's
      c59def9f change. They've been
      BUGs for both slab and slub, and slob never supported them
      either.
      
      This rips out support for the dtor pointer from kmem_cache_create()
      completely and fixes up every single callsite in the kernel (there were
      about 224, not including the slab allocator definitions themselves,
      or the documentation references).
      Signed-off-by: NPaul Mundt <lethal@linux-sh.org>
      20c2df83
  12. 18 7月, 2007 2 次提交
    • M
      Allow huge page allocations to use GFP_HIGH_MOVABLE · 396faf03
      Mel Gorman 提交于
      Huge pages are not movable so are not allocated from ZONE_MOVABLE.  However,
      as ZONE_MOVABLE will always have pages that can be migrated or reclaimed, it
      can be used to satisfy hugepage allocations even when the system has been
      running a long time.  This allows an administrator to resize the hugepage pool
      at runtime depending on the size of ZONE_MOVABLE.
      
      This patch adds a new sysctl called hugepages_treat_as_movable.  When a
      non-zero value is written to it, future allocations for the huge page pool
      will use ZONE_MOVABLE.  Despite huge pages being non-movable, we do not
      introduce additional external fragmentation of note as huge pages are always
      the largest contiguous block we care about.
      
      [akpm@linux-foundation.org: various fixes]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      396faf03
    • M
      Add __GFP_MOVABLE for callers to flag allocations from high memory that may be migrated · 769848c0
      Mel Gorman 提交于
      It is often known at allocation time whether a page may be migrated or not.
      This patch adds a flag called __GFP_MOVABLE and a new mask called
      GFP_HIGH_MOVABLE.  Allocations using the __GFP_MOVABLE can be either migrated
      using the page migration mechanism or reclaimed by syncing with backing
      storage and discarding.
      
      An API function very similar to alloc_zeroed_user_highpage() is added for
      __GFP_MOVABLE allocations called alloc_zeroed_user_highpage_movable().  The
      flags used by alloc_zeroed_user_highpage() are not changed because it would
      change the semantics of an existing API.  After this patch is applied there
      are no in-kernel users of alloc_zeroed_user_highpage() so it probably should
      be marked deprecated if this patch is merged.
      
      Note that this patch includes a minor cleanup to the use of __GFP_ZERO in
      shmem.c to keep all flag modifications to inode->mapping in the
      shmem_dir_alloc() helper function.  This clean-up suggestion is courtesy of
      Hugh Dickens.
      
      Additional credit goes to Christoph Lameter and Linus Torvalds for shaping the
      concept.  Credit to Hugh Dickens for catching issues with shmem swap vector
      and ramfs allocations.
      
      [akpm@linux-foundation.org: build fix]
      [hugh@veritas.com: __GFP_ZERO cleanup]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Christoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      769848c0
  13. 17 7月, 2007 2 次提交
    • P
      numa: mempolicy: trivial debug fixes. · 140d5a49
      Paul Mundt 提交于
      Enabling debugging fails to build due to the nodemask variable in
      do_mbind() having changed names, and then oopses on boot due to the
      assumption that the nodemask can be dereferenced -- which doesn't work out
      so well when the policy is changed to MPOL_DEFAULT with a NULL nodemask by
      numa_default_policy().
      
      This fixes it up, and switches from PDprintk() to pr_debug() while
      we're at it.
      Signed-off-by: NPaul Mundt <lethal@linux-sh.org>
      Cc: Christoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      140d5a49
    • P
      numa: mempolicy: dynamic interleave map for system init · b71636e2
      Paul Mundt 提交于
      This converts the default system init memory policy to use a dynamically
      created node map instead of defaulting to all online nodes.  Nodes of a
      certain size (>= 16MB) are judged to be suitable for interleave, and are added
      to the map.  If all nodes are smaller in size, the largest one is
      automatically selected.
      
      Without this, tiny nodes find themselves out of memory before we even make it
      to userspace.  Systems with large nodes will notice no change.
      
      Only the system init policy is effected by this change, the regular
      MPOL_DEFAULT policy is still switched to later on in the boot process as
      normal.
      Signed-off-by: NPaul Mundt <lethal@linux-sh.org>
      Cc: Andi Kleen <ak@suse.de>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b71636e2
  14. 05 3月, 2007 1 次提交
  15. 21 2月, 2007 1 次提交
  16. 12 2月, 2007 1 次提交
    • C
      [PATCH] optional ZONE_DMA: deal with cases of ZONE_DMA meaning the first zone · 6267276f
      Christoph Lameter 提交于
      This patchset follows up on the earlier work in Andrew's tree to reduce the
      number of zones.  The patches allow to go to a minimum of 2 zones.  This one
      allows also to make ZONE_DMA optional and therefore the number of zones can be
      reduced to one.
      
      ZONE_DMA is usually used for ISA DMA devices.  There are a number of reasons
      why we would not want to have ZONE_DMA
      
      1. Some arches do not need ZONE_DMA at all.
      
      2. With the advent of IOMMUs DMA zones are no longer needed.
         The necessity of DMA zones may drastically be reduced
         in the future. This patchset allows a compilation of
         a kernel without that overhead.
      
      3. Devices that require ISA DMA get rare these days. All
         my systems do not have any need for ISA DMA.
      
      4. The presence of an additional zone unecessarily complicates
         VM operations because it must be scanned and balancing
         logic must operate on its.
      
      5. With only ZONE_NORMAL one can reach the situation where
         we have only one zone. This will allow the unrolling of many
         loops in the VM and allows the optimization of varous
         code paths in the VM.
      
      6. Having only a single zone in a NUMA system results in a
         1-1 correspondence between nodes and zones. Various additional
         optimizations to critical VM paths become possible.
      
      Many systems today can operate just fine with a single zone.  If you look at
      what is in ZONE_DMA then one usually sees that nothing uses it.  The DMA slabs
      are empty (Some arches use ZONE_DMA instead of ZONE_NORMAL, then ZONE_NORMAL
      will be empty instead).
      
      On all of my systems (i386, x86_64, ia64) ZONE_DMA is completely empty.  Why
      constantly look at an empty zone in /proc/zoneinfo and empty slab in
      /proc/slabinfo?  Non i386 also frequently have no need for ZONE_DMA and zones
      stay empty.
      
      The patchset was tested on i386 (UP / SMP), x86_64 (UP, NUMA) and ia64 (NUMA).
      
      The RFC posted earlier (see
      http://marc.theaimsgroup.com/?l=linux-kernel&m=115231723513008&w=2) had lots
      of #ifdefs in them.  An effort has been made to minize the number of #ifdefs
      and make this as compact as possible.  The job was made much easier by the
      ongoing efforts of others to extract common arch specific functionality.
      
      I have been running this for awhile now on my desktop and finally Linux is
      using all my available RAM instead of leaving the 16MB in ZONE_DMA untouched:
      
      christoph@pentium940:~$ cat /proc/zoneinfo
      Node 0, zone   Normal
        pages free     4435
              min      1448
              low      1810
              high     2172
              active   241786
              inactive 210170
              scanned  0 (a: 0 i: 0)
              spanned  524224
              present  524224
          nr_anon_pages 61680
          nr_mapped    14271
          nr_file_pages 390264
          nr_slab_reclaimable 27564
          nr_slab_unreclaimable 1793
          nr_page_table_pages 449
          nr_dirty     39
          nr_writeback 0
          nr_unstable  0
          nr_bounce    0
          cpu: 0 pcp: 0
                    count: 156
                    high:  186
                    batch: 31
          cpu: 0 pcp: 1
                    count: 9
                    high:  62
                    batch: 15
        vm stats threshold: 20
          cpu: 1 pcp: 0
                    count: 177
                    high:  186
                    batch: 31
          cpu: 1 pcp: 1
                    count: 12
                    high:  62
                    batch: 15
        vm stats threshold: 20
        all_unreclaimable: 0
        prev_priority:     12
        temp_priority:     12
        start_pfn:         0
      
      This patch:
      
      In two places in the VM we use ZONE_DMA to refer to the first zone.  If
      ZONE_DMA is optional then other zones may be first.  So simply replace
      ZONE_DMA with zone 0.
      
      This also fixes ZONETABLE_PGSHIFT.  If we have only a single zone then
      ZONES_PGSHIFT may become 0 because there is no need anymore to encode the zone
      number related to a pgdat.  However, we still need a zonetable to index all
      the zones for each node if this is a NUMA system.  Therefore define
      ZONETABLE_SHIFT unconditionally as the offset of the ZONE field in page flags.
      
      [apw@shadowen.org: fix mismerge]
      Acked-by: NChristoph Hellwig <hch@infradead.org>
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Andi Kleen <ak@suse.de>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Kyle McMartin <kyle@mcmartin.ca>
      Cc: Matthew Wilcox <willy@debian.org>
      Cc: James Bottomley <James.Bottomley@steeleye.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Signed-off-by: NAndy Whitcroft <apw@shadowen.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6267276f
  17. 23 1月, 2007 1 次提交
  18. 09 12月, 2006 1 次提交
  19. 08 12月, 2006 4 次提交
    • H
      [PATCH] struct seq_operations and struct file_operations constification · 15ad7cdc
      Helge Deller 提交于
       - move some file_operations structs into the .rodata section
      
       - move static strings from policy_types[] array into the .rodata section
      
       - fix generic seq_operations usages, so that those structs may be defined
         as "const" as well
      
      [akpm@osdl.org: couple of fixes]
      Signed-off-by: NHelge Deller <deller@gmx.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      15ad7cdc
    • C
      [PATCH] slab: remove SLAB_KERNEL · e94b1766
      Christoph Lameter 提交于
      SLAB_KERNEL is an alias of GFP_KERNEL.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      e94b1766
    • A
      [PATCH] numa node ids are int, page_to_nid and zone_to_nid should return int · 25ba77c1
      Andy Whitcroft 提交于
      NUMA node ids are passed as either int or unsigned int almost exclusivly
      page_to_nid and zone_to_nid both return unsigned long.  This is a throw
      back to when page_to_nid was a #define and was thus exposing the real type
      of the page flags field.
      
      In addition to fixing up the definitions of page_to_nid and zone_to_nid I
      audited the users of these functions identifying the following incorrect
      uses:
      
      1) mm/page_alloc.c show_node() -- printk dumping the node id,
      2) include/asm-ia64/pgalloc.h pgtable_quicklist_free() -- comparison
         against numa_node_id() which returns an int from cpu_to_node(), and
      3) mm/mpolicy.c check_pte_range -- used as an index in node_isset which
         uses bit_set which in generic code takes an int.
      Signed-off-by: NAndy Whitcroft <apw@shadowen.org>
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      25ba77c1
    • P
      [PATCH] memory page_alloc zonelist caching speedup · 9276b1bc
      Paul Jackson 提交于
      Optimize the critical zonelist scanning for free pages in the kernel memory
      allocator by caching the zones that were found to be full recently, and
      skipping them.
      
      Remembers the zones in a zonelist that were short of free memory in the
      last second.  And it stashes a zone-to-node table in the zonelist struct,
      to optimize that conversion (minimize its cache footprint.)
      
      Recent changes:
      
          This differs in a significant way from a similar patch that I
          posted a week ago.  Now, instead of having a nodemask_t of
          recently full nodes, I have a bitmask of recently full zones.
          This solves a problem that last weeks patch had, which on
          systems with multiple zones per node (such as DMA zone) would
          take seeing any of these zones full as meaning that all zones
          on that node were full.
      
          Also I changed names - from "zonelist faster" to "zonelist cache",
          as that seemed to better convey what we're doing here - caching
          some of the key zonelist state (for faster access.)
      
          See below for some performance benchmark results.  After all that
          discussion with David on why I didn't need them, I went and got
          some ;).  I wanted to verify that I had not hurt the normal case
          of memory allocation noticeably.  At least for my one little
          microbenchmark, I found (1) the normal case wasn't affected, and
          (2) workloads that forced scanning across multiple nodes for
          memory improved up to 10% fewer System CPU cycles and lower
          elapsed clock time ('sys' and 'real').  Good.  See details, below.
      
          I didn't have the logic in get_page_from_freelist() for various
          full nodes and zone reclaim failures correct.  That should be
          fixed up now - notice the new goto labels zonelist_scan,
          this_zone_full, and try_next_zone, in get_page_from_freelist().
      
      There are two reasons I persued this alternative, over some earlier
      proposals that would have focused on optimizing the fake numa
      emulation case by caching the last useful zone:
      
       1) Contrary to what I said before, we (SGI, on large ia64 sn2 systems)
          have seen real customer loads where the cost to scan the zonelist
          was a problem, due to many nodes being full of memory before
          we got to a node we could use.  Or at least, I think we have.
          This was related to me by another engineer, based on experiences
          from some time past.  So this is not guaranteed.  Most likely, though.
      
          The following approach should help such real numa systems just as
          much as it helps fake numa systems, or any combination thereof.
      
       2) The effort to distinguish fake from real numa, using node_distance,
          so that we could cache a fake numa node and optimize choosing
          it over equivalent distance fake nodes, while continuing to
          properly scan all real nodes in distance order, was going to
          require a nasty blob of zonelist and node distance munging.
      
          The following approach has no new dependency on node distances or
          zone sorting.
      
      See comment in the patch below for a description of what it actually does.
      
      Technical details of note (or controversy):
      
       - See the use of "zlc_active" and "did_zlc_setup" below, to delay
         adding any work for this new mechanism until we've looked at the
         first zone in zonelist.  I figured the odds of the first zone
         having the memory we needed were high enough that we should just
         look there, first, then get fancy only if we need to keep looking.
      
       - Some odd hackery was needed to add items to struct zonelist, while
         not tripping up the custom zonelists built by the mm/mempolicy.c
         code for MPOL_BIND.  My usual wordy comments below explain this.
         Search for "MPOL_BIND".
      
       - Some per-node data in the struct zonelist is now modified frequently,
         with no locking.  Multiple CPU cores on a node could hit and mangle
         this data.  The theory is that this is just performance hint data,
         and the memory allocator will work just fine despite any such mangling.
         The fields at risk are the struct 'zonelist_cache' fields 'fullzones'
         (a bitmask) and 'last_full_zap' (unsigned long jiffies).  It should
         all be self correcting after at most a one second delay.
      
       - This still does a linear scan of the same lengths as before.  All
         I've optimized is making the scan faster, not algorithmically
         shorter.  It is now able to scan a compact array of 'unsigned
         short' in the case of many full nodes, so one cache line should
         cover quite a few nodes, rather than each node hitting another
         one or two new and distinct cache lines.
      
       - If both Andi and Nick don't find this too complicated, I will be
         (pleasantly) flabbergasted.
      
       - I removed the comment claiming we only use one cachline's worth of
         zonelist.  We seem, at least in the fake numa case, to have put the
         lie to that claim.
      
       - I pay no attention to the various watermarks and such in this performance
         hint.  A node could be marked full for one watermark, and then skipped
         over when searching for a page using a different watermark.  I think
         that's actually quite ok, as it will tend to slightly increase the
         spreading of memory over other nodes, away from a memory stressed node.
      
      ===============
      
      Performance - some benchmark results and analysis:
      
      This benchmark runs a memory hog program that uses multiple
      threads to touch alot of memory as quickly as it can.
      
      Multiple runs were made, touching 12, 38, 64 or 90 GBytes out of
      the total 96 GBytes on the system, and using 1, 19, 37, or 55
      threads (on a 56 CPU system.)  System, user and real (elapsed)
      timings were recorded for each run, shown in units of seconds,
      in the table below.
      
      Two kernels were tested - 2.6.18-mm3 and the same kernel with
      this zonelist caching patch added.  The table also shows the
      percentage improvement the zonelist caching sys time is over
      (lower than) the stock *-mm kernel.
      
            number     2.6.18-mm3	   zonelist-cache    delta (< 0 good)	percent
       GBs    N  	------------	   --------------    ----------------	systime
       mem threads   sys user  real	  sys  user  real     sys  user  real	 better
        12	 1     153   24   177	  151	 24   176      -2     0    -1	   1%
        12	19	99   22     8	   99	 22	8	0     0     0	   0%
        12	37     111   25     6	  112	 25	6	1     0     0	  -0%
        12	55     115   25     5	  110	 23	5      -5    -2     0	   4%
        38	 1     502   74   576	  497	 73   570      -5    -1    -6	   0%
        38	19     426   78    48	  373	 76    39     -53    -2    -9	  12%
        38	37     544   83    36	  547	 82    36	3    -1     0	  -0%
        38	55     501   77    23	  511	 80    24      10     3     1	  -1%
        64	 1     917  125  1042	  890	124  1014     -27    -1   -28	   2%
        64	19    1118  138   119	  965	141   103    -153     3   -16	  13%
        64	37    1202  151    94	 1136	150    81     -66    -1   -13	   5%
        64	55    1118  141    61	 1072	140    58     -46    -1    -3	   4%
        90	 1    1342  177  1519	 1275	174  1450     -67    -3   -69	   4%
        90	19    2392  199   192	 2116	189   176    -276   -10   -16	  11%
        90	37    3313  238   175	 2972	225   145    -341   -13   -30	  10%
        90	55    1948  210   104	 1843	213   100    -105     3    -4	   5%
      
      Notes:
       1) This test ran a memory hog program that started a specified number N of
          threads, and had each thread allocate and touch 1/N'th of
          the total memory to be used in the test run in a single loop,
          writing a constant word to memory, one store every 4096 bytes.
          Watching this test during some earlier trial runs, I would see
          each of these threads sit down on one CPU and stay there, for
          the remainder of the pass, a different CPU for each thread.
      
       2) The 'real' column is not comparable to the 'sys' or 'user' columns.
          The 'real' column is seconds wall clock time elapsed, from beginning
          to end of that test pass.  The 'sys' and 'user' columns are total
          CPU seconds spent on that test pass.  For a 19 thread test run,
          for example, the sum of 'sys' and 'user' could be up to 19 times the
          number of 'real' elapsed wall clock seconds.
      
       3) Tests were run on a fresh, single-user boot, to minimize the amount
          of memory already in use at the start of the test, and to minimize
          the amount of background activity that might interfere.
      
       4) Tests were done on a 56 CPU, 28 Node system with 96 GBytes of RAM.
      
       5) Notice that the 'real' time gets large for the single thread runs, even
          though the measured 'sys' and 'user' times are modest.  I'm not sure what
          that means - probably something to do with it being slow for one thread to
          be accessing memory along ways away.  Perhaps the fake numa system, running
          ostensibly the same workload, would not show this substantial degradation
          of 'real' time for one thread on many nodes -- lets hope not.
      
       6) The high thread count passes (one thread per CPU - on 55 of 56 CPUs)
          ran quite efficiently, as one might expect.  Each pair of threads needed
          to allocate and touch the memory on the node the two threads shared, a
          pleasantly parallizable workload.
      
       7) The intermediate thread count passes, when asking for alot of memory forcing
          them to go to a few neighboring nodes, improved the most with this zonelist
          caching patch.
      
      Conclusions:
       * This zonelist cache patch probably makes little difference one way or the
         other for most workloads on real numa hardware, if those workloads avoid
         heavy off node allocations.
       * For memory intensive workloads requiring substantial off-node allocations
         on real numa hardware, this patch improves both kernel and elapsed timings
         up to ten per-cent.
       * For fake numa systems, I'm optimistic, but will have to leave that up to
         Rohit Seth to actually test (once I get him a 2.6.18 backport.)
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Cc: Rohit Seth <rohitseth@google.com>
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      Cc: David Rientjes <rientjes@cs.washington.edu>
      Cc: Paul Menage <menage@google.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      9276b1bc
  20. 12 10月, 2006 1 次提交
  21. 01 10月, 2006 1 次提交
  22. 27 9月, 2006 1 次提交
    • C
      [PATCH] GFP_THISNODE for the slab allocator · 765c4507
      Christoph Lameter 提交于
      This patch insures that the slab node lists in the NUMA case only contain
      slabs that belong to that specific node.  All slab allocations use
      GFP_THISNODE when calling into the page allocator.  If an allocation fails
      then we fall back in the slab allocator according to the zonelists appropriate
      for a certain context.
      
      This allows a replication of the behavior of alloc_pages and alloc_pages node
      in the slab layer.
      
      Currently allocations requested from the page allocator may be redirected via
      cpusets to other nodes.  This results in remote pages on nodelists and that in
      turn results in interrupt latency issues during cache draining.  Plus the slab
      is handing out memory as local when it is really remote.
      
      Fallback for slab memory allocations will occur within the slab allocator and
      not in the page allocator.  This is necessary in order to be able to use the
      existing pools of objects on the nodes that we fall back to before adding more
      pages to a slab.
      
      The fallback function insures that the nodes we fall back to obey cpuset
      restrictions of the current context.  We do not allocate objects from outside
      of the current cpuset context like before.
      
      Note that the implementation of locality constraints within the slab allocator
      requires importing logic from the page allocator.  This is a mischmash that is
      not that great.  Other allocators (uncached allocator, vmalloc, huge pages)
      face similar problems and have similar minimal reimplementations of the basic
      fallback logic of the page allocator.  There is another way of implementing a
      slab by avoiding per node lists (see modular slab) but this wont work within
      the existing slab.
      
      V1->V2:
      - Use NUMA_BUILD to avoid #ifdef CONFIG_NUMA
      - Exploit GFP_THISNODE being 0 in the NON_NUMA case to avoid another
        #ifdef
      
      [akpm@osdl.org: build fix]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      765c4507
  23. 26 9月, 2006 4 次提交