1. 08 12月, 2006 3 次提交
    • C
      [PATCH] slab: remove SLAB_KERNEL · e94b1766
      Christoph Lameter 提交于
      SLAB_KERNEL is an alias of GFP_KERNEL.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      e94b1766
    • A
      [PATCH] numa node ids are int, page_to_nid and zone_to_nid should return int · 25ba77c1
      Andy Whitcroft 提交于
      NUMA node ids are passed as either int or unsigned int almost exclusivly
      page_to_nid and zone_to_nid both return unsigned long.  This is a throw
      back to when page_to_nid was a #define and was thus exposing the real type
      of the page flags field.
      
      In addition to fixing up the definitions of page_to_nid and zone_to_nid I
      audited the users of these functions identifying the following incorrect
      uses:
      
      1) mm/page_alloc.c show_node() -- printk dumping the node id,
      2) include/asm-ia64/pgalloc.h pgtable_quicklist_free() -- comparison
         against numa_node_id() which returns an int from cpu_to_node(), and
      3) mm/mpolicy.c check_pte_range -- used as an index in node_isset which
         uses bit_set which in generic code takes an int.
      Signed-off-by: NAndy Whitcroft <apw@shadowen.org>
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      25ba77c1
    • P
      [PATCH] memory page_alloc zonelist caching speedup · 9276b1bc
      Paul Jackson 提交于
      Optimize the critical zonelist scanning for free pages in the kernel memory
      allocator by caching the zones that were found to be full recently, and
      skipping them.
      
      Remembers the zones in a zonelist that were short of free memory in the
      last second.  And it stashes a zone-to-node table in the zonelist struct,
      to optimize that conversion (minimize its cache footprint.)
      
      Recent changes:
      
          This differs in a significant way from a similar patch that I
          posted a week ago.  Now, instead of having a nodemask_t of
          recently full nodes, I have a bitmask of recently full zones.
          This solves a problem that last weeks patch had, which on
          systems with multiple zones per node (such as DMA zone) would
          take seeing any of these zones full as meaning that all zones
          on that node were full.
      
          Also I changed names - from "zonelist faster" to "zonelist cache",
          as that seemed to better convey what we're doing here - caching
          some of the key zonelist state (for faster access.)
      
          See below for some performance benchmark results.  After all that
          discussion with David on why I didn't need them, I went and got
          some ;).  I wanted to verify that I had not hurt the normal case
          of memory allocation noticeably.  At least for my one little
          microbenchmark, I found (1) the normal case wasn't affected, and
          (2) workloads that forced scanning across multiple nodes for
          memory improved up to 10% fewer System CPU cycles and lower
          elapsed clock time ('sys' and 'real').  Good.  See details, below.
      
          I didn't have the logic in get_page_from_freelist() for various
          full nodes and zone reclaim failures correct.  That should be
          fixed up now - notice the new goto labels zonelist_scan,
          this_zone_full, and try_next_zone, in get_page_from_freelist().
      
      There are two reasons I persued this alternative, over some earlier
      proposals that would have focused on optimizing the fake numa
      emulation case by caching the last useful zone:
      
       1) Contrary to what I said before, we (SGI, on large ia64 sn2 systems)
          have seen real customer loads where the cost to scan the zonelist
          was a problem, due to many nodes being full of memory before
          we got to a node we could use.  Or at least, I think we have.
          This was related to me by another engineer, based on experiences
          from some time past.  So this is not guaranteed.  Most likely, though.
      
          The following approach should help such real numa systems just as
          much as it helps fake numa systems, or any combination thereof.
      
       2) The effort to distinguish fake from real numa, using node_distance,
          so that we could cache a fake numa node and optimize choosing
          it over equivalent distance fake nodes, while continuing to
          properly scan all real nodes in distance order, was going to
          require a nasty blob of zonelist and node distance munging.
      
          The following approach has no new dependency on node distances or
          zone sorting.
      
      See comment in the patch below for a description of what it actually does.
      
      Technical details of note (or controversy):
      
       - See the use of "zlc_active" and "did_zlc_setup" below, to delay
         adding any work for this new mechanism until we've looked at the
         first zone in zonelist.  I figured the odds of the first zone
         having the memory we needed were high enough that we should just
         look there, first, then get fancy only if we need to keep looking.
      
       - Some odd hackery was needed to add items to struct zonelist, while
         not tripping up the custom zonelists built by the mm/mempolicy.c
         code for MPOL_BIND.  My usual wordy comments below explain this.
         Search for "MPOL_BIND".
      
       - Some per-node data in the struct zonelist is now modified frequently,
         with no locking.  Multiple CPU cores on a node could hit and mangle
         this data.  The theory is that this is just performance hint data,
         and the memory allocator will work just fine despite any such mangling.
         The fields at risk are the struct 'zonelist_cache' fields 'fullzones'
         (a bitmask) and 'last_full_zap' (unsigned long jiffies).  It should
         all be self correcting after at most a one second delay.
      
       - This still does a linear scan of the same lengths as before.  All
         I've optimized is making the scan faster, not algorithmically
         shorter.  It is now able to scan a compact array of 'unsigned
         short' in the case of many full nodes, so one cache line should
         cover quite a few nodes, rather than each node hitting another
         one or two new and distinct cache lines.
      
       - If both Andi and Nick don't find this too complicated, I will be
         (pleasantly) flabbergasted.
      
       - I removed the comment claiming we only use one cachline's worth of
         zonelist.  We seem, at least in the fake numa case, to have put the
         lie to that claim.
      
       - I pay no attention to the various watermarks and such in this performance
         hint.  A node could be marked full for one watermark, and then skipped
         over when searching for a page using a different watermark.  I think
         that's actually quite ok, as it will tend to slightly increase the
         spreading of memory over other nodes, away from a memory stressed node.
      
      ===============
      
      Performance - some benchmark results and analysis:
      
      This benchmark runs a memory hog program that uses multiple
      threads to touch alot of memory as quickly as it can.
      
      Multiple runs were made, touching 12, 38, 64 or 90 GBytes out of
      the total 96 GBytes on the system, and using 1, 19, 37, or 55
      threads (on a 56 CPU system.)  System, user and real (elapsed)
      timings were recorded for each run, shown in units of seconds,
      in the table below.
      
      Two kernels were tested - 2.6.18-mm3 and the same kernel with
      this zonelist caching patch added.  The table also shows the
      percentage improvement the zonelist caching sys time is over
      (lower than) the stock *-mm kernel.
      
            number     2.6.18-mm3	   zonelist-cache    delta (< 0 good)	percent
       GBs    N  	------------	   --------------    ----------------	systime
       mem threads   sys user  real	  sys  user  real     sys  user  real	 better
        12	 1     153   24   177	  151	 24   176      -2     0    -1	   1%
        12	19	99   22     8	   99	 22	8	0     0     0	   0%
        12	37     111   25     6	  112	 25	6	1     0     0	  -0%
        12	55     115   25     5	  110	 23	5      -5    -2     0	   4%
        38	 1     502   74   576	  497	 73   570      -5    -1    -6	   0%
        38	19     426   78    48	  373	 76    39     -53    -2    -9	  12%
        38	37     544   83    36	  547	 82    36	3    -1     0	  -0%
        38	55     501   77    23	  511	 80    24      10     3     1	  -1%
        64	 1     917  125  1042	  890	124  1014     -27    -1   -28	   2%
        64	19    1118  138   119	  965	141   103    -153     3   -16	  13%
        64	37    1202  151    94	 1136	150    81     -66    -1   -13	   5%
        64	55    1118  141    61	 1072	140    58     -46    -1    -3	   4%
        90	 1    1342  177  1519	 1275	174  1450     -67    -3   -69	   4%
        90	19    2392  199   192	 2116	189   176    -276   -10   -16	  11%
        90	37    3313  238   175	 2972	225   145    -341   -13   -30	  10%
        90	55    1948  210   104	 1843	213   100    -105     3    -4	   5%
      
      Notes:
       1) This test ran a memory hog program that started a specified number N of
          threads, and had each thread allocate and touch 1/N'th of
          the total memory to be used in the test run in a single loop,
          writing a constant word to memory, one store every 4096 bytes.
          Watching this test during some earlier trial runs, I would see
          each of these threads sit down on one CPU and stay there, for
          the remainder of the pass, a different CPU for each thread.
      
       2) The 'real' column is not comparable to the 'sys' or 'user' columns.
          The 'real' column is seconds wall clock time elapsed, from beginning
          to end of that test pass.  The 'sys' and 'user' columns are total
          CPU seconds spent on that test pass.  For a 19 thread test run,
          for example, the sum of 'sys' and 'user' could be up to 19 times the
          number of 'real' elapsed wall clock seconds.
      
       3) Tests were run on a fresh, single-user boot, to minimize the amount
          of memory already in use at the start of the test, and to minimize
          the amount of background activity that might interfere.
      
       4) Tests were done on a 56 CPU, 28 Node system with 96 GBytes of RAM.
      
       5) Notice that the 'real' time gets large for the single thread runs, even
          though the measured 'sys' and 'user' times are modest.  I'm not sure what
          that means - probably something to do with it being slow for one thread to
          be accessing memory along ways away.  Perhaps the fake numa system, running
          ostensibly the same workload, would not show this substantial degradation
          of 'real' time for one thread on many nodes -- lets hope not.
      
       6) The high thread count passes (one thread per CPU - on 55 of 56 CPUs)
          ran quite efficiently, as one might expect.  Each pair of threads needed
          to allocate and touch the memory on the node the two threads shared, a
          pleasantly parallizable workload.
      
       7) The intermediate thread count passes, when asking for alot of memory forcing
          them to go to a few neighboring nodes, improved the most with this zonelist
          caching patch.
      
      Conclusions:
       * This zonelist cache patch probably makes little difference one way or the
         other for most workloads on real numa hardware, if those workloads avoid
         heavy off node allocations.
       * For memory intensive workloads requiring substantial off-node allocations
         on real numa hardware, this patch improves both kernel and elapsed timings
         up to ten per-cent.
       * For fake numa systems, I'm optimistic, but will have to leave that up to
         Rohit Seth to actually test (once I get him a 2.6.18 backport.)
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Cc: Rohit Seth <rohitseth@google.com>
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      Cc: David Rientjes <rientjes@cs.washington.edu>
      Cc: Paul Menage <menage@google.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      9276b1bc
  2. 12 10月, 2006 1 次提交
  3. 01 10月, 2006 1 次提交
  4. 27 9月, 2006 1 次提交
    • C
      [PATCH] GFP_THISNODE for the slab allocator · 765c4507
      Christoph Lameter 提交于
      This patch insures that the slab node lists in the NUMA case only contain
      slabs that belong to that specific node.  All slab allocations use
      GFP_THISNODE when calling into the page allocator.  If an allocation fails
      then we fall back in the slab allocator according to the zonelists appropriate
      for a certain context.
      
      This allows a replication of the behavior of alloc_pages and alloc_pages node
      in the slab layer.
      
      Currently allocations requested from the page allocator may be redirected via
      cpusets to other nodes.  This results in remote pages on nodelists and that in
      turn results in interrupt latency issues during cache draining.  Plus the slab
      is handing out memory as local when it is really remote.
      
      Fallback for slab memory allocations will occur within the slab allocator and
      not in the page allocator.  This is necessary in order to be able to use the
      existing pools of objects on the nodes that we fall back to before adding more
      pages to a slab.
      
      The fallback function insures that the nodes we fall back to obey cpuset
      restrictions of the current context.  We do not allocate objects from outside
      of the current cpuset context like before.
      
      Note that the implementation of locality constraints within the slab allocator
      requires importing logic from the page allocator.  This is a mischmash that is
      not that great.  Other allocators (uncached allocator, vmalloc, huge pages)
      face similar problems and have similar minimal reimplementations of the basic
      fallback logic of the page allocator.  There is another way of implementing a
      slab by avoiding per node lists (see modular slab) but this wont work within
      the existing slab.
      
      V1->V2:
      - Use NUMA_BUILD to avoid #ifdef CONFIG_NUMA
      - Exploit GFP_THISNODE being 0 in the NON_NUMA case to avoid another
        #ifdef
      
      [akpm@osdl.org: build fix]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      765c4507
  5. 26 9月, 2006 5 次提交
    • C
      [PATCH] NUMA: Add zone_to_nid function · 89fa3024
      Christoph Lameter 提交于
      There are many places where we need to determine the node of a zone.
      Currently we use a difficult to read sequence of pointer dereferencing.
      Put that into an inline function and use throughout VM.  Maybe we can find
      a way to optimize the lookup in the future.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      89fa3024
    • C
      [PATCH] Add __GFP_THISNODE to avoid fallback to other nodes and ignore... · 9b819d20
      Christoph Lameter 提交于
      [PATCH] Add __GFP_THISNODE to avoid fallback to other nodes and ignore cpuset/memory policy restrictions
      
      Add a new gfp flag __GFP_THISNODE to avoid fallback to other nodes.  This
      flag is essential if a kernel component requires memory to be located on a
      certain node.  It will be needed for alloc_pages_node() to force allocation
      on the indicated node and for alloc_pages() to force allocation on the
      current node.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      9b819d20
    • C
      [PATCH] linearly index zone->node_zonelists[] · 19655d34
      Christoph Lameter 提交于
      I wonder why we need this bitmask indexing into zone->node_zonelists[]?
      
      We always start with the highest zone and then include all lower zones
      if we build zonelists.
      
      Are there really cases where we need allocation from ZONE_DMA or
      ZONE_HIGHMEM but not ZONE_NORMAL? It seems that the current implementation
      of highest_zone() makes that already impossible.
      
      If we go linear on the index then gfp_zone() == highest_zone() and a lot
      of definitions fall by the wayside.
      
      We can now revert back to the use of gfp_zone() in mempolicy.c ;-)
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      19655d34
    • C
      [PATCH] Apply type enum zone_type · 2f6726e5
      Christoph Lameter 提交于
      After we have done this we can now do some typing cleanup.
      
      The memory policy layer keeps a policy_zone that specifies
      the zone that gets memory policies applied. This variable
      can now be of type enum zone_type.
      
      The check_highest_zone function and the build_zonelists funnctionm must
      then also take a enum zone_type parameter.
      
      Plus there are a number of loops over zones that also should use
      zone_type.
      
      We run into some troubles at some points with functions that need a
      zone_type variable to become -1. Fix that up.
      
      [pj@sgi.com: fix set_mempolicy() crash]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      2f6726e5
    • C
      [PATCH] mempolicies: fix policy_zone check · 4e4785bc
      Christoph Lameter 提交于
      There is a check in zonelist_policy that compares pieces of the bitmap
      obtained from a gfp mask via GFP_ZONETYPES with a zone number in function
      zonelist_policy().
      
      The bitmap is an ORed mask of __GFP_DMA, __GFP_DMA32 and __GFP_HIGHMEM.
      The policy_zone is a zone number with the possible values of ZONE_DMA,
      ZONE_DMA32, ZONE_HIGHMEM and ZONE_NORMAL. These are two different domains
      of values.
      
      For some reason seemed to work before the zone reduction patchset (It
      definitely works on SGI boxes since we just have one zone and the check
      cannot fail).
      
      With the zone reduction patchset this check definitely fails on systems
      with two zones if the system actually has memory in both zones.
      
      This is because ZONE_NORMAL is selected using no __GFP flag at
      all and thus gfp_zone(gfpmask) == 0. ZONE_DMA is selected when __GFP_DMA
      is set. __GFP_DMA is 0x01.  So gfp_zone(gfpmask) == 1.
      
      policy_zone is set to ZONE_NORMAL (==1) if ZONE_NORMAL and ZONE_DMA are
      populated.
      
      For ZONE_NORMAL gfp_zone(<no _GFP_DMA>) yields 0 which is <
      policy_zone(ZONE_NORMAL) and so policy is not applied to regular memory
      allocations!
      
      Instead gfp_zone(__GFP_DMA) == 1 which results in policy being applied
      to DMA allocations!
      
      What we realy want in that place is to establish the highest allowable
      zone for a given gfp_mask. If the highest zone is higher or equal to the
      policy_zone then memory policies need to be applied. We have such
      a highest_zone() function in page_alloc.c.
      
      So move the highest_zone() function from mm/page_alloc.c into
      include/linux/gfp.h.  On the way we simplify the function and use the new
      zone_type that was also introduced with the zone reduction patchset plus we
      also specify the right type for the gfp flags parameter.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NLee Schermerhorn <Lee.Schermerhorn@hp.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      4e4785bc
  6. 02 9月, 2006 1 次提交
  7. 01 7月, 2006 1 次提交
    • C
      [PATCH] Use Zoned VM Counters for NUMA statistics · ca889e6c
      Christoph Lameter 提交于
      The numa statistics are really event counters.  But they are per node and
      so we have had special treatment for these counters through additional
      fields on the pcp structure.  We can now use the per zone nature of the
      zoned VM counters to realize these.
      
      This will shrink the size of the pcp structure on NUMA systems.  We will
      have some room to add additional per zone counters that will all still fit
      in the same cacheline.
      
       Bits	Prior pcp size	  	Size after patch	We can add
       ------------------------------------------------------------------
       64	128 bytes (16 words)	80 bytes (10 words)	48
       32	 76 bytes (19 words)	56 bytes (14 words)	8 (64 byte cacheline)
      							72 (128 byte)
      
      Remove the special statistics for numa and replace them with zoned vm
      counters.  This has the side effect that global sums of these events now
      show up in /proc/vmstat.
      
      Also take the opportunity to move the zone_statistics() function from
      page_alloc.c into vmstat.c.
      
      Discussions:
      V2 http://marc.theaimsgroup.com/?t=115048227000002&r=1&w=2Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Acked-by: NAndi Kleen <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      ca889e6c
  8. 27 6月, 2006 1 次提交
    • E
      [PATCH] proc: don't lock task_structs indefinitely · 99f89551
      Eric W. Biederman 提交于
      Every inode in /proc holds a reference to a struct task_struct.  If a
      directory or file is opened and remains open after the the task exits this
      pinning continues.  With 8K stacks on a 32bit machine the amount pinned per
      file descriptor is about 10K.
      
      Normally I would figure a reasonable per user process limit is about 100
      processes.  With 80 processes, with a 1000 file descriptors each I can trigger
      the 00M killer on a 32bit kernel, because I have pinned about 800MB of useless
      data.
      
      This patch replaces the struct task_struct pointer with a pointer to a struct
      task_ref which has a struct task_struct pointer.  The so the pinning of dead
      tasks does not happen.
      
      The code now has to contend with the fact that the task may now exit at any
      time.  Which is a little but not muh more complicated.
      
      With this change it takes about 1000 processes each opening up 1000 file
      descriptors before I can trigger the OOM killer.  Much better.
      
      [mlp@google.com: task_mmu small fixes]
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: Albert Cahalan <acahalan@gmail.com>
      Signed-off-by: NPrasanna Meda <mlp@google.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      99f89551
  9. 26 6月, 2006 1 次提交
    • C
      [PATCH] page migration: Support a vma migration function · 7b2259b3
      Christoph Lameter 提交于
      Hooks for calling vma specific migration functions
      
      With this patch a vma may define a vma->vm_ops->migrate function.  That
      function may perform page migration on its own (some vmas may not contain page
      structs and therefore cannot be handled by regular page migration.  Pages in a
      vma may require special preparatory treatment before migration is possible
      etc) .  Only mmap_sem is held when the migration function is called.  The
      migrate() function gets passed two sets of nodemasks describing the source and
      the target of the migration.  The flags parameter either contains
      
      MPOL_MF_MOVE	which means that only pages used exclusively by
      		the specified mm should be moved
      
      or
      
      MPOL_MF_MOVE_ALL which means that pages shared with other processes
      		should also be moved.
      
      The migration function returns 0 on success or an error condition.  An error
      condition will prevent regular page migration from occurring.
      
      On its own this patch cannot be included since there are no users for this
      functionality.  But it seems that the uncached allocator will need this
      functionality at some point.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Andi Kleen <ak@muc.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      7b2259b3
  10. 23 6月, 2006 4 次提交
    • D
      [PATCH] SELinux: add security_task_movememory calls to mm code · 86c3a764
      David Quigley 提交于
      This patch inserts security_task_movememory hook calls into memory management
      code to enable security modules to mediate this operation between tasks.
      
      Since the last posting, the hook has been renamed following feedback from
      Christoph Lameter.
      Signed-off-by: NDavid Quigley <dpquigl@tycho.nsa.gov>
      Acked-by: NStephen Smalley <sds@tycho.nsa.gov>
      Signed-off-by: NJames Morris <jmorris@namei.org>
      Cc: Andi Kleen <ak@muc.de>
      Acked-by: NChristoph Lameter <clameter@sgi.com>
      Acked-by: NChris Wright <chrisw@sous-sol.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      86c3a764
    • C
      [PATCH] page migration: sys_move_pages(): support moving of individual pages · 742755a1
      Christoph Lameter 提交于
      move_pages() is used to move individual pages of a process. The function can
      be used to determine the location of pages and to move them onto the desired
      node. move_pages() returns status information for each page.
      
      long move_pages(pid, number_of_pages_to_move,
      		addresses_of_pages[],
      		nodes[] or NULL,
      		status[],
      		flags);
      
      The addresses of pages is an array of void * pointing to the
      pages to be moved.
      
      The nodes array contains the node numbers that the pages should be moved
      to. If a NULL is passed instead of an array then no pages are moved but
      the status array is updated. The status request may be used to determine
      the page state before issuing another move_pages() to move pages.
      
      The status array will contain the state of all individual page migration
      attempts when the function terminates. The status array is only valid if
      move_pages() completed successfullly.
      
      Possible page states in status[]:
      
      0..MAX_NUMNODES	The page is now on the indicated node.
      
      -ENOENT		Page is not present
      
      -EACCES		Page is mapped by multiple processes and can only
      		be moved if MPOL_MF_MOVE_ALL is specified.
      
      -EPERM		The page has been mlocked by a process/driver and
      		cannot be moved.
      
      -EBUSY		Page is busy and cannot be moved. Try again later.
      
      -EFAULT		Invalid address (no VMA or zero page).
      
      -ENOMEM		Unable to allocate memory on target node.
      
      -EIO		Unable to write back page. The page must be written
      		back in order to move it since the page is dirty and the
      		filesystem does not provide a migration function that
      		would allow the moving of dirty pages.
      
      -EINVAL		A dirty page cannot be moved. The filesystem does not provide
      		a migration function and has no ability to write back pages.
      
      The flags parameter indicates what types of pages to move:
      
      MPOL_MF_MOVE	Move pages that are only mapped by the process.
      
      MPOL_MF_MOVE_ALL Also move pages that are mapped by multiple processes.
      		Requires sufficient capabilities.
      
      Possible return codes from move_pages()
      
      -ENOENT		No pages found that would require moving. All pages
      		are either already on the target node, not present, had an
      		invalid address or could not be moved because they were
      		mapped by multiple processes.
      
      -EINVAL		Flags other than MPOL_MF_MOVE(_ALL) specified or an attempt
      		to migrate pages in a kernel thread.
      
      -EPERM		MPOL_MF_MOVE_ALL specified without sufficient priviledges.
      		or an attempt to move a process belonging to another user.
      
      -EACCES		One of the target nodes is not allowed by the current cpuset.
      
      -ENODEV		One of the target nodes is not online.
      
      -ESRCH		Process does not exist.
      
      -E2BIG		Too many pages to move.
      
      -ENOMEM		Not enough memory to allocate control array.
      
      -EFAULT		Parameters could not be accessed.
      
      A test program for move_pages() may be found with the patches
      on ftp.kernel.org:/pub/linux/kernel/people/christoph/pmig/patches-2.6.17-rc4-mm3
      
      From: Christoph Lameter <clameter@sgi.com>
      
        Detailed results for sys_move_pages()
      
        Pass a pointer to an integer to get_new_page() that may be used to
        indicate where the completion status of a migration operation should be
        placed.  This allows sys_move_pags() to report back exactly what happened to
        each page.
      
        Wish there would be a better way to do this. Looks a bit hacky.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Jes Sorensen <jes@trained-monkey.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Andi Kleen <ak@muc.de>
      Cc: Michael Kerrisk <mtk-manpages@gmx.net>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      742755a1
    • C
      [PATCH] page migration: use allocator function for migrate_pages() · 95a402c3
      Christoph Lameter 提交于
      Instead of passing a list of new pages, pass a function to allocate a new
      page.  This allows the correct placement of MPOL_INTERLEAVE pages during page
      migration.  It also further simplifies the callers of migrate pages.
      migrate_pages() becomes similar to migrate_pages_to() so drop
      migrate_pages_to().  The batching of new page allocations becomes unnecessary.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Jes Sorensen <jes@trained-monkey.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Andi Kleen <ak@muc.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      95a402c3
    • C
      [PATCH] page migration: handle freeing of pages in migrate_pages() · aaa994b3
      Christoph Lameter 提交于
      Do not leave pages on the lists passed to migrate_pages().  Seems that we will
      not need any postprocessing of pages.  This will simplify the handling of
      pages by the callers of migrate_pages().
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Jes Sorensen <jes@trained-monkey.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Andi Kleen <ak@muc.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      aaa994b3
  11. 20 4月, 2006 1 次提交
  12. 29 3月, 2006 1 次提交
  13. 24 3月, 2006 1 次提交
    • P
      [PATCH] cpuset memory spread slab cache optimizations · c61afb18
      Paul Jackson 提交于
      The hooks in the slab cache allocator code path for support of NUMA
      mempolicies and cpuset memory spreading are in an important code path.  Many
      systems will use neither feature.
      
      This patch optimizes those hooks down to a single check of some bits in the
      current tasks task_struct flags.  For non NUMA systems, this hook and related
      code is already ifdef'd out.
      
      The optimization is done by using another task flag, set if the task is using
      a non-default NUMA mempolicy.  Taking this flag bit along with the
      PF_SPREAD_PAGE and PF_SPREAD_SLAB flag bits added earlier in this 'cpuset
      memory spreading' patch set, one can check for the combination of any of these
      special case memory placement mechanisms with a single test of the current
      tasks task_struct flags.
      
      This patch also tightens up the code, to save a few bytes of kernel text
      space, and moves some of it out of line.  Due to the nested inlines called
      from multiple places, we were ending up with three copies of this code, which
      once we get off the main code path (for local node allocation) seems a bit
      wasteful of instruction memory.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      c61afb18
  14. 22 3月, 2006 2 次提交
  15. 17 3月, 2006 1 次提交
  16. 15 3月, 2006 1 次提交
  17. 09 3月, 2006 1 次提交
  18. 07 3月, 2006 1 次提交
    • C
      [PATCH] numa_maps update · 397874df
      Christoph Lameter 提交于
      Change the format of numa_maps to be more compact and contain additional
      information that is useful for managing and troubleshooting memory on a
      NUMA system.  Numa_maps can now also support huge pages.
      
      Fixes:
      
      1. More compact format. Only display fields if they contain additional
      	information.
      
      2. Always display information for all vmas. The old numa_maps did not display
      	vma with no mapped entries. This was a bit confusing because page
      	migration removes ptes for file backed vmas. After page migration
      	a part of the vmas vanished.
      
      3. Rename maxref to maxmap. This is the maximum mapcount of all the pages
      	in a vma and may be used as an indicator as to how many processes
      	may be using a certain vma.
      
      4. Include the ability to scan over huge page vmas.
      
      New items shown:
      
      dirty
      	Number of pages in a vma that have either the dirty bit set in the
      	page_struct or in the pte.
      
      file=<filename>
      	The file backing the pages if any
      
      stack
      	Stack area
      
      heap
      	Heap area
      
      huge
      	Huge page area. The number of pages shows is the number of huge
      	pages not the regular sized pages.
      
      swapcache
      	Number of pages with swap references. Must be >0 in order to
      	be shown.
      
      active
      	Number of active pages. Only displayed if different from the number
      	of pages mapped.
      
      writeback
      	Number of pages under writeback. Only displayed if >0.
      
      Sample ouput of a process using huge pages:
      
      00000000 default
      2000000000000000 default file=/lib/ld-2.3.90.so mapped=13 mapmax=30 N0=13
      2000000000044000 default file=/lib/ld-2.3.90.so anon=2 dirty=2 swapcache=2 N2=2
      2000000000064000 default file=/lib/librt-2.3.90.so mapped=2 active=1 N1=1 N3=1
      2000000000074000 default file=/lib/librt-2.3.90.so
      2000000000080000 default file=/lib/librt-2.3.90.so anon=1 swapcache=1 N2=1
      2000000000084000 default
      2000000000088000 default file=/lib/libc-2.3.90.so mapped=52 mapmax=32 active=48 N0=52
      20000000002bc000 default file=/lib/libc-2.3.90.so
      20000000002c8000 default file=/lib/libc-2.3.90.so anon=3 dirty=2 swapcache=3 active=2 N1=1 N2=2
      20000000002d4000 default anon=1 swapcache=1 N1=1
      20000000002d8000 default file=/lib/libpthread-2.3.90.so mapped=8 mapmax=3 active=7 N2=2 N3=6
      20000000002fc000 default file=/lib/libpthread-2.3.90.so
      2000000000308000 default file=/lib/libpthread-2.3.90.so anon=1 dirty=1 swapcache=1 N1=1
      200000000030c000 default anon=1 dirty=1 swapcache=1 N1=1
      2000000000320000 default anon=1 dirty=1 N1=1
      200000000071c000 default
      2000000000720000 default anon=2 dirty=2 swapcache=1 N1=1 N2=1
      2000000000f1c000 default
      2000000000f20000 default anon=2 dirty=2 swapcache=1 active=1 N2=1 N3=1
      200000000171c000 default
      2000000001720000 default anon=1 dirty=1 swapcache=1 N1=1
      2000000001b20000 default
      2000000001b38000 default file=/lib/libgcc_s.so.1 mapped=2 N1=2
      2000000001b48000 default file=/lib/libgcc_s.so.1
      2000000001b54000 default file=/lib/libgcc_s.so.1 anon=1 dirty=1 active=0 N1=1
      2000000001b58000 default file=/lib/libunwind.so.7.0.0 mapped=2 active=1 N1=2
      2000000001b74000 default file=/lib/libunwind.so.7.0.0
      2000000001b80000 default file=/lib/libunwind.so.7.0.0
      2000000001b84000 default
      4000000000000000 default file=/media/huge/test9 mapped=1 N1=1
      6000000000000000 default file=/media/huge/test9 anon=1 dirty=1 active=0 N1=1
      6000000000004000 default heap
      607fffff7fffc000 default anon=1 dirty=1 swapcache=1 N2=1
      607fffffff06c000 default stack anon=1 dirty=1 active=0 N1=1
      8000000060000000 default file=/mnt/huge/test0 huge dirty=3 N1=3
      8000000090000000 default file=/mnt/huge/test1 huge dirty=3 N0=1 N2=2
      80000000c0000000 default file=/mnt/huge/test2 huge dirty=3 N1=1 N3=2
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Andi Kleen <ak@muc.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      397874df
  19. 03 3月, 2006 1 次提交
  20. 01 3月, 2006 1 次提交
  21. 25 2月, 2006 1 次提交
    • C
      [PATCH] page migration: Fix MPOL_INTERLEAVE behavior for migration via mbind() · 1e275d40
      Christoph Lameter 提交于
      migrate_pages_to() allocates a list of new pages on the intended target
      node or with the intended policy and then uses the list of new pages as
      targets for the migration of a list of pages out of place.
      
      When the pages are allocated it is not clear which of the out of place
      pages will be moved to the new pages.  So we cannot specify an address as
      needed by alloc_page_vma().  This causes problem for MPOL_INTERLEAVE which
      will currently allocate the pages on the first node of the set.  If mbind
      is used with vma that has the policy of MPOL_INTERLEAVE then the
      interleaving of pages may be destroyed.
      
      This patch fixes that by generating a fake address for each alloc_page_vma
      which will result is a distribution of pages as prescribed by
      MPOL_INTERLEAVE.
      
      Lee also noted that the sequence of nodes for the new pages seems to be
      inverted.  So we also invert the way the lists of pages for migration are
      build.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Looks-ok-to: Andi Kleen <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      1e275d40
  22. 21 2月, 2006 2 次提交
  23. 18 2月, 2006 2 次提交
  24. 05 2月, 2006 1 次提交
  25. 02 2月, 2006 1 次提交
  26. 19 1月, 2006 3 次提交
    • C
      [PATCH] mm: optimize numa policy handling in slab allocator · 86c562a9
      Christoph Lameter 提交于
      Move the interrupt check from slab_node into ___cache_alloc and adds an
      "unlikely()" to avoid pipeline stalls on some architectures.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      86c562a9
    • C
      [PATCH] NUMA policies in the slab allocator V2 · dc85da15
      Christoph Lameter 提交于
      This patch fixes a regression in 2.6.14 against 2.6.13 that causes an
      imbalance in memory allocation during bootup.
      
      The slab allocator in 2.6.13 is not numa aware and simply calls
      alloc_pages().  This means that memory policies may control the behavior of
      alloc_pages().  During bootup the memory policy is set to MPOL_INTERLEAVE
      resulting in the spreading out of allocations during bootup over all
      available nodes.  The slab allocator in 2.6.13 has only a single list of
      slab pages.  As a result the per cpu slab cache and the spinlock controlled
      page lists may contain slab entries from off node memory.  The slab
      allocator in 2.6.13 makes no effort to discern the locality of an entry on
      its lists.
      
      The NUMA aware slab allocator in 2.6.14 controls locality of the slab pages
      explicitly by calling alloc_pages_node().  The NUMA slab allocator manages
      slab entries by having lists of available slab pages for each node.  The
      per cpu slab cache can only contain slab entries associated with the node
      local to the processor.  This guarantees that the default allocation mode
      of the slab allocator always assigns local memory if available.
      
      Setting MPOL_INTERLEAVE as a default policy during bootup has no effect
      anymore.  In 2.6.14 all node unspecific slab allocations are performed on
      the boot processor.  This means that most of key data structures are
      allocated on one node.  Most processors will have to refer to these
      structures making the boot node a potential bottleneck.  This may reduce
      performance and cause unnecessary memory pressure on the boot node.
      
      This patch implements NUMA policies in the slab layer.  There is the need
      of explicit application of NUMA memory policies by the slab allcator itself
      since the NUMA slab allocator does no longer let the page_allocator control
      locality.
      
      The check for policies is made directly at the beginning of __cache_alloc
      using current->mempolicy.  The memory policy is already frequently checked
      by the page allocator (alloc_page_vma() and alloc_page_current()).  So it
      is highly likely that the cacheline is present.  For MPOL_INTERLEAVE
      kmalloc() will spread out each request to one node after another so that an
      equal distribution of allocations can be obtained during bootup.
      
      It is not possible to push the policy check to lower layers of the NUMA
      slab allocator since the per cpu caches are now only containing slab
      entries from the current node.  If the policy says that the local node is
      not to be preferred or forbidden then there is no point in checking the
      slab cache or local list of slab pages.  The allocation better be directed
      immediately to the lists containing slab entries for the allowed set of
      nodes.
      
      This way of applying policy also fixes another strange behavior in 2.6.13.
      alloc_pages() is controlled by the memory allocation policy of the current
      process.  It could therefore be that one process is running with
      MPOL_INTERLEAVE and would f.e.  obtain a new page following that policy
      since no slab entries are in the lists anymore.  A page can typically be
      used for multiple slab entries but lets say that the current process is
      only using one.  The other entries are then added to the slab lists.  These
      are now non local entries in the slab lists despite of the possible
      availability of local pages that would provide faster access and increase
      the performance of the application.
      
      Another process without MPOL_INTERLEAVE may now run and expect a local slab
      entry from kmalloc().  However, there are still these free slab entries
      from the off node page obtained from the other process via MPOL_INTERLEAVE
      in the cache.  The process will then get an off node slab entry although
      other slab entries may be available that are local to that process.  This
      means that the policy if one process may contaminate the locality of the
      slab caches for other processes.
      
      This patch in effect insures that a per process policy is followed for the
      allocation of slab entries and that there cannot be a memory policy
      influence from one process to another.  A process with default policy will
      always get a local slab entry if one is available.  And the process using
      memory policies will get its memory arranged as requested.  Off-node slab
      allocation will require the use of spinlocks and will make the use of per
      cpu caches not possible.  A process using memory policies to redirect
      allocations offnode will have to cope with additional lock overhead in
      addition to the latency added by the need to access a remote slab entry.
      
      Changes V1->V2
      - Remove #ifdef CONFIG_NUMA by moving forward declaration into
        prior #ifdef CONFIG_NUMA section.
      
      - Give the function determining the node number to use a saner
        name.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      dc85da15
    • C
      [PATCH] Simplify migrate_page_add · fc301289
      Christoph Lameter 提交于
      Simplify migrate_page_add after feedback from Hugh.  This also allows us to
      drop one parameter from migrate_page_add.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      fc301289