1. 23 6月, 2006 4 次提交
    • D
      [PATCH] SELinux: add security_task_movememory calls to mm code · 86c3a764
      David Quigley 提交于
      This patch inserts security_task_movememory hook calls into memory management
      code to enable security modules to mediate this operation between tasks.
      
      Since the last posting, the hook has been renamed following feedback from
      Christoph Lameter.
      Signed-off-by: NDavid Quigley <dpquigl@tycho.nsa.gov>
      Acked-by: NStephen Smalley <sds@tycho.nsa.gov>
      Signed-off-by: NJames Morris <jmorris@namei.org>
      Cc: Andi Kleen <ak@muc.de>
      Acked-by: NChristoph Lameter <clameter@sgi.com>
      Acked-by: NChris Wright <chrisw@sous-sol.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      86c3a764
    • C
      [PATCH] page migration: sys_move_pages(): support moving of individual pages · 742755a1
      Christoph Lameter 提交于
      move_pages() is used to move individual pages of a process. The function can
      be used to determine the location of pages and to move them onto the desired
      node. move_pages() returns status information for each page.
      
      long move_pages(pid, number_of_pages_to_move,
      		addresses_of_pages[],
      		nodes[] or NULL,
      		status[],
      		flags);
      
      The addresses of pages is an array of void * pointing to the
      pages to be moved.
      
      The nodes array contains the node numbers that the pages should be moved
      to. If a NULL is passed instead of an array then no pages are moved but
      the status array is updated. The status request may be used to determine
      the page state before issuing another move_pages() to move pages.
      
      The status array will contain the state of all individual page migration
      attempts when the function terminates. The status array is only valid if
      move_pages() completed successfullly.
      
      Possible page states in status[]:
      
      0..MAX_NUMNODES	The page is now on the indicated node.
      
      -ENOENT		Page is not present
      
      -EACCES		Page is mapped by multiple processes and can only
      		be moved if MPOL_MF_MOVE_ALL is specified.
      
      -EPERM		The page has been mlocked by a process/driver and
      		cannot be moved.
      
      -EBUSY		Page is busy and cannot be moved. Try again later.
      
      -EFAULT		Invalid address (no VMA or zero page).
      
      -ENOMEM		Unable to allocate memory on target node.
      
      -EIO		Unable to write back page. The page must be written
      		back in order to move it since the page is dirty and the
      		filesystem does not provide a migration function that
      		would allow the moving of dirty pages.
      
      -EINVAL		A dirty page cannot be moved. The filesystem does not provide
      		a migration function and has no ability to write back pages.
      
      The flags parameter indicates what types of pages to move:
      
      MPOL_MF_MOVE	Move pages that are only mapped by the process.
      
      MPOL_MF_MOVE_ALL Also move pages that are mapped by multiple processes.
      		Requires sufficient capabilities.
      
      Possible return codes from move_pages()
      
      -ENOENT		No pages found that would require moving. All pages
      		are either already on the target node, not present, had an
      		invalid address or could not be moved because they were
      		mapped by multiple processes.
      
      -EINVAL		Flags other than MPOL_MF_MOVE(_ALL) specified or an attempt
      		to migrate pages in a kernel thread.
      
      -EPERM		MPOL_MF_MOVE_ALL specified without sufficient priviledges.
      		or an attempt to move a process belonging to another user.
      
      -EACCES		One of the target nodes is not allowed by the current cpuset.
      
      -ENODEV		One of the target nodes is not online.
      
      -ESRCH		Process does not exist.
      
      -E2BIG		Too many pages to move.
      
      -ENOMEM		Not enough memory to allocate control array.
      
      -EFAULT		Parameters could not be accessed.
      
      A test program for move_pages() may be found with the patches
      on ftp.kernel.org:/pub/linux/kernel/people/christoph/pmig/patches-2.6.17-rc4-mm3
      
      From: Christoph Lameter <clameter@sgi.com>
      
        Detailed results for sys_move_pages()
      
        Pass a pointer to an integer to get_new_page() that may be used to
        indicate where the completion status of a migration operation should be
        placed.  This allows sys_move_pags() to report back exactly what happened to
        each page.
      
        Wish there would be a better way to do this. Looks a bit hacky.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Jes Sorensen <jes@trained-monkey.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Andi Kleen <ak@muc.de>
      Cc: Michael Kerrisk <mtk-manpages@gmx.net>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      742755a1
    • C
      [PATCH] page migration: use allocator function for migrate_pages() · 95a402c3
      Christoph Lameter 提交于
      Instead of passing a list of new pages, pass a function to allocate a new
      page.  This allows the correct placement of MPOL_INTERLEAVE pages during page
      migration.  It also further simplifies the callers of migrate pages.
      migrate_pages() becomes similar to migrate_pages_to() so drop
      migrate_pages_to().  The batching of new page allocations becomes unnecessary.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Jes Sorensen <jes@trained-monkey.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Andi Kleen <ak@muc.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      95a402c3
    • C
      [PATCH] page migration: handle freeing of pages in migrate_pages() · aaa994b3
      Christoph Lameter 提交于
      Do not leave pages on the lists passed to migrate_pages().  Seems that we will
      not need any postprocessing of pages.  This will simplify the handling of
      pages by the callers of migrate_pages().
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Jes Sorensen <jes@trained-monkey.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Andi Kleen <ak@muc.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      aaa994b3
  2. 20 4月, 2006 1 次提交
  3. 29 3月, 2006 1 次提交
  4. 24 3月, 2006 1 次提交
    • P
      [PATCH] cpuset memory spread slab cache optimizations · c61afb18
      Paul Jackson 提交于
      The hooks in the slab cache allocator code path for support of NUMA
      mempolicies and cpuset memory spreading are in an important code path.  Many
      systems will use neither feature.
      
      This patch optimizes those hooks down to a single check of some bits in the
      current tasks task_struct flags.  For non NUMA systems, this hook and related
      code is already ifdef'd out.
      
      The optimization is done by using another task flag, set if the task is using
      a non-default NUMA mempolicy.  Taking this flag bit along with the
      PF_SPREAD_PAGE and PF_SPREAD_SLAB flag bits added earlier in this 'cpuset
      memory spreading' patch set, one can check for the combination of any of these
      special case memory placement mechanisms with a single test of the current
      tasks task_struct flags.
      
      This patch also tightens up the code, to save a few bytes of kernel text
      space, and moves some of it out of line.  Due to the nested inlines called
      from multiple places, we were ending up with three copies of this code, which
      once we get off the main code path (for local node allocation) seems a bit
      wasteful of instruction memory.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      c61afb18
  5. 22 3月, 2006 2 次提交
  6. 17 3月, 2006 1 次提交
  7. 15 3月, 2006 1 次提交
  8. 09 3月, 2006 1 次提交
  9. 07 3月, 2006 1 次提交
    • C
      [PATCH] numa_maps update · 397874df
      Christoph Lameter 提交于
      Change the format of numa_maps to be more compact and contain additional
      information that is useful for managing and troubleshooting memory on a
      NUMA system.  Numa_maps can now also support huge pages.
      
      Fixes:
      
      1. More compact format. Only display fields if they contain additional
      	information.
      
      2. Always display information for all vmas. The old numa_maps did not display
      	vma with no mapped entries. This was a bit confusing because page
      	migration removes ptes for file backed vmas. After page migration
      	a part of the vmas vanished.
      
      3. Rename maxref to maxmap. This is the maximum mapcount of all the pages
      	in a vma and may be used as an indicator as to how many processes
      	may be using a certain vma.
      
      4. Include the ability to scan over huge page vmas.
      
      New items shown:
      
      dirty
      	Number of pages in a vma that have either the dirty bit set in the
      	page_struct or in the pte.
      
      file=<filename>
      	The file backing the pages if any
      
      stack
      	Stack area
      
      heap
      	Heap area
      
      huge
      	Huge page area. The number of pages shows is the number of huge
      	pages not the regular sized pages.
      
      swapcache
      	Number of pages with swap references. Must be >0 in order to
      	be shown.
      
      active
      	Number of active pages. Only displayed if different from the number
      	of pages mapped.
      
      writeback
      	Number of pages under writeback. Only displayed if >0.
      
      Sample ouput of a process using huge pages:
      
      00000000 default
      2000000000000000 default file=/lib/ld-2.3.90.so mapped=13 mapmax=30 N0=13
      2000000000044000 default file=/lib/ld-2.3.90.so anon=2 dirty=2 swapcache=2 N2=2
      2000000000064000 default file=/lib/librt-2.3.90.so mapped=2 active=1 N1=1 N3=1
      2000000000074000 default file=/lib/librt-2.3.90.so
      2000000000080000 default file=/lib/librt-2.3.90.so anon=1 swapcache=1 N2=1
      2000000000084000 default
      2000000000088000 default file=/lib/libc-2.3.90.so mapped=52 mapmax=32 active=48 N0=52
      20000000002bc000 default file=/lib/libc-2.3.90.so
      20000000002c8000 default file=/lib/libc-2.3.90.so anon=3 dirty=2 swapcache=3 active=2 N1=1 N2=2
      20000000002d4000 default anon=1 swapcache=1 N1=1
      20000000002d8000 default file=/lib/libpthread-2.3.90.so mapped=8 mapmax=3 active=7 N2=2 N3=6
      20000000002fc000 default file=/lib/libpthread-2.3.90.so
      2000000000308000 default file=/lib/libpthread-2.3.90.so anon=1 dirty=1 swapcache=1 N1=1
      200000000030c000 default anon=1 dirty=1 swapcache=1 N1=1
      2000000000320000 default anon=1 dirty=1 N1=1
      200000000071c000 default
      2000000000720000 default anon=2 dirty=2 swapcache=1 N1=1 N2=1
      2000000000f1c000 default
      2000000000f20000 default anon=2 dirty=2 swapcache=1 active=1 N2=1 N3=1
      200000000171c000 default
      2000000001720000 default anon=1 dirty=1 swapcache=1 N1=1
      2000000001b20000 default
      2000000001b38000 default file=/lib/libgcc_s.so.1 mapped=2 N1=2
      2000000001b48000 default file=/lib/libgcc_s.so.1
      2000000001b54000 default file=/lib/libgcc_s.so.1 anon=1 dirty=1 active=0 N1=1
      2000000001b58000 default file=/lib/libunwind.so.7.0.0 mapped=2 active=1 N1=2
      2000000001b74000 default file=/lib/libunwind.so.7.0.0
      2000000001b80000 default file=/lib/libunwind.so.7.0.0
      2000000001b84000 default
      4000000000000000 default file=/media/huge/test9 mapped=1 N1=1
      6000000000000000 default file=/media/huge/test9 anon=1 dirty=1 active=0 N1=1
      6000000000004000 default heap
      607fffff7fffc000 default anon=1 dirty=1 swapcache=1 N2=1
      607fffffff06c000 default stack anon=1 dirty=1 active=0 N1=1
      8000000060000000 default file=/mnt/huge/test0 huge dirty=3 N1=3
      8000000090000000 default file=/mnt/huge/test1 huge dirty=3 N0=1 N2=2
      80000000c0000000 default file=/mnt/huge/test2 huge dirty=3 N1=1 N3=2
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Andi Kleen <ak@muc.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      397874df
  10. 03 3月, 2006 1 次提交
  11. 01 3月, 2006 1 次提交
  12. 25 2月, 2006 1 次提交
    • C
      [PATCH] page migration: Fix MPOL_INTERLEAVE behavior for migration via mbind() · 1e275d40
      Christoph Lameter 提交于
      migrate_pages_to() allocates a list of new pages on the intended target
      node or with the intended policy and then uses the list of new pages as
      targets for the migration of a list of pages out of place.
      
      When the pages are allocated it is not clear which of the out of place
      pages will be moved to the new pages.  So we cannot specify an address as
      needed by alloc_page_vma().  This causes problem for MPOL_INTERLEAVE which
      will currently allocate the pages on the first node of the set.  If mbind
      is used with vma that has the policy of MPOL_INTERLEAVE then the
      interleaving of pages may be destroyed.
      
      This patch fixes that by generating a fake address for each alloc_page_vma
      which will result is a distribution of pages as prescribed by
      MPOL_INTERLEAVE.
      
      Lee also noted that the sequence of nodes for the new pages seems to be
      inverted.  So we also invert the way the lists of pages for migration are
      build.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Looks-ok-to: Andi Kleen <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      1e275d40
  13. 21 2月, 2006 2 次提交
  14. 18 2月, 2006 2 次提交
  15. 05 2月, 2006 1 次提交
  16. 02 2月, 2006 1 次提交
  17. 19 1月, 2006 4 次提交
    • C
      [PATCH] mm: optimize numa policy handling in slab allocator · 86c562a9
      Christoph Lameter 提交于
      Move the interrupt check from slab_node into ___cache_alloc and adds an
      "unlikely()" to avoid pipeline stalls on some architectures.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      86c562a9
    • C
      [PATCH] NUMA policies in the slab allocator V2 · dc85da15
      Christoph Lameter 提交于
      This patch fixes a regression in 2.6.14 against 2.6.13 that causes an
      imbalance in memory allocation during bootup.
      
      The slab allocator in 2.6.13 is not numa aware and simply calls
      alloc_pages().  This means that memory policies may control the behavior of
      alloc_pages().  During bootup the memory policy is set to MPOL_INTERLEAVE
      resulting in the spreading out of allocations during bootup over all
      available nodes.  The slab allocator in 2.6.13 has only a single list of
      slab pages.  As a result the per cpu slab cache and the spinlock controlled
      page lists may contain slab entries from off node memory.  The slab
      allocator in 2.6.13 makes no effort to discern the locality of an entry on
      its lists.
      
      The NUMA aware slab allocator in 2.6.14 controls locality of the slab pages
      explicitly by calling alloc_pages_node().  The NUMA slab allocator manages
      slab entries by having lists of available slab pages for each node.  The
      per cpu slab cache can only contain slab entries associated with the node
      local to the processor.  This guarantees that the default allocation mode
      of the slab allocator always assigns local memory if available.
      
      Setting MPOL_INTERLEAVE as a default policy during bootup has no effect
      anymore.  In 2.6.14 all node unspecific slab allocations are performed on
      the boot processor.  This means that most of key data structures are
      allocated on one node.  Most processors will have to refer to these
      structures making the boot node a potential bottleneck.  This may reduce
      performance and cause unnecessary memory pressure on the boot node.
      
      This patch implements NUMA policies in the slab layer.  There is the need
      of explicit application of NUMA memory policies by the slab allcator itself
      since the NUMA slab allocator does no longer let the page_allocator control
      locality.
      
      The check for policies is made directly at the beginning of __cache_alloc
      using current->mempolicy.  The memory policy is already frequently checked
      by the page allocator (alloc_page_vma() and alloc_page_current()).  So it
      is highly likely that the cacheline is present.  For MPOL_INTERLEAVE
      kmalloc() will spread out each request to one node after another so that an
      equal distribution of allocations can be obtained during bootup.
      
      It is not possible to push the policy check to lower layers of the NUMA
      slab allocator since the per cpu caches are now only containing slab
      entries from the current node.  If the policy says that the local node is
      not to be preferred or forbidden then there is no point in checking the
      slab cache or local list of slab pages.  The allocation better be directed
      immediately to the lists containing slab entries for the allowed set of
      nodes.
      
      This way of applying policy also fixes another strange behavior in 2.6.13.
      alloc_pages() is controlled by the memory allocation policy of the current
      process.  It could therefore be that one process is running with
      MPOL_INTERLEAVE and would f.e.  obtain a new page following that policy
      since no slab entries are in the lists anymore.  A page can typically be
      used for multiple slab entries but lets say that the current process is
      only using one.  The other entries are then added to the slab lists.  These
      are now non local entries in the slab lists despite of the possible
      availability of local pages that would provide faster access and increase
      the performance of the application.
      
      Another process without MPOL_INTERLEAVE may now run and expect a local slab
      entry from kmalloc().  However, there are still these free slab entries
      from the off node page obtained from the other process via MPOL_INTERLEAVE
      in the cache.  The process will then get an off node slab entry although
      other slab entries may be available that are local to that process.  This
      means that the policy if one process may contaminate the locality of the
      slab caches for other processes.
      
      This patch in effect insures that a per process policy is followed for the
      allocation of slab entries and that there cannot be a memory policy
      influence from one process to another.  A process with default policy will
      always get a local slab entry if one is available.  And the process using
      memory policies will get its memory arranged as requested.  Off-node slab
      allocation will require the use of spinlocks and will make the use of per
      cpu caches not possible.  A process using memory policies to redirect
      allocations offnode will have to cope with additional lock overhead in
      addition to the latency added by the need to access a remote slab entry.
      
      Changes V1->V2
      - Remove #ifdef CONFIG_NUMA by moving forward declaration into
        prior #ifdef CONFIG_NUMA section.
      
      - Give the function determining the node number to use a saner
        name.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      dc85da15
    • C
      [PATCH] Simplify migrate_page_add · fc301289
      Christoph Lameter 提交于
      Simplify migrate_page_add after feedback from Hugh.  This also allows us to
      drop one parameter from migrate_page_add.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      fc301289
    • N
      [PATCH] mm: migration page refcounting fix · 053837fc
      Nick Piggin 提交于
      Migration code currently does not take a reference to target page
      properly, so between unlocking the pte and trying to take a new
      reference to the page with isolate_lru_page, anything could happen to
      it.
      
      Fix this by holding the pte lock until we get a chance to elevate the
      refcount.
      
      Other small cleanups while we're here.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      053837fc
  18. 15 1月, 2006 1 次提交
    • R
      [PATCH] Add tmpfs options for memory placement policies · 7339ff83
      Robin Holt 提交于
      Anything that writes into a tmpfs filesystem is liable to disproportionately
      decrease the available memory on a particular node.  Since there's no telling
      what sort of application (e.g.  dd/cp/cat) might be dropping large files
      there, this lets the admin choose the appropriate default behavior for their
      site's situation.
      
      Introduce a tmpfs mount option which allows specifying a memory policy and
      a second option to specify the nodelist for that policy.  With the default
      policy, tmpfs will behave as it does today.  This patch adds support for
      preferred, bind, and interleave policies.
      
      The default policy will cause pages to be added to tmpfs files on the node
      which is doing the writing.  Some jobs expect a single process to create
      and manage the tmpfs files.  This results in a node which has a
      significantly reduced number of free pages.
      
      With this patch, the administrator can specify the policy and nodes for
      that policy where they would prefer allocations.
      
      This patch was originally written by Brent Casavant and Hugh Dickins.  I
      added support for the bind and preferred policies and the mpol_nodelist
      mount option.
      Signed-off-by: NBrent Casavant <bcasavan@sgi.com>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NRobin Holt <holt@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      7339ff83
  19. 13 1月, 2006 1 次提交
  20. 09 1月, 2006 12 次提交
    • P
      [PATCH] cpuset: rebind vma mempolicies fix · 4225399a
      Paul Jackson 提交于
      Fix more of longstanding bug in cpuset/mempolicy interaction.
      
      NUMA mempolicies (mm/mempolicy.c) are constrained by the current tasks cpuset
      to just the Memory Nodes allowed by that cpuset.  The kernel maintains
      internal state for each mempolicy, tracking what nodes are used for the
      MPOL_INTERLEAVE, MPOL_BIND or MPOL_PREFERRED policies.
      
      When a tasks cpuset memory placement changes, whether because the cpuset
      changed, or because the task was attached to a different cpuset, then the
      tasks mempolicies have to be rebound to the new cpuset placement, so as to
      preserve the cpuset-relative numbering of the nodes in that policy.
      
      An earlier fix handled such mempolicy rebinding for mempolicies attached to a
      task.
      
      This fix rebinds mempolicies attached to vma's (address ranges in a tasks
      address space.) Due to the need to hold the task->mm->mmap_sem semaphore while
      updating vma's, the rebinding of vma mempolicies has to be done when the
      cpuset memory placement is changed, at which time mmap_sem can be safely
      acquired.  The tasks mempolicy is rebound later, when the task next attempts
      to allocate memory and notices that its task->cpuset_mems_generation is
      out-of-date with its cpusets mems_generation.
      
      Because walking the tasklist to find all tasks attached to a changing cpuset
      requires holding tasklist_lock, a spinlock, one cannot update the vma's of the
      affected tasks while doing the tasklist scan.  In general, one cannot acquire
      a semaphore (which can sleep) while already holding a spinlock (such as
      tasklist_lock).  So a list of mm references has to be built up during the
      tasklist scan, then the tasklist lock dropped, then for each mm, its mmap_sem
      acquired, and the vma's in that mm rebound.
      
      Once the tasklist lock is dropped, affected tasks may fork new tasks, before
      their mm's are rebound.  A kernel global 'cpuset_being_rebound' is set to
      point to the cpuset being rebound (there can only be one; cpuset modifications
      are done under a global 'manage_sem' semaphore), and the mpol_copy code that
      is used to copy a tasks mempolicies during fork catches such forking tasks,
      and ensures their children are also rebound.
      
      When a task is moved to a different cpuset, it is easier, as there is only one
      task involved.  It's mm->vma's are scanned, using the same
      mpol_rebind_policy() as used above.
      
      It may happen that both the mpol_copy hook and the update done via the
      tasklist scan update the same mm twice.  This is ok, as the mempolicies of
      each vma in an mm keep track of what mems_allowed they are relative to, and
      safely no-op a second request to rebind to the same nodes.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      4225399a
    • P
      [PATCH] cpuset: numa_policy_rebind cleanup · 74cb2155
      Paul Jackson 提交于
      Cleanup, reorganize and make more robust the mempolicy.c code to rebind
      mempolicies relative to the containing cpuset after a tasks memory placement
      changes.
      
      The real motivator for this cleanup patch is to lay more groundwork for the
      upcoming patch to correctly rebind NUMA mempolicies that are attached to vma's
      after the containing cpuset memory placement changes.
      
      NUMA mempolicies are constrained by the cpuset their task is a member of.
      When either (1) a task is moved to a different cpuset, or (2) the 'mems'
      mems_allowed of a cpuset is changed, then the NUMA mempolicies have embedded
      node numbers (for MPOL_BIND, MPOL_INTERLEAVE and MPOL_PREFERRED) that need to
      be recalculated, relative to their new cpuset placement.
      
      The old code used an unreliable method of determining what was the old
      mems_allowed constraining the mempolicy.  It just looked at the tasks
      mems_allowed value.  This sort of worked with the present code, that just
      rebinds the -task- mempolicy, and leaves any -vma- mempolicies broken,
      referring to the old nodes.  But in an upcoming patch, the vma mempolicies
      will be rebound as well.  Then the order in which the various task and vma
      mempolicies are updated will no longer be deterministic, and one can no longer
      count on the task->mems_allowed holding the old value for as long as needed.
      It's not even clear if the current code was guaranteed to work reliably for
      task mempolicies.
      
      So I added a mems_allowed field to each mempolicy, stating exactly what
      mems_allowed the policy is relative to, and updated synchronously and reliably
      anytime that the mempolicy is rebound.
      
      Also removed a useless wrapper routine, numa_policy_rebind(), and had its
      caller, cpuset_update_task_memory_state(), call directly to the rewritten
      policy_rebind() routine, and made that rebind routine extern instead of
      static, and added a "mpol_" prefix to its name, making it
      mpol_rebind_policy().
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      74cb2155
    • P
      [PATCH] cpuset: implement cpuset_mems_allowed · 909d75a3
      Paul Jackson 提交于
      Provide a cpuset_mems_allowed() method, which the sys_migrate_pages() code
      needed, to obtain the mems_allowed vector of a cpuset, and replaced the
      workaround in sys_migrate_pages() to call this new method.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      909d75a3
    • P
      [PATCH] cpuset: combine refresh_mems and update_mems · cf2a473c
      Paul Jackson 提交于
      The important code paths through alloc_pages_current() and alloc_page_vma(),
      by which most kernel page allocations go, both called
      cpuset_update_current_mems_allowed(), which in turn called refresh_mems().
      -Both- of these latter two routines did a tasklock, got the tasks cpuset
      pointer, and checked for out of date cpuset->mems_generation.
      
      That was a silly duplication of code and waste of CPU cycles on an important
      code path.
      
      Consolidated those two routines into a single routine, called
      cpuset_update_task_memory_state(), since it updates more than just
      mems_allowed.
      
      Changed all callers of either routine to call the new consolidated routine.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      cf2a473c
    • P
      [PATCH] cpuset: mempolicy one more nodemask conversion · 5966514d
      Paul Jackson 提交于
      Finish converting mm/mempolicy.c from bitmaps to nodemasks.  The previous
      conversion had left one routine using bitmaps, since it involved a
      corresponding change to kernel/cpuset.c
      
      Fix that interface by replacing with a simple macro that calls nodes_subset(),
      or if !CONFIG_CPUSET, returns (1).
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Cc: Christoph Lameter <christoph@lameter.com>
      Cc: Andi Kleen <ak@muc.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      5966514d
    • C
      [PATCH] Move page migration related functions near do_migrate_pages() · 6ce3c4c0
      Christoph Lameter 提交于
      Group page migration functions in mempolicy.c
      
      Add a forward declaration for migrate_page_add (like gather_stats()) and use
      our new found mobility to group all page migration related function around
      do_migrate_pages().
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Andi Kleen <ak@muc.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      6ce3c4c0
    • C
      [PATCH] mempolicies: unexport get_vma_policy() · 48fce342
      Christoph Lameter 提交于
      Since the numa_maps functionality is now in mempolicy.c we no longer need to
      export get_vma_policy().
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Andi Kleen <ak@muc.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      48fce342
    • C
      [PATCH] Drop page table lock before calling migrate_page_add() · 132beacf
      Christoph Lameter 提交于
      migrate_page_add cannot be called with a spinlock held (calls
      isolate_lru_page which calles schedule_on_each_cpu).  Drop ptl lock in
      check_pte_range before calling migrate_page_add().
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      132beacf
    • C
      [PATCH] Fold numa_maps into mempolicies.c · 1a75a6c8
      Christoph Lameter 提交于
      First discussed at http://marc.theaimsgroup.com/?t=113149255100001&r=1&w=2
      
      - Use the check_range() in mempolicy.c to gather statistics.
      
      - Improve the numa_maps code in general and fix some comments.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Andi Kleen <ak@muc.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      1a75a6c8
    • C
      [PATCH] mempolicies: private pointer in check_range and MPOL_MF_INVERT · 38e35860
      Christoph Lameter 提交于
      This was was first posted at
      http://marc.theaimsgroup.com/?l=linux-mm&m=113149240227584&w=2
      
      (Part of this functionality is also contained in the direct migration
      pathset. The functionality here is more generic and independent of that
      patchset.)
      
      - Add internal flags MPOL_MF_INVERT to control check_range() behavior.
      
      - Replace the pagelist passed through by check_range by a general
        private pointer that may be used for other purposes.
        (The following patches will use that to merge numa_maps into
        mempolicy.c and to better group the page migration code in
        the policy layer)
      
      - Improve some comments.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Andi Kleen <ak@muc.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      38e35860
    • C
      [PATCH] SwapMig: Extend parameters for migrate_pages() · d4984711
      Christoph Lameter 提交于
      Extend the parameters of migrate_pages() to allow the caller control over the
      fate of successfully migrated or impossible to migrate pages.
      
      Swap migration and direct migration will have the same interface after this
      patch so that patches can be independently applied to the policy layer and the
      core migration code.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Andi Kleen <ak@muc.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      d4984711
    • C
      [PATCH] Swap Migration V5: sys_migrate_pages interface · 39743889
      Christoph Lameter 提交于
      sys_migrate_pages implementation using swap based page migration
      
      This is the original API proposed by Ray Bryant in his posts during the first
      half of 2005 on linux-mm@kvack.org and linux-kernel@vger.kernel.org.
      
      The intent of sys_migrate is to migrate memory of a process.  A process may
      have migrated to another node.  Memory was allocated optimally for the prior
      context.  sys_migrate_pages allows to shift the memory to the new node.
      
      sys_migrate_pages is also useful if the processes available memory nodes have
      changed through cpuset operations to manually move the processes memory.  Paul
      Jackson is working on an automated mechanism that will allow an automatic
      migration if the cpuset of a process is changed.  However, a user may decide
      to manually control the migration.
      
      This implementation is put into the policy layer since it uses concepts and
      functions that are also needed for mbind and friends.  The patch also provides
      a do_migrate_pages function that may be useful for cpusets to automatically
      move memory.  sys_migrate_pages does not modify policies in contrast to Ray's
      implementation.
      
      The current code here is based on the swap based page migration capability and
      thus is not able to preserve the physical layout relative to it containing
      nodeset (which may be a cpuset).  When direct page migration becomes available
      then the implementation needs to be changed to do a isomorphic move of pages
      between different nodesets.  The current implementation simply evicts all
      pages in source nodeset that are not in the target nodeset.
      
      Patch supports ia64, i386 and x86_64.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      39743889