1. 25 5月, 2010 7 次提交
  2. 30 3月, 2010 1 次提交
    • T
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking... · 5a0e3ad6
      Tejun Heo 提交于
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
      
      percpu.h is included by sched.h and module.h and thus ends up being
      included when building most .c files.  percpu.h includes slab.h which
      in turn includes gfp.h making everything defined by the two files
      universally available and complicating inclusion dependencies.
      
      percpu.h -> slab.h dependency is about to be removed.  Prepare for
      this change by updating users of gfp and slab facilities include those
      headers directly instead of assuming availability.  As this conversion
      needs to touch large number of source files, the following script is
      used as the basis of conversion.
      
        http://userweb.kernel.org/~tj/misc/slabh-sweep.py
      
      The script does the followings.
      
      * Scan files for gfp and slab usages and update includes such that
        only the necessary includes are there.  ie. if only gfp is used,
        gfp.h, if slab is used, slab.h.
      
      * When the script inserts a new include, it looks at the include
        blocks and try to put the new include such that its order conforms
        to its surrounding.  It's put in the include block which contains
        core kernel includes, in the same order that the rest are ordered -
        alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
        doesn't seem to be any matching order.
      
      * If the script can't find a place to put a new include (mostly
        because the file doesn't have fitting include block), it prints out
        an error message indicating which .h file needs to be added to the
        file.
      
      The conversion was done in the following steps.
      
      1. The initial automatic conversion of all .c files updated slightly
         over 4000 files, deleting around 700 includes and adding ~480 gfp.h
         and ~3000 slab.h inclusions.  The script emitted errors for ~400
         files.
      
      2. Each error was manually checked.  Some didn't need the inclusion,
         some needed manual addition while adding it to implementation .h or
         embedding .c file was more appropriate for others.  This step added
         inclusions to around 150 files.
      
      3. The script was run again and the output was compared to the edits
         from #2 to make sure no file was left behind.
      
      4. Several build tests were done and a couple of problems were fixed.
         e.g. lib/decompress_*.c used malloc/free() wrappers around slab
         APIs requiring slab.h to be added manually.
      
      5. The script was run on all .h files but without automatically
         editing them as sprinkling gfp.h and slab.h inclusions around .h
         files could easily lead to inclusion dependency hell.  Most gfp.h
         inclusion directives were ignored as stuff from gfp.h was usually
         wildly available and often used in preprocessor macros.  Each
         slab.h inclusion directive was examined and added manually as
         necessary.
      
      6. percpu.h was updated not to include slab.h.
      
      7. Build test were done on the following configurations and failures
         were fixed.  CONFIG_GCOV_KERNEL was turned off for all tests (as my
         distributed build env didn't work with gcov compiles) and a few
         more options had to be turned off depending on archs to make things
         build (like ipr on powerpc/64 which failed due to missing writeq).
      
         * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
         * powerpc and powerpc64 SMP allmodconfig
         * sparc and sparc64 SMP allmodconfig
         * ia64 SMP allmodconfig
         * s390 SMP allmodconfig
         * alpha SMP allmodconfig
         * um on x86_64 SMP allmodconfig
      
      8. percpu.h modifications were reverted so that it could be applied as
         a separate patch and serve as bisection point.
      
      Given the fact that I had only a couple of failures from tests on step
      6, I'm fairly confident about the coverage of this conversion patch.
      If there is a breakage, it's likely to be something in one of the arch
      headers which should be easily discoverable easily on most builds of
      the specific arch.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Guess-its-ok-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      5a0e3ad6
  3. 25 3月, 2010 5 次提交
  4. 07 3月, 2010 2 次提交
    • K
      mm/mempolicy.c: fix indentation of the comments of do_migrate_pages · da0aa138
      KOSAKI Motohiro 提交于
      Currently, do_migrate_pages() have very long comment and this is not
      indent properly.  I often misunderstand it is function starting commnents
      and confused it.
      
      this patch fixes it.
      
      note: this patch doesn't break 80 column rule. I guess original
            author intended this indentaion, but an accident corrupted it.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NChristoph Lameter <cl@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      da0aa138
    • K
      mm: fix mbind vma merge problem · 9d8cebd4
      KOSAKI Motohiro 提交于
      Strangely, current mbind() doesn't merge vma with neighbor vma although it's possible.
      Unfortunately, many vma can reduce performance...
      
      This patch fixes it.
      
          reproduced program
          ----------------------------------------------------------------
           #include <numaif.h>
           #include <numa.h>
           #include <sys/mman.h>
           #include <stdio.h>
           #include <unistd.h>
           #include <stdlib.h>
           #include <string.h>
      
          static unsigned long pagesize;
      
          int main(int argc, char** argv)
          {
          	void* addr;
          	int ch;
          	int node;
          	struct bitmask *nmask = numa_allocate_nodemask();
          	int err;
          	int node_set = 0;
          	char buf[128];
      
          	while ((ch = getopt(argc, argv, "n:")) != -1){
          		switch (ch){
          		case 'n':
          			node = strtol(optarg, NULL, 0);
          			numa_bitmask_setbit(nmask, node);
          			node_set = 1;
          			break;
          		default:
          			;
          		}
          	}
          	argc -= optind;
          	argv += optind;
      
          	if (!node_set)
          		numa_bitmask_setbit(nmask, 0);
      
          	pagesize = getpagesize();
      
          	addr = mmap(NULL, pagesize*3, PROT_READ|PROT_WRITE,
          		    MAP_ANON|MAP_PRIVATE, 0, 0);
          	if (addr == MAP_FAILED)
          		perror("mmap "), exit(1);
      
          	fprintf(stderr, "pid = %d \n" "addr = %p\n", getpid(), addr);
      
          	/* make page populate */
          	memset(addr, 0, pagesize*3);
      
          	/* first mbind */
          	err = mbind(addr+pagesize, pagesize, MPOL_BIND, nmask->maskp,
          		    nmask->size, MPOL_MF_MOVE_ALL);
          	if (err)
          		error("mbind1 ");
      
          	/* second mbind */
          	err = mbind(addr, pagesize*3, MPOL_DEFAULT, NULL, 0, 0);
          	if (err)
          		error("mbind2 ");
      
          	sprintf(buf, "cat /proc/%d/maps", getpid());
          	system(buf);
      
          	return 0;
          }
          ----------------------------------------------------------------
      
      result without this patch
      
      	addr = 0x7fe26ef09000
      	[snip]
      	7fe26ef09000-7fe26ef0a000 rw-p 00000000 00:00 0
      	7fe26ef0a000-7fe26ef0b000 rw-p 00000000 00:00 0
      	7fe26ef0b000-7fe26ef0c000 rw-p 00000000 00:00 0
      	7fe26ef0c000-7fe26ef0d000 rw-p 00000000 00:00 0
      
      	=> 0x7fe26ef09000-0x7fe26ef0c000 have three vmas.
      
      result with this patch
      
      	addr = 0x7fc9ebc76000
      	[snip]
      	7fc9ebc76000-7fc9ebc7a000 rw-p 00000000 00:00 0
      	7fffbe690000-7fffbe6a5000 rw-p 00000000	00:00 0	[stack]
      
      	=> 0x7fc9ebc76000-0x7fc9ebc7a000 have only one vma.
      
      [minchan.kim@gmail.com: fix file offset passed to vma_merge()]
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NMinchan Kim <minchan.kim@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9d8cebd4
  5. 04 3月, 2010 1 次提交
    • P
      rcu: Suppress __mpol_dup() false positive from RCU lockdep · 99ee4ca7
      Paul E. McKenney 提交于
      Common code is used during task creation and after the task has
      started running.  RCU protection is not needed during task
      creation because no other CPU has access to the
      under-construction task.  Provide the RCU protection anyway to
      suppress the false positive, as there does not appear to be a
      good way for the common code to recognize that the task is only
      accessible to the CPU creating it.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Menage <menage@google.com>
      Cc: laijs@cn.fujitsu.com
      Cc: dipankar@in.ibm.com
      Cc: mathieu.desnoyers@polymtl.ca
      Cc: josh@joshtriplett.org
      Cc: dvhltc@us.ibm.com
      Cc: niv@us.ibm.com
      Cc: peterz@infradead.org
      Cc: rostedt@goodmis.org
      Cc: Valdis.Kletnieks@vt.edu
      Cc: dhowells@redhat.com
      LKML-Reference: <1267667418-32233-2-git-send-email-paulmck@linux.vnet.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      99ee4ca7
  6. 16 12月, 2009 3 次提交
    • H
      ksm: memory hotremove migration only · 62b61f61
      Hugh Dickins 提交于
      The previous patch enables page migration of ksm pages, but that soon gets
      into trouble: not surprising, since we're using the ksm page lock to lock
      operations on its stable_node, but page migration switches the page whose
      lock is to be used for that.  Another layer of locking would fix it, but
      do we need that yet?
      
      Do we actually need page migration of ksm pages?  Yes, memory hotremove
      needs to offline sections of memory: and since we stopped allocating ksm
      pages with GFP_HIGHUSER, they will tend to be GFP_HIGHUSER_MOVABLE
      candidates for migration.
      
      But KSM is currently unconscious of NUMA issues, happily merging pages
      from different NUMA nodes: at present the rule must be, not to use
      MADV_MERGEABLE where you care about NUMA.  So no, NUMA page migration of
      ksm pages does not make sense yet.
      
      So, to complete support for ksm swapping we need to make hotremove safe.
      ksm_memory_callback() take ksm_thread_mutex when MEM_GOING_OFFLINE and
      release it when MEM_OFFLINE or MEM_CANCEL_OFFLINE.  But if mapped pages
      are freed before migration reaches them, stable_nodes may be left still
      pointing to struct pages which have been removed from the system: the
      stable_node needs to identify a page by pfn rather than page pointer, then
      it can safely prune them when MEM_OFFLINE.
      
      And make NUMA migration skip PageKsm pages where it skips PageReserved.
      But it's only when we reach unmap_and_move() that the page lock is taken
      and we can be sure that raised pagecount has prevented a PageAnon from
      being upgraded: so add offlining arg to migrate_pages(), to migrate ksm
      page when offlining (has sufficient locking) but reject it otherwise.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Izik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Chris Wright <chrisw@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      62b61f61
    • L
      hugetlb: derive huge pages nodes allowed from task mempolicy · 06808b08
      Lee Schermerhorn 提交于
      This patch derives a "nodes_allowed" node mask from the numa mempolicy of
      the task modifying the number of persistent huge pages to control the
      allocation, freeing and adjusting of surplus huge pages when the pool page
      count is modified via the new sysctl or sysfs attribute
      "nr_hugepages_mempolicy".  The nodes_allowed mask is derived as follows:
      
      * For "default" [NULL] task mempolicy, a NULL nodemask_t pointer
        is produced.  This will cause the hugetlb subsystem to use
        node_online_map as the "nodes_allowed".  This preserves the
        behavior before this patch.
      * For "preferred" mempolicy, including explicit local allocation,
        a nodemask with the single preferred node will be produced.
        "local" policy will NOT track any internode migrations of the
        task adjusting nr_hugepages.
      * For "bind" and "interleave" policy, the mempolicy's nodemask
        will be used.
      * Other than to inform the construction of the nodes_allowed node
        mask, the actual mempolicy mode is ignored.  That is, all modes
        behave like interleave over the resulting nodes_allowed mask
        with no "fallback".
      
      See the updated documentation [next patch] for more information
      about the implications of this patch.
      
      Examples:
      
      Starting with:
      
      	Node 0 HugePages_Total:     0
      	Node 1 HugePages_Total:     0
      	Node 2 HugePages_Total:     0
      	Node 3 HugePages_Total:     0
      
      Default behavior [with or without this patch] balances persistent
      hugepage allocation across nodes [with sufficient contiguous memory]:
      
      	sysctl vm.nr_hugepages[_mempolicy]=32
      
      yields:
      
      	Node 0 HugePages_Total:     8
      	Node 1 HugePages_Total:     8
      	Node 2 HugePages_Total:     8
      	Node 3 HugePages_Total:     8
      
      Of course, we only have nr_hugepages_mempolicy with the patch,
      but with default mempolicy, nr_hugepages_mempolicy behaves the
      same as nr_hugepages.
      
      Applying mempolicy--e.g., with numactl [using '-m' a.k.a.
      '--membind' because it allows multiple nodes to be specified
      and it's easy to type]--we can allocate huge pages on
      individual nodes or sets of nodes.  So, starting from the
      condition above, with 8 huge pages per node, add 8 more to
      node 2 using:
      
      	numactl -m 2 sysctl vm.nr_hugepages_mempolicy=40
      
      This yields:
      
      	Node 0 HugePages_Total:     8
      	Node 1 HugePages_Total:     8
      	Node 2 HugePages_Total:    16
      	Node 3 HugePages_Total:     8
      
      The incremental 8 huge pages were restricted to node 2 by the
      specified mempolicy.
      
      Similarly, we can use mempolicy to free persistent huge pages
      from specified nodes:
      
      	numactl -m 0,1 sysctl vm.nr_hugepages_mempolicy=32
      
      yields:
      
      	Node 0 HugePages_Total:     4
      	Node 1 HugePages_Total:     4
      	Node 2 HugePages_Total:    16
      	Node 3 HugePages_Total:     8
      
      The 8 huge pages freed were balanced over nodes 0 and 1.
      
      [rientjes@google.com: accomodate reworked NODEMASK_ALLOC]
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NAndi Kleen <andi@firstfloor.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Cc: Nishanth Aravamudan <nacc@us.ibm.com>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: Andy Whitcroft <apw@canonical.com>
      Cc: Eric Whitney <eric.whitney@hp.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      06808b08
    • K
      mm: move inc_zone_page_state(NR_ISOLATED) to just isolated place · 6d9c285a
      KOSAKI Motohiro 提交于
      Christoph pointed out inc_zone_page_state(NR_ISOLATED) should be placed
      in right after isolate_page().
      
      This patch does it.
      Reviewed-by: NChristoph Lameter <cl@linux-foundation.org>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6d9c285a
  7. 29 10月, 2009 2 次提交
  8. 08 8月, 2009 1 次提交
    • K
      mm: make set_mempolicy(MPOL_INTERLEAV) N_HIGH_MEMORY aware · 4bfc4495
      KAMEZAWA Hiroyuki 提交于
      At first, init_task's mems_allowed is initialized as this.
       init_task->mems_allowed == node_state[N_POSSIBLE]
      
      And cpuset's top_cpuset mask is initialized as this
       top_cpuset->mems_allowed = node_state[N_HIGH_MEMORY]
      
      Before 2.6.29:
      policy's mems_allowed is initialized as this.
      
        1. update tasks->mems_allowed by its cpuset->mems_allowed.
        2. policy->mems_allowed = nodes_and(tasks->mems_allowed, user's mask)
      
      Updating task's mems_allowed in reference to top_cpuset's one.
      cpuset's mems_allowed is aware of N_HIGH_MEMORY, always.
      
      In 2.6.30: After commit 58568d2a
      ("cpuset,mm: update tasks' mems_allowed in time"), policy's mems_allowed
      is initialized as this.
      
        1. policy->mems_allowd = nodes_and(task->mems_allowed, user's mask)
      
      Here, if task is in top_cpuset, task->mems_allowed is not updated from
      init's one.  Assume user excutes command as #numactrl --interleave=all
      ,....
      
        policy->mems_allowd = nodes_and(N_POSSIBLE, ALL_SET_MASK)
      
      Then, policy's mems_allowd can includes a possible node, which has no pgdat.
      
      MPOL's INTERLEAVE just scans nodemask of task->mems_allowd and access this
      directly.
      
        NODE_DATA(nid)->zonelist even if NODE_DATA(nid)==NULL
      
      Then, what's we need is making policy->mems_allowed be aware of
      N_HIGH_MEMORY.  This patch does that.  But to do so, extra nodemask will
      be on statck.  Because I know cpumask has a new interface of
      CPUMASK_ALLOC(), I added it to node.
      
      This patch stands on old behavior.  But I feel this fix itself is just a
      Band-Aid.  But to do fundametal fix, we have to take care of memory
      hotplug and it takes time.  (task->mems_allowd should be N_HIGH_MEMORY, I
      think.)
      
      mpol_set_nodemask() should be aware of N_HIGH_MEMORY and policy's nodemask
      should be includes only online nodes.
      
      In old behavior, this is guaranteed by frequent reference to cpuset's
      code.  Now, most of them are removed and mempolicy has to check it by
      itself.
      
      To do check, a few nodemask_t will be used for calculating nodemask.  But,
      size of nodemask_t can be big and it's not good to allocate them on stack.
      
      Now, cpumask_t has CPUMASK_ALLOC/FREE an easy code for get scratch area.
      NODEMASK_ALLOC/FREE shoudl be there.
      
      [akpm@linux-foundation.org: cleanups & tweaks]
      Tested-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Miao Xie <miaox@cn.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4bfc4495
  9. 17 6月, 2009 2 次提交
    • M
      page allocator: do not check NUMA node ID when the caller knows the node is valid · 6484eb3e
      Mel Gorman 提交于
      Callers of alloc_pages_node() can optionally specify -1 as a node to mean
      "allocate from the current node".  However, a number of the callers in
      fast paths know for a fact their node is valid.  To avoid a comparison and
      branch, this patch adds alloc_pages_exact_node() that only checks the nid
      with VM_BUG_ON().  Callers that know their node is valid are then
      converted.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NChristoph Lameter <cl@linux-foundation.org>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NPekka Enberg <penberg@cs.helsinki.fi>
      Acked-by: Paul Mundt <lethal@linux-sh.org>	[for the SLOB NUMA bits]
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6484eb3e
    • M
      cpuset,mm: update tasks' mems_allowed in time · 58568d2a
      Miao Xie 提交于
      Fix allocating page cache/slab object on the unallowed node when memory
      spread is set by updating tasks' mems_allowed after its cpuset's mems is
      changed.
      
      In order to update tasks' mems_allowed in time, we must modify the code of
      memory policy.  Because the memory policy is applied in the process's
      context originally.  After applying this patch, one task directly
      manipulates anothers mems_allowed, and we use alloc_lock in the
      task_struct to protect mems_allowed and memory policy of the task.
      
      But in the fast path, we didn't use lock to protect them, because adding a
      lock may lead to performance regression.  But if we don't add a lock,the
      task might see no nodes when changing cpuset's mems_allowed to some
      non-overlapping set.  In order to avoid it, we set all new allowed nodes,
      then clear newly disallowed ones.
      
      [lee.schermerhorn@hp.com:
        The rework of mpol_new() to extract the adjusting of the node mask to
        apply cpuset and mpol flags "context" breaks set_mempolicy() and mbind()
        with MPOL_PREFERRED and a NULL nodemask--i.e., explicit local
        allocation.  Fix this by adding the check for MPOL_PREFERRED and empty
        node mask to mpol_new_mpolicy().
      
        Remove the now unneeded 'nodes = NULL' from mpol_new().
      
        Note that mpol_new_mempolicy() is always called with a non-NULL
        'nodes' parameter now that it has been removed from mpol_new().
        Therefore, we don't need to test nodes for NULL before testing it for
        'empty'.  However, just to be extra paranoid, add a VM_BUG_ON() to
        verify this assumption.]
      [lee.schermerhorn@hp.com:
      
        I don't think the function name 'mpol_new_mempolicy' is descriptive
        enough to differentiate it from mpol_new().
      
        This function applies cpuset set context, usually constraining nodes
        to those allowed by the cpuset.  However, when the 'RELATIVE_NODES flag
        is set, it also translates the nodes.  So I settled on
        'mpol_set_nodemask()', because the comment block for mpol_new() mentions
        that we need to call this function to "set nodes".
      
        Some additional minor line length, whitespace and typo cleanup.]
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      58568d2a
  10. 14 1月, 2009 1 次提交
  11. 14 11月, 2008 3 次提交
  12. 07 11月, 2008 1 次提交
    • C
      mm: move migrate_prep out from under mmap_sem · 0aedadf9
      Christoph Lameter 提交于
      Move the migrate_prep outside the mmap_sem for the following system calls
      
      1. sys_move_pages
      2. sys_migrate_pages
      3. sys_mbind()
      
      It really does not matter when we flush the lru.  The system is free to
      add pages onto the lru even during migration which will make the page
      migration either skip the page (mbind, migrate_pages) or return a busy
      state (move_pages).
      
      Fixes this lockdep warning (and potential deadlock):
      
      Some VM place has
            mmap_sem -> kevent_wq via lru_add_drain_all()
      
      net/core/dev.c::dev_ioctl()  has
           rtnl_lock  ->  mmap_sem        (*) the ioctl has copy_from_user() and it can do page fault.
      
      linkwatch_event has
           kevent_wq -> rtnl_lock
      Signed-off-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reported-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0aedadf9
  13. 20 10月, 2008 2 次提交
    • L
      Unevictable LRU Infrastructure · 894bc310
      Lee Schermerhorn 提交于
      When the system contains lots of mlocked or otherwise unevictable pages,
      the pageout code (kswapd) can spend lots of time scanning over these
      pages.  Worse still, the presence of lots of unevictable pages can confuse
      kswapd into thinking that more aggressive pageout modes are required,
      resulting in all kinds of bad behaviour.
      
      Infrastructure to manage pages excluded from reclaim--i.e., hidden from
      vmscan.  Based on a patch by Larry Woodman of Red Hat.  Reworked to
      maintain "unevictable" pages on a separate per-zone LRU list, to "hide"
      them from vmscan.
      
      Kosaki Motohiro added the support for the memory controller unevictable
      lru list.
      
      Pages on the unevictable list have both PG_unevictable and PG_lru set.
      Thus, PG_unevictable is analogous to and mutually exclusive with
      PG_active--it specifies which LRU list the page is on.
      
      The unevictable infrastructure is enabled by a new mm Kconfig option
      [CONFIG_]UNEVICTABLE_LRU.
      
      A new function 'page_evictable(page, vma)' in vmscan.c tests whether or
      not a page may be evictable.  Subsequent patches will add the various
      !evictable tests.  We'll want to keep these tests light-weight for use in
      shrink_active_list() and, possibly, the fault path.
      
      To avoid races between tasks putting pages [back] onto an LRU list and
      tasks that might be moving the page from non-evictable to evictable state,
      the new function 'putback_lru_page()' -- inverse to 'isolate_lru_page()'
      -- tests the "evictability" of a page after placing it on the LRU, before
      dropping the reference.  If the page has become unevictable,
      putback_lru_page() will redo the 'putback', thus moving the page to the
      unevictable list.  This way, we avoid "stranding" evictable pages on the
      unevictable list.
      
      [akpm@linux-foundation.org: fix fallout from out-of-order merge]
      [riel@redhat.com: fix UNEVICTABLE_LRU and !PROC_PAGE_MONITOR build]
      [nishimura@mxp.nes.nec.co.jp: remove redundant mapping check]
      [kosaki.motohiro@jp.fujitsu.com: unevictable-lru-infrastructure: putback_lru_page()/unevictable page handling rework]
      [kosaki.motohiro@jp.fujitsu.com: kill unnecessary lock_page() in vmscan.c]
      [kosaki.motohiro@jp.fujitsu.com: revert migration change of unevictable lru infrastructure]
      [kosaki.motohiro@jp.fujitsu.com: revert to unevictable-lru-infrastructure-kconfig-fix.patch]
      [kosaki.motohiro@jp.fujitsu.com: restore patch failure of vmstat-unevictable-and-mlocked-pages-vm-events.patch]
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Debugged-by: NBenjamin Kidwell <benjkidwell@yahoo.com>
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      894bc310
    • N
      vmscan: move isolate_lru_page() to vmscan.c · 62695a84
      Nick Piggin 提交于
      On large memory systems, the VM can spend way too much time scanning
      through pages that it cannot (or should not) evict from memory.  Not only
      does it use up CPU time, but it also provokes lock contention and can
      leave large systems under memory presure in a catatonic state.
      
      This patch series improves VM scalability by:
      
      1) putting filesystem backed, swap backed and unevictable pages
         onto their own LRUs, so the system only scans the pages that it
         can/should evict from memory
      
      2) switching to two handed clock replacement for the anonymous LRUs,
         so the number of pages that need to be scanned when the system
         starts swapping is bound to a reasonable number
      
      3) keeping unevictable pages off the LRU completely, so the
         VM does not waste CPU time scanning them. ramfs, ramdisk,
         SHM_LOCKED shared memory segments and mlock()ed VMA pages
         are keept on the unevictable list.
      
      This patch:
      
      isolate_lru_page logically belongs to be in vmscan.c than migrate.c.
      
      It is tough, because we don't need that function without memory migration
      so there is a valid argument to have it in migrate.c.  However a
      subsequent patch needs to make use of it in the core mm, so we can happily
      move it to vmscan.c.
      
      Also, make the function a little more generic by not requiring that it
      adds an isolated page to a given list.  Callers can do that.
      
      	Note that we now have '__isolate_lru_page()', that does
      	something quite different, visible outside of vmscan.c
      	for use with memory controller.  Methinks we need to
      	rationalize these names/purposes.	--lts
      
      [akpm@linux-foundation.org: fix mm/memory_hotplug.c build]
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NLee Schermerhorn <Lee.Schermerhorn@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      62695a84
  14. 13 8月, 2008 1 次提交
  15. 25 7月, 2008 1 次提交
  16. 05 7月, 2008 1 次提交
  17. 28 4月, 2008 6 次提交
    • L
      mempolicy: use struct mempolicy pointer in shmem_sb_info · 71fe804b
      Lee Schermerhorn 提交于
      This patch replaces the mempolicy mode, mode_flags, and nodemask in the
      shmem_sb_info struct with a struct mempolicy pointer, initialized to NULL.
      This removes dependency on the details of mempolicy from shmem.c and hugetlbfs
      inode.c and simplifies the interfaces.
      
      mpol_parse_str() in mempolicy.c is changed to return, via a pointer to a
      pointer arg, a struct mempolicy pointer on success.  For MPOL_DEFAULT, the
      returned pointer is NULL.  Further, mpol_parse_str() now takes a 'no_context'
      argument that causes the input nodemask to be stored in the w.user_nodemask of
      the created mempolicy for use when the mempolicy is installed in a tmpfs inode
      shared policy tree.  At that time, any cpuset contextualization is applied to
      the original input nodemask.  This preserves the previous behavior where the
      input nodemask was stored in the superblock.  We can think of the returned
      mempolicy as "context free".
      
      Because mpol_parse_str() is now calling mpol_new(), we can remove from
      mpol_to_str() the semantic checks that mpol_new() already performs.
      
      Add 'no_context' parameter to mpol_to_str() to specify that it should format
      the nodemask in w.user_nodemask for 'bind' and 'interleave' policies.
      
      Change mpol_shared_policy_init() to take a pointer to a "context free" struct
      mempolicy and to create a new, "contextualized" mempolicy using the mode,
      mode_flags and user_nodemask from the input mempolicy.
      
        Note: we know that the mempolicy passed to mpol_to_str() or
        mpol_shared_policy_init() from a tmpfs superblock is "context free".  This
        is currently the only instance thereof.  However, if we found more uses for
        this concept, and introduced any ambiguity as to whether a mempolicy was
        context free or not, we could add another internal mode flag to identify
        context free mempolicies.  Then, we could remove the 'no_context' argument
        from mpol_to_str().
      
      Added shmem_get_sbmpol() to return a reference counted superblock mempolicy,
      if one exists, to pass to mpol_shared_policy_init().  We must add the
      reference under the sb stat_lock to prevent races with replacement of the mpol
      by remount.  This reference is removed in mpol_shared_policy_init().
      
      [akpm@linux-foundation.org: build fix]
      [akpm@linux-foundation.org: another build fix]
      [akpm@linux-foundation.org: yet another build fix]
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      71fe804b
    • L
      mempolicy: support mpol=local tmpfs mount option · 3f226aa1
      Lee Schermerhorn 提交于
      For tmpfs/shmem shared policies, MPOL_DEFAULT is not necessarily equivalent to
      "local allocation".  Because shared policies are at the same "scope" level
      [see Documentation/vm/numa_memory_policy.txt], as vma policies MPOL_DEFAULT
      means "fall back to current task policy".
      
      This patch extends the memory policy string parsing function to display
      "local" for MPOL_PREFERRED + MPOL_F_LOCAL.  This allows one to specify local
      allocation as the default policy for shared memory areas via the tmpfs mpol
      mount option, regardless of the current task's policy.
      
      Also, "local" is now displayed for this policy.  This patch allows us to
      accept the same input format as the display.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3f226aa1
    • L
      mempolicy: rework shmem mpol parsing and display · 095f1fc4
      Lee Schermerhorn 提交于
      mm/shmem.c currently contains functions to parse and display memory policy
      strings for the tmpfs 'mpol' mount option.  Move this to mm/mempolicy.c with
      the rest of the mempolicy support.  With subsequent patches, we'll be able to
      remove knowledge of the details [mode, flags, policy, ...] completely from
      shmem.c
      
      1) replace shmem_parse_mpol() in mm/shmem.c with mpol_parse_str() in
         mm/mempolicy.c.  Rework to use the policy_types[] array [used by
         mpol_to_str()] to look up mode by name.
      
      2) use mpol_to_str() to format policy for shmem_show_mpol().  mpol_to_str()
         expects a pointer to a struct mempolicy, so temporarily construct one.
         This will be replaced with a reference to a struct mempolicy in the tmpfs
         superblock in a subsequent patch.
      
         NOTE 1: I changed mpol_to_str() to use a colon ':' rather than an equal
         sign '=' as the nodemask delimiter to match mpol_parse_str() and the
         tmpfs/shmem mpol mount option formatting that now uses mpol_to_str().  This
         is a user visible change to numa_maps, but then the addition of the mode
         flags already changed the display.  It makes sense to me to have the mounts
         and numa_maps display the policy in the same format.  However, if anyone
         objects strongly, I can pass the desired nodemask delimeter as an arg to
         mpol_to_str().
      
         Note 2: Like show_numa_map(), I don't check the return code from
         mpol_to_str().  I do use a longer buffer than the one provided by
         show_numa_map(), which seems to have sufficed so far.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      095f1fc4
    • L
      mempolicy: clean-up mpol-to-str() mempolicy formatting · 2291990a
      Lee Schermerhorn 提交于
      mpol-to-str() formats memory policies into printable strings.  Currently this
      is only used to display "numa_maps".  A subsequent patch will use
      mpol_to_str() for formatting tmpfs [shmem] mpol mount options, allowing us to
      remove essentially duplicate code in mm/shmem.c.  This patch cleans up
      mpol_to_str() generally and in preparation for that patch.
      
      1) show_numa_maps() is not checking the return code from mpol_to_str().
         There's not a lot we can do in this context if mpol_to_str() did return the
         error [insufficient space in buffer].  Proposed "solution": just check,
         under DEBUG_VM, that callers are providing sufficient buffer space for the
         policy, flags, and a few nodes.  This way, we'll get some display.
         show_numa_maps() is providing a 50-byte buffer, so it won't trip this
         check.  50-bytes should be sufficient unless one has a large number of
         nodes in a very sparse nodemask.
      
      2) The display of the new mode flags ["static" & "relative"] was set up to
         display multiple flags, separated by a "bar" '|'.  However, this support is
         incomplete--e.g., need_bar was never incremented; and currently, these two
         flags are mutually exclusive.  So remove the "bar" support, for now, and
         only display one flag.
      
      3) Use snprint() to format flags, so as not to overflow the buffer.  Not
         that it's ever happed, AFAIK.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2291990a
    • L
      mempolicy: use MPOL_F_LOCAL to Indicate Preferred Local Policy · fc36b8d3
      Lee Schermerhorn 提交于
      Now that we're using "preferred local" policy for system default, we need to
      make this as fast as possible.  Because of the variable size of the mempolicy
      structure [based on size of nodemasks], the preferred_node may be in a
      different cacheline from the mode.  This can result in accessing an extra
      cacheline in the normal case of system default policy.  Suspect this is the
      cause of an observed 2-3% slowdown in page fault testing relative to kernel
      without this patch series.
      
      To alleviate this, use an internal mode flag, MPOL_F_LOCAL in the mempolicy
      flags member which is guaranteed [?] to be in the same cacheline as the mode
      itself.
      
      Verified that reworked mempolicy now performs slightly better on 25-rc8-mm1
      for both anon and shmem segments with system default and vma [preferred local]
      policy.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fc36b8d3
    • L
      mempolicy: mPOL_PREFERRED cleanups for "local allocation" · 53f2556b
      Lee Schermerhorn 提交于
      Here are a couple of "cleanups" for MPOL_PREFERRED behavior when
      v.preferred_node < 0 -- i.e., "local allocation":
      
      1)  [do_]get_mempolicy() calls the now renamed get_policy_nodemask()
          to fetch the nodemask associated with a policy.  Currently,
          get_policy_nodemask() returns the set of nodes with memory, when
          the policy 'mode' is 'PREFERRED, and the preferred_node is < 0.
          Change to return an empty nodemask, as this is what was specified
          to achieve "local allocation".
      
      2)  When a task is moved into a [new] cpuset, mpol_rebind_policy() is
          called to adjust any task and vma policy nodes to be valid in the
          new cpuset.  However, when the policy is MPOL_PREFERRED, and the
          preferred_node is <0, no rebind is necessary.  The "local allocation"
          indication is valid in any cpuset.  Existing code will "do the right
          thing" because node_remap() will just return the argument node when
          it is outside of the valid range of node ids.  However, I think it is
          clearer and cleaner to skip the remap explicitly in this case.
      
      3)  mpol_to_str() produces a printable, "human readable" string from a
          struct mempolicy.  For MPOL_PREFERRED with preferred_node <0,  show
          "local", as this indicates local allocation, as the task migrates
          among nodes.  Note that this matches the usage of "local allocation"
          in libnuma() and numactl.  Without this change, I believe that node_set()
          [via set_bit()] will set bit 31, resulting in a misleading display.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      53f2556b