1. 09 6月, 2010 1 次提交
    • T
      sched: adjust when cpu_active and cpuset configurations are updated during cpu on/offlining · 3a101d05
      Tejun Heo 提交于
      Currently, when a cpu goes down, cpu_active is cleared before
      CPU_DOWN_PREPARE starts and cpuset configuration is updated from a
      default priority cpu notifier.  When a cpu is coming up, it's set
      before CPU_ONLINE but cpuset configuration again is updated from the
      same cpu notifier.
      
      For cpu notifiers, this presents an inconsistent state.  Threads which
      a CPU_DOWN_PREPARE notifier expects to be bound to the CPU can be
      migrated to other cpus because the cpu is no more inactive.
      
      Fix it by updating cpu_active in the highest priority cpu notifier and
      cpuset configuration in the second highest when a cpu is coming up.
      Down path is updated similarly.  This guarantees that all other cpu
      notifiers see consistent cpu_active and cpuset configuration.
      
      cpuset_track_online_cpus() notifier is converted to
      cpuset_update_active_cpus() which just updates the configuration and
      now called from cpuset_cpu_[in]active() notifiers registered from
      sched_init_smp().  If cpuset is disabled, cpuset_update_active_cpus()
      degenerates into partition_sched_domains() making separate notifier
      for !CONFIG_CPUSETS unnecessary.
      
      This problem is triggered by cmwq.  During CPU_DOWN_PREPARE, hotplug
      callback creates a kthread and kthread_bind()s it to the target cpu,
      and the thread is expected to run on that cpu.
      
      * Ingo's test discovered __cpuinit/exit markups were incorrect.
        Fixed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Paul Menage <menage@google.com>
      3a101d05
  2. 28 5月, 2010 1 次提交
    • J
      cpusets: new round-robin rotor for SLAB allocations · 6adef3eb
      Jack Steiner 提交于
      We have observed several workloads running on multi-node systems where
      memory is assigned unevenly across the nodes in the system.  There are
      numerous reasons for this but one is the round-robin rotor in
      cpuset_mem_spread_node().
      
      For example, a simple test that writes a multi-page file will allocate
      pages on nodes 0 2 4 6 ...  Odd nodes are skipped.  (Sometimes it
      allocates on odd nodes & skips even nodes).
      
      An example is shown below.  The program "lfile" writes a file consisting
      of 10 pages.  The program then mmaps the file & uses get_mempolicy(...,
      MPOL_F_NODE) to determine the nodes where the file pages were allocated.
      The output is shown below:
      
      	# ./lfile
      	 allocated on nodes: 2 4 6 0 1 2 6 0 2
      
      There is a single rotor that is used for allocating both file pages & slab
      pages.  Writing the file allocates both a data page & a slab page
      (buffer_head).  This advances the RR rotor 2 nodes for each page
      allocated.
      
      A quick confirmation seems to confirm this is the cause of the uneven
      allocation:
      
      	# echo 0 >/dev/cpuset/memory_spread_slab
      	# ./lfile
      	 allocated on nodes: 6 7 8 9 0 1 2 3 4 5
      
      This patch introduces a second rotor that is used for slab allocations.
      Signed-off-by: NJack Steiner <steiner@sgi.com>
      Acked-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Paul Menage <menage@google.com>
      Cc: Jack Steiner <steiner@sgi.com>
      Cc: Robin Holt <holt@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6adef3eb
  3. 25 5月, 2010 1 次提交
    • M
      cpuset,mm: fix no node to alloc memory when changing cpuset's mems · c0ff7453
      Miao Xie 提交于
      Before applying this patch, cpuset updates task->mems_allowed and
      mempolicy by setting all new bits in the nodemask first, and clearing all
      old unallowed bits later.  But in the way, the allocator may find that
      there is no node to alloc memory.
      
      The reason is that cpuset rebinds the task's mempolicy, it cleans the
      nodes which the allocater can alloc pages on, for example:
      
      (mpol: mempolicy)
      	task1			task1's mpol	task2
      	alloc page		1
      	  alloc on node0? NO	1
      				1		change mems from 1 to 0
      				1		rebind task1's mpol
      				0-1		  set new bits
      				0	  	  clear disallowed bits
      	  alloc on node1? NO	0
      	  ...
      	can't alloc page
      	  goto oom
      
      This patch fixes this problem by expanding the nodes range first(set newly
      allowed bits) and shrink it lazily(clear newly disallowed bits).  So we
      use a variable to tell the write-side task that read-side task is reading
      nodemask, and the write-side task clears newly disallowed nodes after
      read-side task ends the current memory allocation.
      
      [akpm@linux-foundation.org: fix spello]
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Paul Menage <menage@google.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Ravikiran Thirumalai <kiran@scalex86.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Andi Kleen <andi@firstfloor.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c0ff7453
  4. 03 4月, 2010 2 次提交
    • O
      sched: Make select_fallback_rq() cpuset friendly · 9084bb82
      Oleg Nesterov 提交于
      Introduce cpuset_cpus_allowed_fallback() helper to fix the cpuset problems
      with select_fallback_rq(). It can be called from any context and can't use
      any cpuset locks including task_lock(). It is called when the task doesn't
      have online cpus in ->cpus_allowed but ttwu/etc must be able to find a
      suitable cpu.
      
      I am not proud of this patch. Everything which needs such a fat comment
      can't be good even if correct. But I'd prefer to not change the locking
      rules in the code I hardly understand, and in any case I believe this
      simple change make the code much more correct compared to deadlocks we
      currently have.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20100315091027.GA9155@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      9084bb82
    • O
      sched: Kill the broken and deadlockable cpuset_lock/cpuset_cpus_allowed_locked code · 897f0b3c
      Oleg Nesterov 提交于
      This patch just states the fact the cpusets/cpuhotplug interaction is
      broken and removes the deadlockable code which only pretends to work.
      
      - cpuset_lock() doesn't really work. It is needed for
        cpuset_cpus_allowed_locked() but we can't take this lock in
        try_to_wake_up()->select_fallback_rq() path.
      
      - cpuset_lock() is deadlockable. Suppose that a task T bound to CPU takes
        callback_mutex. If cpu_down(CPU) happens before T drops callback_mutex
        stop_machine() preempts T, then migration_call(CPU_DEAD) tries to take
        cpuset_lock() and hangs forever because CPU is already dead and thus
        T can't be scheduled.
      
      - cpuset_cpus_allowed_locked() is deadlockable too. It takes task_lock()
        which is not irq-safe, but try_to_wake_up() can be called from irq.
      
      Kill them, and change select_fallback_rq() to use cpu_possible_mask, like
      we currently do without CONFIG_CPUSETS.
      
      Also, with or without this patch, with or without CONFIG_CPUSETS, the
      callers of select_fallback_rq() can race with each other or with
      set_cpus_allowed() pathes.
      
      The subsequent patches try to to fix these problems.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20100315091003.GA9123@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      897f0b3c
  5. 17 6月, 2009 1 次提交
    • M
      cpuset,mm: update tasks' mems_allowed in time · 58568d2a
      Miao Xie 提交于
      Fix allocating page cache/slab object on the unallowed node when memory
      spread is set by updating tasks' mems_allowed after its cpuset's mems is
      changed.
      
      In order to update tasks' mems_allowed in time, we must modify the code of
      memory policy.  Because the memory policy is applied in the process's
      context originally.  After applying this patch, one task directly
      manipulates anothers mems_allowed, and we use alloc_lock in the
      task_struct to protect mems_allowed and memory policy of the task.
      
      But in the fast path, we didn't use lock to protect them, because adding a
      lock may lead to performance regression.  But if we don't add a lock,the
      task might see no nodes when changing cpuset's mems_allowed to some
      non-overlapping set.  In order to avoid it, we set all new allowed nodes,
      then clear newly disallowed ones.
      
      [lee.schermerhorn@hp.com:
        The rework of mpol_new() to extract the adjusting of the node mask to
        apply cpuset and mpol flags "context" breaks set_mempolicy() and mbind()
        with MPOL_PREFERRED and a NULL nodemask--i.e., explicit local
        allocation.  Fix this by adding the check for MPOL_PREFERRED and empty
        node mask to mpol_new_mpolicy().
      
        Remove the now unneeded 'nodes = NULL' from mpol_new().
      
        Note that mpol_new_mempolicy() is always called with a non-NULL
        'nodes' parameter now that it has been removed from mpol_new().
        Therefore, we don't need to test nodes for NULL before testing it for
        'empty'.  However, just to be extra paranoid, add a VM_BUG_ON() to
        verify this assumption.]
      [lee.schermerhorn@hp.com:
      
        I don't think the function name 'mpol_new_mempolicy' is descriptive
        enough to differentiate it from mpol_new().
      
        This function applies cpuset set context, usually constraining nodes
        to those allowed by the cpuset.  However, when the 'RELATIVE_NODES flag
        is set, it also translates the nodes.  So I settled on
        'mpol_set_nodemask()', because the comment block for mpol_new() mentions
        that we need to call this function to "set nodes".
      
        Some additional minor line length, whitespace and typo cleanup.]
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      58568d2a
  6. 03 4月, 2009 1 次提交
  7. 30 3月, 2009 1 次提交
  8. 09 1月, 2009 1 次提交
  9. 07 1月, 2009 1 次提交
  10. 20 11月, 2008 1 次提交
  11. 07 9月, 2008 1 次提交
    • M
      sched: arch_reinit_sched_domains() must destroy domains to force rebuild · dfb512ec
      Max Krasnyansky 提交于
      What I realized recently is that calling rebuild_sched_domains() in
      arch_reinit_sched_domains() by itself is not enough when cpusets are enabled.
      partition_sched_domains() code is trying to avoid unnecessary domain rebuilds
      and will not actually rebuild anything if new domain masks match the old ones.
      
      What this means is that doing
           echo 1 > /sys/devices/system/cpu/sched_mc_power_savings
      on a system with cpusets enabled will not take affect untill something changes
      in the cpuset setup (ie new sets created or deleted).
      
      This patch fixes restore correct behaviour where domains must be rebuilt in
      order to enable MC powersaving flags.
      
      Test on quad-core Core2 box with both CONFIG_CPUSETS and !CONFIG_CPUSETS.
      Also tested on dual-core Core2 laptop. Lockdep is happy and things are working
      as expected.
      Signed-off-by: NMax Krasnyansky <maxk@qualcomm.com>
      Tested-by: NVaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      dfb512ec
  12. 18 7月, 2008 1 次提交
    • M
      cpu hotplug, sched: Introduce cpu_active_map and redo sched domain managment (take 2) · e761b772
      Max Krasnyansky 提交于
      This is based on Linus' idea of creating cpu_active_map that prevents
      scheduler load balancer from migrating tasks to the cpu that is going
      down.
      
      It allows us to simplify domain management code and avoid unecessary
      domain rebuilds during cpu hotplug event handling.
      
      Please ignore the cpusets part for now. It needs some more work in order
      to avoid crazy lock nesting. Although I did simplfy and unify domain
      reinitialization logic. We now simply call partition_sched_domains() in
      all the cases. This means that we're using exact same code paths as in
      cpusets case and hence the test below cover cpusets too.
      Cpuset changes to make rebuild_sched_domains() callable from various
      contexts are in the separate patch (right next after this one).
      
      This not only boots but also easily handles
      	while true; do make clean; make -j 8; done
      and
      	while true; do on-off-cpu 1; done
      at the same time.
      (on-off-cpu 1 simple does echo 0/1 > /sys/.../cpu1/online thing).
      
      Suprisingly the box (dual-core Core2) is quite usable. In fact I'm typing
      this on right now in gnome-terminal and things are moving just fine.
      
      Also this is running with most of the debug features enabled (lockdep,
      mutex, etc) no BUG_ONs or lockdep complaints so far.
      
      I believe I addressed all of the Dmitry's comments for original Linus'
      version. I changed both fair and rt balancer to mask out non-active cpus.
      And replaced cpu_is_offline() with !cpu_active() in the main scheduler
      code where it made sense (to me).
      Signed-off-by: NMax Krasnyanskiy <maxk@qualcomm.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NGregory Haskins <ghaskins@novell.com>
      Cc: dmitry.adamushko@gmail.com
      Cc: pj@sgi.com
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      e761b772
  13. 28 4月, 2008 1 次提交
    • M
      mm: filter based on a nodemask as well as a gfp_mask · 19770b32
      Mel Gorman 提交于
      The MPOL_BIND policy creates a zonelist that is used for allocations
      controlled by that mempolicy.  As the per-node zonelist is already being
      filtered based on a zone id, this patch adds a version of __alloc_pages() that
      takes a nodemask for further filtering.  This eliminates the need for
      MPOL_BIND to create a custom zonelist.
      
      A positive benefit of this is that allocations using MPOL_BIND now use the
      local node's distance-ordered zonelist instead of a custom node-id-ordered
      zonelist.  I.e., pages will be allocated from the closest allowed node with
      available memory.
      
      [Lee.Schermerhorn@hp.com: Mempolicy: update stale documentation and comments]
      [Lee.Schermerhorn@hp.com: Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask]
      [Lee.Schermerhorn@hp.com: Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask rework]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      19770b32
  14. 20 4月, 2008 1 次提交
  15. 12 2月, 2008 1 次提交
    • K
      mempolicy: silently restrict nodemask to allowed nodes · 31f1de46
      KOSAKI Motohiro 提交于
      Kosaki Motohito noted that "numactl --interleave=all ..." failed in the
      presence of memoryless nodes.  This patch attempts to fix that problem.
      
      Some background:
      
      numactl --interleave=all calls set_mempolicy(2) with a fully populated
      [out to MAXNUMNODES] nodemask.  set_mempolicy() [in do_set_mempolicy()]
      calls contextualize_policy() which requires that the nodemask be a
      subset of the current task's mems_allowed; else EINVAL will be returned.
      
      A task's mems_allowed will always be a subset of node_states[N_HIGH_MEMORY]
      i.e., nodes with memory.  So, a fully populated nodemask will be
      declared invalid if it includes memoryless nodes.
      
        NOTE:  the same thing will occur when running in a cpuset
               with restricted mem_allowed--for the same reason:
               node mask contains dis-allowed nodes.
      
      mbind(2), on the other hand, just masks off any nodes in the nodemask
      that are not included in the caller's mems_allowed.
      
      In each case [mbind() and set_mempolicy()], mpol_check_policy() will
      complain [again, resulting in EINVAL] if the nodemask contains any
      memoryless nodes.  This is somewhat redundant as mpol_new() will remove
      memoryless nodes for interleave policy, as will bind_zonelist()--called
      by mpol_new() for BIND policy.
      
      Proposed fix:
      
      1) modify contextualize_policy logic to:
         a) remember whether the incoming node mask is empty.
         b) if not, restrict the nodemask to allowed nodes, as is
            currently done in-line for mbind().  This guarantees
            that the resulting mask includes only nodes with memory.
      
            NOTE:  this is a [benign, IMO] change in behavior for
                   set_mempolicy().  Dis-allowed nodes will be
                   silently ignored, rather than returning an error.
      
         c) fold this code into mpol_check_policy(), replace 2 calls to
            contextualize_policy() to call mpol_check_policy() directly
            and remove contextualize_policy().
      
      2) In existing mpol_check_policy() logic, after "contextualization":
         a) MPOL_DEFAULT:  require that in coming mask "was_empty"
         b) MPOL_{BIND|INTERLEAVE}:  require that contextualized nodemask
            contains at least one node.
         c) add a case for MPOL_PREFERRED:  if in coming was not empty
            and resulting mask IS empty, user specified invalid nodes.
            Return EINVAL.
         c) remove the now redundant check for memoryless nodes
      
      3) remove the now redundant masking of policy nodes for interleave
         policy from mpol_new().
      
      4) Now that mpol_check_policy() contextualizes the nodemask, remove
         the in-line nodes_and() from sys_mbind().  I believe that this
         restores mbind() to the behavior before the memoryless-nodes
         patch series.  E.g., we'll no longer treat an invalid nodemask
         with MPOL_PREFERRED as local allocation.
      
      [ Patch history:
      
        v1 -> v2:
         - Communicate whether or not incoming node mask was empty to
           mpol_check_policy() for better error checking.
         - As suggested by David Rientjes, remove the now unused
           cpuset_nodes_subset_current_mems_allowed() from cpuset.h
      
        v2 -> v3:
         - As suggested by Kosaki Motohito, fold the "contextualization"
           of policy nodemask into mpol_check_policy().  Looks a little
           cleaner. ]
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Tested-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      31f1de46
  16. 09 2月, 2008 1 次提交
    • E
      proc: seqfile convert proc_pid_status to properly handle pid namespaces · df5f8314
      Eric W. Biederman 提交于
      Currently we possibly lookup the pid in the wrong pid namespace.  So
      seq_file convert proc_pid_status which ensures the proper pid namespaces is
      passed in.
      
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: build fix]
      [akpm@linux-foundation.org: another build fix]
      [akpm@linux-foundation.org: s390 build fix]
      [akpm@linux-foundation.org: fix task_name() output]
      [akpm@linux-foundation.org: fix nommu build]
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Cc: Andrew Morgan <morgan@kernel.org>
      Cc: Serge Hallyn <serue@us.ibm.com>
      Cc: Cedric Le Goater <clg@fr.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      df5f8314
  17. 20 10月, 2007 2 次提交
    • C
      hotplug cpu: migrate a task within its cpuset · 470fd646
      Cliff Wickman 提交于
      When a cpu is disabled, move_task_off_dead_cpu() is called for tasks that have
      been running on that cpu.
      
      Currently, such a task is migrated:
       1) to any cpu on the same node as the disabled cpu, which is both online
          and among that task's cpus_allowed
       2) to any cpu which is both online and among that task's cpus_allowed
      
      It is typical of a multithreaded application running on a large NUMA system to
      have its tasks confined to a cpuset so as to cluster them near the memory that
      they share.  Furthermore, it is typical to explicitly place such a task on a
      specific cpu in that cpuset.  And in that case the task's cpus_allowed
      includes only a single cpu.
      
      This patch would insert a preference to migrate such a task to some cpu within
      its cpuset (and set its cpus_allowed to its entire cpuset).
      
      With this patch, migrate the task to:
       1) to any cpu on the same node as the disabled cpu, which is both online
          and among that task's cpus_allowed
       2) to any online cpu within the task's cpuset
       3) to any cpu which is both online and among that task's cpus_allowed
      
      In order to do this, move_task_off_dead_cpu() must make a call to
      cpuset_cpus_allowed_locked(), a new subset of cpuset_cpus_allowed(), that will
      not block.  (name change - per Oleg's suggestion)
      
      Calls are made to cpuset_lock() and cpuset_unlock() in migration_call() to set
      the cpuset mutex during the whole migrate_live_tasks() and
      migrate_dead_tasks() procedure.
      
      [akpm@linux-foundation.org: build fix]
      [pj@sgi.com: Fix indentation and spacing]
      Signed-off-by: NCliff Wickman <cpw@sgi.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      470fd646
    • P
      Task Control Groups: make cpusets a client of cgroups · 8793d854
      Paul Menage 提交于
      Remove the filesystem support logic from the cpusets system and makes cpusets
      a cgroup subsystem
      
      The "cpuset" filesystem becomes a dummy filesystem; attempts to mount it get
      passed through to the cgroup filesystem with the appropriate options to
      emulate the old cpuset filesystem behaviour.
      Signed-off-by: NPaul Menage <menage@google.com>
      Cc: Serge E. Hallyn <serue@us.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Kirill Korotaev <dev@openvz.org>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
      Cc: Cedric Le Goater <clg@fr.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8793d854
  18. 17 10月, 2007 2 次提交
  19. 13 2月, 2007 1 次提交
  20. 31 12月, 2006 1 次提交
  21. 14 12月, 2006 1 次提交
    • P
      [PATCH] cpuset: rework cpuset_zone_allowed api · 02a0e53d
      Paul Jackson 提交于
      Elaborate the API for calling cpuset_zone_allowed(), so that users have to
      explicitly choose between the two variants:
      
        cpuset_zone_allowed_hardwall()
        cpuset_zone_allowed_softwall()
      
      Until now, whether or not you got the hardwall flavor depended solely on
      whether or not you or'd in the __GFP_HARDWALL gfp flag to the gfp_mask
      argument.
      
      If you didn't specify __GFP_HARDWALL, you implicitly got the softwall
      version.
      
      Unfortunately, this meant that users would end up with the softwall version
      without thinking about it.  Since only the softwall version might sleep,
      this led to bugs with possible sleeping in interrupt context on more than
      one occassion.
      
      The hardwall version requires that the current tasks mems_allowed allows
      the node of the specified zone (or that you're in interrupt or that
      __GFP_THISNODE is set or that you're on a one cpuset system.)
      
      The softwall version, depending on the gfp_mask, might allow a node if it
      was allowed in the nearest enclusing cpuset marked mem_exclusive (which
      requires taking the cpuset lock 'callback_mutex' to evaluate.)
      
      This patch removes the cpuset_zone_allowed() call, and forces the caller to
      explicitly choose between the hardwall and the softwall case.
      
      If the caller wants the gfp_mask to determine this choice, they should (1)
      be sure they can sleep or that __GFP_HARDWALL is set, and (2) invoke the
      cpuset_zone_allowed_softwall() routine.
      
      This adds another 100 or 200 bytes to the kernel text space, due to the few
      lines of nearly duplicate code at the top of both cpuset_zone_allowed_*
      routines.  It should save a few instructions executed for the calls that
      turned into calls of cpuset_zone_allowed_hardwall, thanks to not having to
      set (before the call) then check (within the call) the __GFP_HARDWALL flag.
      
      For the most critical call, from get_page_from_freelist(), the same
      instructions are executed as before -- the old cpuset_zone_allowed()
      routine it used to call is the same code as the
      cpuset_zone_allowed_softwall() routine that it calls now.
      
      Not a perfect win, but seems worth it, to reduce this chance of hitting a
      sleeping with irq off complaint again.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      02a0e53d
  22. 08 12月, 2006 2 次提交
    • H
      [PATCH] struct seq_operations and struct file_operations constification · 15ad7cdc
      Helge Deller 提交于
       - move some file_operations structs into the .rodata section
      
       - move static strings from policy_types[] array into the .rodata section
      
       - fix generic seq_operations usages, so that those structs may be defined
         as "const" as well
      
      [akpm@osdl.org: couple of fixes]
      Signed-off-by: NHelge Deller <deller@gmx.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      15ad7cdc
    • P
      [PATCH] memory page_alloc zonelist caching speedup · 9276b1bc
      Paul Jackson 提交于
      Optimize the critical zonelist scanning for free pages in the kernel memory
      allocator by caching the zones that were found to be full recently, and
      skipping them.
      
      Remembers the zones in a zonelist that were short of free memory in the
      last second.  And it stashes a zone-to-node table in the zonelist struct,
      to optimize that conversion (minimize its cache footprint.)
      
      Recent changes:
      
          This differs in a significant way from a similar patch that I
          posted a week ago.  Now, instead of having a nodemask_t of
          recently full nodes, I have a bitmask of recently full zones.
          This solves a problem that last weeks patch had, which on
          systems with multiple zones per node (such as DMA zone) would
          take seeing any of these zones full as meaning that all zones
          on that node were full.
      
          Also I changed names - from "zonelist faster" to "zonelist cache",
          as that seemed to better convey what we're doing here - caching
          some of the key zonelist state (for faster access.)
      
          See below for some performance benchmark results.  After all that
          discussion with David on why I didn't need them, I went and got
          some ;).  I wanted to verify that I had not hurt the normal case
          of memory allocation noticeably.  At least for my one little
          microbenchmark, I found (1) the normal case wasn't affected, and
          (2) workloads that forced scanning across multiple nodes for
          memory improved up to 10% fewer System CPU cycles and lower
          elapsed clock time ('sys' and 'real').  Good.  See details, below.
      
          I didn't have the logic in get_page_from_freelist() for various
          full nodes and zone reclaim failures correct.  That should be
          fixed up now - notice the new goto labels zonelist_scan,
          this_zone_full, and try_next_zone, in get_page_from_freelist().
      
      There are two reasons I persued this alternative, over some earlier
      proposals that would have focused on optimizing the fake numa
      emulation case by caching the last useful zone:
      
       1) Contrary to what I said before, we (SGI, on large ia64 sn2 systems)
          have seen real customer loads where the cost to scan the zonelist
          was a problem, due to many nodes being full of memory before
          we got to a node we could use.  Or at least, I think we have.
          This was related to me by another engineer, based on experiences
          from some time past.  So this is not guaranteed.  Most likely, though.
      
          The following approach should help such real numa systems just as
          much as it helps fake numa systems, or any combination thereof.
      
       2) The effort to distinguish fake from real numa, using node_distance,
          so that we could cache a fake numa node and optimize choosing
          it over equivalent distance fake nodes, while continuing to
          properly scan all real nodes in distance order, was going to
          require a nasty blob of zonelist and node distance munging.
      
          The following approach has no new dependency on node distances or
          zone sorting.
      
      See comment in the patch below for a description of what it actually does.
      
      Technical details of note (or controversy):
      
       - See the use of "zlc_active" and "did_zlc_setup" below, to delay
         adding any work for this new mechanism until we've looked at the
         first zone in zonelist.  I figured the odds of the first zone
         having the memory we needed were high enough that we should just
         look there, first, then get fancy only if we need to keep looking.
      
       - Some odd hackery was needed to add items to struct zonelist, while
         not tripping up the custom zonelists built by the mm/mempolicy.c
         code for MPOL_BIND.  My usual wordy comments below explain this.
         Search for "MPOL_BIND".
      
       - Some per-node data in the struct zonelist is now modified frequently,
         with no locking.  Multiple CPU cores on a node could hit and mangle
         this data.  The theory is that this is just performance hint data,
         and the memory allocator will work just fine despite any such mangling.
         The fields at risk are the struct 'zonelist_cache' fields 'fullzones'
         (a bitmask) and 'last_full_zap' (unsigned long jiffies).  It should
         all be self correcting after at most a one second delay.
      
       - This still does a linear scan of the same lengths as before.  All
         I've optimized is making the scan faster, not algorithmically
         shorter.  It is now able to scan a compact array of 'unsigned
         short' in the case of many full nodes, so one cache line should
         cover quite a few nodes, rather than each node hitting another
         one or two new and distinct cache lines.
      
       - If both Andi and Nick don't find this too complicated, I will be
         (pleasantly) flabbergasted.
      
       - I removed the comment claiming we only use one cachline's worth of
         zonelist.  We seem, at least in the fake numa case, to have put the
         lie to that claim.
      
       - I pay no attention to the various watermarks and such in this performance
         hint.  A node could be marked full for one watermark, and then skipped
         over when searching for a page using a different watermark.  I think
         that's actually quite ok, as it will tend to slightly increase the
         spreading of memory over other nodes, away from a memory stressed node.
      
      ===============
      
      Performance - some benchmark results and analysis:
      
      This benchmark runs a memory hog program that uses multiple
      threads to touch alot of memory as quickly as it can.
      
      Multiple runs were made, touching 12, 38, 64 or 90 GBytes out of
      the total 96 GBytes on the system, and using 1, 19, 37, or 55
      threads (on a 56 CPU system.)  System, user and real (elapsed)
      timings were recorded for each run, shown in units of seconds,
      in the table below.
      
      Two kernels were tested - 2.6.18-mm3 and the same kernel with
      this zonelist caching patch added.  The table also shows the
      percentage improvement the zonelist caching sys time is over
      (lower than) the stock *-mm kernel.
      
            number     2.6.18-mm3	   zonelist-cache    delta (< 0 good)	percent
       GBs    N  	------------	   --------------    ----------------	systime
       mem threads   sys user  real	  sys  user  real     sys  user  real	 better
        12	 1     153   24   177	  151	 24   176      -2     0    -1	   1%
        12	19	99   22     8	   99	 22	8	0     0     0	   0%
        12	37     111   25     6	  112	 25	6	1     0     0	  -0%
        12	55     115   25     5	  110	 23	5      -5    -2     0	   4%
        38	 1     502   74   576	  497	 73   570      -5    -1    -6	   0%
        38	19     426   78    48	  373	 76    39     -53    -2    -9	  12%
        38	37     544   83    36	  547	 82    36	3    -1     0	  -0%
        38	55     501   77    23	  511	 80    24      10     3     1	  -1%
        64	 1     917  125  1042	  890	124  1014     -27    -1   -28	   2%
        64	19    1118  138   119	  965	141   103    -153     3   -16	  13%
        64	37    1202  151    94	 1136	150    81     -66    -1   -13	   5%
        64	55    1118  141    61	 1072	140    58     -46    -1    -3	   4%
        90	 1    1342  177  1519	 1275	174  1450     -67    -3   -69	   4%
        90	19    2392  199   192	 2116	189   176    -276   -10   -16	  11%
        90	37    3313  238   175	 2972	225   145    -341   -13   -30	  10%
        90	55    1948  210   104	 1843	213   100    -105     3    -4	   5%
      
      Notes:
       1) This test ran a memory hog program that started a specified number N of
          threads, and had each thread allocate and touch 1/N'th of
          the total memory to be used in the test run in a single loop,
          writing a constant word to memory, one store every 4096 bytes.
          Watching this test during some earlier trial runs, I would see
          each of these threads sit down on one CPU and stay there, for
          the remainder of the pass, a different CPU for each thread.
      
       2) The 'real' column is not comparable to the 'sys' or 'user' columns.
          The 'real' column is seconds wall clock time elapsed, from beginning
          to end of that test pass.  The 'sys' and 'user' columns are total
          CPU seconds spent on that test pass.  For a 19 thread test run,
          for example, the sum of 'sys' and 'user' could be up to 19 times the
          number of 'real' elapsed wall clock seconds.
      
       3) Tests were run on a fresh, single-user boot, to minimize the amount
          of memory already in use at the start of the test, and to minimize
          the amount of background activity that might interfere.
      
       4) Tests were done on a 56 CPU, 28 Node system with 96 GBytes of RAM.
      
       5) Notice that the 'real' time gets large for the single thread runs, even
          though the measured 'sys' and 'user' times are modest.  I'm not sure what
          that means - probably something to do with it being slow for one thread to
          be accessing memory along ways away.  Perhaps the fake numa system, running
          ostensibly the same workload, would not show this substantial degradation
          of 'real' time for one thread on many nodes -- lets hope not.
      
       6) The high thread count passes (one thread per CPU - on 55 of 56 CPUs)
          ran quite efficiently, as one might expect.  Each pair of threads needed
          to allocate and touch the memory on the node the two threads shared, a
          pleasantly parallizable workload.
      
       7) The intermediate thread count passes, when asking for alot of memory forcing
          them to go to a few neighboring nodes, improved the most with this zonelist
          caching patch.
      
      Conclusions:
       * This zonelist cache patch probably makes little difference one way or the
         other for most workloads on real numa hardware, if those workloads avoid
         heavy off node allocations.
       * For memory intensive workloads requiring substantial off-node allocations
         on real numa hardware, this patch improves both kernel and elapsed timings
         up to ten per-cent.
       * For fake numa systems, I'm optimistic, but will have to leave that up to
         Rohit Seth to actually test (once I get him a 2.6.18 backport.)
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Cc: Rohit Seth <rohitseth@google.com>
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      Cc: David Rientjes <rientjes@cs.washington.edu>
      Cc: Paul Menage <menage@google.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      9276b1bc
  23. 30 9月, 2006 1 次提交
    • P
      [PATCH] cpuset: top_cpuset tracks hotplug changes to node_online_map · 38837fc7
      Paul Jackson 提交于
      Change the list of memory nodes allowed to tasks in the top (root) nodeset
      to dynamically track what cpus are online, using a call to a cpuset hook
      from the memory hotplug code.  Make this top cpus file read-only.
      
      On systems that have cpusets configured in their kernel, but that aren't
      actively using cpusets (for some distros, this covers the majority of
      systems) all tasks end up in the top cpuset.
      
      If that system does support memory hotplug, then these tasks cannot make
      use of memory nodes that are added after system boot, because the memory
      nodes are not allowed in the top cpuset.  This is a surprising regression
      over earlier kernels that didn't have cpusets enabled.
      
      One key motivation for this change is to remain consistent with the
      behaviour for the top_cpuset's 'cpus', which is also read-only, and which
      automatically tracks the cpu_online_map.
      
      This change also has the minor benefit that it fixes a long standing,
      little noticed, minor bug in cpusets.  The cpuset performance tweak to
      short circuit the cpuset_zone_allowed() check on systems with just a single
      cpuset (see 'number_of_cpusets', in linux/cpuset.h) meant that simply
      changing the 'mems' of the top_cpuset had no affect, even though the change
      (the write system call) appeared to succeed.  With the following change,
      that write to the 'mems' file fails -EACCES, and the 'mems' file stubbornly
      refuses to be changed via user space writes.  Thus no one should be mislead
      into thinking they've changed the top_cpusets's 'mems' when in affect they
      haven't.
      
      In order to keep the behaviour of cpusets consistent between systems
      actively making use of them and systems not using them, this patch changes
      the behaviour of the 'mems' file in the top (root) cpuset, making it read
      only, and making it automatically track the value of node_online_map.  Thus
      tasks in the top cpuset will have automatic use of hot plugged memory nodes
      allowed by their cpuset.
      
      [akpm@osdl.org: build fix]
      [bunk@stusta.de: build fix]
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAdrian Bunk <bunk@stusta.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      38837fc7
  24. 24 3月, 2006 1 次提交
    • P
      [PATCH] cpuset memory spread basic implementation · 825a46af
      Paul Jackson 提交于
      This patch provides the implementation and cpuset interface for an alternative
      memory allocation policy that can be applied to certain kinds of memory
      allocations, such as the page cache (file system buffers) and some slab caches
      (such as inode caches).
      
      The policy is called "memory spreading." If enabled, it spreads out these
      kinds of memory allocations over all the nodes allowed to a task, instead of
      preferring to place them on the node where the task is executing.
      
      All other kinds of allocations, including anonymous pages for a tasks stack
      and data regions, are not affected by this policy choice, and continue to be
      allocated preferring the node local to execution, as modified by the NUMA
      mempolicy.
      
      There are two boolean flag files per cpuset that control where the kernel
      allocates pages for the file system buffers and related in kernel data
      structures.  They are called 'memory_spread_page' and 'memory_spread_slab'.
      
      If the per-cpuset boolean flag file 'memory_spread_page' is set, then the
      kernel will spread the file system buffers (page cache) evenly over all the
      nodes that the faulting task is allowed to use, instead of preferring to put
      those pages on the node where the task is running.
      
      If the per-cpuset boolean flag file 'memory_spread_slab' is set, then the
      kernel will spread some file system related slab caches, such as for inodes
      and dentries evenly over all the nodes that the faulting task is allowed to
      use, instead of preferring to put those pages on the node where the task is
      running.
      
      The implementation is simple.  Setting the cpuset flags 'memory_spread_page'
      or 'memory_spread_cache' turns on the per-process flags PF_SPREAD_PAGE or
      PF_SPREAD_SLAB, respectively, for each task that is in the cpuset or
      subsequently joins that cpuset.  In subsequent patches, the page allocation
      calls for the affected page cache and slab caches are modified to perform an
      inline check for these flags, and if set, a call to a new routine
      cpuset_mem_spread_node() returns the node to prefer for the allocation.
      
      The cpuset_mem_spread_node() routine is also simple.  It uses the value of a
      per-task rotor cpuset_mem_spread_rotor to select the next node in the current
      tasks mems_allowed to prefer for the allocation.
      
      This policy can provide substantial improvements for jobs that need to place
      thread local data on the corresponding node, but that need to access large
      file system data sets that need to be spread across the several nodes in the
      jobs cpuset in order to fit.  Without this patch, especially for jobs that
      might have one thread reading in the data set, the memory allocation across
      the nodes in the jobs cpuset can become very uneven.
      
      A couple of Copyright year ranges are updated as well.  And a couple of email
      addresses that can be found in the MAINTAINERS file are removed.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      825a46af
  25. 15 1月, 2006 1 次提交
    • P
      [PATCH] cpuset oom lock fix · 505970b9
      Paul Jackson 提交于
      The problem, reported in:
      
        http://bugzilla.kernel.org/show_bug.cgi?id=5859
      
      and by various other email messages and lkml posts is that the cpuset hook
      in the oom (out of memory) code can try to take a cpuset semaphore while
      holding the tasklist_lock (a spinlock).
      
      One must not sleep while holding a spinlock.
      
      The fix seems easy enough - move the cpuset semaphore region outside the
      tasklist_lock region.
      
      This required a few lines of mechanism to implement.  The oom code where
      the locking needs to be changed does not have access to the cpuset locks,
      which are internal to kernel/cpuset.c only.  So I provided a couple more
      cpuset interface routines, available to the rest of the kernel, which
      simple take and drop the lock needed here (cpusets callback_sem).
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      505970b9
  26. 09 1月, 2006 6 次提交
    • P
      [PATCH] cpuset: remove test for null cpuset from alloc code path · c417f024
      Paul Jackson 提交于
      Remove a couple of more lines of code from the cpuset hooks in the page
      allocation code path.
      
      There was a check for a NULL cpuset pointer in the routine
      cpuset_update_task_memory_state() that was only needed during system boot,
      after the memory subsystem was initialized, before the cpuset subsystem was
      initialized, to catch a NULL task->cpuset pointer.
      
      Add a cpuset_init_early() routine, just before the mem_init() call in
      init/main.c, that sets up just enough of the init tasks cpuset structure to
      render cpuset_update_task_memory_state() calls harmless.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      c417f024
    • P
      [PATCH] cpuset: number_of_cpusets optimization · 202f72d5
      Paul Jackson 提交于
      Easy little optimization hack to avoid actually having to call
      cpuset_zone_allowed() and check mems_allowed, in the main page allocation
      routine, __alloc_pages().  This saves several CPU cycles per page allocation
      on systems not using cpusets.
      
      A counter is updated each time a cpuset is created or removed, and whenever
      there is only one cpuset in the system, it must be the root cpuset, which
      contains all CPUs and all Memory Nodes.  In that case, when the counter is
      one, all allocations are allowed.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      202f72d5
    • P
      [PATCH] cpuset: implement cpuset_mems_allowed · 909d75a3
      Paul Jackson 提交于
      Provide a cpuset_mems_allowed() method, which the sys_migrate_pages() code
      needed, to obtain the mems_allowed vector of a cpuset, and replaced the
      workaround in sys_migrate_pages() to call this new method.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      909d75a3
    • P
      [PATCH] cpuset: combine refresh_mems and update_mems · cf2a473c
      Paul Jackson 提交于
      The important code paths through alloc_pages_current() and alloc_page_vma(),
      by which most kernel page allocations go, both called
      cpuset_update_current_mems_allowed(), which in turn called refresh_mems().
      -Both- of these latter two routines did a tasklock, got the tasks cpuset
      pointer, and checked for out of date cpuset->mems_generation.
      
      That was a silly duplication of code and waste of CPU cycles on an important
      code path.
      
      Consolidated those two routines into a single routine, called
      cpuset_update_task_memory_state(), since it updates more than just
      mems_allowed.
      
      Changed all callers of either routine to call the new consolidated routine.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      cf2a473c
    • P
      [PATCH] cpuset: memory pressure meter · 3e0d98b9
      Paul Jackson 提交于
      Provide a simple per-cpuset metric of memory pressure, tracking the -rate-
      that the tasks in a cpuset call try_to_free_pages(), the synchronous
      (direct) memory reclaim code.
      
      This enables batch managers monitoring jobs running in dedicated cpusets to
      efficiently detect what level of memory pressure that job is causing.
      
      This is useful both on tightly managed systems running a wide mix of
      submitted jobs, which may choose to terminate or reprioritize jobs that are
      trying to use more memory than allowed on the nodes assigned them, and with
      tightly coupled, long running, massively parallel scientific computing jobs
      that will dramatically fail to meet required performance goals if they
      start to use more memory than allowed to them.
      
      This patch just provides a very economical way for the batch manager to
      monitor a cpuset for signs of memory pressure.  It's up to the batch
      manager or other user code to decide what to do about it and take action.
      
      ==> Unless this feature is enabled by writing "1" to the special file
          /dev/cpuset/memory_pressure_enabled, the hook in the rebalance
          code of __alloc_pages() for this metric reduces to simply noticing
          that the cpuset_memory_pressure_enabled flag is zero.  So only
          systems that enable this feature will compute the metric.
      
      Why a per-cpuset, running average:
      
          Because this meter is per-cpuset, rather than per-task or mm, the
          system load imposed by a batch scheduler monitoring this metric is
          sharply reduced on large systems, because a scan of the tasklist can be
          avoided on each set of queries.
      
          Because this meter is a running average, instead of an accumulating
          counter, a batch scheduler can detect memory pressure with a single
          read, instead of having to read and accumulate results for a period of
          time.
      
          Because this meter is per-cpuset rather than per-task or mm, the
          batch scheduler can obtain the key information, memory pressure in a
          cpuset, with a single read, rather than having to query and accumulate
          results over all the (dynamically changing) set of tasks in the cpuset.
      
      A per-cpuset simple digital filter (requires a spinlock and 3 words of data
      per-cpuset) is kept, and updated by any task attached to that cpuset, if it
      enters the synchronous (direct) page reclaim code.
      
      A per-cpuset file provides an integer number representing the recent
      (half-life of 10 seconds) rate of direct page reclaims caused by the tasks
      in the cpuset, in units of reclaims attempted per second, times 1000.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      3e0d98b9
    • P
      [PATCH] cpuset: mempolicy one more nodemask conversion · 5966514d
      Paul Jackson 提交于
      Finish converting mm/mempolicy.c from bitmaps to nodemasks.  The previous
      conversion had left one routine using bitmaps, since it involved a
      corresponding change to kernel/cpuset.c
      
      Fix that interface by replacing with a simple macro that calls nodes_subset(),
      or if !CONFIG_CPUSET, returns (1).
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Cc: Christoph Lameter <christoph@lameter.com>
      Cc: Andi Kleen <ak@muc.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      5966514d
  27. 09 10月, 2005 1 次提交
  28. 08 9月, 2005 2 次提交
    • P
      [PATCH] cpusets: confine oom_killer to mem_exclusive cpuset · ef08e3b4
      Paul Jackson 提交于
      Now the real motivation for this cpuset mem_exclusive patch series seems
      trivial.
      
      This patch keeps a task in or under one mem_exclusive cpuset from provoking an
      oom kill of a task under a non-overlapping mem_exclusive cpuset.  Since only
      interrupt and GFP_ATOMIC allocations are allowed to escape mem_exclusive
      containment, there is little to gain from oom killing a task under a
      non-overlapping mem_exclusive cpuset, as almost all kernel and user memory
      allocation must come from disjoint memory nodes.
      
      This patch enables configuring a system so that a runaway job under one
      mem_exclusive cpuset cannot cause the killing of a job in another such cpuset
      that might be using very high compute and memory resources for a prolonged
      time.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      ef08e3b4
    • P
      [PATCH] cpusets: formalize intermediate GFP_KERNEL containment · 9bf2229f
      Paul Jackson 提交于
      This patch makes use of the previously underutilized cpuset flag
      'mem_exclusive' to provide what amounts to another layer of memory placement
      resolution.  With this patch, there are now the following four layers of
      memory placement available:
      
       1) The whole system (interrupt and GFP_ATOMIC allocations can use this),
       2) The nearest enclosing mem_exclusive cpuset (GFP_KERNEL allocations can use),
       3) The current tasks cpuset (GFP_USER allocations constrained to here), and
       4) Specific node placement, using mbind and set_mempolicy.
      
      These nest - each layer is a subset (same or within) of the previous.
      
      Layer (2) above is new, with this patch.  The call used to check whether a
      zone (its node, actually) is in a cpuset (in its mems_allowed, actually) is
      extended to take a gfp_mask argument, and its logic is extended, in the case
      that __GFP_HARDWALL is not set in the flag bits, to look up the cpuset
      hierarchy for the nearest enclosing mem_exclusive cpuset, to determine if
      placement is allowed.  The definition of GFP_USER, which used to be identical
      to GFP_KERNEL, is changed to also set the __GFP_HARDWALL bit, in the previous
      cpuset_gfp_hardwall_flag patch.
      
      GFP_ATOMIC and GFP_KERNEL allocations will stay within the current tasks
      cpuset, so long as any node therein is not too tight on memory, but will
      escape to the larger layer, if need be.
      
      The intended use is to allow something like a batch manager to handle several
      jobs, each job in its own cpuset, but using common kernel memory for caches
      and such.  Swapper and oom_kill activity is also constrained to Layer (2).  A
      task in or below one mem_exclusive cpuset should not cause swapping on nodes
      in another non-overlapping mem_exclusive cpuset, nor provoke oom_killing of a
      task in another such cpuset.  Heavy use of kernel memory for i/o caching and
      such by one job should not impact the memory available to jobs in other
      non-overlapping mem_exclusive cpusets.
      
      This patch enables providing hardwall, inescapable cpusets for memory
      allocations of each job, while sharing kernel memory allocations between
      several jobs, in an enclosing mem_exclusive cpuset.
      
      Like Dinakar's patch earlier to enable administering sched domains using the
      cpu_exclusive flag, this patch also provides a useful meaning to a cpuset flag
      that had previously done nothing much useful other than restrict what cpuset
      configurations were allowed.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      9bf2229f
  29. 17 4月, 2005 2 次提交
    • B
      [PATCH] cpuset: remove function attribute const · 9a848896
      Benoit Boissinot 提交于
      gcc-4 warns with
      include/linux/cpuset.h:21: warning: type qualifiers ignored on function
      return type
      
      cpuset_cpus_allowed is declared with const
      extern const cpumask_t cpuset_cpus_allowed(const struct task_struct *p);
      
      First const should be __attribute__((const)), but the gcc manual
      explains that:
      
      "Note that a function that has pointer arguments and examines the data
      pointed to must not be declared const. Likewise, a function that calls a
      non-const function usually must not be const. It does not make sense for
      a const function to return void."
      
      The following patch remove const from the function declaration.
      Signed-off-by: NBenoit Boissinot <benoit.boissinot@ens-lyon.org>
      Acked-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      9a848896
    • L
      Linux-2.6.12-rc2 · 1da177e4
      Linus Torvalds 提交于
      Initial git repository build. I'm not bothering with the full history,
      even though we have it. We can create a separate "historical" git
      archive of that later if we want to, and in the meantime it's about
      3.2GB when imported into git - space that would just make the early
      git days unnecessarily complicated, when we don't have a lot of good
      infrastructure for it.
      
      Let it rip!
      1da177e4