1. 27 3月, 2012 1 次提交
    • P
      sched: Fix select_fallback_rq() vs cpu_active/cpu_online · 2baab4e9
      Peter Zijlstra 提交于
      Commit 5fbd036b ("sched: Cleanup cpu_active madness"), which was
      supposed to finally sort the cpu_active mess, instead uncovered more.
      
      Since CPU_STARTING is ran before setting the cpu online, there's a
      (small) window where the cpu has active,!online.
      
      If during this time there's a wakeup of a task that used to reside on
      that cpu select_task_rq() will use select_fallback_rq() to compute an
      alternative cpu to run on since we find !online.
      
      select_fallback_rq() however will compute the new cpu against
      cpu_active, this means that it can return the same cpu it started out
      with, the !online one, since that cpu is in fact marked active.
      
      This results in us trying to scheduling a task on an offline cpu and
      triggering a WARN in the IPI code.
      
      The solution proposed by Chuansheng Liu of setting cpu_active in
      set_cpu_online() is buggy, firstly not all archs actually use
      set_cpu_online(), secondly, not all archs call set_cpu_online() with
      IRQs disabled, this means we would introduce either the same race or
      the race from fd8a7de1 ("x86: cpu-hotplug: Prevent softirq wakeup on
      wrong CPU") -- albeit much narrower.
      
      [ By setting online first and active later we have a window of
        online,!active, fresh and bound kthreads have task_cpu() of 0 and
        since cpu0 isn't in tsk_cpus_allowed() we end up in
        select_fallback_rq() which excludes !active, resulting in a reset
        of ->cpus_allowed and the thread running all over the place. ]
      
      The solution is to re-work select_fallback_rq() to require active
      _and_ online. This makes the active,!online case work as expected,
      OTOH archs running CPU_STARTING after setting online are now
      vulnerable to the issue from fd8a7de1 -- these are alpha and
      blackfin.
      Reported-by: NChuansheng Liu <chuansheng.liu@intel.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Frysinger <vapier@gentoo.org>
      Cc: linux-alpha@vger.kernel.org
      Link: http://lkml.kernel.org/n/tip-hubqk1i10o4dpvlm06gq7v6j@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2baab4e9
  2. 21 12月, 2011 1 次提交
    • D
      cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask · b246272e
      David Rientjes 提交于
      Kernels where MAX_NUMNODES > BITS_PER_LONG may temporarily see an empty
      nodemask in a tsk's mempolicy if its previous nodemask is remapped onto a
      new set of allowed cpuset nodes where the two nodemasks, as a result of
      the remap, are now disjoint.
      
      c0ff7453 ("cpuset,mm: fix no node to alloc memory when changing
      cpuset's mems") adds get_mems_allowed() to prevent the set of allowed
      nodes from changing for a thread.  This causes any update to a set of
      allowed nodes to stall until put_mems_allowed() is called.
      
      This stall is unncessary, however, if at least one node remains unchanged
      in the update to the set of allowed nodes.  This was addressed by
      89e8a244 ("cpusets: avoid looping when storing to mems_allowed if one
      node remains set"), but it's still possible that an empty nodemask may be
      read from a mempolicy because the old nodemask may be remapped to the new
      nodemask during rebind.  To prevent this, only avoid the stall if there is
      no mempolicy for the thread being changed.
      
      This is a temporary solution until all reads from mempolicy nodemasks can
      be guaranteed to not be empty without the get_mems_allowed()
      synchronization.
      
      Also moves the check for nodemask intersection inside task_lock() so that
      tsk->mems_allowed cannot change.  This ensures that nothing can set this
      tsk's mems_allowed out from under us and also protects tsk->mempolicy.
      Reported-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Paul Menage <paul@paulmenage.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b246272e
  3. 13 12月, 2011 3 次提交
  4. 03 11月, 2011 1 次提交
    • D
      cpusets: avoid looping when storing to mems_allowed if one node remains set · 89e8a244
      David Rientjes 提交于
      {get,put}_mems_allowed() exist so that general kernel code may locklessly
      access a task's set of allowable nodes without having the chance that a
      concurrent write will cause the nodemask to be empty on configurations
      where MAX_NUMNODES > BITS_PER_LONG.
      
      This could incur a significant delay, however, especially in low memory
      conditions because the page allocator is blocking and reclaim requires
      get_mems_allowed() itself.  It is not atypical to see writes to
      cpuset.mems take over 2 seconds to complete, for example.  In low memory
      conditions, this is problematic because it's one of the most imporant
      times to change cpuset.mems in the first place!
      
      The only way a task's set of allowable nodes may change is through cpusets
      by writing to cpuset.mems and when attaching a task to a generic code is
      not reading the nodemask with get_mems_allowed() at the same time, and
      then clearing all the old nodes.  This prevents the possibility that a
      reader will see an empty nodemask at the same time the writer is storing a
      new nodemask.
      
      If at least one node remains unchanged, though, it's possible to simply
      set all new nodes and then clear all the old nodes.  Changing a task's
      nodemask is protected by cgroup_mutex so it's guaranteed that two threads
      are not changing the same task's nodemask at the same time, so the
      nodemask is guaranteed to be stored before another thread changes it and
      determines whether a node remains set or not.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Miao Xie <miaox@cn.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Paul Menage <paul@paulmenage.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      89e8a244
  5. 31 10月, 2011 1 次提交
  6. 27 7月, 2011 2 次提交
  7. 28 5月, 2011 1 次提交
  8. 27 5月, 2011 2 次提交
    • D
      cgroup: remove the ns_cgroup · a77aea92
      Daniel Lezcano 提交于
      The ns_cgroup is an annoying cgroup at the namespace / cgroup frontier and
      leads to some problems:
      
        * cgroup creation is out-of-control
        * cgroup name can conflict when pids are looping
        * it is not possible to have a single process handling a lot of
          namespaces without falling in a exponential creation time
        * we may want to create a namespace without creating a cgroup
      
        The ns_cgroup was replaced by a compatibility flag 'clone_children',
        where a newly created cgroup will copy the parent cgroup values.
        The userspace has to manually create a cgroup and add a task to
        the 'tasks' file.
      
      This patch removes the ns_cgroup as suggested in the following thread:
      
      https://lists.linux-foundation.org/pipermail/containers/2009-June/018616.html
      
      The 'cgroup_clone' function is removed because it is no longer used.
      
      This is a userspace-visible change.  Commit 45531757 ("cgroup: notify
      ns_cgroup deprecated") (merged into 2.6.27) caused the kernel to emit a
      printk warning users that the feature is planned for removal.  Since that
      time we have heard from XXX users who were affected by this.
      Signed-off-by: NDaniel Lezcano <daniel.lezcano@free.fr>
      Signed-off-by: NSerge E. Hallyn <serge.hallyn@canonical.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Jamal Hadi Salim <hadi@cyberus.ca>
      Reviewed-by: NLi Zefan <lizf@cn.fujitsu.com>
      Acked-by: NPaul Menage <menage@google.com>
      Acked-by: NMatt Helsley <matthltc@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a77aea92
    • B
      cgroups: add per-thread subsystem callbacks · f780bdb7
      Ben Blum 提交于
      Add cgroup subsystem callbacks for per-thread attachment in atomic contexts
      
      Add can_attach_task(), pre_attach(), and attach_task() as new callbacks
      for cgroups's subsystem interface.  Unlike can_attach and attach, these
      are for per-thread operations, to be called potentially many times when
      attaching an entire threadgroup.
      
      Also, the old "bool threadgroup" interface is removed, as replaced by
      this.  All subsystems are modified for the new interface - of note is
      cpuset, which requires from/to nodemasks for attach to be globally scoped
      (though per-cpuset would work too) to persist from its pre_attach to
      attach_task and attach.
      
      This is a pre-patch for cgroup-procs-writable.patch.
      Signed-off-by: NBen Blum <bblum@andrew.cmu.edu>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Reviewed-by: NPaul Menage <menage@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Miao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f780bdb7
  9. 11 4月, 2011 1 次提交
  10. 24 3月, 2011 4 次提交
  11. 05 3月, 2011 1 次提交
  12. 29 10月, 2010 1 次提交
  13. 21 10月, 2010 1 次提交
  14. 22 6月, 2010 1 次提交
  15. 17 6月, 2010 1 次提交
  16. 09 6月, 2010 1 次提交
    • T
      sched: adjust when cpu_active and cpuset configurations are updated during cpu on/offlining · 3a101d05
      Tejun Heo 提交于
      Currently, when a cpu goes down, cpu_active is cleared before
      CPU_DOWN_PREPARE starts and cpuset configuration is updated from a
      default priority cpu notifier.  When a cpu is coming up, it's set
      before CPU_ONLINE but cpuset configuration again is updated from the
      same cpu notifier.
      
      For cpu notifiers, this presents an inconsistent state.  Threads which
      a CPU_DOWN_PREPARE notifier expects to be bound to the CPU can be
      migrated to other cpus because the cpu is no more inactive.
      
      Fix it by updating cpu_active in the highest priority cpu notifier and
      cpuset configuration in the second highest when a cpu is coming up.
      Down path is updated similarly.  This guarantees that all other cpu
      notifiers see consistent cpu_active and cpuset configuration.
      
      cpuset_track_online_cpus() notifier is converted to
      cpuset_update_active_cpus() which just updates the configuration and
      now called from cpuset_cpu_[in]active() notifiers registered from
      sched_init_smp().  If cpuset is disabled, cpuset_update_active_cpus()
      degenerates into partition_sched_domains() making separate notifier
      for !CONFIG_CPUSETS unnecessary.
      
      This problem is triggered by cmwq.  During CPU_DOWN_PREPARE, hotplug
      callback creates a kthread and kthread_bind()s it to the target cpu,
      and the thread is expected to run on that cpu.
      
      * Ingo's test discovered __cpuinit/exit markups were incorrect.
        Fixed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Paul Menage <menage@google.com>
      3a101d05
  17. 28 5月, 2010 1 次提交
    • J
      cpusets: new round-robin rotor for SLAB allocations · 6adef3eb
      Jack Steiner 提交于
      We have observed several workloads running on multi-node systems where
      memory is assigned unevenly across the nodes in the system.  There are
      numerous reasons for this but one is the round-robin rotor in
      cpuset_mem_spread_node().
      
      For example, a simple test that writes a multi-page file will allocate
      pages on nodes 0 2 4 6 ...  Odd nodes are skipped.  (Sometimes it
      allocates on odd nodes & skips even nodes).
      
      An example is shown below.  The program "lfile" writes a file consisting
      of 10 pages.  The program then mmaps the file & uses get_mempolicy(...,
      MPOL_F_NODE) to determine the nodes where the file pages were allocated.
      The output is shown below:
      
      	# ./lfile
      	 allocated on nodes: 2 4 6 0 1 2 6 0 2
      
      There is a single rotor that is used for allocating both file pages & slab
      pages.  Writing the file allocates both a data page & a slab page
      (buffer_head).  This advances the RR rotor 2 nodes for each page
      allocated.
      
      A quick confirmation seems to confirm this is the cause of the uneven
      allocation:
      
      	# echo 0 >/dev/cpuset/memory_spread_slab
      	# ./lfile
      	 allocated on nodes: 6 7 8 9 0 1 2 3 4 5
      
      This patch introduces a second rotor that is used for slab allocations.
      Signed-off-by: NJack Steiner <steiner@sgi.com>
      Acked-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Paul Menage <menage@google.com>
      Cc: Jack Steiner <steiner@sgi.com>
      Cc: Robin Holt <holt@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6adef3eb
  18. 25 5月, 2010 2 次提交
    • M
      cpuset,mm: fix no node to alloc memory when changing cpuset's mems · c0ff7453
      Miao Xie 提交于
      Before applying this patch, cpuset updates task->mems_allowed and
      mempolicy by setting all new bits in the nodemask first, and clearing all
      old unallowed bits later.  But in the way, the allocator may find that
      there is no node to alloc memory.
      
      The reason is that cpuset rebinds the task's mempolicy, it cleans the
      nodes which the allocater can alloc pages on, for example:
      
      (mpol: mempolicy)
      	task1			task1's mpol	task2
      	alloc page		1
      	  alloc on node0? NO	1
      				1		change mems from 1 to 0
      				1		rebind task1's mpol
      				0-1		  set new bits
      				0	  	  clear disallowed bits
      	  alloc on node1? NO	0
      	  ...
      	can't alloc page
      	  goto oom
      
      This patch fixes this problem by expanding the nodes range first(set newly
      allowed bits) and shrink it lazily(clear newly disallowed bits).  So we
      use a variable to tell the write-side task that read-side task is reading
      nodemask, and the write-side task clears newly disallowed nodes after
      read-side task ends the current memory allocation.
      
      [akpm@linux-foundation.org: fix spello]
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Paul Menage <menage@google.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Ravikiran Thirumalai <kiran@scalex86.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Andi Kleen <andi@firstfloor.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c0ff7453
    • M
      mempolicy: restructure rebinding-mempolicy functions · 708c1bbc
      Miao Xie 提交于
      Nick Piggin reported that the allocator may see an empty nodemask when
      changing cpuset's mems[1].  It happens only on the kernel that do not do
      atomic nodemask_t stores.  (MAX_NUMNODES > BITS_PER_LONG)
      
      But I found that there is also a problem on the kernel that can do atomic
      nodemask_t stores.  The problem is that the allocator can't find a node to
      alloc page when changing cpuset's mems though there is a lot of free
      memory.  The reason is like this:
      
      (mpol: mempolicy)
      	task1			task1's mpol	task2
      	alloc page		1
      	  alloc on node0? NO	1
      				1		change mems from 1 to 0
      				1		rebind task1's mpol
      				0-1		  set new bits
      				0	  	  clear disallowed bits
      	  alloc on node1? NO	0
      	  ...
      	can't alloc page
      	  goto oom
      
      I can use the attached program reproduce it by the following step:
      
      # mkdir /dev/cpuset
      # mount -t cpuset cpuset /dev/cpuset
      # mkdir /dev/cpuset/1
      # echo `cat /dev/cpuset/cpus` > /dev/cpuset/1/cpus
      # echo `cat /dev/cpuset/mems` > /dev/cpuset/1/mems
      # echo $$ > /dev/cpuset/1/tasks
      # numactl --membind=`cat /dev/cpuset/mems` ./cpuset_mem_hog <nr_tasks> &
         <nr_tasks> = max(nr_cpus - 1, 1)
      # killall -s SIGUSR1 cpuset_mem_hog
      # ./change_mems.sh
      
      several hours later, oom will happen though there is a lot of free memory.
      
      This patchset fixes this problem by expanding the nodes range first(set
      newly allowed bits) and shrink it lazily(clear newly disallowed bits).  So
      we use a variable to tell the write-side task that read-side task is
      reading nodemask, and the write-side task clears newly disallowed nodes
      after read-side task ends the current memory allocation.
      
      This patch:
      
      In order to fix no node to alloc memory, when we want to update mempolicy
      and mems_allowed, we expand the set of nodes first (set all the newly
      nodes) and shrink the set of nodes lazily(clean disallowed nodes), But the
      mempolicy's rebind functions may breaks the expanding.
      
      So we restructure the mempolicy's rebind functions and split the rebind
      work to two steps, just like the update of cpuset's mems: The 1st step:
      expand the set of the mempolicy's nodes.  The 2nd step: shrink the set of
      the mempolicy's nodes.  It is used when there is no real lock to protect
      the mempolicy in the read-side.  Otherwise we can do rebind work at once.
      
      In order to implement it, we define
      
      	enum mpol_rebind_step {
      		MPOL_REBIND_ONCE,
      		MPOL_REBIND_STEP1,
      		MPOL_REBIND_STEP2,
      		MPOL_REBIND_NSTEP,
      	};
      
      If the mempolicy needn't be updated by two steps, we can pass
      MPOL_REBIND_ONCE to the rebind functions.  Or we can pass
      MPOL_REBIND_STEP1 to do the first step of the rebind work and pass
      MPOL_REBIND_STEP2 to do the second step work.
      
      Besides that, it maybe long time between these two step and we have to
      release the lock that protects mempolicy and mems_allowed.  If we hold the
      lock once again, we must check whether the current mempolicy is under the
      rebinding (the first step has been done) or not, because the task may
      alloc a new mempolicy when we don't hold the lock.  So we defined the
      following flag to identify it:
      
      #define MPOL_F_REBINDING (1 << 2)
      
      The new functions will be used in the next patch.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Paul Menage <menage@google.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Ravikiran Thirumalai <kiran@scalex86.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Andi Kleen <andi@firstfloor.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      708c1bbc
  19. 03 4月, 2010 2 次提交
    • O
      sched: Make select_fallback_rq() cpuset friendly · 9084bb82
      Oleg Nesterov 提交于
      Introduce cpuset_cpus_allowed_fallback() helper to fix the cpuset problems
      with select_fallback_rq(). It can be called from any context and can't use
      any cpuset locks including task_lock(). It is called when the task doesn't
      have online cpus in ->cpus_allowed but ttwu/etc must be able to find a
      suitable cpu.
      
      I am not proud of this patch. Everything which needs such a fat comment
      can't be good even if correct. But I'd prefer to not change the locking
      rules in the code I hardly understand, and in any case I believe this
      simple change make the code much more correct compared to deadlocks we
      currently have.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20100315091027.GA9155@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      9084bb82
    • O
      sched: Kill the broken and deadlockable cpuset_lock/cpuset_cpus_allowed_locked code · 897f0b3c
      Oleg Nesterov 提交于
      This patch just states the fact the cpusets/cpuhotplug interaction is
      broken and removes the deadlockable code which only pretends to work.
      
      - cpuset_lock() doesn't really work. It is needed for
        cpuset_cpus_allowed_locked() but we can't take this lock in
        try_to_wake_up()->select_fallback_rq() path.
      
      - cpuset_lock() is deadlockable. Suppose that a task T bound to CPU takes
        callback_mutex. If cpu_down(CPU) happens before T drops callback_mutex
        stop_machine() preempts T, then migration_call(CPU_DEAD) tries to take
        cpuset_lock() and hangs forever because CPU is already dead and thus
        T can't be scheduled.
      
      - cpuset_cpus_allowed_locked() is deadlockable too. It takes task_lock()
        which is not irq-safe, but try_to_wake_up() can be called from irq.
      
      Kill them, and change select_fallback_rq() to use cpu_possible_mask, like
      we currently do without CONFIG_CPUSETS.
      
      Also, with or without this patch, with or without CONFIG_CPUSETS, the
      callers of select_fallback_rq() can race with each other or with
      set_cpus_allowed() pathes.
      
      The subsequent patches try to to fix these problems.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20100315091003.GA9123@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      897f0b3c
  20. 25 3月, 2010 2 次提交
  21. 07 12月, 2009 2 次提交
    • P
      sched: Fix balance vs hotplug race · 6ad4c188
      Peter Zijlstra 提交于
      Since (e761b772: cpu hotplug, sched: Introduce cpu_active_map and redo
      sched domain managment) we have cpu_active_mask which is suppose to rule
      scheduler migration and load-balancing, except it never (fully) did.
      
      The particular problem being solved here is a crash in try_to_wake_up()
      where select_task_rq() ends up selecting an offline cpu because
      select_task_rq_fair() trusts the sched_domain tree to reflect the
      current state of affairs, similarly select_task_rq_rt() trusts the
      root_domain.
      
      However, the sched_domains are updated from CPU_DEAD, which is after the
      cpu is taken offline and after stop_machine is done. Therefore it can
      race perfectly well with code assuming the domains are right.
      
      Cure this by building the domains from cpu_active_mask on
      CPU_DOWN_PREPARE.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      6ad4c188
    • G
      cpumask: Fix generate_sched_domains() for UP · e1b8090b
      Geert Uytterhoeven 提交于
      Commit acc3f5d7 ("cpumask:
      Partition_sched_domains takes array of cpumask_var_t") changed
      the function signature of generate_sched_domains() for the
      CONFIG_SMP=y case, but forgot to update the corresponding
      function for the CONFIG_SMP=n case, causing:
      
        kernel/cpuset.c:2073: warning: passing argument 1 of 'generate_sched_domains' from incompatible pointer type
      Signed-off-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      LKML-Reference: <alpine.DEB.2.00.0912062038070.5693@ayla.of.borg>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      e1b8090b
  22. 04 11月, 2009 1 次提交
    • R
      cpumask: Partition_sched_domains takes array of cpumask_var_t · acc3f5d7
      Rusty Russell 提交于
      Currently partition_sched_domains() takes a 'struct cpumask
      *doms_new' which is a kmalloc'ed array of cpumask_t.  You can't
      have such an array if 'struct cpumask' is undefined, as we plan
      for CONFIG_CPUMASK_OFFSTACK=y.
      
      So, we make this an array of cpumask_var_t instead: this is the
      same for the CONFIG_CPUMASK_OFFSTACK=n case, but requires
      multiple allocations for the CONFIG_CPUMASK_OFFSTACK=y case.
      Hence we add alloc_sched_domains() and free_sched_domains()
      functions.
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      Cc: Peter Zijlstra <peterz@infradead.org>
      LKML-Reference: <200911031453.40668.rusty@rustcorp.com.au>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      acc3f5d7
  23. 24 9月, 2009 1 次提交
  24. 21 9月, 2009 1 次提交
  25. 17 6月, 2009 3 次提交
    • M
      cpuset,mm: update tasks' mems_allowed in time · 58568d2a
      Miao Xie 提交于
      Fix allocating page cache/slab object on the unallowed node when memory
      spread is set by updating tasks' mems_allowed after its cpuset's mems is
      changed.
      
      In order to update tasks' mems_allowed in time, we must modify the code of
      memory policy.  Because the memory policy is applied in the process's
      context originally.  After applying this patch, one task directly
      manipulates anothers mems_allowed, and we use alloc_lock in the
      task_struct to protect mems_allowed and memory policy of the task.
      
      But in the fast path, we didn't use lock to protect them, because adding a
      lock may lead to performance regression.  But if we don't add a lock,the
      task might see no nodes when changing cpuset's mems_allowed to some
      non-overlapping set.  In order to avoid it, we set all new allowed nodes,
      then clear newly disallowed ones.
      
      [lee.schermerhorn@hp.com:
        The rework of mpol_new() to extract the adjusting of the node mask to
        apply cpuset and mpol flags "context" breaks set_mempolicy() and mbind()
        with MPOL_PREFERRED and a NULL nodemask--i.e., explicit local
        allocation.  Fix this by adding the check for MPOL_PREFERRED and empty
        node mask to mpol_new_mpolicy().
      
        Remove the now unneeded 'nodes = NULL' from mpol_new().
      
        Note that mpol_new_mempolicy() is always called with a non-NULL
        'nodes' parameter now that it has been removed from mpol_new().
        Therefore, we don't need to test nodes for NULL before testing it for
        'empty'.  However, just to be extra paranoid, add a VM_BUG_ON() to
        verify this assumption.]
      [lee.schermerhorn@hp.com:
      
        I don't think the function name 'mpol_new_mempolicy' is descriptive
        enough to differentiate it from mpol_new().
      
        This function applies cpuset set context, usually constraining nodes
        to those allowed by the cpuset.  However, when the 'RELATIVE_NODES flag
        is set, it also translates the nodes.  So I settled on
        'mpol_set_nodemask()', because the comment block for mpol_new() mentions
        that we need to call this function to "set nodes".
      
        Some additional minor line length, whitespace and typo cleanup.]
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      58568d2a
    • M
      cpusets: update tasks' page/slab spread flags in time · 950592f7
      Miao Xie 提交于
      Fix the bug that the kernel didn't spread page cache/slab object evenly
      over all the allowed nodes when spread flags were set by updating tasks'
      page/slab spread flags in time.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      950592f7
    • M
      cpusets: restructure the function cpuset_update_task_memory_state() · f3b39d47
      Miao Xie 提交于
      The kernel still allocates the page caches on old node after modifying its
      cpuset's mems when 'memory_spread_page' was set, or it didn't spread the
      page cache evenly over all the nodes that faulting task is allowed to usr
      after memory_spread_page was set.  it is caused by the old mem_allowed and
      flags of the task, the current kernel doesn't updates them unless some
      function invokes cpuset_update_task_memory_state(), it is too late
      sometimes.We must update the mem_allowed and the flags of the tasks in
      time.
      
      Slab has the same problem.
      
      The following patches fix this bug by updating tasks' mem_allowed and
      spread flag after its cpuset's mems or spread flag is changed.
      
      This patch:
      
      Extract a function from cpuset_update_task_memory_state().  It will be
      used later for update tasks' page/slab spread flags after its cpuset's
      flag is set
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f3b39d47
  26. 12 6月, 2009 1 次提交
  27. 03 4月, 2009 1 次提交