1. 25 5月, 2010 1 次提交
    • M
      mempolicy: restructure rebinding-mempolicy functions · 708c1bbc
      Miao Xie 提交于
      Nick Piggin reported that the allocator may see an empty nodemask when
      changing cpuset's mems[1].  It happens only on the kernel that do not do
      atomic nodemask_t stores.  (MAX_NUMNODES > BITS_PER_LONG)
      
      But I found that there is also a problem on the kernel that can do atomic
      nodemask_t stores.  The problem is that the allocator can't find a node to
      alloc page when changing cpuset's mems though there is a lot of free
      memory.  The reason is like this:
      
      (mpol: mempolicy)
      	task1			task1's mpol	task2
      	alloc page		1
      	  alloc on node0? NO	1
      				1		change mems from 1 to 0
      				1		rebind task1's mpol
      				0-1		  set new bits
      				0	  	  clear disallowed bits
      	  alloc on node1? NO	0
      	  ...
      	can't alloc page
      	  goto oom
      
      I can use the attached program reproduce it by the following step:
      
      # mkdir /dev/cpuset
      # mount -t cpuset cpuset /dev/cpuset
      # mkdir /dev/cpuset/1
      # echo `cat /dev/cpuset/cpus` > /dev/cpuset/1/cpus
      # echo `cat /dev/cpuset/mems` > /dev/cpuset/1/mems
      # echo $$ > /dev/cpuset/1/tasks
      # numactl --membind=`cat /dev/cpuset/mems` ./cpuset_mem_hog <nr_tasks> &
         <nr_tasks> = max(nr_cpus - 1, 1)
      # killall -s SIGUSR1 cpuset_mem_hog
      # ./change_mems.sh
      
      several hours later, oom will happen though there is a lot of free memory.
      
      This patchset fixes this problem by expanding the nodes range first(set
      newly allowed bits) and shrink it lazily(clear newly disallowed bits).  So
      we use a variable to tell the write-side task that read-side task is
      reading nodemask, and the write-side task clears newly disallowed nodes
      after read-side task ends the current memory allocation.
      
      This patch:
      
      In order to fix no node to alloc memory, when we want to update mempolicy
      and mems_allowed, we expand the set of nodes first (set all the newly
      nodes) and shrink the set of nodes lazily(clean disallowed nodes), But the
      mempolicy's rebind functions may breaks the expanding.
      
      So we restructure the mempolicy's rebind functions and split the rebind
      work to two steps, just like the update of cpuset's mems: The 1st step:
      expand the set of the mempolicy's nodes.  The 2nd step: shrink the set of
      the mempolicy's nodes.  It is used when there is no real lock to protect
      the mempolicy in the read-side.  Otherwise we can do rebind work at once.
      
      In order to implement it, we define
      
      	enum mpol_rebind_step {
      		MPOL_REBIND_ONCE,
      		MPOL_REBIND_STEP1,
      		MPOL_REBIND_STEP2,
      		MPOL_REBIND_NSTEP,
      	};
      
      If the mempolicy needn't be updated by two steps, we can pass
      MPOL_REBIND_ONCE to the rebind functions.  Or we can pass
      MPOL_REBIND_STEP1 to do the first step of the rebind work and pass
      MPOL_REBIND_STEP2 to do the second step work.
      
      Besides that, it maybe long time between these two step and we have to
      release the lock that protects mempolicy and mems_allowed.  If we hold the
      lock once again, we must check whether the current mempolicy is under the
      rebinding (the first step has been done) or not, because the task may
      alloc a new mempolicy when we don't hold the lock.  So we defined the
      following flag to identify it:
      
      #define MPOL_F_REBINDING (1 << 2)
      
      The new functions will be used in the next patch.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Paul Menage <menage@google.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Ravikiran Thirumalai <kiran@scalex86.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Andi Kleen <andi@firstfloor.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      708c1bbc
  2. 03 4月, 2010 2 次提交
    • O
      sched: Make select_fallback_rq() cpuset friendly · 9084bb82
      Oleg Nesterov 提交于
      Introduce cpuset_cpus_allowed_fallback() helper to fix the cpuset problems
      with select_fallback_rq(). It can be called from any context and can't use
      any cpuset locks including task_lock(). It is called when the task doesn't
      have online cpus in ->cpus_allowed but ttwu/etc must be able to find a
      suitable cpu.
      
      I am not proud of this patch. Everything which needs such a fat comment
      can't be good even if correct. But I'd prefer to not change the locking
      rules in the code I hardly understand, and in any case I believe this
      simple change make the code much more correct compared to deadlocks we
      currently have.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20100315091027.GA9155@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      9084bb82
    • O
      sched: Kill the broken and deadlockable cpuset_lock/cpuset_cpus_allowed_locked code · 897f0b3c
      Oleg Nesterov 提交于
      This patch just states the fact the cpusets/cpuhotplug interaction is
      broken and removes the deadlockable code which only pretends to work.
      
      - cpuset_lock() doesn't really work. It is needed for
        cpuset_cpus_allowed_locked() but we can't take this lock in
        try_to_wake_up()->select_fallback_rq() path.
      
      - cpuset_lock() is deadlockable. Suppose that a task T bound to CPU takes
        callback_mutex. If cpu_down(CPU) happens before T drops callback_mutex
        stop_machine() preempts T, then migration_call(CPU_DEAD) tries to take
        cpuset_lock() and hangs forever because CPU is already dead and thus
        T can't be scheduled.
      
      - cpuset_cpus_allowed_locked() is deadlockable too. It takes task_lock()
        which is not irq-safe, but try_to_wake_up() can be called from irq.
      
      Kill them, and change select_fallback_rq() to use cpu_possible_mask, like
      we currently do without CONFIG_CPUSETS.
      
      Also, with or without this patch, with or without CONFIG_CPUSETS, the
      callers of select_fallback_rq() can race with each other or with
      set_cpus_allowed() pathes.
      
      The subsequent patches try to to fix these problems.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20100315091003.GA9123@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      897f0b3c
  3. 25 3月, 2010 2 次提交
  4. 07 12月, 2009 2 次提交
    • P
      sched: Fix balance vs hotplug race · 6ad4c188
      Peter Zijlstra 提交于
      Since (e761b772: cpu hotplug, sched: Introduce cpu_active_map and redo
      sched domain managment) we have cpu_active_mask which is suppose to rule
      scheduler migration and load-balancing, except it never (fully) did.
      
      The particular problem being solved here is a crash in try_to_wake_up()
      where select_task_rq() ends up selecting an offline cpu because
      select_task_rq_fair() trusts the sched_domain tree to reflect the
      current state of affairs, similarly select_task_rq_rt() trusts the
      root_domain.
      
      However, the sched_domains are updated from CPU_DEAD, which is after the
      cpu is taken offline and after stop_machine is done. Therefore it can
      race perfectly well with code assuming the domains are right.
      
      Cure this by building the domains from cpu_active_mask on
      CPU_DOWN_PREPARE.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      6ad4c188
    • G
      cpumask: Fix generate_sched_domains() for UP · e1b8090b
      Geert Uytterhoeven 提交于
      Commit acc3f5d7 ("cpumask:
      Partition_sched_domains takes array of cpumask_var_t") changed
      the function signature of generate_sched_domains() for the
      CONFIG_SMP=y case, but forgot to update the corresponding
      function for the CONFIG_SMP=n case, causing:
      
        kernel/cpuset.c:2073: warning: passing argument 1 of 'generate_sched_domains' from incompatible pointer type
      Signed-off-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      LKML-Reference: <alpine.DEB.2.00.0912062038070.5693@ayla.of.borg>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      e1b8090b
  5. 04 11月, 2009 1 次提交
    • R
      cpumask: Partition_sched_domains takes array of cpumask_var_t · acc3f5d7
      Rusty Russell 提交于
      Currently partition_sched_domains() takes a 'struct cpumask
      *doms_new' which is a kmalloc'ed array of cpumask_t.  You can't
      have such an array if 'struct cpumask' is undefined, as we plan
      for CONFIG_CPUMASK_OFFSTACK=y.
      
      So, we make this an array of cpumask_var_t instead: this is the
      same for the CONFIG_CPUMASK_OFFSTACK=n case, but requires
      multiple allocations for the CONFIG_CPUMASK_OFFSTACK=y case.
      Hence we add alloc_sched_domains() and free_sched_domains()
      functions.
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      Cc: Peter Zijlstra <peterz@infradead.org>
      LKML-Reference: <200911031453.40668.rusty@rustcorp.com.au>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      acc3f5d7
  6. 24 9月, 2009 1 次提交
  7. 21 9月, 2009 1 次提交
  8. 17 6月, 2009 3 次提交
    • M
      cpuset,mm: update tasks' mems_allowed in time · 58568d2a
      Miao Xie 提交于
      Fix allocating page cache/slab object on the unallowed node when memory
      spread is set by updating tasks' mems_allowed after its cpuset's mems is
      changed.
      
      In order to update tasks' mems_allowed in time, we must modify the code of
      memory policy.  Because the memory policy is applied in the process's
      context originally.  After applying this patch, one task directly
      manipulates anothers mems_allowed, and we use alloc_lock in the
      task_struct to protect mems_allowed and memory policy of the task.
      
      But in the fast path, we didn't use lock to protect them, because adding a
      lock may lead to performance regression.  But if we don't add a lock,the
      task might see no nodes when changing cpuset's mems_allowed to some
      non-overlapping set.  In order to avoid it, we set all new allowed nodes,
      then clear newly disallowed ones.
      
      [lee.schermerhorn@hp.com:
        The rework of mpol_new() to extract the adjusting of the node mask to
        apply cpuset and mpol flags "context" breaks set_mempolicy() and mbind()
        with MPOL_PREFERRED and a NULL nodemask--i.e., explicit local
        allocation.  Fix this by adding the check for MPOL_PREFERRED and empty
        node mask to mpol_new_mpolicy().
      
        Remove the now unneeded 'nodes = NULL' from mpol_new().
      
        Note that mpol_new_mempolicy() is always called with a non-NULL
        'nodes' parameter now that it has been removed from mpol_new().
        Therefore, we don't need to test nodes for NULL before testing it for
        'empty'.  However, just to be extra paranoid, add a VM_BUG_ON() to
        verify this assumption.]
      [lee.schermerhorn@hp.com:
      
        I don't think the function name 'mpol_new_mempolicy' is descriptive
        enough to differentiate it from mpol_new().
      
        This function applies cpuset set context, usually constraining nodes
        to those allowed by the cpuset.  However, when the 'RELATIVE_NODES flag
        is set, it also translates the nodes.  So I settled on
        'mpol_set_nodemask()', because the comment block for mpol_new() mentions
        that we need to call this function to "set nodes".
      
        Some additional minor line length, whitespace and typo cleanup.]
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      58568d2a
    • M
      cpusets: update tasks' page/slab spread flags in time · 950592f7
      Miao Xie 提交于
      Fix the bug that the kernel didn't spread page cache/slab object evenly
      over all the allowed nodes when spread flags were set by updating tasks'
      page/slab spread flags in time.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      950592f7
    • M
      cpusets: restructure the function cpuset_update_task_memory_state() · f3b39d47
      Miao Xie 提交于
      The kernel still allocates the page caches on old node after modifying its
      cpuset's mems when 'memory_spread_page' was set, or it didn't spread the
      page cache evenly over all the nodes that faulting task is allowed to usr
      after memory_spread_page was set.  it is caused by the old mem_allowed and
      flags of the task, the current kernel doesn't updates them unless some
      function invokes cpuset_update_task_memory_state(), it is too late
      sometimes.We must update the mem_allowed and the flags of the tasks in
      time.
      
      Slab has the same problem.
      
      The following patches fix this bug by updating tasks' mem_allowed and
      spread flag after its cpuset's mems or spread flag is changed.
      
      This patch:
      
      Extract a function from cpuset_update_task_memory_state().  It will be
      used later for update tasks' page/slab spread flags after its cpuset's
      flag is set
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f3b39d47
  9. 12 6月, 2009 1 次提交
  10. 03 4月, 2009 8 次提交
  11. 19 1月, 2009 1 次提交
    • M
      cpuset: fix possible deadlock in async_rebuild_sched_domains · f90d4118
      Miao Xie 提交于
      Lockdep reported some possible circular locking info when we tested cpuset on
      NUMA/fake NUMA box.
      
      =======================================================
      [ INFO: possible circular locking dependency detected ]
      2.6.29-rc1-00224-ga6525042 #111
      -------------------------------------------------------
      bash/2968 is trying to acquire lock:
       (events){--..}, at: [<ffffffff8024c8cd>] flush_work+0x24/0xd8
      
      but task is already holding lock:
       (cgroup_mutex){--..}, at: [<ffffffff8026ad1e>] cgroup_lock_live_group+0x12/0x29
      
      which lock already depends on the new lock.
      ......
      -------------------------------------------------------
      
      Steps to reproduce:
      # mkdir /dev/cpuset
      # mount -t cpuset xxx /dev/cpuset
      # mkdir /dev/cpuset/0
      # echo 0 > /dev/cpuset/0/cpus
      # echo 0 > /dev/cpuset/0/mems
      # echo 1 > /dev/cpuset/0/memory_migrate
      # cat /dev/zero > /dev/null &
      # echo $! > /dev/cpuset/0/tasks
      
      This is because async_rebuild_sched_domains has the following lock sequence:
      run_workqueue(async_rebuild_sched_domains)
      	-> do_rebuild_sched_domains -> cgroup_lock
      
      But, attaching tasks when memory_migrate is set has following:
      cgroup_lock_live_group(cgroup_tasks_write)
      	-> do_migrate_pages -> flush_work
      
      This patch fixes it by using a separate workqueue thread.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      f90d4118
  12. 16 1月, 2009 1 次提交
  13. 09 1月, 2009 8 次提交
  14. 07 1月, 2009 1 次提交
  15. 13 12月, 2008 1 次提交
    • R
      cpumask: change cpumask_scnprintf, cpumask_parse_user, cpulist_parse, and... · 29c0177e
      Rusty Russell 提交于
      cpumask: change cpumask_scnprintf, cpumask_parse_user, cpulist_parse, and cpulist_scnprintf to take pointers.
      
      Impact: change calling convention of existing cpumask APIs
      
      Most cpumask functions started with cpus_: these have been replaced by
      cpumask_ ones which take struct cpumask pointers as expected.
      
      These four functions don't have good replacement names; fortunately
      they're rarely used, so we just change them over.
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NMike Travis <travis@sgi.com>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Cc: paulus@samba.org
      Cc: mingo@redhat.com
      Cc: tony.luck@intel.com
      Cc: ralf@linux-mips.org
      Cc: Greg Kroah-Hartman <gregkh@suse.de>
      Cc: cl@linux-foundation.org
      Cc: srostedt@redhat.com
      29c0177e
  16. 30 11月, 2008 1 次提交
    • I
      sched, cpusets: fix warning in kernel/cpuset.c · 1583715d
      Ingo Molnar 提交于
      this warning:
      
        kernel/cpuset.c: In function ‘generate_sched_domains’:
        kernel/cpuset.c:588: warning: ‘ndoms’ may be used uninitialized in this function
      
      triggers because GCC does not recognize that ndoms stays uninitialized
      only if doms is NULL - but that flow is covered at the end of
      generate_sched_domains().
      
      Help out GCC by initializing this variable to 0. (that's prudent anyway)
      
      Also, this function needs a splitup and code flow simplification:
      with 160 lines length it's clearly too long.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      1583715d
  17. 20 11月, 2008 1 次提交
  18. 18 11月, 2008 1 次提交
    • L
      cpuset: fix regression when failed to generate sched domains · 700018e0
      Li Zefan 提交于
      Impact: properly rebuild sched-domains on kmalloc() failure
      
      When cpuset failed to generate sched domains due to kmalloc()
      failure, the scheduler should fallback to the single partition
      'fallback_doms' and rebuild sched domains, but now it only
      destroys but not rebuilds sched domains.
      
      The regression was introduced by:
      
      | commit dfb512ec
      | Author: Max Krasnyansky <maxk@qualcomm.com>
      | Date:   Fri Aug 29 13:11:41 2008 -0700
      |
      |    sched: arch_reinit_sched_domains() must destroy domains to force rebuild
      
      After the above commit, partition_sched_domains(0, NULL, NULL) will
      only destroy sched domains and partition_sched_domains(1, NULL, NULL)
      will create the default sched domain.
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      Cc: Max Krasnyansky <maxk@qualcomm.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      700018e0
  19. 20 10月, 2008 2 次提交
  20. 03 10月, 2008 1 次提交