1. 07 11月, 2015 1 次提交
  2. 06 11月, 2015 1 次提交
  3. 27 10月, 2014 1 次提交
    • V
      cpuset: simplify cpuset_node_allowed API · 344736f2
      Vladimir Davydov 提交于
      Current cpuset API for checking if a zone/node is allowed to allocate
      from looks rather awkward. We have hardwall and softwall versions of
      cpuset_node_allowed with the softwall version doing literally the same
      as the hardwall version if __GFP_HARDWALL is passed to it in gfp flags.
      If it isn't, the softwall version may check the given node against the
      enclosing hardwall cpuset, which it needs to take the callback lock to
      do.
      
      Such a distinction was introduced by commit 02a0e53d ("cpuset:
      rework cpuset_zone_allowed api"). Before, we had the only version with
      the __GFP_HARDWALL flag determining its behavior. The purpose of the
      commit was to avoid sleep-in-atomic bugs when someone would mistakenly
      call the function without the __GFP_HARDWALL flag for an atomic
      allocation. The suffixes introduced were intended to make the callers
      think before using the function.
      
      However, since the callback lock was converted from mutex to spinlock by
      the previous patch, the softwall check function cannot sleep, and these
      precautions are no longer necessary.
      
      So let's simplify the API back to the single check.
      Suggested-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Acked-by: NZefan Li <lizefan@huawei.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      344736f2
  4. 25 9月, 2014 1 次提交
    • Z
      cpuset: PF_SPREAD_PAGE and PF_SPREAD_SLAB should be atomic flags · 2ad654bc
      Zefan Li 提交于
      When we change cpuset.memory_spread_{page,slab}, cpuset will flip
      PF_SPREAD_{PAGE,SLAB} bit of tsk->flags for each task in that cpuset.
      This should be done using atomic bitops, but currently we don't,
      which is broken.
      
      Tetsuo reported a hard-to-reproduce kernel crash on RHEL6, which happened
      when one thread tried to clear PF_USED_MATH while at the same time another
      thread tried to flip PF_SPREAD_PAGE/PF_SPREAD_SLAB. They both operate on
      the same task.
      
      Here's the full report:
      https://lkml.org/lkml/2014/9/19/230
      
      To fix this, we make PF_SPREAD_PAGE and PF_SPREAD_SLAB atomic flags.
      
      v4:
      - updated mm/slab.c. (Fengguang Wu)
      - updated Documentation.
      
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Miao Xie <miaox@cn.fujitsu.com>
      Cc: Kees Cook <keescook@chromium.org>
      Fixes: 950592f7 ("cpusets: update tasks' page/slab spread flags in time")
      Cc: <stable@vger.kernel.org> # 2.6.31+
      Reported-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NZefan Li <lizefan@huawei.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      2ad654bc
  5. 19 9月, 2014 1 次提交
  6. 05 6月, 2014 1 次提交
  7. 04 4月, 2014 1 次提交
  8. 06 11月, 2013 1 次提交
    • J
      cpuset: Fix potential deadlock w/ set_mems_allowed · db751fe3
      John Stultz 提交于
      After adding lockdep support to seqlock/seqcount structures,
      I started seeing the following warning:
      
      [    1.070907] ======================================================
      [    1.072015] [ INFO: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected ]
      [    1.073181] 3.11.0+ #67 Not tainted
      [    1.073801] ------------------------------------------------------
      [    1.074882] kworker/u4:2/708 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
      [    1.076088]  (&p->mems_allowed_seq){+.+...}, at: [<ffffffff81187d7f>] new_slab+0x5f/0x280
      [    1.077572]
      [    1.077572] and this task is already holding:
      [    1.078593]  (&(&q->__queue_lock)->rlock){..-...}, at: [<ffffffff81339f03>] blk_execute_rq_nowait+0x53/0xf0
      [    1.080042] which would create a new lock dependency:
      [    1.080042]  (&(&q->__queue_lock)->rlock){..-...} -> (&p->mems_allowed_seq){+.+...}
      [    1.080042]
      [    1.080042] but this new dependency connects a SOFTIRQ-irq-safe lock:
      [    1.080042]  (&(&q->__queue_lock)->rlock){..-...}
      [    1.080042] ... which became SOFTIRQ-irq-safe at:
      [    1.080042]   [<ffffffff810ec179>] __lock_acquire+0x5b9/0x1db0
      [    1.080042]   [<ffffffff810edfe5>] lock_acquire+0x95/0x130
      [    1.080042]   [<ffffffff818968a1>] _raw_spin_lock+0x41/0x80
      [    1.080042]   [<ffffffff81560c9e>] scsi_device_unbusy+0x7e/0xd0
      [    1.080042]   [<ffffffff8155a612>] scsi_finish_command+0x32/0xf0
      [    1.080042]   [<ffffffff81560e91>] scsi_softirq_done+0xa1/0x130
      [    1.080042]   [<ffffffff8133b0f3>] blk_done_softirq+0x73/0x90
      [    1.080042]   [<ffffffff81095dc0>] __do_softirq+0x110/0x2f0
      [    1.080042]   [<ffffffff81095fcd>] run_ksoftirqd+0x2d/0x60
      [    1.080042]   [<ffffffff810bc506>] smpboot_thread_fn+0x156/0x1e0
      [    1.080042]   [<ffffffff810b3916>] kthread+0xd6/0xe0
      [    1.080042]   [<ffffffff818980ac>] ret_from_fork+0x7c/0xb0
      [    1.080042]
      [    1.080042] to a SOFTIRQ-irq-unsafe lock:
      [    1.080042]  (&p->mems_allowed_seq){+.+...}
      [    1.080042] ... which became SOFTIRQ-irq-unsafe at:
      [    1.080042] ...  [<ffffffff810ec1d3>] __lock_acquire+0x613/0x1db0
      [    1.080042]   [<ffffffff810edfe5>] lock_acquire+0x95/0x130
      [    1.080042]   [<ffffffff810b3df2>] kthreadd+0x82/0x180
      [    1.080042]   [<ffffffff818980ac>] ret_from_fork+0x7c/0xb0
      [    1.080042]
      [    1.080042] other info that might help us debug this:
      [    1.080042]
      [    1.080042]  Possible interrupt unsafe locking scenario:
      [    1.080042]
      [    1.080042]        CPU0                    CPU1
      [    1.080042]        ----                    ----
      [    1.080042]   lock(&p->mems_allowed_seq);
      [    1.080042]                                local_irq_disable();
      [    1.080042]                                lock(&(&q->__queue_lock)->rlock);
      [    1.080042]                                lock(&p->mems_allowed_seq);
      [    1.080042]   <Interrupt>
      [    1.080042]     lock(&(&q->__queue_lock)->rlock);
      [    1.080042]
      [    1.080042]  *** DEADLOCK ***
      
      The issue stems from the kthreadd() function calling set_mems_allowed
      with irqs enabled. While its possibly unlikely for the actual deadlock
      to trigger, a fix is fairly simple: disable irqs before taking the
      mems_allowed_seq lock.
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: netdev@vger.kernel.org
      Link: http://lkml.kernel.org/r/1381186321-4906-4-git-send-email-john.stultz@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      db751fe3
  9. 02 5月, 2013 1 次提交
  10. 06 3月, 2013 1 次提交
  11. 13 12月, 2012 1 次提交
  12. 24 7月, 2012 1 次提交
  13. 27 3月, 2012 1 次提交
    • P
      sched: Fix select_fallback_rq() vs cpu_active/cpu_online · 2baab4e9
      Peter Zijlstra 提交于
      Commit 5fbd036b ("sched: Cleanup cpu_active madness"), which was
      supposed to finally sort the cpu_active mess, instead uncovered more.
      
      Since CPU_STARTING is ran before setting the cpu online, there's a
      (small) window where the cpu has active,!online.
      
      If during this time there's a wakeup of a task that used to reside on
      that cpu select_task_rq() will use select_fallback_rq() to compute an
      alternative cpu to run on since we find !online.
      
      select_fallback_rq() however will compute the new cpu against
      cpu_active, this means that it can return the same cpu it started out
      with, the !online one, since that cpu is in fact marked active.
      
      This results in us trying to scheduling a task on an offline cpu and
      triggering a WARN in the IPI code.
      
      The solution proposed by Chuansheng Liu of setting cpu_active in
      set_cpu_online() is buggy, firstly not all archs actually use
      set_cpu_online(), secondly, not all archs call set_cpu_online() with
      IRQs disabled, this means we would introduce either the same race or
      the race from fd8a7de1 ("x86: cpu-hotplug: Prevent softirq wakeup on
      wrong CPU") -- albeit much narrower.
      
      [ By setting online first and active later we have a window of
        online,!active, fresh and bound kthreads have task_cpu() of 0 and
        since cpu0 isn't in tsk_cpus_allowed() we end up in
        select_fallback_rq() which excludes !active, resulting in a reset
        of ->cpus_allowed and the thread running all over the place. ]
      
      The solution is to re-work select_fallback_rq() to require active
      _and_ online. This makes the active,!online case work as expected,
      OTOH archs running CPU_STARTING after setting online are now
      vulnerable to the issue from fd8a7de1 -- these are alpha and
      blackfin.
      Reported-by: NChuansheng Liu <chuansheng.liu@intel.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Frysinger <vapier@gentoo.org>
      Cc: linux-alpha@vger.kernel.org
      Link: http://lkml.kernel.org/n/tip-hubqk1i10o4dpvlm06gq7v6j@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2baab4e9
  14. 22 3月, 2012 1 次提交
    • M
      cpuset: mm: reduce large amounts of memory barrier related damage v3 · cc9a6c87
      Mel Gorman 提交于
      Commit c0ff7453 ("cpuset,mm: fix no node to alloc memory when
      changing cpuset's mems") wins a super prize for the largest number of
      memory barriers entered into fast paths for one commit.
      
      [get|put]_mems_allowed is incredibly heavy with pairs of full memory
      barriers inserted into a number of hot paths.  This was detected while
      investigating at large page allocator slowdown introduced some time
      after 2.6.32.  The largest portion of this overhead was shown by
      oprofile to be at an mfence introduced by this commit into the page
      allocator hot path.
      
      For extra style points, the commit introduced the use of yield() in an
      implementation of what looks like a spinning mutex.
      
      This patch replaces the full memory barriers on both read and write
      sides with a sequence counter with just read barriers on the fast path
      side.  This is much cheaper on some architectures, including x86.  The
      main bulk of the patch is the retry logic if the nodemask changes in a
      manner that can cause a false failure.
      
      While updating the nodemask, a check is made to see if a false failure
      is a risk.  If it is, the sequence number gets bumped and parallel
      allocators will briefly stall while the nodemask update takes place.
      
      In a page fault test microbenchmark, oprofile samples from
      __alloc_pages_nodemask went from 4.53% of all samples to 1.15%.  The
      actual results were
      
                                   3.3.0-rc3          3.3.0-rc3
                                   rc3-vanilla        nobarrier-v2r1
          Clients   1 UserTime       0.07 (  0.00%)   0.08 (-14.19%)
          Clients   2 UserTime       0.07 (  0.00%)   0.07 (  2.72%)
          Clients   4 UserTime       0.08 (  0.00%)   0.07 (  3.29%)
          Clients   1 SysTime        0.70 (  0.00%)   0.65 (  6.65%)
          Clients   2 SysTime        0.85 (  0.00%)   0.82 (  3.65%)
          Clients   4 SysTime        1.41 (  0.00%)   1.41 (  0.32%)
          Clients   1 WallTime       0.77 (  0.00%)   0.74 (  4.19%)
          Clients   2 WallTime       0.47 (  0.00%)   0.45 (  3.73%)
          Clients   4 WallTime       0.38 (  0.00%)   0.37 (  1.58%)
          Clients   1 Flt/sec/cpu  497620.28 (  0.00%) 520294.53 (  4.56%)
          Clients   2 Flt/sec/cpu  414639.05 (  0.00%) 429882.01 (  3.68%)
          Clients   4 Flt/sec/cpu  257959.16 (  0.00%) 258761.48 (  0.31%)
          Clients   1 Flt/sec      495161.39 (  0.00%) 517292.87 (  4.47%)
          Clients   2 Flt/sec      820325.95 (  0.00%) 850289.77 (  3.65%)
          Clients   4 Flt/sec      1020068.93 (  0.00%) 1022674.06 (  0.26%)
          MMTests Statistics: duration
          Sys Time Running Test (seconds)             135.68    132.17
          User+Sys Time Running Test (seconds)         164.2    160.13
          Total Elapsed Time (seconds)                123.46    120.87
      
      The overall improvement is small but the System CPU time is much
      improved and roughly in correlation to what oprofile reported (these
      performance figures are without profiling so skew is expected).  The
      actual number of page faults is noticeably improved.
      
      For benchmarks like kernel builds, the overall benefit is marginal but
      the system CPU time is slightly reduced.
      
      To test the actual bug the commit fixed I opened two terminals.  The
      first ran within a cpuset and continually ran a small program that
      faulted 100M of anonymous data.  In a second window, the nodemask of the
      cpuset was continually randomised in a loop.
      
      Without the commit, the program would fail every so often (usually
      within 10 seconds) and obviously with the commit everything worked fine.
      With this patch applied, it also worked fine so the fix should be
      functionally equivalent.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Miao Xie <miaox@cn.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Christoph Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cc9a6c87
  15. 28 5月, 2011 1 次提交
  16. 09 6月, 2010 1 次提交
    • T
      sched: adjust when cpu_active and cpuset configurations are updated during cpu on/offlining · 3a101d05
      Tejun Heo 提交于
      Currently, when a cpu goes down, cpu_active is cleared before
      CPU_DOWN_PREPARE starts and cpuset configuration is updated from a
      default priority cpu notifier.  When a cpu is coming up, it's set
      before CPU_ONLINE but cpuset configuration again is updated from the
      same cpu notifier.
      
      For cpu notifiers, this presents an inconsistent state.  Threads which
      a CPU_DOWN_PREPARE notifier expects to be bound to the CPU can be
      migrated to other cpus because the cpu is no more inactive.
      
      Fix it by updating cpu_active in the highest priority cpu notifier and
      cpuset configuration in the second highest when a cpu is coming up.
      Down path is updated similarly.  This guarantees that all other cpu
      notifiers see consistent cpu_active and cpuset configuration.
      
      cpuset_track_online_cpus() notifier is converted to
      cpuset_update_active_cpus() which just updates the configuration and
      now called from cpuset_cpu_[in]active() notifiers registered from
      sched_init_smp().  If cpuset is disabled, cpuset_update_active_cpus()
      degenerates into partition_sched_domains() making separate notifier
      for !CONFIG_CPUSETS unnecessary.
      
      This problem is triggered by cmwq.  During CPU_DOWN_PREPARE, hotplug
      callback creates a kthread and kthread_bind()s it to the target cpu,
      and the thread is expected to run on that cpu.
      
      * Ingo's test discovered __cpuinit/exit markups were incorrect.
        Fixed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Paul Menage <menage@google.com>
      3a101d05
  17. 28 5月, 2010 1 次提交
    • J
      cpusets: new round-robin rotor for SLAB allocations · 6adef3eb
      Jack Steiner 提交于
      We have observed several workloads running on multi-node systems where
      memory is assigned unevenly across the nodes in the system.  There are
      numerous reasons for this but one is the round-robin rotor in
      cpuset_mem_spread_node().
      
      For example, a simple test that writes a multi-page file will allocate
      pages on nodes 0 2 4 6 ...  Odd nodes are skipped.  (Sometimes it
      allocates on odd nodes & skips even nodes).
      
      An example is shown below.  The program "lfile" writes a file consisting
      of 10 pages.  The program then mmaps the file & uses get_mempolicy(...,
      MPOL_F_NODE) to determine the nodes where the file pages were allocated.
      The output is shown below:
      
      	# ./lfile
      	 allocated on nodes: 2 4 6 0 1 2 6 0 2
      
      There is a single rotor that is used for allocating both file pages & slab
      pages.  Writing the file allocates both a data page & a slab page
      (buffer_head).  This advances the RR rotor 2 nodes for each page
      allocated.
      
      A quick confirmation seems to confirm this is the cause of the uneven
      allocation:
      
      	# echo 0 >/dev/cpuset/memory_spread_slab
      	# ./lfile
      	 allocated on nodes: 6 7 8 9 0 1 2 3 4 5
      
      This patch introduces a second rotor that is used for slab allocations.
      Signed-off-by: NJack Steiner <steiner@sgi.com>
      Acked-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Paul Menage <menage@google.com>
      Cc: Jack Steiner <steiner@sgi.com>
      Cc: Robin Holt <holt@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6adef3eb
  18. 25 5月, 2010 1 次提交
    • M
      cpuset,mm: fix no node to alloc memory when changing cpuset's mems · c0ff7453
      Miao Xie 提交于
      Before applying this patch, cpuset updates task->mems_allowed and
      mempolicy by setting all new bits in the nodemask first, and clearing all
      old unallowed bits later.  But in the way, the allocator may find that
      there is no node to alloc memory.
      
      The reason is that cpuset rebinds the task's mempolicy, it cleans the
      nodes which the allocater can alloc pages on, for example:
      
      (mpol: mempolicy)
      	task1			task1's mpol	task2
      	alloc page		1
      	  alloc on node0? NO	1
      				1		change mems from 1 to 0
      				1		rebind task1's mpol
      				0-1		  set new bits
      				0	  	  clear disallowed bits
      	  alloc on node1? NO	0
      	  ...
      	can't alloc page
      	  goto oom
      
      This patch fixes this problem by expanding the nodes range first(set newly
      allowed bits) and shrink it lazily(clear newly disallowed bits).  So we
      use a variable to tell the write-side task that read-side task is reading
      nodemask, and the write-side task clears newly disallowed nodes after
      read-side task ends the current memory allocation.
      
      [akpm@linux-foundation.org: fix spello]
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Paul Menage <menage@google.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Ravikiran Thirumalai <kiran@scalex86.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Andi Kleen <andi@firstfloor.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c0ff7453
  19. 03 4月, 2010 2 次提交
    • O
      sched: Make select_fallback_rq() cpuset friendly · 9084bb82
      Oleg Nesterov 提交于
      Introduce cpuset_cpus_allowed_fallback() helper to fix the cpuset problems
      with select_fallback_rq(). It can be called from any context and can't use
      any cpuset locks including task_lock(). It is called when the task doesn't
      have online cpus in ->cpus_allowed but ttwu/etc must be able to find a
      suitable cpu.
      
      I am not proud of this patch. Everything which needs such a fat comment
      can't be good even if correct. But I'd prefer to not change the locking
      rules in the code I hardly understand, and in any case I believe this
      simple change make the code much more correct compared to deadlocks we
      currently have.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20100315091027.GA9155@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      9084bb82
    • O
      sched: Kill the broken and deadlockable cpuset_lock/cpuset_cpus_allowed_locked code · 897f0b3c
      Oleg Nesterov 提交于
      This patch just states the fact the cpusets/cpuhotplug interaction is
      broken and removes the deadlockable code which only pretends to work.
      
      - cpuset_lock() doesn't really work. It is needed for
        cpuset_cpus_allowed_locked() but we can't take this lock in
        try_to_wake_up()->select_fallback_rq() path.
      
      - cpuset_lock() is deadlockable. Suppose that a task T bound to CPU takes
        callback_mutex. If cpu_down(CPU) happens before T drops callback_mutex
        stop_machine() preempts T, then migration_call(CPU_DEAD) tries to take
        cpuset_lock() and hangs forever because CPU is already dead and thus
        T can't be scheduled.
      
      - cpuset_cpus_allowed_locked() is deadlockable too. It takes task_lock()
        which is not irq-safe, but try_to_wake_up() can be called from irq.
      
      Kill them, and change select_fallback_rq() to use cpu_possible_mask, like
      we currently do without CONFIG_CPUSETS.
      
      Also, with or without this patch, with or without CONFIG_CPUSETS, the
      callers of select_fallback_rq() can race with each other or with
      set_cpus_allowed() pathes.
      
      The subsequent patches try to to fix these problems.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20100315091003.GA9123@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      897f0b3c
  20. 17 6月, 2009 1 次提交
    • M
      cpuset,mm: update tasks' mems_allowed in time · 58568d2a
      Miao Xie 提交于
      Fix allocating page cache/slab object on the unallowed node when memory
      spread is set by updating tasks' mems_allowed after its cpuset's mems is
      changed.
      
      In order to update tasks' mems_allowed in time, we must modify the code of
      memory policy.  Because the memory policy is applied in the process's
      context originally.  After applying this patch, one task directly
      manipulates anothers mems_allowed, and we use alloc_lock in the
      task_struct to protect mems_allowed and memory policy of the task.
      
      But in the fast path, we didn't use lock to protect them, because adding a
      lock may lead to performance regression.  But if we don't add a lock,the
      task might see no nodes when changing cpuset's mems_allowed to some
      non-overlapping set.  In order to avoid it, we set all new allowed nodes,
      then clear newly disallowed ones.
      
      [lee.schermerhorn@hp.com:
        The rework of mpol_new() to extract the adjusting of the node mask to
        apply cpuset and mpol flags "context" breaks set_mempolicy() and mbind()
        with MPOL_PREFERRED and a NULL nodemask--i.e., explicit local
        allocation.  Fix this by adding the check for MPOL_PREFERRED and empty
        node mask to mpol_new_mpolicy().
      
        Remove the now unneeded 'nodes = NULL' from mpol_new().
      
        Note that mpol_new_mempolicy() is always called with a non-NULL
        'nodes' parameter now that it has been removed from mpol_new().
        Therefore, we don't need to test nodes for NULL before testing it for
        'empty'.  However, just to be extra paranoid, add a VM_BUG_ON() to
        verify this assumption.]
      [lee.schermerhorn@hp.com:
      
        I don't think the function name 'mpol_new_mempolicy' is descriptive
        enough to differentiate it from mpol_new().
      
        This function applies cpuset set context, usually constraining nodes
        to those allowed by the cpuset.  However, when the 'RELATIVE_NODES flag
        is set, it also translates the nodes.  So I settled on
        'mpol_set_nodemask()', because the comment block for mpol_new() mentions
        that we need to call this function to "set nodes".
      
        Some additional minor line length, whitespace and typo cleanup.]
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      58568d2a
  21. 03 4月, 2009 1 次提交
  22. 30 3月, 2009 1 次提交
  23. 09 1月, 2009 1 次提交
  24. 07 1月, 2009 1 次提交
  25. 20 11月, 2008 1 次提交
  26. 07 9月, 2008 1 次提交
    • M
      sched: arch_reinit_sched_domains() must destroy domains to force rebuild · dfb512ec
      Max Krasnyansky 提交于
      What I realized recently is that calling rebuild_sched_domains() in
      arch_reinit_sched_domains() by itself is not enough when cpusets are enabled.
      partition_sched_domains() code is trying to avoid unnecessary domain rebuilds
      and will not actually rebuild anything if new domain masks match the old ones.
      
      What this means is that doing
           echo 1 > /sys/devices/system/cpu/sched_mc_power_savings
      on a system with cpusets enabled will not take affect untill something changes
      in the cpuset setup (ie new sets created or deleted).
      
      This patch fixes restore correct behaviour where domains must be rebuilt in
      order to enable MC powersaving flags.
      
      Test on quad-core Core2 box with both CONFIG_CPUSETS and !CONFIG_CPUSETS.
      Also tested on dual-core Core2 laptop. Lockdep is happy and things are working
      as expected.
      Signed-off-by: NMax Krasnyansky <maxk@qualcomm.com>
      Tested-by: NVaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      dfb512ec
  27. 18 7月, 2008 1 次提交
    • M
      cpu hotplug, sched: Introduce cpu_active_map and redo sched domain managment (take 2) · e761b772
      Max Krasnyansky 提交于
      This is based on Linus' idea of creating cpu_active_map that prevents
      scheduler load balancer from migrating tasks to the cpu that is going
      down.
      
      It allows us to simplify domain management code and avoid unecessary
      domain rebuilds during cpu hotplug event handling.
      
      Please ignore the cpusets part for now. It needs some more work in order
      to avoid crazy lock nesting. Although I did simplfy and unify domain
      reinitialization logic. We now simply call partition_sched_domains() in
      all the cases. This means that we're using exact same code paths as in
      cpusets case and hence the test below cover cpusets too.
      Cpuset changes to make rebuild_sched_domains() callable from various
      contexts are in the separate patch (right next after this one).
      
      This not only boots but also easily handles
      	while true; do make clean; make -j 8; done
      and
      	while true; do on-off-cpu 1; done
      at the same time.
      (on-off-cpu 1 simple does echo 0/1 > /sys/.../cpu1/online thing).
      
      Suprisingly the box (dual-core Core2) is quite usable. In fact I'm typing
      this on right now in gnome-terminal and things are moving just fine.
      
      Also this is running with most of the debug features enabled (lockdep,
      mutex, etc) no BUG_ONs or lockdep complaints so far.
      
      I believe I addressed all of the Dmitry's comments for original Linus'
      version. I changed both fair and rt balancer to mask out non-active cpus.
      And replaced cpu_is_offline() with !cpu_active() in the main scheduler
      code where it made sense (to me).
      Signed-off-by: NMax Krasnyanskiy <maxk@qualcomm.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NGregory Haskins <ghaskins@novell.com>
      Cc: dmitry.adamushko@gmail.com
      Cc: pj@sgi.com
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      e761b772
  28. 28 4月, 2008 1 次提交
    • M
      mm: filter based on a nodemask as well as a gfp_mask · 19770b32
      Mel Gorman 提交于
      The MPOL_BIND policy creates a zonelist that is used for allocations
      controlled by that mempolicy.  As the per-node zonelist is already being
      filtered based on a zone id, this patch adds a version of __alloc_pages() that
      takes a nodemask for further filtering.  This eliminates the need for
      MPOL_BIND to create a custom zonelist.
      
      A positive benefit of this is that allocations using MPOL_BIND now use the
      local node's distance-ordered zonelist instead of a custom node-id-ordered
      zonelist.  I.e., pages will be allocated from the closest allowed node with
      available memory.
      
      [Lee.Schermerhorn@hp.com: Mempolicy: update stale documentation and comments]
      [Lee.Schermerhorn@hp.com: Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask]
      [Lee.Schermerhorn@hp.com: Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask rework]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      19770b32
  29. 20 4月, 2008 1 次提交
  30. 12 2月, 2008 1 次提交
    • K
      mempolicy: silently restrict nodemask to allowed nodes · 31f1de46
      KOSAKI Motohiro 提交于
      Kosaki Motohito noted that "numactl --interleave=all ..." failed in the
      presence of memoryless nodes.  This patch attempts to fix that problem.
      
      Some background:
      
      numactl --interleave=all calls set_mempolicy(2) with a fully populated
      [out to MAXNUMNODES] nodemask.  set_mempolicy() [in do_set_mempolicy()]
      calls contextualize_policy() which requires that the nodemask be a
      subset of the current task's mems_allowed; else EINVAL will be returned.
      
      A task's mems_allowed will always be a subset of node_states[N_HIGH_MEMORY]
      i.e., nodes with memory.  So, a fully populated nodemask will be
      declared invalid if it includes memoryless nodes.
      
        NOTE:  the same thing will occur when running in a cpuset
               with restricted mem_allowed--for the same reason:
               node mask contains dis-allowed nodes.
      
      mbind(2), on the other hand, just masks off any nodes in the nodemask
      that are not included in the caller's mems_allowed.
      
      In each case [mbind() and set_mempolicy()], mpol_check_policy() will
      complain [again, resulting in EINVAL] if the nodemask contains any
      memoryless nodes.  This is somewhat redundant as mpol_new() will remove
      memoryless nodes for interleave policy, as will bind_zonelist()--called
      by mpol_new() for BIND policy.
      
      Proposed fix:
      
      1) modify contextualize_policy logic to:
         a) remember whether the incoming node mask is empty.
         b) if not, restrict the nodemask to allowed nodes, as is
            currently done in-line for mbind().  This guarantees
            that the resulting mask includes only nodes with memory.
      
            NOTE:  this is a [benign, IMO] change in behavior for
                   set_mempolicy().  Dis-allowed nodes will be
                   silently ignored, rather than returning an error.
      
         c) fold this code into mpol_check_policy(), replace 2 calls to
            contextualize_policy() to call mpol_check_policy() directly
            and remove contextualize_policy().
      
      2) In existing mpol_check_policy() logic, after "contextualization":
         a) MPOL_DEFAULT:  require that in coming mask "was_empty"
         b) MPOL_{BIND|INTERLEAVE}:  require that contextualized nodemask
            contains at least one node.
         c) add a case for MPOL_PREFERRED:  if in coming was not empty
            and resulting mask IS empty, user specified invalid nodes.
            Return EINVAL.
         c) remove the now redundant check for memoryless nodes
      
      3) remove the now redundant masking of policy nodes for interleave
         policy from mpol_new().
      
      4) Now that mpol_check_policy() contextualizes the nodemask, remove
         the in-line nodes_and() from sys_mbind().  I believe that this
         restores mbind() to the behavior before the memoryless-nodes
         patch series.  E.g., we'll no longer treat an invalid nodemask
         with MPOL_PREFERRED as local allocation.
      
      [ Patch history:
      
        v1 -> v2:
         - Communicate whether or not incoming node mask was empty to
           mpol_check_policy() for better error checking.
         - As suggested by David Rientjes, remove the now unused
           cpuset_nodes_subset_current_mems_allowed() from cpuset.h
      
        v2 -> v3:
         - As suggested by Kosaki Motohito, fold the "contextualization"
           of policy nodemask into mpol_check_policy().  Looks a little
           cleaner. ]
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Tested-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      31f1de46
  31. 09 2月, 2008 1 次提交
    • E
      proc: seqfile convert proc_pid_status to properly handle pid namespaces · df5f8314
      Eric W. Biederman 提交于
      Currently we possibly lookup the pid in the wrong pid namespace.  So
      seq_file convert proc_pid_status which ensures the proper pid namespaces is
      passed in.
      
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: build fix]
      [akpm@linux-foundation.org: another build fix]
      [akpm@linux-foundation.org: s390 build fix]
      [akpm@linux-foundation.org: fix task_name() output]
      [akpm@linux-foundation.org: fix nommu build]
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Cc: Andrew Morgan <morgan@kernel.org>
      Cc: Serge Hallyn <serue@us.ibm.com>
      Cc: Cedric Le Goater <clg@fr.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      df5f8314
  32. 20 10月, 2007 2 次提交
    • C
      hotplug cpu: migrate a task within its cpuset · 470fd646
      Cliff Wickman 提交于
      When a cpu is disabled, move_task_off_dead_cpu() is called for tasks that have
      been running on that cpu.
      
      Currently, such a task is migrated:
       1) to any cpu on the same node as the disabled cpu, which is both online
          and among that task's cpus_allowed
       2) to any cpu which is both online and among that task's cpus_allowed
      
      It is typical of a multithreaded application running on a large NUMA system to
      have its tasks confined to a cpuset so as to cluster them near the memory that
      they share.  Furthermore, it is typical to explicitly place such a task on a
      specific cpu in that cpuset.  And in that case the task's cpus_allowed
      includes only a single cpu.
      
      This patch would insert a preference to migrate such a task to some cpu within
      its cpuset (and set its cpus_allowed to its entire cpuset).
      
      With this patch, migrate the task to:
       1) to any cpu on the same node as the disabled cpu, which is both online
          and among that task's cpus_allowed
       2) to any online cpu within the task's cpuset
       3) to any cpu which is both online and among that task's cpus_allowed
      
      In order to do this, move_task_off_dead_cpu() must make a call to
      cpuset_cpus_allowed_locked(), a new subset of cpuset_cpus_allowed(), that will
      not block.  (name change - per Oleg's suggestion)
      
      Calls are made to cpuset_lock() and cpuset_unlock() in migration_call() to set
      the cpuset mutex during the whole migrate_live_tasks() and
      migrate_dead_tasks() procedure.
      
      [akpm@linux-foundation.org: build fix]
      [pj@sgi.com: Fix indentation and spacing]
      Signed-off-by: NCliff Wickman <cpw@sgi.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      470fd646
    • P
      Task Control Groups: make cpusets a client of cgroups · 8793d854
      Paul Menage 提交于
      Remove the filesystem support logic from the cpusets system and makes cpusets
      a cgroup subsystem
      
      The "cpuset" filesystem becomes a dummy filesystem; attempts to mount it get
      passed through to the cgroup filesystem with the appropriate options to
      emulate the old cpuset filesystem behaviour.
      Signed-off-by: NPaul Menage <menage@google.com>
      Cc: Serge E. Hallyn <serue@us.ibm.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Kirill Korotaev <dev@openvz.org>
      Cc: Herbert Poetzl <herbert@13thfloor.at>
      Cc: Srivatsa Vaddagiri <vatsa@in.ibm.com>
      Cc: Cedric Le Goater <clg@fr.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8793d854
  33. 17 10月, 2007 2 次提交
  34. 13 2月, 2007 1 次提交
  35. 31 12月, 2006 1 次提交
  36. 14 12月, 2006 1 次提交
    • P
      [PATCH] cpuset: rework cpuset_zone_allowed api · 02a0e53d
      Paul Jackson 提交于
      Elaborate the API for calling cpuset_zone_allowed(), so that users have to
      explicitly choose between the two variants:
      
        cpuset_zone_allowed_hardwall()
        cpuset_zone_allowed_softwall()
      
      Until now, whether or not you got the hardwall flavor depended solely on
      whether or not you or'd in the __GFP_HARDWALL gfp flag to the gfp_mask
      argument.
      
      If you didn't specify __GFP_HARDWALL, you implicitly got the softwall
      version.
      
      Unfortunately, this meant that users would end up with the softwall version
      without thinking about it.  Since only the softwall version might sleep,
      this led to bugs with possible sleeping in interrupt context on more than
      one occassion.
      
      The hardwall version requires that the current tasks mems_allowed allows
      the node of the specified zone (or that you're in interrupt or that
      __GFP_THISNODE is set or that you're on a one cpuset system.)
      
      The softwall version, depending on the gfp_mask, might allow a node if it
      was allowed in the nearest enclusing cpuset marked mem_exclusive (which
      requires taking the cpuset lock 'callback_mutex' to evaluate.)
      
      This patch removes the cpuset_zone_allowed() call, and forces the caller to
      explicitly choose between the hardwall and the softwall case.
      
      If the caller wants the gfp_mask to determine this choice, they should (1)
      be sure they can sleep or that __GFP_HARDWALL is set, and (2) invoke the
      cpuset_zone_allowed_softwall() routine.
      
      This adds another 100 or 200 bytes to the kernel text space, due to the few
      lines of nearly duplicate code at the top of both cpuset_zone_allowed_*
      routines.  It should save a few instructions executed for the calls that
      turned into calls of cpuset_zone_allowed_hardwall, thanks to not having to
      set (before the call) then check (within the call) the __GFP_HARDWALL flag.
      
      For the most critical call, from get_page_from_freelist(), the same
      instructions are executed as before -- the old cpuset_zone_allowed()
      routine it used to call is the same code as the
      cpuset_zone_allowed_softwall() routine that it calls now.
      
      Not a perfect win, but seems worth it, to reduce this chance of hitting a
      sleeping with irq off complaint again.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      02a0e53d
  37. 08 12月, 2006 1 次提交