1. 27 6月, 2008 12 次提交
  2. 20 6月, 2008 3 次提交
    • O
      sched: refactor wait_for_completion_timeout() · ea71a546
      Oleg Nesterov 提交于
      Simplify the code and fix the boundary condition of
      wait_for_completion_timeout(,0).
      
      We can kill the first __remove_wait_queue() as well.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      ea71a546
    • R
      sched: fix wait_for_completion_timeout() spurious failure under heavy load · bb10ed09
      Roland Dreier 提交于
      It seems that the current implementaton of wait_for_completion_timeout()
      has a small problem under very high load for the common pattern:
      
      	if (!wait_for_completion_timeout(&done, timeout))
      		/* handle failure */
      
      because the implementation very roughly does (lots of code deleted to
      show the basic flow):
      
      	static inline long __sched
      	do_wait_for_common(struct completion *x, long timeout, int state)
      	{
      		if (x->done)
      			return timeout;
      
      		do {
      			timeout = schedule_timeout(timeout);
      
      			if (!timeout)
      				return timeout;
      
      		} while (!x->done);
      
      		return timeout;
      	}
      
      so if the system is very busy and x->done is not set when
      do_wait_for_common() is entered, it is possible that the first call to
      schedule_timeout() returns 0 because the task doing wait_for_completion
      doesn't get rescheduled for a long time, even if it is woken up early
      enough.
      
      In this case, wait_for_completion_timeout() returns 0 without even
      checking x->done again, and the code above falls into its failure case
      purely for scheduler reasons, even if the hardware event or whatever was
      being waited for happened early enough.
      
      It would make sense to add an extra test to do_wait_for() in the timeout
      case and return 1 if x->done is actually set.
      
      A quick audit (not exhaustive) of wait_for_completion_timeout() callers
      seems to indicate that no one actually cares about the return value in
      the success case -- they just test for 0 (timed out) versus non-zero
      (wait succeeded).
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      bb10ed09
    • P
      sched: rt: fix the bandwidth contraint computations · 10b612f4
      Peter Zijlstra 提交于
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: "Daniel K." <dk@uw.no>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      10b612f4
  3. 19 6月, 2008 4 次提交
    • L
      cpuset: limit the input of cpuset.sched_relax_domain_level · 30e0e178
      Li Zefan 提交于
      We allow the inputs to be [-1 ... SD_LV_MAX), and return -EINVAL
      for inputs outside this range.
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      Acked-by: NPaul Menage <menage@google.com>
      Acked-by: NPaul Jackson <pj@sgi.com>
      Acked-by: NHidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      30e0e178
    • M
      sched: CPU hotplug events must not destroy scheduler domains created by the cpusets · f18f982a
      Max Krasnyansky 提交于
      First issue is not related to the cpusets. We're simply leaking doms_cur.
      It's allocated in arch_init_sched_domains() which is called for every
      hotplug event. So we just keep reallocation doms_cur without freeing it.
      I introduced free_sched_domains() function that cleans things up.
      
      Second issue is that sched domains created by the cpusets are
      completely destroyed by the CPU hotplug events. For all CPU hotplug
      events scheduler attaches all CPUs to the NULL domain and then puts
      them all into the single domain thereby destroying domains created
      by the cpusets (partition_sched_domains).
      The solution is simple, when cpusets are enabled scheduler should not
      create default domain and instead let cpusets do that. Which is
      exactly what the patch does.
      Signed-off-by: NMax Krasnyansky <maxk@qualcomm.com>
      Cc: pj@sgi.com
      Cc: menage@google.com
      Cc: rostedt@goodmis.org
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      f18f982a
    • P
      sched: rt-group: fix hierarchy · 7ea56616
      Peter Zijlstra 提交于
      Don't re-set the entity's runqueue to the wrong rq after we've set it
      to the right one.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Tested-by: NDaniel K. <dk@uw.no>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      7ea56616
    • D
      sched: NULL pointer dereference while setting sched_rt_period_us · 49307fd6
      Dario Faggioli 提交于
      When CONFIG_RT_GROUP_SCHED and CONFIG_CGROUP_SCHED are enabled, with:
      
       echo 10000 > /proc/sys/kernel/sched_rt_period_us
      
      We get this:
      
       BUG: unable to handle kernel NULL pointer dereference at 0000008c
       [  947.682233] IP: [<c0216b72>] __rt_schedulable+0x12/0x160
       [  947.683123] *pde = 00000000=20
       [  947.683782] Oops: 0000 [#1]
       [  947.684307] Modules linked in:
       [  947.684308]
       [  947.684308] Pid: 2359, comm: bash Not tainted (2.6.26-rc6 #8)
       [  947.684308] EIP: 0060:[<c0216b72>] EFLAGS: 00000246 CPU: 0
       [  947.684308] EIP is at __rt_schedulable+0x12/0x160
       [  947.684308] EAX: 00000000 EBX: 00000000 ECX: 00000000 EDX: 00000001
       [  947.684308] ESI: c0521db4 EDI: 00000001 EBP: c6cc9f00 ESP: c6cc9ed0
       [  947.684308]  DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068
       [  947.684308] Process bash (pid: 2359, tiÆcc8000 taskÇa54f00=20 task.tiÆcc8000)
       [  947.684308] Stack: c0222790 00000000 080f8c08 c0521db4 c6cc9f00 00000001 00000000 00000000
       [  947.684308]        c6cc9f9c 00000000 c0521db4 00000001 c6cc9f28 c0216d40 00000000 00000000
       [  947.684308]        c6cc9f9c 000f4240 000e7ef0 ffffffff c0521db4 c79dfb60 c6cc9f58 c02af2cc
       [  947.684308] Call Trace:
       [  947.684308]  [<c0222790>] ? do_proc_dointvec_conv+0x0/0x50
       [  947.684308]  [<c0216d40>] ? sched_rt_handler+0x80/0x110
       [  947.684308]  [<c02af2cc>] ? proc_sys_call_handler+0x9c/0xb0
       [  947.684308]  [<c02af2fa>] ? proc_sys_write+0x1a/0x20
       [  947.684308]  [<c0273c36>] ? vfs_write+0x96/0x160
       [  947.684308]  [<c02af2e0>] ? proc_sys_write+0x0/0x20
       [  947.684308]  [<c027423d>] ? sys_write+0x3d/0x70
       [  947.684308]  [<c0202ef5>] ? sysenter_past_esp+0x6a/0x91
       [  947.684308]  =======================
       [  947.684308] Code: 24 04 e8 62 b1 0e 00 89 c7 89 f8 8b 5d f4 8b 75
       f8 8b 7d fc 89 ec 5d c3 90 55 89 e5 57 56 53 83 ec 24 89 45 ec 89 55 e4
       89 4d e8 <8b> b8 8c 00 00 00 85 ff 0f 84 c9 00 00 00 8b 57 24 39 55 e8
       8b
       [  947.684308] EIP: [<c0216b72>] __rt_schedulable+0x12/0x160 SS:ESP  0068:c6cc9ed0
      
      We think the following patch solves the issue.
      Signed-off-by: NDario Faggioli <raistlin@linux.it>
      Signed-off-by: NMichael Trimarchi <trimarchimichael@yahoo.it>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      49307fd6
  4. 18 6月, 2008 1 次提交
    • D
      sched: rework of "prioritize non-migratable tasks over migratable ones" · 20b6331b
      Dmitry Adamushko 提交于
      regarding this commit: 45c01e82
      
      I think we can do it simpler. Please take a look at the patch below.
      
      Instead of having 2 separate arrays (which is + ~800 bytes on x86_32 and
      twice so on x86_64), let's add "exclusive" (the ones that are bound to
      this CPU) tasks to the head of the queue and "shared" ones -- to the
      end.
      
      In case of a few newly woken up "exclusive" tasks, they are 'stacked'
      (not queued as now), meaning that a task {i+1} is being placed in front
      of the previously woken up task {i}. But I don't think that this
      behavior may cause any realistic problems.
      
      There are a couple of changes on top of this one.
      
      (1) in check_preempt_curr_rt()
      
      I don't think there is a need for the "pick_next_rt_entity(rq, &rq->rt)
      != &rq->curr->rt" check.
      
      enqueue_task_rt(p) and check_preempt_curr_rt() are always called one
      after another with rq->lock being held so the following check
      "p->rt.nr_cpus_allowed == 1 && rq->curr->rt.nr_cpus_allowed != 1" should
      be enough (well, just its left part) to guarantee that 'p' has been
      queued in front of the 'curr'.
      
      (2) in set_cpus_allowed_rt()
      
      I don't thinks there is a need for requeue_task_rt() here.
      
      Perhaps, the only case when 'requeue' (+ reschedule) might be useful is
      as follows:
      
      i) weight == 1 && cpu_isset(task_cpu(p), *new_mask)
      
      i.e. a task is being bound to this CPU);
      
      ii) 'p' != rq->curr
      
      but here, 'p' has already been on this CPU for a while and was not
      migrated. i.e. it's possible that 'rq->curr' would not have high chances
      to be migrated right at this particular moment (although, has chance in
      a bit longer term), should we allow it to be preempted.
      
      Anyway, I think we should not perhaps make it more complex trying to
      address some rare corner cases. For instance, that's why a single queue
      approach would be preferable. Unless I'm missing something obvious, this
      approach gives us similar functionality at lower cost.
      
      Verified only compilation-wise.
      
      (Almost)-Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      20b6331b
  5. 17 6月, 2008 1 次提交
  6. 12 6月, 2008 2 次提交
    • L
      sched: 64-bit: fix arithmetics overflow · 7a232e03
      Lai Jiangshan 提交于
      (overflow means weight >= 2^32 here, because inv_weigh = 2^32/weight)
      
      A weight of a cfs_rq is the sum of weights of which entities
      are queued on this cfs_rq, so it will overflow when there are
      too many entities.
      
      Although, overflow occurs very rarely, but it break fairness when
      it occurs. 64-bits systems have more memory than 32-bit systems
      and 64-bit systems can create more process usually, so overflow may
      occur more frequently.
      
      This patch guarantees fairness when overflow happens on 64-bit systems.
      Thanks to the optimization of compiler, it changes nothing on 32-bit.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      7a232e03
    • L
      sched: fair group: fix overflow(was: fix divide by zero) · 2e084786
      Lai Jiangshan 提交于
      I found a bug which can be reproduced by this way:(linux-2.6.26-rc5, x86-64)
      (use 2^32, 2^33, ...., 2^63 as shares value)
      
      # mkdir /dev/cpuctl
      # mount -t cgroup -o cpu cpuctl /dev/cpuctl
      # cd /dev/cpuctl
      # mkdir sub
      # echo 0x8000000000000000 > sub/cpu.shares
      # echo $$ > sub/tasks
      oops here! divide by zero.
      
      This is because do_div() expects the 2th parameter to be 32 bits,
      but unsigned long is 64 bits in x86_64.
      
      Peter Zijstra pointed it out that the sane thing to do is limit the
      shares value to something smaller instead of using an even more
      expensive divide.
      
      Also, I found another bug about "the shares value is too large":
      
      pid1 and pid2 are set affinity to cpu#0
      pid1 is attached to cg1 and pid2 is attached to cg2
      
      if cg1/cpu.shares = 1024 cg2/cpu.shares = 2000000000
      then pid2 got 100% usage of cpu, and pid1 0%
      
      if cg1/cpu.shares = 1024 cg2/cpu.shares = 20000000000
      then pid2 got 0% usage of cpu, and pid1 100%
      
      And a weight of a cfs_rq is the sum of weights of which entities
      are queued on this cfs_rq, so the shares value should be limited
      to a smaller value.
      
      I think that (1UL << 18) is a good limited value:
      
      1) it's not too large, we can create a lot of group before overflow
      2) it's several times the weight value for nice=-19 (not too small)
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      2e084786
  7. 10 6月, 2008 4 次提交
    • P
      sched: kill off dead cfs_rq_set_shares() · e9886ca3
      Paul Mundt 提交于
      Building with CONFIG_FAIR_GROUP_SCHED=y on UP results in an unused
      cfs_rq_set_shares() reference. As nothing is using this dummy function
      in the first place, just kill it off.
      Signed-off-by: NPaul Mundt <lethal@linux-sh.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      e9886ca3
    • D
      sched: prevent bound kthreads from changing cpus_allowed · 9985b0ba
      David Rientjes 提交于
      Kthreads that have called kthread_bind() are bound to specific cpus, so
      other tasks should not be able to change their cpus_allowed from under
      them.  Otherwise, it is possible to move kthreads, such as the migration
      or software watchdog threads, so they are not allowed access to the cpu
      they work on.
      
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Paul Jackson <pj@sgi.com>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      9985b0ba
    • P
      sched: fix hotplug cpus on ia64 · 7def2be1
      Peter Zijlstra 提交于
      Cliff Wickman wrote:
      
      > I built an ia64 kernel from Andrew's tree (2.6.26-rc2-mm1)
      > and get a very predictable hotplug cpu problem.
      > billberry1:/tmp/cpw # ./dis
      > disabled cpu 17
      > enabled cpu 17
      > billberry1:/tmp/cpw # ./dis
      > disabled cpu 17
      > enabled cpu 17
      > billberry1:/tmp/cpw # ./dis
      >
      > The script that disables the cpu always hangs (unkillable)
      > on the 3rd attempt.
      >
      > And a bit further:
      > The kstopmachine thread always sits on the run queue (real time) for about
      > 30 minutes before running.
      
      this fix solves some (but not all) issues between CPU hotplug and
      RT bandwidth throttling.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      7def2be1
    • O
      sched: fix TASK_WAKEKILL vs SIGKILL race · 16882c1e
      Oleg Nesterov 提交于
      schedule() has the special "TASK_INTERRUPTIBLE && signal_pending()" case,
      this allows us to do
      
      	current->state = TASK_INTERRUPTIBLE;
      	schedule();
      
      without fear to sleep with pending signal.
      
      However, the code like
      
      	current->state = TASK_KILLABLE;
      	schedule();
      
      is not right, schedule() doesn't take TASK_WAKEKILL into account. This means
      that mutex_lock_killable(), wait_for_completion_killable(), down_killable(),
      schedule_timeout_killable() can miss SIGKILL (and btw the second SIGKILL has
      no effect).
      
      Introduce the new helper, signal_pending_state(), and change schedule() to
      use it. Hopefully it will have more users, that is why the task's state is
      passed separately.
      
      Note this "__TASK_STOPPED | __TASK_TRACED" check in signal_pending_state().
      This is needed to preserve the current behaviour (ptrace_notify). I hope
      this check will be removed soon, but this (afaics good) change needs the
      separate discussion.
      
      The fast path is "(state & (INTERRUPTIBLE | WAKEKILL)) + signal_pending(p)",
      basically the same that schedule() does now. However, this patch of course
      bloats schedule().
      Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      16882c1e
  8. 06 6月, 2008 13 次提交