1. 29 6月, 2008 1 次提交
    • D
      sched: fix cpu hotplug · 79c53799
      Dmitry Adamushko 提交于
      the CPU hotplug problems (crashes under high-volume unplug+replug
      tests) seem to be related to migrate_dead_tasks().
      
      Firstly I added traces to see all tasks being migrated with
      migrate_live_tasks() and migrate_dead_tasks(). On my setup the problem
      pops up (the one with "se == NULL" in the loop of
      pick_next_task_fair()) shortly after the traces indicate that some has
      been migrated with migrate_dead_tasks()). btw., I can reproduce it
      much faster now with just a plain cpu down/up loop.
      
      [disclaimer] Well, unless I'm really missing something important in
      this late hour [/desclaimer] pick_next_task() is not something
      appropriate for migrate_dead_tasks() :-)
      
      the following change seems to eliminate the problem on my setup
      (although, I kept it running only for a few minutes to get a few
      messages indicating migrate_dead_tasks() does move tasks and the
      system is still ok)
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      79c53799
  2. 20 6月, 2008 2 次提交
    • O
      sched: refactor wait_for_completion_timeout() · ea71a546
      Oleg Nesterov 提交于
      Simplify the code and fix the boundary condition of
      wait_for_completion_timeout(,0).
      
      We can kill the first __remove_wait_queue() as well.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      ea71a546
    • R
      sched: fix wait_for_completion_timeout() spurious failure under heavy load · bb10ed09
      Roland Dreier 提交于
      It seems that the current implementaton of wait_for_completion_timeout()
      has a small problem under very high load for the common pattern:
      
      	if (!wait_for_completion_timeout(&done, timeout))
      		/* handle failure */
      
      because the implementation very roughly does (lots of code deleted to
      show the basic flow):
      
      	static inline long __sched
      	do_wait_for_common(struct completion *x, long timeout, int state)
      	{
      		if (x->done)
      			return timeout;
      
      		do {
      			timeout = schedule_timeout(timeout);
      
      			if (!timeout)
      				return timeout;
      
      		} while (!x->done);
      
      		return timeout;
      	}
      
      so if the system is very busy and x->done is not set when
      do_wait_for_common() is entered, it is possible that the first call to
      schedule_timeout() returns 0 because the task doing wait_for_completion
      doesn't get rescheduled for a long time, even if it is woken up early
      enough.
      
      In this case, wait_for_completion_timeout() returns 0 without even
      checking x->done again, and the code above falls into its failure case
      purely for scheduler reasons, even if the hardware event or whatever was
      being waited for happened early enough.
      
      It would make sense to add an extra test to do_wait_for() in the timeout
      case and return 1 if x->done is actually set.
      
      A quick audit (not exhaustive) of wait_for_completion_timeout() callers
      seems to indicate that no one actually cares about the return value in
      the success case -- they just test for 0 (timed out) versus non-zero
      (wait succeeded).
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      bb10ed09
  3. 19 6月, 2008 4 次提交
    • L
      cpuset: limit the input of cpuset.sched_relax_domain_level · 30e0e178
      Li Zefan 提交于
      We allow the inputs to be [-1 ... SD_LV_MAX), and return -EINVAL
      for inputs outside this range.
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      Acked-by: NPaul Menage <menage@google.com>
      Acked-by: NPaul Jackson <pj@sgi.com>
      Acked-by: NHidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      30e0e178
    • M
      sched: CPU hotplug events must not destroy scheduler domains created by the cpusets · f18f982a
      Max Krasnyansky 提交于
      First issue is not related to the cpusets. We're simply leaking doms_cur.
      It's allocated in arch_init_sched_domains() which is called for every
      hotplug event. So we just keep reallocation doms_cur without freeing it.
      I introduced free_sched_domains() function that cleans things up.
      
      Second issue is that sched domains created by the cpusets are
      completely destroyed by the CPU hotplug events. For all CPU hotplug
      events scheduler attaches all CPUs to the NULL domain and then puts
      them all into the single domain thereby destroying domains created
      by the cpusets (partition_sched_domains).
      The solution is simple, when cpusets are enabled scheduler should not
      create default domain and instead let cpusets do that. Which is
      exactly what the patch does.
      Signed-off-by: NMax Krasnyansky <maxk@qualcomm.com>
      Cc: pj@sgi.com
      Cc: menage@google.com
      Cc: rostedt@goodmis.org
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      f18f982a
    • P
      sched: rt-group: fix hierarchy · 7ea56616
      Peter Zijlstra 提交于
      Don't re-set the entity's runqueue to the wrong rq after we've set it
      to the right one.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Tested-by: NDaniel K. <dk@uw.no>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      7ea56616
    • D
      sched: NULL pointer dereference while setting sched_rt_period_us · 49307fd6
      Dario Faggioli 提交于
      When CONFIG_RT_GROUP_SCHED and CONFIG_CGROUP_SCHED are enabled, with:
      
       echo 10000 > /proc/sys/kernel/sched_rt_period_us
      
      We get this:
      
       BUG: unable to handle kernel NULL pointer dereference at 0000008c
       [  947.682233] IP: [<c0216b72>] __rt_schedulable+0x12/0x160
       [  947.683123] *pde = 00000000=20
       [  947.683782] Oops: 0000 [#1]
       [  947.684307] Modules linked in:
       [  947.684308]
       [  947.684308] Pid: 2359, comm: bash Not tainted (2.6.26-rc6 #8)
       [  947.684308] EIP: 0060:[<c0216b72>] EFLAGS: 00000246 CPU: 0
       [  947.684308] EIP is at __rt_schedulable+0x12/0x160
       [  947.684308] EAX: 00000000 EBX: 00000000 ECX: 00000000 EDX: 00000001
       [  947.684308] ESI: c0521db4 EDI: 00000001 EBP: c6cc9f00 ESP: c6cc9ed0
       [  947.684308]  DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068
       [  947.684308] Process bash (pid: 2359, tiÆcc8000 taskÇa54f00=20 task.tiÆcc8000)
       [  947.684308] Stack: c0222790 00000000 080f8c08 c0521db4 c6cc9f00 00000001 00000000 00000000
       [  947.684308]        c6cc9f9c 00000000 c0521db4 00000001 c6cc9f28 c0216d40 00000000 00000000
       [  947.684308]        c6cc9f9c 000f4240 000e7ef0 ffffffff c0521db4 c79dfb60 c6cc9f58 c02af2cc
       [  947.684308] Call Trace:
       [  947.684308]  [<c0222790>] ? do_proc_dointvec_conv+0x0/0x50
       [  947.684308]  [<c0216d40>] ? sched_rt_handler+0x80/0x110
       [  947.684308]  [<c02af2cc>] ? proc_sys_call_handler+0x9c/0xb0
       [  947.684308]  [<c02af2fa>] ? proc_sys_write+0x1a/0x20
       [  947.684308]  [<c0273c36>] ? vfs_write+0x96/0x160
       [  947.684308]  [<c02af2e0>] ? proc_sys_write+0x0/0x20
       [  947.684308]  [<c027423d>] ? sys_write+0x3d/0x70
       [  947.684308]  [<c0202ef5>] ? sysenter_past_esp+0x6a/0x91
       [  947.684308]  =======================
       [  947.684308] Code: 24 04 e8 62 b1 0e 00 89 c7 89 f8 8b 5d f4 8b 75
       f8 8b 7d fc 89 ec 5d c3 90 55 89 e5 57 56 53 83 ec 24 89 45 ec 89 55 e4
       89 4d e8 <8b> b8 8c 00 00 00 85 ff 0f 84 c9 00 00 00 8b 57 24 39 55 e8
       8b
       [  947.684308] EIP: [<c0216b72>] __rt_schedulable+0x12/0x160 SS:ESP  0068:c6cc9ed0
      
      We think the following patch solves the issue.
      Signed-off-by: NDario Faggioli <raistlin@linux.it>
      Signed-off-by: NMichael Trimarchi <trimarchimichael@yahoo.it>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      49307fd6
  4. 17 6月, 2008 1 次提交
  5. 12 6月, 2008 2 次提交
    • L
      sched: 64-bit: fix arithmetics overflow · 7a232e03
      Lai Jiangshan 提交于
      (overflow means weight >= 2^32 here, because inv_weigh = 2^32/weight)
      
      A weight of a cfs_rq is the sum of weights of which entities
      are queued on this cfs_rq, so it will overflow when there are
      too many entities.
      
      Although, overflow occurs very rarely, but it break fairness when
      it occurs. 64-bits systems have more memory than 32-bit systems
      and 64-bit systems can create more process usually, so overflow may
      occur more frequently.
      
      This patch guarantees fairness when overflow happens on 64-bit systems.
      Thanks to the optimization of compiler, it changes nothing on 32-bit.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      7a232e03
    • L
      sched: fair group: fix overflow(was: fix divide by zero) · 2e084786
      Lai Jiangshan 提交于
      I found a bug which can be reproduced by this way:(linux-2.6.26-rc5, x86-64)
      (use 2^32, 2^33, ...., 2^63 as shares value)
      
      # mkdir /dev/cpuctl
      # mount -t cgroup -o cpu cpuctl /dev/cpuctl
      # cd /dev/cpuctl
      # mkdir sub
      # echo 0x8000000000000000 > sub/cpu.shares
      # echo $$ > sub/tasks
      oops here! divide by zero.
      
      This is because do_div() expects the 2th parameter to be 32 bits,
      but unsigned long is 64 bits in x86_64.
      
      Peter Zijstra pointed it out that the sane thing to do is limit the
      shares value to something smaller instead of using an even more
      expensive divide.
      
      Also, I found another bug about "the shares value is too large":
      
      pid1 and pid2 are set affinity to cpu#0
      pid1 is attached to cg1 and pid2 is attached to cg2
      
      if cg1/cpu.shares = 1024 cg2/cpu.shares = 2000000000
      then pid2 got 100% usage of cpu, and pid1 0%
      
      if cg1/cpu.shares = 1024 cg2/cpu.shares = 20000000000
      then pid2 got 0% usage of cpu, and pid1 100%
      
      And a weight of a cfs_rq is the sum of weights of which entities
      are queued on this cfs_rq, so the shares value should be limited
      to a smaller value.
      
      I think that (1UL << 18) is a good limited value:
      
      1) it's not too large, we can create a lot of group before overflow
      2) it's several times the weight value for nice=-19 (not too small)
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      2e084786
  6. 10 6月, 2008 1 次提交
    • O
      sched: fix TASK_WAKEKILL vs SIGKILL race · 16882c1e
      Oleg Nesterov 提交于
      schedule() has the special "TASK_INTERRUPTIBLE && signal_pending()" case,
      this allows us to do
      
      	current->state = TASK_INTERRUPTIBLE;
      	schedule();
      
      without fear to sleep with pending signal.
      
      However, the code like
      
      	current->state = TASK_KILLABLE;
      	schedule();
      
      is not right, schedule() doesn't take TASK_WAKEKILL into account. This means
      that mutex_lock_killable(), wait_for_completion_killable(), down_killable(),
      schedule_timeout_killable() can miss SIGKILL (and btw the second SIGKILL has
      no effect).
      
      Introduce the new helper, signal_pending_state(), and change schedule() to
      use it. Hopefully it will have more users, that is why the task's state is
      passed separately.
      
      Note this "__TASK_STOPPED | __TASK_TRACED" check in signal_pending_state().
      This is needed to preserve the current behaviour (ptrace_notify). I hope
      this check will be removed soon, but this (afaics good) change needs the
      separate discussion.
      
      The fast path is "(state & (INTERRUPTIBLE | WAKEKILL)) + signal_pending(p)",
      basically the same that schedule() does now. However, this patch of course
      bloats schedule().
      Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      16882c1e
  7. 29 5月, 2008 4 次提交
  8. 15 5月, 2008 1 次提交
  9. 12 5月, 2008 1 次提交
    • L
      Add new 'cond_resched_bkl()' helper function · c3921ab7
      Linus Torvalds 提交于
      It acts exactly like a regular 'cond_resched()', but will not get
      optimized away when CONFIG_PREEMPT is set.
      
      Normal kernel code is already preemptable in the presense of
      CONFIG_PREEMPT, so cond_resched() is optimized away (see commit
      02b67cc3 "sched: do not do
      cond_resched() when CONFIG_PREEMPT").
      
      But when wanting to conditionally reschedule while holding a lock, you
      need to use "cond_sched_lock(lock)", and the new function is the BKL
      equivalent of that.
      
      Also make fs/locks.c use it.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c3921ab7
  10. 11 5月, 2008 1 次提交
    • L
      BKL: revert back to the old spinlock implementation · 8e3e076c
      Linus Torvalds 提交于
      The generic semaphore rewrite had a huge performance regression on AIM7
      (and potentially other BKL-heavy benchmarks) because the generic
      semaphores had been rewritten to be simple to understand and fair.  The
      latter, in particular, turns a semaphore-based BKL implementation into a
      mess of scheduling.
      
      The attempt to fix the performance regression failed miserably (see the
      previous commit 00b41ec2 'Revert
      "semaphore: fix"'), and so for now the simple and sane approach is to
      instead just go back to the old spinlock-based BKL implementation that
      never had any issues like this.
      
      This patch also has the advantage of being reported to fix the
      regression completely according to Yanmin Zhang, unlike the semaphore
      hack which still left a couple percentage point regression.
      
      As a spinlock, the BKL obviously has the potential to be a latency
      issue, but it's not really any different from any other spinlock in that
      respect.  We do want to get rid of the BKL asap, but that has been the
      plan for several years.
      
      These days, the biggest users are in the tty layer (open/release in
      particular) and Alan holds out some hope:
      
        "tty release is probably a few months away from getting cured - I'm
         afraid it will almost certainly be the very last user of the BKL in
         tty to get fixed as it depends on everything else being sanely locked."
      
      so while we're not there yet, we do have a plan of action.
      Tested-by: NYanmin Zhang <yanmin_zhang@linux.intel.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Matthew Wilcox <matthew@wil.cx>
      Cc: Alexander Viro <viro@ftp.linux.org.uk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8e3e076c
  11. 06 5月, 2008 10 次提交
  12. 01 5月, 2008 1 次提交
    • R
      rename div64_64 to div64_u64 · 6f6d6a1a
      Roman Zippel 提交于
      Rename div64_64 to div64_u64 to make it consistent with the other divide
      functions, so it clearly includes the type of the divide.  Move its definition
      to math64.h as currently no architecture overrides the generic implementation.
       They can still override it of course, but the duplicated declarations are
      avoided.
      Signed-off-by: NRoman Zippel <zippel@linux-m68k.org>
      Cc: Avi Kivity <avi@qumranet.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Patrick McHardy <kaber@trash.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6f6d6a1a
  13. 29 4月, 2008 2 次提交
  14. 25 4月, 2008 4 次提交
  15. 23 4月, 2008 1 次提交
  16. 20 4月, 2008 4 次提交