1. 12 12月, 2008 2 次提交
  2. 10 12月, 2008 1 次提交
    • B
      sched: CPU remove deadlock fix · 9a2bd244
      Brian King 提交于
      Impact: fix possible deadlock in CPU hot-remove path
      
      This patch fixes a possible deadlock scenario in the CPU remove path.
      migration_call grabs rq->lock, then wakes up everything on rq->migration_queue
      with the lock held. Then one of the tasks on the migration queue ends up
      calling tg_shares_up which then also tries to acquire the same rq->lock.
      
      [c000000058eab2e0] c000000000502078 ._spin_lock_irqsave+0x98/0xf0
      [c000000058eab370] c00000000008011c .tg_shares_up+0x10c/0x20c
      [c000000058eab430] c00000000007867c .walk_tg_tree+0xc4/0xfc
      [c000000058eab4d0] c0000000000840c8 .try_to_wake_up+0xb0/0x3c4
      [c000000058eab590] c0000000000799a0 .__wake_up_common+0x6c/0xe0
      [c000000058eab640] c00000000007ada4 .complete+0x54/0x80
      [c000000058eab6e0] c000000000509fa8 .migration_call+0x5fc/0x6f8
      [c000000058eab7c0] c000000000504074 .notifier_call_chain+0x68/0xe0
      [c000000058eab860] c000000000506568 ._cpu_down+0x2b0/0x3f4
      [c000000058eaba60] c000000000506750 .cpu_down+0xa4/0x108
      [c000000058eabb10] c000000000507e54 .store_online+0x44/0xa8
      [c000000058eabba0] c000000000396260 .sysdev_store+0x3c/0x50
      [c000000058eabc10] c0000000001a39b8 .sysfs_write_file+0x124/0x18c
      [c000000058eabcd0] c00000000013061c .vfs_write+0xd0/0x1bc
      [c000000058eabd70] c0000000001308a4 .sys_write+0x68/0x114
      [c000000058eabe30] c0000000000086b4 syscall_exit+0x0/0x40
      Signed-off-by: NBrian King <brking@linux.vnet.ibm.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      9a2bd244
  3. 08 12月, 2008 2 次提交
  4. 02 12月, 2008 1 次提交
  5. 30 11月, 2008 1 次提交
    • I
      sched: prevent divide by zero error in cpu_avg_load_per_task, update · af6d596f
      Ingo Molnar 提交于
      Regarding the bug addressed in:
      
        4cd42620: sched: prevent divide by zero error in cpu_avg_load_per_task
      
      Linus points out that the fix is not complete:
      
      > There's nothing that keeps gcc from deciding not to reload
      > rq->nr_running.
      >
      > Of course, in _practice_, I don't think gcc ever will (if it decides
      > that it will spill, gcc is likely going to decide that it will
      > literally spill the local variable to the stack rather than decide to
      > reload off the pointer), but it's a valid compiler optimization, and
      > it even has a name (rematerialization).
      >
      > So I suspect that your patch does fix the bug, but it still leaves the
      > fairly unlikely _potential_ for it to re-appear at some point.
      >
      > We have ACCESS_ONCE() as a macro to guarantee that the compiler
      > doesn't rematerialize a pointer access. That also would clarify
      > the fact that we access something unsafe outside a lock.
      
      So make sure our nr_running value is immutable and cannot change
      after we check it for nonzero.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      af6d596f
  6. 29 11月, 2008 1 次提交
    • A
      sched: move double_unlock_balance() higher · 70574a99
      Alexey Dobriyan 提交于
      Move double_lock_balance()/double_unlock_balance() higher to fix the following
      with gcc-3.4.6:
      
         CC      kernel/sched.o
       In file included from kernel/sched.c:1605:
       kernel/sched_rt.c: In function `find_lock_lowest_rq':
       kernel/sched_rt.c:914: sorry, unimplemented: inlining failed in call to 'double_unlock_balance': function body not available
       kernel/sched_rt.c:1077: sorry, unimplemented: called from here
       make[2]: *** [kernel/sched.o] Error 1
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      70574a99
  7. 27 11月, 2008 1 次提交
    • S
      sched: prevent divide by zero error in cpu_avg_load_per_task · 4cd42620
      Steven Rostedt 提交于
      Impact: fix divide by zero crash in scheduler rebalance irq
      
      While testing the branch profiler, I hit this crash:
      
      divide error: 0000 [#1] PREEMPT SMP
      [...]
      RIP: 0010:[<ffffffff8024a008>]  [<ffffffff8024a008>] cpu_avg_load_per_task+0x50/0x7f
      [...]
      Call Trace:
       <IRQ> <0> [<ffffffff8024fd43>] find_busiest_group+0x3e5/0xcaa
       [<ffffffff8025da75>] rebalance_domains+0x2da/0xa21
       [<ffffffff80478769>] ? find_next_bit+0x1b2/0x1e6
       [<ffffffff8025e2ce>] run_rebalance_domains+0x112/0x19f
       [<ffffffff8026d7c2>] __do_softirq+0xa8/0x232
       [<ffffffff8020ea7c>] call_softirq+0x1c/0x3e
       [<ffffffff8021047a>] do_softirq+0x94/0x1cd
       [<ffffffff8026d5eb>] irq_exit+0x6b/0x10e
       [<ffffffff8022e6ec>] smp_apic_timer_interrupt+0xd3/0xff
       [<ffffffff8020e4b3>] apic_timer_interrupt+0x13/0x20
      
      The code for cpu_avg_load_per_task has:
      
      	if (rq->nr_running)
      		rq->avg_load_per_task = rq->load.weight / rq->nr_running;
      
      The runqueue lock is not held here, and there is nothing that prevents
      the rq->nr_running from going to zero after it passes the if condition.
      
      The branch profiler simply made the race window bigger.
      
      This patch saves off the rq->nr_running to a local variable and uses that
      for both the condition and the division.
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      4cd42620
  8. 21 11月, 2008 1 次提交
    • V
      sched: update comment for move_task_off_dead_cpu · 957ad016
      Vegard Nossum 提交于
      Impact: cleanup
      
      This commit:
      
      commit f7b4cddc
      Author: Oleg Nesterov <oleg@tv-sign.ru>
      Date:   Tue Oct 16 23:30:56 2007 -0700
      
          do CPU_DEAD migrating under read_lock(tasklist) instead of write_lock_irq(ta
      
          Currently move_task_off_dead_cpu() is called under
          write_lock_irq(tasklist).  This means it can't use task_lock() which is
          needed to improve migrating to take task's ->cpuset into account.
      
          Change the code to call move_task_off_dead_cpu() with irqs enabled, and
          change migrate_live_tasks() to use read_lock(tasklist).
      
      ...forgot to update the comment in front of move_task_off_dead_cpu.
      
      Reference: http://lkml.org/lkml/2008/6/23/135Signed-off-by: NVegard Nossum <vegard.nossum@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      957ad016
  9. 20 11月, 2008 1 次提交
    • K
      sched: fix inconsistency when redistribute per-cpu tg->cfs_rq shares · ec4e0e2f
      Ken Chen 提交于
      Impact: make load-balancing more consistent
      
      In the update_shares() path leading to tg_shares_up(), the calculation of
      per-cpu cfs_rq shares is rather erratic even under moderate task wake up
      rate.  The problem is that the per-cpu tg->cfs_rq load weight used in the
      sd_rq_weight aggregation and actual redistribution of the cfs_rq->shares
      are collected at different time.  Under moderate system load, we've seen
      quite a bit of variation on the cfs_rq->shares and ultimately wildly
      affects sched_entity's load weight.
      
      This patch caches the result of initial per-cpu load weight when doing the
      sum calculation, and then pass it down to update_group_shares_cpu() for
      redistributing per-cpu cfs_rq shares.  This allows consistent total cfs_rq
      shares across all CPUs. It also simplifies the rounding and zero load
      weight check.
      Signed-off-by: NKen Chen <kenchen@google.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      ec4e0e2f
  10. 18 11月, 2008 1 次提交
    • L
      cpuset: fix regression when failed to generate sched domains · 700018e0
      Li Zefan 提交于
      Impact: properly rebuild sched-domains on kmalloc() failure
      
      When cpuset failed to generate sched domains due to kmalloc()
      failure, the scheduler should fallback to the single partition
      'fallback_doms' and rebuild sched domains, but now it only
      destroys but not rebuilds sched domains.
      
      The regression was introduced by:
      
      | commit dfb512ec
      | Author: Max Krasnyansky <maxk@qualcomm.com>
      | Date:   Fri Aug 29 13:11:41 2008 -0700
      |
      |    sched: arch_reinit_sched_domains() must destroy domains to force rebuild
      
      After the above commit, partition_sched_domains(0, NULL, NULL) will
      only destroy sched domains and partition_sched_domains(1, NULL, NULL)
      will create the default sched domain.
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      Cc: Max Krasnyansky <maxk@qualcomm.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      700018e0
  11. 13 11月, 2008 1 次提交
    • I
      sched: fix init_idle()'s use of sched_clock() · 5cbd54ef
      Ingo Molnar 提交于
      Maciej Rutecki reported:
      
      > I have this bug during suspend to disk:
      >
      > [  188.592151] Enabling non-boot CPUs ...
      > [  188.592151] SMP alternatives: switching to SMP code
      > [  188.666058] BUG: using smp_processor_id() in preemptible
      > [00000000]
      > code: suspend_to_disk/2934
      > [  188.666064] caller is native_sched_clock+0x2b/0x80
      
      Which, as noted by Linus, was caused by me, via:
      
        7cbaef9c "sched: optimize sched_clock() a bit"
      
      Move the rq locking a bit earlier in the initialization sequence,
      that will make the sched_clock() call in init_idle() non-preemptible.
      Reported-by: NMaciej Rutecki <maciej.rutecki@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      5cbd54ef
  12. 12 11月, 2008 1 次提交
  13. 11 11月, 2008 2 次提交
  14. 10 11月, 2008 1 次提交
  15. 07 11月, 2008 4 次提交
  16. 05 11月, 2008 1 次提交
    • P
      sched: backward looking buddy · 4793241b
      Peter Zijlstra 提交于
      Impact: improve/change/fix wakeup-buddy scheduling
      
      Currently we only have a forward looking buddy, that is, we prefer to
      schedule to the task we last woke up, under the presumption that its
      going to consume the data we just produced, and therefore will have
      cache hot benefits.
      
      This allows co-waking producer/consumer task pairs to run ahead of the
      pack for a little while, keeping their cache warm. Without this, we
      would interleave all pairs, utterly trashing the cache.
      
      This patch introduces a backward looking buddy, that is, suppose that
      in the above scenario, the consumer preempts the producer before it
      can go to sleep, we will therefore miss the wakeup from consumer to
      producer (its already running, after all), breaking the cycle and
      reverting to the cache-trashing interleaved schedule pattern.
      
      The backward buddy will try to schedule back to the task that woke us
      up in case the forward buddy is not available, under the assumption
      that the last task will be the one with the most cache hot task around
      barring current.
      
      This will basically allow a task to continue after it got preempted.
      
      In order to avoid starvation, we allow either buddy to get wakeup_gran
      ahead of the pack.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NMike Galbraith <efault@gmx.de>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      4793241b
  17. 04 11月, 2008 3 次提交
  18. 30 10月, 2008 1 次提交
  19. 29 10月, 2008 1 次提交
  20. 24 10月, 2008 2 次提交
  21. 23 10月, 2008 1 次提交
  22. 20 10月, 2008 1 次提交
    • P
      sched: optimize group load balancer · ffda12a1
      Peter Zijlstra 提交于
      I noticed that tg_shares_up() unconditionally takes rq-locks for all cpus
      in the sched_domain. This hurts.
      
      We need the rq-locks whenever we change the weight of the per-cpu group sched
      entities. To allevate this a little, only change the weight when the new
      weight is at least shares_thresh away from the old value.
      
      This avoids the rq-lock for the top level entries, since those will never
      be re-weighted, and fuzzes the lower level entries a little to gain performance
      in semi-stable situations.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      ffda12a1
  23. 16 10月, 2008 3 次提交
  24. 14 10月, 2008 1 次提交
    • M
      tracing, sched: LTTng instrumentation - scheduler · 0a16b607
      Mathieu Desnoyers 提交于
      Instrument the scheduler activity (sched_switch, migration, wakeups,
      wait for a task, signal delivery) and process/thread
      creation/destruction (fork, exit, kthread stop). Actually, kthread
      creation is not instrumented in this patch because it is architecture
      dependent. It allows to connect tracers such as ftrace which detects
      scheduling latencies, good/bad scheduler decisions. Tools like LTTng can
      export this scheduler information along with instrumentation of the rest
      of the kernel activity to perform post-mortem analysis on the scheduler
      activity.
      
      About the performance impact of tracepoints (which is comparable to
      markers), even without immediate values optimizations, tests done by
      Hideo Aoki on ia64 show no regression. His test case was using hackbench
      on a kernel where scheduler instrumentation (about 5 events in code
      scheduler code) was added. See the "Tracepoints" patch header for
      performance result detail.
      
      Changelog :
      
      - Change instrumentation location and parameter to match ftrace
        instrumentation, previously done with kernel markers.
      
      [ mingo@elte.hu: conflict resolutions ]
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
      Acked-by: N'Peter Zijlstra' <peterz@infradead.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      0a16b607
  25. 09 10月, 2008 1 次提交
    • I
      sched debug: add name to sched_domain sysctl entries · a5d8c348
      Ingo Molnar 提交于
      add /proc/sys/kernel/sched_domain/cpu0/domain0/name, to make
      it easier to see which specific scheduler domain remained at
      that entry.
      
      Since we process the scheduler domain tree and
      simplify it, it's not always immediately clear during debugging
      which domain came from where.
      
      depends on CONFIG_SCHED_DEBUG=y.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      a5d8c348
  26. 06 10月, 2008 1 次提交
  27. 30 9月, 2008 1 次提交
  28. 29 9月, 2008 1 次提交
    • T
      hrtimer: prevent migration of per CPU hrtimers · ccc7dadf
      Thomas Gleixner 提交于
      Impact: per CPU hrtimers can be migrated from a dead CPU
      
      The hrtimer code has no knowledge about per CPU timers, but we need to
      prevent the migration of such timers and warn when such a timer is
      active at migration time.
      
      Explicitely mark the timers as per CPU and use a more understandable
      mode descriptor for the interrupts safe unlocked callback mode, which
      is used by hrtimer_sleeper and the scheduler code.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      ccc7dadf
  29. 28 9月, 2008 1 次提交