1. 20 4月, 2008 4 次提交
  2. 26 3月, 2008 1 次提交
    • T
      NOHZ: reevaluate idle sleep length after add_timer_on() · 06d8308c
      Thomas Gleixner 提交于
      add_timer_on() can add a timer on a CPU which is currently in a long
      idle sleep, but the timer wheel is not reevaluated by the nohz code on
      that CPU. So a timer can be delayed for quite a long time. This
      triggered a false positive in the clocksource watchdog code.
      
      To avoid this we need to wake up the idle CPU and enforce the
      reevaluation of the timer wheel for the next timer event.
      
      Add a function, which checks a given CPU for idle state, marks the
      idle task with NEED_RESCHED and sends a reschedule IPI to notify the
      other CPU of the change in the timer wheel.
      
      Call this function from add_timer_on().
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Cc: stable@kernel.org
      
      --
       include/linux/sched.h |    6 ++++++
       kernel/sched.c        |   43 +++++++++++++++++++++++++++++++++++++++++++
       kernel/timer.c        |   10 +++++++++-
       3 files changed, 58 insertions(+), 1 deletion(-)
      06d8308c
  3. 21 3月, 2008 4 次提交
  4. 19 3月, 2008 2 次提交
    • I
      sched: wakeup-buddy tasks are cache-hot · f540a608
      Ingo Molnar 提交于
      Wakeup-buddy tasks are cache-hot - this makes it a bit harder
      for the load-balancer to tear them apart. (but it's still possible,
      if the load is sufficiently assymetric)
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      f540a608
    • I
      sched: improve affine wakeups · 4ae7d5ce
      Ingo Molnar 提交于
      improve affine wakeups. Maintain the 'overlap' metric based on CFS's
      sum_exec_runtime - which means the amount of time a task executes
      after it wakes up some other task.
      
      Use the 'overlap' for the wakeup decisions: if the 'overlap' is short,
      it means there's strong workload coupling between this task and the
      woken up task. If the 'overlap' is large then the workload is decoupled
      and the scheduler will move them to separate CPUs more easily.
      
      ( Also slightly move the preempt_check within try_to_wake_up() - this has
        no effect on functionality but allows 'early wakeups' (for still-on-rq
        tasks) to be correctly accounted as well.)
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      4ae7d5ce
  5. 15 3月, 2008 4 次提交
    • P
      sched: fix overload performance: buddy wakeups · aa2ac252
      Peter Zijlstra 提交于
      Currently we schedule to the leftmost task in the runqueue. When the
      runtimes are very short because of some server/client ping-pong,
      especially in over-saturated workloads, this will cycle through all
      tasks trashing the cache.
      
      Reduce cache trashing by keeping dependent tasks together by running
      newly woken tasks first. However, by not running the leftmost task first
      we could starve tasks because the wakee can gain unlimited runtime.
      
      Therefore we only run the wakee if its within a small
      (wakeup_granularity) window of the leftmost task. This preserves
      fairness, but does alternate server/client task groups.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      aa2ac252
    • I
      sched: fix calc_delta_mine() · 27d11726
      Ingo Molnar 提交于
      lw->weight can be 0 for a short time during bootup.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      27d11726
    • I
      sched: fix update_load_add()/sub() · e89996ae
      Ingo Molnar 提交于
      Clear the cached inverse value when updating load. This is needed for
      calc_delta_mine() to work correctly when using the rq load.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      e89996ae
    • H
      sched: fix race in schedule() · 0e1f3483
      Hiroshi Shimamoto 提交于
      Fix a hard to trigger crash seen in the -rt kernel that also affects
      the vanilla scheduler.
      
      There is a race condition between schedule() and some dequeue/enqueue
      functions; rt_mutex_setprio(), __setscheduler() and sched_move_task().
      
      When scheduling to idle, idle_balance() is called to pull tasks from
      other busy processor. It might drop the rq lock. It means that those 3
      functions encounter on_rq=0 and running=1. The current task should be
      put when running.
      
      Here is a possible scenario:
      
         CPU0                               CPU1
          |                              schedule()
          |                              ->deactivate_task()
          |                              ->idle_balance()
          |                              -->load_balance_newidle()
      rt_mutex_setprio()                     |
          |                              --->double_lock_balance()
          *get lock                          *rel lock
          * on_rq=0, ruuning=1               |
          * sched_class is changed           |
          *rel lock                          *get lock
          :                                  |
                                             :
                                         ->put_prev_task_rt()
                                         ->pick_next_task_fair()
                                             => panic
      
      The current process of CPU1(P1) is scheduling. Deactivated P1, and the
      scheduler looks for another process on other CPU's runqueue because CPU1
      will be idle. idle_balance(), load_balance_newidle() and
      double_lock_balance() are called and double_lock_balance() could drop
      the rq lock. On the other hand, CPU0 is trying to boost the priority of
      P1. The result of boosting only P1's prio and sched_class are changed to
      RT. The sched entities of P1 and P1's group are never put. It makes
      cfs_rq invalid, because the cfs_rq has curr and no leaf, but
      pick_next_task_fair() is called, then the kernel panics.
      Signed-off-by: NHiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      0e1f3483
  6. 11 3月, 2008 2 次提交
    • G
      keep rd->online and cpu_online_map in sync · 08f503b0
      Gregory Haskins 提交于
      It is possible to allow the root-domain cache of online cpus to
      become out of sync with the global cpu_online_map.  This is because we
      currently trigger removal of cpus too early in the notifier chain.
      Other DOWN_PREPARE handlers may in fact run and reconfigure the
      root-domain topology, thereby stomping on our own offline handling.
      
      The end result is that rd->online may become out of sync with
      cpu_online_map, which results in potential task misrouting.
      
      So change the offline handling to be more tightly coupled with the
      global offline process by triggering on CPU_DYING intead of
      CPU_DOWN_PREPARE.
      Signed-off-by: NGregory Haskins <ghaskins@novell.com>
      Cc: Gautham R Shenoy <ego@in.ibm.com>
      Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      08f503b0
    • G
      Revert "cpu hotplug: adjust root-domain->online span in response to hotplug event" · 1f94ef59
      Gregory Haskins 提交于
      This reverts commit 393d94d9.
      
      Lets fix this right.
      Signed-off-by: NGregory Haskins <ghaskins@novell.com>
      Cc: Gautham R Shenoy <ego@in.ibm.com>
      Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      1f94ef59
  7. 10 3月, 2008 1 次提交
  8. 07 3月, 2008 5 次提交
  9. 05 3月, 2008 1 次提交
    • P
      sched: revert load_balance_monitor() changes · 62fb1851
      Peter Zijlstra 提交于
      The following commits cause a number of regressions:
      
        commit 58e2d4ca
        Author: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
        Date:   Fri Jan 25 21:08:00 2008 +0100
        sched: group scheduling, change how cpu load is calculated
      
        commit 6b2d7700
        Author: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
        Date:   Fri Jan 25 21:08:00 2008 +0100
        sched: group scheduler, fix fairness of cpu bandwidth allocation for task groups
      
      Namely:
       - very frequent wakeups on SMP, reported by PowerTop users.
       - cacheline trashing on (large) SMP
       - some latencies larger than 500ms
      
      While there is a mergeable patch to fix the latter, the former issues
      are not fixable in a manner suitable for .25 (we're at -rc3 now).
      
      Hence we revert them and try again in v2.6.26.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      CC: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
      Tested-by: NAlexey Zaytsev <alexey.zaytsev@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      62fb1851
  10. 25 2月, 2008 2 次提交
  11. 24 2月, 2008 2 次提交
    • L
      Add memory barrier semantics to wake_up() & co · 04e2f174
      Linus Torvalds 提交于
      Oleg Nesterov and others have pointed out that on some architectures,
      the traditional sequence of
      
      	set_current_state(TASK_INTERRUPTIBLE);
      	if (CONDITION)
      		return;
      	schedule();
      
      is racy wrt another CPU doing
      
      	CONDITION = 1;
      	wake_up_process(p);
      
      because while set_current_state() has a memory barrier separating
      setting of the TASK_INTERRUPTIBLE state from reading of the CONDITION
      variable, there is no such memory barrier on the wakeup side.
      
      Now, wake_up_process() does actually take a spinlock before it reads and
      sets the task state on the waking side, and on x86 (and many other
      architectures) that spinlock is in fact equivalent to a memory barrier,
      but that is not generally guaranteed.  The write that sets CONDITION
      could move into the critical region protected by the runqueue spinlock.
      
      However, adding a smp_wmb() to before the spinlock should now order the
      writing of CONDITION wrt the lock itself, which in turn is ordered wrt
      the accesses within the spinlock (which includes the reading of the old
      state).
      
      This should thus close the race (which probably has never been seen in
      practice, but since smp_wmb() is a no-op on x86, it's not like this will
      make anything worse either on the most common architecture where the
      spinlock already gave the required protection).
      Acked-by: NOleg Nesterov <oleg@tv-sign.ru>
      Acked-by: NDmitry Adamushko <dmitry.adamushko@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      04e2f174
    • S
      kprobes: refuse kprobe insertion on add/sub_preempt_counter() · 43627582
      Srinivasa Ds 提交于
      Kprobes makes use of preempt_disable(),preempt_enable_noresched() and these
      functions inturn call add/sub_preempt_count().  So we need to refuse user from
      inserting probe in to these functions.
      
      This patch disallows user from probing add/sub_preempt_count().
      Signed-off-by: NSrinivasa DS <srinivasa@in.ibm.com>
      Acked-by: NAnanth N Mavinakayanahalli <ananth@in.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      43627582
  12. 13 2月, 2008 7 次提交
  13. 09 2月, 2008 1 次提交
  14. 01 2月, 2008 1 次提交
  15. 30 1月, 2008 1 次提交
    • N
      spinlock: lockbreak cleanup · 95c354fe
      Nick Piggin 提交于
      The break_lock data structure and code for spinlocks is quite nasty.
      Not only does it double the size of a spinlock but it changes locking to
      a potentially less optimal trylock.
      
      Put all of that under CONFIG_GENERIC_LOCKBREAK, and introduce a
      __raw_spin_is_contended that uses the lock data itself to determine whether
      there are waiters on the lock, to be used if CONFIG_GENERIC_LOCKBREAK is
      not set.
      
      Rename need_lockbreak to spin_needbreak, make it use spin_is_contended to
      decouple it from the spinlock implementation, and make it typesafe (rwlocks
      do not have any need_lockbreak sites -- why do they even get bloated up
      with that break_lock then?).
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      95c354fe
  16. 26 1月, 2008 2 次提交