1. 22 2月, 2014 1 次提交
    • P
      sched: Fix hotplug task migration · 3f1d2a31
      Peter Zijlstra 提交于
      Dan Carpenter reported:
      
      > kernel/sched/rt.c:1347 pick_next_task_rt() warn: variable dereferenced before check 'prev' (see line 1338)
      > kernel/sched/deadline.c:1011 pick_next_task_dl() warn: variable dereferenced before check 'prev' (see line 1005)
      
      Kirill also spotted that migrate_tasks() will have an instant NULL
      deref because pick_next_task() will immediately deref prev.
      
      Instead of fixing all the corner cases because migrate_tasks() can
      pass in a NULL prev task in the unlikely case of hot-un-plug, provide
      a fake task such that we can remove all the NULL checks from the far
      more common paths.
      
      A further problem; not previously spotted; is that because we pushed
      pre_schedule() and idle_balance() into pick_next_task() we now need to
      avoid those getting called and pulling more tasks on our dying CPU.
      
      We avoid pull_{dl,rt}_task() by setting fake_task.prio to MAX_PRIO+1.
      We also note that since we call pick_next_task() exactly the amount of
      times we have runnable tasks present, we should never land in
      idle_balance().
      
      Fixes: 38033c37 ("sched: Push down pre_schedule() and idle_balance()")
      Cc: Juri Lelli <juri.lelli@gmail.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Reported-by: NKirill Tkhai <tkhai@yandex.ru>
      Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/20140212094930.GB3545@laptop.programming.kicks-ass.netSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      3f1d2a31
  2. 11 2月, 2014 1 次提交
    • P
      sched: Push down pre_schedule() and idle_balance() · 38033c37
      Peter Zijlstra 提交于
      This patch both merged idle_balance() and pre_schedule() and pushes
      both of them into pick_next_task().
      
      Conceptually pre_schedule() and idle_balance() are rather similar,
      both are used to pull more work onto the current CPU.
      
      We cannot however first move idle_balance() into pre_schedule_fair()
      since there is no guarantee the last runnable task is a fair task, and
      thus we would miss newidle balances.
      
      Similarly, the dl and rt pre_schedule calls must be ran before
      idle_balance() since their respective tasks have higher priority and
      it would not do to delay their execution searching for less important
      tasks first.
      
      However, by noticing that pick_next_tasks() already traverses the
      sched_class hierarchy in the right order, we can get the right
      behaviour and do away with both calls.
      
      We must however change the special case optimization to also require
      that prev is of sched_class_fair, otherwise we can miss doing a dl or
      rt pull where we needed one.
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/n/tip-a8k6vvaebtn64nie345kx1je@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      38033c37
  3. 10 2月, 2014 2 次提交
  4. 09 10月, 2013 1 次提交
  5. 04 5月, 2013 1 次提交
    • F
      sched: Keep at least 1 tick per second for active dynticks tasks · 265f22a9
      Frederic Weisbecker 提交于
      The scheduler doesn't yet fully support environments
      with a single task running without a periodic tick.
      
      In order to ensure we still maintain the duties of scheduler_tick(),
      keep at least 1 tick per second.
      
      This makes sure that we keep the progression of various scheduler
      accounting and background maintainance even with a very low granularity.
      Examples include cpu load, sched average, CFS entity vruntime,
      avenrun and events such as load balancing, amongst other details
      handled in sched_class::task_tick().
      
      This limitation will be removed in the future once we get
      these individual items to work in full dynticks CPUs.
      Suggested-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Hakan Akkan <hakanakkan@gmail.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Kevin Hilman <khilman@linaro.org>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      265f22a9
  6. 21 4月, 2013 1 次提交
    • V
      sched: Fix wrong rq's runnable_avg update with rt tasks · 642dbc39
      Vincent Guittot 提交于
      The current update of the rq's load can be erroneous when RT
      tasks are involved.
      
      The update of the load of a rq that becomes idle, is done only
      if the avg_idle is less than sysctl_sched_migration_cost. If RT
      tasks and short idle duration alternate, the runnable_avg will
      not be updated correctly and the time will be accounted as idle
      time when a CFS task wakes up.
      
      A new idle_enter function is called when the next task is the
      idle function so the elapsed time will be accounted as run time
      in the load of the rq, whatever the average idle time is. The
      function update_rq_runnable_avg is removed from idle_balance.
      
      When a RT task is scheduled on an idle CPU, the update of the
      rq's load is not done when the rq exit idle state because CFS's
      functions are not called. Then, the idle_balance, which is
      called just before entering the idle function, updates the rq's
      load and makes the assumption that the elapsed time since the
      last update, was only running time.
      
      As a consequence, the rq's load of a CPU that only runs a
      periodic RT task, is close to LOAD_AVG_MAX whatever the running
      duration of the RT task is.
      
      A new idle_exit function is called when the prev task is the
      idle function so the elapsed time will be accounted as idle time
      in the rq's load.
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: linaro-kernel@lists.linaro.org
      Cc: peterz@infradead.org
      Cc: pjt@google.com
      Cc: fweisbec@gmail.com
      Cc: efault@gmx.de
      Link: http://lkml.kernel.org/r/1366302867-5055-1-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      642dbc39
  7. 06 7月, 2012 1 次提交
  8. 07 5月, 2012 1 次提交
  9. 17 11月, 2011 2 次提交
  10. 14 4月, 2011 1 次提交
  11. 23 3月, 2011 1 次提交
  12. 26 1月, 2011 2 次提交
  13. 23 4月, 2010 1 次提交
  14. 03 4月, 2010 2 次提交
    • P
      sched: Add enqueue/dequeue flags · 371fd7e7
      Peter Zijlstra 提交于
      In order to reduce the dependency on TASK_WAKING rework the enqueue
      interface to support a proper flags field.
      
      Replace the int wakeup, bool head arguments with an int flags argument
      and create the following flags:
      
        ENQUEUE_WAKEUP - the enqueue is a wakeup of a sleeping task,
        ENQUEUE_WAKING - the enqueue has relative vruntime due to
                         having sched_class::task_waking() called,
        ENQUEUE_HEAD - the waking task should be places on the head
                       of the priority queue (where appropriate).
      
      For symmetry also convert sched_class::dequeue() to a flags scheme.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      371fd7e7
    • P
      sched: Fix TASK_WAKING vs fork deadlock · 0017d735
      Peter Zijlstra 提交于
      Oleg noticed a few races with the TASK_WAKING usage on fork.
      
       - since TASK_WAKING is basically a spinlock, it should be IRQ safe
       - since we set TASK_WAKING (*) without holding rq->lock it could
         be there still is a rq->lock holder, thereby not actually
         providing full serialization.
      
      (*) in fact we clear PF_STARTING, which in effect enables TASK_WAKING.
      
      Cure the second issue by not setting TASK_WAKING in sched_fork(), but
      only temporarily in wake_up_new_task() while calling select_task_rq().
      
      Cure the first by holding rq->lock around the select_task_rq() call,
      this will disable IRQs, this however requires that we push down the
      rq->lock release into select_task_rq_fair()'s cgroup stuff.
      
      Because select_task_rq_fair() still needs to drop the rq->lock we
      cannot fully get rid of TASK_WAKING.
      Reported-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      0017d735
  15. 21 1月, 2010 1 次提交
  16. 17 1月, 2010 1 次提交
  17. 21 12月, 2009 1 次提交
    • P
      sched: Restore printk sanity · 3df0fc5b
      Peter Zijlstra 提交于
      Revert the braindead pr_* crap. (Commit 663997d4 "sched: Use
      pr_fmt() and pr_<level>()")
      
      It's dumb and causes stupid "sched: " strings all over the place.
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Acked-by: NMike Galbraith <efault@gmx.de>
      Cc: Joe Perches <joe@perches.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      LKML-Reference: <1261315437.4314.6.camel@laptop>
      [ i dont mind the pr_*() patterns that much - but Peter dislikes them with a vengence. ]
      [ - v2: remove spurious diffstat from changelog :-/ ]
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      3df0fc5b
  18. 15 12月, 2009 1 次提交
  19. 13 12月, 2009 1 次提交
    • J
      sched: Use pr_fmt() and pr_<level>() · 663997d4
      Joe Perches 提交于
      - Convert printk(KERN_<level> to pr_<level> (not KERN_DEBUG)
       - Add #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
       - Coalesce long format strings
       - Add missing \n to "ERROR: !SD_LOAD_BALANCE domain has parent"
      Signed-off-by: NJoe Perches <joe@perches.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1260655047.2637.7.camel@Joe-Laptop.home>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      663997d4
  20. 09 12月, 2009 1 次提交
  21. 21 9月, 2009 1 次提交
  22. 15 9月, 2009 3 次提交
  23. 15 5月, 2009 1 次提交
    • T
      sched, timers: move calc_load() to scheduler · dce48a84
      Thomas Gleixner 提交于
      Dimitri Sivanich noticed that xtime_lock is held write locked across
      calc_load() which iterates over all online CPUs. That can cause long
      latencies for xtime_lock readers on large SMP systems. 
      
      The load average calculation is an rough estimate anyway so there is
      no real need to protect the readers vs. the update. It's not a problem
      when the avenrun array is updated while a reader copies the values.
      
      Instead of iterating over all online CPUs let the scheduler_tick code
      update the number of active tasks shortly before the avenrun update
      happens. The avenrun update itself is handled by the CPU which calls
      do_timer().
      
      [ Impact: reduce xtime_lock write locked section ]
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      dce48a84
  24. 22 10月, 2008 1 次提交
  25. 22 9月, 2008 1 次提交
  26. 06 5月, 2008 1 次提交
  27. 26 1月, 2008 3 次提交
    • P
      sched: high-res preemption tick · 8f4d37ec
      Peter Zijlstra 提交于
      Use HR-timers (when available) to deliver an accurate preemption tick.
      
      The regular scheduler tick that runs at 1/HZ can be too coarse when nice
      level are used. The fairness system will still keep the cpu utilisation 'fair'
      by then delaying the task that got an excessive amount of CPU time but try to
      minimize this by delivering preemption points spot-on.
      
      The average frequency of this extra interrupt is sched_latency / nr_latency.
      Which need not be higher than 1/HZ, its just that the distribution within the
      sched_latency period is important.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      8f4d37ec
    • S
      sched: RT-balance, add new methods to sched_class · cb469845
      Steven Rostedt 提交于
      Dmitry Adamushko found that the current implementation of the RT
      balancing code left out changes to the sched_setscheduler and
      rt_mutex_setprio.
      
      This patch addresses this issue by adding methods to the schedule classes
      to handle being switched out of (switched_from) and being switched into
      (switched_to) a sched_class. Also a method for changing of priorities
      is also added (prio_changed).
      
      This patch also removes some duplicate logic between rt_mutex_setprio and
      sched_setscheduler.
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      cb469845
    • G
      sched: de-SCHED_OTHER-ize the RT path · e7693a36
      Gregory Haskins 提交于
      The current wake-up code path tries to determine if it can optimize the
      wake-up to "this_cpu" by computing load calculations.  The problem is that
      these calculations are only relevant to SCHED_OTHER tasks where load is king.
      For RT tasks, priority is king.  So the load calculation is completely wasted
      bandwidth.
      
      Therefore, we create a new sched_class interface to help with
      pre-wakeup routing decisions and move the load calculation as a function
      of CFS task's class.
      Signed-off-by: NGregory Haskins <ghaskins@novell.com>
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      e7693a36
  28. 25 10月, 2007 2 次提交
    • P
      sched: isolate SMP balancing code a bit more · 681f3e68
      Peter Williams 提交于
      At the moment, a lot of load balancing code that is irrelevant to non
      SMP systems gets included during non SMP builds.
      
      This patch addresses this issue and reduces the binary size on non
      SMP systems:
      
         text    data     bss     dec     hex filename
        10983      28    1192   12203    2fab sched.o.before
        10739      28    1192   11959    2eb7 sched.o.after
      Signed-off-by: NPeter Williams <pwil3058@bigpond.net.au>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      681f3e68
    • P
      sched: reduce balance-tasks overhead · e1d1484f
      Peter Williams 提交于
      At the moment, balance_tasks() provides low level functionality for both
        move_tasks() and move_one_task() (indirectly) via the load_balance()
      function (in the sched_class interface) which also provides dual
      functionality.  This dual functionality complicates the interfaces and
      internal mechanisms and makes the run time overhead of operations that
      are called with two run queue locks held.
      
      This patch addresses this issue and reduces the overhead of these
      operations.
      Signed-off-by: NPeter Williams <pwil3058@bigpond.net.au>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      e1d1484f
  29. 15 10月, 2007 3 次提交