1. 12 8月, 2015 1 次提交
  2. 24 11月, 2014 1 次提交
    • T
      sched: Provide update_curr callbacks for stop/idle scheduling classes · 90e362f4
      Thomas Gleixner 提交于
      Chris bisected a NULL pointer deference in task_sched_runtime() to
      commit 6e998916 'sched/cputime: Fix clock_nanosleep()/clock_gettime()
      inconsistency'.
      
      Chris observed crashes in atop or other /proc walking programs when he
      started fork bombs on his machine.  He assumed that this is a new exit
      race, but that does not make any sense when looking at that commit.
      
      What's interesting is that, the commit provides update_curr callbacks
      for all scheduling classes except stop_task and idle_task.
      
      While nothing can ever hit that via the clock_nanosleep() and
      clock_gettime() interfaces, which have been the target of the commit in
      question, the author obviously forgot that there are other code paths
      which invoke task_sched_runtime()
      
      do_task_stat(()
       thread_group_cputime_adjusted()
         thread_group_cputime()
           task_cputime()
             task_sched_runtime()
              if (task_current(rq, p) && task_on_rq_queued(p)) {
                update_rq_clock(rq);
                up->sched_class->update_curr(rq);
              }
      
      If the stats are read for a stomp machine task, aka 'migration/N' and
      that task is current on its cpu, this will happily call the NULL pointer
      of stop_task->update_curr.  Ooops.
      
      Chris observation that this happens faster when he runs the fork bomb
      makes sense as the fork bomb will kick migration threads more often so
      the probability to hit the issue will increase.
      
      Add the missing update_curr callbacks to the scheduler classes stop_task
      and idle_task.  While idle tasks cannot be monitored via /proc we have
      other means to hit the idle case.
      
      Fixes: 6e998916 'sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency'
      Reported-by: NChris Mason <clm@fb.com>
      Reported-and-tested-by: NBorislav Petkov <bp@alien8.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Stanislaw Gruszka <sgruszka@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      90e362f4
  3. 16 7月, 2014 1 次提交
    • K
      sched: Transform resched_task() into resched_curr() · 8875125e
      Kirill Tkhai 提交于
      We always use resched_task() with rq->curr argument.
      It's not possible to reschedule any task but rq's current.
      
      The patch introduces resched_curr(struct rq *) to
      replace all of the repeating patterns. The main aim
      is cleanup, but there is a little size profit too:
      
        (before)
      	$ size kernel/sched/built-in.o
      	   text	   data	    bss	    dec	    hex	filename
      	155274	  16445	   7042	 178761	  2ba49	kernel/sched/built-in.o
      
      	$ size vmlinux
      	   text	   data	    bss	    dec	    hex	filename
      	7411490	1178376	 991232	9581098	 92322a	vmlinux
      
        (after)
      	$ size kernel/sched/built-in.o
      	   text	   data	    bss	    dec	    hex	filename
      	155130	  16445	   7042	 178617	  2b9b9	kernel/sched/built-in.o
      
      	$ size vmlinux
      	   text	   data	    bss	    dec	    hex	filename
      	7411362	1178376	 991232	9580970	 9231aa	vmlinux
      
      	I was choosing between resched_curr() and resched_rq(),
      	and the first name looks better for me.
      
      A little lie in Documentation/trace/ftrace.txt. I have not
      actually collected the tracing again. With a hope the patch
      won't make execution times much worse :)
      Signed-off-by: NKirill Tkhai <tkhai@yandex.ru>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Link: http://lkml.kernel.org/r/20140628200219.1778.18735.stgit@localhostSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8875125e
  4. 11 3月, 2014 1 次提交
  5. 22 2月, 2014 2 次提交
  6. 11 2月, 2014 1 次提交
    • P
      sched: Push down pre_schedule() and idle_balance() · 38033c37
      Peter Zijlstra 提交于
      This patch both merged idle_balance() and pre_schedule() and pushes
      both of them into pick_next_task().
      
      Conceptually pre_schedule() and idle_balance() are rather similar,
      both are used to pull more work onto the current CPU.
      
      We cannot however first move idle_balance() into pre_schedule_fair()
      since there is no guarantee the last runnable task is a fair task, and
      thus we would miss newidle balances.
      
      Similarly, the dl and rt pre_schedule calls must be ran before
      idle_balance() since their respective tasks have higher priority and
      it would not do to delay their execution searching for less important
      tasks first.
      
      However, by noticing that pick_next_tasks() already traverses the
      sched_class hierarchy in the right order, we can get the right
      behaviour and do away with both calls.
      
      We must however change the special case optimization to also require
      that prev is of sched_class_fair, otherwise we can miss doing a dl or
      rt pull where we needed one.
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/n/tip-a8k6vvaebtn64nie345kx1je@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      38033c37
  7. 10 2月, 2014 2 次提交
  8. 09 10月, 2013 1 次提交
  9. 04 5月, 2013 1 次提交
    • F
      sched: Keep at least 1 tick per second for active dynticks tasks · 265f22a9
      Frederic Weisbecker 提交于
      The scheduler doesn't yet fully support environments
      with a single task running without a periodic tick.
      
      In order to ensure we still maintain the duties of scheduler_tick(),
      keep at least 1 tick per second.
      
      This makes sure that we keep the progression of various scheduler
      accounting and background maintainance even with a very low granularity.
      Examples include cpu load, sched average, CFS entity vruntime,
      avenrun and events such as load balancing, amongst other details
      handled in sched_class::task_tick().
      
      This limitation will be removed in the future once we get
      these individual items to work in full dynticks CPUs.
      Suggested-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Hakan Akkan <hakanakkan@gmail.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Kevin Hilman <khilman@linaro.org>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      265f22a9
  10. 21 4月, 2013 1 次提交
    • V
      sched: Fix wrong rq's runnable_avg update with rt tasks · 642dbc39
      Vincent Guittot 提交于
      The current update of the rq's load can be erroneous when RT
      tasks are involved.
      
      The update of the load of a rq that becomes idle, is done only
      if the avg_idle is less than sysctl_sched_migration_cost. If RT
      tasks and short idle duration alternate, the runnable_avg will
      not be updated correctly and the time will be accounted as idle
      time when a CFS task wakes up.
      
      A new idle_enter function is called when the next task is the
      idle function so the elapsed time will be accounted as run time
      in the load of the rq, whatever the average idle time is. The
      function update_rq_runnable_avg is removed from idle_balance.
      
      When a RT task is scheduled on an idle CPU, the update of the
      rq's load is not done when the rq exit idle state because CFS's
      functions are not called. Then, the idle_balance, which is
      called just before entering the idle function, updates the rq's
      load and makes the assumption that the elapsed time since the
      last update, was only running time.
      
      As a consequence, the rq's load of a CPU that only runs a
      periodic RT task, is close to LOAD_AVG_MAX whatever the running
      duration of the RT task is.
      
      A new idle_exit function is called when the prev task is the
      idle function so the elapsed time will be accounted as idle time
      in the rq's load.
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: linaro-kernel@lists.linaro.org
      Cc: peterz@infradead.org
      Cc: pjt@google.com
      Cc: fweisbec@gmail.com
      Cc: efault@gmx.de
      Link: http://lkml.kernel.org/r/1366302867-5055-1-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      642dbc39
  11. 06 7月, 2012 1 次提交
  12. 07 5月, 2012 1 次提交
  13. 17 11月, 2011 2 次提交
  14. 14 4月, 2011 1 次提交
  15. 23 3月, 2011 1 次提交
  16. 26 1月, 2011 2 次提交
  17. 23 4月, 2010 1 次提交
  18. 03 4月, 2010 2 次提交
    • P
      sched: Add enqueue/dequeue flags · 371fd7e7
      Peter Zijlstra 提交于
      In order to reduce the dependency on TASK_WAKING rework the enqueue
      interface to support a proper flags field.
      
      Replace the int wakeup, bool head arguments with an int flags argument
      and create the following flags:
      
        ENQUEUE_WAKEUP - the enqueue is a wakeup of a sleeping task,
        ENQUEUE_WAKING - the enqueue has relative vruntime due to
                         having sched_class::task_waking() called,
        ENQUEUE_HEAD - the waking task should be places on the head
                       of the priority queue (where appropriate).
      
      For symmetry also convert sched_class::dequeue() to a flags scheme.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      371fd7e7
    • P
      sched: Fix TASK_WAKING vs fork deadlock · 0017d735
      Peter Zijlstra 提交于
      Oleg noticed a few races with the TASK_WAKING usage on fork.
      
       - since TASK_WAKING is basically a spinlock, it should be IRQ safe
       - since we set TASK_WAKING (*) without holding rq->lock it could
         be there still is a rq->lock holder, thereby not actually
         providing full serialization.
      
      (*) in fact we clear PF_STARTING, which in effect enables TASK_WAKING.
      
      Cure the second issue by not setting TASK_WAKING in sched_fork(), but
      only temporarily in wake_up_new_task() while calling select_task_rq().
      
      Cure the first by holding rq->lock around the select_task_rq() call,
      this will disable IRQs, this however requires that we push down the
      rq->lock release into select_task_rq_fair()'s cgroup stuff.
      
      Because select_task_rq_fair() still needs to drop the rq->lock we
      cannot fully get rid of TASK_WAKING.
      Reported-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      0017d735
  19. 21 1月, 2010 1 次提交
  20. 17 1月, 2010 1 次提交
  21. 21 12月, 2009 1 次提交
    • P
      sched: Restore printk sanity · 3df0fc5b
      Peter Zijlstra 提交于
      Revert the braindead pr_* crap. (Commit 663997d4 "sched: Use
      pr_fmt() and pr_<level>()")
      
      It's dumb and causes stupid "sched: " strings all over the place.
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Acked-by: NMike Galbraith <efault@gmx.de>
      Cc: Joe Perches <joe@perches.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      LKML-Reference: <1261315437.4314.6.camel@laptop>
      [ i dont mind the pr_*() patterns that much - but Peter dislikes them with a vengence. ]
      [ - v2: remove spurious diffstat from changelog :-/ ]
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      3df0fc5b
  22. 15 12月, 2009 1 次提交
  23. 13 12月, 2009 1 次提交
    • J
      sched: Use pr_fmt() and pr_<level>() · 663997d4
      Joe Perches 提交于
      - Convert printk(KERN_<level> to pr_<level> (not KERN_DEBUG)
       - Add #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
       - Coalesce long format strings
       - Add missing \n to "ERROR: !SD_LOAD_BALANCE domain has parent"
      Signed-off-by: NJoe Perches <joe@perches.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1260655047.2637.7.camel@Joe-Laptop.home>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      663997d4
  24. 09 12月, 2009 1 次提交
  25. 21 9月, 2009 1 次提交
  26. 15 9月, 2009 3 次提交
  27. 15 5月, 2009 1 次提交
    • T
      sched, timers: move calc_load() to scheduler · dce48a84
      Thomas Gleixner 提交于
      Dimitri Sivanich noticed that xtime_lock is held write locked across
      calc_load() which iterates over all online CPUs. That can cause long
      latencies for xtime_lock readers on large SMP systems. 
      
      The load average calculation is an rough estimate anyway so there is
      no real need to protect the readers vs. the update. It's not a problem
      when the avenrun array is updated while a reader copies the values.
      
      Instead of iterating over all online CPUs let the scheduler_tick code
      update the number of active tasks shortly before the avenrun update
      happens. The avenrun update itself is handled by the CPU which calls
      do_timer().
      
      [ Impact: reduce xtime_lock write locked section ]
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      dce48a84
  28. 22 10月, 2008 1 次提交
  29. 22 9月, 2008 1 次提交
  30. 06 5月, 2008 1 次提交
  31. 26 1月, 2008 3 次提交
    • P
      sched: high-res preemption tick · 8f4d37ec
      Peter Zijlstra 提交于
      Use HR-timers (when available) to deliver an accurate preemption tick.
      
      The regular scheduler tick that runs at 1/HZ can be too coarse when nice
      level are used. The fairness system will still keep the cpu utilisation 'fair'
      by then delaying the task that got an excessive amount of CPU time but try to
      minimize this by delivering preemption points spot-on.
      
      The average frequency of this extra interrupt is sched_latency / nr_latency.
      Which need not be higher than 1/HZ, its just that the distribution within the
      sched_latency period is important.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      8f4d37ec
    • S
      sched: RT-balance, add new methods to sched_class · cb469845
      Steven Rostedt 提交于
      Dmitry Adamushko found that the current implementation of the RT
      balancing code left out changes to the sched_setscheduler and
      rt_mutex_setprio.
      
      This patch addresses this issue by adding methods to the schedule classes
      to handle being switched out of (switched_from) and being switched into
      (switched_to) a sched_class. Also a method for changing of priorities
      is also added (prio_changed).
      
      This patch also removes some duplicate logic between rt_mutex_setprio and
      sched_setscheduler.
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      cb469845
    • G
      sched: de-SCHED_OTHER-ize the RT path · e7693a36
      Gregory Haskins 提交于
      The current wake-up code path tries to determine if it can optimize the
      wake-up to "this_cpu" by computing load calculations.  The problem is that
      these calculations are only relevant to SCHED_OTHER tasks where load is king.
      For RT tasks, priority is king.  So the load calculation is completely wasted
      bandwidth.
      
      Therefore, we create a new sched_class interface to help with
      pre-wakeup routing decisions and move the load calculation as a function
      of CFS task's class.
      Signed-off-by: NGregory Haskins <ghaskins@novell.com>
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      e7693a36