1. 18 5月, 2015 1 次提交
    • P
      sched,perf: Fix periodic timers · 4cfafd30
      Peter Zijlstra 提交于
      In the below two commits (see Fixes) we have periodic timers that can
      stop themselves when they're no longer required, but need to be
      (re)-started when their idle condition changes.
      
      Further complications is that we want the timer handler to always do
      the forward such that it will always correctly deal with the overruns,
      and we do not want to race such that the handler has already decided
      to stop, but the (external) restart sees the timer still active and we
      end up with a 'lost' timer.
      
      The problem with the current code is that the re-start can come before
      the callback does the forward, at which point the forward from the
      callback will WARN about forwarding an enqueued timer.
      
      Now, conceptually its easy to detect if you're before or after the fwd
      by comparing the expiration time against the current time. Of course,
      that's expensive (and racy) because we don't have the current time.
      
      Alternatively one could cache this state inside the timer, but then
      everybody pays the overhead of maintaining this extra state, and that
      is undesired.
      
      The only other option that I could see is the external timer_active
      variable, which I tried to kill before. I would love a nicer interface
      for this seemingly simple 'problem' but alas.
      
      Fixes: 272325c4 ("perf: Fix mux_interval hrtimer wreckage")
      Fixes: 77a4d1a1 ("sched: Cleanup bandwidth timers")
      Cc: pjt@google.com
      Cc: tglx@linutronix.de
      Cc: klamm@yandex-team.ru
      Cc: mingo@kernel.org
      Cc: bsegall@google.com
      Cc: hpa@zytor.com
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/20150514102311.GX21418@twins.programming.kicks-ass.net
      4cfafd30
  2. 22 4月, 2015 1 次提交
    • P
      sched: Cleanup bandwidth timers · 77a4d1a1
      Peter Zijlstra 提交于
      Roman reported a 3 cpu lockup scenario involving __start_cfs_bandwidth().
      
      The more I look at that code the more I'm convinced its crack, that
      entire __start_cfs_bandwidth() thing is brain melting, we don't need to
      cancel a timer before starting it, *hrtimer_start*() will happily remove
      the timer for you if its still enqueued.
      
      Removing that, removes a big part of the problem, no more ugly cancel
      loop to get stuck in.
      
      So now, if I understand things right, the entire reason you have this
      cfs_b->lock guarded ->timer_active nonsense is to make sure we don't
      accidentally lose the timer.
      
      It appears to me that it should be possible to guarantee that same by
      unconditionally (re)starting the timer when !queued. Because regardless
      what hrtimer::function will return, if we beat it to (re)enqueue the
      timer, it doesn't matter.
      
      Now, because hrtimers don't come with any serialization guarantees we
      must ensure both handler and (re)start loop serialize their access to
      the hrtimer to avoid both trying to forward the timer at the same
      time.
      
      Update the rt bandwidth timer to match.
      
      This effectively reverts: 09dc4ab0 ("sched/fair: Fix
      tg_set_cfs_bandwidth() deadlock on rq->lock").
      Reported-by: NRoman Gushchin <klamm@yandex-team.ru>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NBen Segall <bsegall@google.com>
      Cc: Paul Turner <pjt@google.com>
      Link: http://lkml.kernel.org/r/20150415095011.804589208@infradead.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      77a4d1a1
  3. 02 4月, 2015 1 次提交
  4. 23 3月, 2015 1 次提交
    • S
      sched/rt: Use IPI to trigger RT task push migration instead of pulling · b6366f04
      Steven Rostedt 提交于
      When debugging the latencies on a 40 core box, where we hit 300 to
      500 microsecond latencies, I found there was a huge contention on the
      runqueue locks.
      
      Investigating it further, running ftrace, I found that it was due to
      the pulling of RT tasks.
      
      The test that was run was the following:
      
       cyclictest --numa -p95 -m -d0 -i100
      
      This created a thread on each CPU, that would set its wakeup in iterations
      of 100 microseconds. The -d0 means that all the threads had the same
      interval (100us). Each thread sleeps for 100us and wakes up and measures
      its latencies.
      
      cyclictest is maintained at:
       git://git.kernel.org/pub/scm/linux/kernel/git/clrkwllms/rt-tests.git
      
      What happened was another RT task would be scheduled on one of the CPUs
      that was running our test, when the other CPU tests went to sleep and
      scheduled idle. This caused the "pull" operation to execute on all
      these CPUs. Each one of these saw the RT task that was overloaded on
      the CPU of the test that was still running, and each one tried
      to grab that task in a thundering herd way.
      
      To grab the task, each thread would do a double rq lock grab, grabbing
      its own lock as well as the rq of the overloaded CPU. As the sched
      domains on this box was rather flat for its size, I saw up to 12 CPUs
      block on this lock at once. This caused a ripple affect with the
      rq locks especially since the taking was done via a double rq lock, which
      means that several of the CPUs had their own rq locks held while trying
      to take this rq lock. As these locks were blocked, any wakeups or load
      balanceing on these CPUs would also block on these locks, and the wait
      time escalated.
      
      I've tried various methods to lessen the load, but things like an
      atomic counter to only let one CPU grab the task wont work, because
      the task may have a limited affinity, and we may pick the wrong
      CPU to take that lock and do the pull, to only find out that the
      CPU we picked isn't in the task's affinity.
      
      Instead of doing the PULL, I now have the CPUs that want the pull to
      send over an IPI to the overloaded CPU, and let that CPU pick what
      CPU to push the task to. No more need to grab the rq lock, and the
      push/pull algorithm still works fine.
      
      With this patch, the latency dropped to just 150us over a 20 hour run.
      Without the patch, the huge latencies would trigger in seconds.
      
      I've created a new sched feature called RT_PUSH_IPI, which is enabled
      by default.
      
      When RT_PUSH_IPI is not enabled, the old method of grabbing the rq locks
      and having the pulling CPU do the work is implemented. When RT_PUSH_IPI
      is enabled, the IPI is sent to the overloaded CPU to do a push.
      
      To enabled or disable this at run time:
      
       # mount -t debugfs nodev /sys/kernel/debug
       # echo RT_PUSH_IPI > /sys/kernel/debug/sched_features
      or
       # echo NO_RT_PUSH_IPI > /sys/kernel/debug/sched_features
      
      Update: This original patch would send an IPI to all CPUs in the RT overload
      list. But that could theoretically cause the reverse issue. That is, there
      could be lots of overloaded RT queues and one CPU lowers its priority. It would
      then send an IPI to all the overloaded RT queues and they could then all try
      to grab the rq lock of the CPU lowering its priority, and then we have the
      same problem.
      
      The latest design sends out only one IPI to the first overloaded CPU. It tries to
      push any tasks that it can, and then looks for the next overloaded CPU that can
      push to the source CPU. The IPIs stop when all overloaded CPUs that have pushable
      tasks that have priorities greater than the source CPU are covered. In case the
      source CPU lowers its priority again, a flag is set to tell the IPI traversal to
      restart with the first RT overloaded CPU after the source CPU.
      Parts-suggested-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Joern Engel <joern@purestorage.com>
      Cc: Clark Williams <williams@redhat.com>
      Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20150318144946.2f3cc982@gandalf.local.homeSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b6366f04
  5. 31 1月, 2015 1 次提交
  6. 14 1月, 2015 1 次提交
  7. 16 11月, 2014 2 次提交
    • W
      sched: Move p->nr_cpus_allowed check to select_task_rq() · 6c1d9410
      Wanpeng Li 提交于
      Move the p->nr_cpus_allowed check into kernel/sched/core.c: select_task_rq().
      This change will make fair.c, rt.c, and deadline.c all start with the
      same logic.
      Suggested-and-Acked-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NWanpeng Li <wanpeng.li@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: "pang.xunlei" <pang.xunlei@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1415150077-59053-1-git-send-email-wanpeng.li@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6c1d9410
    • S
      sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency · 6e998916
      Stanislaw Gruszka 提交于
      Commit d670ec13 "posix-cpu-timers: Cure SMP wobbles" fixes one glibc
      test case in cost of breaking another one. After that commit, calling
      clock_nanosleep(TIMER_ABSTIME, X) and then clock_gettime(&Y) can result
      of Y time being smaller than X time.
      
      Reproducer/tester can be found further below, it can be compiled and ran by:
      
      	gcc -o tst-cpuclock2 tst-cpuclock2.c -pthread
      	while ./tst-cpuclock2 ; do : ; done
      
      This reproducer, when running on a buggy kernel, will complain
      about "clock_gettime difference too small".
      
      Issue happens because on start in thread_group_cputimer() we initialize
      sum_exec_runtime of cputimer with threads runtime not yet accounted and
      then add the threads runtime to running cputimer again on scheduler
      tick, making it's sum_exec_runtime bigger than actual threads runtime.
      
      KOSAKI Motohiro posted a fix for this problem, but that patch was never
      applied: https://lkml.org/lkml/2013/5/26/191 .
      
      This patch takes different approach to cure the problem. It calls
      update_curr() when cputimer starts, that assure we will have updated
      stats of running threads and on the next schedule tick we will account
      only the runtime that elapsed from cputimer start. That also assure we
      have consistent state between cpu times of individual threads and cpu
      time of the process consisted by those threads.
      
      Full reproducer (tst-cpuclock2.c):
      
      	#define _GNU_SOURCE
      	#include <unistd.h>
      	#include <sys/syscall.h>
      	#include <stdio.h>
      	#include <time.h>
      	#include <pthread.h>
      	#include <stdint.h>
      	#include <inttypes.h>
      
      	/* Parameters for the Linux kernel ABI for CPU clocks.  */
      	#define CPUCLOCK_SCHED          2
      	#define MAKE_PROCESS_CPUCLOCK(pid, clock) \
      		((~(clockid_t) (pid) << 3) | (clockid_t) (clock))
      
      	static pthread_barrier_t barrier;
      
      	/* Help advance the clock.  */
      	static void *chew_cpu(void *arg)
      	{
      		pthread_barrier_wait(&barrier);
      		while (1) ;
      
      		return NULL;
      	}
      
      	/* Don't use the glibc wrapper.  */
      	static int do_nanosleep(int flags, const struct timespec *req)
      	{
      		clockid_t clock_id = MAKE_PROCESS_CPUCLOCK(0, CPUCLOCK_SCHED);
      
      		return syscall(SYS_clock_nanosleep, clock_id, flags, req, NULL);
      	}
      
      	static int64_t tsdiff(const struct timespec *before, const struct timespec *after)
      	{
      		int64_t before_i = before->tv_sec * 1000000000ULL + before->tv_nsec;
      		int64_t after_i = after->tv_sec * 1000000000ULL + after->tv_nsec;
      
      		return after_i - before_i;
      	}
      
      	int main(void)
      	{
      		int result = 0;
      		pthread_t th;
      
      		pthread_barrier_init(&barrier, NULL, 2);
      
      		if (pthread_create(&th, NULL, chew_cpu, NULL) != 0) {
      			perror("pthread_create");
      			return 1;
      		}
      
      		pthread_barrier_wait(&barrier);
      
      		/* The test.  */
      		struct timespec before, after, sleeptimeabs;
      		int64_t sleepdiff, diffabs;
      		const struct timespec sleeptime = {.tv_sec = 0,.tv_nsec = 100000000 };
      
      		/* The relative nanosleep.  Not sure why this is needed, but its presence
      		   seems to make it easier to reproduce the problem.  */
      		if (do_nanosleep(0, &sleeptime) != 0) {
      			perror("clock_nanosleep");
      			return 1;
      		}
      
      		/* Get the current time.  */
      		if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &before) < 0) {
      			perror("clock_gettime[2]");
      			return 1;
      		}
      
      		/* Compute the absolute sleep time based on the current time.  */
      		uint64_t nsec = before.tv_nsec + sleeptime.tv_nsec;
      		sleeptimeabs.tv_sec = before.tv_sec + nsec / 1000000000;
      		sleeptimeabs.tv_nsec = nsec % 1000000000;
      
      		/* Sleep for the computed time.  */
      		if (do_nanosleep(TIMER_ABSTIME, &sleeptimeabs) != 0) {
      			perror("absolute clock_nanosleep");
      			return 1;
      		}
      
      		/* Get the time after the sleep.  */
      		if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &after) < 0) {
      			perror("clock_gettime[3]");
      			return 1;
      		}
      
      		/* The time after sleep should always be equal to or after the absolute sleep
      		   time passed to clock_nanosleep.  */
      		sleepdiff = tsdiff(&sleeptimeabs, &after);
      		if (sleepdiff < 0) {
      			printf("absolute clock_nanosleep woke too early: %" PRId64 "\n", sleepdiff);
      			result = 1;
      
      			printf("Before %llu.%09llu\n", before.tv_sec, before.tv_nsec);
      			printf("After  %llu.%09llu\n", after.tv_sec, after.tv_nsec);
      			printf("Sleep  %llu.%09llu\n", sleeptimeabs.tv_sec, sleeptimeabs.tv_nsec);
      		}
      
      		/* The difference between the timestamps taken before and after the
      		   clock_nanosleep call should be equal to or more than the duration of the
      		   sleep.  */
      		diffabs = tsdiff(&before, &after);
      		if (diffabs < sleeptime.tv_nsec) {
      			printf("clock_gettime difference too small: %" PRId64 "\n", diffabs);
      			result = 1;
      		}
      
      		pthread_cancel(th);
      
      		return result;
      	}
      Signed-off-by: NStanislaw Gruszka <sgruszka@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20141112155843.GA24803@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6e998916
  8. 04 11月, 2014 1 次提交
  9. 24 9月, 2014 1 次提交
  10. 19 9月, 2014 1 次提交
  11. 28 8月, 2014 1 次提交
    • C
      percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t · 4ba29684
      Christoph Lameter 提交于
      __get_cpu_var can paper over differences in the definitions of
      cpumask_var_t and either use the address of the cpumask variable
      directly or perform a fetch of the address of the struct cpumask
      allocated elsewhere. This is important particularly when using per cpu
      cpumask_var_t declarations because in one case we have an offset into
      a per cpu area to handle and in the other case we need to fetch a
      pointer from the offset.
      
      This patch introduces a new macro
      
      this_cpu_cpumask_var_ptr()
      
      that is defined where cpumask_var_t is defined and performs the proper
      actions. All use cases where __get_cpu_var is used with cpumask_var_t
      are converted to the use of this_cpu_cpumask_var_ptr().
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      4ba29684
  12. 20 8月, 2014 1 次提交
    • K
      sched: Add wrapper for checking task_struct::on_rq · da0c1e65
      Kirill Tkhai 提交于
      Implement task_on_rq_queued() and use it everywhere instead of
      on_rq check. No functional changes.
      
      The only exception is we do not use the wrapper in
      check_for_tasks(), because it requires to export
      task_on_rq_queued() in global header files. Next patch in series
      would return it back, so we do not twist it from here to there.
      Signed-off-by: NKirill Tkhai <ktkhai@parallels.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paul Turner <pjt@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
      Cc: Kirill Tkhai <tkhai@yandex.ru>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Nicolas Pitre <nicolas.pitre@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1408528052.23412.87.camel@tkhaiSigned-off-by: NIngo Molnar <mingo@kernel.org>
      da0c1e65
  13. 16 7月, 2014 1 次提交
    • K
      sched: Transform resched_task() into resched_curr() · 8875125e
      Kirill Tkhai 提交于
      We always use resched_task() with rq->curr argument.
      It's not possible to reschedule any task but rq's current.
      
      The patch introduces resched_curr(struct rq *) to
      replace all of the repeating patterns. The main aim
      is cleanup, but there is a little size profit too:
      
        (before)
      	$ size kernel/sched/built-in.o
      	   text	   data	    bss	    dec	    hex	filename
      	155274	  16445	   7042	 178761	  2ba49	kernel/sched/built-in.o
      
      	$ size vmlinux
      	   text	   data	    bss	    dec	    hex	filename
      	7411490	1178376	 991232	9581098	 92322a	vmlinux
      
        (after)
      	$ size kernel/sched/built-in.o
      	   text	   data	    bss	    dec	    hex	filename
      	155130	  16445	   7042	 178617	  2b9b9	kernel/sched/built-in.o
      
      	$ size vmlinux
      	   text	   data	    bss	    dec	    hex	filename
      	7411362	1178376	 991232	9580970	 9231aa	vmlinux
      
      	I was choosing between resched_curr() and resched_rq(),
      	and the first name looks better for me.
      
      A little lie in Documentation/trace/ftrace.txt. I have not
      actually collected the tracing again. With a hope the patch
      won't make execution times much worse :)
      Signed-off-by: NKirill Tkhai <tkhai@yandex.ru>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Link: http://lkml.kernel.org/r/20140628200219.1778.18735.stgit@localhostSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8875125e
  14. 05 7月, 2014 1 次提交
  15. 05 6月, 2014 3 次提交
  16. 22 5月, 2014 1 次提交
  17. 18 4月, 2014 5 次提交
  18. 17 4月, 2014 1 次提交
  19. 11 3月, 2014 2 次提交
  20. 27 2月, 2014 2 次提交
    • P
      sched: Guarantee task priority in pick_next_task() · 37e117c0
      Peter Zijlstra 提交于
      Michael spotted that the idle_balance() push down created a task
      priority problem.
      
      Previously, when we called idle_balance() before pick_next_task() it
      wasn't a problem when -- because of the rq->lock droppage -- an rt/dl
      task slipped in.
      
      Similarly for pre_schedule(), rt pre-schedule could have a dl task
      slip in.
      
      But by pulling it into the pick_next_task() loop, we'll not try a
      higher task priority again.
      
      Cure this by creating a re-start condition in pick_next_task(); and
      triggering this from pick_next_task_{rt,fair}().
      
      It also fixes a live-lock where we get stuck in pick_next_task_fair()
      due to idle_balance() seeing !0 nr_running but there not actually
      being any fair tasks about.
      Reported-by: NMichael Wang <wangyun@linux.vnet.ibm.com>
      Fixes: 38033c37 ("sched: Push down pre_schedule() and idle_balance()")
      Tested-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Juri Lelli <juri.lelli@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Link: http://lkml.kernel.org/r/20140224121218.GR15586@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      37e117c0
    • J
      sched/deadline: Prevent rt_time growth to infinity · faa59937
      Juri Lelli 提交于
      Kirill Tkhai noted:
      
        Since deadline tasks share rt bandwidth, we must care about
        bandwidth timer set. Otherwise rt_time may grow up to infinity
        in update_curr_dl(), if there are no other available RT tasks
        on top level bandwidth.
      
      RT task were in fact throttled right after they got enqueued,
      and never executed again (rt_time never again went below rt_runtime).
      
      Peter then proposed to accrue DL execution on rt_time only when
      rt timer is active, and proposed a patch (this patch is a slight
      modification of that) to implement that behavior. While this
      solves Kirill problem, it has a drawback.
      
      Indeed, Kirill noted again:
      
        It looks we may get into a situation, when all CPU time is shared
        between RT and DL tasks:
      
        rt_runtime = n
        rt_period  = 2n
      
        | RT working, DL sleeping  | DL working, RT sleeping      |
        -----------------------------------------------------------
        | (1)     duration = n     | (2)     duration = n         | (repeat)
        |--------------------------|------------------------------|
        | (rt_bw timer is running) | (rt_bw timer is not running) |
      
        No time for fair tasks at all.
      
      While this can happen during the first period, if rq is always backlogged,
      RT tasks won't have the opportunity to execute anymore: rt_time reached
      rt_runtime during (1), suppose after (2) RT is enqueued back, it gets
      throttled since rt timer didn't fire, replenishment is from now on eaten up
      by DL tasks that accrue their execution on rt_time (while rt timer is
      active - we have an RT task waiting for replenishment). FAIR tasks are
      not touched after this first period. Ok, this is not ideal, and the situation
      is even worse!
      
      What above (the nice case), practically never happens in reality, where
      your rt timer is not aligned to tasks periods, tasks are in general not
      periodic, etc.. Long story short, you always risk to overload your system.
      
      This patch is based on Peter's idea, but exploits an additional fact:
      if you don't have RT tasks enqueued, it makes little sense to continue
      incrementing rt_time once you reached the upper limit (DL tasks have their
      own mechanism for throttling).
      
      This cures both problems:
      
       - no matter how many DL instances in the past, you'll have an rt_time
         slightly above rt_runtime when an RT task is enqueued, and from that
         point on (after the first replenishment), the task will normally execute;
      
       - you can still eat up all bandwidth during the first period, but not
         anymore after that, remember that DL execution will increment rt_time
         till the upper limit is reached.
      
      The situation is still not perfect! But, we have a simple solution for now,
      that limits how much you can jeopardize your system, as we keep working
      towards the right answer: RT groups scheduled using deadline servers.
      Reported-by: NKirill Tkhai <tkhai@yandex.ru>
      Signed-off-by: NJuri Lelli <juri.lelli@gmail.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Link: http://lkml.kernel.org/r/20140225151515.617714e2f2cd6c558531ba61@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      faa59937
  21. 23 2月, 2014 1 次提交
  22. 22 2月, 2014 2 次提交
  23. 11 2月, 2014 1 次提交
    • P
      sched: Push down pre_schedule() and idle_balance() · 38033c37
      Peter Zijlstra 提交于
      This patch both merged idle_balance() and pre_schedule() and pushes
      both of them into pick_next_task().
      
      Conceptually pre_schedule() and idle_balance() are rather similar,
      both are used to pull more work onto the current CPU.
      
      We cannot however first move idle_balance() into pre_schedule_fair()
      since there is no guarantee the last runnable task is a fair task, and
      thus we would miss newidle balances.
      
      Similarly, the dl and rt pre_schedule calls must be ran before
      idle_balance() since their respective tasks have higher priority and
      it would not do to delay their execution searching for less important
      tasks first.
      
      However, by noticing that pick_next_tasks() already traverses the
      sched_class hierarchy in the right order, we can get the right
      behaviour and do away with both calls.
      
      We must however change the special case optimization to also require
      that prev is of sched_class_fair, otherwise we can miss doing a dl or
      rt pull where we needed one.
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/n/tip-a8k6vvaebtn64nie345kx1je@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      38033c37
  24. 10 2月, 2014 1 次提交
  25. 13 1月, 2014 1 次提交
    • J
      sched/deadline: Add SCHED_DEADLINE SMP-related data structures & logic · 1baca4ce
      Juri Lelli 提交于
      Introduces data structures relevant for implementing dynamic
      migration of -deadline tasks and the logic for checking if
      runqueues are overloaded with -deadline tasks and for choosing
      where a task should migrate, when it is the case.
      
      Adds also dynamic migrations to SCHED_DEADLINE, so that tasks can
      be moved among CPUs when necessary. It is also possible to bind a
      task to a (set of) CPU(s), thus restricting its capability of
      migrating, or forbidding migrations at all.
      
      The very same approach used in sched_rt is utilised:
       - -deadline tasks are kept into CPU-specific runqueues,
       - -deadline tasks are migrated among runqueues to achieve the
         following:
          * on an M-CPU system the M earliest deadline ready tasks
            are always running;
          * affinity/cpusets settings of all the -deadline tasks is
            always respected.
      
      Therefore, this very special form of "load balancing" is done with
      an active method, i.e., the scheduler pushes or pulls tasks between
      runqueues when they are woken up and/or (de)scheduled.
      IOW, every time a preemption occurs, the descheduled task might be sent
      to some other CPU (depending on its deadline) to continue executing
      (push). On the other hand, every time a CPU becomes idle, it might pull
      the second earliest deadline ready task from some other CPU.
      
      To enforce this, a pull operation is always attempted before taking any
      scheduling decision (pre_schedule()), as well as a push one after each
      scheduling decision (post_schedule()). In addition, when a task arrives
      or wakes up, the best CPU where to resume it is selected taking into
      account its affinity mask, the system topology, but also its deadline.
      E.g., from the scheduling point of view, the best CPU where to wake
      up (and also where to push) a task is the one which is running the task
      with the latest deadline among the M executing ones.
      
      In order to facilitate these decisions, per-runqueue "caching" of the
      deadlines of the currently running and of the first ready task is used.
      Queued but not running tasks are also parked in another rb-tree to
      speed-up pushes.
      Signed-off-by: NJuri Lelli <juri.lelli@gmail.com>
      Signed-off-by: NDario Faggioli <raistlin@linux.it>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1383831828-15501-5-git-send-email-juri.lelli@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1baca4ce
  26. 17 12月, 2013 1 次提交
  27. 26 10月, 2013 1 次提交
  28. 16 10月, 2013 1 次提交
    • P
      sched/rt: Add missing rmb() · 7c3f2ab7
      Peter Zijlstra 提交于
      While discussing the proposed SCHED_DEADLINE patches which in parts
      mimic the existing FIFO code it was noticed that the wmb in
      rt_set_overloaded() didn't have a matching barrier.
      
      The only site using rt_overloaded() to test the rto_count is
      pull_rt_task() and we should issue a matching rmb before then assuming
      there's an rto_mask bit set.
      
      Without that smp_rmb() in there we could actually miss seeing the
      rto_mask bit.
      
      Also, change to using smp_[wr]mb(), even though this is SMP only code;
      memory barriers without smp_ always make me think they're against
      hardware of some sort.
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: vincent.guittot@linaro.org
      Cc: luca.abeni@unitn.it
      Cc: bruce.ashfield@windriver.com
      Cc: dhaval.giani@gmail.com
      Cc: rostedt@goodmis.org
      Cc: hgu1972@gmail.com
      Cc: oleg@redhat.com
      Cc: fweisbec@gmail.com
      Cc: darren@dvhart.com
      Cc: johan.eker@ericsson.com
      Cc: p.faure@akatech.ch
      Cc: paulmck@linux.vnet.ibm.com
      Cc: raistlin@linux.it
      Cc: claudio@evidence.eu.com
      Cc: insop.song@gmail.com
      Cc: michael@amarulasolutions.com
      Cc: liming.wang@windriver.com
      Cc: fchecconi@gmail.com
      Cc: jkacur@redhat.com
      Cc: tommaso.cucinotta@sssup.it
      Cc: Juri Lelli <juri.lelli@gmail.com>
      Cc: harald.gustafsson@ericsson.com
      Cc: nicola.manica@disi.unitn.it
      Cc: tglx@linutronix.de
      Link: http://lkml.kernel.org/r/20131015103507.GF10651@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      7c3f2ab7
  29. 09 10月, 2013 1 次提交
  30. 06 10月, 2013 1 次提交