1. 04 2月, 2015 1 次提交
    • P
      sched/deadline: Fix deadline parameter modification handling · 40767b0d
      Peter Zijlstra 提交于
      Commit 67dfa1b7 ("sched/deadline: Implement cancel_dl_timer() to
      use in switched_from_dl()") removed the hrtimer_try_cancel() function
      call out from init_dl_task_timer(), which gets called from
      __setparam_dl().
      
      The result is that we can now re-init the timer while its active --
      this is bad and corrupts timer state.
      
      Furthermore; changing the parameters of an active deadline task is
      tricky in that you want to maintain guarantees, while immediately
      effective change would allow one to circumvent the CBS guarantees --
      this too is bad, as one (bad) task should not be able to affect the
      others.
      
      Rework things to avoid both problems. We only need to initialize the
      timer once, so move that to __sched_fork() for new tasks.
      
      Then make sure __setparam_dl() doesn't affect the current running
      state but only updates the parameters used to calculate the next
      scheduling period -- this guarantees the CBS functions as expected
      (albeit slightly pessimistic).
      
      This however means we need to make sure __dl_clear_params() needs to
      reset the active state otherwise new (and tasks flipping between
      classes) will not properly (re)compute their first instance.
      
      Todo: close class flipping CBS hole.
      Todo: implement delayed BW release.
      Reported-by: NLuca Abeni <luca.abeni@unitn.it>
      Acked-by: NJuri Lelli <juri.lelli@arm.com>
      Tested-by: NLuca Abeni <luca.abeni@unitn.it>
      Fixes: 67dfa1b7 ("sched/deadline: Implement cancel_dl_timer() to use in switched_from_dl()")
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <stable@vger.kernel.org>
      Cc: Kirill Tkhai <tkhai@yandex.ru>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20150128140803.GF23038@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      40767b0d
  2. 31 1月, 2015 1 次提交
  3. 09 1月, 2015 2 次提交
    • L
      sched/deadline: Avoid double-accounting in case of missed deadlines · 269ad801
      Luca Abeni 提交于
      The dl_runtime_exceeded() function is supposed to ckeck if
      a SCHED_DEADLINE task must be throttled, by checking if its
      current runtime is <= 0. However, it also checks if the
      scheduling deadline has been missed (the current time is
      larger than the current scheduling deadline), further
      decreasing the runtime if this happens.
      This "double accounting" is wrong:
      
      - In case of partitioned scheduling (or single CPU), this
        happens if task_tick_dl() has been called later than expected
        (due to small HZ values). In this case, the current runtime is
        also negative, and replenish_dl_entity() can take care of the
        deadline miss by recharging the current runtime to a value smaller
        than dl_runtime
      
      - In case of global scheduling on multiple CPUs, scheduling
        deadlines can be missed even if the task did not consume more
        runtime than expected, hence penalizing the task is wrong
      
      This patch fix this problem by throttling a SCHED_DEADLINE task
      only when its runtime becomes negative, and not modifying the runtime
      Signed-off-by: NLuca Abeni <luca.abeni@unitn.it>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NJuri Lelli <juri.lelli@gmail.com>
      Cc: <stable@vger.kernel.org>
      Cc: Dario Faggioli <raistlin@linux.it>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1418813432-20797-3-git-send-email-luca.abeni@unitn.itSigned-off-by: NIngo Molnar <mingo@kernel.org>
      269ad801
    • L
      sched/deadline: Fix migration of SCHED_DEADLINE tasks · 6a503c3b
      Luca Abeni 提交于
      According to global EDF, tasks should be migrated between runqueues
      without checking if their scheduling deadlines and runtimes are valid.
      However, SCHED_DEADLINE currently performs such a check:
      a migration happens doing:
      
      	deactivate_task(rq, next_task, 0);
      	set_task_cpu(next_task, later_rq->cpu);
      	activate_task(later_rq, next_task, 0);
      
      which ends up calling dequeue_task_dl(), setting the new CPU, and then
      calling enqueue_task_dl().
      
      enqueue_task_dl() then calls enqueue_dl_entity(), which calls
      update_dl_entity(), which can modify scheduling deadline and runtime,
      breaking global EDF scheduling.
      
      As a result, some of the properties of global EDF are not respected:
      for example, a taskset {(30, 80), (40, 80), (120, 170)} scheduled on
      two cores can have unbounded response times for the third task even
      if 30/80+40/80+120/170 = 1.5809 < 2
      
      This can be fixed by invoking update_dl_entity() only in case of
      wakeup, or if this is a new SCHED_DEADLINE task.
      Signed-off-by: NLuca Abeni <luca.abeni@unitn.it>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NJuri Lelli <juri.lelli@gmail.com>
      Cc: <stable@vger.kernel.org>
      Cc: Dario Faggioli <raistlin@linux.it>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1418813432-20797-2-git-send-email-luca.abeni@unitn.itSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6a503c3b
  4. 16 11月, 2014 4 次提交
    • W
      sched/deadline: Introduce start_hrtick_dl() for !CONFIG_SCHED_HRTICK · 36ce9881
      Wanpeng Li 提交于
      Introduce start_hrtick_dl for !CONFIG_SCHED_HRTICK to align with
      the fair class.
      Signed-off-by: NWanpeng Li <wanpeng.li@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Juri Lelli <juri.lelli@arm.com>
      Cc: Kirill Tkhai <ktkhai@parallels.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1415670747-58726-1-git-send-email-wanpeng.li@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      36ce9881
    • W
      sched/deadline: Fix rq->dl.pushable_tasks bug in push_dl_task() · c51b8ab5
      Wanpeng Li 提交于
      Do not call dequeue_pushable_dl_task() when failing to push an eligible
      task, as it remains pushable, merely not at this particular moment.
      
      Actually the patch is the same behavior as commit 311e800e ("sched,
      rt: Fix rq->rt.pushable_tasks bug in push_rt_task()" in -rt side.
      Signed-off-by: NWanpeng Li <wanpeng.li@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Juri Lelli <juri.lelli@arm.com>
      Cc: Kirill Tkhai <ktkhai@parallels.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1415258564-8573-1-git-send-email-wanpeng.li@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      c51b8ab5
    • W
      sched: Move p->nr_cpus_allowed check to select_task_rq() · 6c1d9410
      Wanpeng Li 提交于
      Move the p->nr_cpus_allowed check into kernel/sched/core.c: select_task_rq().
      This change will make fair.c, rt.c, and deadline.c all start with the
      same logic.
      Suggested-and-Acked-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NWanpeng Li <wanpeng.li@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: "pang.xunlei" <pang.xunlei@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1415150077-59053-1-git-send-email-wanpeng.li@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6c1d9410
    • S
      sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency · 6e998916
      Stanislaw Gruszka 提交于
      Commit d670ec13 "posix-cpu-timers: Cure SMP wobbles" fixes one glibc
      test case in cost of breaking another one. After that commit, calling
      clock_nanosleep(TIMER_ABSTIME, X) and then clock_gettime(&Y) can result
      of Y time being smaller than X time.
      
      Reproducer/tester can be found further below, it can be compiled and ran by:
      
      	gcc -o tst-cpuclock2 tst-cpuclock2.c -pthread
      	while ./tst-cpuclock2 ; do : ; done
      
      This reproducer, when running on a buggy kernel, will complain
      about "clock_gettime difference too small".
      
      Issue happens because on start in thread_group_cputimer() we initialize
      sum_exec_runtime of cputimer with threads runtime not yet accounted and
      then add the threads runtime to running cputimer again on scheduler
      tick, making it's sum_exec_runtime bigger than actual threads runtime.
      
      KOSAKI Motohiro posted a fix for this problem, but that patch was never
      applied: https://lkml.org/lkml/2013/5/26/191 .
      
      This patch takes different approach to cure the problem. It calls
      update_curr() when cputimer starts, that assure we will have updated
      stats of running threads and on the next schedule tick we will account
      only the runtime that elapsed from cputimer start. That also assure we
      have consistent state between cpu times of individual threads and cpu
      time of the process consisted by those threads.
      
      Full reproducer (tst-cpuclock2.c):
      
      	#define _GNU_SOURCE
      	#include <unistd.h>
      	#include <sys/syscall.h>
      	#include <stdio.h>
      	#include <time.h>
      	#include <pthread.h>
      	#include <stdint.h>
      	#include <inttypes.h>
      
      	/* Parameters for the Linux kernel ABI for CPU clocks.  */
      	#define CPUCLOCK_SCHED          2
      	#define MAKE_PROCESS_CPUCLOCK(pid, clock) \
      		((~(clockid_t) (pid) << 3) | (clockid_t) (clock))
      
      	static pthread_barrier_t barrier;
      
      	/* Help advance the clock.  */
      	static void *chew_cpu(void *arg)
      	{
      		pthread_barrier_wait(&barrier);
      		while (1) ;
      
      		return NULL;
      	}
      
      	/* Don't use the glibc wrapper.  */
      	static int do_nanosleep(int flags, const struct timespec *req)
      	{
      		clockid_t clock_id = MAKE_PROCESS_CPUCLOCK(0, CPUCLOCK_SCHED);
      
      		return syscall(SYS_clock_nanosleep, clock_id, flags, req, NULL);
      	}
      
      	static int64_t tsdiff(const struct timespec *before, const struct timespec *after)
      	{
      		int64_t before_i = before->tv_sec * 1000000000ULL + before->tv_nsec;
      		int64_t after_i = after->tv_sec * 1000000000ULL + after->tv_nsec;
      
      		return after_i - before_i;
      	}
      
      	int main(void)
      	{
      		int result = 0;
      		pthread_t th;
      
      		pthread_barrier_init(&barrier, NULL, 2);
      
      		if (pthread_create(&th, NULL, chew_cpu, NULL) != 0) {
      			perror("pthread_create");
      			return 1;
      		}
      
      		pthread_barrier_wait(&barrier);
      
      		/* The test.  */
      		struct timespec before, after, sleeptimeabs;
      		int64_t sleepdiff, diffabs;
      		const struct timespec sleeptime = {.tv_sec = 0,.tv_nsec = 100000000 };
      
      		/* The relative nanosleep.  Not sure why this is needed, but its presence
      		   seems to make it easier to reproduce the problem.  */
      		if (do_nanosleep(0, &sleeptime) != 0) {
      			perror("clock_nanosleep");
      			return 1;
      		}
      
      		/* Get the current time.  */
      		if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &before) < 0) {
      			perror("clock_gettime[2]");
      			return 1;
      		}
      
      		/* Compute the absolute sleep time based on the current time.  */
      		uint64_t nsec = before.tv_nsec + sleeptime.tv_nsec;
      		sleeptimeabs.tv_sec = before.tv_sec + nsec / 1000000000;
      		sleeptimeabs.tv_nsec = nsec % 1000000000;
      
      		/* Sleep for the computed time.  */
      		if (do_nanosleep(TIMER_ABSTIME, &sleeptimeabs) != 0) {
      			perror("absolute clock_nanosleep");
      			return 1;
      		}
      
      		/* Get the time after the sleep.  */
      		if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &after) < 0) {
      			perror("clock_gettime[3]");
      			return 1;
      		}
      
      		/* The time after sleep should always be equal to or after the absolute sleep
      		   time passed to clock_nanosleep.  */
      		sleepdiff = tsdiff(&sleeptimeabs, &after);
      		if (sleepdiff < 0) {
      			printf("absolute clock_nanosleep woke too early: %" PRId64 "\n", sleepdiff);
      			result = 1;
      
      			printf("Before %llu.%09llu\n", before.tv_sec, before.tv_nsec);
      			printf("After  %llu.%09llu\n", after.tv_sec, after.tv_nsec);
      			printf("Sleep  %llu.%09llu\n", sleeptimeabs.tv_sec, sleeptimeabs.tv_nsec);
      		}
      
      		/* The difference between the timestamps taken before and after the
      		   clock_nanosleep call should be equal to or more than the duration of the
      		   sleep.  */
      		diffabs = tsdiff(&before, &after);
      		if (diffabs < sleeptime.tv_nsec) {
      			printf("clock_gettime difference too small: %" PRId64 "\n", diffabs);
      			result = 1;
      		}
      
      		pthread_cancel(th);
      
      		return result;
      	}
      Signed-off-by: NStanislaw Gruszka <sgruszka@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20141112155843.GA24803@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6e998916
  5. 04 11月, 2014 6 次提交
  6. 28 10月, 2014 7 次提交
  7. 24 9月, 2014 2 次提交
  8. 19 9月, 2014 1 次提交
  9. 07 9月, 2014 1 次提交
    • X
      sched/deadline: Fix a precision problem in the microseconds range · 177ef2a6
      xiaofeng.yan 提交于
      An overrun could happen in function start_hrtick_dl()
      when a task with SCHED_DEADLINE runs in the microseconds
      range.
      
      For example, if a task with SCHED_DEADLINE has the following parameters:
      
        Task  runtime  deadline  period
         P1   200us     500us    500us
      
      The deadline and period from task P1 are less than 1ms.
      
      In order to achieve microsecond precision, we need to enable HRTICK feature
      by the next command:
      
        PC#echo "HRTICK" > /sys/kernel/debug/sched_features
        PC#trace-cmd record -e sched_switch &
        PC#./schedtool -E -t 200000:500000:500000 -e ./test
      
      The binary test is in an endless while(1) loop here.
      Some pieces of trace.dat are as follows:
      
        <idle>-0   157.603157: sched_switch: :R ==> 2481:4294967295: test
        test-2481  157.603203: sched_switch:  2481:R ==> 0:120: swapper/2
        <idle>-0   157.605657: sched_switch:  :R ==> 2481:4294967295: test
        test-2481  157.608183: sched_switch:  2481:R ==> 2483:120: trace-cmd
        trace-cmd-2483 157.609656: sched_switch:2483:R==>2481:4294967295: test
      
      We can get the runtime of P1 from the information above:
      
        runtime = 157.608183 - 157.605657
        runtime = 0.002526(2.526ms)
      
      The correct runtime should be less than or equal to 200us at some point.
      
      The problem is caused by a conditional judgment "delta > 10000"
      in function start_hrtick_dl().
      
      Because no hrtimer start up to control the rest of runtime
      when the reset of runtime is less than 10us.
      
      So the process will continue to run until tick-period is coming.
      
      Move the code with the limit of the least time slice
      from hrtick_start_fair() to hrtick_start() because the
      EDF schedule class also needs this function in start_hrtick_dl().
      
      To fix this problem, we call hrtimer_start() unconditionally in
      start_hrtick_dl(), and make sure the scheduling slice won't be smaller
      than 10us in hrtimer_start().
      Signed-off-by: NXiaofeng Yan <xiaofeng.yan@huawei.com>
      Reviewed-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NJuri Lelli <juri.lelli@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1409022941-5880-1-git-send-email-xiaofeng.yan@huawei.com
      [ Massaged the changelog and the code. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      177ef2a6
  10. 28 8月, 2014 1 次提交
    • C
      percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t · 4ba29684
      Christoph Lameter 提交于
      __get_cpu_var can paper over differences in the definitions of
      cpumask_var_t and either use the address of the cpumask variable
      directly or perform a fetch of the address of the struct cpumask
      allocated elsewhere. This is important particularly when using per cpu
      cpumask_var_t declarations because in one case we have an offset into
      a per cpu area to handle and in the other case we need to fetch a
      pointer from the offset.
      
      This patch introduces a new macro
      
      this_cpu_cpumask_var_ptr()
      
      that is defined where cpumask_var_t is defined and performs the proper
      actions. All use cases where __get_cpu_var is used with cpumask_var_t
      are converted to the use of this_cpu_cpumask_var_ptr().
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      4ba29684
  11. 20 8月, 2014 1 次提交
    • K
      sched: Add wrapper for checking task_struct::on_rq · da0c1e65
      Kirill Tkhai 提交于
      Implement task_on_rq_queued() and use it everywhere instead of
      on_rq check. No functional changes.
      
      The only exception is we do not use the wrapper in
      check_for_tasks(), because it requires to export
      task_on_rq_queued() in global header files. Next patch in series
      would return it back, so we do not twist it from here to there.
      Signed-off-by: NKirill Tkhai <ktkhai@parallels.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paul Turner <pjt@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
      Cc: Kirill Tkhai <tkhai@yandex.ru>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Nicolas Pitre <nicolas.pitre@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1408528052.23412.87.camel@tkhaiSigned-off-by: NIngo Molnar <mingo@kernel.org>
      da0c1e65
  12. 16 7月, 2014 2 次提交
  13. 05 6月, 2014 4 次提交
  14. 22 5月, 2014 2 次提交
  15. 07 5月, 2014 1 次提交
  16. 17 4月, 2014 1 次提交
  17. 11 3月, 2014 1 次提交
  18. 27 2月, 2014 2 次提交
    • J
      sched/deadline: Prevent rt_time growth to infinity · faa59937
      Juri Lelli 提交于
      Kirill Tkhai noted:
      
        Since deadline tasks share rt bandwidth, we must care about
        bandwidth timer set. Otherwise rt_time may grow up to infinity
        in update_curr_dl(), if there are no other available RT tasks
        on top level bandwidth.
      
      RT task were in fact throttled right after they got enqueued,
      and never executed again (rt_time never again went below rt_runtime).
      
      Peter then proposed to accrue DL execution on rt_time only when
      rt timer is active, and proposed a patch (this patch is a slight
      modification of that) to implement that behavior. While this
      solves Kirill problem, it has a drawback.
      
      Indeed, Kirill noted again:
      
        It looks we may get into a situation, when all CPU time is shared
        between RT and DL tasks:
      
        rt_runtime = n
        rt_period  = 2n
      
        | RT working, DL sleeping  | DL working, RT sleeping      |
        -----------------------------------------------------------
        | (1)     duration = n     | (2)     duration = n         | (repeat)
        |--------------------------|------------------------------|
        | (rt_bw timer is running) | (rt_bw timer is not running) |
      
        No time for fair tasks at all.
      
      While this can happen during the first period, if rq is always backlogged,
      RT tasks won't have the opportunity to execute anymore: rt_time reached
      rt_runtime during (1), suppose after (2) RT is enqueued back, it gets
      throttled since rt timer didn't fire, replenishment is from now on eaten up
      by DL tasks that accrue their execution on rt_time (while rt timer is
      active - we have an RT task waiting for replenishment). FAIR tasks are
      not touched after this first period. Ok, this is not ideal, and the situation
      is even worse!
      
      What above (the nice case), practically never happens in reality, where
      your rt timer is not aligned to tasks periods, tasks are in general not
      periodic, etc.. Long story short, you always risk to overload your system.
      
      This patch is based on Peter's idea, but exploits an additional fact:
      if you don't have RT tasks enqueued, it makes little sense to continue
      incrementing rt_time once you reached the upper limit (DL tasks have their
      own mechanism for throttling).
      
      This cures both problems:
      
       - no matter how many DL instances in the past, you'll have an rt_time
         slightly above rt_runtime when an RT task is enqueued, and from that
         point on (after the first replenishment), the task will normally execute;
      
       - you can still eat up all bandwidth during the first period, but not
         anymore after that, remember that DL execution will increment rt_time
         till the upper limit is reached.
      
      The situation is still not perfect! But, we have a simple solution for now,
      that limits how much you can jeopardize your system, as we keep working
      towards the right answer: RT groups scheduled using deadline servers.
      Reported-by: NKirill Tkhai <tkhai@yandex.ru>
      Signed-off-by: NJuri Lelli <juri.lelli@gmail.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Link: http://lkml.kernel.org/r/20140225151515.617714e2f2cd6c558531ba61@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      faa59937
    • K
      sched/deadline: Cleanup RT leftovers from {inc/dec}_dl_migration · 3908ac13
      Kirill Tkhai 提交于
      In deadline class we do not have group scheduling.
      
      So, let's remove unnecessary
      
      	X = X;
      
      equations.
      Signed-off-by: NKirill Tkhai <ktkhai@parallels.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Juri Lelli <juri.lelli@gmail.com>
      Link: http://lkml.kernel.org/r/1393343543.4089.5.camel@tkhaiSigned-off-by: NIngo Molnar <mingo@kernel.org>
      3908ac13