1. 03 8月, 2015 2 次提交
    • Y
      sched/fair: Rewrite runnable load and utilization average tracking · 9d89c257
      Yuyang Du 提交于
      The idea of runnable load average (let runnable time contribute to weight)
      was proposed by Paul Turner and Ben Segall, and it is still followed by
      this rewrite. This rewrite aims to solve the following issues:
      
      1. cfs_rq's load average (namely runnable_load_avg and blocked_load_avg) is
         updated at the granularity of an entity at a time, which results in the
         cfs_rq's load average is stale or partially updated: at any time, only
         one entity is up to date, all other entities are effectively lagging
         behind. This is undesirable.
      
         To illustrate, if we have n runnable entities in the cfs_rq, as time
         elapses, they certainly become outdated:
      
           t0: cfs_rq { e1_old, e2_old, ..., en_old }
      
         and when we update:
      
           t1: update e1, then we have cfs_rq { e1_new, e2_old, ..., en_old }
      
           t2: update e2, then we have cfs_rq { e1_old, e2_new, ..., en_old }
      
           ...
      
         We solve this by combining all runnable entities' load averages together
         in cfs_rq's avg, and update the cfs_rq's avg as a whole. This is based
         on the fact that if we regard the update as a function, then:
      
         w * update(e) = update(w * e) and
      
         update(e1) + update(e2) = update(e1 + e2), then
      
         w1 * update(e1) + w2 * update(e2) = update(w1 * e1 + w2 * e2)
      
         therefore, by this rewrite, we have an entirely updated cfs_rq at the
         time we update it:
      
           t1: update cfs_rq { e1_new, e2_new, ..., en_new }
      
           t2: update cfs_rq { e1_new, e2_new, ..., en_new }
      
           ...
      
      2. cfs_rq's load average is different between top rq->cfs_rq and other
         task_group's per CPU cfs_rqs in whether or not blocked_load_average
         contributes to the load.
      
         The basic idea behind runnable load average (the same for utilization)
         is that the blocked state is taken into account as opposed to only
         accounting for the currently runnable state. Therefore, the average
         should include both the runnable/running and blocked load averages.
         This rewrite does that.
      
         In addition, we also combine runnable/running and blocked averages
         of all entities into the cfs_rq's average, and update it together at
         once. This is based on the fact that:
      
           update(runnable) + update(blocked) = update(runnable + blocked)
      
         This significantly reduces the code as we don't need to separately
         maintain/update runnable/running load and blocked load.
      
      3. How task_group entities' share is calculated is complex and imprecise.
      
         We reduce the complexity in this rewrite to allow a very simple rule:
         the task_group's load_avg is aggregated from its per CPU cfs_rqs's
         load_avgs. Then group entity's weight is simply proportional to its
         own cfs_rq's load_avg / task_group's load_avg. To illustrate,
      
         if a task_group has { cfs_rq1, cfs_rq2, ..., cfs_rqn }, then,
      
         task_group_avg = cfs_rq1_avg + cfs_rq2_avg + ... + cfs_rqn_avg, then
      
         cfs_rqx's entity's share = cfs_rqx_avg / task_group_avg * task_group's share
      
      To sum up, this rewrite in principle is equivalent to the current one, but
      fixes the issues described above. Turns out, it significantly reduces the
      code complexity and hence increases clarity and efficiency. In addition,
      the new averages are more smooth/continuous (no spurious spikes and valleys)
      and updated more consistently and quickly to reflect the load dynamics.
      
      As a result, we have less load tracking overhead, better performance,
      and especially better power efficiency due to more balanced load.
      Signed-off-by: NYuyang Du <yuyang.du@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: arjan@linux.intel.com
      Cc: bsegall@google.com
      Cc: dietmar.eggemann@arm.com
      Cc: fengguang.wu@intel.com
      Cc: len.brown@intel.com
      Cc: morten.rasmussen@arm.com
      Cc: pjt@google.com
      Cc: rafael.j.wysocki@intel.com
      Cc: umgwanakikbuti@gmail.com
      Cc: vincent.guittot@linaro.org
      Link: http://lkml.kernel.org/r/1436918682-4971-3-git-send-email-yuyang.du@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9d89c257
    • Y
      sched/fair: Remove rq's runnable avg · cd126afe
      Yuyang Du 提交于
      The current rq->avg is not used at all since its merge into the kernel,
      and the code is in the scheduler's hot path, so remove it.
      Tested-by: NDietmar Eggemann <dietmar.eggemann@arm.com>
      Signed-off-by: NYuyang Du <yuyang.du@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NDietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: arjan@linux.intel.com
      Cc: bsegall@google.com
      Cc: fengguang.wu@intel.com
      Cc: len.brown@intel.com
      Cc: morten.rasmussen@arm.com
      Cc: pjt@google.com
      Cc: rafael.j.wysocki@intel.com
      Cc: umgwanakikbuti@gmail.com
      Cc: vincent.guittot@linaro.org
      Link: http://lkml.kernel.org/r/1436918682-4971-2-git-send-email-yuyang.du@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      cd126afe
  2. 04 7月, 2015 2 次提交
  3. 19 6月, 2015 3 次提交
  4. 18 5月, 2015 1 次提交
    • P
      sched,perf: Fix periodic timers · 4cfafd30
      Peter Zijlstra 提交于
      In the below two commits (see Fixes) we have periodic timers that can
      stop themselves when they're no longer required, but need to be
      (re)-started when their idle condition changes.
      
      Further complications is that we want the timer handler to always do
      the forward such that it will always correctly deal with the overruns,
      and we do not want to race such that the handler has already decided
      to stop, but the (external) restart sees the timer still active and we
      end up with a 'lost' timer.
      
      The problem with the current code is that the re-start can come before
      the callback does the forward, at which point the forward from the
      callback will WARN about forwarding an enqueued timer.
      
      Now, conceptually its easy to detect if you're before or after the fwd
      by comparing the expiration time against the current time. Of course,
      that's expensive (and racy) because we don't have the current time.
      
      Alternatively one could cache this state inside the timer, but then
      everybody pays the overhead of maintaining this extra state, and that
      is undesired.
      
      The only other option that I could see is the external timer_active
      variable, which I tried to kill before. I would love a nicer interface
      for this seemingly simple 'problem' but alas.
      
      Fixes: 272325c4 ("perf: Fix mux_interval hrtimer wreckage")
      Fixes: 77a4d1a1 ("sched: Cleanup bandwidth timers")
      Cc: pjt@google.com
      Cc: tglx@linutronix.de
      Cc: klamm@yandex-team.ru
      Cc: mingo@kernel.org
      Cc: bsegall@google.com
      Cc: hpa@zytor.com
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/20150514102311.GX21418@twins.programming.kicks-ass.net
      4cfafd30
  5. 08 5月, 2015 2 次提交
  6. 22 4月, 2015 1 次提交
    • P
      sched: Cleanup bandwidth timers · 77a4d1a1
      Peter Zijlstra 提交于
      Roman reported a 3 cpu lockup scenario involving __start_cfs_bandwidth().
      
      The more I look at that code the more I'm convinced its crack, that
      entire __start_cfs_bandwidth() thing is brain melting, we don't need to
      cancel a timer before starting it, *hrtimer_start*() will happily remove
      the timer for you if its still enqueued.
      
      Removing that, removes a big part of the problem, no more ugly cancel
      loop to get stuck in.
      
      So now, if I understand things right, the entire reason you have this
      cfs_b->lock guarded ->timer_active nonsense is to make sure we don't
      accidentally lose the timer.
      
      It appears to me that it should be possible to guarantee that same by
      unconditionally (re)starting the timer when !queued. Because regardless
      what hrtimer::function will return, if we beat it to (re)enqueue the
      timer, it doesn't matter.
      
      Now, because hrtimers don't come with any serialization guarantees we
      must ensure both handler and (re)start loop serialize their access to
      the hrtimer to avoid both trying to forward the timer at the same
      time.
      
      Update the rt bandwidth timer to match.
      
      This effectively reverts: 09dc4ab0 ("sched/fair: Fix
      tg_set_cfs_bandwidth() deadlock on rq->lock").
      Reported-by: NRoman Gushchin <klamm@yandex-team.ru>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NBen Segall <bsegall@google.com>
      Cc: Paul Turner <pjt@google.com>
      Link: http://lkml.kernel.org/r/20150415095011.804589208@infradead.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      77a4d1a1
  7. 02 4月, 2015 1 次提交
  8. 27 3月, 2015 5 次提交
  9. 23 3月, 2015 1 次提交
    • S
      sched/rt: Use IPI to trigger RT task push migration instead of pulling · b6366f04
      Steven Rostedt 提交于
      When debugging the latencies on a 40 core box, where we hit 300 to
      500 microsecond latencies, I found there was a huge contention on the
      runqueue locks.
      
      Investigating it further, running ftrace, I found that it was due to
      the pulling of RT tasks.
      
      The test that was run was the following:
      
       cyclictest --numa -p95 -m -d0 -i100
      
      This created a thread on each CPU, that would set its wakeup in iterations
      of 100 microseconds. The -d0 means that all the threads had the same
      interval (100us). Each thread sleeps for 100us and wakes up and measures
      its latencies.
      
      cyclictest is maintained at:
       git://git.kernel.org/pub/scm/linux/kernel/git/clrkwllms/rt-tests.git
      
      What happened was another RT task would be scheduled on one of the CPUs
      that was running our test, when the other CPU tests went to sleep and
      scheduled idle. This caused the "pull" operation to execute on all
      these CPUs. Each one of these saw the RT task that was overloaded on
      the CPU of the test that was still running, and each one tried
      to grab that task in a thundering herd way.
      
      To grab the task, each thread would do a double rq lock grab, grabbing
      its own lock as well as the rq of the overloaded CPU. As the sched
      domains on this box was rather flat for its size, I saw up to 12 CPUs
      block on this lock at once. This caused a ripple affect with the
      rq locks especially since the taking was done via a double rq lock, which
      means that several of the CPUs had their own rq locks held while trying
      to take this rq lock. As these locks were blocked, any wakeups or load
      balanceing on these CPUs would also block on these locks, and the wait
      time escalated.
      
      I've tried various methods to lessen the load, but things like an
      atomic counter to only let one CPU grab the task wont work, because
      the task may have a limited affinity, and we may pick the wrong
      CPU to take that lock and do the pull, to only find out that the
      CPU we picked isn't in the task's affinity.
      
      Instead of doing the PULL, I now have the CPUs that want the pull to
      send over an IPI to the overloaded CPU, and let that CPU pick what
      CPU to push the task to. No more need to grab the rq lock, and the
      push/pull algorithm still works fine.
      
      With this patch, the latency dropped to just 150us over a 20 hour run.
      Without the patch, the huge latencies would trigger in seconds.
      
      I've created a new sched feature called RT_PUSH_IPI, which is enabled
      by default.
      
      When RT_PUSH_IPI is not enabled, the old method of grabbing the rq locks
      and having the pulling CPU do the work is implemented. When RT_PUSH_IPI
      is enabled, the IPI is sent to the overloaded CPU to do a push.
      
      To enabled or disable this at run time:
      
       # mount -t debugfs nodev /sys/kernel/debug
       # echo RT_PUSH_IPI > /sys/kernel/debug/sched_features
      or
       # echo NO_RT_PUSH_IPI > /sys/kernel/debug/sched_features
      
      Update: This original patch would send an IPI to all CPUs in the RT overload
      list. But that could theoretically cause the reverse issue. That is, there
      could be lots of overloaded RT queues and one CPU lowers its priority. It would
      then send an IPI to all the overloaded RT queues and they could then all try
      to grab the rq lock of the CPU lowering its priority, and then we have the
      same problem.
      
      The latest design sends out only one IPI to the first overloaded CPU. It tries to
      push any tasks that it can, and then looks for the next overloaded CPU that can
      push to the source CPU. The IPIs stop when all overloaded CPUs that have pushable
      tasks that have priorities greater than the source CPU are covered. In case the
      source CPU lowers its priority again, a flag is set to tell the IPI traversal to
      restart with the first RT overloaded CPU after the source CPU.
      Parts-suggested-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Joern Engel <joern@purestorage.com>
      Cc: Clark Williams <williams@redhat.com>
      Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20150318144946.2f3cc982@gandalf.local.homeSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b6366f04
  10. 18 2月, 2015 1 次提交
  11. 14 1月, 2015 2 次提交
  12. 16 11月, 2014 1 次提交
    • S
      sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency · 6e998916
      Stanislaw Gruszka 提交于
      Commit d670ec13 "posix-cpu-timers: Cure SMP wobbles" fixes one glibc
      test case in cost of breaking another one. After that commit, calling
      clock_nanosleep(TIMER_ABSTIME, X) and then clock_gettime(&Y) can result
      of Y time being smaller than X time.
      
      Reproducer/tester can be found further below, it can be compiled and ran by:
      
      	gcc -o tst-cpuclock2 tst-cpuclock2.c -pthread
      	while ./tst-cpuclock2 ; do : ; done
      
      This reproducer, when running on a buggy kernel, will complain
      about "clock_gettime difference too small".
      
      Issue happens because on start in thread_group_cputimer() we initialize
      sum_exec_runtime of cputimer with threads runtime not yet accounted and
      then add the threads runtime to running cputimer again on scheduler
      tick, making it's sum_exec_runtime bigger than actual threads runtime.
      
      KOSAKI Motohiro posted a fix for this problem, but that patch was never
      applied: https://lkml.org/lkml/2013/5/26/191 .
      
      This patch takes different approach to cure the problem. It calls
      update_curr() when cputimer starts, that assure we will have updated
      stats of running threads and on the next schedule tick we will account
      only the runtime that elapsed from cputimer start. That also assure we
      have consistent state between cpu times of individual threads and cpu
      time of the process consisted by those threads.
      
      Full reproducer (tst-cpuclock2.c):
      
      	#define _GNU_SOURCE
      	#include <unistd.h>
      	#include <sys/syscall.h>
      	#include <stdio.h>
      	#include <time.h>
      	#include <pthread.h>
      	#include <stdint.h>
      	#include <inttypes.h>
      
      	/* Parameters for the Linux kernel ABI for CPU clocks.  */
      	#define CPUCLOCK_SCHED          2
      	#define MAKE_PROCESS_CPUCLOCK(pid, clock) \
      		((~(clockid_t) (pid) << 3) | (clockid_t) (clock))
      
      	static pthread_barrier_t barrier;
      
      	/* Help advance the clock.  */
      	static void *chew_cpu(void *arg)
      	{
      		pthread_barrier_wait(&barrier);
      		while (1) ;
      
      		return NULL;
      	}
      
      	/* Don't use the glibc wrapper.  */
      	static int do_nanosleep(int flags, const struct timespec *req)
      	{
      		clockid_t clock_id = MAKE_PROCESS_CPUCLOCK(0, CPUCLOCK_SCHED);
      
      		return syscall(SYS_clock_nanosleep, clock_id, flags, req, NULL);
      	}
      
      	static int64_t tsdiff(const struct timespec *before, const struct timespec *after)
      	{
      		int64_t before_i = before->tv_sec * 1000000000ULL + before->tv_nsec;
      		int64_t after_i = after->tv_sec * 1000000000ULL + after->tv_nsec;
      
      		return after_i - before_i;
      	}
      
      	int main(void)
      	{
      		int result = 0;
      		pthread_t th;
      
      		pthread_barrier_init(&barrier, NULL, 2);
      
      		if (pthread_create(&th, NULL, chew_cpu, NULL) != 0) {
      			perror("pthread_create");
      			return 1;
      		}
      
      		pthread_barrier_wait(&barrier);
      
      		/* The test.  */
      		struct timespec before, after, sleeptimeabs;
      		int64_t sleepdiff, diffabs;
      		const struct timespec sleeptime = {.tv_sec = 0,.tv_nsec = 100000000 };
      
      		/* The relative nanosleep.  Not sure why this is needed, but its presence
      		   seems to make it easier to reproduce the problem.  */
      		if (do_nanosleep(0, &sleeptime) != 0) {
      			perror("clock_nanosleep");
      			return 1;
      		}
      
      		/* Get the current time.  */
      		if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &before) < 0) {
      			perror("clock_gettime[2]");
      			return 1;
      		}
      
      		/* Compute the absolute sleep time based on the current time.  */
      		uint64_t nsec = before.tv_nsec + sleeptime.tv_nsec;
      		sleeptimeabs.tv_sec = before.tv_sec + nsec / 1000000000;
      		sleeptimeabs.tv_nsec = nsec % 1000000000;
      
      		/* Sleep for the computed time.  */
      		if (do_nanosleep(TIMER_ABSTIME, &sleeptimeabs) != 0) {
      			perror("absolute clock_nanosleep");
      			return 1;
      		}
      
      		/* Get the time after the sleep.  */
      		if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &after) < 0) {
      			perror("clock_gettime[3]");
      			return 1;
      		}
      
      		/* The time after sleep should always be equal to or after the absolute sleep
      		   time passed to clock_nanosleep.  */
      		sleepdiff = tsdiff(&sleeptimeabs, &after);
      		if (sleepdiff < 0) {
      			printf("absolute clock_nanosleep woke too early: %" PRId64 "\n", sleepdiff);
      			result = 1;
      
      			printf("Before %llu.%09llu\n", before.tv_sec, before.tv_nsec);
      			printf("After  %llu.%09llu\n", after.tv_sec, after.tv_nsec);
      			printf("Sleep  %llu.%09llu\n", sleeptimeabs.tv_sec, sleeptimeabs.tv_nsec);
      		}
      
      		/* The difference between the timestamps taken before and after the
      		   clock_nanosleep call should be equal to or more than the duration of the
      		   sleep.  */
      		diffabs = tsdiff(&before, &after);
      		if (diffabs < sleeptime.tv_nsec) {
      			printf("clock_gettime difference too small: %" PRId64 "\n", diffabs);
      			result = 1;
      		}
      
      		pthread_cancel(th);
      
      		return result;
      	}
      Signed-off-by: NStanislaw Gruszka <sgruszka@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20141112155843.GA24803@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6e998916
  13. 04 11月, 2014 3 次提交
    • I
      sched: Refactor task_struct to use numa_faults instead of numa_* pointers · 44dba3d5
      Iulia Manda 提交于
      This patch simplifies task_struct by removing the four numa_* pointers
      in the same array and replacing them with the array pointer. By doing this,
      on x86_64, the size of task_struct is reduced by 3 ulong pointers (24 bytes on
      x86_64).
      
      A new parameter is added to the task_faults_idx function so that it can return
      an index to the correct offset, corresponding with the old precalculated
      pointers.
      
      All of the code in sched/ that depended on task_faults_idx and numa_* was
      changed in order to match the new logic.
      Signed-off-by: NIulia Manda <iulia.manda21@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: mgorman@suse.de
      Cc: dave@stgolabs.net
      Cc: riel@redhat.com
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20141031001331.GA30662@winterfellSigned-off-by: NIngo Molnar <mingo@kernel.org>
      44dba3d5
    • W
      sched/deadline: Add deadline rq status print · acb32132
      Wanpeng Li 提交于
      This patch add deadline rq status print.
      Signed-off-by: NWanpeng Li <wanpeng.li@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Juri Lelli <juri.lelli@arm.com>
      Cc: Kirill Tkhai <ktkhai@parallels.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1414708776-124078-3-git-send-email-wanpeng.li@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      acb32132
    • K
      sched/deadline: Implement cancel_dl_timer() to use in switched_from_dl() · 67dfa1b7
      Kirill Tkhai 提交于
      Currently used hrtimer_try_to_cancel() is racy:
      
      raw_spin_lock(&rq->lock)
      ...                            dl_task_timer                 raw_spin_lock(&rq->lock)
      ...                               raw_spin_lock(&rq->lock)   ...
         switched_from_dl()             ...                        ...
            hrtimer_try_to_cancel()     ...                        ...
         switched_to_fair()             ...                        ...
      ...                               ...                        ...
      ...                               ...                        ...
      raw_spin_unlock(&rq->lock)        ...                        (asquired)
      ...                               ...                        ...
      ...                               ...                        ...
      do_exit()                         ...                        ...
         schedule()                     ...                        ...
            raw_spin_lock(&rq->lock)    ...                        raw_spin_unlock(&rq->lock)
            ...                         ...                        ...
            raw_spin_unlock(&rq->lock)  ...                        raw_spin_lock(&rq->lock)
            ...                         ...                        (asquired)
            put_task_struct()           ...                        ...
                free_task_struct()      ...                        ...
            ...                         ...                        raw_spin_unlock(&rq->lock)
      ...                               (asquired)                 ...
      ...                               ...                        ...
      ...                               (use after free)           ...
      
      So, let's implement 100% guaranteed way to cancel the timer and let's
      be sure we are safe even in very unlikely situations.
      
      rq unlocking does not limit the area of switched_from_dl() use, because
      this has already been possible in pull_dl_task() below.
      
      Let's consider the safety of of this unlocking. New code in the patch
      is working when hrtimer_try_to_cancel() fails. This means the callback
      is running. In this case hrtimer_cancel() is just waiting till the
      callback is finished. Two
      
      1) Since we are in switched_from_dl(), new class is not dl_sched_class and
      new prio is not less MAX_DL_PRIO. So, the callback returns early; it's
      right after !dl_task() check. After that hrtimer_cancel() returns back too.
      
      The above is:
      
      raw_spin_lock(rq->lock);                  ...
      ...                                       dl_task_timer()
      ...                                          raw_spin_lock(rq->lock);
         switched_from_dl()                        ...
             hrtimer_try_to_cancel()               ...
                raw_spin_unlock(rq->lock);         ...
                hrtimer_cancel()                   ...
                ...                                raw_spin_unlock(rq->lock);
                ...                                return HRTIMER_NORESTART;
                ...                             ...
                raw_spin_lock(rq->lock);        ...
      
      2) But the below is also possible:
                                         dl_task_timer()
                                            raw_spin_lock(rq->lock);
                                            ...
                                            raw_spin_unlock(rq->lock);
      raw_spin_lock(rq->lock);              ...
         switched_from_dl()                 ...
             hrtimer_try_to_cancel()        ...
             ...                            return HRTIMER_NORESTART;
             raw_spin_unlock(rq->lock);  ...
             hrtimer_cancel();           ...
             raw_spin_lock(rq->lock);    ...
      
      In this case hrtimer_cancel() returns immediately. Very unlikely case,
      just to mention.
      
      Nobody can manipulate the task, because check_class_changed() is
      always called with pi_lock locked. Nobody can force the task to
      participate in (concurrent) priority inheritance schemes (the same reason).
      
      All concurrent task operations require pi_lock, which is held by us.
      No deadlocks with dl_task_timer() are possible, because it returns
      right after !dl_task() check (it does nothing).
      
      If we receive a new dl_task during the time of unlocked rq, we just
      don't have to do pull_dl_task() in switched_from_dl() further.
      Signed-off-by: NKirill Tkhai <ktkhai@parallels.com>
      [ Added comments]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NJuri Lelli <juri.lelli@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1414420852.19914.186.camel@tkhaiSigned-off-by: NIngo Molnar <mingo@kernel.org>
      67dfa1b7
  14. 28 10月, 2014 3 次提交
  15. 24 9月, 2014 3 次提交
  16. 21 9月, 2014 1 次提交
  17. 27 8月, 2014 1 次提交
  18. 20 8月, 2014 3 次提交
    • K
      sched: Teach scheduler to understand TASK_ON_RQ_MIGRATING state · cca26e80
      Kirill Tkhai 提交于
      This is a new p->on_rq state which will be used to indicate that a task
      is in a process of migrating between two RQs. It allows to get
      rid of double_rq_lock(), which we used to use to change a rq of
      a queued task before.
      
      Let's consider an example. To move a task between src_rq and
      dst_rq we will do the following:
      
      	raw_spin_lock(&src_rq->lock);
      	/* p is a task which is queued on src_rq */
      	p = ...;
      
      	dequeue_task(src_rq, p, 0);
      	p->on_rq = TASK_ON_RQ_MIGRATING;
      	set_task_cpu(p, dst_cpu);
      	raw_spin_unlock(&src_rq->lock);
      
          	/*
          	 * Both RQs are unlocked here.
          	 * Task p is dequeued from src_rq
          	 * but its on_rq value is not zero.
          	 */
      
      	raw_spin_lock(&dst_rq->lock);
      	p->on_rq = TASK_ON_RQ_QUEUED;
      	enqueue_task(dst_rq, p, 0);
      	raw_spin_unlock(&dst_rq->lock);
      
      While p->on_rq is TASK_ON_RQ_MIGRATING, task is considered as
      "migrating", and other parallel scheduler actions with it are
      not available to parallel callers. The parallel caller is
      spining till migration is completed.
      
      The unavailable actions are changing of cpu affinity, changing
      of priority etc, in other words all the functionality which used
      to require task_rq(p)->lock before (and related to the task).
      
      To implement TASK_ON_RQ_MIGRATING support we primarily are using
      the following fact. Most of scheduler users (from which we are
      protecting a migrating task) use task_rq_lock() and
      __task_rq_lock() to get the lock of task_rq(p). These primitives
      know that task's cpu may change, and they are spining while the
      lock of the right RQ is not held. We add one more condition into
      them, so they will be also spinning until the migration is
      finished.
      Signed-off-by: NKirill Tkhai <ktkhai@parallels.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paul Turner <pjt@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
      Cc: Kirill Tkhai <tkhai@yandex.ru>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Nicolas Pitre <nicolas.pitre@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1408528062.23412.88.camel@tkhaiSigned-off-by: NIngo Molnar <mingo@kernel.org>
      cca26e80
    • K
      sched: Add wrapper for checking task_struct::on_rq · da0c1e65
      Kirill Tkhai 提交于
      Implement task_on_rq_queued() and use it everywhere instead of
      on_rq check. No functional changes.
      
      The only exception is we do not use the wrapper in
      check_for_tasks(), because it requires to export
      task_on_rq_queued() in global header files. Next patch in series
      would return it back, so we do not twist it from here to there.
      Signed-off-by: NKirill Tkhai <ktkhai@parallels.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paul Turner <pjt@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
      Cc: Kirill Tkhai <tkhai@yandex.ru>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Nicolas Pitre <nicolas.pitre@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1408528052.23412.87.camel@tkhaiSigned-off-by: NIngo Molnar <mingo@kernel.org>
      da0c1e65
    • P
      sched: Match declaration with definition · 8b06c55b
      Pranith Kumar 提交于
      Match the declaration of runqueues with the definition.
      Signed-off-by: NPranith Kumar <bobby.prani@gmail.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1407950893-32731-1-git-send-email-bobby.prani@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8b06c55b
  19. 16 7月, 2014 2 次提交
  20. 05 7月, 2014 1 次提交
    • T
      sched/fair: Implement fast idling of CPUs when the system is partially loaded · 4486edd1
      Tim Chen 提交于
      When a system is lightly loaded (i.e. no more than 1 job per cpu),
      attempt to pull job to a cpu before putting it to idle is unnecessary and
      can be skipped.  This patch adds an indicator so the scheduler can know
      when there's no more than 1 active job is on any CPU in the system to
      skip needless job pulls.
      
      On a 4 socket machine with a request/response kind of workload from
      clients, we saw about 0.13 msec delay when we go through a full load
      balance to try pull job from all the other cpus.  While 0.1 msec was
      spent on processing the request and generating a response, the 0.13 msec
      load balance overhead was actually more than the actual work being done.
      This overhead can be skipped much of the time for lightly loaded systems.
      
      With this patch, we tested with a netperf request/response workload that
      has the server busy with half the cpus in a 4 socket system.  We found
      the patch eliminated 75% of the load balance attempts before idling a cpu.
      
      The overhead of setting/clearing the indicator is low as we already gather
      the necessary info while we call add_nr_running() and update_sd_lb_stats.()
      We switch to full load balance load immediately if any cpu got more than
      one job on its run queue in add_nr_running.  We'll clear the indicator
      to avoid load balance when we detect no cpu's have more than one job
      when we scan the work queues in update_sg_lb_stats().  We are aggressive
      in turning on the load balance and opportunistic in skipping the load
      balance.
      Signed-off-by: NTim Chen <tim.c.chen@linux.intel.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NJason Low <jason.low2@hp.com>
      Cc: "Paul E.McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Davidlohr Bueso <davidlohr@hp.com>
      Cc: Alex Shi <alex.shi@linaro.org>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Peter Hurley <peter@hurleysoftware.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1403551009.2970.613.camel@schen9-DESKSigned-off-by: NIngo Molnar <mingo@kernel.org>
      4486edd1
  21. 16 6月, 2014 1 次提交
    • F
      nohz: Use IPI implicit full barrier against rq->nr_running r/w · 3882ec64
      Frederic Weisbecker 提交于
      A full dynticks CPU is allowed to stop its tick when a single task runs.
      Meanwhile when a new task gets enqueued, the CPU must be notified so that
      it can restart its tick to maintain local fairness and other accounting
      details.
      
      This notification is performed by way of an IPI. Then when the target
      receives the IPI, we expect it to see the new value of rq->nr_running.
      
      Hence the following ordering scenario:
      
         CPU 0                   CPU 1
      
         write rq->running       get IPI
         smp_wmb()               smp_rmb()
         send IPI                read rq->nr_running
      
      But Paul Mckenney says that nowadays IPIs imply a full barrier on
      all architectures. So we can safely remove this pair and rely on the
      implicit barriers that come along IPI send/receive. Lets
      just comment on this new assumption.
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Kevin Hilman <khilman@linaro.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      3882ec64