1. 14 1月, 2015 1 次提交
  2. 16 11月, 2014 1 次提交
    • S
      sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency · 6e998916
      Stanislaw Gruszka 提交于
      Commit d670ec13 "posix-cpu-timers: Cure SMP wobbles" fixes one glibc
      test case in cost of breaking another one. After that commit, calling
      clock_nanosleep(TIMER_ABSTIME, X) and then clock_gettime(&Y) can result
      of Y time being smaller than X time.
      
      Reproducer/tester can be found further below, it can be compiled and ran by:
      
      	gcc -o tst-cpuclock2 tst-cpuclock2.c -pthread
      	while ./tst-cpuclock2 ; do : ; done
      
      This reproducer, when running on a buggy kernel, will complain
      about "clock_gettime difference too small".
      
      Issue happens because on start in thread_group_cputimer() we initialize
      sum_exec_runtime of cputimer with threads runtime not yet accounted and
      then add the threads runtime to running cputimer again on scheduler
      tick, making it's sum_exec_runtime bigger than actual threads runtime.
      
      KOSAKI Motohiro posted a fix for this problem, but that patch was never
      applied: https://lkml.org/lkml/2013/5/26/191 .
      
      This patch takes different approach to cure the problem. It calls
      update_curr() when cputimer starts, that assure we will have updated
      stats of running threads and on the next schedule tick we will account
      only the runtime that elapsed from cputimer start. That also assure we
      have consistent state between cpu times of individual threads and cpu
      time of the process consisted by those threads.
      
      Full reproducer (tst-cpuclock2.c):
      
      	#define _GNU_SOURCE
      	#include <unistd.h>
      	#include <sys/syscall.h>
      	#include <stdio.h>
      	#include <time.h>
      	#include <pthread.h>
      	#include <stdint.h>
      	#include <inttypes.h>
      
      	/* Parameters for the Linux kernel ABI for CPU clocks.  */
      	#define CPUCLOCK_SCHED          2
      	#define MAKE_PROCESS_CPUCLOCK(pid, clock) \
      		((~(clockid_t) (pid) << 3) | (clockid_t) (clock))
      
      	static pthread_barrier_t barrier;
      
      	/* Help advance the clock.  */
      	static void *chew_cpu(void *arg)
      	{
      		pthread_barrier_wait(&barrier);
      		while (1) ;
      
      		return NULL;
      	}
      
      	/* Don't use the glibc wrapper.  */
      	static int do_nanosleep(int flags, const struct timespec *req)
      	{
      		clockid_t clock_id = MAKE_PROCESS_CPUCLOCK(0, CPUCLOCK_SCHED);
      
      		return syscall(SYS_clock_nanosleep, clock_id, flags, req, NULL);
      	}
      
      	static int64_t tsdiff(const struct timespec *before, const struct timespec *after)
      	{
      		int64_t before_i = before->tv_sec * 1000000000ULL + before->tv_nsec;
      		int64_t after_i = after->tv_sec * 1000000000ULL + after->tv_nsec;
      
      		return after_i - before_i;
      	}
      
      	int main(void)
      	{
      		int result = 0;
      		pthread_t th;
      
      		pthread_barrier_init(&barrier, NULL, 2);
      
      		if (pthread_create(&th, NULL, chew_cpu, NULL) != 0) {
      			perror("pthread_create");
      			return 1;
      		}
      
      		pthread_barrier_wait(&barrier);
      
      		/* The test.  */
      		struct timespec before, after, sleeptimeabs;
      		int64_t sleepdiff, diffabs;
      		const struct timespec sleeptime = {.tv_sec = 0,.tv_nsec = 100000000 };
      
      		/* The relative nanosleep.  Not sure why this is needed, but its presence
      		   seems to make it easier to reproduce the problem.  */
      		if (do_nanosleep(0, &sleeptime) != 0) {
      			perror("clock_nanosleep");
      			return 1;
      		}
      
      		/* Get the current time.  */
      		if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &before) < 0) {
      			perror("clock_gettime[2]");
      			return 1;
      		}
      
      		/* Compute the absolute sleep time based on the current time.  */
      		uint64_t nsec = before.tv_nsec + sleeptime.tv_nsec;
      		sleeptimeabs.tv_sec = before.tv_sec + nsec / 1000000000;
      		sleeptimeabs.tv_nsec = nsec % 1000000000;
      
      		/* Sleep for the computed time.  */
      		if (do_nanosleep(TIMER_ABSTIME, &sleeptimeabs) != 0) {
      			perror("absolute clock_nanosleep");
      			return 1;
      		}
      
      		/* Get the time after the sleep.  */
      		if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &after) < 0) {
      			perror("clock_gettime[3]");
      			return 1;
      		}
      
      		/* The time after sleep should always be equal to or after the absolute sleep
      		   time passed to clock_nanosleep.  */
      		sleepdiff = tsdiff(&sleeptimeabs, &after);
      		if (sleepdiff < 0) {
      			printf("absolute clock_nanosleep woke too early: %" PRId64 "\n", sleepdiff);
      			result = 1;
      
      			printf("Before %llu.%09llu\n", before.tv_sec, before.tv_nsec);
      			printf("After  %llu.%09llu\n", after.tv_sec, after.tv_nsec);
      			printf("Sleep  %llu.%09llu\n", sleeptimeabs.tv_sec, sleeptimeabs.tv_nsec);
      		}
      
      		/* The difference between the timestamps taken before and after the
      		   clock_nanosleep call should be equal to or more than the duration of the
      		   sleep.  */
      		diffabs = tsdiff(&before, &after);
      		if (diffabs < sleeptime.tv_nsec) {
      			printf("clock_gettime difference too small: %" PRId64 "\n", diffabs);
      			result = 1;
      		}
      
      		pthread_cancel(th);
      
      		return result;
      	}
      Signed-off-by: NStanislaw Gruszka <sgruszka@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20141112155843.GA24803@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6e998916
  3. 04 11月, 2014 3 次提交
    • I
      sched: Refactor task_struct to use numa_faults instead of numa_* pointers · 44dba3d5
      Iulia Manda 提交于
      This patch simplifies task_struct by removing the four numa_* pointers
      in the same array and replacing them with the array pointer. By doing this,
      on x86_64, the size of task_struct is reduced by 3 ulong pointers (24 bytes on
      x86_64).
      
      A new parameter is added to the task_faults_idx function so that it can return
      an index to the correct offset, corresponding with the old precalculated
      pointers.
      
      All of the code in sched/ that depended on task_faults_idx and numa_* was
      changed in order to match the new logic.
      Signed-off-by: NIulia Manda <iulia.manda21@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: mgorman@suse.de
      Cc: dave@stgolabs.net
      Cc: riel@redhat.com
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20141031001331.GA30662@winterfellSigned-off-by: NIngo Molnar <mingo@kernel.org>
      44dba3d5
    • W
      sched/deadline: Add deadline rq status print · acb32132
      Wanpeng Li 提交于
      This patch add deadline rq status print.
      Signed-off-by: NWanpeng Li <wanpeng.li@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Juri Lelli <juri.lelli@arm.com>
      Cc: Kirill Tkhai <ktkhai@parallels.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1414708776-124078-3-git-send-email-wanpeng.li@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      acb32132
    • K
      sched/deadline: Implement cancel_dl_timer() to use in switched_from_dl() · 67dfa1b7
      Kirill Tkhai 提交于
      Currently used hrtimer_try_to_cancel() is racy:
      
      raw_spin_lock(&rq->lock)
      ...                            dl_task_timer                 raw_spin_lock(&rq->lock)
      ...                               raw_spin_lock(&rq->lock)   ...
         switched_from_dl()             ...                        ...
            hrtimer_try_to_cancel()     ...                        ...
         switched_to_fair()             ...                        ...
      ...                               ...                        ...
      ...                               ...                        ...
      raw_spin_unlock(&rq->lock)        ...                        (asquired)
      ...                               ...                        ...
      ...                               ...                        ...
      do_exit()                         ...                        ...
         schedule()                     ...                        ...
            raw_spin_lock(&rq->lock)    ...                        raw_spin_unlock(&rq->lock)
            ...                         ...                        ...
            raw_spin_unlock(&rq->lock)  ...                        raw_spin_lock(&rq->lock)
            ...                         ...                        (asquired)
            put_task_struct()           ...                        ...
                free_task_struct()      ...                        ...
            ...                         ...                        raw_spin_unlock(&rq->lock)
      ...                               (asquired)                 ...
      ...                               ...                        ...
      ...                               (use after free)           ...
      
      So, let's implement 100% guaranteed way to cancel the timer and let's
      be sure we are safe even in very unlikely situations.
      
      rq unlocking does not limit the area of switched_from_dl() use, because
      this has already been possible in pull_dl_task() below.
      
      Let's consider the safety of of this unlocking. New code in the patch
      is working when hrtimer_try_to_cancel() fails. This means the callback
      is running. In this case hrtimer_cancel() is just waiting till the
      callback is finished. Two
      
      1) Since we are in switched_from_dl(), new class is not dl_sched_class and
      new prio is not less MAX_DL_PRIO. So, the callback returns early; it's
      right after !dl_task() check. After that hrtimer_cancel() returns back too.
      
      The above is:
      
      raw_spin_lock(rq->lock);                  ...
      ...                                       dl_task_timer()
      ...                                          raw_spin_lock(rq->lock);
         switched_from_dl()                        ...
             hrtimer_try_to_cancel()               ...
                raw_spin_unlock(rq->lock);         ...
                hrtimer_cancel()                   ...
                ...                                raw_spin_unlock(rq->lock);
                ...                                return HRTIMER_NORESTART;
                ...                             ...
                raw_spin_lock(rq->lock);        ...
      
      2) But the below is also possible:
                                         dl_task_timer()
                                            raw_spin_lock(rq->lock);
                                            ...
                                            raw_spin_unlock(rq->lock);
      raw_spin_lock(rq->lock);              ...
         switched_from_dl()                 ...
             hrtimer_try_to_cancel()        ...
             ...                            return HRTIMER_NORESTART;
             raw_spin_unlock(rq->lock);  ...
             hrtimer_cancel();           ...
             raw_spin_lock(rq->lock);    ...
      
      In this case hrtimer_cancel() returns immediately. Very unlikely case,
      just to mention.
      
      Nobody can manipulate the task, because check_class_changed() is
      always called with pi_lock locked. Nobody can force the task to
      participate in (concurrent) priority inheritance schemes (the same reason).
      
      All concurrent task operations require pi_lock, which is held by us.
      No deadlocks with dl_task_timer() are possible, because it returns
      right after !dl_task() check (it does nothing).
      
      If we receive a new dl_task during the time of unlocked rq, we just
      don't have to do pull_dl_task() in switched_from_dl() further.
      Signed-off-by: NKirill Tkhai <ktkhai@parallels.com>
      [ Added comments]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NJuri Lelli <juri.lelli@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1414420852.19914.186.camel@tkhaiSigned-off-by: NIngo Molnar <mingo@kernel.org>
      67dfa1b7
  4. 28 10月, 2014 3 次提交
  5. 24 9月, 2014 3 次提交
  6. 21 9月, 2014 1 次提交
  7. 27 8月, 2014 1 次提交
  8. 20 8月, 2014 3 次提交
    • K
      sched: Teach scheduler to understand TASK_ON_RQ_MIGRATING state · cca26e80
      Kirill Tkhai 提交于
      This is a new p->on_rq state which will be used to indicate that a task
      is in a process of migrating between two RQs. It allows to get
      rid of double_rq_lock(), which we used to use to change a rq of
      a queued task before.
      
      Let's consider an example. To move a task between src_rq and
      dst_rq we will do the following:
      
      	raw_spin_lock(&src_rq->lock);
      	/* p is a task which is queued on src_rq */
      	p = ...;
      
      	dequeue_task(src_rq, p, 0);
      	p->on_rq = TASK_ON_RQ_MIGRATING;
      	set_task_cpu(p, dst_cpu);
      	raw_spin_unlock(&src_rq->lock);
      
          	/*
          	 * Both RQs are unlocked here.
          	 * Task p is dequeued from src_rq
          	 * but its on_rq value is not zero.
          	 */
      
      	raw_spin_lock(&dst_rq->lock);
      	p->on_rq = TASK_ON_RQ_QUEUED;
      	enqueue_task(dst_rq, p, 0);
      	raw_spin_unlock(&dst_rq->lock);
      
      While p->on_rq is TASK_ON_RQ_MIGRATING, task is considered as
      "migrating", and other parallel scheduler actions with it are
      not available to parallel callers. The parallel caller is
      spining till migration is completed.
      
      The unavailable actions are changing of cpu affinity, changing
      of priority etc, in other words all the functionality which used
      to require task_rq(p)->lock before (and related to the task).
      
      To implement TASK_ON_RQ_MIGRATING support we primarily are using
      the following fact. Most of scheduler users (from which we are
      protecting a migrating task) use task_rq_lock() and
      __task_rq_lock() to get the lock of task_rq(p). These primitives
      know that task's cpu may change, and they are spining while the
      lock of the right RQ is not held. We add one more condition into
      them, so they will be also spinning until the migration is
      finished.
      Signed-off-by: NKirill Tkhai <ktkhai@parallels.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paul Turner <pjt@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
      Cc: Kirill Tkhai <tkhai@yandex.ru>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Nicolas Pitre <nicolas.pitre@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1408528062.23412.88.camel@tkhaiSigned-off-by: NIngo Molnar <mingo@kernel.org>
      cca26e80
    • K
      sched: Add wrapper for checking task_struct::on_rq · da0c1e65
      Kirill Tkhai 提交于
      Implement task_on_rq_queued() and use it everywhere instead of
      on_rq check. No functional changes.
      
      The only exception is we do not use the wrapper in
      check_for_tasks(), because it requires to export
      task_on_rq_queued() in global header files. Next patch in series
      would return it back, so we do not twist it from here to there.
      Signed-off-by: NKirill Tkhai <ktkhai@parallels.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paul Turner <pjt@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
      Cc: Kirill Tkhai <tkhai@yandex.ru>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Nicolas Pitre <nicolas.pitre@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1408528052.23412.87.camel@tkhaiSigned-off-by: NIngo Molnar <mingo@kernel.org>
      da0c1e65
    • P
      sched: Match declaration with definition · 8b06c55b
      Pranith Kumar 提交于
      Match the declaration of runqueues with the definition.
      Signed-off-by: NPranith Kumar <bobby.prani@gmail.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1407950893-32731-1-git-send-email-bobby.prani@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8b06c55b
  9. 16 7月, 2014 2 次提交
  10. 05 7月, 2014 1 次提交
    • T
      sched/fair: Implement fast idling of CPUs when the system is partially loaded · 4486edd1
      Tim Chen 提交于
      When a system is lightly loaded (i.e. no more than 1 job per cpu),
      attempt to pull job to a cpu before putting it to idle is unnecessary and
      can be skipped.  This patch adds an indicator so the scheduler can know
      when there's no more than 1 active job is on any CPU in the system to
      skip needless job pulls.
      
      On a 4 socket machine with a request/response kind of workload from
      clients, we saw about 0.13 msec delay when we go through a full load
      balance to try pull job from all the other cpus.  While 0.1 msec was
      spent on processing the request and generating a response, the 0.13 msec
      load balance overhead was actually more than the actual work being done.
      This overhead can be skipped much of the time for lightly loaded systems.
      
      With this patch, we tested with a netperf request/response workload that
      has the server busy with half the cpus in a 4 socket system.  We found
      the patch eliminated 75% of the load balance attempts before idling a cpu.
      
      The overhead of setting/clearing the indicator is low as we already gather
      the necessary info while we call add_nr_running() and update_sd_lb_stats.()
      We switch to full load balance load immediately if any cpu got more than
      one job on its run queue in add_nr_running.  We'll clear the indicator
      to avoid load balance when we detect no cpu's have more than one job
      when we scan the work queues in update_sg_lb_stats().  We are aggressive
      in turning on the load balance and opportunistic in skipping the load
      balance.
      Signed-off-by: NTim Chen <tim.c.chen@linux.intel.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NJason Low <jason.low2@hp.com>
      Cc: "Paul E.McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Davidlohr Bueso <davidlohr@hp.com>
      Cc: Alex Shi <alex.shi@linaro.org>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Peter Hurley <peter@hurleysoftware.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1403551009.2970.613.camel@schen9-DESKSigned-off-by: NIngo Molnar <mingo@kernel.org>
      4486edd1
  11. 16 6月, 2014 2 次提交
    • F
      nohz: Use IPI implicit full barrier against rq->nr_running r/w · 3882ec64
      Frederic Weisbecker 提交于
      A full dynticks CPU is allowed to stop its tick when a single task runs.
      Meanwhile when a new task gets enqueued, the CPU must be notified so that
      it can restart its tick to maintain local fairness and other accounting
      details.
      
      This notification is performed by way of an IPI. Then when the target
      receives the IPI, we expect it to see the new value of rq->nr_running.
      
      Hence the following ordering scenario:
      
         CPU 0                   CPU 1
      
         write rq->running       get IPI
         smp_wmb()               smp_rmb()
         send IPI                read rq->nr_running
      
      But Paul Mckenney says that nowadays IPIs imply a full barrier on
      all architectures. So we can safely remove this pair and rely on the
      implicit barriers that come along IPI send/receive. Lets
      just comment on this new assumption.
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Kevin Hilman <khilman@linaro.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      3882ec64
    • F
      nohz: Use nohz own full kick on 2nd task enqueue · fd2ac4f4
      Frederic Weisbecker 提交于
      Now that we have a nohz full remote kick based on irq work, lets use
      it to notify a CPU that it's exiting single task mode.
      
      This unbloats a bit the scheduler IPI that the nohz code was abusing
      for its cool "callable anywhere/anytime" properties.
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Kevin Hilman <khilman@linaro.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      fd2ac4f4
  12. 05 6月, 2014 4 次提交
    • P
      sched/idle: Optimize try-to-wake-up IPI · e3baac47
      Peter Zijlstra 提交于
      [ This series reduces the number of IPIs on Andy's workload by something like
        99%. It's down from many hundreds per second to very few.
      
        The basic idea behind this series is to make TIF_POLLING_NRFLAG be a
        reliable indication that the idle task is polling.  Once that's done,
        the rest is reasonably straightforward. ]
      
      When enqueueing tasks on remote LLC domains, we send an IPI to do the
      work 'locally' and avoid bouncing all the cachelines over.
      
      However, when the remote CPU is idle (and polling, say x86 mwait), we
      don't need to send an IPI, we can simply kick the TIF word to wake it
      up and have the 'idle' loop do the work.
      
      So when _TIF_POLLING_NRFLAG is set, but _TIF_NEED_RESCHED is not (yet)
      set, set _TIF_NEED_RESCHED and avoid sending the IPI.
      Much-requested-by: NAndy Lutomirski <luto@amacapital.net>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      [Edited by Andy Lutomirski, but this is mostly Peter Zijlstra's code.]
      Signed-off-by: NAndy Lutomirski <luto@amacapital.net>
      Cc: nicolas.pitre@linaro.org
      Cc: daniel.lezcano@linaro.org
      Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
      Cc: umgwanakikbuti@gmail.com
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: linux-kernel@vger.kernel.org
      Link: http://lkml.kernel.org/r/ce06f8b02e7e337be63e97597fc4b248d3aa6f9b.1401902905.git.luto@amacapital.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e3baac47
    • N
      sched: Remove remaining dubious usage of "power" · ced549fa
      Nicolas Pitre 提交于
      It is better not to think about compute capacity as being equivalent
      to "CPU power".  The upcoming "power aware" scheduler work may create
      confusion with the notion of energy consumption if "power" is used too
      liberally.
      
      This is the remaining "power" -> "capacity" rename for local symbols.
      Those symbols visible to the rest of the kernel are not included yet.
      Signed-off-by: NNicolas Pitre <nico@linaro.org>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: linaro-kernel@lists.linaro.org
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: linux-kernel@vger.kernel.org
      Link: http://lkml.kernel.org/n/tip-yyyhohzhkwnaotr3lx8zd5aa@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ced549fa
    • N
      sched: Let 'struct sched_group_power' care about CPU capacity · 63b2ca30
      Nicolas Pitre 提交于
      It is better not to think about compute capacity as being equivalent
      to "CPU power".  The upcoming "power aware" scheduler work may create
      confusion with the notion of energy consumption if "power" is used too
      liberally.
      
      Since struct sched_group_power is really about compute capacity of sched
      groups, let's rename it to struct sched_group_capacity. Similarly sgp
      becomes sgc. Related variables and functions dealing with groups are also
      adjusted accordingly.
      Signed-off-by: NNicolas Pitre <nico@linaro.org>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: linaro-kernel@lists.linaro.org
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: linux-kernel@vger.kernel.org
      Link: http://lkml.kernel.org/n/tip-5yeix833vvgf2uyj5o36hpu9@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      63b2ca30
    • R
      sched/fair: Fix tg_set_cfs_bandwidth() deadlock on rq->lock · 09dc4ab0
      Roman Gushchin 提交于
      tg_set_cfs_bandwidth() sets cfs_b->timer_active to 0 to
      force the period timer restart. It's not safe, because
      can lead to deadlock, described in commit 927b54fc:
      "__start_cfs_bandwidth calls hrtimer_cancel while holding rq->lock,
      waiting for the hrtimer to finish. However, if sched_cfs_period_timer
      runs for another loop iteration, the hrtimer can attempt to take
      rq->lock, resulting in deadlock."
      
      Three CPUs must be involved:
      
        CPU0               CPU1                         CPU2
        take rq->lock      period timer fired
        ...                take cfs_b lock
        ...                ...                          tg_set_cfs_bandwidth()
        throttle_cfs_rq()  release cfs_b lock           take cfs_b lock
        ...                distribute_cfs_runtime()     timer_active = 0
        take cfs_b->lock   wait for rq->lock            ...
        __start_cfs_bandwidth()
        {wait for timer callback
         break if timer_active == 1}
      
      So, CPU0 and CPU1 are deadlocked.
      
      Instead of resetting cfs_b->timer_active, tg_set_cfs_bandwidth can
      wait for period timer callbacks (ignoring cfs_b->timer_active) and
      restart the timer explicitly.
      Signed-off-by: NRoman Gushchin <klamm@yandex-team.ru>
      Reviewed-by: NBen Segall <bsegall@google.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/87wqdi9g8e.wl\%klamm@yandex-team.ru
      Cc: pjt@google.com
      Cc: chris.j.arges@canonical.com
      Cc: gregkh@linuxfoundation.org
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      09dc4ab0
  13. 22 5月, 2014 1 次提交
  14. 18 4月, 2014 2 次提交
  15. 11 4月, 2014 1 次提交
  16. 13 3月, 2014 1 次提交
  17. 11 3月, 2014 1 次提交
  18. 27 2月, 2014 2 次提交
  19. 23 2月, 2014 1 次提交
  20. 22 2月, 2014 4 次提交
  21. 11 2月, 2014 1 次提交
    • P
      sched: Push down pre_schedule() and idle_balance() · 38033c37
      Peter Zijlstra 提交于
      This patch both merged idle_balance() and pre_schedule() and pushes
      both of them into pick_next_task().
      
      Conceptually pre_schedule() and idle_balance() are rather similar,
      both are used to pull more work onto the current CPU.
      
      We cannot however first move idle_balance() into pre_schedule_fair()
      since there is no guarantee the last runnable task is a fair task, and
      thus we would miss newidle balances.
      
      Similarly, the dl and rt pre_schedule calls must be ran before
      idle_balance() since their respective tasks have higher priority and
      it would not do to delay their execution searching for less important
      tasks first.
      
      However, by noticing that pick_next_tasks() already traverses the
      sched_class hierarchy in the right order, we can get the right
      behaviour and do away with both calls.
      
      We must however change the special case optimization to also require
      that prev is of sched_class_fair, otherwise we can miss doing a dl or
      rt pull where we needed one.
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/n/tip-a8k6vvaebtn64nie345kx1je@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      38033c37
  22. 10 2月, 2014 1 次提交