提交 · 43fa5460fe60dea5c610490a1d263415419c60f6 · openeuler / raspberrypi-kernel

21 9月, 2010 1 次提交

sched: Try not to migrate higher priority RT tasks · 43fa5460

由 Steven Rostedt 提交于 9月 20, 2010

When first working on the RT scheduler design, we concentrated on
keeping all CPUs running RT tasks instead of having multiple RT
tasks on a single CPU waiting for the migration thread to move
them. Instead we take a more proactive stance and push or pull RT
tasks from one CPU to another on wakeup or scheduling.

When an RT task wakes up on a CPU that is running another RT task,
instead of preempting it and killing the cache of the running RT
task, we look to see if we can migrate the RT task that is waking
up, even if the RT task waking up is of higher priority.

This may sound a bit odd, but RT tasks should be limited in
migration by the user anyway. But in practice, people do not do
this, which causes high prio RT tasks to bounce around the CPUs.
This becomes even worse when we have priority inheritance, because
a high prio task can block on a lower prio task and boost its
priority. When the lower prio task wakes up the high prio task, if
it happens to be on the same CPU it will migrate off of it.

But in reality, the above does not happen much either, because the
wake up of the lower prio task, which has already been boosted, if
it was on the same CPU as the higher prio task, it would then
migrate off of it. But anyway, we do not want to migrate them
either.

To examine the scheduling, I created a test program and examined it
under kernelshark. The test program created CPU * 2 threads, where
each thread had a different priority. The program takes different
options. The options used in this change log was to have priority
inheritance mutexes or not.

All threads did the following loop:

static void grab_lock(long id, int iter, int l)
{
	ftrace_write("thread %ld iter %d, taking lock %d\n",
		     id, iter, l);
	pthread_mutex_lock(&locks[l]);
	ftrace_write("thread %ld iter %d, took lock %d\n",
		     id, iter, l);
	busy_loop(nr_tasks - id);
	ftrace_write("thread %ld iter %d, unlock lock %d\n",
		     id, iter, l);
	pthread_mutex_unlock(&locks[l]);
}

void *start_task(void *id)
{
	[...]
	while (!done) {
		for (l = 0; l < nr_locks; l++) {
			grab_lock(id, i, l);
			ftrace_write("thread %ld iter %d sleeping\n",
				     id, i);
			ms_sleep(id);
		}
		i++;
	}
	[...]
}

The busy_loop(ms) keeps the CPU spinning for ms milliseconds. The
ms_sleep(ms) sleeps for ms milliseconds. The ftrace_write() writes
to the ftrace buffer to help analyze via ftrace.

The higher the id, the higher the prio, the shorter it does the
busy loop, but the longer it spins. This is usually the case with
RT tasks, the lower priority tasks usually run longer than higher
priority tasks.

At the end of the test, it records the number of loops each thread
took, as well as the number of voluntary preemptions, non-voluntary
preemptions, and number of migrations each thread took, taking the
information from /proc/$$/sched and /proc/$$/status.

Running this on a 4 CPU processor, the results without changes to
the kernel looked like this:

Task        vol    nonvol   migrated     iterations
----        ---    ------   --------     ----------
  0:         53      3220       1470             98
  1:        562       773        724             98
  2:        752       933       1375             98
  3:        749        39        697             98
  4:        758         5        515             98
  5:        764         2        679             99
  6:        761         2        535             99
  7:        757         3        346             99

total:     5156       4977      6341            787

Each thread regardless of priority migrated a few hundred times.
The higher priority tasks, were a little better but still took
quite an impact.

By letting higher priority tasks bump the lower prio task from the
CPU, things changed a bit:

Task        vol    nonvol   migrated     iterations
----        ---    ------   --------     ----------
  0:         37      2835       1937             98
  1:        666      1821       1865             98
  2:        654      1003       1385             98
  3:        664       635        973             99
  4:        698       197        352             99
  5:        703       101        159             99
  6:        708         1         75             99
  7:        713         1          2             99

total:     4843       6594      6748            789

The total # of migrations did not change (several runs showed the
difference all within the noise). But we now see a dramatic
improvement to the higher priority tasks. (kernelshark showed that
the watchdog timer bumped the highest priority task to give it the
2 count. This was actually consistent with every run).

Notice that the # of iterations did not change either.

The above was with priority inheritance mutexes. That is, when the
higher prority task blocked on a lower priority task, the lower
priority task would inherit the higher priority task (which shows
why task 6 was bumped so many times). When not using priority
inheritance mutexes, the current kernel shows this:

Task        vol    nonvol   migrated     iterations
----        ---    ------   --------     ----------
  0:         56      3101       1892             95
  1:        594       713        937             95
  2:        625       188        618             95
  3:        628         4        491             96
  4:        640         7        468             96
  5:        631         2        501             96
  6:        641         1        466             96
  7:        643         2        497             96

total:     4458       4018      5870            765

Not much changed with or without priority inheritance mutexes. But
if we let the high priority task bump lower priority tasks on
wakeup we see:

Task        vol    nonvol   migrated     iterations
----        ---    ------   --------     ----------
  0:        115      3439       2782             98
  1:        633      1354       1583             99
  2:        652       919       1218             99
  3:        645       713        934             99
  4:        690         3          3             99
  5:        694         1          4             99
  6:        720         3          4             99
  7:        747         0          1            100

Which shows a even bigger change. The big difference between task 3
and task 4 is because we have only 4 CPUs on the machine, causing
the 4 highest prio tasks to always have preference.

Although I did not measure cache misses, and I'm sure there would
be little to measure since the test was not data intensive, I could
imagine large improvements for higher priority tasks when dealing
with lower priority tasks. Thus, I'm satisfied with making the
change and agreeing with what Gregory Haskins argued a few years
ago when we first had this discussion.

One final note. All tasks in the above tests were RT tasks. Any RT
task will always preempt a non RT task that is running on the CPU
the RT task wants to run on.
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Gregory Haskins <ghaskins@novell.com>
LKML-Reference: <20100921024138.605460343@goodmis.org>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

43fa5460

18 6月, 2010 1 次提交

sched: task_tick_rt: Remove the obsolete ->signal != NULL check · c32b4fce

由 Oleg Nesterov 提交于 6月 11, 2010

Remove the obsolete ->signal != NULL check in watchdog().
Since ea6d290c ->signal can't be NULL.
Signed-off-by: NOleg Nesterov <oleg@redhat.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20100610230948.GA25911@redhat.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

c32b4fce

03 4月, 2010 2 次提交

sched: Add enqueue/dequeue flags · 371fd7e7

由 Peter Zijlstra 提交于 3月 24, 2010

In order to reduce the dependency on TASK_WAKING rework the enqueue
interface to support a proper flags field.

Replace the int wakeup, bool head arguments with an int flags argument
and create the following flags:

  ENQUEUE_WAKEUP - the enqueue is a wakeup of a sleeping task,
  ENQUEUE_WAKING - the enqueue has relative vruntime due to
                   having sched_class::task_waking() called,
  ENQUEUE_HEAD - the waking task should be places on the head
                 of the priority queue (where appropriate).

For symmetry also convert sched_class::dequeue() to a flags scheme.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

371fd7e7

sched: Fix TASK_WAKING vs fork deadlock · 0017d735

由 Peter Zijlstra 提交于 3月 24, 2010

Oleg noticed a few races with the TASK_WAKING usage on fork.

 - since TASK_WAKING is basically a spinlock, it should be IRQ safe
 - since we set TASK_WAKING (*) without holding rq->lock it could
   be there still is a rq->lock holder, thereby not actually
   providing full serialization.

(*) in fact we clear PF_STARTING, which in effect enables TASK_WAKING.

Cure the second issue by not setting TASK_WAKING in sched_fork(), but
only temporarily in wake_up_new_task() while calling select_task_rq().

Cure the first by holding rq->lock around the select_task_rq() call,
this will disable IRQs, this however requires that we push down the
rq->lock release into select_task_rq_fair()'s cgroup stuff.

Because select_task_rq_fair() still needs to drop the rq->lock we
cannot fully get rid of TASK_WAKING.
Reported-by: NOleg Nesterov <oleg@redhat.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

0017d735

11 3月, 2010 2 次提交

sched: Implement group scheduler statistics in one struct · 41acab88

由 Lucas De Marchi 提交于 3月 10, 2010

Put all statistic fields of sched_entity in one struct, sched_statistics,
and embed it into sched_entity.

This change allows to memset the sched_statistics to 0 when needed (for
instance when forking), avoiding bugs of non initialized fields.
Signed-off-by: NLucas De Marchi <lucas.de.marchi@gmail.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1268275065-18542-1-git-send-email-lucas.de.marchi@gmail.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

41acab88

sched: Fix pick_next_highest_task_rt() for cgroups · 3d07467b

由 Peter Zijlstra 提交于 3月 10, 2010

Since pick_next_highest_task_rt() already iterates all the cgroups and
is really only interested in tasks, skip over the !task entries.
Reported-by: NDhaval Giani <dhaval.giani@gmail.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Tested-by: NDhaval Giani <dhaval.giani@gmail.com>
LKML-Reference: <new-submission>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

3d07467b

07 3月, 2010 1 次提交

kernel core: use helpers for rlimits · 78d7d407

由 Jiri Slaby 提交于 3月 05, 2010

Make sure compiler won't do weird things with limits.  E.g.  fetching them
twice may return 2 different values after writable limits are implemented.

I.e.  either use rlimit helpers added in commit 3e10e716 ("resource:
add helpers for fetching rlimits") or ACCESS_ONCE if not applicable.
Signed-off-by: NJiri Slaby <jslaby@suse.cz>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: john stultz <johnstul@us.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

78d7d407

04 2月, 2010 1 次提交

sched: Change usage of rt_rq->rt_se to rt_rq->tg->rt_se[cpu] · 74b7eb58

由 Yong Zhang 提交于 1月 29, 2010

This is the first step to remove rt_rq member rt_se because it have the
same meaning with tg->rt_se[cpu]. And the latter style is also used by
the fair scheduling class.
Signed-off-by: NYong Zhang <yong.zhang0@gmail.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <2674af741001282257r28c97a92o9f90cf16fe8d3d84@mail.gmail.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

74b7eb58

23 1月, 2010 2 次提交

sched: Implement head queueing for sched_rt · 37dad3fc

由 Thomas Gleixner 提交于 1月 20, 2010

The ability of enqueueing a task to the head of a SCHED_FIFO priority
list is required to fix some violations of POSIX scheduling policy.

Implement the functionality in sched_rt.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Acked-by: NPeter Zijlstra <peterz@infradead.org>
Tested-by: NCarsten Emde <cbe@osadl.org>
Tested-by: NMathias Weber <mathias.weber.mw1@roche.com>
LKML-Reference: <20100120171629.772169931@linutronix.de>

37dad3fc

sched: Extend enqueue_task to allow head queueing · ea87bb78

由 Thomas Gleixner 提交于 1月 20, 2010

The ability of enqueueing a task to the head of a SCHED_FIFO priority
list is required to fix some violations of POSIX scheduling policy.

Extend the related functions with a "head" argument.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Acked-by: NPeter Zijlstra <peterz@infradead.org>
Tested-by: NCarsten Emde <cbe@osadl.org>
Tested-by: NMathias Weber <mathias.weber.mw1@roche.com>
LKML-Reference: <20100120171629.734886007@linutronix.de>

ea87bb78

21 1月, 2010 1 次提交

sched: Remove the sched_class load_balance methods · 3d45fd80

由 Peter Zijlstra 提交于 12月 17, 2009

Take out the sched_class methods for load-balancing.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

3d45fd80

17 1月, 2010 1 次提交

sched: Don't expose local functions · 6d686f45

由 H Hartley Sweeten 提交于 1月 13, 2010

kernel/sched: don't expose local functions

The get_rr_interval_* functions are all class methods of
struct sched_class. They are not exported so make them
static.
Signed-off-by: NH Hartley Sweeten <hsweeten@visionengravers.com>
Cc: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <201001132021.53253.hartleys@visionengravers.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

6d686f45

17 12月, 2009 1 次提交

sched: Add pre and post wakeup hooks · efbbd05a

由 Peter Zijlstra 提交于 12月 16, 2009

As will be apparent in the next patch, we need a pre wakeup hook
for sched_fair task migration, hence rename the post wakeup hook
and one pre wakeup.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
LKML-Reference: <20091216170518.114746117@chello.nl>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

efbbd05a

15 12月, 2009 2 次提交

sched: Convert rt_runtime_lock to raw_spinlock · 0986b11b

由 Thomas Gleixner 提交于 11月 17, 2009

Convert locks which cannot be sleeping locks in preempt-rt to
raw_spinlocks.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Acked-by: NPeter Zijlstra <peterz@infradead.org>
Acked-by: NIngo Molnar <mingo@elte.hu>

0986b11b

sched: Convert rq->lock to raw_spinlock · 05fa785c

由 Thomas Gleixner 提交于 11月 17, 2009

Convert locks which cannot be sleeping locks in preempt-rt to
raw_spinlocks.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Acked-by: NPeter Zijlstra <peterz@infradead.org>
Acked-by: NIngo Molnar <mingo@elte.hu>

05fa785c

09 12月, 2009 1 次提交

sched: Protect sched_rr_get_param() access to task->sched_class · dba091b9

由 Thomas Gleixner 提交于 12月 09, 2009

sched_rr_get_param calls
task->sched_class->get_rr_interval(task) without protection
against a concurrent sched_setscheduler() call which modifies
task->sched_class.

Serialize the access with task_rq_lock(task) and hand the rq
pointer into get_rr_interval() as it's needed at least in the
sched_fair implementation.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Acked-by: NPeter Zijlstra <peterz@infradead.org>
LKML-Reference: <alpine.LFD.2.00.0912090930120.3089@localhost.localdomain>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

dba091b9

04 11月, 2009 1 次提交

cpumask: Simplify sched_rt.c · e2c88063

由 Rusty Russell 提交于 11月 03, 2009

find_lowest_rq() wants to call pick_optimal_cpu() on the
intersection of sched_domain_span(sd) and lowest_mask.  Rather
than doing a cpus_and into a temporary, we can open-code it.

This actually makes the code slightly clearer, IMHO.
Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
Acked-by: NGregory Haskins <ghaskins@novell.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <200911031453.15350.rusty@rustcorp.com.au>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

e2c88063

21 9月, 2009 1 次提交

sched: Simplify sys_sched_rr_get_interval() system call · 0d721cea

由 Peter Williams 提交于 9月 21, 2009

By removing the need for it to know details of scheduling classes.

This allows PlugSched to define orthogonal scheduling classes.
Signed-off-by: NPeter Williams <pwil3058@bigpond.net.au>
Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
LKML-Reference: <06d1b89ee15a0eef82d7.1253496713@mudlark.pw.nest>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

0d721cea

15 9月, 2009 3 次提交

sched: Rename sync arguments · 7d478721

由 Peter Zijlstra 提交于 9月 14, 2009

In order to extend the functions to have more than 1 flag (sync),
rename the argument to flags, and explicitly define a WF_ space for
individual flags.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

7d478721

sched: Rename select_task_rq() argument · 0763a660

由 Peter Zijlstra 提交于 9月 14, 2009

In order to be able to rename the sync argument, we need to rename
the current flag argument.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

0763a660

sched: Hook sched_balance_self() into sched_class::select_task_rq() · 5f3edc1b

由 Peter Zijlstra 提交于 9月 10, 2009

Rather ugly patch to fully place the sched_balance_self() code
inside the fair class.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

5f3edc1b

04 9月, 2009 1 次提交

sched: Scale down cpu_power due to RT tasks · e9e9250b

由 Peter Zijlstra 提交于 9月 01, 2009

Keep an average on the amount of time spend on RT tasks and use
that fraction to scale down the cpu_power for regular tasks.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Tested-by: NAndreas Herrmann <andreas.herrmann3@amd.com>
Acked-by: NAndreas Herrmann <andreas.herrmann3@amd.com>
Acked-by: NGautham R Shenoy <ego@in.ibm.com>
Cc: Balbir Singh <balbir@in.ibm.com>
LKML-Reference: <20090901083826.287778431@chello.nl>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

e9e9250b

02 8月, 2009 4 次提交

sched: Fix cpupri build on !CONFIG_SMP · bcf08df3

由 Ingo Molnar 提交于 4月 19, 2008

This build bug:

 In file included from kernel/sched.c:1765:
 kernel/sched_rt.c: In function ‘has_pushable_tasks’:
 kernel/sched_rt.c:1069: error: ‘struct rt_rq’ has no member named ‘pushable_tasks’
 kernel/sched_rt.c: In function ‘pick_next_task_rt’:
 kernel/sched_rt.c:1084: error: ‘struct rq’ has no member named ‘post_schedule’

Triggers because both pushable_tasks and post_schedule are
SMP-only fields.

Move pushable_tasks() to the SMP section and #ifdef the post_schedule use.

Cc: Gregory Haskins <ghaskins@novell.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20090729150422.17691.55590.stgit@dev.haskins.net>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

bcf08df3

sched: Add debug check to task_of() · 8f48894f

由 Peter Zijlstra 提交于 7月 24, 2009

A frequent mistake appears to be to call task_of() on a
scheduler entity that is not actually a task, which can result
in a wild pointer.

Add a check to catch these mistakes.
Suggested-by: NIngo Molnar <mingo@elte.hu>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

8f48894f

sched: Fully integrate cpus_active_map and root-domain code · 00aec93d

由 Gregory Haskins 提交于 7月 30, 2009

Reflect "active" cpus in the rq->rd->online field, instead of
the online_map.

The motivation is that things that use the root-domain code
(such as cpupri) only care about cpus classified as "active"
anyway. By synchronizing the root-domain state with the active
map, we allow several optimizations.

For instance, we can remove an extra cpumask_and from the
scheduler hotpath by utilizing rq->rd->online (since it is now
a cached version of cpu_active_map & rq->rd->span).
Signed-off-by: NGregory Haskins <ghaskins@novell.com>
Acked-by: NPeter Zijlstra <peterz@infradead.org>
Acked-by: NMax Krasnyansky <maxk@qualcomm.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20090730145723.25226.24493.stgit@dev.haskins.net>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

00aec93d

sched: Enhance the pre/post scheduling logic · 3f029d3c

由 Gregory Haskins 提交于 7月 29, 2009

We currently have an explicit "needs_post" vtable method which
returns a stack variable for whether we should later run
post-schedule.  This leads to an awkward exchange of the
variable as it bubbles back up out of the context switch. Peter
Zijlstra observed that this information could be stored in the
run-queue itself instead of handled on the stack.

Therefore, we revert to the method of having context_switch
return void, and update an internal rq->post_schedule variable
when we require further processing.

In addition, we fix a race condition where we try to access
current->sched_class without holding the rq->lock.  This is
technically racy, as the sched-class could change out from
under us.  Instead, we reference the per-rq post_schedule
variable with the runqueue unlocked, but with preemption
disabled to see if we need to reacquire the rq->lock.

Finally, we clean the code up slightly by removing the #ifdef
CONFIG_SMP conditionals from the schedule() call, and implement
some inline helper functions instead.

This patch passes checkpatch, and rt-migrate.
Signed-off-by: NGregory Haskins <ghaskins@novell.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20090729150422.17691.55590.stgit@dev.haskins.net>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

3f029d3c

10 7月, 2009 1 次提交

sched_rt: Fix overload bug on rt group scheduling · a1ba4d8b

由 Peter Zijlstra 提交于 4月 01, 2009

Fixes an easily triggerable BUG() when setting process affinities.

Make sure to count the number of migratable tasks in the same place:
the root rt_rq. Otherwise the number doesn't make sense and we'll hit
the BUG in set_cpus_allowed_rt().

Also, make sure we only count tasks, not groups (this is probably
already taken care of by the fact that rt_se->nr_cpus_allowed will be 0
for groups, but be more explicit)
Tested-by: NThomas Gleixner <tglx@linutronix.de>
CC: stable@kernel.org
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: NGregory Haskins <ghaskins@novell.com>
LKML-Reference: <1247067476.9777.57.camel@twins>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

a1ba4d8b

09 6月, 2009 1 次提交

cpumask: alloc zeroed cpumask for static cpumask_var_ts · eaa95840

由 Yinghai Lu 提交于 6月 06, 2009

These are defined as static cpumask_var_t so if MAXSMP is not used,
they are cleared already.  Avoid surprises when MAXSMP is enabled.
Signed-off-by: NYinghai Lu <yinghai.lu@kernel.org>
Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>

eaa95840

01 4月, 2009 1 次提交

sched_rt: don't allocate cpumask in fastpath · 13b8bd0a

由 Rusty Russell 提交于 3月 25, 2009

Impact: cleanup

As pointed out by Steven Rostedt.  Since the arg in question is
unused, we simply change cpupri_find() to accept NULL.
Reported-by: NSteven Rostedt <srostedt@redhat.com>
Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
LKML-Reference: <200903251501.22664.rusty@rustcorp.com.au>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

13b8bd0a

01 2月, 2009 1 次提交

sched_rt: don't use first_cpu on cpumask created with cpumask_and · 3d398703

由 Rusty Russell 提交于 1月 31, 2009

cpumask_and() only initializes nr_cpu_ids bits, so the (deprecated)
first_cpu() might find one of those uninitialized bits if nr_cpu_ids
is less than NR_CPUS (as it can be for CONFIG_CPUMASK_OFFSTACK).
Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

3d398703

16 1月, 2009 1 次提交

sched: make plist a library facility · ceacc2c1

由 Peter Zijlstra 提交于 1月 16, 2009

Ingo Molnar wrote:

> here's a new build failure with tip/sched/rt:
>
>   LD      .tmp_vmlinux1
> kernel/built-in.o: In function `set_curr_task_rt':
> sched.c:(.text+0x3675): undefined reference to `plist_del'
> kernel/built-in.o: In function `pick_next_task_rt':
> sched.c:(.text+0x37ce): undefined reference to `plist_del'
> kernel/built-in.o: In function `enqueue_pushable_task':
> sched.c:(.text+0x381c): undefined reference to `plist_del'

Eliminate the plist library kconfig and make it available
unconditionally.
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

ceacc2c1

14 1月, 2009 2 次提交

sched: fix build error in kernel/sched_rt.c when RT_GROUP_SCHED && !SMP · 398a153b

由 Gregory Haskins 提交于 1月 14, 2009

Ingo found a build error in the scheduler when RT_GROUP_SCHED was
enabled, but SMP was not.  This patch rearranges the code such
that it is a little more streamlined and compiles under all permutations
of SMP, UP and RT_GROUP_SCHED.  It was boot tested on my 4-way x86_64
and it still passes preempt-test.
Signed-off-by: NGregory Haskins <ghaskins@novell.com>

398a153b

G
sched: de CPP-ify the scheduler code · b07430ac
由 Gregory Haskins 提交于 1月 14, 2009
```
Signed-off-by: NGregory Haskins <ghaskins@novell.com>
```
b07430ac

12 1月, 2009 1 次提交

cpumask: reduce stack usage in find_lowest_rq · d38b223c

由 Mike Travis 提交于 1月 10, 2009

Impact: reduce stack usage, cleanup

Use a cpumask_var_t in find_lowest_rq() and clean up other old
cpumask_t calls.
Signed-off-by: NMike Travis <travis@sgi.com>

d38b223c

04 1月, 2009 1 次提交

sched: put back some stack hog changes that were undone in kernel/sched.c · 6ca09dfc

由 Mike Travis 提交于 12月 31, 2008

Impact: prevents panic from stack overflow on numa-capable machines.

Some of the "removal of stack hogs" changes in kernel/sched.c by using
node_to_cpumask_ptr were undone by the early cpumask API updates, and
causes a panic due to stack overflow.  This patch undoes those changes
by using cpumask_of_node() which returns a 'const struct cpumask *'.

In addition, cpu_coregoup_map is replaced with cpu_coregroup_mask further
reducing stack usage.  (Both of these updates removed 9 FIXME's!)

Also:
   Pick up some remaining changes from the old 'cpumask_t' functions to
   the new 'struct cpumask *' functions.

   Optimize memory traffic by allocating each percpu local_cpu_mask on the
   same node as the referring cpu.
Signed-off-by: NMike Travis <travis@sgi.com>
Acked-by: NRusty Russell <rusty@rustcorp.com.au>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

6ca09dfc

29 12月, 2008 5 次提交

RT: fix push_rt_task() to handle dequeue_pushable properly · 1563513d

由 Gregory Haskins 提交于 12月 29, 2008

A panic was discovered by Chirag Jog where a BUG_ON sanity check
in the new "pushable_task" logic would trigger a panic under
certain circumstances:

http://lkml.org/lkml/2008/9/25/189

Gilles Carry discovered that the root cause was attributed to the
pushable_tasks list getting corrupted in the push_rt_task logic.
This was the result of a dropped rq lock in double_lock_balance
allowing a task in the process of being pushed to potentially migrate
away, and thus corrupt the pushable_tasks() list.

I traced back the problem as introduced by the pushable_tasks patch
that went in recently.   There is a "retry" path in push_rt_task()
that actually had a compound conditional to decide whether to
retry or exit.  I missed the meaning behind the rationale for the
virtual "if(!task) goto out;" portion of the compound statement and
thus did not handle it properly.  The new pushable_tasks logic
actually creates three distinct conditions:

1) an untouched and unpushable task should be dequeued
2) a migrated task where more pushable tasks remain should be retried
3) a migrated task where no more pushable tasks exist should exit

The original logic mushed (1) and (3) together, resulting in the
system dequeuing a migrated task (against an unlocked foreign run-queue
nonetheless).

To fix this, we get rid of the notion of "paranoid" and we support the
three unique conditions properly.  The paranoid feature is no longer
relevant with the new pushable logic (since pushable naturally limits
the loop) anyway, so lets just remove it.
Reported-By: NChirag Jog <chirag@linux.vnet.ibm.com>
Found-by: NGilles Carry <gilles.carry@bull.net>
Signed-off-by: NGregory Haskins <ghaskins@novell.com>

1563513d

sched: create "pushable_tasks" list to limit pushing to one attempt · 917b627d

由 Gregory Haskins 提交于 12月 29, 2008

The RT scheduler employs a "push/pull" design to actively balance tasks
within the system (on a per disjoint cpuset basis).  When a task is
awoken, it is immediately determined if there are any lower priority
cpus which should be preempted.  This is opposed to the way normal
SCHED_OTHER tasks behave, which will wait for a periodic rebalancing
operation to occur before spreading out load.

When a particular RQ has more than 1 active RT task, it is said to
be in an "overloaded" state.  Once this occurs, the system enters
the active balancing mode, where it will try to push the task away,
or persuade a different cpu to pull it over.  The system will stay
in this state until the system falls back below the <= 1 queued RT
task per RQ.

However, the current implementation suffers from a limitation in the
push logic.  Once overloaded, all tasks (other than current) on the
RQ are analyzed on every push operation, even if it was previously
unpushable (due to affinity, etc).  Whats more, the operation stops
at the first task that is unpushable and will not look at items
lower in the queue.  This causes two problems:

1) We can have the same tasks analyzed over and over again during each
   push, which extends out the fast path in the scheduler for no
   gain.  Consider a RQ that has dozens of tasks that are bound to a
   core.  Each one of those tasks will be encountered and skipped
   for each push operation while they are queued.

2) There may be lower-priority tasks under the unpushable task that
   could have been successfully pushed, but will never be considered
   until either the unpushable task is cleared, or a pull operation
   succeeds.  The net result is a potential latency source for mid
   priority tasks.

This patch aims to rectify these two conditions by introducing a new
priority sorted list: "pushable_tasks".  A task is added to the list
each time a task is activated or preempted.  It is removed from the
list any time it is deactivated, made current, or fails to push.

This works because a task only needs to be attempted to push once.
After an initial failure to push, the other cpus will eventually try to
pull the task when the conditions are proper.  This also solves the
problem that we don't completely analyze all tasks due to encountering
an unpushable tasks.  Now every task will have a push attempted (when
appropriate).

This reduces latency both by shorting the critical section of the
rq->lock for certain workloads, and by making sure the algorithm
considers all eligible tasks in the system.

[ rostedt: added a couple more BUG_ONs ]
Signed-off-by: NGregory Haskins <ghaskins@novell.com>
Acked-by: NSteven Rostedt <srostedt@redhat.com>

917b627d

sched: add sched_class->needs_post_schedule() member · 967fc046

由 Gregory Haskins 提交于 12月 29, 2008

We currently run class->post_schedule() outside of the rq->lock, which
means that we need to test for the need to post_schedule outside of
the lock to avoid a forced reacquistion.  This is currently not a problem
as we only look at rq->rt.overloaded.  However, we want to enhance this
going forward to look at more state to reduce the need to post_schedule to
a bare minimum set.  Therefore, we introduce a new member-func called
needs_post_schedule() which tests for the post_schedule condtion without
actually performing the work.  Therefore it is safe to call this
function before the rq->lock is released, because we are guaranteed not
to drop the lock at an intermediate point (such as what post_schedule()
may do).

We will use this later in the series

[ rostedt: removed paranoid BUG_ON ]
Signed-off-by: NGregory Haskins <ghaskins@novell.com>

967fc046

sched: only try to push a task on wakeup if it is migratable · 777c2f38

由 Gregory Haskins 提交于 12月 29, 2008

There is no sense in wasting time trying to push a task away that
cannot move anywhere else.  We gain no benefit from trying to push
other tasks at this point, so if the task being woken up is non
migratable, just skip the whole operation.  This reduces overhead
in the wakeup path for certain tasks.
Signed-off-by: NGregory Haskins <ghaskins@novell.com>

777c2f38

sched: use highest_prio.next to optimize pull operations · 74ab8e4f

由 Gregory Haskins 提交于 12月 29, 2008

We currently take the rq->lock for every cpu in an overload state during
pull_rt_tasks().  However, we now have enough information via the
highest_prio.[curr|next] fields to determine if there is any tasks of
interest to warrant the overhead of the rq->lock, before we actually take
it.  So we use this information to reduce lock contention during the
pull for the case where the source-rq doesnt have tasks that preempt
the current task.
Signed-off-by: NGregory Haskins <ghaskins@novell.com>

74ab8e4f