提交 · 4ce72a2c063a7fa8e42a9435440ae3364115a58d · openeuler / raspberrypi-kernel

22 10月, 2008 1 次提交

sched: add CONFIG_SMP consistency · 4ce72a2c

由 Li Zefan 提交于 10月 22, 2008

a patch from Henrik Austad did this:

>> Do not declare select_task_rq as part of sched_class when CONFIG_SMP is
>> not set.

Peter observed:

> While a proper cleanup, could you do it by re-arranging the methods so
> as to not create an additional ifdef?

Do not declare select_task_rq and some other methods as part of sched_class
when CONFIG_SMP is not set.

Also gather those methods to avoid CONFIG_SMP mess.
Idea-by: NHenrik Austad <henrik.austad@gmail.com>
Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
Acked-by: NPeter Zijlstra <peterz@infradead.org>
Acked-by: NHenrik Austad <henrik@austad.us>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

4ce72a2c

04 10月, 2008 1 次提交

sched_rt.c: resch needed in rt_rq_enqueue() for the root rt_rq · f6121f4f

由 Dario Faggioli 提交于 10月 03, 2008

While working on the new version of the code for SCHED_SPORADIC I
noticed something strange in the present throttling mechanism. More
specifically in the throttling timer handler in sched_rt.c
(do_sched_rt_period_timer()) and in rt_rq_enqueue().

The problem is that, when unthrottling a runqueue, rt_rq_enqueue() only
asks for rescheduling if the runqueue has a sched_entity associated to
it (i.e., rt_rq->rt_se != NULL).
Now, if the runqueue is the root rq (which has a rt_se = NULL)
rescheduling does not take place, and it is delayed to some undefined
instant in the future.

This imply some random bandwidth usage by the RT tasks under throttling.
For instance, setting rt_runtime_us/rt_period_us = 950ms/1000ms an RT
task will get less than 95%. In our tests we got something varying
between 70% to 95%.
Using smaller time values, e.g., 95ms/100ms, things are even worse, and
I can see values also going down to 20-25%!!

The tests we performed are simply running 'yes' as a SCHED_FIFO task,
and checking the CPU usage with top, but we can investigate thoroughly
if you think it is needed.

Things go much better, for us, with the attached patch... Don't know if
it is the best approach, but it solved the issue for us.
Signed-off-by: NDario Faggioli <raistlin@linux.it>
Signed-off-by: NMichael Trimarchi <trimarchimichael@yahoo.it>
Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: <stable@kernel.org>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

f6121f4f

23 9月, 2008 1 次提交

sched: add some comments to the bandwidth code · 78333cdd

由 Peter Zijlstra 提交于 9月 23, 2008

Hopefully clarify some of this code a little.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

78333cdd

22 9月, 2008 1 次提交

sched: wakeup preempt when small overlap · 15afe09b

由 Peter Zijlstra 提交于 9月 20, 2008

Lin Ming reported a 10% OLTP regression against 2.6.27-rc4.

The difference seems to come from different preemption agressiveness,
which affects the cache footprint of the workload and its effective
cache trashing.

Aggresively preempt a task if its avg overlap is very small, this should
avoid the task going to sleep and find it still running when we schedule
back to it - saving a wakeup.
Reported-by: NLin Ming <ming.m.lin@intel.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

15afe09b

11 9月, 2008 1 次提交

sched: fix 2.6.27-rc5 couldn't boot on tulsa machine randomly · baf25731

由 Zhang, Yanmin 提交于 9月 09, 2008

On my tulsa x86-64 machine, kernel 2.6.25-rc5 couldn't boot randomly.

Basically, function __enable_runtime forgets to reset rt_rq->rt_throttled
to 0. When every cpu is up, per-cpu migration_thread is created and it runs
very fast, sometimes to mark the corresponding rt_rq->rt_throttled to 1 very
quickly. After all cpus are up, with below calling chain:

   sched_init_smp => arch_init_sched_domains => build_sched_domains => ...
=> cpu_attach_domain => rq_attach_root => set_rq_online => ...
=> _enable_runtime

_enable_runtime is called against every rt_rq again, so rt_rq->rt_time is
reset to 0, but rt_rq->rt_throttled might be still 1. Later on function
do_sched_rt_period_timer couldn't reset it, and all RT tasks couldn't be
scheduled to run on that cpu. here is RT task migration_thread which is
woken up when a task is migrated to another cpu.

Below patch fixes it against 2.6.27-rc5.
Signed-off-by: NZhang Yanmin <yanmin_zhang@linux.intel.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

baf25731

28 8月, 2008 2 次提交

sched: rt-bandwidth accounting fix · cc2991cf

由 Peter Zijlstra 提交于 8月 19, 2008

It fixes an accounting bug where we would continue accumulating runtime
even though the bandwidth control is disabled. This would lead to very long
throttle periods once bandwidth control gets turned on again.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

cc2991cf

sched: fix sched_rt_rq_enqueue() resched idle · f3ade837

由 John Blackwood 提交于 8月 26, 2008

When sysctl_sched_rt_runtime is set to something other than -1 and the
CONFIG_RT_GROUP_SCHED kernel parameter is NOT enabled, we get into a state
where we see one or more CPUs idling forvever even though there are
real-time
tasks in their rt runqueue that are able to run (no longer throttled).

The sequence is:

- A real-time task is running when the timer sets the rt runqueue
    to throttled, and the rt task is resched_task()ed and switched
    out, and idle is switched in since there are no non-rt tasks to
    run on that cpu.

- Eventually the do_sched_rt_period_timer() runs and un-throttles
    the rt runqueue, but we just exit the timer interrupt and go back
    to executing the idle task in the idle loop forever.

If we change the sched_rt_rq_enqueue() routine to use some of the code
from the CONFIG_RT_GROUP_SCHED enabled version of this same routine and
resched_task() the currently executing task (idle in our case) if it is
a lower priority task than the higher rt task in the now un-throttled
runqueue, the problem is no longer observed.
Signed-off-by: NJohn Blackwood <john.blackwood@ccur.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

f3ade837

19 8月, 2008 2 次提交

sched: rt-bandwidth group disable fixes · 0b148fa0

由 Peter Zijlstra 提交于 8月 19, 2008

More extensive disable of bandwidth control. It allows sysctl_sched_rt_runtime
to disable full group bandwidth control.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

0b148fa0

sched: rt-bandwidth accounting fix · 6f0d5c39

由 Peter Zijlstra 提交于 8月 19, 2008

6f0d5c39

14 8月, 2008 1 次提交

sched: fix rt-bandwidth hotplug race · f1679d08

由 Peter Zijlstra 提交于 8月 14, 2008

When we hot-unplug a cpu and rebuild the sched-domain, all cpus will be
detatched. Alex observed the case where a runqueue was stealing bandwidth
from an already disabled runqueue to satisfy its own needs.

Stop this by skipping over already disabled runqueues.
Reported-by: NAlex Nixon <alex.nixon@citrix.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Tested-by: NAlex Nixon <alex.nixon@citrix.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

f1679d08

11 8月, 2008 1 次提交

lockdep: re-annotate scheduler runqueues · 1b12bbc7

由 Peter Zijlstra 提交于 8月 11, 2008

Instead of using a per-rq lock class, use the regular nesting operations.

However, take extra care with double_lock_balance() as it can release the
already held rq->lock (and therefore change its nesting class).

So what can happen is:

 spin_lock(rq->lock);	// this rq subclass 0

 double_lock_balance(rq, other_rq);
   // release rq
   // acquire other_rq->lock subclass 0
   // acquire rq->lock subclass 1

 spin_unlock(other_rq->lock);

leaving you with rq->lock in subclass 1

So a subsequent double_lock_balance() call can try to nest a subclass 1
lock while already holding a subclass 1 lock.

Fix this by introducing double_unlock_balance() which releases the other
rq's lock, but also re-sets the subclass for this rq's lock to 0.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

1b12bbc7

24 7月, 2008 1 次提交

sched: clean up compiler warning · 58838cf3

由 Peter Zijlstra 提交于 7月 24, 2008

Reported-by: NDaniel Walker <dwalker@mvista.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

58838cf3

18 7月, 2008 3 次提交

sched: fix warning in inc_rt_tasks() to not declare variable 'rq' if it's not needed · 577b4a58

由 David Howells 提交于 7月 11, 2008

Fix inc_rt_tasks() to not declare variable 'rq' if it's not needed.  It is
declared if CONFIG_SMP or CONFIG_RT_GROUP_SCHED, but only used if CONFIG_SMP.

This is a consequence of patch 1f11eb6a plus
patch 1100ac91.
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

577b4a58

cpu hotplug, sched: Introduce cpu_active_map and redo sched domain managment (take 2) · e761b772

由 Max Krasnyansky 提交于 7月 15, 2008

This is based on Linus' idea of creating cpu_active_map that prevents
scheduler load balancer from migrating tasks to the cpu that is going
down.

It allows us to simplify domain management code and avoid unecessary
domain rebuilds during cpu hotplug event handling.

Please ignore the cpusets part for now. It needs some more work in order
to avoid crazy lock nesting. Although I did simplfy and unify domain
reinitialization logic. We now simply call partition_sched_domains() in
all the cases. This means that we're using exact same code paths as in
cpusets case and hence the test below cover cpusets too.
Cpuset changes to make rebuild_sched_domains() callable from various
contexts are in the separate patch (right next after this one).

This not only boots but also easily handles
	while true; do make clean; make -j 8; done
and
	while true; do on-off-cpu 1; done
at the same time.
(on-off-cpu 1 simple does echo 0/1 > /sys/.../cpu1/online thing).

Suprisingly the box (dual-core Core2) is quite usable. In fact I'm typing
this on right now in gnome-terminal and things are moving just fine.

Also this is running with most of the debug features enabled (lockdep,
mutex, etc) no BUG_ONs or lockdep complaints so far.

I believe I addressed all of the Dmitry's comments for original Linus'
version. I changed both fair and rt balancer to mask out non-active cpus.
And replaced cpu_is_offline() with !cpu_active() in the main scheduler
code where it made sense (to me).
Signed-off-by: NMax Krasnyanskiy <maxk@qualcomm.com>
Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: NGregory Haskins <ghaskins@novell.com>
Cc: dmitry.adamushko@gmail.com
Cc: pj@sgi.com
Signed-off-by: NIngo Molnar <mingo@elte.hu>

e761b772

sched: rework of "prioritize non-migratable tasks over migratable ones" · 7ebefa8c

由 Dmitry Adamushko 提交于 7月 01, 2008

(1) handle in a generic way all cases when a newly woken-up task is
not migratable (not just a corner case when "rt_se->nr_cpus_allowed ==
1")

(2) if current is to be preempted, then make sure "p" will be picked
up by pick_next_task_rt().
i.e. move task's group at the head of its list as well.

currently, it's not a case for the group-scheduling case as described
here: http://www.ussg.iu.edu/hypermail/linux/kernel/0807.0/0134.htmlSigned-off-by: NDmitry Adamushko <dmitry.adamushko@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

7ebefa8c

27 6月, 2008 3 次提交

sched: make sched_{rt,fair}.c ifdefs more readable · 55e12e5e

由 Dhaval Giani 提交于 6月 24, 2008

Signed-off-by: NDhaval Giani <dhaval@linux.vnet.ibm.com>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

55e12e5e

sched: revert revert of: fair-group: SMP-nice for group scheduling · c09595f6

由 Peter Zijlstra 提交于 6月 27, 2008

Try again..

Initial commit: 18d95a28
Revert: 6363ca57Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

c09595f6

sched: clean up some unused variables · bf647b62

由 Peter Zijlstra 提交于 6月 27, 2008

In file included from /mnt/build/linux-2.6/kernel/sched.c:1496:
/mnt/build/linux-2.6/kernel/sched_rt.c: In function '__enable_runtime':
/mnt/build/linux-2.6/kernel/sched_rt.c:339: warning: unused variable 'rd'
/mnt/build/linux-2.6/kernel/sched_rt.c: In function 'requeue_rt_entity':
/mnt/build/linux-2.6/kernel/sched_rt.c:692: warning: unused variable 'queue'
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

bf647b62

20 6月, 2008 5 次提交

sched: rt: dont stop the period timer when there are tasks wanting to run · 8a8cde16

由 Peter Zijlstra 提交于 6月 19, 2008

So if the group ever gets throttled, it will never wake up again.
Reported-by: N"Daniel K." <dk@uw.no>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Tested-by: NDaniel K. <dk@uw.no>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

8a8cde16

sched: rt: dont stop the period timer when there are tasks wanting to run · 6c3df255

由 Peter Zijlstra 提交于 6月 19, 2008

So if the group ever gets throttled, it will never wake up again.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: "Daniel K." <dk@uw.no>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reported-by: N"Daniel K." <dk@uw.no>

6c3df255

sched: rt: move some code around · eff6549b

由 Peter Zijlstra 提交于 6月 19, 2008

Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: "Daniel K." <dk@uw.no>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

eff6549b

sched: rt: fix SMP bandwidth balancing for throttled groups · b79f3833

由 Peter Zijlstra 提交于 6月 19, 2008

Now we exceed the runtime and get throttled - the period rollover tick
will subtract the cpu quota from the runtime and check if we're below
quota. However with this cpu having a very small portion of the runtime
it will not refresh as fast as it should.

Therefore, also rebalance the runtime when we're throttled.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: "Daniel K." <dk@uw.no>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

b79f3833

sched: debug: add some rt debug output · ada18de2

由 Peter Zijlstra 提交于 6月 19, 2008

Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: "Daniel K." <dk@uw.no>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

ada18de2

19 6月, 2008 2 次提交

sched: rt-group: fix RR buglet · 15a8641e

由 Peter Zijlstra 提交于 6月 19, 2008

In tick_task_rt() we first call update_curr_rt() which can dequeue a runqueue
due to it running out of runtime, and then we try to requeue it, of it also
having exhausted its RR quota. Obviously requeueing something that is no longer
on the runqueue will not have the expected result.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Tested-by: NDaniel K. <dk@uw.no>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

15a8641e

sched: rt-group: heirarchy aware throttle · ad2a3f13

由 Peter Zijlstra 提交于 6月 19, 2008

The bandwidth throttle code dequeues a group when it runs out of quota, and
re-queues it once the period rolls over and the quota gets refreshed.

Sadly it failed to take the hierarchy into consideration. Share more of the
enqueue/dequeue code with regular task opterations.

Also, some operations like sched_setscheduler() can dequeue/enqueue tasks that
are in throttled runqueues, we should not inadvertly re-enqueue empty runqueues
so check for that.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Tested-by: NDaniel K. <dk@uw.no>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

ad2a3f13

18 6月, 2008 1 次提交

sched: rework of "prioritize non-migratable tasks over migratable ones" · 20b6331b

由 Dmitry Adamushko 提交于 6月 11, 2008

regarding this commit: 45c01e82

I think we can do it simpler. Please take a look at the patch below.

Instead of having 2 separate arrays (which is + ~800 bytes on x86_32 and
twice so on x86_64), let's add "exclusive" (the ones that are bound to
this CPU) tasks to the head of the queue and "shared" ones -- to the
end.

In case of a few newly woken up "exclusive" tasks, they are 'stacked'
(not queued as now), meaning that a task {i+1} is being placed in front
of the previously woken up task {i}. But I don't think that this
behavior may cause any realistic problems.

There are a couple of changes on top of this one.

(1) in check_preempt_curr_rt()

I don't think there is a need for the "pick_next_rt_entity(rq, &rq->rt)
!= &rq->curr->rt" check.

enqueue_task_rt(p) and check_preempt_curr_rt() are always called one
after another with rq->lock being held so the following check
"p->rt.nr_cpus_allowed == 1 && rq->curr->rt.nr_cpus_allowed != 1" should
be enough (well, just its left part) to guarantee that 'p' has been
queued in front of the 'curr'.

(2) in set_cpus_allowed_rt()

I don't thinks there is a need for requeue_task_rt() here.

Perhaps, the only case when 'requeue' (+ reschedule) might be useful is
as follows:

i) weight == 1 && cpu_isset(task_cpu(p), *new_mask)

i.e. a task is being bound to this CPU);

ii) 'p' != rq->curr

but here, 'p' has already been on this CPU for a while and was not
migrated. i.e. it's possible that 'rq->curr' would not have high chances
to be migrated right at this particular moment (although, has chance in
a bit longer term), should we allow it to be preempted.

Anyway, I think we should not perhaps make it more complex trying to
address some rare corner cases. For instance, that's why a single queue
approach would be preferable. Unless I'm missing something obvious, this
approach gives us similar functionality at lower cost.

Verified only compilation-wise.

(Almost)-Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

20b6331b

10 6月, 2008 1 次提交

sched: fix hotplug cpus on ia64 · 7def2be1

由 Peter Zijlstra 提交于 6月 05, 2008

Cliff Wickman wrote:

> I built an ia64 kernel from Andrew's tree (2.6.26-rc2-mm1)
> and get a very predictable hotplug cpu problem.
> billberry1:/tmp/cpw # ./dis
> disabled cpu 17
> enabled cpu 17
> billberry1:/tmp/cpw # ./dis
> disabled cpu 17
> enabled cpu 17
> billberry1:/tmp/cpw # ./dis
>
> The script that disables the cpu always hangs (unkillable)
> on the 3rd attempt.
>
> And a bit further:
> The kstopmachine thread always sits on the run queue (real time) for about
> 30 minutes before running.

this fix solves some (but not all) issues between CPU hotplug and
RT bandwidth throttling.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

7def2be1

06 6月, 2008 4 次提交

sched: fix cpuprio build bug · 1100ac91

由 Ingo Molnar 提交于 6月 05, 2008

this patch was not built on !SMP:

 kernel/sched_rt.c: In function 'inc_rt_tasks':
 kernel/sched_rt.c:404: error: 'struct rq' has no member named 'online'
Signed-off-by: NIngo Molnar <mingo@elte.hu>

1100ac91

sched: fix cpupri hotplug support · 1f11eb6a

由 Gregory Haskins 提交于 6月 04, 2008

The RT folks over at RedHat found an issue w.r.t. hotplug support which
was traced to problems with the cpupri infrastructure in the scheduler:

https://bugzilla.redhat.com/show_bug.cgi?id=449676

This bug affects 23-rt12+, 24-rtX, 25-rtX, and sched-devel.  This patch
applies to 25.4-rt4, though it should trivially apply to most cpupri enabled
kernels mentioned above.

It turned out that the issue was that offline cpus could get inadvertently
registered with cpupri so that they were erroneously selected during
migration decisions.  The end result would be an OOPS as the offline cpu
had tasks routed to it.

This patch generalizes the old join/leave domain interface into an
online/offline interface, and adjusts the root-domain/hotplug code to
utilize it.

I was able to easily reproduce the issue prior to this patch, and am no
longer able to reproduce it after this patch.  I can offline cpus
indefinately and everything seems to be in working order.

Thanks to Arnaldo (acme), Thomas, and Peter for doing the legwork to point
me in the right direction.  Also thank you to Peter for reviewing the
early iterations of this patch.
Signed-off-by: NGregory Haskins <ghaskins@novell.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>

1f11eb6a

sched: use a 2-d bitmap for searching lowest-pri CPU · 6e0534f2

由 Gregory Haskins 提交于 5月 12, 2008

The current code use a linear algorithm which causes scaling issues
on larger SMP machines.  This patch replaces that algorithm with a
2-dimensional bitmap to reduce latencies in the wake-up path.
Signed-off-by: NGregory Haskins <ghaskins@novell.com>
Acked-by: NSteven Rostedt <srostedt@redhat.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>

6e0534f2

sched: prioritize non-migratable tasks over migratable ones · 45c01e82

由 Gregory Haskins 提交于 5月 12, 2008

Dmitry Adamushko pointed out a known flaw in the rt-balancing algorithm
that could allow suboptimal balancing if a non-migratable task gets
queued behind a running migratable one.  It is discussed in this thread:

http://lkml.org/lkml/2008/4/22/296

This issue has been further exacerbated by a recent checkin to
sched-devel (git-id 5eee63a5ebc19a870ac40055c0be49457f3a89a3).

>From a pure priority standpoint, the run-queue is doing the "right"
thing. Using Dmitry's nomenclature, if T0 is on cpu1 first, and T1
wakes up at equal or lower priority (affined only to cpu1) later, it
*should* wait for T0 to finish.  However, in reality that is likely
suboptimal from a system perspective if there are other cores that
could allow T0 and T1 to run concurrently.  Since T1 can not migrate,
the only choice for higher concurrency is to try to move T0.  This is
not something we addessed in the recent rt-balancing re-work.

This patch tries to enhance the balancing algorithm by accomodating this
scenario.  It accomplishes this by incorporating the migratability of a
task into its priority calculation.  Within a numerical tsk->prio, a
non-migratable task is logically higher than a migratable one.  We
maintain this by introducing a new per-priority queue (xqueue, or
exclusive-queue) for holding non-migratable tasks.  The scheduler will
draw from the xqueue over the standard shared-queue (squeue) when
available.

There are several details for utilizing this properly.

1) During task-wake-up, we not only need to check if the priority
   preempts the current task, but we also need to check for this
   non-migratable condition.  Therefore, if a non-migratable task wakes
   up and sees an equal priority migratable task already running, it
   will attempt to preempt it *if* there is a likelyhood that the
   current task will find an immediate home.

2) Tasks only get this non-migratable "priority boost" on wake-up.  Any
   requeuing will result in the non-migratable task being queued to the
   end of the shared queue.  This is an attempt to prevent the system
   from being completely unfair to migratable tasks during things like
   SCHED_RR timeslicing.

I am sure this patch introduces potentially "odd" behavior if you
concoct a scenario where a bunch of non-migratable threads could starve
migratable ones given the right pattern.  I am not yet convinced that
this is a problem since we are talking about tasks of equal RT priority
anyway, and there never is much in the way of guarantees against
starvation under that scenario anyway. (e.g. you could come up with a
similar scenario with a specific timing environment verses an affinity
environment).  I can be convinced otherwise, but for now I think this is
"ok".
Signed-off-by: NGregory Haskins <ghaskins@novell.com>
CC: Dmitry Adamushko <dmitry.adamushko@gmail.com>
CC: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: NIngo Molnar <mingo@elte.hu>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>

45c01e82

29 5月, 2008 1 次提交

revert ("sched: fair-group: SMP-nice for group scheduling") · 6363ca57

由 Ingo Molnar 提交于 5月 29, 2008

Yanmin Zhang reported:

Comparing with 2.6.25, volanoMark has big regression with kernel 2.6.26-rc1.
It's about 50% on my 8-core stoakley, 16-core tigerton, and Itanium Montecito.

With bisect, I located the following patch:

| 18d95a28 is first bad commit
| commit 18d95a28
| Author: Peter Zijlstra <a.p.zijlstra@chello.nl>
| Date:   Sat Apr 19 19:45:00 2008 +0200
|
|     sched: fair-group: SMP-nice for group scheduling

Revert it so that we get v2.6.25 behavior.
Bisected-by: NYanmin Zhang <yanmin_zhang@linux.intel.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

6363ca57

24 5月, 2008 1 次提交

core: use performance variant for_each_cpu_mask_nr · 363ab6f1

由 Mike Travis 提交于 5月 12, 2008

Change references from for_each_cpu_mask to for_each_cpu_mask_nr
where appropriate
Reviewed-by: NPaul Jackson <pj@sgi.com>
Reviewed-by: NChristoph Lameter <clameter@sgi.com>
Signed-off-by: NMike Travis <travis@sgi.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>

363ab6f1

06 5月, 2008 2 次提交

sched: fix RT task-wakeup logic · 8ae121ac

由 Gregory Haskins 提交于 4月 23, 2008

Dmitry Adamushko pointed out a logic error in task_wake_up_rt() where we
will always evaluate to "true".  You can find the thread here:

http://lkml.org/lkml/2008/4/22/296

In reality, we only want to try to push tasks away when a wake up request is
not going to preempt the current task.  So lets fix it.

Note: We introduce test_tsk_need_resched() instead of open-coding the flag
check so that the merge-conflict with -rt should help remind us that we
may need to support NEEDS_RESCHED_DELAYED in the future, too.
Signed-off-by: NGregory Haskins <ghaskins@novell.com>
CC: Dmitry Adamushko <dmitry.adamushko@gmail.com>
CC: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

8ae121ac

sched: make rt_sched_class, idle_sched_class static · 2abdad0a

由 Harvey Harrison 提交于 4月 25, 2008

The C files are included directly in sched.c, so they are
effectively static.
Signed-off-by: NHarvey Harrison <harvey.harrison@gmail.com>
Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

2abdad0a

20 4月, 2008 5 次提交

sched: rt-group: optimize dequeue_rt_stack · 58d6c2d7

由 Peter Zijlstra 提交于 4月 19, 2008

Now that the group hierarchy can have an arbitrary depth the O(n^2) nature
of RT task dequeues will really hurt. Optimize this by providing space to
store the tree path, so we can walk it the other way.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

58d6c2d7

sched: fair-group: SMP-nice for group scheduling · 18d95a28

由 Peter Zijlstra 提交于 4月 19, 2008

Implement SMP nice support for the full group hierarchy.

On each load-balance action, compile a sched_domain wide view of the full
task_group tree. We compute the domain wide view when walking down the
hierarchy, and readjust the weights when walking back up.

After collecting and readjusting the domain wide view, we try to balance the
tasks within the task_groups. The current approach is a naively balance each
task group until we've moved the targeted amount of load.

Inspired by Srivatsa Vaddsgiri's previous code and Abhishek Chandra's H-SMP
paper.

XXX: there will be some numerical issues due to the limited nature of
     SCHED_LOAD_SCALE wrt to representing a task_groups influence on the
     total weight. When the tree is deep enough, or the task weight small
     enough, we'll run out of bits.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
CC: Abhishek Chandra <chandra@cs.umn.edu>
CC: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

18d95a28

sched: mix tasks and groups · 354d60c2

由 Dhaval Giani 提交于 4月 19, 2008

This patch allows tasks and groups to exist in the same cfs_rq. With this
change the CFS group scheduling follows a 1/(M+N) model from a 1/(1+N)
fairness model where M tasks and N groups exist at the cfs_rq level.

[a.p.zijlstra@chello.nl: rt bits and assorted fixes]
Signed-off-by: NDhaval Giani <dhaval@linux.vnet.ibm.com>
Signed-off-by: NSrivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

354d60c2

sched: add new set_cpus_allowed_ptr function · cd8ba7cd

由 Mike Travis 提交于 3月 26, 2008

Add a new function that accepts a pointer to the "newly allowed cpus"
cpumask argument.

int set_cpus_allowed_ptr(struct task_struct *p, const cpumask_t *new_mask)

The current set_cpus_allowed() function is modified to use the above
but this does not result in an ABI change.  And with some compiler
optimization help, it may not introduce any additional overhead.

Additionally, to enforce the read only nature of the new_mask arg, the
"const" property is migrated to sub-functions called by set_cpus_allowed.
This silences compiler warnings.
Signed-off-by: NMike Travis <travis@sgi.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

cd8ba7cd

sched: rt-group: smp balancing · ac086bc2

由 Peter Zijlstra 提交于 4月 19, 2008

Currently the rt group scheduling does a per cpu runtime limit, however
the rt load balancer makes no guarantees about an equal spread of real-
time tasks, just that at any one time, the highest priority tasks run.

Solve this by making the runtime limit a global property by borrowing
excessive runtime from the other cpus once the local limit runs out.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

ac086bc2