提交 · 4a55bd5e97b1775913f88f11108a4f144f590e89 · openanolis / cloud-kernel

20 4月, 2008 11 次提交

sched: fair-group: de-couple load-balancing from the rb-trees · 4a55bd5e

由 Peter Zijlstra 提交于 4月 19, 2008

De-couple load-balancing from the rb-trees, so that I can change their
organization.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

4a55bd5e

sched: fair-group scheduling vs latency · ac884dec

由 Peter Zijlstra 提交于 4月 19, 2008

Currently FAIR_GROUP sched grows the scheduler latency outside of
sysctl_sched_latency, invert this so it stays within.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

ac884dec

sched: fair-group: SMP-nice for group scheduling · 18d95a28

由 Peter Zijlstra 提交于 4月 19, 2008

Implement SMP nice support for the full group hierarchy.

On each load-balance action, compile a sched_domain wide view of the full
task_group tree. We compute the domain wide view when walking down the
hierarchy, and readjust the weights when walking back up.

After collecting and readjusting the domain wide view, we try to balance the
tasks within the task_groups. The current approach is a naively balance each
task group until we've moved the targeted amount of load.

Inspired by Srivatsa Vaddsgiri's previous code and Abhishek Chandra's H-SMP
paper.

XXX: there will be some numerical issues due to the limited nature of
     SCHED_LOAD_SCALE wrt to representing a task_groups influence on the
     total weight. When the tree is deep enough, or the task weight small
     enough, we'll run out of bits.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
CC: Abhishek Chandra <chandra@cs.umn.edu>
CC: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

18d95a28

sched, cpuset: customize sched domains, core · 1d3504fc

由 Hidetoshi Seto 提交于 4月 15, 2008

[rebased for sched-devel/latest]

 - Add a new cpuset file, having levels:
     sched_relax_domain_level

 - Modify partition_sched_domains() and build_sched_domains()
   to take attributes parameter passed from cpuset.

 - Fill newidle_idx for node domains which currently unused but
   might be required if sched_relax_domain_level become higher.

 - We can change the default level by boot option 'relax_domain_level='.
Signed-off-by: NHidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

1d3504fc

sched: prepatory code movement · b758149c

由 Peter Zijlstra 提交于 4月 19, 2008

Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

b758149c

sched: mix tasks and groups · 354d60c2

由 Dhaval Giani 提交于 4月 19, 2008

This patch allows tasks and groups to exist in the same cfs_rq. With this
change the CFS group scheduling follows a 1/(M+N) model from a 1/(1+N)
fairness model where M tasks and N groups exist at the cfs_rq level.

[a.p.zijlstra@chello.nl: rt bits and assorted fixes]
Signed-off-by: NDhaval Giani <dhaval@linux.vnet.ibm.com>
Signed-off-by: NSrivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

354d60c2

sched: old sleeper bonus · 112f53f5

由 Peter Zijlstra 提交于 3月 19, 2008

Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

112f53f5

sched: fix regression with sched yield · 79b3feff

由 Peter Zijlstra 提交于 2月 18, 2008

Balbir Singh reported:

> 1:mon> t
> [c0000000e7677da0] c000000000067de0 .sys_sched_yield+0x6c/0xbc
> [c0000000e7677e30] c000000000008748 syscall_exit+0x0/0x40
> --- Exception: c01 (System Call) at 00000400001d09e4
> SP (4000664cb10) is in userspace
> 1:mon> r
> cpu 0x1: Vector: 300 (Data Access) at [c0000000e7677aa0]
>     pc: c000000000068e50: .yield_task_fair+0x94/0xc4
>     lr: c000000000067de0: .sys_sched_yield+0x6c/0xbc

the check that should have avoided that is:

        /*
         * Are we the only task in the tree?
         */
        if (unlikely(rq->load.weight == curr->se.load.weight))
                return;

But I guess that overlooks rt tasks, they also increase the load.
So I guess something like this ought to fix it..
Signed-off-by: NIngo Molnar <mingo@elte.hu>

79b3feff

I
sched: remove sysctl_sched_batch_wakeup_granularity · 50df5d6a
由 Ingo Molnar 提交于 3月 14, 2008
```
it's unused.
Signed-off-by: NIngo Molnar <mingo@elte.hu>
```
50df5d6a

sched: fix wakeup granularity for buddies · 0bbd3336

由 Peter Zijlstra 提交于 4月 19, 2008

The wakeup buddy logic didn't use the same wakeup granularity logic as the
wakeup preemption did, this might cause the ->next buddy to be selected past
the point where we would have preempted had the task been a single running
instance.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

0bbd3336

sched: re-do "sched: fix fair sleepers" · 018d6db4

由 Ingo Molnar 提交于 4月 14, 2008

re-apply:

| commit e22ecef1
| Author: Ingo Molnar <mingo@elte.hu>
| Date:   Fri Mar 14 22:16:08 2008 +0100
|
|     sched: fix fair sleepers
|
|     Fair sleepers need to scale their latency target down by runqueue
|     weight. Otherwise busy systems will gain ever larger sleep bonus.
Signed-off-by: NIngo Molnar <mingo@elte.hu>

018d6db4

14 4月, 2008 1 次提交

revert "sched: fix fair sleepers" · e2df9e09

由 Ingo Molnar 提交于 4月 14, 2008

revert "sched: fix fair sleepers" (e22ecef1),
because it is causing audio skipping, see:

   http://bugzilla.kernel.org/show_bug.cgi?id=10428

the patch is correct and the real cause of the skipping is not
understood (tracing makes it go away), but time has run out so we'll
revert it and re-try in 2.6.26.
Signed-off-by: NIngo Molnar <mingo@elte.hu>

e2df9e09

21 3月, 2008 1 次提交

sched: cleanup old and rarely used 'debug' features. · 2070ee01

由 Peter Zijlstra 提交于 3月 21, 2008

TREE_AVG and APPROX_AVG are initial task placement policies that have been
disabled for a long while.. time to remove them.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
CC: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

2070ee01

19 3月, 2008 5 次提交

sched: retune wake granularity · 74e3cd7f

由 Ingo Molnar 提交于 3月 18, 2008

reduce wake-up granularity for better interactivity.
Signed-off-by: NIngo Molnar <mingo@elte.hu>

74e3cd7f

sched: improve affine wakeups · 4ae7d5ce

由 Ingo Molnar 提交于 3月 19, 2008

improve affine wakeups. Maintain the 'overlap' metric based on CFS's
sum_exec_runtime - which means the amount of time a task executes
after it wakes up some other task.

Use the 'overlap' for the wakeup decisions: if the 'overlap' is short,
it means there's strong workload coupling between this task and the
woken up task. If the 'overlap' is large then the workload is decoupled
and the scheduler will move them to separate CPUs more easily.

( Also slightly move the preempt_check within try_to_wake_up() - this has
  no effect on functionality but allows 'early wakeups' (for still-on-rq
  tasks) to be correctly accounted as well.)
Signed-off-by: NIngo Molnar <mingo@elte.hu>

4ae7d5ce

sched: clean up wakeup balancing, code flow · f4827386

由 Ingo Molnar 提交于 3月 16, 2008

Clean up the code flow. No code changed:

kernel/sched.o:

   text	   data	    bss	    dec	    hex	filename
  42521	   2858	    232	  45611	   b22b	sched.o.before
  42521	   2858	    232	  45611	   b22b	sched.o.after

md5:
   09b31c44e9aff8666f72773dc433e2df  sched.o.before.asm
   09b31c44e9aff8666f72773dc433e2df  sched.o.after.asm
Signed-off-by: NIngo Molnar <mingo@elte.hu>

f4827386

sched: clean up wakeup balancing, rename variables · ac192d39

由 Ingo Molnar 提交于 3月 16, 2008

rename 'cpu' to 'prev_cpu'. No code changed:

kernel/sched.o:

   text	   data	    bss	    dec	    hex	filename
  42521	   2858	    232	  45611	   b22b	sched.o.before
  42521	   2858	    232	  45611	   b22b	sched.o.after

md5:
   09b31c44e9aff8666f72773dc433e2df  sched.o.before.asm
   09b31c44e9aff8666f72773dc433e2df  sched.o.after.asm
Signed-off-by: NIngo Molnar <mingo@elte.hu>

ac192d39

sched: clean up wakeup balancing, move wake_affine() · 098fb9db

由 Ingo Molnar 提交于 3月 16, 2008

split out the affine-wakeup bits.

No code changed:

kernel/sched.o:

   text	   data	    bss	    dec	    hex	filename
  42521	   2858	    232	  45611	   b22b	sched.o.before
  42521	   2858	    232	  45611	   b22b	sched.o.after

md5:
   9d76738f1272aa82f0b7affd2f51df6b  sched.o.before.asm
   09b31c44e9aff8666f72773dc433e2df  sched.o.after.asm

(the md5's changed because stack slots changed and some registers
get scheduled by gcc in a different order - but otherwise the before
and after assembly is instruction for instruction equivalent.)
Signed-off-by: NIngo Molnar <mingo@elte.hu>

098fb9db

15 3月, 2008 4 次提交

sched: simplify sched_slice() · 6a6029b8

由 Ingo Molnar 提交于 3月 14, 2008

Use the existing calc_delta_mine() calculation for sched_slice(). This
saves a divide and simplifies the code because we share it with the
other /cfs_rq->load users.

It also improves code size:

      text    data     bss     dec     hex filename
     42659    2740     144   45543    b1e7 sched.o.before
     42093    2740     144   44977    afb1 sched.o.after
Signed-off-by: NIngo Molnar <mingo@elte.hu>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>

6a6029b8

sched: fix fair sleepers · e22ecef1

由 Ingo Molnar 提交于 3月 14, 2008

Fair sleepers need to scale their latency target down by runqueue
weight. Otherwise busy systems will gain ever larger sleep bonus.
Signed-off-by: NIngo Molnar <mingo@elte.hu>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>

e22ecef1

sched: fix overload performance: buddy wakeups · aa2ac252

由 Peter Zijlstra 提交于 3月 14, 2008

Currently we schedule to the leftmost task in the runqueue. When the
runtimes are very short because of some server/client ping-pong,
especially in over-saturated workloads, this will cycle through all
tasks trashing the cache.

Reduce cache trashing by keeping dependent tasks together by running
newly woken tasks first. However, by not running the leftmost task first
we could starve tasks because the wakee can gain unlimited runtime.

Therefore we only run the wakee if its within a small
(wakeup_granularity) window of the leftmost task. This preserves
fairness, but does alternate server/client task groups.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

aa2ac252

sched: min_vruntime fix · 3fe69747

由 Peter Zijlstra 提交于 3月 14, 2008

Current min_vruntime tracking is incorrect and will cause serious
problems when we don't run the leftmost task for some reason.

min_vruntime does two things; 1) it's used to determine a forward
direction when the u64 vruntime wraps, 2) it's used to track the
leftmost vruntime to position newly enqueued tasks from.

The current logic advances min_vruntime whenever the current task's
vruntime advance. Because the current task may pass the leftmost task
still waiting we're failing the second goal. This causes new tasks to be
placed too far ahead and thus penalizes their runtime.

Fix this by making min_vruntime the min_vruntime of the waiting tasks by
tracking it in enqueue/dequeue, and compare against current's vruntime
to obtain the absolute minimum when placing new tasks.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

3fe69747

07 3月, 2008 1 次提交

sched: retain vruntime · 810b3817

由 Peter Zijlstra 提交于 2月 29, 2008

Kei Tokunaga reported an interactivity problem when moving tasks
between control groups.

Tasks would retain their old vruntime when moved between groups, this
can cause funny lags. Re-set the vruntime on group move to fit within
the new tree.
Reported-by: NKei Tokunaga <tokunaga.keiich@jp.fujitsu.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

810b3817

05 3月, 2008 1 次提交

sched: revert load_balance_monitor() changes · 62fb1851

由 Peter Zijlstra 提交于 2月 25, 2008

The following commits cause a number of regressions:

  commit 58e2d4ca
  Author: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
  Date:   Fri Jan 25 21:08:00 2008 +0100
  sched: group scheduling, change how cpu load is calculated

  commit 6b2d7700
  Author: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
  Date:   Fri Jan 25 21:08:00 2008 +0100
  sched: group scheduler, fix fairness of cpu bandwidth allocation for task groups

Namely:
 - very frequent wakeups on SMP, reported by PowerTop users.
 - cacheline trashing on (large) SMP
 - some latencies larger than 500ms

While there is a mergeable patch to fix the latter, the former issues
are not fixable in a manner suitable for .25 (we're at -rc3 now).

Hence we revert them and try again in v2.6.26.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
CC: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Tested-by: NAlexey Zaytsev <alexey.zaytsev@gmail.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

62fb1851

25 2月, 2008 2 次提交

I
sched: clean up __pick_last_entity() a bit · 7eee3e67
由 Ingo Molnar 提交于 2月 22, 2008
```
Signed-off-by: NIngo Molnar <mingo@elte.hu>
```
7eee3e67

sched: remove duplicate code from sched_fair.c · 70eee74b

由 Balbir Singh 提交于 2月 22, 2008

pick_task_entity() duplicates existing code. This functionality can be
easily obtained using rb_last(). Avoid code duplication by using rb_last().
Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

70eee74b

01 2月, 2008 2 次提交

sched: let +nice tasks have smaller impact · ef9884e6

由 Peter Zijlstra 提交于 1月 31, 2008

Michel Dänzr has bisected an interactivity problem with
plus-reniced tasks back to this commit:

 810e95cc is first bad commit
 commit 810e95cc
 Author: Peter Zijlstra <a.p.zijlstra@chello.nl>
 Date:   Mon Oct 15 17:00:14 2007 +0200

 sched: another wakeup_granularity fix

      unit mis-match: wakeup_gran was used against a vruntime

fix this by assymetrically scaling the vtime of positive reniced
tasks.
Bisected-by: NMichel Dänzer <michel@tungstengraphics.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

ef9884e6

sched: fix high wake up latencies with FAIR_USER_SCHED · 296825cb

由 Srivatsa Vaddagiri 提交于 1月 31, 2008

The reason why we are getting better wakeup latencies for
!FAIR_USER_SCHED is because of this snippet of code in place_entity():

	if (!initial) {
		/* sleeps upto a single latency don't count. */
		if (sched_feat(NEW_FAIR_SLEEPERS) && entity_is_task(se))
						     ^^^^^^^^^^^^^^^^^^
			vruntime -= sysctl_sched_latency;

		/* ensure we never gain time by being placed backwards. */
		vruntime = max_vruntime(se->vruntime, vruntime);
	}

NEW_FAIR_SLEEPERS feature gives credit for sleeping only to tasks and
not group-level entities. With the patch attached, I could see that
wakeup latencies with FAIR_USER_SCHED are restored to the same level as
!FAIR_USER_SCHED.
Signed-off-by: NIngo Molnar <mingo@elte.hu>

296825cb

26 1月, 2008 11 次提交

sched: keep total / count stats in addition to the max for · 6d082592

由 Arjan van de Ven 提交于 1月 25, 2008

Right now, the linux kernel (with scheduler statistics enabled) keeps track
of the maximum time a process is waiting to be scheduled. While the maximum
is a very useful metric, tracking average and total is equally useful
(at least for latencytop) to figure out the accumulated effect of scheduler
delays. The accumulated effect is important to judge the performance impact
of scheduler tuning/behavior.
Signed-off-by: NArjan van de Ven <arjan@linux.intel.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

6d082592

sched: fix: don't take a mutex from interrupt context · 5973e5b9

由 Peter Zijlstra 提交于 1月 25, 2008

print_cfs_stats is callable from interrupt context (sysrq), hence it should
not take mutexes. Change it to use RCU since the task group data is RCU
freed anyway.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

5973e5b9

sched: latencytop support · 9745512c

由 Arjan van de Ven 提交于 1月 25, 2008

LatencyTOP kernel infrastructure; it measures latencies in the
scheduler and tracks it system wide and per process.
Signed-off-by: NArjan van de Ven <arjan@linux.intel.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

9745512c

sched: high-res preemption tick · 8f4d37ec

由 Peter Zijlstra 提交于 1月 25, 2008

Use HR-timers (when available) to deliver an accurate preemption tick.

The regular scheduler tick that runs at 1/HZ can be too coarse when nice
level are used. The fairness system will still keep the cpu utilisation 'fair'
by then delaying the task that got an excessive amount of CPU time but try to
minimize this by delivering preemption points spot-on.

The average frequency of this extra interrupt is sched_latency / nr_latency.
Which need not be higher than 1/HZ, its just that the distribution within the
sched_latency period is important.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

8f4d37ec

sched: RT-balance, add new methods to sched_class · cb469845

由 Steven Rostedt 提交于 1月 25, 2008

Dmitry Adamushko found that the current implementation of the RT
balancing code left out changes to the sched_setscheduler and
rt_mutex_setprio.

This patch addresses this issue by adding methods to the schedule classes
to handle being switched out of (switched_from) and being switched into
(switched_to) a sched_class. Also a method for changing of priorities
is also added (prio_changed).

This patch also removes some duplicate logic between rt_mutex_setprio and
sched_setscheduler.
Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

cb469845

sched: remove do_div() from __sched_slice() · 4bf0b771

由 Peter Zijlstra 提交于 1月 25, 2008

Yanmin Zhang noticed a nice optimization:

  p = l * nr / nl, nl = l/g -> p = g * nr

which eliminates a do_div() from __sched_period().
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

4bf0b771

sched: no need for 'affine wakeup' balancing · 9ec3b77e

由 Dmitry Adamushko 提交于 1月 25, 2008

No need to do a check for 'affine wakeup and passive balancing possibilities'
in select_task_rq_fair() when task_cpu(p) == this_cpu.

I guess, this part got missed upon introduction of per-sched_class
select_task_rq() in try_to_wake_up().
Signed-off-by: NDmitry Adamushko <dmitry.adamushko@gmail.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

9ec3b77e

sched: de-SCHED_OTHER-ize the RT path · e7693a36

由 Gregory Haskins 提交于 1月 25, 2008

The current wake-up code path tries to determine if it can optimize the
wake-up to "this_cpu" by computing load calculations.  The problem is that
these calculations are only relevant to SCHED_OTHER tasks where load is king.
For RT tasks, priority is king.  So the load calculation is completely wasted
bandwidth.

Therefore, we create a new sched_class interface to help with
pre-wakeup routing decisions and move the load calculation as a function
of CFS task's class.
Signed-off-by: NGregory Haskins <ghaskins@novell.com>
Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

e7693a36

sched: group scheduler, fix fairness of cpu bandwidth allocation for task groups · 6b2d7700

由 Srivatsa Vaddagiri 提交于 1月 25, 2008

The current load balancing scheme isn't good enough for precise
group fairness.

For example: on a 8-cpu system, I created 3 groups as under:

	a = 8 tasks (cpu.shares = 1024)
	b = 4 tasks (cpu.shares = 1024)
	c = 3 tasks (cpu.shares = 1024)

a, b and c are task groups that have equal weight. We would expect each
of the groups to receive 33.33% of cpu bandwidth under a fair scheduler.

This is what I get with the latest scheduler git tree:
Signed-off-by: NIngo Molnar <mingo@elte.hu>
--------------------------------------------------------------------------------
Col1  | Col2    | Col3  |  Col4
------|---------|-------|-------------------------------------------------------
a     | 277.676 | 57.8% | 54.1%  54.1%  54.1%  54.2%  56.7%  62.2%  62.8% 64.5%
b     | 116.108 | 24.2% | 47.4%  48.1%  48.7%  49.3%
c     |  86.326 | 18.0% | 47.5%  47.9%  48.5%
--------------------------------------------------------------------------------

Explanation of o/p:

Col1 -> Group name
Col2 -> Cumulative execution time (in seconds) received by all tasks of that
	group in a 60sec window across 8 cpus
Col3 -> CPU bandwidth received by the group in the 60sec window, expressed in
        percentage. Col3 data is derived as:
		Col3 = 100 * Col2 / (NR_CPUS * 60)
Col4 -> CPU bandwidth received by each individual task of the group.
		Col4 = 100 * cpu_time_recd_by_task / 60

[I can share the test case that produces a similar o/p if reqd]

The deviation from desired group fairness is as below:

	a = +24.47%
	b = -9.13%
	c = -15.33%

which is quite high.

After the patch below is applied, here are the results:

--------------------------------------------------------------------------------
Col1  | Col2    | Col3  |  Col4
------|---------|-------|-------------------------------------------------------
a     | 163.112 | 34.0% | 33.2%  33.4%  33.5%  33.5%  33.7%  34.4%  34.8% 35.3%
b     | 156.220 | 32.5% | 63.3%  64.5%  66.1%  66.5%
c     | 160.653 | 33.5% | 85.8%  90.6%  91.4%
--------------------------------------------------------------------------------

Deviation from desired group fairness is as below:

	a = +0.67%
	b = -0.83%
	c = +0.17%

which is far better IMO. Most of other runs have yielded a deviation within
+-2% at the most, which is good.

Why do we see bad (group) fairness with current scheuler?
=========================================================

Currently cpu's weight is just the summation of individual task weights.
This can yield incorrect results. For ex: consider three groups as below
on a 2-cpu system:

	CPU0	CPU1
---------------------------
	A (10)  B(5)
		C(5)
---------------------------

Group A has 10 tasks, all on CPU0, Group B and C have 5 tasks each all
of which are on CPU1. Each task has the same weight (NICE_0_LOAD =
1024).

The current scheme would yield a cpu weight of 10240 (10*1024) for each cpu and
the load balancer will think both CPUs are perfectly balanced and won't
move around any tasks. This, however, would yield this bandwidth:

	A = 50%
	B = 25%
	C = 25%

which is not the desired result.

What's changing in the patch?
=============================

	- How cpu weights are calculated when CONFIF_FAIR_GROUP_SCHED is
	  defined (see below)
	- API Change
		- Two tunables introduced in sysfs (under SCHED_DEBUG) to
		  control the frequency at which the load balance monitor
		  thread runs.

The basic change made in this patch is how cpu weight (rq->load.weight) is
calculated. Its now calculated as the summation of group weights on a cpu,
rather than summation of task weights. Weight exerted by a group on a
cpu is dependent on the shares allocated to it and also the number of
tasks the group has on that cpu compared to the total number of
(runnable) tasks the group has in the system.

Let,
	W(K,i)  = Weight of group K on cpu i
	T(K,i)  = Task load present in group K's cfs_rq on cpu i
	T(K)    = Total task load of group K across various cpus
	S(K) 	= Shares allocated to group K
	NRCPUS	= Number of online cpus in the scheduler domain to
	 	  which group K is assigned.

Then,
	W(K,i) = S(K) * NRCPUS * T(K,i) / T(K)

A load balance monitor thread is created at bootup, which periodically
runs and adjusts group's weight on each cpu. To avoid its overhead, two
min/max tunables are introduced (under SCHED_DEBUG) to control the rate
at which it runs.

Fixes from: Peter Zijlstra <a.p.zijlstra@chello.nl>

- don't start the load_balance_monitor when there is only a single cpu.
- rename the kthread because its currently longer than TASK_COMM_LEN
Signed-off-by: NSrivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

6b2d7700

sched: group scheduling, change how cpu load is calculated · 58e2d4ca

由 Srivatsa Vaddagiri 提交于 1月 25, 2008

This patch changes how the cpu load exerted by fair_sched_class tasks
is calculated. Load exerted by fair_sched_class tasks on a cpu is now
a summation of the group weights, rather than summation of task weights.
Weight exerted by a group on a cpu is dependent on the shares allocated
to it.

This version of patch has a minor impact on code size, but should have
no runtime/functional impact for !CONFIG_FAIR_GROUP_SCHED.
Signed-off-by: NSrivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

58e2d4ca

sched: group scheduling, minor fixes · ec2c507f

由 Srivatsa Vaddagiri 提交于 1月 25, 2008

Minor bug fixes for the group scheduler:

- Use a mutex to serialize add/remove of task groups and also when
  changing shares of a task group. Use the same mutex when printing
  cfs_rq debugging stats for various task groups.

- Use list_for_each_entry_rcu in for_each_leaf_cfs_rq macro (when
  walking task group list)
Signed-off-by: NSrivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

ec2c507f

18 12月, 2007 1 次提交

sched: do not hurt SCHED_BATCH on wakeup · 6cbf1c12

由 Ingo Molnar 提交于 12月 18, 2007

measurements by Yanmin Zhang have shown that SCHED_BATCH tasks benefit
if they run the same place_entity() logic as SCHED_OTHER tasks - so
uniformize behavior in this area.
Signed-off-by: NIngo Molnar <mingo@elte.hu>

6cbf1c12

openanolis / cloud-kernel 大约 1 年 前同步成功

openanolis / cloud-kernel
大约 1 年前同步成功