提交 · ec12cb7f31e28854efae7dd6f9544e0a66379040 · openanolis / cloud-kernel

14 8月, 2011 5 次提交

sched: Accumulate per-cfs_rq cpu usage and charge against bandwidth · ec12cb7f

由 Paul Turner 提交于 7月 21, 2011

Account bandwidth usage on the cfs_rq level versus the task_groups to which
they belong. Whether we are tracking bandwidth on a given cfs_rq is maintained
under cfs_rq->runtime_enabled.

cfs_rq's which belong to a bandwidth constrained task_group have their runtime
accounted via the update_curr() path, which withdraws bandwidth from the global
pool as desired. Updates involving the global pool are currently protected
under cfs_bandwidth->lock, local runtime is protected by rq->lock.

This patch only assigns and tracks quota, no action is taken in the case that
cfs_rq->runtime_used exceeds cfs_rq->runtime_assigned.
Signed-off-by: NPaul Turner <pjt@google.com>
Signed-off-by: NNikhil Rao <ncrao@google.com>
Signed-off-by: NBharata B Rao <bharata@linux.vnet.ibm.com>
Reviewed-by: NHidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110721184757.179386821@google.comSigned-off-by: NIngo Molnar <mingo@elte.hu>

ec12cb7f

sched: Introduce primitives to account for CFS bandwidth tracking · ab84d31e

由 Paul Turner 提交于 7月 21, 2011

In this patch we introduce the notion of CFS bandwidth, partitioned into
globally unassigned bandwidth, and locally claimed bandwidth.

 - The global bandwidth is per task_group, it represents a pool of unclaimed
   bandwidth that cfs_rqs can allocate from.
 - The local bandwidth is tracked per-cfs_rq, this represents allotments from
   the global pool bandwidth assigned to a specific cpu.

Bandwidth is managed via cgroupfs, adding two new interfaces to the cpu subsystem:
 - cpu.cfs_period_us : the bandwidth period in usecs
 - cpu.cfs_quota_us : the cpu bandwidth (in usecs) that this tg will be allowed
   to consume over period above.
Signed-off-by: NPaul Turner <pjt@google.com>
Signed-off-by: NNikhil Rao <ncrao@google.com>
Signed-off-by: NBharata B Rao <bharata@linux.vnet.ibm.com>
Reviewed-by: NHidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110721184756.972636699@google.comSigned-off-by: NIngo Molnar <mingo@elte.hu>

ab84d31e

sched: Implement hierarchical task accounting for SCHED_OTHER · 953bfcd1

由 Paul Turner 提交于 7月 21, 2011

Introduce hierarchical task accounting for the group scheduling case in CFS, as
well as promoting the responsibility for maintaining rq->nr_running to the
scheduling classes.

The primary motivation for this is that with scheduling classes supporting
bandwidth throttling it is possible for entities participating in throttled
sub-trees to not have root visible changes in rq->nr_running across activate
and de-activate operations. This in turn leads to incorrect idle and
weight-per-task load balance decisions.

This also allows us to make a small fixlet to the fastpath in pick_next_task()
under group scheduling.

Note: this issue also exists with the existing sched_rt throttling mechanism.
This patch does not address that.
Signed-off-by: NPaul Turner <pjt@google.com>
Reviewed-by: NHidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110721184756.878333391@google.comSigned-off-by: NIngo Molnar <mingo@elte.hu>

953bfcd1

sched: Remove noop in lowest_flag_domain() · 08354716

由 Hillf Danton 提交于 6月 16, 2011

Checking for the validity of sd is removed, since it is already
checked by the for_each_domain macro.
Signed-off-by: NHillf Danton <dhillf@gmail.com>
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/BANLkTimT+Tut-3TshCDm-NiLLXrOznibNA@mail.gmail.comSigned-off-by: NIngo Molnar <mingo@elte.hu>

08354716

sched: Kill WAKEUP_PREEMPT · 2c2efaed

由 Yong Zhang 提交于 7月 29, 2011

Remove the WAKEUP_PREEMPT feature, disabling it doesn't make any sense
and its outlived its use by a long long while.
Signed-off-by: NYong Zhang <yong.zhang0@gmail.com>
Acked-by: NMike Galbraith <efault@gmx.de>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110729082033.GB12106@zhySigned-off-by: NIngo Molnar <mingo@elte.hu>

2c2efaed

22 7月, 2011 6 次提交

sched: Cleanup duplicate local variable in [enqueue|dequeue]_task_fair · 0f317143

由 Lin Ming 提交于 7月 22, 2011

No need to define a new "cfs_rq" variable in the "for" block.
Just use the one at the top of the function.
Signed-off-by: NLin Ming <ming.m.lin@intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1311297271.3938.1352.camel@minggr.sh.intel.comSigned-off-by: NIngo Molnar <mingo@elte.hu>

0f317143

sched: Replace use of entity_key() · 2bd2d6f2

由 Stephan Baerwolf 提交于 7月 20, 2011

"entity_key()" is only used in "__enqueue_entity()" and
its only function is to subtract a tasks vruntime by
its groups minvruntime.
Before this patch a rbtree enqueue-decision is done by
comparing two tasks in the style:

	"if (entity_key(cfs_rq, se) < entity_key(cfs_rq, entry))"

which would be

	"if (se->vruntime-cfs_rq->min_vruntime < entry->vruntime-cfs_rq->min_vruntime)"

or (if reducing cfs_rq->min_vruntime out)

	"if (se->vruntime < entry->vruntime)"

which is

	"if (entity_before(se, entry))"

So we do not need "entity_key()".
If "entity_before()" is inline we will also save one subtraction (only one,
because "entity_key(cfs_rq, se)"  was cached in "key")
Signed-off-by: NStephan Baerwolf <stephan.baerwolf@tu-ilmenau.de>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-ns12mnd2h5w8rb9agd8hnsfk@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@elte.hu>

2bd2d6f2

sched: Remove unused function cpu_cfs_rq() · 045176d2

由 Jan Schoenherr 提交于 7月 13, 2011

The last reference to cpu_cfs_rq() was removed with commit 88ec22d3
("sched: Remove the cfs_rq dependency from set_task_cpu()"). Thus,
remove this function, too.
Signed-off-by: NJan Schoenherr <schnhrr@cs.tu-berlin.de>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1310580816-10861-3-git-send-email-schnhrr@cs.tu-berlin.deSigned-off-by: NIngo Molnar <mingo@elte.hu>

045176d2

sched, cgroup: Optimize load_balance_fair() · 9763b67f

由 Peter Zijlstra 提交于 7月 13, 2011

Use for_each_leaf_cfs_rq() instead of list_for_each_entry_rcu(), this
achieves that load_balance_fair() only iterates those task_groups that
actually have tasks on busiest, and that we iterate bottom-up, trying to
move light groups before the heavier ones.

No idea if it will actually work out to be beneficial in practice, does
anybody have a cgroup workload that might show a difference one way or
the other?

[ Also move update_h_load to sched_fair.c, loosing #ifdef-ery ]
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: NPaul Turner <pjt@google.com>
Link: http://lkml.kernel.org/r/1310557009.2586.28.camel@twinsSigned-off-by: NIngo Molnar <mingo@elte.hu>

9763b67f

sched: Don't update shares twice on on_rq parent · 9598c82d

由 Paul Turner 提交于 7月 06, 2011

In dequeue_task_fair() we bail on dequeue when we encounter a parenting entity
with additional weight. However, we perform a double shares update on this
entity as we continue the shares update traversal from this point, despite
dequeue_entity() having already updated its queuing cfs_rq.
Avoid this by starting from the parent when we resume.
Signed-off-by: NPaul Turner <pjt@google.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110707053059.797714697@google.comSigned-off-by: NIngo Molnar <mingo@elte.hu>

9598c82d

sched: update correct entity's runtime in check_preempt_wakeup() · 9bbd7374

由 Paul Turner 提交于 7月 05, 2011

While looking at check_preempt_wakeup() I realized that we are
potentially updating the wrong entity in the fair-group scheduling
case. In this case the current task's cfs_rq may not be the same as
the one used for the comparison between the waking task and the
existing task's vruntime.

This potentially results in us using a stale vruntime in the
pre-emption decision, providing a small false preference for the
previous task. The effects of this are bounded since we always
perform a hierarchal update on the tick.
Signed-off-by: NPaul Turner <pjt@google.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/CAPM31R+2Ke2urUZKao5W92_LupdR4AYEv-EZWiJ3tG=tEes2cw@mail.gmail.comSigned-off-by: NIngo Molnar <mingo@elte.hu>

9bbd7374

21 7月, 2011 1 次提交

sched: Break out cpu_power from the sched_group structure · 9c3f75cb

由 Peter Zijlstra 提交于 7月 14, 2011

In order to prepare for non-unique sched_groups per domain, we need to
carry the cpu_power elsewhere, so put a level of indirection in.
Reported-and-tested-by: NAnton Blanchard <anton@samba.org>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/n/tip-qkho2byuhe4482fuknss40ad@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@elte.hu>

9c3f75cb

01 7月, 2011 1 次提交

sched: Remove rcu_read_lock() from wake_affine() · 2a46dae3

由 Nikunj A. Dadhania 提交于 6月 07, 2011

wake_affine() is only called from one path: select_task_rq_fair(),
which already has the RCU read lock held.
Signed-off-by: NNikunj A. Dadhania <nikunj@linux.vnet.ibm.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/20110607101251.777.34547.stgit@IBM-009124035060.in.ibm.comSigned-off-by: NIngo Molnar <mingo@elte.hu>

2a46dae3

28 5月, 2011 1 次提交

sched: Fix ->min_vruntime calculation in dequeue_entity() · 1e876231

由 Peter Zijlstra 提交于 5月 17, 2011

Dima Zavin <dima@android.com> reported:

"After pulling the thread off the run-queue during a cgroup change,
the cfs_rq.min_vruntime gets recalculated. The dequeued thread's vruntime
then gets normalized to this new value. This can then lead to the thread
getting an unfair boost in the new group if the vruntime of the next
task in the old run-queue was way further ahead."
Reported-by: NDima Zavin <dima@android.com>
Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
Recalls-having-tested-once-upon-a-time-by: NMike Galbraith <efault@gmx.de>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1305674470-23727-1-git-send-email-john.stultz@linaro.orgSigned-off-by: NIngo Molnar <mingo@elte.hu>

1e876231

20 5月, 2011 1 次提交

sched: Introduce SCHED_POWER_SCALE to scale cpu_power calculations · 1399fa78

由 Nikhil Rao 提交于 5月 18, 2011

SCHED_LOAD_SCALE is used to increase nice resolution and to
scale cpu_power calculations in the scheduler. This patch
introduces SCHED_POWER_SCALE and converts all uses of
SCHED_LOAD_SCALE for scaling cpu_power to use SCHED_POWER_SCALE
instead.

This is a preparatory patch for increasing the resolution of
SCHED_LOAD_SCALE, and there is no need to increase resolution
for cpu_power calculations.
Signed-off-by: NNikhil Rao <ncrao@google.com>
Acked-by: NPeter Zijlstra <peterz@infradead.org>
Cc: Nikunj A. Dadhania <nikunj@linux.vnet.ibm.com>
Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Cc: Stephan Barwolf <stephan.baerwolf@tu-ilmenau.de>
Cc: Mike Galbraith <efault@gmx.de>
Link: http://lkml.kernel.org/r/1305738580-9924-3-git-send-email-ncrao@google.comSigned-off-by: NIngo Molnar <mingo@elte.hu>

1399fa78

04 5月, 2011 1 次提交

sched: Remove unused 'this_best_prio arg' from balance_tasks() · 931aeeda

由 Vladimir Davydov 提交于 5月 03, 2011

It's passed across multiple functions but is never really used, so
remove it.
Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1304447467-29200-1-git-send-email-vdavydov@parallels.comSigned-off-by: NIngo Molnar <mingo@elte.hu>

931aeeda

19 4月, 2011 2 次提交

sched: Next buddy hint on sleep and preempt path · 2f36825b

由 Venkatesh Pallipadi 提交于 4月 14, 2011

When a task in a taskgroup sleeps, pick_next_task starts all the way back at
the root and picks the task/taskgroup with the min vruntime across all
runnable tasks.

But when there are many frequently sleeping tasks across different taskgroups,
it makes better sense to stay with same taskgroup for its slice period (or
until all tasks in the taskgroup sleeps) instead of switching cross taskgroup
on each sleep after a short runtime.

This helps specifically where taskgroups corresponds to a process with
multiple threads. The change reduces the number of CR3 switches in this case.

Example:

Two taskgroups with 2 threads each which are running for 2ms and
sleeping for 1ms. Looking at sched:sched_switch shows:

BEFORE: taskgroup_1 threads [5004, 5005], taskgroup_2 threads [5016, 5017]
      cpu-soaker-5004  [003]  3683.391089
      cpu-soaker-5016  [003]  3683.393106
      cpu-soaker-5005  [003]  3683.395119
      cpu-soaker-5017  [003]  3683.397130
      cpu-soaker-5004  [003]  3683.399143
      cpu-soaker-5016  [003]  3683.401155
      cpu-soaker-5005  [003]  3683.403168
      cpu-soaker-5017  [003]  3683.405170

AFTER: taskgroup_1 threads [21890, 21891], taskgroup_2 threads [21934, 21935]
      cpu-soaker-21890 [003]   865.895494
      cpu-soaker-21935 [003]   865.897506
      cpu-soaker-21934 [003]   865.899520
      cpu-soaker-21935 [003]   865.901532
      cpu-soaker-21934 [003]   865.903543
      cpu-soaker-21935 [003]   865.905546
      cpu-soaker-21891 [003]   865.907548
      cpu-soaker-21890 [003]   865.909560
      cpu-soaker-21891 [003]   865.911571
      cpu-soaker-21890 [003]   865.913582
      cpu-soaker-21891 [003]   865.915594
      cpu-soaker-21934 [003]   865.917606

Similar problem is there when there are multiple taskgroups and say a task A
preempts currently running task B of taskgroup_1. On schedule, pick_next_task
can pick an unrelated task on taskgroup_2. Here it would be better to give some
preference to task B on pick_next_task.

A simple (may be extreme case) benchmark I tried was tbench with 2 tbench
client processes with 2 threads each running on a single CPU. Avg throughput
across 5 50 sec runs was:

 BEFORE: 105.84 MB/sec
 AFTER:  112.42 MB/sec
Signed-off-by: NVenkatesh Pallipadi <venki@google.com>
Acked-by: NRik van Riel <riel@redhat.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1302802253-25760-1-git-send-email-venki@google.comSigned-off-by: NIngo Molnar <mingo@elte.hu>

2f36825b

sched: Make set_*_buddy() work on non-task entities · 69c80f3e

由 Venkatesh Pallipadi 提交于 4月 13, 2011

Make set_*_buddy() work on non-task sched_entity, to facilitate the
use of next_buddy to cache a group entity in cases where one of the
tasks within that entity sleeps or gets preempted.

set_skip_buddy() was incorrectly comparing the policy of task that is
yielding to be not equal to SCHED_IDLE. Yielding should happen even
when task yielding is SCHED_IDLE. This change removes the policy check
on the yielding task.
Signed-off-by: NVenkatesh Pallipadi <venki@google.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1302744070-30079-2-git-send-email-venki@google.comSigned-off-by: NIngo Molnar <mingo@elte.hu>

69c80f3e

14 4月, 2011 3 次提交

sched: Deal with non-atomic min_vruntime reads on 32bits · 3fe1698b

由 Peter Zijlstra 提交于 4月 05, 2011

In order to avoid reading partial updated min_vruntime values on 32bit
implement a seqcount like solution.
Reviewed-by: NFrank Rowand <frank.rowand@am.sony.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20110405152729.111378493@chello.nlSigned-off-by: NIngo Molnar <mingo@elte.hu>

3fe1698b

sched: Remove rq argument to sched_class::task_waking() · 74f8e4b2

由 Peter Zijlstra 提交于 4月 05, 2011

In preparation of calling this without rq->lock held, remove the
dependency on the rq argument.
Reviewed-by: NFrank Rowand <frank.rowand@am.sony.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20110405152729.071474242@chello.nlSigned-off-by: NIngo Molnar <mingo@elte.hu>

74f8e4b2

sched: Drop the rq argument to sched_class::select_task_rq() · 7608dec2

由 Peter Zijlstra 提交于 4月 05, 2011

In preparation of calling select_task_rq() without rq->lock held, drop
the dependency on the rq argument.
Reviewed-by: NFrank Rowand <frank.rowand@am.sony.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20110405152729.031077745@chello.nlSigned-off-by: NIngo Molnar <mingo@elte.hu>

7608dec2

11 4月, 2011 5 次提交

sched: Avoid using sd->level · a6c75f2f

由 Peter Zijlstra 提交于 4月 07, 2011

Don't use sd->level for identifying properties of the domain.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20110407122942.350174079@chello.nlSigned-off-by: NIngo Molnar <mingo@elte.hu>

a6c75f2f

sched: Dynamically allocate sched_domain/sched_group data-structures · dce840a0

由 Peter Zijlstra 提交于 4月 07, 2011

Instead of relying on static allocations for the sched_domain and
sched_group trees, dynamically allocate and RCU free them.

Allocating this dynamically also allows for some build_sched_groups()
simplification since we can now (like with other simplifications) rely
on the sched_domain tree instead of hard-coded knowledge.

One tricky to note is that detach_destroy_domains() needs to hold
rcu_read_lock() over the entire tear-down, per-cpu is not sufficient
since that can lead to partial sched_group existance (could possibly
be solved by doing the tear-down backwards but this is much more
robust).

A concequence of the above is that we can no longer print the
sched_domain debug stuff from cpu_attach_domain() since that might now
run with preemption disabled (due to classic RCU etc.) and
sched_domain_debug() does some GFP_KERNEL allocations.

Another thing to note is that we now fully rely on normal RCU and not
RCU-sched, this is because with the new and exiting RCU flavours we
grew over the years BH doesn't necessarily hold off RCU-sched grace
periods (-rt is known to break this). This would in fact already cause
us grief since we do sched_domain/sched_group iterations from softirq
context.

This patch is somewhat larger than I would like it to be, but I didn't
find any means of shrinking/splitting this.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20110407122942.245307941@chello.nlSigned-off-by: NIngo Molnar <mingo@elte.hu>

dce840a0

sched: Eliminate dead code from wakeup_gran() · f4ad9bd2

由 Shaohua Li 提交于 4月 08, 2011

calc_delta_fair() checks NICE_0_LOAD already, delete duplicate check.

Signed-off-by: Shaohua Li<shaohua.li@intel.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Link: http://lkml.kernel.org/r/1302238389.3981.92.camel@sli10-conroeSigned-off-by: NIngo Molnar <mingo@elte.hu>

f4ad9bd2

sched: Fix erroneous all_pinned logic · b30aef17

由 Ken Chen 提交于 4月 08, 2011

The scheduler load balancer has specific code to deal with cases of
unbalanced system due to lots of unmovable tasks (for example because of
hard CPU affinity). In those situation, it excludes the busiest CPU that
has pinned tasks for load balance consideration such that it can perform
second 2nd load balance pass on the rest of the system.

This all works as designed if there is only one cgroup in the system.

However, when we have multiple cgroups, this logic has false positives and
triggers multiple load balance passes despite there are actually no pinned
tasks at all.

The reason it has false positives is that the all pinned logic is deep in
the lowest function of can_migrate_task() and is too low level:

load_balance_fair() iterates each task group and calls balance_tasks() to
migrate target load. Along the way, balance_tasks() will also set a
all_pinned variable. Given that task-groups are iterated, this all_pinned
variable is essentially the status of last group in the scanning process.
Task group can have number of reasons that no load being migrated, none
due to cpu affinity. However, this status bit is being propagated back up
to the higher level load_balance(), which incorrectly think that no tasks
were moved. It kick off the all pinned logic and start multiple passes
attempt to move load onto puller CPU.

To fix this, move the all_pinned aggregation up at the iterator level.
This ensures that the status is aggregated over all task-groups, not just
last one in the list.
Signed-off-by: NKen Chen <kenchen@google.com>
Cc: stable@kernel.org
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/BANLkTi=ernzNawaR5tJZEsV_QVnfxqXmsQ@mail.gmail.comSigned-off-by: NIngo Molnar <mingo@elte.hu>

b30aef17

sched: Fix sched-domain avg_load calculation · b0432d8f

由 Ken Chen 提交于 4月 07, 2011

In function find_busiest_group(), the sched-domain avg_load isn't
calculated at all if there is a group imbalance within the domain. This
will cause erroneous imbalance calculation.

The reason is that calculate_imbalance() sees sds->avg_load = 0 and it
will dump entire sds->max_load into imbalance variable, which is used
later on to migrate entire load from busiest CPU to the puller CPU.

This has two really bad effect:

1. stampede of task migration, and they won't be able to break out
   of the bad state because of positive feedback loop: large load
   delta -> heavier load migration -> larger imbalance and the cycle
   goes on.

2. severe imbalance in CPU queue depth.  This causes really long
   scheduling latency blip which affects badly on application that
   has tight latency requirement.

The fix is to have kernel calculate domain avg_load in both cases. This
will ensure that imbalance calculation is always sensible and the target
is usually half way between busiest and puller CPU.
Signed-off-by: NKen Chen <kenchen@google.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: <stable@kernel.org>
Link: http://lkml.kernel.org/r/20110408002322.3A0D812217F@elm.corp.google.comSigned-off-by: NIngo Molnar <mingo@elte.hu>

b0432d8f

05 4月, 2011 1 次提交

sched: Clean up rebalance_domains() load-balance interval calculation · 49c022e6

由 Peter Zijlstra 提交于 4月 05, 2011

Instead of the possible multiple-evaluation of num_online_cpus()
in rebalance_domains() that Linus reported, avoid it altogether
in the normal case since it's implemented with a Hamming weight
function over a cpu bitmask which can be darn expensive for those
with big iron.

This also makes it cleaner, smaller and documents the code.
Reported-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1301991265.2225.12.camel@twins>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

49c022e6

31 3月, 2011 2 次提交

Fix common misspellings · 25985edc

由 Lucas De Marchi 提交于 3月 30, 2011

Fixes generated by 'codespell' and manually reviewed.
Signed-off-by: NLucas De Marchi <lucas.demarchi@profusion.mobi>

25985edc

sched: Fix rebalance interval calculation · 3436ae12

由 Sisir Koppaka 提交于 3月 26, 2011

The interval for checking scheduling domains if they are due to be
balanced currently depends on boot state NR_CPUS, which may not
accurately reflect the number of online CPUs at the time of check.

Thus replace NR_CPUS with num_online_cpus().

 (ed: Should only affect those who set NR_CPUS really high, such as 4096
      or so :-)
Signed-off-by: NSisir Koppaka <sisir.koppaka@gmail.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <AANLkTikqHWid2Q93F5U5Qw5snJH8C5PXoa7J6=6hYO94@mail.gmail.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

3436ae12

04 3月, 2011 2 次提交

sched: Resched proper CPU on yield_to() · 6d1cafd8

由 Venkatesh Pallipadi 提交于 3月 01, 2011

yield_to_task_fair() has code to resched the CPU of yielding task when the
intention is to resched the CPU of the task that is being yielded to.

Change here fixes the problem and also makes the resched conditional on
rq != p_rq.
Signed-off-by: NVenkatesh Pallipadi <venki@google.com>
Reviewed-by: NRik van Riel <riel@redhat.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1299025701-22168-1-git-send-email-venki@google.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

6d1cafd8

sched: Allow SCHED_BATCH to preempt SCHED_IDLE tasks · a2f5c9ab

由 Darren Hart 提交于 2月 22, 2011

Perform the test for SCHED_IDLE before testing for SCHED_BATCH (and
ensure idle tasks don't preempt idle tasks) so the non-interactive,
but still important, SCHED_BATCH tasks will run in favor of the very
low priority SCHED_IDLE tasks.
Signed-off-by: NDarren Hart <dvhart@linux.intel.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: NMike Galbraith <efault@gmx.de>
Cc: Richard Purdie <richard.purdie@linuxfoundation.org>
LKML-Reference: <1298408674-3130-2-git-send-email-dvhart@linux.intel.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

a2f5c9ab

23 2月, 2011 3 次提交

sched: Fix the group_imb logic · 866ab43e

由 Peter Zijlstra 提交于 2月 21, 2011

On a 2*6*2 machine something like:

 taskset -c 3-11 bash -c 'for ((i=0;i<9;i++)) do while :; do :; done & done'

_should_ result in 9 busy CPUs, each running 1 task.

However it didn't quite work reliably, most of the time one cpu of the
second socket (6-11) would be idle and one cpu of the first socket
(0-5) would have two tasks on it.

The group_imb logic is supposed to deal with this and detect when a
particular group is imbalanced (like in our case, 0-2 are idle but 3-5
will have 4 tasks on it).

The detection phase needed a bit of a tweak as it was too weak and
required more than 2 avg weight tasks difference between idle and busy
cpus in the group which won't trigger for our test-case. So cure that
to be one or more avg task weight difference between cpus.

Once the detection phase worked, it was then defeated by the f_b_g()
tests trying to avoid ping-pongs. In particular, this_load >= max_load
triggered because the pulling cpu (the (first) idle cpu in on the
second socket, say 6) would find this_load to be 5 and max_load to be
4 (there'd be 5 tasks running on our socket and only 4 on the other
socket).
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Nikhil Rao <ncrao@google.com>
Cc: Venkatesh Pallipadi <venki@google.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Mike Galbraith <efault@gmx.de>
LKML-Reference: <new-submission>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

866ab43e

sched: Clean up some f_b_g() comments · cc57aa8f

由 Peter Zijlstra 提交于 2月 21, 2011

The existing comment tends to grow state (as it already has), split it
up and place it near the actual tests.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Nikhil Rao <ncrao@google.com>
Cc: Venkatesh Pallipadi <venki@google.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Mike Galbraith <efault@gmx.de>
LKML-Reference: <new-submission>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

cc57aa8f

sched: Clean up remnants of sd_idle · c186fafe

由 Peter Zijlstra 提交于 2月 21, 2011

With the wholesale removal of the sd_idle SMT logic we can clean up
some more.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Nikhil Rao <ncrao@google.com>
Cc: Venkatesh Pallipadi <venki@google.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Mike Galbraith <efault@gmx.de>
LKML-Reference: <new-submission>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

c186fafe

16 2月, 2011 1 次提交

sched: Wholesale removal of sd_idle logic · 46e49b38

由 Venkatesh Pallipadi 提交于 2月 14, 2011

sd_idle logic was introduced way back in 2005 (commit 5969fe06),
as an HT optimization.

As per the discussion in the thread here:

  lkml - sched: Resolve sd_idle and first_idle_cpu Catch-22 - v1
  https://patchwork.kernel.org/patch/532501/

The capacity based logic in the load balancer right now handles this
in a much cleaner way, handling more than 2 SMT siblings etc, and sd_idle
does not seem to bring any additional benefits. sd_idle logic also has
some bugs that has performance impact. Here is the patch that removes
the sd_idle logic altogether.

Also, there was a dependency of sched_mc_power_savings == 2, with sd_idle
logic.
Signed-off-by: NVenkatesh Pallipadi <venki@google.com>
Acked-by: NVaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1297723130-693-1-git-send-email-venki@google.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

46e49b38

03 2月, 2011 4 次提交

sched: Add yield_to(task, preempt) functionality · d95f4122

由 Mike Galbraith 提交于 2月 01, 2011

Currently only implemented for fair class tasks.

Add a yield_to_task method() to the fair scheduling class. allowing the
caller of yield_to() to accelerate another thread in it's thread group,
task group.

Implemented via a scheduler hint, using cfs_rq->next to encourage the
target being selected.  We can rely on pick_next_entity to keep things
fair, so noone can accelerate a thread that has already used its fair
share of CPU time.

This also means callers should only call yield_to when they really
mean it.  Calling it too often can result in the scheduler just
ignoring the hint.
Signed-off-by: NRik van Riel <riel@redhat.com>
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: NMike Galbraith <efault@gmx.de>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20110201095051.4ddb7738@annuminas.surriel.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

d95f4122

sched: Use a buddy to implement yield_task_fair() · ac53db59

由 Rik van Riel 提交于 2月 01, 2011

Use the buddy mechanism to implement yield_task_fair.  This
allows us to skip onto the next highest priority se at every
level in the CFS tree, unless doing so would introduce gross
unfairness in CPU time distribution.

We order the buddy selection in pick_next_entity to check
yield first, then last, then next.  We need next to be able
to override yield, because it is possible for the "next" and
"yield" task to be different processen in the same sub-tree
of the CFS tree.  When they are, we need to go into that
sub-tree regardless of the "yield" hint, and pick the correct
entity once we get to the right level.
Signed-off-by: NRik van Riel <riel@redhat.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20110201095103.3a79e92a@annuminas.surriel.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

ac53db59

sched: Limit the scope of clear_buddies · 2c13c919

由 Rik van Riel 提交于 2月 01, 2011

The clear_buddies function does not seem to play well with the concept
of hierarchical runqueues.  In the following tree, task groups are
represented by 'G', tasks by 'T', next by 'n' and last by 'l'.

     (nl)
    /    \
   G(nl)  G
   / \     \
 T(l) T(n)  T

This situation can arise when a task is woken up T(n), and the previously
running task T(l) is marked last.

When clear_buddies is called from either T(l) or T(n), the next and last
buddies of the group G(nl) will be cleared.  This is not the desired
result, since we would like to be able to find the other type of buddy
in many cases.

This especially a worry when implementing yield_task_fair through the
buddy system.

The fix is simple: only clear the buddy type that the task itself
is indicated to be.  As an added bonus, we stop walking up the tree
when the buddy has already been cleared or pointed elsewhere.
Signed-off-by: NRik van Riel <riel@redhat.coM>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20110201094837.6b0962a9@annuminas.surriel.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

2c13c919

sched: Check the right ->nr_running in yield_task_fair() · 725e7580

由 Rik van Riel 提交于 2月 01, 2011

With CONFIG_FAIR_GROUP_SCHED, each task_group has its own cfs_rq.
Yielding to a task from another cfs_rq may be worthwhile, since
a process calling yield typically cannot use the CPU right now.

Therefor, we want to check the per-cpu nr_running, not the
cgroup local one.
Signed-off-by: NRik van Riel <riel@redhat.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20110201094715.798c4f86@annuminas.surriel.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

725e7580

26 1月, 2011 1 次提交

sched: Fix switch_from_fair() · da7a735e

由 Peter Zijlstra 提交于 1月 17, 2011

When a task is taken out of the fair class we must ensure the vruntime
is properly normalized because when we put it back in it will assume
to be normalized.

The case that goes wrong is when changing away from the fair class
while sleeping. Sleeping tasks have non-normalized vruntime in order
to make sleeper-fairness work. So treat the switch away from fair as a
wakeup and preserve the relative vruntime.

Also update sysrq-n to call the ->switch_{to,from} methods.
Reported-by: NOnkalo Samu <samu.p.onkalo@nokia.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

da7a735e

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功