提交 · c02aa73b1d18e43cfd79c2f193b225e84ca497c8 · openeuler / raspberrypi-kernel

04 3月, 2011 1 次提交

sched: Allow users with sufficient RLIMIT_NICE to change from SCHED_IDLE policy · c02aa73b

由 Darren Hart 提交于 2月 17, 2011

The current scheduler implementation returns -EPERM when trying to
change from SCHED_IDLE to SCHED_OTHER or SCHED_BATCH. Since SCHED_IDLE
is considered to be a nice 20 on steroids, changing to another policy
should be allowed provided the RLIMIT_NICE is accounted for.

This patch allows the following test-case to pass with RLIMIT_NICE=40,
but still fail with RLIMIT_NICE=10 when the calling process is run
from a typical shell (nice 0, or 20 in rlimit terms).

int main()
{
	int ret;
	struct sched_param sp;
	sp.sched_priority = 0;

	/* switch to SCHED_IDLE */
	ret = sched_setscheduler(0, SCHED_IDLE, &sp);
	printf("setscheduler IDLE: %d\n", ret);
	if (ret) return ret;

	/* switch back to SCHED_OTHER */
	ret = sched_setscheduler(0, SCHED_OTHER, &sp);
	printf("setscheduler OTHER: %d\n", ret);

	return ret;
}

 $ ulimit -e
 40
 $ ./test
 setscheduler IDLE: 0
 setscheduler OTHER: 0

 $ ulimit -e 10
 $ ulimit -e
 10
 $ ./test
 setscheduler IDLE: 0
 setscheduler OTHER: -1
Signed-off-by: NDarren Hart <dvhart@linux.intel.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Richard Purdie <richard.purdie@linuxfoundation.org>
LKML-Reference: <4D657BEE.4040608@linux.intel.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

c02aa73b

26 2月, 2011 1 次提交

sched: Clean up the IRQ_TIME_ACCOUNTING code · 544b4a1f

由 Venkatesh Pallipadi 提交于 2月 25, 2011

Fix this warning:

  lkml.org/lkml/2011/1/30/124

 kernel/sched.c:3719: warning: 'irqtime_account_idle_ticks' defined but not used
 kernel/sched.c:3720: warning: 'irqtime_account_process_tick' defined but not used

In a cleaner way than:

 7e949870: sched: Add #ifdef around irq time accounting functions

This patch will not have any functional impact.
Signed-off-by: NVenkatesh Pallipadi <venki@google.com>
Cc: heiko.carstens@de.ibm.com
Cc: a.p.zijlstra@chello.nl
LKML-Reference: <1298675596-10992-1-git-send-email-venki@google.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

544b4a1f

25 2月, 2011 1 次提交

sched: Add #ifdef around irq time accounting functions · 7e949870

由 Heiko Carstens 提交于 2月 25, 2011

Get rid of this:

 kernel/sched.c:3731:13: warning: 'irqtime_account_idle_ticks' defined but not used
 kernel/sched.c:3732:13: warning: 'irqtime_account_process_tick' defined but not used
Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
Cc: Venkatesh Pallipadi <venki@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20110225133228.GD7469@osiris.boeblingen.de.ibm.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

7e949870

03 2月, 2011 2 次提交

sched: Add yield_to(task, preempt) functionality · d95f4122

由 Mike Galbraith 提交于 2月 01, 2011

Currently only implemented for fair class tasks.

Add a yield_to_task method() to the fair scheduling class. allowing the
caller of yield_to() to accelerate another thread in it's thread group,
task group.

Implemented via a scheduler hint, using cfs_rq->next to encourage the
target being selected.  We can rely on pick_next_entity to keep things
fair, so noone can accelerate a thread that has already used its fair
share of CPU time.

This also means callers should only call yield_to when they really
mean it.  Calling it too often can result in the scheduler just
ignoring the hint.
Signed-off-by: NRik van Riel <riel@redhat.com>
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: NMike Galbraith <efault@gmx.de>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20110201095051.4ddb7738@annuminas.surriel.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

d95f4122

sched: Use a buddy to implement yield_task_fair() · ac53db59

由 Rik van Riel 提交于 2月 01, 2011

Use the buddy mechanism to implement yield_task_fair.  This
allows us to skip onto the next highest priority se at every
level in the CFS tree, unless doing so would introduce gross
unfairness in CPU time distribution.

We order the buddy selection in pick_next_entity to check
yield first, then last, then next.  We need next to be able
to override yield, because it is possible for the "next" and
"yield" task to be different processen in the same sub-tree
of the CFS tree.  When they are, we need to go into that
sub-tree regardless of the "yield" hint, and pick the correct
entity once we get to the right level.
Signed-off-by: NRik van Riel <riel@redhat.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20110201095103.3a79e92a@annuminas.surriel.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

ac53db59

27 1月, 2011 1 次提交

sched: Avoid expensive initial update_cfs_load(), on UP too · 6ea72f12

由 Peter Zijlstra 提交于 1月 26, 2011

Fix the build on UP.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Turner <pjt@google.com>
LKML-Reference: <20110122044852.102126037@google.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

6ea72f12

26 1月, 2011 7 次提交

sched: Fix switch_from_fair() · da7a735e

由 Peter Zijlstra 提交于 1月 17, 2011

When a task is taken out of the fair class we must ensure the vruntime
is properly normalized because when we put it back in it will assume
to be normalized.

The case that goes wrong is when changing away from the fair class
while sleeping. Sleeping tasks have non-normalized vruntime in order
to make sleeper-fairness work. So treat the switch away from fair as a
wakeup and preserve the relative vruntime.

Also update sysrq-n to call the ->switch_{to,from} methods.
Reported-by: NOnkalo Samu <samu.p.onkalo@nokia.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <new-submission>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

da7a735e

softirqs: Account ksoftirqd time as cpustat softirq · 414bee9b

由 Venkatesh Pallipadi 提交于 12月 21, 2010

softirq time in ksoftirqd context is not accounted in ns granularity
per cpu softirq stats, as we want that to be a part of ksoftirqd
exec_runtime.

Accounting them as softirq on /proc/stat separately.
Tested-by: NShaun Ruffell <sruffell@digium.com>
Signed-off-by: NVenkatesh Pallipadi <venki@google.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1292980144-28796-6-git-send-email-venki@google.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

414bee9b

sched: Export ns irqtimes through /proc/stat · abb74cef

由 Venkatesh Pallipadi 提交于 12月 21, 2010

CONFIG_IRQ_TIME_ACCOUNTING adds ns granularity irq time on each CPU.
This info is already used in scheduler to do proper task chargeback
(earlier patches). This patch retro-fits this ns granularity
hardirq and softirq information to /proc/stat irq and softirq fields.

The update is still done on timer tick, where we look at accumulated
ns hardirq/softirq time and account the tick to user/system/irq/hardirq/guest
accordingly.

No new interface added.

Earlier versions looked at adding this as new fields in some /proc
files. This one seems to be the best in terms of impact to existing
apps, even though it has somewhat more kernel code than earlier versions.
Tested-by: NShaun Ruffell <sruffell@digium.com>
Signed-off-by: NVenkatesh Pallipadi <venki@google.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1292980144-28796-5-git-send-email-venki@google.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

abb74cef

sched: Refactor account_system_time separating id-update · 70a89a66

由 Venkatesh Pallipadi 提交于 12月 21, 2010

Refactor account_system_time, to separate out the logic of
identifying the update needed and code that does actual update.

This is used by following patch for IRQ_TIME_ACCOUNTING,
which has different identification logic and same update logic.
Tested-by: NShaun Ruffell <sruffell@digium.com>
Signed-off-by: NVenkatesh Pallipadi <venki@google.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1292980144-28796-4-git-send-email-venki@google.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

70a89a66

softirqs: Free up pf flag PF_KSOFTIRQD · 4dd53d89

由 Venkatesh Pallipadi 提交于 12月 21, 2010

Cleanup patch, freeing up PF_KSOFTIRQD and use per_cpu ksoftirqd pointer
instead, as suggested by Eric Dumazet.
Tested-by: NShaun Ruffell <sruffell@digium.com>
Signed-off-by: NVenkatesh Pallipadi <venki@google.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1292980144-28796-2-git-send-email-venki@google.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

4dd53d89

sched: Avoid expensive initial update_cfs_load() · f07333bf

由 Paul Turner 提交于 1月 21, 2011

Since cfs->{load_stamp,load_last} are zero-initalized the initial load update
will consider the delta to be 'since the beginning of time'.

This results in a lot of pointless divisions to bring this large period to be
within the sysctl_sched_shares_window.

Fix this by initializing load_stamp to be 1 at cfs_rq initialization, this
allows for an initial load_stamp > load_last which then lets standard idle
truncation proceed.

We avoid spinning (and slightly improve consistency) by fixing delta to be
[period - 1] in this path resulting in a slightly more predictable shares ramp.
(Previously the amount of idle time preserved by the overflow would range between
[period/2,period-1].)
Signed-off-by: NPaul Turner <pjt@google.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20110122044852.102126037@google.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

f07333bf

sched: Simplify update_cfs_shares parameters · 6d5ab293

由 Paul Turner 提交于 1月 21, 2011

Re-visiting this: Since update_cfs_shares will now only ever re-weight an
entity that is a relative parent of the current entity in enqueue_entity; we
can safely issue the account_entity_enqueue relative to that cfs_rq and avoid
the requirement for special handling of the enqueue case in update_cfs_shares.
Signed-off-by: NPaul Turner <pjt@google.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20110122044851.915214637@google.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

6d5ab293

19 1月, 2011 1 次提交

sched, cgroup: Use exit hook to avoid use-after-free crash · 068c5cc5

由 Peter Zijlstra 提交于 1月 19, 2011

By not notifying the controller of the on-exit move back to
init_css_set, we fail to move the task out of the previous
cgroup's cfs_rq. This leads to an opportunity for a
cgroup-destroy to come in and free the cgroup (there are no
active tasks left in it after all) to which the not-quite dead
task is still enqueued.
Reported-by: NMiklos Vajna <vmiklos@frugalware.org>
Fixed-by: NMike Galbraith <efault@gmx.de>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: <stable@kernel.org>
Cc: Mike Galbraith <efault@gmx.de>
Signed-off-by: NIngo Molnar <mingo@elte.hu>
LKML-Reference: <1293206353.29444.205.camel@laptop>

068c5cc5

18 1月, 2011 2 次提交

sched: Replace rq->bkl_count with rq->rq_sched_info.bkl_count · fce20979

由 Yong Zhang 提交于 1月 14, 2011

Now rq->rq_sched_info.bkl_count is not used for rq, scroll
rq->bkl_count into it. Thus we can save some space for rq.
Signed-off-by: NYong Zhang <yong.zhang0@gmail.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1294991859-13246-1-git-send-email-yong.zhang0@gmail.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

fce20979

sched, autogroup: Fix CONFIG_RT_GROUP_SCHED sched_setscheduler() failure · f4493771

由 Mike Galbraith 提交于 1月 13, 2011

If CONFIG_RT_GROUP_SCHED is set, __sched_setscheduler() fails due to autogroup
not allocating rt_runtime. Free unused/unusable rt_se and rt_rq, redirect RT
tasks to the root task group, and tell __sched_setscheduler() that it's ok.
Reported-and-tested-by: NBharata B Rao <bharata@linux.vnet.ibm.com>
Signed-off-by: NMike Galbraith <efault@gmx.de>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1294890890.8089.39.camel@marge.simson.net>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

f4493771

07 1月, 2011 3 次提交

sched: Fix strncmp operation · 524429c3

由 Hillf Danton 提交于 1月 06, 2011

One of the operands, buf, is incorrect, since it is stripped and the
correct address for subsequent string comparing could change if
leading white spaces, if any, are removed from buf.

It is fixed by replacing buf with cmp.
Signed-off-by: NHillf Danton <dhillf@gmail.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <AANLkTinOPuYsVovrZpbuCCmG5deEyc8WgA_A1RJx_YK7@mail.gmail.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

524429c3

sched: Fix struct autogroup memory leak · e9aa1dd1

由 Mike Galbraith 提交于 1月 05, 2011

Seems I lost a change somewhere, leaking memory.

sched: fix struct autogroup memory leak

Add missing change to actually use autogroup_free().
Signed-off-by: NMike Galbraith <efault@gmx.de>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1294222285.8369.2.camel@marge.simson.net>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

e9aa1dd1

sched: Consolidate the name of root_task_group and init_task_group · 07e06b01

由 Yong Zhang 提交于 1月 07, 2011

root_task_group is the leftover of USER_SCHED, now it's always
same to init_task_group.
But as Mike suggested, root_task_group is maybe the suitable name
to keep for a tree.
So in this patch:
  init_task_group      --> root_task_group
  init_task_group_load --> root_task_group_load
  INIT_TASK_GROUP_LOAD --> ROOT_TASK_GROUP_LOAD
Suggested-by: NMike Galbraith <efault@gmx.de>
Signed-off-by: NYong Zhang <yong.zhang0@gmail.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20110107071736.GA32635@windriver.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

07e06b01

05 1月, 2011 2 次提交

sched: Change wait_for_completion_*_timeout() to return a signed long · 6bf41237

由 NeilBrown 提交于 1月 05, 2011

wait_for_completion_*_timeout() can return:

   0: if the wait timed out
 -ve: if the wait was interrupted
 +ve: if the completion was completed.

As they currently return an 'unsigned long', the last two cases
are not easily distinguished which can easily result in buggy
code, as is the case for the recently added
wait_for_completion_interruptible_timeout() call in
net/sunrpc/cache.c

So change them both to return 'long'.  As MAX_SCHEDULE_TIMEOUT
is LONG_MAX, a large +ve return value should never overflow.
Signed-off-by: NNeilBrown <neilb@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: J.  Bruce Fields <bfields@fieldses.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
LKML-Reference: <20110105125016.64ccab0e@notabene.brown>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

6bf41237

[S390] mutex: Introduce arch_mutex_cpu_relax() · 34b133f8

由 Gerald Schaefer 提交于 1月 05, 2011

The spinning mutex implementation uses cpu_relax() in busy loops as a
compiler barrier. Depending on the architecture, cpu_relax() may do more
than needed in this specific mutex spin loops. On System z we also give
up the time slice of the virtual cpu in cpu_relax(), which prevents
effective spinning on the mutex.

This patch replaces cpu_relax() in the spinning mutex code with
arch_mutex_cpu_relax(), which can be defined by each architecture that
selects HAVE_ARCH_MUTEX_CPU_RELAX. The default is still cpu_relax(), so
this patch should not affect other architectures than System z for now.
Signed-off-by: NGerald Schaefer <gerald.schaefer@de.ibm.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1290437256.7455.4.camel@thinkpad>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

34b133f8

04 1月, 2011 1 次提交

sched: Remove redundant CONFIG_CGROUP_SCHED ifdef · 6706125e

由 Yong Zhang 提交于 12月 31, 2010

CONFIG_[FAIR|RT]_GROUP_SCHED always means CONFIG_CGROUP_SCHED
Signed-off-by: NYong Zhang <yong.zhang0@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <1293803938-8157-1-git-send-email-yong.zhang0@gmail.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

6706125e

20 12月, 2010 1 次提交

sched: Remove debugging check · 050c6c9b

由 Ingo Molnar 提交于 12月 19, 2010

Linus reported that the new warning introduced by commit f26f9aff
"Sched: fix skip_clock_update optimization" triggers. The need_resched
flag can be set by other CPUs asynchronously so this debug check is
bogus - remove it.
Reported-by: NLinus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
LKML-Reference: <AANLkTinJ8hAG1TpyC+CSYPR47p48+1=E7fiC45hMXT_1@mail.gmail.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

050c6c9b

16 12月, 2010 3 次提交

perf: Move perf_event_init() into main.c · 24a24bb6

由 Peter Zijlstra 提交于 11月 17, 2010

Currently we call perf_event_init() from sched_init(). In order to
make it more obvious move it to the cannnonical location.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20101117222056.093629821@chello.nl>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

24a24bb6

sched: Fix the irqtime code for 32bit · 8e92c201

由 Peter Zijlstra 提交于 12月 09, 2010

Since the irqtime accounting is using non-atomic u64 and can be read
from remote cpus (writes are strictly cpu local, reads are not) we
have to deal with observing partial updates.

When we do observe partial updates the clock movement (in particular,
->clock_task movement) will go funny (in either direction), a
subsequent clock update (observing the full update) will make it go
funny in the oposite direction.

Since we rely on these clocks to be strictly monotonic we cannot
suffer backwards motion. One possible solution would be to simply
ignore all backwards deltas, but that will lead to accounting
artefacts, most notable: clock_task + irq_time != clock, this
inaccuracy would end up in user visible stats.

Therefore serialize the reads using a seqcount.
Reviewed-by: NVenkatesh Pallipadi <venki@google.com>
Reported-by: NMikael Pettersson <mikpe@it.uu.se>
Tested-by: NMikael Pettersson <mikpe@it.uu.se>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1292242434.6803.200.camel@twins>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

8e92c201

sched: Fix the irqtime code to deal with u64 wraps · fe44d621

由 Peter Zijlstra 提交于 12月 09, 2010

Some ARM systems have a short sched_clock() [ which needs to be fixed
too ], but this exposed a bug in the irq_time code as well, it doesn't
deal with wraps at all.

Fix the irq_time code to deal with u64 wraps by re-writing the code to
only use delta increments, which avoids the whole issue.
Reviewed-by: NVenkatesh Pallipadi <venki@google.com>
Reported-by: NMikael Pettersson <mikpe@it.uu.se>
Tested-by: NMikael Pettersson <mikpe@it.uu.se>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1292242433.6803.199.camel@twins>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

fe44d621

09 12月, 2010 3 次提交

sched: Make pushable_tasks CONFIG_SMP dependant · 806c09a7

由 Dario Faggioli 提交于 11月 30, 2010

As noted by Peter Zijlstra at https://lkml.org/lkml/2010/11/10/391
(while reviewing other stuff, though), tracking pushable tasks
only makes sense on SMP systems.
Signed-off-by: NDario Faggioli <raistlin@linux.it>
Acked-by: NSteven Rostedt <rostedt@goodmis.org>
Acked-by: NGregory Haskins <ghaskins@novell.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1291143093.2697.298.camel@Palantir>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

806c09a7

Sched: fix skip_clock_update optimization · f26f9aff

由 Mike Galbraith 提交于 12月 08, 2010

idle_balance() drops/retakes rq->lock, leaving the previous task
vulnerable to set_tsk_need_resched().  Clear it after we return
from balancing instead, and in setup_thread_stack() as well, so
no successfully descheduled or never scheduled task has it set.

Need resched confused the skip_clock_update logic, which assumes
that the next call to update_rq_clock() will come nearly immediately
after being set.  Make the optimization robust against the waking
a sleeper before it sucessfully deschedules case by checking that
the current task has not been dequeued before setting the flag,
since it is that useless clock update we're trying to save, and
clear unconditionally in schedule() proper instead of conditionally
in put_prev_task().
Signed-off-by: NMike Galbraith <efault@gmx.de>
Reported-by: NBjoern B. Brandenburg <bbb.lst@gmail.com>
Tested-by: NYong Zhang <yong.zhang0@gmail.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: stable@kernel.org
LKML-Reference: <1291802742.1417.9.camel@marge.simson.net>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

f26f9aff

sched: Cure more NO_HZ load average woes · 0f004f5a

由 Peter Zijlstra 提交于 11月 30, 2010

There's a long-running regression that proved difficult to fix and
which is hitting certain people and is rather annoying in its effects.

Damien reported that after 74f5187a (sched: Cure load average vs
NO_HZ woes) his load average is unnaturally high, he also noted that
even with that patch reverted the load avgerage numbers are not
correct.

The problem is that the previous patch only solved half the NO_HZ
problem, it addressed the part of going into NO_HZ mode, not of
comming out of NO_HZ mode. This patch implements that missing half.

When comming out of NO_HZ mode there are two important things to take
care of:

 - Folding the pending idle delta into the global active count.
 - Correctly aging the averages for the idle-duration.

So with this patch the NO_HZ interaction should be complete and
behaviour between CONFIG_NO_HZ=[yn] should be equivalent.

Furthermore, this patch slightly changes the load average computation
by adding a rounding term to the fixed point multiplication.
Reported-by: NDamien Wyart <damien.wyart@free.fr>
Reported-by: NTim McGrath <tmhikaru@gmail.com>
Tested-by: NDamien Wyart <damien.wyart@free.fr>
Tested-by: NOrion Poplawski <orion@cora.nwra.com>
Tested-by: NKyle McMartin <kyle@mcmartin.ca>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: stable@kernel.org
Cc: Chase Douglas <chase.douglas@canonical.com>
LKML-Reference: <1291129145.32004.874.camel@laptop>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

0f004f5a

30 11月, 2010 3 次提交

sched: Add 'autogroup' scheduling feature: automated per session task groups · 5091faa4

由 Mike Galbraith 提交于 11月 30, 2010

A recurring complaint from CFS users is that parallel kbuild has
a negative impact on desktop interactivity.  This patch
implements an idea from Linus, to automatically create task
groups.  Currently, only per session autogroups are implemented,
but the patch leaves the way open for enhancement.

Implementation: each task's signal struct contains an inherited
pointer to a refcounted autogroup struct containing a task group
pointer, the default for all tasks pointing to the
init_task_group.  When a task calls setsid(), a new task group
is created, the process is moved into the new task group, and a
reference to the preveious task group is dropped.  Child
processes inherit this task group thereafter, and increase it's
refcount.  When the last thread of a process exits, the
process's reference is dropped, such that when the last process
referencing an autogroup exits, the autogroup is destroyed.

At runqueue selection time, IFF a task has no cgroup assignment,
its current autogroup is used.

Autogroup bandwidth is controllable via setting it's nice level
through the proc filesystem:

  cat /proc/<pid>/autogroup

Displays the task's group and the group's nice level.

  echo <nice level> > /proc/<pid>/autogroup

Sets the task group's shares to the weight of nice <level> task.
Setting nice level is rate limited for !admin users due to the
abuse risk of task group locking.

The feature is enabled from boot by default if
CONFIG_SCHED_AUTOGROUP=y is selected, but can be disabled via
the boot option noautogroup, and can also be turned on/off on
the fly via:

  echo [01] > /proc/sys/kernel/sched_autogroup_enabled

... which will automatically move tasks to/from the root task group.
Signed-off-by: NMike Galbraith <efault@gmx.de>
Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Markus Trippelsdorf <markus@trippelsdorf.de>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Paul Turner <pjt@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
[ Removed the task_group_path() debug code, and fixed !EVENTFD build failure. ]
Signed-off-by: NIngo Molnar <mingo@elte.hu>
LKML-Reference: <1290281700.28711.9.camel@maggy.simson.net>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

5091faa4

sched: Fix unregister_fair_sched_group() · 822bc180

由 Paul Turner 提交于 11月 29, 2010

In the flipping and flopping between calling
unregister_fair_sched_group() on a per-cpu versus per-group basis
we ended up in a bad state.

Remove from the list for the passed cpu as opposed to some
arbitrary index.

( This fixes explosions w/ autogroup as well as a group
  creation/destruction stress test. )
Reported-by: NStephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: NPaul Turner <pjt@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
LKML-Reference: <20101130005740.080828123@google.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

822bc180

rcu,cleanup: move synchronize_sched_expedited() out of sched.c · 7b27d547

由 Lai Jiangshan 提交于 10月 21, 2010

The first version of synchronize_sched_expedited() used the migration
code in the scheduler, and was therefore implemented in kernel/sched.c.
However, the more recent version of this code no longer uses the
migration code, so this commit moves it to the main RCU source files.
Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>

7b27d547

26 11月, 2010 2 次提交

sched: Remove unused argument dest_cpu to migrate_task() · b7a2b39d

由 Nikanth Karthikesan 提交于 11月 26, 2010

Remove unused argument, 'dest_cpu' of migrate_task(), and pass runqueue,
as it is always known at the call site.
Signed-off-by: NNikanth Karthikesan <knikanth@suse.de>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <201011261237.09187.knikanth@suse.de>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

b7a2b39d

mutexes, sched: Introduce arch_mutex_cpu_relax() · 335d7afb

由 Gerald Schaefer 提交于 11月 22, 2010

335d7afb

23 11月, 2010 1 次提交

sched: Make task dump print all 15 chars of proc comm · 28d0686c

由 Erik Gilling 提交于 11月 19, 2010

Signed-off-by: NErik Gilling <konkers@android.com>
Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1290218934-8544-3-git-send-email-john.stultz@linaro.org>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

28d0686c

18 11月, 2010 5 次提交

sched: Update tg->shares after cpu.shares write · 9437178f

由 Paul Turner 提交于 11月 15, 2010

Formerly sched_group_set_shares would force a rebalance by overflowing domain
share sums.  Now that per-cpu averages are maintained we can set the true value
by issuing an update_cfs_shares() following a tg->shares update.

Also initialize tg se->load to 0 for consistency since we'll now set correct
weights on enqueue.
Signed-off-by: NPaul Turner <pjt@google.com?>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20101115234938.465521344@google.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

9437178f

sched: Implement demand based update_cfs_load() · 3b3d190e

由 Paul Turner 提交于 11月 15, 2010

When the system is busy, dilation of rq->next_balance makes lb->update_shares()
insufficiently frequent for threads which don't sleep (no dequeue/enqueue
updates). Adjust for this by making demand based updates based on the
accumulation of execution time sufficient to wrap our averaging window.
Signed-off-by: NPaul Turner <pjt@google.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20101115234938.291159744@google.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

3b3d190e

sched: Fix update_cfs_load() synchronization · e33078ba

由 Paul Turner 提交于 11月 15, 2010

Using cfs_rq->nr_running is not sufficient to synchronize update_cfs_load with
the put path since nr_running accounting occurs at deactivation.

It's also not safe to make the removal decision based on load_avg as this fails
with both high periods and low shares.  Resolve this by clipping history after
4 periods without activity.

Note: the above will always occur from update_shares() since in the
last-task-sleep-case that task will still be cfs_rq->curr when update_cfs_load
is called.
Signed-off-by: NPaul Turner <pjt@google.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20101115234937.933428187@google.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

e33078ba

sched: Make tg_shares_up() walk on-demand · 9e3081ca

由 Peter Zijlstra 提交于 11月 15, 2010

Make tg_shares_up() use the active cgroup list, this means we cannot
do a strict bottom-up walk of the hierarchy, but assuming its a very
wide tree with a small number of active groups it should be a win.
Signed-off-by: NPaul Turner <pjt@google.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20101115234937.754159484@google.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

9e3081ca

sched: Implement on-demand (active) cfs_rq list · 3d4b47b4

由 Peter Zijlstra 提交于 11月 15, 2010

Make certain load-balance actions scale per number of active cgroups
instead of the number of existing cgroups.

This makes wakeup/sleep paths more expensive, but is a win for systems
where the vast majority of existing cgroups are idle.
Signed-off-by: NPaul Turner <pjt@google.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20101115234937.666535048@google.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

3d4b47b4