提交 · 67aa0f767af488a7f1e41cccb4f7a4893f24a1ab · openanolis / cloud-kernel

25 3月, 2009 1 次提交

sched: remove unused fields from struct rq · 67aa0f76

由 Luis Henriques 提交于 3月 24, 2009

Impact: cleanup, new schedstat ABI

Since they are used on in statistics and are always set to zero, the
following fields from struct rq have been removed: yld_exp_empty,
yld_act_empty and yld_both_empty.

Both Sched Debug and SCHEDSTAT_VERSION versions has also been
incremented since ABIs have been changed.

The schedtop tool has been updated to properly handle new version of
schedstat:

   http://rt.wiki.kernel.org/index.php/Schedtop_utilitySigned-off-by: NLuis Henriques <henrix@sapo.pt>
Acked-by: NGregory Haskins <ghaskins@novell.com>
Acked-by: NPeter Zijlstra <peterz@infradead.org>
LKML-Reference: <20090324221002.GA10061@hades.domain.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

67aa0f76

17 3月, 2009 2 次提交

sched: small optimisation of can_migrate_task() · 708dc512

由 Luis Henriques 提交于 3月 16, 2009

There were 3 invocations of task_hot() in can_migrate_task().

Replace these 3 invocations by only one invocation, cached in
a local variable.
Signed-off-by: NLuis Henriques <henrix@sapo.pt>
LKML-Reference: <20090316195902.GA6197@hades.domain.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

708dc512

sched: fix typos in documentation · 80dd99b3

由 Luis Henriques 提交于 3月 16, 2009

Fixed typos in function documentation.
Signed-off-by: NLuis Henriques <henrix@sapo.pt>
LKML-Reference: <20090316195809.GA6073@hades.domain.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

80dd99b3

11 3月, 2009 1 次提交

sched: add avg_overlap decay · df1c99d4

由 Mike Galbraith 提交于 3月 10, 2009

Impact: more precise avg_overlap metric - better load-balancing

avg_overlap is used to measure the runtime overlap of the waker and
wakee.

However, when a process changes behaviour, eg a pipe becomes
un-congested and we don't need to go to sleep after a wakeup
for a while, the avg_overlap value grows stale.

When running we use the avg runtime between preemption as a
measure for avg_overlap since the amount of runtime can be
correlated to cache footprint.

The longer we run, the less likely we'll be wanting to be
migrated to another CPU.
Signed-off-by: NMike Galbraith <efault@gmx.de>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1236709131.25234.576.camel@laptop>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

df1c99d4

10 3月, 2009 1 次提交

sched: optimize ttwu vs group scheduling · 57310a98

由 Peter Zijlstra 提交于 3月 09, 2009

Impact: micro-optimization

We can avoid the sched domain walk on try_to_wake_up() when we know
there are no groups.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1236603381.8389.455.camel@laptop>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

57310a98

06 3月, 2009 1 次提交

sched: TIF_NEED_RESCHED -> need_reshed() cleanup · 5ed0cec0

由 Lai Jiangshan 提交于 3月 06, 2009

Impact: cleanup

Use test_tsk_need_resched(), set_tsk_need_resched(), need_resched()
instead of using TIF_NEED_RESCHED.
Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <49B10BA4.9070209@cn.fujitsu.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

5ed0cec0

05 3月, 2009 1 次提交

sched: don't rebalance if attached on NULL domain · 8a0be9ef

由 Frederic Weisbecker 提交于 3月 05, 2009

Impact: fix function graph trace hang / drop pointless softirq on UP

While debugging a function graph trace hang on an old PII, I saw
that it consumed most of its time on the timer interrupt. And
the domain rebalancing softirq was the most concerned.

The timer interrupt calls trigger_load_balance() which will
decide if it is worth to schedule a rebalancing softirq.

In case of builtin UP kernel, no problem arises because there is
no domain question.

In case of builtin SMP kernel running on an SMP box, still no
problem, the softirq will be raised each time we reach the
next_balance time.

In case of builtin SMP kernel running on a UP box (most distros
provide default SMP kernels, whatever the box you have), then
the CPU is attached to the NULL sched domain. So a kind of
unexpected behaviour happen:

trigger_load_balance() -> raises the rebalancing softirq later
on softirq: run_rebalance_domains() -> rebalance_domains() where
the for_each_domain(cpu, sd) is not taken because of the NULL
domain we are attached at. Which means rq->next_balance is never
updated. So on the next timer tick, we will enter
trigger_load_balance() which will always reschedule() the
rebalacing softirq:

if (time_after_eq(jiffies, rq->next_balance))
	raise_softirq(SCHED_SOFTIRQ);

So for each tick, we process this pointless softirq.

This patch fixes it by checking if we are attached to the null
domain before raising the softirq, another possible fix would be
to set the maximal possible JIFFIES value to rq->next_balance if
we are attached to the NULL domain.

v2: build fix on UP
Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <49af242d.1c07d00a.32d5.ffffc019@mx.google.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

8a0be9ef

02 3月, 2009 1 次提交

sched: kill unused parameter of pick_next_task() · b67802ea

由 Wang Chen 提交于 3月 02, 2009

Impact: micro-optimization

Parameter "prev" is not used really.
Signed-off-by: NWang Chen <wangchen@cn.fujitsu.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

b67802ea

27 2月, 2009 1 次提交

sched: don't allow setuid to succeed if the user does not have rt bandwidth · 54e99124

由 Dhaval Giani 提交于 2月 27, 2009

Impact: fix hung task with certain (non-default) rt-limit settings

Corey Hickey reported that on using setuid to change the uid of a
rt process, the process would be unkillable and not be running.
This is because there was no rt runtime for that user group. Add
in a check to see if a user can attach an rt task to its task group.
On failure, return EINVAL, which is also returned in
CONFIG_CGROUP_SCHED.
Reported-by: NCorey Hickey <bugfood-ml@fatooh.org>
Signed-off-by: NDhaval Giani <dhaval@linux.vnet.ibm.com>
Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

54e99124

26 2月, 2009 2 次提交

sched_rt: don't start timer when rt bandwidth disabled · cac64d00

由 Hiroshi Shimamoto 提交于 2月 25, 2009

Impact: fix incorrect condition check

No need to start rt bandwidth timer when rt bandwidth is disabled.
If this timer starts, it may stop at sched_rt_period_timer() on the first time.
Signed-off-by: NHiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

cac64d00

cpuacct: add a branch prediction · c40c6f85

由 Li Zefan 提交于 2月 26, 2009

cpuacct_charge() is in fast-path, and checking of !cpuacct_susys.active
always returns false after cpuacct has been initialized at system boot.
Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Menage <menage@google.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

c40c6f85

16 2月, 2009 2 次提交

sched: use TASK_NICE for task_struct · 2b8f836f

由 Américo Wang 提交于 2月 16, 2009

#define TASK_NICE(p)            PRIO_TO_NICE((p)->static_prio)

So it's better to use TASK_NICE here.
Signed-off-by: NWANG Cong <wangcong@zeuux.org>
Acked-by: NPeter Zijlstra <peterz@infradead.org>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

2b8f836f

sched: idle_at_tick is only used when CONFIG_SMP is set · a0a522ce

由 Henrik Austad 提交于 2月 13, 2009

Impact: struct rq size optimization

The idle_at_tick in struct rq is only used in SMP settings
and it does not make sense to have this in the rq in an UP setup.
Signed-off-by: NHenrik Austad <henrik@austad.us>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

a0a522ce

12 2月, 2009 1 次提交

sched: cpu hotplug fix · a0490fa3

由 Ingo Molnar 提交于 2月 12, 2009

rq_attach_root() does a kfree() with the runqueue lock held.

That's not a very wise move, fix it.
Signed-off-by: NIngo Molnar <mingo@elte.hu>

a0490fa3

11 2月, 2009 1 次提交

sched: revert recent sync wakeup changes · fc631c82

由 Peter Zijlstra 提交于 2月 11, 2009

Intel reported a 10% regression (mysql+sysbench) on a 16-way machine
with these patches:

  1596e297: sched: symmetric sync vs avg_overlap
  d942fb6c: sched: fix sync wakeups

Revert them.
Reported-by: N"Zhang, Yanmin" <yanmin_zhang@linux.intel.com>
Bisected-by: NLin Ming <ming.m.lin@intel.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

fc631c82

06 2月, 2009 1 次提交

wait: prevent exclusive waiter starvation · 777c6c5f

由 Johannes Weiner 提交于 2月 04, 2009

With exclusive waiters, every process woken up through the wait queue must
ensure that the next waiter down the line is woken when it has finished.

Interruptible waiters don't do that when aborting due to a signal.  And if
an aborting waiter is concurrently woken up through the waitqueue, noone
will ever wake up the next waiter.

This has been observed with __wait_on_bit_lock() used by
lock_page_killable(): the first contender on the queue was aborting when
the actual lock holder woke it up concurrently.  The aborted contender
didn't acquire the lock and therefor never did an unlock followed by
waking up the next waiter.

Add abort_exclusive_wait() which removes the process' wait descriptor from
the waitqueue, iff still queued, or wakes up the next waiter otherwise.
It does so under the waitqueue lock.  Racing with a wake up means the
aborting process is either already woken (removed from the queue) and will
wake up the next waiter, or it will remove itself from the queue and the
concurrent wake up will apply to the next waiter after it.

Use abort_exclusive_wait() in __wait_event_interruptible_exclusive() and
__wait_on_bit_lock() when they were interrupted by other means than a wake
up through the queue.

[akpm@linux-foundation.org: coding-style fixes]
Reported-by: NChris Mason <chris.mason@oracle.com>
Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Mentored-by: NOleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Chuck Lever <cel@citi.umich.edu>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: <stable@kernel.org>		["after some testing"]
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

777c6c5f

05 2月, 2009 1 次提交

sched: fix nohz load balancer on cpu offline · 483b4ee6

由 Suresh Siddha 提交于 2月 04, 2009

Christian Borntraeger reports:

> After a logical cpu offline, even on a complete idle system, there
> is one cpu with full ticks. It turns out that nohz.cpu_mask has the
> the offlined cpu still set.
>
> In select_nohz_load_balancer() we check if the system is completely
> idle to turn of load balancing. We compare cpu_online_map with
> nohz.cpu_mask.  Since cpu_online_map is updated on cpu unplug,
> but nohz.cpu_mask is not, the check fails and the scheduler believes
> that we need an "idle load balancer" even on a fully idle system.
> Since the ilb cpu does not deactivate the timer tick this breaks NOHZ.

Fix the select_nohz_load_balancer() to not set the nohz.cpu_mask
while a cpu is going offline.
Reported-by: NChristian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
Tested-by: NChristian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

483b4ee6

01 2月, 2009 2 次提交

sched: symmetric sync vs avg_overlap · 1596e297

由 Peter Zijlstra 提交于 1月 28, 2009

Reinstate the weakening of the sync hint if set. This yields a more
symmetric usage of avg_overlap.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

1596e297

sched: fix sync wakeups · d942fb6c

由 Peter Zijlstra 提交于 1月 26, 2009

Pawel Dziekonski reported that the openssl benchmark and his
quantum chemistry application both show slowdowns due to the
scheduler under-parallelizing execution.

The reason are pipe wakeups still doing 'sync' wakeups which
overrides the normal buddy wakeup logic - even if waker and
wakee are loosely coupled.

Fix an inversion of logic in the buddy wakeup code.
Reported-by: NPawel Dziekonski <dzieko@gmail.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

d942fb6c

15 1月, 2009 3 次提交

sched: SCHED_IDLE weight change · cce7ade8

由 Peter Zijlstra 提交于 1月 15, 2009

Increase the SCHED_IDLE weight from 2 to 3, this gives much more stable
vruntime numbers.

time advanced in 100ms:

 weight=2

 64765.988352
 67012.881408
 88501.412352

 weight=3

 35496.181411
 34130.971298
 35497.411573
Signed-off-by: NMike Galbraith <efault@gmx.de>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

cce7ade8

sched: fix bandwidth validation for UID grouping · 98a4826b

由 Peter Zijlstra 提交于 1月 14, 2009

Impact: make rt-limit tunables work again

Mark Glines reported:

> I've got an issue on x86-64 where I can't configure the system to allow
> RT tasks for a non-root user.
>
> In 2.6.26.5, I was able to do the following to set things up nicely:
> echo 450000 >/sys/kernel/uids/0/cpu_rt_runtime
> echo 450000 >/sys/kernel/uids/1000/cpu_rt_runtime
>
> Seems like every value I try to echo into the /sys files returns EINVAL.

For UID grouping we initialize the root group with infinite bandwidth
which by default is actually more than the global limit, therefore the
bandwidth check always fails.

Because the root group is a phantom group (for UID grouping) we cannot
runtime adjust it, therefore we let it reflect the global bandwidth
settings.
Reported-by: NMark Glines <mark@glines.org>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

98a4826b

sched: introduce avg_wakeup · 831451ac

由 Peter Zijlstra 提交于 1月 14, 2009

Introduce a new avg_wakeup statistic.

avg_wakeup is a measure of how frequently a task wakes up other tasks, it
represents the average time between wakeups, with a limit of avg_runtime
for when it doesn't wake up anybody.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NMike Galbraith <efault@gmx.de>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

831451ac

14 1月, 2009 4 次提交

sched: fix build error in kernel/sched_rt.c when RT_GROUP_SCHED && !SMP · 398a153b

由 Gregory Haskins 提交于 1月 14, 2009

Ingo found a build error in the scheduler when RT_GROUP_SCHED was
enabled, but SMP was not.  This patch rearranges the code such
that it is a little more streamlined and compiles under all permutations
of SMP, UP and RT_GROUP_SCHED.  It was boot tested on my 4-way x86_64
and it still passes preempt-test.
Signed-off-by: NGregory Haskins <ghaskins@novell.com>

398a153b

H
[CVE-2009-0029] System call wrappers part 08 · 17da2bd9
由 Heiko Carstens 提交于 1月 14, 2009
```
Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
```
17da2bd9
H
[CVE-2009-0029] System call wrappers part 07 · 754fe8d2
由 Heiko Carstens 提交于 1月 14, 2009
```
Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
```
754fe8d2
H
[CVE-2009-0029] System call wrappers part 06 · 5add95d4
由 Heiko Carstens 提交于 1月 14, 2009
```
Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
```
5add95d4

12 1月, 2009 1 次提交

Revert "sched: improve preempt debugging" · 01e3eb82

由 Ingo Molnar 提交于 1月 12, 2009

This reverts commit 7317d7b8.

This has been reported (and bisected) by Alexey Zaytsev and
Kamalesh Babulal to produce annoying warnings during bootup
on both x86 and powerpc.

kernel_locked() is not a valid test in IRQ context (we update the
BKL's ->lock_depth and the preempt count separately and non-atomicalyy),
so we cannot put it into the generic preempt debugging checks which
can run in IRQ contexts too.
Reported-and-bisected-by: NAlexey Zaytsev <alexey.zaytsev@gmail.com>
Reported-and-bisected-by: NKamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

01e3eb82

11 1月, 2009 2 次提交

kernel/sched.c: add missing forward declaration for 'double_rq_lock' · fd2ab30b

由 Steven Noonan 提交于 1月 11, 2009

Impact: build fix on certain configs

Added 'double_rq_lock' forward declaration, allowing double_rq_lock
to be used in _double_lock_balance().
Signed-off-by: NSteven Noonan <steven@uplinklabs.net>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

fd2ab30b

cpumask: fix CONFIG_NUMA=y sched.c · 62ea9ceb

由 Rusty Russell 提交于 1月 11, 2009

Impact: fix panic on ia64 with NR_CPUS=1024

struct sched_domain is now a dangling structure; where we really want
static ones, we need to use static_sched_domain.

(As the FIXME in this file says, cpumask_var_t would be better, but
this code is hairy enough without trying to add initialization code to
the right places).
Reported-by: NMike Travis <travis@sgi.com>
Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

62ea9ceb

07 1月, 2009 1 次提交

sched: fix possible recursive rq->lock · da8d5089

由 Peter Zijlstra 提交于 1月 07, 2009

Vaidyanathan Srinivasan reported:

 > =============================================
 > [ INFO: possible recursive locking detected ]
 > 2.6.28-autotest-tip-sv #1
 > ---------------------------------------------
 > klogd/5062 is trying to acquire lock:
 >  (&rq->lock){++..}, at: [<ffffffff8022aca2>] task_rq_lock+0x45/0x7e
 >
 > but task is already holding lock:
 >  (&rq->lock){++..}, at: [<ffffffff805f7354>] schedule+0x158/0xa31

With sched_mc at 2. (it is default-off)

Strictly speaking we'll not deadlock, because ttwu will not be able to
place the migration task on our rq, but since the code can deal with
both rqs getting unlocked, this seems the easiest way out.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

da8d5089

06 1月, 2009 2 次提交

sched: fix section mismatch · db2f59c8

由 Li Zefan 提交于 1月 06, 2009

init_rootdomain() calls alloc_bootmem_cpumask_var() at system boot,
so does cpupri_init().
Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

db2f59c8

sched: fix double kfree in failure path · 0c910d28

由 Li Zefan 提交于 1月 06, 2009

It's not the responsibility of init_rootdomain() to free root_domain
allocated by alloc_rootdomain().
Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
Reviewed-by: NPekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

0c910d28

05 1月, 2009 2 次提交

sched: clean up arch_reinit_sched_domains() · c70f22d2

由 Li Zefan 提交于 1月 05, 2009

- Make arch_reinit_sched_domains() static. It was exported to be used in
  s390, but now rebuild_sched_domains() is used instead.

- Make it return void.
Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

c70f22d2

sched: mark sched_create_sysfs_power_savings_entries() as __init · 39aac648

由 Li Zefan 提交于 1月 05, 2009

Impact: cleanup

The only caller is cpu_dev_init() which is marked as __init.
Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

39aac648

04 1月, 2009 1 次提交

sched: put back some stack hog changes that were undone in kernel/sched.c · 6ca09dfc

由 Mike Travis 提交于 12月 31, 2008

Impact: prevents panic from stack overflow on numa-capable machines.

Some of the "removal of stack hogs" changes in kernel/sched.c by using
node_to_cpumask_ptr were undone by the early cpumask API updates, and
causes a panic due to stack overflow.  This patch undoes those changes
by using cpumask_of_node() which returns a 'const struct cpumask *'.

In addition, cpu_coregoup_map is replaced with cpu_coregroup_mask further
reducing stack usage.  (Both of these updates removed 9 FIXME's!)

Also:
   Pick up some remaining changes from the old 'cpumask_t' functions to
   the new 'struct cpumask *' functions.

   Optimize memory traffic by allocating each percpu local_cpu_mask on the
   same node as the referring cpu.
Signed-off-by: NMike Travis <travis@sgi.com>
Acked-by: NRusty Russell <rusty@rustcorp.com.au>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

6ca09dfc

31 12月, 2008 2 次提交

[PATCH] idle cputime accounting · 79741dd3

由 Martin Schwidefsky 提交于 12月 31, 2008

The cpu time spent by the idle process actually doing something is
currently accounted as idle time. This is plain wrong, the architectures
that support VIRT_CPU_ACCOUNTING=y can do better: distinguish between the
time spent doing nothing and the time spent by idle doing work. The first
is accounted with account_idle_time and the second with account_system_time.
The architectures that use the account_xxx_time interface directly and not
the account_xxx_ticks interface now need to do the check for the idle
process in their arch code. In particular to improve the system vs true
idle time accounting the arch code needs to measure the true idle time
instead of just testing for the idle process.
To improve the tick based accounting as well we would need an architecture
primitive that can tell us if the pt_regs of the interrupted context
points to the magic instruction that halts the cpu.

In addition idle time is no more added to the stime of the idle process.
This field now contains the system time of the idle process as it should
be. On systems without VIRT_CPU_ACCOUNTING this will always be zero as
every tick that occurs while idle is running will be accounted as idle
time.

This patch contains the necessary common code changes to be able to
distinguish idle system time and true idle time. The architectures with
support for VIRT_CPU_ACCOUNTING need some changes to exploit this.
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>

79741dd3

[PATCH] fix scaled & unscaled cputime accounting · 457533a7

由 Martin Schwidefsky 提交于 12月 31, 2008

The utimescaled / stimescaled fields in the task structure and the
global cpustat should be set on all architectures. On s390 the calls
to account_user_time_scaled and account_system_time_scaled never have
been added. In addition system time that is accounted as guest time
to the user time of a process is accounted to the scaled system time
instead of the scaled user time.
To fix the bugs and to prevent future forgetfulness this patch merges
account_system_time_scaled into account_system_time and
account_user_time_scaled into account_user_time.

Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: Michael Neuling <mikey@neuling.org>
Acked-by: NPaul Mackerras <paulus@samba.org>
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>

457533a7

29 12月, 2008 3 次提交

sched: create "pushable_tasks" list to limit pushing to one attempt · 917b627d

由 Gregory Haskins 提交于 12月 29, 2008

The RT scheduler employs a "push/pull" design to actively balance tasks
within the system (on a per disjoint cpuset basis).  When a task is
awoken, it is immediately determined if there are any lower priority
cpus which should be preempted.  This is opposed to the way normal
SCHED_OTHER tasks behave, which will wait for a periodic rebalancing
operation to occur before spreading out load.

When a particular RQ has more than 1 active RT task, it is said to
be in an "overloaded" state.  Once this occurs, the system enters
the active balancing mode, where it will try to push the task away,
or persuade a different cpu to pull it over.  The system will stay
in this state until the system falls back below the <= 1 queued RT
task per RQ.

However, the current implementation suffers from a limitation in the
push logic.  Once overloaded, all tasks (other than current) on the
RQ are analyzed on every push operation, even if it was previously
unpushable (due to affinity, etc).  Whats more, the operation stops
at the first task that is unpushable and will not look at items
lower in the queue.  This causes two problems:

1) We can have the same tasks analyzed over and over again during each
   push, which extends out the fast path in the scheduler for no
   gain.  Consider a RQ that has dozens of tasks that are bound to a
   core.  Each one of those tasks will be encountered and skipped
   for each push operation while they are queued.

2) There may be lower-priority tasks under the unpushable task that
   could have been successfully pushed, but will never be considered
   until either the unpushable task is cleared, or a pull operation
   succeeds.  The net result is a potential latency source for mid
   priority tasks.

This patch aims to rectify these two conditions by introducing a new
priority sorted list: "pushable_tasks".  A task is added to the list
each time a task is activated or preempted.  It is removed from the
list any time it is deactivated, made current, or fails to push.

This works because a task only needs to be attempted to push once.
After an initial failure to push, the other cpus will eventually try to
pull the task when the conditions are proper.  This also solves the
problem that we don't completely analyze all tasks due to encountering
an unpushable tasks.  Now every task will have a push attempted (when
appropriate).

This reduces latency both by shorting the critical section of the
rq->lock for certain workloads, and by making sure the algorithm
considers all eligible tasks in the system.

[ rostedt: added a couple more BUG_ONs ]
Signed-off-by: NGregory Haskins <ghaskins@novell.com>
Acked-by: NSteven Rostedt <srostedt@redhat.com>

917b627d

sched: add sched_class->needs_post_schedule() member · 967fc046

由 Gregory Haskins 提交于 12月 29, 2008

We currently run class->post_schedule() outside of the rq->lock, which
means that we need to test for the need to post_schedule outside of
the lock to avoid a forced reacquistion.  This is currently not a problem
as we only look at rq->rt.overloaded.  However, we want to enhance this
going forward to look at more state to reduce the need to post_schedule to
a bare minimum set.  Therefore, we introduce a new member-func called
needs_post_schedule() which tests for the post_schedule condtion without
actually performing the work.  Therefore it is safe to call this
function before the rq->lock is released, because we are guaranteed not
to drop the lock at an intermediate point (such as what post_schedule()
may do).

We will use this later in the series

[ rostedt: removed paranoid BUG_ON ]
Signed-off-by: NGregory Haskins <ghaskins@novell.com>

967fc046

sched: make double-lock-balance fair · 8f45e2b5

由 Gregory Haskins 提交于 12月 29, 2008

double_lock balance() currently favors logically lower cpus since they
often do not have to release their own lock to acquire a second lock.
The result is that logically higher cpus can get starved when there is
a lot of pressure on the RQs. This can result in higher latencies on
higher cpu-ids.

This patch makes the algorithm more fair by forcing all paths to have
to release both locks before acquiring them again. Since callsites to
double_lock_balance already consider it a potential preemption/reschedule
point, they have the proper logic to recheck for atomicity violations.
Signed-off-by: NGregory Haskins <ghaskins@novell.com>

8f45e2b5

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功