提交 · 34d76c41554a05425613d16efebb3069c4c545f0 · OpenHarmony / kernel_linux

28 8月, 2009 1 次提交

sched: Fix division by zero - really · 34d76c41

由 Peter Zijlstra 提交于 8月 27, 2009

When re-computing the shares for each task group's cpu
representation we need the ratio of weight on each cpu vs the
total weight of the sched domain.

Since load-balancing is loosely (read not) synchronized, the
weight of individual cpus can change between doing the sum and
calculating the ratio.

The previous patch dealt with only one of the race scenarios,
this patch side steps them all by saving a snapshot of all the
individual cpu weights, thereby always working on a consistent
set.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: torvalds@linux-foundation.org
Cc: jes@sgi.com
Cc: jens.axboe@oracle.com
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Arjan van de Ven <arjan@infradead.org>
Cc: Yinghai Lu <yinghai@kernel.org>
LKML-Reference: <1251371336.18584.77.camel@twins>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

34d76c41

21 8月, 2009 1 次提交

sched: Avoid division by zero · a8af7246

由 Peter Zijlstra 提交于 8月 21, 2009

Patch a5004278 (sched: Fix
cgroup smp fairness) introduced the possibility of a
divide-by-zero because load-balancing is not synchronized
between sched_domains.

This can cause the state of cpus to change between the first
and second loop over the sched domain in tg_shares_up().
Reported-by: NYinghai Lu <yinghai@kernel.org>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Jes Sorensen <jes@sgi.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
LKML-Reference: <1250855934.7538.30.camel@twins>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

a8af7246

20 8月, 2009 1 次提交

sched: Use for_each_class macro in move_one_task() · cde7e5ca

由 Hiroshi Shimamoto 提交于 8月 18, 2009

Replace for loop with the macro for_each_class to cleanup.
Signed-off-by: NHiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
LKML-Reference: <4A8A277D.4090304@ct.jp.nec.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

cde7e5ca

02 8月, 2009 6 次提交

sched: Ensure the migration task doesn't go away during use · 693525e3

由 Peter Zijlstra 提交于 7月 21, 2009

Like sched_migrate_task(), set_cpus_allowed_ptr() should hold
onto the migration thread too.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

693525e3

sched: Fully integrate cpus_active_map and root-domain code · 00aec93d

由 Gregory Haskins 提交于 7月 30, 2009

Reflect "active" cpus in the rq->rd->online field, instead of
the online_map.

The motivation is that things that use the root-domain code
(such as cpupri) only care about cpus classified as "active"
anyway. By synchronizing the root-domain state with the active
map, we allow several optimizations.

For instance, we can remove an extra cpumask_and from the
scheduler hotpath by utilizing rq->rd->online (since it is now
a cached version of cpu_active_map & rq->rd->span).
Signed-off-by: NGregory Haskins <ghaskins@novell.com>
Acked-by: NPeter Zijlstra <peterz@infradead.org>
Acked-by: NMax Krasnyansky <maxk@qualcomm.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20090730145723.25226.24493.stgit@dev.haskins.net>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

00aec93d

sched: Enhance the pre/post scheduling logic · 3f029d3c

由 Gregory Haskins 提交于 7月 29, 2009

We currently have an explicit "needs_post" vtable method which
returns a stack variable for whether we should later run
post-schedule.  This leads to an awkward exchange of the
variable as it bubbles back up out of the context switch. Peter
Zijlstra observed that this information could be stored in the
run-queue itself instead of handled on the stack.

Therefore, we revert to the method of having context_switch
return void, and update an internal rq->post_schedule variable
when we require further processing.

In addition, we fix a race condition where we try to access
current->sched_class without holding the rq->lock.  This is
technically racy, as the sched-class could change out from
under us.  Instead, we reference the per-rq post_schedule
variable with the runqueue unlocked, but with preemption
disabled to see if we need to reacquire the rq->lock.

Finally, we clean the code up slightly by removing the #ifdef
CONFIG_SMP conditionals from the schedule() call, and implement
some inline helper functions instead.

This patch passes checkpatch, and rt-migrate.
Signed-off-by: NGregory Haskins <ghaskins@novell.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20090729150422.17691.55590.stgit@dev.haskins.net>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

3f029d3c

sched: Check for pushing rt tasks after all scheduling · da19ab51

由 Steven Rostedt 提交于 7月 29, 2009

The current method for pushing RT tasks after scheduling only
happens after a context switch. But we found cases where a task
is set up on a run queue to be pushed but the push never
happens because the schedule chooses the same task.

This bug was found with the help of Gregory Haskins and the use
of ftrace (trace_printk). It tooks several days for both of us
analyzing the code and the trace output to find this.
Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20090729042526.205923666@goodmis.org>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

da19ab51

sched: Optimize unused cgroup configuration · e7097159

由 Peter Zijlstra 提交于 6月 03, 2009

When cgroup group scheduling is built in, skip some code paths
if we don't have any (but the root) cgroups configured.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

e7097159

sched: Fix cgroup smp fairness · a5004278

由 Peter Zijlstra 提交于 7月 27, 2009

Commit ec4e0e2f ("fix
inconsistency when redistribute per-cpu tg->cfs_rq shares")
broke cgroup smp fairness.

In order to avoid starvation of newly placed tasks, we never
quite set the share of an empty cpu group-task to 0, but
instead we set it as if there's a single NICE-0 task present.

If however we actually set this in cfs_rq[cpu]->shares, that
means the total shares for that group will be slightly inflated
every time we balance, causing the observed unfairness.

Fix this by setting cfs_rq[cpu]->shares to 0 but actually
setting the effective weight of the related se to the inflated
number.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1248696557.6987.1615.camel@twins>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

a5004278

24 7月, 2009 1 次提交

sched: Fix return value of migration_init() · a004cd42

由 Thomas Gleixner 提交于 7月 21, 2009

migration_init() returns the return value of the hotplug notifier. In
the success case this is NOTIFY_OK which is 1. initcall_debug
evaluates that as an error code because init calls are expected to
return 0 on success.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>

a004cd42

18 7月, 2009 6 次提交

sched: Pull up the might_sleep() check into cond_resched() · 613afbf8

由 Frederic Weisbecker 提交于 7月 16, 2009

might_sleep() is called late-ish in cond_resched(), after the
need_resched()/preempt enabled/system running tests are
checked.

It's better to check the sleeps while atomic earlier and not
depend on some environment datas that reduce the chances to
detect a problem.

Also define cond_resched_*() helpers as macros, so that the
FILE/LINE reported in the sleeping while atomic warning
displays the real origin and not sched.h

Changes in v2:

 - Call __might_sleep() directly instead of might_sleep() which
   may call cond_resched()

 - Turn cond_resched() into a macro so that the file:line
   couple reported refers to the caller of cond_resched() and
   not __cond_resched() itself.

Changes in v3:

 - Also propagate this __might_sleep() pull up to
   cond_resched_lock() and cond_resched_softirq()
Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1247725694-6082-6-git-send-email-fweisbec@gmail.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

613afbf8

sched: Add a preempt count base offset to __might_sleep() · e4aafea2

由 Frederic Weisbecker 提交于 7月 16, 2009

Add a preempt count base offset to compare against the current
preempt level count. It prepares to pull up the might_sleep
check from cond_resched() to cond_resched_lock() and
cond_resched_bh().

For these two helpers, we need to respectively ensure that once
we'll unlock the given spinlock / reenable local softirqs, we
will reach a sleepable state.
Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
[ Move and rename preempt_count_equals() ]
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1247725694-6082-4-git-send-email-fweisbec@gmail.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

e4aafea2

sched: Cover the CONFIG_DEBUG_SPINLOCK_SLEEP off-case for __might_sleep() · e09758fa

由 Frederic Weisbecker 提交于 7月 16, 2009

Cover the off case for __might_sleep(), so that we avoid
#ifdefs in files that make use of it. Especially, this prepares
for the __might_sleep() pull up on cond_resched().
Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1247725694-6082-3-git-send-email-fweisbec@gmail.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

e09758fa

sched: Remove obsolete comment in __cond_resched() · 4b215567

由 Frederic Weisbecker 提交于 7月 16, 2009

Remove the outdated comment from __cond_resched() related to
the now removed Big Kernel Semaphore.
Reported-by: NArnd Bergmann <arnd@arndb.de>
Reported-by: NIngo Molnar <mingo@elte.hu>
Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1247725694-6082-2-git-send-email-fweisbec@gmail.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

4b215567

sched: Drop the need_resched() loop from cond_resched() · e7aaaa69

由 Frederic Weisbecker 提交于 7月 16, 2009

The schedule() function is a loop that reschedules the current
task while the TIF_NEED_RESCHED flag is set:

void schedule(void)
{
need_resched:
	/* schedule code */
	if (need_resched())
		goto need_resched;
}

And cond_resched() repeat this loop:

do {
	add_preempt_count(PREEMPT_ACTIVE);
	schedule();
	sub_preempt_count(PREEMPT_ACTIVE);
} while(need_resched());

This loop is needless because schedule() already did the check
and nothing can set TIF_NEED_RESCHED between schedule() exit
and the loop check in need_resched().

Then remove this needless loop.
Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1247725694-6082-1-git-send-email-fweisbec@gmail.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

e7aaaa69

sched: fix load average accounting vs. cpu hotplug · a468d389

由 Thomas Gleixner 提交于 7月 17, 2009

The new load average code clears rq->calc_load_active on
CPU_ONLINE. That's wrong as the new onlined CPU might have got a
scheduler tick already and accounted the delta to the stale value of
the time we offlined the CPU.

Clear the value when we cleanup the dead CPU instead. 

Also move the update of the calc_load_update time for the newly online
CPU to CPU_UP_PREPARE to avoid that the CPU plays catch up with the
stale update time value.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>

a468d389

11 7月, 2009 1 次提交

sched: optimize cond_resched() · d86ee480

由 Peter Zijlstra 提交于 7月 10, 2009

Optimize cond_resched() by removing one conditional.

Currently cond_resched() checks system_state ==
SYSTEM_RUNNING in order to avoid scheduling before the
scheduler is running.

We can however, as per suggestion of Matt, use
PREEMPT_ACTIVE to accomplish that very same.
Suggested-by: NMatt Mackall <mpm@selenic.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: NMatt Mackall <mpm@selenic.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

d86ee480

10 7月, 2009 3 次提交

sched: Fix rt_rq->pushable_tasks initialization in init_rt_rq() · c20b08e3

由 Fabio Checconi 提交于 6月 15, 2009

init_rt_rq() initializes only rq->rt.pushable_tasks, and not the
pushable_tasks field of the passed rt_rq.  The plist is not used
uninitialized since the only pushable_tasks plists used are the
ones of root rt_rqs; anyway reinitializing the list on every group
creation corrupts the root plist, losing its previous contents.
Signed-off-by: NFabio Checconi <fabio@gandalf.sssup.it>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20090615185638.GK21741@gandalf.sssup.it>
CC: Gregory Haskins <ghaskins@novell.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

c20b08e3

sched: Reset sched stats on fork() · 7793527b

由 Lucas De Marchi 提交于 7月 09, 2009

The sched_stat fields are currently not reset upon fork.
Ingo's recent commit 6c594c21
did reset nr_migrations, but it didn't reset any of the
others.

This patch resets all sched_stat fields on fork.
Signed-off-by: NLucas De Marchi <lucas.de.marchi@gmail.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <193b0f820907090457s7a3662f4gcdecdc22fcae857b@mail.gmail.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

7793527b

sched_rt: Fix overload bug on rt group scheduling · a1ba4d8b

由 Peter Zijlstra 提交于 4月 01, 2009

Fixes an easily triggerable BUG() when setting process affinities.

Make sure to count the number of migratable tasks in the same place:
the root rt_rq. Otherwise the number doesn't make sense and we'll hit
the BUG in set_cpus_allowed_rt().

Also, make sure we only count tasks, not groups (this is probably
already taken care of by the fact that rt_se->nr_cpus_allowed will be 0
for groups, but be more explicit)
Tested-by: NThomas Gleixner <tglx@linutronix.de>
CC: stable@kernel.org
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: NGregory Haskins <ghaskins@novell.com>
LKML-Reference: <1247067476.9777.57.camel@twins>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

a1ba4d8b

29 6月, 2009 1 次提交

sched: Hide runqueues from direct reference at source code level for __raw_get_cpu_var() · 54d35f29

由 Hitoshi Mitake 提交于 6月 29, 2009

Hide __raw_get_cpu_var() as well - thus all the direct
references to runqueues will abstracted out.
Signed-off-by: NHitoshi Mitake <mitake@dcl.info.waseda.ac.jp>
LKML-Reference: <20090629.144457.886429910353660979.mitake@dcl.info.waseda.ac.jp>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

54d35f29

19 6月, 2009 2 次提交

perf_counter: Simplify and fix task migration counting · e5289d4a

由 Peter Zijlstra 提交于 6月 19, 2009

The task migrations counter was causing rare and hard to decypher
memory corruptions under load. After a day of debugging and bisection
we found that the problem was introduced with:

  3f731ca6: perf_counter: Fix cpu migration counter

Turning them off fixes the crashes. Incidentally, the whole
perf_counter_task_migration() logic can be done simpler as well,
by injecting a proper sw-counter event.

This cleanup also fixed the crashes. The precise failure mode is
not completely clear yet, but we are clearly not unhappy about
having a fix ;-)
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <new-submission>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

e5289d4a

kthreads: simplify migration_thread() exit path · 371cbb38

由 Oleg Nesterov 提交于 6月 17, 2009

Now that kthread_stop() can be used even if the task has already exited,
we can kill the "wait_to_die:" loop in migration_thread().  But we must
pin rq->migration_thread after creation.

Actually, I don't think CPU_UP_CANCELED or CPU_DEAD should wait for
->migration_thread exit.  Perhaps we can simplify this code a bit more.
migration_call() can set ->should_stop and forget about this thread.  But
we need a new helper in kthred.c for that.
Signed-off-by: NOleg Nesterov <oleg@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Vitaliy Gusev <vgusev@openvz.org
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

371cbb38

18 6月, 2009 2 次提交

sched: Add SCHED_RESET_ON_FORK functionality for nice < 0 tasks · 6c697bdf

由 Mike Galbraith 提交于 6月 17, 2009

Signed-off-by: NMike Galbraith <efault@gmx.de>
Acked-by: NLennart Poettering <mzxreary@0pointer.de>
Cc: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <1245228482.27326.1.camel@marge.simson.net>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

6c697bdf

sched: Clean up SCHED_RESET_ON_FORK · b9dc29e7

由 Mike Galbraith 提交于 6月 17, 2009

Make SCHED_RESET_ON_FORK sched_fork() bits a self-contained unlikely code path.
Signed-off-by: NMike Galbraith <efault@gmx.de>
Acked-by: NLennart Poettering <mzxreary@0pointer.de>
Cc: Peter Zijlstra <peterz@infradead.org>
LKML-Reference: <1245228361.18329.6.camel@marge.simson.net>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

b9dc29e7

17 6月, 2009 1 次提交

sched: Remove unneeded __ref tag · fd5e1b5d

由 Li Zefan 提交于 6月 15, 2009

Those two functions no longer call alloc_bootmmem_cpumask_var(),
so no need to tag them with __init_refok.
Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
Acked-by: NPekka Enberg <penberg@cs.helsinki.fi>
LKML-Reference: <4A35DD5B.9050106@cn.fujitsu.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

fd5e1b5d

15 6月, 2009 1 次提交

sched: Introduce SCHED_RESET_ON_FORK scheduling policy flag · ca94c442

由 Lennart Poettering 提交于 6月 15, 2009

This patch introduces a new flag SCHED_RESET_ON_FORK which can be passed
to the kernel via sched_setscheduler(), ORed in the policy parameter. If
set this will make sure that when the process forks a) the scheduling
priority is reset to DEFAULT_PRIO if it was higher and b) the scheduling
policy is reset to SCHED_NORMAL if it was either SCHED_FIFO or SCHED_RR.

Why have this?

Currently, if a process is real-time scheduled this will 'leak' to all
its child processes. For security reasons it is often (always?) a good
idea to make sure that if a process acquires RT scheduling this is
confined to this process and only this process. More specifically this
makes the per-process resource limit RLIMIT_RTTIME useful for security
purposes, because it makes it impossible to use a fork bomb to
circumvent the per-process RLIMIT_RTTIME accounting.

This feature is also useful for tools like 'renice' which can then
change the nice level of a process without having this spill to all its
child processes.

Why expose this via sched_setscheduler() and not other syscalls such as
prctl() or sched_setparam()?

prctl() does not take a pid parameter. Due to that it would be
impossible to modify this flag for other processes than the current one.

The struct passed to sched_setparam() can unfortunately not be extended
without breaking compatibility, since sched_setparam() lacks a size
parameter.

How to use this from userspace? In your RT program simply replace this:

  sched_setscheduler(pid, SCHED_FIFO, &param);

by this:

  sched_setscheduler(pid, SCHED_FIFO|SCHED_RESET_ON_FORK, &param);
Signed-off-by: NLennart Poettering <lennart@poettering.net>
Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20090615152714.GA29092@tango.0pointer.de>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

ca94c442

12 6月, 2009 4 次提交

sched: export kick_process · b43e3521

由 Rusty Russell 提交于 6月 12, 2009

lguest needs kick_process: wake_up_process() does nothing if a process
is running, which isn't sufficient (we need it in the kernel).

And lguest support is usually modular.
Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
Cc: Ingo Molnar <mingo@elte.hu>

b43e3521

sched: use slab in cpupri_init() · 0fb53029

由 Pekka Enberg 提交于 6月 11, 2009

Lets not use the bootmem allocator in cpupri_init() as slab is already up when
it is run.

Cc: Ingo Molnar <mingo@elte.hu>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>

0fb53029

sched: use alloc_cpumask_var() instead of alloc_bootmem_cpumask_var() · 4bdddf8f

由 Pekka Enberg 提交于 6月 11, 2009

Slab is initialized when sched_init() runs now so lets use alloc_cpumask_var().

Cc: Ingo Molnar <mingo@elte.hu>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>

4bdddf8f

sched: use kzalloc() instead of the bootmem allocator · 36b7b6d4

由 Pekka Enberg 提交于 6月 10, 2009

Now that kmem_cache_init() happens before sched_init(), we should use kzalloc()
and not the bootmem allocator.
Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>

36b7b6d4

02 6月, 2009 2 次提交

perf_counter: Fix cpu migration counter · 3f731ca6

由 Paul Mackerras 提交于 6月 01, 2009

This fixes the cpu migration software counter to count
correctly even when contexts get swapped from one task to
another.  Previously the cpu migration counts reported by perf
stat were bogus, ranging from negative to several thousand for
a single "lat_ctx 2 8 32" run.  With this patch the cpu
migration count reported for "lat_ctx 2 8 32" is almost always
between 35 and 44.

This fixes the problem by adding a call into the perf_counter
code from set_task_cpu when tasks are migrated.  This enables
us to use the generic swcounter code (with some modifications)
for the cpu migration counter.

This modifies the swcounter code to allow a NULL regs pointer
to be passed in to perf_swcounter_ctx_event() etc.  The cpu
migration counter does this because there isn't necessarily a
pt_regs struct for the task available.  In this case, the
counter will not have interrupt capability - but the migration
counter didn't have interrupt capability before, so this is no
loss.
Signed-off-by: NPaul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
LKML-Reference: <18979.35006.819769.416327@cargo.ozlabs.ibm.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

3f731ca6

perf_counter: Initialize per-cpu context earlier on cpu up · f38b0820

由 Paul Mackerras 提交于 6月 02, 2009

This arranges for perf_counter's notifier for cpu hotplug
operations to be called earlier than the migration notifier in
sched.c by increasing its priority to 20, compared to the 10
for the migration notifier.  The reason for doing this is that
a subsequent commit to convert the cpu migration counter to use
the generic swcounter infrastructure will add a call into the
perf_counter subsystem when tasks get migrated.  Therefore the
perf_counter subsystem needs a chance to initialize its per-cpu
data for the new cpu before it can get called from the
migration code.

This also adds a comment to the migration notifier noting that
its priority needs to be lower than that of the perf_counter
notifier.
Signed-off-by: NPaul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <18981.1900.792795.836858@cargo.ozlabs.ibm.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

f38b0820

24 5月, 2009 1 次提交

perf_counter: Fix dynamic irq_period logging · e220d2dc

由 Peter Zijlstra 提交于 5月 23, 2009

We call perf_adjust_freq() from perf_counter_task_tick() which
is is called under the rq->lock causing lock recursion.
However, it's no longer required to be called under the
rq->lock, so remove it from under it.

Also, fix up some related comments.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: John Kacur <jkacur@redhat.com>
LKML-Reference: <20090523163012.476197912@chello.nl>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

e220d2dc

22 5月, 2009 1 次提交

perf_counter: Optimize context switch between identical inherited contexts · 564c2b21

由 Paul Mackerras 提交于 5月 22, 2009

When monitoring a process and its descendants with a set of inherited
counters, we can often get the situation in a context switch where
both the old (outgoing) and new (incoming) process have the same set
of counters, and their values are ultimately going to be added together.
In that situation it doesn't matter which set of counters are used to
count the activity for the new process, so there is really no need to
go through the process of reading the hardware counters and updating
the old task's counters and then setting up the PMU for the new task.

This optimizes the context switch in this situation.  Instead of
scheduling out the perf_counter_context for the old task and
scheduling in the new context, we simply transfer the old context
to the new task and keep using it without interruption.  The new
context gets transferred to the old task.  This means that both
tasks still have a valid perf_counter_context, so no special case
is introduced when the old task gets scheduled in again, either on
this CPU or another CPU.

The equivalence of contexts is detected by keeping a pointer in
each cloned context pointing to the context it was cloned from.
To cope with the situation where a context is changed by adding
or removing counters after it has been cloned, we also keep a
generation number on each context which is incremented every time
a context is changed.  When a context is cloned we take a copy
of the parent's generation number, and two cloned contexts are
equivalent only if they have the same parent and the same
generation number.  In order that the parent context pointer
remains valid (and is not reused), we increment the parent
context's reference count for each context cloned from it.

Since we don't have individual fds for the counters in a cloned
context, the only thing that can make two clones of a given parent
different after they have been cloned is enabling or disabling all
counters with prctl.  To account for this, we keep a count of the
number of enabled counters in each context.  Two contexts must have
the same number of enabled counters to be considered equivalent.

Here are some measurements of the context switch time as measured with
the lat_ctx benchmark from lmbench, comparing the times obtained with
and without this patch series:

		-----Unmodified-----		With this patch series
Counters:	none	2 HW	4H+4S	none	2 HW	4H+4S

2 processes:
Average		3.44	6.45	11.24	3.12	3.39	3.60
St dev		0.04	0.04	0.13	0.05	0.17	0.19

8 processes:
Average		6.45	8.79	14.00	5.57	6.23	7.57
St dev		1.27	1.04	0.88	1.42	1.46	1.42

32 processes:
Average		5.56	8.43	13.78	5.28	5.55	7.15
St dev		0.41	0.47	0.53	0.54	0.57	0.81

The numbers are the mean and standard deviation of 20 runs of
lat_ctx.  The "none" columns are lat_ctx run directly without any
counters.  The "2 HW" columns are with lat_ctx run under perfstat,
counting cycles and instructions.  The "4H+4S" columns are lat_ctx run
under perfstat with 4 hardware counters and 4 software counters
(cycles, instructions, cache references, cache misses, task
clock, context switch, cpu migrations, and page faults).

[ Impact: performance optimization of counter context-switches ]
Signed-off-by: NPaul Mackerras <paulus@samba.org>
Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
LKML-Reference: <18966.10666.517218.332164@cargo.ozlabs.ibm.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

564c2b21

19 5月, 2009 1 次提交

sched: properly define the sched_group::cpumask and sched_domain::span fields · 4200efd9

由 Ingo Molnar 提交于 5月 19, 2009

Properly document the variable-size structure tricks we are doing
wrt. struct sched_group and sched_domain, and use the field[0] GCC
extension instead of defining a vla array.

Dont use unions for this, as pointed out by Linus.

[ Impact: cleanup, un-confuse Sparse and LLVM ]
Reported-by: NJeff Garzik <jeff@garzik.org>
Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
LKML-Reference: <alpine.LFD.2.01.0905180850110.3301@localhost.localdomain>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

4200efd9

15 5月, 2009 2 次提交

sched, timers: cleanup avenrun users · 2d02494f

由 Thomas Gleixner 提交于 5月 02, 2009

avenrun is an rough estimate so we don't have to worry about
consistency of the three avenrun values. Remove the xtime lock
dependency and provide a function to scale the values. Cleanup the
users.

[ Impact: cleanup ]
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Acked-by: NPeter Zijlstra <peterz@infradead.org>

2d02494f

sched, timers: move calc_load() to scheduler · dce48a84

由 Thomas Gleixner 提交于 4月 11, 2009

Dimitri Sivanich noticed that xtime_lock is held write locked across
calc_load() which iterates over all online CPUs. That can cause long
latencies for xtime_lock readers on large SMP systems. 

The load average calculation is an rough estimate anyway so there is
no real need to protect the readers vs. the update. It's not a problem
when the avenrun array is updated while a reader copies the values.

Instead of iterating over all online CPUs let the scheduler_tick code
update the number of active tasks shortly before the avenrun update
happens. The avenrun update itself is handled by the CPU which calls
do_timer().

[ Impact: reduce xtime_lock write locked section ]
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Acked-by: NPeter Zijlstra <peterz@infradead.org>

dce48a84

13 5月, 2009 2 次提交

timers: Logic to move non pinned timers · eea08f32

由 Arun R Bharadwaj 提交于 4月 16, 2009

* Arun R Bharadwaj <arun@linux.vnet.ibm.com> [2009-04-16 12:11:36]:

This patch migrates all non pinned timers and hrtimers to the current
idle load balancer, from all the idle CPUs. Timers firing on busy CPUs
are not migrated.

While migrating hrtimers, care should be taken to check if migrating
a hrtimer would result in a latency or not. So we compare the expiry of the
hrtimer with the next timer interrupt on the target cpu and migrate the
hrtimer only if it expires *after* the next interrupt on the target cpu.
So, added a clockevents_get_next_event() helper function to return the
next_event on the target cpu's clock_event_device.

[ tglx: cleanups and simplifications ]
Signed-off-by: NArun R Bharadwaj <arun@linux.vnet.ibm.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>

eea08f32

timers: /proc/sys sysctl hook to enable timer migration · cd1bb94b

由 Arun R Bharadwaj 提交于 4月 16, 2009

* Arun R Bharadwaj <arun@linux.vnet.ibm.com> [2009-04-16 12:11:36]:

This patch creates the /proc/sys sysctl interface at
/proc/sys/kernel/timer_migration

Timer migration is enabled by default.

To disable timer migration, when CONFIG_SCHED_DEBUG = y,

echo 0 > /proc/sys/kernel/timer_migration
Signed-off-by: NArun R Bharadwaj <arun@linux.vnet.ibm.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>

cd1bb94b

OpenHarmony / kernel_linux 上一次同步 大约 4 年

OpenHarmony / kernel_linux
上一次同步大约 4 年