提交 · 46e0bb9c12f4bab539736f1714cbf16600f681ec · OpenHarmony / kernel_linux

01 4月, 2009 4 次提交

sched: Print sched_group::__cpu_power in sched_domain_debug · 46e0bb9c

由 Gautham R Shenoy 提交于 3月 30, 2009

Impact: extend debug info /proc/sched_debug

If the user changes the value of the sched_mc/smt_power_savings sysfs
tunable, it'll trigger a rebuilding of the whole sched_domain tree,
with the SD_POWERSAVINGS_BALANCE flag set at certain levels.

As a result, there would be a change in the __cpu_power of sched_groups
in the sched_domain hierarchy.

Print the __cpu_power values for each sched_group in sched_domain_debug
to help verify this change and correlate it with the change in the
load-balancing behavior.
Signed-off-by: NGautham R Shenoy <ego@in.ibm.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20090330045520.2869.24777.stgit@sofia.in.ibm.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

46e0bb9c

cpuacct: add per-cgroup utime/stime statistics · ef12fefa

由 Bharata B Rao 提交于 3月 31, 2009

Add per-cgroup cpuacct controller statistics like the system and user
time consumed by the group of tasks.

Changelog:

v7
- Changed the name of the statistic from utime to user and from stime to
  system so that in future we could easily add other statistics like irq,
  softirq, steal times etc easily.

v6
- Fixed a bug in the error path of cpuacct_create() (pointed by Li Zefan).

v5
- In cpuacct_stats_show(), use cputime64_to_clock_t() since we are
  operating on a 64bit variable here.

v4
- Remove comments in cpuacct_update_stats() which explained why rcu_read_lock()
  was needed (as per Peter Zijlstra's review comments).
- Don't say that percpu_counter_read() is broken in Documentation/cpuacct.txt
  as per KAMEZAWA Hiroyuki's review comments.

v3
- Fix a small race in the cpuacct hierarchy walk.

v2
- stime and utime now exported in clock_t units instead of msecs.
- Addressed the code review comments from Balbir and Li Zefan.
- Moved to -tip tree.

v1
- Moved the stime/utime accounting to cpuacct controller.

Earlier versions
- http://lkml.org/lkml/2009/2/25/129Signed-off-by: NBharata B Rao <bharata@linux.vnet.ibm.com>
Signed-off-by: NBalaji Rao <balajirrao@gmail.com>
Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: NLi Zefan <lizf@cn.fujitsu.com>
Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
Tested-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
LKML-Reference: <20090331043222.GA4093@in.ibm.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

ef12fefa

posixtimers, sched: Fix posix clock monotonicity · c5f8d995

由 Hidetoshi Seto 提交于 3月 31, 2009

Impact: Regression fix (against clock_gettime() backwarding bug)

This patch re-introduces a couple of functions, task_sched_runtime
and thread_group_sched_runtime, which was once removed at the
time of 2.6.28-rc1.

These functions protect the sampling of thread/process clock with
rq lock.  This rq lock is required not to update rq->clock during
the sampling.

i.e.
  The clock_gettime() may return
   ((accounted runtime before update) + (delta after update))
  that is less than what it should be.

v2 -> v3:
	- Rename static helper function __task_delta_exec()
	  to do_task_delta_exec() since -tip tree already has
	  a __task_delta_exec() of different version.

v1 -> v2:
	- Revises comments of function and patch description.
	- Add note about accuracy of thread group's runtime.
Signed-off-by: NHidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: stable@kernel.org	[2.6.28.x][2.6.29.x]
LKML-Reference: <49D1CC93.4080401@jp.fujitsu.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

c5f8d995

cpuacct: make cpuacct hierarchy walk in cpuacct_charge() safe when rcupreempt is used -v2 · a18b83b7

由 Bharata B Rao 提交于 3月 23, 2009

Impact: fix cgroups race under rcu-preempt

cpuacct_charge() obtains task's ca and does a hierarchy walk upwards.
This can race with the task's movement between cgroups. This race
can cause an access to freed ca pointer in cpuacct_charge() or access
to invalid cgroups pointer of the task. This will not happen with rcu or
tree rcu as cpuacct_charge() is called with preemption disabled. However if
rcupreempt is used, the race is seen. Thanks to Li Zefan for explaining this.

Fix this race by explicitly protecting ca and the hierarchy walk with
rcu_read_lock().

Changes for v2:

 - Update patch descrition (as per Li Zefan's review comments).

 - Remove comments in cpuacct_charge() which explained why rcu_read_lock()
   was needed (as per Peter Zijlstra's review comments).
Signed-off-by: NBharata B Rao <bharata@linux.vnet.ibm.com>
Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
Tested-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

a18b83b7

06 2月, 2009 1 次提交

wait: prevent exclusive waiter starvation · 777c6c5f

由 Johannes Weiner 提交于 2月 04, 2009

With exclusive waiters, every process woken up through the wait queue must
ensure that the next waiter down the line is woken when it has finished.

Interruptible waiters don't do that when aborting due to a signal.  And if
an aborting waiter is concurrently woken up through the waitqueue, noone
will ever wake up the next waiter.

This has been observed with __wait_on_bit_lock() used by
lock_page_killable(): the first contender on the queue was aborting when
the actual lock holder woke it up concurrently.  The aborted contender
didn't acquire the lock and therefor never did an unlock followed by
waking up the next waiter.

Add abort_exclusive_wait() which removes the process' wait descriptor from
the waitqueue, iff still queued, or wakes up the next waiter otherwise.
It does so under the waitqueue lock.  Racing with a wake up means the
aborting process is either already woken (removed from the queue) and will
wake up the next waiter, or it will remove itself from the queue and the
concurrent wake up will apply to the next waiter after it.

Use abort_exclusive_wait() in __wait_event_interruptible_exclusive() and
__wait_on_bit_lock() when they were interrupted by other means than a wake
up through the queue.

[akpm@linux-foundation.org: coding-style fixes]
Reported-by: NChris Mason <chris.mason@oracle.com>
Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Mentored-by: NOleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Matthew Wilcox <matthew@wil.cx>
Cc: Chuck Lever <cel@citi.umich.edu>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: <stable@kernel.org>		["after some testing"]
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

777c6c5f

01 2月, 2009 2 次提交

sched: symmetric sync vs avg_overlap · 1596e297

由 Peter Zijlstra 提交于 1月 28, 2009

Reinstate the weakening of the sync hint if set. This yields a more
symmetric usage of avg_overlap.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

1596e297

sched: fix sync wakeups · d942fb6c

由 Peter Zijlstra 提交于 1月 26, 2009

Pawel Dziekonski reported that the openssl benchmark and his
quantum chemistry application both show slowdowns due to the
scheduler under-parallelizing execution.

The reason are pipe wakeups still doing 'sync' wakeups which
overrides the normal buddy wakeup logic - even if waker and
wakee are loosely coupled.

Fix an inversion of logic in the buddy wakeup code.
Reported-by: NPawel Dziekonski <dzieko@gmail.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

d942fb6c

15 1月, 2009 4 次提交

sched: SCHED_IDLE weight change · cce7ade8

由 Peter Zijlstra 提交于 1月 15, 2009

Increase the SCHED_IDLE weight from 2 to 3, this gives much more stable
vruntime numbers.

time advanced in 100ms:

 weight=2

 64765.988352
 67012.881408
 88501.412352

 weight=3

 35496.181411
 34130.971298
 35497.411573
Signed-off-by: NMike Galbraith <efault@gmx.de>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

cce7ade8

sched: fix bandwidth validation for UID grouping · 98a4826b

由 Peter Zijlstra 提交于 1月 14, 2009

Impact: make rt-limit tunables work again

Mark Glines reported:

> I've got an issue on x86-64 where I can't configure the system to allow
> RT tasks for a non-root user.
>
> In 2.6.26.5, I was able to do the following to set things up nicely:
> echo 450000 >/sys/kernel/uids/0/cpu_rt_runtime
> echo 450000 >/sys/kernel/uids/1000/cpu_rt_runtime
>
> Seems like every value I try to echo into the /sys files returns EINVAL.

For UID grouping we initialize the root group with infinite bandwidth
which by default is actually more than the global limit, therefore the
bandwidth check always fails.

Because the root group is a phantom group (for UID grouping) we cannot
runtime adjust it, therefore we let it reflect the global bandwidth
settings.
Reported-by: NMark Glines <mark@glines.org>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

98a4826b

mutex: implement adaptive spinning · 0d66bf6d

由 Peter Zijlstra 提交于 1月 12, 2009

Change mutex contention behaviour such that it will sometimes busy wait on
acquisition - moving its behaviour closer to that of spinlocks.

This concept got ported to mainline from the -rt tree, where it was originally
implemented for rtmutexes by Steven Rostedt, based on work by Gregory Haskins.

Testing with Ingo's test-mutex application (http://lkml.org/lkml/2006/1/8/50)
gave a 345% boost for VFS scalability on my testbox:

 # ./test-mutex-shm V 16 10 | grep "^avg ops"
 avg ops/sec:               296604

 # ./test-mutex-shm V 16 10 | grep "^avg ops"
 avg ops/sec:               85870

The key criteria for the busy wait is that the lock owner has to be running on
a (different) cpu. The idea is that as long as the owner is running, there is a
fair chance it'll release the lock soon, and thus we'll be better off spinning
instead of blocking/scheduling.

Since regular mutexes (as opposed to rtmutexes) do not atomically track the
owner, we add the owner in a non-atomic fashion and deal with the races in
the slowpath.

Furthermore, to ease the testing of the performance impact of this new code,
there is means to disable this behaviour runtime (without having to reboot
the system), when scheduler debugging is enabled (CONFIG_SCHED_DEBUG=y),
by issuing the following command:

 # echo NO_OWNER_SPIN > /debug/sched_features

This command re-enables spinning again (this is also the default):

 # echo OWNER_SPIN > /debug/sched_features
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

0d66bf6d

mutex: preemption fixes · 41719b03

由 Peter Zijlstra 提交于 1月 14, 2009

The problem is that dropping the spinlock right before schedule is a voluntary
preemption point and can cause a schedule, right after which we schedule again.

Fix this inefficiency by keeping preemption disabled until we schedule, do this
by explicity disabling preemption and providing a schedule() variant that
assumes preemption is already disabled.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

41719b03

14 1月, 2009 3 次提交
- H
  [CVE-2009-0029] System call wrappers part 08 · 17da2bd9
  由 Heiko Carstens 提交于 1月 14, 2009
```
Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
```
  17da2bd9
- H
  [CVE-2009-0029] System call wrappers part 07 · 754fe8d2
  由 Heiko Carstens 提交于 1月 14, 2009
```
Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
```
  754fe8d2
- H
  [CVE-2009-0029] System call wrappers part 06 · 5add95d4
  由 Heiko Carstens 提交于 1月 14, 2009
```
Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
```
  5add95d4
12 1月, 2009 1 次提交

Revert "sched: improve preempt debugging" · 01e3eb82

由 Ingo Molnar 提交于 1月 12, 2009

This reverts commit 7317d7b8.

This has been reported (and bisected) by Alexey Zaytsev and
Kamalesh Babulal to produce annoying warnings during bootup
on both x86 and powerpc.

kernel_locked() is not a valid test in IRQ context (we update the
BKL's ->lock_depth and the preempt count separately and non-atomicalyy),
so we cannot put it into the generic preempt debugging checks which
can run in IRQ contexts too.
Reported-and-bisected-by: NAlexey Zaytsev <alexey.zaytsev@gmail.com>
Reported-and-bisected-by: NKamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

01e3eb82

11 1月, 2009 2 次提交

kernel/sched.c: add missing forward declaration for 'double_rq_lock' · fd2ab30b

由 Steven Noonan 提交于 1月 11, 2009

Impact: build fix on certain configs

Added 'double_rq_lock' forward declaration, allowing double_rq_lock
to be used in _double_lock_balance().
Signed-off-by: NSteven Noonan <steven@uplinklabs.net>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

fd2ab30b

cpumask: fix CONFIG_NUMA=y sched.c · 62ea9ceb

由 Rusty Russell 提交于 1月 11, 2009

Impact: fix panic on ia64 with NR_CPUS=1024

struct sched_domain is now a dangling structure; where we really want
static ones, we need to use static_sched_domain.

(As the FIXME in this file says, cpumask_var_t would be better, but
this code is hairy enough without trying to add initialization code to
the right places).
Reported-by: NMike Travis <travis@sgi.com>
Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

62ea9ceb

07 1月, 2009 1 次提交

sched: fix possible recursive rq->lock · da8d5089

由 Peter Zijlstra 提交于 1月 07, 2009

Vaidyanathan Srinivasan reported:

 > =============================================
 > [ INFO: possible recursive locking detected ]
 > 2.6.28-autotest-tip-sv #1
 > ---------------------------------------------
 > klogd/5062 is trying to acquire lock:
 >  (&rq->lock){++..}, at: [<ffffffff8022aca2>] task_rq_lock+0x45/0x7e
 >
 > but task is already holding lock:
 >  (&rq->lock){++..}, at: [<ffffffff805f7354>] schedule+0x158/0xa31

With sched_mc at 2. (it is default-off)

Strictly speaking we'll not deadlock, because ttwu will not be able to
place the migration task on our rq, but since the code can deal with
both rqs getting unlocked, this seems the easiest way out.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

da8d5089

06 1月, 2009 2 次提交

sched: fix section mismatch · db2f59c8

由 Li Zefan 提交于 1月 06, 2009

init_rootdomain() calls alloc_bootmem_cpumask_var() at system boot,
so does cpupri_init().
Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

db2f59c8

sched: fix double kfree in failure path · 0c910d28

由 Li Zefan 提交于 1月 06, 2009

It's not the responsibility of init_rootdomain() to free root_domain
allocated by alloc_rootdomain().
Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
Reviewed-by: NPekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

0c910d28

05 1月, 2009 2 次提交

sched: clean up arch_reinit_sched_domains() · c70f22d2

由 Li Zefan 提交于 1月 05, 2009

- Make arch_reinit_sched_domains() static. It was exported to be used in
  s390, but now rebuild_sched_domains() is used instead.

- Make it return void.
Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

c70f22d2

sched: mark sched_create_sysfs_power_savings_entries() as __init · 39aac648

由 Li Zefan 提交于 1月 05, 2009

Impact: cleanup

The only caller is cpu_dev_init() which is marked as __init.
Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

39aac648

04 1月, 2009 1 次提交

sched: put back some stack hog changes that were undone in kernel/sched.c · 6ca09dfc

由 Mike Travis 提交于 12月 31, 2008

Impact: prevents panic from stack overflow on numa-capable machines.

Some of the "removal of stack hogs" changes in kernel/sched.c by using
node_to_cpumask_ptr were undone by the early cpumask API updates, and
causes a panic due to stack overflow.  This patch undoes those changes
by using cpumask_of_node() which returns a 'const struct cpumask *'.

In addition, cpu_coregoup_map is replaced with cpu_coregroup_mask further
reducing stack usage.  (Both of these updates removed 9 FIXME's!)

Also:
   Pick up some remaining changes from the old 'cpumask_t' functions to
   the new 'struct cpumask *' functions.

   Optimize memory traffic by allocating each percpu local_cpu_mask on the
   same node as the referring cpu.
Signed-off-by: NMike Travis <travis@sgi.com>
Acked-by: NRusty Russell <rusty@rustcorp.com.au>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

6ca09dfc

31 12月, 2008 2 次提交

[PATCH] idle cputime accounting · 79741dd3

由 Martin Schwidefsky 提交于 12月 31, 2008

The cpu time spent by the idle process actually doing something is
currently accounted as idle time. This is plain wrong, the architectures
that support VIRT_CPU_ACCOUNTING=y can do better: distinguish between the
time spent doing nothing and the time spent by idle doing work. The first
is accounted with account_idle_time and the second with account_system_time.
The architectures that use the account_xxx_time interface directly and not
the account_xxx_ticks interface now need to do the check for the idle
process in their arch code. In particular to improve the system vs true
idle time accounting the arch code needs to measure the true idle time
instead of just testing for the idle process.
To improve the tick based accounting as well we would need an architecture
primitive that can tell us if the pt_regs of the interrupted context
points to the magic instruction that halts the cpu.

In addition idle time is no more added to the stime of the idle process.
This field now contains the system time of the idle process as it should
be. On systems without VIRT_CPU_ACCOUNTING this will always be zero as
every tick that occurs while idle is running will be accounted as idle
time.

This patch contains the necessary common code changes to be able to
distinguish idle system time and true idle time. The architectures with
support for VIRT_CPU_ACCOUNTING need some changes to exploit this.
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>

79741dd3

[PATCH] fix scaled & unscaled cputime accounting · 457533a7

由 Martin Schwidefsky 提交于 12月 31, 2008

The utimescaled / stimescaled fields in the task structure and the
global cpustat should be set on all architectures. On s390 the calls
to account_user_time_scaled and account_system_time_scaled never have
been added. In addition system time that is accounted as guest time
to the user time of a process is accounted to the scaled system time
instead of the scaled user time.
To fix the bugs and to prevent future forgetfulness this patch merges
account_system_time_scaled into account_system_time and
account_user_time_scaled into account_user_time.

Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: Michael Neuling <mikey@neuling.org>
Acked-by: NPaul Mackerras <paulus@samba.org>
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>

457533a7

26 12月, 2008 1 次提交

cpumask: Replace cpu_coregroup_map with cpu_coregroup_mask · be4d638c

由 Rusty Russell 提交于 12月 26, 2008

cpu_coregroup_map returned a cpumask_t: it's going away.

(Note, the sched part of this patch won't apply meaningfully to the
sched tree, but I'm posting it to show the goal).
Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
Signed-off-by: NMike Travis <travis@sgi.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Ingo Molnar <mingo@redhat.com>

be4d638c

25 12月, 2008 1 次提交

sched, trace: update trace_sched_wakeup() · 468a15bb

由 Peter Zijlstra 提交于 12月 16, 2008

Impact: extend the wakeup tracepoint with the info whether the wakeup was real

Add the information needed to distinguish 'real' wakeups from 'false'
wakeups.
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

468a15bb

24 12月, 2008 1 次提交

sched: nominate preferred wakeup cpu, fix · 36dffab6

由 Vaidyanathan Srinivasan 提交于 12月 20, 2008

Andrew Morton reported:

> kernel/sched.c: In function 'schedule':
> kernel/sched.c:3679: warning: 'active_balance' may be used uninitialized in this function
>
> This warning is correct - the code is buggy.

In sched.c load_balance_newidle, there's real potential use of
uninitialised variable - fix it.
Signed-off-by: NVaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

36dffab6

19 12月, 2008 6 次提交

sched: fix warning in kernel/sched.c · 9924da43

由 Ingo Molnar 提交于 12月 19, 2008

Impact: fix cpumask conversion bug

this warning:

  kernel/sched.c: In function ‘find_busiest_group’:
  kernel/sched.c:3429: warning: passing argument 1 of ‘__first_cpu’ from incompatible pointer type

shows that we forgot to convert a new patch to the new cpumask APIs.
Signed-off-by: NIngo Molnar <mingo@elte.hu>

9924da43

sched: activate active load balancing in new idle cpus · ad273b32

由 Vaidyanathan Srinivasan 提交于 12月 18, 2008

Impact: tweak task balancing to save power more agressively

Active load balancing is a process by which migration thread
is woken up on the target CPU in order to pull current
running task on another package into this newly idle
package.

This method is already in use with normal load_balance(),
this patch introduces this method to new idle cpus when
sched_mc is set to POWERSAVINGS_BALANCE_WAKEUP.

This logic provides effective consolidation of short running
daemon jobs in a almost idle system

The side effect of this patch may be ping-ponging of tasks
if the system is moderately utilised. May need to adjust the
iterations before triggering.
Signed-off-by: NVaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

ad273b32

sched: nominate preferred wakeup cpu · 7a09b1a2

由 Vaidyanathan Srinivasan 提交于 12月 18, 2008

Impact: extend load-balancing code (no change in behavior yet)

When the system utilisation is low and more cpus are idle,
then the process waking up from sleep should prefer to
wakeup an idle cpu from semi-idle cpu package (multi core
package) rather than a completely idle cpu package which
would waste power.

Use the sched_mc balance logic in find_busiest_group() to
nominate a preferred wakeup cpu.

This info can be stored in appropriate sched_domain, but
updating this info in all copies of sched_domain is not
practical.  Hence this information is stored in root_domain
struct which is one copy per partitioned sched domain.
The root_domain can be accessed from each cpu's runqueue
and there is one copy per partitioned sched domain.
Signed-off-by: NVaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

7a09b1a2

sched: favour lower logical cpu number for sched_mc balance · d5679bd1

由 Vaidyanathan Srinivasan 提交于 12月 18, 2008

Impact: change load-balancing direction to match that of irqbalanced

Just in case two groups have identical load, prefer to move load to lower
logical cpu number rather than the present logic of moving to higher logical
number.

find_busiest_group() tries to look for a group_leader that has spare capacity
to take more tasks and freeup an appropriate least loaded group. Just in case
there is a tie and the load is equal, then the group with higher logical number
is favoured. This conflicts with user space irqbalance daemon that will move
interrupts to lower logical number if the system utilisation is very low.
Signed-off-by: NVaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

d5679bd1

sched: framework for sched_mc/smt_power_savings=N · afb8a9b7

由 Gautham R Shenoy 提交于 12月 18, 2008

Impact: extend range of /sys/devices/system/cpu/sched_mc_power_savings

Currently the sched_mc/smt_power_savings variable is a boolean,
which either enables or disables topology based power savings.
This patch extends the behaviour of the variable from boolean to
multivalued, such that based on the value, we decide how
aggressively do we want to perform powersavings balance at
appropriate sched domain based on topology.

Variable levels of power saving tunable would benefit end user to
match the required level of power savings vs performance
trade-off depending on the system configuration and workloads.

This version makes the sched_mc_power_savings global variable to
take more values (0,1,2).  Later versions can have a single
tunable called sched_power_savings instead of
sched_{mc,smt}_power_savings.
Signed-off-by: NGautham R Shenoy <ego@in.ibm.com>
Signed-off-by: NVaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

afb8a9b7

tracing: fix warnings in kernel/trace/trace_sched_switch.c · c71dd42d

由 Ingo Molnar 提交于 12月 19, 2008

these warnings:

kernel/trace/trace_sched_switch.c: In function ‘tracing_sched_register’:
kernel/trace/trace_sched_switch.c:96: warning: passing argument 1 of ‘register_trace_sched_wakeup_new’ from incompatible pointer type
kernel/trace/trace_sched_switch.c:112: warning: passing argument 1 of ‘unregister_trace_sched_wakeup_new’ from incompatible pointer type
kernel/trace/trace_sched_switch.c: In function ‘tracing_sched_unregister’:
kernel/trace/trace_sched_switch.c:121: warning: passing argument 1 of ‘unregister_trace_sched_wakeup_new’ from incompatible pointer type

Trigger because sched_wakeup_new tracepoints need the same trace
signature as sched_wakeup - which was changed recently.

Fix it.
Signed-off-by: NIngo Molnar <mingo@elte.hu>

c71dd42d

18 12月, 2008 1 次提交

schedstat: consolidate per-task cpu runtime stats · 9c2c4802

由 Ken Chen 提交于 12月 16, 2008

Impact: simplify code

When we turn on CONFIG_SCHEDSTATS, per-task cpu runtime is accumulated
twice. Once in task->se.sum_exec_runtime and once in sched_info.cpu_time.
These two stats are exactly the same.

Given that task->se.sum_exec_runtime is always accumulated by the core
scheduler, sched_info can reuse that data instead of duplicate the accounting.
Signed-off-by: NKen Chen <kenchen@google.com>
Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

9c2c4802

16 12月, 2008 3 次提交

sched, cpuacct: export percpu cpuacct cgroup stats · e9515c3c

由 Ken Chen 提交于 12月 15, 2008

This patch export per-cpu CPU cycle usage for a given cpuacct cgroup.
There is a need for a user space monitor daemon to track group CPU
usage on per-cpu base.  It is also useful for monitoring CFS load
balancer behavior by tracking per CPU group usage.
Signed-off-by: NKen Chen <kenchen@google.com>
Reviewed-by: NLi Zefan <lizf@cn.fujitsu.com>
Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

e9515c3c

sched, cpuacct: refactoring cpuusage_read / cpuusage_write · 720f5498

由 Ken Chen 提交于 12月 15, 2008

Impact: micro-optimize the code on 64-bit architectures

In the thread regarding to 'export percpu cpuacct cgroup stats'
http://lkml.org/lkml/2008/12/7/13

akpm pointed out that current cpuacct code is inefficient.  This patch
refactoring the following:

* make cpu_rq locking only on 32-bit
* change iterator to each_present_cpu instead of each_possible_cpu to
  make it hotplug friendly.

It's a bit of code churn, but I was rewarded with 160 byte code size saving
on x86-64 arch and zero code size change on i386.
Signed-off-by: NKen Chen <kenchen@google.com>
Cc: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

720f5498

sched: fix wakeup preemption clock · 03e89e45

由 Mike Galbraith 提交于 12月 16, 2008

Impact: sharpen the wakeup-granularity to always be against current scheduler time

It was possible to do the preemption check against an old time stamp.
Signed-off-by: NMike Galbraith <efault@gmx.de>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

03e89e45

13 12月, 2008 1 次提交

cpumask: change cpumask_scnprintf, cpumask_parse_user, cpulist_parse, and... · 29c0177e

由 Rusty Russell 提交于 12月 13, 2008

cpumask: change cpumask_scnprintf, cpumask_parse_user, cpulist_parse, and cpulist_scnprintf to take pointers.

Impact: change calling convention of existing cpumask APIs

Most cpumask functions started with cpus_: these have been replaced by
cpumask_ ones which take struct cpumask pointers as expected.

These four functions don't have good replacement names; fortunately
they're rarely used, so we just change them over.
Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
Signed-off-by: NMike Travis <travis@sgi.com>
Acked-by: NIngo Molnar <mingo@elte.hu>
Cc: paulus@samba.org
Cc: mingo@redhat.com
Cc: tony.luck@intel.com
Cc: ralf@linux-mips.org
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Cc: cl@linux-foundation.org
Cc: srostedt@redhat.com

29c0177e

12 12月, 2008 1 次提交

sched: add missing arch_update_cpu_topology() call · d65bd5ec

由 Heiko Carstens 提交于 12月 09, 2008

arch_reinit_sched_domains() used to call arch_update_cpu_topology()
via arch_init_sched_domains(). This call got lost with
e761b772 ("cpu hotplug, sched: Introduce
cpu_active_map and redo sched domain managment (take 2)".

So we might end up with outdated and missing cpus in the cpu core
maps (architecture used to call arch_reinit_sched_domains if cpu
topology changed).

This adds a call to arch_update_cpu_topology in partition_sched_domains
which gets called whenever scheduling domains get updated. Which is
what is supposed to happen when cpu topology changes.
Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

d65bd5ec

OpenHarmony / kernel_linux 上一次同步 3 年多

OpenHarmony / kernel_linux
上一次同步 3 年多