提交 · 54a015104136974262afa4b8ddd943ea70dec8a2 · OpenHarmony / kernel_linux

11 4月, 2008 1 次提交

asmlinkage_protect replaces prevent_tail_call · 54a01510

由 Roland McGrath 提交于 4月 10, 2008

The prevent_tail_call() macro works around the problem of the compiler
clobbering argument words on the stack, which for asmlinkage functions
is the caller's (user's) struct pt_regs. The tail/sibling-call
optimization is not the only way that the compiler can decide to use
stack argument words as scratch space, which we have to prevent.
Other optimizations can do it too.

Until we have new compiler support to make "asmlinkage" binding on the
compiler's own use of the stack argument frame, we have work around all
the manifestations of this issue that crop up.

More cases seem to be prevented by also keeping the incoming argument
variables live at the end of the function. This makes their original
stack slots attractive places to leave those variables, so the compiler
tends not clobber them for something else. It's still no guarantee, but
it handles some observed cases that prevent_tail_call() did not.
Signed-off-by: NRoland McGrath <roland@redhat.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

54a01510

05 4月, 2008 1 次提交

cgroups: add cgroup support for enabling controllers at boot time · 8bab8dde

由 Paul Menage 提交于 4月 04, 2008

The effects of cgroup_disable=foo are:

- foo isn't auto-mounted if you mount all cgroups in a single hierarchy
- foo isn't visible as an individually mountable subsystem

As a result there will only ever be one call to foo->create(), at init time;
all processes will stay in this group, and the group will never be mounted on
a visible hierarchy.  Any additional effects (e.g.  not allocating metadata)
are up to the foo subsystem.

This doesn't handle early_init subsystems (their "disabled" bit isn't set be,
but it could easily be extended to do so if any of the early_init systems
wanted it - I think it would just involve some nastier parameter processing
since it would occur before the command-line argument parser had been run.

Hugh said:

  Ballpark figures, I'm trying to get this question out rather than
  processing the exact numbers: CONFIG_CGROUP_MEM_RES_CTLR adds 15% overhead
  to the affected paths, booting with cgroup_disable=memory cuts that back to
  1% overhead (due to slightly bigger struct page).

  I'm no expert on distros, they may have no interest whatever in
  CONFIG_CGROUP_MEM_RES_CTLR=y; and the rest of us can easily build with or
  without it, or apply the cgroup_disable=memory patches.

Unix bench's execl test result on x86_64 was

== just after boot without mounting any cgroup fs.==
mem_cgorup=off : Execl Throughput       43.0     3150.1      732.6
mem_cgroup=on  : Execl Throughput       43.0     2932.6      682.0
==

[lizf@cn.fujitsu.com: fix boot option parsing]
Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Sudhir Kumar <skumar@linux.vnet.ibm.com>
Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

8bab8dde

03 4月, 2008 1 次提交

markers: use synchronize_sched() · 6496968e

由 Mathieu Desnoyers 提交于 4月 02, 2008

Markers do not mix well with CONFIG_PREEMPT_RCU because it uses
preempt_disable/enable() and not rcu_read_lock/unlock for minimal
intrusiveness. We would need call_sched and sched_barrier primitives.

Currently, the modification (connection and disconnection) of probes
from markers requires changes to the data structure done in RCU-style :
a new data structure is created, the pointer is changed atomically, a
quiescent state is reached and then the old data structure is freed.

The quiescent state is reached once all the currently running
preempt_disable regions are done running. We use the call_rcu mechanism
to execute kfree() after such quiescent state has been reached.
However, the new CONFIG_PREEMPT_RCU version of call_rcu and rcu_barrier
does not guarantee that all preempt_disable code regions have finished,
hence the race.

The "proper" way to do this is to use rcu_read_lock/unlock, but we don't
want to use it to minimize intrusiveness on the traced system. (we do
not want the marker code to call into much of the OS code, because it
would quickly restrict what can and cannot be instrumented, such as the
scheduler).

The temporary fix, until we get call_rcu_sched and rcu_barrier_sched in
mainline, is to use synchronize_sched before each call_rcu calls, so we
wait for the quiescent state in the system call code path. It will slow
down batch marker enable/disable, but will make sure the race is gone.
Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

6496968e

31 3月, 2008 2 次提交

futex_compat __user annotation · 8481664d

由 Al Viro 提交于 3月 29, 2008

Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

8481664d

NULL noise: fs/*, mm/*, kernel/* · 9dce07f1

由 Al Viro 提交于 3月 29, 2008

Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

9dce07f1

29 3月, 2008 2 次提交

audit: silence two kerneldoc warnings in kernel/audit.c · f706d5d2

由 Dave Jones 提交于 3月 28, 2008

Silence two kerneldoc warnings.

Warning(kernel/audit.c:1276): No description found for parameter 'string'
Warning(kernel/audit.c:1276): No description found for parameter 'len'

[also fix a typo for bonus points]
Signed-off-by: NDave Jones <davej@codemonkey.org.uk>
Acked-by: NRandy Dunlap <randy.dunlap@oracle.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f706d5d2

memcgroup: fix spurious EBUSY on memory cgroup removal · 1d4a788f

由 YAMAMOTO Takashi 提交于 3月 28, 2008

Call mm_free_cgroup earlier.  Otherwise a reference due to lazy mm switching
can prevent cgroup removal.
Signed-off-by: NYAMAMOTO Takashi <yamamoto@valinux.co.jp>
Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

1d4a788f

27 3月, 2008 1 次提交

Give futex init a proper name · f6d107fb

由 Benjamin Herrenschmidt 提交于 3月 27, 2008

The futex init function is called init(). This is a pain in the neck
when debugging when you code dies in ... init :-)

This renames it to futex_init().
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f6d107fb

26 3月, 2008 4 次提交

relay: set an spd_release() hook for splice · 5eb7f9fa

由 Jens Axboe 提交于 3月 26, 2008

relay doesn't reference the pages it adds, however we need a non-NULL
hook or splice_to_pipe() can oops.
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

5eb7f9fa

set relay file can not be read by pread(2) · 37529fe9

由 Lai Jiangshan 提交于 3月 26, 2008

I found that relay files can be read by pread(2). I fix it,
for relay files are not capable of seeking.
Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

37529fe9

NOHZ: reevaluate idle sleep length after add_timer_on() · 06d8308c

由 Thomas Gleixner 提交于 3月 22, 2008

add_timer_on() can add a timer on a CPU which is currently in a long
idle sleep, but the timer wheel is not reevaluated by the nohz code on
that CPU. So a timer can be delayed for quite a long time. This
triggered a false positive in the clocksource watchdog code.

To avoid this we need to wake up the idle CPU and enforce the
reevaluation of the timer wheel for the next timer event.

Add a function, which checks a given CPU for idle state, marks the
idle task with NEED_RESCHED and sends a reschedule IPI to notify the
other CPU of the change in the timer wheel.

Call this function from add_timer_on().
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: NIngo Molnar <mingo@elte.hu>
Cc: stable@kernel.org

--
 include/linux/sched.h |    6 ++++++
 kernel/sched.c        |   43 +++++++++++++++++++++++++++++++++++++++++++
 kernel/timer.c        |   10 +++++++++-
 3 files changed, 58 insertions(+), 1 deletion(-)

06d8308c

clocksource: revert: use init_timer_deferrable for clocksource_watchdog · 898a19de

由 Thomas Gleixner 提交于 3月 25, 2008

Revert

commit 1077f5a9
Author: Parag Warudkar <parag.warudkar@gmail.com>
Date:   Wed Jan 30 13:30:01 2008 +0100

    clocksource.c: use init_timer_deferrable for clocksource_watchdog
    
    clocksource_watchdog can use a deferrable timer - reduces wakeups from
    idle per second.

The watchdog timer needs to run with the specified interval. Otherwise
it will miss the possible wrap of the watchdog clocksource.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Cc: stable@kernel.org

898a19de

25 3月, 2008 6 次提交

Make printk() console semaphore accesses sensible · 266c2e0a

由 Linus Torvalds 提交于 3月 24, 2008

The printk() logic on when/how to get the console semaphore was
unreadable, this splits the code up into a few helper functions and
makes it easier to follow what is going on.
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

266c2e0a

bsd_acct: using task_struct->tgid is not right in pid-namespaces · 5f7b703f

由 Pavel Emelyanov 提交于 3月 24, 2008

In case we're accounting from a sub-namespace, the tgids reported will not
refer to the right namespace.

Save the pid_namespace we're accounting in on the acct_glbs and use it in
do_acct_process.

Two less :) places using the task_struct.tgid member.
Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

5f7b703f

bsd_acct: plain current->real_parent access is not always safe · a846a195

由 Pavel Emelyanov 提交于 3月 24, 2008

This is minor, but dereferencing even current real_parent is not safe on debug
kernels, since the memory, this points to, can be unmapped - RCU protection is
required.

Besides, the tgid field is deprecated and is to be replaced with task_tgid_xxx
call (the 2nd patch), so RCU will be required anyway.
Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
Cc: Oleg Nesterov <oleg@tv-sign.ru>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a846a195

markers: remove ACCESS_ONCE · 58336114

由 Mathieu Desnoyers 提交于 3月 24, 2008

As Paul pointed out, the ACCESS_ONCE are not needed because we already have
the explicit surrounding memory barriers.
Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Mike Mason <mmlnx@us.ibm.com>
Cc: Dipankar Sarma <dipankar@in.ibm.com>
Cc: David Smith <dsmith@redhat.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Adrian Bunk <adrian.bunk@movial.fi>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

58336114

markers: update preempt_disable. call_rcu, rcu_barrier comments · fd3c36f8

由 Mathieu Desnoyers 提交于 3月 24, 2008

Add comments requested by Andrew.

Updated comments about synchronize_sched().  Since we use call_rcu and
rcu_barrier now, these comments were out of sync with the code.
Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Mike Mason <mmlnx@us.ibm.com>
Cc: Dipankar Sarma <dipankar@in.ibm.com>
Cc: David Smith <dsmith@redhat.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Adrian Bunk <adrian.bunk@movial.fi>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

fd3c36f8

Don't 'printk()' while holding xtime lock for writing · 92896bd9

由 Linus Torvalds 提交于 3月 24, 2008

The printk() can deadlock because it can wake up klogd(), and
task enqueueing will try to read the time in order to set a hrtimer.
Reported-by: NMarcin Slusarz <marcin.slusarz@gmail.com>
Debugged-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

92896bd9

21 3月, 2008 5 次提交

sched: add arch_update_cpu_topology hook. · 22e52b07

由 Heiko Carstens 提交于 3月 12, 2008

Will be called each time the scheduling domains are rebuild.
Needed for architectures that don't have a static cpu topology.
Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

22e52b07

sched: add exported arch_reinit_sched_domains() to header file. · 9aefd0ab

由 Heiko Carstens 提交于 3月 12, 2008

Needed so it can be called from outside of sched.c.
Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

9aefd0ab

sched: remove double unlikely from schedule() · 23e3c3cd

由 Roel Kluin 提交于 3月 13, 2008

Combine two unlikely's
Signed-off-by: NRoel Kluin <12o3l@tiscali.nl>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

23e3c3cd

sched: cleanup old and rarely used 'debug' features. · 2070ee01

由 Peter Zijlstra 提交于 3月 21, 2008

TREE_AVG and APPROX_AVG are initial task placement policies that have been
disabled for a long while.. time to remove them.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
CC: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

2070ee01

audit: netlink socket can be auto-bound to pid other than current->pid (v2) · 75c0371a

由 Pavel Emelyanov 提交于 3月 20, 2008

From:	Pavel Emelyanov <xemul@openvz.org>

This patch is based on the one from Thomas.

The kauditd_thread() calls the netlink_unicast() and passes 
the audit_pid to it. The audit_pid, in turn, is received from 
the user space and the tool (I've checked the audit v1.6.9) 
uses getpid() to pass one in the kernel. Besides, this tool 
doesn't bind the netlink socket to this id, but simply creates 
it allowing the kernel to auto-bind one.

That's the preamble.

The problem is that netlink_autobind() _does_not_ guarantees
that the socket will be auto-bound to the current pid. Instead
it uses the current pid as a hint to start looking for a free
id. So, in case of conflict, the audit messages can be sent
to a wrong socket. This can happen (it's unlikely, but can be)
in case some task opens more than one netlink sockets and then
the audit one starts - in this case the audit's pid can be busy
and its socket will be bound to another id.

The proposal is to introduce an audit_nlk_pid in audit subsys,
that will point to the netlink socket to send packets to. It
will most often be equal to audit_pid. The socket id can be 
got from the skb's netlink CB right in the audit_receive_msg.
The audit_nlk_pid reset to 0 is not required, since all the
decisions are taken based on audit_pid value only.

Later, if the audit tools will bind the socket themselves, the
kernel will have to provide a way to setup the audit_nlk_pid
as well.

A good side effect of this patch is that audit_pid can later 
be converted to struct pid, as it is not longer safe to use 
pid_t-s in the presence of pid namespaces. But audit code still 
uses the tgid from task_struct in the audit_signal_info and in
the audit_filter_syscall.
Signed-off-by: NThomas Graf <tgraf@suug.ch>
Signed-off-by: NPavel Emelyanov <xemul@openvz.org>
Acked-by: NEric Paris <eparis@redhat.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

75c0371a

20 3月, 2008 1 次提交

revert "clocksource: make clocksource watchdog cycle through online CPUs" · 3150e63d

由 Andrew Morton 提交于 3月 19, 2008

Revert commit 1ada5cba ("clocksource:
make clocksource watchdog cycle through online CPUs") due to the
regression reported by Gabriel C at

	http://lkml.org/lkml/2008/2/24/281

(short vesion: it makes TSC be marked as always unstable on his
machine).

Cc: Andi Kleen <ak@suse.de>
Acked-by: NIngo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Robert Hancock <hancockr@shaw.ca>
Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Gabriel C <nix.or.die@googlemail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

3150e63d

19 3月, 2008 6 次提交

sched: retune wake granularity · 74e3cd7f

由 Ingo Molnar 提交于 3月 18, 2008

reduce wake-up granularity for better interactivity.
Signed-off-by: NIngo Molnar <mingo@elte.hu>

74e3cd7f

sched: wakeup-buddy tasks are cache-hot · f540a608

由 Ingo Molnar 提交于 3月 15, 2008

Wakeup-buddy tasks are cache-hot - this makes it a bit harder
for the load-balancer to tear them apart. (but it's still possible,
if the load is sufficiently assymetric)
Signed-off-by: NIngo Molnar <mingo@elte.hu>

f540a608

sched: improve affine wakeups · 4ae7d5ce

由 Ingo Molnar 提交于 3月 19, 2008

improve affine wakeups. Maintain the 'overlap' metric based on CFS's
sum_exec_runtime - which means the amount of time a task executes
after it wakes up some other task.

Use the 'overlap' for the wakeup decisions: if the 'overlap' is short,
it means there's strong workload coupling between this task and the
woken up task. If the 'overlap' is large then the workload is decoupled
and the scheduler will move them to separate CPUs more easily.

( Also slightly move the preempt_check within try_to_wake_up() - this has
  no effect on functionality but allows 'early wakeups' (for still-on-rq
  tasks) to be correctly accounted as well.)
Signed-off-by: NIngo Molnar <mingo@elte.hu>

4ae7d5ce

sched: clean up wakeup balancing, code flow · f4827386

由 Ingo Molnar 提交于 3月 16, 2008

Clean up the code flow. No code changed:

kernel/sched.o:

   text	   data	    bss	    dec	    hex	filename
  42521	   2858	    232	  45611	   b22b	sched.o.before
  42521	   2858	    232	  45611	   b22b	sched.o.after

md5:
   09b31c44e9aff8666f72773dc433e2df  sched.o.before.asm
   09b31c44e9aff8666f72773dc433e2df  sched.o.after.asm
Signed-off-by: NIngo Molnar <mingo@elte.hu>

f4827386

sched: clean up wakeup balancing, rename variables · ac192d39

由 Ingo Molnar 提交于 3月 16, 2008

rename 'cpu' to 'prev_cpu'. No code changed:

kernel/sched.o:

   text	   data	    bss	    dec	    hex	filename
  42521	   2858	    232	  45611	   b22b	sched.o.before
  42521	   2858	    232	  45611	   b22b	sched.o.after

md5:
   09b31c44e9aff8666f72773dc433e2df  sched.o.before.asm
   09b31c44e9aff8666f72773dc433e2df  sched.o.after.asm
Signed-off-by: NIngo Molnar <mingo@elte.hu>

ac192d39

sched: clean up wakeup balancing, move wake_affine() · 098fb9db

由 Ingo Molnar 提交于 3月 16, 2008

split out the affine-wakeup bits.

No code changed:

kernel/sched.o:

   text	   data	    bss	    dec	    hex	filename
  42521	   2858	    232	  45611	   b22b	sched.o.before
  42521	   2858	    232	  45611	   b22b	sched.o.after

md5:
   9d76738f1272aa82f0b7affd2f51df6b  sched.o.before.asm
   09b31c44e9aff8666f72773dc433e2df  sched.o.after.asm

(the md5's changed because stack slots changed and some registers
get scheduled by gcc in a different order - but otherwise the before
and after assembly is instruction for instruction equivalent.)
Signed-off-by: NIngo Molnar <mingo@elte.hu>

098fb9db

17 3月, 2008 1 次提交

relay: fix subbuf_splice_actor() adding too many pages · 16d54669

由 Jens Axboe 提交于 3月 17, 2008

If subbuf_pages was larger than the max number of pages the pipe
buffer will hold, subbuf_splice_actor() would happily go beyond
the array size.
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>

16d54669

15 3月, 2008 7 次提交

sched: simplify sched_slice() · 6a6029b8

由 Ingo Molnar 提交于 3月 14, 2008

Use the existing calc_delta_mine() calculation for sched_slice(). This
saves a divide and simplifies the code because we share it with the
other /cfs_rq->load users.

It also improves code size:

      text    data     bss     dec     hex filename
     42659    2740     144   45543    b1e7 sched.o.before
     42093    2740     144   44977    afb1 sched.o.after
Signed-off-by: NIngo Molnar <mingo@elte.hu>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>

6a6029b8

sched: fix fair sleepers · e22ecef1

由 Ingo Molnar 提交于 3月 14, 2008

Fair sleepers need to scale their latency target down by runqueue
weight. Otherwise busy systems will gain ever larger sleep bonus.
Signed-off-by: NIngo Molnar <mingo@elte.hu>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>

e22ecef1

sched: fix overload performance: buddy wakeups · aa2ac252

由 Peter Zijlstra 提交于 3月 14, 2008

Currently we schedule to the leftmost task in the runqueue. When the
runtimes are very short because of some server/client ping-pong,
especially in over-saturated workloads, this will cycle through all
tasks trashing the cache.

Reduce cache trashing by keeping dependent tasks together by running
newly woken tasks first. However, by not running the leftmost task first
we could starve tasks because the wakee can gain unlimited runtime.

Therefore we only run the wakee if its within a small
(wakeup_granularity) window of the leftmost task. This preserves
fairness, but does alternate server/client task groups.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

aa2ac252

sched: fix calc_delta_mine() · 27d11726

由 Ingo Molnar 提交于 3月 14, 2008

lw->weight can be 0 for a short time during bootup.
Signed-off-by: NIngo Molnar <mingo@elte.hu>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>

27d11726

sched: fix update_load_add()/sub() · e89996ae

由 Ingo Molnar 提交于 3月 14, 2008

Clear the cached inverse value when updating load. This is needed for
calc_delta_mine() to work correctly when using the rq load.
Signed-off-by: NIngo Molnar <mingo@elte.hu>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>

e89996ae

sched: min_vruntime fix · 3fe69747

由 Peter Zijlstra 提交于 3月 14, 2008

Current min_vruntime tracking is incorrect and will cause serious
problems when we don't run the leftmost task for some reason.

min_vruntime does two things; 1) it's used to determine a forward
direction when the u64 vruntime wraps, 2) it's used to track the
leftmost vruntime to position newly enqueued tasks from.

The current logic advances min_vruntime whenever the current task's
vruntime advance. Because the current task may pass the leftmost task
still waiting we're failing the second goal. This causes new tasks to be
placed too far ahead and thus penalizes their runtime.

Fix this by making min_vruntime the min_vruntime of the waiting tasks by
tracking it in enqueue/dequeue, and compare against current's vruntime
to obtain the absolute minimum when placing new tasks.
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

3fe69747

sched: fix race in schedule() · 0e1f3483

由 Hiroshi Shimamoto 提交于 3月 10, 2008

Fix a hard to trigger crash seen in the -rt kernel that also affects
the vanilla scheduler.

There is a race condition between schedule() and some dequeue/enqueue
functions; rt_mutex_setprio(), __setscheduler() and sched_move_task().

When scheduling to idle, idle_balance() is called to pull tasks from
other busy processor. It might drop the rq lock. It means that those 3
functions encounter on_rq=0 and running=1. The current task should be
put when running.

Here is a possible scenario:

   CPU0                               CPU1
    |                              schedule()
    |                              ->deactivate_task()
    |                              ->idle_balance()
    |                              -->load_balance_newidle()
rt_mutex_setprio()                     |
    |                              --->double_lock_balance()
    *get lock                          *rel lock
    * on_rq=0, ruuning=1               |
    * sched_class is changed           |
    *rel lock                          *get lock
    :                                  |
                                       :
                                   ->put_prev_task_rt()
                                   ->pick_next_task_fair()
                                       => panic

The current process of CPU1(P1) is scheduling. Deactivated P1, and the
scheduler looks for another process on other CPU's runqueue because CPU1
will be idle. idle_balance(), load_balance_newidle() and
double_lock_balance() are called and double_lock_balance() could drop
the rq lock. On the other hand, CPU0 is trying to boost the priority of
P1. The result of boosting only P1's prio and sched_class are changed to
RT. The sched entities of P1 and P1's group are never put. It makes
cfs_rq invalid, because the cfs_rq has curr and no leaf, but
pick_next_task_fair() is called, then the kernel panics.
Signed-off-by: NHiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: NIngo Molnar <mingo@elte.hu>

0e1f3483

13 3月, 2008 1 次提交

documentation: Move power-related files to Documentation/power/ · 53471121

由 Randy Dunlap 提交于 3月 12, 2008

Move 00-INDEX entries to power/00-INDEX (and add entry for
pm_qos_interface.txt).

Update references to moved filenames.

Fix some trailing whitespace.
Signed-off-by: NRandy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: NLen Brown <len.brown@intel.com>

53471121

12 3月, 2008 1 次提交

Hibernation: Fix mark_nosave_pages() · a82f7119

由 Rafael J. Wysocki 提交于 3月 12, 2008

There is a problem in the hibernation code that triggers on some NUMA
systems on which pfn_valid() returns 'true' for some PFNs that don't
belong to any zone.  Namely, there is a BUG_ON() in
memory_bm_find_bit() that triggers for PFNs not belonging to any
zone and passing the pfn_valid() test.  On the affected systems it
triggers when we mark PFNs reported by the platform as not saveable,
because the PFNs in question belong to a region mapped directly using
iorepam() (i.e. the ACPI data area) and they pass the pfn_valid()
test.

Modify memory_bm_find_bit() so that it returns an error if given PFN
doesn't belong to any zone instead of crashing the kernel and ignore
the result returned by it in mark_nosave_pages(), while marking the
"nosave" memory regions.

This doesn't affect the hibernation functionality, as we won't touch
the PFNs in question anyway.

http://bugzilla.kernel.org/show_bug.cgi?id=9966 .
Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
Signed-off-by: NLen Brown <len.brown@intel.com>

a82f7119

OpenHarmony / kernel_linux 上一次同步 3 年多

OpenHarmony / kernel_linux
上一次同步 3 年多