提交 · 3b514d24e200fcdcde0a57c354a51d3677a86743 · openanolis / cloud-kernel

17 4月, 2014 1 次提交

sched: Check for stop task appearance when balancing happens · a1d9a323

由 Kirill Tkhai 提交于 4月 10, 2014

We need to do it like we do for the other higher priority classes..
Signed-off-by: NKirill Tkhai <tkhai@yandex.ru>
Cc: Michael wang <wangyun@linux.vnet.ibm.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/336561397137116@web27h.yandex.ruSigned-off-by: NIngo Molnar <mingo@kernel.org>

a1d9a323

11 4月, 2014 1 次提交

sched/numa: Fix task_numa_free() lockdep splat · 60e69eed

由 Mike Galbraith 提交于 4月 07, 2014

Sasha reported that lockdep claims that the following commit:
made numa_group.lock interrupt unsafe:

  156654f4 ("sched/numa: Move task_numa_free() to __put_task_struct()")

While I don't see how that could be, given the commit in question moved
task_numa_free() from one irq enabled region to another, the below does
make both gripes and lockups upon gripe with numa=fake=4 go away.
Reported-by: NSasha Levin <sasha.levin@oracle.com>
Fixes: 156654f4 ("sched/numa: Move task_numa_free() to __put_task_struct()")
Signed-off-by: NMike Galbraith <bitbucket@online.de>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Cc: torvalds@linux-foundation.org
Cc: mgorman@suse.com
Cc: akpm@linux-foundation.org
Cc: Dave Jones <davej@redhat.com>
Link: http://lkml.kernel.org/r/1396860915.5170.5.camel@marge.simpson.netSigned-off-by: NIngo Molnar <mingo@kernel.org>

60e69eed

08 4月, 2014 2 次提交

kernel: use macros from compiler.h instead of __attribute__((...)) · 52f5684c

由 Gideon Israel Dsouza 提交于 4月 07, 2014

To increase compiler portability there is <linux/compiler.h> which
provides convenience macros for various gcc constructs.  Eg: __weak for
__attribute__((weak)).  I've replaced all instances of gcc attributes
with the right macro in the kernel subsystem.
Signed-off-by: NGideon Israel Dsouza <gidisrael@gmail.com>
Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

52f5684c

sched: remove sleep_on() and friends · b8780c36

由 Arnd Bergmann 提交于 4月 07, 2014

This is the final piece in the puzzle, as all patches to remove the
last users of \(interruptible_\|\)sleep_on\(_timeout\|\) have made it
into the 3.15 merge window. The work was long overdue, and this
interface in particular should not have survived the BKL removal
that was done a couple of years ago.

Citing Jon Corbet from http://lwn.net/2001/0201/kernel.php3":

 "[...] it was suggested that the janitors look for and fix all code
  that calls sleep_on() [...] since (1) almost all such code is
  incorrect, and (2) Linus has agreed that those functions should
  be removed in the 2.5 development series".

We haven't quite made it for 2.5, but maybe we can merge this for 3.15.
Signed-off-by: NArnd Bergmann <arnd@arndb.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b8780c36

04 4月, 2014 1 次提交

kernel: audit/fix non-modular users of module_init in core code · c96d6660

由 Paul Gortmaker 提交于 4月 03, 2014

Code that is obj-y (always built-in) or dependent on a bool Kconfig
(built-in or absent) can never be modular.  So using module_init as an
alias for __initcall can be somewhat misleading.

Fix these up now, so that we can relocate module_init from init.h into
module.h in the future.  If we don't do this, we'd have to add module.h
to obviously non-modular code, and that would be a worse thing.

The audit targets the following module_init users for change:
 kernel/user.c                  obj-y
 kernel/kexec.c                 bool KEXEC (one instance per arch)
 kernel/profile.c               bool PROFILING
 kernel/hung_task.c             bool DETECT_HUNG_TASK
 kernel/sched/stats.c           bool SCHEDSTATS
 kernel/user_namespace.c        bool USER_NS

Note that direct use of __initcall is discouraged, vs.  one of the
priority categorized subgroups.  As __initcall gets mapped onto
device_initcall, our use of subsys_initcall (which makes sense for these
files) will thus change this registration from level 6-device to level
4-subsys (i.e.  slightly earlier).  However no observable impact of that
difference has been observed during testing.

Also, two instances of missing ";" at EOL are fixed in kexec.
Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c96d6660

20 3月, 2014 1 次提交

timer: Remove code redundancy while calling get_nohz_timer_target() · 6201b4d6

由 Viresh Kumar 提交于 3月 18, 2014

There are only two users of get_nohz_timer_target(): timer and hrtimer. Both
call it under same circumstances, i.e.

	#ifdef CONFIG_NO_HZ_COMMON
	       if (!pinned && get_sysctl_timer_migration() && idle_cpu(this_cpu))
	               return get_nohz_timer_target();
	#endif

So, it makes more sense to get all this as part of get_nohz_timer_target()
instead of duplicating code at two places. For this another parameter is
required to be passed to this routine, pinned.
Signed-off-by: NViresh Kumar <viresh.kumar@linaro.org>
Cc: linaro-kernel@lists.linaro.org
Cc: fweisbec@gmail.com
Cc: peterz@infradead.org
Link: http://lkml.kernel.org/r/1e1b53537217d58d48c2d7a222a9c3ac47d5b64c.1395140107.git.viresh.kumar@linaro.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>

6201b4d6

13 3月, 2014 2 次提交

sched: Remove needless round trip nsecs <-> tick conversion of steal time · 300a9d88

由 Frederic Weisbecker 提交于 3月 05, 2014

When update_rq_clock_task() accounts the pending steal time for a task,
it converts the steal delta from nsecs to tick then from tick to nsecs.

There is no apparent good reason for doing that though because both
the task clock and the prev steal delta are u64 and store values
in nsecs.

So lets remove the needless conversion.

Cc: Ingo Molnar <mingo@kernel.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Acked-by: NRik van Riel <riel@redhat.com>
Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>

300a9d88

cputime: Fix jiffies based cputime assumption on steal accounting · dee08a72

由 Frederic Weisbecker 提交于 3月 05, 2014

The steal guest time accounting code assumes that cputime_t is based on
jiffies. So when CONFIG_NO_HZ_FULL=y, which implies that cputime_t
is based on nsecs, steal_account_process_tick() passes the delta in
jiffies to account_steal_time() which then accounts it as if it's a
value in nsecs.

As a result, accounting 1 second of steal time (with HZ=100 that would
be 100 jiffies) is spuriously accounted as 100 nsecs.

As such /proc/stat may report 0 values of steal time even when two
guests have run concurrently for a few seconds on the same host and
same CPU.

In order to fix this, lets convert the nsecs based steal delta to
cputime instead of jiffies by using the right conversion API.

Given that the steal time is stored in cputime_t and this type can have
a smaller granularity than nsecs, we only account the rounded converted
value and leave the remaining nsecs for the next deltas.
Reported-by: NHuiqingding <huding@redhat.com>
Reported-by: NMarcelo Tosatti <mtosatti@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Acked-by: NRik van Riel <riel@redhat.com>
Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>

dee08a72

12 3月, 2014 3 次提交

sched: Clean up the task_hot() function · 6037dd1a

由 Alex Shi 提交于 3月 12, 2014

task_hot() doesn't need the 'sched_domain' parameter, so remove it.
Signed-off-by: NAlex Shi <alex.shi@linaro.org>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1394607111-1904-1-git-send-email-alex.shi@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

6037dd1a

sched: Remove double calculation in fix_small_imbalance() · a2cd4260

由 Vincent Guittot 提交于 3月 11, 2014

The tmp value has been already calculated in:

  scaled_busy_load_per_task =
		(busiest->load_per_task * SCHED_POWER_SCALE) /
		busiest->group_power;
Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1394555166-22894-1-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

a2cd4260

sched: Fix broken setscheduler() · 383afd09

由 Steven Rostedt 提交于 3月 11, 2014

I decided to run my tests on linux-next, and my wakeup_rt tracer was
broken. After running a bisect, I found that the problem commit was:

   linux-next commit c365c292
   "sched: Consider pi boosting in setscheduler()"

And the reason the wake_rt tracer test was failing, was because it had
no RT task to trace. I first noticed this when running with
sched_switch event and saw that my RT task still had normal SCHED_OTHER
priority. Looking at the problem commit, I found:

 -       p->normal_prio = normal_prio(p);
 -       p->prio = rt_mutex_getprio(p);

With no

 +       p->normal_prio = normal_prio(p);
 +       p->prio = rt_mutex_getprio(p);

Reading what the commit is suppose to do, I realize that the p->prio
can't be set if the task is boosted with a higher prio, but the
p->normal_prio still needs to be set regardless, otherwise, when the
task is deboosted, it wont get the new priority.

The p->prio has to be set before "check_class_changed()" is called,
otherwise the class wont be changed.

Also added fix to newprio to include a check for deadline policy that
was missing. This change was suggested by Juri Lelli.
Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
Cc: SebastianAndrzej Siewior <bigeasy@linutronix.de>
Cc: Juri Lelli <juri.lelli@gmail.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140306120438.638bfe94@gandalf.local.homeSigned-off-by: NIngo Molnar <mingo@kernel.org>

383afd09

11 3月, 2014 11 次提交

sched/numa: Move task_numa_free() to __put_task_struct() · 156654f4

由 Mike Galbraith 提交于 2月 28, 2014

Bad idea on -rt:

[  908.026136]  [<ffffffff8150ad6a>] rt_spin_lock_slowlock+0xaa/0x2c0
[  908.026145]  [<ffffffff8108f701>] task_numa_free+0x31/0x130
[  908.026151]  [<ffffffff8108121e>] finish_task_switch+0xce/0x100
[  908.026156]  [<ffffffff81509c0a>] thread_return+0x48/0x4ae
[  908.026160]  [<ffffffff8150a095>] schedule+0x25/0xa0
[  908.026163]  [<ffffffff8150ad95>] rt_spin_lock_slowlock+0xd5/0x2c0
[  908.026170]  [<ffffffff810658cf>] get_signal_to_deliver+0xaf/0x680
[  908.026175]  [<ffffffff8100242d>] do_signal+0x3d/0x5b0
[  908.026179]  [<ffffffff81002a30>] do_notify_resume+0x90/0xe0
[  908.026186]  [<ffffffff81513176>] int_signal+0x12/0x17
[  908.026193]  [<00007ff2a388b1d0>] 0x7ff2a388b1cf

and since upstream does not mind where we do this, be a bit nicer ...
Signed-off-by: NMike Galbraith <bitbucket@online.de>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1393568591.6018.27.camel@marge.simpson.netSigned-off-by: NIngo Molnar <mingo@kernel.org>

156654f4

sched/fair: Fix endless loop in idle_balance() · 35805ff8

由 Kirill Tkhai 提交于 3月 06, 2014

Check for fair tasks number to decide, that we've pulled a task.
rq's nr_running may contain throttled RT tasks.
Signed-off-by: NKirill Tkhai <ktkhai@parallels.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1394118975.19290.104.camel@tkhaiSigned-off-by: NIngo Molnar <mingo@kernel.org>

35805ff8

sched/core: Fix endless loop in pick_next_task() · 4c6c4e38

由 Kirill Tkhai 提交于 3月 06, 2014

1) Single cpu machine case.

When rq has only RT tasks, but no one of them can be picked
because of throttling, we enter in endless loop.

pick_next_task_{dl,rt} return NULL.

In pick_next_task_fair() we permanently go to retry

	if (rq->nr_running != rq->cfs.h_nr_running)
		return RETRY_TASK;

(rq->nr_running is not being decremented when rt_rq becomes
throttled).

No chances to unthrottle any rt_rq or to wake fair here,
because of rq is locked permanently and interrupts are
disabled.

2) In case of SMP this can cause a hang too. Although we unlock
   rq in idle_balance(), interrupts are still disabled.

The solution is to check for available tasks in DL and RT
classes instead of checking for sum.
Signed-off-by: NKirill Tkhai <ktkhai@parallels.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1394098321.19290.11.camel@tkhaiSigned-off-by: NIngo Molnar <mingo@kernel.org>

4c6c4e38

sched/fair: Push down check for high priority class task into idle_balance() · e4aa358b

由 Kirill Tkhai 提交于 3月 06, 2014

We close idle_exit_fair() bracket in case of we've pulled something or we've received
task of high priority class.
Signed-off-by: NKirill Tkhai <ktkhai@parallels.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: http://lkml.kernel.org/r/1394098315.19290.10.camel@tkhaiSigned-off-by: NIngo Molnar <mingo@kernel.org>

e4aa358b

sched/rt: Fix picking RT and DL tasks from empty queue · 734ff2a7

由 Kirill Tkhai 提交于 3月 04, 2014

The problems:

1) We check for rt_nr_running before call of put_prev_task().
   If previous task is RT, its rt_rq may become throttled
   and dequeued after this call.

In case of p is from rt->rq this just causes picking a task
from throttled queue, but in case of its rt_rq is child
we are guaranteed catch BUG_ON.

2) The same with deadline class. The only difference we operate
   on only dl_rq.

This patch fixes all the above problems and it adds a small skip in the
DL update like we've already done for RT class:

	if (unlikely((s64)delta_exec <= 0))
		return;

This will optimize sequential update_curr_dl() calls a little.
Signed-off-by: NKirill Tkhai <ktkhai@parallels.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@gmail.com>
Link: http://lkml.kernel.org/r/1393946746.3643.3.camel@tkhaiSigned-off-by: NIngo Molnar <mingo@kernel.org>

734ff2a7

sched/idle: Add more comments to the code · a1d028bd

由 Daniel Lezcano 提交于 3月 03, 2014

The idle main function is a complex and a critical function. Added more
comments to the code.
Signed-off-by: NDaniel Lezcano <daniel.lezcano@linaro.org>
Acked-by: NNicolas Pitre <nico@linaro.org>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Cc: tglx@linutronix.de
Cc: rjw@rjwysocki.net
Cc: preeti@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/1393832934-11625-5-git-send-email-daniel.lezcano@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

a1d028bd

sched/idle: Move idle conditions in cpuidle_idle main function · 8ca3c642

由 Daniel Lezcano 提交于 3月 03, 2014

This patch moves the condition before entering idle into the cpuidle main
function located in idle.c. That simplify the idle mainloop functions and
increase the readibility of the conditions to enter truly idle.

This patch is code reorganization and does not change the behavior of the
function.
Signed-off-by: NDaniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Cc: tglx@linutronix.de
Cc: rjw@rjwysocki.net
Cc: nicolas.pitre@linaro.org
Cc: preeti@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/1393832934-11625-4-git-send-email-daniel.lezcano@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

8ca3c642

sched/idle: Reorganize the idle loop · c8cc7d4d

由 Daniel Lezcano 提交于 3月 03, 2014

Now that we have the main cpuidle function in idle.c, move some code from
the idle mainloop to this function for the sake of clarity.

That removes if then else indentation difficult to follow when looking at the
code. This patch does not change the current behavior.
Signed-off-by: NDaniel Lezcano <daniel.lezcano@linaro.org>
Acked-by: NNicolas Pitre <nico@linaro.org>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Cc: tglx@linutronix.de
Cc: rjw@rjwysocki.net
Cc: preeti@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/1393832934-11625-3-git-send-email-daniel.lezcano@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

c8cc7d4d

cpuidle/idle: Move the cpuidle_idle_call function to idle.c · 30cdd69e

由 Daniel Lezcano 提交于 3月 03, 2014

The cpuidle_idle_call does nothing more than calling the three individuals
function and is no longer used by any arch specific code but only in the
cpuidle framework code.

We can move this function into the idle task code to ensure better
proximity to the scheduler code.
Signed-off-by: NDaniel Lezcano <daniel.lezcano@linaro.org>
Acked-by: NNicolas Pitre <nicolas.pitre@linaro.org>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Cc: rjw@rjwysocki.net
Cc: preeti@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/1393832934-11625-2-git-send-email-daniel.lezcano@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

30cdd69e

sched/clock: Prevent tracing recursion in sched_clock_cpu() · 96b3d28b

由 Fernando Luis Vazquez Cao 提交于 3月 06, 2014

Prevent tracing of preempt_disable/enable() in sched_clock_cpu().
When CONFIG_DEBUG_PREEMPT is enabled, preempt_disable/enable() are
traced and this causes trace_clock() users (and probably others) to
go into an infinite recursion. Systems with a stable sched_clock()
are not affected.

This problem is similar to that fixed by upstream commit 95ef1e52
("KVM guest: prevent tracing recursion with kvmclock").
Signed-off-by: NFernando Luis Vazquez Cao <fernando@oss.ntt.co.jp>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Acked-by: NSteven Rostedt <rostedt@goodmis.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1394083528.4524.3.camel@nexusSigned-off-by: NIngo Molnar <mingo@kernel.org>

96b3d28b

sched/deadline: Deny unprivileged users to set/change SCHED_DEADLINE policy · d44753b8

由 Juri Lelli 提交于 3月 03, 2014

Deny the use of SCHED_DEADLINE policy to unprivileged users.
Even if root users can set the policy for normal users, we
don't want the latter to be able to change their parameters
(safest behavior).
Signed-off-by: NJuri Lelli <juri.lelli@gmail.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1393844961-18097-1-git-send-email-juri.lelli@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

d44753b8

27 2月, 2014 7 次提交

sched: Guarantee task priority in pick_next_task() · 37e117c0

由 Peter Zijlstra 提交于 2月 14, 2014

Michael spotted that the idle_balance() push down created a task
priority problem.

Previously, when we called idle_balance() before pick_next_task() it
wasn't a problem when -- because of the rq->lock droppage -- an rt/dl
task slipped in.

Similarly for pre_schedule(), rt pre-schedule could have a dl task
slip in.

But by pulling it into the pick_next_task() loop, we'll not try a
higher task priority again.

Cure this by creating a re-start condition in pick_next_task(); and
triggering this from pick_next_task_{rt,fair}().

It also fixes a live-lock where we get stuck in pick_next_task_fair()
due to idle_balance() seeing !0 nr_running but there not actually
being any fair tasks about.
Reported-by: NMichael Wang <wangyun@linux.vnet.ibm.com>
Fixes: 38033c37 ("sched: Push down pre_schedule() and idle_balance()")
Tested-by: NSasha Levin <sasha.levin@oracle.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20140224121218.GR15586@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>

37e117c0

sched/idle: Remove stale old file · 06d50c65

由 Peter Zijlstra 提交于 2月 24, 2014

Commit cf37b6b4 ("sched/idle: Move cpu/idle.c to sched/idle.c")
said to simply move a file; somehow it got mangled and created an old
version of the file and forgot to remove the old file.

Fix this fail; add the lost change and remove the now identical old
file.
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Cc: rjw@rjwysocki.net
Cc: nicolas.pitre@linaro.org
Cc: preeti@linux.vnet.ibm.com
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Link: http://lkml.kernel.org/r/20140224172207.GC9987@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>

06d50c65

sched: Put rq's sched_avg under CONFIG_FAIR_GROUP_SCHED · f5f9739d

由 Dietmar Eggemann 提交于 2月 26, 2014

The struct sched_avg of struct rq is only used in case group
scheduling is enabled inside __update_tg_runnable_avg() to update
per-cpu representation of a task group. I.e. that there is no need to
maintain the runnable avg of a rq in the !CONFIG_FAIR_GROUP_SCHED case.

This patch guards struct sched_avg of struct rq and
update_rq_runnable_avg() with CONFIG_FAIR_GROUP_SCHED.

There is an extra empty definition for update_rq_runnable_avg()
necessary for the !CONFIG_FAIR_GROUP_SCHED && CONFIG_SMP case.

The function print_cfs_group_stats() which prints out struct sched_avg
of struct rq is already guarded with CONFIG_FAIR_GROUP_SCHED.
Reviewed-by: NBen Segall <bsegall@google.com>
Signed-off-by: NDietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/530DCDC5.1060406@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

f5f9739d

sched/deadline: Prevent rt_time growth to infinity · faa59937

由 Juri Lelli 提交于 2月 21, 2014

Kirill Tkhai noted:

  Since deadline tasks share rt bandwidth, we must care about
  bandwidth timer set. Otherwise rt_time may grow up to infinity
  in update_curr_dl(), if there are no other available RT tasks
  on top level bandwidth.

RT task were in fact throttled right after they got enqueued,
and never executed again (rt_time never again went below rt_runtime).

Peter then proposed to accrue DL execution on rt_time only when
rt timer is active, and proposed a patch (this patch is a slight
modification of that) to implement that behavior. While this
solves Kirill problem, it has a drawback.

Indeed, Kirill noted again:

  It looks we may get into a situation, when all CPU time is shared
  between RT and DL tasks:

  rt_runtime = n
  rt_period  = 2n

  | RT working, DL sleeping  | DL working, RT sleeping      |
  -----------------------------------------------------------
  | (1)     duration = n     | (2)     duration = n         | (repeat)
  |--------------------------|------------------------------|
  | (rt_bw timer is running) | (rt_bw timer is not running) |

  No time for fair tasks at all.

While this can happen during the first period, if rq is always backlogged,
RT tasks won't have the opportunity to execute anymore: rt_time reached
rt_runtime during (1), suppose after (2) RT is enqueued back, it gets
throttled since rt timer didn't fire, replenishment is from now on eaten up
by DL tasks that accrue their execution on rt_time (while rt timer is
active - we have an RT task waiting for replenishment). FAIR tasks are
not touched after this first period. Ok, this is not ideal, and the situation
is even worse!

What above (the nice case), practically never happens in reality, where
your rt timer is not aligned to tasks periods, tasks are in general not
periodic, etc.. Long story short, you always risk to overload your system.

This patch is based on Peter's idea, but exploits an additional fact:
if you don't have RT tasks enqueued, it makes little sense to continue
incrementing rt_time once you reached the upper limit (DL tasks have their
own mechanism for throttling).

This cures both problems:

 - no matter how many DL instances in the past, you'll have an rt_time
   slightly above rt_runtime when an RT task is enqueued, and from that
   point on (after the first replenishment), the task will normally execute;

 - you can still eat up all bandwidth during the first period, but not
   anymore after that, remember that DL execution will increment rt_time
   till the upper limit is reached.

The situation is still not perfect! But, we have a simple solution for now,
that limits how much you can jeopardize your system, as we keep working
towards the right answer: RT groups scheduled using deadline servers.
Reported-by: NKirill Tkhai <tkhai@yandex.ru>
Signed-off-by: NJuri Lelli <juri.lelli@gmail.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/20140225151515.617714e2f2cd6c558531ba61@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

faa59937

sched/deadline: Switch CPU's presence test order · eec751ed

由 Juri Lelli 提交于 2月 24, 2014

Commit 82b95800 ("sched/deadline: Test for CPU's presence explicitly")
changed how we check if a CPU returned by cpudeadline machinery is
valid. But, we don't want to call cpu_present() if best_cpu is
equal to -1. So, switch the order of tests inside WARN_ON().
Signed-off-by: NJuri Lelli <juri.lelli@gmail.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Cc: boris.ostrovsky@oracle.com
Cc: konrad.wilk@oracle.com
Cc: rostedt@goodmis.org
Link: http://lkml.kernel.org/r/1393238832-9100-1-git-send-email-juri.lelli@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

eec751ed

sched/deadline: Cleanup RT leftovers from {inc/dec}_dl_migration · 3908ac13

由 Kirill Tkhai 提交于 2月 25, 2014

In deadline class we do not have group scheduling.

So, let's remove unnecessary

	X = X;

equations.
Signed-off-by: NKirill Tkhai <ktkhai@parallels.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@gmail.com>
Link: http://lkml.kernel.org/r/1393343543.4089.5.camel@tkhaiSigned-off-by: NIngo Molnar <mingo@kernel.org>

3908ac13

sched: Fix double normalization of vruntime · 791c9e02

由 George McCollister 提交于 2月 18, 2014

dequeue_entity() is called when p->on_rq and sets se->on_rq = 0
which appears to guarentee that the !se->on_rq condition is met.
If the task has done set_current_state(TASK_INTERRUPTIBLE) without
schedule() the second condition will be met and vruntime will be
incorrectly adjusted twice.

In certain cases this can result in the task's vruntime never increasing
past the vruntime of other tasks on the CFS' run queue, starving them of
CPU time.

This patch changes switched_from_fair() to use !p->on_rq instead of
!se->on_rq.

I'm able to cause a task with a priority of 120 to starve all other
tasks with the same priority on an ARM platform running 3.2.51-rt72
PREEMPT RT by writing one character at time to a serial tty (16550 UART)
in a tight loop. I'm also able to verify making this change corrects the
problem on that platform and kernel version.
Signed-off-by: NGeorge McCollister <george.mccollister@gmail.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1392767811-28916-1-git-send-email-george.mccollister@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

791c9e02

25 2月, 2014 2 次提交

smp: Rename __smp_call_function_single() to smp_call_function_single_async() · c46fff2a

由 Frederic Weisbecker 提交于 2月 24, 2014

The name __smp_call_function_single() doesn't tell much about the
properties of this function, especially when compared to
smp_call_function_single().

The comments above the implementation are also misleading. The main
point of this function is actually not to be able to embed the csd
in an object. This is actually a requirement that result from the
purpose of this function which is to raise an IPI asynchronously.

As such it can be called with interrupts disabled. And this feature
comes at the cost of the caller who then needs to serialize the
IPIs on this csd.

Lets rename the function and enhance the comments so that they reflect
these properties.
Suggested-by: NChristoph Hellwig <hch@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@fb.com>
Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

c46fff2a

smp: Remove wait argument from __smp_call_function_single() · fce8ad15

由 Frederic Weisbecker 提交于 2月 24, 2014

The main point of calling __smp_call_function_single() is to send
an IPI in a pure asynchronous way. By embedding a csd in an object,
a caller can send the IPI without waiting for a previous one to complete
as is required by smp_call_function_single() for example. As such,
sending this kind of IPI can be safe even when irqs are disabled.

This flexibility comes at the expense of the caller who then needs to
synchronize the csd lifecycle by himself and make sure that IPIs on a
single csd are serialized.

This is how __smp_call_function_single() works when wait = 0 and this
usecase is relevant.

Now there don't seem to be any usecase with wait = 1 that can't be
covered by smp_call_function_single() instead, which is safer. Lets look
at the two possible scenario:

1) The user calls __smp_call_function_single(wait = 1) on a csd embedded
   in an object. It looks like a nice and convenient pattern at the first
   sight because we can then retrieve the object from the IPI handler easily.

   But actually it is a waste of memory space in the object since the csd
   can be allocated from the stack by smp_call_function_single(wait = 1)
   and the object can be passed an the IPI argument.

   Besides that, embedding the csd in an object is more error prone
   because the caller must take care of the serialization of the IPIs
   for this csd.

2) The user calls __smp_call_function_single(wait = 1) on a csd that
   is allocated on the stack. It's ok but smp_call_function_single()
   can do it as well and it already takes care of the allocation on the
   stack. Again it's more simple and less error prone.

Therefore, using the underscore prepend API version with wait = 1
is a bad pattern and a sign that the caller can do safer and more
simple.

There was a single user of that which has just been converted.
So lets remove this option to discourage further users.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@fb.com>
Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

fce8ad15

23 2月, 2014 9 次提交

sched, nohz: Exclude isolated cores from load balancing · d987fc7f

由 Mike Galbraith 提交于 12月 05, 2011

The user explicitly disabled load balancing, else this core would not be
disconnected. Don't add these to nohz.idle_cpus_mask.
Signed-off-by: NMike Galbraith <mgalbraith@suse.de>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Cc: Lei Wen <leiwen@marvell.com>
Link: http://lkml.kernel.org/n/tip-vmme4f49psirp966pklm5l9j@git.kernel.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NIngo Molnar <mingo@kernel.org>

d987fc7f

sched: Fix select_task_rq_fair() description comments · de91b9cb

由 Morten Rasmussen 提交于 2月 18, 2014

Brings select_task_rq_fair() description comments up-to-date.
Signed-off-by: NMorten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1392732864-10927-1-git-send-email-morten.rasmussen@arm.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NIngo Molnar <mingo@kernel.org>

de91b9cb

sched: Replace hardcoding of -20 and 19 with MIN_NICE and MAX_NICE · 75e45d51

由 Dongsheng Yang 提交于 2月 11, 2014

Signed-off-by: NDongsheng Yang <yangds.fnst@cn.fujitsu.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/bd80780f19b4f9b4a765acc353c8dbc130274dd6.1392103744.git.yangds.fnst@cn.fujitsu.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NIngo Molnar <mingo@kernel.org>

75e45d51

sched/rt: Make init_sched_rt_calss() __init · 11c785b7

由 Li Zefan 提交于 2月 08, 2014

It's a bootstrap function.
Signed-off-by: NLi Zefan <lizefan@huawei.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/52F5CC09.1080502@huawei.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NIngo Molnar <mingo@kernel.org>

11c785b7

sched/rt: Remove 'leaf_rt_rq_list' from 'struct rq' · d82fd253

由 Li Zefan 提交于 2月 08, 2014

This is a leftover from commit e23ee747
("sched/rt: Simplify pull_rt_task() logic and remove .leaf_rt_rq_list").
Signed-off-by: NLi Zefan <lizefan@huawei.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/52F5CBF6.4060901@huawei.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NIngo Molnar <mingo@kernel.org>

d82fd253

sched: Consider pi boosting in setscheduler() · c365c292

由 Thomas Gleixner 提交于 2月 07, 2014

If a PI boosted task policy/priority is modified by a setscheduler()
call we unconditionally dequeue and requeue the task if it is on the
runqueue even if the new priority is lower than the current effective
boosted priority. This can result in undesired reordering of the
priority bucket list.

If the new priority is less or equal than the current effective we
just store the new parameters in the task struct and leave the
scheduler class and the runqueue untouched. This is handled when the
task deboosts itself. Only if the new priority is higher than the
effective boosted priority we apply the change immediately.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
[ Rebase ontop of v3.14-rc1. ]
Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Dario Faggioli <raistlin@linux.it>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1391803122-4425-7-git-send-email-bigeasy@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>

c365c292

sched: Queue RT tasks to head when prio drops · 81a44c54

由 Thomas Gleixner 提交于 2月 07, 2014

The following scenario does not work correctly:

Runqueue of CPUx contains two runnable and pinned tasks:

 T1: SCHED_FIFO, prio 80
 T2: SCHED_FIFO, prio 80

T1 is on the cpu and executes the following syscalls (classic priority
ceiling scenario):

 sys_sched_setscheduler(pid(T1), SCHED_FIFO, .prio = 90);
 ...
 sys_sched_setscheduler(pid(T1), SCHED_FIFO, .prio = 80);
 ...

Now T1 gets preempted by T3 (SCHED_FIFO, prio 95). After T3 goes back
to sleep the scheduler picks T2. Surprise!

The same happens w/o actual preemption when T1 is forced into the
scheduler due to a sporadic NEED_RESCHED event. The scheduler invokes
pick_next_task() which returns T2. So T1 gets preempted and scheduled
out.

This happens because sched_setscheduler() dequeues T1 from the prio 90
list and then enqueues it on the tail of the prio 80 list behind T2.
This violates the POSIX spec and surprises user space which relies on
the guarantee that SCHED_FIFO tasks are not scheduled out unless they
give the CPU up voluntarily or are preempted by a higher priority
task. In the latter case the preempted task must get back on the CPU
after the preempting task schedules out again.

We fixed a similar issue already in commit 60db48ca (sched: Queue a
deboosted task to the head of the RT prio queue). The same treatment
is necessary for sched_setscheduler(). So enqueue to head of the prio
bucket list if the priority of the task is lowered.

It might be possible that existing user space relies on the current
behaviour, but it can be considered highly unlikely due to the corner
case nature of the application scenario.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1391803122-4425-6-git-send-email-bigeasy@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>

81a44c54

sched: Adjust p->sched_reset_on_fork when nothing else changes · d6b1e911

由 Thomas Gleixner 提交于 2月 07, 2014

If the policy and priority remain unchanged a possible modification of
p->sched_reset_on_fork gets lost in the early exit path.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
[ Rebase ontop of v3.14-rc1. ]
Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1391803122-4425-5-git-send-email-bigeasy@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>

d6b1e911

sched: Add better debug output for might_sleep() · 8f47b187

由 Thomas Gleixner 提交于 2月 07, 2014

might_sleep() can tell us where interrupts have been disabled, but we
have no idea what disabled preemption. Add some debug infrastructure.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1391803122-4425-4-git-send-email-bigeasy@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>

8f47b187

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功