提交 · d50dde5a10f305253cbc3855307f608f8a3c5f73 · openanolis / cloud-kernel

13 1月, 2014 1 次提交

sched: Add new scheduler syscalls to support an extended scheduling parameters ABI · d50dde5a

由 Dario Faggioli 提交于 11月 07, 2013

Add the syscalls needed for supporting scheduling algorithms
with extended scheduling parameters (e.g., SCHED_DEADLINE).

In general, it makes possible to specify a periodic/sporadic task,
that executes for a given amount of runtime at each instance, and is
scheduled according to the urgency of their own timing constraints,
i.e.:

- a (maximum/typical) instance execution time,
- a minimum interval between consecutive instances,
- a time constraint by which each instance must be completed.

Thus, both the data structure that holds the scheduling parameters of
the tasks and the system calls dealing with it must be extended.
Unfortunately, modifying the existing struct sched_param would break
the ABI and result in potentially serious compatibility issues with
legacy binaries.

For these reasons, this patch:

- defines the new struct sched_attr, containing all the fields
that are necessary for specifying a task in the computational
model described above;

- defines and implements the new scheduling related syscalls that
manipulate it, i.e., sched_setattr() and sched_getattr().

Syscalls are introduced for x86 (32 and 64 bits) and ARM only, as a
proof of concept and for developing and testing purposes. Making them
available on other architectures is straightforward.

Since no "user" for these new parameters is introduced in this patch,
the implementation of the new system calls is just identical to their
already existing counterpart. Future patches that implement scheduling
policies able to exploit the new data structure must also take care of
modifying the sched_*attr() calls accordingly with their own purposes.
Signed-off-by: NDario Faggioli <raistlin@linux.it>
[ Rewrote to use sched_attr. ]
Signed-off-by: NJuri Lelli <juri.lelli@gmail.com>
[ Removed sched_setscheduler2() for now. ]
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-3-git-send-email-juri.lelli@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

d50dde5a

12 1月, 2014 1 次提交

sched: Calculate effective load even if local weight is 0 · 9722c2da

由 Rik van Riel 提交于 1月 06, 2014

Thomas Hellstrom bisected a regression where erratic 3D performance is
experienced on virtual machines as measured by glxgears. It identified
commit 58d081b5 ("sched/numa: Avoid overloading CPUs on a preferred NUMA
node") as the problem which had modified the behaviour of effective_load.

Effective load calculates the difference to the system-wide load if a
scheduling entity was moved to another CPU. The task group is not heavier
as a result of the move but overall system load can increase/decrease as a
result of the change. Commit 58d081b5 ("sched/numa: Avoid overloading CPUs
on a preferred NUMA node") changed effective_load to make it suitable for
calculating if a particular NUMA node was compute overloaded. To reduce
the cost of the function, it assumed that a current sched entity weight
of 0 was uninteresting but that is not the case.

wake_affine() uses a weight of 0 for sync wakeups on the grounds that it
is assuming the waking task will sleep and not contribute to load in the
near future. In this case, we still want to calculate the effective load
of the sched entity hierarchy. As effective_load is no longer used by
task_numa_compare since commit fb13c7ee (sched/numa: Use a system-wide
search to find swap/migration candidates), this patch simply restores the
historical behaviour.
Reported-and-tested-by: NThomas Hellstrom <thellstrom@vmware.com>
Signed-off-by: NRik van Riel <riel@redhat.com>
[ Wrote changelog]
Signed-off-by: NMel Gorman <mgorman@suse.de>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20140106113912.GC6178@suse.deSigned-off-by: NIngo Molnar <mingo@kernel.org>

9722c2da

19 12月, 2013 1 次提交

sched: numa: skip inaccessible VMAs · 3c67f474

由 Mel Gorman 提交于 12月 18, 2013

Inaccessible VMA should not be trapping NUMA hint faults. Skip them.
Signed-off-by: NMel Gorman <mgorman@suse.de>
Reviewed-by: NRik van Riel <riel@redhat.com>
Cc: Alex Thorlton <athorlton@sgi.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

3c67f474

17 12月, 2013 6 次提交

sched/numa: Fix period_slot recalculation · e777b63b

由 Wanpeng Li 提交于 12月 12, 2013

The original code is as intended and was meant to scale the difference
between the NUMA_PERIOD_THRESHOLD and local/remote ratio when adjusting
the scan period. The period_slot recalculation can be dropped.
Signed-off-by: NWanpeng Li <liwanp@linux.vnet.ibm.com>
Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: NMel Gorman <mgorman@suse.de>
Acked-by: NDavid Rientjes <rientjes@google.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Rik van Riel <riel@redhat.com>
Link: http://lkml.kernel.org/r/1386833006-6600-4-git-send-email-liwanp@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

e777b63b

sched/numa: Use wrapper function task_faults_idx to calculate index in group_faults · 82897b4f

由 Wanpeng Li 提交于 12月 12, 2013

Use wrapper function task_faults_idx to calculate index in group_faults.
Signed-off-by: NWanpeng Li <liwanp@linux.vnet.ibm.com>
Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: NMel Gorman <mgorman@suse.de>
Acked-by: NDavid Rientjes <rientjes@google.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Rik van Riel <riel@redhat.com>
Link: http://lkml.kernel.org/r/1386833006-6600-3-git-send-email-liwanp@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

82897b4f

sched/numa: Use wrapper function task_node to get node which task is on · de1b301a

由 Wanpeng Li 提交于 12月 12, 2013

Use wrapper function task_node to get node which task is on.
Signed-off-by: NWanpeng Li <liwanp@linux.vnet.ibm.com>
Acked-by: NMel Gorman <mgorman@suse.de>
Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: NRik van Riel <riel@redhat.com>
Acked-by: NDavid Rientjes <rientjes@google.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/1386833006-6600-2-git-send-email-liwanp@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

de1b301a

sched/numa: Drop sysctl_numa_balancing_settle_count sysctl · 1bd53a7e

由 Wanpeng Li 提交于 12月 12, 2013

commit 887c290e (sched/numa: Decide whether to favour task or group weights
based on swap candidate relationships) drop the check against
sysctl_numa_balancing_settle_count, this patch remove the sysctl.
Signed-off-by: NWanpeng Li <liwanp@linux.vnet.ibm.com>
Acked-by: NMel Gorman <mgorman@suse.de>
Reviewed-by: NRik van Riel <riel@redhat.com>
Acked-by: NDavid Rientjes <rientjes@google.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Link: http://lkml.kernel.org/r/1386833006-6600-1-git-send-email-liwanp@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

1bd53a7e

sched/rt: Fix rq's cpupri leak while enqueue/dequeue child RT entities · 757dfcaa

由 Kirill Tkhai 提交于 11月 27, 2013

This patch touches the RT group scheduling case.

Functions inc_rt_prio_smp() and dec_rt_prio_smp() change (global) rq's
priority, while rt_rq passed to them may be not the top-level rt_rq.
This is wrong, because changing of priority on a child level does not
guarantee that the priority is the highest all over the rq. So, this
leak makes RT balancing unusable.

The short example: the task having the highest priority among all rq's
RT tasks (no one other task has the same priority) are waking on a
throttle rt_rq.  The rq's cpupri is set to the task's priority
equivalent, but real rq->rt.highest_prio.curr is less.

The patch below fixes the problem.
Signed-off-by: NKirill Tkhai <tkhai@yandex.ru>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/49231385567953@web4m.yandex.ruSigned-off-by: NIngo Molnar <mingo@kernel.org>

757dfcaa

sched: Assign correct scheduling domain to 'sd_llc' · 5d4cf996

由 Mel Gorman 提交于 12月 17, 2013

Commit 42eb088e (sched: Avoid NULL dereference on sd_busy) corrected a NULL
dereference on sd_busy but the fix also altered what scheduling domain it
used for the 'sd_llc' percpu variable.

One impact of this is that a task selecting a runqueue may consider
idle CPUs that are not cache siblings as candidates for running.
Tasks are then running on CPUs that are not cache hot.

This was found through bisection where ebizzy threads were not seeing equal
performance and it looked like a scheduling fairness issue. This patch
mitigates but does not completely fix the problem on all machines tested
implying there may be an additional bug or a common root cause. Here are
the average range of performance seen by individual ebizzy threads. It
was tested on top of candidate patches related to x86 TLB range flushing.

	4-core machine
			    3.13.0-rc3            3.13.0-rc3
			       vanilla            fixsd-v3r3
	Mean   1        0.00 (  0.00%)        0.00 (  0.00%)
	Mean   2        0.34 (  0.00%)        0.10 ( 70.59%)
	Mean   3        1.29 (  0.00%)        0.93 ( 27.91%)
	Mean   4        7.08 (  0.00%)        0.77 ( 89.12%)
	Mean   5      193.54 (  0.00%)        2.14 ( 98.89%)
	Mean   6      151.12 (  0.00%)        2.06 ( 98.64%)
	Mean   7      115.38 (  0.00%)        2.04 ( 98.23%)
	Mean   8      108.65 (  0.00%)        1.92 ( 98.23%)

	8-core machine
	Mean   1         0.00 (  0.00%)        0.00 (  0.00%)
	Mean   2         0.40 (  0.00%)        0.21 ( 47.50%)
	Mean   3        23.73 (  0.00%)        0.89 ( 96.25%)
	Mean   4        12.79 (  0.00%)        1.04 ( 91.87%)
	Mean   5        13.08 (  0.00%)        2.42 ( 81.50%)
	Mean   6        23.21 (  0.00%)       69.46 (-199.27%)
	Mean   7        15.85 (  0.00%)      101.72 (-541.77%)
	Mean   8       109.37 (  0.00%)       19.13 ( 82.51%)
	Mean   12      124.84 (  0.00%)       28.62 ( 77.07%)
	Mean   16      113.50 (  0.00%)       24.16 ( 78.71%)

It's eliminated for one machine and reduced for another.
Signed-off-by: NMel Gorman <mgorman@suse.de>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Cc: Alex Shi <alex.shi@linaro.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Fengguang Wu <fengguang.wu@intel.com>
Cc: H Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20131217092124.GV11295@suse.deSigned-off-by: NIngo Molnar <mingo@kernel.org>

5d4cf996

11 12月, 2013 2 次提交

sched/fair: Rework sched_fair time accounting · 9dbdb155

由 Peter Zijlstra 提交于 11月 18, 2013

Christian suffers from a bad BIOS that wrecks his i5's TSC sync. This
results in him occasionally seeing time going backwards - which
crashes the scheduler ...

Most of our time accounting can actually handle that except the most
common one; the tick time update of sched_fair.

There is a further problem with that code; previously we assumed that
because we get a tick every TICK_NSEC our time delta could never
exceed 32bits and math was simpler.

However, ever since Frederic managed to get NO_HZ_FULL merged; this is
no longer the case since now a task can run for a long time indeed
without getting a tick. It only takes about ~4.2 seconds to overflow
our u32 in nanoseconds.

This means we not only need to better deal with time going backwards;
but also means we need to be able to deal with large deltas.

This patch reworks the entire code and uses mul_u64_u32_shr() as
proposed by Andy a long while ago.

We express our virtual time scale factor in a u32 multiplier and shift
right and the 32bit mul_u64_u32_shr() implementation reduces to a
single 32x32->64 multiply if the time delta is still short (common
case).

For 64bit a 64x64->128 multiply can be used if ARCH_SUPPORTS_INT128.
Reported-and-Tested-by: NChristian Engelmayer <cengelma@gmx.at>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Cc: fweisbec@gmail.com
Cc: Paul Turner <pjt@google.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20131118172706.GI3866@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>

9dbdb155

sched: Initialize power_orig for overlapping groups · 8e8339a3

由 Peter Zijlstra 提交于 12月 11, 2013

Yinghai reported that he saw a /0 in sg_capacity on his EX parts.
Make sure to always initialize power_orig now that we actually use it.

Ideally build_sched_domains() -> init_sched_groups_power() would also
initialize this; but for some yet unexplained reason some setups seem
to miss updates there.
Reported-by: NYinghai Lu <yinghai@kernel.org>
Tested-by: NYinghai Lu <yinghai@kernel.org>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/n/tip-l8ng2m9uml6fhibln8wqpom7@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

8e8339a3

05 12月, 2013 1 次提交

sched/numa: Drop idx field of task_numa_env struct · 40ea2b42

由 Wanpeng Li 提交于 12月 05, 2013

Drop unused idx field of task_numa_env struct.
Signed-off-by: NWanpeng Li <liwanp@linux.vnet.ibm.com>
Reviewed-by: NRik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/1386241817-5051-2-git-send-email-liwanp@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

40ea2b42

27 11月, 2013 6 次提交

sched: Add sched_class->task_dead() method · e6c390f2

由 Dario Faggioli 提交于 11月 07, 2013

Add a new function to the scheduling class interface. It is called
at the end of a context switch, if the prev task is in TASK_DEAD state.

It will be useful for the scheduling classes that want to be notified
when one of their tasks dies, e.g. to perform some cleanup actions,
such as SCHED_DEADLINE.
Signed-off-by: NDario Faggioli <raistlin@linux.it>
Reviewed-by: NPaul Turner <pjt@google.com>
Signed-off-by: NJuri Lelli <juri.lelli@gmail.com>
Cc: bruce.ashfield@windriver.com
Cc: claudio@evidence.eu.com
Cc: darren@dvhart.com
Cc: dhaval.giani@gmail.com
Cc: fchecconi@gmail.com
Cc: fweisbec@gmail.com
Cc: harald.gustafsson@ericsson.com
Cc: hgu1972@gmail.com
Cc: insop.song@gmail.com
Cc: jkacur@redhat.com
Cc: johan.eker@ericsson.com
Cc: liming.wang@windriver.com
Cc: luca.abeni@unitn.it
Cc: michael@amarulasolutions.com
Cc: nicola.manica@disi.unitn.it
Cc: oleg@redhat.com
Cc: paulmck@linux.vnet.ibm.com
Cc: p.faure@akatech.ch
Cc: rostedt@goodmis.org
Cc: tommaso.cucinotta@sssup.it
Cc: vincent.guittot@linaro.org
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1383831828-15501-2-git-send-email-juri.lelli@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

e6c390f2

sched/fair: Clean up update_sg_lb_stats() a bit · 380c9077

由 Kamalesh Babulal 提交于 11月 15, 2013

Add rq->nr_running to sgs->sum_nr_running directly instead of
assigning it through an intermediate variable nr_running.
Signed-off-by: NKamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1384508212-25032-1-git-send-email-kamalesh@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

380c9077

sched/fair: Move load idx selection in find_idlest_group · c44f2a02

由 Vincent Guittot 提交于 10月 18, 2013

load_idx is used in find_idlest_group but initialized in select_task_rq_fair
even when not used. The load_idx initialisation is moved in find_idlest_group
and the sd_flag replaces it in the function's args.
Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
Cc: len.brown@intel.com
Cc: amit.kucheria@linaro.org
Cc: pjt@google.com
Cc: l.majewski@samsung.com
Cc: Morten.Rasmussen@arm.com
Cc: cmetcalf@tilera.com
Cc: tony.luck@intel.com
Cc: alex.shi@intel.com
Cc: preeti@linux.vnet.ibm.com
Cc: linaro-kernel@lists.linaro.org
Cc: rjw@sisk.pl
Cc: paulmck@linux.vnet.ibm.com
Cc: corbet@lwn.net
Cc: arjan@linux.intel.com
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1382097147-30088-8-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

c44f2a02

sched: Check TASK_DEAD rather than EXIT_DEAD in schedule_debug() · 192301e7

由 Oleg Nesterov 提交于 11月 13, 2013

schedule_debug() ignores in_atomic() if prev->exit_state != 0.
This is not what we want, ->exit_state is set by exit_notify()
but we should complain until the task does the last schedule()
in TASK_DEAD.

See also 7407251a "PF_DEAD cleanup", I think this ancient
commit explains why schedule() had to rely on ->exit_state,
until that commit exit_notify() disabled preemption and set
PF_DEAD which was used to detect the exiting task.
Signed-off-by: NOleg Nesterov <oleg@redhat.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/20131113154538.GB15810@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

192301e7

sched: Expose preempt_schedule_irq() · 32e475d7

由 Thomas Gleixner 提交于 11月 21, 2013

Tony reported that aa0d5326 ("ia64: Use preempt_schedule_irq")
broke PREEMPT=n builds on ia64.

Ok, wrapped my brain around it. I tripped over the magic asm foo which
has a single need_resched check and schedule point for both sys call
return and interrupt return.

So you need the schedule_preempt_irq() for kernel preemption from
interrupt return while on a normal syscall preemption a schedule would
be sufficient. But using schedule_preempt_irq() is not harmful here in
any way. It just sets the preempt_active bit also in cases where it
would not be required.

Even on preempt=n kernels adding the preempt_active bit is completely
harmless. So instead of having an extra function, moving the existing
one out of the ifdef PREEMPT looks like the sanest thing to do.

It would also allow getting rid of various other sti/schedule/cli asm
magic in other archs.
Reported-and-Tested-by: NTony Luck <tony.luck@gmail.com>
Fixes: aa0d5326 ("ia64: Use preempt_schedule_irq")
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
[slightly edited Changelog]
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/alpine.DEB.2.02.1311211230030.30673@ionos.tec.linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>

32e475d7

sched: Fix a trivial syntax misuse · 39e24d8f

由 Shigeru Yoshida 提交于 11月 23, 2013

Use if statement instead of while loop.
Signed-off-by: NShigeru Yoshida <shigeru.yoshida@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jiri Kosina <trivial@kernel.org>
Link: http://lkml.kernel.org/r/20131123.183801.769652906919404319.shigeru.yoshida@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

39e24d8f

20 11月, 2013 3 次提交

sched: Fix a trivial typo in comments · 0515973f

由 Shigeru Yoshida 提交于 11月 17, 2013

Fix a trivial typo in rq_attach_root().
Signed-off-by: NShigeru Yoshida <shigeru.yoshida@gmail.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20131117.121236.1990617639803941055.shigeru.yoshida@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

0515973f

sched: Avoid NULL dereference on sd_busy · 42eb088e

由 Peter Zijlstra 提交于 11月 19, 2013

Commit 37dc6b50 ("sched: Remove unnecessary iteration over sched
domains to update nr_busy_cpus") forgot to clear 'sd_busy' under some
conditions leading to a possible NULL deref in set_cpu_sd_state_idle().
Reported-by: NAnton Blanchard <anton@samba.org>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20131118113701.GF3866@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>

42eb088e

sched: Check sched_domain before computing group power · 9abf24d4

由 Srikar Dronamraju 提交于 11月 12, 2013

After commit 863bffc8 ("sched/fair: Fix group power_orig
computation"), we can dereference rq->sd before it is set.

Fix this by falling back to power_of() in this case and add a comment
explaining things.
Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
[ Added comment and tweaked patch. ]
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Cc: mikey@neuling.org
Link: http://lkml.kernel.org/r/20131113151718.GN21461@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>

9abf24d4

13 11月, 2013 4 次提交

sched/fair: Avoid integer overflow · 85b088e9

由 Michal Nazarewicz 提交于 11月 10, 2013

sa->runnable_avg_sum is of type u32 but after shifting it by NICE_0_SHIFT
bits it is promoted to u64. This of course makes no sense, since the
result will never be more then 32-bit long. Casting sa->runnable_avg_sum
to u64 before it is shifted, fixes this problem.
Reviewed-by: NBen Segall <bsegall@google.com>
Signed-off-by: NMichal Nazarewicz <mina86@mina86.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1384112521-25177-1-git-send-email-mpn@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

85b088e9

sched: Optimize task_sched_runtime() · 911b2898

由 Peter Zijlstra 提交于 11月 11, 2013

Large multi-threaded apps like to hit this using do_sys_times() and
then queue up on the rq->lock.

Avoid when possible.

Larry reported ~20% performance increase his test case.
Reported-by: NLarry Woodman <lwoodman@redhat.com>
Suggested-by: NPaul Turner <pjt@google.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/20131111172925.GG26898@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>

911b2898

sched/numa: Cure update_numa_stats() vs. hotplug · 5eca82a9

由 Peter Zijlstra 提交于 11月 06, 2013

Because we're completely unserialized against hotplug its well
possible to try and generate numa stats for an offlined node.

Bail out early (and avoid a /0) in this case. The resulting stats are
all 0 which should result in an undesirable balance target -- not to
mention that actually trying to migrate to an offline CPU will fail.
Reported-by: NPrarit Bhargava <prarit@redhat.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Link: http://lkml.kernel.org/n/tip-orja0qylcvyhxfsuebcyL5sI@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

5eca82a9

sched/numa: Fix NULL pointer dereference in task_numa_migrate() · 46a73e8a

由 Rik van Riel 提交于 11月 11, 2013

The cpusets code can split up the scheduler's domain tree into
smaller domains.  Some of those smaller domains may not cross
NUMA nodes at all, leading to a NULL pointer dereference on the
per-cpu sd_numa pointer.

Tasks cannot be migrated out of their domain, so the patch
also sets p->numa_preferred_nid to whereever they are, to
prevent the migration from being retried over and over again.
Reported-by: NPrarit Bhargava <prarit@redhat.com>
Signed-off-by: NRik van Riel <riel@redhat.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Link: http://lkml.kernel.org/n/tip-oosqomw0Jput0Jkvoowhrqtu@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

46a73e8a

06 11月, 2013 5 次提交

sched: Remove unnecessary iteration over sched domains to update nr_busy_cpus · 37dc6b50

由 Preeti U Murthy 提交于 10月 30, 2013

nr_busy_cpus parameter is used by nohz_kick_needed() to find out the
number of busy cpus in a sched domain which has SD_SHARE_PKG_RESOURCES
flag set.  Therefore instead of updating nr_busy_cpus at every level
of sched domain, since it is irrelevant, we can update this parameter
only at the parent domain of the sd which has this flag set. Introduce
a per-cpu parameter sd_busy which represents this parent domain.

In nohz_kick_needed() we directly query the nr_busy_cpus parameter
associated with the groups of sd_busy.

By associating sd_busy with the highest domain which has
SD_SHARE_PKG_RESOURCES flag set, we cover all lower level domains
which could have this flag set and trigger nohz_idle_balancing if any
of the levels have more than one busy cpu.

sd_busy is irrelevant for asymmetric load balancing. However sd_asym
has been introduced to represent the highest sched domain which has
SD_ASYM_PACKING flag set so that it can be queried directly when
required.

While we are at it, we might as well change the nohz_idle parameter to
be updated at the sd_busy domain level alone and not the base domain
level of a CPU.  This will unify the concept of busy cpus at just one
level of sched domain where it is currently used.

Signed-off-by: Preeti U Murthy<preeti@linux.vnet.ibm.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Cc: svaidy@linux.vnet.ibm.com
Cc: vincent.guittot@linaro.org
Cc: bitbucket@online.de
Cc: benh@kernel.crashing.org
Cc: anton@samba.org
Cc: Morten.Rasmussen@arm.com
Cc: pjt@google.com
Cc: peterz@infradead.org
Cc: mikey@neuling.org
Link: http://lkml.kernel.org/r/20131030031252.23426.4417.stgit@preeti.in.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

37dc6b50

sched: Fix asymmetric scheduling for POWER7 · 2042abe7

由 Vaidyanathan Srinivasan 提交于 10月 30, 2013

Asymmetric scheduling within a core is a scheduler loadbalancing
feature that is triggered when SD_ASYM_PACKING flag is set.  The goal
for the load balancer is to move tasks to lower order idle SMT threads
within a core on a POWER7 system.

In nohz_kick_needed(), we intend to check if our sched domain (core)
is completely busy or we have idle cpu.

The following check for SD_ASYM_PACKING:

    (cpumask_first_and(nohz.idle_cpus_mask, sched_domain_span(sd)) < cpu)

already covers the case of checking if the domain has an idle cpu,
because cpumask_first_and() will not yield any set bits if this domain
has no idle cpu.

Hence, nr_busy check against group weight can be removed.
Reported-by: NMichael Neuling <michael.neuling@au1.ibm.com>
Signed-off-by: NVaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: NPreeti U Murthy <preeti@linux.vnet.ibm.com>
Tested-by: NMichael Neuling <mikey@neuling.org>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Cc: vincent.guittot@linaro.org
Cc: bitbucket@online.de
Cc: benh@kernel.crashing.org
Cc: anton@samba.org
Cc: Morten.Rasmussen@arm.com
Cc: pjt@google.com
Link: http://lkml.kernel.org/r/20131030031242.23426.13019.stgit@preeti.in.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

2042abe7

sched: Move completion code from core.c to completion.c · b8a21626

由 Peter Zijlstra 提交于 10月 04, 2013

Completions already have their own header file: linux/completion.h
Move the implementation out of kernel/sched/core.c and into its own
file: kernel/sched/completion.c.
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/n/tip-x2y49rmxu5dljt66ai2lcfuw@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

b8a21626

sched: Move wait code from core.c to wait.c · b4145872

由 Peter Zijlstra 提交于 10月 04, 2013

For some reason only the wait part of the wait api lives in
kernel/sched/wait.c and the wake part still lives in kernel/sched/core.c;
ammend this.
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/n/tip-ftycee88naznulqk7ei5mbci@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

b4145872

sched: Move wait.c into kernel/sched/ · 7a6354e2

由 Peter Zijlstra 提交于 10月 31, 2013

Suggested-by: NIngo Molnar <mingo@kernel.org>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/n/tip-5q5yqvdaen0rmapwloeaotx3@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>

7a6354e2

29 10月, 2013 5 次提交

sched: Avoid throttle_cfs_rq() racing with period_timer stopping · f9f9ffc2

由 Ben Segall 提交于 10月 16, 2013

throttle_cfs_rq() doesn't check to make sure that period_timer is running,
and while update_curr/assign_cfs_runtime does, a concurrently running
period_timer on another cpu could cancel itself between this cpu's
update_curr and throttle_cfs_rq(). If there are no other cfs_rqs running
in the tg to restart the timer, this causes the cfs_rq to be stranded
forever.

Fix this by calling __start_cfs_bandwidth() in throttle if the timer is
inactive.

(Also add some sched_debug lines for cfs_bandwidth.)

Tested: make a run/sleep task in a cgroup, loop switching the cgroup
between 1ms/100ms quota and unlimited, checking for timer_active=0 and
throttled=1 as a failure. With the throttle_cfs_rq() change commented out
this fails, with the full patch it passes.
Signed-off-by: NBen Segall <bsegall@google.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Cc: pjt@google.com
Link: http://lkml.kernel.org/r/20131016181632.22647.84174.stgit@sword-of-the-dawn.mtv.corp.google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

f9f9ffc2

sched: Guarantee new group-entities always have weight · 0ac9b1c2

由 Paul Turner 提交于 10月 16, 2013

Currently, group entity load-weights are initialized to zero. This
admits some races with respect to the first time they are re-weighted in
earlty use. ( Let g[x] denote the se for "g" on cpu "x". )

Suppose that we have root->a and that a enters a throttled state,
immediately followed by a[0]->t1 (the only task running on cpu[0])
blocking:

  put_prev_task(group_cfs_rq(a[0]), t1)
  put_prev_entity(..., t1)
  check_cfs_rq_runtime(group_cfs_rq(a[0]))
  throttle_cfs_rq(group_cfs_rq(a[0]))

Then, before unthrottling occurs, let a[0]->b[0]->t2 wake for the first
time:

  enqueue_task_fair(rq[0], t2)
  enqueue_entity(group_cfs_rq(b[0]), t2)
  enqueue_entity_load_avg(group_cfs_rq(b[0]), t2)
  account_entity_enqueue(group_cfs_ra(b[0]), t2)
  update_cfs_shares(group_cfs_rq(b[0]))
  < skipped because b is part of a throttled hierarchy >
  enqueue_entity(group_cfs_rq(a[0]), b[0])
  ...

We now have b[0] enqueued, yet group_cfs_rq(a[0])->load.weight == 0
which violates invariants in several code-paths. Eliminate the
possibility of this by initializing group entity weight.
Signed-off-by: NPaul Turner <pjt@google.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20131016181627.22647.47543.stgit@sword-of-the-dawn.mtv.corp.google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

0ac9b1c2

sched: Fix hrtimer_cancel()/rq->lock deadlock · 927b54fc

由 Ben Segall 提交于 10月 16, 2013

__start_cfs_bandwidth calls hrtimer_cancel while holding rq->lock,
waiting for the hrtimer to finish. However, if sched_cfs_period_timer
runs for another loop iteration, the hrtimer can attempt to take
rq->lock, resulting in deadlock.

Fix this by ensuring that cfs_b->timer_active is cleared only if the
_latest_ call to do_sched_cfs_period_timer is returning as idle. Then
__start_cfs_bandwidth can just call hrtimer_try_to_cancel and wait for
that to succeed or timer_active == 1.
Signed-off-by: NBen Segall <bsegall@google.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Cc: pjt@google.com
Link: http://lkml.kernel.org/r/20131016181622.22647.16643.stgit@sword-of-the-dawn.mtv.corp.google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

927b54fc

sched: Fix cfs_bandwidth misuse of hrtimer_expires_remaining · db06e78c

由 Ben Segall 提交于 10月 16, 2013

hrtimer_expires_remaining does not take internal hrtimer locks and thus
must be guarded against concurrent __hrtimer_start_range_ns (but
returning HRTIMER_RESTART is safe). Use cfs_b->lock to make it safe.
Signed-off-by: NBen Segall <bsegall@google.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Cc: pjt@google.com
Link: http://lkml.kernel.org/r/20131016181617.22647.73829.stgit@sword-of-the-dawn.mtv.corp.google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

db06e78c

sched: Fix race on toggling cfs_bandwidth_used · 1ee14e6c

由 Ben Segall 提交于 10月 16, 2013

When we transition cfs_bandwidth_used to false, any currently
throttled groups will incorrectly return false from cfs_rq_throttled.
While tg_set_cfs_bandwidth will unthrottle them eventually, currently
running code (including at least dequeue_task_fair and
distribute_cfs_runtime) will cause errors.

Fix this by turning off cfs_bandwidth_used only after unthrottling all
cfs_rqs.

Tested: toggle bandwidth back and forth on a loaded cgroup. Caused
crashes in minutes without the patch, hasn't crashed with it.
Signed-off-by: NBen Segall <bsegall@google.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Cc: pjt@google.com
Link: http://lkml.kernel.org/r/20131016181611.22647.80365.stgit@sword-of-the-dawn.mtv.corp.google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

1ee14e6c

28 10月, 2013 1 次提交

sched: Remove extra put_online_cpus() inside sched_setaffinity() · ac9ff799

由 Michael wang 提交于 10月 28, 2013

Commit 6acce3ef:

	sched: Remove get_online_cpus() usage

has left one extra put_online_cpus() inside sched_setaffinity(),
remove it to fix the WARN:

   ------------[ cut here ]------------
   WARNING: CPU: 0 PID: 3166 at kernel/cpu.c:84 put_online_cpus+0x43/0x70()
   ...
   [<ffffffff810c3fef>] put_online_cpus+0x43/0x70 [
   [<ffffffff810efd59>] sched_setaffinity+0x7d/0x1f9 [
   ...
Reported-by: NFengguang Wu <fengguang.wu@intel.com>
Tested-by: NFengguang Wu <fengguang.wu@intel.com>
Signed-off-by: NMichael Wang <wangyun@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/526DD0EE.1090309@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

ac9ff799

26 10月, 2013 1 次提交

sched/rt: Fix task_tick_rt() comment · e9aa39bb

由 Li Bin 提交于 10月 21, 2013

This issue was introduced by 454c7999 ("sched/rt: Fix SCHED_RR
across cgroups") that missed the word 'not'. Fix it.
Signed-off-by: NLi Bin <huawei.libin@huawei.com>
Cc: <guohanjun@huawei.com>
Cc: <xiexiuqi@huawei.com>
Cc: <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1382357743-54136-1-git-send-email-huawei.libin@huawei.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

e9aa39bb

16 10月, 2013 3 次提交

sched: Remove get_online_cpus() usage · 6acce3ef

由 Peter Zijlstra 提交于 10月 11, 2013

Remove get_online_cpus() usage from the scheduler; there's 4 sites that
use it:

 - sched_init_smp(); where its completely superfluous since we're in
   'early' boot and there simply cannot be any hotplugging.

 - sched_getaffinity(); we already take a raw spinlock to protect the
   task cpus_allowed mask, this disables preemption and therefore
   also stabilizes cpu_online_mask as that's modified using
   stop_machine. However switch to active mask for symmetry with
   sched_setaffinity()/set_cpus_allowed_ptr(). We guarantee active
   mask stability by inserting sync_rcu/sched() into _cpu_down.

 - sched_setaffinity(); we don't appear to need get_online_cpus()
   either, there's two sites where hotplug appears relevant:
    * cpuset_cpus_allowed(); for the !cpuset case we use possible_mask,
      for the cpuset case we hold task_lock, which is a spinlock and
      thus for mainline disables preemption (might cause pain on RT).
    * set_cpus_allowed_ptr(); Holds all scheduler locks and thus has
      preemption properly disabled; also it already deals with hotplug
      races explicitly where it releases them.

 - migrate_swap(); we can make stop_two_cpus() do the heavy lifting for
   us with a little trickery. By adding a sync_sched/rcu() after the
   CPU_DOWN_PREPARE notifier we can provide preempt/rcu guarantees for
   cpu_active_mask. Use these to validate that both our cpus are active
   when queueing the stop work before we queue the stop_machine works
   for take_cpu_down().
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Cc: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Link: http://lkml.kernel.org/r/20131011123820.GV3081@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>

6acce3ef

sched: Fix race in migrate_swap_stop() · 74602315

由 Peter Zijlstra 提交于 10月 10, 2013

There is a subtle race in migrate_swap, when task P, on CPU A, decides to swap
places with task T, on CPU B.

Task P:
  - call migrate_swap
Task T:
  - go to sleep, removing itself from the runqueue
Task P:
  - double lock the runqueues on CPU A & B
Task T:
  - get woken up, place itself on the runqueue of CPU C
Task P:
  - see that task T is on a runqueue, and pretend to remove it
    from the runqueue on CPU B

Now CPUs B & C both have corrupted scheduler data structures.

This patch fixes it, by holding the pi_lock for both of the tasks
involved in the migrate swap. This prevents task T from waking up,
and placing itself onto another runqueue, until after migrate_swap
has released all locks.

This means that, when migrate_swap checks, task T will be either
on the runqueue where it was originally seen, or not on any
runqueue at all. Migrate_swap deals correctly with of those cases.
Tested-by: NJoe Mario <jmario@redhat.com>
Acked-by: NMel Gorman <mgorman@suse.de>
Reviewed-by: NRik van Riel <riel@redhat.com>
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Cc: hannes@cmpxchg.org
Cc: aarcange@redhat.com
Cc: srikar@linux.vnet.ibm.com
Cc: tglx@linutronix.de
Cc: hpa@zytor.com
Link: http://lkml.kernel.org/r/20131010181722.GO13848@laptop.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>

74602315

sched/rt: Add missing rmb() · 7c3f2ab7

由 Peter Zijlstra 提交于 10月 15, 2013

While discussing the proposed SCHED_DEADLINE patches which in parts
mimic the existing FIFO code it was noticed that the wmb in
rt_set_overloaded() didn't have a matching barrier.

The only site using rt_overloaded() to test the rto_count is
pull_rt_task() and we should issue a matching rmb before then assuming
there's an rto_mask bit set.

Without that smp_rmb() in there we could actually miss seeing the
rto_mask bit.

Also, change to using smp_[wr]mb(), even though this is SMP only code;
memory barriers without smp_ always make me think they're against
hardware of some sort.
Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
Cc: vincent.guittot@linaro.org
Cc: luca.abeni@unitn.it
Cc: bruce.ashfield@windriver.com
Cc: dhaval.giani@gmail.com
Cc: rostedt@goodmis.org
Cc: hgu1972@gmail.com
Cc: oleg@redhat.com
Cc: fweisbec@gmail.com
Cc: darren@dvhart.com
Cc: johan.eker@ericsson.com
Cc: p.faure@akatech.ch
Cc: paulmck@linux.vnet.ibm.com
Cc: raistlin@linux.it
Cc: claudio@evidence.eu.com
Cc: insop.song@gmail.com
Cc: michael@amarulasolutions.com
Cc: liming.wang@windriver.com
Cc: fchecconi@gmail.com
Cc: jkacur@redhat.com
Cc: tommaso.cucinotta@sssup.it
Cc: Juri Lelli <juri.lelli@gmail.com>
Cc: harald.gustafsson@ericsson.com
Cc: nicola.manica@disi.unitn.it
Cc: tglx@linutronix.de
Link: http://lkml.kernel.org/r/20131015103507.GF10651@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>

7c3f2ab7

openanolis / cloud-kernel 接近 2 年 前同步成功

openanolis / cloud-kernel
接近 2 年前同步成功