- 09 3月, 2015 1 次提交
-
-
由 Frederic Weisbecker 提交于
Current context tracking symbols are designed to express living state. As such they are prefixed with "IN_": IN_USER, IN_KERNEL. Now we are going to use these symbols to also express state transitions such as context_tracking_enter(IN_USER) or context_tracking_exit(IN_USER). But while the "IN_" prefix works well to express entering a context, it's confusing to depict a context exit: context_tracking_exit(IN_USER) could mean two things: 1) We are exiting the current context to enter user context. 2) We are exiting the user context We want 2) but the reviewer may be confused and understand 1) So lets disambiguate these symbols and rename them to CONTEXT_USER and CONTEXT_KERNEL. Acked-by: NRik van Riel <riel@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Will deacon <will.deacon@arm.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
-
- 18 2月, 2015 9 次提交
-
-
由 Peter Zijlstra 提交于
Setting the root group's cpu.rt_runtime_us to 0 is a bad thing; it would disallow the kernel creating RT tasks. One can of course still set it to 1, which will (likely) still wreck your kernel, but at least make it clear that setting it to 0 is not good. Collect both sanity checks into the one place while we're there. Suggested-by: NZefan Li <lizefan@huawei.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20150209112715.GO24151@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
Because task_group() uses a cache of autogroup_task_group(), whose output depends on sched_class, switching classes can generate problems. In particular, when started as fair, the cache points to the autogroup, so when switching to RT the tg_rt_schedulable() test fails for every cpu.rt_{runtime,period}_us change because now the autogroup has tasks and no runtime. Furthermore, going back to the previous semantics of varying task_group() with sched_class has the down-side that the sched_debug output varies as well, even though the task really is in the autogroup. Therefore add an autogroup exception to tg_has_rt_tasks() -- such that both (all) task_group() usages in sched/core now have one. And remove all the remnants of the variable task_group() output. Reported-by: NZefan Li <lizefan@huawei.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Stefan Bader <stefan.bader@canonical.com> Fixes: 8323f26c ("sched: Fix race in task_group()") Link: http://lkml.kernel.org/r/20150209112237.GR5029@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Kirill Tkhai 提交于
update_curr_dl() needs actual rq clock. Signed-off-by: NKirill Tkhai <ktkhai@parallels.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1423040972.18770.10.camel@tkhaiSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 NeilBrown 提交于
io_schedule() calls blk_flush_plug() which, depending on the contents of current->plug, can initiate arbitrary blk-io requests. Note that this contrasts with blk_schedule_flush_plug() which requires all non-trivial work to be handed off to a separate thread. This makes it possible for io_schedule() to recurse, and initiating block requests could possibly call mempool_alloc() which, in times of memory pressure, uses io_schedule(). Apart from any stack usage issues, io_schedule() will not behave correctly when called recursively as delayacct_blkio_start() does not allow for repeated calls. So: - use ->in_iowait to detect recursion. Set it earlier, and restore it to the old value. - move the call to "raw_rq" after the call to blk_flush_plug(). As this is some sort of per-cpu thing, we want some chance that we are on the right CPU - When io_schedule() is called recurively, use blk_schedule_flush_plug() which cannot further recurse. - as this makes io_schedule() a lot more complex and as io_schedule() must match io_schedule_timeout(), but all the changes in io_schedule_timeout() and make io_schedule a simple wrapper for that. Signed-off-by: NNeilBrown <neilb@suse.de> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> [ Moved the now rudimentary io_schedule() into sched.h. ] Cc: Jens Axboe <axboe@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Tony Battersby <tonyb@cybernetics.com> Link: http://lkml.kernel.org/r/20150213162600.059fffb2@notabene.brownSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Oleg Nesterov 提交于
Commit de30ec47 "Remove unnecessary ->wait.lock serialization when reading completion state" was not correct, without lock/unlock the code like stop_machine_from_inactive_cpu() while (!completion_done()) cpu_relax(); can return before complete() finishes its spin_unlock() which writes to this memory. And spin_unlock_wait(). While at it, change try_wait_for_completion() to use READ_ONCE(). Reported-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com> Reported-by: NDavidlohr Bueso <dave@stgolabs.net> Tested-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: NOleg Nesterov <oleg@redhat.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> [ Added a comment with the barrier. ] Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Nicholas Mc Guire <der.herr@hofr.at> Cc: raghavendra.kt@linux.vnet.ibm.com Cc: waiman.long@hp.com Fixes: de30ec47 ("sched/completion: Remove unnecessary ->wait.lock serialization when reading completion state") Link: http://lkml.kernel.org/r/20150212195913.GA30430@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Frederic Weisbecker 提交于
Since the function graph tracer needs to disable preemption, it might call preempt_schedule() after reenabling it if something triggered the need for rescheduling in between. Therefore we can't trace preempt_schedule() itself because we would face a function tracing recursion otherwise as the tracer is always called before PREEMPT_ACTIVE gets set to prevent that recursion. This is why preempt_schedule() is tagged as "notrace". But the same issue applies to every function called by preempt_schedule() before PREEMPT_ACTIVE is actually set. And preempt_schedule_common() is one such example. Unfortunately we forgot to tag it as notrace as well and as a result we are encountering tracing recursion since it got introduced by: a18b5d01 ("sched: Fix missing preemption opportunity") Let's fix that by applying the appropriate function tag to preempt_schedule_common(). Reported-by: NHuang Ying <ying.huang@intel.com> Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Acked-by: NSteven Rostedt <rostedt@goodmis.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1424110807-15057-1-git-send-email-fweisbec@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Kirill Tkhai 提交于
A deadline task may be throttled and dequeued at the same time. This happens, when it becomes throttled in schedule(), which is called to go to sleep: current->state = TASK_INTERRUPTIBLE; schedule() deactivate_task() dequeue_task_dl() update_curr_dl() start_dl_timer() __dequeue_task_dl() prev->on_rq = 0; Later the timer fires, but the task is still dequeued: dl_task_timer() enqueue_task_dl() /* queues on dl_rq; on_rq remains 0 */ Someone wakes it up: try_to_wake_up() enqueue_dl_entity() BUG_ON(on_dl_rq()) Patch fixes this problem, it prevents queueing !on_rq tasks on dl_rq. Reported-by: NFengguang Wu <fengguang.wu@intel.com> Signed-off-by: NKirill Tkhai <ktkhai@parallels.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> [ Wrote comment. ] Cc: Juri Lelli <juri.lelli@arm.com> Fixes: 1019a359 ("sched/deadline: Fix stale yield state") Link: http://lkml.kernel.org/r/1374601424090314@web4j.yandex.ruSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
Kirill reported that a dl task can be throttled and dequeued at the same time. This happens, when it becomes throttled in schedule(), which is called to go to sleep: current->state = TASK_INTERRUPTIBLE; schedule() deactivate_task() dequeue_task_dl() update_curr_dl() start_dl_timer() __dequeue_task_dl() prev->on_rq = 0; This invalidates the assumption from commit 0f397f2c ("sched/dl: Fix race in dl_task_timer()"): "The only reason we don't strictly need ->pi_lock now is because we're guaranteed to have p->state == TASK_RUNNING here and are thus free of ttwu races". And therefore we have to use the full task_rq_lock() here. This further amends the fact that we forgot to update the rq lock loop for TASK_ON_RQ_MIGRATE, from commit cca26e80 ("sched: Teach scheduler to understand TASK_ON_RQ_MIGRATING state"). Reported-by: NKirill Tkhai <ktkhai@parallels.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@arm.com> Link: http://lkml.kernel.org/r/20150217123139.GN5029@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
There was a wee bit of confusion around the exact ordering here; clarify things. Reported-by: NKirill Tkhai <ktkhai@parallels.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/20150217121258.GM5029@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 14 2月, 2015 2 次提交
-
-
由 Tejun Heo 提交于
printk and friends can now format bitmaps using '%*pb[l]'. cpumask and nodemask also provide cpumask_pr_args() and nodemask_pr_args() respectively which can be used to generate the two printf arguments necessary to format the specified cpu/nodemask. Signed-off-by: NTejun Heo <tj@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Rafael J. Wysocki 提交于
In preparation for adding support for quiescing timers in the final stage of suspend-to-idle transitions, rework the freeze_enter() function making the system wait on a wakeup event, the freeze_wake() function terminating the suspend-to-idle loop and the mechanism by which deep idle states are entered during suspend-to-idle. First of all, introduce a simple state machine for suspend-to-idle and make the code in question use it. Second, prevent freeze_enter() from losing wakeup events due to race conditions and ensure that the number of online CPUs won't change while it is being executed. In addition to that, make it force all of the CPUs re-enter the idle loop in case they are in idle states already (so they can enter deeper idle states if possible). Next, drop cpuidle_use_deepest_state() and replace use_deepest_state checks in cpuidle_select() and cpuidle_reflect() with a single suspend-to-idle state check in cpuidle_idle_call(). Finally, introduce cpuidle_enter_freeze() that will simply find the deepest idle state available to the given CPU and enter it using cpuidle_enter(). Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
-
- 13 2月, 2015 1 次提交
-
-
由 Cyril Bur 提交于
When the hypervisor pauses a virtualised kernel the kernel will observe a jump in timebase, this can cause spurious messages from the softlockup detector. Whilst these messages are harmless, they are accompanied with a stack trace which causes undue concern and more problematically the stack trace in the guest has nothing to do with the observed problem and can only be misleading. Futhermore, on POWER8 this is completely avoidable with the introduction of the Virtual Time Base (VTB) register. This patch (of 2): This permits the use of arch specific clocks for which virtualised kernels can use their notion of 'running' time, not the elpased wall time which will include host execution time. Signed-off-by: NCyril Bur <cyrilbur@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Andrew Jones <drjones@redhat.com> Acked-by: NDon Zickus <dzickus@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Ulrich Obergfell <uobergfe@redhat.com> Cc: chai wen <chaiw.fnst@cn.fujitsu.com> Cc: Fabian Frederick <fabf@skynet.be> Cc: Aaron Tomlin <atomlin@redhat.com> Cc: Ben Zhang <benzh@chromium.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: John Stultz <john.stultz@linaro.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 04 2月, 2015 9 次提交
-
-
由 Nicholas Mc Guire 提交于
The "thread would block" case can be checked without grabbing ->wait.lock. [ If the check does not return early then grab the lock and recheck. A memory barrier is not needed as complete() and complete_all() imply a barrier. The ACCESS_ONCE() is needed for calls in a loop that, if inlined, could optimize out the re-fetching of x->done. ] Signed-off-by: NNicholas Mc Guire <der.herr@hofr.at> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1422013307-13200-1-git-send-email-der.herr@hofr.atSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Nicholas Mc Guire 提交于
Signed-off-by: NNicholas Mc Guire <der.herr@hofr.at> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1421467534-22834-1-git-send-email-der.herr@hofr.atSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Frederic Weisbecker 提交于
__schedule() disables preemption during its job and re-enables it afterward without doing a preemption check to avoid recursion. But if an event happens after the context switch which requires rescheduling, we need to check again if a task of a higher priority needs the CPU. A preempt irq can raise such a situation. To handle that, __schedule() loops on need_resched(). But preempt_schedule_*() functions, which call __schedule(), also loop on need_resched() to handle missed preempt irqs. Hence we end up with the same loop happening twice. Lets simplify that by attributing the need_resched() loop responsibility to all __schedule() callers. There is a risk that the outer loop now handles reschedules that used to be handled by the inner loop with the added overhead of caller details (inc/dec of PREEMPT_ACTIVE, irq save/restore) but assuming those inner rescheduling loop weren't too frequent, this shouldn't matter. Especially since the whole preemption path is now losing one loop in any case. Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/1422404652-29067-2-git-send-email-fweisbec@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Xunlei Pang 提交于
cpu_active_mask is rarely changed (only on hotplug), so remove this operation to gain a little performance. If there is a change in cpu_active_mask, rq_online_dl() and rq_offline_dl() should take care of it normally, so cpudl::free_cpus carries enough information for us. For the rare case when a task is put onto a dying cpu (which rq_offline_dl() can't handle in a timely fashion), it will be handled through _cpu_down()->...->multi_cpu_stop()->migration_call() ->migrate_tasks(), preventing the task from hanging on the dead cpu. Cc: Juri Lelli <juri.lelli@gmail.com> Signed-off-by: NXunlei Pang <pang.xunlei@linaro.org> [peterz: changelog] Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/1421642980-10045-2-git-send-email-pang.xunlei@linaro.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-kernel@vger.kernel.org Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Wanpeng Li 提交于
The commit 177ef2a6 ("sched/deadline: Fix a precision problem in the microseconds range") forgot to change the UP version of hrtick_start(), do so now. Signed-off-by: NWanpeng Li <wanpeng.li@linux.intel.com> Fixes: 177ef2a6 ("sched/deadline: Fix a precision problem in the microseconds range") [ Fixed the changelog. ] Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Kirill Tkhai <ktkhai@parallels.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1416962647-76792-7-git-send-email-wanpeng.li@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Wanpeng Li 提交于
There is no need to dequeue/enqueue and push/pull if there are no scheduling parameters changed for the DL class. Both fair and RT classes already check if parameters changed for them to avoid unnecessary overhead. This patch add the parameters changed test for the DL class in order to reduce overhead. Signed-off-by: NWanpeng Li <wanpeng.li@linux.intel.com> [ Fixed up the changelog. ] Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Kirill Tkhai <ktkhai@parallels.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1416962647-76792-5-git-send-email-wanpeng.li@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
When we fail to start the deadline timer in update_curr_dl(), we forget to clear ->dl_yielded, resulting in wrecked time keeping. Since the natural place to clear both ->dl_yielded and ->dl_throttled is in replenish_dl_entity(); both are after all waiting for that event; make it so. Luckily since 67dfa1b7 ("sched/deadline: Implement cancel_dl_timer() to use in switched_from_dl()") the task_on_rq_queued() condition in dl_task_timer() must be true, and can therefore call enqueue_task_dl() unconditionally. Reported-by: NWanpeng Li <wanpeng.li@linux.intel.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Kirill Tkhai <ktkhai@parallels.com> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1416962647-76792-4-git-send-email-wanpeng.li@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Wanpeng Li 提交于
After update_curr_dl() the current task might not be the leftmost task anymore. In that case do not start a new hrtick for it. In this case NEED_RESCHED will be set and the next schedule will start the hrtick for the new task if and when appropriate. Signed-off-by: NWanpeng Li <wanpeng.li@linux.intel.com> Acked-by: NJuri Lelli <juri.lelli@arm.com> [ Rewrote the changelog and comment. ] Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Kirill Tkhai <ktkhai@parallels.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1416962647-76792-2-git-send-email-wanpeng.li@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
Commit 67dfa1b7 ("sched/deadline: Implement cancel_dl_timer() to use in switched_from_dl()") removed the hrtimer_try_cancel() function call out from init_dl_task_timer(), which gets called from __setparam_dl(). The result is that we can now re-init the timer while its active -- this is bad and corrupts timer state. Furthermore; changing the parameters of an active deadline task is tricky in that you want to maintain guarantees, while immediately effective change would allow one to circumvent the CBS guarantees -- this too is bad, as one (bad) task should not be able to affect the others. Rework things to avoid both problems. We only need to initialize the timer once, so move that to __sched_fork() for new tasks. Then make sure __setparam_dl() doesn't affect the current running state but only updates the parameters used to calculate the next scheduling period -- this guarantees the CBS functions as expected (albeit slightly pessimistic). This however means we need to make sure __dl_clear_params() needs to reset the active state otherwise new (and tasks flipping between classes) will not properly (re)compute their first instance. Todo: close class flipping CBS hole. Todo: implement delayed BW release. Reported-by: NLuca Abeni <luca.abeni@unitn.it> Acked-by: NJuri Lelli <juri.lelli@arm.com> Tested-by: NLuca Abeni <luca.abeni@unitn.it> Fixes: 67dfa1b7 ("sched/deadline: Implement cancel_dl_timer() to use in switched_from_dl()") Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: <stable@vger.kernel.org> Cc: Kirill Tkhai <tkhai@yandex.ru> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20150128140803.GF23038@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 02 2月, 2015 1 次提交
-
-
由 Linus Torvalds 提交于
Commit 8eb23b9f ("sched: Debug nested sleeps") added code to report on nested sleep conditions, which we generally want to avoid because the inner sleeping operation can re-set the thread state to TASK_RUNNING, but that will then cause the outer sleep loop not actually sleep when it calls schedule. However, that's actually valid traditional behavior, with the inner sleep being some fairly rare case (like taking a sleeping lock that normally doesn't actually need to sleep). And the debug code would actually change the state of the task to TASK_RUNNING internally, which makes that kind of traditional and working code not work at all, because now the nested sleep doesn't just sometimes cause the outer one to not block, but will cause it to happen every time. In particular, it will cause the cardbus kernel daemon (pccardd) to basically busy-loop doing scheduling, converting a laptop into a heater, as reported by Bruno Prémont. But there may be other legacy uses of that nested sleep model in other drivers that are also likely to never get converted to the new model. This fixes both cases: - don't set TASK_RUNNING when the nested condition happens (note: even if WARN_ONCE() only _warns_ once, the return value isn't whether the warning happened, but whether the condition for the warning was true. So despite the warning only happening once, the "if (WARN_ON(..))" would trigger for every nested sleep. - in the cases where we knowingly disable the warning by using "sched_annotate_sleep()", don't change the task state (that is used for all core scheduling decisions), instead use '->task_state_change' that is used for the debugging decision itself. (Credit for the second part of the fix goes to Oleg Nesterov: "Can't we avoid this subtle change in behaviour DEBUG_ATOMIC_SLEEP adds?" with the suggested change to use 'task_state_change' as part of the test) Reported-and-bisected-by: NBruno Prémont <bonbons@linux-vserver.org> Tested-by: NRafael J Wysocki <rjw@rjwysocki.net> Acked-by: NOleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de>, Cc: Ilya Dryomov <ilya.dryomov@inktank.com>, Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Hurley <peter@hurleysoftware.com>, Cc: Davidlohr Bueso <dave@stgolabs.net>, Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 31 1月, 2015 4 次提交
-
-
由 Xunlei Pang 提交于
Currently, cpudl::free_cpus contains all CPUs during init, see cpudl_init(). When calling cpudl_find(), we have to add rd->span to avoid selecting the cpu outside the current root domain, because cpus_allowed cannot be depended on when performing clustered scheduling using the cpuset, see find_later_rq(). This patch adds cpudl_set_freecpu() and cpudl_clear_freecpu() for changing cpudl::free_cpus when doing rq_online_dl()/rq_offline_dl(), so we can avoid the rd->span operation when calling cpudl_find() in find_later_rq(). Signed-off-by: NXunlei Pang <pang.xunlei@linaro.org> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1421642980-10045-1-git-send-email-pang.xunlei@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Preeti U Murthy 提交于
cpu_idle_poll() is entered into when either the cpu_idle_force_poll is set or tick_check_broadcast_expired() returns true. The exit condition from cpu_idle_poll() is tif_need_resched(). However this does not take into account scenarios where cpu_idle_force_poll changes or tick_check_broadcast_expired() returns false, without setting the resched flag. So a cpu will be caught in cpu_idle_poll() needlessly, thereby wasting power. Add an explicit check on cpu_idle_force_poll and tick_check_broadcast_expired() to the exit condition of cpu_idle_poll() to avoid this. Signed-off-by: NPreeti U Murthy <preeti@linux.vnet.ibm.com> Reviewed-by: NThomas Gleixner <tglx@linutronix.de> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: linuxppc-dev@lists.ozlabs.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20150121105655.15279.59626.stgit@preeti.in.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Frederic Weisbecker 提交于
If an interrupt fires in cond_resched(), between the call to __schedule() and the PREEMPT_ACTIVE count decrementation, and that interrupt sets TIF_NEED_RESCHED, the call to preempt_schedule_irq() will be ignored due to the PREEMPT_ACTIVE count. This kind of scenario, with irq preemption being delayed because it's interrupting a preempt-disabled area, is usually fixed up after preemption is re-enabled back with an explicit call to preempt_schedule(). This is what preempt_enable() does but a raw preempt count decrement as performed by __preempt_count_sub(PREEMPT_ACTIVE) doesn't handle delayed preemption check. Therefore when such a race happens, the rescheduling is going to be delayed until the next scheduler or preemption entrypoint. This can be a problem for scheduler latency sensitive workloads. Lets fix that by consolidating cond_resched() with preempt_schedule() internals. Reported-by: NLinus Torvalds <torvalds@linux-foundation.org> Reported-by: NIngo Molnar <mingo@kernel.org> Original-patch-by: NIngo Molnar <mingo@kernel.org> Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/1421946484-9298-1-git-send-email-fweisbec@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Tim Chen 提交于
This patch adds checks that prevens futile attempts to move rt tasks to a CPU with active tasks of equal or higher priority. This reduces run queue lock contention and improves the performance of a well known OLTP benchmark by 0.7%. Signed-off-by: NTim Chen <tim.c.chen@linux.intel.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Shawn Bohrer <sbohrer@rgmadvisors.com> Cc: Suruchi Kadu <suruchi.a.kadu@intel.com> Cc: Doug Nelson<doug.nelson@intel.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1421430374.2399.27.camel@schen9-desk2.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 29 1月, 2015 1 次提交
-
-
由 Heiko Carstens 提交于
If the kernel is compiled with function tracer support the -pg compile option is passed to gcc to generate extra code into the prologue of each function. This patch replaces the "open-coded" -pg compile flag with a CC_FLAGS_FTRACE makefile variable which architectures can override if a different option should be used for code generation. Acked-by: NSteven Rostedt <rostedt@goodmis.org> Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
-
- 28 1月, 2015 2 次提交
-
-
由 Mike Galbraith 提交于
While creating an exclusive cpuset, we passed cpuset_cpumask_can_shrink() an empty cpumask (cur), and dl_bw_of(cpumask_any(cur)) made boom with it: CPU: 0 PID: 6942 Comm: shield.sh Not tainted 3.19.0-master #19 Hardware name: MEDIONPC MS-7502/MS-7502, BIOS 6.00 PG 12/26/2007 task: ffff880224552450 ti: ffff8800caab8000 task.ti: ffff8800caab8000 RIP: 0010:[<ffffffff81073846>] [<ffffffff81073846>] cpuset_cpumask_can_shrink+0x56/0xb0 [...] Call Trace: [<ffffffff810cb82a>] validate_change+0x18a/0x200 [<ffffffff810cc877>] cpuset_write_resmask+0x3b7/0x720 [<ffffffff810c4d58>] cgroup_file_write+0x38/0x100 [<ffffffff811d953a>] kernfs_fop_write+0x12a/0x180 [<ffffffff8116e1a3>] vfs_write+0xb3/0x1d0 [<ffffffff8116ed06>] SyS_write+0x46/0xb0 [<ffffffff8159ced6>] system_call_fastpath+0x16/0x1b Signed-off-by: NMike Galbraith <umgwanakikbuti@gmail.com> Acked-by: NZefan Li <lizefan@huawei.com> Fixes: f82f8042 ("sched/deadline: Ensure that updates to exclusive cpusets don't break AC") Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1422417235.5716.5.camel@marge.simpson.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Jan Beulich 提交于
At least some gcc versions - validly afaict - warn about potentially using max_group uninitialized: There's no way the compiler can prove that the body of the conditional where it and max_faults get set/ updated gets executed; in fact, without knowing all the details of other scheduler code, I can't prove this either. Generally the necessary change would appear to be to clear max_group prior to entering the inner loop, and break out of the outer loop when it ends up being all clear after the inner one. This, however, seems inefficient, and afaict the same effect can be achieved by exiting the outer loop when max_faults is still zero after the inner loop. [ mingo: changed the solution to zero initialization: uninitialized_var() needs to die, as it's an actively dangerous construct: if in the future a known-proven-good piece of code is changed to have a true, buggy uninitialized variable, the compiler warning is then supressed... The better long term solution is to clean up the code flow, so that even simple minded compilers (and humans!) are able to read it without getting a headache. ] Signed-off-by: NJan Beulich <jbeulich@suse.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/54C2139202000078000588F7@mail.emea.novell.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 14 1月, 2015 9 次提交
-
-
由 Peter Zijlstra (Intel) 提交于
Both Linus (most recent) and Steve (a while ago) reported that perf related callbacks have massive stack bloat. The problem is that software events need a pt_regs in order to properly report the event location and unwind stack. And because we could not assume one was present we allocated one on stack and filled it with minimal bits required for operation. Now, pt_regs is quite large, so this is undesirable. Furthermore it turns out that most sites actually have a pt_regs pointer available, making this even more onerous, as the stack space is pointless waste. This patch addresses the problem by observing that software events have well defined nesting semantics, therefore we can use static per-cpu storage instead of on-stack. Linus made the further observation that all but the scheduler callers of perf_sw_event() have a pt_regs available, so we change the regular perf_sw_event() to require a valid pt_regs (where it used to be optional) and add perf_sw_event_sched() for the scheduler. We have a scheduler specific call instead of a more generic _noregs() like construct because we can assume non-recursion from the scheduler and thereby simplify the code further (_noregs would have to put the recursion context call inline in order to assertain which __perf_regs element to use). One last note on the implementation of perf_trace_buf_prepare(); we allow .regs = NULL for those cases where we already have a pt_regs pointer available and do not need another. Reported-by: NLinus Torvalds <torvalds@linux-foundation.org> Reported-by: NSteven Rostedt <rostedt@goodmis.org> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Javi Merino <javi.merino@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Petr Mladek <pmladek@suse.cz> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tom Zanussi <tom.zanussi@linux.intel.com> Cc: Vaibhav Nagarnaik <vnagarnaik@google.com> Link: http://lkml.kernel.org/r/20141216115041.GW3337@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
We seem to have forgotten adding it to the debug output like forever... do so now. Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/20150105103554.495253233@infradead.org Cc: umgwanakikbuti@gmail.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
The original purpose of rq::skip_clock_update was to avoid 'costly' clock updates for back to back wakeup-preempt pairs. The big problem with it has always been that the rq variable is unaware of the context and causes indiscrimiate clock skips. Rework the entire thing and create a sense of context by only allowing schedule() to skip clock updates. (XXX can we measure the cost of the added store?) By ensuring only schedule can ever skip an update, we guarantee we're never more than 1 tick behind on the update. Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: umgwanakikbuti@gmail.com Link: http://lkml.kernel.org/r/20150105103554.432381549@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
rq->clock{,_task} are serialized by rq->lock, verify this. One immediate fail is the usage in scale_rt_capability, so 'annotate' that for now, there's more 'funny' there. Maybe change rq->lock into a raw_seqlock_t? (Only 32-bit is affected) Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/20150105103554.361872747@infradead.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: umgwanakikbuti@gmail.com Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Yao Dongdong 提交于
Search all usage of p->sched_class in sched/core.c, no one check it before use, so it seems that every task must belong to one sched_class. Signed-off-by: NYao Dongdong <yaodongdong@huawei.com> [ Moved the early class assignment to make it boot. ] Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1419835303-28958-1-git-send-email-yaodongdong@huawei.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Kirill Tkhai 提交于
Child has the same decay_count as parent. If it's not zero, we add it to parent's cfs_rq->removed_load: wake_up_new_task()->set_task_cpu()->migrate_task_rq_fair(). Child's load is a just garbade after copying of parent, it hasn't been on cfs_rq yet, and it must not be added to cfs_rq::removed_load in migrate_task_rq_fair(). The patch moves sched_entity::avg::decay_count intialization in sched_fork(). So, migrate_task_rq_fair() does not change removed_load. Signed-off-by: NKirill Tkhai <ktkhai@parallels.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: NBen Segall <bsegall@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1418644618.6074.13.camel@tkhaiSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Tetsuo Handa 提交于
"struct task_struct"->state is "volatile long" and __ffs() warns that "Undefined if no bit exists, so code should check against 0 first." Therefore, at expression state = p->state ? __ffs(p->state) + 1 : 0; in sched_show_task(), CPU might see "p->state" before "?" as "non-zero" but "p->state" after "?" as "zero", which could result in "state >= sizeof(stat_nam)" being true and bogus '?' is printed. This patch changes "state" from "unsigned int" to "unsigned long" and save "p->state" before calling __ffs(), in order to avoid potential call to __ffs(0). Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/201412052131.GCE35924.FVHFOtLOJOMQFS@I-love.SAKURA.ne.jpSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Eric Sandeen 提交于
Sometimes a "BUG: sleeping function called from invalid context" message is not indicative of locking problems, but is the result of a stack overflow corrupting the thread info. Witness http://oss.sgi.com/archives/xfs/2014-02/msg00325.html for example, which took a few go-rounds to sort out. If we're printing the warning, things are wonky already, and it'd be informative to check for the stack end corruption at this point, too. Signed-off-by: NEric Sandeen <sandeen@redhat.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/5490B158.4060005@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Xunlei Pang 提交于
In __synchronize_entity_decay(), if "decays" happens to be zero, se->avg.decay_count will not be zeroed, holding the positive value assigned when dequeued last time. This is problematic in the following case: If this runnable task is CFS-balanced to other CPUs soon afterwards, migrate_task_rq_fair() will treat it as a blocked task due to its non-zero decay_count, thereby adding its load to cfs_rq->removed_load wrongly. Thus, we must zero se->avg.decay_count in this case as well. Signed-off-by: NXunlei Pang <pang.xunlei@linaro.org> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: NBen Segall <bsegall@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1418745509-2609-1-git-send-email-pang.xunlei@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 09 1月, 2015 1 次提交
-
-
由 Tetsuo Handa 提交于
When alloc_fair_sched_group() in sched_create_group() fails, free_sched_group() is called, and free_fair_sched_group() is called by free_sched_group(). Since destroy_cfs_bandwidth() is called by free_fair_sched_group() without calling init_cfs_bandwidth(), RCU stall occurs at hrtimer_cancel(): INFO: rcu_sched self-detected stall on CPU { 1} (t=60000 jiffies g=13074 c=13073 q=0) Task dump for CPU 1: (fprintd) R running task 0 6249 1 0x00000088 ... Call Trace: <IRQ> [<ffffffff81094988>] sched_show_task+0xa8/0x110 [<ffffffff81097acd>] dump_cpu_task+0x3d/0x50 [<ffffffff810c3a80>] rcu_dump_cpu_stacks+0x90/0xd0 [<ffffffff810c7751>] rcu_check_callbacks+0x491/0x700 [<ffffffff810cbf2b>] update_process_times+0x4b/0x80 [<ffffffff810db046>] tick_sched_handle.isra.20+0x36/0x50 [<ffffffff810db0a2>] tick_sched_timer+0x42/0x70 [<ffffffff810ccb19>] __run_hrtimer+0x69/0x1a0 [<ffffffff810db060>] ? tick_sched_handle.isra.20+0x50/0x50 [<ffffffff810ccedf>] hrtimer_interrupt+0xef/0x230 [<ffffffff810452cb>] local_apic_timer_interrupt+0x3b/0x70 [<ffffffff8164a465>] smp_apic_timer_interrupt+0x45/0x60 [<ffffffff816485bd>] apic_timer_interrupt+0x6d/0x80 <EOI> [<ffffffff810cc588>] ? lock_hrtimer_base.isra.23+0x18/0x50 [<ffffffff81193cf1>] ? __kmalloc+0x211/0x230 [<ffffffff810cc9d2>] hrtimer_try_to_cancel+0x22/0xd0 [<ffffffff81193cf1>] ? __kmalloc+0x211/0x230 [<ffffffff810ccaa2>] hrtimer_cancel+0x22/0x30 [<ffffffff810a3cb5>] free_fair_sched_group+0x25/0xd0 [<ffffffff8108df46>] free_sched_group+0x16/0x40 [<ffffffff810971bb>] sched_create_group+0x4b/0x80 [<ffffffff810aa383>] sched_autogroup_create_attach+0x43/0x1c0 [<ffffffff8107dc9c>] sys_setsid+0x7c/0x110 [<ffffffff81647729>] system_call_fastpath+0x12/0x17 Check whether init_cfs_bandwidth() was called before calling destroy_cfs_bandwidth(). Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> [ Move the check into destroy_cfs_bandwidth() to aid compilability. ] Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Paul Turner <pjt@google.com> Cc: Ben Segall <bsegall@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/201412252210.GCC30204.SOMVFFOtQJFLOH@I-love.SAKURA.ne.jpSigned-off-by: NIngo Molnar <mingo@kernel.org>
-