- 04 12月, 2015 1 次提交
-
-
由 Peter Zijlstra 提交于
Explain how the control dependency and smp_rmb() end up providing ACQUIRE semantics and pair with smp_store_release() in finish_lock_switch(). Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
- 23 11月, 2015 1 次提交
-
-
由 Dietmar Eggemann 提交于
Commit cd126afe ("sched/fair: Remove rq's runnable avg") got rid of rq->avg and so there is no need to update it any more when entering or exiting idle. Remove the now empty functions idle_{enter|exit}_fair(). Signed-off-by: NDietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Yuyang Du <yuyang.du@intel.com> Link: http://lkml.kernel.org/r/1445342681-17171-1-git-send-email-dietmar.eggemann@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 06 10月, 2015 3 次提交
-
-
由 xiaofeng.yan 提交于
The parameter "int next_cpu" in the following function is unused: migrate_task_rq(struct task_struct *p, int next_cpu) Remove it. Signed-off-by: Nxiaofeng.yan <yanxiaofeng@inspur.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/1442991360-31945-1-git-send-email-yanxiaofeng@inspur.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
Mike Meyer reported the following bug: > During evaluation of some performance data, it was discovered thread > and run queue run_delay accounting data was inconsistent with the other > accounting data that was collected. Further investigation found under > certain circumstances execution time was leaking into the task and > run queue accounting of run_delay. > > Consider the following sequence: > > a. thread is running. > b. thread moves beween cgroups, changes scheduling class or priority. > c. thread sleeps OR > d. thread involuntarily gives up cpu. > > a. implies: > > thread->sched_info.last_queued = 0 > > a. and b. results in the following: > > 1. dequeue_task(rq, thread) > > sched_info_dequeued(rq, thread) > delta = 0 > > sched_info_reset_dequeued(thread) > thread->sched_info.last_queued = 0 > > thread->sched_info.run_delay += delta > > 2. enqueue_task(rq, thread) > > sched_info_queued(rq, thread) > > /* thread is still on cpu at this point. */ > thread->sched_info.last_queued = task_rq(thread)->clock; > > c. results in: > > dequeue_task(rq, thread) > > sched_info_dequeued(rq, thread) > > /* delta is execution time not run_delay. */ > delta = task_rq(thread)->clock - thread->sched_info.last_queued > > sched_info_reset_dequeued(thread) > thread->sched_info.last_queued = 0 > > thread->sched_info.run_delay += delta > > Since thread was running between enqueue_task(rq, thread) and > dequeue_task(rq, thread), the delta above is really execution > time and not run_delay. > > d. results in: > > __sched_info_switch(thread, next_thread) > > sched_info_depart(rq, thread) > > sched_info_queued(rq, thread) > > /* last_queued not updated due to being non-zero */ > return > > Since thread was running between enqueue_task(rq, thread) and > __sched_info_switch(thread, next_thread), the execution time > between enqueue_task(rq, thread) and > __sched_info_switch(thread, next_thread) now will become > associated with run_delay due to when last_queued was last updated. > This alternative patch solves the problem by not calling sched_info_{de,}queued() in {de,en}queue_task(). Therefore the sched_info state is preserved and things work as expected. By inlining the {de,en}queue_task() functions the new condition becomes (mostly) a compile-time constant and we'll not emit any new branch instructions. It even shrinks the code (due to inlining {en,de}queue_task()): $ size defconfig-build/kernel/sched/core.o defconfig-build/kernel/sched/core.o.orig text data bss dec hex filename 64019 23378 2344 89741 15e8d defconfig-build/kernel/sched/core.o 64149 23378 2344 89871 15f0f defconfig-build/kernel/sched/core.o.orig Reported-by: NMike Meyer <Mike.Meyer@Teradata.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/20150930154413.GO3604@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
So the problem this patch is trying to address is as follows: CPU0 CPU1 context_switch(A, B) ttwu(A) LOCK A->pi_lock A->on_cpu == 0 finish_task_switch(A) prev_state = A->state <-. WMB | A->on_cpu = 0; | UNLOCK rq0->lock | | context_switch(C, A) `-- A->state = TASK_DEAD prev_state == TASK_DEAD put_task_struct(A) context_switch(A, C) finish_task_switch(A) A->state == TASK_DEAD put_task_struct(A) The argument being that the WMB will allow the load of A->state on CPU0 to cross over and observe CPU1's store of A->state, which will then result in a double-drop and use-after-free. Now the comment states (and this was true once upon a long time ago) that we need to observe A->state while holding rq->lock because that will order us against the wakeup; however the wakeup will not in fact acquire (that) rq->lock; it takes A->pi_lock these days. We can obviously fix this by upgrading the WMB to an MB, but that is expensive, so we'd rather avoid that. The alternative this patch takes is: smp_store_release(&A->on_cpu, 0), which avoids the MB on some archs, but not important ones like ARM. Reported-by: NOleg Nesterov <oleg@redhat.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Acked-by: NLinus Torvalds <torvalds@linux-foundation.org> Cc: <stable@vger.kernel.org> # v3.1+ Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Cc: manfred@colorfullife.com Cc: will.deacon@arm.com Fixes: e4a52bcb ("sched: Remove rq->lock from the first half of ttwu()") Link: http://lkml.kernel.org/r/20150929124509.GG3816@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 23 9月, 2015 1 次提交
-
-
由 Juri Lelli 提交于
Move dl_time_before() static definition in include/linux/sched/deadline.h so that it can be used by different parties without being re-defined. Reported-by: NLuca Abeni <luca.abeni@unitn.it> Signed-off-by: NJuri Lelli <juri.lelli@arm.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1441188096-23021-3-git-send-email-juri.lelli@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 18 9月, 2015 1 次提交
-
-
由 Henrik Austad 提交于
Most of the policy-tests are done via the <class>_policy() helpers with the notable exception of idle. A new wrapper for valid_policy() has also been added to improve readability in set_load_weight(). This commit does not change the logical behavior of the scheduler core. Signed-off-by: NHenrik Austad <henrik@austad.us> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/1441810841-4756-1-git-send-email-henrik@austad.usSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 13 9月, 2015 6 次提交
-
-
由 Dietmar Eggemann 提交于
Besides the existing frequency scale-invariance correction factor, apply CPU scale-invariance correction factor to utilization tracking to compensate for any differences in compute capacity. This could be due to micro-architectural differences (i.e. instructions per seconds) between cpus in HMP systems (e.g. big.LITTLE), and/or differences in the current maximum frequency supported by individual cpus in SMP systems. In the existing implementation utilization isn't comparable between cpus as it is relative to the capacity of each individual CPU. Each segment of the sched_avg.util_sum geometric series is now scaled by the CPU performance factor too so the sched_avg.util_avg of each sched entity will be invariant from the particular CPU of the HMP/SMP system on which the sched entity is scheduled. With this patch, the utilization of a CPU stays relative to the max CPU performance of the fastest CPU in the system. In contrast to utilization (sched_avg.util_sum), load (sched_avg.load_sum) should not be scaled by compute capacity. The utilization metric is based on running time which only makes sense when cpus are _not_ fully utilized (utilization cannot go beyond 100% even if more tasks are added), where load is runnable time which isn't limited by the capacity of the CPU and therefore is a better metric for overloaded scenarios. If we run two nice-0 busy loops on two cpus with different compute capacity their load should be similar since their compute demands are the same. We have to assume that the compute demand of any task running on a fully utilized CPU (no spare cycles = 100% utilization) is high and the same no matter of the compute capacity of its current CPU, hence we shouldn't scale load by CPU capacity. Signed-off-by: NDietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: NMorten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/55CE7409.1000700@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Morten Rasmussen 提交于
Bring arch_scale_cpu_capacity() in line with the recent change of its arch_scale_freq_capacity() sibling in commit dfbca41f ("sched: Optimize freq invariant accounting") from weak function to #define to allow inlining of the function. While at it, remove the ARCH_CAPACITY sched_feature as well. With the change to #define there isn't a straightforward way to allow runtime switch between an arch implementation and the default implementation of arch_scale_cpu_capacity() using sched_feature. The default was to use the arch-specific implementation, but only the arm architecture provides one and that is essentially equivalent to the default implementation. Signed-off-by: NMorten Rasmussen <morten.rasmussen@arm.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Dietmar Eggemann <Dietmar.Eggemann@arm.com> Cc: Juri Lelli <Juri.Lelli@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: daniel.lezcano@linaro.org Cc: mturquette@baylibre.com Cc: pang.xunlei@zte.com.cn Cc: rjw@rjwysocki.net Cc: sgurrappadi@nvidia.com Cc: vincent.guittot@linaro.org Cc: yuyang.du@intel.com Link: http://lkml.kernel.org/r/1439569394-11974-3-git-send-email-morten.rasmussen@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Srikar Dronamraju 提交于
Variable sched_numa_balancing toggles numa_balancing feature. Hence moving from a simple read mostly variable to a more apt static_branch. Suggested-by: NPeter Zijlstra <peterz@infradead.org> Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Galbraith <efault@gmx.de> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1439310261-16124-1-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Srikar Dronamraju 提交于
Commit 2a1ed24c ("sched/numa: Prefer NUMA hotness over cache hotness") sets sched feature NUMA to true. However this can enable NUMA hinting faults on a UMA system. This commit ensures that NUMA hinting faults occur only on a NUMA system by setting/resetting sched_numa_balancing. This commit: - Makes sched_numa_balancing common to CONFIG_SCHED_DEBUG and !CONFIG_SCHED_DEBUG. Earlier it was only in !CONFIG_SCHED_DEBUG. - Checks for sched_numa_balancing instead of sched_feat(NUMA). Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1439290813-6683-3-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Srikar Dronamraju 提交于
Simple rename of the 'numabalancing_enabled' variable to 'sched_numa_balancing'. No functional changes. Suggested-by: NIngo Molnar <mingo@kernel.org> Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1439290813-6683-2-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
The previous patches made the second argument go unused, remove it. Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Byungchul Park <byungchul.park@lge.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
- 12 8月, 2015 1 次提交
-
-
由 Peter Zijlstra 提交于
Give every class a set_cpus_allowed() method, this enables some small optimization in the RT,DL implementation by avoiding a double cpumask_weight() call. Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dedekind1@gmail.com Cc: juri.lelli@arm.com Cc: mgorman@suse.de Cc: riel@redhat.com Cc: rostedt@goodmis.org Link: http://lkml.kernel.org/r/20150515154833.614517487@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 04 8月, 2015 1 次提交
-
-
由 Peter Zijlstra 提交于
One less arch hook.. Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
- 03 8月, 2015 4 次提交
-
-
由 Yuyang Du 提交于
The cfs_rq's load_avg is composed of runnable_load_avg and blocked_load_avg. Before this series, sometimes the runnable_load_avg is used, and sometimes the load_avg is used. Completely replacing all uses of runnable_load_avg with load_avg may be too big a leap, i.e., the blocked_load_avg is concerned to result in overrated load. Therefore, we get runnable_load_avg back. The new cfs_rq's runnable_load_avg is improved to be updated with all of the runnable sched_eneities at the same time, so the one sched_entity updated and the others stale problem is solved. Signed-off-by: NYuyang Du <yuyang.du@intel.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: arjan@linux.intel.com Cc: bsegall@google.com Cc: dietmar.eggemann@arm.com Cc: fengguang.wu@intel.com Cc: len.brown@intel.com Cc: morten.rasmussen@arm.com Cc: pjt@google.com Cc: rafael.j.wysocki@intel.com Cc: umgwanakikbuti@gmail.com Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/1436918682-4971-7-git-send-email-yuyang.du@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Yuyang Du 提交于
The runnable load and utilization averages of cfs_rq's sched_entity were not initiated. Like done to a task, give new cfs_rq' sched_entity start values to heavy its load in infant time. Signed-off-by: NYuyang Du <yuyang.du@intel.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: arjan@linux.intel.com Cc: bsegall@google.com Cc: dietmar.eggemann@arm.com Cc: fengguang.wu@intel.com Cc: len.brown@intel.com Cc: morten.rasmussen@arm.com Cc: pjt@google.com Cc: rafael.j.wysocki@intel.com Cc: umgwanakikbuti@gmail.com Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/1436918682-4971-5-git-send-email-yuyang.du@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Yuyang Du 提交于
The idea of runnable load average (let runnable time contribute to weight) was proposed by Paul Turner and Ben Segall, and it is still followed by this rewrite. This rewrite aims to solve the following issues: 1. cfs_rq's load average (namely runnable_load_avg and blocked_load_avg) is updated at the granularity of an entity at a time, which results in the cfs_rq's load average is stale or partially updated: at any time, only one entity is up to date, all other entities are effectively lagging behind. This is undesirable. To illustrate, if we have n runnable entities in the cfs_rq, as time elapses, they certainly become outdated: t0: cfs_rq { e1_old, e2_old, ..., en_old } and when we update: t1: update e1, then we have cfs_rq { e1_new, e2_old, ..., en_old } t2: update e2, then we have cfs_rq { e1_old, e2_new, ..., en_old } ... We solve this by combining all runnable entities' load averages together in cfs_rq's avg, and update the cfs_rq's avg as a whole. This is based on the fact that if we regard the update as a function, then: w * update(e) = update(w * e) and update(e1) + update(e2) = update(e1 + e2), then w1 * update(e1) + w2 * update(e2) = update(w1 * e1 + w2 * e2) therefore, by this rewrite, we have an entirely updated cfs_rq at the time we update it: t1: update cfs_rq { e1_new, e2_new, ..., en_new } t2: update cfs_rq { e1_new, e2_new, ..., en_new } ... 2. cfs_rq's load average is different between top rq->cfs_rq and other task_group's per CPU cfs_rqs in whether or not blocked_load_average contributes to the load. The basic idea behind runnable load average (the same for utilization) is that the blocked state is taken into account as opposed to only accounting for the currently runnable state. Therefore, the average should include both the runnable/running and blocked load averages. This rewrite does that. In addition, we also combine runnable/running and blocked averages of all entities into the cfs_rq's average, and update it together at once. This is based on the fact that: update(runnable) + update(blocked) = update(runnable + blocked) This significantly reduces the code as we don't need to separately maintain/update runnable/running load and blocked load. 3. How task_group entities' share is calculated is complex and imprecise. We reduce the complexity in this rewrite to allow a very simple rule: the task_group's load_avg is aggregated from its per CPU cfs_rqs's load_avgs. Then group entity's weight is simply proportional to its own cfs_rq's load_avg / task_group's load_avg. To illustrate, if a task_group has { cfs_rq1, cfs_rq2, ..., cfs_rqn }, then, task_group_avg = cfs_rq1_avg + cfs_rq2_avg + ... + cfs_rqn_avg, then cfs_rqx's entity's share = cfs_rqx_avg / task_group_avg * task_group's share To sum up, this rewrite in principle is equivalent to the current one, but fixes the issues described above. Turns out, it significantly reduces the code complexity and hence increases clarity and efficiency. In addition, the new averages are more smooth/continuous (no spurious spikes and valleys) and updated more consistently and quickly to reflect the load dynamics. As a result, we have less load tracking overhead, better performance, and especially better power efficiency due to more balanced load. Signed-off-by: NYuyang Du <yuyang.du@intel.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: arjan@linux.intel.com Cc: bsegall@google.com Cc: dietmar.eggemann@arm.com Cc: fengguang.wu@intel.com Cc: len.brown@intel.com Cc: morten.rasmussen@arm.com Cc: pjt@google.com Cc: rafael.j.wysocki@intel.com Cc: umgwanakikbuti@gmail.com Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/1436918682-4971-3-git-send-email-yuyang.du@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Yuyang Du 提交于
The current rq->avg is not used at all since its merge into the kernel, and the code is in the scheduler's hot path, so remove it. Tested-by: NDietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: NYuyang Du <yuyang.du@intel.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: NDietmar Eggemann <dietmar.eggemann@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: arjan@linux.intel.com Cc: bsegall@google.com Cc: fengguang.wu@intel.com Cc: len.brown@intel.com Cc: morten.rasmussen@arm.com Cc: pjt@google.com Cc: rafael.j.wysocki@intel.com Cc: umgwanakikbuti@gmail.com Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/1436918682-4971-2-git-send-email-yuyang.du@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 04 7月, 2015 2 次提交
-
-
由 Srikar Dronamraju 提交于
Commit 44dba3d5 ("sched: Refactor task_struct to use numa_faults instead of numa_* pointers") modified the way tsk->numa_faults stats are accounted. However that commit never touched show_numa_stats() that is displayed in /proc/pid/sched and thus the numbers displayed in /proc/pid/sched don't match the actual numbers. Fix it by making sure that /proc/pid/sched reflects the task fault numbers. Also add group fault stats too. Also couple of more modifications are added here: 1. Format changes: - Previously we would list two entries per node, one for private and one for shared. Also the home node info was listed in each entry. - Now preferred node, total_faults and current node are displayed separately. - Now there is one entry per node, that lists private,shared task and group faults. 2. Unit changes: - p->numa_pages_migrated was getting reset after every read of /proc/pid/sched. It's more useful to have absolute numbers since differential migrations between two accesses can be more easily calculated. Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com> Acked-by: NRik van Riel <riel@redhat.com> Cc: Iulia Manda <iulia.manda21@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1435252903-1081-4-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Srikar Dronamraju 提交于
Currently print_cfs_rq() is declared in include/linux/sched.h. However it's not used outside kernel/sched. Hence move the declaration to kernel/sched/sched.h Also some functions are only available for CONFIG_SCHED_DEBUG=y. Hence move the declarations to within the #ifdef. Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com> Acked-by: NRik van Riel <riel@redhat.com> Cc: Iulia Manda <iulia.manda21@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1435252903-1081-2-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 19 6月, 2015 3 次提交
-
-
由 Wanpeng Li 提交于
There are two init_sched_dl_class() declarations, this patch drops the duplicate. Signed-off-by: NWanpeng Li <wanpeng.li@linux.intel.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Borislav Petkov <bp@alien8.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1431496867-4194-5-git-send-email-wanpeng.li@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
Employ the new lockdep lock pinning annotation to ensure no 'accidental' lock-breaks happen with rq->lock. Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: ktkhai@parallels.com Cc: rostedt@goodmis.org Cc: juri.lelli@gmail.com Cc: pang.xunlei@linaro.org Cc: oleg@redhat.com Cc: wanpeng.li@linux.intel.com Cc: umgwanakikbuti@gmail.com Link: http://lkml.kernel.org/r/20150611124744.003233193@infradead.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
-
由 Peter Zijlstra 提交于
Generalize the post_schedule() stuff into a balance callback list. This allows us to more easily use it outside of schedule() and cross sched_class. Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: ktkhai@parallels.com Cc: rostedt@goodmis.org Cc: juri.lelli@gmail.com Cc: pang.xunlei@linaro.org Cc: oleg@redhat.com Cc: wanpeng.li@linux.intel.com Cc: umgwanakikbuti@gmail.com Link: http://lkml.kernel.org/r/20150611124742.424032725@infradead.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
-
- 18 5月, 2015 1 次提交
-
-
由 Peter Zijlstra 提交于
In the below two commits (see Fixes) we have periodic timers that can stop themselves when they're no longer required, but need to be (re)-started when their idle condition changes. Further complications is that we want the timer handler to always do the forward such that it will always correctly deal with the overruns, and we do not want to race such that the handler has already decided to stop, but the (external) restart sees the timer still active and we end up with a 'lost' timer. The problem with the current code is that the re-start can come before the callback does the forward, at which point the forward from the callback will WARN about forwarding an enqueued timer. Now, conceptually its easy to detect if you're before or after the fwd by comparing the expiration time against the current time. Of course, that's expensive (and racy) because we don't have the current time. Alternatively one could cache this state inside the timer, but then everybody pays the overhead of maintaining this extra state, and that is undesired. The only other option that I could see is the external timer_active variable, which I tried to kill before. I would love a nicer interface for this seemingly simple 'problem' but alas. Fixes: 272325c4 ("perf: Fix mux_interval hrtimer wreckage") Fixes: 77a4d1a1 ("sched: Cleanup bandwidth timers") Cc: pjt@google.com Cc: tglx@linutronix.de Cc: klamm@yandex-team.ru Cc: mingo@kernel.org Cc: bsegall@google.com Cc: hpa@zytor.com Cc: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/20150514102311.GX21418@twins.programming.kicks-ass.net
-
- 08 5月, 2015 2 次提交
-
-
由 Jason Low 提交于
ACCESS_ONCE doesn't work reliably on non-scalar types. This patch removes the rest of the existing usages of ACCESS_ONCE() in the scheduler, and use the new READ_ONCE() and WRITE_ONCE() APIs as appropriate. Signed-off-by: NJason Low <jason.low2@hp.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Acked-by: NThomas Gleixner <tglx@linutronix.de> Acked-by: NRik van Riel <riel@redhat.com> Acked-by: NWaiman Long <Waiman.Long@hp.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Aswin Chandramouleeswaran <aswin@hp.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com> Cc: Scott J Norton <scott.norton@hp.com> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/1430251224-5764-2-git-send-email-jason.low2@hp.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
I could not find the loadavg code.. turns out it was hidden in a file called proc.c. It further got mingled up with the cruft per rq load indexes (which we really want to get rid of). Move the per rq load indexes into the fair.c load-balance code (that's the only thing that uses them) and rename proc.c to loadavg.c so we can find it again. Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Thomas Gleixner <tglx@linutronix.de> [ Did minor cleanups to the code. ] Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
- 22 4月, 2015 1 次提交
-
-
由 Peter Zijlstra 提交于
Roman reported a 3 cpu lockup scenario involving __start_cfs_bandwidth(). The more I look at that code the more I'm convinced its crack, that entire __start_cfs_bandwidth() thing is brain melting, we don't need to cancel a timer before starting it, *hrtimer_start*() will happily remove the timer for you if its still enqueued. Removing that, removes a big part of the problem, no more ugly cancel loop to get stuck in. So now, if I understand things right, the entire reason you have this cfs_b->lock guarded ->timer_active nonsense is to make sure we don't accidentally lose the timer. It appears to me that it should be possible to guarantee that same by unconditionally (re)starting the timer when !queued. Because regardless what hrtimer::function will return, if we beat it to (re)enqueue the timer, it doesn't matter. Now, because hrtimers don't come with any serialization guarantees we must ensure both handler and (re)start loop serialize their access to the hrtimer to avoid both trying to forward the timer at the same time. Update the rt bandwidth timer to match. This effectively reverts: 09dc4ab0 ("sched/fair: Fix tg_set_cfs_bandwidth() deadlock on rq->lock"). Reported-by: NRoman Gushchin <klamm@yandex-team.ru> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: NBen Segall <bsegall@google.com> Cc: Paul Turner <pjt@google.com> Link: http://lkml.kernel.org/r/20150415095011.804589208@infradead.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
-
- 02 4月, 2015 1 次提交
-
-
由 Abel Vesa 提交于
Obviously, 'rq' is not used in these two functions, therefore, there is no reason for it to be passed as an argument. Signed-off-by: NAbel Vesa <abelvesa@gmail.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/1425383427-26244-1-git-send-email-abelvesa@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 27 3月, 2015 5 次提交
-
-
由 Peter Zijlstra 提交于
Currently the freq invariant accounting (in __update_entity_runnable_avg() and sched_rt_avg_update()) get the scale factor from a weak function call, this means that even for archs that default on their implementation the compiler cannot see into this function and optimize the extra scaling math away. This is sad, esp. since its a 64-bit multiplication which can be quite costly on some platforms. So replace the weak function with #ifdef and __always_inline goo. This is not quite as nice from an arch support PoV but should at least result in compile time errors if done wrong. Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Ben Segall <bsegall@google.com> Cc: Morten.Rasmussen@arm.com Cc: Paul Turner <pjt@google.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: dietmar.eggemann@arm.com Cc: efault@gmx.de Cc: kamalesh@linux.vnet.ibm.com Cc: nicolas.pitre@linaro.org Cc: preeti@linux.vnet.ibm.com Cc: riel@redhat.com Link: http://lkml.kernel.org/r/20150323131905.GF23123@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Vincent Guittot 提交于
The 'struct sched_group_capacity::capacity_orig' field is no longer used in the scheduler so we can remove it. Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Morten.Rasmussen@arm.com Cc: dietmar.eggemann@arm.com Cc: efault@gmx.de Cc: kamalesh@linux.vnet.ibm.com Cc: linaro-kernel@lists.linaro.org Cc: nicolas.pitre@linaro.org Cc: preeti@linux.vnet.ibm.com Cc: riel@redhat.com Link: http://lkml.kernel.org/r/1425378903-5349-1-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Vincent Guittot 提交于
This new field 'cpu_capacity_orig' reflects the original capacity of a CPU before being altered by rt tasks and/or IRQ The cpu_capacity_orig will be used: - to detect when the capacity of a CPU has been noticeably reduced so we can trig load balance to look for a CPU with better capacity. As an example, we can detect when a CPU handles a significant amount of irq (with CONFIG_IRQ_TIME_ACCOUNTING) but this CPU is seen as an idle CPU by scheduler whereas CPUs, which are really idle, are available. - evaluate the available capacity for CFS tasks Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: NKamalesh Babulal <kamalesh@linux.vnet.ibm.com> Acked-by: NMorten Rasmussen <morten.rasmussen@arm.com> Cc: Morten.Rasmussen@arm.com Cc: dietmar.eggemann@arm.com Cc: efault@gmx.de Cc: linaro-kernel@lists.linaro.org Cc: nicolas.pitre@linaro.org Cc: preeti@linux.vnet.ibm.com Cc: riel@redhat.com Link: http://lkml.kernel.org/r/1425052454-25797-7-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Vincent Guittot 提交于
The average running time of RT tasks is used to estimate the remaining compute capacity for CFS tasks. This remaining capacity is the original capacity scaled down by a factor (aka scale_rt_capacity). This estimation of available capacity must also be invariant with frequency scaling. A frequency scaling factor is applied on the running time of the RT tasks for computing scale_rt_capacity. In sched_rt_avg_update(), we now scale the RT execution time like below: rq->rt_avg += rt_delta * arch_scale_freq_capacity() >> SCHED_CAPACITY_SHIFT Then, scale_rt_capacity can be summarized by: scale_rt_capacity = SCHED_CAPACITY_SCALE * available / total with available = total - rq->rt_avg This has been been optimized in current code by: scale_rt_capacity = available / (total >> SCHED_CAPACITY_SHIFT) But we can also developed the equation like below: scale_rt_capacity = SCHED_CAPACITY_SCALE - ((rq->rt_avg << SCHED_CAPACITY_SHIFT) / total) and we can optimize the equation by removing SCHED_CAPACITY_SHIFT shift in the computation of rq->rt_avg and scale_rt_capacity(). so rq->rt_avg += rt_delta * arch_scale_freq_capacity() and scale_rt_capacity = SCHED_CAPACITY_SCALE - (rq->rt_avg / total) arch_scale_frequency_capacity() will be called in the hot path of the scheduler which implies to have a short and efficient function. As an example, arch_scale_frequency_capacity() should return a cached value that is updated periodically outside of the hot path. Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Acked-by: NMorten Rasmussen <morten.rasmussen@arm.com> Cc: Morten.Rasmussen@arm.com Cc: dietmar.eggemann@arm.com Cc: efault@gmx.de Cc: kamalesh@linux.vnet.ibm.com Cc: linaro-kernel@lists.linaro.org Cc: nicolas.pitre@linaro.org Cc: preeti@linux.vnet.ibm.com Cc: riel@redhat.com Link: http://lkml.kernel.org/r/1425052454-25797-6-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Vincent Guittot 提交于
Add new statistics which reflect the average time a task is running on the CPU and the sum of these running time of the tasks on a runqueue. The latter is named utilization_load_avg. This patch is based on the usage metric that was proposed in the 1st versions of the per-entity load tracking patchset by Paul Turner <pjt@google.com> but that has be removed afterwards. This version differs from the original one in the sense that it's not linked to task_group. The rq's utilization_load_avg will be used to check if a rq is overloaded or not instead of trying to compute how many tasks a group of CPUs can handle. Rename runnable_avg_period into avg_period as it is now used with both runnable_avg_sum and running_avg_sum. Add some descriptions of the variables to explain their differences. Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Acked-by: NMorten Rasmussen <morten.rasmussen@arm.com> Cc: Paul Turner <pjt@google.com> Cc: Ben Segall <bsegall@google.com> Cc: Ben Segall <bsegall@google.com> Cc: Morten.Rasmussen@arm.com Cc: Paul Turner <pjt@google.com> Cc: dietmar.eggemann@arm.com Cc: efault@gmx.de Cc: kamalesh@linux.vnet.ibm.com Cc: linaro-kernel@lists.linaro.org Cc: nicolas.pitre@linaro.org Cc: preeti@linux.vnet.ibm.com Cc: riel@redhat.com Link: http://lkml.kernel.org/r/1425052454-25797-2-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 23 3月, 2015 1 次提交
-
-
由 Steven Rostedt 提交于
When debugging the latencies on a 40 core box, where we hit 300 to 500 microsecond latencies, I found there was a huge contention on the runqueue locks. Investigating it further, running ftrace, I found that it was due to the pulling of RT tasks. The test that was run was the following: cyclictest --numa -p95 -m -d0 -i100 This created a thread on each CPU, that would set its wakeup in iterations of 100 microseconds. The -d0 means that all the threads had the same interval (100us). Each thread sleeps for 100us and wakes up and measures its latencies. cyclictest is maintained at: git://git.kernel.org/pub/scm/linux/kernel/git/clrkwllms/rt-tests.git What happened was another RT task would be scheduled on one of the CPUs that was running our test, when the other CPU tests went to sleep and scheduled idle. This caused the "pull" operation to execute on all these CPUs. Each one of these saw the RT task that was overloaded on the CPU of the test that was still running, and each one tried to grab that task in a thundering herd way. To grab the task, each thread would do a double rq lock grab, grabbing its own lock as well as the rq of the overloaded CPU. As the sched domains on this box was rather flat for its size, I saw up to 12 CPUs block on this lock at once. This caused a ripple affect with the rq locks especially since the taking was done via a double rq lock, which means that several of the CPUs had their own rq locks held while trying to take this rq lock. As these locks were blocked, any wakeups or load balanceing on these CPUs would also block on these locks, and the wait time escalated. I've tried various methods to lessen the load, but things like an atomic counter to only let one CPU grab the task wont work, because the task may have a limited affinity, and we may pick the wrong CPU to take that lock and do the pull, to only find out that the CPU we picked isn't in the task's affinity. Instead of doing the PULL, I now have the CPUs that want the pull to send over an IPI to the overloaded CPU, and let that CPU pick what CPU to push the task to. No more need to grab the rq lock, and the push/pull algorithm still works fine. With this patch, the latency dropped to just 150us over a 20 hour run. Without the patch, the huge latencies would trigger in seconds. I've created a new sched feature called RT_PUSH_IPI, which is enabled by default. When RT_PUSH_IPI is not enabled, the old method of grabbing the rq locks and having the pulling CPU do the work is implemented. When RT_PUSH_IPI is enabled, the IPI is sent to the overloaded CPU to do a push. To enabled or disable this at run time: # mount -t debugfs nodev /sys/kernel/debug # echo RT_PUSH_IPI > /sys/kernel/debug/sched_features or # echo NO_RT_PUSH_IPI > /sys/kernel/debug/sched_features Update: This original patch would send an IPI to all CPUs in the RT overload list. But that could theoretically cause the reverse issue. That is, there could be lots of overloaded RT queues and one CPU lowers its priority. It would then send an IPI to all the overloaded RT queues and they could then all try to grab the rq lock of the CPU lowering its priority, and then we have the same problem. The latest design sends out only one IPI to the first overloaded CPU. It tries to push any tasks that it can, and then looks for the next overloaded CPU that can push to the source CPU. The IPIs stop when all overloaded CPUs that have pushable tasks that have priorities greater than the source CPU are covered. In case the source CPU lowers its priority again, a flag is set to tell the IPI traversal to restart with the first RT overloaded CPU after the source CPU. Parts-suggested-by: NPeter Zijlstra <peterz@infradead.org> Signed-off-by: NSteven Rostedt <rostedt@goodmis.org> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Joern Engel <joern@purestorage.com> Cc: Clark Williams <williams@redhat.com> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20150318144946.2f3cc982@gandalf.local.homeSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 18 2月, 2015 1 次提交
-
-
由 Peter Zijlstra 提交于
Kirill reported that a dl task can be throttled and dequeued at the same time. This happens, when it becomes throttled in schedule(), which is called to go to sleep: current->state = TASK_INTERRUPTIBLE; schedule() deactivate_task() dequeue_task_dl() update_curr_dl() start_dl_timer() __dequeue_task_dl() prev->on_rq = 0; This invalidates the assumption from commit 0f397f2c ("sched/dl: Fix race in dl_task_timer()"): "The only reason we don't strictly need ->pi_lock now is because we're guaranteed to have p->state == TASK_RUNNING here and are thus free of ttwu races". And therefore we have to use the full task_rq_lock() here. This further amends the fact that we forgot to update the rq lock loop for TASK_ON_RQ_MIGRATE, from commit cca26e80 ("sched: Teach scheduler to understand TASK_ON_RQ_MIGRATING state"). Reported-by: NKirill Tkhai <ktkhai@parallels.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@arm.com> Link: http://lkml.kernel.org/r/20150217123139.GN5029@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 14 1月, 2015 2 次提交
-
-
由 Peter Zijlstra 提交于
The original purpose of rq::skip_clock_update was to avoid 'costly' clock updates for back to back wakeup-preempt pairs. The big problem with it has always been that the rq variable is unaware of the context and causes indiscrimiate clock skips. Rework the entire thing and create a sense of context by only allowing schedule() to skip clock updates. (XXX can we measure the cost of the added store?) By ensuring only schedule can ever skip an update, we guarantee we're never more than 1 tick behind on the update. Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: umgwanakikbuti@gmail.com Link: http://lkml.kernel.org/r/20150105103554.432381549@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
rq->clock{,_task} are serialized by rq->lock, verify this. One immediate fail is the usage in scale_rt_capability, so 'annotate' that for now, there's more 'funny' there. Maybe change rq->lock into a raw_seqlock_t? (Only 32-bit is affected) Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/20150105103554.361872747@infradead.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: umgwanakikbuti@gmail.com Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
- 16 11月, 2014 1 次提交
-
-
由 Stanislaw Gruszka 提交于
Commit d670ec13 "posix-cpu-timers: Cure SMP wobbles" fixes one glibc test case in cost of breaking another one. After that commit, calling clock_nanosleep(TIMER_ABSTIME, X) and then clock_gettime(&Y) can result of Y time being smaller than X time. Reproducer/tester can be found further below, it can be compiled and ran by: gcc -o tst-cpuclock2 tst-cpuclock2.c -pthread while ./tst-cpuclock2 ; do : ; done This reproducer, when running on a buggy kernel, will complain about "clock_gettime difference too small". Issue happens because on start in thread_group_cputimer() we initialize sum_exec_runtime of cputimer with threads runtime not yet accounted and then add the threads runtime to running cputimer again on scheduler tick, making it's sum_exec_runtime bigger than actual threads runtime. KOSAKI Motohiro posted a fix for this problem, but that patch was never applied: https://lkml.org/lkml/2013/5/26/191 . This patch takes different approach to cure the problem. It calls update_curr() when cputimer starts, that assure we will have updated stats of running threads and on the next schedule tick we will account only the runtime that elapsed from cputimer start. That also assure we have consistent state between cpu times of individual threads and cpu time of the process consisted by those threads. Full reproducer (tst-cpuclock2.c): #define _GNU_SOURCE #include <unistd.h> #include <sys/syscall.h> #include <stdio.h> #include <time.h> #include <pthread.h> #include <stdint.h> #include <inttypes.h> /* Parameters for the Linux kernel ABI for CPU clocks. */ #define CPUCLOCK_SCHED 2 #define MAKE_PROCESS_CPUCLOCK(pid, clock) \ ((~(clockid_t) (pid) << 3) | (clockid_t) (clock)) static pthread_barrier_t barrier; /* Help advance the clock. */ static void *chew_cpu(void *arg) { pthread_barrier_wait(&barrier); while (1) ; return NULL; } /* Don't use the glibc wrapper. */ static int do_nanosleep(int flags, const struct timespec *req) { clockid_t clock_id = MAKE_PROCESS_CPUCLOCK(0, CPUCLOCK_SCHED); return syscall(SYS_clock_nanosleep, clock_id, flags, req, NULL); } static int64_t tsdiff(const struct timespec *before, const struct timespec *after) { int64_t before_i = before->tv_sec * 1000000000ULL + before->tv_nsec; int64_t after_i = after->tv_sec * 1000000000ULL + after->tv_nsec; return after_i - before_i; } int main(void) { int result = 0; pthread_t th; pthread_barrier_init(&barrier, NULL, 2); if (pthread_create(&th, NULL, chew_cpu, NULL) != 0) { perror("pthread_create"); return 1; } pthread_barrier_wait(&barrier); /* The test. */ struct timespec before, after, sleeptimeabs; int64_t sleepdiff, diffabs; const struct timespec sleeptime = {.tv_sec = 0,.tv_nsec = 100000000 }; /* The relative nanosleep. Not sure why this is needed, but its presence seems to make it easier to reproduce the problem. */ if (do_nanosleep(0, &sleeptime) != 0) { perror("clock_nanosleep"); return 1; } /* Get the current time. */ if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &before) < 0) { perror("clock_gettime[2]"); return 1; } /* Compute the absolute sleep time based on the current time. */ uint64_t nsec = before.tv_nsec + sleeptime.tv_nsec; sleeptimeabs.tv_sec = before.tv_sec + nsec / 1000000000; sleeptimeabs.tv_nsec = nsec % 1000000000; /* Sleep for the computed time. */ if (do_nanosleep(TIMER_ABSTIME, &sleeptimeabs) != 0) { perror("absolute clock_nanosleep"); return 1; } /* Get the time after the sleep. */ if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &after) < 0) { perror("clock_gettime[3]"); return 1; } /* The time after sleep should always be equal to or after the absolute sleep time passed to clock_nanosleep. */ sleepdiff = tsdiff(&sleeptimeabs, &after); if (sleepdiff < 0) { printf("absolute clock_nanosleep woke too early: %" PRId64 "\n", sleepdiff); result = 1; printf("Before %llu.%09llu\n", before.tv_sec, before.tv_nsec); printf("After %llu.%09llu\n", after.tv_sec, after.tv_nsec); printf("Sleep %llu.%09llu\n", sleeptimeabs.tv_sec, sleeptimeabs.tv_nsec); } /* The difference between the timestamps taken before and after the clock_nanosleep call should be equal to or more than the duration of the sleep. */ diffabs = tsdiff(&before, &after); if (diffabs < sleeptime.tv_nsec) { printf("clock_gettime difference too small: %" PRId64 "\n", diffabs); result = 1; } pthread_cancel(th); return result; } Signed-off-by: NStanislaw Gruszka <sgruszka@redhat.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20141112155843.GA24803@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 04 11月, 2014 1 次提交
-
-
由 Iulia Manda 提交于
This patch simplifies task_struct by removing the four numa_* pointers in the same array and replacing them with the array pointer. By doing this, on x86_64, the size of task_struct is reduced by 3 ulong pointers (24 bytes on x86_64). A new parameter is added to the task_faults_idx function so that it can return an index to the correct offset, corresponding with the old precalculated pointers. All of the code in sched/ that depended on task_faults_idx and numa_* was changed in order to match the new logic. Signed-off-by: NIulia Manda <iulia.manda21@gmail.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: mgorman@suse.de Cc: dave@stgolabs.net Cc: riel@redhat.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20141031001331.GA30662@winterfellSigned-off-by: NIngo Molnar <mingo@kernel.org>
-