- 14 1月, 2015 7 次提交
-
-
由 Peter Zijlstra 提交于
The original purpose of rq::skip_clock_update was to avoid 'costly' clock updates for back to back wakeup-preempt pairs. The big problem with it has always been that the rq variable is unaware of the context and causes indiscrimiate clock skips. Rework the entire thing and create a sense of context by only allowing schedule() to skip clock updates. (XXX can we measure the cost of the added store?) By ensuring only schedule can ever skip an update, we guarantee we're never more than 1 tick behind on the update. Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: umgwanakikbuti@gmail.com Link: http://lkml.kernel.org/r/20150105103554.432381549@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
rq->clock{,_task} are serialized by rq->lock, verify this. One immediate fail is the usage in scale_rt_capability, so 'annotate' that for now, there's more 'funny' there. Maybe change rq->lock into a raw_seqlock_t? (Only 32-bit is affected) Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/20150105103554.361872747@infradead.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: umgwanakikbuti@gmail.com Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Yao Dongdong 提交于
Search all usage of p->sched_class in sched/core.c, no one check it before use, so it seems that every task must belong to one sched_class. Signed-off-by: NYao Dongdong <yaodongdong@huawei.com> [ Moved the early class assignment to make it boot. ] Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1419835303-28958-1-git-send-email-yaodongdong@huawei.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Kirill Tkhai 提交于
Child has the same decay_count as parent. If it's not zero, we add it to parent's cfs_rq->removed_load: wake_up_new_task()->set_task_cpu()->migrate_task_rq_fair(). Child's load is a just garbade after copying of parent, it hasn't been on cfs_rq yet, and it must not be added to cfs_rq::removed_load in migrate_task_rq_fair(). The patch moves sched_entity::avg::decay_count intialization in sched_fork(). So, migrate_task_rq_fair() does not change removed_load. Signed-off-by: NKirill Tkhai <ktkhai@parallels.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: NBen Segall <bsegall@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1418644618.6074.13.camel@tkhaiSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Tetsuo Handa 提交于
"struct task_struct"->state is "volatile long" and __ffs() warns that "Undefined if no bit exists, so code should check against 0 first." Therefore, at expression state = p->state ? __ffs(p->state) + 1 : 0; in sched_show_task(), CPU might see "p->state" before "?" as "non-zero" but "p->state" after "?" as "zero", which could result in "state >= sizeof(stat_nam)" being true and bogus '?' is printed. This patch changes "state" from "unsigned int" to "unsigned long" and save "p->state" before calling __ffs(), in order to avoid potential call to __ffs(0). Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/201412052131.GCE35924.FVHFOtLOJOMQFS@I-love.SAKURA.ne.jpSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Eric Sandeen 提交于
Sometimes a "BUG: sleeping function called from invalid context" message is not indicative of locking problems, but is the result of a stack overflow corrupting the thread info. Witness http://oss.sgi.com/archives/xfs/2014-02/msg00325.html for example, which took a few go-rounds to sort out. If we're printing the warning, things are wonky already, and it'd be informative to check for the stack end corruption at this point, too. Signed-off-by: NEric Sandeen <sandeen@redhat.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/5490B158.4060005@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Xunlei Pang 提交于
In __synchronize_entity_decay(), if "decays" happens to be zero, se->avg.decay_count will not be zeroed, holding the positive value assigned when dequeued last time. This is problematic in the following case: If this runnable task is CFS-balanced to other CPUs soon afterwards, migrate_task_rq_fair() will treat it as a blocked task due to its non-zero decay_count, thereby adding its load to cfs_rq->removed_load wrongly. Thus, we must zero se->avg.decay_count in this case as well. Signed-off-by: NXunlei Pang <pang.xunlei@linaro.org> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: NBen Segall <bsegall@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1418745509-2609-1-git-send-email-pang.xunlei@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 09 1月, 2015 4 次提交
-
-
由 Tetsuo Handa 提交于
When alloc_fair_sched_group() in sched_create_group() fails, free_sched_group() is called, and free_fair_sched_group() is called by free_sched_group(). Since destroy_cfs_bandwidth() is called by free_fair_sched_group() without calling init_cfs_bandwidth(), RCU stall occurs at hrtimer_cancel(): INFO: rcu_sched self-detected stall on CPU { 1} (t=60000 jiffies g=13074 c=13073 q=0) Task dump for CPU 1: (fprintd) R running task 0 6249 1 0x00000088 ... Call Trace: <IRQ> [<ffffffff81094988>] sched_show_task+0xa8/0x110 [<ffffffff81097acd>] dump_cpu_task+0x3d/0x50 [<ffffffff810c3a80>] rcu_dump_cpu_stacks+0x90/0xd0 [<ffffffff810c7751>] rcu_check_callbacks+0x491/0x700 [<ffffffff810cbf2b>] update_process_times+0x4b/0x80 [<ffffffff810db046>] tick_sched_handle.isra.20+0x36/0x50 [<ffffffff810db0a2>] tick_sched_timer+0x42/0x70 [<ffffffff810ccb19>] __run_hrtimer+0x69/0x1a0 [<ffffffff810db060>] ? tick_sched_handle.isra.20+0x50/0x50 [<ffffffff810ccedf>] hrtimer_interrupt+0xef/0x230 [<ffffffff810452cb>] local_apic_timer_interrupt+0x3b/0x70 [<ffffffff8164a465>] smp_apic_timer_interrupt+0x45/0x60 [<ffffffff816485bd>] apic_timer_interrupt+0x6d/0x80 <EOI> [<ffffffff810cc588>] ? lock_hrtimer_base.isra.23+0x18/0x50 [<ffffffff81193cf1>] ? __kmalloc+0x211/0x230 [<ffffffff810cc9d2>] hrtimer_try_to_cancel+0x22/0xd0 [<ffffffff81193cf1>] ? __kmalloc+0x211/0x230 [<ffffffff810ccaa2>] hrtimer_cancel+0x22/0x30 [<ffffffff810a3cb5>] free_fair_sched_group+0x25/0xd0 [<ffffffff8108df46>] free_sched_group+0x16/0x40 [<ffffffff810971bb>] sched_create_group+0x4b/0x80 [<ffffffff810aa383>] sched_autogroup_create_attach+0x43/0x1c0 [<ffffffff8107dc9c>] sys_setsid+0x7c/0x110 [<ffffffff81647729>] system_call_fastpath+0x12/0x17 Check whether init_cfs_bandwidth() was called before calling destroy_cfs_bandwidth(). Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> [ Move the check into destroy_cfs_bandwidth() to aid compilability. ] Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Paul Turner <pjt@google.com> Cc: Ben Segall <bsegall@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/201412252210.GCC30204.SOMVFFOtQJFLOH@I-love.SAKURA.ne.jpSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Luca Abeni 提交于
The dl_runtime_exceeded() function is supposed to ckeck if a SCHED_DEADLINE task must be throttled, by checking if its current runtime is <= 0. However, it also checks if the scheduling deadline has been missed (the current time is larger than the current scheduling deadline), further decreasing the runtime if this happens. This "double accounting" is wrong: - In case of partitioned scheduling (or single CPU), this happens if task_tick_dl() has been called later than expected (due to small HZ values). In this case, the current runtime is also negative, and replenish_dl_entity() can take care of the deadline miss by recharging the current runtime to a value smaller than dl_runtime - In case of global scheduling on multiple CPUs, scheduling deadlines can be missed even if the task did not consume more runtime than expected, hence penalizing the task is wrong This patch fix this problem by throttling a SCHED_DEADLINE task only when its runtime becomes negative, and not modifying the runtime Signed-off-by: NLuca Abeni <luca.abeni@unitn.it> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Acked-by: NJuri Lelli <juri.lelli@gmail.com> Cc: <stable@vger.kernel.org> Cc: Dario Faggioli <raistlin@linux.it> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1418813432-20797-3-git-send-email-luca.abeni@unitn.itSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Luca Abeni 提交于
According to global EDF, tasks should be migrated between runqueues without checking if their scheduling deadlines and runtimes are valid. However, SCHED_DEADLINE currently performs such a check: a migration happens doing: deactivate_task(rq, next_task, 0); set_task_cpu(next_task, later_rq->cpu); activate_task(later_rq, next_task, 0); which ends up calling dequeue_task_dl(), setting the new CPU, and then calling enqueue_task_dl(). enqueue_task_dl() then calls enqueue_dl_entity(), which calls update_dl_entity(), which can modify scheduling deadline and runtime, breaking global EDF scheduling. As a result, some of the properties of global EDF are not respected: for example, a taskset {(30, 80), (40, 80), (120, 170)} scheduled on two cores can have unbounded response times for the third task even if 30/80+40/80+120/170 = 1.5809 < 2 This can be fixed by invoking update_dl_entity() only in case of wakeup, or if this is a new SCHED_DEADLINE task. Signed-off-by: NLuca Abeni <luca.abeni@unitn.it> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Acked-by: NJuri Lelli <juri.lelli@gmail.com> Cc: <stable@vger.kernel.org> Cc: Dario Faggioli <raistlin@linux.it> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1418813432-20797-2-git-send-email-luca.abeni@unitn.itSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Yuyang Du 提交于
In effective_load, we have (long w * unsigned long tg->shares) / long W, when w is negative, it is cast to unsigned long and hence the product is insanely large. Fix this by casting tg->shares to long. Reported-by: NSasha Levin <sasha.levin@oracle.com> Signed-off-by: NYuyang Du <yuyang.du@intel.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Dave Jones <davej@redhat.com> Cc: Andrey Ryabinin <a.ryabinin@samsung.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20141219002956.GA25405@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 23 12月, 2014 1 次提交
-
-
由 Alex Thorlton 提交于
When allocating space for load_balance_mask, in sched_init, when CPUMASK_OFFSTACK is set, we've managed to spill over KMALLOC_MAX_SIZE on our 6144 core machine. The patch below breaks up the allocations so that they don't overflow the max alloc size. It also allocates the masks on the the node from which they'll most commonly be accessed, to minimize remote accesses on NUMA machines. Suggested-by: NGeorge Beshers <gbeshers@sgi.com> Signed-off-by: NAlex Thorlton <athorlton@sgi.com> Cc: George Beshers <gbeshers@sgi.com> Cc: Russ Anderson <rja@sgi.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1418928270-148543-1-git-send-email-athorlton@sgi.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 11 12月, 2014 1 次提交
-
-
由 Oleg Nesterov 提交于
rcu_read_lock() can not protect p->real_parent if release_task(p) was already called, change sched_show_task() to check pis_alive() like other users do. Note: we need some helpers to cleanup the code like this. And it seems that that the usage of cpu_curr(cpu) in dump_cpu_task() is not safe too. Signed-off-by: NOleg Nesterov <oleg@redhat.com> Cc: Aaron Tomlin <atomlin@redhat.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com>, Cc: Sterling Alexander <stalexan@redhat.com> Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Roland McGrath <roland@hack.frob.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 08 12月, 2014 1 次提交
-
-
由 Andy Lutomirski 提交于
Locklessly doing is_idle_task(rq->curr) is only okay because of RCU protection. The older variant of the broken code checked rq->curr == rq->idle instead and therefore didn't need RCU. Fixes: f6be8af1 ("sched: Add new API wake_up_if_idle() to wake up the idle cpu") Signed-off-by: NAndy Lutomirski <luto@amacapital.net> Reviewed-by: NChuansheng Liu <chuansheng.liu@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/729365dddca178506dfd0a9451006344cd6808bc.1417277372.git.luto@amacapital.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 04 12月, 2014 1 次提交
-
-
由 Andy Lutomirski 提交于
It appears that some SCHEDULE_USER (asm for schedule_user) callers in arch/x86/kernel/entry_64.S are called from RCU kernel context, and schedule_user will return in RCU user context. This causes RCU warnings and possible failures. This is intended to be a minimal fix suitable for 3.18. Reported-and-tested-by: NDave Jones <davej@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Frédéric Weisbecker <fweisbec@gmail.com> Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: NAndy Lutomirski <luto@amacapital.net> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 24 11月, 2014 1 次提交
-
-
由 Thomas Gleixner 提交于
Chris bisected a NULL pointer deference in task_sched_runtime() to commit 6e998916 'sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency'. Chris observed crashes in atop or other /proc walking programs when he started fork bombs on his machine. He assumed that this is a new exit race, but that does not make any sense when looking at that commit. What's interesting is that, the commit provides update_curr callbacks for all scheduling classes except stop_task and idle_task. While nothing can ever hit that via the clock_nanosleep() and clock_gettime() interfaces, which have been the target of the commit in question, the author obviously forgot that there are other code paths which invoke task_sched_runtime() do_task_stat(() thread_group_cputime_adjusted() thread_group_cputime() task_cputime() task_sched_runtime() if (task_current(rq, p) && task_on_rq_queued(p)) { update_rq_clock(rq); up->sched_class->update_curr(rq); } If the stats are read for a stomp machine task, aka 'migration/N' and that task is current on its cpu, this will happily call the NULL pointer of stop_task->update_curr. Ooops. Chris observation that this happens faster when he runs the fork bomb makes sense as the fork bomb will kick migration threads more often so the probability to hit the issue will increase. Add the missing update_curr callbacks to the scheduler classes stop_task and idle_task. While idle tasks cannot be monitored via /proc we have other means to hit the idle case. Fixes: 6e998916 'sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency' Reported-by: NChris Mason <clm@fb.com> Reported-and-tested-by: NBorislav Petkov <bp@alien8.de> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 16 11月, 2014 11 次提交
-
-
由 Wanpeng Li 提交于
Introduce start_hrtick_dl for !CONFIG_SCHED_HRTICK to align with the fair class. Signed-off-by: NWanpeng Li <wanpeng.li@linux.intel.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Kirill Tkhai <ktkhai@parallels.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1415670747-58726-1-git-send-email-wanpeng.li@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 pang.xunlei 提交于
Actually, cpudl_set() and cpudl_init() can never be used without CONFIG_SMP. Signed-off-by: Npang.xunlei <pang.xunlei@linaro.org> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Juri Lelli <juri.lelli@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1415260327-30465-4-git-send-email-pang.xunlei@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 pang.xunlei 提交于
Actually, cpupri_set() and cpupri_init() can never be used without CONFIG_SMP. Signed-off-by: Npang.xunlei <pang.xunlei@linaro.org> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Juri Lelli <juri.lelli@gmail.com> Cc: "pang.xunlei" <pang.xunlei@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1415260327-30465-1-git-send-email-pang.xunlei@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Wanpeng Li 提交于
Do not call dequeue_pushable_dl_task() when failing to push an eligible task, as it remains pushable, merely not at this particular moment. Actually the patch is the same behavior as commit 311e800e ("sched, rt: Fix rq->rt.pushable_tasks bug in push_rt_task()" in -rt side. Signed-off-by: NWanpeng Li <wanpeng.li@linux.intel.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Kirill Tkhai <ktkhai@parallels.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1415258564-8573-1-git-send-email-wanpeng.li@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Wanpeng Li 提交于
Commit caeb178c ("sched/fair: Make update_sd_pick_busiest() return 'true' on a busier sd") changes groups to be ranked in the order of overloaded > imbalance > other, and busiest group is picked according to this order. sgs->group_capacity_factor is used to check if the group is overloaded. When the child domain prefers tasks to go to siblings first, the sgs->group_capacity_factor will be set lower than one in order to move all the excess tasks away. However, group overloaded status is not updated when sgs->group_capacity_factor is set to lower than one, which leads to us missing to find the busiest group. This patch fixes it by updating group overloaded status when sg capacity factor is set to one, in order to find the busiest group accurately. Signed-off-by: NWanpeng Li <wanpeng.li@linux.intel.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Kirill Tkhai <ktkhai@parallels.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1415144690-25196-1-git-send-email-wanpeng.li@linux.intel.com [ Fixed the changelog. ] Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Wanpeng Li 提交于
Move the p->nr_cpus_allowed check into kernel/sched/core.c: select_task_rq(). This change will make fair.c, rt.c, and deadline.c all start with the same logic. Suggested-and-Acked-by: NSteven Rostedt <rostedt@goodmis.org> Signed-off-by: NWanpeng Li <wanpeng.li@linux.intel.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: "pang.xunlei" <pang.xunlei@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1415150077-59053-1-git-send-email-wanpeng.li@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Wolfram Sang 提交于
As discussed in [1], accounting IO is meant for blkio only. Document that so driver authors won't use them for device io. [1] http://thread.gmane.org/gmane.linux.drivers.i2c/20470Signed-off-by: NWolfram Sang <wsa@the-dreams.de> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: One Thousand Gnomes <gnomes@lxorguk.ukuu.org.uk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1415098901-2768-1-git-send-email-wsa@the-dreams.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Kirill Tkhai 提交于
Nobody iterates over numa_group::task_list, this just confuses the readers. Signed-off-by: NKirill Tkhai <ktkhai@parallels.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1415358456.28592.17.camel@tkhaiSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Stanislaw Gruszka 提交于
Commit d670ec13 "posix-cpu-timers: Cure SMP wobbles" fixes one glibc test case in cost of breaking another one. After that commit, calling clock_nanosleep(TIMER_ABSTIME, X) and then clock_gettime(&Y) can result of Y time being smaller than X time. Reproducer/tester can be found further below, it can be compiled and ran by: gcc -o tst-cpuclock2 tst-cpuclock2.c -pthread while ./tst-cpuclock2 ; do : ; done This reproducer, when running on a buggy kernel, will complain about "clock_gettime difference too small". Issue happens because on start in thread_group_cputimer() we initialize sum_exec_runtime of cputimer with threads runtime not yet accounted and then add the threads runtime to running cputimer again on scheduler tick, making it's sum_exec_runtime bigger than actual threads runtime. KOSAKI Motohiro posted a fix for this problem, but that patch was never applied: https://lkml.org/lkml/2013/5/26/191 . This patch takes different approach to cure the problem. It calls update_curr() when cputimer starts, that assure we will have updated stats of running threads and on the next schedule tick we will account only the runtime that elapsed from cputimer start. That also assure we have consistent state between cpu times of individual threads and cpu time of the process consisted by those threads. Full reproducer (tst-cpuclock2.c): #define _GNU_SOURCE #include <unistd.h> #include <sys/syscall.h> #include <stdio.h> #include <time.h> #include <pthread.h> #include <stdint.h> #include <inttypes.h> /* Parameters for the Linux kernel ABI for CPU clocks. */ #define CPUCLOCK_SCHED 2 #define MAKE_PROCESS_CPUCLOCK(pid, clock) \ ((~(clockid_t) (pid) << 3) | (clockid_t) (clock)) static pthread_barrier_t barrier; /* Help advance the clock. */ static void *chew_cpu(void *arg) { pthread_barrier_wait(&barrier); while (1) ; return NULL; } /* Don't use the glibc wrapper. */ static int do_nanosleep(int flags, const struct timespec *req) { clockid_t clock_id = MAKE_PROCESS_CPUCLOCK(0, CPUCLOCK_SCHED); return syscall(SYS_clock_nanosleep, clock_id, flags, req, NULL); } static int64_t tsdiff(const struct timespec *before, const struct timespec *after) { int64_t before_i = before->tv_sec * 1000000000ULL + before->tv_nsec; int64_t after_i = after->tv_sec * 1000000000ULL + after->tv_nsec; return after_i - before_i; } int main(void) { int result = 0; pthread_t th; pthread_barrier_init(&barrier, NULL, 2); if (pthread_create(&th, NULL, chew_cpu, NULL) != 0) { perror("pthread_create"); return 1; } pthread_barrier_wait(&barrier); /* The test. */ struct timespec before, after, sleeptimeabs; int64_t sleepdiff, diffabs; const struct timespec sleeptime = {.tv_sec = 0,.tv_nsec = 100000000 }; /* The relative nanosleep. Not sure why this is needed, but its presence seems to make it easier to reproduce the problem. */ if (do_nanosleep(0, &sleeptime) != 0) { perror("clock_nanosleep"); return 1; } /* Get the current time. */ if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &before) < 0) { perror("clock_gettime[2]"); return 1; } /* Compute the absolute sleep time based on the current time. */ uint64_t nsec = before.tv_nsec + sleeptime.tv_nsec; sleeptimeabs.tv_sec = before.tv_sec + nsec / 1000000000; sleeptimeabs.tv_nsec = nsec % 1000000000; /* Sleep for the computed time. */ if (do_nanosleep(TIMER_ABSTIME, &sleeptimeabs) != 0) { perror("absolute clock_nanosleep"); return 1; } /* Get the time after the sleep. */ if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &after) < 0) { perror("clock_gettime[3]"); return 1; } /* The time after sleep should always be equal to or after the absolute sleep time passed to clock_nanosleep. */ sleepdiff = tsdiff(&sleeptimeabs, &after); if (sleepdiff < 0) { printf("absolute clock_nanosleep woke too early: %" PRId64 "\n", sleepdiff); result = 1; printf("Before %llu.%09llu\n", before.tv_sec, before.tv_nsec); printf("After %llu.%09llu\n", after.tv_sec, after.tv_nsec); printf("Sleep %llu.%09llu\n", sleeptimeabs.tv_sec, sleeptimeabs.tv_nsec); } /* The difference between the timestamps taken before and after the clock_nanosleep call should be equal to or more than the duration of the sleep. */ diffabs = tsdiff(&before, &after); if (diffabs < sleeptime.tv_nsec) { printf("clock_gettime difference too small: %" PRId64 "\n", diffabs); result = 1; } pthread_cancel(th); return result; } Signed-off-by: NStanislaw Gruszka <sgruszka@redhat.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20141112155843.GA24803@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
While looking over the cpu-timer code I found that we appear to add the delta for the calling task twice, through: cpu_timer_sample_group() thread_group_cputimer() thread_group_cputime() times->sum_exec_runtime += task_sched_runtime(); *sample = cputime.sum_exec_runtime + task_delta_exec(); Which would make the sample run ahead, making the sleep short. Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Rik van Riel <riel@redhat.com> Cc: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/20141112113737.GI10476@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
Because the whole numa task selection stuff runs with preemption enabled (its long and expensive) we can end up migrating and selecting oneself as a swap target. This doesn't really work out well -- we end up trying to acquire the same lock twice for the swap migrate -- so avoid this. Reported-and-Tested-by: NSasha Levin <sasha.levin@oracle.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20141110100328.GF29390@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 10 11月, 2014 1 次提交
-
-
由 Andrey Ryabinin 提交于
On latest mm + KASan patchset I've got this: ================================================================== BUG: AddressSanitizer: out of bounds access in sched_init_smp+0x3ba/0x62c at addr ffff88006d4bee6c ============================================================================= BUG kmalloc-8 (Not tainted): kasan error ----------------------------------------------------------------------------- Disabling lock debugging due to kernel taint INFO: Allocated in alloc_vfsmnt+0xb0/0x2c0 age=75 cpu=0 pid=0 __slab_alloc+0x4b4/0x4f0 __kmalloc_track_caller+0x15f/0x1e0 kstrdup+0x44/0x90 alloc_vfsmnt+0xb0/0x2c0 vfs_kern_mount+0x35/0x190 kern_mount_data+0x25/0x50 pid_ns_prepare_proc+0x19/0x50 alloc_pid+0x5e2/0x630 copy_process.part.41+0xdf5/0x2aa0 do_fork+0xf5/0x460 kernel_thread+0x21/0x30 rest_init+0x1e/0x90 start_kernel+0x522/0x531 x86_64_start_reservations+0x2a/0x2c x86_64_start_kernel+0x15b/0x16a INFO: Slab 0xffffea0001b52f80 objects=24 used=22 fp=0xffff88006d4befc0 flags=0x100000000004080 INFO: Object 0xffff88006d4bed20 @offset=3360 fp=0xffff88006d4bee70 Bytes b4 ffff88006d4bed10: 00 00 00 00 00 00 00 00 5a 5a 5a 5a 5a 5a 5a 5a ........ZZZZZZZZ Object ffff88006d4bed20: 70 72 6f 63 00 6b 6b a5 proc.kk. Redzone ffff88006d4bed28: cc cc cc cc cc cc cc cc ........ Padding ffff88006d4bee68: 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZ CPU: 0 PID: 1 Comm: swapper/0 Tainted: G B 3.18.0-rc3-mm1+ #108 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014 ffff88006d4be000 0000000000000000 ffff88006d4bed20 ffff88006c86fd18 ffffffff81cd0a59 0000000000000058 ffff88006d404240 ffff88006c86fd48 ffffffff811fa3a8 ffff88006d404240 ffffea0001b52f80 ffff88006d4bed20 Call Trace: dump_stack (lib/dump_stack.c:52) print_trailer (mm/slub.c:645) object_err (mm/slub.c:652) ? sched_init_smp (kernel/sched/core.c:6552 kernel/sched/core.c:7063) kasan_report_error (mm/kasan/report.c:102 mm/kasan/report.c:178) ? kasan_poison_shadow (mm/kasan/kasan.c:48) ? kasan_unpoison_shadow (mm/kasan/kasan.c:54) ? kasan_poison_shadow (mm/kasan/kasan.c:48) ? kasan_kmalloc (mm/kasan/kasan.c:311) __asan_load4 (mm/kasan/kasan.c:371) ? sched_init_smp (kernel/sched/core.c:6552 kernel/sched/core.c:7063) sched_init_smp (kernel/sched/core.c:6552 kernel/sched/core.c:7063) kernel_init_freeable (init/main.c:869 init/main.c:997) ? finish_task_switch (kernel/sched/sched.h:1036 kernel/sched/core.c:2248) ? rest_init (init/main.c:924) kernel_init (init/main.c:929) ? rest_init (init/main.c:924) ret_from_fork (arch/x86/kernel/entry_64.S:348) ? rest_init (init/main.c:924) Read of size 4 by task swapper/0: Memory state around the buggy address: ffff88006d4beb80: fc fc fc fc fc fc fc fc fc fc 00 fc fc fc fc fc ffff88006d4bec00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff88006d4bec80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff88006d4bed00: fc fc fc fc 00 fc fc fc fc fc fc fc fc fc fc fc ffff88006d4bed80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc >ffff88006d4bee00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc 04 fc ^ ffff88006d4bee80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff88006d4bef00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff88006d4bef80: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb ffff88006d4bf000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff88006d4bf080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ================================================================== Zero 'level' (e.g. on non-NUMA system) causing out of bounds access in this line: sched_max_numa_distance = sched_domains_numa_distance[level - 1]; Fix this by exiting from sched_init_numa() earlier. Signed-off-by: NAndrey Ryabinin <a.ryabinin@samsung.com> Reviewed-by: NRik van Riel <riel@redhat.com> Fixes: 9942f79b ("sched/numa: Export info needed for NUMA balancing on complex topologies") Cc: peterz@infradead.org Link: http://lkml.kernel.org/r/1415372020-1871-1-git-send-email-a.ryabinin@samsung.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 04 11月, 2014 12 次提交
-
-
由 Iulia Manda 提交于
This patch simplifies task_struct by removing the four numa_* pointers in the same array and replacing them with the array pointer. By doing this, on x86_64, the size of task_struct is reduced by 3 ulong pointers (24 bytes on x86_64). A new parameter is added to the task_faults_idx function so that it can return an index to the correct offset, corresponding with the old precalculated pointers. All of the code in sched/ that depended on task_faults_idx and numa_* was changed in order to match the new logic. Signed-off-by: NIulia Manda <iulia.manda21@gmail.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: mgorman@suse.de Cc: dave@stgolabs.net Cc: riel@redhat.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20141031001331.GA30662@winterfellSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Wanpeng Li 提交于
There are both UP and SMP version of pull_dl_task(), so don't need to check CONFIG_SMP in switched_from_dl(); Signed-off-by: NWanpeng Li <wanpeng.li@linux.intel.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Kirill Tkhai <ktkhai@parallels.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1414708776-124078-6-git-send-email-wanpeng.li@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Wanpeng Li 提交于
In switched_from_dl() we have to issue a resched if we successfully pulled some task from other cpus. This patch also aligns the behavior with -rt. Suggested-by: NJuri Lelli <juri.lelli@arm.com> Signed-off-by: NWanpeng Li <wanpeng.li@linux.intel.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Kirill Tkhai <ktkhai@parallels.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1414708776-124078-5-git-send-email-wanpeng.li@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Wanpeng Li 提交于
This patch pushes task away if the dealine of the task is equal to current during wake up. The same behavior as rt class. Signed-off-by: NWanpeng Li <wanpeng.li@linux.intel.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Kirill Tkhai <ktkhai@parallels.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1414708776-124078-4-git-send-email-wanpeng.li@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Wanpeng Li 提交于
This patch add deadline rq status print. Signed-off-by: NWanpeng Li <wanpeng.li@linux.intel.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Kirill Tkhai <ktkhai@parallels.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1414708776-124078-3-git-send-email-wanpeng.li@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Wanpeng Li 提交于
The yield semantic of deadline class is to reduce remaining runtime to zero, and then update_curr_dl() will stop it. However, comsumed bandwidth is reduced from the budget of yield task again even if it has already been set to zero which leads to artificial overrun. This patch fix it by make sure we don't steal some more time from the task that yielded in update_curr_dl(). Suggested-by: NJuri Lelli <juri.lelli@arm.com> Signed-off-by: NWanpeng Li <wanpeng.li@linux.intel.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Kirill Tkhai <ktkhai@parallels.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1414708776-124078-2-git-send-email-wanpeng.li@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Wanpeng Li 提交于
This patch checks if current can be pushed/pulled somewhere else in advance to make logic clear, the same behavior as dl class. - If current can't be migrated, useless to reschedule, let's hope task can move out. - If task is migratable, so let's not schedule it and see if it can be pushed or pulled somewhere else. Signed-off-by: NWanpeng Li <wanpeng.li@linux.intel.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Kirill Tkhai <ktkhai@parallels.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1414708776-124078-1-git-send-email-wanpeng.li@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Juri Lelli 提交于
As per commit f10e00f4 ("sched/dl: Use dl_bw_of() under rcu_read_lock_sched()"), dl_bw_of() has to be protected by rcu_read_lock_sched(). Signed-off-by: NJuri Lelli <juri.lelli@arm.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1414497286-28824-1-git-send-email-juri.lelli@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Yao Dongdong 提交于
Idle cpu is idler than non-idle cpu, so we needn't search for least_loaded_cpu after we have found an idle cpu. Signed-off-by: NYao Dongdong <yaodongdong@huawei.com> Reviewed-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1414469286-6023-1-git-send-email-yaodongdong@huawei.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Kirill Tkhai 提交于
Currently used hrtimer_try_to_cancel() is racy: raw_spin_lock(&rq->lock) ... dl_task_timer raw_spin_lock(&rq->lock) ... raw_spin_lock(&rq->lock) ... switched_from_dl() ... ... hrtimer_try_to_cancel() ... ... switched_to_fair() ... ... ... ... ... ... ... ... raw_spin_unlock(&rq->lock) ... (asquired) ... ... ... ... ... ... do_exit() ... ... schedule() ... ... raw_spin_lock(&rq->lock) ... raw_spin_unlock(&rq->lock) ... ... ... raw_spin_unlock(&rq->lock) ... raw_spin_lock(&rq->lock) ... ... (asquired) put_task_struct() ... ... free_task_struct() ... ... ... ... raw_spin_unlock(&rq->lock) ... (asquired) ... ... ... ... ... (use after free) ... So, let's implement 100% guaranteed way to cancel the timer and let's be sure we are safe even in very unlikely situations. rq unlocking does not limit the area of switched_from_dl() use, because this has already been possible in pull_dl_task() below. Let's consider the safety of of this unlocking. New code in the patch is working when hrtimer_try_to_cancel() fails. This means the callback is running. In this case hrtimer_cancel() is just waiting till the callback is finished. Two 1) Since we are in switched_from_dl(), new class is not dl_sched_class and new prio is not less MAX_DL_PRIO. So, the callback returns early; it's right after !dl_task() check. After that hrtimer_cancel() returns back too. The above is: raw_spin_lock(rq->lock); ... ... dl_task_timer() ... raw_spin_lock(rq->lock); switched_from_dl() ... hrtimer_try_to_cancel() ... raw_spin_unlock(rq->lock); ... hrtimer_cancel() ... ... raw_spin_unlock(rq->lock); ... return HRTIMER_NORESTART; ... ... raw_spin_lock(rq->lock); ... 2) But the below is also possible: dl_task_timer() raw_spin_lock(rq->lock); ... raw_spin_unlock(rq->lock); raw_spin_lock(rq->lock); ... switched_from_dl() ... hrtimer_try_to_cancel() ... ... return HRTIMER_NORESTART; raw_spin_unlock(rq->lock); ... hrtimer_cancel(); ... raw_spin_lock(rq->lock); ... In this case hrtimer_cancel() returns immediately. Very unlikely case, just to mention. Nobody can manipulate the task, because check_class_changed() is always called with pi_lock locked. Nobody can force the task to participate in (concurrent) priority inheritance schemes (the same reason). All concurrent task operations require pi_lock, which is held by us. No deadlocks with dl_task_timer() are possible, because it returns right after !dl_task() check (it does nothing). If we receive a new dl_task during the time of unlocked rq, we just don't have to do pull_dl_task() in switched_from_dl() further. Signed-off-by: NKirill Tkhai <ktkhai@parallels.com> [ Added comments] Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Acked-by: NJuri Lelli <juri.lelli@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1414420852.19914.186.camel@tkhaiSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
In some cases this can trigger a true flood of output. Requested-by: NIngo Molnar <mingo@kernel.org> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
There is a race between kthread_stop() and the new wait_woken() that can result in a lack of progress. CPU 0 | CPU 1 | rfcomm_run() | kthread_stop() ... | if (!test_bit(KTHREAD_SHOULD_STOP)) | | set_bit(KTHREAD_SHOULD_STOP) | wake_up_process() wait_woken() | wait_for_completion() set_current_state(INTERRUPTIBLE) | if (!WQ_FLAG_WOKEN) | schedule_timeout() | | After which both tasks will wait.. forever. Fix this by having wait_woken() check for kthread_should_stop() but only for kthreads (obviously). Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Peter Hurley <peter@hurleysoftware.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NIngo Molnar <mingo@kernel.org>
-