- 06 6月, 2012 2 次提交
-
-
由 Peter Zijlstra 提交于
Weird topologies can lead to asymmetric domain setups. This needs further consideration since these setups are typically non-minimal too. For now, make it work by adding an extra mask selecting which CPUs are allowed to iterate up. The topology that triggered it is the one from David Rientjes: 10 20 20 30 20 10 20 20 20 20 10 20 30 20 20 10 resulting in boxes that wouldn't even boot. Reported-by: NDavid Rientjes <rientjes@google.com> Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-3p86l9cuaqnxz7uxsojmz5rm@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Alex Shi 提交于
Commit cb83b629 ("sched/numa: Rewrite the CONFIG_NUMA sched domain support") removed the NODE sched domain and started checking if the node distance in SLIT table is farther than REMOTE_DISTANCE, if so, it will lose the load balance chance at exec/fork/wake_affine points. But actually, even the node distance is farther than REMOTE_DISTANCE. Modern CPUs also has QPI like connections, which ensures that memory access is not too slow between nodes. So the above change in behavior on NUMA machine causes a performance regression on various benchmarks: hackbench, tbench, netperf, oltp, etc. This patch will recover the scheduler behavior to old mode on all my Intel platforms: NHM EP/EX, WSM EP, SNB EP/EP4S, and thus fixes the perfromance regressions. (all of them just have 2 kinds distance, 10, 21) Signed-off-by: NAlex Shi <alex.shi@intel.com> Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1338965571-9812-1-git-send-email-alex.shi@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 30 5月, 2012 7 次提交
-
-
由 Kamalesh Babulal 提交于
Remove explicit NULL assignment of static pointer dattr_cur from init_sched_domains(). Signed-off-by: NKamalesh Babulal <kamalesh@linux.vnet.ibm.com> Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20120523091411.GG5005@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Hiroshi Shimamoto 提交于
No need to have the last NULL entry. Signed-off-by: NHiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/4FBF29E7.5020805@ct.jp.nec.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Hiroshi Shimamoto 提交于
The strings sched_feat_names are never changed. Signed-off-by: NHiroshi Shimamoto <h-shimamoto@ct.jp.nec.com> Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/4FBF29B2.9030904@ct.jp.nec.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
Since nr_cpus_allowed is used outside of sched/rt.c and wants to be used outside of there more, move it to a more natural site. Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-kr61f02y9brwzkh6x53pdptm@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
SD_OVERLAP exists to allow overlapping groups, overlapping groups appear in NUMA topologies that aren't fully connected. The typical result of not fully connected NUMA is that each cpu (or rather node) will have different spans for a particular distance. However due to how sched domains are traversed -- only the first cpu in the mask goes one level up -- the next level only cares about the spans of the cpus that went up. Due to this two things were observed to be broken: - build_overlap_sched_groups() -- since its possible the cpu we're building the groups for exists in multiple (or all) groups, the selection criteria of the first group didn't ensure there was a cpu for which is was true that cpumask_first(span) == cpu. Thus load- balancing would terminate. - update_group_power() -- assumed that the cpu span of the first group of the domain was covered by all groups of the child domain. The above explains why this isn't true, so deal with it. Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Cc: David Rientjes <rientjes@google.com> Link: http://lkml.kernel.org/r/1337788843.9783.14.camel@laptopSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
Allocators don't appreciate it when you try and allocate memory from offline nodes. Reported-and-tested-by: NTony Luck <tony.luck@intel.com> Reported-and-tested-by: NAnton Blanchard <anton@samba.org> Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-epfc1io9whb7o22bcujf31vn@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
Follow up on commit 556061b0 ("sched/nohz: Fix rq->cpu_load[] calculations") since while that fixed the busy case it regressed the mostly idle case. Add a callback from the nohz exit to also age the rq->cpu_load[] array. This closes the hole where either there was no nohz load balance pass during the nohz, or there was a 'significant' amount of idle time between the last nohz balance and the nohz exit. So we'll update unconditionally from the tick to not insert any accidental 0 load periods while busy, and we try and catch up from nohz idle balance and nohz exit. Both these are still prone to missing a jiffy, but that has always been the case. Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Cc: pjt@google.com Cc: Venkatesh Pallipadi <venki@google.com> Link: http://lkml.kernel.org/n/tip-kt0trz0apodbf84ucjfdbr1a@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 23 5月, 2012 1 次提交
-
-
由 Jiri Olsa 提交于
This reverts commit cb04ff9a ("sched, perf: Use a single callback into the scheduler"). Before this change was introduced, the process switch worked like this (wrt. to perf event schedule): schedule (prev, next) - schedule out all perf events for prev - switch to next - schedule in all perf events for current (next) After the commit, the process switch looks like: schedule (prev, next) - schedule out all perf events for prev - schedule in all perf events for (next) - switch to next The problem is, that after we schedule perf events in, the pmu is enabled and we can receive events even before we make the switch to next - so "current" still being prev process (event SAMPLE data are filled based on the value of the "current" process). Thats exactly what we see for test__PERF_RECORD test. We receive SAMPLES with PID of the process that our tracee is scheduled from. Discussed with Peter Zijlstra: > Bah!, yeah I guess reverting is the right thing for now. Sad > though. > > So by having the two hooks we have a black-spot between them > where we receive no events at all, this black-spot covers the > hand-over of current and we thus don't receive the 'wrong' > events. > > I rather liked we could do away with both that black-spot and > clean up the code a little, but apparently people rely on it. Signed-off-by: NJiri Olsa <jolsa@redhat.com> Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Cc: acme@redhat.com Cc: paulus@samba.org Cc: cjashfor@linux.vnet.ibm.com Cc: fweisbec@gmail.com Cc: eranian@google.com Link: http://lkml.kernel.org/r/20120523111302.GC1638@m.brq.redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 18 5月, 2012 1 次提交
-
-
由 Konstantin Khlebnikov 提交于
Usually sleep-in-atomic bugs are followed by dozens other warnings. This patch should help to figure out original source of problem. Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org> Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20120510122004.4873.12726.stgit@zurgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 17 5月, 2012 1 次提交
-
-
由 Peter Zijlstra 提交于
It's been broken forever (i.e. it's not scheduling in a power aware fashion), as reported by Suresh and others sending patches, and nobody cares enough to fix it properly ... so remove it to make space free for something better. There's various problems with the code as it stands today, first and foremost the user interface which is bound to topology levels and has multiple values per level. This results in a state explosion which the administrator or distro needs to master and almost nobody does. Furthermore large configuration state spaces aren't good, it means the thing doesn't just work right because it's either under so many impossibe to meet constraints, or even if there's an achievable state workloads have to be aware of it precisely and can never meet it for dynamic workloads. So pushing this kind of decision to user-space was a bad idea even with a single knob - it's exponentially worse with knobs on every node of the topology. There is a proposal to replace the user interface with a single 3 state knob: sched_balance_policy := { performance, power, auto } where 'auto' would be the preferred default which looks at things like Battery/AC mode and possible cpufreq state or whatever the hw exposes to show us power use expectations - but there's been no progress on it in the past many months. Aside from that, the actual implementation of the various knobs is known to be broken. There have been sporadic attempts at fixing things but these always stop short of reaching a mergable state. Therefore this wholesale removal with the hopes of spurring people who care to come forward once again and work on a coherent replacement. Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/1326104915.2442.53.camel@twinsSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 14 5月, 2012 4 次提交
-
-
由 Peter Zijlstra 提交于
While investigating why the load-balancer did funny I found that the rq->cpu_load[] tables were completely screwy.. a bit more digging revealed that the updates that got through were missing ticks followed by a catchup of 2 ticks. The catchup assumes the cpu was idle during that time (since only nohz can cause missed ticks and the machine is idle etc..) this means that esp. the higher indices were significantly lower than they ought to be. The reason for this is that its not correct to compare against jiffies on every jiffy on any other cpu than the cpu that updates jiffies. This patch cludges around it by only doing the catch-up stuff from nohz_idle_balance() and doing the regular stuff unconditionally from the tick. Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Cc: pjt@google.com Cc: Venkatesh Pallipadi <venki@google.com> Link: http://lkml.kernel.org/n/tip-tp4kj18xdd5aj4vvj0qg55s2@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
It's far too easy to get ridiculously large imbalance pct when you scale it like that. Use a fixed 125% for now. Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-zsriaft1dv7hhboyrpvqjy6s@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
Patches c22402a2 ("sched/fair: Let minimally loaded cpu balance the group") and 0ce90475 ("sched/fair: Add some serialization to the sched_domain load-balance walk") are horribly broken so revert them. The problem is that while it sounds good to have the minimally loaded cpu do the pulling of more load, the way we walk the domains there is absolutely no guarantee this cpu will actually get to the domain. In fact its very likely it wont. Therefore the higher up the tree we get, the less likely it is we'll balance at all. The first of mask always walks up, while sucky in that it accumulates load on the first cpu and needs extra passes to spread it out at least guarantees a cpu gets up that far and load-balancing happens at all. Since its now always the first and idle cpus should always be able to balance so they get a task as fast as possible we can also do away with the added serialization. Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-rpuhs5s56aiv1aw7khv9zkw6@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
There's no need to convert a node number to a node number by pretending its a cpu number.. Reported-by: NYinghai Lu <yinghai@kernel.org> Reported-and-Tested-by: NGreg Pearson <greg.pearson@hp.com> Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-0sqhrht34phowgclj12dgk8h@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 09 5月, 2012 4 次提交
-
-
由 Peter Zijlstra 提交于
We can easily use a single callback for both sched-in and sched-out. This reduces the code footprint in the scheduler path as well as removes the PMU black spot otherwise present between the out and in callback. Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-o56ajxp1edwqg6x9d31wb805@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
The current code groups up to 16 nodes in a level and then puts an ALLNODES domain spanning the entire tree on top of that. This doesn't reflect the numa topology and esp for the smaller not-fully-connected machines out there today this might make a difference. Therefore, build a proper numa topology based on node_distance(). Since there's no fixed numa layers anymore, the static SD_NODE_INIT and SD_ALLNODES_INIT aren't usable anymore, the new code tries to construct something similar and scales some values either on the number of cpus in the domain and/or the node_distance() ratio. Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Cc: Anton Blanchard <anton@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: David Howells <dhowells@redhat.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: linux-alpha@vger.kernel.org Cc: linux-ia64@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux-mips@linux-mips.org Cc: linuxppc-dev@lists.ozlabs.org Cc: linux-sh@vger.kernel.org Cc: Matt Turner <mattst88@gmail.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Paul Mundt <lethal@linux-sh.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Richard Henderson <rth@twiddle.net> Cc: sparclinux@vger.kernel.org Cc: Tony Luck <tony.luck@intel.com> Cc: x86@kernel.org Cc: Dimitri Sivanich <sivanich@sgi.com> Cc: Greg Pearson <greg.pearson@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: bob.picco@oracle.com Cc: chris.mason@oracle.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/n/tip-r74n3n8hhuc2ynbrnp3vt954@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
Since the sched_domain walk is completely unserialized (!SD_SERIALIZE) it is possible that multiple cpus in the group get elected to do the next level. Avoid this by adding some serialization. Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-vqh9ai6s0ewmeakjz80w4qz6@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Igor Mammedov 提交于
If we have one cpu that failed to boot and boot cpu gave up on waiting for it and then another cpu is being booted, kernel might crash with following OOPS: BUG: unable to handle kernel NULL pointer dereference at 0000000000000018 IP: [<ffffffff812c3630>] __bitmap_weight+0x30/0x80 Call Trace: [<ffffffff8108b9b6>] build_sched_domains+0x7b6/0xa50 The crash happens in init_sched_groups_power() that expects sched_groups to be circular linked list. However it is not always true, since sched_groups preallocated in __sdt_alloc are initialized in build_sched_groups and it may exit early if (cpu != cpumask_first(sched_domain_span(sd))) return 0; without initializing sd->groups->next field. Fix bug by initializing next field right after sched_group was allocated. Also-Reported-by: NJiang Liu <liuj97@gmail.com> Signed-off-by: NIgor Mammedov <imammedo@redhat.com> Cc: a.p.zijlstra@chello.nl Cc: pjt@google.com Cc: seto.hidetoshi@jp.fujitsu.com Link: http://lkml.kernel.org/r/1336559908-32533-1-git-send-email-imammedo@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 03 5月, 2012 2 次提交
-
-
由 Eric W. Biederman 提交于
- Compare kuids with uid_eq - kuid are uniuqe across all user namespaces so there is no longer the need for a user_namespace comparison. Acked-by: NSerge Hallyn <serge.hallyn@canonical.com> Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
-
由 Paul E. McKenney 提交于
Currently, PREEMPT_RCU readers are enqueued upon entry to the scheduler. This is inefficient because enqueuing is required only if there is a context switch, and entry to the scheduler does not guarantee a context switch. The commit therefore moves the enqueuing to immediately precede the call to switch_to() from the scheduler. Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com> Tested-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 26 4月, 2012 2 次提交
-
-
由 he, bo 提交于
Under extreme memory used up situations, percpu allocation might fail. We hit it when system goes to suspend-to-ram, causing a kworker panic: EIP: [<c124411a>] build_sched_domains+0x23a/0xad0 Kernel panic - not syncing: Fatal exception Pid: 3026, comm: kworker/u:3 3.0.8-137473-gf42fbef #1 Call Trace: [<c18cc4f2>] panic+0x66/0x16c [...] [<c1244c37>] partition_sched_domains+0x287/0x4b0 [<c12a77be>] cpuset_update_active_cpus+0x1fe/0x210 [<c123712d>] cpuset_cpu_inactive+0x1d/0x30 [...] With this fix applied build_sched_domains() will return -ENOMEM and the suspend attempt fails. Signed-off-by: Nhe, bo <bo.he@intel.com> Reviewed-by: NZhang, Yanmin <yanmin.zhang@intel.com> Reviewed-by: NSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: <stable@kernel.org> Link: http://lkml.kernel.org/r/1335355161.5892.17.camel@hebo [ So, we fail to deallocate a CPU because we cannot allocate RAM :-/ I don't like that kind of sad behavior but nevertheless it should not crash under high memory load. ] Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Thomas Gleixner 提交于
All SMP architectures have magic to fork the idle task and to store it for reusage when cpu hotplug is enabled. Provide a generic infrastructure for it. Create/reinit the idle thread for the cpu which is brought up in the generic code and hand the thread pointer to the architecture code via __cpu_up(). Note, that fork_idle() is called via a workqueue, because this guarantees that the idle thread does not get a reference to a user space VM. This can happen when the boot process did not bring up all possible cpus and a later cpu_up() is initiated via the sysfs interface. In that case fork_idle() would be called in the context of the user space task and take a reference on the user space VM. Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Russell King <linux@arm.linux.org.uk> Cc: Mike Frysinger <vapier@gentoo.org> Cc: Jesper Nilsson <jesper.nilsson@axis.com> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: Tony Luck <tony.luck@intel.com> Cc: Hirokazu Takata <takata@linux-m32r.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: David Howells <dhowells@redhat.com> Cc: James E.J. Bottomley <jejb@parisc-linux.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Paul Mundt <lethal@linux-sh.org> Cc: David S. Miller <davem@davemloft.net> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Richard Weinberger <richard@nod.at> Cc: x86@kernel.org Acked-by: NVenkatesh Pallipadi <venki@google.com> Link: http://lkml.kernel.org/r/20120420124557.102478630@linutronix.de
-
- 08 4月, 2012 1 次提交
-
-
由 Eric W. Biederman 提交于
Optimize performance and prepare for the removal of the user_ns reference from user_struct. Remove the slow long walk through cred->user->user_ns and instead go straight to cred->user_ns. Acked-by: NSerge Hallyn <serge.hallyn@canonical.com> Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
-
- 02 4月, 2012 1 次提交
-
-
由 Tejun Heo 提交于
Convert debug, freezer, cpuset, cpu_cgroup, cpuacct, net_prio, blkio, net_cls and device controllers to use the new cftype based interface. Termination entry is added to cftype arrays and populate callbacks are replaced with cgroup_subsys->base_cftypes initializations. This is functionally identical transformation. There shouldn't be any visible behavior change. memcg is rather special and will be converted separately. Signed-off-by: NTejun Heo <tj@kernel.org> Acked-by: NLi Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <paul@paulmenage.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Vivek Goyal <vgoyal@redhat.com>
-
- 31 3月, 2012 1 次提交
-
-
由 Srivatsa S. Bhat 提交于
The function for_each_cpu_mask() expects a *pointer* to struct cpumask as its second argument, whereas select_fallback_rq() passes the value itself. And moreover, for_each_cpu_mask() has been marked as obselete in include/linux/cpumask.h. So move to the more appropriate for_each_cpu() variant. Reported-by: NSasha Levin <levinsasha928@gmail.com> Signed-off-by: NSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Cc: Dave Jones <davej@redhat.com> Cc: Liu Chuansheng <chuansheng.liu@intel.com> Cc: vapier@gentoo.org Cc: rusty@rustcorp.com.au Link: http://lkml.kernel.org/r/4F75BED4.9050005@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 29 3月, 2012 2 次提交
-
-
由 Stephen Boyd 提交于
If schedule is called from an interrupt handler __schedule_bug() will call show_regs() with the registers saved during the interrupt handling done in do_IRQ(). This means we'll see the registers and the backtrace for the process that was interrupted and not the full backtrace explaining who called schedule(). This is due to 838225b4 ("sched: use show_regs() to improve __schedule_bug() output", 2007-10-24) which improperly assumed that get_irq_regs() would return the registers for the current stack because it is being called from within an interrupt handler. Simply remove the show_reg() code so that we dump a backtrace for the interrupt handler that called schedule(). [ I ran across this when I was presented with a scheduling while atomic log with a stacktrace pointing at spin_unlock_irqrestore(). It made no sense and I had to guess what interrupt handler could be called and poke around for someone calling schedule() in an interrupt handler. A simple test of putting an msleep() in an interrupt handler works better with this patch because you can actually see the msleep() call in the backtrace. ] Also-reported-by: NChris Metcalf <cmetcalf@tilera.com> Signed-off-by: NStephen Boyd <sboyd@codeaurora.org> Cc: Satyam Sharma <satyam@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1332979847-27102-1-git-send-email-sboyd@codeaurora.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 David Howells 提交于
asm/system.h is a cause of circular dependency problems because it contains commonly used primitive stuff like barrier definitions and uncommonly used stuff like switch_to() that might require MMU definitions. asm/system.h has been disintegrated by this point on all arches into the following common segments: (1) asm/barrier.h Moved memory barrier definitions here. (2) asm/cmpxchg.h Moved xchg() and cmpxchg() here. #included in asm/atomic.h. (3) asm/bug.h Moved die() and similar here. (4) asm/exec.h Moved arch_align_stack() here. (5) asm/elf.h Moved AT_VECTOR_SIZE_ARCH here. (6) asm/switch_to.h Moved switch_to() here. Signed-off-by: NDavid Howells <dhowells@redhat.com>
-
- 27 3月, 2012 1 次提交
-
-
由 Peter Zijlstra 提交于
Commit 5fbd036b ("sched: Cleanup cpu_active madness"), which was supposed to finally sort the cpu_active mess, instead uncovered more. Since CPU_STARTING is ran before setting the cpu online, there's a (small) window where the cpu has active,!online. If during this time there's a wakeup of a task that used to reside on that cpu select_task_rq() will use select_fallback_rq() to compute an alternative cpu to run on since we find !online. select_fallback_rq() however will compute the new cpu against cpu_active, this means that it can return the same cpu it started out with, the !online one, since that cpu is in fact marked active. This results in us trying to scheduling a task on an offline cpu and triggering a WARN in the IPI code. The solution proposed by Chuansheng Liu of setting cpu_active in set_cpu_online() is buggy, firstly not all archs actually use set_cpu_online(), secondly, not all archs call set_cpu_online() with IRQs disabled, this means we would introduce either the same race or the race from fd8a7de1 ("x86: cpu-hotplug: Prevent softirq wakeup on wrong CPU") -- albeit much narrower. [ By setting online first and active later we have a window of online,!active, fresh and bound kthreads have task_cpu() of 0 and since cpu0 isn't in tsk_cpus_allowed() we end up in select_fallback_rq() which excludes !active, resulting in a reset of ->cpus_allowed and the thread running all over the place. ] The solution is to re-work select_fallback_rq() to require active _and_ online. This makes the active,!online case work as expected, OTOH archs running CPU_STARTING after setting online are now vulnerable to the issue from fd8a7de1 -- these are alpha and blackfin. Reported-by: NChuansheng Liu <chuansheng.liu@intel.com> Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Frysinger <vapier@gentoo.org> Cc: linux-alpha@vger.kernel.org Link: http://lkml.kernel.org/n/tip-hubqk1i10o4dpvlm06gq7v6j@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 13 3月, 2012 5 次提交
-
-
由 Catalin Marinas 提交于
This callback is called by the scheduler after rq->lock has been released and interrupts enabled. It will be used in subsequent patches on the ARM architecture. Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com> Reviewed-by: NWill Deacon <will.deacon@arm.com> Reviewed-by: NFrank Rowand <frank.rowand@am.sony.com> Tested-by: NWill Deacon <will.deacon@arm.com> Tested-by: NMarc Zyngier <Marc.Zyngier@arm.com> Acked-by: NPeter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/20120313110840.7b444deb6b1bb902c15f3cdf@canb.auug.org.auSigned-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Peter Zijlstra 提交于
Various people reported nohz load tracking still being wrecked, but Doug spotted the actual problem. We fold the nohz remainder in too soon, causing us to loose samples and under-account. So instead of playing catch-up up-front, always do a single load-fold with whatever state we encounter and only then fold the nohz remainder and play catch-up. Reported-by: NDoug Smythies <dsmythies@telus.net> Reported-by: NLesÅ=82aw Kope=C4=87 <leslaw.kopec@nasza-klasa.pl> Reported-by: NAman Gupta <aman@tmm1.net> Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-4v31etnhgg9kwd6ocgx3rxl8@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Peter Zijlstra 提交于
Suggested-by: NJoe Perches <joe@perches.com> Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1331056466.11248.327.camel@twinsSigned-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Peter Zijlstra 提交于
There's a few awkward printk()s inside of scheduler guts that people prefer to keep but really are rather deadlock prone. Fudge around it by storing the text in a per-cpu buffer and poll it using the existing printk_tick() handler. This will drop output when its more frequent than once a tick, however only the affinity thing could possible go that fast and for that just one should suffice to notify the admin he's done something silly.. Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/n/tip-wua3lmkt3dg8nfts66o6brne@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Peter Zijlstra 提交于
Stepan found: CPU0 CPUn _cpu_up() __cpu_up() boostrap() notify_cpu_starting() set_cpu_online() while (!cpu_active()) cpu_relax() <PREEMPT-out> smp_call_function(.wait=1) /* we find cpu_online() is true */ arch_send_call_function_ipi_mask() /* wait-forever-more */ <PREEMPT-in> local_irq_enable() cpu_notify(CPU_ONLINE) sched_cpu_active() set_cpu_active() Now the purpose of cpu_active is mostly with bringing down a cpu, where we mark it !active to avoid the load-balancer from moving tasks to it while we tear down the cpu. This is required because we only update the sched_domain tree after we brought the cpu-down. And this is needed so that some tasks can still run while we bring it down, we just don't want new tasks to appear. On cpu-up however the sched_domain tree doesn't yet include the new cpu, so its invisible to the load-balancer, regardless of the active state. So instead of setting the active state after we boot the new cpu (and consequently having to wait for it before enabling interrupts) set the cpu active before we set it online and avoid the whole mess. Reported-by: NStepan Moskovchenko <stepanm@codeaurora.org> Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: NThomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1323965362.18942.71.camel@twinsSigned-off-by: NIngo Molnar <mingo@elte.hu>
-
- 08 3月, 2012 1 次提交
-
-
由 Linus Torvalds 提交于
This reverts commit 8f2f748b. It causes some odd regression that we have not figured out, and it's too late in the -rc series to try to figure it out now. As reported by Konstantin Khlebnikov, it causes consistent hangs on his laptop (Thinkpad x220: 2x cores + HT). They can be avoided by adding calls to "rebuild_sched_domains();" in cpuset_cpu_[in]active() for the CPU_{ONLINE/DOWN_FAILED/DOWN_PREPARE}_FROZEN cases, but it's not at all clear why, and it makes no sense. Konstantin's config doesn't even have CONFIG_CPUSETS enabled, just to make things even more interesting. So it's not the cpusets, it's just the scheduling domains. So until this is understood, revert. Bisected-reported-and-tested-by: NKonstantin Khlebnikov <khlebnikov@openvz.org> Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: NIngo Molnar <mingo@elte.hu> Acked-by: NSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 01 3月, 2012 4 次提交
-
-
由 Peter Zijlstra 提交于
Per cgroup load-balance has numerous problems, chief amongst them that there is no real sane order in them. So stop pretending it makes sense and enqueue all tasks on a single list. This also allows us to more easily fix the fwd progress issue uncovered by the lock-break stuff. Rotate the list on failure to migreate and limit the total iterations to nr_running (which with releasing the lock isn't strictly accurate but close enough). Also add a filter that skips very light tasks on the first attempt around the list, this attempts to avoid shooting whole cgroups around without affecting over balance. Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Cc: pjt@google.com Link: http://lkml.kernel.org/n/tip-tx8yqydc7eimgq7i4rkc3a4g@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Thomas Gleixner 提交于
When we are PI-blocked then we want to get things done ASAP. Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-vw8et3445km5b8mpihf4trae@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Thomas Gleixner 提交于
Idle task boosting is a nono in general. There is one exception, when PREEMPT_RT and NOHZ is active: The idle task calls get_next_timer_interrupt() and holds the timer wheel base->lock on the CPU and another CPU wants to access the timer (probably to cancel it). We can safely ignore the boosting request, as the idle CPU runs this code with interrupts disabled and will complete the lock protected section without being interrupted. So there is no real need to boost. Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-755rvsosz7sdzot12a3gbha6@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Thomas Gleixner 提交于
For code which protects the waitqueue itself with another lock it makes no sense to acquire the waitqueue lock for wakeup all. Provide __wake_up_all_locked(). This is an optimization on the vanilla kernel (to be used by the PCI code) and an important semantic distinction on -rt. Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-ux6m4b8jonb9inx8xafh77ds@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@elte.hu>
-