- 07 12月, 2011 2 次提交
-
-
由 Suresh Siddha 提交于
cpu bit in the nohz.idle_cpu_mask are reset in the first busy tick after exiting idle. So during nohz_idle_balance(), intention is to double check if the cpu that is part of the idle_cpu_mask is indeed idle before going ahead in performing idle balance for that cpu. Fix the cpu typo in the idle_cpu() check during nohz_idle_balance(). Reported-by: NSrivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1323199177.1984.12.camel@sbsiddha-desk.sc.intel.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Mike Galbraith 提交于
hrtick_start_fair() shows up in profiles even when disabled. v3.0.6 taskset -c 3 pipe-test PerfTop: 997 irqs/sec kernel:89.5% exact: 0.0% [1000Hz cycles], (all, CPU: 3) ------------------------------------------------------------------------------------------------ Virgin Patched samples pcnt function samples pcnt function _______ _____ ___________________________ _______ _____ ___________________________ 2880.00 10.2% __schedule 3136.00 11.3% __schedule 1634.00 5.8% pipe_read 1615.00 5.8% pipe_read 1458.00 5.2% system_call 1534.00 5.5% system_call 1382.00 4.9% _raw_spin_lock_irqsave 1412.00 5.1% _raw_spin_lock_irqsave 1202.00 4.3% pipe_write 1255.00 4.5% copy_user_generic_string 1164.00 4.1% copy_user_generic_string 1241.00 4.5% __switch_to 1097.00 3.9% __switch_to 929.00 3.3% mutex_lock 872.00 3.1% mutex_lock 846.00 3.0% mutex_unlock 687.00 2.4% mutex_unlock 804.00 2.9% pipe_write 682.00 2.4% native_sched_clock 713.00 2.6% native_sched_clock 643.00 2.3% system_call_after_swapgs 653.00 2.3% _raw_spin_unlock_irqrestore 617.00 2.2% sched_clock_local 633.00 2.3% fsnotify 612.00 2.2% fsnotify 605.00 2.2% sched_clock_local 596.00 2.1% _raw_spin_unlock_irqrestore 593.00 2.1% system_call_after_swapgs 542.00 1.9% sysret_check 559.00 2.0% sysret_check 467.00 1.7% fget_light 472.00 1.7% fget_light 462.00 1.6% finish_task_switch 461.00 1.7% finish_task_switch 437.00 1.5% vfs_write 442.00 1.6% vfs_write 431.00 1.5% do_sync_write 428.00 1.5% do_sync_write 413.00 1.5% select_task_rq_fair 404.00 1.5% _raw_spin_lock_irq 386.00 1.4% update_curr 402.00 1.4% update_curr 385.00 1.4% rw_verify_area 389.00 1.4% do_sync_read 377.00 1.3% _raw_spin_lock_irq 378.00 1.4% vfs_read 369.00 1.3% do_sync_read 340.00 1.2% pipe_iov_copy_from_user 360.00 1.3% vfs_read 316.00 1.1% __wake_up_sync_key * 342.00 1.2% hrtick_start_fair 313.00 1.1% __wake_up_common Signed-off-by: NMike Galbraith <efault@gmx.de> [ fixed !CONFIG_SCHED_HRTICK borkage ] Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1321971607.6855.17.camel@marge.simson.netSigned-off-by: NIngo Molnar <mingo@elte.hu>
-
- 06 12月, 2011 8 次提交
-
-
由 Suresh Siddha 提交于
nr_busy_cpus in the sched_group_power indicates whether the group is semi idle or not. This helps remove the is_semi_idle_group() and simplify the find_new_ilb() in the context of finding an optimal cpu that can do idle load balancing. Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20111202010832.656983582@sbsiddha-desk.sc.intel.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Suresh Siddha 提交于
When there are many logical cpu's that enter and exit idle often, members of the global nohz data structure are getting modified very frequently causing lot of cache-line contention. Make the nohz idle load balancing more scalabale by using the sched domain topology and 'nr_busy_cpu's in the struct sched_group_power. Idle load balance is kicked on one of the idle cpu's when there is atleast one idle cpu and: - a busy rq having more than one task or - a busy rq's scheduler group that share package resources (like HT/MC siblings) and has more than one member in that group busy or - for the SD_ASYM_PACKING domain, if the lower numbered cpu's in that domain are idle compared to the busy ones. This will help in kicking the idle load balancing request only when there is a potential imbalance. And once it is mostly balanced, these kicks will be minimized. These changes helped improve the workload that is context switch intensive between number of task pairs by 2x on a 8 socket NHM-EX based system. Reported-by: NTim Chen <tim.c.chen@intel.com> Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20111202010832.602203411@sbsiddha-desk.sc.intel.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Suresh Siddha 提交于
Introduce nr_busy_cpus in the struct sched_group_power [Not in sched_group because sched groups are duplicated for the SD_OVERLAP scheduler domain] and for each cpu that enters and exits idle, this parameter will be updated in each scheduler group of the scheduler domain that this cpu belongs to. To avoid the frequent update of this state as the cpu enters and exits idle, the update of the stat during idle exit is delayed to the first timer tick that happens after the cpu becomes busy. This is done using NOHZ_IDLE flag in the struct rq's nohz_flags. Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20111202010832.555984323@sbsiddha-desk.sc.intel.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Suresh Siddha 提交于
Introduce nohz_flags in the struct rq, which will track these two flags for now. NOHZ_TICK_STOPPED keeps track of the tick stopped status that gets set when the tick is stopped. It will be used to update the nohz idle load balancer data structures during the first busy tick after the tick is restarted. At this first busy tick after tickless idle, NOHZ_TICK_STOPPED flag will be reset. This will minimize the nohz idle load balancer status updates that currently happen for every tickless exit, making it more scalable when there are many logical cpu's that enter and exit idle often. NOHZ_BALANCE_KICK will track the need for nohz idle load balance on this rq. This will replace the nohz_balance_kick in the rq, which was not being updated atomically. Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20111202010832.499438999@sbsiddha-desk.sc.intel.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Mike Galbraith 提交于
This is another case where we are on our way to schedule(), so can save a useless clock update and resulting microscopic vruntime update. Signed-off-by: NMike Galbraith <efault@gmx.de> Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1321971686.6855.18.camel@marge.simson.netSigned-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Mike Galbraith 提交于
rt.nr_cpus_allowed is always available, use it to bail from select_task_rq() when only one cpu can be used, and saves some cycles for pinned tasks. See the line marked with '*' below: # taskset -c 3 pipe-test PerfTop: 997 irqs/sec kernel:89.5% exact: 0.0% [1000Hz cycles], (all, CPU: 3) ------------------------------------------------------------------------------------------------ Virgin Patched samples pcnt function samples pcnt function _______ _____ ___________________________ _______ _____ ___________________________ 2880.00 10.2% __schedule 3136.00 11.3% __schedule 1634.00 5.8% pipe_read 1615.00 5.8% pipe_read 1458.00 5.2% system_call 1534.00 5.5% system_call 1382.00 4.9% _raw_spin_lock_irqsave 1412.00 5.1% _raw_spin_lock_irqsave 1202.00 4.3% pipe_write 1255.00 4.5% copy_user_generic_string 1164.00 4.1% copy_user_generic_string 1241.00 4.5% __switch_to 1097.00 3.9% __switch_to 929.00 3.3% mutex_lock 872.00 3.1% mutex_lock 846.00 3.0% mutex_unlock 687.00 2.4% mutex_unlock 804.00 2.9% pipe_write 682.00 2.4% native_sched_clock 713.00 2.6% native_sched_clock 643.00 2.3% system_call_after_swapgs 653.00 2.3% _raw_spin_unlock_irqrestore 617.00 2.2% sched_clock_local 633.00 2.3% fsnotify 612.00 2.2% fsnotify 605.00 2.2% sched_clock_local 596.00 2.1% _raw_spin_unlock_irqrestore 593.00 2.1% system_call_after_swapgs 542.00 1.9% sysret_check 559.00 2.0% sysret_check 467.00 1.7% fget_light 472.00 1.7% fget_light 462.00 1.6% finish_task_switch 461.00 1.7% finish_task_switch 437.00 1.5% vfs_write 442.00 1.6% vfs_write 431.00 1.5% do_sync_write 428.00 1.5% do_sync_write * 413.00 1.5% select_task_rq_fair 404.00 1.5% _raw_spin_lock_irq 386.00 1.4% update_curr 402.00 1.4% update_curr 385.00 1.4% rw_verify_area 389.00 1.4% do_sync_read 377.00 1.3% _raw_spin_lock_irq 378.00 1.4% vfs_read 369.00 1.3% do_sync_read 340.00 1.2% pipe_iov_copy_from_user 360.00 1.3% vfs_read 316.00 1.1% __wake_up_sync_key 342.00 1.2% hrtick_start_fair 313.00 1.1% __wake_up_common Signed-off-by: NMike Galbraith <efault@gmx.de> Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1321971504.6855.15.camel@marge.simson.netSigned-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Suresh Siddha 提交于
Instead of going through the scheduler domain hierarchy multiple times (for giving priority to an idle core over an idle SMT sibling in a busy core), start with the highest scheduler domain with the SD_SHARE_PKG_RESOURCES flag and traverse the domain hierarchy down till we find an idle group. This cleanup also addresses an issue reported by Mike where the recent changes returned the busy thread even in the presence of an idle SMT sibling in single socket platforms. Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com> Tested-by: NMike Galbraith <efault@gmx.de> Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1321556904.15339.25.camel@sbsiddha-desk.sc.intel.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Andrew Vagin 提交于
This tracepoint shows how long a task is sleeping in uninterruptible state. E.g. it may show how long and where a mutex is waited for. Signed-off-by: NAndrew Vagin <avagin@openvz.org> Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1322471015-107825-8-git-send-email-avagin@openvz.orgSigned-off-by: NIngo Molnar <mingo@elte.hu>
-
- 17 11月, 2011 2 次提交
-
-
由 Peter Zijlstra 提交于
There's too many sched*.[ch] files in kernel/, give them their own directory. (No code changed, other than Makefile glue added.) Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Peter Zijlstra 提交于
Since once needs to do something at conferences and fixing compile warnings doesn't actually require much if any attention I decided to break up the sched.c #include "*.c" fest. This further modularizes the scheduler code. Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-x0fcd3mnp8f9c99grcpewmhi@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@elte.hu>
-
- 16 11月, 2011 3 次提交
-
-
由 Paul Turner 提交于
Now that the linkage of jump-labels has been fixed they show a measurable improvement in overhead for the enabled-but-unused case. Workload is: 'taskset -c 0 perf stat --repeat 50 -e instructions,cycles,branches bash -c "for ((i=0;i<5;i++)); do $(dirname $0)/pipe-test 20000; done"' There's a speedup for all situations: instructions cycles branches ------------------------------------------------------------------------- Intel Westmere base 806611770 745895590 146765378 +jumplabel 803090165 (-0.44%) 713381840 (-4.36%) 144561130 AMD Barcelona base 824657415 740055589 148855354 +jumplabel 821056910 (-0.44%) 737558389 (-0.34%) 146635229 Signed-off-by: NPaul Turner <pjt@google.com> Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20111108042736.560831357@google.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Paul Turner 提交于
In return_cfs_rq_runtime() we want to return bandwidth when there are no remaining tasks, not "return" when this is the case. Signed-off-by: NPaul Turner <pjt@google.com> Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20111108042736.623812423@google.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Peter Zijlstra 提交于
Avoid select_idle_sibling() from picking a sibling thread if there's an idle core that shares cache. This fixes SMT balancing in the increasingly common case where there's a shared cache core available to balance to. Tested-by: NMike Galbraith <efault@gmx.de> Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Link: http://lkml.kernel.org/r/1321350377.1421.55.camel@twinsSigned-off-by: NIngo Molnar <mingo@elte.hu>
-
- 14 11月, 2011 2 次提交
-
-
由 Hui Kang 提交于
Signed-off-by: NHui Kang <hkang.sunysb@gmail.com> Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1318388459-4427-1-git-send-email-hkang.sunysb@gmail.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Peter Zijlstra 提交于
Every time I have to stare at this function I need to completely reverse engineer its workings, about time I write a comment explaining the thing. Collected bits and pieces from previous changelogs, mostly: 4be9daaa 83378269Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1318518057.27731.2.camel@twinsSigned-off-by: NIngo Molnar <mingo@elte.hu>
-
- 06 10月, 2011 3 次提交
-
-
由 Peter Zijlstra 提交于
This task is preparatory for the migrate_disable() implementation, but stands on its own and provides a cleanup. It currently only converts those sites required for task-placement. Kosaki-san once mentioned replacing cpus_allowed with a proper cpumask_t instead of the NR_CPUS sized array it currently is, that would also require something like this. Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: NThomas Gleixner <tglx@linutronix.de> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Link: http://lkml.kernel.org/n/tip-e42skvaddos99psip0vce41o@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Suresh Siddha 提交于
rq's idle_at_tick is set to idle/busy during the timer tick depending on the cpu was idle or not. This will be used later in the load balance that will be done in the softirq context (which is a process context in -RT kernels). For nohz kernels, for the cpu doing nohz idle load balance on behalf of all the idle cpu's, its rq->idle_at_tick might have a stale value (which is recorded when it got the timer tick presumably when it is busy). As the nohz idle load balancing is also being done at the same place as the regular load balancing, nohz idle load balancing was bailing out when it sees rq's idle_at_tick not set. Thus leading to poor system utilization. Rename rq's idle_at_tick to idle_balance and set it when someone requests for nohz idle balance on an idle cpu. Reported-by: NSrivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20111003220934.892350549@sbsiddha-desk.sc.intel.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Suresh Siddha 提交于
Current use of smp call function to kick the nohz idle balance can deadlock in this scenario. 1. cpu-A did a generic_exec_single() to cpu-B and after queuing its call single data (csd) to the call single queue, cpu-A took a timer interrupt. Actual IPI to cpu-B to process the call single queue is not yet sent. 2. As part of the timer interrupt handler, cpu-A decided to kick cpu-B for the idle load balancing (sets cpu-B's rq->nohz_balance_kick to 1) and __smp_call_function_single() with nowait will queue the csd to the cpu-B's queue. But the generic_exec_single() won't send an IPI to cpu-B as the call single queue was not empty. 3. cpu-A is busy with lot of interrupts 4. Meanwhile cpu-B is entering and exiting idle and noticed that it has it's rq->nohz_balance_kick set to '1'. So it will go ahead and do the idle load balancer and clear its rq->nohz_balance_kick. 5. At this point, csd queued as part of the step-2 above is still locked and waiting to be serviced on cpu-B. 6. cpu-A is still busy with interrupt load and now it got another timer interrupt and as part of it decided to kick cpu-B for another idle load balancing (as it finds cpu-B's rq->nohz_balance_kick cleared in step-4 above) and does __smp_call_function_single() with the same csd that is still locked. 7. And we get a deadlock waiting for the csd_lock() in the __smp_call_function_single(). Main issue here is that cpu-B can service the idle load balancer kick request from cpu-A even with out receiving the IPI and this lead to doing multiple __smp_call_function_single() on the same csd leading to deadlock. To kick a cpu, scheduler already has the reschedule vector reserved. Use that mechanism (kick_process()) instead of using the generic smp call function mechanism to kick off the nohz idle load balancing and avoid the deadlock. [ This issue is present from 2.6.35+ kernels, but marking it -stable only from v3.0+ as the proposed fix depends on the scheduler_ipi() that is introduced recently. ] Reported-by: NPrarit Bhargava <prarit@redhat.com> Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com> Cc: stable@kernel.org # v3.0+ Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20111003220934.834943260@sbsiddha-desk.sc.intel.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
-
- 26 9月, 2011 1 次提交
-
-
由 Wang Xingchao 提交于
The caller already checks for nr_running > 1, therefore we don't have to do so again. Signed-off-by: NWang Xingchao <xingchao.wang@intel.com> Reviewed-by: NPaul Turner <pjt@google.com> Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1316194552-12019-1-git-send-email-xingchao.wang@intel.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
-
- 14 8月, 2011 14 次提交
-
-
由 Paul Turner 提交于
When a local cfs_rq blocks we return the majority of its remaining quota to the global bandwidth pool for use by other runqueues. We do this only when the quota is current and there is more than min_cfs_rq_quota [1ms by default] of runtime remaining on the rq. In the case where there are throttled runqueues and we have sufficient bandwidth to meter out a slice, a second timer is kicked off to handle this delivery, unthrottling where appropriate. Using a 'worst case' antagonist which executes on each cpu for 1ms before moving onto the next on a fairly large machine: no quota generations: 197.47 ms /cgroup/a/cpuacct.usage 199.46 ms /cgroup/a/cpuacct.usage 205.46 ms /cgroup/a/cpuacct.usage 198.46 ms /cgroup/a/cpuacct.usage 208.39 ms /cgroup/a/cpuacct.usage Since we are allowed to use "stale" quota our usage is effectively bounded by the rate of input into the global pool and performance is relatively stable. with quota generations [1s increments]: 119.58 ms /cgroup/a/cpuacct.usage 119.65 ms /cgroup/a/cpuacct.usage 119.64 ms /cgroup/a/cpuacct.usage 119.63 ms /cgroup/a/cpuacct.usage 119.60 ms /cgroup/a/cpuacct.usage The large deficit here is due to quota generations (/intentionally/) preventing us from now using previously stranded slack quota. The cost is that this quota becomes unavailable. with quota generations and quota return: 200.09 ms /cgroup/a/cpuacct.usage 200.09 ms /cgroup/a/cpuacct.usage 198.09 ms /cgroup/a/cpuacct.usage 200.09 ms /cgroup/a/cpuacct.usage 200.06 ms /cgroup/a/cpuacct.usage By returning unused quota we're able to both stably consume our desired quota and prevent unintentional overages due to the abuse of slack quota from previous quota periods (especially on a large machine). Signed-off-by: NPaul Turner <pjt@google.com> Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110721184758.306848658@google.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Nikhil Rao 提交于
This change introduces statistics exports for the cpu sub-system, these are added through the use of a stat file similar to that exported by other subsystems. The following exports are included: nr_periods: number of periods in which execution occurred nr_throttled: the number of periods above in which execution was throttle throttled_time: cumulative wall-time that any cpus have been throttled for this group Signed-off-by: NPaul Turner <pjt@google.com> Signed-off-by: NNikhil Rao <ncrao@google.com> Signed-off-by: NBharata B Rao <bharata@linux.vnet.ibm.com> Reviewed-by: NHidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110721184758.198901931@google.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Paul Turner 提交于
With the machinery in place to throttle and unthrottle entities, as well as handle their participation (or lack there of) we can now enable throttling. There are 2 points that we must check whether it's time to set throttled state: put_prev_entity() and enqueue_entity(). - put_prev_entity() is the typical throttle path, we reach it by exceeding our allocated run-time within update_curr()->account_cfs_rq_runtime() and going through a reschedule. - enqueue_entity() covers the case of a wake-up into an already throttled group. In this case we know the group cannot be on_rq and can throttle immediately. Checks are added at time of put_prev_entity() and enqueue_entity() Signed-off-by: NPaul Turner <pjt@google.com> Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110721184758.091415417@google.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Paul Turner 提交于
Buddies allow us to select "on-rq" entities without actually selecting them from a cfs_rq's rb_tree. As a result we must ensure that throttled entities are not falsely nominated as buddies. The fact that entities are dequeued within throttle_entity is not sufficient for clearing buddy status as the nomination may occur after throttling. Signed-off-by: NPaul Turner <pjt@google.com> Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110721184757.886850167@google.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Paul Turner 提交于
From the perspective of load-balance and shares distribution, throttled entities should be invisible. However, both of these operations work on 'active' lists and are not inherently aware of what group hierarchies may be present. In some cases this may be side-stepped (e.g. we could sideload via tg_load_down in load balance) while in others (e.g. update_shares()) it is more difficult to compute without incurring some O(n^2) costs. Instead, track hierarchicaal throttled state at time of transition. This allows us to easily identify whether an entity belongs to a throttled hierarchy and avoid incorrect interactions with it. Also, when an entity leaves a throttled hierarchy we need to advance its time averaging for shares averaging so that the elapsed throttled time is not considered as part of the cfs_rq's operation. We also use this information to prevent buddy interactions in the wakeup and yield_to() paths. Signed-off-by: NPaul Turner <pjt@google.com> Reviewed-by: NHidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110721184757.777916795@google.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Paul Turner 提交于
At the start of each period we refresh the global bandwidth pool. At this time we must also unthrottle any cfs_rq entities who are now within bandwidth once more (as quota permits). Unthrottled entities have their corresponding cfs_rq->throttled flag cleared and their entities re-enqueued. Signed-off-by: NPaul Turner <pjt@google.com> Reviewed-by: NHidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110721184757.574628950@google.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Paul Turner 提交于
Now that consumption is tracked (via update_curr()) we add support to throttle group entities (and their corresponding cfs_rqs) in the case where this is no run-time remaining. Throttled entities are dequeued to prevent scheduling, additionally we mark them as throttled (using cfs_rq->throttled) to prevent them from becoming re-enqueued until they are unthrottled. A list of a task_group's throttled entities are maintained on the cfs_bandwidth structure. Note: While the machinery for throttling is added in this patch the act of throttling an entity exceeding its bandwidth is deferred until later within the series. Signed-off-by: NPaul Turner <pjt@google.com> Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110721184757.480608533@google.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Paul Turner 提交于
Since quota is managed using a global state but consumed on a per-cpu basis we need to ensure that our per-cpu state is appropriately synchronized. Most importantly, runtime that is state (from a previous period) should not be locally consumable. We take advantage of existing sched_clock synchronization about the jiffy to efficiently detect whether we have (globally) crossed a quota boundary above. One catch is that the direction of spread on sched_clock is undefined, specifically, we don't know whether our local clock is behind or ahead of the one responsible for the current expiration time. Fortunately we can differentiate these by considering whether the global deadline has advanced. If it has not, then we assume our clock to be "fast" and advance our local expiration; otherwise, we know the deadline has truly passed and we expire our local runtime. Signed-off-by: NPaul Turner <pjt@google.com> Reviewed-by: NHidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110721184757.379275352@google.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Paul Turner 提交于
This patch adds a per-task_group timer which handles the refresh of the global CFS bandwidth pool. Since the RT pool is using a similar timer there's some small refactoring to share this support. Signed-off-by: NPaul Turner <pjt@google.com> Reviewed-by: NHidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110721184757.277271273@google.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Paul Turner 提交于
Account bandwidth usage on the cfs_rq level versus the task_groups to which they belong. Whether we are tracking bandwidth on a given cfs_rq is maintained under cfs_rq->runtime_enabled. cfs_rq's which belong to a bandwidth constrained task_group have their runtime accounted via the update_curr() path, which withdraws bandwidth from the global pool as desired. Updates involving the global pool are currently protected under cfs_bandwidth->lock, local runtime is protected by rq->lock. This patch only assigns and tracks quota, no action is taken in the case that cfs_rq->runtime_used exceeds cfs_rq->runtime_assigned. Signed-off-by: NPaul Turner <pjt@google.com> Signed-off-by: NNikhil Rao <ncrao@google.com> Signed-off-by: NBharata B Rao <bharata@linux.vnet.ibm.com> Reviewed-by: NHidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110721184757.179386821@google.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Paul Turner 提交于
In this patch we introduce the notion of CFS bandwidth, partitioned into globally unassigned bandwidth, and locally claimed bandwidth. - The global bandwidth is per task_group, it represents a pool of unclaimed bandwidth that cfs_rqs can allocate from. - The local bandwidth is tracked per-cfs_rq, this represents allotments from the global pool bandwidth assigned to a specific cpu. Bandwidth is managed via cgroupfs, adding two new interfaces to the cpu subsystem: - cpu.cfs_period_us : the bandwidth period in usecs - cpu.cfs_quota_us : the cpu bandwidth (in usecs) that this tg will be allowed to consume over period above. Signed-off-by: NPaul Turner <pjt@google.com> Signed-off-by: NNikhil Rao <ncrao@google.com> Signed-off-by: NBharata B Rao <bharata@linux.vnet.ibm.com> Reviewed-by: NHidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110721184756.972636699@google.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Paul Turner 提交于
Introduce hierarchical task accounting for the group scheduling case in CFS, as well as promoting the responsibility for maintaining rq->nr_running to the scheduling classes. The primary motivation for this is that with scheduling classes supporting bandwidth throttling it is possible for entities participating in throttled sub-trees to not have root visible changes in rq->nr_running across activate and de-activate operations. This in turn leads to incorrect idle and weight-per-task load balance decisions. This also allows us to make a small fixlet to the fastpath in pick_next_task() under group scheduling. Note: this issue also exists with the existing sched_rt throttling mechanism. This patch does not address that. Signed-off-by: NPaul Turner <pjt@google.com> Reviewed-by: NHidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110721184756.878333391@google.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Hillf Danton 提交于
Checking for the validity of sd is removed, since it is already checked by the for_each_domain macro. Signed-off-by: NHillf Danton <dhillf@gmail.com> Signed-off-by: NSteven Rostedt <rostedt@goodmis.org> Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/BANLkTimT+Tut-3TshCDm-NiLLXrOznibNA@mail.gmail.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Yong Zhang 提交于
Remove the WAKEUP_PREEMPT feature, disabling it doesn't make any sense and its outlived its use by a long long while. Signed-off-by: NYong Zhang <yong.zhang0@gmail.com> Acked-by: NMike Galbraith <efault@gmx.de> Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110729082033.GB12106@zhySigned-off-by: NIngo Molnar <mingo@elte.hu>
-
- 22 7月, 2011 5 次提交
-
-
由 Lin Ming 提交于
No need to define a new "cfs_rq" variable in the "for" block. Just use the one at the top of the function. Signed-off-by: NLin Ming <ming.m.lin@intel.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1311297271.3938.1352.camel@minggr.sh.intel.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Stephan Baerwolf 提交于
"entity_key()" is only used in "__enqueue_entity()" and its only function is to subtract a tasks vruntime by its groups minvruntime. Before this patch a rbtree enqueue-decision is done by comparing two tasks in the style: "if (entity_key(cfs_rq, se) < entity_key(cfs_rq, entry))" which would be "if (se->vruntime-cfs_rq->min_vruntime < entry->vruntime-cfs_rq->min_vruntime)" or (if reducing cfs_rq->min_vruntime out) "if (se->vruntime < entry->vruntime)" which is "if (entity_before(se, entry))" So we do not need "entity_key()". If "entity_before()" is inline we will also save one subtraction (only one, because "entity_key(cfs_rq, se)" was cached in "key") Signed-off-by: NStephan Baerwolf <stephan.baerwolf@tu-ilmenau.de> Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-ns12mnd2h5w8rb9agd8hnsfk@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Jan Schoenherr 提交于
The last reference to cpu_cfs_rq() was removed with commit 88ec22d3 ("sched: Remove the cfs_rq dependency from set_task_cpu()"). Thus, remove this function, too. Signed-off-by: NJan Schoenherr <schnhrr@cs.tu-berlin.de> Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1310580816-10861-3-git-send-email-schnhrr@cs.tu-berlin.deSigned-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Peter Zijlstra 提交于
Use for_each_leaf_cfs_rq() instead of list_for_each_entry_rcu(), this achieves that load_balance_fair() only iterates those task_groups that actually have tasks on busiest, and that we iterate bottom-up, trying to move light groups before the heavier ones. No idea if it will actually work out to be beneficial in practice, does anybody have a cgroup workload that might show a difference one way or the other? [ Also move update_h_load to sched_fair.c, loosing #ifdef-ery ] Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: NPaul Turner <pjt@google.com> Link: http://lkml.kernel.org/r/1310557009.2586.28.camel@twinsSigned-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Paul Turner 提交于
In dequeue_task_fair() we bail on dequeue when we encounter a parenting entity with additional weight. However, we perform a double shares update on this entity as we continue the shares update traversal from this point, despite dequeue_entity() having already updated its queuing cfs_rq. Avoid this by starting from the parent when we resume. Signed-off-by: NPaul Turner <pjt@google.com> Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110707053059.797714697@google.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
-