- 09 10月, 2013 6 次提交
-
-
由 Mel Gorman 提交于
The NUMA PTE scan rate is controlled with a combination of the numa_balancing_scan_period_min, numa_balancing_scan_period_max and numa_balancing_scan_size. This scan rate is independent of the size of the task and as an aside it is further complicated by the fact that numa_balancing_scan_size controls how many pages are marked pte_numa and not how much virtual memory is scanned. In combination, it is almost impossible to meaningfully tune the min and max scan periods and reasoning about performance is complex when the time to complete a full scan is is partially a function of the tasks memory size. This patch alters the semantic of the min and max tunables to be about tuning the length time it takes to complete a scan of a tasks occupied virtual address space. Conceptually this is a lot easier to understand. There is a "sanity" check to ensure the scan rate is never extremely fast based on the amount of virtual memory that should be scanned in a second. The default of 2.5G seems arbitrary but it is to have the maximum scan rate after the patch roughly match the maximum scan rate before the patch was applied. On a similar note, numa_scan_period is in milliseconds and not jiffies. Properly placed pages slow the scanning rate but adding 10 jiffies to numa_scan_period means that the rate scanning slows depends on HZ which is confusing. Get rid of the jiffies_to_msec conversion and treat it as ms. Signed-off-by: NMel Gorman <mgorman@suse.de> Reviewed-by: NRik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: NPeter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-18-git-send-email-mgorman@suse.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Mel Gorman 提交于
Scan delay logic and resets are currently initialised to start scanning immediately instead of delaying properly. Initialise them properly at fork time and catch when a new mm has been allocated. Signed-off-by: NMel Gorman <mgorman@suse.de> Reviewed-by: NRik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: NPeter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-17-git-send-email-mgorman@suse.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Mel Gorman 提交于
PTE scanning and NUMA hinting fault handling is expensive so commit 5bca2303 ("mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node") deferred the PTE scan until a task had been scheduled on another node. The problem is that in the purely shared memory case that this may never happen and no NUMA hinting fault information will be captured. We are not ruling out the possibility that something better can be done here but for now, this patch needs to be reverted and depend entirely on the scan_delay to avoid punishing short-lived processes. Signed-off-by: NMel Gorman <mgorman@suse.de> Reviewed-by: NRik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: NPeter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-16-git-send-email-mgorman@suse.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
Avoiding marking PTEs pte_numa because a particular NUMA node is migrate rate limited sees like a bad idea. Even if this node can't migrate anymore other nodes might and we want up-to-date information to do balance decisions. We already rate limit the actual migrations, this should leave enough bandwidth to allow the non-migrating scanning. I think its important we keep up-to-date information if we're going to do placement based on it. Signed-off-by: NPeter Zijlstra <peterz@infradead.org> Signed-off-by: NMel Gorman <mgorman@suse.de> Reviewed-by: NRik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/1381141781-10992-15-git-send-email-mgorman@suse.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
With a trace_printk("working\n"); right after the cmpxchg in task_numa_work() we can see that of a 4 thread process, its always the same task winning the race and doing the protection change. This is a problem since the task doing the protection change has a penalty for taking faults -- it is busy when marking the PTEs. If its always the same task the ->numa_faults[] get severely skewed. Avoid this by delaying the task doing the protection change such that it is unlikely to win the privilege again. Before: root@interlagos:~# grep "thread 0/.*working" /debug/tracing/trace | tail -15 thread 0/0-3232 [022] .... 212.787402: task_numa_work: working thread 0/0-3232 [022] .... 212.888473: task_numa_work: working thread 0/0-3232 [022] .... 212.989538: task_numa_work: working thread 0/0-3232 [022] .... 213.090602: task_numa_work: working thread 0/0-3232 [022] .... 213.191667: task_numa_work: working thread 0/0-3232 [022] .... 213.292734: task_numa_work: working thread 0/0-3232 [022] .... 213.393804: task_numa_work: working thread 0/0-3232 [022] .... 213.494869: task_numa_work: working thread 0/0-3232 [022] .... 213.596937: task_numa_work: working thread 0/0-3232 [022] .... 213.699000: task_numa_work: working thread 0/0-3232 [022] .... 213.801067: task_numa_work: working thread 0/0-3232 [022] .... 213.903155: task_numa_work: working thread 0/0-3232 [022] .... 214.005201: task_numa_work: working thread 0/0-3232 [022] .... 214.107266: task_numa_work: working thread 0/0-3232 [022] .... 214.209342: task_numa_work: working After: root@interlagos:~# grep "thread 0/.*working" /debug/tracing/trace | tail -15 thread 0/0-3253 [005] .... 136.865051: task_numa_work: working thread 0/2-3255 [026] .... 136.965134: task_numa_work: working thread 0/3-3256 [024] .... 137.065217: task_numa_work: working thread 0/3-3256 [024] .... 137.165302: task_numa_work: working thread 0/3-3256 [024] .... 137.265382: task_numa_work: working thread 0/0-3253 [004] .... 137.366465: task_numa_work: working thread 0/2-3255 [026] .... 137.466549: task_numa_work: working thread 0/0-3253 [004] .... 137.566629: task_numa_work: working thread 0/0-3253 [004] .... 137.666711: task_numa_work: working thread 0/1-3254 [028] .... 137.766799: task_numa_work: working thread 0/0-3253 [004] .... 137.866876: task_numa_work: working thread 0/2-3255 [026] .... 137.966960: task_numa_work: working thread 0/1-3254 [028] .... 138.067041: task_numa_work: working thread 0/2-3255 [026] .... 138.167123: task_numa_work: working thread 0/3-3256 [024] .... 138.267207: task_numa_work: working Signed-off-by: NPeter Zijlstra <peterz@infradead.org> Signed-off-by: NMel Gorman <mgorman@suse.de> Reviewed-by: NRik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/1381141781-10992-14-git-send-email-mgorman@suse.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
Fix a 80 column violation and a PTE vs PMD reference. Signed-off-by: NPeter Zijlstra <peterz@infradead.org> Signed-off-by: NMel Gorman <mgorman@suse.de> Reviewed-by: NRik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/1381141781-10992-4-git-send-email-mgorman@suse.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 06 10月, 2013 1 次提交
-
-
由 Shawn Bohrer 提交于
In 76854c7e ("sched: Use rt.nr_cpus_allowed to recover select_task_rq() cycles") an optimization was added to select_task_rq_rt() that immediately returns when p->nr_cpus_allowed == 1 at the beginning of the function. This makes the latter p->nr_cpus_allowed > 1 check redundant, which can now be removed. Signed-off-by: NShawn Bohrer <sbohrer@rgmadvisors.com> Reviewed-by: NSteven Rostedt <rostedt@goodmis.org> Cc: Mike Galbraith <mgalbraith@suse.de> Cc: tomk@rgmadvisors.com Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1380914693-24634-1-git-send-email-shawn.bohrer@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 25 9月, 2013 7 次提交
-
-
由 Peter Zijlstra 提交于
When using per-cpu preempt_count variables we need to save/restore the preempt_count on context switch (into per task storage; for instance the old thread_info::preempt_count variable) because of PREEMPT_ACTIVE. However, this means that on fork() the preempt_count value of the last context switch gets copied and if we had a PREEMPT_ACTIVE switch right before cloning a child task the child task will now too have PREEMPT_ACTIVE set and start its life with an extra PREEMPT_ACTIVE count. Therefore we need to make init_task_preempt_count() unconditional; this resets whatever preempt_count we inherited from our parent process. Doing so for !per-cpu implementations is harmless. For !PREEMPT_COUNT kernels we need to be careful not to start life with an increased preempt_count. Signed-off-by: NPeter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-4k0b7oy1rcdyzochwiixuwi9@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
Rewrite the preempt_count macros in order to extract the 3 basic preempt_count value modifiers: __preempt_count_add() __preempt_count_sub() and the new: __preempt_count_dec_and_test() And since we're at it anyway, replace the unconventional $op_preempt_count names with the more conventional preempt_count_$op. Since these basic operators are equivalent to the previous _notrace() variants, do away with the _notrace() versions. Signed-off-by: NPeter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-ewbpdbupy9xpsjhg960zwbv8@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
We need a few special preempt_count accessors: - task_preempt_count() for when we're interested in the preemption count of another (non-running) task. - init_task_preempt_count() for properly initializing the preemption count. - init_idle_preempt_count() a special case of the above for the idle threads. With these no generic code ever touches thread_info::preempt_count anymore and architectures could choose to remove it. Signed-off-by: NPeter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-jf5swrio8l78j37d06fzmo4r@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
In order to combine the preemption and need_resched test we need to fold the need_resched information into the preempt_count value. Since the NEED_RESCHED flag is set across CPUs this needs to be an atomic operation, however we very much want to avoid making preempt_count atomic, therefore we keep the existing TIF_NEED_RESCHED infrastructure in place but at 3 sites test it and fold its value into preempt_count; namely: - resched_task() when setting TIF_NEED_RESCHED on the current task - scheduler_ipi() when resched_task() sets TIF_NEED_RESCHED on a remote task it follows it up with a reschedule IPI and we can modify the cpu local preempt_count from there. - cpu_idle_loop() for when resched_task() found tsk_is_polling(). We use an inverted bitmask to indicate need_resched so that a 0 means both need_resched and !atomic. Also remove the barrier() in preempt_enable() between preempt_enable_no_resched() and preempt_check_resched() to avoid having to reload the preemption value and allow the compiler to use the flags of the previuos decrement. I couldn't come up with any sane reason for this barrier() to be there as preempt_enable_no_resched() already has a barrier() before doing the decrement. Suggested-by: NIngo Molnar <mingo@kernel.org> Signed-off-by: NPeter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-7a7m5qqbn5pmwnd4wko9u6da@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
Replace the single preempt_count() 'function' that's an lvalue with two proper functions: preempt_count() - returns the preempt_count value as rvalue preempt_count_set() - Allows setting the preempt-count value Also provide preempt_count_ptr() as a convenience wrapper to implement all modifying operations. Signed-off-by: NPeter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-orxrbycjozopqfhb4dxdkdvb@git.kernel.org [ Fixed build failure. ] Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
We're going to deprecate and remove set_need_resched() for it will do the wrong thing. Make an exception for RCU and allow it to use resched_cpu() which will do the right thing. Signed-off-by: NPeter Zijlstra <peterz@infradead.org> Cc: Paul McKenney <paulmck@linux.vnet.ibm.com> Link: http://lkml.kernel.org/n/tip-2eywnacjl1nllctl1nszqa5w@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Michael S. Tsirkin 提交于
We always know the rq used, let's just pass it around. This seems to cut the size of scheduler core down a tiny bit: Before: [linux]$ size kernel/sched/core.o.orig text data bss dec hex filename 62760 16130 3876 82766 1434e kernel/sched/core.o.orig After: [linux]$ size kernel/sched/core.o.patched text data bss dec hex filename 62566 16130 3876 82572 1428c kernel/sched/core.o.patched Probably speeds it up as well. Signed-off-by: NMichael S. Tsirkin <mst@redhat.com> Signed-off-by: NPeter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20130922142054.GA11499@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 20 9月, 2013 7 次提交
-
-
由 Jason Low 提交于
This patch builds on patch 2 and periodically decays that max value to do idle balancing per sched domain by approximately 1% per second. Also decay the rq's max_idle_balance_cost value. Signed-off-by: NJason Low <jason.low2@hp.com> Signed-off-by: NPeter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1379096813-3032-4-git-send-email-jason.low2@hp.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Jason Low 提交于
In this patch, we keep track of the max cost we spend doing idle load balancing for each sched domain. If the avg time the CPU remains idle is less then the time we have already spent on idle balancing + the max cost of idle balancing in the sched domain, then we don't continue to attempt the balance. We also keep a per rq variable, max_idle_balance_cost, which keeps track of the max time spent on newidle load balances throughout all its domains so that we can determine the avg_idle's max value. By using the max, we avoid overrunning the average. This further reduces the chance we attempt balancing when the CPU is not idle for longer than the cost to balance. Signed-off-by: NJason Low <jason.low2@hp.com> Signed-off-by: NPeter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1379096813-3032-3-git-send-email-jason.low2@hp.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Jason Low 提交于
When updating avg_idle, if the delta exceeds some max value, then avg_idle gets set to the max, regardless of what the previous avg was. This can cause avg_idle to often be overestimated. This patch modifies the way we update avg_idle by always updating it with the function call to update_avg() first. Then, if avg_idle exceeds the max, we set it to the max. Signed-off-by: NJason Low <jason.low2@hp.com> Reviewed-by: NRik van Riel <riel@redhat.com> Reviewed-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: NPeter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1379096813-3032-2-git-send-email-jason.low2@hp.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Vladimir Davydov 提交于
Currently new_dst_cpu is prevented from being reselected actually, not dst_cpu. This can result in attempting to pull tasks to this_cpu twice. Signed-off-by: NVladimir Davydov <vdavydov@parallels.com> Signed-off-by: NPeter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/281f59b6e596c718dd565ad267fc38f5b8e5c995.1379265590.git.vdavydov@parallels.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Vladimir Davydov 提交于
Patch a003a2 (sched: Consider runnable load average in move_tasks()) sets all top-level cfs_rqs' h_load to rq->avg.load_avg_contrib, which is always 0. This mistype leads to all tasks having weight 0 when load balancing in a cpu-cgroup enabled setup. There obviously should be sum of weights of all runnable tasks there instead. Fix it. Signed-off-by: NVladimir Davydov <vdavydov@parallels.com> Reviewed-by: NPaul Turner <pjt@google.com> Signed-off-by: NPeter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1379173186-11944-1-git-send-email-vdavydov@parallels.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Vladimir Davydov 提交于
In busiest->group_imb case we can come to fix_small_imbalance() with local->avg_load > busiest->avg_load. This can result in wrong imbalance fix-up, because there is the following check there where all the members are unsigned: if (busiest->avg_load - local->avg_load + scaled_busy_load_per_task >= (scaled_busy_load_per_task * imbn)) { env->imbalance = busiest->load_per_task; return; } As a result we can end up constantly bouncing tasks from one cpu to another if there are pinned tasks. Fix it by substituting the subtraction with an equivalent addition in the check. [ The bug can be caught by running 2*N cpuhogs pinned to two logical cpus belonging to different cores on an HT-enabled machine with N logical cpus: just look at se.nr_migrations growth. ] Signed-off-by: NVladimir Davydov <vdavydov@parallels.com> Signed-off-by: NPeter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/ef167822e5c5b2d96cf5b0e3e4f4bdff3f0414a2.1379252740.git.vdavydov@parallels.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Vladimir Davydov 提交于
In busiest->group_imb case we can come to calculate_imbalance() with local->avg_load >= busiest->avg_load >= sds->avg_load. This can result in imbalance overflow, because it is calculated as follows env->imbalance = min( max_pull * busiest->group_power, (sds->avg_load - local->avg_load) * local->group_power) / SCHED_POWER_SCALE; As a result we can end up constantly bouncing tasks from one cpu to another if there are pinned tasks. Fix this by skipping the assignment and assuming imbalance=0 in case local->avg_load > sds->avg_load. [ The bug can be caught by running 2*N cpuhogs pinned to two logical cpus belonging to different cores on an HT-enabled machine with N logical cpus: just look at se.nr_migrations growth. ] Signed-off-by: NVladimir Davydov <vdavydov@parallels.com> Signed-off-by: NPeter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/8f596cc6bc0e5e655119dc892c9bfcad26e971f4.1379252740.git.vdavydov@parallels.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 16 9月, 2013 1 次提交
-
-
由 Michael S. Tsirkin 提交于
sched_info_depart seems to be only called from sched_info_switch(), so only on involuntary task switch. Fix the comment to match. Signed-off-by: NMichael S. Tsirkin <mst@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Link: http://lkml.kernel.org/r/20130916083036.GA1113@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 13 9月, 2013 7 次提交
-
-
由 Peter Zijlstra 提交于
Do away with 'phantom' cores due to N*frac(smt_power) >= 1 by limiting the capacity to the actual number of cores. The assumption of 1 < smt_power < 2 is an actual requirement because of what SMT is so this should work regardless of the SMT implementation. It can still be defeated by creative use of cpu hotplug, but if you're one of those freaks, you get to live with it. Signed-off-by: NPeter Zijlstra <peterz@infradead.org> Acked-by: NVincent Guittot <vincent.guitto@linaro.org> Link: http://lkml.kernel.org/n/tip-dczmbi8tfgixacg1ji2av1un@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
Pull out the group_capacity computation so that we can more clearly comment its issues. Signed-off-by: NPeter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-az1hl1ya55k361nkeh9bj0yw@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
When looking at the code I noticed we don't actually compute sgp->power_orig correctly for groups, fix that. Currently the only consumer of that value is fix_small_capacity() which is only used on POWER7+ and that code excludes this case by being limited to SD_SHARE_CPUPOWER which is only ever set on the SMT domain which must be the lowest domain and this has singleton groups. So nothing should be affected by this change. Cc: Michael Neuling <mikey@neuling.org> Signed-off-by: NPeter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-db2pe0vxwunv37plc7onnugj@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
Try and reduce the local_group logic by pulling most of it into update_sd_lb_stats. Signed-off-by: NPeter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-mgezl354xgyhiyrte78fdkpd@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
Change the group_imb detection from the old 'load-spike' detector to an actual imbalance detector. We set it from the lower domain balance pass when it fails to create a balance in the presence of task affinities. The advantage is that this should no longer generate the false positive group_imb conditions generated by transient load spikes from the normal balancing/bulk-wakeup etc. behaviour. While I haven't actually observed those they could happen. I'm not entirely happy with this patch; it somehow feels a little fragile. Nor does it solve the biggest issue I have with the group_imb code; it it still a fragile construct in that once we 'fixed' the imbalance we'll not detect the group_imb again and could end up re-creating it. That said, this patch does seem to preserve behaviour for the described degenerate case. In particular on my 2*6*2 wsm-ep: taskset -c 3-11 bash -c 'for ((i=0;i<9;i++)) do while :; do :; done & done' ends up with 9 spinners, each on their own CPU; whereas if you disable the group_imb code that typically doesn't happen (you'll get one pair sharing a CPU most of the time). Signed-off-by: NPeter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-36fpbgl39dv4u51b6yz2ypz5@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
Emmanuel reported that /proc/sched_debug didn't report the right PIDs when using namespaces, cure this. Reported-by: NEmmanuel Deloget <emmanuel.deloget@efixo.com> Signed-off-by: NPeter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20130909110141.GM31370@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Daisuke Nishimura 提交于
There is a small race between copy_process() and cgroup_attach_task() where child->se.parent,cfs_rq points to invalid (old) ones. parent doing fork() | someone moving the parent to another cgroup -------------------------------+--------------------------------------------- copy_process() + dup_task_struct() -> parent->se is copied to child->se. se.parent,cfs_rq of them point to old ones. cgroup_attach_task() + cgroup_task_migrate() -> parent->cgroup is updated. + cpu_cgroup_attach() + sched_move_task() + task_move_group_fair() +- set_task_rq() -> se.parent,cfs_rq of parent are updated. + cgroup_fork() -> parent->cgroup is copied to child->cgroup. (*1) + sched_fork() + task_fork_fair() -> se.parent,cfs_rq of child are accessed while they point to old ones. (*2) In the worst case, this bug can lead to "use-after-free" and cause a panic, because it's new cgroup's refcount that is incremented at (*1), so the old cgroup(and related data) can be freed before (*2). In fact, a panic caused by this bug was originally caught in RHEL6.4. BUG: unable to handle kernel NULL pointer dereference at (null) IP: [<ffffffff81051e3e>] sched_slice+0x6e/0xa0 [...] Call Trace: [<ffffffff81051f25>] place_entity+0x75/0xa0 [<ffffffff81056a3a>] task_fork_fair+0xaa/0x160 [<ffffffff81063c0b>] sched_fork+0x6b/0x140 [<ffffffff8106c3c2>] copy_process+0x5b2/0x1450 [<ffffffff81063b49>] ? wake_up_new_task+0xd9/0x130 [<ffffffff8106d2f4>] do_fork+0x94/0x460 [<ffffffff81072a9e>] ? sys_wait4+0xae/0x100 [<ffffffff81009598>] sys_clone+0x28/0x30 [<ffffffff8100b393>] stub_clone+0x13/0x20 [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: NPeter Zijlstra <peterz@infradead.org> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/039601ceae06$733d3130$59b79390$@mxp.nes.nec.co.jpSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 10 9月, 2013 1 次提交
-
-
由 Joonsoo Kim 提交于
Commit 23f0d209 ("sched: Factor out code to should_we_balance()") introduces the should_we_balance() function. This function should return 1 if this cpu is appropriate for balancing. But the newly introduced code doesn't do so, it returns 0 instead of 1. This introduces performance regression, reported by Dave Chinner: v4 filesystem v5 filesystem 3.11+xfsdev: 220k files/s 225k files/s 3.12-git 180k files/s 185k files/s 3.12-git-revert 245k files/s 247k files/s You can find more detailed information at: https://lkml.org/lkml/2013/9/10/1 This patch corrects the return value of should_we_balance() function as orignally intended. With this patch, Dave Chinner reports that the regression is gone: v4 filesystem v5 filesystem 3.11+xfsdev: 220k files/s 225k files/s 3.12-git 180k files/s 185k files/s 3.12-git-revert 245k files/s 247k files/s 3.12-git-fix 249k files/s 248k files/s Reported-by: NDave Chinner <dchinner@redhat.com> Tested-by: NDave Chinner <dchinner@redhat.com> Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Paul Turner <pjt@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Link: http://lkml.kernel.org/r/20130910065448.GA20368@lge.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 04 9月, 2013 1 次提交
-
-
由 Stanislaw Gruszka 提交于
scale_stime() silently assumes that stime < rtime, otherwise when stime == rtime and both values are big enough (operations on them do not fit in 32 bits), the resulting scaling stime can be bigger than rtime. In consequence utime = rtime - stime results in negative value. User space visible symptoms of the bug are overflowed TIME values on ps/top, for example: $ ps aux | grep rcu root 8 0.0 0.0 0 0 ? S 12:42 0:00 [rcuc/0] root 9 0.0 0.0 0 0 ? S 12:42 0:00 [rcub/0] root 10 62422329 0.0 0 0 ? R 12:42 21114581:37 [rcu_preempt] root 11 0.1 0.0 0 0 ? S 12:42 0:02 [rcuop/0] root 12 62422329 0.0 0 0 ? S 12:42 21114581:35 [rcuop/1] root 10 62422329 0.0 0 0 ? R 12:42 21114581:37 [rcu_preempt] or overflowed utime values read directly from /proc/$PID/stat Reference: https://lkml.org/lkml/2013/8/20/259Reported-and-tested-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com> Signed-off-by: NStanislaw Gruszka <sgruszka@redhat.com> Cc: stable@vger.kernel.org Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Borislav Petkov <bp@alien8.de> Link: http://lkml.kernel.org/r/20130904131602.GC2564@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 02 9月, 2013 9 次提交
-
-
由 Peter Zijlstra 提交于
I found that on my WSM box I had a redundant domain: [ 0.949769] CPU0 attaching sched-domain: [ 0.953765] domain 0: span 0,12 level SIBLING [ 0.958335] groups: 0 (cpu_power = 587) 12 (cpu_power = 588) [ 0.964548] domain 1: span 0-5,12-17 level MC [ 0.969206] groups: 0,12 (cpu_power = 1175) 1,13 (cpu_power = 1176) 2,14 (cpu_power = 1176) 3,15 (cpu_power = 1176) 4,16 (cpu_power = 1176) 5,17 (cpu_power = 1176) [ 0.984993] domain 2: span 0-5,12-17 level CPU [ 0.989822] groups: 0-5,12-17 (cpu_power = 7055) [ 0.995049] domain 3: span 0-23 level NUMA [ 0.999620] groups: 0-5,12-17 (cpu_power = 7055) 6-11,18-23 (cpu_power = 7056) Note how domain 2 has only a single group and spans the same CPUs as domain 1. We should not keep such domains and do in fact have code to prune these. It turns out that the 'new' SD_PREFER_SIBLING flag causes this, it makes sd_parent_degenerate() fail on the CPU domain. We can easily fix this by 'ignoring' the SD_PREFER_SIBLING bit and transfering it to whatever domain ends up covering the span. With this patch the domains now look like this: [ 0.950419] CPU0 attaching sched-domain: [ 0.954454] domain 0: span 0,12 level SIBLING [ 0.959039] groups: 0 (cpu_power = 587) 12 (cpu_power = 588) [ 0.965271] domain 1: span 0-5,12-17 level MC [ 0.969936] groups: 0,12 (cpu_power = 1175) 1,13 (cpu_power = 1176) 2,14 (cpu_power = 1176) 3,15 (cpu_power = 1176) 4,16 (cpu_power = 1176) 5,17 (cpu_power = 1176) [ 0.985737] domain 2: span 0-23 level NUMA [ 0.990231] groups: 0-5,12-17 (cpu_power = 7055) 6-11,18-23 (cpu_power = 7056) Reviewed-by: NPaul Turner <pjt@google.com> Signed-off-by: NPeter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-ys201g4jwukj0h8xcamakxq1@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
Rik reported some weirdness due to the group_imb code. As a start to looking at it, clean it up a little and add a few explanatory comments. Signed-off-by: NPeter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-caeeqttnla4wrrmhp5uf89gp@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
Use for_each_cpu_and() and thereby avoid computing the capacity for CPUs we know we're not interested in. Reviewed-by: NPaul Turner <pjt@google.com> Reviewed-by: NPreeti U Murthy <preeti@linux.vnet.ibm.com> Signed-off-by: NPeter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-lppceyv6kb3a19g8spmrn20b@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
For easier access, less dereferences and more consistent value, store the group power in update_sg_lb_stats() and use it thereafter. The actual value in sched_group::sched_group_power::power can change throughout the load-balance pass if we're unlucky. Reviewed-by: NPreeti U Murthy <preeti@linux.vnet.ibm.com> Signed-off-by: NPeter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-739xxqkyvftrhnh9ncudutc7@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
Since we already compute (but don't store) the sgs load_per_task value in update_sg_lb_stats() we might as well store it and not re-compute it later on. Signed-off-by: NPeter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-ym1vmljiwbzgdnnrwp9azftq@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
We can shrink sg_lb_stats because rq::nr_running is an unsigned int and cpu numbers are 'int' Before: sgs: /* size: 72, cachelines: 2, members: 10 */ sds: /* size: 184, cachelines: 3, members: 7 */ After: sgs: /* size: 56, cachelines: 1, members: 10 */ sds: /* size: 152, cachelines: 3, members: 7 */ Further we can avoid clearing all of sds since we do a total clear/assignment of sg_stats in update_sg_lb_stats() with exception of busiest_stat.avg_load which is referenced in update_sd_pick_busiest(). Signed-off-by: NPeter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-0klzmz9okll8wc0nsudguc9p@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Joonsoo Kim 提交于
There is no reason to maintain separate variables for this_group and busiest_group in sd_lb_stat, except saving some space. But this structure is always allocated in stack, so this saving isn't really benificial [peterz: reducing stack space is good; in this case readability increases enough that I think its still beneficial] This patch unify these variables, so IMO, readability may be improved. Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com> [ Rename this to local -- avoids confusion between this_cpu and the C++ this pointer. ] Reviewed-by: NPaul Turner <pjt@google.com> [ Lots of style edits, a few fixes and a rename. ] Signed-off-by: NPeter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1375778203-31343-4-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Joonsoo Kim 提交于
Now checking whether this cpu is appropriate to balance or not is embedded into update_sg_lb_stats() and this checking has no direct relationship to this function. There is not enough reason to place this checking at update_sg_lb_stats(), except saving one iteration for sched_group_cpus. In this patch, I factor out this checking to should_we_balance() function. And before doing actual work for load_balancing, check whether this cpu is appropriate to balance via should_we_balance(). If this cpu is not a candidate for balancing, it quit the work immediately. With this change, we can save two memset cost and can expect better compiler optimization. Below is result of this patch. * Vanilla * text data bss dec hex filename 34499 1136 116 35751 8ba7 kernel/sched/fair.o * Patched * text data bss dec hex filename 34243 1136 116 35495 8aa7 kernel/sched/fair.o In addition, rename @balance to @continue_balancing in order to represent its purpose more clearly. Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com> [ s/should_balance/continue_balancing/g ] Reviewed-by: NPaul Turner <pjt@google.com> [ Made style changes and a fix in should_we_balance(). ] Signed-off-by: NPeter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1375778203-31343-3-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Joonsoo Kim 提交于
Remove one division operation in find_busiest_queue() by using crosswise multiplication: wl_i / power_i > wl_j / power_j := wl_i * power_j > wl_j * power_i Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com> [ Expanded the changelog. ] Signed-off-by: NPeter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1375778203-31343-2-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-