- 25 5月, 2018 1 次提交
-
-
由 Patrick Bellasi 提交于
When a task is enqueued the estimated utilization of a CPU is updated to better support the selection of the required frequency. However, schedutil is (implicitly) updated by update_load_avg() which always happens before util_est_{en,de}queue(), thus potentially introducing a latency between estimated utilization updates and frequency selections. Let's update util_est at the beginning of enqueue_task_fair(), which will ensure that all schedutil updates will see the most updated estimated utilization value for a CPU. Reported-by: NVincent Guittot <vincent.guittot@linaro.org> Signed-off-by: NPatrick Bellasi <patrick.bellasi@arm.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: NThomas Gleixner <tglx@linutronix.de> Acked-by: NViresh Kumar <viresh.kumar@linaro.org> Acked-by: NVincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com> Cc: Steve Muckle <smuckle@google.com> Fixes: 7f65ea42 ("sched/fair: Add util_est on top of PELT") Link: http://lkml.kernel.org/r/20180524141023.13765-3-patrick.bellasi@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 14 5月, 2018 2 次提交
-
-
由 Rohit Jain 提交于
sched/core: Distinguish between idle_cpu() calls based on desired effect, introduce available_idle_cpu() In the following commit: 247f2f6f ("sched/core: Don't schedule threads on pre-empted vCPUs") ... we distinguish between idle_cpu() when the vCPU is not running for scheduling threads. However, the idle_cpu() function is used in other places for actually checking whether the state of the CPU is idle or not. Hence split the use of that function based on the desired return value, by introducing the available_idle_cpu() function. This fixes a (slight) regression in that initial vCPU commit, because some code paths (like the load-balancer) don't care and shouldn't care if the vCPU is preempted or not, they just want to know if there's any tasks on the CPU. Signed-off-by: NRohit Jain <rohit.k.jain@oracle.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dhaval.giani@oracle.com Cc: linux-kernel@vger.kernel.org Cc: matt@codeblueprint.co.uk Cc: steven.sistare@oracle.com Cc: subhra.mazumdar@oracle.com Link: http://lkml.kernel.org/r/1525883988-10356-1-git-send-email-rohit.k.jain@oracle.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Mel Gorman 提交于
Threads share an address space and each can change the protections of the same address space to trap NUMA faults. This is redundant and potentially counter-productive as any thread doing the update will suffice. Potentially only one thread is required but that thread may be idle or it may not have any locality concerns and pick an unsuitable scan rate. This patch uses independent scan period but they are staggered based on the number of address space users when the thread is created. The intent is that threads will avoid scanning at the same time and have a chance to adapt their scan rate later if necessary. This reduces the total scan activity early in the lifetime of the threads. The different in headline performance across a range of machines and workloads is marginal but the system CPU usage is reduced as well as overall scan activity. The following is the time reported by NAS Parallel Benchmark using unbound openmp threads and a D size class: 4.17.0-rc1 4.17.0-rc1 vanilla stagger-v1r1 Time bt.D 442.77 ( 0.00%) 419.70 ( 5.21%) Time cg.D 171.90 ( 0.00%) 180.85 ( -5.21%) Time ep.D 33.10 ( 0.00%) 32.90 ( 0.60%) Time is.D 9.59 ( 0.00%) 9.42 ( 1.77%) Time lu.D 306.75 ( 0.00%) 304.65 ( 0.68%) Time mg.D 54.56 ( 0.00%) 52.38 ( 4.00%) Time sp.D 1020.03 ( 0.00%) 903.77 ( 11.40%) Time ua.D 400.58 ( 0.00%) 386.49 ( 3.52%) Note it's not a universal win but we have no prior knowledge of which thread matters but the number of threads created often exceeds the size of the node when the threads are not bound. However, there is a reducation of overall system CPU usage: 4.17.0-rc1 4.17.0-rc1 vanilla stagger-v1r1 sys-time-bt.D 48.78 ( 0.00%) 48.22 ( 1.15%) sys-time-cg.D 25.31 ( 0.00%) 26.63 ( -5.22%) sys-time-ep.D 1.65 ( 0.00%) 0.62 ( 62.42%) sys-time-is.D 40.05 ( 0.00%) 24.45 ( 38.95%) sys-time-lu.D 37.55 ( 0.00%) 29.02 ( 22.72%) sys-time-mg.D 47.52 ( 0.00%) 34.92 ( 26.52%) sys-time-sp.D 119.01 ( 0.00%) 109.05 ( 8.37%) sys-time-ua.D 51.52 ( 0.00%) 45.13 ( 12.40%) NUMA scan activity is also reduced: NUMA alloc local 1042828 1342670 NUMA base PTE updates 140481138 93577468 NUMA huge PMD updates 272171 180766 NUMA page range updates 279832690 186129660 NUMA hint faults 1395972 1193897 NUMA hint local faults 877925 855053 NUMA hint local percent 62 71 NUMA pages migrated 12057909 9158023 Similar observations are made for other thread-intensive workloads. System CPU usage is lower even though the headline gains in performance tend to be small. For example, specjbb 2005 shows almost no difference in performance but scan activity is reduced by a third on a 4-socket box. I didn't find a workload (thread intensive or otherwise) that suffered badly. Signed-off-by: NMel Gorman <mgorman@techsingularity.net> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/20180504154109.mvrha2qo5wdl65vr@techsingularity.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 12 5月, 2018 1 次提交
-
-
由 Mel Gorman 提交于
This reverts commit 7347fc87. Srikar Dronamra pointed out that while the commit in question did show a performance improvement on ppc64, it did so at the cost of disabling active CPU migration by automatic NUMA balancing which was not the intent. The issue was that a serious flaw in the logic failed to ever active balance if SD_WAKE_AFFINE was disabled on scheduler domains. Even when it's enabled, the logic is still bizarre and against the original intent. Investigation showed that fixing the patch in either the way he suggested, using the correct comparison for jiffies values or introducing a new numa_migrate_deferred variable in task_struct all perform similarly to a revert with a mix of gains and losses depending on the workload, machine and socket count. The original intent of the commit was to handle a problem whereby wake_affine, idle balancing and automatic NUMA balancing disagree on the appropriate placement for a task. This was particularly true for cases where a single task was a massive waker of tasks but where wake_wide logic did not apply. This was particularly noticeable when a futex (a barrier) woke all worker threads and tried pulling the wakees to the waker nodes. In that specific case, it could be handled by tuning MPI or openMP appropriately, but the behavior is not illogical and was worth attempting to fix. However, the approach was wrong. Given that we're at rc4 and a fix is not obvious, it's better to play safe, revert this commit and retry later. Signed-off-by: NMel Gorman <mgorman@techsingularity.net> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: efault@gmx.de Cc: ggherdovich@suse.cz Cc: hpa@zytor.com Cc: matt@codeblueprint.co.uk Cc: mpe@ellerman.id.au Link: http://lkml.kernel.org/r/20180509163115.6fnnyeg4vdm2ct4v@techsingularity.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 04 5月, 2018 2 次提交
-
-
由 Viresh Kumar 提交于
Call sync_entity_load_avg() directly from find_idlest_cpu() instead of select_task_rq_fair(), as that's where we need to use task's utilization value. And call sync_entity_load_avg() only after making sure sched domain spans over one of the allowed CPUs for the task. Signed-off-by: NViresh Kumar <viresh.kumar@linaro.org> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vincent Guittot <vincent.guittot@linaro.org> Link: http://lkml.kernel.org/r/cd019d1753824c81130eae7b43e2bbcec47cc1ad.1524738578.git.viresh.kumar@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Viresh Kumar 提交于
Rearrange select_task_rq_fair() a bit to avoid executing some conditional statements in few specific code-paths. That gets rid of the goto as well. This shouldn't result in any functional changes. Tested-by: NRohit Jain <rohit.k.jain@oracle.com> Signed-off-by: NViresh Kumar <viresh.kumar@linaro.org> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: NValentin Schneider <valentin.schneider@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vincent Guittot <vincent.guittot@linaro.org> Link: http://lkml.kernel.org/r/20831b8d237bf3a20e4e328286f678b425ff04c9.1524738578.git.viresh.kumar@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 03 5月, 2018 1 次提交
-
-
由 Vincent Guittot 提交于
With commit: 31e77c93 ("sched/fair: Update blocked load when newly idle") ... we release the rq->lock when updating blocked load of idle CPUs. This opens a time window during which another CPU can add a task to this CPU's cfs_rq. The check for newly added task of idle_balance() is not in the common path. Move the out label to include this check. Reported-by: NHeiner Kallweit <hkallweit1@gmail.com> Tested-by: NGeert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 31e77c93 ("sched/fair: Update blocked load when newly idle") Link: http://lkml.kernel.org/r/20180426103133.GA6953@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 05 4月, 2018 1 次提交
-
-
由 Davidlohr Bueso 提交于
By renaming the functions we can get rid of the skip parameter and have better code redability. It makes zero sense to have things such as: rq_clock_skip_update(rq, false) When the skip request is in fact not going to happen. Ever. Rename things such that we end up with: rq_clock_skip_update(rq) rq_clock_cancel_skipupdate(rq) Signed-off-by: NDavidlohr Bueso <dbueso@suse.de> Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Cc: matt@codeblueprint.co.uk Cc: rostedt@goodmis.org Link: http://lkml.kernel.org/r/20180404161539.nhadkff2aats74jh@linux-n805Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
- 20 3月, 2018 3 次提交
-
-
由 Patrick Bellasi 提交于
The estimated utilization of a task is currently updated every time the task is dequeued. However, to keep overheads under control, PELT signals are effectively updated at maximum once every 1ms. Thus, for really short running tasks, it can happen that their util_avg value has not been updates since their last enqueue. If such tasks are also frequently running tasks (e.g. the kind of workload generated by hackbench) it can also happen that their util_avg is updated only every few activations. This means that updating util_est at every dequeue potentially introduces not necessary overheads and it's also conceptually wrong if the util_avg signal has never been updated during a task activation. Let's introduce a throttling mechanism on task's util_est updates to sync them with util_avg updates. To make the solution memory efficient, both in terms of space and load/store operations, we encode a synchronization flag into the LSB of util_est.enqueued. This makes util_est an even values only metric, which is still considered good enough for its purpose. The synchronization bit is (re)set by __update_load_avg_se() once the PELT signal of a task has been updated during its last activation. Such a throttling mechanism allows to keep under control util_est overheads in the wakeup hot path, thus making it a suitable mechanism which can be enabled also on high-intensity workload systems. Thus, this now switches on by default the estimation utilization scheduler feature. Suggested-by: NChris Redpath <chris.redpath@arm.com> Signed-off-by: NPatrick Bellasi <patrick.bellasi@arm.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Paul Turner <pjt@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com> Cc: Steve Muckle <smuckle@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Todd Kjos <tkjos@android.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Viresh Kumar <viresh.kumar@linaro.org> Link: http://lkml.kernel.org/r/20180309095245.11071-5-patrick.bellasi@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Patrick Bellasi 提交于
When the scheduler looks at the CPU utilization, the current PELT value for a CPU is returned straight away. In certain scenarios this can have undesired side effects on task placement. For example, since the task utilization is decayed at wakeup time, when a long sleeping big task is enqueued it does not add immediately a significant contribution to the target CPU. As a result we generate a race condition where other tasks can be placed on the same CPU while it is still considered relatively empty. In order to reduce this kind of race conditions, this patch introduces the required support to integrate the usage of the CPU's estimated utilization in the wakeup path, via cpu_util_wake(), as well as in the load-balance path, via cpu_util() which is used by update_sg_lb_stats(). The estimated utilization of a CPU is defined to be the maximum between its PELT's utilization and the sum of the estimated utilization (at previous dequeue time) of all the tasks currently RUNNABLE on that CPU. This allows to properly represent the spare capacity of a CPU which, for example, has just got a big task running since a long sleep period. Signed-off-by: NPatrick Bellasi <patrick.bellasi@arm.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: NDietmar Eggemann <dietmar.eggemann@arm.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Paul Turner <pjt@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com> Cc: Steve Muckle <smuckle@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Todd Kjos <tkjos@android.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Viresh Kumar <viresh.kumar@linaro.org> Link: http://lkml.kernel.org/r/20180309095245.11071-3-patrick.bellasi@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Patrick Bellasi 提交于
The util_avg signal computed by PELT is too variable for some use-cases. For example, a big task waking up after a long sleep period will have its utilization almost completely decayed. This introduces some latency before schedutil will be able to pick the best frequency to run a task. The same issue can affect task placement. Indeed, since the task utilization is already decayed at wakeup, when the task is enqueued in a CPU, this can result in a CPU running a big task as being temporarily represented as being almost empty. This leads to a race condition where other tasks can be potentially allocated on a CPU which just started to run a big task which slept for a relatively long period. Moreover, the PELT utilization of a task can be updated every [ms], thus making it a continuously changing value for certain longer running tasks. This means that the instantaneous PELT utilization of a RUNNING task is not really meaningful to properly support scheduler decisions. For all these reasons, a more stable signal can do a better job of representing the expected/estimated utilization of a task/cfs_rq. Such a signal can be easily created on top of PELT by still using it as an estimator which produces values to be aggregated on meaningful events. This patch adds a simple implementation of util_est, a new signal built on top of PELT's util_avg where: util_est(task) = max(task::util_avg, f(task::util_avg@dequeue)) This allows to remember how big a task has been reported by PELT in its previous activations via f(task::util_avg@dequeue), which is the new _task_util_est(struct task_struct*) function added by this patch. If a task should change its behavior and it runs longer in a new activation, after a certain time its util_est will just track the original PELT signal (i.e. task::util_avg). The estimated utilization of cfs_rq is defined only for root ones. That's because the only sensible consumer of this signal are the scheduler and schedutil when looking for the overall CPU utilization due to FAIR tasks. For this reason, the estimated utilization of a root cfs_rq is simply defined as: util_est(cfs_rq) = max(cfs_rq::util_avg, cfs_rq::util_est::enqueued) where: cfs_rq::util_est::enqueued = sum(_task_util_est(task)) for each RUNNABLE task on that root cfs_rq It's worth noting that the estimated utilization is tracked only for objects of interests, specifically: - Tasks: to better support tasks placement decisions - root cfs_rqs: to better support both tasks placement decisions as well as frequencies selection Signed-off-by: NPatrick Bellasi <patrick.bellasi@arm.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: NDietmar Eggemann <dietmar.eggemann@arm.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Paul Turner <pjt@google.com> Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com> Cc: Steve Muckle <smuckle@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Todd Kjos <tkjos@android.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Viresh Kumar <viresh.kumar@linaro.org> Link: http://lkml.kernel.org/r/20180309095245.11071-2-patrick.bellasi@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 09 3月, 2018 15 次提交
-
-
由 Vincent Guittot 提交于
When NEWLY_IDLE load balance is not triggered, we might need to update the blocked load anyway. We can kick an ilb so an idle CPU will take care of updating blocked load or we can try to update them locally before entering idle. In the latter case, we reuse part of the nohz_idle_balance. Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: brendan.jackman@arm.com Cc: dietmar.eggemann@arm.com Cc: morten.rasmussen@foss.arm.com Cc: valentin.schneider@arm.com Link: http://lkml.kernel.org/r/1518622006-16089-4-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
We're going to want to call nohz_idle_balance() or parts thereof from idle_balance(). Since we already have a forward declaration of idle_balance() move it down such that it's below nohz_idle_balance() avoiding the need for a forward declaration for that. Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
Now that we have two back-to-back NO_HZ_COMMON blocks, merge them. Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
This pure code movement results in two #ifdef CONFIG_NO_HZ_COMMON sections landing next to each other. Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
Avoid calling update_blocked_averages() when it does not in fact have any by re-using/extending update_nohz_stats(). Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Vincent Guittot 提交于
Instead of using the cfs_rq_is_decayed() which monitors all *_avg and *_sum, we create a cfs_rq_has_blocked() which only takes care of util_avg and load_avg. We are only interested by these 2 values which are decaying faster than the *_sum so we can stop the periodic update earlier. Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: brendan.jackman@arm.com Cc: dietmar.eggemann@arm.com Cc: morten.rasmussen@foss.arm.com Cc: valentin.schneider@arm.com Link: http://lkml.kernel.org/r/1518517879-2280-3-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Vincent Guittot 提交于
Stopped the periodic update of blocked load when all idle CPUs have fully decayed. We introduce a new nohz.has_blocked that reflect if some idle CPUs has blocked load that have to be periodiccally updated. nohz.has_blocked is set everytime that a Idle CPU can have blocked load and it is then clear when no more blocked load has been detected during an update. We don't need atomic operation but only to make cure of the right ordering when updating nohz.idle_cpus_mask and nohz.has_blocked. Suggested-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: brendan.jackman@arm.com Cc: dietmar.eggemann@arm.com Cc: morten.rasmussen@foss.arm.com Cc: valentin.schneider@arm.com Link: http://lkml.kernel.org/r/1518517879-2280-2-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
It was suggested that a migration hint might be usefull for the CPU-freq governors. Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Viresh Kumar <viresh.kumar@linaro.org> Cc: linux-kernel@vger.kernel.org Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
The primary observation is that nohz enter/exit is always from the current CPU, therefore NOHZ_TICK_STOPPED does not in fact need to be an atomic. Secondary is that we appear to have 2 nearly identical hooks in the nohz enter code, set_cpu_sd_state_idle() and nohz_balance_enter_idle(). Fold the whole set_cpu_sd_state thing into nohz_balance_{enter,exit}_idle. Removes an atomic op from both enter and exit paths. Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
Since we already iterate CPUs looking for work on NEWIDLE, use this iteration to age the blocked load. If the domain for which this is done completely spand the idle set, we can push the ILB based aging forward. Suggested-by: NBrendan Jackman <brendan.jackman@arm.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
Teach the idle balancer about the need to update statistics which have a different periodicity from regular balancing. Suggested-by: NVincent Guittot <vincent.guittot@linaro.org> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
The current: if (nohz_kick_needed()) nohz_balancer_kick() is pointless complexity, fold them into a single call and avoid the various conditions at the call site. When we introduce multiple different needs to kick the ilb, the above construct also becomes a problem. Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
Split the NOHZ idle balancer into doing two separate actions: - update blocked load statistic - actually load-balance Since the latter requires the former, ensure this happens. For now always tag both bits at the same time. Prepares for a future where we can toggle only the STATS bit. Suggested-by: NVincent Guittot <vincent.guittot@linaro.org> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
Using atomic_t allows us to use the more flexible bitops provided there. Also its smaller. Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Norbert Manthey 提交于
Due to using GCC defines for configuration, some labels might be unused in certain configurations. While adding a __maybe_unused to the label is fine in general, the line has to be terminated with ';'. This is also reflected in the GCC documentation, but GCC parsed the previous variant without an error message. This has been spotted while compiling with goto-cc, the compiler for the CPROVER tool suite. Signed-off-by: NNorbert Manthey <nmanthey@amazon.de> Signed-off-by: NMichael Tautschnig <tautschn@amazon.co.uk> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1519717660-16157-1-git-send-email-nmanthey@amazon.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 04 3月, 2018 1 次提交
-
-
由 Ingo Molnar 提交于
Do the following cleanups and simplifications: - sched/sched.h already includes <asm/paravirt.h>, so no need to include it in sched/core.c again. - order the <linux/sched/*.h> headers alphabetically - add all <linux/sched/*.h> headers to kernel/sched/sched.h - remove all unnecessary includes from the .c files that are already included in kernel/sched/sched.h. Finally, make all scheduler .c files use a single common header: #include "sched.h" ... which now contains a union of the relied upon headers. This makes the various .c files easier to read and easier to handle. Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
- 03 3月, 2018 1 次提交
-
-
由 Ingo Molnar 提交于
A good number of small style inconsistencies have accumulated in the scheduler core, so do a pass over them to harmonize all these details: - fix speling in comments, - use curly braces for multi-line statements, - remove unnecessary parentheses from integer literals, - capitalize consistently, - remove stray newlines, - add comments where necessary, - remove invalid/unnecessary comments, - align structure definitions and other data types vertically, - add missing newlines for increased readability, - fix vertical tabulation where it's misaligned, - harmonize preprocessor conditional block labeling and vertical alignment, - remove line-breaks where they uglify the code, - add newline after local variable definitions, No change in functionality: md5: 1191fa0a890cfa8132156d2959d7e9e2 built-in.o.before.asm 1191fa0a890cfa8132156d2959d7e9e2 built-in.o.after.asm Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
- 21 2月, 2018 7 次提交
-
-
由 Frederic Weisbecker 提交于
When a CPU runs in full dynticks mode, a 1Hz tick remains in order to keep the scheduler stats alive. However this residual tick is a burden for bare metal tasks that can't stand any interruption at all, or want to minimize them. The usual boot parameters "nohz_full=" or "isolcpus=nohz" will now outsource these scheduler ticks to the global workqueue so that a housekeeping CPU handles those remotely. The sched_class::task_tick() implementations have been audited and look safe to be called remotely as the target runqueue and its current task are passed in parameter and don't seem to be accessed locally. Note that in the case of using isolcpus, it's still up to the user to affine the global workqueues to the housekeeping CPUs through /sys/devices/virtual/workqueue/cpumask or domains isolation "isolcpus=nohz,domain". Signed-off-by: NFrederic Weisbecker <frederic@kernel.org> Reviewed-by: NThomas Gleixner <tglx@linutronix.de> Acked-by: NPeter Zijlstra <peterz@infradead.org> Cc: Chris Metcalf <cmetcalf@mellanox.com> Cc: Christoph Lameter <cl@linux.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Rik van Riel <riel@redhat.com> Cc: Wanpeng Li <kernellwp@gmail.com> Link: http://lkml.kernel.org/r/1519186649-3242-6-git-send-email-frederic@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Mel Gorman 提交于
If wake_affine() pulls a task to another node for any reason and the node is no longer preferred then temporarily stop automatic NUMA balancing pulling the task back. Otherwise, tasks with a strong waker/wakee relationship may constantly fight automatic NUMA balancing over where a task should be placed. Once again netperf is interesting here. The performance barely changes but automatic NUMA balancing is interesting: Hmean send-64 354.67 ( 0.00%) 352.15 ( -0.71%) Hmean send-128 702.91 ( 0.00%) 693.84 ( -1.29%) Hmean send-256 1350.07 ( 0.00%) 1344.19 ( -0.44%) Hmean send-1024 5124.38 ( 0.00%) 4941.24 ( -3.57%) Hmean send-2048 9687.44 ( 0.00%) 9624.45 ( -0.65%) Hmean send-3312 14577.64 ( 0.00%) 14514.35 ( -0.43%) Hmean send-4096 16393.62 ( 0.00%) 16488.30 ( 0.58%) Hmean send-8192 26877.26 ( 0.00%) 26431.63 ( -1.66%) Hmean send-16384 38683.43 ( 0.00%) 38264.91 ( -1.08%) Hmean recv-64 354.67 ( 0.00%) 352.15 ( -0.71%) Hmean recv-128 702.91 ( 0.00%) 693.84 ( -1.29%) Hmean recv-256 1350.07 ( 0.00%) 1344.19 ( -0.44%) Hmean recv-1024 5124.38 ( 0.00%) 4941.24 ( -3.57%) Hmean recv-2048 9687.43 ( 0.00%) 9624.45 ( -0.65%) Hmean recv-3312 14577.59 ( 0.00%) 14514.35 ( -0.43%) Hmean recv-4096 16393.55 ( 0.00%) 16488.20 ( 0.58%) Hmean recv-8192 26876.96 ( 0.00%) 26431.29 ( -1.66%) Hmean recv-16384 38682.41 ( 0.00%) 38263.94 ( -1.08%) NUMA alloc hit 1465986 1423090 NUMA alloc miss 0 0 NUMA interleave hit 0 0 NUMA alloc local 1465897 1423003 NUMA base PTE updates 1473 1420 NUMA huge PMD updates 0 0 NUMA page range updates 1473 1420 NUMA hint faults 1383 1312 NUMA hint local faults 451 124 NUMA hint local percent 32 9 There is a slight degrading in performance but there are slightly fewer NUMA faults. There is a large drop in the percentage of local faults but the bulk of migrations for netperf are in small shared libraries so it's reflecting the fact that automatic NUMA balancing has backed off. This is a case where despite wake_affine() and automatic NUMA balancing fighting for placement that there is a marginal benefit to rescheduling to local data quickly. However, it should be noted that wake_affine() and automatic NUMA balancing fighting each other constantly is undesirable. However, the benefit in other cases is large. This is the result for NAS with the D class sizing on a 4-socket machine: nas-mpi 4.15.0 4.15.0 sdnuma-v1r23 delayretry-v1r23 Time cg.D 557.00 ( 0.00%) 431.82 ( 22.47%) Time ep.D 77.83 ( 0.00%) 79.01 ( -1.52%) Time is.D 26.46 ( 0.00%) 26.64 ( -0.68%) Time lu.D 727.14 ( 0.00%) 597.94 ( 17.77%) Time mg.D 191.35 ( 0.00%) 146.85 ( 23.26%) 4.15.0 4.15.0 sdnuma-v1r23delayretry-v1r23 User 75665.20 70413.30 System 20321.59 8861.67 Elapsed 766.13 634.92 Minor Faults 16528502 7127941 Major Faults 4553 5068 NUMA alloc local 6963197 6749135 NUMA base PTE updates 366409093 107491434 NUMA huge PMD updates 687556 198880 NUMA page range updates 718437765 209317994 NUMA hint faults 13643410 4601187 NUMA hint local faults 9212593 3063996 NUMA hint local percent 67 66 Note the massive reduction in system CPU usage even though the percentage of local faults is barely affected. There is a massive reduction in the number of PTE updates showing that automatic NUMA balancing has backed off. A critical observation is also that there is a massive reduction in minor faults which is due to far fewer NUMA hinting faults being trapped. There were questions on NAS OMP and how it behaved related to threads being bound to CPUs. First, there are more gains than losses with this patch applied and a reduction in system CPU usage: nas-omp 4.16.0-rc1 4.16.0-rc1 sdnuma-v2r1 delayretry-v2r1 Time bt.D 436.71 ( 0.00%) 430.05 ( 1.53%) Time cg.D 201.02 ( 0.00%) 180.87 ( 10.02%) Time ep.D 32.84 ( 0.00%) 32.68 ( 0.49%) Time is.D 9.63 ( 0.00%) 9.64 ( -0.10%) Time lu.D 331.20 ( 0.00%) 304.80 ( 7.97%) Time mg.D 54.87 ( 0.00%) 52.72 ( 3.92%) Time sp.D 1108.78 ( 0.00%) 917.10 ( 17.29%) Time ua.D 378.81 ( 0.00%) 398.83 ( -5.28%) 4.16.0-rc1 4.16.0-rc1 sdnuma-v2r1delayretry-v2r1 User 305633.08 296751.91 System 451.75 357.80 Elapsed 2595.73 2368.13 However, it does not close the gap between binding and being unbound. There is negligible difference between the performance of the baseline and a patched kernel when threads are bound so it is not presented here: 4.16.0-rc1 4.16.0-rc1 delayretry-bind delayretry-unbound Time bt.D 385.02 ( 0.00%) 430.05 ( -11.70%) Time cg.D 144.02 ( 0.00%) 180.87 ( -25.59%) Time ep.D 32.85 ( 0.00%) 32.68 ( 0.52%) Time is.D 10.52 ( 0.00%) 9.64 ( 8.37%) Time lu.D 285.31 ( 0.00%) 304.80 ( -6.83%) Time mg.D 43.21 ( 0.00%) 52.72 ( -22.01%) Time sp.D 820.24 ( 0.00%) 917.10 ( -11.81%) Time ua.D 337.09 ( 0.00%) 398.83 ( -18.32%) 4.16.0-rc1 4.16.0-rc1 delayretry-binddelayretry-unbound User 277731.25 296751.91 System 261.29 357.80 Elapsed 2100.55 2368.13 Unfortunately, while performance is improved by the patch, there is still quite a long way to go before it's equivalent to hard binding. Other workloads like hackbench, tbench, dbench and schbench are barely affected. dbench shows a mix of gains and losses depending on the machine although in general, the results are more stable. Signed-off-by: NMel Gorman <mgorman@techsingularity.net> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Giovanni Gherdovich <ggherdovich@suse.cz> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180213133730.24064-7-mgorman@techsingularity.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Mel Gorman 提交于
find_idlest_group() compares a local group with each other group to select the one that is most idle. When comparing groups in different NUMA domains, a very slight imbalance is enough to select a remote NUMA node even if the runnable load on both groups is 0 or close to 0. This ignores the cost of remote accesses entirely and is a problem when selecting the CPU for a newly forked task to run on. This is problematic when a forking server is almost guaranteed to run on a remote node incurring numerous remote accesses and potentially causing automatic NUMA balancing to try migrate the task back or migrate the data to another node. Similar weirdness is observed if a basic shell command pipes output to another as each process in the pipeline is likely to start on different nodes and then get adjusted later by wake_affine(). This patch adds imbalance to remote domains when considering whether to select CPUs from remote domains. If the local domain is selected, imbalance will still be used to try select a CPU from a lower scheduler domain's group instead of stacking tasks on the same CPU. A variety of workloads and machines were tested and as expected, there is no difference on UMA. The difference on NUMA can be dramatic. This is a comparison of elapsed times running the git regression test suite. It's fork-intensive with short-lived processes: 4.15.0 4.15.0 noexit-v1r23 sdnuma-v1r23 Elapsed min 1706.06 ( 0.00%) 1435.94 ( 15.83%) Elapsed mean 1709.53 ( 0.00%) 1436.98 ( 15.94%) Elapsed stddev 2.16 ( 0.00%) 1.01 ( 53.38%) Elapsed coeffvar 0.13 ( 0.00%) 0.07 ( 44.54%) Elapsed max 1711.59 ( 0.00%) 1438.01 ( 15.98%) 4.15.0 4.15.0 noexit-v1r23 sdnuma-v1r23 User 5434.12 5188.41 System 4878.77 3467.09 Elapsed 10259.06 8624.21 That shows a considerable reduction in elapsed times. It's important to note that automatic NUMA balancing does not affect this load as processes are too short-lived. There is also a noticable impact on hackbench such as this example using processes and pipes: hackbench-process-pipes 4.15.0 4.15.0 noexit-v1r23 sdnuma-v1r23 Amean 1 1.0973 ( 0.00%) 0.9393 ( 14.40%) Amean 4 1.3427 ( 0.00%) 1.3730 ( -2.26%) Amean 7 1.4233 ( 0.00%) 1.6670 ( -17.12%) Amean 12 3.0250 ( 0.00%) 3.3013 ( -9.13%) Amean 21 9.0860 ( 0.00%) 9.5343 ( -4.93%) Amean 30 14.6547 ( 0.00%) 13.2433 ( 9.63%) Amean 48 22.5447 ( 0.00%) 20.4303 ( 9.38%) Amean 79 29.2010 ( 0.00%) 26.7853 ( 8.27%) Amean 110 36.7443 ( 0.00%) 35.8453 ( 2.45%) Amean 141 45.8533 ( 0.00%) 42.6223 ( 7.05%) Amean 172 55.1317 ( 0.00%) 50.6473 ( 8.13%) Amean 203 64.4420 ( 0.00%) 58.3957 ( 9.38%) Amean 234 73.2293 ( 0.00%) 67.1047 ( 8.36%) Amean 265 80.5220 ( 0.00%) 75.7330 ( 5.95%) Amean 296 88.7567 ( 0.00%) 82.1533 ( 7.44%) It's not a universal win as there are occasions when spreading wide and quickly is a benefit but it's more of a win than it is a loss. For other workloads, there is little difference but netperf is interesting. Without the patch, the server and client starts on different nodes but quickly get migrated due to wake_affine. Hence, the difference is overall performance is marginal but detectable: 4.15.0 4.15.0 noexit-v1r23 sdnuma-v1r23 Hmean send-64 349.09 ( 0.00%) 354.67 ( 1.60%) Hmean send-128 699.16 ( 0.00%) 702.91 ( 0.54%) Hmean send-256 1316.34 ( 0.00%) 1350.07 ( 2.56%) Hmean send-1024 5063.99 ( 0.00%) 5124.38 ( 1.19%) Hmean send-2048 9705.19 ( 0.00%) 9687.44 ( -0.18%) Hmean send-3312 14359.48 ( 0.00%) 14577.64 ( 1.52%) Hmean send-4096 16324.20 ( 0.00%) 16393.62 ( 0.43%) Hmean send-8192 26112.61 ( 0.00%) 26877.26 ( 2.93%) Hmean send-16384 37208.44 ( 0.00%) 38683.43 ( 3.96%) Hmean recv-64 349.09 ( 0.00%) 354.67 ( 1.60%) Hmean recv-128 699.16 ( 0.00%) 702.91 ( 0.54%) Hmean recv-256 1316.34 ( 0.00%) 1350.07 ( 2.56%) Hmean recv-1024 5063.99 ( 0.00%) 5124.38 ( 1.19%) Hmean recv-2048 9705.16 ( 0.00%) 9687.43 ( -0.18%) Hmean recv-3312 14359.42 ( 0.00%) 14577.59 ( 1.52%) Hmean recv-4096 16323.98 ( 0.00%) 16393.55 ( 0.43%) Hmean recv-8192 26111.85 ( 0.00%) 26876.96 ( 2.93%) Hmean recv-16384 37206.99 ( 0.00%) 38682.41 ( 3.97%) However, what is very interesting is how automatic NUMA balancing behaves. Each netperf instance runs long enough for balancing to activate: NUMA base PTE updates 4620 1473 NUMA huge PMD updates 0 0 NUMA page range updates 4620 1473 NUMA hint faults 4301 1383 NUMA hint local faults 1309 451 NUMA hint local percent 30 32 NUMA pages migrated 1335 491 AutoNUMA cost 21% 6% There is an unfortunate number of remote faults although tracing indicated that the vast majority are in shared libraries. However, the tendency to start tasks on the same node if there is capacity means that there were far fewer PTE updates and faults incurred overall. Signed-off-by: NMel Gorman <mgorman@techsingularity.net> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Giovanni Gherdovich <ggherdovich@suse.cz> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180213133730.24064-6-mgorman@techsingularity.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
When a task exits, it notifies the parent that it has exited. This is a sync wakeup and the exiting task may pull the parent towards the wakers CPU. For simple workloads like using a shell, it was observed that the shell is pulled across nodes by exiting processes. This is daft as the parent may be long-lived and properly placed. This patch special cases a sync wakeup on exit to avoid pulling tasks across nodes. Testing on a range of workloads and machines showed very little differences in performance although there was a small 3% boost on some machines running a shellscript intensive workload (git regression test suite). Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: NMel Gorman <mgorman@techsingularity.net> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Giovanni Gherdovich <ggherdovich@suse.cz> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180213133730.24064-5-mgorman@techsingularity.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Mel Gorman 提交于
wake_affine_weight() will consider migrating a task to, or near, the current CPU if there is a load imbalance. If the CPUs share LLC then either CPU is valid as a search-for-idle-sibling target and equally appropriate for stacking two tasks on one CPU if an idle sibling is unavailable. If they do not share cache then a cross-node migration potentially impacts locality so while they are equal from a CPU capacity point of view, they are not equal in terms of memory locality. In either case, it's more appropriate to migrate only if there is a difference in their effective load. This patch modifies wake_affine_weight() to only consider migrating a task if there is a load imbalance for normal wakeups but will allow potential stacking if the loads are equal and it's a sync wakeup. For the most part, the different in performance is marginal. For example, on a 4-socket server running netperf UDP_STREAM on localhost the differences are as follows: 4.15.0 4.15.0 16rc0 noequal-v1r23 Hmean send-64 355.47 ( 0.00%) 349.50 ( -1.68%) Hmean send-128 697.98 ( 0.00%) 693.35 ( -0.66%) Hmean send-256 1328.02 ( 0.00%) 1318.77 ( -0.70%) Hmean send-1024 5051.83 ( 0.00%) 5051.11 ( -0.01%) Hmean send-2048 9637.02 ( 0.00%) 9601.34 ( -0.37%) Hmean send-3312 14355.37 ( 0.00%) 14414.51 ( 0.41%) Hmean send-4096 16464.97 ( 0.00%) 16301.37 ( -0.99%) Hmean send-8192 26722.42 ( 0.00%) 26428.95 ( -1.10%) Hmean send-16384 38137.81 ( 0.00%) 38046.11 ( -0.24%) Hmean recv-64 355.47 ( 0.00%) 349.50 ( -1.68%) Hmean recv-128 697.98 ( 0.00%) 693.35 ( -0.66%) Hmean recv-256 1328.02 ( 0.00%) 1318.77 ( -0.70%) Hmean recv-1024 5051.83 ( 0.00%) 5051.11 ( -0.01%) Hmean recv-2048 9636.95 ( 0.00%) 9601.30 ( -0.37%) Hmean recv-3312 14355.32 ( 0.00%) 14414.48 ( 0.41%) Hmean recv-4096 16464.74 ( 0.00%) 16301.16 ( -0.99%) Hmean recv-8192 26721.63 ( 0.00%) 26428.17 ( -1.10%) Hmean recv-16384 38136.00 ( 0.00%) 38044.88 ( -0.24%) Stddev send-64 7.30 ( 0.00%) 4.75 ( 34.96%) Stddev send-128 15.15 ( 0.00%) 22.38 ( -47.66%) Stddev send-256 13.99 ( 0.00%) 19.14 ( -36.81%) Stddev send-1024 105.73 ( 0.00%) 67.38 ( 36.27%) Stddev send-2048 294.57 ( 0.00%) 223.88 ( 24.00%) Stddev send-3312 302.28 ( 0.00%) 271.74 ( 10.10%) Stddev send-4096 195.92 ( 0.00%) 121.10 ( 38.19%) Stddev send-8192 399.71 ( 0.00%) 563.77 ( -41.04%) Stddev send-16384 1163.47 ( 0.00%) 1103.68 ( 5.14%) Stddev recv-64 7.30 ( 0.00%) 4.75 ( 34.96%) Stddev recv-128 15.15 ( 0.00%) 22.38 ( -47.66%) Stddev recv-256 13.99 ( 0.00%) 19.14 ( -36.81%) Stddev recv-1024 105.73 ( 0.00%) 67.38 ( 36.27%) Stddev recv-2048 294.59 ( 0.00%) 223.89 ( 24.00%) Stddev recv-3312 302.24 ( 0.00%) 271.75 ( 10.09%) Stddev recv-4096 196.03 ( 0.00%) 121.14 ( 38.20%) Stddev recv-8192 399.86 ( 0.00%) 563.65 ( -40.96%) Stddev recv-16384 1163.79 ( 0.00%) 1103.86 ( 5.15%) The difference in overall performance is marginal but note that most measurements are less variable. There were similar observations for other netperf comparisons. hackbench with sockets or threads with processes or threads showed minor difference with some reduction of migration. tbench showed only marginal differences that were within the noise. dbench, regardless of filesystem, showed minor differences all of which are within noise. Multiple machines, both UMA and NUMA were tested without any regressions showing up. The biggest risk with a patch like this is affecting wakeup latencies. However, the schbench load from Facebook which is very sensitive to wakeup latency showed a mixed result with mostly improvements in wakeup latency: 4.15.0 4.15.0 16rc0 noequal-v1r23 Lat 50.00th-qrtle-1 38.00 ( 0.00%) 38.00 ( 0.00%) Lat 75.00th-qrtle-1 49.00 ( 0.00%) 41.00 ( 16.33%) Lat 90.00th-qrtle-1 52.00 ( 0.00%) 50.00 ( 3.85%) Lat 95.00th-qrtle-1 54.00 ( 0.00%) 51.00 ( 5.56%) Lat 99.00th-qrtle-1 63.00 ( 0.00%) 60.00 ( 4.76%) Lat 99.50th-qrtle-1 66.00 ( 0.00%) 61.00 ( 7.58%) Lat 99.90th-qrtle-1 78.00 ( 0.00%) 65.00 ( 16.67%) Lat 50.00th-qrtle-2 38.00 ( 0.00%) 38.00 ( 0.00%) Lat 75.00th-qrtle-2 42.00 ( 0.00%) 43.00 ( -2.38%) Lat 90.00th-qrtle-2 46.00 ( 0.00%) 48.00 ( -4.35%) Lat 95.00th-qrtle-2 49.00 ( 0.00%) 50.00 ( -2.04%) Lat 99.00th-qrtle-2 55.00 ( 0.00%) 57.00 ( -3.64%) Lat 99.50th-qrtle-2 58.00 ( 0.00%) 60.00 ( -3.45%) Lat 99.90th-qrtle-2 65.00 ( 0.00%) 68.00 ( -4.62%) Lat 50.00th-qrtle-4 41.00 ( 0.00%) 41.00 ( 0.00%) Lat 75.00th-qrtle-4 45.00 ( 0.00%) 46.00 ( -2.22%) Lat 90.00th-qrtle-4 50.00 ( 0.00%) 50.00 ( 0.00%) Lat 95.00th-qrtle-4 54.00 ( 0.00%) 53.00 ( 1.85%) Lat 99.00th-qrtle-4 61.00 ( 0.00%) 61.00 ( 0.00%) Lat 99.50th-qrtle-4 65.00 ( 0.00%) 64.00 ( 1.54%) Lat 99.90th-qrtle-4 76.00 ( 0.00%) 82.00 ( -7.89%) Lat 50.00th-qrtle-8 48.00 ( 0.00%) 46.00 ( 4.17%) Lat 75.00th-qrtle-8 55.00 ( 0.00%) 54.00 ( 1.82%) Lat 90.00th-qrtle-8 60.00 ( 0.00%) 59.00 ( 1.67%) Lat 95.00th-qrtle-8 63.00 ( 0.00%) 63.00 ( 0.00%) Lat 99.00th-qrtle-8 71.00 ( 0.00%) 69.00 ( 2.82%) Lat 99.50th-qrtle-8 74.00 ( 0.00%) 73.00 ( 1.35%) Lat 99.90th-qrtle-8 98.00 ( 0.00%) 90.00 ( 8.16%) Lat 50.00th-qrtle-16 56.00 ( 0.00%) 55.00 ( 1.79%) Lat 75.00th-qrtle-16 68.00 ( 0.00%) 67.00 ( 1.47%) Lat 90.00th-qrtle-16 77.00 ( 0.00%) 78.00 ( -1.30%) Lat 95.00th-qrtle-16 82.00 ( 0.00%) 84.00 ( -2.44%) Lat 99.00th-qrtle-16 90.00 ( 0.00%) 93.00 ( -3.33%) Lat 99.50th-qrtle-16 93.00 ( 0.00%) 97.00 ( -4.30%) Lat 99.90th-qrtle-16 110.00 ( 0.00%) 110.00 ( 0.00%) Lat 50.00th-qrtle-32 68.00 ( 0.00%) 62.00 ( 8.82%) Lat 75.00th-qrtle-32 90.00 ( 0.00%) 83.00 ( 7.78%) Lat 90.00th-qrtle-32 110.00 ( 0.00%) 100.00 ( 9.09%) Lat 95.00th-qrtle-32 122.00 ( 0.00%) 111.00 ( 9.02%) Lat 99.00th-qrtle-32 145.00 ( 0.00%) 133.00 ( 8.28%) Lat 99.50th-qrtle-32 154.00 ( 0.00%) 143.00 ( 7.14%) Lat 99.90th-qrtle-32 2316.00 ( 0.00%) 515.00 ( 77.76%) Lat 50.00th-qrtle-35 69.00 ( 0.00%) 72.00 ( -4.35%) Lat 75.00th-qrtle-35 92.00 ( 0.00%) 95.00 ( -3.26%) Lat 90.00th-qrtle-35 111.00 ( 0.00%) 114.00 ( -2.70%) Lat 95.00th-qrtle-35 122.00 ( 0.00%) 124.00 ( -1.64%) Lat 99.00th-qrtle-35 142.00 ( 0.00%) 144.00 ( -1.41%) Lat 99.50th-qrtle-35 150.00 ( 0.00%) 154.00 ( -2.67%) Lat 99.90th-qrtle-35 6104.00 ( 0.00%) 5640.00 ( 7.60%) Signed-off-by: NMel Gorman <mgorman@techsingularity.net> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Giovanni Gherdovich <ggherdovich@suse.cz> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180213133730.24064-4-mgorman@techsingularity.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Mel Gorman 提交于
On sync wakeups, the previous CPU effective load may not be used so delay the calculation until it's needed. Signed-off-by: NMel Gorman <mgorman@techsingularity.net> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Giovanni Gherdovich <ggherdovich@suse.cz> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180213133730.24064-3-mgorman@techsingularity.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Mel Gorman 提交于
The only caller of wake_affine() knows the CPU ID. Pass it in instead of rechecking it. Signed-off-by: NMel Gorman <mgorman@techsingularity.net> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Giovanni Gherdovich <ggherdovich@suse.cz> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180213133730.24064-2-mgorman@techsingularity.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 13 2月, 2018 1 次提交
-
-
由 Vincent Guittot 提交于
Remove a useless space in # ifdef and align it with others. Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org> Acked-by: NPeter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1518512382-29426-1-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 06 2月, 2018 4 次提交
-
-
由 Mel Gorman 提交于
The select_idle_sibling() (SIS) rewrite in commit: 10e2f1ac ("sched/core: Rewrite and improve select_idle_siblings()") ... replaced a domain iteration with a search that broadly speaking does a wrapped walk of the scheduler domain sharing a last-level-cache. While this had a number of improvements, one consequence is that two tasks that share a waker/wakee relationship push each other around a socket. Even though two tasks may be active, all cores are evenly used. This is great from a search perspective and spreads a load across individual cores, but it has adverse consequences for cpufreq. As each CPU has relatively low utilisation, cpufreq may decide the utilisation is too low to used a higher P-state and overall computation throughput suffers. While individual cpufreq and cpuidle drivers may compensate by artifically boosting P-state (at c0) or avoiding lower C-states (during idle), it does not help if hardware-based cpufreq (e.g. HWP) is used. This patch tracks a recently used CPU based on what CPU a task was running on when it last was a waker a CPU it was recently using when a task is a wakee. During SIS, the recently used CPU is used as a target if it's still allowed by the task and is idle. The benefit may be non-obvious so consider an example of two tasks communicating back and forth. Task A may be an application doing IO where task B is a kworker or kthread like journald. Task A may issue IO, wake B and B wakes up A on completion. With the existing scheme this may look like the following (potentially different IDs if SMT is in use but similar principal applies). A (cpu 0) wake B (wakes on cpu 1) B (cpu 1) wake A (wakes on cpu 2) A (cpu 2) wake B (wakes on cpu 3) etc. A careful reader may wonder why CPU 0 was not idle when B wakes A the first time and it's simply due to the fact that A can be rescheduled to another CPU and the pattern is that prev == target when B tries to wakeup A and the information about CPU 0 has been lost. With this patch, the pattern is more likely to be: A (cpu 0) wake B (wakes on cpu 1) B (cpu 1) wake A (wakes on cpu 0) A (cpu 0) wake B (wakes on cpu 1) etc i.e. two communicating casts are more likely to use just two cores instead of all available cores sharing a LLC. The most dramatic speedup was noticed on dbench using the XFS filesystem on UMA as clients interact heavily with workqueues in that configuration. Note that a similar speedup is not observed on ext4 as the wakeup pattern is different: 4.15.0-rc9 4.15.0-rc9 waprev-v1 biasancestor-v1 Hmean 1 287.54 ( 0.00%) 817.01 ( 184.14%) Hmean 2 1268.12 ( 0.00%) 1781.24 ( 40.46%) Hmean 4 1739.68 ( 0.00%) 1594.47 ( -8.35%) Hmean 8 2464.12 ( 0.00%) 2479.56 ( 0.63%) Hmean 64 1455.57 ( 0.00%) 1434.68 ( -1.44%) The results can be less dramatic on NUMA where automatic balancing interferes with the test. It's also known that network benchmarks running on localhost also benefit quite a bit from this patch (roughly 10% on netperf RR for UDP and TCP depending on the machine). Hackbench also seens small improvements (6-11% depending on machine and thread count). The facebook schbench was also tested but in most cases showed little or no different to wakeup latencies. Signed-off-by: NMel Gorman <mgorman@techsingularity.net> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180130104555.4125-5-mgorman@techsingularity.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Mel Gorman 提交于
wake_affine_idle() prefers to move a task to the current CPU if the wakeup is due to an interrupt. The expectation is that the interrupt data is cache hot and relevant to the waking task as well as avoiding a search. However, there is no way to determine if there was cache hot data on the previous CPU that may exceed the interrupt data. Furthermore, round-robin delivery of interrupts can migrate tasks around a socket where each CPU is under-utilised. This can interact badly with cpufreq which makes decisions based on per-cpu data. It has been observed on machines with HWP that p-states are not boosted to their maximum levels even though the workload is latency and throughput sensitive. This patch uses the previous CPU for the task if it's idle and cache-affine with the current CPU even if the current CPU is idle due to the wakup being related to the interrupt. This reduces migrations at the cost of the interrupt data not being cache hot when the task wakes. A variety of workloads were tested on various machines and no adverse impact was noticed that was outside noise. dbench on ext4 on UMA showed roughly 10% reduction in the number of CPU migrations and it is a case where interrupts are frequent for IO competions. In most cases, the difference in performance is quite small but variability is often reduced. For example, this is the result for pgbench running on a UMA machine with different numbers of clients. 4.15.0-rc9 4.15.0-rc9 baseline waprev-v1 Hmean 1 22096.28 ( 0.00%) 22734.86 ( 2.89%) Hmean 4 74633.42 ( 0.00%) 75496.77 ( 1.16%) Hmean 7 115017.50 ( 0.00%) 113030.81 ( -1.73%) Hmean 12 126209.63 ( 0.00%) 126613.40 ( 0.32%) Hmean 16 131886.91 ( 0.00%) 130844.35 ( -0.79%) Stddev 1 636.38 ( 0.00%) 417.11 ( 34.46%) Stddev 4 614.64 ( 0.00%) 583.24 ( 5.11%) Stddev 7 542.46 ( 0.00%) 435.45 ( 19.73%) Stddev 12 173.93 ( 0.00%) 171.50 ( 1.40%) Stddev 16 671.42 ( 0.00%) 680.30 ( -1.32%) CoeffVar 1 2.88 ( 0.00%) 1.83 ( 36.26%) Note that the different in performance is marginal but for low utilisation, there is less variability. Signed-off-by: NMel Gorman <mgorman@techsingularity.net> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180130104555.4125-4-mgorman@techsingularity.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Mel Gorman 提交于
This is a preparation patch that has wake_affine*() return a CPU ID instead of a boolean. The intent is to allow the wake_affine() helpers to be avoided if a decision is already made. This patch has no functional change. Signed-off-by: NMel Gorman <mgorman@techsingularity.net> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180130104555.4125-3-mgorman@techsingularity.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Mel Gorman 提交于
wake_affine_idle() takes parameters it never uses so clean it up. Signed-off-by: NMel Gorman <mgorman@techsingularity.net> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180130104555.4125-2-mgorman@techsingularity.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
-