- 24 6月, 2020 5 次提交
-
-
由 Dietmar Eggemann 提交于
to #28739709 commit 5e83eafbfd3b351537c0d74467fc43e8a88f4ae4 upstream With LB_BIAS disabled, there is no need to update the rq->cpu_load[idx] any more. Signed-off-by: NDietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Acked-by: NRik van Riel <riel@surriel.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Patrick Bellasi <patrick.bellasi@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Quentin Perret <quentin.perret@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Valentin Schneider <valentin.schneider@arm.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20190527062116.11512-2-dietmar.eggemann@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org> Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com> Reviewed-by: NShanpei Chen <shanpeic@linux.alibaba.com>
-
由 Dietmar Eggemann 提交于
to #28739709 commit f2bedc4705659216bd60948029ad8dfedf923ad9 upstream The CFS class is the only one maintaining and using the CPU wide load (rq->load(.weight)). The last use case of the CPU wide load in CFS's set_next_entity() can be replaced by using the load of the CFS class (rq->cfs.load(.weight)) instead. Signed-off-by: NDietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20190424084556.604-1-dietmar.eggemann@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org> Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com> Reviewed-by: NShanpei Chen <shanpeic@linux.alibaba.com>
-
由 Yihao Wu 提交于
to #28739709 This patch adds cpuacct.cgroup_wait_latency interface. It exports the histogram of the sched entity's schedule latency. Unlike wait_latency, the sched entity is a cgroup rather than task. This is useful when tasks are not directly clustered under one cgroup. For examples: cgroup1 --- cgroupA --- task1 --- cgroupB --- task2 cgroup2 --- cgroupC --- task3 --- cgroupD --- task4 This is a common cgroup hierarchy used by many applications. With cgroup_wait_latency, we can just read from cgroup1 to know aggregated wait latency information of task1 and task2. The interface output format is identical to cpuacct.wait_latency. Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com> Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
-
由 Yihao Wu 提交于
to #28739709 This patch measures time that tasks in cpuacct cgroup blocks. There are two types: blocked due to IO, and others like locks. And they are exported in"cpuacct.ioblock_latency" and "cpuacct.block_latency" respectively. According to histogram, we know the detailed distribution of the duration. And according to total(ms), we know the percentage of time tasks spent off rq, waiting for resources: (△ioblock_latency.total(ms) + △block_latency.total(ms)) / △wall_time The interface output format is identical to cpuacct.wait_latency. Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com> Acked-by: NXunlei Pang <xlpang@linux.alibaba.com> Reviewed-by: NShanpei Chen <shanpeic@linux.alibaba.com> Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
-
由 Yihao Wu 提交于
to #28739709 Export wait_latency in "cpuacct.wait_latency", which indicates the time that tasks in a cpuacct cgroup wait on a cfs_rq to be scheduled. This is like "perf sched", but it gives smaller overhead. So it can be used as monitor constantly. wait_latency is useful to debug application's high RT problem. It can tell if it's caused by scheduling or not. If it is, loadavg can tell if it's caused by bad scheduling bahaviour or system overloads. System admins can also use wait_latency to define SLA. To ensure SLA is guaranteed, there are various ways to decrease wait_latency. This feature is disabled by default for performance concerns. It can be switched on dynamically by "echo 0 > /proc/cpusli/sched_lat_enable" Example: $ cat /sys/fs/cgroup/cpuacct/a/cpuacct.wait_latency 0-1ms: 4139 1-4ms: 317 4-7ms: 568 7-10ms: 0 10-100ms: 42324 100-500ms: 9131 500-1000ms: 95 1000-5000ms: 134 5000-10000ms: 0 >=10000ms: 0 total(ms): 4256455 Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com> Acked-by: NXunlei Pang <xlpang@linux.alibaba.com> Reviewed-by: NShanpei Chen <shanpeic@linux.alibaba.com> Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
-
- 23 6月, 2020 2 次提交
-
-
由 Yihao Wu 提交于
to #28739709 Assume workloads are composed of massive short tasks. Then periodical load tracking is unnecessary. Because load tracking should be already guaranteed by frequent sleep and wake-up. If these massive short tasks run in their individual cgroups, the load tracking becomes extremely heavy. This patch adds a switch to bypass scheduler_tick load tracking, in order to reduce scheduler overhead, without sacrificing much balance in this scenario. Performance Tests: 1) 1100+ tasks in their individual cgroups, on a 96-HT Skylake machine sched overhead(each HT): 0.74% -> 0.48% (This test's baseline is from the previous patch) 2) sysbench-threads with 96 threads, running for 5min latency_ms 95th: 63.07 -> 54.01 Besides these, no regression is found on our test platform. Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com> Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
-
由 Yihao Wu 提交于
to #28739709 Unless the workloads are IO-bounded, update_blocked_averages doesn't help load balance. This patch adds a switch to bypass update_blocked_averages if prior knowledge about workloads indicates IO is negligible. Performance Tests: 1) 1100+ tasks in their individual cgroups, on a 96-HT Skylake machine sched overhead(each HT): 3.78% -> 0.74% 2) cgroup-overhead benchmark in our sched-test suite on a 96-HT Skylake overhead: 21.06 -> 18.08 3) unixbench context1 with 96 threads running for 1min Score: 15409.40 -> 16821.77 Besides these, UnixBench has some performance ups and downs. But generally, the performance of UnixBench hasn't changed. Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com> Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
-
- 22 6月, 2020 1 次提交
-
-
由 Yihao Wu 提交于
to #28143829 rq_clock_task is less than rq_clock when in VM, or when IRQ_TIME_ACCOUNTING is on. So they are not comparable when accounting elapsed time. This bug is not observed on host yet, because neither of these two conditions are met. Use rq_clock at both begin and end of exec_start_raw accumulation to fix this bug, because we expect steal% in cpuacct.proc_stat of VM's cgroups can reflect the cpu time the host steal from the guest. Fixes: c7552980 ("alinux: sched: Introduce per-cgroup steal accounting") Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com> Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
-
- 08 6月, 2020 1 次提交
-
-
由 Jens Axboe 提交于
fix #28384496 commit 18f855e574d9799a0e7489f8ae6fd8447d0dd74a upstream Stefano reported a crash with using SQPOLL with io_uring: BUG: kernel NULL pointer dereference, address: 00000000000003b0 CPU: 2 PID: 1307 Comm: io_uring-sq Not tainted 5.7.0-rc7 #11 RIP: 0010:task_numa_work+0x4f/0x2c0 Call Trace: task_work_run+0x68/0xa0 io_sq_thread+0x252/0x3d0 kthread+0xf9/0x130 ret_from_fork+0x35/0x40 which is task_numa_work() oopsing on current->mm being NULL. The task work is queued by task_tick_numa(), which checks if current->mm is NULL at the time of the call. But this state isn't necessarily persistent, if the kthread is using use_mm() to temporarily adopt the mm of a task. Change the task_tick_numa() check to exclude kernel threads in general, as it doesn't make sense to attempt ot balance for kthreads anyway. Reported-by: NStefano Garzarella <sgarzare@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NIngo Molnar <mingo@kernel.org> Acked-by: NPeter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/865de121-8190-5d30-ece5-3b097dc74431@kernel.dkSigned-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Acked-by: NShanpei Chen <shanpeic@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
- 26 5月, 2020 1 次提交
-
-
由 Yihao Wu 提交于
fix #27788368 per cgroup nr_uninterruptible tracking leads to huge performance regression of hackbench. This patch delete nr_uninterruptible related code for now, to address performance regression issue. Fixes: 9410d314 ("alinux: cpuacct: Export nr_running & nr_uninterruptible") Fixes: 36da4fe9 ("alinux: sched: Maintain "nr_uninterruptible" in runqueue") Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com> Acked-by: NShanpei Chen <shanpeic@linux.alibaba.com>
-
- 07 5月, 2020 1 次提交
-
-
由 Yihao Wu 提交于
to #27372989 CONFIG_THREAD_INFO_IN_TASK is not set for aarch64, task_struct has no cpu member. We should use the helper functiton task_cpu instead. Fixes: 9e7b35d6 ("alinux: sched: Introduce per-cgroup iowait accounting") Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
- 24 4月, 2020 4 次提交
-
-
由 Yihao Wu 提交于
to #26424323 We account iowait when the cgroup's se is idle, and it has blocked task on the hierarchy of se->my_q. To achieve this, we also add cg_nr_running to track the hierarchical number of blocked tasks. We do it when a blocked task wakes up or a task is blocked. Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com> Signed-off-by: NShanpei Chen <shanpeic@linux.alibaba.com> Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
-
由 Yihao Wu 提交于
to #26424323 From the previous patch. We know there are 4 possible states. Since steal state's transition is complex. We choose to account its supplement. steal = elapse - idle - sum_exec_raw - ineffective Where elapse is the time since the cgroup is created. sum_exec_raw is the running time including IRQ time. ineffective is the total time that the cpuacct-binded cpuset doesn't allow this cpu for the cgroup. Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com> Signed-off-by: NShanpei Chen <shanpeic@linux.alibaba.com> Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
-
由 Yihao Wu 提交于
to #26424323 Since we concern idle, let's take idle as the center state. And omit transition between other stats. Below is the state transition graph: sleep->deque +-----------+ cpumask +-------+ exit->deque +-------+ |ineffective|-------- | idle | <-----------|running| +-----------+ +-------+ +-------+ ^ | unthrtl child -> deque | | wake -> deque | |thrtl chlid -> enque migrate -> deque | |migrate -> enque | v +-------+ | steal | +-------+ We conclude idle state condition as: !se->on_rq && !my_q->throttled && cpu allowed. From this graph and condition, we can hook (de|en)queue_task_fair update_cpumasks_hier, (un|)throttle_cfs_rq to account idle state. In the hooked functions, we also check the conditions, to avoid accounting unwanted cpu clocks. Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com> Signed-off-by: NShanpei Chen <shanpeic@linux.alibaba.com> Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
-
由 Xunlei Pang 提交于
to #26424323 It's relatively easy to maintain nr_uninterruptible in scheduler compared to doing it in cpuacct, we assume that "cpu,cpuacct" are bound together, so that it can be used for per-cgroup load. This will be needed to calculate per-cgroup load average later. Reviewed-by: NMichael Wang <yun.wang@linux.alibaba.com> Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com> Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
-
- 22 4月, 2020 1 次提交
-
-
由 Huaixin Chang 提交于
fix #25892693 commit 26a8b12747c975b33b4a82d62e4a307e1c07f31b upstream Currently, there is a potential race between distribute_cfs_runtime() and assign_cfs_rq_runtime(). Race happens when cfs_b->runtime is read, distributes without holding lock and finds out there is not enough runtime to charge against after distribution. Because assign_cfs_rq_runtime() might be called during distribution, and use cfs_b->runtime at the same time. Fibtest is the tool to test this race. Assume all gcfs_rq is throttled and cfs period timer runs, slow threads might run and sleep, returning unused cfs_rq runtime and keeping min_cfs_rq_runtime in their local pool. If all this happens sufficiently quickly, cfs_b->runtime will drop a lot. If runtime distributed is large too, over-use of runtime happens. A runtime over-using by about 70 percent of quota is seen when we test fibtest on a 96-core machine. We run fibtest with 1 fast thread and 95 slow threads in test group, configure 10ms quota for this group and see the CPU usage of fibtest is 17.0%, which is far from than the expected 10%. On a smaller machine with 32 cores, we also run fibtest with 96 threads. CPU usage is more than 12%, which is also more than expected 10%. This shows that on similar workloads, this race do affect CPU bandwidth control. Solve this by holding lock inside distribute_cfs_runtime(). Fixes: c06f04c7 ("sched: Fix potential near-infinite distribute_cfs_runtime() loop") Signed-off-by: NHuaixin Chang <changhuaixin@linux.alibaba.com> Reviewed-by: NBen Segall <bsegall@google.com> Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com> Link: https://lore.kernel.org/lkml/20200325092602.22471-1-changhuaixin@linux.alibaba.com/Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
-
- 17 1月, 2020 1 次提交
-
-
由 Huaixin Chang 提交于
For a long time runnable cpu load has been used in selecting task rq when waking up tasks. Recent test has shown for test load with a large quantity of short running tasks and almost full cpu utility, static load is more helpful. In our e2e tests, runnable load avg of java threads ranges from less than 10 to as large as 362, while these java threads are no different from each other, and should be treated in the same way. After using static load, qps imporvement has been seen in multiple test cases. A new sched feature WA_STATIC_WEIGHT is introduced here to control. Echo WA_STATIC_WEIGHT to /sys/kernel/debug/sched_features to turn static load in wake_affine_weight on and NO_WA_STATIC_WEIGHT to turn it off. This feature is kept off by default. Test is done on the following hardware: 4 threads Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz In tests with 120 threads and sql loglevel configured to info: NO_WA_STATIC_WEIGHT WA_STATIC_WEIGHT 33170.63 34614.95 (+4.35%) In tests with 160 threads and sql loglevel configured to info: NO_WA_STATIC_WEIGHT WA_STATIC_WEIGHT 35888.71 38247.20 (+6.57%) In tests with 160 threads and sql loglevel configured to warn: NO_WA_STATIC_WEIGHT WA_STATIC_WEIGHT 39118.72 39698.72 (+1.48%) Signed-off-by: NHuaixin Chang <changhuaixin@linux.alibaba.com> Acked-by: NShanpei Chen <shanpeic@linux.alibaba.com>
-
- 27 12月, 2019 1 次提交
-
-
由 bsegall@google.com 提交于
commit 66567fcbaecac455caa1b13643155d686b51ce63 upstream. When a cfs_rq sleeps and returns its quota, we delay for 5ms before waking any throttled cfs_rqs to coalesce with other cfs_rqs going to sleep, as this has to be done outside of the rq lock we hold. The current code waits for 5ms without any sleeps, instead of waiting for 5ms from the first sleep, which can delay the unthrottle more than we want. Switch this around so that we can't push this forward forever. This requires an extra flag rather than using hrtimer_active, since we need to start a new timer if the current one is in the process of finishing. Signed-off-by: NBen Segall <bsegall@google.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com> Acked-by: NPhil Auld <pauld@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/xm26a7euy6iq.fsf_-_@bsegall-linux.svl.corp.google.comSigned-off-by: NIngo Molnar <mingo@kernel.org> Signed-off-by: NShanpei Chen <shanpeic@linux.alibaba.com> Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
-
- 13 12月, 2019 1 次提交
-
-
由 Xuewei Zhang 提交于
commit 4929a4e6faa0f13289a67cae98139e727f0d4a97 upstream. The quota/period ratio is used to ensure a child task group won't get more bandwidth than the parent task group, and is calculated as: normalized_cfs_quota() = [(quota_us << 20) / period_us] If the quota/period ratio was changed during this scaling due to precision loss, it will cause inconsistency between parent and child task groups. See below example: A userspace container manager (kubelet) does three operations: 1) Create a parent cgroup, set quota to 1,000us and period to 10,000us. 2) Create a few children cgroups. 3) Set quota to 1,000us and period to 10,000us on a child cgroup. These operations are expected to succeed. However, if the scaling of 147/128 happens before step 3, quota and period of the parent cgroup will be changed: new_quota: 1148437ns, 1148us new_period: 11484375ns, 11484us And when step 3 comes in, the ratio of the child cgroup will be 104857, which will be larger than the parent cgroup ratio (104821), and will fail. Scaling them by a factor of 2 will fix the problem. Tested-by: NPhil Auld <pauld@redhat.com> Signed-off-by: NXuewei Zhang <xueweiz@google.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Acked-by: NPhil Auld <pauld@redhat.com> Cc: Anton Blanchard <anton@ozlabs.org> Cc: Ben Segall <bsegall@google.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vincent Guittot <vincent.guittot@linaro.org> Fixes: 2e8e19226398 ("sched/fair: Limit sched_cfs_period_timer() loop to avoid hard lockup") Link: https://lkml.kernel.org/r/20191004001243.140897-1-xueweiz@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
- 01 12月, 2019 1 次提交
-
-
由 Valentin Schneider 提交于
[ Upstream commit 3f130a37c442d5c4d66531b240ebe9abfef426b5 ] When load_balance() fails to move some load because of task affinity, we end up increasing sd->balance_interval to delay the next periodic balance in the hopes that next time we look, that annoying pinned task(s) will be gone. However, idle_balance() pays no attention to sd->balance_interval, yet it will still lead to an increase in balance_interval in case of pinned tasks. If we're going through several newidle balances (e.g. we have a periodic task), this can lead to a huge increase of the balance_interval in a very small amount of time. To prevent that, don't increase the balance interval when going through a newidle balance. This is a similar approach to what is done in commit 58b26c4c ("sched: Increment cache_nice_tries only on periodic lb"), where we disregard newidle balance and rely on periodic balance for more stable results. Signed-off-by: NValentin Schneider <valentin.schneider@arm.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Dietmar.Eggemann@arm.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: patrick.bellasi@arm.com Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/1537974727-30788-2-git-send-email-valentin.schneider@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org> Signed-off-by: NSasha Levin <sashal@kernel.org>
-
- 13 11月, 2019 2 次提交
-
-
由 Qian Cai 提交于
commit 763a9ec06c409dcde2a761aac4bb83ff3938e0b3 upstream. Commit: de53fd7aedb1 ("sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices") introduced a few compilation warnings: kernel/sched/fair.c: In function '__refill_cfs_bandwidth_runtime': kernel/sched/fair.c:4365:6: warning: variable 'now' set but not used [-Wunused-but-set-variable] kernel/sched/fair.c: In function 'start_cfs_bandwidth': kernel/sched/fair.c:4992:6: warning: variable 'overrun' set but not used [-Wunused-but-set-variable] Also, __refill_cfs_bandwidth_runtime() does no longer update the expiration time, so fix the comments accordingly. Signed-off-by: NQian Cai <cai@lca.pw> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: NBen Segall <bsegall@google.com> Reviewed-by: NDave Chiluk <chiluk+linux@indeed.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: pauld@redhat.com Fixes: de53fd7aedb1 ("sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices") Link: https://lkml.kernel.org/r/1566326455-8038-1-git-send-email-cai@lca.pwSigned-off-by: NIngo Molnar <mingo@kernel.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Dave Chiluk 提交于
commit de53fd7aedb100f03e5d2231cfce0e4993282425 upstream. It has been observed, that highly-threaded, non-cpu-bound applications running under cpu.cfs_quota_us constraints can hit a high percentage of periods throttled while simultaneously not consuming the allocated amount of quota. This use case is typical of user-interactive non-cpu bound applications, such as those running in kubernetes or mesos when run on multiple cpu cores. This has been root caused to cpu-local run queue being allocated per cpu bandwidth slices, and then not fully using that slice within the period. At which point the slice and quota expires. This expiration of unused slice results in applications not being able to utilize the quota for which they are allocated. The non-expiration of per-cpu slices was recently fixed by 'commit 512ac999 ("sched/fair: Fix bandwidth timer clock drift condition")'. Prior to that it appears that this had been broken since at least 'commit 51f2176d ("sched/fair: Fix unlocked reads of some cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That added the following conditional which resulted in slices never being expired. if (cfs_rq->runtime_expires != cfs_b->runtime_expires) { /* extend local deadline, drift is bounded above by 2 ticks */ cfs_rq->runtime_expires += TICK_NSEC; Because this was broken for nearly 5 years, and has recently been fixed and is now being noticed by many users running kubernetes (https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion that the mechanisms around expiring runtime should be removed altogether. This allows quota already allocated to per-cpu run-queues to live longer than the period boundary. This allows threads on runqueues that do not use much CPU to continue to use their remaining slice over a longer period of time than cpu.cfs_period_us. However, this helps prevent the above condition of hitting throttling while also not fully utilizing your cpu quota. This theoretically allows a machine to use slightly more than its allotted quota in some periods. This overflow would be bounded by the remaining quota left on each per-cpu runqueueu. This is typically no more than min_cfs_rq_runtime=1ms per cpu. For CPU bound tasks this will change nothing, as they should theoretically fully utilize all of their quota in each period. For user-interactive tasks as described above this provides a much better user/application experience as their cpu utilization will more closely match the amount they requested when they hit throttling. This means that cpu limits no longer strictly apply per period for non-cpu bound applications, but that they are still accurate over longer timeframes. This greatly improves performance of high-thread-count, non-cpu bound applications with low cfs_quota_us allocation on high-core-count machines. In the case of an artificial testcase (10ms/100ms of quota on 80 CPU machine), this commit resulted in almost 30x performance improvement, while still maintaining correct cpu quota restrictions. That testcase is available at https://github.com/indeedeng/fibtest. Fixes: 512ac999 ("sched/fair: Fix bandwidth timer clock drift condition") Signed-off-by: NDave Chiluk <chiluk+linux@indeed.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: NPhil Auld <pauld@redhat.com> Reviewed-by: NBen Segall <bsegall@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: John Hammond <jhammond@indeed.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kyle Anderson <kwa@yelp.com> Cc: Gabriel Munos <gmunoz@netflix.com> Cc: Peter Oskolkov <posk@posk.io> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Brendan Gregg <bgregg@netflix.com> Link: https://lkml.kernel.org/r/1563900266-19734-2-git-send-email-chiluk+linux@indeed.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
- 05 10月, 2019 2 次提交
-
-
由 Phil Auld 提交于
[ Upstream commit a46d14eca7b75fffe35603aa8b81df654353d80f ] Enabling WARN_DOUBLE_CLOCK in /sys/kernel/debug/sched_features causes warning to fire in update_rq_clock. This seems to be caused by onlining a new fair sched group not using the rq lock wrappers. [] rq->clock_update_flags & RQCF_UPDATED [] WARNING: CPU: 5 PID: 54385 at kernel/sched/core.c:210 update_rq_clock+0xec/0x150 [] Call Trace: [] online_fair_sched_group+0x53/0x100 [] cpu_cgroup_css_online+0x16/0x20 [] online_css+0x1c/0x60 [] cgroup_apply_control_enable+0x231/0x3b0 [] cgroup_mkdir+0x41b/0x530 [] kernfs_iop_mkdir+0x61/0xa0 [] vfs_mkdir+0x108/0x1a0 [] do_mkdirat+0x77/0xe0 [] do_syscall_64+0x55/0x1d0 [] entry_SYSCALL_64_after_hwframe+0x44/0xa9 Using the wrappers in online_fair_sched_group instead of the raw locking removes this warning. [ tglx: Use rq_*lock_irq() ] Signed-off-by: NPhil Auld <pauld@redhat.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20190801133749.11033-1-pauld@redhat.comSigned-off-by: NSasha Levin <sashal@kernel.org>
-
由 Vincent Guittot 提交于
[ Upstream commit f6cad8df6b30a5d2bbbd2e698f74b4cafb9fb82b ] The load_balance() has a dedicated mecanism to detect when an imbalance is due to CPU affinity and must be handled at parent level. In this case, the imbalance field of the parent's sched_group is set. The description of sg_imbalanced() gives a typical example of two groups of 4 CPUs each and 4 tasks each with a cpumask covering 1 CPU of the first group and 3 CPUs of the second group. Something like: { 0 1 2 3 } { 4 5 6 7 } * * * * But the load_balance fails to fix this UC on my octo cores system made of 2 clusters of quad cores. Whereas the load_balance is able to detect that the imbalanced is due to CPU affinity, it fails to fix it because the imbalance field is cleared before letting parent level a chance to run. In fact, when the imbalance is detected, the load_balance reruns without the CPU with pinned tasks. But there is no other running tasks in the situation described above and everything looks balanced this time so the imbalance field is immediately cleared. The imbalance field should not be cleared if there is no other task to move when the imbalance is detected. Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/1561996022-28829-1-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org> Signed-off-by: NSasha Levin <sashal@kernel.org>
-
- 16 9月, 2019 1 次提交
-
-
由 Liangyan 提交于
commit 5e2d2cc2588bd3307ce3937acbc2ed03c830a861 upstream. do_sched_cfs_period_timer() will refill cfs_b runtime and call distribute_cfs_runtime to unthrottle cfs_rq, sometimes cfs_b->runtime will allocate all quota to one cfs_rq incorrectly, then other cfs_rqs attached to this cfs_b can't get runtime and will be throttled. We find that one throttled cfs_rq has non-negative cfs_rq->runtime_remaining and cause an unexpetced cast from s64 to u64 in snippet: distribute_cfs_runtime() { runtime = -cfs_rq->runtime_remaining + 1; } The runtime here will change to a large number and consume all cfs_b->runtime in this cfs_b period. According to Ben Segall, the throttled cfs_rq can have account_cfs_rq_runtime called on it because it is throttled before idle_balance, and the idle_balance calls update_rq_clock to add time that is accounted to the task. This commit prevents cfs_rq to be assgined new runtime if it has been throttled until that distribute_cfs_runtime is called. Signed-off-by: NLiangyan <liangyan.peng@linux.alibaba.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: NValentin Schneider <valentin.schneider@arm.com> Reviewed-by: NBen Segall <bsegall@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: shanpeic@linux.alibaba.com Cc: stable@vger.kernel.org Cc: xlpang@linux.alibaba.com Fixes: d3d9dc33 ("sched: Throttle entities exceeding their allowed bandwidth") Link: https://lkml.kernel.org/r/20190826121633.6538-1-liangyan.peng@linux.alibaba.comSigned-off-by: NIngo Molnar <mingo@kernel.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
- 04 8月, 2019 2 次提交
-
-
由 Jann Horn 提交于
commit cb361d8cdef69990f6b4504dc1fd9a594d983c97 upstream. The old code used RCU annotations and accessors inconsistently for ->numa_group, which can lead to use-after-frees and NULL dereferences. Let all accesses to ->numa_group use proper RCU helpers to prevent such issues. Signed-off-by: NJann Horn <jannh@google.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Fixes: 8c8a743c ("sched/numa: Use {cpu, pid} to create task groups for shared faults") Link: https://lkml.kernel.org/r/20190716152047.14424-3-jannh@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Jann Horn 提交于
commit 16d51a590a8ce3befb1308e0e7ab77f3b661af33 upstream. When going through execve(), zero out the NUMA fault statistics instead of freeing them. During execve, the task is reachable through procfs and the scheduler. A concurrent /proc/*/sched reader can read data from a freed ->numa_faults allocation (confirmed by KASAN) and write it back to userspace. I believe that it would also be possible for a use-after-free read to occur through a race between a NUMA fault and execve(): task_numa_fault() can lead to task_numa_compare(), which invokes task_weight() on the currently running task of a different CPU. Another way to fix this would be to make ->numa_faults RCU-managed or add extra locking, but it seems easier to wipe the NUMA fault statistics on execve. Signed-off-by: NJann Horn <jannh@google.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Fixes: 82727018 ("sched/numa: Call task_numa_free() from do_execve()") Link: https://lkml.kernel.org/r/20190716152047.14424-1-jannh@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
- 04 6月, 2019 1 次提交
-
-
由 Masahiro Yamada 提交于
commit e9666d10a5677a494260d60d1fa0b73cc7646eb3 upstream. Currently, CONFIG_JUMP_LABEL just means "I _want_ to use jump label". The jump label is controlled by HAVE_JUMP_LABEL, which is defined like this: #if defined(CC_HAVE_ASM_GOTO) && defined(CONFIG_JUMP_LABEL) # define HAVE_JUMP_LABEL #endif We can improve this by testing 'asm goto' support in Kconfig, then make JUMP_LABEL depend on CC_HAS_ASM_GOTO. Ugly #ifdef HAVE_JUMP_LABEL will go away, and CONFIG_JUMP_LABEL will match to the real kernel capability. Signed-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Tested-by: NSedat Dilek <sedat.dilek@gmail.com> [nc: Fix trivial conflicts in 4.19 arch/xtensa/kernel/jump_label.c doesn't exist yet Ensured CC_HAVE_ASM_GOTO and HAVE_JUMP_LABEL were sufficiently eliminated] Signed-off-by: NNathan Chancellor <natechancellor@gmail.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
- 31 5月, 2019 1 次提交
-
-
由 Nicholas Piggin 提交于
[ Upstream commit 9b019acb72e4b5741d88e8936d6f200ed44b66b2 ] The NOHZ idle balancer runs on the lowest idle CPU. This can interfere with isolated CPUs, so confine it to HK_FLAG_MISC housekeeping CPUs. HK_FLAG_SCHED is not used for this because it is not set anywhere at the moment. This could be folded into HK_FLAG_SCHED once that option is fixed. The problem was observed with increased jitter on an application running on CPU0, caused by NOHZ idle load balancing being run on CPU1 (an SMT sibling). Signed-off-by: NNicholas Piggin <npiggin@gmail.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20190412042613.28930-1-npiggin@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org> Signed-off-by: NSasha Levin <sashal@kernel.org>
-
- 02 5月, 2019 1 次提交
-
-
由 Xie XiuQi 提交于
commit a860fa7b96e1a1c974556327aa1aee852d434c21 upstream. sched_clock_cpu() may not be consistent between CPUs. If a task migrates to another CPU, then se.exec_start is set to that CPU's rq_clock_task() by update_stats_curr_start(). Specifically, the new value might be before the old value due to clock skew. So then if in numa_get_avg_runtime() the expression: 'now - p->last_task_numa_placement' ends up as -1, then the divider '*period + 1' in task_numa_placement() is 0 and things go bang. Similar to update_curr(), check if time goes backwards to avoid this. [ peterz: Wrote new changelog. ] [ mingo: Tweaked the code comment. ] Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: cj.chengjian@huawei.com Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20190425080016.GX11158@hirez.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
- 27 4月, 2019 1 次提交
-
-
由 Phil Auld 提交于
[ Upstream commit 2e8e19226398db8265a8e675fcc0118b9e80c9e8 ] With extremely short cfs_period_us setting on a parent task group with a large number of children the for loop in sched_cfs_period_timer() can run until the watchdog fires. There is no guarantee that the call to hrtimer_forward_now() will ever return 0. The large number of children can make do_sched_cfs_period_timer() take longer than the period. NMI watchdog: Watchdog detected hard LOCKUP on cpu 24 RIP: 0010:tg_nop+0x0/0x10 <IRQ> walk_tg_tree_from+0x29/0xb0 unthrottle_cfs_rq+0xe0/0x1a0 distribute_cfs_runtime+0xd3/0xf0 sched_cfs_period_timer+0xcb/0x160 ? sched_cfs_slack_timer+0xd0/0xd0 __hrtimer_run_queues+0xfb/0x270 hrtimer_interrupt+0x122/0x270 smp_apic_timer_interrupt+0x6a/0x140 apic_timer_interrupt+0xf/0x20 </IRQ> To prevent this we add protection to the loop that detects when the loop has run too many times and scales the period and quota up, proportionally, so that the timer can complete before then next period expires. This preserves the relative runtime quota while preventing the hard lockup. A warning is issued reporting this state and the new values. Signed-off-by: NPhil Auld <pauld@redhat.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: <stable@vger.kernel.org> Cc: Anton Blanchard <anton@ozlabs.org> Cc: Ben Segall <bsegall@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20190319130005.25492-1-pauld@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org> Signed-off-by: NSasha Levin <sashal@kernel.org>
-
- 17 4月, 2019 1 次提交
-
-
由 Mel Gorman 提交于
commit 0e9f02450da07fc7b1346c8c32c771555173e397 upstream. A NULL pointer dereference bug was reported on a distribution kernel but the same issue should be present on mainline kernel. It occured on s390 but should not be arch-specific. A partial oops looks like: Unable to handle kernel pointer dereference in virtual kernel address space ... Call Trace: ... try_to_wake_up+0xfc/0x450 vhost_poll_wakeup+0x3a/0x50 [vhost] __wake_up_common+0xbc/0x178 __wake_up_common_lock+0x9e/0x160 __wake_up_sync_key+0x4e/0x60 sock_def_readable+0x5e/0x98 The bug hits any time between 1 hour to 3 days. The dereference occurs in update_cfs_rq_h_load when accumulating h_load. The problem is that cfq_rq->h_load_next is not protected by any locking and can be updated by parallel calls to task_h_load. Depending on the compiler, code may be generated that re-reads cfq_rq->h_load_next after the check for NULL and then oops when reading se->avg.load_avg. The dissassembly showed that it was possible to reread h_load_next after the check for NULL. While this does not appear to be an issue for later compilers, it's still an accident if the correct code is generated. Full locking in this path would have high overhead so this patch uses READ_ONCE to read h_load_next only once and check for NULL before dereferencing. It was confirmed that there were no further oops after 10 days of testing. As Peter pointed out, it is also necessary to use WRITE_ONCE() to avoid any potential problems with store tearing. Signed-off-by: NMel Gorman <mgorman@techsingularity.net> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: NValentin Schneider <valentin.schneider@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: <stable@vger.kernel.org> Fixes: 68520796 ("sched: Move h_load calculation to task_h_load()") Link: https://lkml.kernel.org/r/20190319123610.nsivgf3mjbjjesxb@techsingularity.netSigned-off-by: NIngo Molnar <mingo@kernel.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
- 13 2月, 2019 1 次提交
-
-
由 Josh Poimboeuf 提交于
commit b284909abad48b07d3071a9fc9b5692b3e64914b upstream. With the following commit: 73d5e2b4 ("cpu/hotplug: detect SMT disabled by BIOS") ... the hotplug code attempted to detect when SMT was disabled by BIOS, in which case it reported SMT as permanently disabled. However, that code broke a virt hotplug scenario, where the guest is booted with only primary CPU threads, and a sibling is brought online later. The problem is that there doesn't seem to be a way to reliably distinguish between the HW "SMT disabled by BIOS" case and the virt "sibling not yet brought online" case. So the above-mentioned commit was a bit misguided, as it permanently disabled SMT for both cases, preventing future virt sibling hotplugs. Going back and reviewing the original problems which were attempted to be solved by that commit, when SMT was disabled in BIOS: 1) /sys/devices/system/cpu/smt/control showed "on" instead of "notsupported"; and 2) vmx_vm_init() was incorrectly showing the L1TF_MSG_SMT warning. I'd propose that we instead consider #1 above to not actually be a problem. Because, at least in the virt case, it's possible that SMT wasn't disabled by BIOS and a sibling thread could be brought online later. So it makes sense to just always default the smt control to "on" to allow for that possibility (assuming cpuid indicates that the CPU supports SMT). The real problem is #2, which has a simple fix: change vmx_vm_init() to query the actual current SMT state -- i.e., whether any siblings are currently online -- instead of looking at the SMT "control" sysfs value. So fix it by: a) reverting the original "fix" and its followup fix: 73d5e2b4 ("cpu/hotplug: detect SMT disabled by BIOS") bc2d8d26 ("cpu/hotplug: Fix SMT supported evaluation") and b) changing vmx_vm_init() to query the actual current SMT state -- instead of the sysfs control value -- to determine whether the L1TF warning is needed. This also requires the 'sched_smt_present' variable to exported, instead of 'cpu_smt_control'. Fixes: 73d5e2b4 ("cpu/hotplug: detect SMT disabled by BIOS") Reported-by: NIgor Mammedov <imammedo@redhat.com> Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Cc: Joe Mario <jmario@redhat.com> Cc: Jiri Kosina <jikos@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: kvm@vger.kernel.org Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/e3a85d585da28cc333ecbc1e78ee9216e6da9396.1548794349.git.jpoimboe@redhat.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
- 13 1月, 2019 1 次提交
-
-
由 Linus Torvalds 提交于
commit c40f7d74c741a907cfaeb73a7697081881c497d0 upstream. Zhipeng Xie, Xie XiuQi and Sargun Dhillon reported lockups in the scheduler under high loads, starting at around the v4.18 time frame, and Zhipeng Xie tracked it down to bugs in the rq->leaf_cfs_rq_list manipulation. Do a (manual) revert of: a9e7f654 ("sched/fair: Fix O(nr_cgroups) in load balance path") It turns out that the list_del_leaf_cfs_rq() introduced by this commit is a surprising property that was not considered in followup commits such as: 9c2791f9 ("sched/fair: Fix hierarchical order in rq->leaf_cfs_rq_list") As Vincent Guittot explains: "I think that there is a bigger problem with commit a9e7f654 and cfs_rq throttling: Let take the example of the following topology TG2 --> TG1 --> root: 1) The 1st time a task is enqueued, we will add TG2 cfs_rq then TG1 cfs_rq to leaf_cfs_rq_list and we are sure to do the whole branch in one path because it has never been used and can't be throttled so tmp_alone_branch will point to leaf_cfs_rq_list at the end. 2) Then TG1 is throttled 3) and we add TG3 as a new child of TG1. 4) The 1st enqueue of a task on TG3 will add TG3 cfs_rq just before TG1 cfs_rq and tmp_alone_branch will stay on rq->leaf_cfs_rq_list. With commit a9e7f654, we can del a cfs_rq from rq->leaf_cfs_rq_list. So if the load of TG1 cfs_rq becomes NULL before step 2) above, TG1 cfs_rq is removed from the list. Then at step 4), TG3 cfs_rq is added at the beginning of rq->leaf_cfs_rq_list but tmp_alone_branch still points to TG3 cfs_rq because its throttled parent can't be enqueued when the lock is released. tmp_alone_branch doesn't point to rq->leaf_cfs_rq_list whereas it should. So if TG3 cfs_rq is removed or destroyed before tmp_alone_branch points on another TG cfs_rq, the next TG cfs_rq that will be added, will be linked outside rq->leaf_cfs_rq_list - which is bad. In addition, we can break the ordering of the cfs_rq in rq->leaf_cfs_rq_list but this ordering is used to update and propagate the update from leaf down to root." Instead of trying to work through all these cases and trying to reproduce the very high loads that produced the lockup to begin with, simplify the code temporarily by reverting a9e7f654 - which change was clearly not thought through completely. This (hopefully) gives us a kernel that doesn't lock up so people can continue to enjoy their holidays without worrying about regressions. ;-) [ mingo: Wrote changelog, fixed weird spelling in code comment while at it. ] Analyzed-by: NXie XiuQi <xiexiuqi@huawei.com> Analyzed-by: NVincent Guittot <vincent.guittot@linaro.org> Reported-by: NZhipeng Xie <xiezhipeng1@huawei.com> Reported-by: NSargun Dhillon <sargun@sargun.me> Reported-by: NXie XiuQi <xiexiuqi@huawei.com> Tested-by: NZhipeng Xie <xiezhipeng1@huawei.com> Tested-by: NSargun Dhillon <sargun@sargun.me> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Acked-by: NVincent Guittot <vincent.guittot@linaro.org> Cc: <stable@vger.kernel.org> # v4.13+ Cc: Bin Li <huawei.libin@huawei.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: a9e7f654 ("sched/fair: Fix O(nr_cgroups) in load balance path") Link: http://lkml.kernel.org/r/1545879866-27809-1-git-send-email-xiexiuqi@huawei.comSigned-off-by: NIngo Molnar <mingo@kernel.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
- 20 12月, 2018 1 次提交
-
-
由 Vincent Guittot 提交于
Commit 11d4afd4ff667f9b6178ee8c142c36cb78bd84db upstream. Create a config for enabling irq load tracking in the scheduler. irq load tracking is useful only when irq or paravirtual time is accounted but it's only possible with SMP for now. Also use __maybe_unused to remove the compilation warning in update_rq_clock_task() that has been introduced by: 2e62c474 ("sched/fair: Remove #ifdefs from scale_rt_capacity()") Suggested-by: NIngo Molnar <mingo@redhat.com> Reported-by: NDou Liyang <douly.fnst@cn.fujitsu.com> Reported-by: NMiguel Ojeda <miguel.ojeda.sandonis@gmail.com> Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: bp@alien8.de Cc: dou_liyang@163.com Fixes: 2e62c474 ("sched/fair: Remove #ifdefs from scale_rt_capacity()") Link: http://lkml.kernel.org/r/1537867062-27285-1-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org> Signed-off-by: NSasha Levin <sashal@kernel.org>
-
- 01 12月, 2018 1 次提交
-
-
由 Patrick Bellasi 提交于
[ Upstream commit c469933e772132aad040bd6a2adc8edf9ad6f825 ] A ~10% regression has been reported for UnixBench's execl throughput test by Aaron Lu and Ye Xiaolong: https://lkml.org/lkml/2018/10/30/765 That test is pretty simple, it does a "recursive" execve() syscall on the same binary. Starting from the syscall, this sequence is possible: do_execve() do_execveat_common() __do_execve_file() sched_exec() select_task_rq_fair() <==| Task already enqueued find_idlest_cpu() find_idlest_group() capacity_spare_wake() <==| Functions not called from cpu_util_wake() | the wakeup path which means we can end up calling cpu_util_wake() not only from the "wakeup path", as its name would suggest. Indeed, the task doing an execve() syscall is already enqueued on the CPU we want to get the cpu_util_wake() for. The estimated utilization for a CPU computed in cpu_util_wake() was written under the assumption that function can be called only from the wakeup path. If instead the task is already enqueued, we end up with a utilization which does not remove the current task's contribution from the estimated utilization of the CPU. This will wrongly assume a reduced spare capacity on the current CPU and increase the chances to migrate the task on execve. The regression is tracked down to: commit d519329f ("sched/fair: Update util_est only on util_avg updates") because in that patch we turn on by default the UTIL_EST sched feature. However, the real issue is introduced by: commit f9be3e59 ("sched/fair: Use util_est in LB and WU paths") Let's fix this by ensuring to always discount the task estimated utilization from the CPU's estimated utilization when the task is also the current one. The same benchmark of the bug report, executed on a dual socket 40 CPUs Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz machine, reports these "Execl Throughput" figures (higher the better): mainline : 48136.5 lps mainline+fix : 55376.5 lps which correspond to a 15% speedup. Moreover, since {cpu_util,capacity_spare}_wake() are not really only used from the wakeup path, let's remove this ambiguity by using a better matching name: {cpu_util,capacity_spare}_without(). Since we are at that, let's also improve the existing documentation. Reported-by: NAaron Lu <aaron.lu@intel.com> Reported-by: NYe Xiaolong <xiaolong.ye@intel.com> Tested-by: NAaron Lu <aaron.lu@intel.com> Signed-off-by: NPatrick Bellasi <patrick.bellasi@arm.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Quentin Perret <quentin.perret@arm.com> Cc: Steve Muckle <smuckle@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Todd Kjos <tkjos@google.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Fixes: f9be3e59 (sched/fair: Use util_est in LB and WU paths) Link: https://lore.kernel.org/lkml/20181025093100.GB13236@e110439-lin/Signed-off-by: NIngo Molnar <mingo@kernel.org> Signed-off-by: NSasha Levin <sashal@kernel.org>
-
- 16 10月, 2018 1 次提交
-
-
由 Song Muchun 提交于
The comment and the code around the update_min_vruntime() call in dequeue_entity() are not in agreement. From commit: b60205c7 ("sched/fair: Fix min_vruntime tracking") I think that we want to update min_vruntime when a task is sleeping/migrating. So, the check is inverted there - fix it. Signed-off-by: NSong Muchun <smuchun@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: b60205c7 ("sched/fair: Fix min_vruntime tracking") Link: http://lkml.kernel.org/r/20181014112612.2614-1-smuchun@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 11 10月, 2018 1 次提交
-
-
由 Phil Auld 提交于
With a very low cpu.cfs_quota_us setting, such as the minimum of 1000, distribute_cfs_runtime may not empty the throttled_list before it runs out of runtime to distribute. In that case, due to the change from c06f04c7 to put throttled entries at the head of the list, later entries on the list will starve. Essentially, the same X processes will get pulled off the list, given CPU time and then, when expired, get put back on the head of the list where distribute_cfs_runtime will give runtime to the same set of processes leaving the rest. Fix the issue by setting a bit in struct cfs_bandwidth when distribute_cfs_runtime is running, so that the code in throttle_cfs_rq can decide to put the throttled entry on the tail or the head of the list. The bit is set/cleared by the callers of distribute_cfs_runtime while they hold cfs_bandwidth->lock. This is easy to reproduce with a handful of CPU consumers. I use 'crash' on the live system. In some cases you can simply look at the throttled list and see the later entries are not changing: crash> list cfs_rq.throttled_list -H 0xffff90b54f6ade40 -s cfs_rq.runtime_remaining | paste - - | awk '{print $1" "$4}' | pr -t -n3 1 ffff90b56cb2d200 -976050 2 ffff90b56cb2cc00 -484925 3 ffff90b56cb2bc00 -658814 4 ffff90b56cb2ba00 -275365 5 ffff90b166a45600 -135138 6 ffff90b56cb2da00 -282505 7 ffff90b56cb2e000 -148065 8 ffff90b56cb2fa00 -872591 9 ffff90b56cb2c000 -84687 10 ffff90b56cb2f000 -87237 11 ffff90b166a40a00 -164582 crash> list cfs_rq.throttled_list -H 0xffff90b54f6ade40 -s cfs_rq.runtime_remaining | paste - - | awk '{print $1" "$4}' | pr -t -n3 1 ffff90b56cb2d200 -994147 2 ffff90b56cb2cc00 -306051 3 ffff90b56cb2bc00 -961321 4 ffff90b56cb2ba00 -24490 5 ffff90b166a45600 -135138 6 ffff90b56cb2da00 -282505 7 ffff90b56cb2e000 -148065 8 ffff90b56cb2fa00 -872591 9 ffff90b56cb2c000 -84687 10 ffff90b56cb2f000 -87237 11 ffff90b166a40a00 -164582 Sometimes it is easier to see by finding a process getting starved and looking at the sched_info: crash> task ffff8eb765994500 sched_info PID: 7800 TASK: ffff8eb765994500 CPU: 16 COMMAND: "cputest" sched_info = { pcount = 8, run_delay = 697094208, last_arrival = 240260125039, last_queued = 240260327513 }, crash> task ffff8eb765994500 sched_info PID: 7800 TASK: ffff8eb765994500 CPU: 16 COMMAND: "cputest" sched_info = { pcount = 8, run_delay = 697094208, last_arrival = 240260125039, last_queued = 240260327513 }, Signed-off-by: NPhil Auld <pauld@redhat.com> Reviewed-by: NBen Segall <bsegall@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Fixes: c06f04c7 ("sched: Fix potential near-infinite distribute_cfs_runtime() loop") Link: http://lkml.kernel.org/r/20181008143639.GA4019@pauld.bos.csbSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 02 10月, 2018 2 次提交
-
-
由 Mel Gorman 提交于
Automatic NUMA Balancing uses a multi-stage pass to decide whether a page should migrate to a local node. This filter avoids excessive ping-ponging if a page is shared or used by threads that migrate cross-node frequently. Threads inherit both page tables and the preferred node ID from the parent. This means that threads can trigger hinting faults earlier than a new task which delays scanning for a number of seconds. As it can be load balanced very early in its lifetime there can be an unnecessary delay before it starts migrating thread-local data. This patch migrates private pages faster early in the lifetime of a thread using the sequence counter as an identifier of new tasks. With this patch applied, STREAM performance is the same as 4.17 even though processes are not spread cross-node prematurely. Other workloads showed a mix of minor gains and losses. This is somewhat expected most workloads are not very sensitive to the starting conditions of a process. 4.19.0-rc5 4.19.0-rc5 4.17.0 numab-v1r1 fastmigrate-v1r1 vanilla MB/sec copy 43298.52 ( 0.00%) 47335.46 ( 9.32%) 47219.24 ( 9.06%) MB/sec scale 30115.06 ( 0.00%) 32568.12 ( 8.15%) 32527.56 ( 8.01%) MB/sec add 32825.12 ( 0.00%) 36078.94 ( 9.91%) 35928.02 ( 9.45%) MB/sec triad 32549.52 ( 0.00%) 35935.94 ( 10.40%) 35969.88 ( 10.51%) Signed-off-by: NMel Gorman <mgorman@techsingularity.net> Reviewed-by: NRik van Riel <riel@surriel.com> Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Cc: Jirka Hladky <jhladky@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Linux-MM <linux-mm@kvack.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20181001100525.29789-3-mgorman@techsingularity.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Srikar Dronamraju 提交于
If NUMA improvement from the task migration is going to be very minimal, then avoid task migration. Specjbb2005 results (8 warehouses) Higher bops are better 2 Socket - 2 Node Haswell - X86 JVMS Prev Current %Change 4 198512 205910 3.72673 1 313559 318491 1.57291 2 Socket - 4 Node Power8 - PowerNV JVMS Prev Current %Change 8 74761.9 74935.9 0.232739 1 214874 226796 5.54837 2 Socket - 2 Node Power9 - PowerNV JVMS Prev Current %Change 4 180536 189780 5.12031 1 210281 205695 -2.18089 4 Socket - 4 Node Power7 - PowerVM JVMS Prev Current %Change 8 56511.4 60370 6.828 1 104899 108100 3.05151 1/7 cases is regressing, if we look at events migrate_pages seem to vary the most especially in the regressing case. Also some amount of variance is expected between different runs of Specjbb2005. Some events stats before and after applying the patch. perf stats 8th warehouse Multi JVM 2 Socket - 2 Node Haswell - X86 Event Before After cs 13,818,546 13,801,554 migrations 1,149,960 1,151,541 faults 385,583 433,246 cache-misses 55,259,546,768 55,168,691,835 sched:sched_move_numa 2,257 2,551 sched:sched_stick_numa 9 24 sched:sched_swap_numa 512 904 migrate:mm_migrate_pages 2,225 1,571 vmstat 8th warehouse Multi JVM 2 Socket - 2 Node Haswell - X86 Event Before After numa_hint_faults 72692 113682 numa_hint_faults_local 62270 102163 numa_hit 238762 240181 numa_huge_pte_updates 48 36 numa_interleave 75 64 numa_local 238676 240103 numa_other 86 78 numa_pages_migrated 2225 1564 numa_pte_updates 98557 134080 perf stats 8th warehouse Single JVM 2 Socket - 2 Node Haswell - X86 Event Before After cs 3,173,490 3,079,150 migrations 36,966 31,455 faults 108,776 99,081 cache-misses 12,200,075,320 11,588,126,740 sched:sched_move_numa 1,264 1 sched:sched_stick_numa 0 0 sched:sched_swap_numa 0 0 migrate:mm_migrate_pages 899 36 vmstat 8th warehouse Single JVM 2 Socket - 2 Node Haswell - X86 Event Before After numa_hint_faults 21109 430 numa_hint_faults_local 17120 77 numa_hit 72934 71277 numa_huge_pte_updates 42 0 numa_interleave 33 22 numa_local 72866 71218 numa_other 68 59 numa_pages_migrated 915 23 numa_pte_updates 42326 0 perf stats 8th warehouse Multi JVM 2 Socket - 2 Node Power9 - PowerNV Event Before After cs 8,312,022 8,707,565 migrations 231,705 171,342 faults 310,242 310,820 cache-misses 402,324,573 136,115,400 sched:sched_move_numa 193 215 sched:sched_stick_numa 0 6 sched:sched_swap_numa 3 24 migrate:mm_migrate_pages 93 162 vmstat 8th warehouse Multi JVM 2 Socket - 2 Node Power9 - PowerNV Event Before After numa_hint_faults 11838 8985 numa_hint_faults_local 11216 8154 numa_hit 90689 93819 numa_huge_pte_updates 0 0 numa_interleave 1579 882 numa_local 89634 93496 numa_other 1055 323 numa_pages_migrated 92 169 numa_pte_updates 12109 9217 perf stats 8th warehouse Single JVM 2 Socket - 2 Node Power9 - PowerNV Event Before After cs 2,170,481 2,152,072 migrations 10,126 10,704 faults 160,962 164,376 cache-misses 10,834,845 3,818,437 sched:sched_move_numa 10 16 sched:sched_stick_numa 0 0 sched:sched_swap_numa 0 7 migrate:mm_migrate_pages 2 199 vmstat 8th warehouse Single JVM 2 Socket - 2 Node Power9 - PowerNV Event Before After numa_hint_faults 403 2248 numa_hint_faults_local 358 1666 numa_hit 25898 25704 numa_huge_pte_updates 0 0 numa_interleave 207 200 numa_local 25860 25679 numa_other 38 25 numa_pages_migrated 2 197 numa_pte_updates 400 2234 perf stats 8th warehouse Multi JVM 4 Socket - 4 Node Power7 - PowerVM Event Before After cs 110,339,633 93,330,595 migrations 4,139,812 4,122,061 faults 863,622 865,979 cache-misses 231,838,045,660 225,395,083,479 sched:sched_move_numa 2,196 2,372 sched:sched_stick_numa 33 24 sched:sched_swap_numa 544 769 migrate:mm_migrate_pages 2,469 1,677 vmstat 8th warehouse Multi JVM 4 Socket - 4 Node Power7 - PowerVM Event Before After numa_hint_faults 85748 91638 numa_hint_faults_local 66831 78096 numa_hit 242213 242225 numa_huge_pte_updates 0 0 numa_interleave 0 2 numa_local 242211 242219 numa_other 2 6 numa_pages_migrated 2376 1515 numa_pte_updates 86233 92274 perf stats 8th warehouse Single JVM 4 Socket - 4 Node Power7 - PowerVM Event Before After cs 59,331,057 51,487,271 migrations 552,019 537,170 faults 266,586 256,921 cache-misses 73,796,312,990 70,073,831,187 sched:sched_move_numa 981 576 sched:sched_stick_numa 54 24 sched:sched_swap_numa 286 327 migrate:mm_migrate_pages 713 726 vmstat 8th warehouse Single JVM 4 Socket - 4 Node Power7 - PowerVM Event Before After numa_hint_faults 14807 12000 numa_hint_faults_local 5738 5024 numa_hit 36230 36470 numa_huge_pte_updates 0 0 numa_interleave 0 0 numa_local 36228 36465 numa_other 2 5 numa_pages_migrated 703 726 numa_pte_updates 14742 11930 Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Jirka Hladky <jhladky@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1537552141-27815-7-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-