- 02 9月, 2020 15 次提交
-
-
由 Thomas Gleixner 提交于
task #29600094 commit 1f85b1f5e1f5541272abedc19ba7b6c5b564c228 upstream. Backport summary: for 4.19 kernel ICX PCIe Gen4 support. In preparation for an interrupt injection interface which can be used safely by error injection mechanisms. e.g. PCI-E/ AER, add a return value to check_irq_resend() so errors can be propagated to the caller. Split out the software resend code so the ugly #ifdef in check_irq_resend() goes away and the whole thing becomes readable. Fix up the caller in debugfs. The caller in irq_startup() does not care about the return value as this is unconditionally invoked for all interrupts and the resend is best effort anyway. Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Acked-by: NMarc Zyngier <maz@kernel.org> Link: https://lkml.kernel.org/r/20200306130623.775200917@linutronix.de (cherry picked from commit 1f85b1f5e1f5541272abedc19ba7b6c5b564c228) Signed-off-by: NEthan Zhao <haifeng.zhao@intel.com> Signed-off-by: NArtie Ding <artie.ding@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Thomas Gleixner 提交于
task #29600094 commit c16816acd08697b02a53f56f8936497a9f6f6e7a upstream. Backport summary: for 4.19 kernel ICX PCIe Gen4 support. In general calling generic_handle_irq() with interrupts disabled from non interrupt context is harmless. For some interrupt controllers like the x86 trainwrecks this is outright dangerous as it might corrupt state if an interrupt affinity change is pending. Add infrastructure which allows to mark interrupts as unsafe and catch such usage in generic_handle_irq(). Reported-by: sathyanarayanan.kuppuswamy@linux.intel.com Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Acked-by: NMarc Zyngier <maz@kernel.org> Link: https://lkml.kernel.org/r/20200306130623.590923677@linutronix.de (cherry picked from commit c16816acd08697b02a53f56f8936497a9f6f6e7a) Signed-off-by: NEthan Zhao <haifeng.zhao@intel.com> Conflicts: include/linux/irq.h Signed-off-by: NArtie Ding <artie.ding@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Masahiro Yamada 提交于
task #29499913 commit d198b34f3855eee2571dda03eea75a09c7c31480 upstream Add SPDX License Identifier to all .gitignore files. erwei: Because the locations of the .gitignore files from the upstream are different from our kernel, I add the context "# SPDX-License-Identifier: GPL-2.0-only" to our onw .gitignore files. Signed-off-by: NMasahiro Yamada <masahiroy@kernel.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NErwei Deng <erwei@linux.alibaba.com> Reviewed-by: NArtie Ding <artie.ding@linux.alibaba.com>
-
由 Xu Yu 提交于
fix #29564148 Global swapout can happen even with swappiness set to 0. In some scenario, we do want OOM instead of unexpected global swapout. This extends value range of vm_swappiness, and disables swapout completely by setting swappiness to -1. Signed-off-by: NXu Yu <xuyu@linux.alibaba.com> Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
-
由 Shile Zhang 提交于
fix #29056122 commit 'fbb2f06e' ("pvpanic: add crash loaded event") introduce new pvpanic event 'PVPANIC_CRASH_LOADED', it make the qemu on host can get info that if the guest already handle the panic by kdump or not. But if the guest enabled the kdump, it will not post the panic event by default unless the parameter 'crash_kexec_post_notifiers' is given. So, its better to set the default value of this parameter to true, to avoid it missed in case of kdump enabled. If user want disable the event notification, the parameter 'crash_kexec_post_notifiers=N' should be given. Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com> Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Jens Axboe 提交于
to #28991349 commit 6da4b3ab9a6e9b1b5f90322ab3fa3a7dd18edb19 upstream A driver may have a need to allocate multiple sets of MSI/MSI-X interrupts, and have them appropriately affinitized. Add support for defining a number of sets in the irq_affinity structure, of varying sizes, and get each set affinitized correctly across the machine. [ tglx: Minor changelog tweaks ] Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Reviewed-by: NHannes Reinecke <hare@suse.com> Reviewed-by: NMing Lei <ming.lei@redhat.com> Reviewed-by: NKeith Busch <keith.busch@intel.com> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Cc: linux-block@vger.kernel.org Link: https://lkml.kernel.org/r/20181102145951.31979-5-ming.lei@redhat.comSigned-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Ming Lei 提交于
to #28991349 commit 060746d9e394084b7401e7532f2de528ecbfb521 upstream No functional change. Prepares for support of allocating and affinitizing sets of interrupts, in which each set of interrupts needs a full two stage spreading. The first vector argument is necessary for this so the affinitizing starts from the first vector of each set. [ tglx: Minor changelog tweaks ] Signed-off-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Cc: Hannes Reinecke <hare@suse.com> Cc: Keith Busch <keith.busch@intel.com> Cc: Sagi Grimberg <sagi@grimberg.me> Link: https://lkml.kernel.org/r/20181102145951.31979-4-ming.lei@redhat.comSigned-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Ming Lei 提交于
to #28991349 commit 5c903e108d0b005cf59904ca3520934fca4b9439 upstream No functional change. Prepares for supporting allocating and affinitizing interrupt sets. [ tglx: Minor changelog tweaks ] Signed-off-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Cc: Hannes Reinecke <hare@suse.com> Cc: Keith Busch <keith.busch@intel.com> Cc: Sagi Grimberg <sagi@grimberg.me> Link: https://lkml.kernel.org/r/20181102145951.31979-3-ming.lei@redhat.comSigned-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Yihao Wu 提交于
to #28739709 /proc/loadavg can reflex the waiting tasks over a period of time to some extent. But to become a SLI requires better precision and quicker response. Furthermore, I/O block is not concerned here, and bandwidth control is excluded from cpu_stress. This patch adds a new interface /proc/cpu_stress. It's based on task runtime tracking so we don't need to deal with complex state transition. And because task runtime tracking is done in most scheduler events, the precision is quite enough. Like loadavg, cpu_stress has 3 average windows too (1,5,15 min) Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com> Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
-
由 Mel Gorman 提交于
to #28825456 commit 1c30844d2dfe272d58c8fc000960b835d13aa2ac upstream. An external fragmentation event was previously described as When the page allocator fragments memory, it records the event using the mm_page_alloc_extfrag event. If the fallback_order is smaller than a pageblock order (order-9 on 64-bit x86) then it's considered an event that will cause external fragmentation issues in the future. The kernel reduces the probability of such events by increasing the watermark sizes by calling set_recommended_min_free_kbytes early in the lifetime of the system. This works reasonably well in general but if there are enough sparsely populated pageblocks then the problem can still occur as enough memory is free overall and kswapd stays asleep. This patch introduces a watermark_boost_factor sysctl that allows a zone watermark to be temporarily boosted when an external fragmentation causing events occurs. The boosting will stall allocations that would decrease free memory below the boosted low watermark and kswapd is woken if the calling context allows to reclaim an amount of memory relative to the size of the high watermark and the watermark_boost_factor until the boost is cleared. When kswapd finishes, it wakes kcompactd at the pageblock order to clean some of the pageblocks that may have been affected by the fragmentation event. kswapd avoids any writeback, slab shrinkage and swap from reclaim context during this operation to avoid excessive system disruption in the name of fragmentation avoidance. Care is taken so that kswapd will do normal reclaim work if the system is really low on memory. This was evaluated using the same workloads as "mm, page_alloc: Spread allocations across zones before introducing fragmentation". 1-socket Skylake machine config-global-dhp__workload_thpfioscale XFS (no special madvise) 4 fio threads, 1 THP allocating thread -------------------------------------- 4.20-rc3 extfrag events < order 9: 804694 4.20-rc3+patch: 408912 (49% reduction) 4.20-rc3+patch1-4: 18421 (98% reduction) 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Amean fault-base-1 653.58 ( 0.00%) 652.71 ( 0.13%) Amean fault-huge-1 0.00 ( 0.00%) 178.93 * -99.00%* 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Percentage huge-1 0.00 ( 0.00%) 5.12 ( 100.00%) Note that external fragmentation causing events are massively reduced by this path whether in comparison to the previous kernel or the vanilla kernel. The fault latency for huge pages appears to be increased but that is only because THP allocations were successful with the patch applied. 1-socket Skylake machine global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE) ----------------------------------------------------------------- 4.20-rc3 extfrag events < order 9: 291392 4.20-rc3+patch: 191187 (34% reduction) 4.20-rc3+patch1-4: 13464 (95% reduction) thpfioscale Fault Latencies 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Min fault-base-1 912.00 ( 0.00%) 905.00 ( 0.77%) Min fault-huge-1 127.00 ( 0.00%) 135.00 ( -6.30%) Amean fault-base-1 1467.55 ( 0.00%) 1481.67 ( -0.96%) Amean fault-huge-1 1127.11 ( 0.00%) 1063.88 * 5.61%* 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Percentage huge-1 77.64 ( 0.00%) 83.46 ( 7.49%) As before, massive reduction in external fragmentation events, some jitter on latencies and an increase in THP allocation success rates. 2-socket Haswell machine config-global-dhp__workload_thpfioscale XFS (no special madvise) 4 fio threads, 5 THP allocating threads ---------------------------------------------------------------- 4.20-rc3 extfrag events < order 9: 215698 4.20-rc3+patch: 200210 (7% reduction) 4.20-rc3+patch1-4: 14263 (93% reduction) 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Amean fault-base-5 1346.45 ( 0.00%) 1306.87 ( 2.94%) Amean fault-huge-5 3418.60 ( 0.00%) 1348.94 ( 60.54%) 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Percentage huge-5 0.78 ( 0.00%) 7.91 ( 910.64%) There is a 93% reduction in fragmentation causing events, there is a big reduction in the huge page fault latency and allocation success rate is higher. 2-socket Haswell machine global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE) ----------------------------------------------------------------- 4.20-rc3 extfrag events < order 9: 166352 4.20-rc3+patch: 147463 (11% reduction) 4.20-rc3+patch1-4: 11095 (93% reduction) thpfioscale Fault Latencies 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Amean fault-base-5 6217.43 ( 0.00%) 7419.67 * -19.34%* Amean fault-huge-5 3163.33 ( 0.00%) 3263.80 ( -3.18%) 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Percentage huge-5 95.14 ( 0.00%) 87.98 ( -7.53%) There is a large reduction in fragmentation events with some jitter around the latencies and success rates. As before, the high THP allocation success rate does mean the system is under a lot of pressure. However, as the fragmentation events are reduced, it would be expected that the long-term allocation success rate would be higher. Link: http://lkml.kernel.org/r/20181123114528.28802-5-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Zi Yan <zi.yan@cs.rutgers.edu> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NXu Yu <xuyu@linux.alibaba.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
-
由 Julien Thierry 提交于
task #25552995 commit 17ce302f3117e9518395847a3120c8a108b587b8 upstream. In the presence of any form of instrumentation, nmi_enter() should be done before calling any traceable code and any instrumentation code. Currently, nmi_enter() is done in handle_domain_nmi(), which is much too late as instrumentation code might get called before. Move the nmi_enter/exit() calls to the arch IRQ vector handler. On arm64, it is not possible to know if the IRQ vector handler was called because of an NMI before acknowledging the interrupt. However, It is possible to know whether normal interrupts could be taken in the interrupted context (i.e. if taking an NMI in that context could introduce a potential race condition). When interrupting a context with IRQs disabled, call nmi_enter() as soon as possible. In contexts with IRQs enabled, defer this to the interrupt controller, which is in a better position to know if an interrupt taken is an NMI. Fixes: bc3c03ccb464 ("arm64: Enable the support of pseudo-NMIs") Cc: <stable@vger.kernel.org> # 5.1.x- Cc: Will Deacon <will.deacon@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Jason Cooper <jason@lakedaemon.net> Cc: Mark Rutland <mark.rutland@arm.com> Reviewed-by: NMarc Zyngier <marc.zyngier@arm.com> Signed-off-by: NJulien Thierry <julien.thierry@arm.com> Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com> Signed-off-by: NZou Cao <zoucao@linux.alibaba.com> Reviewed-by: Nluanshi <zhangliguang@linux.alibaba.com>
-
由 Julien Thierry 提交于
task #25552995 commit 6e4933a006616343f66c4702dc4fc56bb25e7b02 upstream NMI handling code should be executed between calls to nmi_enter and nmi_exit. Add a separate domain handler to properly setup NMI context when handling an interrupt requested as NMI. Signed-off-by: NJulien Thierry <julien.thierry@arm.com> Acked-by: NMarc Zyngier <marc.zyngier@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com> Signed-off-by: NZou Cao <zoucao@linux.alibaba.com> Reviewed-by: Nluanshi <zhangliguang@linux.alibaba.com>
-
由 Julien Thierry 提交于
task #25552995 commit 2dcf1fbcad352baaa5f47b17e57c5743c8eedbad upstream Provide flow handlers that are NMI safe for interrupts and percpu_devid interrupts. Signed-off-by: NJulien Thierry <julien.thierry@arm.com> Acked-by: NMarc Zyngier <marc.zyngier@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com> Signed-off-by: NZou Cao <zoucao@linux.alibaba.com> Reviewed-by: Nluanshi <zhangliguang@linux.alibaba.com>
-
由 Julien Thierry 提交于
task #25552995 commit 4b078c3f1a26487c39363089ba0d5c6b09f2a89f upstream Add support for percpu_devid interrupts treated as NMIs. Percpu_devid NMIs need to be setup/torn down on each CPU they target. The same restrictions as for global NMIs still apply for percpu_devid NMIs. Signed-off-by: NJulien Thierry <julien.thierry@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com> Signed-off-by: NZou Cao <zoucao@linux.alibaba.com> Reviewed-by: Nluanshi <zhangliguang@linux.alibaba.com> [ Shile: fixed conflicts in include/linux/interrupt.h ] Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
-
由 Julien Thierry 提交于
task #25552995 commit b525903c254dab2491410f0f23707691b7c2c317 upstream Add functionality to allocate interrupt lines that will deliver IRQs as Non-Maskable Interrupts. These allocations are only successful if the irqchip provides the necessary support and allows NMI delivery for the interrupt line. Interrupt lines allocated for NMI delivery must be enabled/disabled through enable_nmi/disable_nmi_nosync to keep their state consistent. To treat a PERCPU IRQ as NMI, the interrupt must not be shared nor threaded, the irqchip directly managing the IRQ must be the root irqchip and the irqchip cannot be behind a slow bus. Signed-off-by: NJulien Thierry <julien.thierry@arm.com> Reviewed-by: NMarc Zyngier <marc.zyngier@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Marc Zyngier <marc.zyngier@arm.com> Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com> Signed-off-by: NZou Cao <zoucao@linux.alibaba.com> Reviewed-by: Nluanshi <zhangliguang@linux.alibaba.com>
-
- 29 6月, 2020 4 次提交
-
-
由 Vincent Guittot 提交于
to #28739709 commit bef69dd87828ef5d8ecdab8d857cd3a33cf98675 upstream update_cfs_rq_load_avg() calls cfs_rq_util_change() every time PELT decays, which might be inefficient when the cpufreq driver has rate limitation. When a task is attached on a CPU, we have this call path: update_load_avg() update_cfs_rq_load_avg() cfs_rq_util_change -- > trig frequency update attach_entity_load_avg() cfs_rq_util_change -- > trig frequency update The 1st frequency update will not take into account the utilization of the newly attached task and the 2nd one might be discarded because of rate limitation of the cpufreq driver. update_cfs_rq_load_avg() is only called by update_blocked_averages() and update_load_avg() so we can move the call to cfs_rq_util_change/cpufreq_update_util() into these two functions. It's also interesting to note that update_load_avg() already calls cfs_rq_util_change() directly for the !SMP case. This change will also ensure that cpufreq_update_util() is called even when there is no more CFS rq in the leaf_cfs_rq_list to update, but only IRQ, RT or DL PELT signals. [ mingo: Minor updates. ] Reported-by: NDoug Smythies <dsmythies@telus.net> Tested-by: NDoug Smythies <dsmythies@telus.net> Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: NDietmar Eggemann <dietmar.eggemann@arm.com> Acked-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: juri.lelli@redhat.com Cc: linux-pm@vger.kernel.org Cc: mgorman@suse.de Cc: rostedt@goodmis.org Cc: sargun@sargun.me Cc: srinivas.pandruvada@linux.intel.com Cc: tj@kernel.org Cc: xiexiuqi@huawei.com Cc: xiezhipeng1@huawei.com Fixes: 039ae8b ("sched/fair: Fix O(nr_cgroups) in the load balancing path") Link: https://lkml.kernel.org/r/1574083279-799-1-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org> Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com> Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
-
由 Vincent Guittot 提交于
to #28739709 commit 039ae8bcf7a5f4476f4487e6bf816885fb3fb617 upstream This re-applies the commit reverted here: commit c40f7d74c741 ("sched/fair: Fix infinite loop in update_blocked_averages() by reverting a9e7f654") I.e. now that cfs_rq can be safely removed/added in the list, we can re-apply: commit a9e7f654 ("sched/fair: Fix O(nr_cgroups) in load balance path") Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: sargun@sargun.me Cc: tj@kernel.org Cc: xiexiuqi@huawei.com Cc: xiezhipeng1@huawei.com Link: https://lkml.kernel.org/r/1549469662-13614-3-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org> Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com> Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
-
由 Vincent Guittot 提交于
to #28739709 commit 31bc6aeaab1d1de8959b67edbed5c7a4b3cdbe7c upstream Removing a cfs_rq from rq->leaf_cfs_rq_list can break the parent/child ordering of the list when it will be added back. In order to remove an empty and fully decayed cfs_rq, we must remove its children too, so they will be added back in the right order next time. With a normal decay of PELT, a parent will be empty and fully decayed if all children are empty and fully decayed too. In such a case, we just have to ensure that the whole branch will be added when a new task is enqueued. This is default behavior since : commit f6783319737f ("sched/fair: Fix insertion in rq->leaf_cfs_rq_list") In case of throttling, the PELT of throttled cfs_rq will not be updated whereas the parent will. This breaks the assumption made above unless we remove the children of a cfs_rq that is throttled. Then, they will be added back when unthrottled and a sched_entity will be enqueued. As throttled cfs_rq are now removed from the list, we can remove the associated test in update_blocked_averages(). Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: sargun@sargun.me Cc: tj@kernel.org Cc: xiexiuqi@huawei.com Cc: xiezhipeng1@huawei.com Link: https://lkml.kernel.org/r/1549469662-13614-2-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org> Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com> Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
-
由 Yihao Wu 提交于
to #29028845 cpuacct_update_latency's declaration was changed since 6dbaddaa, but was not changed for the case when CONFIG_SCHED_SLI=n. This leads to a compilation error. Fixes: 6dbaddaa ("alinux: sched: Add cgroup's scheduling latency histograms") Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com> Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
-
- 24 6月, 2020 14 次提交
-
-
由 Dietmar Eggemann 提交于
to #28739709 commit af75d1a9a9f75bf030c2f35705f1ff6d226f96fe upstream Since sg_lb_stats::sum_weighted_load is now identical with sg_lb_stats::group_load remove it and replace its use case (calculating load per task) with the latter. Signed-off-by: NDietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Acked-by: NRik van Riel <riel@surriel.com> Acked-by: NVincent Guittot <vincent.guittot@linaro.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Patrick Bellasi <patrick.bellasi@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Quentin Perret <quentin.perret@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Valentin Schneider <valentin.schneider@arm.com> Link: https://lkml.kernel.org/r/20190527062116.11512-7-dietmar.eggemann@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org> Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com> Reviewed-by: NShanpei Chen <shanpeic@linux.alibaba.com> Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
-
由 Dietmar Eggemann 提交于
to #28739709 commit 0e1fef63d92d61ed561e504c3a078a827a0f9bfe upstream The sched domain per rq load index files also disappear from the /proc/sys/kernel/sched_domain/cpuX/domainY directories. Signed-off-by: NDietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Acked-by: NRik van Riel <riel@surriel.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Patrick Bellasi <patrick.bellasi@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Quentin Perret <quentin.perret@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Valentin Schneider <valentin.schneider@arm.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20190527062116.11512-6-dietmar.eggemann@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org> Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com> Reviewed-by: NShanpei Chen <shanpeic@linux.alibaba.com>
-
由 Dietmar Eggemann 提交于
to #28739709 commit 55627e3cd22c315c4a02fe3bbbb7234ec439cb1d upstream The per rq load array values also disappear from the cpu#X sections in /proc/sched_debug. Signed-off-by: NDietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Acked-by: NRik van Riel <riel@surriel.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Patrick Bellasi <patrick.bellasi@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Quentin Perret <quentin.perret@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Valentin Schneider <valentin.schneider@arm.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20190527062116.11512-5-dietmar.eggemann@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org> Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com> Reviewed-by: NShanpei Chen <shanpeic@linux.alibaba.com>
-
由 Dietmar Eggemann 提交于
to #28739709 commit 3d8d53554405952993bb0279ef3ebebc51740074 upstream This reverts: commit 201c373e ("sched/debug: Limit sd->*_idx range on sysctl") Load indexes (sd->*_idx) are no longer needed without rq->cpu_load[]. The range check for load indexes can be removed as well. Get rid of it before the rq->cpu_load[] since it uses CPU_LOAD_IDX_MAX. At the same time, fix the following coding style issues detected by scripts/checkpatch.pl: ERROR: space prohibited before that ',' ERROR: space prohibited before that close parenthesis ')' Signed-off-by: NDietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Acked-by: NRik van Riel <riel@surriel.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Patrick Bellasi <patrick.bellasi@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Quentin Perret <quentin.perret@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Valentin Schneider <valentin.schneider@arm.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20190527062116.11512-4-dietmar.eggemann@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org> Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com> Reviewed-by: NShanpei Chen <shanpeic@linux.alibaba.com>
-
由 Dietmar Eggemann 提交于
to #28739709 commit 1c1b8a7b03ef50f80f5d0c871ee261c04a6c967e upstream With LB_BIAS disabled, source_load() & target_load() return weighted_cpuload(). Replace both with calls to weighted_cpuload(). The function to obtain the load index (sd->*_idx) for an sd, get_sd_load_idx(), can be removed as well. Finally, get rid of the sched feature LB_BIAS. Signed-off-by: NDietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Acked-by: NRik van Riel <riel@surriel.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Patrick Bellasi <patrick.bellasi@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Quentin Perret <quentin.perret@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Valentin Schneider <valentin.schneider@arm.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20190527062116.11512-3-dietmar.eggemann@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org> Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com> Reviewed-by: NShanpei Chen <shanpeic@linux.alibaba.com>
-
由 Dietmar Eggemann 提交于
to #28739709 commit 5e83eafbfd3b351537c0d74467fc43e8a88f4ae4 upstream With LB_BIAS disabled, there is no need to update the rq->cpu_load[idx] any more. Signed-off-by: NDietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Acked-by: NRik van Riel <riel@surriel.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Patrick Bellasi <patrick.bellasi@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Quentin Perret <quentin.perret@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Valentin Schneider <valentin.schneider@arm.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20190527062116.11512-2-dietmar.eggemann@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org> Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com> Reviewed-by: NShanpei Chen <shanpeic@linux.alibaba.com>
-
由 Dietmar Eggemann 提交于
to #28739709 commit f2bedc4705659216bd60948029ad8dfedf923ad9 upstream The CFS class is the only one maintaining and using the CPU wide load (rq->load(.weight)). The last use case of the CPU wide load in CFS's set_next_entity() can be replaced by using the load of the CFS class (rq->cfs.load(.weight)) instead. Signed-off-by: NDietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20190424084556.604-1-dietmar.eggemann@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org> Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com> Reviewed-by: NShanpei Chen <shanpeic@linux.alibaba.com>
-
由 Daniel Lezcano 提交于
to #28739709 commit a7fe5190c03f8137ef08db84a58dd4daf2c4785d upstream The function get_loadavg() returns almost always zero. To be more precise, statistically speaking for a total of 1023379 times passing in the function, the load is equal to zero 1020728 times, greater than 100, 610 times, the remaining is between 0 and 5. In 2011, the get_loadavg() was removed from the Android tree because of the above [1]. At this time, the load was: unsigned long this_cpu_load(void) { struct rq *this = this_rq(); return this->cpu_load[0]; } In 2014, the code was changed by commit 372ba8cb (cpuidle: menu: Lookup CPU runqueues less) and the load is: void get_iowait_load(unsigned long *nr_waiters, unsigned long *load) { struct rq *rq = this_rq(); *nr_waiters = atomic_read(&rq->nr_iowait); *load = rq->load.weight; } with the same result. Both measurements show using the load in this code path does no matter anymore. Removing it. [1] https://android.googlesource.com/kernel/common/+/4dedd9f124703207895777ac6e91dacde0f7cc17Signed-off-by: NDaniel Lezcano <daniel.lezcano@linaro.org> Acked-by: NMel Gorman <mgorman@suse.de> Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com> Reviewed-by: NShanpei Chen <shanpeic@linux.alibaba.com>
-
由 Dietmar Eggemann 提交于
to #28739709 commit fdf5f315d5cfaefb7bb8a62ec4bf37b9891837aa upstream LB_BIAS allows the adjustment on how conservative load should be balanced. The rq->cpu_load[idx] array is used for this functionality. It contains weighted CPU load decayed average values over different intervals (idx = 1..4). Idx = 0 is the weighted CPU load itself. The values are updated during scheduler_tick, before idle balance and at nohz exit. There are 5 different types of idx's per sched domain (sd). Each of them is used to index into the rq->cpu_load[idx] array in a specific scenario (busy, idle and newidle for load balancing, forkexec for wake-up slow-path load balancing and wake for affine wakeup based on weight). Only the sd idx's for busy and idle load balancing are set to 2,3 or 1,2 respectively. All the other sd idx's are set to 0. Conservative load balancing is achieved for sd idx's >= 1 by using the min/max (source_load()/target_load()) value between the current weighted CPU load and the rq->cpu_load[sd idx -1] for the busiest(idlest)/local CPU load in load balancing or vice versa in the wake-up slow-path load balancing. There is no conservative balancing for sd idx = 0 since only current weighted CPU load is used in this case. It is very likely that LB_BIAS' influence on load balancing can be neglected (see test results below). This is further supported by: (1) Weighted CPU load today is by itself a decayed average value (PELT) (cfs_rq->avg->runnable_load_avg) and not the instantaneous load (rq->load.weight) it was when LB_BIAS was introduced. (2) Sd imbalance_pct is used for CPU_NEWLY_IDLE and CPU_NOT_IDLE (relate to sd's newidle and busy idx) in find_busiest_group() when comparing busiest and local avg load to make load balancing even more conservative. (3) The sd forkexec and newidle idx are always set to 0 so there is no adjustment on how conservatively load balancing is done here. (4) Affine wakeup based on weight (wake_affine_weight()) will not be impacted since the sd wake idx is always set to 0. Let's disable LB_BIAS by default for a few kernel releases to make sure that no workload and no scheduler topology is affected. The benefit of being able to remove the LB_BIAS dependency from source_load() and target_load() is that the entire rq->cpu_load[idx] code could be removed in this case. It is really hard to say if there is no regression w/o testing this with a lot of different workloads on a lot of different platforms, especially NUMA machines. The following 104 LKP (Linux Kernel Performance) tests were run by the 0-Day guys mostly on multi-socket hosts with a larger number of logical cpus (88, 192). The base for the test was commit b3dae109 ("sched/swait: Rename to exclusive") (tip/sched/core v4.18-rc1). Only 2 out of the 104 tests had a significant change in one of the metrics (fsmark/1x-1t-1HDD-btrfs-nfsv4-4M-60G-NoSync-performance +7% files_per_sec, unixbench/300s-100%-syscall-performance -11% score). Tests which showed a change in one of the metrics are marked with a '*' and this change is listed as well. (a) lkp-bdw-ep3: 88 threads Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz 64G dd-write/10m-1HDD-cfq-btrfs-100dd-performance fsmark/1x-1t-1HDD-xfs-nfsv4-4M-60G-NoSync-performance * fsmark/1x-1t-1HDD-btrfs-nfsv4-4M-60G-NoSync-performance 7.50 7% 8.00 ± 6% fsmark.files_per_sec fsmark/1x-1t-1HDD-btrfs-nfsv4-4M-60G-fsyncBeforeClose-performance fsmark/1x-1t-1HDD-btrfs-4M-60G-NoSync-performance fsmark/1x-1t-1HDD-btrfs-4M-60G-fsyncBeforeClose-performance kbuild/300s-50%-vmlinux_prereq-performance kbuild/300s-200%-vmlinux_prereq-performance kbuild/300s-50%-vmlinux_prereq-performance-1HDD-ext4 kbuild/300s-200%-vmlinux_prereq-performance-1HDD-ext4 (b) lkp-skl-4sp1: 192 threads Intel(R) Xeon(R) Platinum 8160 768G dbench/100%-performance ebizzy/200%-100x-10s-performance hackbench/1600%-process-pipe-performance iperf/300s-cs-localhost-tcp-performance iperf/300s-cs-localhost-udp-performance perf-bench-numa-mem/2t-300M-performance perf-bench-sched-pipe/10000000ops-process-performance perf-bench-sched-pipe/10000000ops-threads-performance schbench/2-16-300-30000-30000-performance tbench/100%-cs-localhost-performance (c) lkp-bdw-ep6: 88 threads Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz 128G stress-ng/100%-60s-pipe-performance unixbench/300s-1-whetstone-double-performance unixbench/300s-1-shell1-performance unixbench/300s-1-shell8-performance unixbench/300s-1-pipe-performance * unixbench/300s-1-context1-performance 312 315 unixbench.score unixbench/300s-1-spawn-performance unixbench/300s-1-syscall-performance unixbench/300s-1-dhry2reg-performance unixbench/300s-1-fstime-performance unixbench/300s-1-fsbuffer-performance unixbench/300s-1-fsdisk-performance unixbench/300s-100%-whetstone-double-performance unixbench/300s-100%-shell1-performance unixbench/300s-100%-shell8-performance unixbench/300s-100%-pipe-performance unixbench/300s-100%-context1-performance unixbench/300s-100%-spawn-performance * unixbench/300s-100%-syscall-performance 3571 ± 3% -11% 3183 ± 4% unixbench.score unixbench/300s-100%-dhry2reg-performance unixbench/300s-100%-fstime-performance unixbench/300s-100%-fsbuffer-performance unixbench/300s-100%-fsdisk-performance unixbench/300s-1-execl-performance unixbench/300s-100%-execl-performance * will-it-scale/brk1-performance 365004 360387 will-it-scale.per_thread_ops * will-it-scale/dup1-performance 432401 437596 will-it-scale.per_thread_ops will-it-scale/eventfd1-performance will-it-scale/futex1-performance will-it-scale/futex2-performance will-it-scale/futex3-performance will-it-scale/futex4-performance will-it-scale/getppid1-performance will-it-scale/lock1-performance will-it-scale/lseek1-performance will-it-scale/lseek2-performance * will-it-scale/malloc1-performance 47025 45817 will-it-scale.per_thread_ops 77499 76529 will-it-scale.per_process_ops will-it-scale/malloc2-performance * will-it-scale/mmap1-performance 123399 120815 will-it-scale.per_thread_ops 152219 149833 will-it-scale.per_process_ops * will-it-scale/mmap2-performance 107327 104714 will-it-scale.per_thread_ops 136405 133765 will-it-scale.per_process_ops will-it-scale/open1-performance * will-it-scale/open2-performance 171570 168805 will-it-scale.per_thread_ops 532644 526202 will-it-scale.per_process_ops will-it-scale/page_fault1-performance will-it-scale/page_fault2-performance will-it-scale/page_fault3-performance will-it-scale/pipe1-performance will-it-scale/poll1-performance * will-it-scale/poll2-performance 176134 172848 will-it-scale.per_thread_ops 281361 275053 will-it-scale.per_process_ops will-it-scale/posix_semaphore1-performance will-it-scale/pread1-performance will-it-scale/pread2-performance will-it-scale/pread3-performance will-it-scale/pthread_mutex1-performance will-it-scale/pthread_mutex2-performance will-it-scale/pwrite1-performance will-it-scale/pwrite2-performance will-it-scale/pwrite3-performance * will-it-scale/read1-performance 1190563 1174833 will-it-scale.per_thread_ops * will-it-scale/read2-performance 1105369 1080427 will-it-scale.per_thread_ops will-it-scale/readseek1-performance * will-it-scale/readseek2-performance 261818 259040 will-it-scale.per_thread_ops will-it-scale/readseek3-performance * will-it-scale/sched_yield-performance 2408059 2382034 will-it-scale.per_thread_ops will-it-scale/signal1-performance will-it-scale/unix1-performance will-it-scale/unlink1-performance will-it-scale/unlink2-performance * will-it-scale/write1-performance 976701 961588 will-it-scale.per_thread_ops * will-it-scale/writeseek1-performance 831898 822448 will-it-scale.per_thread_ops * will-it-scale/writeseek2-performance 228248 225065 will-it-scale.per_thread_ops * will-it-scale/writeseek3-performance 226670 224058 will-it-scale.per_thread_ops will-it-scale/context_switch1-performance aim7/performance-fork_test-2000 * aim7/performance-brk_test-3000 74869 76676 aim7.jobs-per-min aim7/performance-disk_cp-3000 aim7/performance-disk_rd-3000 aim7/performance-sieve-3000 aim7/performance-page_test-3000 aim7/performance-creat-clo-3000 aim7/performance-mem_rtns_1-8000 aim7/performance-disk_wrt-8000 aim7/performance-pipe_cpy-8000 aim7/performance-ram_copy-8000 (d) lkp-avoton3: 8 threads Intel(R) Atom(TM) CPU C2750 @ 2.40GHz 16G netperf/ipv4-900s-200%-cs-localhost-TCP_STREAM-performance Signed-off-by: NDietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Li Zhijian <zhijianx.li@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180809135753.21077-1-dietmar.eggemann@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org> Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com> Reviewed-by: NShanpei Chen <shanpeic@linux.alibaba.com>
-
由 Yihao Wu 提交于
to #28739709 Many samples are between 10ms-50ms. To display more informative distribution of latency, divide 10ms-50ms into 5 parts uniformly. Example: $ cat /sys/fs/cgroup/cpuacct/a/cpuacct.wait_latency 0-1ms: 59726433 1-4ms: 167 4-7ms: 0 7-10ms: 0 10-20ms: 5 20-30ms: 0 30-40ms: 3 40-50ms: 0 50-100ms: 0 100-500ms: 0 500-1000ms: 0 1000-5000ms: 0 5000-10000ms: 0 >=10000ms: 0 total(ms): 45554 nr: 59726600 Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com> Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
-
由 Yihao Wu 提交于
to #28739709 Sometimes histogram is not precise enough because each sample is roughly accounted into a histogram bar. And average latency is more pratical for some users. This patch adds a "nr" field in 4 latency histogram interfaces, so lat(avg) = total(ms) / nr And compared to histogram, average latency is better to be used as a SLI because of simplicity. Example $ cat /sys/fs/cgroup/cpuacct/a/cpuacct.wait_latency 0-1ms: 4139 1-4ms: 317 4-7ms: 568 7-10ms: 0 10-100ms: 42324 100-500ms: 9131 500-1000ms: 95 1000-5000ms: 134 5000-10000ms: 0 >=10000ms: 0 total(ms): 4256455 nr: 182128 Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com> Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
-
由 Yihao Wu 提交于
to #28739709 This patch adds cpuacct.cgroup_wait_latency interface. It exports the histogram of the sched entity's schedule latency. Unlike wait_latency, the sched entity is a cgroup rather than task. This is useful when tasks are not directly clustered under one cgroup. For examples: cgroup1 --- cgroupA --- task1 --- cgroupB --- task2 cgroup2 --- cgroupC --- task3 --- cgroupD --- task4 This is a common cgroup hierarchy used by many applications. With cgroup_wait_latency, we can just read from cgroup1 to know aggregated wait latency information of task1 and task2. The interface output format is identical to cpuacct.wait_latency. Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com> Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
-
由 Yihao Wu 提交于
to #28739709 This patch measures time that tasks in cpuacct cgroup blocks. There are two types: blocked due to IO, and others like locks. And they are exported in"cpuacct.ioblock_latency" and "cpuacct.block_latency" respectively. According to histogram, we know the detailed distribution of the duration. And according to total(ms), we know the percentage of time tasks spent off rq, waiting for resources: (△ioblock_latency.total(ms) + △block_latency.total(ms)) / △wall_time The interface output format is identical to cpuacct.wait_latency. Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com> Acked-by: NXunlei Pang <xlpang@linux.alibaba.com> Reviewed-by: NShanpei Chen <shanpeic@linux.alibaba.com> Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
-
由 Yihao Wu 提交于
to #28739709 Export wait_latency in "cpuacct.wait_latency", which indicates the time that tasks in a cpuacct cgroup wait on a cfs_rq to be scheduled. This is like "perf sched", but it gives smaller overhead. So it can be used as monitor constantly. wait_latency is useful to debug application's high RT problem. It can tell if it's caused by scheduling or not. If it is, loadavg can tell if it's caused by bad scheduling bahaviour or system overloads. System admins can also use wait_latency to define SLA. To ensure SLA is guaranteed, there are various ways to decrease wait_latency. This feature is disabled by default for performance concerns. It can be switched on dynamically by "echo 0 > /proc/cpusli/sched_lat_enable" Example: $ cat /sys/fs/cgroup/cpuacct/a/cpuacct.wait_latency 0-1ms: 4139 1-4ms: 317 4-7ms: 568 7-10ms: 0 10-100ms: 42324 100-500ms: 9131 500-1000ms: 95 1000-5000ms: 134 5000-10000ms: 0 >=10000ms: 0 total(ms): 4256455 Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com> Acked-by: NXunlei Pang <xlpang@linux.alibaba.com> Reviewed-by: NShanpei Chen <shanpeic@linux.alibaba.com> Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
-
- 23 6月, 2020 2 次提交
-
-
由 Yihao Wu 提交于
to #28739709 Assume workloads are composed of massive short tasks. Then periodical load tracking is unnecessary. Because load tracking should be already guaranteed by frequent sleep and wake-up. If these massive short tasks run in their individual cgroups, the load tracking becomes extremely heavy. This patch adds a switch to bypass scheduler_tick load tracking, in order to reduce scheduler overhead, without sacrificing much balance in this scenario. Performance Tests: 1) 1100+ tasks in their individual cgroups, on a 96-HT Skylake machine sched overhead(each HT): 0.74% -> 0.48% (This test's baseline is from the previous patch) 2) sysbench-threads with 96 threads, running for 5min latency_ms 95th: 63.07 -> 54.01 Besides these, no regression is found on our test platform. Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com> Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
-
由 Yihao Wu 提交于
to #28739709 Unless the workloads are IO-bounded, update_blocked_averages doesn't help load balance. This patch adds a switch to bypass update_blocked_averages if prior knowledge about workloads indicates IO is negligible. Performance Tests: 1) 1100+ tasks in their individual cgroups, on a 96-HT Skylake machine sched overhead(each HT): 3.78% -> 0.74% 2) cgroup-overhead benchmark in our sched-test suite on a 96-HT Skylake overhead: 21.06 -> 18.08 3) unixbench context1 with 96 threads running for 1min Score: 15409.40 -> 16821.77 Besides these, UnixBench has some performance ups and downs. But generally, the performance of UnixBench hasn't changed. Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com> Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
-
- 22 6月, 2020 1 次提交
-
-
由 Yihao Wu 提交于
to #28143829 rq_clock_task is less than rq_clock when in VM, or when IRQ_TIME_ACCOUNTING is on. So they are not comparable when accounting elapsed time. This bug is not observed on host yet, because neither of these two conditions are met. Use rq_clock at both begin and end of exec_start_raw accumulation to fix this bug, because we expect steal% in cpuacct.proc_stat of VM's cgroups can reflect the cpu time the host steal from the guest. Fixes: c7552980 ("alinux: sched: Introduce per-cgroup steal accounting") Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com> Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
-
- 09 6月, 2020 4 次提交
-
-
由 Yafang Shao 提交于
task #28327019 commit 1066d1b6974e095d5a6c472ad9180a957b496cd6 upstream The task->flags is a 32-bits flag, in which 31 bits have already been consumed. So it is hardly to introduce other new per process flag. Currently there're still enough spaces in the bit-field section of task_struct, so we can define the memstall state as a single bit in task_struct instead. This patch also removes an out-of-date comment pointed by Matthew. Suggested-by: NJohannes Weiner <hannes@cmpxchg.org> Signed-off-by: NYafang Shao <laoar.shao@gmail.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Link: https://lkml.kernel.org/r/1584408485-1921-1-git-send-email-laoar.shao@gmail.comSigned-off-by: Nzhongjiang-ali <zhongjiang-ali@linux.alibaba.com> Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
-
由 Johannes Weiner 提交于
task #28327019 commit 36b238d5717279163859fb6ba0f4360abcafab83 upstream When switching tasks running on a CPU, the psi state of a cgroup containing both of these tasks does not change. Right now, we don't exploit that, and can perform many unnecessary state changes in nested hierarchies, especially when most activity comes from one leaf cgroup. This patch implements an optimization where we only update cgroups whose state actually changes during a task switch. These are all cgroups that contain one task but not the other, up to the first shared ancestor. When both tasks are in the same group, we don't need to update anything at all. We can identify the first shared ancestor by walking the groups of the incoming task until we see TSK_ONCPU set on the local CPU; that's the first group that also contains the outgoing task. The new psi_task_switch() is similar to psi_task_change(). To allow code reuse, move the task flag maintenance code into a new function and the poll/avg worker wakeups into the shared psi_group_change(). Suggested-by: NPeter Zijlstra <peterz@infradead.org> Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200316191333.115523-3-hannes@cmpxchg.orgSigned-off-by: Nzhongjiang-ali <zhongjiang-ali@linux.alibaba.com> Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
-
由 Johannes Weiner 提交于
task #28327019 commit 3dfbe25c27eab7c90c8a7e97b4c354a9d24dd985 upstream Jingfeng reports rare div0 crashes in psi on systems with some uptime: [58914.066423] divide error: 0000 [#1] SMP [58914.070416] Modules linked in: ipmi_poweroff ipmi_watchdog toa overlay fuse tcp_diag inet_diag binfmt_misc aisqos(O) aisqos_hotfixes(O) [58914.083158] CPU: 94 PID: 140364 Comm: kworker/94:2 Tainted: G W OE K 4.9.151-015.ali3000.alios7.x86_64 #1 [58914.093722] Hardware name: Alibaba Alibaba Cloud ECS/Alibaba Cloud ECS, BIOS 3.23.34 02/14/2019 [58914.102728] Workqueue: events psi_update_work [58914.107258] task: ffff8879da83c280 task.stack: ffffc90059dcc000 [58914.113336] RIP: 0010:[] [] psi_update_stats+0x1c1/0x330 [58914.122183] RSP: 0018:ffffc90059dcfd60 EFLAGS: 00010246 [58914.127650] RAX: 0000000000000000 RBX: ffff8858fe98be50 RCX: 000000007744d640 [58914.134947] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00003594f700648e [58914.142243] RBP: ffffc90059dcfdf8 R08: 0000359500000000 R09: 0000000000000000 [58914.149538] R10: 0000000000000000 R11: 0000000000000000 R12: 0000359500000000 [58914.156837] R13: 0000000000000000 R14: 0000000000000000 R15: ffff8858fe98bd78 [58914.164136] FS: 0000000000000000(0000) GS:ffff887f7f380000(0000) knlGS:0000000000000000 [58914.172529] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [58914.178467] CR2: 00007f2240452090 CR3: 0000005d5d258000 CR4: 00000000007606f0 [58914.185765] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [58914.193061] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [58914.200360] PKRU: 55555554 [58914.203221] Stack: [58914.205383] ffff8858fe98bd48 00000000000002f0 0000002e81036d09 ffffc90059dcfde8 [58914.213168] ffff8858fe98bec8 0000000000000000 0000000000000000 0000000000000000 [58914.220951] 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [58914.228734] Call Trace: [58914.231337] [] psi_update_work+0x22/0x60 [58914.237067] [] process_one_work+0x189/0x420 [58914.243063] [] worker_thread+0x4e/0x4b0 [58914.248701] [] ? process_one_work+0x420/0x420 [58914.254869] [] kthread+0xe6/0x100 [58914.259994] [] ? kthread_park+0x60/0x60 [58914.265640] [] ret_from_fork+0x39/0x50 [58914.271193] Code: 41 29 c3 4d 39 dc 4d 0f 42 dc <49> f7 f1 48 8b 13 48 89 c7 48 c1 [58914.279691] RIP [] psi_update_stats+0x1c1/0x330 The crashing instruction is trying to divide the observed stall time by the sampling period. The period, stored in R8, is not 0, but we are dividing by the lower 32 bits only, which are all 0 in this instance. We could switch to a 64-bit division, but the period shouldn't be that big in the first place. It's the time between the last update and the next scheduled one, and so should always be around 2s and comfortably fit into 32 bits. The bug is in the initialization of new cgroups: we schedule the first sampling event in a cgroup as an offset of sched_clock(), but fail to initialize the last_update timestamp, and it defaults to 0. That results in a bogusly large sampling period the first time we run the sampling code, and consequently we underreport pressure for the first 2s of a cgroup's life. But worse, if sched_clock() is sufficiently advanced on the system, and the user gets unlucky, the period's lower 32 bits can all be 0 and the sampling division will crash. Fix this by initializing the last update timestamp to the creation time of the cgroup, thus correctly marking the start of the first pressure sampling period in a new cgroup. Reported-by: NJingfeng Xie <xiejingfeng@linux.alibaba.com> Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Link: https://lkml.kernel.org/r/20191203183524.41378-2-hannes@cmpxchg.orgSigned-off-by: Nzhongjiang-ali <zhongjiang-ali@linux.alibaba.com> Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
-
由 Johannes Weiner 提交于
task #28327019 commit b05e75d611380881e73edc58a20fd8c6bb71720b upstream For simplicity, cpu pressure is defined as having more than one runnable task on a given CPU. This works on the system-level, but it has limitations in a cgrouped reality: When cpu.max is in use, it doesn't capture the time in which a task is not executing on the CPU due to throttling. Likewise, it doesn't capture the time in which a competing cgroup is occupying the CPU - meaning it only reflects cgroup-internal competitive pressure, not outside pressure. Enable tracking of currently executing tasks, and then change the definition of cpu pressure in a cgroup from NR_RUNNING > 1 to NR_RUNNING > ON_CPU which will capture the effects of cpu.max as well as competition from outside the cgroup. After this patch, a cgroup running `stress -c 1` with a cpu.max setting of 5000 10000 shows ~50% continuous CPU pressure. Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Link: https://lkml.kernel.org/r/20200316191333.115523-2-hannes@cmpxchg.orgSigned-off-by: Nzhongjiang-ali <zhongjiang-ali@linux.alibaba.com> Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
-