- 08 6月, 2017 10 次提交
-
-
由 Luca Abeni 提交于
This commit introduces a per-runqueue "extra utilization" that can be reclaimed by deadline tasks. In this way, the maximum fraction of CPU time that can reclaimed by deadline tasks is fixed (and configurable) and does not depend on the total deadline utilization. The GRUB accounting rule is modified to add this "extra utilization" to the inactive utilization of the runqueue, and to avoid reclaiming more than a maximum fraction of the CPU time. Tested-by: NDaniel Bristot de Oliveira <bristot@redhat.com> Signed-off-by: NLuca Abeni <luca.abeni@santannapisa.it> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Claudio Scordino <claudio@evidence.eu.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mathieu Poirier <mathieu.poirier@linaro.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it> Link: http://lkml.kernel.org/r/1495138417-6203-10-git-send-email-luca.abeni@santannapisa.itSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Luca Abeni 提交于
Instead of decreasing the runtime as "dq = -Uact dt" (eventually divided by the maximum utilization available for deadline tasks), decrease it as "dq = -max{u, (1 - Uinact)} dt", where u is the task utilization and Uinact is the "inactive utilization". In this way, the maximum fraction of CPU time that can be reclaimed is given by the total utilization of deadline tasks. This approach solves a fairness issue with "traditional" global GRUB reclaiming: using the traditional GRUB algorithm, if tasks are allocated to the various cores in a non-uniform way, the reclaiming mechanism allows some tasks to reclaim more time than others. This issue is visible starting 11 time-consuming tasks with runtime 10ms and period 30ms (total utilization 3.666) on a 4-cores system: some tasks will receive much more than the reserved runtime (thanks to the reclaiming mechanism), while other tasks will receive less than the reserved runtime. Tested-by: NDaniel Bristot de Oliveira <bristot@redhat.com> Signed-off-by: NLuca Abeni <luca.abeni@santannapisa.it> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Claudio Scordino <claudio@evidence.eu.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mathieu Poirier <mathieu.poirier@linaro.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it> Link: http://lkml.kernel.org/r/1495138417-6203-9-git-send-email-luca.abeni@santannapisa.itSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Luca Abeni 提交于
The total rq utilization is defined as the sum of the utilisations of tasks that are "assigned" to a runqueue, independently from their state (TASK_RUNNING or blocked) Tested-by: NDaniel Bristot de Oliveira <bristot@redhat.com> Signed-off-by: NLuca Abeni <luca.abeni@santannapisa.it> Signed-off-by: NClaudio Scordino <claudio@evidence.eu.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Joel Fernandes <joelaf@google.com> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mathieu Poirier <mathieu.poirier@linaro.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it> Link: http://lkml.kernel.org/r/1495138417-6203-8-git-send-email-luca.abeni@santannapisa.itSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Luca Abeni 提交于
This patch introduces the SCHED_FLAG_RECLAIM flag to specify that a DL task is allowed to reclaim unused CPU time (using the GRUB algorithm). Tested-by: NDaniel Bristot de Oliveira <bristot@redhat.com> Signed-off-by: NLuca Abeni <luca.abeni@santannapisa.it> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Claudio Scordino <claudio@evidence.eu.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mathieu Poirier <mathieu.poirier@linaro.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it> Link: http://lkml.kernel.org/r/1495138417-6203-7-git-send-email-luca.abeni@santannapisa.itSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Luca Abeni 提交于
Original GRUB tends to reclaim 100% of the CPU time... And this allows a CPU hog to starve non-deadline tasks. To address this issue, allow the scheduler to reclaim only a specified fraction of CPU time, stored in the new "bw_ratio" field of the dl runqueue structure. Tested-by: NDaniel Bristot de Oliveira <bristot@redhat.com> Signed-off-by: NLuca Abeni <luca.abeni@santannapisa.it> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Claudio Scordino <claudio@evidence.eu.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mathieu Poirier <mathieu.poirier@linaro.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it> Link: http://lkml.kernel.org/r/1495138417-6203-6-git-send-email-luca.abeni@santannapisa.itSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Luca Abeni 提交于
According to the GRUB (Greedy Reclaimation of Unused Bandwidth) reclaiming algorithm, the runtime is not decreased as "dq = -dt", but as "dq = -Uact dt" (where Uact is the per-runqueue active utilization). Hence, this commit modifies the runtime accounting rule in update_curr_dl() to implement the GRUB rule. Tested-by: NDaniel Bristot de Oliveira <bristot@redhat.com> Signed-off-by: NLuca Abeni <luca.abeni@santannapisa.it> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Claudio Scordino <claudio@evidence.eu.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mathieu Poirier <mathieu.poirier@linaro.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it> Link: http://lkml.kernel.org/r/1495138417-6203-5-git-send-email-luca.abeni@santannapisa.itSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Luca Abeni 提交于
Now that the inactive timer can be armed to fire at the 0-lag time, it is possible to use inactive_task_timer() to update the total -deadline utilization (dl_b->total_bw) at the correct time, fixing dl_overflow() and __setparam_dl(). Tested-by: NDaniel Bristot de Oliveira <bristot@redhat.com> Signed-off-by: NLuca Abeni <luca.abeni@santannapisa.it> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Claudio Scordino <claudio@evidence.eu.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mathieu Poirier <mathieu.poirier@linaro.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it> Link: http://lkml.kernel.org/r/1495138417-6203-4-git-send-email-luca.abeni@santannapisa.itSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Luca Abeni 提交于
This patch implements a more theoretically sound algorithm for tracking active utilization: instead of decreasing it when a task blocks, use a timer (the "inactive timer", named after the "Inactive" task state of the GRUB algorithm) to decrease the active utilization at the so called "0-lag time". Tested-by: NClaudio Scordino <claudio@evidence.eu.com> Tested-by: NDaniel Bristot de Oliveira <bristot@redhat.com> Signed-off-by: NLuca Abeni <luca.abeni@santannapisa.it> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Joel Fernandes <joelaf@google.com> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mathieu Poirier <mathieu.poirier@linaro.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it> Link: http://lkml.kernel.org/r/1495138417-6203-3-git-send-email-luca.abeni@santannapisa.itSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Luca Abeni 提交于
Active utilization is defined as the total utilization of active (TASK_RUNNING) tasks queued on a runqueue. Hence, it is increased when a task wakes up and is decreased when a task blocks. When a task is migrated from CPUi to CPUj, immediately subtract the task's utilization from CPUi and add it to CPUj. This mechanism is implemented by modifying the pull and push functions. Note: this is not fully correct from the theoretical point of view (the utilization should be removed from CPUi only at the 0 lag time), a more theoretically sound solution is presented in the next patches. Tested-by: NDaniel Bristot de Oliveira <bristot@redhat.com> Signed-off-by: NLuca Abeni <luca.abeni@unitn.it> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Acked-by: NJuri Lelli <juri.lelli@arm.com> Cc: Claudio Scordino <claudio@evidence.eu.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mathieu Poirier <mathieu.poirier@linaro.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it> Link: http://lkml.kernel.org/r/1495138417-6203-2-git-send-email-luca.abeni@santannapisa.itSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
Hackbench recently suffered a bunch of pain, first by commit: 4c77b18c ("sched/fair: Make select_idle_cpu() more aggressive") and then by commit: c743f0a5 ("sched/fair, cpumask: Export for_each_cpu_wrap()") which fixed a bug in the initial for_each_cpu_wrap() implementation that made select_idle_cpu() even more expensive. The bug was that it would skip over CPUs when bits were consequtive in the bitmask. This however gave me an idea to fix select_idle_cpu(); where the old scheme was a cliff-edge throttle on idle scanning, this introduces a more gradual approach. Instead of stopping to scan entirely, we limit how many CPUs we scan. Initial benchmarks show that it mostly recovers hackbench while not hurting anything else, except Mason's schbench, but not as bad as the old thing. It also appears to recover the tbench high-end, which also suffered like hackbench. Tested-by: NMatt Fleming <matt@codeblueprint.co.uk> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Chris Mason <clm@fb.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: hpa@zytor.com Cc: kitsunyan <kitsunyan@inbox.ru> Cc: linux-kernel@vger.kernel.org Cc: lvenanci@redhat.com Cc: riel@redhat.com Cc: xiaolong.ye@intel.com Link: http://lkml.kernel.org/r/20170517105350.hk5m4h4jb6dfr65a@hirez.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 24 5月, 2017 1 次提交
-
-
由 Peter Zijlstra 提交于
The more strict early boot preemption warnings found that __set_sched_clock_stable() was incorrectly assuming we'd still be running on a single CPU: BUG: using smp_processor_id() in preemptible [00000000] code: swapper/0/1 caller is debug_smp_processor_id+0x1c/0x1e CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.12.0-rc2-00108-g1c3c5eab #1 Call Trace: dump_stack+0x110/0x192 check_preemption_disabled+0x10c/0x128 ? set_debug_rodata+0x25/0x25 debug_smp_processor_id+0x1c/0x1e sched_clock_init_late+0x27/0x87 [...] Fix it by disabling IRQs. Reported-by: Nkernel test robot <xiaolong.ye@intel.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Acked-by: NThomas Gleixner <tglx@linutronix.de> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: lkp@01.org Cc: tipbuild@zytor.com Link: http://lkml.kernel.org/r/20170524065202.v25vyu7pvba5mhpd@hirez.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 23 5月, 2017 13 次提交
-
-
由 Thomas Gleixner 提交于
might_sleep() and smp_processor_id() checks are enabled after the boot process is done. That hides bugs in the SMP bringup and driver initialization code. Enable it right when the scheduler starts working, i.e. when init task and kthreadd have been created and right before the idle task enables preemption. Tested-by: NMark Rutland <mark.rutland@arm.com> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Acked-by: NMark Rutland <mark.rutland@arm.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/20170516184736.272225698@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Thomas Gleixner 提交于
To enable smp_processor_id() and might_sleep() debug checks earlier, it's required to add system states between SYSTEM_BOOTING and SYSTEM_RUNNING. Adjust the system_state check in boot_delay_msec() to handle the extra states. Tested-by: NMark Rutland <mark.rutland@arm.com> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: NSteven Rostedt (VMware) <rostedt@goodmis.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20170516184736.027534895@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Thomas Gleixner 提交于
To enable smp_processor_id() and might_sleep() debug checks earlier, it's required to add system states between SYSTEM_BOOTING and SYSTEM_RUNNING. Adjust the system_state check in core_kernel_text() to handle the extra states, i.e. to cover init text up to the point where the system switches to state RUNNING. Tested-by: NMark Rutland <mark.rutland@arm.com> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: NSteven Rostedt (VMware) <rostedt@goodmis.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20170516184735.949992741@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Thomas Gleixner 提交于
To enable smp_processor_id() and might_sleep() debug checks earlier, it's required to add system states between SYSTEM_BOOTING and SYSTEM_RUNNING. Adjust the system_state check in async_run_entry_fn() and async_synchronize_cookie_domain() to handle the extra states. Tested-by: NMark Rutland <mark.rutland@arm.com> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Acked-by: NArjan van de Ven <arjan@linux.intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/20170516184735.865155020@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Vlastimil Babka 提交于
A customer has reported a soft-lockup when running an intensive memory stress test, where the trace on multiple CPU's looks like this: RIP: 0010:[<ffffffff810c53fe>] [<ffffffff810c53fe>] native_queued_spin_lock_slowpath+0x10e/0x190 ... Call Trace: [<ffffffff81182d07>] queued_spin_lock_slowpath+0x7/0xa [<ffffffff811bc331>] change_protection_range+0x3b1/0x930 [<ffffffff811d4be8>] change_prot_numa+0x18/0x30 [<ffffffff810adefe>] task_numa_work+0x1fe/0x310 [<ffffffff81098322>] task_work_run+0x72/0x90 Further investigation showed that the lock contention here is pmd_lock(). The task_numa_work() function makes sure that only one thread is let to perform the work in a single scan period (via cmpxchg), but if there's a thread with mmap_sem locked for writing for several periods, multiple threads in task_numa_work() can build up a convoy waiting for mmap_sem for read and then all get unblocked at once. This patch changes the down_read() to the trylock version, which prevents the build up. For a workload experiencing mmap_sem contention, it's probably better to postpone the NUMA balancing work anyway. This seems to have fixed the soft lockups involving pmd_lock(), which is in line with the convoy theory. Signed-off-by: NVlastimil Babka <vbabka@suse.cz> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Acked-by: NRik van Riel <riel@redhat.com> Acked-by: NMel Gorman <mgorman@techsingularity.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20170515131316.21909-1-vbabka@suse.czSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Dave Kleikamp 提交于
With CONFIG_RT_GROUP_SCHED=y, do_sched_rt_period_timer() sequentially takes each CPU's rq->lock. On a large, busy system, the cumulative time it takes to acquire each lock can be excessive, even triggering a watchdog timeout. If rt_rq->rt_time and rt_rq->rt_nr_running are both zero, this function does nothing while holding the lock, so don't bother taking it at all. Signed-off-by: NDave Kleikamp <dave.kleikamp@oracle.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/a767637b-df85-912f-ba69-c90ee00a3fb6@oracle.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Steven Rostedt (VMware) 提交于
When priority inheritance was added back in 2.6.18 to sched_setscheduler(), it added a path to taking an rt-mutex wait_lock, which is not IRQ safe. As PI is not a common occurrence, lockdep will likely never trigger if sched_setscheduler was called from interrupt context. A BUG_ON() was added to trigger if __sched_setscheduler() was ever called from interrupt context because there was a possibility to take the wait_lock. Today the wait_lock is irq safe, but the path to taking it in sched_setscheduler() is the same as the path to taking it from normal context. The wait_lock is taken with raw_spin_lock_irq() and released with raw_spin_unlock_irq() which will indiscriminately enable interrupts, which would be bad in interrupt context. The problem is that normalize_rt_tasks, which is called by triggering the sysrq nice-all-RT-tasks was changed to call __sched_setscheduler(), and this is done from interrupt context! Now __sched_setscheduler() takes a "pi" parameter that is used to know if the priority inheritance should be called or not. As the BUG_ON() only cares about calling the PI code, it should only bug if called from interrupt context with the "pi" parameter set to true. Reported-by: NLaurent Dufour <ldufour@linux.vnet.ibm.com> Tested-by: NLaurent Dufour <ldufour@linux.vnet.ibm.com> Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@osdl.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: dbc7f069 ("sched: Use replace normalize_task() with __sched_setscheduler()") Link: http://lkml.kernel.org/r/20170308124654.10e598f2@gandalf.local.homeSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Byungchul Park 提交于
pick_next_pushable_dl_task(rq) has BUG_ON(rq->cpu != task_cpu(task)) when it returns a task other than NULL, which means that task_cpu(task) must be rq->cpu. So if task == next_task, then task_cpu(next_task) must be rq->cpu as well. Remove the redundant condition and make the code simpler. This way one unnecessary branch and two LOAD operations can be avoided. Signed-off-by: NByungchul Park <byungchul.park@lge.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: NSteven Rostedt (VMware) <rostedt@goodmis.org> Reviewed-by: NJuri Lelli <juri.lelli@arm.com> Reviewed-by: NDaniel Bristot de Oliveira <bristot@redhat.com> Cc: <kernel-team@lge.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1494551159-22367-1-git-send-email-byungchul.park@lge.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Byungchul Park 提交于
pick_next_pushable_task(rq) has BUG_ON(rq_cpu != task_cpu(task)) when it returns a task other than NULL, which means that task_cpu(task) must be rq->cpu. So if task == next_task, then task_cpu(next_task) must be rq->cpu as well. Remove the redundant condition and make the code simpler. This way one unnecessary branch and two LOAD operations can be avoided. Signed-off-by: NByungchul Park <byungchul.park@lge.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: NSteven Rostedt (VMware) <rostedt@goodmis.org> Reviewed-by: NJuri Lelli <juri.lelli@arm.com> Reviewed-by: NDaniel Bristot de Oliveira <bristot@redhat.com> Cc: <kernel-team@lge.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1494551143-22219-1-git-send-email-byungchul.park@lge.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Byungchul Park 提交于
Now that we've added llist_for_each_entry_safe(), use it to simplify an open coded version of it in sched_ttwu_pending(). Signed-off-by: NByungchul Park <byungchul.park@lge.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: <kernel-team@lge.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1494549584-11730-1-git-send-email-byungchul.park@lge.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
The cpumasks in smp_call_function_many() are private and not subject to concurrency, atomic bitops are pointless and expensive. Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Aaron Lu 提交于
Inter-Processor-Interrupt(IPI) is needed when a page is unmapped and the process' mm_cpumask() shows the process has ever run on other CPUs. page migration, page reclaim all need IPIs. The number of IPI needed to send to different CPUs is especially large for multi-threaded workload since mm_cpumask() is per process. For smp_call_function_many(), whenever a CPU queues a CSD to a target CPU, it will send an IPI to let the target CPU to handle the work. This isn't necessary - we need only send IPI when queueing a CSD to an empty call_single_queue. The reason: flush_smp_call_function_queue() that is called upon a CPU receiving an IPI will empty the queue and then handle all of the CSDs there. So if the target CPU's call_single_queue is not empty, we know that: i. An IPI for the target CPU has already been sent by 'previous queuers'; ii. flush_smp_call_function_queue() hasn't emptied that CPU's queue yet. Thus, it's safe for us to just queue our CSD there without sending an addtional IPI. And for the 'previous queuers', we can limit it to the first queuer. To demonstrate the effect of this patch, a multi-thread workload that spawns 80 threads to equally consume 100G memory is used. This is tested on a 2 node broadwell-EP which has 44cores/88threads and 32G memory. So after 32G memory is used up, page reclaiming starts to happen a lot. With this patch, IPI number dropped 88% and throughput increased about 15% for the above workload. Signed-off-by: NAaron Lu <aaron.lu@intel.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tim Chen <tim.c.chen@linux.intel.com> Link: http://lkml.kernel.org/r/20170519075331.GE2084@aaronlu.sh.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 David S. Miller 提交于
The assignmnet: ip_align = strict ? 2 : NET_IP_ALIGN; in compare_pkt_ptr_alignment() trips up Coverity because we can only get to this code when strict is true, therefore ip_align will always be 2 regardless of NET_IP_ALIGN's value. So just assign directly to '2' and explain the situation in the comment above. Reported-by: N"Gustavo A. R. Silva" <garsilva@embeddedor.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 19 5月, 2017 2 次提交
-
-
由 Shaohua Li 提交于
sscanf is a very poor way to parse integer. For example, I input "discard" for act_mask, it gets 0xd and completely messes up. Using correct API to do integer parse. This patch also makes attributes accept any base of integer. Signed-off-by: NShaohua Li <shli@fb.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Steven Rostedt (VMware) 提交于
As stack tracing now requires "rcu watching", force RCU to be watching when recording a stack trace. Link: http://lkml.kernel.org/r/20170512172449.879684501@goodmis.orgAcked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
-
- 18 5月, 2017 7 次提交
-
-
由 Daniel Borkmann 提交于
Current limits with regards to processing program paths do not really reflect today's needs anymore due to programs becoming more complex and verifier smarter, keeping track of more data such as const ALU operations, alignment tracking, spilling of PTR_TO_MAP_VALUE_ADJ registers, and other features allowing for smarter matching of what LLVM generates. This also comes with the side-effect that we result in fewer opportunities to prune search states and thus often need to do more work to prove safety than in the past due to different register states and stack layout where we mismatch. Generally, it's quite hard to determine what caused a sudden increase in complexity, it could be caused by something as trivial as a single branch somewhere at the beginning of the program where LLVM assigned a stack slot that is marked differently throughout other branches and thus causing a mismatch, where verifier then needs to prove safety for the whole rest of the program. Subsequently, programs with even less than half the insn size limit can get rejected. We noticed that while some programs load fine under pre 4.11, they get rejected due to hitting limits on more recent kernels. We saw that in the vast majority of cases (90+%) pruning failed due to register mismatches. In case of stack mismatches, majority of cases failed due to different stack slot types (invalid, spill, misc) rather than differences in spilled registers. This patch makes pruning more aggressive by also adding markers that sit at conditional jumps as well. Currently, we only mark jump targets for pruning. For example in direct packet access, these are usually error paths where we bail out. We found that adding these markers, it can reduce number of processed insns by up to 30%. Another option is to ignore reg->id in probing PTR_TO_MAP_VALUE_OR_NULL registers, which can help pruning slightly as well by up to 7% observed complexity reduction as stand-alone. Meaning, if a previous path with register type PTR_TO_MAP_VALUE_OR_NULL for map X was found to be safe, then in the current state a PTR_TO_MAP_VALUE_OR_NULL register for the same map X must be safe as well. Last but not least the patch also adds a scheduling point and bumps the current limit for instructions to be processed to a more adequate value. Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Acked-by: NAlexei Starovoitov <ast@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Steven Rostedt (VMware) 提交于
Thomas discovered a bug where the kprobe trace tests had a race condition where the kprobe_optimizer called from a delayed work queue that does the optimizing and "unoptimizing" of a kprobe, can try to modify the text after it has been freed by the init code. The kprobe trace selftest is a special case, and Thomas and myself investigated to see if there's a chance that this could also be a bug with module unloading, as the code is not obvious to how it handles this. After adding lots of printks, I figured it out. Thomas suggested that this should be commented so that others will not have to go through this exercise again. Link: http://lkml.kernel.org/r/20170516145835.3827d3aa@gandalf.local.homeAcked-by: NMasami Hiramatsu <mhiramat@kernel.org> Suggested-by: NThomas Gleixner <tglx@linutronix.de> Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
-
由 Steven Rostedt (VMware) 提交于
No need to add ugly #ifdefs in the code. Having a standard stub file is much prettier. Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
-
由 Naveen N. Rao 提交于
If instance directories are deleted while there are registered function triggers: # cd /sys/kernel/debug/tracing/instances # mkdir test # echo "schedule:enable_event:sched:sched_switch" > test/set_ftrace_filter # rmdir test Unable to handle kernel paging request for data at address 0x00000008 Unable to handle kernel paging request for data at address 0x00000008 Faulting instruction address: 0xc0000000021edde8 Oops: Kernel access of bad area, sig: 11 [#1] SMP NR_CPUS=2048 NUMA pSeries Modules linked in: iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp tun bridge stp llc kvm iptable_filter fuse binfmt_misc pseries_rng rng_core vmx_crypto ib_iser rdma_cm iw_cm ib_cm ib_core libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c multipath virtio_net virtio_blk virtio_pci crc32c_vpmsum virtio_ring virtio CPU: 8 PID: 8694 Comm: rmdir Not tainted 4.11.0-nnr+ #113 task: c0000000bab52800 task.stack: c0000000baba0000 NIP: c0000000021edde8 LR: c0000000021f0590 CTR: c000000002119620 REGS: c0000000baba3870 TRAP: 0300 Not tainted (4.11.0-nnr+) MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 22002422 XER: 20000000 CFAR: 00007fffabb725a8 DAR: 0000000000000008 DSISR: 40000000 SOFTE: 0 GPR00: c00000000220f750 c0000000baba3af0 c000000003157e00 0000000000000000 GPR04: 0000000000000040 00000000000000eb 0000000000000040 0000000000000000 GPR08: 0000000000000000 0000000000000113 0000000000000000 c00000000305db98 GPR12: c000000002119620 c00000000fd42c00 0000000000000000 0000000000000000 GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 GPR20: 0000000000000000 0000000000000000 c0000000bab52e90 0000000000000000 GPR24: 0000000000000000 00000000000000eb 0000000000000040 c0000000baba3bb0 GPR28: c00000009cb06eb0 c0000000bab52800 c00000009cb06eb0 c0000000baba3bb0 NIP [c0000000021edde8] ring_buffer_lock_reserve+0x8/0x4e0 LR [c0000000021f0590] trace_event_buffer_lock_reserve+0xe0/0x1a0 Call Trace: [c0000000baba3af0] [c0000000021f96c8] trace_event_buffer_commit+0x1b8/0x280 (unreliable) [c0000000baba3b60] [c00000000220f750] trace_event_buffer_reserve+0x80/0xd0 [c0000000baba3b90] [c0000000021196b8] trace_event_raw_event_sched_switch+0x98/0x180 [c0000000baba3c10] [c0000000029d9980] __schedule+0x6e0/0xab0 [c0000000baba3ce0] [c000000002122230] do_task_dead+0x70/0xc0 [c0000000baba3d10] [c0000000020ea9c8] do_exit+0x828/0xd00 [c0000000baba3dd0] [c0000000020eaf70] do_group_exit+0x60/0x100 [c0000000baba3e10] [c0000000020eb034] SyS_exit_group+0x24/0x30 [c0000000baba3e30] [c00000000200bcec] system_call+0x38/0x54 Instruction dump: 60000000 60420000 7d244b78 7f63db78 4bffaa09 393efff8 793e0020 39200000 4bfffecc 60420000 3c4c00f7 3842a020 <81230008> 2f890000 409e02f0 a14d0008 ---[ end trace b917b8985d0e650b ]--- Unable to handle kernel paging request for data at address 0x00000008 Faulting instruction address: 0xc0000000021edde8 Unable to handle kernel paging request for data at address 0x00000008 Faulting instruction address: 0xc0000000021edde8 Faulting instruction address: 0xc0000000021edde8 To address this, let's clear all registered function probes before deleting the ftrace instance. Link: http://lkml.kernel.org/r/c5f1ca624043690bd94642bb6bffd3f2fc504035.1494956770.git.naveen.n.rao@linux.vnet.ibm.comReported-by: NMichael Ellerman <mpe@ellerman.id.au> Signed-off-by: NNaveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
-
由 Naveen N. Rao 提交于
Handle a NULL glob properly and simplify the check. Link: http://lkml.kernel.org/r/5df74d4ffb4721db6d5a22fa08ca031d62ead493.1494956770.git.naveen.n.rao@linux.vnet.ibm.comReviewed-by: NMasami Hiramatsu <mhiramat@kernel.org> Signed-off-by: NNaveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
-
由 Thomas Gleixner 提交于
Enabling the tracer selftest triggers occasionally the warning in text_poke(), which warns when the to be modified page is not marked reserved. The reason is that the tracer selftest installs kprobes on functions marked __init for testing. These probes are removed after the tests, but that removal schedules the delayed kprobes_optimizer work, which will do the actual text poke. If the work is executed after the init text is freed, then the warning triggers. The bug can be reproduced reliably when the work delay is increased. Flush the optimizer work and wait for the optimizing/unoptimizing lists to become empty before returning from the kprobes tracer selftest. That ensures that all operations which were queued due to the probes removal have completed. Link: http://lkml.kernel.org/r/20170516094802.76a468bb@gandalf.local.homeSigned-off-by: NThomas Gleixner <tglx@linutronix.de> Acked-by: NMasami Hiramatsu <mhiramat@kernel.org> Cc: stable@vger.kernel.org Fixes: 6274de49 ("kprobes: Support delayed unoptimizing") Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
-
由 Steven Rostedt 提交于
I hit the following lockdep splat when booting with ftrace selftests enabled, as well as CONFIG_PREEMPT and LOCKDEP. Testing dynamic ftrace ops #1: (1 0 1 0 0) (1 1 2 0 0) (2 1 3 0 169) (2 2 4 0 50066) ------------[ cut here ]------------ WARNING: CPU: 0 PID: 13 at kernel/rcu/srcutree.c:202 check_init_srcu_struct+0x60/0x70 Modules linked in: CPU: 0 PID: 13 Comm: rcu_tasks_kthre Not tainted 4.12.0-rc1-test+ #587 Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v02.05 05/07/2012 task: ffff880119628040 task.stack: ffffc900006a4000 RIP: 0010:check_init_srcu_struct+0x60/0x70 RSP: 0000:ffffc900006a7d98 EFLAGS: 00010246 RAX: 0000000000000246 RBX: 0000000000000000 RCX: 0000000000000000 RDX: ffff880119628040 RSI: 00000000ffffffff RDI: ffffffff81e5fb40 RBP: ffffc900006a7e20 R08: 00000023b403c000 R09: 0000000000000001 R10: ffffc900006a7e40 R11: 0000000000000000 R12: ffffffff81e5fb40 R13: 0000000000000286 R14: ffff880119628040 R15: ffffc900006a7e98 FS: 0000000000000000(0000) GS:ffff88011ea00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff88011edff000 CR3: 0000000001e0f000 CR4: 00000000001406f0 Call Trace: ? __synchronize_srcu+0x6e/0x140 ? lock_acquire+0xdc/0x1d0 ? ktime_get_mono_fast_ns+0x5d/0xb0 synchronize_srcu+0x6f/0x110 ? synchronize_srcu+0x6f/0x110 rcu_tasks_kthread+0x20a/0x540 kthread+0x114/0x150 ? __rcu_read_unlock+0x70/0x70 ? kthread_create_on_node+0x40/0x40 ret_from_fork+0x2e/0x40 Code: f6 83 70 06 00 00 03 49 89 c5 74 0d be 01 00 00 00 48 89 df e8 42 fa ff ff 4c 89 ee 4c 89 e7 e8 b7 42 75 00 5b 41 5c 41 5d 5d c3 <0f> ff eb aa 66 90 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 ---[ end trace 5c3f4206ce50f6ac ]--- What happens is that the selftests include a creating of a dynamically allocated ftrace_ops, which requires the use of synchronize_rcu_tasks() which uses srcu, and triggers the above warning. It appears that synchronize_rcu_tasks() is not set up at early_initcall(), but it is at core_initcall(). By moving the tests down to that location works out properly. Link: http://lkml.kernel.org/r/20170517111435.7388c033@gandalf.local.homeAcked-by: N"Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
-
- 16 5月, 2017 1 次提交
-
-
由 Thomas Gleixner 提交于
irq_set_chained_handler_and_data() sets up the chained interrupt and then stores the handler data. That's racy against an immediate interrupt which gets handled before the store of the handler data happened. The handler will dereference a NULL pointer and crash. Cure it by storing handler data before installing the chained handler. Reported-by: NBorislav Petkov <bp@alien8.de> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org
-
- 15 5月, 2017 6 次提交
-
-
由 Tejun Heo 提交于
Currently, rq->leaf_cfs_rq_list is a traversal ordered list of all live cfs_rqs which have ever been active on the CPU; unfortunately, this makes update_blocked_averages() O(# total cgroups) which isn't scalable at all. This shows up as a small CPU consumption and scheduling latency increase in the load balancing path in systems with CPU controller enabled across most cgroups. In an edge case where temporary cgroups were leaking, this caused the kernel to consume good several tens of percents of CPU cycles running update_blocked_averages(), each run taking multiple millisecs. This patch fixes the issue by taking empty and fully decayed cfs_rqs off the rq->leaf_cfs_rq_list. Signed-off-by: NTejun Heo <tj@kernel.org> [ Added cfs_rq_is_decayed() ] Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Acked-by: NVincent Guittot <vincent.guittot@linaro.org> Cc: Chris Mason <clm@fb.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Turner <pjt@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20170426004350.GB3222@wtj.duckdns.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
In order to allow leaf_cfs_rq_list to remove entries switch the bandwidth hotplug code over to the task_groups list. Suggested-by: NTejun Heo <tj@kernel.org> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Chris Mason <clm@fb.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Turner <pjt@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20170504133122.a6qjlj3hlblbjxux@hirez.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
There's a discrepancy in naming between the sched_domain and sched_group cpumask accessor. Since we're doing changes, fix it. $ git grep sched_group_cpus | wc -l 28 $ git grep sched_domain_span | wc -l 38 Suggests changing sched_group_cpus() into sched_group_span(): for i in `git grep -l sched_group_cpus` do sed -ie 's/sched_group_cpus/sched_group_span/g' $i done Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
Since sched_group_mask() is now an independent cpumask (it no longer masks sched_group_cpus()), rename the thing. Suggested-by: NLauro Ramos Venancio <lvenanci@redhat.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
While writing the comments, it occurred to me that: sg_cpus & sg_mask == sg_mask at least conceptually; the !overlap case sets the all 1s mask. If we correct that we can simplify things and directly use sg_mask. Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
We want to attain: sg_cpus() & sg_mask() == sg_mask() for this to be so we must initialize sg_mask() to sg_cpus() for the !overlap case (its currently cpumask_setall()). Since the code makes my head hurt bad, rewrite it into a simpler form, inspired by the now fixed overlap code. Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: NIngo Molnar <mingo@kernel.org>
-