- 18 4月, 2017 1 次提交
-
-
由 Steven Rostedt (VMware) 提交于
The trace_event benchmark thread runs in kernel space in an infinite loop while also calling cond_resched() in case anything else wants to schedule in. Unfortunately, on a PREEMPT kernel, that makes it a nop, in which case, this will never voluntarily schedule. That will cause synchronize_rcu_tasks() to forever block on this thread, while it is running. This is exactly what cond_resched_rcu_qs() is for. Use that instead. Acked-by: N"Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
-
- 17 4月, 2017 1 次提交
-
-
由 Steven Rostedt (VMware) 提交于
t_hash_start() does not increment *pos, where as t_next() must. But when t_next() does increment *pos, it must still pass in the original *pos to t_hash_start() otherwise it will skip the first instance: # cd /sys/kernel/debug/tracing # echo schedule:traceoff > set_ftrace_filter # echo do_IRQ:traceoff > set_ftrace_filter # echo call_rcu > set_ftrace_filter # cat set_ftrace_filter call_rcu schedule:traceoff:unlimited do_IRQ:traceoff:unlimited The above called t_hash_start() from t_start() as there was only one function (call_rcu), but if we add another function: # echo xfrm_policy_destroy_rcu >> set_ftrace_filter # cat set_ftrace_filter call_rcu xfrm_policy_destroy_rcu do_IRQ:traceoff:unlimited The "schedule:traceoff" disappears. Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
-
- 16 4月, 2017 1 次提交
-
-
由 Steven Rostedt (VMware) 提交于
When two function probes are added to set_ftrace_filter, and then one of them is removed, the update to the function locations is not performed, and the record keeping of the function states are corrupted, and causes an ftrace_bug() to occur. This is easily reproducable by adding two probes, removing one, and then adding it back again. # cd /sys/kernel/debug/tracing # echo schedule:traceoff > set_ftrace_filter # echo do_IRQ:traceoff > set_ftrace_filter # echo \!do_IRQ:traceoff > /debug/tracing/set_ftrace_filter # echo do_IRQ:traceoff > set_ftrace_filter Causes: ------------[ cut here ]------------ WARNING: CPU: 2 PID: 1098 at kernel/trace/ftrace.c:2369 ftrace_get_addr_curr+0x143/0x220 Modules linked in: [...] CPU: 2 PID: 1098 Comm: bash Not tainted 4.10.0-test+ #405 Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v02.05 05/07/2012 Call Trace: dump_stack+0x68/0x9f __warn+0x111/0x130 ? trace_irq_work_interrupt+0xa0/0xa0 warn_slowpath_null+0x1d/0x20 ftrace_get_addr_curr+0x143/0x220 ? __fentry__+0x10/0x10 ftrace_replace_code+0xe3/0x4f0 ? ftrace_int3_handler+0x90/0x90 ? printk+0x99/0xb5 ? 0xffffffff81000000 ftrace_modify_all_code+0x97/0x110 arch_ftrace_update_code+0x10/0x20 ftrace_run_update_code+0x1c/0x60 ftrace_run_modify_code.isra.48.constprop.62+0x8e/0xd0 register_ftrace_function_probe+0x4b6/0x590 ? ftrace_startup+0x310/0x310 ? debug_lockdep_rcu_enabled.part.4+0x1a/0x30 ? update_stack_state+0x88/0x110 ? ftrace_regex_write.isra.43.part.44+0x1d3/0x320 ? preempt_count_sub+0x18/0xd0 ? mutex_lock_nested+0x104/0x800 ? ftrace_regex_write.isra.43.part.44+0x1d3/0x320 ? __unwind_start+0x1c0/0x1c0 ? _mutex_lock_nest_lock+0x800/0x800 ftrace_trace_probe_callback.isra.3+0xc0/0x130 ? func_set_flag+0xe0/0xe0 ? __lock_acquire+0x642/0x1790 ? __might_fault+0x1e/0x20 ? trace_get_user+0x398/0x470 ? strcmp+0x35/0x60 ftrace_trace_onoff_callback+0x48/0x70 ftrace_regex_write.isra.43.part.44+0x251/0x320 ? match_records+0x420/0x420 ftrace_filter_write+0x2b/0x30 __vfs_write+0xd7/0x330 ? do_loop_readv_writev+0x120/0x120 ? locks_remove_posix+0x90/0x2f0 ? do_lock_file_wait+0x160/0x160 ? __lock_is_held+0x93/0x100 ? rcu_read_lock_sched_held+0x5c/0xb0 ? preempt_count_sub+0x18/0xd0 ? __sb_start_write+0x10a/0x230 ? vfs_write+0x222/0x240 vfs_write+0xef/0x240 SyS_write+0xab/0x130 ? SyS_read+0x130/0x130 ? trace_hardirqs_on_caller+0x182/0x280 ? trace_hardirqs_on_thunk+0x1a/0x1c entry_SYSCALL_64_fastpath+0x18/0xad RIP: 0033:0x7fe61c157c30 RSP: 002b:00007ffe87890258 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: ffffffff8114a410 RCX: 00007fe61c157c30 RDX: 0000000000000010 RSI: 000055814798f5e0 RDI: 0000000000000001 RBP: ffff8800c9027f98 R08: 00007fe61c422740 R09: 00007fe61ca53700 R10: 0000000000000073 R11: 0000000000000246 R12: 0000558147a36400 R13: 00007ffe8788f160 R14: 0000000000000024 R15: 00007ffe8788f15c ? trace_hardirqs_off_caller+0xc0/0x110 ---[ end trace 99fa09b3d9869c2c ]--- Bad trampoline accounting at: ffffffff81cc3b00 (do_IRQ+0x0/0x150) Cc: stable@vger.kernel.org Fixes: 59df055f ("ftrace: trace different functions with a different tracer") Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
-
- 11 4月, 2017 5 次提交
-
-
由 Steven Rostedt (VMware) 提交于
Tracing uses rcu_irq_enter() as a way to make sure that RCU is watching when it needs to use rcu_read_lock() and friends. This is because tracing can happen as RCU is about to enter user space, or about to go idle, and RCU does not watch for RCU read side critical sections as it makes the transition. There is a small location within the RCU infrastructure that rcu_irq_enter() itself will not work. If tracing were to occur in that section it will break if it tries to use rcu_irq_enter(). Originally, this happens with the stack_tracer, because it will call save_stack_trace when it encounters stack usage that is greater than any stack usage it had encountered previously. There was a case where that happened in the RCU section where rcu_irq_enter() did not work, and lockdep complained loudly about it. To fix it, stack tracing added a call to be disabled and RCU would disable stack tracing during the critical section that rcu_irq_enter() was inoperable. This solution worked, but there are other cases that use rcu_irq_enter() and it would be a good idea to let RCU give a way to let others know that rcu_irq_enter() will not work. For example, in trace events. Another helpful aspect of this change is that it also moves the per cpu variable called in the RCU critical section into a cache locale along with other RCU per cpu variables used in that same location. I'm keeping the stack_trace_disable() code, as that still could be used in the future by places that really need to disable it. And since it's only a static inline, it wont take up any kernel text if it is not used. Link: http://lkml.kernel.org/r/20170405093207.404f8deb@gandalf.local.homeAcked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
-
由 Paul E. McKenney 提交于
The tracing subsystem started using rcu_irq_entry() and rcu_irq_exit() (with my blessing) to allow the current _rcuidle alternative tracepoint name to be dispensed with while still maintaining good performance. Unfortunately, this causes RCU's dyntick-idle entry code's tracing to appear to RCU like an interrupt that occurs where RCU is not designed to handle interrupts. This commit fixes this problem by moving the zeroing of ->dynticks_nesting after the offending trace_rcu_dyntick() statement, which narrows the window of vulnerability to a pair of adjacent statements that are now marked with comments to that effect. Link: http://lkml.kernel.org/r/20170405093207.404f8deb@gandalf.local.home Link: http://lkml.kernel.org/r/20170405193928.GM1600@linux.vnet.ibm.comReported-by: NSteven Rostedt <rostedt@goodmis.org> Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
-
由 Steven Rostedt (VMware) 提交于
In order to eliminate a function call, make "trace_active" into "disable_stack_tracer" and convert stack_tracer_disable() and friends into static inline functions. Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
-
由 Steven Rostedt (VMware) 提交于
There are certain parts of the kernel that cannot let stack tracing proceed (namely in RCU), because the stack tracer uses RCU, and parts of RCU internals cannot handle having RCU read side locks taken. Add stack_tracer_disable() and stack_tracer_enable() functions to let RCU stop stack tracing on the current CPU when it is in those critical sections. Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
-
由 Steven Rostedt (VMware) 提交于
The updates to the trace_active per cpu variable can be updated with the __this_cpu_*() functions as it only gets updated on the CPU that the variable is on. Thanks to Paul McKenney for suggesting __this_cpu_* instead of this_cpu_*. Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
-
- 07 4月, 2017 1 次提交
-
-
由 Steven Rostedt (VMware) 提交于
The function tracer needs to be more careful than other subsystems when it comes to freeing data. Especially if that data is actually executable code. When a single function is traced, a trampoline can be dynamically allocated which is called to jump to the function trace callback. When the callback is no longer needed, the dynamic allocated trampoline needs to be freed. This is where the issues arise. The dynamically allocated trampoline must not be used again. As function tracing can trace all subsystems, including subsystems that are used to serialize aspects of freeing (namely RCU), it must take extra care when doing the freeing. Before synchronize_rcu_tasks() was around, there was no way for the function tracer to know that nothing was using the dynamically allocated trampoline when CONFIG_PREEMPT was enabled. That's because a task could be indefinitely preempted while sitting on the trampoline. Now with synchronize_rcu_tasks(), it will wait till all tasks have either voluntarily scheduled (not on the trampoline) or goes into userspace (not on the trampoline). Then it is safe to free the trampoline even with CONFIG_PREEMPT set. Acked-by: N"Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
-
- 04 4月, 2017 2 次提交
-
-
由 Alban Crequy 提交于
When a kretprobe is installed on a kernel function, there is a maximum limit of how many calls in parallel it can catch (aka "maxactive"). A kernel module could call register_kretprobe() and initialize maxactive (see example in samples/kprobes/kretprobe_example.c). But that is not exposed to userspace and it is currently not possible to choose maxactive when writing to /sys/kernel/debug/tracing/kprobe_events The default maxactive can be as low as 1 on single-core with a non-preemptive kernel. This is too low and we need to increase it not only for recursive functions, but for functions that sleep or resched. This patch updates the format of the command that can be written to kprobe_events so that maxactive can be optionally specified. I need this for a bpf program attached to the kretprobe of inet_csk_accept, which can sleep for a long time. This patch includes a basic selftest: > # ./ftracetest -v test.d/kprobe/ > === Ftrace unit tests === > [1] Kprobe dynamic event - adding and removing [PASS] > [2] Kprobe dynamic event - busy event check [PASS] > [3] Kprobe dynamic event with arguments [PASS] > [4] Kprobes event arguments with types [PASS] > [5] Kprobe dynamic event with function tracer [PASS] > [6] Kretprobe dynamic event with arguments [PASS] > [7] Kretprobe dynamic event with maxactive [PASS] > > # of passed: 7 > # of failed: 0 > # of unresolved: 0 > # of untested: 0 > # of unsupported: 0 > # of xfailed: 0 > # of undefined(test bug): 0 BugLink: https://github.com/iovisor/bcc/issues/1072 Link: http://lkml.kernel.org/r/1491215782-15490-1-git-send-email-alban@kinvolk.ioAcked-by: NMasami Hiramatsu <mhiramat@kernel.org> Signed-off-by: NAlban Crequy <alban@kinvolk.io> Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
-
由 Steven Rostedt (VMware) 提交于
Relying on free_reserved_area() to call ftrace to free init memory proved to not be sufficient. The issue is that on x86, when debug_pagealloc is enabled, the init memory is not freed, but simply set as not present. Since ftrace was uninformed of this, starting function tracing still tries to update pages that are not present according to the page tables, causing ftrace to bug, as well as killing the kernel itself. Instead of relying on free_reserved_area(), have init/main.c call ftrace directly just before it frees the init memory. Then it needs to use __init_begin and __init_end to know where the init memory location is. Looking at all archs (and testing what I can), it appears that this should work for each of them. Reported-by: Nkernel test robot <xiaolong.ye@intel.com> Reported-by: NFengguang Wu <fengguang.wu@intel.com> Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
-
- 01 4月, 2017 5 次提交
-
-
由 Steven Rostedt (VMware) 提交于
I noticed that if I use dd to read the set_ftrace_filter file that the first hash command is repeated. # cd /sys/kernel/debug/tracing # echo schedule > set_ftrace_filter # echo do_IRQ >> set_ftrace_filter # echo schedule:traceoff >> set_ftrace_filter # echo do_IRQ:traceoff >> set_ftrace_filter # cat set_ftrace_filter schedule do_IRQ schedule:traceoff:unlimited do_IRQ:traceoff:unlimited # dd if=set_ftrace_filter bs=1 schedule do_IRQ schedule:traceoff:unlimited schedule:traceoff:unlimited do_IRQ:traceoff:unlimited 98+0 records in 98+0 records out 98 bytes copied, 0.00265011 s, 37.0 kB/s This is due to the way t_start() calls t_next() as well as the seq_file calls t_next() and the state is slightly different between the two. Namely, t_start() will call t_next() with a local "pos" variable. By separating out the function listing from t_next() into its own function, we can have better control of outputting the functions and the hash of triggers. This simplifies the code. Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
-
由 Steven Rostedt (VMware) 提交于
If all functions are enabled, there's a comment displayed in the file to denote that: # cd /sys/kernel/debug/tracing # cat set_ftrace_filter #### all functions enabled #### If a function trigger is set, those are displayed as well: # echo schedule:traceoff >> /debug/tracing/set_ftrace_filter # cat set_ftrace_filter #### all functions enabled #### schedule:traceoff:unlimited But if you read that file with dd, the output can change: # dd if=/debug/tracing/set_ftrace_filter bs=1 #### all functions enabled #### 32+0 records in 32+0 records out 32 bytes copied, 7.0237e-05 s, 456 kB/s This is because the "pos" variable is updated for the comment, but func_pos is not. "func_pos" is used by the triggers (or hashes) to know how many functions were printed and it bases its index from the pos - func_pos. func_pos should be 1 to count for the comment printed. But since it is not, t_hash_start() thinks that one trigger was already printed. The cat gets to t_hash_start() via t_next() and not t_start() which updates both pos and func_pos. Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
-
由 Steven Rostedt (VMware) 提交于
The loop in t_start() of calling t_next() will call t_hash_start() if the pos is beyond the functions and enters the hash items. There's no reason to check if p is NULL and call t_hash_start(), as that would be redundant. Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
-
由 Steven Rostedt (VMware) 提交于
Instead of testing if the hash to use is the filter_hash or the notrace_hash at each iteration, do the test at open, and set the iter->hash to point to the corresponding filter or notrace hash. Then use that directly instead of testing which hash needs to be used each iteration. Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
-
由 Steven Rostedt (VMware) 提交于
The return status check of __seq_open_private() is rather strange: iter = __seq_open_private(); if (iter) { /* do stuff */ } return iter ? 0 : -ENOMEM; It makes much more sense to do the return of failure right away: iter = __seq_open_private(); if (!iter) return -ENOMEM; /* do stuff */ return 0; This clean up will make updates to this code a bit nicer. Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
-
- 25 3月, 2017 5 次提交
-
-
由 Steven Rostedt (VMware) 提交于
Currently trace_handle_return() looks like this: static inline enum print_line_t trace_handle_return(struct trace_seq *s) { return trace_seq_has_overflowed(s) ? TRACE_TYPE_PARTIAL_LINE : TRACE_TYPE_HANDLED; } Where trace_seq_overflowed(s) is: static inline bool trace_seq_has_overflowed(struct trace_seq *s) { return s->full || seq_buf_has_overflowed(&s->seq); } And seq_buf_has_overflowed(&s->seq) is: static inline bool seq_buf_has_overflowed(struct seq_buf *s) { return s->len > s->size; } Making trace_handle_return() into: return (s->full || (s->seq->len > s->seq->size)) ? TRACE_TYPE_PARTIAL_LINE : TRACE_TYPE_HANDLED; One would think this is not an issue to keep as an inline. But because this is used in the TRACE_EVENT() macro, it is extended for every tracepoint in the system. Taking a look at a single tracepoint x86_irq_vector (was the first one I randomly chosen). As trace_handle_return is used in the TRACE_EVENT() macro of trace_raw_output_##call() we disassemble trace_raw_output_x86_irq_vector and do a diff: - is the original + is the out-of-line code I removed identical lines that were different just due to different addresses. --- /tmp/irq-vec-orig 2017-03-16 09:12:48.569384851 -0400 +++ /tmp/irq-vec-ool 2017-03-16 09:13:39.378153385 -0400 @@ -6,27 +6,23 @@ 53 push %rbx 48 89 fb mov %rdi,%rbx 4c 8b a7 c0 20 00 00 mov 0x20c0(%rdi),%r12 e8 f7 72 13 00 callq ffffffff81155c80 <trace_raw_output_prep> 83 f8 01 cmp $0x1,%eax 74 05 je ffffffff8101e993 <trace_raw_output_x86_irq_vector+0x23> 5b pop %rbx 41 5c pop %r12 5d pop %rbp c3 retq 41 8b 54 24 08 mov 0x8(%r12),%edx - 48 8d bb 98 10 00 00 lea 0x1098(%rbx),%rdi + 48 81 c3 98 10 00 00 add $0x1098,%rbx - 48 c7 c6 7b 8a a0 81 mov $0xffffffff81a08a7b,%rsi + 48 c7 c6 ab 8a a0 81 mov $0xffffffff81a08aab,%rsi - e8 c5 85 13 00 callq ffffffff81156f70 <trace_seq_printf> === here's the start of the main difference === + 48 89 df mov %rbx,%rdi + e8 62 7e 13 00 callq ffffffff81156810 <trace_seq_printf> - 8b 93 b8 20 00 00 mov 0x20b8(%rbx),%edx - 31 c0 xor %eax,%eax - 85 d2 test %edx,%edx - 75 11 jne ffffffff8101e9c8 <trace_raw_output_x86_irq_vector+0x58> - 48 8b 83 a8 20 00 00 mov 0x20a8(%rbx),%rax - 48 39 83 a0 20 00 00 cmp %rax,0x20a0(%rbx) - 0f 93 c0 setae %al + 48 89 df mov %rbx,%rdi + e8 4a c5 12 00 callq ffffffff8114af00 <trace_handle_return> 5b pop %rbx - 0f b6 c0 movzbl %al,%eax === end === 41 5c pop %r12 5d pop %rbp c3 retq If you notice, the original has 22 bytes of text more than the out of line version. As this is for every TRACE_EVENT() defined in the system, this can become quite large. text data bss dec hex filename 8690305 5450490 1298432 15439227 eb957b vmlinux-orig 8681725 5450490 1298432 15430647 eb73f7 vmlinux-handle This change has a total of 8580 bytes in savings. $ objdump -dr /tmp/vmlinux-orig | grep '^[0-9a-f]* <trace_raw_output' | wc -l 324 That's 324 tracepoints. But this does not include modules (which contain many more tracepoints). For an allyesconfig build: $ objdump -dr vmlinux-allyes-orig | grep '^[0-9a-f]* <trace_raw_output' | wc -l 1401 That's 1401 tracepoints giving us: text data bss dec hex filename 137920629 140221067 53264384 331406080 13c0db00 vmlinux-allyes-orig 137827709 140221067 53264384 331313160 13bf7008 vmlinux-allyes-handle 92920 bytes in savings!!! Link: http://lkml.kernel.org/r/20170315021431.13107-2-andi@firstfloor.orgReported-by: NAndi Kleen <andi@firstfloor.org> Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
-
由 Steven Rostedt (VMware) 提交于
Adding a hook into free_reserve_area() that informs ftrace that boot up init text is being free, lets ftrace safely remove those init functions from its records, which keeps ftrace from trying to modify text that no longer exists. Note, this still does not allow for tracing .init text of modules, as modules require different work for freeing its init code. Link: http://lkml.kernel.org/r/1488502497.7212.24.camel@linux.intel.com Cc: linux-mm@kvack.org Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Peter Zijlstra <peterz@infradead.org> Requested-by: NTodd Brandt <todd.e.brandt@linux.intel.com> Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
-
由 Steven Rostedt (VMware) 提交于
Register the function tracer right after the tracing buffers are initialized in early boot up. This will allow function tracing to begin early if it is enabled via the kernel command line. Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
-
由 Steven Rostedt (VMware) 提交于
As tracing can now be enabled very early in boot up, even before some critical system services (like scheduling), do not run the tracer selftests until after early_initcall() is performed. If a tracer is registered before such time, it is saved off in a list and the test is run when the system is able to handle more diverse functions. Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
-
由 Steven Rostedt (VMware) 提交于
Create an early_trace_init() function that will initialize the buffers and allow for ealier use of trace_printk(). This will also allow for future work to have function tracing start earlier at boot up. Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
-
- 17 3月, 2017 1 次提交
-
-
由 Heiko Carstens 提交于
Commit bfc8c901 ("mem-hotplug: implement get/put_online_mems") introduced new functions get/put_online_mems() and mem_hotplug_begin/end() in order to allow similar semantics for memory hotplug like for cpu hotplug. The corresponding functions for cpu hotplug are get/put_online_cpus() and cpu_hotplug_begin/done() for cpu hotplug. The commit however missed to introduce functions that would serialize memory hotplug operations like they are done for cpu hotplug with cpu_maps_update_begin/done(). This basically leaves mem_hotplug.active_writer unprotected and allows concurrent writers to modify it, which may lead to problems as outlined by commit f931ab47 ("mm: fix devm_memremap_pages crash, use mem_hotplug_{begin, done}"). That commit was extended again with commit b5d24fda ("mm, devm_memremap_pages: hold device_hotplug lock over mem_hotplug_{begin, done}") which serializes memory hotplug operations for some call sites by using the device_hotplug lock. In addition with commit 3fc21924 ("mm: validate device_hotplug is held for memory hotplug") a sanity check was added to mem_hotplug_begin() to verify that the device_hotplug lock is held. This in turn triggers the following warning on s390: WARNING: CPU: 6 PID: 1 at drivers/base/core.c:643 assert_held_device_hotplug+0x4a/0x58 Call Trace: assert_held_device_hotplug+0x40/0x58) mem_hotplug_begin+0x34/0xc8 add_memory_resource+0x7e/0x1f8 add_memory+0xda/0x130 add_memory_merged+0x15c/0x178 sclp_detect_standby_memory+0x2ae/0x2f8 do_one_initcall+0xa2/0x150 kernel_init_freeable+0x228/0x2d8 kernel_init+0x2a/0x140 kernel_thread_starter+0x6/0xc One possible fix would be to add more lock_device_hotplug() and unlock_device_hotplug() calls around each call site of mem_hotplug_begin/end(). But that would give the device_hotplug lock additional semantics it better should not have (serialize memory hotplug operations). Instead add a new memory_add_remove_lock which has the similar semantics like cpu_add_remove_lock for cpu hotplug. To keep things hopefully a bit easier the lock will be locked and unlocked within the mem_hotplug_begin/end() functions. Link: http://lkml.kernel.org/r/20170314125226.16779-2-heiko.carstens@de.ibm.comSigned-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com> Reported-by: NSebastian Ott <sebott@linux.vnet.ibm.com> Acked-by: NDan Williams <dan.j.williams@intel.com> Acked-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Ben Hutchings <ben@decadent.org.uk> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 16 3月, 2017 11 次提交
-
-
由 Peter Zijlstra 提交于
While going through the event inheritance code Oleg got confused. Add some comments to better explain the silent dissapearance of orphaned events. So what happens is that at perf_event_release_kernel() time; when an event looses its connection to userspace (and ceases to exist from the user's perspective) we can still have an arbitrary amount of inherited copies of the event. We want to synchronously find and remove all these child events. Since that requires a bit of lock juggling, there is the possibility that concurrent clone()s will create new child events. Therefore we first mark the parent event as DEAD, which marks all the extant child events as orphaned. We then avoid copying orphaned events; in order to avoid getting more of them. Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: fweisbec@gmail.com Link: http://lkml.kernel.org/r/20170316125823.289567442@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
We have ctx->event_list that contains all events; no need to repeatedly iterate the group lists to find them all. Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: fweisbec@gmail.com Link: http://lkml.kernel.org/r/20170316125823.239678244@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
While hunting for clues to a use-after-free, Oleg spotted that perf_event_init_context() can loose an error value with the result that fork() can succeed even though we did not fully inherit the perf event context. Spotted-by: NOleg Nesterov <oleg@redhat.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: oleg@redhat.com Cc: stable@vger.kernel.org Fixes: 889ff015 ("perf/core: Split context's event group list into pinned and non-pinned lists") Link: http://lkml.kernel.org/r/20170316125823.190342547@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
Dmitry reported syzcaller tripped a use-after-free in perf_release(). After much puzzlement Oleg spotted the below scenario: Task1 Task2 fork() perf_event_init_task() /* ... */ goto bad_fork_$foo; /* ... */ perf_event_free_task() mutex_lock(ctx->lock) perf_free_event(B) perf_event_release_kernel(A) mutex_lock(A->child_mutex) list_for_each_entry(child, ...) { /* child == B */ ctx = B->ctx; get_ctx(ctx); mutex_unlock(A->child_mutex); mutex_lock(A->child_mutex) list_del_init(B->child_list) mutex_unlock(A->child_mutex) /* ... */ mutex_unlock(ctx->lock); put_ctx() /* >0 */ free_task(); mutex_lock(ctx->lock); mutex_lock(A->child_mutex); /* ... */ mutex_unlock(A->child_mutex); mutex_unlock(ctx->lock) put_ctx() /* 0 */ ctx->task && !TOMBSTONE put_task_struct() /* UAF */ This patch closes the hole by making perf_event_free_task() destroy the task <-> ctx relation such that perf_event_release_kernel() will no longer observe the now dead task. Spotted-by: NOleg Nesterov <oleg@redhat.com> Reported-by: NDmitry Vyukov <dvyukov@google.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: fweisbec@gmail.com Cc: oleg@redhat.com Cc: stable@vger.kernel.org Fixes: c6e5b732 ("perf: Synchronously clean up child events") Link: http://lkml.kernel.org/r/20170314155949.GE32474@worktop Link: http://lkml.kernel.org/r/20170316125823.140295131@infradead.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Steven Rostedt (VMware) 提交于
I was testing Daniel's changes with his test case, and tweaked it a little. Instead of having the runtime equal to the deadline, I increased the deadline ten fold. Daniel's test case had: attr.sched_runtime = 2 * 1000 * 1000; /* 2 ms */ attr.sched_deadline = 2 * 1000 * 1000; /* 2 ms */ attr.sched_period = 2 * 1000 * 1000 * 1000; /* 2 s */ To make it more interesting, I changed it to: attr.sched_runtime = 2 * 1000 * 1000; /* 2 ms */ attr.sched_deadline = 20 * 1000 * 1000; /* 20 ms */ attr.sched_period = 2 * 1000 * 1000 * 1000; /* 2 s */ The results were rather surprising. The behavior that Daniel's patch was fixing came back. The task started using much more than .1% of the CPU. More like 20%. Looking into this I found that it was due to the dl_entity_overflow() constantly returning true. That's because it uses the relative period against relative runtime vs the absolute deadline against absolute runtime. runtime / (deadline - t) > dl_runtime / dl_period There's even a comment mentioning this, and saying that when relative deadline equals relative period, that the equation is the same as using deadline instead of period. That comment is backwards! What we really want is: runtime / (deadline - t) > dl_runtime / dl_deadline We care about if the runtime can make its deadline, not its period. And then we can say "when the deadline equals the period, the equation is the same as using dl_period instead of dl_deadline". After correcting this, now when the task gets enqueued, it can throttle correctly, and Daniel's fix to the throttling of sleeping deadline tasks works even when the runtime and deadline are not the same. Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: NDaniel Bristot de Oliveira <bristot@redhat.com> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luca Abeni <luca.abeni@santannapisa.it> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Romulo Silva de Oliveira <romulo.deoliveira@ufsc.br> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it> Link: http://lkml.kernel.org/r/02135a27f1ae3fe5fd032568a5a2f370e190e8d7.1488392936.git.bristot@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
During the activation, CBS checks if it can reuse the current task's runtime and period. If the deadline of the task is in the past, CBS cannot use the runtime, and so it replenishes the task. This rule works fine for implicit deadline tasks (deadline == period), and the CBS was designed for implicit deadline tasks. However, a task with constrained deadline (deadine < period) might be awakened after the deadline, but before the next period. In this case, replenishing the task would allow it to run for runtime / deadline. As in this case deadline < period, CBS enables a task to run for more than the runtime / period. In a very loaded system, this can cause a domino effect, making other tasks miss their deadlines. To avoid this problem, in the activation of a constrained deadline task after the deadline but before the next period, throttle the task and set the replenishing timer to the begin of the next period, unless it is boosted. Reproducer: --------------- %< --------------- int main (int argc, char **argv) { int ret; int flags = 0; unsigned long l = 0; struct timespec ts; struct sched_attr attr; memset(&attr, 0, sizeof(attr)); attr.size = sizeof(attr); attr.sched_policy = SCHED_DEADLINE; attr.sched_runtime = 2 * 1000 * 1000; /* 2 ms */ attr.sched_deadline = 2 * 1000 * 1000; /* 2 ms */ attr.sched_period = 2 * 1000 * 1000 * 1000; /* 2 s */ ts.tv_sec = 0; ts.tv_nsec = 2000 * 1000; /* 2 ms */ ret = sched_setattr(0, &attr, flags); if (ret < 0) { perror("sched_setattr"); exit(-1); } for(;;) { /* XXX: you may need to adjust the loop */ for (l = 0; l < 150000; l++); /* * The ideia is to go to sleep right before the deadline * and then wake up before the next period to receive * a new replenishment. */ nanosleep(&ts, NULL); } exit(0); } --------------- >% --------------- On my box, this reproducer uses almost 50% of the CPU time, which is obviously wrong for a task with 2/2000 reservation. Signed-off-by: NDaniel Bristot de Oliveira <bristot@redhat.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luca Abeni <luca.abeni@santannapisa.it> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Romulo Silva de Oliveira <romulo.deoliveira@ufsc.br> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it> Link: http://lkml.kernel.org/r/edf58354e01db46bf42df8d2dd32418833f68c89.1488392936.git.bristot@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
Currently, the replenishment timer is set to fire at the deadline of a task. Although that works for implicit deadline tasks because the deadline is equals to the begin of the next period, that is not correct for constrained deadline tasks (deadline < period). For instance: f.c: --------------- %< --------------- int main (void) { for(;;); } --------------- >% --------------- # gcc -o f f.c # trace-cmd record -e sched:sched_switch \ -e syscalls:sys_exit_sched_setattr \ chrt -d --sched-runtime 490000000 \ --sched-deadline 500000000 \ --sched-period 1000000000 0 ./f # trace-cmd report | grep "{pid of ./f}" After setting parameters, the task is replenished and continue running until being throttled: f-11295 [003] 13322.113776: sys_exit_sched_setattr: 0x0 The task is throttled after running 492318 ms, as expected: f-11295 [003] 13322.606094: sched_switch: f:11295 [-1] R ==> watchdog/3:32 [0] But then, the task is replenished 500719 ms after the first replenishment: <idle>-0 [003] 13322.614495: sched_switch: swapper/3:0 [120] R ==> f:11295 [-1] Running for 490277 ms: f-11295 [003] 13323.104772: sched_switch: f:11295 [-1] R ==> swapper/3:0 [120] Hence, in the first period, the task runs 2 * runtime, and that is a bug. During the first replenishment, the next deadline is set one period away. So the runtime / period starts to be respected. However, as the second replenishment took place in the wrong instant, the next replenishment will also be held in a wrong instant of time. Rather than occurring in the nth period away from the first activation, it is taking place in the (nth period - relative deadline). Signed-off-by: NDaniel Bristot de Oliveira <bristot@redhat.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: NLuca Abeni <luca.abeni@santannapisa.it> Reviewed-by: NSteven Rostedt (VMware) <rostedt@goodmis.org> Reviewed-by: NJuri Lelli <juri.lelli@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Romulo Silva de Oliveira <romulo.deoliveira@ufsc.br> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it> Link: http://lkml.kernel.org/r/ac50d89887c25285b47465638354b63362f8adff.1488392936.git.bristot@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Niklas Cassel 提交于
We hang if SIGKILL has been sent, but the task is stuck in down_read() (after do_exit()), even though no task is doing down_write() on the rwsem in question: INFO: task libupnp:21868 blocked for more than 120 seconds. libupnp D 0 21868 1 0x08100008 ... Call Trace: __schedule() schedule() __down_read() do_exit() do_group_exit() __wake_up_parent() This bug has already been fixed for CONFIG_RWSEM_XCHGADD_ALGORITHM=y in the following commit: 04cafed7 ("locking/rwsem: Fix down_write_killable()") ... however, this bug also exists for CONFIG_RWSEM_GENERIC_SPINLOCK=y. Signed-off-by: NNiklas Cassel <niklas.cassel@axis.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: <mhocko@suse.com> Cc: <stable@vger.kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Niklas Cassel <niklass@axis.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: d4799608 ("locking/rwsem: Introduce basis for down_write_killable()") Link: http://lkml.kernel.org/r/1487981873-12649-1-git-send-email-niklass@axis.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Matt Fleming 提交于
'calc_load_update' is accessed without any kind of locking and there's a clear assumption in the code that only a single value is read or written. Make this explicit by using READ_ONCE() and WRITE_ONCE(), and avoid unintentionally seeing multiple values, or having the load/stores split. Technically the loads in calc_global_*() don't require this since those are the only functions that update 'calc_load_update', but I've added the READ_ONCE() for consistency. Suggested-by: NPeter Zijlstra <peterz@infradead.org> Signed-off-by: NMatt Fleming <matt@codeblueprint.co.uk> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vincent Guittot <vincent.guittot@linaro.org> Link: http://lkml.kernel.org/r/20170217120731.11868-3-matt@codeblueprint.co.ukSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Matt Fleming 提交于
If we crossed a sample window while in NO_HZ we will add LOAD_FREQ to the pending sample window time on exit, setting the next update not one window into the future, but two. This situation on exiting NO_HZ is described by: this_rq->calc_load_update < jiffies < calc_load_update In this scenario, what we should be doing is: this_rq->calc_load_update = calc_load_update [ next window ] But what we actually do is: this_rq->calc_load_update = calc_load_update + LOAD_FREQ [ next+1 window ] This has the effect of delaying load average updates for potentially up to ~9seconds. This can result in huge spikes in the load average values due to per-cpu uninterruptible task counts being out of sync when accumulated across all CPUs. It's safe to update the per-cpu active count if we wake between sample windows because any load that we left in 'calc_load_idle' will have been zero'd when the idle load was folded in calc_global_load(). This issue is easy to reproduce before, commit 9d89c257 ("sched/fair: Rewrite runnable load and utilization average tracking") just by forking short-lived process pipelines built from ps(1) and grep(1) in a loop. I'm unable to reproduce the spikes after that commit, but the bug still seems to be present from code review. Signed-off-by: NMatt Fleming <matt@codeblueprint.co.uk> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vincent Guittot <vincent.guittot@linaro.org> Fixes: commit 5167e8d5 ("sched/nohz: Rewrite and fix load-avg computation -- again") Link: http://lkml.kernel.org/r/20170217120731.11868-2-matt@codeblueprint.co.ukSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Wanpeng Li 提交于
The following warning can be triggered by hot-unplugging the CPU on which an active SCHED_DEADLINE task is running on: ------------[ cut here ]------------ WARNING: CPU: 7 PID: 0 at kernel/sched/sched.h:833 replenish_dl_entity+0x71e/0xc40 rq->clock_update_flags < RQCF_ACT_SKIP CPU: 7 PID: 0 Comm: swapper/7 Tainted: G B 4.11.0-rc1+ #24 Hardware name: LENOVO ThinkCentre M8500t-N000/SHARKBAY, BIOS FBKTC1AUS 02/16/2016 Call Trace: <IRQ> dump_stack+0x85/0xc4 __warn+0x172/0x1b0 warn_slowpath_fmt+0xb4/0xf0 ? __warn+0x1b0/0x1b0 ? debug_check_no_locks_freed+0x2c0/0x2c0 ? cpudl_set+0x3d/0x2b0 replenish_dl_entity+0x71e/0xc40 enqueue_task_dl+0x2ea/0x12e0 ? dl_task_timer+0x777/0x990 ? __hrtimer_run_queues+0x270/0xa50 dl_task_timer+0x316/0x990 ? enqueue_task_dl+0x12e0/0x12e0 ? enqueue_task_dl+0x12e0/0x12e0 __hrtimer_run_queues+0x270/0xa50 ? hrtimer_cancel+0x20/0x20 ? hrtimer_interrupt+0x119/0x600 hrtimer_interrupt+0x19c/0x600 ? trace_hardirqs_off+0xd/0x10 local_apic_timer_interrupt+0x74/0xe0 smp_apic_timer_interrupt+0x76/0xa0 apic_timer_interrupt+0x93/0xa0 The DL task will be migrated to a suitable later deadline rq once the DL timer fires and currnet rq is offline. The rq clock of the new rq should be updated. This patch fixes it by updating the rq clock after holding the new rq's rq lock. Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: NMatt Fleming <matt@codeblueprint.co.uk> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1488865888-15894-1-git-send-email-wanpeng.li@hotmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 15 3月, 2017 3 次提交
-
-
由 Peter Zijlstra 提交于
Thomas spotted that fixup_pi_state_owner() can return errors and we fail to unlock the rt_mutex in that case. Reported-by: NThomas Gleixner <tglx@linutronix.de> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: NDarren Hart <dvhart@linux.intel.com> Cc: juri.lelli@arm.com Cc: bigeasy@linutronix.de Cc: xlpang@redhat.com Cc: rostedt@goodmis.org Cc: mathieu.desnoyers@efficios.com Cc: jdesfossez@efficios.com Cc: dvhart@infradead.org Cc: bristot@redhat.com Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/20170304093558.867401760@infradead.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
-
由 Peter Zijlstra 提交于
While working on the futex code, I stumbled over this potential use-after-free scenario. Dmitry triggered it later with syzkaller. pi_mutex is a pointer into pi_state, which we drop the reference on in unqueue_me_pi(). So any access to that pointer after that is bad. Since other sites already do rt_mutex_unlock() with hb->lock held, see for example futex_lock_pi(), simply move the unlock before unqueue_me_pi(). Reported-by: NDmitry Vyukov <dvyukov@google.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: NDarren Hart <dvhart@linux.intel.com> Cc: juri.lelli@arm.com Cc: bigeasy@linutronix.de Cc: xlpang@redhat.com Cc: rostedt@goodmis.org Cc: mathieu.desnoyers@efficios.com Cc: jdesfossez@efficios.com Cc: dvhart@infradead.org Cc: bristot@redhat.com Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/20170304093558.801744246@infradead.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
-
The setup/remove_state/instance() functions in the hotplug core code are serialized against concurrent CPU hotplug, but unfortunately not serialized against themself. As a consequence a concurrent invocation of these function results in corruption of the callback machinery because two instances try to invoke callbacks on remote cpus at the same time. This results in missing callback invocations and initiator threads waiting forever on the completion. The obvious solution to replace get_cpu_online() with cpu_hotplug_begin() is not possible because at least one callsite calls into these functions from a get_online_cpu() locked region. Extend the protection scope of the cpuhp_state_mutex from solely protecting the state arrays to cover the callback invocation machinery as well. Fixes: 5b7aa87e ("cpu/hotplug: Implement setup/removal interface") Reported-and-tested-by: NBart Van Assche <Bart.VanAssche@sandisk.com> Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: hpa@zytor.com Cc: mingo@kernel.org Cc: akpm@linux-foundation.org Cc: torvalds@linux-foundation.org Link: http://lkml.kernel.org/r/20170314150645.g4tdyoszlcbajmna@linutronix.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
-
- 11 3月, 2017 1 次提交
-
-
由 Thomas Gleixner 提交于
The purgatory code defines global variables which are referenced via a symbol lookup in the kexec code (core and arch). A recent commit addressing sparse warnings made these static and thereby broke kexec_file. Why did this happen? Simply because the whole machinery is undocumented and lacks any form of forward declarations. The variable names are unspecific and lack a prefix, so adding forward declarations creates shadow variables in the core code. Aside of that the code relies on magic constants and duplicate struct definitions with no way to ensure that these things stay in sync. The section placement of the purgatory variables happened by chance and not by design. Unbreak kexec and cleanup the mess: - Add proper forward declarations and document the usage - Use common struct definition - Use the proper common defines instead of magic constants - Add a purgatory_ prefix to have a proper name space - Use ARRAY_SIZE() instead of a homebrewn reimplementation - Add proper sections to the purgatory variables [ From Mike ] Fixes: 72042a8c ("x86/purgatory: Make functions and variables static") Reported-by: NMike Galbraith <<efault@gmx.de> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Cc: Nicholas Mc Guire <der.herr@hofr.at> Cc: Borislav Petkov <bp@alien8.de> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: "Tobin C. Harding" <me@tobin.cc> Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1703101315140.3681@nanosSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
-
- 10 3月, 2017 3 次提交
-
-
由 Andrea Arcangeli 提交于
Patch series "userfaultfd non-cooperative further update for 4.11 merge window". Unfortunately I noticed one relevant bug in userfaultfd_exit while doing more testing. I've been doing testing before and this was also tested by kbuild bot and exercised by the selftest, but this bug never reproduced before. I dropped userfaultfd_exit as result. I dropped it because of implementation difficulty in receiving signals in __mmput and because I think -ENOSPC as result from the background UFFDIO_COPY should be enough already. Before I decided to remove userfaultfd_exit, I noticed userfaultfd_exit wasn't exercised by the selftest and when I tried to exercise it, after moving it to a more correct place in __mmput where it would make more sense and where the vma list is stable, it resulted in the event_wait_completion in D state. So then I added the second patch to be sure even if we call userfaultfd_event_wait_completion too late during task exit(), we won't risk to generate tasks in D state. The same check exists in handle_userfault() for the same reason, except it makes a difference there, while here is just a robustness check and it's run under WARN_ON_ONCE. While looking at the userfaultfd_event_wait_completion() function I looked back at its callers too while at it and I think it's not ok to stop executing dup_fctx on the fcs list because we relay on userfaultfd_event_wait_completion to execute userfaultfd_ctx_put(fctx->orig) which is paired against userfaultfd_ctx_get(fctx->orig) in dup_userfault just before list_add(fcs). This change only takes care of fctx->orig but this area also needs further review looking for similar problems in fctx->new. The only patch that is urgent is the first because it's an use after free during a SMP race condition that affects all processes if CONFIG_USERFAULTFD=y. Very hard to reproduce though and probably impossible without SLUB poisoning enabled. This patch (of 3): I once reproduced this oops with the userfaultfd selftest, it's not easily reproducible and it requires SLUB poisoning to reproduce. general protection fault: 0000 [#1] SMP Modules linked in: CPU: 2 PID: 18421 Comm: userfaultfd Tainted: G ------------ T 3.10.0+ #15 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.1-0-g8891697-prebuilt.qemu-project.org 04/01/2014 task: ffff8801f83b9440 ti: ffff8801f833c000 task.ti: ffff8801f833c000 RIP: 0010:[<ffffffff81451299>] [<ffffffff81451299>] userfaultfd_exit+0x29/0xa0 RSP: 0018:ffff8801f833fe80 EFLAGS: 00010202 RAX: ffff8801f833ffd8 RBX: 6b6b6b6b6b6b6b6b RCX: ffff8801f83b9440 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8800baf18600 RBP: ffff8801f833fee8 R08: 0000000000000000 R09: 0000000000000001 R10: 0000000000000000 R11: ffffffff8127ceb3 R12: 0000000000000000 R13: ffff8800baf186b0 R14: ffff8801f83b99f8 R15: 00007faed746c700 FS: 0000000000000000(0000) GS:ffff88023fc80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007faf0966f028 CR3: 0000000001bc6000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Call Trace: do_exit+0x297/0xd10 SyS_exit+0x17/0x20 tracesys+0xdd/0xe2 Code: 00 00 66 66 66 66 90 55 48 89 e5 41 54 53 48 83 ec 58 48 8b 1f 48 85 db 75 11 eb 73 66 0f 1f 44 00 00 48 8b 5b 10 48 85 db 74 64 <4c> 8b a3 b8 00 00 00 4d 85 e4 74 eb 41 f6 84 24 2c 01 00 00 80 RIP [<ffffffff81451299>] userfaultfd_exit+0x29/0xa0 RSP <ffff8801f833fe80> ---[ end trace 9fecd6dcb442846a ]--- In the debugger I located the "mm" pointer in the stack and walking mm->mmap->vm_next through the end shows the vma->vm_next list is fully consistent and it is null terminated list as expected. So this has to be an SMP race condition where userfaultfd_exit was running while the vma list was being modified by another CPU. When userfaultfd_exit() run one of the ->vm_next pointers pointed to SLAB_POISON (RBX is the vma pointer and is 0x6b6b..). The reason is that it's not running in __mmput but while there are still other threads running and it's not holding the mmap_sem (it can't as it has to wait the even to be received by the manager). So this is an use after free that was happening for all processes. One more implementation problem aside from the race condition: userfaultfd_exit has really to check a flag in mm->flags before walking the vma or it's going to slowdown the exit() path for regular tasks. One more implementation problem: at that point signals can't be delivered so it would also create a task in D state if the manager doesn't read the event. The major design issue: it overall looks superfluous as the manager can check for -ENOSPC in the background transfer: if (mmget_not_zero(ctx->mm)) { [..] } else { return -ENOSPC; } It's safer to roll it back and re-introduce it later if at all. [rppt@linux.vnet.ibm.com: documentation fixup after removal of UFFD_EVENT_EXIT] Link: http://lkml.kernel.org/r/1488345437-4364-1-git-send-email-rppt@linux.vnet.ibm.com Link: http://lkml.kernel.org/r/20170224181957.19736-2-aarcange@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com> Signed-off-by: NMike Rapoport <rppt@linux.vnet.ibm.com> Acked-by: NMike Rapoport <rppt@linux.vnet.ibm.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Hillf Danton <hillf.zj@alibaba-inc.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Masahiro Yamada 提交于
Fix typos and add the following to the scripts/spelling.txt: overide||override While we are here, fix the doubled "address" in the touched line Documentation/devicetree/bindings/regulator/ti-abb-regulator.txt. Also, fix the comment block style in the touched hunks in drivers/media/dvb-frontends/drx39xyj/drx_driver.h. Link: http://lkml.kernel.org/r/1481573103-11329-21-git-send-email-yamada.masahiro@socionext.comSigned-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Masahiro Yamada 提交于
Fix typos and add the following to the scripts/spelling.txt: disble||disable disbled||disabled I kept the TSL2563_INT_DISBLED in /drivers/iio/light/tsl2563.c untouched. The macro is not referenced at all, but this commit is touching only comment blocks just in case. Link: http://lkml.kernel.org/r/1481573103-11329-20-git-send-email-yamada.masahiro@socionext.comSigned-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-