- 01 9月, 2016 2 次提交
-
-
由 Namhyung Kim 提交于
It missed to handle TRACE_BPUTS so messages recorded by trace_bputs() will be shown with symbol info unnecessarily. You can see it with the trace_printk sample code: # cd /sys/kernel/tracing/ # echo sys_sync > set_graph_function # echo 1 > options/sym-offset # echo function_graph > current_tracer Note that the sys_sync filter was there to prevent recording other functions and the sym-offset option was needed since the first message was called from a module init function so kallsyms doesn't have the symbol and omitted in the output. # cd ~/build/kernel # insmod samples/trace_printk/trace-printk.ko # cd - # head trace Before: # tracer: function_graph # # CPU DURATION FUNCTION CALLS # | | | | | | | 1) | /* 0xffffffffa0002000: This is a static string that will use trace_bputs */ 1) | /* This is a dynamic string that will use trace_puts */ 1) | /* trace_printk_irq_work+0x5/0x7b [trace_printk]: (irq) This is a static string that will use trace_bputs */ 1) | /* (irq) This is a dynamic string that will use trace_puts */ 1) | /* (irq) This is a static string that will use trace_bprintk() */ 1) | /* (irq) This is a dynamic string that will use trace_printk */ After: # tracer: function_graph # # CPU DURATION FUNCTION CALLS # | | | | | | | 1) | /* This is a static string that will use trace_bputs */ 1) | /* This is a dynamic string that will use trace_puts */ 1) | /* (irq) This is a static string that will use trace_bputs */ 1) | /* (irq) This is a dynamic string that will use trace_puts */ 1) | /* (irq) This is a static string that will use trace_bprintk() */ 1) | /* (irq) This is a dynamic string that will use trace_printk */ Link: http://lkml.kernel.org/r/20160901024354.13720-1-namhyung@kernel.orgSigned-off-by: NNamhyung Kim <namhyung@kernel.org> Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
-
由 Dmitry Safonov 提交于
It's useless. Before: [tracing]# echo 'p:test /a:0x0' >> uprobe_events [tracing]# echo 'p:test a:0x0' >> uprobe_events -bash: echo: write error: No such file or directory [tracing]# echo 'p:test 1:0x0' >> uprobe_events -bash: echo: write error: Invalid argument After: [tracing]# echo 'p:test 1:0x0' >> uprobe_events -bash: echo: write error: No such file or directory Link: http://lkml.kernel.org/r/20160825152110.25663-3-dsafonov@virtuozzo.comAcked-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com> Acked-by: NOleg Nesterov <oleg@redhat.com> Signed-off-by: NDmitry Safonov <dsafonov@virtuozzo.com> Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
-
- 27 8月, 2016 2 次提交
-
-
由 Nicolas Iooss 提交于
Commit bbeddf52 ("printk: move braille console support into separate braille.[ch] files") moved the parsing of braille-related options into _braille_console_setup(), changing the type of variable str from char* to char**. In this commit, memcmp(str, "brl,", 4) was correctly updated to memcmp(*str, "brl,", 4) but not memcmp(str, "brl=", 4). Update the code to make "brl=" option work again and replace memcmp() with strncmp() to make the compiler able to detect such an issue. Fixes: bbeddf52 ("printk: move braille console support into separate braille.[ch] files") Link: http://lkml.kernel.org/r/20160823165700.28952-1-nicolas.iooss_linux@m4x.orgSigned-off-by: NNicolas Iooss <nicolas.iooss_linux@m4x.org> Cc: Joe Perches <joe@perches.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
We have scripts which write to certain fields on 3.18 kernels but this seems to be failing on 4.4 kernels. An entry which we write to here is xfrm_aevent_rseqth which is u32. echo 4294967295 > /proc/sys/net/core/xfrm_aevent_rseqth Commit 230633d1 ("kernel/sysctl.c: detect overflows when converting to int") prevented writing to sysctl entries when integer overflow occurs. However, this does not apply to unsigned integers. Heinrich suggested that we introduce a new option to handle 64 bit limits and set min as 0 and max as UINT_MAX. This might not work as it leads to issues similar to __do_proc_doulongvec_minmax. Alternatively, we would need to change the datatype of the entry to 64 bit. static int __do_proc_doulongvec_minmax(void *data, struct ctl_table { i = (unsigned long *) data; //This cast is causing to read beyond the size of data (u32) vleft = table->maxlen / sizeof(unsigned long); //vleft is 0 because maxlen is sizeof(u32) which is lesser than sizeof(unsigned long) on x86_64. Introduce a new proc handler proc_douintvec. Individual proc entries will need to be updated to use the new handler. [akpm@linux-foundation.org: coding-style fixes] Fixes: 230633d1 ("kernel/sysctl.c:detect overflows when converting to int") Link: http://lkml.kernel.org/r/1471479806-5252-1-git-send-email-subashab@codeaurora.orgSigned-off-by: NSubash Abhinov Kasiviswanathan <subashab@codeaurora.org> Cc: Heinrich Schuchardt <xypron.glpk@gmx.de> Cc: Kees Cook <keescook@chromium.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Ingo Molnar <mingo@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 24 8月, 2016 3 次提交
-
-
由 Will Deacon 提交于
When tearing down an AUX buf for an event via perf_mmap_close(), __perf_event_output_stop() is called on the event's CPU to ensure that trace generation is halted before the process of unmapping and freeing the buffer pages begins. The callback is performed via cpu_function_call(), which ensures that it runs with interrupts disabled and is therefore not preemptible. Unfortunately, the current code grabs the per-cpu context pointer using get_cpu_ptr(), which unnecessarily disables preemption and doesn't pair the call with put_cpu_ptr(), leading to a preempt_count() imbalance and a BUG when freeing the AUX buffer later on: WARNING: CPU: 1 PID: 2249 at kernel/events/ring_buffer.c:539 __rb_free_aux+0x10c/0x120 Modules linked in: [...] Call Trace: [<ffffffff813379dd>] dump_stack+0x4f/0x72 [<ffffffff81059ff6>] __warn+0xc6/0xe0 [<ffffffff8105a0c8>] warn_slowpath_null+0x18/0x20 [<ffffffff8112761c>] __rb_free_aux+0x10c/0x120 [<ffffffff81128163>] rb_free_aux+0x13/0x20 [<ffffffff8112515e>] perf_mmap_close+0x29e/0x2f0 [<ffffffff8111da30>] ? perf_iterate_ctx+0xe0/0xe0 [<ffffffff8115f685>] remove_vma+0x25/0x60 [<ffffffff81161796>] exit_mmap+0x106/0x140 [<ffffffff8105725c>] mmput+0x1c/0xd0 [<ffffffff8105cac3>] do_exit+0x253/0xbf0 [<ffffffff8105e32e>] do_group_exit+0x3e/0xb0 [<ffffffff81068d49>] get_signal+0x249/0x640 [<ffffffff8101c273>] do_signal+0x23/0x640 [<ffffffff81905f42>] ? _raw_write_unlock_irq+0x12/0x30 [<ffffffff81905f69>] ? _raw_spin_unlock_irq+0x9/0x10 [<ffffffff81901896>] ? __schedule+0x2c6/0x710 [<ffffffff810022a4>] exit_to_usermode_loop+0x74/0x90 [<ffffffff81002a56>] prepare_exit_to_usermode+0x26/0x30 [<ffffffff81906d1b>] retint_user+0x8/0x10 This patch uses this_cpu_ptr() instead of get_cpu_ptr(), since preemption is already disabled by the caller. Signed-off-by: NWill Deacon <will.deacon@arm.com> Reviewed-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Fixes: 95ff4ca2 ("perf/core: Free AUX pages in unmap path") Link: http://lkml.kernel.org/r/20160824091905.GA16944@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 John Stultz 提交于
It was reported that hibernation could fail on the 2nd attempt, where the system hangs at hibernate() -> syscore_resume() -> i8237A_resume() -> claim_dma_lock(), because the lock has already been taken. However there is actually no other process would like to grab this lock on that problematic platform. Further investigation showed that the problem is triggered by setting /sys/power/pm_trace to 1 before the 1st hibernation. Since once pm_trace is enabled, the rtc becomes unmeaningful after suspend, and meanwhile some BIOSes would like to adjust the 'invalid' RTC (e.g, smaller than 1970) to the release date of that motherboard during POST stage, thus after resumed, it may seem that the system had a significant long sleep time which is a completely meaningless value. Then in timekeeping_resume -> tk_debug_account_sleep_time, if the bit31 of the sleep time happened to be set to 1, fls() returns 32 and we add 1 to sleep_time_bin[32], which causes an out of bounds array access and therefor memory being overwritten. As depicted by System.map: 0xffffffff81c9d080 b sleep_time_bin 0xffffffff81c9d100 B dma_spin_lock the dma_spin_lock.val is set to 1, which caused this problem. This patch adds a sanity check in tk_debug_account_sleep_time() to ensure we don't index past the sleep_time_bin array. [jstultz: Problem diagnosed and original patch by Chen Yu, I've solved the issue slightly differently, but borrowed his excelent explanation of the issue here.] Fixes: 5c83545f "power: Add option to log time spent in suspend" Reported-by: NJanek Kozicki <cosurgi@gmail.com> Reported-by: NChen Yu <yu.c.chen@intel.com> Signed-off-by: NJohn Stultz <john.stultz@linaro.org> Cc: linux-pm@vger.kernel.org Cc: Peter Zijlstra <peterz@infradead.org> Cc: Xunlei Pang <xpang@redhat.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: stable <stable@vger.kernel.org> Cc: Zhang Rui <rui.zhang@intel.com> Link: http://lkml.kernel.org/r/1471993702-29148-3-git-send-email-john.stultz@linaro.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
-
由 John Stultz 提交于
When I added some extra sanity checking in timekeeping_get_ns() under CONFIG_DEBUG_TIMEKEEPING, I missed that the NMI safe __ktime_get_fast_ns() method was using timekeeping_get_ns(). Thus the locking added to the debug checks broke the NMI-safety of __ktime_get_fast_ns(). This patch open-codes the timekeeping_get_ns() logic for __ktime_get_fast_ns(), so can avoid any deadlocks in NMI. Fixes: 4ca22c26 "timekeeping: Add warnings when overflows or underflows are observed" Reported-by: NSteven Rostedt <rostedt@goodmis.org> Reported-by: NPeter Zijlstra <peterz@infradead.org> Signed-off-by: NJohn Stultz <john.stultz@linaro.org> Cc: stable <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/1471993702-29148-2-git-send-email-john.stultz@linaro.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
-
- 22 8月, 2016 2 次提交
-
-
由 Christoph Hellwig 提交于
Without locking out CPU mask operations we might end up with an inconsistent view of the cpumask in the function. Fixes: 5e385a6e: "genirq: Add a helper to spread an affinity mask for MSI/MSI-X vectors" Signed-off-by: NChristoph Hellwig <hch@lst.de> Link: http://lkml.kernel.org/r/1470924405-25728-1-git-send-email-hch@lst.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
-
由 Shawn Lin 提交于
Obviously we should free action here if irq_chip_pm_get failed. Fixes: be45beb2: "genirq: Add runtime power management support for IRQ chips" Signed-off-by: NShawn Lin <shawn.lin@rock-chips.com> Cc: Jon Hunter <jonathanh@nvidia.com> Cc: Marc Zyngier <marc.zyngier@arm.com> Link: http://lkml.kernel.org/r/1471854112-13006-1-git-send-email-shawn.lin@rock-chips.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
-
- 18 8月, 2016 8 次提交
-
-
由 Wanpeng Li 提交于
Commit: 57430218 ("sched/cputime: Count actually elapsed irq & softirq time") ... fixed a bug but also triggered a regression: On an i5 laptop, 4 pCPUs, 4vCPUs for one full dynticks guest, there are four CPU hog processes(for loop) running in the guest, I hot-unplug the pCPUs on host one by one until there is only one left, then observe CPU utilization via 'top' in the guest, it shows: 100% st for cpu0(housekeeping) 75% st for other CPUs (nohz full mode) However, w/o this commit it shows the correct 75% for all four CPUs. When a guest is interrupted for a longer amount of time, missed clock ticks are not redelivered later. Because of that, we should not limit the amount of steal time accounted to the amount of time that the calling functions think have passed. However, the interval returned by account_other_time() is NOT rounded down to the nearest jiffy, while the base interval in get_vtime_delta() it is subtracted from is, so the max cputime limit is required to avoid underflow. This patch fixes the regression by limiting the account_other_time() from get_vtime_delta() to avoid underflow, and lets the other three call sites (in account_other_time() and steal_account_process_time()) account however much steal time the host told us elapsed. Suggested-by: NRik van Riel <riel@redhat.com> Suggested-by: NPaolo Bonzini <pbonzini@redhat.com> Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com> Reviewed-by: NRik van Riel <riel@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Radim Krcmar <rkrcmar@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: kvm@vger.kernel.org Link: http://lkml.kernel.org/r/1471399546-4069-1-git-send-email-wanpeng.li@hotmail.com [ Improved the changelog. ] Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
Mike reports: Roughly 10% of the time, ltp testcase getrusage04 fails: getrusage04 0 TINFO : Expected timers granularity is 4000 us getrusage04 0 TINFO : Using 1 as multiply factor for max [us]time increment (1000+4000us)! getrusage04 0 TINFO : utime: 0us; stime: 179us getrusage04 0 TINFO : utime: 3751us; stime: 0us getrusage04 1 TFAIL : getrusage04.c:133: stime increased > 5000us: And tracked it down to the case where the task simply doesn't get _any_ [us]time ticks. Update the code to assume all rtime is utime when we lack information, thus ensuring a task that elides the tick gets time accounted. Reported-by: NMike Galbraith <umgwanakikbuti@gmail.com> Tested-by: NMike Galbraith <umgwanakikbuti@gmail.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Fredrik Markstrom <fredrik.markstrom@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Radim <rkrcmar@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Cc: stable@vger.kernel.org # 4.3+ Fixes: 9d7fb042 ("sched/cputime: Guarantee stime + utime == rtime") Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
由 David Carrillo-Cisneros 提交于
The call to smp_call_function_single in perf_event_read() may fail if an invalid or not online CPU index is passed. Warn user if such bug is present and return error. Signed-off-by: NDavid Carrillo-Cisneros <davidcc@google.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul Turner <pjt@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vegard Nossum <vegard.nossum@gmail.com> Cc: Vince Weaver <vincent.weaver@maine.edu> Link: http://lkml.kernel.org/r/1471467307-61171-2-git-send-email-davidcc@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Mathieu Poirier 提交于
At this time the perf_addr_filter_needs_mmap() function will _not_ return true on a user space 'stop' filter. But stop filters need exactly the same kind of mapping that range and start filters get. Signed-off-by: NMathieu Poirier <mathieu.poirier@linaro.org> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Acked-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Link: http://lkml.kernel.org/r/1468860187-318-4-git-send-email-mathieu.poirier@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Mathieu Poirier 提交于
Function perf_event_mmap() is called by the MM subsystem each time part of a binary is loaded in memory. There can be several mapping for a binary, many times unrelated to the code section. Each time a section of a binary is mapped address filters are updated, event when the map doesn't pertain to the code section. The end result is that filters are configured based on the last map event that was received rather than the last mapping of the code segment. For example if we have an executable 'main' that calls library 'libcstest.so.1.0', and that we want to collect traces on code that is in that library. The perf cmd line for this scenario would be: perf record -e cs_etm// --filter 'filter 0x72c/0x40@/opt/lib/libcstest.so.1.0' --per-thread ./main Resulting in binaries being mapped this way: root@linaro-nano:~# cat /proc/1950/maps 00400000-00401000 r-xp 00000000 08:02 33169 /home/linaro/main 00410000-00411000 r--p 00000000 08:02 33169 /home/linaro/main 00411000-00412000 rw-p 00001000 08:02 33169 /home/linaro/main 7fa2464000-7fa2474000 rw-p 00000000 00:00 0 7fa2474000-7fa25a4000 r-xp 00000000 08:02 543 /lib/aarch64-linux-gnu/libc-2.21.so 7fa25a4000-7fa25b3000 ---p 00130000 08:02 543 /lib/aarch64-linux-gnu/libc-2.21.so 7fa25b3000-7fa25b7000 r--p 0012f000 08:02 543 /lib/aarch64-linux-gnu/libc-2.21.so 7fa25b7000-7fa25b9000 rw-p 00133000 08:02 543 /lib/aarch64-linux-gnu/libc-2.21.so 7fa25b9000-7fa25bd000 rw-p 00000000 00:00 0 7fa25bd000-7fa25be000 r-xp 00000000 08:02 38308 /opt/lib/libcstest.so.1.0 7fa25be000-7fa25cd000 ---p 00001000 08:02 38308 /opt/lib/libcstest.so.1.0 7fa25cd000-7fa25ce000 r--p 00000000 08:02 38308 /opt/lib/libcstest.so.1.0 7fa25ce000-7fa25cf000 rw-p 00001000 08:02 38308 /opt/lib/libcstest.so.1.0 7fa25cf000-7fa25eb000 r-xp 00000000 08:02 574 /lib/aarch64-linux-gnu/ld-2.21.so 7fa25ef000-7fa25f2000 rw-p 00000000 00:00 0 7fa25f7000-7fa25f9000 rw-p 00000000 00:00 0 7fa25f9000-7fa25fa000 r--p 00000000 00:00 0 [vvar] 7fa25fa000-7fa25fb000 r-xp 00000000 00:00 0 [vdso] 7fa25fb000-7fa25fc000 r--p 0001c000 08:02 574 /lib/aarch64-linux-gnu/ld-2.21.so 7fa25fc000-7fa25fe000 rw-p 0001d000 08:02 574 /lib/aarch64-linux-gnu/ld-2.21.so 7ff2ea8000-7ff2ec9000 rw-p 00000000 00:00 0 [stack] root@linaro-nano:~# Before 'main()' can execute 'libcstest.so.1.0' has to be loaded in memory. Once that has been done perf_event_mmap() has been called 4 times, with the last map starting at address 0x7fa25ce000 and the address filter configured to start filtering when the IP has passed over address 0x0x7fa25ce72c (0x7fa25ce000 + 0x72c). But that is wrong since the code segment for library 'libcstest.so.1.0' as been mapped at 0x7fa25bd000, resulting in traces not being collected. This patch corrects the situation by requesting that address filters be updated only if the mapped event is for a code segment. Signed-off-by: NMathieu Poirier <mathieu.poirier@linaro.org> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Acked-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Link: http://lkml.kernel.org/r/1468860187-318-3-git-send-email-mathieu.poirier@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Mathieu Poirier 提交于
Binary file names have to be supplied for both range and start/stop filters but the current code only processes the filename if an address range filter is specified. This code adds processing of the filename for start/stop filters. Signed-off-by: NMathieu Poirier <mathieu.poirier@linaro.org> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Acked-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Link: http://lkml.kernel.org/r/1468860187-318-2-git-send-email-mathieu.poirier@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
Vincent reported triggering the WARN_ON_ONCE() in event_function_local(). While thinking through cases I noticed that by using event_function() directly, we miss the inactive case usually handled by event_function_call(). Therefore construct a blend of event_function_call() and event_function() that handles the cases relevant to event_function_local(). Reported-by: NVince Weaver <vincent.weaver@maine.edu> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org # 4.5+ Fixes: fae3fde6 ("perf: Collapse and fix event_function_call() users") Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Oleg Nesterov 提交于
__replace_page() wronlgy calls mem_cgroup_cancel_charge() in "success" path, it should only do this if page_check_address() fails. This means that every enable/disable leads to unbalanced mem_cgroup_uncharge() from put_page(old_page), it is trivial to underflow the page_counter->count and trigger OOM. Reported-and-tested-by: NBrenden Blanco <bblanco@plumgrid.com> Signed-off-by: NOleg Nesterov <oleg@redhat.com> Reviewed-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NMichal Hocko <mhocko@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vladimir Davydov <vdavydov@virtuozzo.com> Cc: stable@vger.kernel.org # 3.17+ Fixes: 00501b53 ("mm: memcontrol: rewrite charge API") Link: http://lkml.kernel.org/r/20160817153629.GB29724@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 17 8月, 2016 1 次提交
-
-
由 Marc Zyngier 提交于
Commit 1e2a7d78 ("irqdomain: Don't set type when mapping an IRQ") moved the trigger configuration call from the irqdomain mapping to the interrupt being actually requested. This patch failed to handle the case where we configure a chained interrupt, which doesn't get requested through the usual path. In order to solve this, let's call __irq_set_trigger just before starting the cascade interrupt. Special care must be taken to make the flow handler stick, as the .irq_set_type method could have reset it (it doesn't know we're dealing with a chained interrupt). Based on an initial patch by Jon Hunter. Fixes: 1e2a7d78 ("irqdomain: Don't set type when mapping an IRQ") Reported-by: NJohn Stultz <john.stultz@linaro.org> Reported-by: NLinus Walleij <linus.walleij@linaro.org> Tested-by: NJohn Stultz <john.stultz@linaro.org> Acked-by: NJon Hunter <jonathanh@nvidia.com> Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
-
- 16 8月, 2016 2 次提交
-
-
由 Adrian Hunter 提交于
Commit 288dab8a ("block: add a separate operation type for secure erase") split REQ_OP_SECURE_ERASE from REQ_OP_DISCARD without considering all the places REQ_OP_DISCARD was being used to mean either. Fix those. Signed-off-by: NAdrian Hunter <adrian.hunter@intel.com> Fixes: 288dab8a ("block: add a separate operation type for secure erase") Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 James Morse 提交于
rtree_next_node() walks the linked list of leaf nodes to find the next block of pages in the struct memory_bitmap. If it walks off the end of the list of nodes, it walks the list of memory zones to find the next region of memory. If it walks off the end of the list of zones, it returns false. This leaves the struct bm_position's node and zone pointers pointing at their respective struct list_heads in struct mem_zone_bm_rtree. memory_bm_find_bit() uses struct bm_position's node and zone pointers to avoid walking lists and trees if the next bit appears in the same node/zone. It handles these values being stale. Swap rtree_next_node()s 'step then test' to 'test-next then step', this means if we reach the end of memory we return false and leave the node and zone pointers as they were. This fixes a panic on resume using AMD Seattle with 64K pages: [ 6.868732] Freezing user space processes ... (elapsed 0.000 seconds) done. [ 6.875753] Double checking all user space processes after OOM killer disable... (elapsed 0.000 seconds) [ 6.896453] PM: Using 3 thread(s) for decompression. [ 6.896453] PM: Loading and decompressing image data (5339 pages)... [ 7.318890] PM: Image loading progress: 0% [ 7.323395] Unable to handle kernel paging request at virtual address 00800040 [ 7.330611] pgd = ffff000008df0000 [ 7.334003] [00800040] *pgd=00000083fffe0003, *pud=00000083fffe0003, *pmd=00000083fffd0003, *pte=0000000000000000 [ 7.344266] Internal error: Oops: 96000005 [#1] PREEMPT SMP [ 7.349825] Modules linked in: [ 7.352871] CPU: 2 PID: 1 Comm: swapper/0 Tainted: G W I 4.8.0-rc1 #4737 [ 7.360512] Hardware name: AMD Overdrive/Supercharger/Default string, BIOS ROD1002C 04/08/2016 [ 7.369109] task: ffff8003c0220000 task.stack: ffff8003c0280000 [ 7.375020] PC is at set_bit+0x18/0x30 [ 7.378758] LR is at memory_bm_set_bit+0x24/0x30 [ 7.383362] pc : [<ffff00000835bbc8>] lr : [<ffff0000080faf18>] pstate: 60000045 [ 7.390743] sp : ffff8003c0283b00 [ 7.473551] [ 7.475031] Process swapper/0 (pid: 1, stack limit = 0xffff8003c0280020) [ 7.481718] Stack: (0xffff8003c0283b00 to 0xffff8003c0284000) [ 7.800075] Call trace: [ 7.887097] [<ffff00000835bbc8>] set_bit+0x18/0x30 [ 7.891876] [<ffff0000080fb038>] duplicate_memory_bitmap.constprop.38+0x54/0x70 [ 7.899172] [<ffff0000080fcc40>] snapshot_write_next+0x22c/0x47c [ 7.905166] [<ffff0000080fe1b4>] load_image_lzo+0x754/0xa88 [ 7.910725] [<ffff0000080ff0a8>] swsusp_read+0x144/0x230 [ 7.916025] [<ffff0000080fa338>] load_image_and_restore+0x58/0x90 [ 7.922105] [<ffff0000080fa660>] software_resume+0x2f0/0x338 [ 7.927752] [<ffff000008083350>] do_one_initcall+0x38/0x11c [ 7.933314] [<ffff000008b40cc0>] kernel_init_freeable+0x14c/0x1ec [ 7.939395] [<ffff0000087ce564>] kernel_init+0x10/0xfc [ 7.944520] [<ffff000008082e90>] ret_from_fork+0x10/0x40 [ 7.949820] Code: d2800022 8b400c21 f9800031 9ac32043 (c85f7c22) [ 7.955909] ---[ end trace 0024a5986e6ff323 ]--- [ 7.960529] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b Here struct mem_zone_bm_rtree's start_pfn has been returned instead of struct rtree_node's addr as the node/zone pointers are corrupt after we walked off the end of the lists during mark_unsafe_pages(). This behaviour was exposed by commit 6dbecfd3 ("PM / hibernate: Simplify mark_unsafe_pages()"), which caused mark_unsafe_pages() to call duplicate_memory_bitmap(), which uses memory_bm_find_bit() after walking off the end of the memory bitmap. Fixes: 3a20cb17 (PM / Hibernate: Implement position keeping in radix tree) Signed-off-by: NJames Morse <james.morse@arm.com> [ rjw: Subject ] Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
-
- 13 8月, 2016 2 次提交
-
-
由 Daniel Borkmann 提交于
While hashing out BPF's current_task_under_cgroup helper bits, it came to discussion that the skb_in_cgroup helper name was suboptimally chosen. Tejun says: So, I think in_cgroup should mean that the object is in that particular cgroup while under_cgroup in the subhierarchy of that cgroup. Let's rename the other subhierarchy test to under too. I think that'd be a lot less confusing going forward. [...] It's more intuitive and gives us the room to implement the real "in" test if ever necessary in the future. Since this touches uapi bits, we need to change this as long as v4.8 is not yet officially released. Thus, change the helper enum and rename related bits. Fixes: 4a482f34 ("cgroup: bpf: Add bpf_skb_in_cgroup_proto") Reference: http://patchwork.ozlabs.org/patch/658500/Suggested-by: NSargun Dhillon <sargun@sargun.me> Suggested-by: NTejun Heo <tj@kernel.org> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Acked-by: NAlexei Starovoitov <ast@kernel.org>
-
由 Thomas Garnier 提交于
Restore the processor state before calling any other functions to ensure per-CPU variables can be used with KASLR memory randomization. Tracing functions use per-CPU variables (GS based on x86) and one was called just before restoring the processor state fully. It resulted in a double fault when both the tracing & the exception handler functions tried to use a per-CPU variable. Fixes: bb3632c6 (PM / sleep: trace events for suspend/resume) Reported-and-tested-by: NBorislav Petkov <bp@suse.de> Reported-by: NJiri Kosina <jikos@kernel.org> Tested-by: NRafael J. Wysocki <rafael@kernel.org> Tested-by: NJiri Kosina <jkosina@suse.cz> Signed-off-by: NThomas Garnier <thgarnie@google.com> Acked-by: NPavel Machek <pavel@ucw.cz> Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
-
- 11 8月, 2016 2 次提交
-
-
由 Frederic Weisbecker 提交于
Commit: f9bcf1e0 ("sched/cputime: Fix steal time accounting") ... fixes a leak on steal time accounting but forgets to account the ticks passed in parameters, assuming there is only one to take into account. Let's consider that parameter back. Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com> Acked-by: NWanpeng Li <kernellwp@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Radim <rkrcmar@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Cc: linux-tip-commits@vger.kernel.org Link: http://lkml.kernel.org/r/20160811125822.GB4214@lerougeSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Wanpeng Li 提交于
Commit: 57430218 ("sched/cputime: Count actually elapsed irq & softirq time") ... didn't take steal time into consideration with passing the noirqtime kernel parameter. As Paolo pointed out before: | Why not? If idle=poll, for example, any time the guest is suspended (and | thus cannot poll) does count as stolen time. This patch fixes it by reducing steal time from idle time accounting when the noirqtime parameter is true. The average idle time drops from 56.8% to 54.75% for nohz idle kvm guest(noirqtime, idle=poll, four vCPUs running on one pCPU). Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Radim <rkrcmar@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1470893795-3527-1-git-send-email-wanpeng.li@hotmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 10 8月, 2016 9 次提交
-
-
由 Pan Xinhui 提交于
It's obviously wrong to set stat to NULL. So lets remove it. Otherwise it is always zero when we check the latency of kick/wake. Signed-off-by: NPan Xinhui <xinhui.pan@linux.vnet.ibm.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: NWaiman Long <Waiman.Long@hpe.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1468405414-3700-1-git-send-email-xinhui.pan@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Wanpeng Li 提交于
When the lock holder vCPU is racing with the queue head: CPU 0 (lock holder) CPU1 (queue head) =================== ================= spin_lock(); spin_lock(); pv_kick_node(): pv_wait_head_or_lock(): if (!lp) { lp = pv_hash(lock, pn); xchg(&l->locked, _Q_SLOW_VAL); } WRITE_ONCE(pn->state, vcpu_halted); cmpxchg(&pn->state, vcpu_halted, vcpu_hashed); WRITE_ONCE(l->locked, _Q_SLOW_VAL); (void)pv_hash(lock, pn); In this case, lock holder inserts the pv_node of queue head into the hash table and set _Q_SLOW_VAL unnecessary. This patch avoids it by restoring/setting vcpu_hashed state after failing adaptive locking spinning. Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: NPan Xinhui <xinhui.pan@linux.vnet.ibm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Waiman Long <Waiman.Long@hpe.com> Link: http://lkml.kernel.org/r/1468484156-4521-1-git-send-email-wanpeng.li@hotmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Wanpeng Li 提交于
The following warning can be triggered by hot-unplugging the CPU on which an active SCHED_DEADLINE task is running on: WARNING: CPU: 0 PID: 0 at kernel/locking/lockdep.c:3531 lock_release+0x690/0x6a0 releasing a pinned lock Call Trace: dump_stack+0x99/0xd0 __warn+0xd1/0xf0 ? dl_task_timer+0x1a1/0x2b0 warn_slowpath_fmt+0x4f/0x60 ? sched_clock+0x13/0x20 lock_release+0x690/0x6a0 ? enqueue_pushable_dl_task+0x9b/0xa0 ? enqueue_task_dl+0x1ca/0x480 _raw_spin_unlock+0x1f/0x40 dl_task_timer+0x1a1/0x2b0 ? push_dl_task.part.31+0x190/0x190 WARNING: CPU: 0 PID: 0 at kernel/locking/lockdep.c:3649 lock_unpin_lock+0x181/0x1a0 unpinning an unpinned lock Call Trace: dump_stack+0x99/0xd0 __warn+0xd1/0xf0 warn_slowpath_fmt+0x4f/0x60 lock_unpin_lock+0x181/0x1a0 dl_task_timer+0x127/0x2b0 ? push_dl_task.part.31+0x190/0x190 As per the comment before this code, its safe to drop the RQ lock here, and since we (potentially) change rq, unpin and repin to avoid the splat. Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com> [ Rewrote changelog. ] Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luca Abeni <luca.abeni@unitn.it> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1470274940-17976-1-git-send-email-wanpeng.li@hotmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Giovanni Gherdovich 提交于
Commit: 6e998916 ("sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency") fixed a problem whereby clock_nanosleep() followed by clock_gettime() could allow a task to wake early. It addressed the problem by calling the scheduling classes update_curr() when the cputimer starts. Said change induced a considerable performance regression on the syscalls times() and clock_gettimes(CLOCK_PROCESS_CPUTIME_ID). There are some debuggers and applications that monitor their own performance that accidentally depend on the performance of these specific calls. This patch mitigates the performace loss by prefetching data in the CPU cache, as stalls due to cache misses appear to be where most time is spent in our benchmarks. Here are the performance gain of this patch over v4.7-rc7 on a Sandy Bridge box with 32 logical cores and 2 NUMA nodes. The test is repeated with a variable number of threads, from 2 to 4*num_cpus; the results are in seconds and correspond to the average of 10 runs; the percentage gain is computed with (before-after)/before so a positive value is an improvement (it's faster). The improvement varies between a few percents for 5-20 threads and more than 10% for 2 or >20 threads. pound_clock_gettime: threads 4.7-rc7 patched 4.7-rc7 [num] [secs] [secs (percent)] 2 3.48 3.06 ( 11.83%) 5 3.33 3.25 ( 2.40%) 8 3.37 3.26 ( 3.30%) 12 3.32 3.37 ( -1.60%) 21 4.01 3.90 ( 2.74%) 30 3.63 3.36 ( 7.41%) 48 3.71 3.11 ( 16.27%) 79 3.75 3.16 ( 15.74%) 110 3.81 3.25 ( 14.80%) 128 3.88 3.31 ( 14.76%) pound_times: threads 4.7-rc7 patched 4.7-rc7 [num] [secs] [secs (percent)] 2 3.65 3.25 ( 11.03%) 5 3.45 3.17 ( 7.92%) 8 3.52 3.22 ( 8.69%) 12 3.29 3.36 ( -2.04%) 21 4.07 3.92 ( 3.78%) 30 3.87 3.40 ( 12.17%) 48 3.79 3.16 ( 16.61%) 79 3.88 3.28 ( 15.42%) 110 3.90 3.38 ( 13.35%) 128 4.00 3.38 ( 15.45%) pound_clock_gettime and pound_clock_gettime are two benchmarks included in the MMTests framework. They launch a given number of threads which repeatedly call times() or clock_gettimes(). The results above can be reproduced with cloning MMTests from github.com and running the "poundtime" workload: $ git clone https://github.com/gormanm/mmtests.git $ cd mmtests $ cp configs/config-global-dhp__workload_poundtime config $ ./run-mmtests.sh --run-monitor $(uname -r) The above will run "poundtime" measuring the kernel currently running on the machine; Once a new kernel is installed and the machine rebooted, running again $ cd mmtests $ ./run-mmtests.sh --run-monitor $(uname -r) will produce results to compare with. A comparison table will be output with: $ cd mmtests/work/log $ ../../compare-kernels.sh the table will contain a lot of entries; grepping for "Amean" (as in "arithmetic mean") will give the tables presented above. The source code for the two benchmarks is reported at the end of this changelog for clairity. The cache misses addressed by this patch were found using a combination of `perf top`, `perf record` and `perf annotate`. The incriminated lines were found to be struct sched_entity *curr = cfs_rq->curr; and delta_exec = now - curr->exec_start; in the function update_curr() from kernel/sched/fair.c. This patch prefetches the data from memory just before update_curr is called in the interested execution path. A comparison of the total number of cycles before and after the patch follows; the data is obtained using `perf stat -r 10 -ddd <program>` running over the same sequence of number of threads used above (a positive gain is an improvement): threads cycles before cycles after gain 2 19,699,563,964 +-1.19% 17,358,917,517 +-1.85% 11.88% 5 47,401,089,566 +-2.96% 45,103,730,829 +-0.97% 4.85% 8 80,923,501,004 +-3.01% 71,419,385,977 +-0.77% 11.74% 12 112,326,485,473 +-0.47% 110,371,524,403 +-0.47% 1.74% 21 193,455,574,299 +-0.72% 180,120,667,904 +-0.36% 6.89% 30 315,073,519,013 +-1.64% 271,222,225,950 +-1.29% 13.92% 48 321,969,515,332 +-1.48% 273,353,977,321 +-1.16% 15.10% 79 337,866,003,422 +-0.97% 289,462,481,538 +-1.05% 14.33% 110 338,712,691,920 +-0.78% 290,574,233,170 +-0.77% 14.21% 128 348,384,794,006 +-0.50% 292,691,648,206 +-0.66% 15.99% A comparison of cache miss vs total cache loads ratios, before and after the patch (again from the `perf stat -r 10 -ddd <program>` tables): threads L1 misses/total*100 L1 misses/total*100 gain before after 2 7.43 +-4.90% 7.36 +-4.70% 0.94% 5 13.09 +-4.74% 13.52 +-3.73% -3.28% 8 13.79 +-5.61% 12.90 +-3.27% 6.45% 12 11.57 +-2.44% 8.71 +-1.40% 24.72% 21 12.39 +-3.92% 9.97 +-1.84% 19.53% 30 13.91 +-2.53% 11.73 +-2.28% 15.67% 48 13.71 +-1.59% 12.32 +-1.97% 10.14% 79 14.44 +-0.66% 13.40 +-1.06% 7.20% 110 15.86 +-0.50% 14.46 +-0.59% 8.83% 128 16.51 +-0.32% 15.06 +-0.78% 8.78% As a final note, the following shows the evolution of performance figures in the "poundtime" benchmark and pinpoints commit 6e998916 ("sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency") as a major source of degradation, mostly unaddressed to this day (figures expressed in seconds). pound_clock_gettime: threads parent of 6e998916 4.7-rc7 6e998916 itself 2 2.23 3.68 ( -64.56%) 3.48 (-55.48%) 5 2.83 3.78 ( -33.42%) 3.33 (-17.43%) 8 2.84 4.31 ( -52.12%) 3.37 (-18.76%) 12 3.09 3.61 ( -16.74%) 3.32 ( -7.17%) 21 3.14 4.63 ( -47.36%) 4.01 (-27.71%) 30 3.28 5.75 ( -75.37%) 3.63 (-10.80%) 48 3.02 6.05 (-100.56%) 3.71 (-22.99%) 79 2.88 6.30 (-118.90%) 3.75 (-30.26%) 110 2.95 6.46 (-119.00%) 3.81 (-29.24%) 128 3.05 6.42 (-110.08%) 3.88 (-27.04%) pound_times: threads parent of 6e998916 4.7-rc7 6e998916 itself 2 2.27 3.73 ( -64.71%) 3.65 (-61.14%) 5 2.78 3.77 ( -35.56%) 3.45 (-23.98%) 8 2.79 4.41 ( -57.71%) 3.52 (-26.05%) 12 3.02 3.56 ( -17.94%) 3.29 ( -9.08%) 21 3.10 4.61 ( -48.74%) 4.07 (-31.34%) 30 3.33 5.75 ( -72.53%) 3.87 (-16.01%) 48 2.96 6.06 (-105.04%) 3.79 (-28.10%) 79 2.88 6.24 (-116.83%) 3.88 (-34.81%) 110 2.98 6.37 (-114.08%) 3.90 (-31.12%) 128 3.10 6.35 (-104.61%) 4.00 (-28.87%) The source code of the two benchmarks follows. To compile the two: NR_THREADS=42 for FILE in pound_times pound_clock_gettime; do gcc -lrt -O2 -lpthread -DNUM_THREADS=$NR_THREADS $FILE.c -o $FILE done ==== BEGIN pound_times.c ==== struct tms start; void *pound (void *threadid) { struct tms end; int oldutime = 0; int utime; int i; for (i = 0; i < 5000000 / NUM_THREADS; i++) { times(&end); utime = ((int)end.tms_utime - (int)start.tms_utime); if (oldutime > utime) { printf("utime decreased, was %d, now %d!\n", oldutime, utime); } oldutime = utime; } pthread_exit(NULL); } int main() { pthread_t th[NUM_THREADS]; long i; times(&start); for (i = 0; i < NUM_THREADS; i++) { pthread_create (&th[i], NULL, pound, (void *)i); } pthread_exit(NULL); return 0; } ==== END pound_times.c ==== ==== BEGIN pound_clock_gettime.c ==== void *pound (void *threadid) { struct timespec ts; int rc, i; unsigned long prev = 0, this = 0; for (i = 0; i < 5000000 / NUM_THREADS; i++) { rc = clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &ts); if (rc < 0) perror("clock_gettime"); this = (ts.tv_sec * 1000000000) + ts.tv_nsec; if (0 && this < prev) printf("%lu ns timewarp at iteration %d\n", prev - this, i); prev = this; } pthread_exit(NULL); } int main() { pthread_t th[NUM_THREADS]; long rc, i; pid_t pgid; for (i = 0; i < NUM_THREADS; i++) { rc = pthread_create(&th[i], NULL, pound, (void *)i); if (rc < 0) perror("pthread_create"); } pthread_exit(NULL); return 0; } ==== END pound_clock_gettime.c ==== Suggested-by: NMike Galbraith <mgalbraith@suse.de> Signed-off-by: NGiovanni Gherdovich <ggherdovich@suse.cz> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stanislaw Gruszka <sgruszka@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1470385316-15027-2-git-send-email-ggherdovich@suse.czSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Xunlei Pang 提交于
We should update cfs_rq->throttled_clock_task, not pcfs_rq->throttle_clock_task. The effects of this bug was probably occasionally erratic group scheduling, particularly in cgroups-intense workloads. Signed-off-by: NXunlei Pang <xlpang@redhat.com> [ Added changelog. ] Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Acked-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 55e16d30 ("sched/fair: Rework throttle_count sync") Link: http://lkml.kernel.org/r/1468050862-18864-1-git-send-email-xlpang@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Tommaso Cucinotta 提交于
Current code in cpudeadline.c has a bug in re-heapifying when adding a new element at the end of the heap, because a deadline value of 0 is temporarily set in the new elem, then cpudl_change_key() is called with the actual elem deadline as param. However, the function compares the new deadline to set with the one previously in the elem, which is 0. So, if current absolute deadlines grew so much to have negative values as s64, the comparison in cpudl_change_key() makes the wrong decision. Instead, as from dl_time_before(), the kernel should handle correctly abs deadlines wrap-arounds. This patch fixes the problem with a minimally invasive change that forces cpudl_change_key() to heapify up in this case. Signed-off-by: NTommaso Cucinotta <tommaso.cucinotta@sssup.it> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: NLuca Abeni <luca.abeni@unitn.it> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Juri Lelli <juri.lelli@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1468921493-10054-2-git-send-email-tommaso.cucinotta@sssup.itSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 David Carrillo-Cisneros 提交于
There's a perf stat bug easy to observer on a machine with only one cgroup: $ perf stat -e cycles -I 1000 -C 0 -G / # time counts unit events 1.000161699 <not counted> cycles / 2.000355591 <not counted> cycles / 3.000565154 <not counted> cycles / 4.000951350 <not counted> cycles / We'd expect some output there. The underlying problem is that there is an optimization in perf_cgroup_sched_{in,out}() that skips the switch of cgroup events if the old and new cgroups in a task switch are the same. This optimization interacts with the current code in two ways that cause a CPU context's cgroup (cpuctx->cgrp) to be NULL even if a cgroup event matches the current task. These are: 1. On creation of the first cgroup event in a CPU: In current code, cpuctx->cpu is only set in perf_cgroup_sched_in, but due to the aforesaid optimization, perf_cgroup_sched_in will run until the next cgroup switches in that CPU. This may happen late or never happen, depending on system's number of cgroups, CPU load, etc. 2. On deletion of the last cgroup event in a cpuctx: In list_del_event, cpuctx->cgrp is set NULL. Any new cgroup event will not be sched in because cpuctx->cgrp == NULL until a cgroup switch occurs and perf_cgroup_sched_in is executed (updating cpuctx->cgrp). This patch fixes both problems by setting cpuctx->cgrp in list_add_event, mirroring what list_del_event does when removing a cgroup event from CPU context, as introduced in: commit 68cacd29 ("perf_events: Fix stale ->cgrp pointer in update_cgrp_time_from_cpuctx()") With this patch, cpuctx->cgrp is always set/clear when installing/removing the first/last cgroup event in/from the CPU context. With cpuctx->cgrp correctly set, event_filter_match works as intended when events are sched in/out. After the fix, the output is as expected: $ perf stat -e cycles -I 1000 -a -G / # time counts unit events 1.004699159 627342882 cycles / 2.007397156 615272690 cycles / 3.010019057 616726074 cycles / Signed-off-by: NDavid Carrillo-Cisneros <davidcc@google.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul Turner <pjt@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vegard Nossum <vegard.nossum@gmail.com> Cc: Vince Weaver <vincent.weaver@maine.edu> Link: http://lkml.kernel.org/r/1470124092-113192-1-git-send-email-davidcc@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
Vegard Nossum reported that perf fuzzing generates a NULL pointer dereference crash: > Digging a bit deeper into this, it seems the event itself is getting > created by perf_event_open() and it gets added to the pmu_event_list > through: > > perf_event_open() > - perf_event_alloc() > - account_event() > - account_pmu_sb_event() > - attach_sb_event() > > so at this point the event is being attached but its ->ctx is still > NULL. It seems like ->ctx is set just a bit later in > perf_event_open(), though. > > But before that, __schedule() comes along and creates a stack trace > similar to the one above: > > __schedule() > - __perf_event_task_sched_out() > - perf_iterate_sb() > - perf_iterate_sb_cpu() > - event_filter_match() > - perf_cgroup_match() > - __get_cpu_context() > - (dereference ctx which is NULL) > > So I guess the question is... should the event be attached (= put on > the list) before ->ctx gets set? Or should the cgroup code check for a > NULL ->ctx? The latter seems like the simplest solution. Moving the list-add later creates a bit of a mess. Reported-by: NVegard Nossum <vegard.nossum@gmail.com> Tested-by: NVegard Nossum <vegard.nossum@gmail.com> Tested-by: NVince Weaver <vincent.weaver@maine.edu> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: David Carrillo-Cisneros <davidcc@google.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kan Liang <kan.liang@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: f2fb6bef ("perf/core: Optimize side-band event delivery") Link: http://lkml.kernel.org/r/20160804123724.GN6862@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Linus Torvalds 提交于
This reverts commit 874f9c7d. Geert Uytterhoeven reports: "This change seems to have an (unintendent?) side-effect. Before, pr_*() calls without a trailing newline characters would be printed with a newline character appended, both on the console and in the output of the dmesg command. After this commit, no new line character is appended, and the output of the next pr_*() call of the same type may be appended, like in: - Truncating RAM at 0x0000000040000000-0x00000000c0000000 to -0x0000000070000000 - Ignoring RAM at 0x0000000200000000-0x0000000240000000 (!CONFIG_HIGHMEM) + Truncating RAM at 0x0000000040000000-0x00000000c0000000 to -0x0000000070000000Ignoring RAM at 0x0000000200000000-0x0000000240000000 (!CONFIG_HIGHMEM)" Joe Perches says: "No, that is not intentional. The newline handling code inside vprintk_emit is a bit involved and for now I suggest a revert until this has all the same behavior as earlier" Reported-by: NGeert Uytterhoeven <geert@linux-m68k.org> Requested-by: NJoe Perches <joe@perches.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 09 8月, 2016 3 次提交
-
-
由 Chris Metcalf 提交于
The tick_nohz_stop_sched_tick() routine is not properly canceling the sched timer when nothing is pending, because get_next_timer_interrupt() is no longer returning KTIME_MAX in that case. This causes periodic interrupts when none are needed. When determining the next interrupt time, we first use __next_timer_interrupt() to get the first expiring timer in the timer wheel. If no timer is found, we return the base clock value plus NEXT_TIMER_MAX_DELTA to indicate there is no timer in the timer wheel. Back in get_next_timer_interrupt(), we set the "expires" value by converting the timer wheel expiry (in ticks) to a nsec value. But we don't want to do this if the timer wheel expiry value indicates no timer; we want to return KTIME_MAX. Prior to commit 500462a9 ("timers: Switch to a non-cascading wheel") we checked base->active_timers to see if any timers were active, and if not, we didn't touch the expiry value and so properly returned KTIME_MAX. Now we don't have active_timers. To fix this, we now just check the timer wheel expiry value to see if it is "now + NEXT_TIMER_MAX_DELTA", and if it is, we don't try to compute a new value based on it, but instead simply let the KTIME_MAX value in expires remain. Fixes: 500462a9 "timers: Switch to a non-cascading wheel" Signed-off-by: NChris Metcalf <cmetcalf@mellanox.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: John Stultz <john.stultz@linaro.org> Link: http://lkml.kernel.org/r/1470688147-22287-1-git-send-email-cmetcalf@mellanox.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
-
由 Marc Zyngier 提交于
Bharat Kumar Gogada reported issues with the generic MSI code, where the end-point ended up with garbage in its MSI configuration (both for the vector and the message). It turns out that the two MSI paths in the kernel are doing slightly different things: generic MSI: disable MSI -> allocate MSI -> enable MSI -> setup EP PCI MSI: disable MSI -> allocate MSI -> setup EP -> enable MSI And it turns out that end-points are allowed to latch the content of the MSI configuration registers as soon as MSIs are enabled. In Bharat's case, the end-point ends up using whatever was there already, which is not what you want. In order to make things converge, we introduce a new MSI domain flag (MSI_FLAG_ACTIVATE_EARLY) that is unconditionally set for PCI/MSI. When set, this flag forces the programming of the end-point as soon as the MSIs are allocated. A consequence of this is that we have an extra activate in irq_startup, but that should be without much consequence. tglx: - Several people reported a VMWare regression with PCI/MSI-X passthrough. It turns out that the patch also cures that issue. - We need to have a look at the MSI disable interrupt path, where we write the msg to all zeros without disabling MSI in the PCI device. Is that correct? Fixes: 52f518a3 "x86/MSI: Use hierarchical irqdomains to manage MSI interrupts" Reported-and-tested-by: NBharat Kumar Gogada <bharat.kumar.gogada@xilinx.com> Reported-and-tested-by: NFoster Snowhill <forst@forstwoof.ru> Reported-by: NMatthias Prager <linux@matthiasprager.de> Reported-by: NJason Taylor <jason.taylor@simplivity.com> Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com> Acked-by: NBjorn Helgaas <bhelgaas@google.com> Cc: linux-pci@vger.kernel.org Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/1468426713-31431-1-git-send-email-marc.zyngier@arm.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
-
由 Andreas Ziegler 提交于
In commit 874f9c7d ("printk: create pr_<level> functions"), new pr_level defines were added to printk.c. These new defines are guarded by an #ifdef CONFIG_PRINTK - however, there is already a surrounding #ifdef CONFIG_PRINTK starting a lot earlier in line 249 which means the newly introduced #ifdef is unnecessary. Let's remove it to avoid confusion. Signed-off-by: NAndreas Ziegler <andreas.ziegler@fau.de> Cc: Joe Perches <joe@perches.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 08 8月, 2016 1 次提交
-
-
由 Jens Axboe 提交于
Since commit 63a4cc24, bio->bi_rw contains flags in the lower portion and the op code in the higher portions. This means that old code that relies on manually setting bi_rw is most likely going to be broken. Instead of letting that brokeness linger, rename the member, to force old and out-of-tree code to break at compile time instead of at runtime. No intended functional changes in this commit. Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 07 8月, 2016 1 次提交
-
-
由 Alexei Starovoitov 提交于
The introduction of pre-allocated hash elements inadvertently broke the behavior of bpf hash maps where users expected to call bpf_map_update_elem() without considering that the map can be full. Some programs do: old_value = bpf_map_lookup_elem(map, key); if (old_value) { ... prepare new_value on stack ... bpf_map_update_elem(map, key, new_value); } Before pre-alloc the update() for existing element would work even in 'map full' condition. Restore this behavior. The above program could have updated old_value in place instead of update() which would be faster and most programs use that approach, but sometimes the values are large and the programs use update() helper to do atomic replacement of the element. Note we cannot simply update element's value in-place like percpu hash map does and have to allocate extra num_possible_cpu elements and use this extra reserve when the map is full. Fixes: 6c905981 ("bpf: pre-allocate hash map elements") Signed-off-by: NAlexei Starovoitov <ast@kernel.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
- 04 8月, 2016 2 次提交
-
-
由 Jason Baron 提交于
The current jump_label.h includes bug.h for things such as WARN_ON(). This makes the header problematic for inclusion by kernel.h or any headers that kernel.h includes, since bug.h includes kernel.h (circular dependency). The inclusion of atomic.h is similarly problematic. Thus, this should make jump_label.h 'includable' from most places. Link: http://lkml.kernel.org/r/7060ce35ddd0d20b33bf170685e6b0fab816bdf2.1467837322.git.jbaron@akamai.comSigned-off-by: NJason Baron <jbaron@akamai.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Chris Metcalf <cmetcalf@mellanox.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Joe Perches <joe@perches.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Masahiro Yamada 提交于
The use of config_enabled() against config options is ambiguous. In practical terms, config_enabled() is equivalent to IS_BUILTIN(), but the author might have used it for the meaning of IS_ENABLED(). Using IS_ENABLED(), IS_BUILTIN(), IS_MODULE() etc. makes the intention clearer. This commit replaces config_enabled() with IS_ENABLED() where possible. This commit is only touching bool config options. I noticed two cases where config_enabled() is used against a tristate option: - config_enabled(CONFIG_HWMON) [ drivers/net/wireless/ath/ath10k/thermal.c ] - config_enabled(CONFIG_BACKLIGHT_CLASS_DEVICE) [ drivers/gpu/drm/gma500/opregion.c ] I did not touch them because they should be converted to IS_BUILTIN() in order to keep the logic, but I was not sure it was the authors' intention. Link: http://lkml.kernel.org/r/1465215656-20569-1-git-send-email-yamada.masahiro@socionext.comSigned-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com> Acked-by: NKees Cook <keescook@chromium.org> Cc: Stas Sergeev <stsp@list.ru> Cc: Matt Redfearn <matt.redfearn@imgtec.com> Cc: Joshua Kinard <kumba@gentoo.org> Cc: Jiri Slaby <jslaby@suse.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Borislav Petkov <bp@suse.de> Cc: Markos Chandras <markos.chandras@imgtec.com> Cc: "Dmitry V. Levin" <ldv@altlinux.org> Cc: yu-cheng yu <yu-cheng.yu@intel.com> Cc: James Hogan <james.hogan@imgtec.com> Cc: Brian Gerst <brgerst@gmail.com> Cc: Johannes Berg <johannes@sipsolutions.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Will Drewry <wad@chromium.org> Cc: Nikolay Martynov <mar.kolya@gmail.com> Cc: Huacai Chen <chenhc@lemote.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Leonid Yegoshin <Leonid.Yegoshin@imgtec.com> Cc: Rafal Milecki <zajec5@gmail.com> Cc: James Cowgill <James.Cowgill@imgtec.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Alex Smith <alex.smith@imgtec.com> Cc: Adam Buchbinder <adam.buchbinder@gmail.com> Cc: Qais Yousef <qais.yousef@imgtec.com> Cc: Jiang Liu <jiang.liu@linux.intel.com> Cc: Mikko Rapeli <mikko.rapeli@iki.fi> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Brian Norris <computersforpeace@gmail.com> Cc: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Cc: "Luis R. Rodriguez" <mcgrof@do-not-panic.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Ingo Molnar <mingo@redhat.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Roland McGrath <roland@hack.frob.com> Cc: Paul Burton <paul.burton@imgtec.com> Cc: Kalle Valo <kvalo@qca.qualcomm.com> Cc: Viresh Kumar <viresh.kumar@linaro.org> Cc: Tony Wu <tung7970@gmail.com> Cc: Huaitong Han <huaitong.han@intel.com> Cc: Sumit Semwal <sumit.semwal@linaro.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Juergen Gross <jgross@suse.com> Cc: Jason Cooper <jason@lakedaemon.net> Cc: "David S. Miller" <davem@davemloft.net> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Andrea Gelmini <andrea.gelmini@gelma.net> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Rabin Vincent <rabin@rab.in> Cc: "Maciej W. Rozycki" <macro@imgtec.com> Cc: David Daney <david.daney@cavium.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-