- 27 5月, 2015 4 次提交
-
-
由 Peter Zijlstra 提交于
For some obscure reason intel_{start,stop}_scheduling() copy the HT state to an intermediate array. This would make sense if we ever were to make changes to it which we'd have to discard. Except we don't. By the time we call intel_commit_scheduling() we're; as the name implies; committed to them. We'll never back out. A further hint its pointless is that stop_scheduling() unconditionally publishes the state. So the intermediate array is pointless, modify the state in place and kill the extra array. And remove the pointless array initialization: INTEL_EXCL_UNUSED == 0. Note; all is serialized by intel_excl_cntr::lock. Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Don Zickus 提交于
I stumbled upon an AMD box that had the BIOS using a hardware performance counter. Instead of printing out a warning and continuing, it failed and blocked further perf counter usage. Looking through the history, I found this commit: a5ebe0ba ("perf/x86: Check all MSRs before passing hw check") which tweaked the rules for a Xen guest on an almost identical box and now changed the behaviour. Unfortunately the rules were tweaked incorrectly and will always lead to MSR failures even though the MSRs are completely fine. What happens now is in arch/x86/kernel/cpu/perf_event.c::check_hw_exists(): <snip> for (i = 0; i < x86_pmu.num_counters; i++) { reg = x86_pmu_config_addr(i); ret = rdmsrl_safe(reg, &val); if (ret) goto msr_fail; if (val & ARCH_PERFMON_EVENTSEL_ENABLE) { bios_fail = 1; val_fail = val; reg_fail = reg; } } <snip> /* * Read the current value, change it and read it back to see if it * matches, this is needed to detect certain hardware emulators * (qemu/kvm) that don't trap on the MSR access and always return 0s. */ reg = x86_pmu_event_addr(0); ^^^^ if the first perf counter is enabled, then this routine will always fail because the counter is running. :-( if (rdmsrl_safe(reg, &val)) goto msr_fail; val ^= 0xffffUL; ret = wrmsrl_safe(reg, val); ret |= rdmsrl_safe(reg, &val_new); if (ret || val != val_new) goto msr_fail; The above bios_fail used to be a 'goto' which is why it worked in the past. Further, most vendors have migrated to using fixed counters to hide their evilness hence this problem rarely shows up now days except on a few old boxes. I fixed my problem and kept the spirit of the original Xen fix, by recording a safe non-enable register to be used safely for the reading/writing check. Because it is not enabled, this passes on bare metal boxes (like metal), but should continue to throw an msr_fail on Xen guests because the register isn't emulated yet. Now I get a proper bios_fail error message and Xen should still see their msr_fail message (untested). Signed-off-by: NDon Zickus <dzickus@redhat.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: george.dunlap@eu.citrix.com Cc: konrad.wilk@oracle.com Link: http://lkml.kernel.org/r/1431976608-56970-1-git-send-email-dzickus@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
The (SNB/IVB/HSW) HT bug only affects events that can be programmed onto GP counters, therefore we should only limit the number of GP counters that can be used per cpu -- iow we should not constrain the FP counters. Furthermore, we should only enfore such a limit when there are in fact exclusive events being scheduled on either sibling. Reported-by: NVince Weaver <vincent.weaver@maine.edu> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> [ Fixed build fail for the !CONFIG_CPU_SUP_INTEL case. ] Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
Commit 43b45780 ("perf/x86: Reduce stack usage of x86_schedule_events()") violated the rule that 'fake' scheduling; as used for event/group validation; should not change the event state. This went mostly un-noticed because repeated calls of x86_pmu::get_event_constraints() would give the same result. And x86_pmu::put_event_constraints() would mostly not do anything. Commit e979121b ("perf/x86/intel: Implement cross-HT corruption bug workaround") made the situation much worse by actually setting the event->hw.constraint value to NULL, so when validation and actual scheduling interact we get NULL ptr derefs. Fix it by removing the constraint pointer from the event and move it back to an array, this time in cpuc instead of on the stack. validate_group() x86_schedule_events() event->hw.constraint = c; # store <context switch> perf_task_event_sched_in() ... x86_schedule_events(); event->hw.constraint = c2; # store ... put_event_constraints(event); # assume failure to schedule intel_put_event_constraints() event->hw.constraint = NULL; <context switch end> c = event->hw.constraint; # read -> NULL if (!test_bit(hwc->idx, c->idxmsk)) # <- *BOOM* NULL deref This in particular is possible when the event in question is a cpu-wide event and group-leader, where the validate_group() tries to add an event to the group. Reported-by: NVince Weaver <vincent.weaver@maine.edu> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Hunter <ahh@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Maria Dimakopoulou <maria.n.dimakopoulou@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 43b45780 ("perf/x86: Reduce stack usage of x86_schedule_events()") Fixes: e979121b ("perf/x86/intel: Implement cross-HT corruption bug workaround") Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
- 02 4月, 2015 7 次提交
-
-
由 Andi Kleen 提交于
Technically PEBS_ENABLED is only guaranteed to exist when we detected PEBS. So add a check for this to the PMU dump function. I don't think it can happen on a real CPU, but could in a VM. Signed-off-by: NAndi Kleen <ak@linux.intel.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: eranian@google.com Link: http://lkml.kernel.org/r/1425059312-18217-4-git-send-email-andi@firstfloor.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Andi Kleen 提交于
LBRs and LBR freezing are controlled through the DEBUGCTL MSR. So dump the state of DEBUGCTL too when dumping the PMU state. Signed-off-by: NAndi Kleen <ak@linux.intel.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: eranian@google.com Link: http://lkml.kernel.org/r/1425059312-18217-3-git-send-email-andi@firstfloor.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Maria Dimakopoulou 提交于
This patch implements a software workaround for a HW erratum on Intel SandyBridge, IvyBridge and Haswell processors with Hyperthreading enabled. The errata are documented for each processor in their respective specification update documents: - SandyBridge: BJ122 - IvyBridge: BV98 - Haswell: HSD29 The bug causes silent counter corruption across hyperthreads only when measuring certain memory events (0xd0, 0xd1, 0xd2, 0xd3). Counters measuring those events may leak counts to the sibling counter. For instance, counter 0, thread 0 measuring event 0xd0, may leak to counter 0, thread 1, regardless of the event measured there. The size of the leak is not predictible. It all depends on the workload and the state of each sibling hyper-thread. The corrupting events do undercount as a consequence of the leak. The leak is compensated automatically only when the sibling counter measures the exact same corrupting event AND the workload is on the two threads is the same. Given, there is no way to guarantee this, a work-around is necessary. Furthermore, there is a serious problem if the leaked count is added to a low-occurrence event. In that case the corruption on the low occurrence event can be very large, e.g., orders of magnitude. There is no HW or FW workaround for this problem. The bug is very easy to reproduce on a loaded system. Here is an example on a Haswell client, where CPU0, CPU4 are siblings. We load the CPUs with a simple triad app streaming large floating-point vector. We use 0x81d0 corrupting event (MEM_UOPS_RETIRED:ALL_LOADS) and 0x20cc (ROB_MISC_EVENTS:LBR_INSERTS). Given we are not using the LBR, the 0x20cc event should be zero. $ taskset -c 0 triad & $ taskset -c 4 triad & $ perf stat -a -C 0 -e r81d0 sleep 100 & $ perf stat -a -C 4 -r20cc sleep 10 Performance counter stats for 'system wide': 139 277 291 r20cc 10,000969126 seconds time elapsed In this example, 0x81d0 and r20cc ar eusing sinling counters on CPU0 and CPU4. 0x81d0 leaks into 0x20cc and corrupts it from 0 to 139 millions occurrences. This patch provides a software workaround to this problem by modifying the way events are scheduled onto counters by the kernel. The patch forces cross-thread mutual exclusion between counters in case a corrupting event is measured by one of the hyper-threads. If thread 0, counter 0 is measuring event 0xd0, then nothing can be measured on counter 0, thread 1. If no corrupting event is measured on any hyper-thread, event scheduling proceeds as before. The same example run with the workaround enabled, yield the correct answer: $ taskset -c 0 triad & $ taskset -c 4 triad & $ perf stat -a -C 0 -e r81d0 sleep 100 & $ perf stat -a -C 4 -r20cc sleep 10 Performance counter stats for 'system wide': 0 r20cc 10,000969126 seconds time elapsed The patch does provide correctness for all non-corrupting events. It does not "repatriate" the leaked counts back to the leaking counter. This is planned for a second patch series. This patch series makes this repatriation more easy by guaranteeing the sibling counter is not measuring any useful event. The patch introduces dynamic constraints for events. That means that events which did not have constraints, i.e., could be measured on any counters, may now be constrained to a subset of the counters depending on what is going on the sibling thread. The algorithm is similar to a cache coherency protocol. We call it XSU in reference to Exclusive, Shared, Unused, the 3 possible states of a PMU counter. As a consequence of the workaround, users may see an increased amount of event multiplexing, even in situtations where there are fewer events than counters measured on a CPU. Patch has been tested on all three impacted processors. Note that when HT is off, there is no corruption. However, the workaround is still enabled, yet not costing too much. Adding a dynamic detection of HT on turned out to be complex are requiring too much to code to be justified. This patch addresses the issue when PEBS is not used. A subsequent patch fixes the problem when PEBS is used. Signed-off-by: NMaria Dimakopoulou <maria.n.dimakopoulou@gmail.com> [spinlock_t -> raw_spinlock_t] Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: NStephane Eranian <eranian@google.com> Cc: bp@alien8.de Cc: jolsa@redhat.com Cc: kan.liang@intel.com Link: http://lkml.kernel.org/r/1416251225-17721-7-git-send-email-eranian@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Stephane Eranian 提交于
This patch adds an index parameter to the get_event_constraint() x86_pmu callback. It is expected to represent the index of the event in the cpuc->event_list[] array. When the callback is used for fake_cpuc (evnet validation), then the index must be -1. The motivation for passing the index is to use it to index into another cpuc array. Signed-off-by: NStephane Eranian <eranian@google.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: bp@alien8.de Cc: jolsa@redhat.com Cc: kan.liang@intel.com Cc: maria.n.dimakopoulou@gmail.com Link: http://lkml.kernel.org/r/1416251225-17721-5-git-send-email-eranian@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Maria Dimakopoulou 提交于
This patch adds 3 new PMU model specific callbacks during the event scheduling done by x86_schedule_events(). ->start_scheduling(): invoked when entering the schedule routine. ->stop_scheduling(): invoked at the end of the schedule routine ->commit_scheduling(): invoked for each committed event To be used optionally by model-specific code. Signed-off-by: NMaria Dimakopoulou <maria.n.dimakopoulou@gmail.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: NStephane Eranian <eranian@google.com> Cc: bp@alien8.de Cc: jolsa@redhat.com Cc: kan.liang@intel.com Link: http://lkml.kernel.org/r/1416251225-17721-4-git-send-email-eranian@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Stephane Eranian 提交于
Make the cpuc->kfree_on_online a vector to accommodate more than one entry and add the second entry to be used by a later patch. Signed-off-by: NStephane Eranian <eranian@google.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: NMaria Dimakopoulou <maria.n.dimakopoulou@gmail.com> Cc: bp@alien8.de Cc: jolsa@redhat.com Cc: kan.liang@intel.com Link: http://lkml.kernel.org/r/1416251225-17721-3-git-send-email-eranian@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Alexander Shishkin 提交于
Intel PT cannot be used at the same time as LBR or BTS and will cause a general protection fault if they are used together. In order to avoid fixing up GPs in the fast path, instead we disallow creating LBR/BTS events when PT events are present and vice versa. Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Kaixu Xia <kaixu.xia@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Robert Richter <rric@kernel.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@infradead.org Cc: adrian.hunter@intel.com Cc: kan.liang@intel.com Cc: markus.t.metzger@intel.com Cc: mathieu.poirier@linaro.org Link: http://lkml.kernel.org/r/1421237903-181015-12-git-send-email-alexander.shishkin@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 31 3月, 2015 1 次提交
-
-
由 Ingo Molnar 提交于
user_mode_ignore_vm86() can be used instead of user_mode(), in places where we have already done a v8086_mode() security check of ptregs. But doing this check in the wrong place would be a bug that could result in security problems, and also the naming still isn't very clear. Furthermore, it only affects 32-bit kernels, while most development happens on 64-bit kernels. If we replace them with user_mode() checks then the cost is only a very minor increase in various slowpaths: text data bss dec hex filename 10573391 703562 1753042 13029995 c6d26b vmlinux.o.before 10573423 703562 1753042 13030027 c6d28b vmlinux.o.after So lets get rid of this distinction once and for all. Acked-by: NBorislav Petkov <bp@suse.de> Acked-by: NAndy Lutomirski <luto@kernel.org> Cc: Andrew Lutomirski <luto@kernel.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brad Spengler <spender@grsecurity.net> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20150329090233.GA1963@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 27 3月, 2015 3 次提交
-
-
由 Peter Zijlstra 提交于
While thinking on the whole clock discussion it occurred to me we have two distinct uses of time: 1) the tracking of event/ctx/cgroup enabled/running/stopped times which includes the self-monitoring support in struct perf_event_mmap_page. 2) the actual timestamps visible in the data records. And we've been conflating them. The first is all about tracking time deltas, nobody should really care in what time base that happens, its all relative information, as long as its internally consistent it works. The second however is what people are worried about when having to merge their data with external sources. And here we have the discussion on MONOTONIC vs MONOTONIC_RAW etc.. Where MONOTONIC is good for correlating between machines (static offset), MONOTNIC_RAW is required for correlating against a fixed rate hardware clock. This means configurability; now 1) makes that hard because it needs to be internally consistent across groups of unrelated events; which is why we had to have a global perf_clock(). However, for 2) it doesn't really matter, perf itself doesn't care what it writes into the buffer. The below patch makes the distinction between these two cases by adding perf_event_clock() which is used for the second case. It further makes this configurable on a per-event basis, but adds a few sanity checks such that we cannot combine events with different clocks in confusing ways. And since we then have per-event configurability we might as well retain the 'legacy' behaviour as a default. Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: David Ahern <dsahern@gmail.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: John Stultz <john.stultz@linaro.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
由 David Ahern 提交于
perf_pmu_disable() is called before pmu->add() and perf_pmu_enable() is called afterwards. No need to call these inside of x86_pmu_add() as well. Signed-off-by: NDavid Ahern <dsahern@gmail.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/1424281543-67335-1-git-send-email-dsahern@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Andi Kleen 提交于
On Broadwell INST_RETIRED.ALL cannot be used with any period that doesn't have the lowest 6 bits cleared. And the period should not be smaller than 128. This is erratum BDM11 and BDM55: http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/5th-gen-core-family-spec-update.pdf BDM11: When using a period < 100; we may get incorrect PEBS/PMI interrupts and/or an invalid counter state. BDM55: When bit0-5 of the period are !0 we may get redundant PEBS records on overflow. Add a new callback to enforce this, and set it for Broadwell. How does this handle the case when an app requests a specific period with some of the bottom bits set? Short answer: Any useful instruction sampling period needs to be 4-6 orders of magnitude larger than 128, as an PMI every 128 instructions would instantly overwhelm the system and be throttled. So the +-64 error from this is really small compared to the period, much smaller than normal system jitter. Long answer (by Peterz): IFF we guarantee perf_event_attr::sample_period >= 128. Suppose we start out with sample_period=192; then we'll set period_left to 192, we'll end up with left = 128 (we truncate the lower bits). We get an interrupt, find that period_left = 64 (>0 so we return 0 and don't get an overflow handler), up that to 128. Then we trigger again, at n=256. Then we find period_left = -64 (<=0 so we return 1 and do get an overflow). We increment with sample_period so we get left = 128. We fire again, at n=384, period_left = 0 (<=0 so we return 1 and get an overflow). And on and on. So while the individual interrupts are 'wrong' we get then with interval=256,128 in exactly the right ratio to average out at 192. And this works for everything >=128. So the num_samples*fixed_period thing is still entirely correct +- 127, which is good enough I'd say, as you already have that error anyhow. So no need to 'fix' the tools, al we need to do is refuse to create INST_RETIRED:ALL events with sample_period < 128. Signed-off-by: NAndi Kleen <ak@linux.intel.com> [ Updated comments and changelog a bit. ] Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Link: http://lkml.kernel.org/r/1424225886-18652-3-git-send-email-andi@firstfloor.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 23 3月, 2015 2 次提交
-
-
由 Andy Lutomirski 提交于
There's no point in checking the VM bit on 64-bit, and, since we're explicitly checking it, we can use user_mode_ignore_vm86() after the check. While we're at it, rearrange the #ifdef slightly to make the code flow a bit clearer. Signed-off-by: NAndy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brad Spengler <spender@grsecurity.net> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/dc1457a734feccd03a19bb3538a7648582f57cdd.1426728647.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Andy Lutomirski 提交于
We want to check whether user code is in 32-bit mode, not whether the task is nominally 32-bit. Signed-off-by: NAndy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brad Spengler <spender@grsecurity.net> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/33e5107085ce347a8303560302b15c2cadd62c4c.1426728647.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 19 2月, 2015 4 次提交
-
-
由 Yan, Zheng 提交于
Make later patch more readable, no logic change. Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com> Signed-off-by: NKan Liang <kan.liang@intel.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul Mackerras <paulus@samba.org> Cc: eranian@google.com Cc: jolsa@redhat.com Link: http://lkml.kernel.org/r/1415156173-10035-13-git-send-email-kan.liang@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Yan, Zheng 提交于
When the LBR call stack is enabled, it is necessary to save/restore the LBR stack on context switch. We can use pmu specific data to store LBR stack when task is scheduled out. This patch adds code that allocates the pmu specific data. Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com> Signed-off-by: NKan Liang <kan.liang@intel.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: NStephane Eranian <eranian@google.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: jolsa@redhat.com Link: http://lkml.kernel.org/r/1415156173-10035-8-git-send-email-kan.liang@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Yan, Zheng 提交于
Previous commit introduces context switch callback, its function overlaps with the flush branch stack callback. So we can use the context switch callback to flush LBR stack. This patch adds code that uses the flush branch callback to flush the LBR stack when task is being scheduled in. The callback is enabled only when there are events use the LBR hardware. This patch also removes all old flush branch stack code. Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com> Signed-off-by: NKan Liang <kan.liang@intel.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: eranian@google.com Cc: jolsa@redhat.com Link: http://lkml.kernel.org/r/1415156173-10035-4-git-send-email-kan.liang@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Yan, Zheng 提交于
The callback is invoked when process is scheduled in or out. It provides mechanism for later patches to save/store the LBR stack. For the schedule in case, the callback is invoked at the same place that flush branch stack callback is invoked. So it also can replace the flush branch stack callback. To avoid unnecessary overhead, the callback is enabled only when there are events use the LBR stack. Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com> Signed-off-by: NKan Liang <kan.liang@intel.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: eranian@google.com Cc: jolsa@redhat.com Link: http://lkml.kernel.org/r/1415156173-10035-3-git-send-email-kan.liang@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 04 2月, 2015 4 次提交
-
-
由 Andy Lutomirski 提交于
While perfmon2 is a sufficiently evil library (it pokes MSRs directly) that breaking it is fair game, it's still useful, so we might as well try to support it. This allows users to write 2 to /sys/devices/cpu/rdpmc to disable all rdpmc protection so that hack like perfmon2 can continue to work. At some point, if perf_event becomes fast enough to replace perfmon2, then this can go. Signed-off-by: NAndy Lutomirski <luto@amacapital.net> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Vince Weaver <vince@deater.net> Cc: "hillf.zj" <hillf.zj@alibaba-inc.com> Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/caac3c1c707dcca48ecbc35f4def21495856f479.1414190806.git.luto@amacapital.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Andy Lutomirski 提交于
We currently allow any process to use rdpmc. This significantly weakens the protection offered by PR_TSC_DISABLED, and it could be helpful to users attempting to exploit timing attacks. Since we can't enable access to individual counters, use a very coarse heuristic to limit access to rdpmc: allow access only when a perf_event is mmapped. This protects seccomp sandboxes. There is plenty of room to further tighen these restrictions. For example, this allows rdpmc for any x86_pmu event, but it's only useful for self-monitoring tasks. As a side effect, cap_user_rdpmc will now be false for AMD uncore events. This isn't a real regression, since .event_idx is disabled for these events anyway for the time being. Whenever that gets re-added, the cap_user_rdpmc code can be adjusted or refactored accordingly. Signed-off-by: NAndy Lutomirski <luto@amacapital.net> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Vince Weaver <vince@deater.net> Cc: "hillf.zj" <hillf.zj@alibaba-inc.com> Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/a2bdb3cf3a1d70c26980d7c6dddfbaa69f3182bf.1414190806.git.luto@amacapital.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Andy Lutomirski 提交于
Signed-off-by: NAndy Lutomirski <luto@amacapital.net> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Vince Weaver <vince@deater.net> Cc: "hillf.zj" <hillf.zj@alibaba-inc.com> Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/0fea9a7fac3c1eea86cb0a5954184e74f4213666.1414190806.git.luto@amacapital.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Andy Lutomirski 提交于
CR4 manipulation was split, seemingly at random, between direct (write_cr4) and using a helper (set/clear_in_cr4). Unfortunately, the set_in_cr4 and clear_in_cr4 helpers also poke at the boot code, which only a small subset of users actually wanted. This patch replaces all cr4 access in functions that don't leave cr4 exactly the way they found it with new helpers cr4_set_bits, cr4_clear_bits, and cr4_set_bits_and_update_boot. Signed-off-by: NAndy Lutomirski <luto@amacapital.net> Reviewed-by: NThomas Gleixner <tglx@linutronix.de> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Vince Weaver <vince@deater.net> Cc: "hillf.zj" <hillf.zj@alibaba-inc.com> Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/495a10bdc9e67016b8fd3945700d46cfd5c12c2f.1414190806.git.luto@amacapital.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 29 10月, 2014 1 次提交
-
-
由 Ingo Molnar 提交于
These patches: 86a349a2 ("perf/x86/intel: Add Broadwell core support") c46e665f ("perf/x86: Add INST_RETIRED.ALL workarounds") fdda3c4a ("perf/x86/intel: Use Broadwell cache event list for Haswell") introduced magic constants and unexplained changes: https://lkml.org/lkml/2014/10/28/1128 https://lkml.org/lkml/2014/10/27/325 https://lkml.org/lkml/2014/8/27/546 https://lkml.org/lkml/2014/10/28/546 Peter Zijlstra has attempted to help out, to clean up the mess: https://lkml.org/lkml/2014/10/28/543 But has not received helpful and constructive replies which makes me doubt wether it can all be finished in time until v3.18 is released. Despite various review feedback the author (Andi Kleen) has answered only few of the review questions and has generally been uncooperative, only giving replies when prompted repeatedly, and only giving minimal answers instead of constructively explaining and helping along the effort. That kind of behavior is not acceptable. There's also a boot crash on Intel E5-1630 v3 CPUs reported for another commit from Andi Kleen: e735b9db ("perf/x86/intel/uncore: Add Haswell-EP uncore support") https://lkml.org/lkml/2014/10/22/730 Which is not yet resolved. The uncore driver is independent in theory, but the crash makes me worry about how well all these patches were tested and makes me uneasy about the level of interminging that the Broadwell and Haswell code has received by the commits above. As a first step to resolve the mess revert the Broadwell client commits back to the v3.17 version, before we run out of time and problematic code hits a stable upstream kernel. ( If the Haswell-EP crash is not resolved via a simple fix then we'll have to revert the Haswell-EP uncore driver as well. ) The Broadwell client series has to be submitted in a clean fashion, with single, well documented changes per patch. If they are submitted in time and are accepted during review then they can possibly go into v3.19 but will need additional scrutiny due to the rocky history of this patch set. Cc: Andi Kleen <ak@linux.intel.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: eranian@google.com Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/1409683455-29168-3-git-send-email-andi@firstfloor.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 28 10月, 2014 1 次提交
-
-
由 Peter Zijlstra (Intel) 提交于
Andy spotted the fail in what was intended as a conditional printk level. Reported-by: NAndy Lutomirski <luto@amacapital.net> Fixes: cc6cd47e ("perf/x86: Tone down kernel messages when the PMU check fails in a virtual environment") Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20141007124757.GH19379@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 03 10月, 2014 1 次提交
-
-
由 Wei Huang 提交于
PMU checking can fail due to various reasons. On native machine, this is mostly caused by faulty hardware and it is reasonable to use KERN_ERR in reporting. However, when kernel is running on virtualized environment, this checking can fail if virtual PMU is not supported (e.g. KVM on AMD host). It is annoying to see an error message on splash screen, even though we know such failure is benign on virtualized environment. This patch checks if the kernel is running in a virtualized environment. If so, it will use KERN_INFO in reporting, which reduces the syslog priority of them. This patch was tested successfully on KVM. Signed-off-by: NWei Huang <wei@redhat.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Link: http://lkml.kernel.org/r/1411617314-24659-1-git-send-email-wei@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 24 9月, 2014 1 次提交
-
-
由 Andi Kleen 提交于
On Broadwell INST_RETIRED.ALL cannot be used with any period that doesn't have the lowest 6 bits cleared. And the period should not be smaller than 128. Add a new callback to enforce this, and set it for Broadwell. This is erratum BDM57 and BDM11. How does this handle the case when an app requests a specific period with some of the bottom bits set The apps thinks it is sampling at X occurences per sample, when it is in fact at X - 63 (worst case). Short answer: Any useful instruction sampling period needs to be 4-6 orders of magnitude larger than 128, as an PMI every 128 instructions would instantly overwhelm the system and be throttled. So the +-64 error from this is really small compared to the period, much smaller than normal system jitter. Long answer: <write up by Peter:> IFF we guarantee perf_event_attr::sample_period >= 128. Suppose we start out with sample_period=192; then we'll set period_left to 192, we'll end up with left = 128 (we truncate the lower bits). We get an interrupt, find that period_left = 64 (>0 so we return 0 and don't get an overflow handler), up that to 128. Then we trigger again, at n=256. Then we find period_left = -64 (<=0 so we return 1 and do get an overflow). We increment with sample_period so we get left = 128. We fire again, at n=384, period_left = 0 (<=0 so we return 1 and get an overflow). And on and on. So while the individual interrupts are 'wrong' we get then with interval=256,128 in exactly the right ratio to average out at 192. And this works for everything >=128. So the num_samples*fixed_period thing is still entirely correct +- 127, which is good enough I'd say, as you already have that error anyhow. So no need to 'fix' the tools, al we need to do is refuse to create INST_RETIRED:ALL events with sample_period < 128. Signed-off-by: NAndi Kleen <ak@linux.intel.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Andi Kleen <ak@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Kan Liang <kan.liang@intel.com> Cc: Maria Dimakopoulou <maria.n.dimakopoulou@gmail.com> Cc: Mark Davies <junk@eslaf.co.uk> Cc: Stephane Eranian <eranian@google.com> Link: http://lkml.kernel.org/r/1409683455-29168-4-git-send-email-andi@firstfloor.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 27 8月, 2014 1 次提交
-
-
由 Christoph Lameter 提交于
__get_cpu_var() is used for multiple purposes in the kernel source. One of them is address calculation via the form &__get_cpu_var(x). This calculates the address for the instance of the percpu variable of the current processor based on an offset. Other use cases are for storing and retrieving data from the current processors percpu area. __get_cpu_var() can be used as an lvalue when writing data or on the right side of an assignment. __get_cpu_var() is defined as : #define __get_cpu_var(var) (*this_cpu_ptr(&(var))) __get_cpu_var() always only does an address determination. However, store and retrieve operations could use a segment prefix (or global register on other platforms) to avoid the address calculation. this_cpu_write() and this_cpu_read() can directly take an offset into a percpu area and use optimized assembly code to read and write per cpu variables. This patch converts __get_cpu_var into either an explicit address calculation using this_cpu_ptr() or into a use of this_cpu operations that use the offset. Thereby address calculations are avoided and less registers are used when code is generated. Transformations done to __get_cpu_var() 1. Determine the address of the percpu instance of the current processor. DEFINE_PER_CPU(int, y); int *x = &__get_cpu_var(y); Converts to int *x = this_cpu_ptr(&y); 2. Same as #1 but this time an array structure is involved. DEFINE_PER_CPU(int, y[20]); int *x = __get_cpu_var(y); Converts to int *x = this_cpu_ptr(y); 3. Retrieve the content of the current processors instance of a per cpu variable. DEFINE_PER_CPU(int, y); int x = __get_cpu_var(y) Converts to int x = __this_cpu_read(y); 4. Retrieve the content of a percpu struct DEFINE_PER_CPU(struct mystruct, y); struct mystruct x = __get_cpu_var(y); Converts to memcpy(&x, this_cpu_ptr(&y), sizeof(x)); 5. Assignment to a per cpu variable DEFINE_PER_CPU(int, y) __get_cpu_var(y) = x; Converts to __this_cpu_write(y, x); 6. Increment/Decrement etc of a per cpu variable DEFINE_PER_CPU(int, y); __get_cpu_var(y)++ Converts to __this_cpu_inc(y) Cc: Thomas Gleixner <tglx@linutronix.de> Cc: x86@kernel.org Acked-by: NH. Peter Anvin <hpa@linux.intel.com> Acked-by: NIngo Molnar <mingo@kernel.org> Signed-off-by: NChristoph Lameter <cl@linux.com> Signed-off-by: NTejun Heo <tj@kernel.org>
-
- 13 8月, 2014 1 次提交
-
-
由 Andi Kleen 提交于
This fixes a side effect of Kan's earlier patch to probe the LBRs at boot time. Normally when the LBRs are disabled cycles:pp is disabled too. So for example cycles:pp doesn't work. However this is not needed with PEBSv2 and later (Haswell) because it does not need LBRs to correct the IP-off-by-one. So add an extra check for PEBSv2 that also allows :pp Signed-off-by: NAndi Kleen <ak@linux.intel.com> Signed-off-by: NPeter Zijlstra <peterz@infradead.org> Cc: kan.liang@intel.com Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Link: http://lkml.kernel.org/r/1407456534-15747-1-git-send-email-andi@firstfloor.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 16 7月, 2014 1 次提交
-
-
由 Kan Liang 提交于
With -cpu host, KVM reports LBR and extra_regs support, if the host has support. When the guest perf driver tries to access LBR or extra_regs MSR, it #GPs all MSR accesses,since KVM doesn't handle LBR and extra_regs support. So check the related MSRs access right once at initialization time to avoid the error access at runtime. For reproducing the issue, please build the kernel with CONFIG_KVM_INTEL = y (for host kernel). And CONFIG_PARAVIRT = n and CONFIG_KVM_GUEST = n (for guest kernel). Start the guest with -cpu host. Run perf record with --branch-any or --branch-filter in guest to trigger LBR Run perf stat offcore events (E.g. LLC-loads/LLC-load-misses ...) in guest to trigger offcore_rsp #GP Signed-off-by: NKan Liang <kan.liang@intel.com> Signed-off-by: NPeter Zijlstra <peterz@infradead.org> Cc: Andi Kleen <ak@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Maria Dimakopoulou <maria.n.dimakopoulou@gmail.com> Cc: Mark Davies <junk@eslaf.co.uk> Cc: Paul Mackerras <paulus@samba.org> Cc: Stephane Eranian <eranian@google.com> Cc: Yan, Zheng <zheng.z.yan@intel.com> Link: http://lkml.kernel.org/r/1405365957-20202-1-git-send-email-kan.liang@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 05 6月, 2014 1 次提交
-
-
由 Vince Weaver 提交于
Make the x86 perf code use the new common PMU interrupt disabled code. Typically most x86 machines have working PMU interrupts, although some older p6-class machines had this problem. Signed-off-by: NVince Weaver <vincent.weaver@maine.edu> Signed-off-by: NPeter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1405161715560.11099@vincent-weaver-1.umelst.maine.eduSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 24 4月, 2014 1 次提交
-
-
由 Masami Hiramatsu 提交于
Use NOKPROBE_SYMBOL macro for protecting functions from kprobes instead of __kprobes annotation under arch/x86. This applies nokprobe_inline annotation for some cases, because NOKPROBE_SYMBOL() will inhibit inlining by referring the symbol address. This just folds a bunch of previous NOKPROBE_SYMBOL() cleanup patches for x86 to one patch. Signed-off-by: NMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Link: http://lkml.kernel.org/r/20140417081814.26341.51656.stgit@ltc230.yrl.intra.hitachi.co.jp Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Borislav Petkov <bp@suse.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Fernando Luis Vázquez Cao <fernando_b1@lab.ntt.co.jp> Cc: Gleb Natapov <gleb@redhat.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Jesper Nilsson <jesper.nilsson@axis.com> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Jiri Slaby <jslaby@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Lebon <jlebon@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Matt Fleming <matt.fleming@intel.com> Cc: Michel Lespinasse <walken@google.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Seiji Aguchi <seiji.aguchi@hds.com> Cc: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vineet Gupta <vgupta@synopsys.com> Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
- 18 4月, 2014 1 次提交
-
-
由 Yan, Zheng 提交于
export perf_assign_events to allow building perf Intel uncore driver as module Signed-off-by: NYan, Zheng <zheng.z.yan@intel.com> Signed-off-by: NPeter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1395133004-23205-3-git-send-email-zheng.z.yan@intel.com Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: eranian@google.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-kernel@vger.kernel.org Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
- 27 2月, 2014 2 次提交
-
-
由 Peter Zijlstra 提交于
Add a few comments on the ->add(), ->del() and ->*_txn() implementation. Requested-by: NVince Weaver <vincent.weaver@maine.edu> Signed-off-by: NPeter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-he3819318c245j7t5e1e22tr@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
Vince "Super Tester" Weaver reported a new round of syscall fuzzing (Trinity) failures, with perf WARN_ON()s triggering. He also provided traces of the failures. This is I think the relevant bit: > pec_1076_warn-2804 [000] d... 147.926153: x86_pmu_disable: x86_pmu_disable > pec_1076_warn-2804 [000] d... 147.926153: x86_pmu_state: Events: { > pec_1076_warn-2804 [000] d... 147.926156: x86_pmu_state: 0: state: .R config: ffffffffffffffff ( (null)) > pec_1076_warn-2804 [000] d... 147.926158: x86_pmu_state: 33: state: AR config: 0 (ffff88011ac99800) > pec_1076_warn-2804 [000] d... 147.926159: x86_pmu_state: } > pec_1076_warn-2804 [000] d... 147.926160: x86_pmu_state: n_events: 1, n_added: 0, n_txn: 1 > pec_1076_warn-2804 [000] d... 147.926161: x86_pmu_state: Assignment: { > pec_1076_warn-2804 [000] d... 147.926162: x86_pmu_state: 0->33 tag: 1 config: 0 (ffff88011ac99800) > pec_1076_warn-2804 [000] d... 147.926163: x86_pmu_state: } > pec_1076_warn-2804 [000] d... 147.926166: collect_events: Adding event: 1 (ffff880119ec8800) So we add the insn:p event (fd[23]). At this point we should have: n_events = 2, n_added = 1, n_txn = 1 > pec_1076_warn-2804 [000] d... 147.926170: collect_events: Adding event: 0 (ffff8800c9e01800) > pec_1076_warn-2804 [000] d... 147.926172: collect_events: Adding event: 4 (ffff8800cbab2c00) We try and add the {BP,cycles,br_insn} group (fd[3], fd[4], fd[15]). These events are 0:cycles and 4:br_insn, the BP event isn't x86_pmu so that's not visible. group_sched_in() pmu->start_txn() /* nop - BP pmu */ event_sched_in() event->pmu->add() So here we should end up with: 0: n_events = 3, n_added = 2, n_txn = 2 4: n_events = 4, n_added = 3, n_txn = 3 But seeing the below state on x86_pmu_enable(), the must have failed, because the 0 and 4 events aren't there anymore. Looking at group_sched_in(), since the BP is the leader, its event_sched_in() must have succeeded, for otherwise we would not have seen the sibling adds. But since neither 0 or 4 are in the below state; their event_sched_in() must have failed; but I don't see why, the complete state: 0,0,1:p,4 fits perfectly fine on a core2. However, since we try and schedule 4 it means the 0 event must have succeeded! Therefore the 4 event must have failed, its failure will have put group_sched_in() into the fail path, which will call: event_sched_out() event->pmu->del() on 0 and the BP event. Now x86_pmu_del() will reduce n_events; but it will not reduce n_added; giving what we see below: n_event = 2, n_added = 2, n_txn = 2 > pec_1076_warn-2804 [000] d... 147.926177: x86_pmu_enable: x86_pmu_enable > pec_1076_warn-2804 [000] d... 147.926177: x86_pmu_state: Events: { > pec_1076_warn-2804 [000] d... 147.926179: x86_pmu_state: 0: state: .R config: ffffffffffffffff ( (null)) > pec_1076_warn-2804 [000] d... 147.926181: x86_pmu_state: 33: state: AR config: 0 (ffff88011ac99800) > pec_1076_warn-2804 [000] d... 147.926182: x86_pmu_state: } > pec_1076_warn-2804 [000] d... 147.926184: x86_pmu_state: n_events: 2, n_added: 2, n_txn: 2 > pec_1076_warn-2804 [000] d... 147.926184: x86_pmu_state: Assignment: { > pec_1076_warn-2804 [000] d... 147.926186: x86_pmu_state: 0->33 tag: 1 config: 0 (ffff88011ac99800) > pec_1076_warn-2804 [000] d... 147.926188: x86_pmu_state: 1->0 tag: 1 config: 1 (ffff880119ec8800) > pec_1076_warn-2804 [000] d... 147.926188: x86_pmu_state: } > pec_1076_warn-2804 [000] d... 147.926190: x86_pmu_enable: S0: hwc->idx: 33, hwc->last_cpu: 0, hwc->last_tag: 1 hwc->state: 0 So the problem is that x86_pmu_del(), when called from a group_sched_in() that fails (for whatever reason), and without x86_pmu TXN support (because the leader is !x86_pmu), will corrupt the n_added state. Reported-and-Tested-by: NVince Weaver <vincent.weaver@maine.edu> Signed-off-by: NPeter Zijlstra <peterz@infradead.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Stephane Eranian <eranian@google.com> Cc: Dave Jones <davej@redhat.com> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20140221150312.GF3104@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 09 2月, 2014 2 次提交
-
-
由 Peter Zijlstra 提交于
The current code forgets to change the CR4 state on the current CPU. Use on_each_cpu() instead of smp_call_function(). Reported-by: NMark Davies <junk@eslaf.co.uk> Suggested-by: NMark Davies <junk@eslaf.co.uk> Signed-off-by: NPeter Zijlstra <peterz@infradead.org> Cc: fweisbec@gmail.com Link: http://lkml.kernel.org/n/tip-69efsat90ibhnd577zy3z9gh@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Peter Zijlstra 提交于
PPro machines can die hard when PCE gets enabled due to a CPU erratum. The safe way it so disable it by default and keep it disabled. See erratum 26 in: http://download.intel.com/design/archives/processors/pro/docs/24268935.pdfReported-and-Tested-by: NMark Davies <junk@eslaf.co.uk> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Stephane Eranian <eranian@google.com> Cc: Vince Weaver <vince@deater.net> Signed-off-by: NPeter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20140206170815.GW2936@laptop.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 13 1月, 2014 1 次提交
-
-
由 Peter Zijlstra 提交于
In order to avoid the runtime condition and variable load turn sched_clock_stable into a static_key. Also provide a shorter implementation of local_clock() and cpu_clock(int) when sched_clock_stable==1. MAINLINE PRE POST sched_clock_stable: 1 1 1 (cold) sched_clock: 329841 221876 215295 (cold) local_clock: 301773 234692 220773 (warm) sched_clock: 38375 25602 25659 (warm) local_clock: 100371 33265 27242 (warm) rdtsc: 27340 24214 24208 sched_clock_stable: 0 0 0 (cold) sched_clock: 382634 235941 237019 (cold) local_clock: 396890 297017 294819 (warm) sched_clock: 38194 25233 25609 (warm) local_clock: 143452 71234 71232 (warm) rdtsc: 27345 24245 24243 Signed-off-by: NPeter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/n/tip-eummbdechzz37mwmpags1gjr@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-