- 13 3月, 2012 4 次提交
-
-
由 Salman Qazi 提交于
When a machine boots up, the TSC generally gets reset. However, when kexec is used to boot into a kernel, the TSC value would be carried over from the previous kernel. The computation of cycns_offset in set_cyc2ns_scale is prone to an overflow, if the machine has been up more than 208 days prior to the kexec. The overflow happens when we multiply *scale, even though there is enough room to store the final answer. We fix this issue by decomposing tsc_now into the quotient and remainder of division by CYC2NS_SCALE_FACTOR and then performing the multiplication separately on the two components. Refactor code to share the calculation with the previous fix in __cycles_2_ns(). Signed-off-by: NSalman Qazi <sqazi@google.com> Acked-by: NJohn Stultz <john.stultz@linaro.org> Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Turner <pjt@google.com> Cc: john stultz <johnstul@us.ibm.com> Link: http://lkml.kernel.org/r/20120310004027.19291.88460.stgit@dungbeetle.mtv.corp.google.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Peter Zijlstra 提交于
I got somewhat tired of having to decode hex numbers.. Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: NThomas Gleixner <tglx@linutronix.de> Cc: Stephane Eranian <eranian@google.com> Cc: Robert Richter <robert.richter@amd.com> Link: http://lkml.kernel.org/n/tip-0vsy1sgywc4uar3mu1szm0rg@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Peter Zijlstra 提交于
Verified using the below proglet.. before: [root@westmere ~]# perf stat -e node-stores -e node-store-misses ./numa 0 remote write Performance counter stats for './numa 0': 2,101,554 node-stores 2,096,931 node-store-misses 5.021546079 seconds time elapsed [root@westmere ~]# perf stat -e node-stores -e node-store-misses ./numa 1 local write Performance counter stats for './numa 1': 501,137 node-stores 199 node-store-misses 5.124451068 seconds time elapsed After: [root@westmere ~]# perf stat -e node-stores -e node-store-misses ./numa 0 remote write Performance counter stats for './numa 0': 2,107,516 node-stores 2,097,187 node-store-misses 5.012755149 seconds time elapsed [root@westmere ~]# perf stat -e node-stores -e node-store-misses ./numa 1 local write Performance counter stats for './numa 1': 2,063,355 node-stores 165 node-store-misses 5.082091494 seconds time elapsed #define _GNU_SOURCE #include <sched.h> #include <stdio.h> #include <errno.h> #include <sys/mman.h> #include <sys/types.h> #include <dirent.h> #include <signal.h> #include <unistd.h> #include <numaif.h> #include <stdlib.h> #define SIZE (32*1024*1024) volatile int done; void sig_done(int sig) { done = 1; } int main(int argc, char **argv) { cpu_set_t *mask, *mask2; size_t size; int i, err, t; int nrcpus = 1024; char *mem; unsigned long nodemask = 0x01; /* node 0 */ DIR *node; struct dirent *de; int read = 0; int local = 0; if (argc < 2) { printf("usage: %s [0-3]\n", argv[0]); printf(" bit0 - local/remote\n"); printf(" bit1 - read/write\n"); exit(0); } switch (atoi(argv[1])) { case 0: printf("remote write\n"); break; case 1: printf("local write\n"); local = 1; break; case 2: printf("remote read\n"); read = 1; break; case 3: printf("local read\n"); local = 1; read = 1; break; } mask = CPU_ALLOC(nrcpus); size = CPU_ALLOC_SIZE(nrcpus); CPU_ZERO_S(size, mask); node = opendir("/sys/devices/system/node/node0/"); if (!node) perror("opendir"); while ((de = readdir(node))) { int cpu; if (sscanf(de->d_name, "cpu%d", &cpu) == 1) CPU_SET_S(cpu, size, mask); } closedir(node); mask2 = CPU_ALLOC(nrcpus); CPU_ZERO_S(size, mask2); for (i = 0; i < size; i++) CPU_SET_S(i, size, mask2); CPU_XOR_S(size, mask2, mask2, mask); // invert if (!local) mask = mask2; err = sched_setaffinity(0, size, mask); if (err) perror("sched_setaffinity"); mem = mmap(0, SIZE, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0); err = mbind(mem, SIZE, MPOL_BIND, &nodemask, 8*sizeof(nodemask), MPOL_MF_MOVE); if (err) perror("mbind"); signal(SIGALRM, sig_done); alarm(5); if (!read) { while (!done) { for (i = 0; i < SIZE; i++) mem[i] = 0x01; } } else { while (!done) { for (i = 0; i < SIZE; i++) t += *(volatile char *)(mem + i); } } return 0; } Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Cc: Stephane Eranian <eranian@google.com> Cc: <stable@kernel.org> Link: http://lkml.kernel.org/n/tip-tq73sxus35xmqpojf7ootxgs@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Peter Zijlstra 提交于
Stepan found: CPU0 CPUn _cpu_up() __cpu_up() boostrap() notify_cpu_starting() set_cpu_online() while (!cpu_active()) cpu_relax() <PREEMPT-out> smp_call_function(.wait=1) /* we find cpu_online() is true */ arch_send_call_function_ipi_mask() /* wait-forever-more */ <PREEMPT-in> local_irq_enable() cpu_notify(CPU_ONLINE) sched_cpu_active() set_cpu_active() Now the purpose of cpu_active is mostly with bringing down a cpu, where we mark it !active to avoid the load-balancer from moving tasks to it while we tear down the cpu. This is required because we only update the sched_domain tree after we brought the cpu-down. And this is needed so that some tasks can still run while we bring it down, we just don't want new tasks to appear. On cpu-up however the sched_domain tree doesn't yet include the new cpu, so its invisible to the load-balancer, regardless of the active state. So instead of setting the active state after we boot the new cpu (and consequently having to wait for it before enabling interrupts) set the cpu active before we set it online and avoid the whole mess. Reported-by: NStepan Moskovchenko <stepanm@codeaurora.org> Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: NThomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1323965362.18942.71.camel@twinsSigned-off-by: NIngo Molnar <mingo@elte.hu>
-
- 10 3月, 2012 1 次提交
-
-
由 Thomas Gleixner 提交于
Commit f0fbf0ab ("x86: integrate delay functions") converted delay_tsc() into a random delay generator for 64 bit. The reason is that it merged the mostly identical versions of delay_32.c and delay_64.c. Though the subtle difference of the result was: static void delay_tsc(unsigned long loops) { - unsigned bclock, now; + unsigned long bclock, now; Now the function uses rdtscl() which returns the lower 32bit of the TSC. On 32bit that's not problematic as unsigned long is 32bit. On 64 bit this fails when the lower 32bit are close to wrap around when bclock is read, because the following check if ((now - bclock) >= loops) break; evaluated to true on 64bit for e.g. bclock = 0xffffffff and now = 0 because the unsigned long (now - bclock) of these values results in 0xffffffff00000001 which is definitely larger than the loops value. That explains Tvortkos observation: "Because I am seeing udelay(500) (_occasionally_) being short, and that by delaying for some duration between 0us (yep) and 491us." Make those variables explicitely u32 again, so this works for both 32 and 64 bit. Reported-by: NTvrtko Ursulin <tvrtko.ursulin@onelan.co.uk> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org # >= 2.6.27 Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 09 3月, 2012 1 次提交
-
-
由 Linus Torvalds 提交于
Ok, this is hacky, and only works on little-endian machines with goo unaligned handling. And even then only with CONFIG_DEBUG_PAGEALLOC disabled, since it can access up to 7 bytes after the pathname. But it runs like a bat out of hell. Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 07 3月, 2012 2 次提交
-
-
由 Linus Torvalds 提交于
It turns out that test-compiling this file on x86-64 doesn't really help, because much of it is x86-32-specific. And so I hadn't noticed the slightly over-eager removal of the 'r' from 'addr' variable despite thinking I had tested it. Signed-off-by: NLinus "oopsie" Torvalds <torvalds@linux-foundation.org>
-
由 Linus Torvalds 提交于
Several users of "find_vma_prev()" were not in fact interested in the previous vma if there was no primary vma to be found either. And in those cases, we're much better off just using the regular "find_vma()", and then "prev" can be looked up by just checking vma->vm_prev. The find_vma_prev() semantics are fairly subtle (see Mikulas' recent commit 83cd904d: "mm: fix find_vma_prev"), and the whole "return prev by reference" means that it generates worse code too. Thus this "let's avoid using this inconvenient and clearly too subtle interface when we don't really have to" patch. Cc: Mikulas Patocka <mpatocka@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 06 3月, 2012 4 次提交
-
-
由 Masami Hiramatsu 提交于
Split out optprobe related code to arch/x86/kernel/kprobes-opt.c for maintenanceability. Signed-off-by: NMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Suggested-by: NIngo Molnar <mingo@elte.hu> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: yrl.pp-manager.tt@hitachi.com Cc: systemtap@sourceware.org Cc: anderson@redhat.com Link: http://lkml.kernel.org/r/20120305133222.5982.54794.stgit@localhost.localdomain [ Tidied up the code a tiny bit ] Signed-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Masami Hiramatsu 提交于
Fix a bug in kprobes which can modify kernel code permanently at run-time. In the result, kernel can crash when it executes the modified code. This bug can happen when we put two probes enough near and the first probe is optimized. When the second probe is set up, it copies a byte which is already modified by the first probe, and executes it when the probe is hit. Even worse, the first probe and the second probe are removed respectively, the second probe writes back the copied (modified) instruction. To fix this bug, kprobes always recovers the original code and copies the first byte from recovered instruction. Signed-off-by: NMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: yrl.pp-manager.tt@hitachi.com Cc: systemtap@sourceware.org Cc: anderson@redhat.com Link: http://lkml.kernel.org/r/20120305133215.5982.31991.stgit@localhost.localdomainSigned-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Masami Hiramatsu 提交于
Current probed-instruction recovery expects that only breakpoint instruction modifies instruction. However, since kprobes jump optimization can replace original instructions with a jump, that expectation is not enough. And it may cause instruction decoding failure on the function where an optimized probe already exists. This bug can reproduce easily as below: 1) find a target function address (any kprobe-able function is OK) $ grep __secure_computing /proc/kallsyms ffffffff810c19d0 T __secure_computing 2) decode the function $ objdump -d vmlinux --start-address=0xffffffff810c19d0 --stop-address=0xffffffff810c19eb vmlinux: file format elf64-x86-64 Disassembly of section .text: ffffffff810c19d0 <__secure_computing>: ffffffff810c19d0: 55 push %rbp ffffffff810c19d1: 48 89 e5 mov %rsp,%rbp ffffffff810c19d4: e8 67 8f 72 00 callq ffffffff817ea940 <mcount> ffffffff810c19d9: 65 48 8b 04 25 40 b8 mov %gs:0xb840,%rax ffffffff810c19e0: 00 00 ffffffff810c19e2: 83 b8 88 05 00 00 01 cmpl $0x1,0x588(%rax) ffffffff810c19e9: 74 05 je ffffffff810c19f0 <__secure_computing+0x20> 3) put a kprobe-event at an optimize-able place, where no call/jump places within the 5 bytes. $ su - # cd /sys/kernel/debug/tracing # echo p __secure_computing+0x9 > kprobe_events 4) enable it and check it is optimized. # echo 1 > events/kprobes/p___secure_computing_9/enable # cat ../kprobes/list ffffffff810c19d9 k __secure_computing+0x9 [OPTIMIZED] 5) put another kprobe on an instruction after previous probe in the same function. # echo p __secure_computing+0x12 >> kprobe_events bash: echo: write error: Invalid argument # dmesg | tail -n 1 [ 1666.500016] Probing address(0xffffffff810c19e2) is not an instruction boundary. 6) however, if the kprobes optimization is disabled, it works. # echo 0 > /proc/sys/debug/kprobes-optimization # cat ../kprobes/list ffffffff810c19d9 k __secure_computing+0x9 # echo p __secure_computing+0x12 >> kprobe_events (no error) This is because kprobes doesn't recover the instruction which is overwritten with a relative jump by another kprobe when finding instruction boundary. It only recovers the breakpoint instruction. This patch fixes kprobes to recover such instructions. With this fix: # echo p __secure_computing+0x9 > kprobe_events # echo 1 > events/kprobes/p___secure_computing_9/enable # cat ../kprobes/list ffffffff810c1aa9 k __secure_computing+0x9 [OPTIMIZED] # echo p __secure_computing+0x12 >> kprobe_events # cat ../kprobes/list ffffffff810c1aa9 k __secure_computing+0x9 [OPTIMIZED] ffffffff810c1ab2 k __secure_computing+0x12 [DISABLED] Changes in v4: - Fix a bug to ensure optimized probe is really optimized by jump. - Remove kprobe_optready() dependency. - Cleanup code for preparing optprobe separation. Changes in v3: - Fix a build error when CONFIG_OPTPROBE=n. (Thanks, Ingo!) To fix the error, split optprobe instruction recovering path from kprobes path. - Cleanup comments/styles. Changes in v2: - Fix a bug to recover original instruction address in RIP-relative instruction fixup. - Moved on tip/master. Signed-off-by: NMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: yrl.pp-manager.tt@hitachi.com Cc: systemtap@sourceware.org Cc: anderson@redhat.com Link: http://lkml.kernel.org/r/20120305133209.5982.36568.stgit@localhost.localdomainSigned-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Al Viro 提交于
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 05 3月, 2012 10 次提交
-
-
由 Stephane Eranian 提交于
With branch stack sampling, it is possible to filter by priv levels. In system-wide mode, that means it is possible to capture only user level branches. The builtin SW LBR filter needs to disassemble code based on LBR captured addresses. For that, it needs to know the task the addresses are associated with. Because of context switches, the content of the branch stack buffer may contain addresses from different tasks. We need a callback on context switch to either flush the branch stack or save it. This patch adds a new callback in struct pmu which is called during context switches. The callback is called only when necessary. That is when a system-wide context has, at least, one event which uses PERF_SAMPLE_BRANCH_STACK. The callback is never called for per-thread context. In this version, the Intel x86 code simply flushes (resets) the LBR on context switches (fills it with zeroes). Those zeroed branches are then filtered out by the SW filter. Signed-off-by: NStephane Eranian <eranian@google.com> Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1328826068-11713-11-git-send-email-eranian@google.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Stephane Eranian 提交于
PERF_SAMPLE_BRANCH_* is disabled for: - SW events (sw counters, tracepoints) - HW breakpoints - ALL but Intel x86 architecture - AMD64 processors Signed-off-by: NStephane Eranian <eranian@google.com> Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1328826068-11713-10-git-send-email-eranian@google.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Stephane Eranian 提交于
This patch adds an internal sofware filter to complement the (optional) LBR hardware filter. The software filter is necessary: - as a substitute when there is no HW LBR filter (e.g., Atom, Core) - to complement HW LBR filter in case of errata (e.g., Nehalem/Westmere) - to provide finer grain filtering (e.g., all processors) Sometimes the LBR HW filter cannot distinguish between two types of branches. For instance, to capture syscall as CALLS, it is necessary to enable the LBR_FAR filter which will also capture JMP instructions. Thus, a second pass is necessary to filter those out, this is what the SW filter can do. The SW filter is built on top of the internal x86 disassembler. It is a best effort filter especially for user level code. It is subject to the availability of the text page of the program. The SW filter is enabled on all Intel processors. It is bypassed when the user is capturing all branches at all priv levels. Signed-off-by: NStephane Eranian <eranian@google.com> Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1328826068-11713-9-git-send-email-eranian@google.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Stephane Eranian 提交于
This patch implements PERF_SAMPLE_BRANCH support for Intel x86processors. It connects PERF_SAMPLE_BRANCH to the actual LBR. The patch adds the hooks in the PMU irq handler to save the LBR on counter overflow for both regular and PEBS modes. Signed-off-by: NStephane Eranian <eranian@google.com> Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1328826068-11713-8-git-send-email-eranian@google.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Stephane Eranian 提交于
The patch adds a restriction for Intel Atom LBR support. Only steppings 10 (PineView) and more recent are supported. Older models do not have a functional LBR. Their LBR does not freeze on PMU interrupt which makes LBR unusable in the context of perf_events. Signed-off-by: NStephane Eranian <eranian@google.com> Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1328826068-11713-7-git-send-email-eranian@google.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Stephane Eranian 提交于
This patch adds the mappings from the generic PERF_SAMPLE_BRANCH_* filters to the actual Intel x86LBR filters, whenever they exist. Signed-off-by: NStephane Eranian <eranian@google.com> Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1328826068-11713-6-git-send-email-eranian@google.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Stephane Eranian 提交于
If precise sampling is enabled on Intel x86 then perf_event uses PEBS. To correct for the off-by-one error of PEBS, perf_event uses LBR when precise_sample > 1. On Intel x86 PERF_SAMPLE_BRANCH_STACK is implemented using LBR, therefore both features must be coordinated as they may not configure LBR the same way. For PEBS, LBR needs to capture all branches at the priv level of the associated event. This patch checks that the branch type and priv level of BRANCH_STACK is compatible with that of the PEBS LBR requirement, thereby allowing: $ perf record -b any,u -e instructions:upp .... But: $ perf record -b any_call,u -e instructions:upp Is not possible. Signed-off-by: NStephane Eranian <eranian@google.com> Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1328826068-11713-5-git-send-email-eranian@google.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Stephane Eranian 提交于
The Intel LBR on some recent processor is capable of filtering branches by type. The filter is configurable via the LBR_SELECT MSR register. There are limitation on how this register can be used. On Nehalem/Westmere, the LBR_SELECT is shared by the two HT threads when HT is on. It is private to each core when HT is off. On SandyBridge, the LBR_SELECT register is private to each thread when HT is on. It is private to each core when HT is off. The kernel must manage the sharing of LBR_SELECT. It allows multiple users on the same logical CPU to use LBR_SELECT as long as they program it with the same value. Across sibling CPUs (HT threads), the same restriction applies on NHM/WSM. This patch implements this sharing logic by leveraging the mechanism put in place for managing the offcore_response shared MSR. We modify __intel_shared_reg_get_constraints() to cause x86_get_event_constraint() to be called because LBR may be associated with events that may be counter constrained. Signed-off-by: NStephane Eranian <eranian@google.com> Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1328826068-11713-4-git-send-email-eranian@google.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Stephane Eranian 提交于
This patch adds the LBR definitions for NHM/WSM/SNB and Core. It also adds the definitions for the architected LBR MSR: LBR_SELECT, LBRT_TOS. Signed-off-by: NStephane Eranian <eranian@google.com> Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1328826068-11713-3-git-send-email-eranian@google.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Stephane Eranian 提交于
This patch adds the ability to sample taken branches to the perf_event interface. The ability to capture taken branches is very useful for all sorts of analysis. For instance, basic block profiling, call counts, statistical call graph. This new capability requires hardware assist and as such may not be available on all HW platforms. On Intel x86 it is implemented on top of the Last Branch Record (LBR) facility. To enable taken branches sampling, the PERF_SAMPLE_BRANCH_STACK bit must be set in attr->sample_type. Sampled taken branches may be filtered by type and/or priv levels. The patch adds a new field, called branch_sample_type, to the perf_event_attr structure. It contains a bitmask of filters to apply to the sampled taken branches. Filters may be implemented in HW. If the HW filter does not exist or is not good enough, some arch may also implement a SW filter. The following generic filters are currently defined: - PERF_SAMPLE_USER only branches whose targets are at the user level - PERF_SAMPLE_KERNEL only branches whose targets are at the kernel level - PERF_SAMPLE_HV only branches whose targets are at the hypervisor level - PERF_SAMPLE_ANY any type of branches (subject to priv levels filters) - PERF_SAMPLE_ANY_CALL any call branches (may incl. syscall on some arch) - PERF_SAMPLE_ANY_RET any return branches (may incl. syscall returns on some arch) - PERF_SAMPLE_IND_CALL indirect call branches Obviously filter may be combined. The priv level bits are optional. If not provided, the priv level of the associated event are used. It is possible to collect branches at a priv level different from the associated event. Use of kernel, hv priv levels is subject to permissions and availability (hv). The number of taken branch records present in each sample may vary based on HW, the type of sampled branches, the executed code. Therefore each sample contains the number of taken branches it contains. Signed-off-by: NStephane Eranian <eranian@google.com> Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1328826068-11713-2-git-send-email-eranian@google.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
-
- 02 3月, 2012 2 次提交
-
-
由 Joerg Roedel 提交于
It turned out that a performance counter on AMD does not count at all when the GO or HO bit is set in the control register and SVM is disabled in EFER. This patch works around this issue by masking out the HO bit in the performance counter control register when SVM is not enabled. The GO bit is not touched because it is only set when the user wants to count in guest-mode only. So when SVM is disabled the counter should not run at all and the not-counting is the intended behaviour. Signed-off-by: NJoerg Roedel <joerg.roedel@amd.com> Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Cc: Avi Kivity <avi@redhat.com> Cc: Stephane Eranian <eranian@google.com> Cc: David Ahern <dsahern@gmail.com> Cc: Gleb Natapov <gleb@redhat.com> Cc: Robert Richter <robert.richter@amd.com> Cc: stable@vger.kernel.org # v3.2 Link: http://lkml.kernel.org/r/1330523852-19566-1-git-send-email-joerg.roedel@amd.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Jonathan Nieder 提交于
Carlos was getting WARNING: at drivers/pci/pci.c:118 pci_ioremap_bar+0x24/0x52() when probing his sound card, and sound did not work. After adding pci=use_crs to the kernel command line, no more trouble. Ok, we can add a quirk. dmidecode output reveals that this is an MSI MS-7253, for which we already have a quirk, but the short-sighted author tied the quirk to a single BIOS version, making it not kick in on Carlos's machine with BIOS V1.2. If a later BIOS update makes it no longer necessary to look at the _CRS info it will still be harmless, so let's stop trying to guess which versions have and don't have accurate _CRS tables. Addresses https://bugtrack.alsa-project.org/alsa-bug/view.php?id=5533 Also see <https://bugzilla.kernel.org/show_bug.cgi?id=42619>. Reported-by: NCarlos Luna <caralu74@gmail.com> Reviewed-by: NBjorn Helgaas <bhelgaas@google.com> Signed-off-by: NJonathan Nieder <jrnieder@gmail.com> Signed-off-by: NJesse Barnes <jbarnes@virtuousgeek.org>
-
- 01 3月, 2012 1 次提交
-
-
由 Thomas Gleixner 提交于
Coccinelle based conversion. Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-24swm5zut3h9c4a6s46x8rws@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@elte.hu>
-
- 29 2月, 2012 1 次提交
-
-
由 Jonathan Nieder 提交于
In the spirit of commit 29cf7a30 ("x86/PCI: use host bridge _CRS info on ASUS M2V-MX SE"), this DMI quirk turns on "pci_use_crs" by default on a board that needs it. This fixes boot failures and oopses introduced in 3e3da00c ("x86/pci: AMD one chain system to use pci read out res"). The quirk is quite targetted (to a specific board and BIOS version) for two reasons: (1) to emphasize that this method of tackling the problem one quirk at a time is a little insane (2) to give BIOS vendors an opportunity to use simpler tables and allow us to return to generic behavior (whatever that happens to be) with a later BIOS update In other words, I am not at all happy with having quirks like this. But it is even worse for the kernel not to work out of the box on these machines, so... Reference: https://bugzilla.kernel.org/show_bug.cgi?id=42619Reported-by: NSvante Signell <svante.signell@telia.com> Signed-off-by: NJonathan Nieder <jrnieder@gmail.com> Signed-off-by: NBjorn Helgaas <bhelgaas@google.com> Signed-off-by: NJesse Barnes <jbarnes@virtuousgeek.org>
-
- 27 2月, 2012 1 次提交
-
-
由 Jan Beulich 提交于
As of v2.6.38 this counter is being maintained without ever being read. Signed-off-by: NJan Beulich <jbeulich@suse.com> Link: http://lkml.kernel.org/r/4F4787930200007800074A10@nat28.tlf.novell.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
-
- 24 2月, 2012 2 次提交
-
-
由 Ingo Molnar 提交于
static keys: Introduce 'struct static_key', static_key_true()/false() and static_key_slow_[inc|dec]() So here's a boot tested patch on top of Jason's series that does all the cleanups I talked about and turns jump labels into a more intuitive to use facility. It should also address the various misconceptions and confusions that surround jump labels. Typical usage scenarios: #include <linux/static_key.h> struct static_key key = STATIC_KEY_INIT_TRUE; if (static_key_false(&key)) do unlikely code else do likely code Or: if (static_key_true(&key)) do likely code else do unlikely code The static key is modified via: static_key_slow_inc(&key); ... static_key_slow_dec(&key); The 'slow' prefix makes it abundantly clear that this is an expensive operation. I've updated all in-kernel code to use this everywhere. Note that I (intentionally) have not pushed through the rename blindly through to the lowest levels: the actual jump-label patching arch facility should be named like that, so we want to decouple jump labels from the static-key facility a bit. On non-jump-label enabled architectures static keys default to likely()/unlikely() branches. Signed-off-by: NIngo Molnar <mingo@elte.hu> Acked-by: NJason Baron <jbaron@redhat.com> Acked-by: NSteven Rostedt <rostedt@goodmis.org> Cc: a.p.zijlstra@chello.nl Cc: mathieu.desnoyers@efficios.com Cc: davem@davemloft.net Cc: ddaney.cavm@gmail.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20120222085809.GA26397@elte.huSigned-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Yinghai Lu 提交于
warning: unreferenced object 0xffff8801f6914200 (size 512): comm "swapper/0", pid 1, jiffies 4294893643 (age 2664.644s) hex dump (first 32 bytes): 00 00 c0 fe 00 00 00 00 ff ff ff ff 00 00 00 00 ................ 60 58 2f f6 03 88 ff ff 00 02 00 00 00 00 00 00 `X/............. backtrace: [<ffffffff81c2408c>] kmemleak_alloc+0x26/0x43 [<ffffffff8113764f>] __kmalloc+0x121/0x183 [<ffffffff81ca8d93>] get_current_resources+0x5a/0xc6 [<ffffffff81c5bedd>] pci_acpi_scan_root+0x13c/0x21c [<ffffffff81c2a745>] acpi_pci_root_add+0x1e1/0x421 [<ffffffff81408f50>] acpi_device_probe+0x50/0x190 [<ffffffff8149edc7>] really_probe+0x99/0x126 [<ffffffff8149ef83>] driver_probe_device+0x3b/0x56 [<ffffffff8149effd>] __driver_attach+0x5f/0x82 [<ffffffff8149d860>] bus_for_each_dev+0x5c/0x88 [<ffffffff8149eb87>] driver_attach+0x1e/0x20 [<ffffffff8149e7cc>] bus_add_driver+0xca/0x21d [<ffffffff8149f47b>] driver_register+0x91/0xfe [<ffffffff81409d09>] acpi_bus_register_driver+0x43/0x45 [<ffffffff8278bdc9>] acpi_pci_root_init+0x20/0x28 [<ffffffff810001e7>] do_one_initcall+0x57/0x134 The system has _CRS for root buses, but they are not used because the machine date is before the cutoff date for _CRS usage. Try to free those unused resource arrays and names. Reviewed-by: NBjorn Helgaas <bhelgaas@google.com> Signed-off-by: NYinghai Lu <yinghai@kernel.org> Signed-off-by: NJesse Barnes <jbarnes@virtuousgeek.org>
-
- 22 2月, 2012 2 次提交
-
-
由 Borislav Petkov 提交于
141168c3 ("x86: Simplify code by removing a !SMP #ifdefs from 'struct cpuinfo_x86'") removed a bunch of CONFIG_SMP ifdefs around code touching struct cpuinfo_x86 members but also caused the following build error with Randy's randconfigs: mce_amd.c:(.cpuinit.text+0x4723): undefined reference to `cpu_llc_shared_map' Restore the #ifdef in threshold_create_bank() which creates symlinks on the non-BSP CPUs. There's a better patch series being worked on by Kevin Winchester which will solve this in a cleaner fashion, but that series is too ambitious for v3.3 merging - so we first queue up this trivial fix and then do the rest for v3.4. Signed-off-by: NBorislav Petkov <bp@alien8.de> Acked-by: NKevin Winchester <kjwinchester@gmail.com> Cc: Randy Dunlap <rdunlap@xenotime.net> Cc: Nick Bowler <nbowler@elliptictech.com> Link: http://lkml.kernel.org/r/20120203191801.GA2846@x1.osrc.amd.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Suresh Siddha 提交于
For each logical CPU that is coming online, we spend 20msec for checking the TSC synchronization. And as this is done sequentially for each logical CPU boot, this time gets added up depending on the number of logical CPU's supported by the platform. Minimize this by using the socket topology information. If the target CPU coming online doesn't have any of its core-siblings online, a timeout of 20msec will be used for the TSC-warp measurement loop. Otherwise a smaller timeout of 2msec will be used, as we have some information about this socket already (and this information grows as we have more and more logical-siblings in that socket). Ideally we should be able to skip the TSC sync check on the other core-siblings, if the first logical CPU in a socket passed the sync test. But as the TSC is per-logical CPU and can potentially be modified wrongly by the bios before the OS boot, TSC sync test for smaller duration should be able to catch such errors. Also this will catch the condition where all the cores in the socket doesn't get reset at the same time. For example, with this modification, time spent in TSC sync checks on a 4 socket 10-core with HT system gets reduced from 1580msec to 212msec. Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com> Acked-by: NArjan van de Ven <arjan@linux.intel.com> Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Cc: Jack Steiner <steiner@sgi.com> Cc: venki@google.com Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/1328581940.29790.20.camel@sbsiddha-desk.sc.intel.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
-
- 21 2月, 2012 5 次提交
-
-
由 Linus Torvalds 提交于
(And define it properly for x86-32, which had its 'current_task' declaration in separate from x86-64) Bitten by my dislike for modules on the machines I use, and the fact that apparently nobody else actually wanted to test the patches I sent out. Snif. Nobody else cares. Anyway, we probably should uninline the 'kernel_fpu_begin()' function that is what modules actually use and that references this, but this is the minimal fix for now. Reported-by: NJosh Boyer <jwboyer@gmail.com> Reported-and-tested-by: NJongman Heo <jongman.heo@samsung.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Steven Rostedt 提交于
Linus noticed that the cmp used to check if the code segment is __KERNEL_CS or not did not specify a size. Perhaps it does not matter as H. Peter Anvin noted that user space can not set the bottom two bits of the %cs register. But it's best not to let the assembly choose and change things between different versions of gas, but instead just pick the size. Four bytes are used to compare the saved code segment against __KERNEL_CS. Perhaps this might mess up Xen, but we can fix that when the time comes. Also I noticed that there was another non-specified cmp that checks the special stack variable if it is 1 or 0. This too probably doesn't matter what cmp is used, but this patch uses cmpl just to make it non ambiguous. Link: http://lkml.kernel.org/r/CA+55aFxfAn9MWRgS3O5k2tqN5ys1XrhSFVO5_9ZAoZKDVgNfGA@mail.gmail.comSuggested-by: NLinus Torvalds <torvalds@linux-foundation.org> Cc: H. Peter Anvin <hpa@zytor.com> Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
-
由 Linus Torvalds 提交于
This makes us recognize when we try to restore FPU state that matches what we already have in the FPU on this CPU, and avoids the restore entirely if so. To do this, we add two new data fields: - a percpu 'fpu_owner_task' variable that gets written any time we update the "has_fpu" field, and thus acts as a kind of back-pointer to the task that owns the CPU. The exception is when we save the FPU state as part of a context switch - if the save can keep the FPU state around, we leave the 'fpu_owner_task' variable pointing at the task whose FP state still remains on the CPU. - a per-thread 'last_cpu' field, that indicates which CPU that thread used its FPU on last. We update this on every context switch (writing an invalid CPU number if the last context switch didn't leave the FPU in a lazily usable state), so we know that *that* thread has done nothing else with the FPU since. These two fields together can be used when next switching back to the task to see if the CPU still matches: if 'fpu_owner_task' matches the task we are switching to, we know that no other task (or kernel FPU usage) touched the FPU on this CPU in the meantime, and if the current CPU number matches the 'last_cpu' field, we know that this thread did no other FP work on any other CPU, so the FPU state on the CPU must match what was saved on last context switch. In that case, we can avoid the 'f[x]rstor' entirely, and just clear the CR0.TS bit. Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Linus Torvalds 提交于
This inlines what is usually just a couple of instructions, but more importantly it also fixes the theoretical error case (can that FPU restore really ever fail? Maybe we should remove the checking). We can't start sending signals from within the scheduler, we're much too deep in the kernel and are holding the runqueue lock etc. So don't bother even trying. Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Linus Torvalds 提交于
This makes sure we clear the FPU usage counter for newly created tasks, just so that we start off in a known state (for example, don't try to preload the FPU state on the first task switch etc). It also fixes a thinko in when we increment the fpu_counter at task switch time, introduced by commit 34ddc81a ("i387: re-introduce FPU state preloading at context switch time"). We should increment the *new* task fpu_counter, not the old task, and only if we decide to use that state (whether lazily or preloaded). Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 20 2月, 2012 4 次提交
-
-
由 Konrad Rzeszutek Wilk 提交于
[Pls also look at https://lkml.org/lkml/2012/2/10/228] Using of PAT to change pages from WB to WC works quite nicely. Changing it back to WB - not so much. The crux of the matter is that the code that does this (__page_change_att_set_clr) has only limited information so when it tries to the change it gets the "raw" unfiltered information instead of the properly filtered one - and the "raw" one tell it that PSE bit is on (while infact it is not). As a result when the PTE is set to be WB from WC, we get tons of: :WARNING: at arch/x86/xen/mmu.c:475 xen_make_pte+0x67/0xa0() :Hardware name: HP xw4400 Workstation .. snip.. :Pid: 27, comm: kswapd0 Tainted: G W 3.2.2-1.fc16.x86_64 #1 :Call Trace: : [<ffffffff8106dd1f>] warn_slowpath_common+0x7f/0xc0 : [<ffffffff8106dd7a>] warn_slowpath_null+0x1a/0x20 : [<ffffffff81005a17>] xen_make_pte+0x67/0xa0 : [<ffffffff810051bd>] __raw_callee_save_xen_make_pte+0x11/0x1e : [<ffffffff81040e15>] ? __change_page_attr_set_clr+0x9d5/0xc00 : [<ffffffff8114c2e8>] ? __purge_vmap_area_lazy+0x158/0x1d0 : [<ffffffff8114cca5>] ? vm_unmap_aliases+0x175/0x190 : [<ffffffff81041168>] change_page_attr_set_clr+0x128/0x4c0 : [<ffffffff81041542>] set_pages_array_wb+0x42/0xa0 : [<ffffffff8100a9b2>] ? check_events+0x12/0x20 : [<ffffffffa0074d4c>] ttm_pages_put+0x1c/0x70 [ttm] : [<ffffffffa0074e98>] ttm_page_pool_free+0xf8/0x180 [ttm] : [<ffffffffa0074f78>] ttm_pool_mm_shrink+0x58/0x90 [ttm] : [<ffffffff8112ba04>] shrink_slab+0x154/0x310 : [<ffffffff8112f17a>] balance_pgdat+0x4fa/0x6c0 : [<ffffffff8112f4b8>] kswapd+0x178/0x3d0 : [<ffffffff815df134>] ? __schedule+0x3d4/0x8c0 : [<ffffffff81090410>] ? remove_wait_queue+0x50/0x50 : [<ffffffff8112f340>] ? balance_pgdat+0x6c0/0x6c0 : [<ffffffff8108fb6c>] kthread+0x8c/0xa0 for every page. The proper fix for this is has been posted and is https://lkml.org/lkml/2012/2/10/228 "x86/cpa: Use pte_attrs instead of pte_flags on CPA/set_p.._wb/wc operations." along with a detailed description of the problem and solution. But since that posting has gone nowhere I am proposing this band-aid solution so that at least users don't get the page corruption (the pages that are WC don't get changed to WB and end up being recycled for filesystem or other things causing mysterious crashes). The negative impact of this patch is that users of WC flag (which are InfiniBand, radeon, nouveau drivers) won't be able to set that flag - so they are going to see performance degradation. But stability is more important here. Fixes RH BZ# 742032, 787403, and 745574 Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
-
由 Konrad Rzeszutek Wilk 提交于
commit 7347b408 "xen: Allow unprivileged Xen domains to create iomap pages" added a redundant line in the early bootup code to filter out the PTE. That filtering is already done a bit earlier so this extra processing is not required. Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
-
由 Linus Torvalds 提交于
If the irq happens in user mode, our kernel stack is empty (apart from the pt_regs themselves, of course), so there's no need or advantage to switch. And it really doesn't save any stack space, quite the reverse: it means that a nested interrupt cannot switch irq stacks. So instead of saving kernel stack space, it actually causes the potential for *more* stack usage. Also simplify the preemption count copy when we do switch stacks: just copy the whole preemption count, rather than just the softirq parts of it. There is no advantage to the partial copy: it is more effort to get a less correct result. Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1202191139260.10000@i5.linux-foundation.orgSigned-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Steven Rostedt 提交于
Currently, the NMI handler tests if it is nested by checking the special variable saved on the stack (set during NMI handling) and whether the saved stack is the NMI stack as well (to prevent the race when the variable is set to zero). But userspace may set their %rsp to any value as long as they do not derefence it, and it may make it point to the NMI stack, which will prevent NMIs from triggering while the userspace app is running. (I tested this, and it is indeed the case) Add another check to determine nested NMIs by looking at the saved %cs (code segment register) and making sure that it is the kernel code segment. Signed-off-by: NSteven Rostedt <rostedt@goodmis.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: <stable@kernel.org> Link: http://lkml.kernel.org/r/1329687817.1561.27.camel@acer.local.homeSigned-off-by: NIngo Molnar <mingo@elte.hu>
-