- 12 1月, 2014 1 次提交
-
-
由 Borislav Petkov 提交于
So mce_start_timer() has a 'cpu' argument which is supposed to mean to start a timer on that cpu. However, the code currently starts a timer on the *current* cpu the function runs on and causes the sanity-check in mce_timer_fn to fire: WARNING: CPU: 0 PID: 0 at arch/x86/kernel/cpu/mcheck/mce.c:1286 mce_timer_fn because it is running on the wrong cpu. This was triggered by Prarit Bhargava <prarit@redhat.com> by offlining all the cpus in succession. Then, we were fiddling with the CMCI storm settings when starting the timer whereas there's no need for that - if there's storm happening on this newly restarted cpu, we're going to be in normal CMCI mode initially and then when the CMCI interrupt starts firing, we're going to go to the polling mode with the timer real soon. Signed-off-by: NBorislav Petkov <bp@suse.de> Tested-by: NPrarit Bhargava <prarit@redhat.com> Cc: Tony Luck <tony.luck@intel.com> Reviewed-by: NChen, Gong <gong.chen@linux.intel.com> Link: http://lkml.kernel.org/r/1387722156-5511-1-git-send-email-prarit@redhat.com
-
- 30 11月, 2013 1 次提交
-
-
由 Levente Kurusa 提交于
This patch adds a call to put_device() when the device_register() call has failed. This is required so that the last reference to the device is given up. Signed-off-by: NLevente Kurusa <levex@linux.com> Link: http://lkml.kernel.org/r/5298F900.9000208@linux.comSigned-off-by: NBorislav Petkov <bp@suse.de>
-
- 15 7月, 2013 1 次提交
-
-
由 Paul Gortmaker 提交于
The __cpuinit type of throwaway sections might have made sense some time ago when RAM was more constrained, but now the savings do not offset the cost and complications. For example, the fix in commit 5e427ec2 ("x86: Fix bit corruption at CPU resume time") is a good example of the nasty type of bugs that can be created with improper use of the various __init prefixes. After a discussion on LKML[1] it was decided that cpuinit should go the way of devinit and be phased out. Once all the users are gone, we can then finally remove the macros themselves from linux/init.h. Note that some harmless section mismatch warnings may result, since notify_cpu_starting() and cpu_up() are arch independent (kernel/cpu.c) are flagged as __cpuinit -- so if we remove the __cpuinit from arch specific callers, we will also get section mismatch warnings. As an intermediate step, we intend to turn the linux/init.h cpuinit content into no-ops as early as possible, since that will get rid of these warnings. In any case, they are temporary and harmless. This removes all the arch/x86 uses of the __cpuinit macros from all C files. x86 only had the one __CPUINIT used in assembly files, and it wasn't paired off with a .previous or a __FINIT, so we can delete it directly w/o any corresponding additional change there. [1] https://lkml.org/lkml/2013/5/20/589 Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: x86@kernel.org Acked-by: NIngo Molnar <mingo@kernel.org> Acked-by: NThomas Gleixner <tglx@linutronix.de> Acked-by: NH. Peter Anvin <hpa@linux.intel.com> Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
-
- 09 7月, 2013 1 次提交
-
-
由 Naveen N. Rao 提交于
The Corrected Machine Check structure (CMC) in HEST has a flag which can be set by the firmware to indicate to the OS that it prefers to process the corrected error events first. In this scenario, the OS is expected to not monitor for corrected errors (through CMCI/polling). Instead, the firmware notifies the OS on corrected error events through GHES. Linux already has support for GHES. This patch adds support for parsing CMC structure and to disable CMCI/polling if the firmware first flag is set. Further, the list of machine check bank structures at the end of CMC is used to determine which MCA banks function in FF mode, so that we continue to monitor error events on the other banks. Signed-off-by: NNaveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Acked-by: NBorislav Petkov <bp@suse.de> Signed-off-by: NTony Luck <tony.luck@intel.com>
-
- 26 6月, 2013 1 次提交
-
-
由 Naveen N. Rao 提交于
There is some confusion about the 'mce_poll_banks' and 'mce_banks_owned' per-cpu bitmaps. Provide comments so that we all know exactly what these are used for, and why. Signed-off-by: NNaveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Acked-by: NBorislav Petkov <bp@suse.de> Signed-off-by: NTony Luck <tony.luck@intel.com>
-
- 03 4月, 2013 1 次提交
-
-
由 Srivatsa S. Bhat 提交于
Dave Jones reports that offlining a CPU leads to this trace: numa_remove_cpu cpu 1 node 0: mask now 0,2-3 smpboot: CPU 1 is now offline BUG: using smp_processor_id() in preemptible [00000000] code: cpu-offline.sh/10591 caller is cmci_rediscover+0x6a/0xe0 Pid: 10591, comm: cpu-offline.sh Not tainted 3.9.0-rc3+ #2 Call Trace: [<ffffffff81333bbd>] debug_smp_processor_id+0xdd/0x100 [<ffffffff8101edba>] cmci_rediscover+0x6a/0xe0 [<ffffffff815f5b9f>] mce_cpu_callback+0x19d/0x1ae [<ffffffff8160ea66>] notifier_call_chain+0x66/0x150 [<ffffffff8107ad7e>] __raw_notifier_call_chain+0xe/0x10 [<ffffffff8104c2e3>] cpu_notify+0x23/0x50 [<ffffffff8104c31e>] cpu_notify_nofail+0xe/0x20 [<ffffffff815ef082>] _cpu_down+0x302/0x350 [<ffffffff815ef106>] cpu_down+0x36/0x50 [<ffffffff815f1c9d>] store_online+0x8d/0xd0 [<ffffffff813edc48>] dev_attr_store+0x18/0x30 [<ffffffff81226eeb>] sysfs_write_file+0xdb/0x150 [<ffffffff811adfb2>] vfs_write+0xa2/0x170 [<ffffffff811ae16c>] sys_write+0x4c/0xa0 [<ffffffff81613019>] system_call_fastpath+0x16/0x1b However, a look at cmci_rediscover shows that it can be simplified quite a bit, apart from solving the above issue. It invokes functions that take spin locks with interrupts disabled, and hence it can run in atomic context. Also, it is run in the CPU_POST_DEAD phase, so the dying CPU is already dead and out of the cpu_online_mask. So take these points into account and simplify the code, and thereby also fix the above issue. Reported-by: NDave Jones <davej@redhat.com> Signed-off-by: NSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Signed-off-by: NTony Luck <tony.luck@intel.com>
-
- 21 1月, 2013 1 次提交
-
-
由 Rusty Russell 提交于
Fix up all callers as they were before, with make one change: an unsigned module taints the kernel, but doesn't turn off lockdep. Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
-
- 29 12月, 2012 1 次提交
-
-
由 Tejun Heo 提交于
There's no need to test whether a (delayed) work item in pending before queueing, flushing or cancelling it. Most uses are unnecessary and quite a few of them are buggy. Remove unnecessary pending tests from x86/mce. Only compile tested. v2: Local var work removed from mce_schedule_work() as suggested by Borislav. Signed-off-by: NTejun Heo <tj@kernel.org> Acked-by: NBorislav Petkov <bp@alien8.de> Cc: Tony Luck <tony.luck@intel.com> Cc: linux-edac@vger.kernel.org
-
- 26 10月, 2012 4 次提交
-
-
由 Borislav Petkov 提交于
mce_ser, mce_bios_cmci_threshold and mce_disabled are the last three bools which need conversion. Move them to the mca_config struct and adjust usage sites accordingly. Signed-off-by: NBorislav Petkov <bp@alien8.de> Acked-by: NTony Luck <tony.luck@intel.com>
-
由 Borislav Petkov 提交于
Move them into the mca_config struct and adjust code touching them accordingly. Signed-off-by: NBorislav Petkov <bp@alien8.de> Acked-by: NTony Luck <tony.luck@intel.com>
-
由 Borislav Petkov 提交于
Move above configuration variables into struct mca_config and adjust usage places accordingly. Signed-off-by: NBorislav Petkov <bp@alien8.de> Acked-by: NTony Luck <tony.luck@intel.com>
-
由 Borislav Petkov 提交于
Move those MCA configuration variables into struct mca_config and adjust the places they're used accordingly. Signed-off-by: NBorislav Petkov <bp@alien8.de> Acked-by: NTony Luck <tony.luck@intel.com>
-
- 19 10月, 2012 1 次提交
-
-
由 Borislav Petkov 提交于
450cc201 ("x86/mce: Provide boot argument to honour bios-set CMCI threshold") added the bios_cmci_threshold sysfs attribute which was supposed to communicate to userspace tools that BIOS CMCI threshold has been honoured. However, this info is not of any importance to userspace - it should rather get the actual error count it has been thresholded already from MCi_STATUS[38:52]. So drop this before it becomes a used interface (good thing we caught this early in 3.7-rc1, right after the merge window closed). Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Acked-by: NTony Luck <tony.luck@intel.com> Link: http://lkml.kernel.org/r/20121017105940.GA14590@x1.osrc.amd.comSigned-off-by: NBorislav Petkov <borislav.petkov@amd.com>
-
- 28 9月, 2012 1 次提交
-
-
由 Naveen N. Rao 提交于
The ACPI spec doesn't provide for a way for the bios to pass down recommended thresholds to the OS on a _per-bank_ basis. This patch adds a new boot option, which if passed, tells Linux to use CMCI thresholds set by the bios. As fail-safe, we initialize threshold to 1 if some banks have not been initialized by the bios and warn the user. Signed-off-by: NNaveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Signed-off-by: NTony Luck <tony.luck@intel.com>
-
- 10 8月, 2012 1 次提交
-
-
由 Chen Gong 提交于
On Intel systems corrected machine check interrupts (CMCI) may be sent to multiple logical processors; possibly to all processors on the affected socket (SDM Volume 3B "15.5.1 CMCI Local APIC Interface"). This means that a persistent error (such as a stuck bit in ECC memory) may cause a storm of interrupts that greatly hinders or prevents forward progress (probably on many processors). To solve this we keep track of the rate at which each processor sees CMCI. If we exceed a threshold, we disable CMCI delivery and switch to polling the machine check banks. If the storm subsides (none of the affected processors see any more errors for a complete poll interval) we re-enable CMCI. [Tony: Added console messages when storm begins/ends and increased storm threshold from 5 to 15 so we have a few more logged entries before we disable interrupts and start dropping reports] Signed-off-by: NChen Gong <gong.chen@linux.intel.com> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Tested-by: NChen Gong <gong.chen@linux.intel.com> Signed-off-by: NTony Luck <tony.luck@intel.com>
-
- 04 8月, 2012 2 次提交
-
-
由 Thomas Gleixner 提交于
No point in having double cases if we can simply mask the FROZEN bit out. Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Signed-off-by: NChen Gong <gong.chen@linux.intel.com> Signed-off-by: NTony Luck <tony.luck@intel.com>
-
由 Thomas Gleixner 提交于
Split timer init function into the init and the start part, so the start part can replace the open coded version in CPU_DOWN_FAILED. Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Signed-off-by: NChen Gong <gong.chen@linux.intel.com> Acked-by: NBorislav Petkov <borislav.petkov@amd.com> Signed-off-by: NTony Luck <tony.luck@intel.com>
-
- 26 7月, 2012 1 次提交
-
-
由 Tony Luck 提交于
Sandy Bridge processors follow the SDM (Vol 3B, Table 15-20) and set both the RIPV and EIPV bits in the MCG_STATUS register to zero for machine checks during instruction fetch. This is more than a little counter-intuitive and means that Linux cannot recover from these errors. Rather than insert special case code at several places in mce.c and mce-severity.c, we pretend the EIPV bit was set for just this case early in processing the machine check. Acked-by: NBorislav Petkov <bp@amd64.org> Signed-off-by: NTony Luck <tony.luck@intel.com> Cc: Chen Gong <gong.chen@linux.intel.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Link: http://lkml.kernel.org/r/180a06f3f357cf9f78259ae443a082b14a29535b.1343078495.git.tony.luck@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 20 7月, 2012 1 次提交
-
-
由 Liu, Jinsong 提交于
When MCA error occurs, it would be handled by Xen hypervisor first, and then the error information would be sent to initial domain for logging. This patch gets error information from Xen hypervisor and convert Xen format error into Linux format mcelog. This logic is basically self-contained, not touching other kernel components. By using tools like mcelog tool users could read specific error information, like what they did under native Linux. To test follow directions outlined in Documentation/acpi/apei/einj.txt Acked-and-tested-by: NBorislav Petkov <borislav.petkov@amd.com> Signed-off-by: NKe, Liping <liping.ke@intel.com> Signed-off-by: NJiang, Yunhong <yunhong.jiang@intel.com> Signed-off-by: NJeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Signed-off-by: NLiu, Jinsong <jinsong.liu@intel.com> Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
-
- 12 7月, 2012 1 次提交
-
-
由 Tony Luck 提交于
In commit dad1743e ("x86/mce: Only restart instruction after machine check recovery if it is safe") we fixed mce_notify_process() to force a signal to the current process if it was not restartable (RIPV bit not set in MCG_STATUS). But doing it here means that the process doesn't get told the virtual address of the fault via siginfo_t->si_addr. This would prevent application level recovery from the fault. Make a new MF_MUST_KILL flag bit for memory_failure() et al. to use so that we will provide the right information with the signal. Signed-off-by: NTony Luck <tony.luck@intel.com> Acked-by: NBorislav Petkov <borislav.petkov@amd.com> Cc: stable@kernel.org # 3.4+
-
- 06 6月, 2012 4 次提交
-
-
由 Thomas Gleixner 提交于
commit 82f7af09 ("x86/mce: Cleanup timer mess) dropped the initialization of the per cpu timer interval. Duh :( Restore the previous behaviour. Reported-by: NChen Gong <gong.chen@linux.intel.com> Cc: bp@amd64.org Cc: tony.luck@intel.com Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
-
由 Joe Perches 提交于
Use a more current logging style: - Bare printks should have a KERN_<LEVEL> for consistency's sake - Add pr_fmt where appropriate - Neaten some macro definitions - Convert some Ok output to OK - Use "%s: ", __func__ in pr_fmt for summit - Convert some printks to pr_<level> Message output is not identical in all cases. Signed-off-by: NJoe Perches <joe@perches.com> Cc: levinsasha928@gmail.com Link: http://lkml.kernel.org/r/1337655007.24226.10.camel@joe2Laptop [ merged two similar patches, tidied up the changelog ] Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Chen Gong 提交于
In commit 82f7af09 ("x86/mce: Cleanup timer mess), Thomas just forgot the "/ 2" there while cleaning up. Signed-off-by: NChen Gong <gong.chen@linux.intel.com> Acked-by: NThomas Gleixner <tglx@linutronix.de> Cc: bp@amd64.org Cc: tony.luck@intel.com Link: http://lkml.kernel.org/r/1338863702-9245-1-git-send-email-gong.chen@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Chen Gong 提交于
In commit 82f7af09 (x86/mce: Cleanup timer mess), Thomas just forgot the "/ 2" there while cleaning up. Signed-off-by: NChen Gong <gong.chen@linux.intel.com> Acked-by: NThomas Gleixner <tglx@linutronix.de> Signed-off-by: NTony Luck <tony.luck@intel.com>
-
- 31 5月, 2012 1 次提交
-
-
由 Thomas Gleixner 提交于
Use unsigned long for dealing with jiffies not int. Rename the callback to something sensible. Use __this_cpu_read/write for accessing per cpu data. Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Acked-by: NBorislav Petkov <borislav.petkov@amd.com> Signed-off-by: NTony Luck <tony.luck@intel.com>
-
- 24 5月, 2012 1 次提交
-
-
由 Andi Kleen 提交于
When running on 32bit the mce handler could misinterpret vm86 mode as ring 0. This can affect whether it does recovery or not; it was possible to panic when recovery was actually possible. Fix this by always forcing vm86 to look like ring 3. Signed-off-by: NAndi Kleen <ak@linux.intel.com> Cc: <stable@vger.kernel.org> Signed-off-by: NTony Luck <tony.luck@intel.com>
-
- 23 5月, 2012 1 次提交
-
-
由 Borislav Petkov 提交于
Got bitten again by the BIT() macro: arch/x86/kernel/cpu/mcheck/mce.c: In function '__mcheck_cpu_apply_quirks': arch/x86/kernel/cpu/mcheck/mce.c:1453:6: warning: left shift count >= width of type arch/x86/kernel/cpu/mcheck/mce.c:1454:7: warning: left shift count >= width of type Fix it already. Signed-off-by: NBorislav Petkov <borislav.petkov@amd.com> Cc: Frank Arnold <frank.arnold@amd.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1337684026-19740-2-git-send-email-bp@amd64.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 15 5月, 2012 2 次提交
-
-
由 Tony Luck 提交于
Section 15.3.1.2 of the software developer manual has this to say about the RIPV bit in the IA32_MCG_STATUS register: RIPV (restart IP valid) flag, bit 0 — Indicates (when set) that program execution can be restarted reliably at the instruction pointed to by the instruction pointer pushed on the stack when the machine-check exception is generated. When clear, the program cannot be reliably restarted at the pushed instruction pointer. We need to save the state of this bit in do_machine_check() and use it in mce_notify_process() to force a signal; even if memory_failure() says it made a complete recovery ... e.g. replaced a clean LRU page. Acked-by: NBorislav Petkov <bp@amd64.org> Signed-off-by: NTony Luck <tony.luck@intel.com>
-
由 Alex Shi 提交于
Since percpu_xxx() serial functions are duplicated with this_cpu_xxx(). Removing percpu_xxx() definition and replacing them by this_cpu_xxx() in code. There is no function change in this patch, just preparation for later percpu_xxx serial function removing. On x86 machine the this_cpu_xxx() serial functions are same as __this_cpu_xxx() without no unnecessary premmpt enable/disable. Thanks for Stephen Rothwell, he found and fixed a i386 build error in the patch. Also thanks for Andrew Morton, he kept updating the patchset in Linus' tree. Signed-off-by: NAlex Shi <alex.shi@intel.com> Acked-by: NChristoph Lameter <cl@gentwo.org> Acked-by: NTejun Heo <tj@kernel.org> Acked-by: N"H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NTejun Heo <tj@kernel.org>
-
- 30 4月, 2012 1 次提交
-
-
由 Borislav Petkov 提交于
Turn off MC4_MISC thresholding banks on models which have them but that particular processor implementation does not supply applicable error sources to be counted. Signed-off-by: NBorislav Petkov <borislav.petkov@amd.com>
-
- 20 4月, 2012 1 次提交
-
-
由 Tony Luck 提交于
Reading machine check bank registers is slow. There is a trend of increasing the number of banks, and the number of cores. The main section of do_machine_check() is a serialized section where each cpu in turn checks every bank. Even on a little two socket SandyBridge-EP system that multiplies out as: 2 sockets * 8 cores * 2 hyperthreads * 20 banks = 640 MSRs We already scan the banks in parallel in mce_no_way_out() to see if there is a fatal error anywhere in the system. If we build a cache of VALID bits during this scan, we can avoid uselessly re-reading banks that have no data. Note that this cache is only a hint. If the valid bit is set in a shared bank, all cpus that share that bank will see it during the parallel scan, but the first to find it in the sequential scan will (usually) clear the bank. Acked-by: NBorislav Petkov <borislav.petkov@amd.com> Signed-off-by: NTony Luck <tony.luck@intel.com>
-
- 07 3月, 2012 1 次提交
-
-
由 Srivatsa S. Bhat 提交于
While booting, the following message is seen: [ 21.665087] =============================== [ 21.669439] [ INFO: suspicious RCU usage. ] [ 21.673798] 3.2.0-0.0.0.28.36b5ec9-default #2 Not tainted [ 21.681353] ------------------------------- [ 21.685864] arch/x86/kernel/cpu/mcheck/mce.c:194 suspicious rcu_dereference_index_check() usage! [ 21.695013] [ 21.695014] other info that might help us debug this: [ 21.695016] [ 21.703488] [ 21.703489] rcu_scheduler_active = 1, debug_locks = 1 [ 21.710426] 3 locks held by modprobe/2139: [ 21.714754] #0: (&__lockdep_no_validate__){......}, at: [<ffffffff8133afd3>] __driver_attach+0x53/0xa0 [ 21.725020] #1: [ 21.725323] ioatdma: Intel(R) QuickData Technology Driver 4.00 [ 21.733206] (&__lockdep_no_validate__){......}, at: [<ffffffff8133afe1>] __driver_attach+0x61/0xa0 [ 21.743015] #2: (i7core_edac_lock){+.+.+.}, at: [<ffffffffa01cfa5f>] i7core_probe+0x1f/0x5c0 [i7core_edac] [ 21.753708] [ 21.753709] stack backtrace: [ 21.758429] Pid: 2139, comm: modprobe Not tainted 3.2.0-0.0.0.28.36b5ec9-default #2 [ 21.768253] Call Trace: [ 21.770838] [<ffffffff810977cd>] lockdep_rcu_suspicious+0xcd/0x100 [ 21.777366] [<ffffffff8101aa41>] drain_mcelog_buffer+0x191/0x1b0 [ 21.783715] [<ffffffff8101aa78>] mce_register_decode_chain+0x18/0x20 [ 21.790430] [<ffffffffa01cf8db>] i7core_register_mci+0x2fb/0x3e4 [i7core_edac] [ 21.798003] [<ffffffffa01cfb14>] i7core_probe+0xd4/0x5c0 [i7core_edac] [ 21.804809] [<ffffffff8129566b>] local_pci_probe+0x5b/0xe0 [ 21.810631] [<ffffffff812957c9>] __pci_device_probe+0xd9/0xe0 [ 21.816650] [<ffffffff813362e4>] ? get_device+0x14/0x20 [ 21.822178] [<ffffffff81296916>] pci_device_probe+0x36/0x60 [ 21.828061] [<ffffffff8133ac8a>] really_probe+0x7a/0x2b0 [ 21.833676] [<ffffffff8133af23>] driver_probe_device+0x63/0xc0 [ 21.839868] [<ffffffff8133b01b>] __driver_attach+0x9b/0xa0 [ 21.845718] [<ffffffff8133af80>] ? driver_probe_device+0xc0/0xc0 [ 21.852027] [<ffffffff81339168>] bus_for_each_dev+0x68/0x90 [ 21.857876] [<ffffffff8133aa3c>] driver_attach+0x1c/0x20 [ 21.863462] [<ffffffff8133a64d>] bus_add_driver+0x16d/0x2b0 [ 21.869377] [<ffffffff8133b6dc>] driver_register+0x7c/0x160 [ 21.875220] [<ffffffff81296bda>] __pci_register_driver+0x6a/0xf0 [ 21.881494] [<ffffffffa01fe000>] ? 0xffffffffa01fdfff [ 21.886846] [<ffffffffa01fe047>] i7core_init+0x47/0x1000 [i7core_edac] [ 21.893737] [<ffffffff810001ce>] do_one_initcall+0x3e/0x180 [ 21.899670] [<ffffffff810a9b95>] sys_init_module+0xc5/0x220 [ 21.905542] [<ffffffff8149bc39>] system_call_fastpath+0x16/0x1b Fix this by using ACCESS_ONCE() instead of rcu_dereference_check_mce() over mcelog.next. Since the access to each entry is controlled by the ->finished field, ACCESS_ONCE() should work just fine. An rcu_dereference is unnecessary here. Signed-off-by: NSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Suggested-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: NBorislav Petkov <borislav.petkov@amd.com>
-
- 23 2月, 2012 2 次提交
-
-
由 Naoya Horiguchi 提交于
Current kernel MCE code reads ERST at the first reading of /dev/mcelog (maybe in starting mcelogd,) even if the system does not support ERST, which results in a fake "no such device" message (as described in [1].) This problem is not critical, but can confuse system admins. This patch fixes it by filtering the return value from lower (ACPI) layer. [1] http://thread.gmane.org/gmane.linux.kernel/1060250 Reported by: Jon Masters <jonathan@jonmasters.org> Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Huang Ying <ying.huang@intel.com> Link: https://lkml.org/lkml/2012/1/23/299Signed-off-by: NTony Luck <tony.luck@intel.com>
-
由 Greg Kroah-Hartman 提交于
When I previously fixed up the mce_device code, I used a static array of the pointers. It was (rightfully) pointed out to me that I should be using the per_cpu code instead. This patch converts the code over to that structure, moving the variable back into the per_cpu area, like it used to be for 3.2 and earlier. Signed-off-by: NGreg Kroah-Hartman <gregkh@suse.de> Reviewed-by: NSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Link: https://lkml.org/lkml/2012/1/27/165Signed-off-by: NTony Luck <tony.luck@intel.com>
-
- 17 1月, 2012 1 次提交
-
-
由 Greg Kroah-Hartman 提交于
When suspending, there was a large list of warnings going something like: Device 'machinecheck1' does not have a release() function, it is broken and must be fixed This patch turns the static mce_devices into dynamically allocated, and properly frees them when they are removed from the system. It solves the warning messages on my laptop here. Reported-by: N"Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com> Reported-by: NLinus Torvalds <torvalds@linux-foundation.org> Tested-by: NDjalal Harouni <tixxdz@opendz.org> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Tony Luck <tony.luck@intel.com> Cc: Borislav Petkov <bp@amd64.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@suse.de> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 14 1月, 2012 1 次提交
-
-
由 Srivatsa S. Bhat 提交于
Commit 8a25a2fd ("cpu: convert 'cpu' and 'machinecheck' sysdev_class to a regular subsystem") changed how things are dealt with in the MCE subsystem. Some of the things that got broken due to this are CPU hotplug and suspend/hibernate. MCE uses per_cpu allocations of struct device. So, when a CPU goes offline and comes back online, in order to ensure that we start from a clean slate with respect to the MCE subsystem, zero out the entire per_cpu device structure to 0 before using it. Signed-off-by: NSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 04 1月, 2012 4 次提交
-
-
由 Tony Luck 提交于
All non-urgent actions (reporting low severity errors and handling "action-optional" errors) are now handled by a work queue. This means that TIF_MCE_NOTIFY can be used to block execution for a thread experiencing an "action-required" fault until we get all cpus out of the machine check handler (and the thread that hit the fault into mce_notify_process(). We use the new mce_{save,find,clear}_info() API to get information from do_machine_check() to mce_notify_process(), and then use the newly improved memory_failure(..., MF_ACTION_REQUIRED) to handle the error (possibly signalling the process). Update some comments to make the new code flows clearer. Signed-off-by: NTony Luck <tony.luck@intel.com>
-
由 Tony Luck 提交于
Machine checks on Intel cpus interrupt execution on all cpus, regardless of interrupt masking. We have a need to save some data about the cause of the machine check (physical address) in the machine check handler that can be retrieved later to attempt recovery in a more flexible execution state. Signed-off-by: NTony Luck <tony.luck@intel.com>
-
由 Tony Luck 提交于
The MCI_STATUS_MISCV and MCI_STATUS_ADDRV bits in the bank status registers define whether the MISC and ADDR registers respectively contain valid data - provide a helper function to check these bits and read the registers when needed. In addition, processors that support software error recovery (as indicated by the MCG_SER_P bit in the MCG_CAP register) may include some undefined bits in the ADDR register - mask these out. Acked-by: NBorislav Petkov <bp@amd64.org> Signed-off-by: NTony Luck <tony.luck@intel.com>
-
由 Tony Luck 提交于
There is only one caller of memory_failure(), all other users call __memory_failure() and pass in the flags argument explicitly. The lone user of memory_failure() will soon need to pass flags too. Add flags argument to the callsite in mce.c. Delete the old memory_failure() function, and then rename __memory_failure() without the leading "__". Provide clearer message when action optional memory errors are ignored. Acked-by: NBorislav Petkov <bp@amd64.org> Signed-off-by: NTony Luck <tony.luck@intel.com>
-