- 21 6月, 2017 1 次提交
-
-
由 Juergen Gross 提交于
When running under Xen as dom0, /dev/mcelog is being provided by Xen instead of the normal mcelog character device of the MCE core. Convert an error message being issued by the MCE core in this case to an informative message that Xen has registered the device. Signed-off-by: NJuergen Gross <jgross@suse.com> Signed-off-by: NBorislav Petkov <bp@suse.de> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Cc: xen-devel@lists.xenproject.org Cc: Tony Luck <tony.luck@intel.com> Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/20170614084059.19294-1-jgross@suse.com
-
- 14 6月, 2017 8 次提交
-
-
由 Yazen Ghannam 提交于
The bootlog option is only disabled by default on AMD Fam10h and older systems. Update bootlog description to say this. Change the family value to hex to avoid confusion. Signed-off-by: NYazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: NBorislav Petkov <bp@suse.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/20170613162835.30750-9-bp@alien8.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Yazen Ghannam 提交于
AMD systems have non-core, shared MCA banks within a die. These banks are controlled by a master CPU per die. If this CPU is offlined then all the shared banks are disabled in addition to the CPU's core banks. Also, Fam17h systems may have SMT enabled. The MCA_CTL register is shared between SMT thread siblings. If a CPU is offlined then all its sibling's MCA banks are also disabled. Extend the existing vendor check to AMD too. Signed-off-by: NYazen Ghannam <yazen.ghannam@amd.com> [ Fix up comment. ] Signed-off-by: NBorislav Petkov <bp@suse.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/20170613162835.30750-8-bp@alien8.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Borislav Petkov 提交于
Populate the MCE injection struct before doing initial injection so that values which don't change have sane defaults. Tested-by: NYazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: NBorislav Petkov <bp@suse.de> Acked-by: NYazen Ghannam <yazen.ghannam@amd.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Link: http://lkml.kernel.org/r/20170613162835.30750-7-bp@alien8.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Borislav Petkov 提交于
Not really needed. Tested-by: NYazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: NBorislav Petkov <bp@suse.de> Acked-by: NYazen Ghannam <yazen.ghannam@amd.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Link: http://lkml.kernel.org/r/20170613162835.30750-6-bp@alien8.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Borislav Petkov 提交于
Make the mcelog call a notifier which lands in the injector module and does the injection. This allows for mce-inject to be a normal kernel module now. Tested-by: NYazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: NBorislav Petkov <bp@suse.de> Acked-by: NYazen Ghannam <yazen.ghannam@amd.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Link: http://lkml.kernel.org/r/20170613162835.30750-5-bp@alien8.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Borislav Petkov 提交于
Reuse mce_amd_inj's debugfs interface so that mce-inject can benefit from it too. The old functionality is still preserved under CONFIG_X86_MCELOG_LEGACY. Tested-by: NYazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: NBorislav Petkov <bp@suse.de> Acked-by: NYazen Ghannam <yazen.ghannam@amd.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Link: http://lkml.kernel.org/r/20170613162835.30750-4-bp@alien8.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Yazen Ghannam 提交于
In the amd_threshold_interrupt() handler, we loop through every possible block in each bank and rediscover the block's address and if it's valid, e.g. valid, counter present and not locked. However, we already have the address saved in the threshold blocks list for each CPU and bank. The list only contains blocks that have passed all the valid checks. Besides the redundancy, there's also a smp_call_function* in get_block_address() which causes a warning when servicing the interrupt: WARNING: CPU: 0 PID: 0 at kernel/smp.c:281 smp_call_function_single+0xdd/0xf0 ... Call Trace: <IRQ> rdmsr_safe_on_cpu() get_block_address.isra.2() amd_threshold_interrupt() smp_threshold_interrupt() threshold_interrupt() because we do get called in an interrupt handler *with* interrupts disabled, which can result in a deadlock. Drop the redundant valid checks and move the overflow check, logging and block reset into a separate function. Check the first block then iterate over the rest. This procedure is needed since the first block is used as the head of the list. Signed-off-by: NYazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: NBorislav Petkov <bp@suse.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/20170613162835.30750-3-bp@alien8.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Yazen Ghannam 提交于
The value of MCA_STATUS is used as the MSR when clearing MCA_STATUS. This may cause the following warning: unchecked MSR access error: WRMSR to 0x11b (tried to write 0x0000000000000000) Call Trace: <IRQ> smp_threshold_interrupt() threshold_interrupt() Use msr_stat instead which has the MSR address. Signed-off-by: NYazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: NBorislav Petkov <bp@suse.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: linux-edac <linux-edac@vger.kernel.org> Fixes: 37d43acf ("x86/mce/AMD: Redo error logging from APIC LVT interrupt handlers") Link: http://lkml.kernel.org/r/20170613162835.30750-2-bp@alien8.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 22 5月, 2017 4 次提交
-
-
由 Yazen Ghannam 提交于
Scalable MCA systems have a new MCA_CONFIG register that we use to configure each bank. We currently use this when we set up thresholding. However, this is logically separate. Group all SMCA-related initialization into a single function. Signed-off-by: NYazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: NBorislav Petkov <bp@suse.de> Cc: Tony Luck <tony.luck@intel.com> Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/1493147772-2721-2-git-send-email-Yazen.Ghannam@amd.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
-
由 Yazen Ghannam 提交于
We have support for the new SMCA MCA_DE{STAT,ADDR} registers in Linux. So we've used these registers in place of MCA_{STATUS,ADDR} on SMCA systems. However, the guidance for current SMCA implementations of is to continue using MCA_{STATUS,ADDR} and to use MCA_DE{STAT,ADDR} only if a Deferred error was not found in the former registers. If we logged a Deferred error in MCA_STATUS then we should also clear MCA_DESTAT. This also means we shouldn't clear MCA_CONFIG[LogDeferredInMcaStat]. Rework __log_error() to only log an error and add helpers for the different error types being logged from the corresponding interrupt handlers. Boris: carve out common functionality into a _log_error_bank(). Cleanup comments, check MCi_STATUS bits before reading MSRs. Streamline flow. Signed-off-by: NYazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: NBorislav Petkov <bp@suse.de> Cc: Tony Luck <tony.luck@intel.com> Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/1493147772-2721-1-git-send-email-Yazen.Ghannam@amd.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
-
由 Elena Reshetova 提交于
The refcount_t type and corresponding API should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Suggested-by: NKees Cook <keescook@chromium.org> Signed-off-by: NElena Reshetova <elena.reshetova@intel.com> Signed-off-by: NBorislav Petkov <bp@suse.de> Reviewed-by: NHans Liljestrand <ishkamiel@gmail.com> Reviewed-by: NDavid Windsor <dwindsor@gmail.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Yazen Ghannam <Yazen.Ghannam@amd.com> Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/1492695536-5947-1-git-send-email-elena.reshetova@intel.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
-
由 Borislav Petkov 提交于
Export the function which checks whether an MCE is a memory error to other users so that we can reuse the logic. Drop the boot_cpu_data use, while at it, as mce.cpuvendor already has the CPU vendor in there. Integrate a piece from a patch from Vishal Verma <vishal.l.verma@intel.com> to export it for modules (nfit). The main reason we're exporting it is that the nfit handler nfit_handle_mce() needs to detect a memory error properly before doing its recovery actions. Signed-off-by: NBorislav Petkov <bp@suse.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20170519093915.15413-2-bp@alien8.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
-
- 19 4月, 2017 2 次提交
-
-
由 Borislav Petkov 提交于
mce_usable_address() does a bunch of basic sanity checks to verify whether the address reported with the error is usable for further processing. However, we do check MCi_STATUS[MISCV] and that is not needed on AMD as that bit says that there's additional information about the logged error in the MCi_MISCj banks. But we don't need that to know whether the address is usable - we only need to know whether the physical address is valid - i.e., ADDRV. On Intel the MISCV bit is needed to perform additional checks to determine whether the reported address is a physical one, etc. Signed-off-by: NBorislav Petkov <bp@suse.de> Cc: Yazen Ghannam <yazen.ghannam@amd.com> Cc: Tony Luck <tony.luck@intel.com> Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/20170418183924.6agjkebilwqj26or@pd.tnicSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
-
由 Vishal Verma 提交于
The NFIT MCE handler callback (for handling media errors on NVDIMMs) takes a mutex to add the location of a memory error to a list. But since the notifier call chain for machine checks (x86_mce_decoder_chain) is atomic, we get a lockdep splat like: BUG: sleeping function called from invalid context at kernel/locking/mutex.c:620 in_atomic(): 1, irqs_disabled(): 0, pid: 4, name: kworker/0:0 [..] Call Trace: dump_stack ___might_sleep __might_sleep mutex_lock_nested ? __lock_acquire nfit_handle_mce notifier_call_chain atomic_notifier_call_chain ? atomic_notifier_call_chain mce_gen_pool_process Convert the notifier to a blocking one which gets to run only in process context. Boris: remove the notifier call in atomic context in print_mce(). For now, let's print the MCE on the atomic path so that we can make sure they go out and get logged at least. Fixes: 6839a6d9 ("nfit: do an ARS scrub on hitting a latent media error") Reported-by: NRoss Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: NVishal Verma <vishal.l.verma@intel.com> Acked-by: NTony Luck <tony.luck@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: x86-ml <x86@kernel.org> Cc: <stable@vger.kernel.org> Link: http://lkml.kernel.org/r/20170411224457.24777-1-vishal.l.verma@intel.comSigned-off-by: NBorislav Petkov <bp@suse.de> Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
-
- 18 4月, 2017 1 次提交
-
-
由 Borislav Petkov 提交于
Update the check which enforces the registration of MCE decoder notifier callbacks with valid priority only, to include mcelog's priority. Reported-by: Nkernel test robot <xiaolong.ye@intel.com> Signed-off-by: NBorislav Petkov <bp@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: lkp@01.org Link: http://lkml.kernel.org/r/20170418073820.i6kl5tggcntwlisa@pd.tnicSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 14 4月, 2017 1 次提交
-
-
由 Piotr Luc 提交于
Intel Xeon Phi processors (KNL and KNM) support PPIN as well, so add their CPUIDs to the whitelist of supported processors. Signed-off-by: NPiotr Luc <piotr.luc@intel.com> Signed-off-by: NBorislav Petkov <bp@suse.de> Cc: Tony Luck <tony.luck@intel.com> Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/20170408172004.8463-1-piotr.luc@intel.com Link: http://lkml.kernel.org/r/20170413201056.10525-1-bp@alien8.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
-
- 31 3月, 2017 1 次提交
-
-
由 Yazen Ghannam 提交于
MCA bank 3 is reserved on systems pre-Fam17h, so it didn't have a name. However, MCA bank 3 is defined on Fam17h systems and can be accessed using legacy MSRs. Without a name we get a stack trace on Fam17h systems when trying to register sysfs files for bank 3 on kernels that don't recognize Scalable MCA. Call MCA bank 3 "decode_unit" since this is what it represents on Fam17h. This will allow kernels without SMCA support to see this bank on Fam17h+ and prevent the stack trace. This will not affect older systems since this bank is reserved on them, i.e. it'll be ignored. Tested on AMD Fam15h and Fam17h systems. WARNING: CPU: 26 PID: 1 at lib/kobject.c:210 kobject_add_internal kobject: (ffff88085bb256c0): attempted to be registered with empty name! ... Call Trace: kobject_add_internal kobject_add kobject_create_and_add threshold_create_device threshold_init_device Signed-off-by: NYazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: NBorislav Petkov <bp@suse.de> Link: http://lkml.kernel.org/r/1490102285-3659-1-git-send-email-Yazen.Ghannam@amd.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
-
- 28 3月, 2017 6 次提交
-
-
由 Borislav Petkov 提交于
This is just a defensive precaution: do not register notifiers with a priority which would disrupt the error handling in the notifiers with prio higher than MCE_PRIO_EDAC. Signed-off-by: NBorislav Petkov <bp@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/20170327093304.10683-7-bp@alien8.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Tony Luck 提交于
Move all code relating to /dev/mcelog to a separate source file. /dev/mcelog driver can now operate from the machine check notifier with lowest prio. Signed-off-by: NTony Luck <tony.luck@intel.com> [ Move the mce_helper and trigger functionality behind CONFIG_X86_MCELOG_LEGACY. ] Signed-off-by: NBorislav Petkov <bp@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/20170327093304.10683-6-bp@alien8.de [ Renamed CONFIG_X86_MCELOG to CONFIG_X86_MCELOG_LEGACY. ] Signed-off-by: NIngo Molnar <mingo@kernel.org> Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Borislav Petkov 提交于
Introduce a simple data structure for collecting correctable errors along with accessors. More detailed description in the code itself. The error decoding is done with the decoding chain now and mce_first_notifier() gets to see the error first and the CEC decides whether to log it and then the rest of the chain doesn't hear about it - basically the main reason for the CE collector - or to continue running the notifiers. When the CEC hits the action threshold, it will try to soft-offine the page containing the ECC and then the whole decoding chain gets to see the error. Signed-off-by: NBorislav Petkov <bp@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/20170327093304.10683-5-bp@alien8.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Borislav Petkov 提交于
It is confusing when staring at "struct mce_log mcelog" and then there's also a function called mce_log(). So call the buffer what it is. No functionality change. Signed-off-by: NBorislav Petkov <bp@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/20170327093304.10683-4-bp@alien8.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Borislav Petkov 提交于
We call it everywhere "struct mce *m". Adjust that here too to avoid confusion. No functionality change. Signed-off-by: NBorislav Petkov <bp@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/20170327093304.10683-3-bp@alien8.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Andi Kleen 提交于
Since: cd9c57ca ("x86/MCE: Dump MCE to dmesg if no consumers") all MCEs are printed even when mcelog is running. Fix the regression to not print to dmesg when mcelog is running as it is a consumer too. Signed-off-by: NAndi Kleen <ak@linux.intel.com> [ Massage commit message. ] Signed-off-by: NBorislav Petkov <bp@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: stable@vger.kernel.org # 4.10.. Fixes: cd9c57ca ("x86/MCE: Dump MCE to dmesg if no consumers") Link: http://lkml.kernel.org/r/20170327093304.10683-2-bp@alien8.deSigned-off-by: NIngo Molnar <mingo@kernel.org> Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
- 18 3月, 2017 1 次提交
-
-
由 Yazen Ghannam 提交于
When the MCA banks in __mcheck_cpu_init_generic() are polled for leftover errors logged during boot or from the previous boot, its required to have CPU features detected sufficiently so that the reading out and handling of those early errors is done correctly. If those features are not available, the decoding may miss some information and get incomplete errors logged. For example, on SMCA systems the MCA_IPID and MCA_SYND registers are not logged and MCA_ADDR is not masked appropriately. To cure that, do a subset of the basic feature detection early while the rest happens in its usual place in __mcheck_cpu_init_vendor(). Signed-off-by: NYazen Ghannam <Yazen.Ghannam@amd.com> Cc: Tony Luck <tony.luck@intel.com> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: x86-ml <x86@kernel.org> Link: http://lkml.kernel.org/r/1489599055-20756-1-git-send-email-Yazen.Ghannam@amd.com [ Massage commit message and simplify. ] Signed-off-by: NBorislav Petkov <bp@suse.de> Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
-
- 14 3月, 2017 1 次提交
-
-
由 Xunlei Pang 提交于
When we are about to kexec a crash kernel and right then and there a broadcasted MCE fires while we're still in the first kernel and while the other CPUs remain in a holding pattern, the #MC handler of the first kernel will timeout and then panic due to never completing MCE synchronization. Handle this in a similar way as to when the CPUs are offlined when that broadcasted MCE happens. [ Boris: rewrote commit message and comments. ] Suggested-by: NBorislav Petkov <bp@alien8.de> Signed-off-by: NXunlei Pang <xlpang@redhat.com> Signed-off-by: NBorislav Petkov <bp@suse.de> Acked-by: NTony Luck <tony.luck@intel.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: kexec@lists.infradead.org Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/1487857012-9059-1-git-send-email-xlpang@redhat.com Link: http://lkml.kernel.org/r/20170313095019.19351-1-bp@alien8.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
-
- 01 2月, 2017 1 次提交
-
-
由 Thomas Gleixner 提交于
Erik reported that on a preproduction hardware a CMCI storm triggers the BUG_ON in add_timer_on(). The reason is that the per CPU MCE timer is started by the CMCI logic before the MCE CPU hotplug callback starts the timer with add_timer_on(). So the timer is already queued which triggers the BUG. Using add_timer_on() is pretty pointless in this code because the timer is strictlty per CPU, initialized as pinned and all operations which arm the timer happen on the CPU to which the timer belongs. Simplify the whole machinery by using mod_timer() instead of add_timer_on() which avoids the problem because mod_timer() can handle already queued timers. Use __start_timer() everywhere so the earliest armed expiry time is preserved. Reported-by: NErik Veijola <erik.veijola@intel.com> Tested-by: NBorislav Petkov <bp@alien8.de> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Reviewed-by: NBorislav Petkov <bp@alien8.de> Cc: Tony Luck <tony.luck@intel.com> Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1701310936080.3457@nanosSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
-
- 24 1月, 2017 6 次提交
-
-
由 Borislav Petkov 提交于
Assign all notifiers on the MCE decode chain a priority so that they get called in the correct order. Suggested-by: NThomas Gleixner <tglx@linutronix.de> Signed-off-by: NBorislav Petkov <bp@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tony Luck <tony.luck@intel.com> Cc: Yazen Ghannam <Yazen.Ghannam@amd.com> Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/20170123183514.13356-10-bp@alien8.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Borislav Petkov 提交于
Make mce_gen_pool_process() the workqueue function directly and save us an indirection. Signed-off-by: NBorislav Petkov <bp@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Yazen Ghannam <Yazen.Ghannam@amd.com> Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/20170123183514.13356-9-bp@alien8.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Borislav Petkov 提交于
Add the TSC value to the MCE record only when the MCE being logged is precise, i.e., it is logged as an exception or an MCE-related interrupt. So it doesn't look particularly easy to do without touching/changing a bunch of places. That's why I'm trying tricks first. For example, the mce-apei.c case I'm addressing by setting ->tsc only for errors of panic severity. The idea there is, that, panic errors will have raised an #MC and not polled. And then instead of propagating a flag to mce_setup(), it seems easier/less code to set ->tsc depending on the call sites, i.e., are we polling or are we preparing an MCE record in an exception handler/thresholding interrupt. Signed-off-by: NBorislav Petkov <bp@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Yazen Ghannam <Yazen.Ghannam@amd.com> Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/20170123183514.13356-5-bp@alien8.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Yazen Ghannam 提交于
Currently, we append the MCA_IPID[InstanceId] to the bank name to create the sysfs filename. The InstanceId field uniquely identifies a bank instance but it doesn't look very nice for most banks. Replace the InstanceId with a simpler, ascending (0, 1, ..) value. Only use this in the sysfs name when there is more than 1 instance. Otherwise, just use the bank's name as the sysfs name. Signed-off-by: NYazen Ghannam <Yazen.Ghannam@amd.com> Signed-off-by: NBorislav Petkov <bp@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/1484322741-41884-3-git-send-email-Yazen.Ghannam@amd.com Link: http://lkml.kernel.org/r/20170123183514.13356-4-bp@alien8.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Borislav Petkov 提交于
We log a fake bank 128 MCE to note that we're handling a CPU thermal event. However, this confuses people into thinking that their hardware generates MCEs. Hijacking MCA for logging thermal events is a gross misuse anyway and it shouldn't have been done in the first place. And besides we have other means for dealing with thermal events which are much more suitable. So let's kill the MCE logging part. Signed-off-by: NBorislav Petkov <bp@suse.de> Acked-by: NAshok Raj <ashok.raj@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Yazen Ghannam <Yazen.Ghannam@amd.com> Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/20170105213846.GA12024@gmail.com Link: http://lkml.kernel.org/r/20170123183514.13356-3-bp@alien8.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Borislav Petkov 提交于
... and get rid of the annoying: arch/x86/kernel/cpu/mcheck/mce-inject.c:97:13: warning: ‘mce_irq_ipi’ defined but not used [-Wunused-function] when doing randconfig builds. Signed-off-by: NBorislav Petkov <bp@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Yazen Ghannam <Yazen.Ghannam@amd.com> Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/20170123183514.13356-2-bp@alien8.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 05 1月, 2017 1 次提交
-
-
This patch adds the __irq_entry annotation to the default x86 platform IRQ handlers. ftrace's function_graph tracer uses the __irq_entry annotation to notify the entry and return of IRQ handlers. For example, before the patch: 354549.667252 | 3) d..1 | default_idle_call() { 354549.667252 | 3) d..1 | arch_cpu_idle() { 354549.667253 | 3) d..1 | default_idle() { 354549.696886 | 3) d..1 | smp_trace_reschedule_interrupt() { 354549.696886 | 3) d..1 | irq_enter() { 354549.696886 | 3) d..1 | rcu_irq_enter() { After the patch: 366416.254476 | 3) d..1 | arch_cpu_idle() { 366416.254476 | 3) d..1 | default_idle() { 366416.261566 | 3) d..1 ==========> | 366416.261566 | 3) d..1 | smp_trace_reschedule_interrupt() { 366416.261566 | 3) d..1 | irq_enter() { 366416.261566 | 3) d..1 | rcu_irq_enter() { KASAN also uses this annotation. The smp_apic_timer_interrupt() was already annotated. Signed-off-by: NDaniel Bristot de Oliveira <bristot@redhat.com> Acked-by: NSteven Rostedt (VMware) <rostedt@goodmis.org> Cc: Aaron Lu <aaron.lu@intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Baoquan He <bhe@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Claudio Fontana <claudio.fontana@huawei.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Dou Liyang <douly.fnst@cn.fujitsu.com> Cc: Gu Zheng <guz.fnst@cn.fujitsu.com> Cc: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Nicolai Stange <nicstange@gmail.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Cc: linux-edac@vger.kernel.org Link: http://lkml.kernel.org/r/059fdf437c2f0c09b13c18c8fe4e69999d3ffe69.1483528431.git.bristot@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 27 12月, 2016 1 次提交
-
-
由 Thomas Gleixner 提交于
If mce_device_init() fails then the mce device pointer is NULL and the AMD mce code happily dereferences it. Add a sanity check. Reported-by: NMarkus Trippelsdorf <markus@trippelsdorf.de> Reported-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 25 12月, 2016 1 次提交
-
-
由 Linus Torvalds 提交于
This was entirely automated, using the script by Al: PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>' sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \ $(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h) to do the replacement at the end of the merge window. Requested-by: NAl Viro <viro@zeniv.linux.org.uk> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 10 12月, 2016 1 次提交
-
-
由 Thomas Gleixner 提交于
One include less is always a good thing(tm). Good riddance. Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Signed-off-by: NBorislav Petkov <bp@suse.de> Cc: Jiri Olsa <jolsa@redhat.com> Link: http://lkml.kernel.org/r/20161209182912.2726-6-bp@alien8.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
-
- 23 11月, 2016 3 次提交
-
-
由 Tony Luck 提交于
Intel Xeons from Ivy Bridge onwards support a processor identification number set in the factory. To the user this is a handy unique number to identify a particular CPU. Intel can decode this to the fab/production run to track errors. On systems that have it, include it in the machine check record. I'm told that this would be helpful for users that run large data centers with multi-socket servers to keep track of which CPUs are seeing errors. Boris: * Add some clarifying comments and spacing. * Mask out [63:2] in the disabled-but-not-locked case * Call the MSR variable "val" for more readability. Signed-off-by: NTony Luck <tony.luck@intel.com> Cc: Ashok Raj <ashok.raj@intel.com> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: x86-ml <x86@kernel.org> Link: http://lkml.kernel.org/r/20161123114855.njguoaygp3qnbkia@pd.tnicSigned-off-by: NBorislav Petkov <bp@suse.de> Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
-
由 Thomas Gleixner 提交于
No point to have the sysfs files around before the cpu is online and no point to have them around until the cpu is dead. Get rid of the explicit state. Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Cc: Sebastian Siewior <bigeasy@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Borislav Petkov <bp@alien8.de>
-
Install the callbacks via the state machine and let the core invoke the callbacks on the already online CPUs. Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: rt@linuxtronix.de Cc: Borislav Petkov <bp@alien8.de> Cc: linux-edac@vger.kernel.org Link: http://lkml.kernel.org/r/20161117183541.8588-2-bigeasy@linutronix.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
-