- 14 9月, 2015 1 次提交
-
-
If the ACPI APEI firmware handles hardware error first (called "firmware first handling"), the firmware updates the GHES memory region with hardware error record (called "generic hardware error record"). Essentially the firmware writes hardware error records in the GHES memory region, triggers an NMI/interrupt, then the GHES driver goes off and grabs the error record from the GHES region. The kernel currently maps the GHES memory region as cacheable (PAGE_KERNEL) for all architectures. However, on some arm64 platforms, there is a mismatch between how the kernel maps the GHES region (PAGE_KERNEL) and how the firmware maps it (EFI_MEMORY_UC, ie. uncacheable), leading to the possibility of the kernel GHES driver reading stale data from the cache when it receives the interrupt. With stale data being read, the kernel is unaware there is new hardware error to be handled when there actually is; this may lead to further damage in various scenarios, such as error propagation caused data corruption. If uncorrected error (such as double bit ECC error) happened in memory operation and if the kernel is unaware of such an event happening, errorneous data may be propagated to the disk. Instead GHES memory region should be mapped with page protection type according to what is returned from arch_apei_get_mem_attribute(). Signed-off-by: NJonathan (Zhixiong) Zhang <zjzhang@codeaurora.org> Signed-off-by: NMatt Fleming <matt.fleming@intel.com> [ Small stylistic tweaks. ] Reviewed-by: NMatt Fleming <matt@codeblueprint.co.uk> Acked-by: NBorislav Petkov <bp@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1441372302-23242-3-git-send-email-matt@codeblueprint.co.ukSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 08 7月, 2015 1 次提交
-
-
由 Jarkko Nikula 提交于
There is no need to carry potentially outdated Free Software Foundation mailing address in file headers since the COPYING file includes it. Signed-off-by: NJarkko Nikula <jarkko.nikula@linux.intel.com> Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
-
- 28 4月, 2015 5 次提交
-
-
由 Jiri Kosina 提交于
Since GHES sources are global, we theoretically need only a single CPU reading them per NMI instead of a thundering herd of CPUs waiting on a spinlock in NMI context for no reason at all. Do that. Signed-off-by: NJiri Kosina <jkosina@suse.cz> Signed-off-by: NBorislav Petkov <bp@suse.de>
-
由 Borislav Petkov 提交于
There's no real need to iterate twice over the HW error sources in the NMI handler. With the previous cleanups, elliminating the second loop is almost trivial. Signed-off-by: NBorislav Petkov <bp@suse.de>
-
由 Borislav Petkov 提交于
The moment we log an error of panic severity, there's no need to noodle through the ghes_nmi list anymore. So panic instead right then and there. Signed-off-by: NBorislav Petkov <bp@suse.de>
-
由 Borislav Petkov 提交于
... into another function for more clarity. No functionality change. Signed-off-by: NBorislav Petkov <bp@suse.de>
-
由 Borislav Petkov 提交于
Make the handler more readable. No functionality change. Signed-off-by: NBorislav Petkov <bp@suse.de>
-
- 22 10月, 2014 2 次提交
-
-
由 Borislav Petkov 提交于
It is used only in ghes.c. Signed-off-by: NBorislav Petkov <bp@suse.de>
-
由 Chen, Gong 提交于
We have a generic function to reverse a lockless list, kill homegrown copy. Signed-off-by: NChen, Gong <gong.chen@linux.intel.com> Link: http://lkml.kernel.org/r/1406530260-26078-2-git-send-email-gong.chen@linux.intel.comAcked-by: NTony Luck <tony.luck@intel.com> Acked-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> [ Boris: correct commit msg ] Signed-off-by: NBorislav Petkov <bp@suse.de>
-
- 20 10月, 2014 1 次提交
-
-
由 Wolfram Sang 提交于
A platform_driver does not need to set an owner, it will be populated by the driver core. Signed-off-by: NWolfram Sang <wsa@the-dreams.de>
-
- 23 7月, 2014 3 次提交
-
-
由 Tomasz Nowicki 提交于
GHES currently maps two pages with atomic_ioremap. From now on, NMI is architectural depended so there is no need to allocate an NMI page for platforms without NMI support. To make it possible to not use a second page, swap the existing page order so that the IRQ context page is first, and the optional NMI context page is second. Then, use HAVE_ACPI_APEI_NMI to decide how many pages are to be allocated. Signed-off-by: NTomasz Nowicki <tomasz.nowicki@linaro.org> Acked-by: NBorislav Petkov <bp@suse.de> Signed-off-by: NTony Luck <tony.luck@intel.com>
-
由 Tomasz Nowicki 提交于
Currently APEI depends on x86 architecture. It is because of NMI hardware error notification of GHES which is currently supported by x86 only. However, many other APEI features can be still used perfectly by other architectures. This commit adds two symbols: 1. HAVE_ACPI_APEI for those archs which support APEI. 2. HAVE_ACPI_APEI_NMI which is used for NMI code isolation in ghes.c file. NMI related data and functions are grouped so they can be wrapped inside one #ifdef section. Appropriate function stubs are provided for !NMI case. Note there is no functional changes for x86 due to hard selected HAVE_ACPI_APEI and HAVE_ACPI_APEI_NMI symbols. Signed-off-by: NTomasz Nowicki <tomasz.nowicki@linaro.org> Acked-by: NBorislav Petkov <bp@suse.de> Signed-off-by: NTony Luck <tony.luck@intel.com>
-
由 Tomasz Nowicki 提交于
This commit abstracts MCE calls and provides weak corresponding default implementation for those architectures which do not need arch specific actions. Each platform willing to do additional architectural actions should provides desired function definition. It allows us to avoid wrap code into #ifdef in generic code and prevent new platform from introducing dummy stub function too. Initially, there are two APEI arch-specific calls: - arch_apei_enable_cmcff() - arch_apei_report_mem_error() Both interact with MCE driver for X86 architecture. Signed-off-by: NTomasz Nowicki <tomasz.nowicki@linaro.org> Acked-by: NBorislav Petkov <bp@suse.de> Signed-off-by: NTony Luck <tony.luck@intel.com>
-
- 17 6月, 2014 1 次提交
-
-
由 Lv Zheng 提交于
ACPICA: Restore error table definitions to reduce code differences between Linux and ACPICA upstream. The following commit has changed ACPICA table header definitions: Commit: 88f074f4 Subject: ACPI, CPER: Update cper info While such definitions are currently maintained in ACPICA. As the modifications applying to the table definitions affect other OSPMs' drivers, it is very difficult for ACPICA to initiate a process to complete the merge. Thus this commit finally only leaves us divergences. Revert such naming modifications to reduce the source code differecnes between Linux and ACPICA upstream. No functional changes. Signed-off-by: NLv Zheng <lv.zheng@intel.com> Cc: Bob Moore <robert.moore@intel.com> Cc: Chen, Gong <gong.chen@linux.intel.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Borislav Petkov <bp@alien8.de> Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
-
- 21 12月, 2013 2 次提交
-
-
由 Chen, Gong 提交于
Cleanup the logic in ghes_handle_memory_failure(). While at it, add proper PFN validity check for UC error and cleanup the code logic to make it simpler and cleaner. Signed-off-by: NChen, Gong <gong.chen@linux.intel.com> Acked-by: NNaveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/1385363701-12387-2-git-send-email-gong.chen@linux.intel.com [ Boris: massage commit message. ] Signed-off-by: NBorislav Petkov <bp@suse.de>
-
由 Chen, Gong 提交于
Currently SCI is employed to handle corrected errors - memory corrected errors, more specifically but in fact SCI still can be used to handle any errors, e.g. uncorrected or even fatal ones if enabled by the BIOS. Enable logging for those kinds of errors too. Signed-off-by: NChen, Gong <gong.chen@linux.intel.com> Acked-by: NNaveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Cc: Tony Luck <tony.luck@intel.com> Link: http://lkml.kernel.org/r/1385363701-12387-1-git-send-email-gong.chen@linux.intel.com [ Boris: massage commit message, rename function arg. ] Signed-off-by: NBorislav Petkov <bp@suse.de>
-
- 07 12月, 2013 1 次提交
-
-
由 Lv Zheng 提交于
To avoid build problems and breaking dependencies between ACPI header files, <acpi/acpi.h> should not be included directly by code outside of the ACPI core subsystem. However, that is possible if <linux/acpi_io.h> is included, because that file contains a direct inclusion of <acpi/acpi.h>. For this reason, remove the direct <acpi/acpi.h> inclusion from <linux/acpi_io.h>, move that file from include/linux/ to include/acpi/ and make <linux/acpi.h> include it for CONFIG_ACPI set along with the other ACPI header files. Accordingly, Remove the inclusions of <linux/acpi_io.h> from everywhere. Of course, that causes the contents of the new <acpi/acpi_io.h> file to be available for CONFIG_ACPI set only, so intel_opregion.o that depends on it should also depend on CONFIG_ACPI (and it really should not be compiled for CONFIG_ACPI unset anyway). References: https://01.org/linuxgraphics/sites/default/files/documentation/acpi_igd_opregion_spec.pdf Cc: Matthew Garrett <mjg59@srcf.ucam.org> Signed-off-by: NLv Zheng <lv.zheng@intel.com> Acked-by: NDaniel Vetter <daniel.vetter@ffwll.ch> [rjw: Subject and changelog] Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
-
- 24 10月, 2013 1 次提交
-
-
由 Chen, Gong 提交于
In latest UEFI spec(by now it is 2.4) memory error definition for CPER (UEFI 2.4 Appendix N Common Platform Error Record) adds some new fields. These fields help people to locate memory error to an actual DIMM location. Original-author: Tony Luck <tony.luck@intel.com> Signed-off-by: NChen, Gong <gong.chen@linux.intel.com> Reviewed-by: NBorislav Petkov <bp@suse.de> Reviewed-by: NMauro Carvalho Chehab <m.chehab@samsung.com> Acked-by: NNaveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Signed-off-by: NTony Luck <tony.luck@intel.com>
-
- 22 10月, 2013 1 次提交
-
-
由 Chen, Gong 提交于
We have a lot of confusing names of functions and data structures in amongs the the error reporting code. In particular the "apei" prefix has been applied to many objects that are not part of APEI. Since we will be using these routines for extended error log reporting it will be clearer if we fix up the names first. Signed-off-by: NChen, Gong <gong.chen@linux.intel.com> Acked-by: NBorislav Petkov <bp@suse.de> Reviewed-by: NMauro Carvalho Chehab <m.chehab@samsung.com> Signed-off-by: NTony Luck <tony.luck@intel.com>
-
- 11 7月, 2013 1 次提交
-
-
由 Naveen N. Rao 提交于
If the firmware indicates in GHES error data entry that the error threshold has exceeded for a corrected error event, then we try to soft-offline the page. This could be called in interrupt context, so we queue this up similar to how we handle memory failure scenarios. Signed-off-by: NNaveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Acked-by: NBorislav Petkov <bp@suse.de> Signed-off-by: NTony Luck <tony.luck@intel.com>
-
- 07 6月, 2013 1 次提交
-
-
由 Betty Dall 提交于
The CPER error record has a reset bit that indicates that the platform has reset the component. The reset bit can be set for any severity error including recoverable. From the AER code path's perspective, any error is fatal if the component has been reset. This patch upgrades the severity of the AER recovery to AER_FATAL whenever the CPER error record indicates that the component has been reset. [bhelgaas: s/bus has been reset/component has been reset/] Signed-off-by: NBetty Dall <betty.dall@hp.com> Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>
-
- 05 6月, 2013 1 次提交
-
-
由 Wei Yongjun 提交于
Fix to return a negative error code in the acpi_gsi_to_irq() and request_irq() error handling case instead of 0, as done elsewhere in this function. Signed-off-by: NWei Yongjun <yongjun_wei@trendmicro.com.cn> Reviewed-by: NHuang Ying <ying.huang@intel.com> Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
-
- 31 5月, 2013 1 次提交
-
-
由 Lance Ortiz 提交于
The following warning was seen on 3.9 when a corrected PCIe error was being handled by the AER subsystem. WARNING: at .../drivers/pci/search.c:214 pci_get_dev_by_id+0x8a/0x90() This occurred because a call to pci_get_domain_bus_and_slot() was added to cper_print_pcie() to setup for the call to cper_print_aer(). The warning showed up because cper_print_pcie() is called in an interrupt context and pci_get* functions are not supposed to be called in that context. The solution is to move the cper_print_aer() call out of the interrupt context and into aer_recover_work_func() to avoid any warnings when calling pci_get* functions. Signed-off-by: NLance Ortiz <lance.ortiz@hp.com> Acked-by: NBorislav Petkov <bp@suse.de> Acked-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: NTony Luck <tony.luck@intel.com>
-
- 26 2月, 2013 1 次提交
-
-
由 Mauro Carvalho Chehab 提交于
In order to allow reporting errors via EDAC, add hooks for: 1) register an EDAC driver; 2) unregister an EDAC driver; 3) report errors via EDAC. As the EDAC driver will need to access the ghes structure, adds it as one of the parameters for ghes_do_proc. Acked-by: NHuang Ying <ying.huang@intel.com> Signed-off-by: NMauro Carvalho Chehab <mchehab@redhat.com>
-
- 22 2月, 2013 1 次提交
-
-
由 Mauro Carvalho Chehab 提交于
As a ghes_edac driver will need to access ghes structures, in order to properly handle the errors, move those structures to a separate header file. No functional changes. Signed-off-by: NMauro Carvalho Chehab <mchehab@redhat.com>
-
- 29 11月, 2012 1 次提交
-
-
由 Bill Pemberton 提交于
CONFIG_HOTPLUG is going away as an option so __devinit is no longer needed. Signed-off-by: NBill Pemberton <wfp5p@virginia.edu> Acked-by: NRafael J. Wysocki <rjw@sisk.pl> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
- 22 11月, 2012 1 次提交
-
-
由 Bill Pemberton 提交于
CONFIG_HOTPLUG is going away as an option so __devexit is no longer needed. Signed-off-by: NBill Pemberton <wfp5p@virginia.edu> Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
-
- 12 6月, 2012 1 次提交
-
-
由 Huang Ying 提交于
This patch fixed the following bug. https://bugzilla.kernel.org/show_bug.cgi?id=43282 This is caused by a firmware bug checking (checking generic address register provided by firmware) in runtime. The checking should be done in address mapping time instead of runtime to avoid too much error reporting in runtime. Reported-by: NPawel Sikora <pluto@agmk.net> Signed-off-by: NHuang Ying <ying.huang@intel.com> Tested-by: NJean Delvare <khali@linux-fr.org> Cc: stable@vger.kernel.org Signed-off-by: NLen Brown <len.brown@intel.com>
-
- 17 1月, 2012 4 次提交
-
-
由 Myron Stowe 提交于
APEI needs memory access in interrupt context. The obvious choice is acpi_read(), but originally it couldn't be used in interrupt context because it makes temporary mappings with ioremap(). Therefore, we added drivers/acpi/atomicio.c, which provides: acpi_pre_map_gar() -- ioremap in process context acpi_atomic_read() -- memory access in interrupt context acpi_post_unmap_gar() -- iounmap Later we added acpi_os_map_generic_address() (29718521) and enhanced acpi_read() so it works in interrupt context as long as the address has been previously mapped (620242ae). Now this sequence: acpi_os_map_generic_address() -- ioremap in process context acpi_read()/apei_read() -- now OK in interrupt context acpi_os_unmap_generic_address() is equivalent to what atomicio.c provides. This patch introduces apei_read() and apei_write(), which currently are functional equivalents of acpi_read() and acpi_write(). This is mainly proactive, to prevent APEI breakages if acpi_read() and acpi_write() are ever augmented to support the 'bit_offset' field of GAS, as APEI's __apei_exec_write_register() precludes splitting up functionality related to 'bit_offset' and APEI's 'mask' (see its APEI_EXEC_PRESERVE_REGISTER block). With apei_read() and apei_write() in place, usages of atomicio routines are converted to apei_read()/apei_write() and existing calls within osl.c and the CA, based on the re-factoring that was done in an earlier patch series - http://marc.info/?l=linux-acpi&m=128769263327206&w=2: acpi_pre_map_gar() --> acpi_os_map_generic_address() acpi_post_unmap_gar() --> acpi_os_unmap_generic_address() acpi_atomic_read() --> apei_read() acpi_atomic_write() --> apei_write() Note that acpi_read() and acpi_write() currently use 'bit_width' for accessing GARs which seems incorrect. 'bit_width' is the size of the register, while 'access_width' is the size of the access the processor must generate on the bus. The 'access_width' may be larger, for example, if the hardware only supports 32-bit or 64-bit reads. I wanted to minimize any possible impacts with this patch series so I did *not* change this behavior. Signed-off-by: NMyron Stowe <myron.stowe@redhat.com> Signed-off-by: NLen Brown <len.brown@intel.com>
-
由 Huang Ying 提交于
Because printk is not safe inside NMI handler, the recoverable error records received in NMI handler will be queued to be printked in a delayed IRQ context via irq_work. If a fatal error occurs after the recoverable error and before the irq_work processed, we lost a error report. To solve the issue, the queued error records are printked in NMI handler if system will go panic. Signed-off-by: NHuang Ying <ying.huang@intel.com> Signed-off-by: NLen Brown <len.brown@intel.com>
-
由 Huang Ying 提交于
In most cases, printk only guarantees messages from different printk calling will not be interleaved between each other. But, one APEI GHES hardware error report will involve multiple printk calling, normally each for one line. So it is possible that the hardware error report comes from different generic hardware error source will be interleaved. In this patch, a sequence number is prefixed to each line of error report. So that, even if they are interleaved, they still can be distinguished by the prefixed sequence number. Signed-off-by: NHuang Ying <ying.huang@intel.com> Signed-off-by: NLen Brown <len.brown@intel.com>
-
由 Huang Ying 提交于
aer_recover_queue() is called when recoverable PCIe AER errors are notified by firmware to do the recovery work. Signed-off-by: NHuang Ying <ying.huang@intel.com> Signed-off-by: NLen Brown <len.brown@intel.com>
-
- 13 1月, 2012 1 次提交
-
-
由 Rusty Russell 提交于
module_param(bool) used to counter-intuitively take an int. In fddd5201 (mid-2009) we allowed bool or int/unsigned int using a messy trick. It's time to remove the int/unsigned int option. For this version it'll simply give a warning, but it'll break next kernel version. Acked-by: NMauro Carvalho Chehab <mchehab@redhat.com> Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
-
- 10 10月, 2011 1 次提交
-
-
由 Don Zickus 提交于
Just convert all the files that have an nmi handler to the new routines. Most of it is straight forward conversion. A couple of places needed some tweaking like kgdb which separates the debug notifier from the nmi handler and mce removes a call to notify_die. [Thanks to Ying for finding out the history behind that mce call https://lkml.org/lkml/2010/5/27/114 And Boris responding that he would like to remove that call because of it https://lkml.org/lkml/2011/9/21/163] The things that get converted are the registeration/unregistration routines and the nmi handler itself has its args changed along with code removal to check which list it is on (most are on one NMI list except for kgdb which has both an NMI routine and an NMI Unknown routine). Signed-off-by: NDon Zickus <dzickus@redhat.com> Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: NCorey Minyard <minyard@acm.org> Cc: Jason Wessel <jason.wessel@windriver.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Robert Richter <robert.richter@amd.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Corey Minyard <minyard@acm.org> Cc: Jack Steiner <steiner@sgi.com> Link: http://lkml.kernel.org/r/1317409584-23662-4-git-send-email-dzickus@redhat.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
-
- 03 8月, 2011 4 次提交
-
-
由 Len Brown 提交于
drivers/acpi/apei/ghes.c:542: warning: integer overflow in expression drivers/acpi/apei/ghes.c:619: warning: integer overflow in expression ghes.c:(.text+0x46289): undefined reference to `__udivdi3' in function ghes_estatus_cache_add(). Reported-by: NRandy Dunlap <rdunlap@xenotime.net> Signed-off-by: NLen Brown <len.brown@intel.com>
-
由 Huang Ying 提交于
memory_failure_queue() is called when recoverable memory errors are notified by firmware to do the recovery work. Signed-off-by: NHuang Ying <ying.huang@intel.com> Signed-off-by: NLen Brown <len.brown@intel.com>
-
由 Huang Ying 提交于
printk is used by GHES to report hardware errors. Ratelimit is enforced on the printk to avoid too many hardware error reports in kernel log. Because there may be thousands or even millions of corrected hardware errors during system running. Currently, a simple scheme is used. That is, the total number of hardware error reporting is ratelimited. This may cause some issues in practice. For example, there are two kinds of hardware errors occurred in system. One is corrected memory error, because the fault memory address is accessed frequently, there may be hundreds error report per-second. The other is corrected PCIe AER error, it will be reported once per-second. Because they share one ratelimit control structure, it is highly possible that only memory error is reported. To avoid the above issue, an error record content based throttle algorithm is implemented in the patch. Where after the first successful reporting, all error records that are same are throttled for some time, to let other kinds of error records have the opportunity to be reported. In above example, the memory errors will be throttled for some time, after being printked. Then the PCIe AER error will be printked successfully. Signed-off-by: NHuang Ying <ying.huang@intel.com> Signed-off-by: NLen Brown <len.brown@intel.com>
-
由 Huang Ying 提交于
Some APEI GHES recoverable errors are reported via NMI, but printk is not safe in NMI context. To solve the issue, a lock-less memory allocator is used to allocate memory in NMI handler, save the error record into the allocated memory, put the error record into a lock-less list. On the other hand, an irq_work is used to delay the operation from NMI context to IRQ context. The irq_work IRQ handler will remove nodes from lock-less list, printk the error record and do some further processing include recovery operation, then free the memory. Signed-off-by: NHuang Ying <ying.huang@intel.com> Signed-off-by: NLen Brown <len.brown@intel.com>
-
- 14 7月, 2011 2 次提交
-
-
由 Huang Ying 提交于
APEI firmware first mode must be turned on explicitly on some machines, otherwise there may be no GHES hardware error record for hardware error notification. APEI bit in generic _OSC call can be used to do that, but on some machine, a special WHEA _OSC call must be used. This patch adds the support to that WHEA _OSC call. Signed-off-by: NHuang Ying <ying.huang@intel.com> Reviewed-by: NAndi Kleen <ak@linux.intel.com> Reviewed-by: NMatthew Garrett <mjg@redhat.com> Signed-off-by: NLen Brown <len.brown@intel.com>
-
由 Huang Ying 提交于
Some machine may have broken firmware so that GHES and firmware first mode should be disabled. This patch adds support to that. Signed-off-by: NHuang Ying <ying.huang@intel.com> Reviewed-by: NAndi Kleen <ak@linux.intel.com> Reviewed-by: NMatthew Garrett <mjg@redhat.com> Signed-off-by: NLen Brown <len.brown@intel.com>
-