提交 · 5ba82ab534a325d310fe02af1c149f1072792c7b · openeuler / Kernel

17 1月, 2012 2 次提交

ACPI, APEI, GHES, Distinguish interleaved error report in kernel log · 5ba82ab5

由 Huang Ying 提交于 12月 08, 2011

In most cases, printk only guarantees messages from different printk
calling will not be interleaved between each other.  But, one APEI
GHES hardware error report will involve multiple printk calling,
normally each for one line.  So it is possible that the hardware error
report comes from different generic hardware error source will be
interleaved.

In this patch, a sequence number is prefixed to each line of error
report.  So that, even if they are interleaved, they still can be
distinguished by the prefixed sequence number.
Signed-off-by: NHuang Ying <ying.huang@intel.com>
Signed-off-by: NLen Brown <len.brown@intel.com>

5ba82ab5

ACPI, APEI, GHES: Add PCIe AER recovery support · a654e5ee

由 Huang Ying 提交于 12月 08, 2011

aer_recover_queue() is called when recoverable PCIe AER errors are
notified by firmware to do the recovery work.
Signed-off-by: NHuang Ying <ying.huang@intel.com>
Signed-off-by: NLen Brown <len.brown@intel.com>

a654e5ee

10 10月, 2011 1 次提交

x86, nmi: Wire up NMI handlers to new routines · 9c48f1c6

由 Don Zickus 提交于 9月 30, 2011

Just convert all the files that have an nmi handler to the new routines.
Most of it is straight forward conversion.  A couple of places needed some
tweaking like kgdb which separates the debug notifier from the nmi handler
and mce removes a call to notify_die.

[Thanks to Ying for finding out the history behind that mce call

https://lkml.org/lkml/2010/5/27/114

And Boris responding that he would like to remove that call because of it

https://lkml.org/lkml/2011/9/21/163]

The things that get converted are the registeration/unregistration routines
and the nmi handler itself has its args changed along with code removal
to check which list it is on (most are on one NMI list except for kgdb
which has both an NMI routine and an NMI Unknown routine).
Signed-off-by: NDon Zickus <dzickus@redhat.com>
Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: NCorey Minyard <minyard@acm.org>
Cc: Jason Wessel <jason.wessel@windriver.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Corey Minyard <minyard@acm.org>
Cc: Jack Steiner <steiner@sgi.com>
Link: http://lkml.kernel.org/r/1317409584-23662-4-git-send-email-dzickus@redhat.comSigned-off-by: NIngo Molnar <mingo@elte.hu>

9c48f1c6

03 8月, 2011 4 次提交

APEI GHES: 32-bit buildfix · 70cb6e1d

由 Len Brown 提交于 8月 02, 2011

drivers/acpi/apei/ghes.c:542: warning: integer overflow in expression
drivers/acpi/apei/ghes.c:619: warning: integer overflow in expression

ghes.c:(.text+0x46289): undefined reference to `__udivdi3'
  in function ghes_estatus_cache_add().
Reported-by: NRandy Dunlap <rdunlap@xenotime.net>
Signed-off-by: NLen Brown <len.brown@intel.com>

70cb6e1d

ACPI, APEI, GHES: Add hardware memory error recovery support · ba61ca4a

由 Huang Ying 提交于 7月 13, 2011

memory_failure_queue() is called when recoverable memory errors are
notified by firmware to do the recovery work.
Signed-off-by: NHuang Ying <ying.huang@intel.com>
Signed-off-by: NLen Brown <len.brown@intel.com>

ba61ca4a

ACPI, APEI, GHES, Error records content based throttle · 152cef40

由 Huang Ying 提交于 7月 13, 2011

printk is used by GHES to report hardware errors.  Ratelimit is
enforced on the printk to avoid too many hardware error reports in
kernel log.  Because there may be thousands or even millions of
corrected hardware errors during system running.

Currently, a simple scheme is used.  That is, the total number of
hardware error reporting is ratelimited.  This may cause some issues
in practice.

For example, there are two kinds of hardware errors occurred in
system.  One is corrected memory error, because the fault memory
address is accessed frequently, there may be hundreds error report
per-second.  The other is corrected PCIe AER error, it will be
reported once per-second.  Because they share one ratelimit control
structure, it is highly possible that only memory error is reported.

To avoid the above issue, an error record content based throttle
algorithm is implemented in the patch.  Where after the first
successful reporting, all error records that are same are throttled for
some time, to let other kinds of error records have the opportunity to
be reported.

In above example, the memory errors will be throttled for some time,
after being printked.  Then the PCIe AER error will be printked
successfully.
Signed-off-by: NHuang Ying <ying.huang@intel.com>
Signed-off-by: NLen Brown <len.brown@intel.com>

152cef40

ACPI, APEI, GHES, printk support for recoverable error via NMI · 67eb2e99

由 Huang Ying 提交于 7月 13, 2011

Some APEI GHES recoverable errors are reported via NMI, but printk is
not safe in NMI context.

To solve the issue, a lock-less memory allocator is used to allocate
memory in NMI handler, save the error record into the allocated
memory, put the error record into a lock-less list.  On the other
hand, an irq_work is used to delay the operation from NMI context to
IRQ context.  The irq_work IRQ handler will remove nodes from
lock-less list, printk the error record and do some further processing
include recovery operation, then free the memory.
Signed-off-by: NHuang Ying <ying.huang@intel.com>
Signed-off-by: NLen Brown <len.brown@intel.com>

67eb2e99

14 7月, 2011 3 次提交

ACPI, APEI, Add WHEA _OSC support · 9fb0bfe1

由 Huang Ying 提交于 7月 13, 2011

APEI firmware first mode must be turned on explicitly on some
machines, otherwise there may be no GHES hardware error record for
hardware error notification.  APEI bit in generic _OSC call can be
used to do that, but on some machine, a special WHEA _OSC call must be
used.  This patch adds the support to that WHEA _OSC call.
Signed-off-by: NHuang Ying <ying.huang@intel.com>
Reviewed-by: NAndi Kleen <ak@linux.intel.com>
Reviewed-by: NMatthew Garrett <mjg@redhat.com>
Signed-off-by: NLen Brown <len.brown@intel.com>

9fb0bfe1

ACPI, APEI, GHES, Support disable GHES at boot time · b6a95016

由 Huang Ying 提交于 7月 13, 2011

Some machine may have broken firmware so that GHES and firmware first
mode should be disabled.  This patch adds support to that.
Signed-off-by: NHuang Ying <ying.huang@intel.com>
Reviewed-by: NAndi Kleen <ak@linux.intel.com>
Reviewed-by: NMatthew Garrett <mjg@redhat.com>
Signed-off-by: NLen Brown <len.brown@intel.com>

b6a95016

ACPI, APEI, GHES, Do not ratelimit fatal error printk before panic · 5588340d

由 Huang Ying 提交于 7月 13, 2011

printk is used by GHES to report hardware errors.  Normally, the
printk will be ratelimited to avoid too many hardware error reports in
kernel log.  Because there may be thousands or even millions of
corrected hardware errors during system running.

That is different for fatal hardware error, because system will go
panic as soon as possible, there will be no more than several error
records.  And these error records are valuable for system fault
diagnosis, so they should not be ratelimited.
Signed-off-by: NHuang Ying <ying.huang@intel.com>
Signed-off-by: NLen Brown <len.brown@intel.com>

5588340d

31 3月, 2011 1 次提交

Fix common misspellings · 25985edc

由 Lucas De Marchi 提交于 3月 30, 2011

Fixes generated by 'codespell' and manually reviewed.
Signed-off-by: NLucas De Marchi <lucas.demarchi@profusion.mobi>

25985edc

12 1月, 2011 1 次提交

ACPI, APEI, Generic Hardware Error Source POLL/IRQ/NMI notification type support · 81e88fdc

由 Huang Ying 提交于 1月 12, 2011

Generic Hardware Error Source provides a way to report platform
hardware errors (such as that from chipset). It works in so called
"Firmware First" mode, that is, hardware errors are reported to
firmware firstly, then reported to Linux by firmware. This way, some
non-standard hardware error registers or non-standard hardware link
can be checked by firmware to produce more valuable hardware error
information for Linux.

This patch adds POLL/IRQ/NMI notification types support.

Because the memory area used to transfer hardware error information
from BIOS to Linux can be determined only in NMI, IRQ or timer
handler, but general ioremap can not be used in atomic context, so a
special version of atomic ioremap is implemented for that.

Known issue:

- Error information can not be printed for recoverable errors notified
  via NMI, because printk is not NMI-safe. Will fix this via delay
  printing to IRQ context via irq_work or make printk NMI-safe.

v2:

- adjust printk format per comments.
Signed-off-by: NHuang Ying <ying.huang@intel.com>
Reviewed-by: NAndi Kleen <ak@linux.intel.com>
Signed-off-by: NLen Brown <len.brown@intel.com>

81e88fdc

14 12月, 2010 1 次提交

ACPI, APEI, Report GHES error information via printk · 32c361f5

由 Huang Ying 提交于 12月 07, 2010

printk is one of the methods to report hardware errors to user space.
This patch implements hardware error reporting for GHES via printk.
Signed-off-by: NHuang Ying <ying.huang@intel.com>
Signed-off-by: NLen Brown <len.brown@intel.com>

32c361f5

30 9月, 2010 1 次提交

ACPI, APEI, HEST Fix the unsuitable usage of platform_data · 1dd6b20e

由 Jin Dongming 提交于 9月 29, 2010

platform_data in hest_parse_ghes() is used for saving the address of entry
information of erst_tab. When the device is failed to be added, platform_data
will be freed by platform_device_put(). But the value saved in platform_data
should not be freed here. If it is done, it will make system panic.

So I think platform_data should save the address of allocated memory
which saves entry information of erst_tab.

This patch fixed it and I confirmed it on x86_64 next-tree.

v2:
    Transport the pointer of hest_hdr to platform_data using
    platform_device_add_data()
Signed-off-by: NJin Dongming <jin.dongming@np.css.fujitsu.com>
Signed-off-by: NHuang Ying <ying.huang@intel.com>
Signed-off-by: NLen Brown <len.brown@intel.com>

1dd6b20e

09 8月, 2010 2 次提交

ACPI, APEI, Manage GHES as platform devices · 7ad6e943

由 Huang Ying 提交于 8月 02, 2010

Register GHES during HEST initialization as platform devices. And make
GHES driver into platform device driver. So that the GHES driver
module can be loaded automatically when there are GHES available.
Signed-off-by: NHuang Ying <ying.huang@intel.com>
Signed-off-by: NAndi Kleen <ak@linux.intel.com>
Signed-off-by: NLen Brown <len.brown@intel.com>

7ad6e943

ACPI, APEI, Rename CPER and GHES severity constants · ad4ecef2

由 Huang Ying 提交于 8月 02, 2010

The abbreviation of severity should be SEV instead of SER, so the CPER
severity constants are renamed accordingly. GHES severity constants
are renamed in the same way too.
Signed-off-by: NHuang Ying <ying.huang@intel.com>
Signed-off-by: NAndi Kleen <ak@linux.intel.com>
Signed-off-by: NLen Brown <len.brown@intel.com>

ad4ecef2

20 5月, 2010 1 次提交

ACPI, APEI, Generic Hardware Error Source memory error support · d334a491

由 Huang Ying 提交于 5月 18, 2010

Generic Hardware Error Source provides a way to report platform
hardware errors (such as that from chipset). It works in so called
"Firmware First" mode, that is, hardware errors are reported to
firmware firstly, then reported to Linux by firmware. This way, some
non-standard hardware error registers or non-standard hardware link
can be checked by firmware to produce more valuable hardware error
information for Linux.

Now, only SCI notification type and memory errors are supported. More
notification type and hardware error type will be added later. These
memory errors are reported to user space through /dev/mcelog via
faking a corrected Machine Check, so that the error memory page can be
offlined by /sbin/mcelog if the error count for one page is beyond the
threshold.

On some machines, Machine Check can not report physical address for
some corrected memory errors, but GHES can do that. So this simplified
GHES is implemented firstly.
Signed-off-by: NHuang Ying <ying.huang@intel.com>
Signed-off-by: NAndi Kleen <ak@linux.intel.com>
Signed-off-by: NLen Brown <len.brown@intel.com>

d334a491

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功