!74 SPR: HBM EDAC and MCA recovery enhancement and bug fix
Merge Pull Request from: @x56Jason This is a cherry-pick of PR34 from OLK-5.10 branch. ## Description MCA recovery with OS kernel assistance get uncorrected data errors isolated and recovery and at the same time EDAC driver will also translate the memory error address to detail locations alike:socket/MC/channel/dimm/bank/row/column etc. which will benefit dev&ops greatly. Recent, upstream there are some MCA recovery feature enhancement and bug fix and EDAC will get SPR(Sapphire Rapids) HBM(High Bandwidth Memory) supported and as well as enhancement/bug fix. BTW: These backported patches almost all are directly applied from upstream patches/commits. ## upstream commits list here: ``` ea6d0630 mm/hwpoison: do not lock page again when me_huge_page() successfully recovers a3f5d80e mm,hwpoison: send SIGBUS with error virutal address a6e3cf70 x86/mce: Change to not send SIGBUS error during copy from user bc1bb416 generic_perform_write()/iomap_write_actor(): saner logics for short copy 69065847 x86/mce: Drop copyin special case for #MC 33761363 x86/mce: Reduce number of machine checks taken during recovery 046545a6 mm/hwpoison: fix error page recovered but reported "not recovered" bc1c99a5 EDAC: Add DDR5 new memory type 479f58dd EDAC/i10nm: Add Intel Sapphire Rapids server support cf4e6d52 EDAC/i10nm: Retrieve and print retry_rd_err_log registers 2f4348e5 EDAC/skx_common: Add new ADXL components for 2-level memory 4bd4d32e EDAC/i10nm: Add detection of memory levels for ICX/SPR servers c9450883 EDAC/i10nm: Add support for high bandwidth memory e1ca90b7 EDAC/mc: Add new HBM2 memory type 7cb58db64ca7ab020850ef8543d9f583b820dde0 EDAC/skx_common: Set the memory type correctly for HBM memory c370baa3 EDAC/i10nm: Release mdev/mbase when failing to detect HBM ``` ## Testing kernel options: ``` CONFIG_X86_MCE=y CONFIG_MEMORY_FAILURE=y CONFIG_X86_MCE_INTEL=m CONFIG_ACPI_APEI_EINJ=m CONFIG_EDAC=y CONFIG_EDAC_I10NM=m ``` 1.https://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git ./einj_mem_uc -f -m 128:512:0 copyin : MCE events triggered decrease. ./einj_mem_uc single -f : repeat to run, sometime notice "not recovered" without patch When we inject error and got MCA recovery. run "dmesg" also to check if error address is decoded to detail location. 2. modprobe einj, ./cmcistorm 1 : inject CE for HBM address - check EDAC decode the locate for address in HBM range. it will also apply to test 2-level memory. Note: SPR HBM EDAC retry_rd_err_log is not supported yet and which be backported when the patches are ready in upstream. Link: !34:SPR: HBM EDAC and MCA recovery enhancement and bug fix Link:https://gitee.com/openeuler/kernel/pulls/74 Reviewed-by: Zheng Zengkai <zhengzengkai@huawei.com> Signed-off-by: Xie XiuQi <xiexiuqi@huawei.com>
Showing
想要评论请 注册 或 登录