- 06 1月, 2023 1 次提交
-
-
由 Hawking Zhang 提交于
amdgpu_ras_block_late_init will be invoked in IP specific ras_late_init call as a common helper for all the IP blocks. However, when amdgpu_ras_block_late_init call amdgpu_ras_query_error_count to query ras error counters, amdgpu_ras_query_error_count queries all the IP blocks that support ras query interface. This results to wrong error counters cached in software copies when there are ras errors detected at time zero or warm reset procedure. i.e., in sdma_ras_late_init phase, it counts on sdma/mmhub errors, while, in mmhub_ras_late_init phase, it still counts on sdma/mmhub errors. The change updates amdgpu_ras_query_error_count interface to allow query specific ip error counter. It introduces a new input parameter: query_info. if query_info is NULL, it means query all the IP blocks, otherwise, only query the ip block specified by query_info. Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: NTao Zhou <tao.zhou1@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 04 1月, 2023 1 次提交
-
-
由 Stanley.Yang 提交于
[Why] [ 41.285804] RIP: 0010:amdgpu_ras_feature_enable+0x15c/0x310 [amdgpu] [ 41.285945] Code: 48 89 c1 48 c7 c2 b9 f2 88 c1 48 c7 c0 c0 f2 88 c1 49 8b 3c 24 48 0f 44 d0 48 c7 c6 98 33 80 c1 e8 5f 52 75 d9 e9 fa fe ff ff <0f> 0b e9 66 ff ff ff 48 8b 3d 86 8c 0f da ba 00 04 00 00 be c0 0d [ 41.285946] RSP: 0018:ffffbccdc72efc90 EFLAGS: 00010246 [ 41.285948] RAX: 0000000000000004 RBX: ffff931897406980 RCX: 0000000000000002 [ 41.285949] RDX: 0000000000000dc0 RSI: 0000000000000002 RDI: ffff931500042b00 [ 41.285950] RBP: ffffbccdc72efcc0 R08: 0000000000000002 R09: ffff931885b87000 [ 41.285951] R10: 0000000000ffff10 R11: 0000000000000001 R12: ffff931893e20000 [ 41.285952] R13: 0000000000000001 R14: ffff931885b87000 R15: 0000000000000000 [ 41.285953] FS: 0000000000000000(0000) GS:ffff931c6f200000(0000) knlGS:0000000000000000 [ 41.285954] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 41.285955] CR2: 000055dd6f532008 CR3: 000000061b010006 CR4: 00000000003706e0 [ 41.285956] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 41.285957] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 41.285958] Call Trace: [ 41.285959] <TASK> [ 41.285963] ? gfx_v11_0_early_init+0x250/0x250 [amdgpu] [ 41.286117] gfx_v11_0_late_init+0x8c/0xb0 [amdgpu] [ 41.286271] amdgpu_device_ip_late_init+0x8d/0x3c0 [amdgpu] [ 41.286401] amdgpu_device_init.cold+0x1677/0x1fda [amdgpu] [ 41.286616] ? pci_bus_read_config_word+0x4a/0x70 [ 41.286621] ? do_pci_enable_device+0xdb/0x110 [ 41.286625] amdgpu_driver_load_kms+0x1a/0x160 [amdgpu] [ 41.286762] amdgpu_pci_probe+0x18d/0x3a0 [amdgpu] [ 41.286898] local_pci_probe+0x4b/0x90 [ 41.286901] work_for_cpu_fn+0x1a/0x30 [ 41.286903] process_one_work+0x22b/0x3d0 [ 41.286905] worker_thread+0x223/0x420 [ 41.286907] ? process_one_work+0x3d0/0x3d0 [ 41.286908] kthread+0x12a/0x150 [ 41.286911] ? set_kthread_struct+0x50/0x50 [ 41.286913] ret_from_fork+0x22/0x30 [How] For specific asic, only mem ecc is enabled, sram ecc is not enabled, but it still need to send ras enable cmd to gfx block to support poison mode, so add check posion mode. Signed-off-by: NStanley.Yang <Stanley.Yang@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 16 12月, 2022 3 次提交
-
-
由 Tao Zhou 提交于
1. no need to query poison mode on SRIOV guest side, host can handle it. 2. define the function to simplify code. v2: rename amdgpu_ras_poison_mode_query to amdgpu_ras_query_poison_mode. Signed-off-by: NTao Zhou <tao.zhou1@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Tao Zhou 提交于
Support VCN/JPEG RAS in both bare metal and SRIOV environment. v2: update commit description. Signed-off-by: NTao Zhou <tao.zhou1@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Tao Zhou 提交于
Injection on guest is not allowed. v2: return directly in SRIOV environment. Signed-off-by: NTao Zhou <tao.zhou1@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 02 12月, 2022 1 次提交
-
-
由 ye xingchen 提交于
Replace the open-code with sysfs_emit() to simplify the code. Reviewed-by: NLuben Tuikov <luben.tuikov@amd.com> Signed-off-by: Nye xingchen <ye.xingchen@zte.com.cn> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 18 11月, 2022 2 次提交
-
-
由 Tao Zhou 提交于
Set support flag for VCN/JPEG 4.0. Signed-off-by: NTao Zhou <tao.zhou1@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 YiPeng Chai 提交于
The patch is enabling mode-1 reset for RAS recovery in fatal error mode. Signed-off-by: NYiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: NTao Zhou <tao.zhou1@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 28 10月, 2022 2 次提交
-
-
由 Tao Zhou 提交于
Make the code simpler. Signed-off-by: NTao Zhou <tao.zhou1@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Tao Zhou 提交于
Make the code more readable. Signed-off-by: NTao Zhou <tao.zhou1@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 19 10月, 2022 3 次提交
-
-
由 YiPeng Chai 提交于
V2: Add sriov vf ras support in amdgpu_ras_asic_supported. Signed-off-by: NYiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 YiPeng Chai 提交于
V1: Enable ras support for CHIP_IP_DISCOVERY asic type. V2: 1. Change commit comment. 2. Enable ras support for mp0 v13_0_0 and v13_0_10. Signed-off-by: NYiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Victor Zhao 提交于
This reverts commit dac6b808. This commit reverted the AMDGPU_SKIP_MODE2_RESET as it conflicts with the original design of reset handler. Will redesign it. Fixes: dac6b808 ("drm/amdgpu: let mode2 reset fallback to default when failure") Signed-off-by: NVictor Zhao <Victor.Zhao@amd.com> Reviewed-by: NLijo Lazar <lijo.lazar@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 18 10月, 2022 4 次提交
-
-
由 Hawking Zhang 提交于
RAS error address translation algorithm is common across dGPU and A + A platform as along as the SOC integrates the same generation of UMC IP. UMC RAS is managed by x86 MCA on A + A platform, umc_ras in GPU driver is not initialized at all on A + A platform. In such case, any umc_ras callback implemented for dGPU config shouldn't be invoked from A + A specific callback. The change moves convert_error_address out of dGPU umc_ras structure and makes it share between A + A and dGPU config. Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: NStanley Yang <Stanley.Yang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 YiPeng Chai 提交于
V2: Add sriov vf ras support in amdgpu_ras_asic_supported. Signed-off-by: NYiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 YiPeng Chai 提交于
V1: Enable ras support for CHIP_IP_DISCOVERY asic type. V2: 1. Change commit comment. 2. Enable ras support for mp0 v13_0_0 and v13_0_10. Signed-off-by: NYiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Victor Zhao 提交于
This reverts commit dac6b808. This commit reverted the AMDGPU_SKIP_MODE2_RESET as it conflicts with the original design of reset handler. Will redesign it. Fixes: dac6b808 ("drm/amdgpu: let mode2 reset fallback to default when failure") Signed-off-by: NVictor Zhao <Victor.Zhao@amd.com> Reviewed-by: NLijo Lazar <lijo.lazar@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 11 10月, 2022 2 次提交
-
-
由 Tao Zhou 提交于
Fix some issues found by checkpatch script. Signed-off-by: NTao Zhou <tao.zhou1@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Tao Zhou 提交于
Make the code reusable and remove redundant code. Signed-off-by: NTao Zhou <tao.zhou1@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 29 9月, 2022 2 次提交
-
-
由 Tao Zhou 提交于
Use the convert interface to simplify code. Signed-off-by: NTao Zhou <tao.zhou1@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 YiPeng Chai 提交于
For the asic using smu v13_0_2, there is the following warning when uninstalling amdgpu: amdgpu: ras disable gfx failed poison:1 ret:-22. [Why]: For the asic using smu v13_0_2, the psp .suspend and mode1reset is called before amdgpu_ras_pre_fini during amdgpu uninstall, it has disabled all ras features and reset the psp. Since the psp is reset, calling amdgpu_ras_disable_all_features in amdgpu_ras_pre_fini to disable ras features will fail. [How]: If all ras features are disabled, amdgpu_ras_disable_all_features will not be called to disable all ras features again. Signed-off-by: NYiPeng Chai <YiPeng.Chai@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 14 9月, 2022 2 次提交
-
-
由 Candice Li 提交于
No need to reset error status since only umc ras supported on psp v13_0_0. Signed-off-by: NCandice Li <candice.li@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Candice Li 提交于
No need to reset error status since only umc ras supported on psp v13_0_0. Signed-off-by: NCandice Li <candice.li@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 17 8月, 2022 1 次提交
-
-
由 Victor Zhao 提交于
- introduce AMDGPU_SKIP_MODE2_RESET flag - let mode2 reset fallback to default reset method if failed v2: move this part out from the asic specific part Signed-off-by: NVictor Zhao <Victor.Zhao@amd.com> Acked-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 13 7月, 2022 1 次提交
-
-
由 Likun Gao 提交于
Move reset_context out of gpu recover function to make it configurable for different reset purpose. For the reset way of call gpu_recovery sysfs, force to use full reset method. Otherwise, try soft reset by default if the related ASIC supportted, if soft reset failed, will use full reset. Signed-off-by: NLikun Gao <Likun.Gao@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 06 7月, 2022 2 次提交
-
-
由 Stanley.Yang 提交于
It should not init whole ras bad page framework on sriov guest side due to it is handled on host side. Signed-off-by: NStanley.Yang <Stanley.Yang@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: NTao Zhou <tao.zhou1@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Stanley.Yang 提交于
GFX is the only IP block that RAS TA needs to program the hardware when receiving enable_feature command. Changed from V1: remove amdgpu_ras_need_send_ras_feature inline function, use GFX RAS block check directly. Signed-off-by: NStanley.Yang <Stanley.Yang@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 11 6月, 2022 2 次提交
-
-
由 Andrey Grodzovsky 提交于
We removed the wrapper that was queueing the recover function into reset domain queue who was using this name. Signed-off-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: NChristian König <christian.koenig@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Andrey Grodzovsky 提交于
Save the extra usless work schedule. Signed-off-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com> Acked-by: NChristian König <christian.koenig@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 02 6月, 2022 2 次提交
-
-
由 Candice Li 提交于
Adjust the sequence for ras late init and separate ras reset error status from query status. v2: squash in fix from Candice Signed-off-by: NCandice Li <candice.li@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Stanley.Yang 提交于
Fix aldebaran ras supported check on SRIOV guest side, the previous check conditicon block all ras feature on baremetal Signed-off-by: NStanley.Yang <Stanley.Yang@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 27 5月, 2022 1 次提交
-
-
由 Stanley.Yang 提交于
support umc/gfx/sdma ras on guest side Changed from V1: move sriov judgment in amdgpu_ras_interrupt_fatal_error_handler Signed-off-by: NStanley.Yang <Stanley.Yang@amd.com> Reviewed-by: NTao Zhou <tao.zhou1@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 11 5月, 2022 2 次提交
-
-
由 Tao Zhou 提交于
Qeury ras status before ras poison consumption handling, add more comment and log. Signed-off-by: NTao Zhou <tao.zhou1@amd.com> Reviewed-and-tested-by: NMohammad Zafar Ziya <Mohammadzafar.ziya@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Tao Zhou 提交于
Enable RAS IH if poison consumption handler is implemented. Signed-off-by: NTao Zhou <tao.zhou1@amd.com> Reviewed-by: NMohammad Zafar Ziya <Mohammadzafar.ziya@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 26 4月, 2022 1 次提交
-
-
由 Haowen Bai 提交于
After alloc fail, we do not need to kfree. Signed-off-by: NHaowen Bai <baihaowen@meizu.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 23 4月, 2022 3 次提交
-
-
由 Tao Zhou 提交于
The fatal error handler is independent from general ras interrupt handler since there is no related IH ring. Signed-off-by: NTao Zhou <tao.zhou1@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Tao Zhou 提交于
Add support for general RAS poison consumption handler. v2: remove callback function for poison consumption. Signed-off-by: NTao Zhou <tao.zhou1@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Tao Zhou 提交于
Prepare for the implementation of poison consumption handler. v2: separate umc handler from poison creation. Signed-off-by: NTao Zhou <tao.zhou1@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 29 3月, 2022 1 次提交
-
-
由 Mohammad Zafar Ziya 提交于
Add vcn and jpeg ras support options V2: vcn and jpeg ras flag enabled for aldebaran asic only V3: vcn and jpeg ras flag disabled for error counter query Generic poison query interface added VCN and JPEG ras enabled based on IP version check V4: vcn and jpeg ras flag moved under ecc flag for dGPU Signed-off-by: NMohammad Zafar Ziya <Mohammadzafar.ziya@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: NTao Zhou <tao.zhou1@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 16 3月, 2022 1 次提交
-
-
由 Stanley.Yang 提交于
It should notice SMU to update bad channel info when detected uncorrectable error in UMC block Signed-off-by: NStanley.Yang <Stanley.Yang@amd.com> Reviewed-by: NTao Zhou <tao.zhou1@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-