- 20 3月, 2020 1 次提交
-
-
由 John Clements 提交于
MMHub EDC becomes dirty after BACO reset EDC registers should be cleared early on in reset phase Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NJohn Clements <john.clements@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 13 3月, 2020 2 次提交
-
-
由 Stanley.Yang 提交于
Fix the warning "warn: variable dereferenced before check 'obj' (see line 1131)" by removing unnecessary checks as amdgpu_ras_debugfs_create_all() is only called from amdgpu_debugfs_init() where obj member in con->head list is not NULL. Use list_for_each_entry() instead list_for_each_entry_safe() as obj do not to be freeing or removing from list during this process. Signed-off-by: NStanley.Yang <Stanley.Yang@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Guchun Chen 提交于
RAS support capability needs to be updated on top of different memeory ECC enablement, and remove redundant memory ecc check in gmc module for vega20 and arcturus. v2: check HBM ECC enablement and set ras mask accordingly. v3: avoid to invoke atomfirmware interface to query twice. Suggested-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NGuchun Chen <guchun.chen@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 11 3月, 2020 2 次提交
-
-
由 Tao Zhou 提交于
and remove each ras IP's own debugfs creation this is required to fix ras when the driver does not use the drm load and unload callbacks due to ordering issues with the drm device node. Signed-off-by: NTao Zhou <tao.zhou1@amd.com> Signed-off-by: NStanley.Yang <Stanley.Yang@amd.com> Reviewed-by: NAlex Deucher <alexander.deucher@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Tao Zhou 提交于
centralize all debugfs creation in one place for ras this is required to fix ras when the driver does not use the drm load and unload callbacks due to ordering issues with the drm device node. Signed-off-by: NTao Zhou <tao.zhou1@amd.com> Signed-off-by: NStanley.Yang <Stanley.Yang@amd.com> Reviewed-by: NAlex Deucher <alexander.deucher@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 07 3月, 2020 1 次提交
-
-
由 Hawking Zhang 提交于
Now driver will report XGMI/WAFL PCS error through sysfs xgmi_wafl_err_count node on Vega20 Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: NGuchun Chen <guchun.chen@amd.com> Reviewed-by: NTao Zhou <tao.zhou1@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 27 2月, 2020 1 次提交
-
-
由 Hawking Zhang 提交于
centralize all the xgmi related function to amdgpu_xgmi.c Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com> Acked-by: NEvan Quan <evan.quan@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 19 2月, 2020 1 次提交
-
-
由 Guchun Chen 提交于
Once sync flood interrupt is triggered by RAS error, before actual GPU recovery job, it's necessary to log on and print non-zero error counter, this will help user knows where the RAS error source is from quickly. Signed-off-by: NGuchun Chen <guchun.chen@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 23 1月, 2020 1 次提交
-
-
由 John Clements 提交于
resolves issue with RAS error injection in mGPU configuration Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NJohn Clements <john.clements@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 17 1月, 2020 1 次提交
-
-
由 Hawking Zhang 提交于
To allow the flexibilty for user to disable gpu recovery in RAS recovery path by module parameter amdgpu_gpu_recovery Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: NGuchun Chen <guchun.chen@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 14 1月, 2020 1 次提交
-
-
由 Hawking Zhang 提交于
invoke sdma query_ras_error_count to get sdma single bit error count Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: NAlex Deucher <alexander.deucher@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 24 12月, 2019 1 次提交
-
-
由 Guchun Chen 提交于
Return value should be set when going to error handle tag for error case, this can avoid potential invalid array access by upper caller. Signed-off-by: NGuchun Chen <guchun.chen@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 19 12月, 2019 2 次提交
-
-
由 zhengbin 提交于
Fixes coccicheck warning: drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:318:2-3: Unneeded semicolon Reported-by: NHulk Robot <hulkci@huawei.com> Signed-off-by: Nzhengbin <zhengbin13@huawei.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Guchun Chen 提交于
BACO reset mode strategy is determined by latter func when calling amdgpu_ras_reset_gpu. So not to confuse audience, drop it. Signed-off-by: NGuchun Chen <guchun.chen@amd.com> Reviewed-by: NAlex Deucher <alexander.deucher@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 06 12月, 2019 1 次提交
-
-
由 Le Ma 提交于
Change it to external interface. Signed-off-by: NLe Ma <le.ma@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 19 11月, 2019 1 次提交
-
-
由 Hawking Zhang 提交于
check hw ras capablity via atomfirmware Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: NAlex Deucher <alexander.deucher@amd.com> Reviewed-by: NJohn Clements <John.Clements@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 07 11月, 2019 1 次提交
-
-
由 Alex Deucher 提交于
Clarify some areas, clean up formatting, add section for unrecoverable error handling. v2: fix grammatical errors Reviewed-by: NYong Zhao <yong.zhao@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 30 10月, 2019 1 次提交
-
-
由 Le Ma 提交于
PSP lost connection when err_event_athub occurs. These cleanup work can be skipped in BACO reset. v2: squash in missing include (Alex) Signed-off-by: NLe Ma <le.ma@amd.com> Reviewed-by: NHawking Zhang <hawking.zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 26 10月, 2019 2 次提交
-
-
由 Guchun Chen 提交于
Easy for maintainance. Signed-off-by: NGuchun Chen <guchun.chen@amd.com> Acked-by: NChristian König <christian.koenig@amd.com> Reviewed-by: NAlex Deucher <alexander.deucher@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Guchun Chen 提交于
Ras reboot debugfs node allows user one easy control to avoid gpu recovery hang problem and directly reboot system per card basis, after ras uncorrectable error happens. However, it is one common entry, which should get rid of ras_ctrl node and remove ip dependence when inputting by user. So add one new auto_reboot node in ras debugfs dir to achieve this. v2: in commit mssage, add justification why ras reboot debugfs node is needed. v3: use debugfs_create_bool to create debugfs file for boolean value Signed-off-by: NGuchun Chen <guchun.chen@amd.com> Reviewed-by: NAlex Deucher <alexander.deucher@amd.com> Reviewed-by: NChristian König <christian.koenig@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 16 10月, 2019 1 次提交
-
-
由 Andrey Grodzovsky 提交于
Ignre the ERREVENT_ATHUB_INTERRUPT for systems without RAS. Signed-off-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-and-tested-by: NJack Zhang <Jack.Zhang1@amd.com> Acked-by: NAlex Deucher <alexander.deucher@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 11 10月, 2019 3 次提交
-
-
由 Tao Zhou 提交于
check whether a page is bad page before umc error injection, bad page should not be accessed again Signed-off-by: NTao Zhou <tao.zhou1@amd.com> Reviewed-by: NGuchun Chen <guchun.chen@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Alex Deucher 提交于
We recently added it, but never documented it. Reviewed-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Alex Deucher 提交于
Fix a couple of spelling typos. Reviewed-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 08 10月, 2019 1 次提交
-
-
由 Felix Kuehling 提交于
Don't set a struct pointer to NULL before freeing its members. It's hard to see what's happening due to a local pointer-to-pointer data aliasing con->eh_data. Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com> Tested-by: NPhilip Cox <Philip.Cox@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 03 10月, 2019 7 次提交
-
-
由 Tao Zhou 提交于
simplify the code of accessing to eeprom_control struct Signed-off-by: NTao Zhou <tao.zhou1@amd.com> Reviewed-by: NGuchun Chen <guchun.chen@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Tao Zhou 提交于
remove mmhub_funcs in adev Signed-off-by: NTao Zhou <tao.zhou1@amd.com> Reviewed-by: NGuchun Chen <guchun.chen@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Alex Deucher 提交于
Add new sections to amdgpu.rst, fix up formatting issues, add additional documentation to each section. Acked-by: NChristian König <christian.koenig@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Guchun Chen 提交于
null ptr should be checked first to avoid null ptr access Signed-off-by: NGuchun Chen <guchun.chen@amd.com> Reviewed-by: NTao Zhou <tao.zhou1@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Alex Deucher 提交于
We are reserving vram pages so they should be aligned to the GPU page size. Reviewed-by: NTao Zhou <tao.zhou1@amd.com> Reviewed-by: NChristian König <christian.koenig@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Tao Zhou 提交于
There are two cases of reserve error should be ignored: 1) a ras bad page has been allocated (used by someone); 2) a ras bad page has been reserved (duplicate error injection for one page); DRM_ERROR is unnecessary for the failure of bad page reserve Signed-off-by: NTao Zhou <tao.zhou1@amd.com> Reviewed-by: NGuchun Chen <guchun.chen@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Adam Zerella 提交于
Some of the documentation formatting could be improved which will resolve some Sphinx amdgpu build warnings e.g WARNING: Unexpected indentation. WARNING: Block quote ends without a blank line; unexpected unindent. WARNING: Inline emphasis start-string without end-string. Signed-off-by: NAdam Zerella <adam.zerella@gmail.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 17 9月, 2019 2 次提交
-
-
由 Christian König 提交于
The placement is something TTM/BO internal and the RAS code should avoid touching that directly. Add a helper to create a BO at a fixed location and use that instead. v2: squash in fixes (Alex) Signed-off-by: NChristian König <christian.koenig@amd.com> Reviewed-by: NGuchun Chen <guchun.chen@amd.com> Reviewed-by: NAlex Deucher <alexander.deucher@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Guchun Chen 提交于
Use debugfs_remove_recursive to remove the whole debugfs directory instead of removing the node one by one. Signed-off-by: NGuchun Chen <guchun.chen@amd.com> Reviewed-by: NChristian König <christian.koenig@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 16 9月, 2019 5 次提交
-
-
由 Guchun Chen 提交于
Call pcie bif ras query/inject in amdgpu ras. Signed-off-by: NTao Zhou <tao.zhou1@amd.com> Signed-off-by: NGuchun Chen <guchun.chen@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Hawking Zhang 提交于
allow inject error to XGMI block via debugfs node ras_ctrl Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: NGuchun Chen <guchun.chen@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Andrey Grodzovsky 提交于
The table grows quickly during debug/development effort when multiple RAS errors are injected. Allow to avoid this by setting table header back to empty if needed. v2: Switch to debugfs entry instead of load time parameter. Signed-off-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: NTao Zhou <tao.zhou1@amd.com> Reviewed-by: NGuchun Chen <guchun.chen@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Tao Zhou 提交于
move umc ras init from ras module to umc block, generic ras module should pay less attention to specific ras block. Signed-off-by: NTao Zhou <tao.zhou1@amd.com> Reviewed-by: NGuchun Chen <guchun.chen@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Andrey Grodzovsky 提交于
Fixes driver load regression on APUs. Signed-off-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 14 9月, 2019 1 次提交
-
-
由 Tao Zhou 提交于
ras recovery_init should be called after ttm init, bad page reserve should be put in front of gpu reset since i2c may be unstable during gpu reset. add cleanup for recovery_init and recovery_fini v2: add more comment and print. remove cancel_work_sync in recovery_init. Signed-off-by: NTao Zhou <tao.zhou1@amd.com> Reviewed-by: NGuchun Chen <guchun.chen@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-