- 03 10月, 2019 6 次提交
-
-
由 Tao Zhou 提交于
remove mmhub_funcs in adev Signed-off-by: NTao Zhou <tao.zhou1@amd.com> Reviewed-by: NGuchun Chen <guchun.chen@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Alex Deucher 提交于
Add new sections to amdgpu.rst, fix up formatting issues, add additional documentation to each section. Acked-by: NChristian König <christian.koenig@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Guchun Chen 提交于
null ptr should be checked first to avoid null ptr access Signed-off-by: NGuchun Chen <guchun.chen@amd.com> Reviewed-by: NTao Zhou <tao.zhou1@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Alex Deucher 提交于
We are reserving vram pages so they should be aligned to the GPU page size. Reviewed-by: NTao Zhou <tao.zhou1@amd.com> Reviewed-by: NChristian König <christian.koenig@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Tao Zhou 提交于
There are two cases of reserve error should be ignored: 1) a ras bad page has been allocated (used by someone); 2) a ras bad page has been reserved (duplicate error injection for one page); DRM_ERROR is unnecessary for the failure of bad page reserve Signed-off-by: NTao Zhou <tao.zhou1@amd.com> Reviewed-by: NGuchun Chen <guchun.chen@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Adam Zerella 提交于
Some of the documentation formatting could be improved which will resolve some Sphinx amdgpu build warnings e.g WARNING: Unexpected indentation. WARNING: Block quote ends without a blank line; unexpected unindent. WARNING: Inline emphasis start-string without end-string. Signed-off-by: NAdam Zerella <adam.zerella@gmail.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 17 9月, 2019 2 次提交
-
-
由 Christian König 提交于
The placement is something TTM/BO internal and the RAS code should avoid touching that directly. Add a helper to create a BO at a fixed location and use that instead. v2: squash in fixes (Alex) Signed-off-by: NChristian König <christian.koenig@amd.com> Reviewed-by: NGuchun Chen <guchun.chen@amd.com> Reviewed-by: NAlex Deucher <alexander.deucher@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Guchun Chen 提交于
Use debugfs_remove_recursive to remove the whole debugfs directory instead of removing the node one by one. Signed-off-by: NGuchun Chen <guchun.chen@amd.com> Reviewed-by: NChristian König <christian.koenig@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 16 9月, 2019 5 次提交
-
-
由 Guchun Chen 提交于
Call pcie bif ras query/inject in amdgpu ras. Signed-off-by: NTao Zhou <tao.zhou1@amd.com> Signed-off-by: NGuchun Chen <guchun.chen@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Hawking Zhang 提交于
allow inject error to XGMI block via debugfs node ras_ctrl Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: NGuchun Chen <guchun.chen@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Andrey Grodzovsky 提交于
The table grows quickly during debug/development effort when multiple RAS errors are injected. Allow to avoid this by setting table header back to empty if needed. v2: Switch to debugfs entry instead of load time parameter. Signed-off-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: NTao Zhou <tao.zhou1@amd.com> Reviewed-by: NGuchun Chen <guchun.chen@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Tao Zhou 提交于
move umc ras init from ras module to umc block, generic ras module should pay less attention to specific ras block. Signed-off-by: NTao Zhou <tao.zhou1@amd.com> Reviewed-by: NGuchun Chen <guchun.chen@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Andrey Grodzovsky 提交于
Fixes driver load regression on APUs. Signed-off-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 14 9月, 2019 7 次提交
-
-
由 Tao Zhou 提交于
ras recovery_init should be called after ttm init, bad page reserve should be put in front of gpu reset since i2c may be unstable during gpu reset. add cleanup for recovery_init and recovery_fini v2: add more comment and print. remove cancel_work_sync in recovery_init. Signed-off-by: NTao Zhou <tao.zhou1@amd.com> Reviewed-by: NGuchun Chen <guchun.chen@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Tao Zhou 提交于
support eeprom records load and save for ras, move EEPROM records storing to bad page reserving v2: remove redundant check for con->eh_data Signed-off-by: NTao Zhou <tao.zhou1@amd.com> Signed-off-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: NGuchun Chen <guchun.chen@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Tao Zhou 提交于
change bps type from retired page to eeprom table record, prepare for saving umc error records to eeprom Signed-off-by: NTao Zhou <tao.zhou1@amd.com> Reviewed-by: NGuchun Chen <guchun.chen@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Andrey Grodzovsky 提交于
In case of RAS error allow user configure auto system reboot through ras_ctrl. This is also part of the temproray work around for the RAS hang problem. v4: Use latest kernel API for disk sync. Signed-off-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Andrey Grodzovsky 提交于
Problem: Under certain conditions, when some IP bocks take a RAS error, we can get into a situation where a GPU reset is not possible due to issues in RAS in SMU/PSP. Temporary fix until proper solution in PSP/SMU is ready: When uncorrectable error happens the DF will unconditionally broadcast error event packets to all its clients/slave upon receiving fatal error event and freeze all its outbound queues, err_event_athub interrupt will be triggered. In such case and we use this interrupt to issue GPU reset. THe GPU reset code is modified for such case to avoid HW reset, only stops schedulers, deatches all in progress and not yet scheduled job's fences, set error code on them and signals. Also reject any new incoming job submissions from user space. All this is done to notify the applications of the problem. v2: Extract amdgpu_amdkfd_pre/post_reset from amdgpu_device_lock/unlock_adev Move amdgpu_job_stop_all_jobs_on_sched to amdgpu_job.c Remove print param from amdgpu_ras_query_error_count v3: Update based on prevoius bug fixing patch to properly call amdgpu_amdkfd_pre_reset for other XGMI hive memebers. Signed-off-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com> Acked-by: NFelix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Hawking Zhang 提交于
In late_init for ras, the helper function will be used to 1). disable ras feature if the IP block is masked as disabled 2). send enable feature command if the ip block was masked as enabled 3). create debugfs/sysfs node per IP block 4). register interrupt handler v2: check ih_info.cb to decide add interrupt handler or not v3: add ras_late_fini for cleanup all the ras fs node and remove interrupt handler Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: NAlex Deucher <alexander.deucher@amd.com> Reviewed-by: NTao Zhou <tao.zhou1@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Hawking Zhang 提交于
Ras controller interrupt and Ras err event athub interrupt are two dedicated interrupts for RAS support. Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: NAlex Deucher <alexander.deucher@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 24 8月, 2019 1 次提交
-
-
由 Guchun Chen 提交于
Use unsigned long type for the same ras count variable. This will avoid overflow on 64 bit system. Signed-off-by: NGuchun Chen <guchun.chen@amd.com> Reviewed-by: NTao Zhou <tao.zhou1@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 13 8月, 2019 3 次提交
-
-
由 Tao Zhou 提交于
feature mask info is enough for rocm tool, "cat /sys/class/drm/card0/device/ras/features" will get the info like this: feature mask: 0x3ffb Signed-off-by: NTao Zhou <tao.zhou1@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Tao Zhou 提交于
call mmhub ras query/inject in amdgpu ras Signed-off-by: NTao Zhou <tao.zhou1@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Tao Zhou 提交于
ras sub block index could be passed from shell command Signed-off-by: NTao Zhou <tao.zhou1@amd.com> Reviewed-by: NGuchun Chen <guchun.chen@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 07 8月, 2019 1 次提交
-
-
由 Tao Zhou 提交于
remove confused ras error type info Signed-off-by: NTao Zhou <tao.zhou1@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 02 8月, 2019 3 次提交
-
-
由 Tao Zhou 提交于
ce can also trigger interrupt, and even both ce and ue error can be found in one ras query, distinguishing between ce and ue in interrupt handler is uncessary. Signed-off-by: NTao Zhou <tao.zhou1@amd.com> Suggested-by: NGuchun Chen <guchun.chen@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Tao Zhou 提交于
correctable error can also trigger interrupt in some ras blocks Signed-off-by: NTao Zhou <tao.zhou1@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: NAlex Deucher <alexander.deucher@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Tao Zhou 提交于
umc error address query can get ce/ue error address and clear error status Signed-off-by: NTao Zhou <tao.zhou1@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 01 8月, 2019 9 次提交
-
-
由 Dennis Li 提交于
check gfx error count in both ras querry function and ras interrupt handler. gfx ras is still disabled by default due to known stability issue found in gpu reset. Signed-off-by: NDennis Li <Dennis.Li@amd.com> Reviewed-by: NTao Zhou <tao.zhou1@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Tao Zhou 提交于
error injection address is not in gpu address space Signed-off-by: NTao Zhou <tao.zhou1@amd.com> Reviewed-by: NDennis Li <dennis.li@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Tao Zhou 提交于
only ue and ce errors are supported Signed-off-by: NTao Zhou <tao.zhou1@amd.com> Reviewed-by: NDennis Li <dennis.li@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Tao Zhou 提交于
add error data as parameter for ras interrupt cb and process it Signed-off-by: NTao Zhou <tao.zhou1@amd.com> Reviewed-by: NDennis Li <dennis.li@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Tao Zhou 提交于
more than one error address may be recorded in one query Signed-off-by: NTao Zhou <tao.zhou1@amd.com> Reviewed-by: NDennis Li <dennis.li@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Tao Zhou 提交于
create new amdgpu_umc structure to for more umc settings in future and switch to the new structure Signed-off-by: NTao Zhou <tao.zhou1@amd.com> Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: NDennis Li <dennis.li@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Tao Zhou 提交于
v1: increase ras ce/ue error count v2: log the number of correctable and uncorrectable errors Signed-off-by: NTao Zhou <tao.zhou1@amd.com> Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: NDennis Li <dennis.li@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Hawking Zhang 提交于
check umc error count in both ras querry function and ras interrupt handler Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: NDennis Li <dennis.li@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Hawking Zhang 提交于
These are common structures that can be included by IP specific source files Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: NDennis Li <dennis.li@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 19 7月, 2019 3 次提交
-
-
由 Hawking Zhang 提交于
this function is not needed any more. error injection is the only way to validate ras but it can't be executed in amdgpu_ras_init, where gpu is even not initialized Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: NFeifei Xu <Feifei.Xu@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Hawking Zhang 提交于
error injection to other IP blocks (except UMC) will be enabled until RAS feature stablize on those IP blocks Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: NFeifei Xu <Feifei.Xu@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Hawking Zhang 提交于
driver shouldn't init any ras debugfs/sysfs node for ASICs that don't have ras hardware ability Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: NFeifei Xu <Feifei.Xu@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-