提交 · d65bf1f8a795e2748ab3ea2231ab896a9cac743c · openeuler / Kernel

03 10月, 2019 6 次提交

drm/amdgpu: replace mmhub_funcs with mmhub.funcs · d65bf1f8

由 Tao Zhou 提交于 9月 12, 2019

remove mmhub_funcs in adev
Signed-off-by: NTao Zhou <tao.zhou1@amd.com>
Reviewed-by: NGuchun Chen <guchun.chen@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

d65bf1f8

drm/amdgpu/ras: fix and update the documentation for RAS · f77c7109

由 Alex Deucher 提交于 9月 19, 2019

Add new sections to amdgpu.rst, fix up formatting issues,
add additional documentation to each section.
Acked-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

f77c7109

drm/amdgpu: avoid null pointer dereference · 8a3e801f

由 Guchun Chen 提交于 9月 17, 2019

null ptr should be checked first to avoid null ptr access
Signed-off-by: NGuchun Chen <guchun.chen@amd.com>
Reviewed-by: NTao Zhou <tao.zhou1@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

8a3e801f

drm/amdgpu/ras: use GPU PAGE_SIZE/SHIFT for reserving pages · a142ba88

由 Alex Deucher 提交于 9月 17, 2019

We are reserving vram pages so they should be aligned to the
GPU page size.
Reviewed-by: NTao Zhou <tao.zhou1@amd.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

a142ba88

drm/amdgpu: replace DRM_ERROR with DRM_WARN in ras_reserve_bad_pages · ae115c81

由 Tao Zhou 提交于 9月 12, 2019

There are two cases of reserve error should be ignored:
1) a ras bad page has been allocated (used by someone);
2) a ras bad page has been reserved (duplicate error injection for one page);

DRM_ERROR is unnecessary for the failure of bad page reserve
Signed-off-by: NTao Zhou <tao.zhou1@amd.com>
Reviewed-by: NGuchun Chen <guchun.chen@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

ae115c81

docs: drm/amdgpu: Resolve build warnings · 879e723d

由 Adam Zerella 提交于 9月 14, 2019

Some of the documentation formatting could be improved
which will resolve some Sphinx amdgpu build warnings e.g

WARNING: Unexpected indentation.
WARNING: Block quote ends without a blank line; unexpected unindent.
WARNING: Inline emphasis start-string without end-string.
Signed-off-by: NAdam Zerella <adam.zerella@gmail.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

879e723d

17 9月, 2019 2 次提交

drm/amdgpu: cleanup creating BOs at fixed location (v2) · de7b45ba

由 Christian König 提交于 9月 13, 2019

The placement is something TTM/BO internal and the RAS code should
avoid touching that directly.

Add a helper to create a BO at a fixed location and use that instead.

v2: squash in fixes (Alex)
Signed-off-by: NChristian König <christian.koenig@amd.com>
Reviewed-by: NGuchun Chen <guchun.chen@amd.com>
Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

de7b45ba

drm/amdgpu: fix ras ctrl debugfs node leak · 012dd14d

由 Guchun Chen 提交于 9月 16, 2019

Use debugfs_remove_recursive to remove the whole debugfs
directory instead of removing the node one by one.
Signed-off-by: NGuchun Chen <guchun.chen@amd.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

012dd14d

16 9月, 2019 5 次提交

drm/amdgpu: support pcie bif ras query and inject · d7bd680d

由 Guchun Chen 提交于 9月 11, 2019

Call pcie bif ras query/inject in amdgpu ras.
Signed-off-by: NTao Zhou <tao.zhou1@amd.com>
Signed-off-by: NGuchun Chen <guchun.chen@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

d7bd680d

drm/amdgpu: enable error injection to XGMI block via debugfs · f3170352

由 Hawking Zhang 提交于 9月 08, 2019

allow inject error to XGMI block via debugfs node ras_ctrl
Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: NGuchun Chen <guchun.chen@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

f3170352

drm/amdgpu: Allow to reset to EERPOM table. · 084fe13b

由 Andrey Grodzovsky 提交于 9月 09, 2019

The table grows quickly during debug/development effort when
multiple RAS errors are injected. Allow to avoid this by setting
table header back to empty if needed.

v2: Switch to debugfs entry instead of load time parameter.
Signed-off-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: NTao Zhou <tao.zhou1@amd.com>
Reviewed-by: NGuchun Chen <guchun.chen@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

084fe13b

drm/amdgpu: move umc ras init to umc block · 4930aabe

由 Tao Zhou 提交于 9月 05, 2019

move umc ras init from ras module to umc block, generic ras module
should pay less attention to specific ras block.
Signed-off-by: NTao Zhou <tao.zhou1@amd.com>
Reviewed-by: NGuchun Chen <guchun.chen@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

4930aabe

drm/amdgpu: Avoid RAS recovery init when no RAS support. · 4d1337d2

由 Andrey Grodzovsky 提交于 9月 06, 2019

Fixes driver load regression on APUs.
Signed-off-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

4d1337d2

14 9月, 2019 7 次提交

drm/amdgpu: move the call of ras recovery_init and bad page reserve to proper place · 1a6fc071

由 Tao Zhou 提交于 8月 30, 2019

ras recovery_init should be called after ttm init,
bad page reserve should be put in front of gpu reset since i2c
may be unstable during gpu reset.
add cleanup for recovery_init and recovery_fini

v2: add more comment and print.
    remove cancel_work_sync in recovery_init.
Signed-off-by: NTao Zhou <tao.zhou1@amd.com>
Reviewed-by: NGuchun Chen <guchun.chen@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

1a6fc071

drm/amdgpu: Hook EEPROM table to RAS · 78ad00c9

由 Tao Zhou 提交于 8月 15, 2019

support eeprom records load and save for ras,
move EEPROM records storing to bad page reserving

v2: remove redundant check for con->eh_data
Signed-off-by: NTao Zhou <tao.zhou1@amd.com>
Signed-off-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: NGuchun Chen <guchun.chen@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

78ad00c9

drm/amdgpu: change ras bps type to eeprom table record structure · 9dc23a63

由 Tao Zhou 提交于 8月 13, 2019

change bps type from retired page to eeprom table record, prepare for
saving umc error records to eeprom
Signed-off-by: NTao Zhou <tao.zhou1@amd.com>
Reviewed-by: NGuchun Chen <guchun.chen@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

9dc23a63

dmr/amdgpu: Add system auto reboot to RAS. · d5ea093e

由 Andrey Grodzovsky 提交于 8月 22, 2019

In case of RAS error allow user configure auto system
reboot through ras_ctrl.
This is also part of the temproray work around for the RAS
hang problem.

v4: Use latest kernel API for disk sync.
Signed-off-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

d5ea093e

drm/amdgpu: Avoid HW GPU reset for RAS. · 7c6e68c7

由 Andrey Grodzovsky 提交于 9月 13, 2019

Problem:
Under certain conditions, when some IP bocks take a RAS error,
we can get into a situation where a GPU reset is not possible
due to issues in RAS in SMU/PSP.

Temporary fix until proper solution in PSP/SMU is ready:
When uncorrectable error happens the DF will unconditionally
broadcast error event packets to all its clients/slave upon
receiving fatal error event and freeze all its outbound queues,
err_event_athub interrupt  will be triggered.
In such case and we use this interrupt
to issue GPU reset. THe GPU reset code is modified for such case to avoid HW
reset, only stops schedulers, deatches all in progress and not yet scheduled
job's fences, set error code on them and signals.
Also reject any new incoming job submissions from user space.
All this is done to notify the applications of the problem.

v2:
Extract amdgpu_amdkfd_pre/post_reset from amdgpu_device_lock/unlock_adev
Move amdgpu_job_stop_all_jobs_on_sched to amdgpu_job.c
Remove print param from amdgpu_ras_query_error_count

v3:
Update based on prevoius bug fixing patch to properly call amdgpu_amdkfd_pre_reset
for other XGMI hive memebers.
Signed-off-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com>
Acked-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

7c6e68c7

drm/amdgpu: add helper function to do common ras_late_init/fini (v3) · b293e891

由 Hawking Zhang 提交于 8月 30, 2019

In late_init for ras, the helper function will be used to
1). disable ras feature if the IP block is masked as disabled
2). send enable feature command if the ip block was masked as enabled
3). create debugfs/sysfs node per IP block
4). register interrupt handler

v2: check ih_info.cb to decide add interrupt handler or not

v3: add ras_late_fini for cleanup all the ras fs node and remove
interrupt handler
Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
Reviewed-by: NTao Zhou <tao.zhou1@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

b293e891

drm/amdgpu: add ras_controller and err_event_athub interrupt support · 4e644fff

由 Hawking Zhang 提交于 6月 05, 2019

Ras controller interrupt and Ras err event athub interrupt are two dedicated
interrupts for RAS support.
Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

4e644fff

24 8月, 2019 1 次提交

drm/amdgpu: correct ras error count type · 64cc5414

由 Guchun Chen 提交于 8月 16, 2019

Use unsigned long type for the same ras count variable.
This will avoid overflow on 64 bit system.
Signed-off-by: NGuchun Chen <guchun.chen@amd.com>
Reviewed-by: NTao Zhou <tao.zhou1@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

64cc5414

13 8月, 2019 3 次提交

drm/amdgpu: remove ras block's feature status info in sysfs · 5212a3bd

由 Tao Zhou 提交于 8月 09, 2019

feature mask info is enough for rocm tool,
"cat /sys/class/drm/card0/device/ras/features" will get the
info like this:

feature mask: 0x3ffb
Signed-off-by: NTao Zhou <tao.zhou1@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

5212a3bd

drm/amdgpu: support mmhub ras in amdgpu ras · 9fb2d8de

由 Tao Zhou 提交于 8月 06, 2019

call mmhub ras query/inject in amdgpu ras
Signed-off-by: NTao Zhou <tao.zhou1@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

9fb2d8de

drm/amdgpu: add sub block parameter in ras inject command · 44494f96

由 Tao Zhou 提交于 8月 07, 2019

ras sub block index could be passed from shell command
Signed-off-by: NTao Zhou <tao.zhou1@amd.com>
Reviewed-by: NGuchun Chen <guchun.chen@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

44494f96

07 8月, 2019 1 次提交

drm/amdgpu: update ras sysfs feature info · 2a3c7ff6

由 Tao Zhou 提交于 8月 05, 2019

remove confused ras error type info
Signed-off-by: NTao Zhou <tao.zhou1@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

2a3c7ff6

02 8月, 2019 3 次提交

drm/amdgpu: replace AMDGPU_RAS_UE with AMDGPU_RAS_SUCCESS · bd2280da

由 Tao Zhou 提交于 8月 01, 2019

ce can also trigger interrupt, and even both ce and ue error can be
found in one ras query, distinguishing between ce and ue in interrupt
handler is uncessary.
Signed-off-by: NTao Zhou <tao.zhou1@amd.com>
Suggested-by: NGuchun Chen <guchun.chen@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

bd2280da

drm/amdgpu: support ce interrupt in ras module · 51437623

由 Tao Zhou 提交于 7月 29, 2019

correctable error can also trigger interrupt in some ras blocks
Signed-off-by: NTao Zhou <tao.zhou1@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

51437623

drm/amdgpu: add error address query for umc ras · 13b7c46c

由 Tao Zhou 提交于 8月 01, 2019

umc error address query can get ce/ue error address and clear error
status
Signed-off-by: NTao Zhou <tao.zhou1@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

13b7c46c

01 8月, 2019 9 次提交

drm/amdgpu: support gfx ras error injection and err_cnt query · 83b0582c

由 Dennis Li 提交于 7月 31, 2019

check gfx error count in both ras querry function and
ras interrupt handler.

gfx ras is still disabled by default due to known stability
issue found in gpu reset.
Signed-off-by: NDennis Li <Dennis.Li@amd.com>
Reviewed-by: NTao Zhou <tao.zhou1@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

83b0582c

drm/amdgpu: remove ras_reserve_vram in ras injection · 7cdc2ee3

由 Tao Zhou 提交于 7月 24, 2019

error injection address is not in gpu address space
Signed-off-by: NTao Zhou <tao.zhou1@amd.com>
Reviewed-by: NDennis Li <dennis.li@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

7cdc2ee3

drm/amdgpu: add check for ras error type · e1063493

由 Tao Zhou 提交于 7月 23, 2019

only ue and ce errors are supported
Signed-off-by: NTao Zhou <tao.zhou1@amd.com>
Reviewed-by: NDennis Li <dennis.li@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

e1063493

drm/amdgpu: allow ras interrupt callback to return error data · cf04dfd0

由 Tao Zhou 提交于 7月 22, 2019

add error data as parameter for ras interrupt cb and process it
Signed-off-by: NTao Zhou <tao.zhou1@amd.com>
Reviewed-by: NDennis Li <dennis.li@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

cf04dfd0

drm/amdgpu: add support for recording ras error address · 6f102dba

由 Tao Zhou 提交于 7月 22, 2019

more than one error address may be recorded in one query
Signed-off-by: NTao Zhou <tao.zhou1@amd.com>
Reviewed-by: NDennis Li <dennis.li@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

6f102dba

drm/amdgpu: switch to amdgpu_umc structure · 045c0216

由 Tao Zhou 提交于 7月 23, 2019

create new amdgpu_umc structure to for more umc
settings in future and switch to the new structure
Signed-off-by: NTao Zhou <tao.zhou1@amd.com>
Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: NDennis Li <dennis.li@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

045c0216

drm/amdgpu: add ras error count after each query (v2) · 05a58345

由 Tao Zhou 提交于 7月 31, 2019

v1: increase ras ce/ue error count
v2: log the number of correctable and uncorrectable errors
Signed-off-by: NTao Zhou <tao.zhou1@amd.com>
Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: NDennis Li <dennis.li@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

05a58345

drm/amdgpu: querry umc error count · 939e2258

由 Hawking Zhang 提交于 7月 17, 2019

check umc error count in both ras querry function and
ras interrupt handler
Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: NDennis Li <dennis.li@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

939e2258

drm/amdgpu: move some ras data structure to amdgpu_ras.h · 7af25d5b

由 Hawking Zhang 提交于 7月 17, 2019

These are common structures that can be included by IP specific
source files
Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: NDennis Li <dennis.li@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

7af25d5b

19 7月, 2019 3 次提交

drm/amdgpu: drop ras self test · 33c976c9

由 Hawking Zhang 提交于 7月 18, 2019

this function is not needed any more. error injection is
the only way to validate ras but it can't be executed in
amdgpu_ras_init, where gpu is even not initialized
Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: NFeifei Xu <Feifei.Xu@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

33c976c9

drm/amdgpu: only allow error injection to UMC IP block · a5dd40ca

由 Hawking Zhang 提交于 7月 18, 2019

error injection to other IP blocks (except UMC) will be enabled
until RAS feature stablize on those IP blocks
Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: NFeifei Xu <Feifei.Xu@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

a5dd40ca

drm/amdgpu: do not create ras debugfs/sysfs node for ASICs that don't have ras ability · fb2a3607

由 Hawking Zhang 提交于 7月 18, 2019

driver shouldn't init any ras debugfs/sysfs node for ASICs that don't have ras
hardware ability
Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: NFeifei Xu <Feifei.Xu@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

fb2a3607

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功