提交 · 36a1707afda9abc704543d6b419a998c64df41ca · openeuler / Kernel

17 1月, 2020 1 次提交

drm/amdgpu: check if driver should try recovery in ras recovery path · 93af20f7

由 Hawking Zhang 提交于 1月 16, 2020

To allow the flexibilty for user to disable gpu recovery
in RAS recovery path by module parameter amdgpu_gpu_recovery
Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: NGuchun Chen <guchun.chen@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

93af20f7

14 1月, 2020 1 次提交

drm/amdgpu: support error reporting for sdma ip block · 3e81ee9a

由 Hawking Zhang 提交于 1月 09, 2020

invoke sdma query_ras_error_count to get sdma single
bit error count
Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

3e81ee9a

24 12月, 2019 1 次提交

drm/amdgpu: add missed return value set for error case · 46cf2fec

由 Guchun Chen 提交于 12月 23, 2019

Return value should be set when going to error handle tag
for error case, this can avoid potential invalid array
access by upper caller.
Signed-off-by: NGuchun Chen <guchun.chen@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

46cf2fec

19 12月, 2019 2 次提交

drm/amdgpu: Remove unneeded semicolon in amdgpu_ras.c · 374bf7bd

由 zhengbin 提交于 12月 14, 2019

Fixes coccicheck warning:

drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:318:2-3: Unneeded semicolon
Reported-by: NHulk Robot <hulkci@huawei.com>
Signed-off-by: Nzhengbin <zhengbin13@huawei.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

374bf7bd

drm/amdgpu: drop useless BACO arg in amdgpu_ras_reset_gpu · 61934624

由 Guchun Chen 提交于 12月 13, 2019

BACO reset mode strategy is determined by latter func when
calling amdgpu_ras_reset_gpu. So not to confuse audience, drop
it.
Signed-off-by: NGuchun Chen <guchun.chen@amd.com>
Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

61934624

06 12月, 2019 1 次提交

drm/amdgpu: export amdgpu_ras_find_obj to use externally · f2a79be1

由 Le Ma 提交于 11月 25, 2019

Change it to external interface.
Signed-off-by: NLe Ma <le.ma@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

f2a79be1

19 11月, 2019 1 次提交

drm/amdgpu: enable ras capablity check on arcturus · baaeb610

由 Hawking Zhang 提交于 11月 13, 2019

check hw ras capablity via atomfirmware
Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
Reviewed-by: NJohn Clements <John.Clements@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

baaeb610

07 11月, 2019 1 次提交

drm/amdgpu: Improve RAS documentation (v2) · ef177d11

由 Alex Deucher 提交于 10月 30, 2019

Clarify some areas, clean up formatting, add section for
unrecoverable error handling.

v2: fix grammatical errors
Reviewed-by: NYong Zhao <yong.zhao@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

ef177d11

30 10月, 2019 1 次提交

drm/amdgpu: bypass some cleanup work after err_event_athub (v2) · bff77e86

由 Le Ma 提交于 10月 25, 2019

PSP lost connection when err_event_athub occurs. These cleanup work can be
skipped in BACO reset.

v2: squash in missing include (Alex)
Signed-off-by: NLe Ma <le.ma@amd.com>
Reviewed-by: NHawking Zhang <hawking.zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

bff77e86

26 10月, 2019 2 次提交

drm/amdgpu: define macros for retire page reservation · 52dd95f2

由 Guchun Chen 提交于 10月 22, 2019

Easy for maintainance.
Signed-off-by: NGuchun Chen <guchun.chen@amd.com>
Acked-by: NChristian König <christian.koenig@amd.com>
Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

52dd95f2

drm/amdgpu: refine reboot debugfs operation in ras case (v3) · c688a06b

由 Guchun Chen 提交于 10月 21, 2019

Ras reboot debugfs node allows user one easy control to avoid
gpu recovery hang problem and directly reboot system per card
basis, after ras uncorrectable error happens. However, it is
one common entry, which should get rid of ras_ctrl node and
remove ip dependence when inputting by user. So add one new
auto_reboot node in ras debugfs dir to achieve this.

v2: in commit mssage, add justification why ras reboot debugfs
node is needed.
v3: use debugfs_create_bool to create debugfs file for boolean value
Signed-off-by: NGuchun Chen <guchun.chen@amd.com>
Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

c688a06b

16 10月, 2019 1 次提交

dmr/amdgpu: Fix crash on SRIOV for ERREVENT_ATHUB_INTERRUPT interrupt. · ed606f8a

由 Andrey Grodzovsky 提交于 10月 11, 2019

Ignre the ERREVENT_ATHUB_INTERRUPT for systems without RAS.
Signed-off-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-and-tested-by: NJack Zhang <Jack.Zhang1@amd.com>
Acked-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

ed606f8a

11 10月, 2019 3 次提交

drm/amdgpu: avoid ras error injection for retired page · 6e4be987

由 Tao Zhou 提交于 9月 30, 2019

check whether a page is bad page before umc error injection, bad page
should not be accessed again
Signed-off-by: NTao Zhou <tao.zhou1@amd.com>
Reviewed-by: NGuchun Chen <guchun.chen@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

6e4be987

drm/amdgpu/ras: document the reboot ras option · 54e9ab2e

由 Alex Deucher 提交于 10月 08, 2019

We recently added it, but never documented it.
Reviewed-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

54e9ab2e

drm/amdgpu/ras: fix typos in documentation · a20bfd0f

由 Alex Deucher 提交于 10月 08, 2019

Fix a couple of spelling typos.
Reviewed-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

a20bfd0f

08 10月, 2019 1 次提交

drm/amdgpu: Fix error handling in amdgpu_ras_recovery_init · 1995b3a3

由 Felix Kuehling 提交于 10月 03, 2019

Don't set a struct pointer to NULL before freeing its members. It's
hard to see what's happening due to a local pointer-to-pointer data
aliasing con->eh_data.
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Tested-by: NPhilip Cox <Philip.Cox@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

1995b3a3

03 10月, 2019 7 次提交

drm/amdgpu: simplify the access to eeprom_control struct · 0771b0bf

由 Tao Zhou 提交于 9月 18, 2019

simplify the code of accessing to eeprom_control struct
Signed-off-by: NTao Zhou <tao.zhou1@amd.com>
Reviewed-by: NGuchun Chen <guchun.chen@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

0771b0bf

drm/amdgpu: replace mmhub_funcs with mmhub.funcs · d65bf1f8

由 Tao Zhou 提交于 9月 12, 2019

remove mmhub_funcs in adev
Signed-off-by: NTao Zhou <tao.zhou1@amd.com>
Reviewed-by: NGuchun Chen <guchun.chen@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

d65bf1f8

drm/amdgpu/ras: fix and update the documentation for RAS · f77c7109

由 Alex Deucher 提交于 9月 19, 2019

Add new sections to amdgpu.rst, fix up formatting issues,
add additional documentation to each section.
Acked-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

f77c7109

drm/amdgpu: avoid null pointer dereference · 8a3e801f

由 Guchun Chen 提交于 9月 17, 2019

null ptr should be checked first to avoid null ptr access
Signed-off-by: NGuchun Chen <guchun.chen@amd.com>
Reviewed-by: NTao Zhou <tao.zhou1@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

8a3e801f

drm/amdgpu/ras: use GPU PAGE_SIZE/SHIFT for reserving pages · a142ba88

由 Alex Deucher 提交于 9月 17, 2019

We are reserving vram pages so they should be aligned to the
GPU page size.
Reviewed-by: NTao Zhou <tao.zhou1@amd.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

a142ba88

drm/amdgpu: replace DRM_ERROR with DRM_WARN in ras_reserve_bad_pages · ae115c81

由 Tao Zhou 提交于 9月 12, 2019

There are two cases of reserve error should be ignored:
1) a ras bad page has been allocated (used by someone);
2) a ras bad page has been reserved (duplicate error injection for one page);

DRM_ERROR is unnecessary for the failure of bad page reserve
Signed-off-by: NTao Zhou <tao.zhou1@amd.com>
Reviewed-by: NGuchun Chen <guchun.chen@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

ae115c81

docs: drm/amdgpu: Resolve build warnings · 879e723d

由 Adam Zerella 提交于 9月 14, 2019

Some of the documentation formatting could be improved
which will resolve some Sphinx amdgpu build warnings e.g

WARNING: Unexpected indentation.
WARNING: Block quote ends without a blank line; unexpected unindent.
WARNING: Inline emphasis start-string without end-string.
Signed-off-by: NAdam Zerella <adam.zerella@gmail.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

879e723d

17 9月, 2019 2 次提交

drm/amdgpu: cleanup creating BOs at fixed location (v2) · de7b45ba

由 Christian König 提交于 9月 13, 2019

The placement is something TTM/BO internal and the RAS code should
avoid touching that directly.

Add a helper to create a BO at a fixed location and use that instead.

v2: squash in fixes (Alex)
Signed-off-by: NChristian König <christian.koenig@amd.com>
Reviewed-by: NGuchun Chen <guchun.chen@amd.com>
Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

de7b45ba

drm/amdgpu: fix ras ctrl debugfs node leak · 012dd14d

由 Guchun Chen 提交于 9月 16, 2019

Use debugfs_remove_recursive to remove the whole debugfs
directory instead of removing the node one by one.
Signed-off-by: NGuchun Chen <guchun.chen@amd.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

012dd14d

16 9月, 2019 5 次提交

drm/amdgpu: support pcie bif ras query and inject · d7bd680d

由 Guchun Chen 提交于 9月 11, 2019

Call pcie bif ras query/inject in amdgpu ras.
Signed-off-by: NTao Zhou <tao.zhou1@amd.com>
Signed-off-by: NGuchun Chen <guchun.chen@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

d7bd680d

drm/amdgpu: enable error injection to XGMI block via debugfs · f3170352

由 Hawking Zhang 提交于 9月 08, 2019

allow inject error to XGMI block via debugfs node ras_ctrl
Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: NGuchun Chen <guchun.chen@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

f3170352

drm/amdgpu: Allow to reset to EERPOM table. · 084fe13b

由 Andrey Grodzovsky 提交于 9月 09, 2019

The table grows quickly during debug/development effort when
multiple RAS errors are injected. Allow to avoid this by setting
table header back to empty if needed.

v2: Switch to debugfs entry instead of load time parameter.
Signed-off-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: NTao Zhou <tao.zhou1@amd.com>
Reviewed-by: NGuchun Chen <guchun.chen@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

084fe13b

drm/amdgpu: move umc ras init to umc block · 4930aabe

由 Tao Zhou 提交于 9月 05, 2019

move umc ras init from ras module to umc block, generic ras module
should pay less attention to specific ras block.
Signed-off-by: NTao Zhou <tao.zhou1@amd.com>
Reviewed-by: NGuchun Chen <guchun.chen@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

4930aabe

drm/amdgpu: Avoid RAS recovery init when no RAS support. · 4d1337d2

由 Andrey Grodzovsky 提交于 9月 06, 2019

Fixes driver load regression on APUs.
Signed-off-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

4d1337d2

14 9月, 2019 7 次提交

drm/amdgpu: move the call of ras recovery_init and bad page reserve to proper place · 1a6fc071

由 Tao Zhou 提交于 8月 30, 2019

ras recovery_init should be called after ttm init,
bad page reserve should be put in front of gpu reset since i2c
may be unstable during gpu reset.
add cleanup for recovery_init and recovery_fini

v2: add more comment and print.
    remove cancel_work_sync in recovery_init.
Signed-off-by: NTao Zhou <tao.zhou1@amd.com>
Reviewed-by: NGuchun Chen <guchun.chen@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

1a6fc071

drm/amdgpu: Hook EEPROM table to RAS · 78ad00c9

由 Tao Zhou 提交于 8月 15, 2019

support eeprom records load and save for ras,
move EEPROM records storing to bad page reserving

v2: remove redundant check for con->eh_data
Signed-off-by: NTao Zhou <tao.zhou1@amd.com>
Signed-off-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: NGuchun Chen <guchun.chen@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

78ad00c9

drm/amdgpu: change ras bps type to eeprom table record structure · 9dc23a63

由 Tao Zhou 提交于 8月 13, 2019

change bps type from retired page to eeprom table record, prepare for
saving umc error records to eeprom
Signed-off-by: NTao Zhou <tao.zhou1@amd.com>
Reviewed-by: NGuchun Chen <guchun.chen@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

9dc23a63

dmr/amdgpu: Add system auto reboot to RAS. · d5ea093e

由 Andrey Grodzovsky 提交于 8月 22, 2019

In case of RAS error allow user configure auto system
reboot through ras_ctrl.
This is also part of the temproray work around for the RAS
hang problem.

v4: Use latest kernel API for disk sync.
Signed-off-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

d5ea093e

drm/amdgpu: Avoid HW GPU reset for RAS. · 7c6e68c7

由 Andrey Grodzovsky 提交于 9月 13, 2019

Problem:
Under certain conditions, when some IP bocks take a RAS error,
we can get into a situation where a GPU reset is not possible
due to issues in RAS in SMU/PSP.

Temporary fix until proper solution in PSP/SMU is ready:
When uncorrectable error happens the DF will unconditionally
broadcast error event packets to all its clients/slave upon
receiving fatal error event and freeze all its outbound queues,
err_event_athub interrupt  will be triggered.
In such case and we use this interrupt
to issue GPU reset. THe GPU reset code is modified for such case to avoid HW
reset, only stops schedulers, deatches all in progress and not yet scheduled
job's fences, set error code on them and signals.
Also reject any new incoming job submissions from user space.
All this is done to notify the applications of the problem.

v2:
Extract amdgpu_amdkfd_pre/post_reset from amdgpu_device_lock/unlock_adev
Move amdgpu_job_stop_all_jobs_on_sched to amdgpu_job.c
Remove print param from amdgpu_ras_query_error_count

v3:
Update based on prevoius bug fixing patch to properly call amdgpu_amdkfd_pre_reset
for other XGMI hive memebers.
Signed-off-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com>
Acked-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

7c6e68c7

drm/amdgpu: add helper function to do common ras_late_init/fini (v3) · b293e891

由 Hawking Zhang 提交于 8月 30, 2019

In late_init for ras, the helper function will be used to
1). disable ras feature if the IP block is masked as disabled
2). send enable feature command if the ip block was masked as enabled
3). create debugfs/sysfs node per IP block
4). register interrupt handler

v2: check ih_info.cb to decide add interrupt handler or not

v3: add ras_late_fini for cleanup all the ras fs node and remove
interrupt handler
Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
Reviewed-by: NTao Zhou <tao.zhou1@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

b293e891

drm/amdgpu: add ras_controller and err_event_athub interrupt support · 4e644fff

由 Hawking Zhang 提交于 6月 05, 2019

Ras controller interrupt and Ras err event athub interrupt are two dedicated
interrupts for RAS support.
Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

4e644fff

24 8月, 2019 1 次提交

drm/amdgpu: correct ras error count type · 64cc5414

由 Guchun Chen 提交于 8月 16, 2019

Use unsigned long type for the same ras count variable.
This will avoid overflow on 64 bit system.
Signed-off-by: NGuchun Chen <guchun.chen@amd.com>
Reviewed-by: NTao Zhou <tao.zhou1@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

64cc5414

13 8月, 2019 2 次提交

drm/amdgpu: remove ras block's feature status info in sysfs · 5212a3bd

由 Tao Zhou 提交于 8月 09, 2019

feature mask info is enough for rocm tool,
"cat /sys/class/drm/card0/device/ras/features" will get the
info like this:

feature mask: 0x3ffb
Signed-off-by: NTao Zhou <tao.zhou1@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

5212a3bd

drm/amdgpu: support mmhub ras in amdgpu ras · 9fb2d8de

由 Tao Zhou 提交于 8月 06, 2019

call mmhub ras query/inject in amdgpu ras
Signed-off-by: NTao Zhou <tao.zhou1@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

9fb2d8de

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功