提交 · 66399248feaf4a2fa4cd76765412a4139aca28e9 · openeuler / Kernel

20 3月, 2020 1 次提交

drm/amdgpu: protect RAS sysfs during GPU reset · 43c4d576

由 John Clements 提交于 3月 19, 2020

MMHub EDC becomes dirty after BACO reset

EDC registers should be cleared early on in reset phase
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NJohn Clements <john.clements@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

43c4d576

13 3月, 2020 2 次提交

drm/amdgpu: fix warning in ras_debugfs_create_all() · c1509f3f

由 Stanley.Yang 提交于 3月 12, 2020

Fix the warning
"warn: variable dereferenced before check 'obj' (see line 1131)"
by removing unnecessary checks as amdgpu_ras_debugfs_create_all()
is only called from amdgpu_debugfs_init() where obj member in
con->head list is not NULL.
Use list_for_each_entry() instead list_for_each_entry_safe() as obj
do not to be freeing or removing from list during this process.
Signed-off-by: NStanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

c1509f3f

drm/amdgpu: update ras capability's query based on mem ecc configuration · 88474cca

由 Guchun Chen 提交于 3月 10, 2020

RAS support capability needs to be updated on top of different
memeory ECC enablement, and remove redundant memory ecc check
in gmc module for vega20 and arcturus.

v2: check HBM ECC enablement and set ras mask accordingly.
v3: avoid to invoke atomfirmware interface to query twice.
Suggested-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NGuchun Chen <guchun.chen@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

88474cca

11 3月, 2020 2 次提交

drm/amdgpu: call ras_debugfs_create_all in debugfs_init · 204eaac6

由 Tao Zhou 提交于 3月 06, 2020

and remove each ras IP's own debugfs creation

this is required to fix ras when the driver does not use the drm load
and unload callbacks due to ordering issues with the drm device node.
Signed-off-by: NTao Zhou <tao.zhou1@amd.com>
Signed-off-by: NStanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

204eaac6

drm/amdgpu: add function to creat all ras debugfs node · f9317014

由 Tao Zhou 提交于 3月 06, 2020

centralize all debugfs creation in one place for ras

this is required to fix ras when the driver does not use the drm load
and unload callbacks due to ordering issues with the drm device node.
Signed-off-by: NTao Zhou <tao.zhou1@amd.com>
Signed-off-by: NStanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

f9317014

07 3月, 2020 1 次提交

drm/amdgpu: enable PCS error report on VG20 · ec01fe2d

由 Hawking Zhang 提交于 2月 21, 2020

Now driver will report XGMI/WAFL PCS error through
sysfs xgmi_wafl_err_count node on Vega20
Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: NGuchun Chen <guchun.chen@amd.com>
Reviewed-by: NTao Zhou <tao.zhou1@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

ec01fe2d

27 2月, 2020 1 次提交

drm/amdgpu: move get_xgmi_relative_phy_addr to amdgpu_xgmi.c · 19744f5f

由 Hawking Zhang 提交于 2月 24, 2020

centralize all the xgmi related function to amdgpu_xgmi.c
Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com>
Acked-by: NEvan Quan <evan.quan@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

19744f5f

19 2月, 2020 1 次提交

drm/amdgpu: log on non-zero error conter per IP before GPU reset · 313c8fd3

由 Guchun Chen 提交于 2月 13, 2020

Once sync flood interrupt is triggered by RAS error, before
actual GPU recovery job, it's necessary to log on and print
non-zero error counter, this will help user knows where the
RAS error source is from quickly.
Signed-off-by: NGuchun Chen <guchun.chen@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

313c8fd3

23 1月, 2020 1 次提交

drm/amdgpu: added support to get mGPU DRAM base · a6c44d25

由 John Clements 提交于 1月 17, 2020

resolves issue with RAS error injection in mGPU configuration
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NJohn Clements <john.clements@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

a6c44d25

17 1月, 2020 1 次提交

drm/amdgpu: check if driver should try recovery in ras recovery path · 93af20f7

由 Hawking Zhang 提交于 1月 16, 2020

To allow the flexibilty for user to disable gpu recovery
in RAS recovery path by module parameter amdgpu_gpu_recovery
Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: NGuchun Chen <guchun.chen@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

93af20f7

14 1月, 2020 1 次提交

drm/amdgpu: support error reporting for sdma ip block · 3e81ee9a

由 Hawking Zhang 提交于 1月 09, 2020

invoke sdma query_ras_error_count to get sdma single
bit error count
Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

3e81ee9a

24 12月, 2019 1 次提交

drm/amdgpu: add missed return value set for error case · 46cf2fec

由 Guchun Chen 提交于 12月 23, 2019

Return value should be set when going to error handle tag
for error case, this can avoid potential invalid array
access by upper caller.
Signed-off-by: NGuchun Chen <guchun.chen@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

46cf2fec

19 12月, 2019 2 次提交

drm/amdgpu: Remove unneeded semicolon in amdgpu_ras.c · 374bf7bd

由 zhengbin 提交于 12月 14, 2019

Fixes coccicheck warning:

drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:318:2-3: Unneeded semicolon
Reported-by: NHulk Robot <hulkci@huawei.com>
Signed-off-by: Nzhengbin <zhengbin13@huawei.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

374bf7bd

drm/amdgpu: drop useless BACO arg in amdgpu_ras_reset_gpu · 61934624

由 Guchun Chen 提交于 12月 13, 2019

BACO reset mode strategy is determined by latter func when
calling amdgpu_ras_reset_gpu. So not to confuse audience, drop
it.
Signed-off-by: NGuchun Chen <guchun.chen@amd.com>
Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

61934624

06 12月, 2019 1 次提交

drm/amdgpu: export amdgpu_ras_find_obj to use externally · f2a79be1

由 Le Ma 提交于 11月 25, 2019

Change it to external interface.
Signed-off-by: NLe Ma <le.ma@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

f2a79be1

19 11月, 2019 1 次提交

drm/amdgpu: enable ras capablity check on arcturus · baaeb610

由 Hawking Zhang 提交于 11月 13, 2019

check hw ras capablity via atomfirmware
Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
Reviewed-by: NJohn Clements <John.Clements@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

baaeb610

07 11月, 2019 1 次提交

drm/amdgpu: Improve RAS documentation (v2) · ef177d11

由 Alex Deucher 提交于 10月 30, 2019

Clarify some areas, clean up formatting, add section for
unrecoverable error handling.

v2: fix grammatical errors
Reviewed-by: NYong Zhao <yong.zhao@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

ef177d11

30 10月, 2019 1 次提交

drm/amdgpu: bypass some cleanup work after err_event_athub (v2) · bff77e86

由 Le Ma 提交于 10月 25, 2019

PSP lost connection when err_event_athub occurs. These cleanup work can be
skipped in BACO reset.

v2: squash in missing include (Alex)
Signed-off-by: NLe Ma <le.ma@amd.com>
Reviewed-by: NHawking Zhang <hawking.zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

bff77e86

26 10月, 2019 2 次提交

drm/amdgpu: define macros for retire page reservation · 52dd95f2

由 Guchun Chen 提交于 10月 22, 2019

Easy for maintainance.
Signed-off-by: NGuchun Chen <guchun.chen@amd.com>
Acked-by: NChristian König <christian.koenig@amd.com>
Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

52dd95f2

drm/amdgpu: refine reboot debugfs operation in ras case (v3) · c688a06b

由 Guchun Chen 提交于 10月 21, 2019

Ras reboot debugfs node allows user one easy control to avoid
gpu recovery hang problem and directly reboot system per card
basis, after ras uncorrectable error happens. However, it is
one common entry, which should get rid of ras_ctrl node and
remove ip dependence when inputting by user. So add one new
auto_reboot node in ras debugfs dir to achieve this.

v2: in commit mssage, add justification why ras reboot debugfs
node is needed.
v3: use debugfs_create_bool to create debugfs file for boolean value
Signed-off-by: NGuchun Chen <guchun.chen@amd.com>
Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

c688a06b

16 10月, 2019 1 次提交

dmr/amdgpu: Fix crash on SRIOV for ERREVENT_ATHUB_INTERRUPT interrupt. · ed606f8a

由 Andrey Grodzovsky 提交于 10月 11, 2019

Ignre the ERREVENT_ATHUB_INTERRUPT for systems without RAS.
Signed-off-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-and-tested-by: NJack Zhang <Jack.Zhang1@amd.com>
Acked-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

ed606f8a

11 10月, 2019 3 次提交

drm/amdgpu: avoid ras error injection for retired page · 6e4be987

由 Tao Zhou 提交于 9月 30, 2019

check whether a page is bad page before umc error injection, bad page
should not be accessed again
Signed-off-by: NTao Zhou <tao.zhou1@amd.com>
Reviewed-by: NGuchun Chen <guchun.chen@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

6e4be987

drm/amdgpu/ras: document the reboot ras option · 54e9ab2e

由 Alex Deucher 提交于 10月 08, 2019

We recently added it, but never documented it.
Reviewed-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

54e9ab2e

drm/amdgpu/ras: fix typos in documentation · a20bfd0f

由 Alex Deucher 提交于 10月 08, 2019

Fix a couple of spelling typos.
Reviewed-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

a20bfd0f

08 10月, 2019 1 次提交

drm/amdgpu: Fix error handling in amdgpu_ras_recovery_init · 1995b3a3

由 Felix Kuehling 提交于 10月 03, 2019

Don't set a struct pointer to NULL before freeing its members. It's
hard to see what's happening due to a local pointer-to-pointer data
aliasing con->eh_data.
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Tested-by: NPhilip Cox <Philip.Cox@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

1995b3a3

03 10月, 2019 7 次提交

drm/amdgpu: simplify the access to eeprom_control struct · 0771b0bf

由 Tao Zhou 提交于 9月 18, 2019

simplify the code of accessing to eeprom_control struct
Signed-off-by: NTao Zhou <tao.zhou1@amd.com>
Reviewed-by: NGuchun Chen <guchun.chen@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

0771b0bf

drm/amdgpu: replace mmhub_funcs with mmhub.funcs · d65bf1f8

由 Tao Zhou 提交于 9月 12, 2019

remove mmhub_funcs in adev
Signed-off-by: NTao Zhou <tao.zhou1@amd.com>
Reviewed-by: NGuchun Chen <guchun.chen@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

d65bf1f8

drm/amdgpu/ras: fix and update the documentation for RAS · f77c7109

由 Alex Deucher 提交于 9月 19, 2019

Add new sections to amdgpu.rst, fix up formatting issues,
add additional documentation to each section.
Acked-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

f77c7109

drm/amdgpu: avoid null pointer dereference · 8a3e801f

由 Guchun Chen 提交于 9月 17, 2019

null ptr should be checked first to avoid null ptr access
Signed-off-by: NGuchun Chen <guchun.chen@amd.com>
Reviewed-by: NTao Zhou <tao.zhou1@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

8a3e801f

drm/amdgpu/ras: use GPU PAGE_SIZE/SHIFT for reserving pages · a142ba88

由 Alex Deucher 提交于 9月 17, 2019

We are reserving vram pages so they should be aligned to the
GPU page size.
Reviewed-by: NTao Zhou <tao.zhou1@amd.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

a142ba88

drm/amdgpu: replace DRM_ERROR with DRM_WARN in ras_reserve_bad_pages · ae115c81

由 Tao Zhou 提交于 9月 12, 2019

There are two cases of reserve error should be ignored:
1) a ras bad page has been allocated (used by someone);
2) a ras bad page has been reserved (duplicate error injection for one page);

DRM_ERROR is unnecessary for the failure of bad page reserve
Signed-off-by: NTao Zhou <tao.zhou1@amd.com>
Reviewed-by: NGuchun Chen <guchun.chen@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

ae115c81

docs: drm/amdgpu: Resolve build warnings · 879e723d

由 Adam Zerella 提交于 9月 14, 2019

Some of the documentation formatting could be improved
which will resolve some Sphinx amdgpu build warnings e.g

WARNING: Unexpected indentation.
WARNING: Block quote ends without a blank line; unexpected unindent.
WARNING: Inline emphasis start-string without end-string.
Signed-off-by: NAdam Zerella <adam.zerella@gmail.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

879e723d

17 9月, 2019 2 次提交

drm/amdgpu: cleanup creating BOs at fixed location (v2) · de7b45ba

由 Christian König 提交于 9月 13, 2019

The placement is something TTM/BO internal and the RAS code should
avoid touching that directly.

Add a helper to create a BO at a fixed location and use that instead.

v2: squash in fixes (Alex)
Signed-off-by: NChristian König <christian.koenig@amd.com>
Reviewed-by: NGuchun Chen <guchun.chen@amd.com>
Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

de7b45ba

drm/amdgpu: fix ras ctrl debugfs node leak · 012dd14d

由 Guchun Chen 提交于 9月 16, 2019

Use debugfs_remove_recursive to remove the whole debugfs
directory instead of removing the node one by one.
Signed-off-by: NGuchun Chen <guchun.chen@amd.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

012dd14d

16 9月, 2019 5 次提交

drm/amdgpu: support pcie bif ras query and inject · d7bd680d

由 Guchun Chen 提交于 9月 11, 2019

Call pcie bif ras query/inject in amdgpu ras.
Signed-off-by: NTao Zhou <tao.zhou1@amd.com>
Signed-off-by: NGuchun Chen <guchun.chen@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

d7bd680d

drm/amdgpu: enable error injection to XGMI block via debugfs · f3170352

由 Hawking Zhang 提交于 9月 08, 2019

allow inject error to XGMI block via debugfs node ras_ctrl
Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: NGuchun Chen <guchun.chen@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

f3170352

drm/amdgpu: Allow to reset to EERPOM table. · 084fe13b

由 Andrey Grodzovsky 提交于 9月 09, 2019

The table grows quickly during debug/development effort when
multiple RAS errors are injected. Allow to avoid this by setting
table header back to empty if needed.

v2: Switch to debugfs entry instead of load time parameter.
Signed-off-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: NTao Zhou <tao.zhou1@amd.com>
Reviewed-by: NGuchun Chen <guchun.chen@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

084fe13b

drm/amdgpu: move umc ras init to umc block · 4930aabe

由 Tao Zhou 提交于 9月 05, 2019

move umc ras init from ras module to umc block, generic ras module
should pay less attention to specific ras block.
Signed-off-by: NTao Zhou <tao.zhou1@amd.com>
Reviewed-by: NGuchun Chen <guchun.chen@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

4930aabe

drm/amdgpu: Avoid RAS recovery init when no RAS support. · 4d1337d2

由 Andrey Grodzovsky 提交于 9月 06, 2019

Fixes driver load regression on APUs.
Signed-off-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

4d1337d2

14 9月, 2019 1 次提交

drm/amdgpu: move the call of ras recovery_init and bad page reserve to proper place · 1a6fc071

由 Tao Zhou 提交于 8月 30, 2019

ras recovery_init should be called after ttm init,
bad page reserve should be put in front of gpu reset since i2c
may be unstable during gpu reset.
add cleanup for recovery_init and recovery_fini

v2: add more comment and print.
    remove cancel_work_sync in recovery_init.
Signed-off-by: NTao Zhou <tao.zhou1@amd.com>
Reviewed-by: NGuchun Chen <guchun.chen@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

1a6fc071

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功