提交 · f1403342ebdfcff3c3cf57ae476f19d3078f2767 · openeuler / Kernel

15 8月, 2020 4 次提交

drm/amdgpu: revert "fix system hang issue during GPU reset" · f1403342

由 Christian König 提交于 8月 12, 2020

The whole approach wasn't thought through till the end.

We already had a reset lock like this in the past and it caused the same problems like this one.

Completely revert the patch for now and add individual trylock protection to the hardware access functions as necessary.

This reverts commit df9c8d1a.
Signed-off-by: NChristian König <christian.koenig@amd.com>
Acked-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

f1403342

drm/amdgpu: pass NULL pointer instead of 0 · 2f530724

由 Nirmoy Das 提交于 8月 11, 2020

Fixes: c030f2e4 ("drm/amdgpu: add amdgpu_ras.c to support ras (v2)")
Signed-off-by: NNirmoy Das <nirmoy.das@amd.com>
Reviewed-by: NGuchun Chen <guchun.chen@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

2f530724

drm/amdgpu: add debugfs node to toggle ras error cnt harvest · 66459e1d

由 Guchun Chen 提交于 8月 04, 2020

Before ras recovery is issued, user could operate this debugfs
node to enable/disable the harvest of all RAS IPs' ras error
count registers, which will help keep hardware's registers'
status instead of cleaning up them.
Signed-off-by: NGuchun Chen <guchun.chen@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: NDennis Li <Dennis.Li@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

66459e1d

drm/amdgpu: bypass querying ras error count registers · f75e94d8

由 Guchun Chen 提交于 8月 04, 2020

Once ras recovery is issued by ras sync flood interrupt or
ras controller interrupt, add this guard to bypass or execute
ras error count register harvest of all IPs.
Signed-off-by: NGuchun Chen <guchun.chen@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: NDennis Li <Dennis.Li@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

f75e94d8

05 8月, 2020 9 次提交

drm/amdgpu: enable RAS support for sienna cichlid · 0ad7a64d

由 John Clements 提交于 8月 03, 2020

enabled GECC error injection and query support
Reviewed-by: NGuchun Chen <guchun.chen@amd.com>
Signed-off-by: NJohn Clements <john.clements@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

0ad7a64d

drm/amdgpu: disable page reservation when amdgpu_bad_page_threshold = 0 · a219ecbb

由 Guchun Chen 提交于 7月 27, 2020

When amdgpu_bad_page_threshold = 0, bad page reservation stuffs
are skipped in either UMC ECC irq or page retirement calling of
sync flood isr.
Signed-off-by: NGuchun Chen <guchun.chen@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

a219ecbb

drm/amdgpu: decouple sysfs creating of bad page node · f848159b

由 Guchun Chen 提交于 7月 31, 2020

Bad page information should not be exposed by sysfs when
bad page retirement is disabled, so decouple it from ras
sysfs group creating, and add one guard before creating.
Signed-off-by: NGuchun Chen <guchun.chen@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

f848159b

drm/amdgpu: add one definition for RAS's sysfs/debugfs name(v2) · eb0c3cd4

由 Guchun Chen 提交于 7月 31, 2020

Add one definition for the RAS module's FS name. It's used
in both debugfs and sysfs cases.

v2: Use static variable instead of macro definition.
Signed-off-by: NGuchun Chen <guchun.chen@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

eb0c3cd4

drm/amdgpu: restore ras flags when user resets eeprom(v2) · bf0b91b7

由 Guchun Chen 提交于 7月 22, 2020

RAS flags needs to be cleaned as well when user requires
one clean eeprom.

v2: RAS flags shall be restored after eeprom reset succeeds.
Signed-off-by: NGuchun Chen <guchun.chen@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

bf0b91b7

drm/amdgpu: break GPU recovery once it's in bad state(v4) · e8fbaf03

由 Guchun Chen 提交于 7月 23, 2020

When GPU executes recovery and retriving bad GPU tag
from external eerpom device, the recovery will be broken
and error message is printed as well for user's awareness.

v2: Refine warning message in threshold reaching case, and
    fix spelling typo.

v3: Fix explicit calling of bad gpu.

v4: Rename function names.
Signed-off-by: NGuchun Chen <guchun.chen@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

e8fbaf03

drm/amdgpu: skip bad page reservation once issuing from eeprom write · 35cd2cda

由 Guchun Chen 提交于 7月 23, 2020

Once the ras recovery is issued from eeprom write itself,
bad page reservation should be ignored, otherwise, recursive
calling of writting to eeprom would happen.
Signed-off-by: NGuchun Chen <guchun.chen@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

35cd2cda

drm/amdgpu: break driver init process when it's bad GPU(v5) · b82e65a9

由 Guchun Chen 提交于 7月 23, 2020

When retrieving bad gpu tag from eeprom, GPU init should
fail as the GPU needs to be retired for further check.

v2: Fix spelling typo, correct the condition to detect
    bad gpu tag and refine error message.

v3: Refine function argument name.

v4: Fix missing check of returning value of i2c
    initialization error case.

v5: Use dev_err to print PCI information in dmesg instead
    of DRM_ERROR.
Signed-off-by: NGuchun Chen <guchun.chen@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

b82e65a9

drm/amdgpu: validate bad page threshold in ras(v3) · c84d4670

由 Guchun Chen 提交于 7月 22, 2020

Bad page threshold value should be valid in the range between
-1 and max records length of eeprom. It could determine when
saved bad pages exceed threshold value, and proceed corresponding
actions.

v2: When using the default typical value, it should be min
value between typical value and eeprom max records length.

v3: drop the case of setting bad_page_cnt_threshold to be
    0xFFFFFFFF, as it confuses user.
Signed-off-by: NGuchun Chen <guchun.chen@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

c84d4670

28 7月, 2020 1 次提交

drm/amdgpu: fix system hang issue during GPU reset · df9c8d1a

由 Dennis Li 提交于 7月 08, 2020

when GPU hang, driver has multi-paths to enter amdgpu_device_gpu_recover,
the atomic adev->in_gpu_reset and hive->in_reset are used to avoid
re-entering GPU recovery.

During GPU reset and resume, it is unsafe that other threads access GPU,
which maybe cause GPU reset failed. Therefore the new rw_semaphore
adev->reset_sem is introduced, which protect GPU from being accessed by
external threads during recovery.

v2:
1. add rwlock for some ioctls, debugfs and file-close function.
2. change to use dqm->is_resetting and dqm_lock for protection in kfd
driver.
3. remove try_lock and change adev->in_gpu_reset as atomic, to avoid
re-enter GPU recovery for the same GPU hang.

v3:
1. change back to use adev->reset_sem to protect kfd callback
functions, because dqm_lock couldn't protect all codes, for example:
free_mqd must be called outside of dqm_lock;

[ 1230.176199] Hardware name: Supermicro SYS-7049GP-TRT/X11DPG-QT, BIOS 3.1 05/23/2019
[ 1230.177221] Call Trace:
[ 1230.178249]  dump_stack+0x98/0xd5
[ 1230.179443]  amdgpu_virt_kiq_reg_write_reg_wait+0x181/0x190 [amdgpu]
[ 1230.180673]  gmc_v9_0_flush_gpu_tlb+0xcc/0x310 [amdgpu]
[ 1230.181882]  amdgpu_gart_unbind+0xa9/0xe0 [amdgpu]
[ 1230.183098]  amdgpu_ttm_backend_unbind+0x46/0x180 [amdgpu]
[ 1230.184239]  ? ttm_bo_put+0x171/0x5f0 [ttm]
[ 1230.185394]  ttm_tt_unbind+0x21/0x40 [ttm]
[ 1230.186558]  ttm_tt_destroy.part.12+0x12/0x60 [ttm]
[ 1230.187707]  ttm_tt_destroy+0x13/0x20 [ttm]
[ 1230.188832]  ttm_bo_cleanup_memtype_use+0x36/0x80 [ttm]
[ 1230.189979]  ttm_bo_put+0x1be/0x5f0 [ttm]
[ 1230.191230]  amdgpu_bo_unref+0x1e/0x30 [amdgpu]
[ 1230.192522]  amdgpu_amdkfd_free_gtt_mem+0xaf/0x140 [amdgpu]
[ 1230.193833]  free_mqd+0x25/0x40 [amdgpu]
[ 1230.195143]  destroy_queue_cpsch+0x1a7/0x270 [amdgpu]
[ 1230.196475]  pqm_destroy_queue+0x105/0x260 [amdgpu]
[ 1230.197819]  kfd_ioctl_destroy_queue+0x37/0x70 [amdgpu]
[ 1230.199154]  kfd_ioctl+0x277/0x500 [amdgpu]
[ 1230.200458]  ? kfd_ioctl_get_clock_counters+0x60/0x60 [amdgpu]
[ 1230.201656]  ? tomoyo_file_ioctl+0x19/0x20
[ 1230.202831]  ksys_ioctl+0x98/0xb0
[ 1230.204004]  __x64_sys_ioctl+0x1a/0x20
[ 1230.205174]  do_syscall_64+0x5f/0x250
[ 1230.206339]  entry_SYSCALL_64_after_hwframe+0x49/0xbe

2. remove try_lock and introduce atomic hive->in_reset, to avoid
re-enter GPU recovery.

v4:
1. remove an unnecessary whitespace change in kfd_chardev.c
2. remove comment codes in amdgpu_device.c
3. add more detailed comment in commit message
4. define a wrap function amdgpu_in_reset

v5:
1. Fix some style issues.
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Suggested-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com>
Suggested-by: NChristian König <christian.koenig@amd.com>
Suggested-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Suggested-by: NLijo Lazar <Lijo.Lazar@amd.com>
Suggested-by: NLuben Tukov <luben.tuikov@amd.com>
Signed-off-by: NDennis Li <Dennis.Li@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

df9c8d1a

23 7月, 2020 1 次提交

drm/amdgpu: add printing after executing page reservation to eeprom · b1628425

由 Guchun Chen 提交于 7月 20, 2020

This will tell users if the faulty page has been written to
external eeprom device in dmesg log.
Signed-off-by: NGuchun Chen <guchun.chen@amd.com>
Reviewed-by: NTao Zhou <tao.zhou1@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

b1628425

16 7月, 2020 1 次提交

drm/amdgpu: RAS emergency restart logic refine · bb5c7235

由 Wenhui Sheng 提交于 7月 13, 2020

If we are in RAS triggered situation and
BACO isn't support, emergency restart is needed,
and this code is only needed for some specific
cases(vega20 with given smu fw version).

After we add smu mode1 reset for sienna cichlid, we
need to share AMD_RESET_METHOD_MODE1 with psp mode1 reset,
so in amdgpu_device_gpu_recover, we need differentiate
which mode1 reset we are using, then decide if it's
a full reset and then decide if emergency restart is needed,
the logic will become much more complex.

After discussion with Hawking, move emergency restart logic
to an independent function.
Signed-off-by: NLikun Gao <Likun.Gao@amd.com>
Signed-off-by: NWenhui Sheng <Wenhui.Sheng@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

bb5c7235

01 7月, 2020 1 次提交

drm/amdgpu: label internally used symbols as static · f3167919

由 Nirmoy Das 提交于 6月 18, 2020

Used sparse(make C=1) to find these loose ends.

v2:
removed unwanted extra line
Signed-off-by: NNirmoy Das <nirmoy.das@amd.com>
Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

f3167919

03 6月, 2020 2 次提交

drm/amdgpu: remove useless code in RAS · 9e69b1ee

由 Guchun Chen 提交于 6月 02, 2020

Module parameter amdgpu_ras_mask has been involved in
the calculation of ras support capability, so drop this
redundant code.
Signed-off-by: NGuchun Chen <guchun.chen@amd.com>
Reviewed-by: NTao Zhou <tao.zhou1@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

9e69b1ee

drm/amdgpu: fix RAS memory leak in error case · 5e91160a

由 Guchun Chen 提交于 6月 02, 2020

RAS context memory needs to freed in failure case.
Signed-off-by: NGuchun Chen <guchun.chen@amd.com>
Reviewed-by: NTao Zhou <tao.zhou1@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

5e91160a

29 5月, 2020 1 次提交

drm/amdgpu: print warning when input address is invalid · b0d4783a

由 Guchun Chen 提交于 5月 22, 2020

This will assist debug in error injection case.
Signed-off-by: NGuchun Chen <guchun.chen@amd.com>
Reviewed-by: NTao Zhou <tao.zhou1@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

b0d4783a

15 5月, 2020 1 次提交

drm/amdgpu: Update RAS XGMI error inject sequence · 5c23e9e0

由 John Clements 提交于 5月 13, 2020

Disable XGMI link power down prior to issuing a XGMI RAS error
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NJohn Clements <john.clements@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

5c23e9e0

06 5月, 2020 1 次提交

drm/amdgpu: allocate large structures dynamically · 7fcffecf

由 Arnd Bergmann 提交于 5月 05, 2020

After the structure was padded to 1024 bytes, it is no longer
suitable for being a local variable, as the function surpasses
the warning limit for 32-bit architectures:

drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:587:5: error: stack frame size of 1072 bytes in function 'amdgpu_ras_feature_enable' [-Werror,-Wframe-larger-than=]
int amdgpu_ras_feature_enable(struct amdgpu_device *adev,
    ^

Use kzalloc() instead to get it from the heap.

Fixes: a0d25482 ("drm/amdgpu: update RAS TA to Host interface")
Acked-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NArnd Bergmann <arnd@arndb.de>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

7fcffecf

01 5月, 2020 1 次提交

drm/amdgpu: update RAS error handling · a200034b

由 John Clements 提交于 4月 30, 2020

Parse return status from TA to determine error severity
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NJohn Clements <john.clements@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

a200034b

23 4月, 2020 2 次提交

drm/amdgpu: set error query ready after all IPs late init · a891d239

由 Dennis Li 提交于 4月 22, 2020

If set error query ready in amdgpu_ras_late_init, which will
cause some IP blocks aren't initialized, but their error query
is ready.
Signed-off-by: NDennis Li <Dennis.Li@amd.com>
Reviewed-by: NGuchun Chen <guchun.chen@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

a891d239

drm/amdgpu: fix kernel page fault issue by ras recovery on sGPU · 12c17b9d

由 Guchun Chen 提交于 4月 16, 2020

When running ras uncorrectable error injection and triggering GPU
reset on sGPU, below issue is observed. It's caused by the list
uninitialized when accessing.

[   80.047227] BUG: unable to handle page fault for address: ffffffffc0f4f750
[   80.047300] #PF: supervisor write access in kernel mode
[   80.047351] #PF: error_code(0x0003) - permissions violation
[   80.047404] PGD 12c20e067 P4D 12c20e067 PUD 12c210067 PMD 41c4ee067 PTE 404316061
[   80.047477] Oops: 0003 [#1] SMP PTI
[   80.047516] CPU: 7 PID: 377 Comm: kworker/7:2 Tainted: G           OE     5.4.0-rc7-guchchen #1
[   80.047594] Hardware name: System manufacturer System Product Name/TUF Z370-PLUS GAMING II, BIOS 0411 09/21/2018
[   80.047888] Workqueue: events amdgpu_ras_do_recovery [amdgpu]
Signed-off-by: NGuchun Chen <guchun.chen@amd.com>
Reviewed-by: NJohn Clements <John.Clements@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

12c17b9d

14 4月, 2020 1 次提交

drm/amdgpu: refine ras related message print · 6952e99c

由 Guchun Chen 提交于 4月 10, 2020

Prefix ras related kernel message logging with PCI
device info by replacing DRM_INFO/WARN/ERROR with
dev_info/warn/err. This can clearly tell user about
GPU device information where ras is. And add some
other ras message printing to make it more clear
and friendly as well.
Suggested-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NGuchun Chen <guchun.chen@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

6952e99c

09 4月, 2020 1 次提交

drm/amdgpu: resolve mGPU RAS query instability · b3dbd6d3

由 John Clements 提交于 4月 07, 2020

upon receiving uncorrectable error, query every GPU node for ras errors
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NJohn Clements <john.clements@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

b3dbd6d3

08 4月, 2020 1 次提交

drm/amdgpu: resolve mGPU RAS query instability · 0b9ebd7e

由 John Clements 提交于 4月 07, 2020

upon receiving uncorrectable error, query every GPU node for ras errors
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NJohn Clements <john.clements@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

0b9ebd7e

02 4月, 2020 2 次提交

drm/amdgpu: fix non-pointer dereference for non-RAS supported · a9d82d2f

由 Evan Quan 提交于 3月 27, 2020

Backtrace on gpu recover test on Navi10.

[ 1324.516681] RIP: 0010:amdgpu_ras_set_error_query_ready+0x15/0x20 [amdgpu]
[ 1324.523778] Code: 4c 89 f7 e8 cd a2 a0 d8 e9 99 fe ff ff 45 31 ff e9 91 fe ff ff 0f 1f 44 00 00 55 48 85 ff 48 89 e5 74 0e 48 8b 87 d8 2b 01 00 <40> 88 b0 38 01 00 00 5d c3 66 90 0f 1f 44 00 00 55 31 c0 48 85 ff
[ 1324.543452] RSP: 0018:ffffaa1040e4bd28 EFLAGS: 00010286
[ 1324.549025] RAX: 0000000000000000 RBX: ffff911198b20000 RCX: 0000000000000000
[ 1324.556217] RDX: 00000000000c0a01 RSI: 0000000000000000 RDI: ffff911198b20000
[ 1324.563514] RBP: ffffaa1040e4bd28 R08: 0000000000001000 R09: ffff91119d0028c0
[ 1324.570804] R10: ffffffff9a606b40 R11: 0000000000000000 R12: 0000000000000000
[ 1324.578413] R13: ffffaa1040e4bd70 R14: ffff911198b20000 R15: 0000000000000000
[ 1324.586464] FS:  00007f4441cbf540(0000) GS:ffff91119ed80000(0000) knlGS:0000000000000000
[ 1324.595434] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1324.601345] CR2: 0000000000000138 CR3: 00000003fcdf8004 CR4: 00000000003606e0
[ 1324.608694] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1324.616303] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1324.623678] Call Trace:
[ 1324.626270]  amdgpu_device_gpu_recover+0x6e7/0xc50 [amdgpu]
[ 1324.632018]  ? seq_printf+0x4e/0x70
[ 1324.636652]  amdgpu_debugfs_gpu_recover+0x50/0x80 [amdgpu]
[ 1324.643371]  seq_read+0xda/0x420
[ 1324.647601]  full_proxy_read+0x5c/0x90
[ 1324.652426]  __vfs_read+0x1b/0x40
[ 1324.656734]  vfs_read+0x8e/0x130
[ 1324.660981]  ksys_read+0xa7/0xe0
[ 1324.665201]  __x64_sys_read+0x1a/0x20
[ 1324.669907]  do_syscall_64+0x57/0x1c0
[ 1324.674517]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 1324.680654] RIP: 0033:0x7f44417cf081
Signed-off-by: NEvan Quan <evan.quan@amd.com>
Reviewed-by: NJohn Clements <John.Clements@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

a9d82d2f

drm/amdgpu: disable ras query and iject during gpu reset · 61380faa

由 John Clements 提交于 3月 25, 2020

added flag to ras context to indicate if ras query functionality is ready
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NJohn Clements <john.clements@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

61380faa

20 3月, 2020 1 次提交

drm/amdgpu: protect RAS sysfs during GPU reset · 43c4d576

由 John Clements 提交于 3月 19, 2020

MMHub EDC becomes dirty after BACO reset

EDC registers should be cleared early on in reset phase
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NJohn Clements <john.clements@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

43c4d576

13 3月, 2020 2 次提交

drm/amdgpu: fix warning in ras_debugfs_create_all() · c1509f3f

由 Stanley.Yang 提交于 3月 12, 2020

Fix the warning
"warn: variable dereferenced before check 'obj' (see line 1131)"
by removing unnecessary checks as amdgpu_ras_debugfs_create_all()
is only called from amdgpu_debugfs_init() where obj member in
con->head list is not NULL.
Use list_for_each_entry() instead list_for_each_entry_safe() as obj
do not to be freeing or removing from list during this process.
Signed-off-by: NStanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

c1509f3f

drm/amdgpu: update ras capability's query based on mem ecc configuration · 88474cca

由 Guchun Chen 提交于 3月 10, 2020

RAS support capability needs to be updated on top of different
memeory ECC enablement, and remove redundant memory ecc check
in gmc module for vega20 and arcturus.

v2: check HBM ECC enablement and set ras mask accordingly.
v3: avoid to invoke atomfirmware interface to query twice.
Suggested-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NGuchun Chen <guchun.chen@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

88474cca

11 3月, 2020 2 次提交

drm/amdgpu: call ras_debugfs_create_all in debugfs_init · 204eaac6

由 Tao Zhou 提交于 3月 06, 2020

and remove each ras IP's own debugfs creation

this is required to fix ras when the driver does not use the drm load
and unload callbacks due to ordering issues with the drm device node.
Signed-off-by: NTao Zhou <tao.zhou1@amd.com>
Signed-off-by: NStanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

204eaac6

drm/amdgpu: add function to creat all ras debugfs node · f9317014

由 Tao Zhou 提交于 3月 06, 2020

centralize all debugfs creation in one place for ras

this is required to fix ras when the driver does not use the drm load
and unload callbacks due to ordering issues with the drm device node.
Signed-off-by: NTao Zhou <tao.zhou1@amd.com>
Signed-off-by: NStanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

f9317014

07 3月, 2020 1 次提交

drm/amdgpu: enable PCS error report on VG20 · ec01fe2d

由 Hawking Zhang 提交于 2月 21, 2020

Now driver will report XGMI/WAFL PCS error through
sysfs xgmi_wafl_err_count node on Vega20
Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: NGuchun Chen <guchun.chen@amd.com>
Reviewed-by: NTao Zhou <tao.zhou1@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

ec01fe2d

27 2月, 2020 1 次提交

drm/amdgpu: move get_xgmi_relative_phy_addr to amdgpu_xgmi.c · 19744f5f

由 Hawking Zhang 提交于 2月 24, 2020

centralize all the xgmi related function to amdgpu_xgmi.c
Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com>
Acked-by: NEvan Quan <evan.quan@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

19744f5f

19 2月, 2020 1 次提交

drm/amdgpu: log on non-zero error conter per IP before GPU reset · 313c8fd3

由 Guchun Chen 提交于 2月 13, 2020

Once sync flood interrupt is triggered by RAS error, before
actual GPU recovery job, it's necessary to log on and print
non-zero error counter, this will help user knows where the
RAS error source is from quickly.
Signed-off-by: NGuchun Chen <guchun.chen@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

313c8fd3

23 1月, 2020 1 次提交

drm/amdgpu: added support to get mGPU DRAM base · a6c44d25

由 John Clements 提交于 1月 17, 2020

resolves issue with RAS error injection in mGPU configuration
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NJohn Clements <john.clements@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

a6c44d25

17 1月, 2020 1 次提交

drm/amdgpu: check if driver should try recovery in ras recovery path · 93af20f7

由 Hawking Zhang 提交于 1月 16, 2020

To allow the flexibilty for user to disable gpu recovery
in RAS recovery path by module parameter amdgpu_gpu_recovery
Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: NGuchun Chen <guchun.chen@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

93af20f7

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功