提交 · 8a82b347e8732fd2b68d26a6e9f0d9a1c397560d · openeuler / Kernel

07 1月, 2021 1 次提交

drm/amdgpu: fix no bad_pages issue after umc ue injection · 8a82b347

由 Dennis Li 提交于 1月 05, 2021

old code wrongly used the bad page status as the function return value,
which cause amdgpu_ras_badpages_read always return failed.
Signed-off-by: NDennis Li <Dennis.Li@amd.com>
Reviewed-by: NGuchun Chen <guchun.chen@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

8a82b347

09 12月, 2020 2 次提交

drm/amdgpu: fix debugfs creation/removal, again · 2343e9d2

由 Arnd Bergmann 提交于 12月 04, 2020

There is still a warning when CONFIG_DEBUG_FS is disabled:

drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:1145:13: error: 'amdgpu_ras_debugfs_create_ctrl_node' defined but not used [-Werror=unused-function]
1145 | static void amdgpu_ras_debugfs_create_ctrl_node(struct amdgpu_device *adev)

Change the code again to make the compiler actually drop
this code but not warn about it.

Fixes: ae2bf61f ("drm/amdgpu: guard ras debugfs creation/removal based on CONFIG_DEBUG_FS")
Reviewed-by: NTao Zhou <tao.zhou1@amd.com>
Signed-off-by: NArnd Bergmann <arnd@arndb.de>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

2343e9d2

drm/amdgpu: fix debugfs creation/removal, again · cedf7884

由 Arnd Bergmann 提交于 12月 04, 2020

There is still a warning when CONFIG_DEBUG_FS is disabled:

Change the code again to make the compiler actually drop
this code but not warn about it.

cedf7884

14 11月, 2020 2 次提交

drm/amd/amdgpu/amdgpu_ras: Make local function 'amdgpu_ras_error_status_query' static · cd92df93

由 Lee Jones 提交于 11月 13, 2020

Fixes the following W=1 kernel build warning(s):

 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:1482:6: warning: no previous prototype for ‘amdgpu_ras_error_status_query’ [-Wmissing-prototypes]

Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: "Christian König" <christian.koenig@amd.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: amd-gfx@lists.freedesktop.org
Cc: dri-devel@lists.freedesktop.org
Signed-off-by: NLee Jones <lee.jones@linaro.org>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

cd92df93

drm/amd/amdgpu/amdgpu_ras: Remove unused function 'amdgpu_ras_error_cure' · 32dc5348

由 Lee Jones 提交于 11月 13, 2020

Fixes the following W=1 kernel build warning(s):

 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:908:5: warning: no previous prototype for ‘amdgpu_ras_error_cure’ [-Wmissing-prototypes]

Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: "Christian König" <christian.koenig@amd.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: amd-gfx@lists.freedesktop.org
Cc: dri-devel@lists.freedesktop.org
Signed-off-by: NLee Jones <lee.jones@linaro.org>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

32dc5348

03 11月, 2020 1 次提交

drm/amdgpu/amdgpu: improve code indentation and alignment · f3729f7b

由 Deepak R Varma 提交于 11月 02, 2020

General code indentation and alignment changes such as replace spaces
by tabs or align function arguments as per the coding style
guidelines. The patch corrects issues for various amdgpu_*.c files
for this driver. Issue reported by checkpatch script.
Signed-off-by: NDeepak R Varma <mh12gx2825@gmail.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

f3729f7b

30 10月, 2020 3 次提交

drm/amdgpu: remove unneeded semicolon · aec576f9

由 Tom Rix 提交于 10月 27, 2020

A semicolon is not needed after a switch statement.
Signed-off-by: NTom Rix <trix@redhat.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

aec576f9

drm/amdgpu: fix the issue of reserving bad pages failed · 676deb38

由 Dennis Li 提交于 10月 22, 2020

In amdgpu_ras_reset_gpu, because bad pages may not be freed,
it has high probability to reserve bad pages failed.

Change to reserve bad pages when freeing VRAM.

v2:
1. avoid allocating the drm_mm node outside of amdgpu_vram_mgr.c
2. move bad page reserving into amdgpu_ras_add_bad_pages, if vram mgr
   reserve bad page failed, it will put it into pending list, otherwise
   put it into processed list;
3. remove amdgpu_ras_release_bad_pages, because retired page's info has
   been moved into amdgpu_vram_mgr

v3:
1. formate code style;
2. rename amdgpu_vram_reserve_scope as amdgpu_vram_reservation;
3. rename scope_pending as reservations_pending;
4. rename scope_processed as reserved_pages;
5. change to iterate over all the pending ones and try to insert them
   with drm_mm_reserve_node();

v4:
1. rename amdgpu_vram_mgr_reserve_scope as
amdgpu_vram_mgr_reserve_range;
2. remove unused include "amdgpu_ras.h";
3. rename amdgpu_vram_mgr_check_and_reserve as
amdgpu_vram_mgr_do_reserve;
4. refine amdgpu_vram_mgr_reserve_range to call
amdgpu_vram_mgr_do_reserve.
Reviewed-by: NChristian König <christian.koenig@amd.com>
Reviewed-by: NHawking Zhang <hawking.zhang@amd.com>
Signed-off-by: NDennis Li <Dennis.Li@amd.com>
Signed-off-by: NWenhui Sheng <Wenhui.Sheng@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

676deb38

drm/amdgpu: change to save bad pages in UMC error interrupt callback · 22503d80

由 Dennis Li 提交于 10月 19, 2020

Instead of saving bad pages in amdgpu_ras_reset_gpu, it will reduce
the unnecessary calling of amdgpu_ras_save_bad_pages.
Signed-off-by: NDennis Li <Dennis.Li@amd.com>
Reviewed-by: NHawking Zhang <hawking.zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

22503d80

22 10月, 2020 2 次提交

Revert drm/amdgpu: disable sienna chichlid UMC RAS · e4eeceb7

由 John Clements 提交于 10月 21, 2020

This reverts commit 265c280a.
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NJohn Clements <john.clements@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

e4eeceb7

Revert drm/amdgpu: disable sienna chichlid UMC RAS · 4d2aae33

由 John Clements 提交于 10月 21, 2020

This reverts commit 265c280a.
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NJohn Clements <john.clements@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

4d2aae33

26 9月, 2020 3 次提交

drm/amdgpu: fix a warning in amdgpu_ras.c (v2) · a069a9eb

由 Alex Deucher 提交于 9月 25, 2020

drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c: In function ‘amdgpu_ras_fs_init’:
drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:1284:2: warning: ignoring return value of ‘sysfs_create_group’, declared with attribute warn_unused_result [-Wunused-result]
 1284 |  sysfs_create_group(&adev->dev->kobj, &group);
      |  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

v2: just print an error for sysfs group creation failure
Acked-by: NNirmoy Das <nirmoy.das@amd.com>
Reviewed-by: NLuben Tuikov <luben.tuikov@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

a069a9eb

drm/amdgpu: clean up ras sysfs creation (v2) · c3d4d45d

由 Guchun Chen 提交于 9月 24, 2020

Merge ras sysfs creation together by calling sysfs_create_group
once, as sysfs_update_group may not work properly as expected.

v2: improve commit message
Signed-off-by: NGuchun Chen <guchun.chen@amd.com>
Acked-by: NChristian König <christian.koenig@amd.com>
Reviewed-by: NDennis Li <Dennis.Li@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

c3d4d45d

drm/amdgpu: disable sienna chichlid UMC RAS · 265c280a

由 John Clements 提交于 9月 24, 2020

disable UMC RAS in lieu of stability issues on certain sku
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NJohn Clements <john.clements@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

265c280a

23 9月, 2020 1 次提交

drm/amdgpu: update athub interrupt harvesting handle · 3f975d0f

由 Stanley.Yang 提交于 9月 15, 2020

GCEA/MMHUB EA error should not result to DF freeze, this is
fixed in next generation, but for some reasons the GCEA/MMHUB
EA error will result to DF freeze in previous generation,
diver should avoid to indicate GCEA/MMHUB EA error as hw fatal
error in kernel message by read GCEA/MMHUB err status registers.

Changed from V1:
    make query_ras_error_status function more general
    make read mmhub er status register more friendly

Changed from V2:
    move ras error status query function into do_recovery workqueue

Changed from V3:
    remove useless code from V2, print GCEA error status
    instance number
Signed-off-by: NStanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

3f975d0f

27 8月, 2020 1 次提交

drm/amdkfd: fix set kfd node ras properties value · 5436ab94

由 Stanley.Yang 提交于 8月 17, 2020

The ctx->features are new RAS implementation which
is only available for Vega20 and onwards, it is not
available for vega10, vega10 should follow legacy
ECC implementation.

Changed from V1:
    wrap function to initialize kfd node properties

Changed from V2:
    remove wrap function and SDMA SRAM ECC check
Signed-off-by: NStanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: NGuchun Chen <guchun.chen@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

5436ab94

25 8月, 2020 3 次提交

drm/amdgpu: Get DRM dev from adev by inline-f · 4a580877

由 Luben Tuikov 提交于 8月 24, 2020

Add a static inline adev_to_drm() to obtain
the DRM device pointer from an amdgpu_device pointer.
Signed-off-by: NLuben Tuikov <luben.tuikov@amd.com>
Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

4a580877

drm/amdgpu: refine create and release logic of hive info · d95e8e97

由 Dennis Li 提交于 8月 18, 2020

Change to dynamically create and release hive info object,
which help driver support more hives in the future.

v2:
Change to save hive object pointer in adev, to avoid locking
xgmi_mutex every time when calling amdgpu_get_xgmi_hive.

v3:
1. Change type of hive object pointer in adev from void* to
amdgpu_hive_info*.
2. remove unnecessary variable initialization.
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NDennis Li <Dennis.Li@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

d95e8e97

drm/amdgpu: refine codes to avoid reentering GPU recovery · 53b3f8f4

由 Dennis Li 提交于 8月 19, 2020

if other threads have holden the reset lock, recovery will
fail to try_lock. Therefore we introduce atomic hive->in_reset
and adev->in_gpu_reset, to avoid reentering GPU recovery.

v2:
drop "? true : false" in the definition of amdgpu_in_reset
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NDennis Li <Dennis.Li@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

53b3f8f4

19 8月, 2020 1 次提交

drm/amdgpu: fix NULL pointer access issue when unloading driver · 1a68d96f

由 Guchun Chen 提交于 8月 13, 2020

When unloading driver by "modprobe -r amdgpu", one NULL pointer
dereference bug occurs in ras debugfs releasing. The cause is the
duplicated debugfs_remove, as drm debugfs_root dir has been cleaned
up already by drm_minor_unregister.

BUG: kernel NULL pointer dereference, address: 00000000000000a0
PGD 0 P4D 0
Oops: 0002 [#1] SMP PTI
CPU: 11 PID: 1526 Comm: modprobe Tainted: G           OE     5.6.0-guchchen #1
Hardware name: System manufacturer System Product Name/TUF Z370-PLUS GAMING II, BIOS 0411 09/21/2018
RIP: 0010:down_write+0x15/0x40
Code: eb de e8 7e 17 72 ff cc cc cc cc cc cc cc cc cc cc cc cc cc cc 0f 1f 44 00 00 53 48 89 fb e8 92
d8 ff ff 31 c0 ba 01 00 00 00 <f0> 48 0f b1 13 75 0f 65 48 8b 04 25 c0 8b 01 00 48 89 43 08 5b c3
RSP: 0018:ffffb1590386fcd0 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 00000000000000a0 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffffffff85b2fcc2 RDI: 00000000000000a0
RBP: ffffb1590386fd30 R08: ffffffff85b2fcc2 R09: 000000000002b3c0
R10: ffff97a330618c40 R11: 00000000000005f6 R12: ffff97a3481beb40
R13: 00000000000000a0 R14: ffff97a3481beb40 R15: 0000000000000000
FS:  00007fb11a717540(0000) GS:ffff97a376cc0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000000a0 CR3: 00000004066d6006 CR4: 00000000003606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 simple_recursive_removal+0x63/0x370
 ? debugfs_remove+0x60/0x60
 debugfs_remove+0x40/0x60
 amdgpu_ras_fini+0x82/0x230 [amdgpu]
 ? __kernfs_remove.part.17+0x101/0x1f0
 ? kernfs_name_hash+0x12/0x80
 amdgpu_device_fini+0x1c0/0x580 [amdgpu]
 amdgpu_driver_unload_kms+0x3e/0x70 [amdgpu]
 amdgpu_pci_remove+0x36/0x60 [amdgpu]
 pci_device_remove+0x3b/0xb0
 device_release_driver_internal+0xe5/0x1c0
 driver_detach+0x46/0x90
 bus_remove_driver+0x58/0xd0
 pci_unregister_driver+0x29/0x90
 amdgpu_exit+0x11/0x25 [amdgpu]
 __x64_sys_delete_module+0x13d/0x210
 do_syscall_64+0x5f/0x250
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
Signed-off-by: NGuchun Chen <guchun.chen@amd.com>
Reviewed-by: NTao Zhou <tao.zhou1@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

1a68d96f

15 8月, 2020 6 次提交

drm/amdgpu: guard ras debugfs creation/removal based on CONFIG_DEBUG_FS · ae2bf61f

由 Guchun Chen 提交于 8月 13, 2020

It can avoid potential build warn/error when
CONFIG_DEBUG_FS is not set.
Signed-off-by: NGuchun Chen <guchun.chen@amd.com>
Reviewed-by: NTao Zhou <tao.zhou1@amd.com>
Reviewed-by: NDennis Li <Dennis.Li@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

ae2bf61f

drm/amdgpu: fix NULL pointer access issue when unloading driver · 2e2f5dd5

由 Guchun Chen 提交于 8月 13, 2020

When unloading driver by "modprobe -r amdgpu", one NULL pointer
dereference bug occurs in ras debugfs releasing. The cause is the
duplicated debugfs_remove, as drm debugfs_root dir has been cleaned
up already by drm_minor_unregister.

BUG: kernel NULL pointer dereference, address: 00000000000000a0
PGD 0 P4D 0
Oops: 0002 [#1] SMP PTI
CPU: 11 PID: 1526 Comm: modprobe Tainted: G           OE     5.6.0-guchchen #1
Hardware name: System manufacturer System Product Name/TUF Z370-PLUS GAMING II, BIOS 0411 09/21/2018
RIP: 0010:down_write+0x15/0x40
Code: eb de e8 7e 17 72 ff cc cc cc cc cc cc cc cc cc cc cc cc cc cc 0f 1f 44 00 00 53 48 89 fb e8 92
d8 ff ff 31 c0 ba 01 00 00 00 <f0> 48 0f b1 13 75 0f 65 48 8b 04 25 c0 8b 01 00 48 89 43 08 5b c3
RSP: 0018:ffffb1590386fcd0 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 00000000000000a0 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffffffff85b2fcc2 RDI: 00000000000000a0
RBP: ffffb1590386fd30 R08: ffffffff85b2fcc2 R09: 000000000002b3c0
R10: ffff97a330618c40 R11: 00000000000005f6 R12: ffff97a3481beb40
R13: 00000000000000a0 R14: ffff97a3481beb40 R15: 0000000000000000
FS:  00007fb11a717540(0000) GS:ffff97a376cc0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000000a0 CR3: 00000004066d6006 CR4: 00000000003606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 simple_recursive_removal+0x63/0x370
 ? debugfs_remove+0x60/0x60
 debugfs_remove+0x40/0x60
 amdgpu_ras_fini+0x82/0x230 [amdgpu]
 ? __kernfs_remove.part.17+0x101/0x1f0
 ? kernfs_name_hash+0x12/0x80
 amdgpu_device_fini+0x1c0/0x580 [amdgpu]
 amdgpu_driver_unload_kms+0x3e/0x70 [amdgpu]
 amdgpu_pci_remove+0x36/0x60 [amdgpu]
 pci_device_remove+0x3b/0xb0
 device_release_driver_internal+0xe5/0x1c0
 driver_detach+0x46/0x90
 bus_remove_driver+0x58/0xd0
 pci_unregister_driver+0x29/0x90
 amdgpu_exit+0x11/0x25 [amdgpu]
 __x64_sys_delete_module+0x13d/0x210
 do_syscall_64+0x5f/0x250
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
Signed-off-by: NGuchun Chen <guchun.chen@amd.com>
Reviewed-by: NTao Zhou <tao.zhou1@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

2e2f5dd5

drm/amdgpu: revert "fix system hang issue during GPU reset" · f1403342

由 Christian König 提交于 8月 12, 2020

The whole approach wasn't thought through till the end.

We already had a reset lock like this in the past and it caused the same problems like this one.

Completely revert the patch for now and add individual trylock protection to the hardware access functions as necessary.

This reverts commit df9c8d1a.
Signed-off-by: NChristian König <christian.koenig@amd.com>
Acked-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

f1403342

drm/amdgpu: pass NULL pointer instead of 0 · 2f530724

由 Nirmoy Das 提交于 8月 11, 2020

Fixes: c030f2e4 ("drm/amdgpu: add amdgpu_ras.c to support ras (v2)")
Signed-off-by: NNirmoy Das <nirmoy.das@amd.com>
Reviewed-by: NGuchun Chen <guchun.chen@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

2f530724

drm/amdgpu: add debugfs node to toggle ras error cnt harvest · 66459e1d

由 Guchun Chen 提交于 8月 04, 2020

Before ras recovery is issued, user could operate this debugfs
node to enable/disable the harvest of all RAS IPs' ras error
count registers, which will help keep hardware's registers'
status instead of cleaning up them.
Signed-off-by: NGuchun Chen <guchun.chen@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: NDennis Li <Dennis.Li@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

66459e1d

drm/amdgpu: bypass querying ras error count registers · f75e94d8

由 Guchun Chen 提交于 8月 04, 2020

Once ras recovery is issued by ras sync flood interrupt or
ras controller interrupt, add this guard to bypass or execute
ras error count register harvest of all IPs.
Signed-off-by: NGuchun Chen <guchun.chen@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: NDennis Li <Dennis.Li@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

f75e94d8

07 8月, 2020 1 次提交

drm/amdgpu: add printing after executing page reservation to eeprom · ee10e06e

由 Guchun Chen 提交于 7月 20, 2020

This will tell users if the faulty page has been written to
external eeprom device in dmesg log.
Signed-off-by: NGuchun Chen <guchun.chen@amd.com>
Reviewed-by: NTao Zhou <tao.zhou1@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

ee10e06e

05 8月, 2020 9 次提交

drm/amdgpu: enable RAS support for sienna cichlid · 0ad7a64d

由 John Clements 提交于 8月 03, 2020

enabled GECC error injection and query support
Reviewed-by: NGuchun Chen <guchun.chen@amd.com>
Signed-off-by: NJohn Clements <john.clements@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

0ad7a64d

drm/amdgpu: disable page reservation when amdgpu_bad_page_threshold = 0 · a219ecbb

由 Guchun Chen 提交于 7月 27, 2020

When amdgpu_bad_page_threshold = 0, bad page reservation stuffs
are skipped in either UMC ECC irq or page retirement calling of
sync flood isr.
Signed-off-by: NGuchun Chen <guchun.chen@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

a219ecbb

drm/amdgpu: decouple sysfs creating of bad page node · f848159b

由 Guchun Chen 提交于 7月 31, 2020

Bad page information should not be exposed by sysfs when
bad page retirement is disabled, so decouple it from ras
sysfs group creating, and add one guard before creating.
Signed-off-by: NGuchun Chen <guchun.chen@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

f848159b

drm/amdgpu: add one definition for RAS's sysfs/debugfs name(v2) · eb0c3cd4

由 Guchun Chen 提交于 7月 31, 2020

Add one definition for the RAS module's FS name. It's used
in both debugfs and sysfs cases.

v2: Use static variable instead of macro definition.
Signed-off-by: NGuchun Chen <guchun.chen@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

eb0c3cd4

drm/amdgpu: restore ras flags when user resets eeprom(v2) · bf0b91b7

由 Guchun Chen 提交于 7月 22, 2020

RAS flags needs to be cleaned as well when user requires
one clean eeprom.

v2: RAS flags shall be restored after eeprom reset succeeds.
Signed-off-by: NGuchun Chen <guchun.chen@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

bf0b91b7

drm/amdgpu: break GPU recovery once it's in bad state(v4) · e8fbaf03

由 Guchun Chen 提交于 7月 23, 2020

When GPU executes recovery and retriving bad GPU tag
from external eerpom device, the recovery will be broken
and error message is printed as well for user's awareness.

v2: Refine warning message in threshold reaching case, and
    fix spelling typo.

v3: Fix explicit calling of bad gpu.

v4: Rename function names.
Signed-off-by: NGuchun Chen <guchun.chen@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

e8fbaf03

drm/amdgpu: skip bad page reservation once issuing from eeprom write · 35cd2cda

由 Guchun Chen 提交于 7月 23, 2020

Once the ras recovery is issued from eeprom write itself,
bad page reservation should be ignored, otherwise, recursive
calling of writting to eeprom would happen.
Signed-off-by: NGuchun Chen <guchun.chen@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

35cd2cda

drm/amdgpu: break driver init process when it's bad GPU(v5) · b82e65a9

由 Guchun Chen 提交于 7月 23, 2020

When retrieving bad gpu tag from eeprom, GPU init should
fail as the GPU needs to be retired for further check.

v2: Fix spelling typo, correct the condition to detect
    bad gpu tag and refine error message.

v3: Refine function argument name.

v4: Fix missing check of returning value of i2c
    initialization error case.

v5: Use dev_err to print PCI information in dmesg instead
    of DRM_ERROR.
Signed-off-by: NGuchun Chen <guchun.chen@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

b82e65a9

drm/amdgpu: validate bad page threshold in ras(v3) · c84d4670

由 Guchun Chen 提交于 7月 22, 2020

Bad page threshold value should be valid in the range between
-1 and max records length of eeprom. It could determine when
saved bad pages exceed threshold value, and proceed corresponding
actions.

v2: When using the default typical value, it should be min
value between typical value and eeprom max records length.

v3: drop the case of setting bad_page_cnt_threshold to be
    0xFFFFFFFF, as it confuses user.
Signed-off-by: NGuchun Chen <guchun.chen@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

c84d4670

28 7月, 2020 1 次提交

drm/amdgpu: fix system hang issue during GPU reset · df9c8d1a

由 Dennis Li 提交于 7月 08, 2020

when GPU hang, driver has multi-paths to enter amdgpu_device_gpu_recover,
the atomic adev->in_gpu_reset and hive->in_reset are used to avoid
re-entering GPU recovery.

During GPU reset and resume, it is unsafe that other threads access GPU,
which maybe cause GPU reset failed. Therefore the new rw_semaphore
adev->reset_sem is introduced, which protect GPU from being accessed by
external threads during recovery.

v2:
1. add rwlock for some ioctls, debugfs and file-close function.
2. change to use dqm->is_resetting and dqm_lock for protection in kfd
driver.
3. remove try_lock and change adev->in_gpu_reset as atomic, to avoid
re-enter GPU recovery for the same GPU hang.

v3:
1. change back to use adev->reset_sem to protect kfd callback
functions, because dqm_lock couldn't protect all codes, for example:
free_mqd must be called outside of dqm_lock;

[ 1230.176199] Hardware name: Supermicro SYS-7049GP-TRT/X11DPG-QT, BIOS 3.1 05/23/2019
[ 1230.177221] Call Trace:
[ 1230.178249]  dump_stack+0x98/0xd5
[ 1230.179443]  amdgpu_virt_kiq_reg_write_reg_wait+0x181/0x190 [amdgpu]
[ 1230.180673]  gmc_v9_0_flush_gpu_tlb+0xcc/0x310 [amdgpu]
[ 1230.181882]  amdgpu_gart_unbind+0xa9/0xe0 [amdgpu]
[ 1230.183098]  amdgpu_ttm_backend_unbind+0x46/0x180 [amdgpu]
[ 1230.184239]  ? ttm_bo_put+0x171/0x5f0 [ttm]
[ 1230.185394]  ttm_tt_unbind+0x21/0x40 [ttm]
[ 1230.186558]  ttm_tt_destroy.part.12+0x12/0x60 [ttm]
[ 1230.187707]  ttm_tt_destroy+0x13/0x20 [ttm]
[ 1230.188832]  ttm_bo_cleanup_memtype_use+0x36/0x80 [ttm]
[ 1230.189979]  ttm_bo_put+0x1be/0x5f0 [ttm]
[ 1230.191230]  amdgpu_bo_unref+0x1e/0x30 [amdgpu]
[ 1230.192522]  amdgpu_amdkfd_free_gtt_mem+0xaf/0x140 [amdgpu]
[ 1230.193833]  free_mqd+0x25/0x40 [amdgpu]
[ 1230.195143]  destroy_queue_cpsch+0x1a7/0x270 [amdgpu]
[ 1230.196475]  pqm_destroy_queue+0x105/0x260 [amdgpu]
[ 1230.197819]  kfd_ioctl_destroy_queue+0x37/0x70 [amdgpu]
[ 1230.199154]  kfd_ioctl+0x277/0x500 [amdgpu]
[ 1230.200458]  ? kfd_ioctl_get_clock_counters+0x60/0x60 [amdgpu]
[ 1230.201656]  ? tomoyo_file_ioctl+0x19/0x20
[ 1230.202831]  ksys_ioctl+0x98/0xb0
[ 1230.204004]  __x64_sys_ioctl+0x1a/0x20
[ 1230.205174]  do_syscall_64+0x5f/0x250
[ 1230.206339]  entry_SYSCALL_64_after_hwframe+0x49/0xbe

2. remove try_lock and introduce atomic hive->in_reset, to avoid
re-enter GPU recovery.

v4:
1. remove an unnecessary whitespace change in kfd_chardev.c
2. remove comment codes in amdgpu_device.c
3. add more detailed comment in commit message
4. define a wrap function amdgpu_in_reset

v5:
1. Fix some style issues.
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Suggested-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com>
Suggested-by: NChristian König <christian.koenig@amd.com>
Suggested-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Suggested-by: NLijo Lazar <Lijo.Lazar@amd.com>
Suggested-by: NLuben Tukov <luben.tuikov@amd.com>
Signed-off-by: NDennis Li <Dennis.Li@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

df9c8d1a

23 7月, 2020 1 次提交

drm/amdgpu: add printing after executing page reservation to eeprom · b1628425

由 Guchun Chen 提交于 7月 20, 2020

This will tell users if the faulty page has been written to
external eeprom device in dmesg log.
Signed-off-by: NGuchun Chen <guchun.chen@amd.com>
Reviewed-by: NTao Zhou <tao.zhou1@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

b1628425

16 7月, 2020 1 次提交

drm/amdgpu: RAS emergency restart logic refine · bb5c7235

由 Wenhui Sheng 提交于 7月 13, 2020

If we are in RAS triggered situation and
BACO isn't support, emergency restart is needed,
and this code is only needed for some specific
cases(vega20 with given smu fw version).

After we add smu mode1 reset for sienna cichlid, we
need to share AMD_RESET_METHOD_MODE1 with psp mode1 reset,
so in amdgpu_device_gpu_recover, we need differentiate
which mode1 reset we are using, then decide if it's
a full reset and then decide if emergency restart is needed,
the logic will become much more complex.

After discussion with Hawking, move emergency restart logic
to an independent function.
Signed-off-by: NLikun Gao <Likun.Gao@amd.com>
Signed-off-by: NWenhui Sheng <Wenhui.Sheng@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

bb5c7235

01 7月, 2020 1 次提交

drm/amdgpu: label internally used symbols as static · f3167919

由 Nirmoy Das 提交于 6月 18, 2020

Used sparse(make C=1) to find these loose ends.

v2:
removed unwanted extra line
Signed-off-by: NNirmoy Das <nirmoy.das@amd.com>
Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

f3167919

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功