• D
    drm/amdgpu: fix system hang issue during GPU reset · df9c8d1a
    Dennis Li 提交于
    when GPU hang, driver has multi-paths to enter amdgpu_device_gpu_recover,
    the atomic adev->in_gpu_reset and hive->in_reset are used to avoid
    re-entering GPU recovery.
    
    During GPU reset and resume, it is unsafe that other threads access GPU,
    which maybe cause GPU reset failed. Therefore the new rw_semaphore
    adev->reset_sem is introduced, which protect GPU from being accessed by
    external threads during recovery.
    
    v2:
    1. add rwlock for some ioctls, debugfs and file-close function.
    2. change to use dqm->is_resetting and dqm_lock for protection in kfd
    driver.
    3. remove try_lock and change adev->in_gpu_reset as atomic, to avoid
    re-enter GPU recovery for the same GPU hang.
    
    v3:
    1. change back to use adev->reset_sem to protect kfd callback
    functions, because dqm_lock couldn't protect all codes, for example:
    free_mqd must be called outside of dqm_lock;
    
    [ 1230.176199] Hardware name: Supermicro SYS-7049GP-TRT/X11DPG-QT, BIOS 3.1 05/23/2019
    [ 1230.177221] Call Trace:
    [ 1230.178249]  dump_stack+0x98/0xd5
    [ 1230.179443]  amdgpu_virt_kiq_reg_write_reg_wait+0x181/0x190 [amdgpu]
    [ 1230.180673]  gmc_v9_0_flush_gpu_tlb+0xcc/0x310 [amdgpu]
    [ 1230.181882]  amdgpu_gart_unbind+0xa9/0xe0 [amdgpu]
    [ 1230.183098]  amdgpu_ttm_backend_unbind+0x46/0x180 [amdgpu]
    [ 1230.184239]  ? ttm_bo_put+0x171/0x5f0 [ttm]
    [ 1230.185394]  ttm_tt_unbind+0x21/0x40 [ttm]
    [ 1230.186558]  ttm_tt_destroy.part.12+0x12/0x60 [ttm]
    [ 1230.187707]  ttm_tt_destroy+0x13/0x20 [ttm]
    [ 1230.188832]  ttm_bo_cleanup_memtype_use+0x36/0x80 [ttm]
    [ 1230.189979]  ttm_bo_put+0x1be/0x5f0 [ttm]
    [ 1230.191230]  amdgpu_bo_unref+0x1e/0x30 [amdgpu]
    [ 1230.192522]  amdgpu_amdkfd_free_gtt_mem+0xaf/0x140 [amdgpu]
    [ 1230.193833]  free_mqd+0x25/0x40 [amdgpu]
    [ 1230.195143]  destroy_queue_cpsch+0x1a7/0x270 [amdgpu]
    [ 1230.196475]  pqm_destroy_queue+0x105/0x260 [amdgpu]
    [ 1230.197819]  kfd_ioctl_destroy_queue+0x37/0x70 [amdgpu]
    [ 1230.199154]  kfd_ioctl+0x277/0x500 [amdgpu]
    [ 1230.200458]  ? kfd_ioctl_get_clock_counters+0x60/0x60 [amdgpu]
    [ 1230.201656]  ? tomoyo_file_ioctl+0x19/0x20
    [ 1230.202831]  ksys_ioctl+0x98/0xb0
    [ 1230.204004]  __x64_sys_ioctl+0x1a/0x20
    [ 1230.205174]  do_syscall_64+0x5f/0x250
    [ 1230.206339]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
    
    2. remove try_lock and introduce atomic hive->in_reset, to avoid
    re-enter GPU recovery.
    
    v4:
    1. remove an unnecessary whitespace change in kfd_chardev.c
    2. remove comment codes in amdgpu_device.c
    3. add more detailed comment in commit message
    4. define a wrap function amdgpu_in_reset
    
    v5:
    1. Fix some style issues.
    Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
    Suggested-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com>
    Suggested-by: NChristian König <christian.koenig@amd.com>
    Suggested-by: NFelix Kuehling <Felix.Kuehling@amd.com>
    Suggested-by: NLijo Lazar <Lijo.Lazar@amd.com>
    Suggested-by: NLuben Tukov <luben.tuikov@amd.com>
    Signed-off-by: NDennis Li <Dennis.Li@amd.com>
    Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
    df9c8d1a
mxgpu_ai.c 10.3 KB