1. 15 8月, 2020 4 次提交
  2. 05 8月, 2020 9 次提交
  3. 28 7月, 2020 1 次提交
    • D
      drm/amdgpu: fix system hang issue during GPU reset · df9c8d1a
      Dennis Li 提交于
      when GPU hang, driver has multi-paths to enter amdgpu_device_gpu_recover,
      the atomic adev->in_gpu_reset and hive->in_reset are used to avoid
      re-entering GPU recovery.
      
      During GPU reset and resume, it is unsafe that other threads access GPU,
      which maybe cause GPU reset failed. Therefore the new rw_semaphore
      adev->reset_sem is introduced, which protect GPU from being accessed by
      external threads during recovery.
      
      v2:
      1. add rwlock for some ioctls, debugfs and file-close function.
      2. change to use dqm->is_resetting and dqm_lock for protection in kfd
      driver.
      3. remove try_lock and change adev->in_gpu_reset as atomic, to avoid
      re-enter GPU recovery for the same GPU hang.
      
      v3:
      1. change back to use adev->reset_sem to protect kfd callback
      functions, because dqm_lock couldn't protect all codes, for example:
      free_mqd must be called outside of dqm_lock;
      
      [ 1230.176199] Hardware name: Supermicro SYS-7049GP-TRT/X11DPG-QT, BIOS 3.1 05/23/2019
      [ 1230.177221] Call Trace:
      [ 1230.178249]  dump_stack+0x98/0xd5
      [ 1230.179443]  amdgpu_virt_kiq_reg_write_reg_wait+0x181/0x190 [amdgpu]
      [ 1230.180673]  gmc_v9_0_flush_gpu_tlb+0xcc/0x310 [amdgpu]
      [ 1230.181882]  amdgpu_gart_unbind+0xa9/0xe0 [amdgpu]
      [ 1230.183098]  amdgpu_ttm_backend_unbind+0x46/0x180 [amdgpu]
      [ 1230.184239]  ? ttm_bo_put+0x171/0x5f0 [ttm]
      [ 1230.185394]  ttm_tt_unbind+0x21/0x40 [ttm]
      [ 1230.186558]  ttm_tt_destroy.part.12+0x12/0x60 [ttm]
      [ 1230.187707]  ttm_tt_destroy+0x13/0x20 [ttm]
      [ 1230.188832]  ttm_bo_cleanup_memtype_use+0x36/0x80 [ttm]
      [ 1230.189979]  ttm_bo_put+0x1be/0x5f0 [ttm]
      [ 1230.191230]  amdgpu_bo_unref+0x1e/0x30 [amdgpu]
      [ 1230.192522]  amdgpu_amdkfd_free_gtt_mem+0xaf/0x140 [amdgpu]
      [ 1230.193833]  free_mqd+0x25/0x40 [amdgpu]
      [ 1230.195143]  destroy_queue_cpsch+0x1a7/0x270 [amdgpu]
      [ 1230.196475]  pqm_destroy_queue+0x105/0x260 [amdgpu]
      [ 1230.197819]  kfd_ioctl_destroy_queue+0x37/0x70 [amdgpu]
      [ 1230.199154]  kfd_ioctl+0x277/0x500 [amdgpu]
      [ 1230.200458]  ? kfd_ioctl_get_clock_counters+0x60/0x60 [amdgpu]
      [ 1230.201656]  ? tomoyo_file_ioctl+0x19/0x20
      [ 1230.202831]  ksys_ioctl+0x98/0xb0
      [ 1230.204004]  __x64_sys_ioctl+0x1a/0x20
      [ 1230.205174]  do_syscall_64+0x5f/0x250
      [ 1230.206339]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      2. remove try_lock and introduce atomic hive->in_reset, to avoid
      re-enter GPU recovery.
      
      v4:
      1. remove an unnecessary whitespace change in kfd_chardev.c
      2. remove comment codes in amdgpu_device.c
      3. add more detailed comment in commit message
      4. define a wrap function amdgpu_in_reset
      
      v5:
      1. Fix some style issues.
      Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
      Suggested-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com>
      Suggested-by: NChristian König <christian.koenig@amd.com>
      Suggested-by: NFelix Kuehling <Felix.Kuehling@amd.com>
      Suggested-by: NLijo Lazar <Lijo.Lazar@amd.com>
      Suggested-by: NLuben Tukov <luben.tuikov@amd.com>
      Signed-off-by: NDennis Li <Dennis.Li@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      df9c8d1a
  4. 23 7月, 2020 1 次提交
  5. 16 7月, 2020 1 次提交
  6. 01 7月, 2020 1 次提交
  7. 03 6月, 2020 2 次提交
  8. 29 5月, 2020 1 次提交
  9. 15 5月, 2020 1 次提交
  10. 06 5月, 2020 1 次提交
  11. 01 5月, 2020 1 次提交
  12. 23 4月, 2020 2 次提交
  13. 14 4月, 2020 1 次提交
  14. 09 4月, 2020 1 次提交
  15. 08 4月, 2020 1 次提交
  16. 02 4月, 2020 2 次提交
    • E
      drm/amdgpu: fix non-pointer dereference for non-RAS supported · a9d82d2f
      Evan Quan 提交于
      Backtrace on gpu recover test on Navi10.
      
      [ 1324.516681] RIP: 0010:amdgpu_ras_set_error_query_ready+0x15/0x20 [amdgpu]
      [ 1324.523778] Code: 4c 89 f7 e8 cd a2 a0 d8 e9 99 fe ff ff 45 31 ff e9 91 fe ff ff 0f 1f 44 00 00 55 48 85 ff 48 89 e5 74 0e 48 8b 87 d8 2b 01 00 <40> 88 b0 38 01 00 00 5d c3 66 90 0f 1f 44 00 00 55 31 c0 48 85 ff
      [ 1324.543452] RSP: 0018:ffffaa1040e4bd28 EFLAGS: 00010286
      [ 1324.549025] RAX: 0000000000000000 RBX: ffff911198b20000 RCX: 0000000000000000
      [ 1324.556217] RDX: 00000000000c0a01 RSI: 0000000000000000 RDI: ffff911198b20000
      [ 1324.563514] RBP: ffffaa1040e4bd28 R08: 0000000000001000 R09: ffff91119d0028c0
      [ 1324.570804] R10: ffffffff9a606b40 R11: 0000000000000000 R12: 0000000000000000
      [ 1324.578413] R13: ffffaa1040e4bd70 R14: ffff911198b20000 R15: 0000000000000000
      [ 1324.586464] FS:  00007f4441cbf540(0000) GS:ffff91119ed80000(0000) knlGS:0000000000000000
      [ 1324.595434] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 1324.601345] CR2: 0000000000000138 CR3: 00000003fcdf8004 CR4: 00000000003606e0
      [ 1324.608694] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [ 1324.616303] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [ 1324.623678] Call Trace:
      [ 1324.626270]  amdgpu_device_gpu_recover+0x6e7/0xc50 [amdgpu]
      [ 1324.632018]  ? seq_printf+0x4e/0x70
      [ 1324.636652]  amdgpu_debugfs_gpu_recover+0x50/0x80 [amdgpu]
      [ 1324.643371]  seq_read+0xda/0x420
      [ 1324.647601]  full_proxy_read+0x5c/0x90
      [ 1324.652426]  __vfs_read+0x1b/0x40
      [ 1324.656734]  vfs_read+0x8e/0x130
      [ 1324.660981]  ksys_read+0xa7/0xe0
      [ 1324.665201]  __x64_sys_read+0x1a/0x20
      [ 1324.669907]  do_syscall_64+0x57/0x1c0
      [ 1324.674517]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      [ 1324.680654] RIP: 0033:0x7f44417cf081
      Signed-off-by: NEvan Quan <evan.quan@amd.com>
      Reviewed-by: NJohn Clements <John.Clements@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      a9d82d2f
    • J
      drm/amdgpu: disable ras query and iject during gpu reset · 61380faa
      John Clements 提交于
      added flag to ras context to indicate if ras query functionality is ready
      Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
      Signed-off-by: NJohn Clements <john.clements@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      61380faa
  17. 20 3月, 2020 1 次提交
  18. 13 3月, 2020 2 次提交
  19. 11 3月, 2020 2 次提交
  20. 07 3月, 2020 1 次提交
  21. 27 2月, 2020 1 次提交
  22. 19 2月, 2020 1 次提交
  23. 23 1月, 2020 1 次提交
  24. 17 1月, 2020 1 次提交