1. 01 2月, 2023 1 次提交
  2. 21 10月, 2021 1 次提交
  3. 09 4月, 2021 1 次提交
  4. 01 10月, 2020 3 次提交
  5. 16 9月, 2020 3 次提交
  6. 27 8月, 2020 3 次提交
  7. 25 8月, 2020 5 次提交
  8. 19 8月, 2020 1 次提交
  9. 15 8月, 2020 1 次提交
  10. 07 8月, 2020 1 次提交
  11. 05 8月, 2020 6 次提交
  12. 28 7月, 2020 1 次提交
    • D
      drm/amdgpu: fix system hang issue during GPU reset · df9c8d1a
      Dennis Li 提交于
      when GPU hang, driver has multi-paths to enter amdgpu_device_gpu_recover,
      the atomic adev->in_gpu_reset and hive->in_reset are used to avoid
      re-entering GPU recovery.
      
      During GPU reset and resume, it is unsafe that other threads access GPU,
      which maybe cause GPU reset failed. Therefore the new rw_semaphore
      adev->reset_sem is introduced, which protect GPU from being accessed by
      external threads during recovery.
      
      v2:
      1. add rwlock for some ioctls, debugfs and file-close function.
      2. change to use dqm->is_resetting and dqm_lock for protection in kfd
      driver.
      3. remove try_lock and change adev->in_gpu_reset as atomic, to avoid
      re-enter GPU recovery for the same GPU hang.
      
      v3:
      1. change back to use adev->reset_sem to protect kfd callback
      functions, because dqm_lock couldn't protect all codes, for example:
      free_mqd must be called outside of dqm_lock;
      
      [ 1230.176199] Hardware name: Supermicro SYS-7049GP-TRT/X11DPG-QT, BIOS 3.1 05/23/2019
      [ 1230.177221] Call Trace:
      [ 1230.178249]  dump_stack+0x98/0xd5
      [ 1230.179443]  amdgpu_virt_kiq_reg_write_reg_wait+0x181/0x190 [amdgpu]
      [ 1230.180673]  gmc_v9_0_flush_gpu_tlb+0xcc/0x310 [amdgpu]
      [ 1230.181882]  amdgpu_gart_unbind+0xa9/0xe0 [amdgpu]
      [ 1230.183098]  amdgpu_ttm_backend_unbind+0x46/0x180 [amdgpu]
      [ 1230.184239]  ? ttm_bo_put+0x171/0x5f0 [ttm]
      [ 1230.185394]  ttm_tt_unbind+0x21/0x40 [ttm]
      [ 1230.186558]  ttm_tt_destroy.part.12+0x12/0x60 [ttm]
      [ 1230.187707]  ttm_tt_destroy+0x13/0x20 [ttm]
      [ 1230.188832]  ttm_bo_cleanup_memtype_use+0x36/0x80 [ttm]
      [ 1230.189979]  ttm_bo_put+0x1be/0x5f0 [ttm]
      [ 1230.191230]  amdgpu_bo_unref+0x1e/0x30 [amdgpu]
      [ 1230.192522]  amdgpu_amdkfd_free_gtt_mem+0xaf/0x140 [amdgpu]
      [ 1230.193833]  free_mqd+0x25/0x40 [amdgpu]
      [ 1230.195143]  destroy_queue_cpsch+0x1a7/0x270 [amdgpu]
      [ 1230.196475]  pqm_destroy_queue+0x105/0x260 [amdgpu]
      [ 1230.197819]  kfd_ioctl_destroy_queue+0x37/0x70 [amdgpu]
      [ 1230.199154]  kfd_ioctl+0x277/0x500 [amdgpu]
      [ 1230.200458]  ? kfd_ioctl_get_clock_counters+0x60/0x60 [amdgpu]
      [ 1230.201656]  ? tomoyo_file_ioctl+0x19/0x20
      [ 1230.202831]  ksys_ioctl+0x98/0xb0
      [ 1230.204004]  __x64_sys_ioctl+0x1a/0x20
      [ 1230.205174]  do_syscall_64+0x5f/0x250
      [ 1230.206339]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      2. remove try_lock and introduce atomic hive->in_reset, to avoid
      re-enter GPU recovery.
      
      v4:
      1. remove an unnecessary whitespace change in kfd_chardev.c
      2. remove comment codes in amdgpu_device.c
      3. add more detailed comment in commit message
      4. define a wrap function amdgpu_in_reset
      
      v5:
      1. Fix some style issues.
      Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
      Suggested-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com>
      Suggested-by: NChristian König <christian.koenig@amd.com>
      Suggested-by: NFelix Kuehling <Felix.Kuehling@amd.com>
      Suggested-by: NLijo Lazar <Lijo.Lazar@amd.com>
      Suggested-by: NLuben Tukov <luben.tuikov@amd.com>
      Signed-off-by: NDennis Li <Dennis.Li@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      df9c8d1a
  13. 16 7月, 2020 1 次提交
  14. 03 7月, 2020 2 次提交
  15. 01 7月, 2020 3 次提交
  16. 30 5月, 2020 1 次提交
  17. 23 5月, 2020 1 次提交
  18. 22 5月, 2020 1 次提交
  19. 18 5月, 2020 1 次提交
    • J
      drm/amdgpu: Add autodump debugfs node for gpu reset v8 · 728e7e0c
      Jiange Zhao 提交于
      When GPU got timeout, it would notify an interested part
      of an opportunity to dump info before actual GPU reset.
      
      A usermode app would open 'autodump' node under debugfs system
      and poll() for readable/writable. When a GPU reset is due,
      amdgpu would notify usermode app through wait_queue_head and give
      it 10 minutes to dump info.
      
      After usermode app has done its work, this 'autodump' node is closed.
      On node closure, amdgpu gets to know the dump is done through
      the completion that is triggered in release().
      
      There is no write or read callback because necessary info can be
      obtained through dmesg and umr. Messages back and forth between
      usermode app and amdgpu are unnecessary.
      
      v2: (1) changed 'registered' to 'app_listening'
          (2) add a mutex in open() to prevent race condition
      
      v3 (chk): grab the reset lock to avoid race in autodump_open,
                rename debugfs file to amdgpu_autodump,
                provide autodump_read as well,
                style and code cleanups
      
      v4: add 'bool app_listening' to differentiate situations, so that
          the node can be reopened; also, there is no need to wait for
          completion when no app is waiting for a dump.
      
      v5: change 'bool app_listening' to 'enum amdgpu_autodump_state'
          add 'app_state_mutex' for race conditions:
      	(1)Only 1 user can open this file node
      	(2)wait_dump() can only take effect after poll() executed.
      	(3)eliminated the race condition between release() and
      	   wait_dump()
      
      v6: removed 'enum amdgpu_autodump_state' and 'app_state_mutex'
          removed state checking in amdgpu_debugfs_wait_dump
          Improve on top of version 3 so that the node can be reopened.
      
      v7: move reinit_completion into open() so that only one user
          can open it.
      
      v8: remove complete_all() from amdgpu_debugfs_wait_dump().
      Signed-off-by: NJiange Zhao <Jiange.Zhao@amd.com>
      Reviewed-by: NChristian König <christian.koenig@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      728e7e0c
  20. 09 5月, 2020 2 次提交
  21. 02 5月, 2020 1 次提交