1. 15 12月, 2021 1 次提交
  2. 31 8月, 2021 1 次提交
  3. 13 7月, 2021 1 次提交
  4. 09 7月, 2021 1 次提交
  5. 14 1月, 2021 1 次提交
  6. 16 12月, 2020 1 次提交
    • J
      drm/amdgpu/SRIOV: Extend VF reset request wait period · 3aa883ac
      Jiange Zhao 提交于
      In Virtualization case, when one VF is sending too many
      FLR requests, hypervisor would stop responding to this
      VF's request for a long period of time. This is called
      event guard. During this period of cooling time, guest
      driver should wait instead of doing other things. After
      this period of time, guest driver would resume reset
      process and return to normal.
      
      Currently, guest driver would wait 12 seconds and return fail
      if it doesn't get response from host.
      
      Solution: extend this waiting time in guest driver and poll
      response periodically. Poll happens every 6 seconds and it will
      last for 60 seconds.
      
      v2: change the max repetition times from number to macro.
      Signed-off-by: NJiange Zhao <Jiange.Zhao@amd.com>
      Acked-by: NHawking Zhang <Hawking.Zhang@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      3aa883ac
  7. 16 9月, 2020 1 次提交
  8. 25 8月, 2020 2 次提交
  9. 19 8月, 2020 1 次提交
  10. 15 8月, 2020 1 次提交
  11. 28 7月, 2020 1 次提交
    • D
      drm/amdgpu: fix system hang issue during GPU reset · df9c8d1a
      Dennis Li 提交于
      when GPU hang, driver has multi-paths to enter amdgpu_device_gpu_recover,
      the atomic adev->in_gpu_reset and hive->in_reset are used to avoid
      re-entering GPU recovery.
      
      During GPU reset and resume, it is unsafe that other threads access GPU,
      which maybe cause GPU reset failed. Therefore the new rw_semaphore
      adev->reset_sem is introduced, which protect GPU from being accessed by
      external threads during recovery.
      
      v2:
      1. add rwlock for some ioctls, debugfs and file-close function.
      2. change to use dqm->is_resetting and dqm_lock for protection in kfd
      driver.
      3. remove try_lock and change adev->in_gpu_reset as atomic, to avoid
      re-enter GPU recovery for the same GPU hang.
      
      v3:
      1. change back to use adev->reset_sem to protect kfd callback
      functions, because dqm_lock couldn't protect all codes, for example:
      free_mqd must be called outside of dqm_lock;
      
      [ 1230.176199] Hardware name: Supermicro SYS-7049GP-TRT/X11DPG-QT, BIOS 3.1 05/23/2019
      [ 1230.177221] Call Trace:
      [ 1230.178249]  dump_stack+0x98/0xd5
      [ 1230.179443]  amdgpu_virt_kiq_reg_write_reg_wait+0x181/0x190 [amdgpu]
      [ 1230.180673]  gmc_v9_0_flush_gpu_tlb+0xcc/0x310 [amdgpu]
      [ 1230.181882]  amdgpu_gart_unbind+0xa9/0xe0 [amdgpu]
      [ 1230.183098]  amdgpu_ttm_backend_unbind+0x46/0x180 [amdgpu]
      [ 1230.184239]  ? ttm_bo_put+0x171/0x5f0 [ttm]
      [ 1230.185394]  ttm_tt_unbind+0x21/0x40 [ttm]
      [ 1230.186558]  ttm_tt_destroy.part.12+0x12/0x60 [ttm]
      [ 1230.187707]  ttm_tt_destroy+0x13/0x20 [ttm]
      [ 1230.188832]  ttm_bo_cleanup_memtype_use+0x36/0x80 [ttm]
      [ 1230.189979]  ttm_bo_put+0x1be/0x5f0 [ttm]
      [ 1230.191230]  amdgpu_bo_unref+0x1e/0x30 [amdgpu]
      [ 1230.192522]  amdgpu_amdkfd_free_gtt_mem+0xaf/0x140 [amdgpu]
      [ 1230.193833]  free_mqd+0x25/0x40 [amdgpu]
      [ 1230.195143]  destroy_queue_cpsch+0x1a7/0x270 [amdgpu]
      [ 1230.196475]  pqm_destroy_queue+0x105/0x260 [amdgpu]
      [ 1230.197819]  kfd_ioctl_destroy_queue+0x37/0x70 [amdgpu]
      [ 1230.199154]  kfd_ioctl+0x277/0x500 [amdgpu]
      [ 1230.200458]  ? kfd_ioctl_get_clock_counters+0x60/0x60 [amdgpu]
      [ 1230.201656]  ? tomoyo_file_ioctl+0x19/0x20
      [ 1230.202831]  ksys_ioctl+0x98/0xb0
      [ 1230.204004]  __x64_sys_ioctl+0x1a/0x20
      [ 1230.205174]  do_syscall_64+0x5f/0x250
      [ 1230.206339]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      2. remove try_lock and introduce atomic hive->in_reset, to avoid
      re-enter GPU recovery.
      
      v4:
      1. remove an unnecessary whitespace change in kfd_chardev.c
      2. remove comment codes in amdgpu_device.c
      3. add more detailed comment in commit message
      4. define a wrap function amdgpu_in_reset
      
      v5:
      1. Fix some style issues.
      Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
      Suggested-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com>
      Suggested-by: NChristian König <christian.koenig@amd.com>
      Suggested-by: NFelix Kuehling <Felix.Kuehling@amd.com>
      Suggested-by: NLijo Lazar <Lijo.Lazar@amd.com>
      Suggested-by: NLuben Tukov <luben.tuikov@amd.com>
      Signed-off-by: NDennis Li <Dennis.Li@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      df9c8d1a
  12. 24 12月, 2019 1 次提交
  13. 12 12月, 2019 1 次提交
    • Y
      drm/amd/powerplay: enable pp one vf mode for vega10 · c9ffa427
      Yintian Tao 提交于
      Originally, due to the restriction from PSP and SMU, VF has
      to send message to hypervisor driver to handle powerplay
      change which is complicated and redundant. Currently, SMU
      and PSP can support VF to directly handle powerplay
      change by itself. Therefore, the old code about the handshake
      between VF and PF to handle powerplay will be removed and VF
      will use new the registers below to handshake with SMU.
      mmMP1_SMN_C2PMSG_101: register to handle SMU message
      mmMP1_SMN_C2PMSG_102: register to handle SMU parameter
      mmMP1_SMN_C2PMSG_103: register to handle SMU response
      
      v2: remove module parameter pp_one_vf
      v3: fix the parens
      v4: forbid vf to change smu feature
      v5: use hwmon_attributes_visible to skip sepicified hwmon atrribute
      v6: change skip condition at vega10_copy_table_to_smc
      Signed-off-by: NYintian Tao <yttao@amd.com>
      Acked-by: NEvan Quan <evan.quan@amd.com>
      Reviewed-by: NKenneth Feng <kenneth.feng@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      c9ffa427
  14. 02 8月, 2019 1 次提交
  15. 12 6月, 2019 1 次提交
  16. 25 5月, 2019 2 次提交
  17. 06 5月, 2019 1 次提交
  18. 11 4月, 2019 1 次提交
  19. 14 2月, 2019 1 次提交
  20. 15 1月, 2019 1 次提交
  21. 03 1月, 2019 1 次提交
  22. 28 8月, 2018 1 次提交
  23. 16 5月, 2018 1 次提交
  24. 23 3月, 2018 1 次提交
  25. 15 3月, 2018 2 次提交
    • O
      drm/amdgpu: Move IH clientid defs to separate file · 3760f76c
      Oak Zeng 提交于
      This is preparation for sharing client ID definitions
      between amdgpu and amdkfd
      Signed-off-by: NOak Zeng <Oak.Zeng@amd.com>
      Reviewed-by: NChunming Zhou <david1.zhou@amd.com>
      Acked-by: NAlex Deucher <alexander.deucher@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      3760f76c
    • M
      drm/amdgpu: refactoring mailbox to fix TDR handshake bugs(v2) · 48527e52
      Monk Liu 提交于
      this patch actually refactor mailbox implmentations, and
      all below changes are needed together to fix all those mailbox
      handshake issues exposured by heavey TDR test.
      
      1)refactor all mailbox functions based on byte accessing for mb_control
      reason is to avoid touching non-related bits when writing trn/rcv part of
      mailbox_control, this way some incorrect INTR sent to hypervisor
      side could be avoided, and it fixes couple handshake bug.
      
      2)trans_msg function re-impled: put a invalid
      logic before transmitting message to make sure the ACK bit is in
      a clear status, otherwise there is chance that ACK asserted already
      before transmitting message and lead to fake ACK polling.
      (hypervisor side have some tricks to workaround ACK bit being corrupted
      by VF FLR which hase an side effects that may make guest side ACK bit
      asserted wrongly), and clear TRANS_MSG words after message transferred.
      
      3)for mailbox_flr_work, it is also re-worked: it takes the mutex lock
      first if invoked, to block gpu recover's participate too early while
      hypervisor side is doing VF FLR. (hypervisor sends FLR_NOTIFY to guest
      before doing VF FLR and sentds FLR_COMPLETE after VF FLR done, and
      the FLR_NOTIFY will trigger interrupt to guest which lead to
      mailbox_flr_work being invoked)
      
      This can avoid the issue that mailbox trans msg being cleared by its VF FLR.
      
      4)for mailbox_rcv_irq IRQ routine, it should only peek msg and schedule
      mailbox_flr_work, instead of ACK to hypervisor itself, because FLR_NOTIFY
      msg sent from hypervisor side doesn't need VF's ACK (this is because
      VF's ACK would lead to hypervisor clear its trans_valid/msg, and this
      would cause handshake bug if trans_valid/msg is cleared not due to
      correct VF ACK but from a wrong VF ACK like this "FLR_NOTIFY" one)
      
      This fixed handshake bug that sometimes GUEST always couldn't receive
      "READY_TO_ACCESS_GPU" msg from hypervisor.
      
      5)seperate polling time limite accordingly:
      POLL ACK cost no more than 500ms
      POLL MSG cost no more than 12000ms
      POLL FLR finish cost no more than 500ms
      
      6) we still need to set adev into in_gpu_reset mode after we received
      FLR_NOTIFY from host side, this can prevent innocent app wrongly succesed
      to open amdgpu dri device.
      
      FLR_NOFITY is received due to an IDLE hang detected from hypervisor side
      which indicating GPU is already die in this VF.
      
      v2:
      use MACRO as the offset of mailbox_control register
      don't test if NOTIFY_CMPL event in rcv_msg since it won't
      recieve that message anymore
      Signed-off-by: NMonk Liu <Monk.Liu@amd.com>
      Reviewed-by: NPixel Ding <Pixel.Ding@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      48527e52
  26. 18 12月, 2017 1 次提交
  27. 16 12月, 2017 2 次提交
  28. 09 12月, 2017 1 次提交
  29. 07 12月, 2017 3 次提交
  30. 05 12月, 2017 3 次提交
  31. 20 10月, 2017 1 次提交
  32. 14 7月, 2017 1 次提交