1. 18 9月, 2020 1 次提交
  2. 25 8月, 2020 1 次提交
  3. 19 8月, 2020 1 次提交
  4. 15 8月, 2020 1 次提交
  5. 28 7月, 2020 1 次提交
    • D
      drm/amdgpu: fix system hang issue during GPU reset · df9c8d1a
      Dennis Li 提交于
      when GPU hang, driver has multi-paths to enter amdgpu_device_gpu_recover,
      the atomic adev->in_gpu_reset and hive->in_reset are used to avoid
      re-entering GPU recovery.
      
      During GPU reset and resume, it is unsafe that other threads access GPU,
      which maybe cause GPU reset failed. Therefore the new rw_semaphore
      adev->reset_sem is introduced, which protect GPU from being accessed by
      external threads during recovery.
      
      v2:
      1. add rwlock for some ioctls, debugfs and file-close function.
      2. change to use dqm->is_resetting and dqm_lock for protection in kfd
      driver.
      3. remove try_lock and change adev->in_gpu_reset as atomic, to avoid
      re-enter GPU recovery for the same GPU hang.
      
      v3:
      1. change back to use adev->reset_sem to protect kfd callback
      functions, because dqm_lock couldn't protect all codes, for example:
      free_mqd must be called outside of dqm_lock;
      
      [ 1230.176199] Hardware name: Supermicro SYS-7049GP-TRT/X11DPG-QT, BIOS 3.1 05/23/2019
      [ 1230.177221] Call Trace:
      [ 1230.178249]  dump_stack+0x98/0xd5
      [ 1230.179443]  amdgpu_virt_kiq_reg_write_reg_wait+0x181/0x190 [amdgpu]
      [ 1230.180673]  gmc_v9_0_flush_gpu_tlb+0xcc/0x310 [amdgpu]
      [ 1230.181882]  amdgpu_gart_unbind+0xa9/0xe0 [amdgpu]
      [ 1230.183098]  amdgpu_ttm_backend_unbind+0x46/0x180 [amdgpu]
      [ 1230.184239]  ? ttm_bo_put+0x171/0x5f0 [ttm]
      [ 1230.185394]  ttm_tt_unbind+0x21/0x40 [ttm]
      [ 1230.186558]  ttm_tt_destroy.part.12+0x12/0x60 [ttm]
      [ 1230.187707]  ttm_tt_destroy+0x13/0x20 [ttm]
      [ 1230.188832]  ttm_bo_cleanup_memtype_use+0x36/0x80 [ttm]
      [ 1230.189979]  ttm_bo_put+0x1be/0x5f0 [ttm]
      [ 1230.191230]  amdgpu_bo_unref+0x1e/0x30 [amdgpu]
      [ 1230.192522]  amdgpu_amdkfd_free_gtt_mem+0xaf/0x140 [amdgpu]
      [ 1230.193833]  free_mqd+0x25/0x40 [amdgpu]
      [ 1230.195143]  destroy_queue_cpsch+0x1a7/0x270 [amdgpu]
      [ 1230.196475]  pqm_destroy_queue+0x105/0x260 [amdgpu]
      [ 1230.197819]  kfd_ioctl_destroy_queue+0x37/0x70 [amdgpu]
      [ 1230.199154]  kfd_ioctl+0x277/0x500 [amdgpu]
      [ 1230.200458]  ? kfd_ioctl_get_clock_counters+0x60/0x60 [amdgpu]
      [ 1230.201656]  ? tomoyo_file_ioctl+0x19/0x20
      [ 1230.202831]  ksys_ioctl+0x98/0xb0
      [ 1230.204004]  __x64_sys_ioctl+0x1a/0x20
      [ 1230.205174]  do_syscall_64+0x5f/0x250
      [ 1230.206339]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      2. remove try_lock and introduce atomic hive->in_reset, to avoid
      re-enter GPU recovery.
      
      v4:
      1. remove an unnecessary whitespace change in kfd_chardev.c
      2. remove comment codes in amdgpu_device.c
      3. add more detailed comment in commit message
      4. define a wrap function amdgpu_in_reset
      
      v5:
      1. Fix some style issues.
      Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
      Suggested-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com>
      Suggested-by: NChristian König <christian.koenig@amd.com>
      Suggested-by: NFelix Kuehling <Felix.Kuehling@amd.com>
      Suggested-by: NLijo Lazar <Lijo.Lazar@amd.com>
      Suggested-by: NLuben Tukov <luben.tuikov@amd.com>
      Signed-off-by: NDennis Li <Dennis.Li@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      df9c8d1a
  6. 01 7月, 2020 4 次提交
    • M
      drm/amdkfd: Fix circular locking dependency warning · d69fd951
      Mukul Joshi 提交于
      [  150.887733] ======================================================
      [  150.893903] WARNING: possible circular locking dependency detected
      [  150.905917] ------------------------------------------------------
      [  150.912129] kfdtest/4081 is trying to acquire lock:
      [  150.917002] ffff8f7f3762e118 (&mm->mmap_sem#2){++++}, at:
                                       __might_fault+0x3e/0x90
      [  150.924490]
                     but task is already holding lock:
      [  150.930320] ffff8f7f49d229e8 (&dqm->lock_hidden){+.+.}, at:
                                      destroy_queue_cpsch+0x29/0x210 [amdgpu]
      [  150.939432]
                     which lock already depends on the new lock.
      
      [  150.947603]
                     the existing dependency chain (in reverse order) is:
      [  150.955074]
                     -> #3 (&dqm->lock_hidden){+.+.}:
      [  150.960822]        __mutex_lock+0xa1/0x9f0
      [  150.964996]        evict_process_queues_cpsch+0x22/0x120 [amdgpu]
      [  150.971155]        kfd_process_evict_queues+0x3b/0xc0 [amdgpu]
      [  150.977054]        kgd2kfd_quiesce_mm+0x25/0x60 [amdgpu]
      [  150.982442]        amdgpu_amdkfd_evict_userptr+0x35/0x70 [amdgpu]
      [  150.988615]        amdgpu_mn_invalidate_hsa+0x41/0x60 [amdgpu]
      [  150.994448]        __mmu_notifier_invalidate_range_start+0xa4/0x240
      [  151.000714]        copy_page_range+0xd70/0xd80
      [  151.005159]        dup_mm+0x3ca/0x550
      [  151.008816]        copy_process+0x1bdc/0x1c70
      [  151.013183]        _do_fork+0x76/0x6c0
      [  151.016929]        __x64_sys_clone+0x8c/0xb0
      [  151.021201]        do_syscall_64+0x4a/0x1d0
      [  151.025404]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [  151.030977]
                     -> #2 (&adev->notifier_lock){+.+.}:
      [  151.036993]        __mutex_lock+0xa1/0x9f0
      [  151.041168]        amdgpu_mn_invalidate_hsa+0x30/0x60 [amdgpu]
      [  151.047019]        __mmu_notifier_invalidate_range_start+0xa4/0x240
      [  151.053277]        copy_page_range+0xd70/0xd80
      [  151.057722]        dup_mm+0x3ca/0x550
      [  151.061388]        copy_process+0x1bdc/0x1c70
      [  151.065748]        _do_fork+0x76/0x6c0
      [  151.069499]        __x64_sys_clone+0x8c/0xb0
      [  151.073765]        do_syscall_64+0x4a/0x1d0
      [  151.077952]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [  151.083523]
                     -> #1 (mmu_notifier_invalidate_range_start){+.+.}:
      [  151.090833]        change_protection+0x802/0xab0
      [  151.095448]        mprotect_fixup+0x187/0x2d0
      [  151.099801]        setup_arg_pages+0x124/0x250
      [  151.104251]        load_elf_binary+0x3a4/0x1464
      [  151.108781]        search_binary_handler+0x6c/0x210
      [  151.113656]        __do_execve_file.isra.40+0x7f7/0xa50
      [  151.118875]        do_execve+0x21/0x30
      [  151.122632]        call_usermodehelper_exec_async+0x17e/0x190
      [  151.128393]        ret_from_fork+0x24/0x30
      [  151.132489]
                     -> #0 (&mm->mmap_sem#2){++++}:
      [  151.138064]        __lock_acquire+0x11a1/0x1490
      [  151.142597]        lock_acquire+0x90/0x180
      [  151.146694]        __might_fault+0x68/0x90
      [  151.150879]        read_sdma_queue_counter+0x5f/0xb0 [amdgpu]
      [  151.156693]        update_sdma_queue_past_activity_stats+0x3b/0x90 [amdgpu]
      [  151.163725]        destroy_queue_cpsch+0x1ae/0x210 [amdgpu]
      [  151.169373]        pqm_destroy_queue+0xf0/0x250 [amdgpu]
      [  151.174762]        kfd_ioctl_destroy_queue+0x32/0x70 [amdgpu]
      [  151.180577]        kfd_ioctl+0x223/0x400 [amdgpu]
      [  151.185284]        ksys_ioctl+0x8f/0xb0
      [  151.189118]        __x64_sys_ioctl+0x16/0x20
      [  151.193389]        do_syscall_64+0x4a/0x1d0
      [  151.197569]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [  151.203141]
                     other info that might help us debug this:
      
      [  151.211140] Chain exists of:
                       &mm->mmap_sem#2 --> &adev->notifier_lock --> &dqm->lock_hidden
      
      [  151.222535]  Possible unsafe locking scenario:
      
      [  151.228447]        CPU0                    CPU1
      [  151.232971]        ----                    ----
      [  151.237502]   lock(&dqm->lock_hidden);
      [  151.241254]                                lock(&adev->notifier_lock);
      [  151.247774]                                lock(&dqm->lock_hidden);
      [  151.254038]   lock(&mm->mmap_sem#2);
      
      This commit fixes the warning by ensuring get_user() is not called
      while reading SDMA stats with dqm_lock held as get_user() could cause a
      page fault which leads to the circular locking scenario.
      Signed-off-by: NMukul Joshi <mukul.joshi@amd.com>
      Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      d69fd951
    • B
      drm/amd: fix potential memleak in err branch · dc2f832e
      Bernard Zhao 提交于
      The function kobject_init_and_add alloc memory like:
      kobject_init_and_add->kobject_add_varg->kobject_set_name_vargs
      ->kvasprintf_const->kstrdup_const->kstrdup->kmalloc_track_caller
      ->kmalloc_slab, in err branch this memory not free. If use
      kmemleak, this path maybe catched.
      These changes are to add kobject_put in kobject_init_and_add
      failed branch, fix potential memleak.
      Signed-off-by: NBernard Zhao <bernard@vivo.com>
      Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
      Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      dc2f832e
    • N
      drm/amdkfd: label internally used symbols as static · 204d8998
      Nirmoy Das 提交于
      Used sparse(make C=1) to find these loose ends.
      
      v2:
      removed unwanted extra line
      Signed-off-by: NNirmoy Das <nirmoy.das@amd.com>
      Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      204d8998
    • A
      drm/amdkfd: fix ref count leak when pm_runtime_get_sync fails · 1c1ada37
      Alex Deucher 提交于
      The call to pm_runtime_get_sync increments the counter even in case of
      failure, leading to incorrect ref count.
      In case of failure, decrement the ref count before returning.
      Reviewed-by: NRajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      1c1ada37
  7. 25 6月, 2020 1 次提交
  8. 30 5月, 2020 1 次提交
  9. 29 5月, 2020 1 次提交
  10. 01 5月, 2020 2 次提交
  11. 29 4月, 2020 1 次提交
  12. 11 3月, 2020 1 次提交
  13. 07 3月, 2020 1 次提交
  14. 13 2月, 2020 1 次提交
    • R
      drm/amdkfd: refactor runtime pm for baco · 9593f4d6
      Rajneesh Bhardwaj 提交于
      So far the kfd driver implemented same routines for runtime and system
      wide suspend and resume (s2idle or mem). During system wide suspend the
      kfd aquires an atomic lock that prevents any more user processes to
      create queues and interact with kfd driver and amd gpu. This mechanism
      created problem when amdgpu device is runtime suspended with BACO
      enabled. Any application that relies on kfd driver fails to load because
      the driver reports a locked kfd device since gpu is runtime suspended.
      
      However, in an ideal case, when gpu is runtime  suspended the kfd driver
      should be able to:
      
       - auto resume amdgpu driver whenever a client requests compute service
       - prevent runtime suspend for amdgpu  while kfd is in use
      
      This change refactors the amdgpu and amdkfd drivers to support BACO and
      runtime power management.
      Reviewed-by: NOak Zeng <oak.zeng@amd.com>
      Reviewed-by: NFelix Kuehling <felix.kuehling@amd.com>
      Signed-off-by: NRajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      9593f4d6
  15. 07 2月, 2020 1 次提交
  16. 17 1月, 2020 1 次提交
  17. 10 1月, 2020 1 次提交
  18. 14 11月, 2019 1 次提交
  19. 03 10月, 2019 2 次提交
  20. 22 8月, 2019 1 次提交
  21. 20 8月, 2019 2 次提交
  22. 15 8月, 2019 1 次提交
  23. 22 6月, 2019 1 次提交
  24. 21 6月, 2019 1 次提交
  25. 19 2月, 2019 1 次提交
  26. 06 11月, 2018 1 次提交
  27. 30 8月, 2018 2 次提交
  28. 12 7月, 2018 1 次提交
  29. 11 4月, 2018 2 次提交
  30. 24 3月, 2018 2 次提交
  31. 16 3月, 2018 1 次提交