1. 23 9月, 2020 1 次提交
  2. 18 9月, 2020 3 次提交
  3. 16 9月, 2020 3 次提交
    • Y
      drm/amdkfd: Fix -Wunused-const-variable warning · 2b3bbf23
      YueHaibing 提交于
      If KFD_SUPPORT_IOMMU_V2 is not set, gcc warns:
      
      drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_device.c:121:37: warning: ‘raven_device_info’ defined but not used [-Wunused-const-variable=]
       static const struct kfd_device_info raven_device_info = {
                                           ^~~~~~~~~~~~~~~~~
      
      As Huang Rui suggested, Raven already has the fallback path,
      so it should be out of IOMMU v2 flag.
      Suggested-by: NHuang Rui <ray.huang@amd.com>
      Signed-off-by: NYueHaibing <yuehaibing@huawei.com>
      Acked-by: NHuang Rui <ray.huang@amd.com>
      Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
      Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      2b3bbf23
    • D
      drm/amdkfd: fix a memory leak issue · edb084f4
      Dennis Li 提交于
      In the resume stage of GPU recovery, start_cpsch will call pm_init
      which set pm->allocated as false, cause the next pm_release_ib has
      no chance to release ib memory.
      
      Add pm_release_ib in stop_cpsch which will be called in the suspend
      stage of GPU recovery.
      Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
      Signed-off-by: NDennis Li <Dennis.Li@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      edb084f4
    • D
      drm/kfd: fix a system crash issue during GPU recovery · a9a83a92
      Dennis Li 提交于
      The crash log as the below:
      
      [Thu Aug 20 23:18:14 2020] general protection fault: 0000 [#1] SMP NOPTI
      [Thu Aug 20 23:18:14 2020] CPU: 152 PID: 1837 Comm: kworker/152:1 Tainted: G           OE     5.4.0-42-generic #46~18.04.1-Ubuntu
      [Thu Aug 20 23:18:14 2020] Hardware name: GIGABYTE G482-Z53-YF/MZ52-G40-00, BIOS R12 05/13/2020
      [Thu Aug 20 23:18:14 2020] Workqueue: events amdgpu_ras_do_recovery [amdgpu]
      [Thu Aug 20 23:18:14 2020] RIP: 0010:evict_process_queues_cpsch+0xc9/0x130 [amdgpu]
      [Thu Aug 20 23:18:14 2020] Code: 49 8d 4d 10 48 39 c8 75 21 eb 44 83 fa 03 74 36 80 78 72 00 74 0c 83 ab 68 01 00 00 01 41 c6 45 41 00 48 8b 00 48 39 c8 74 25 <80> 78 70 00 c6 40 6d 01 74 ee 8b 50 28 c6 40 70 00 83 ab 60 01 00
      [Thu Aug 20 23:18:14 2020] RSP: 0018:ffffb29b52f6fc90 EFLAGS: 00010213
      [Thu Aug 20 23:18:14 2020] RAX: 1c884edb0a118914 RBX: ffff8a0d45ff3c00 RCX: ffff8a2d83e41038
      [Thu Aug 20 23:18:14 2020] RDX: 0000000000000000 RSI: 0000000000000082 RDI: ffff8a0e2e4178c0
      [Thu Aug 20 23:18:14 2020] RBP: ffffb29b52f6fcb0 R08: 0000000000001b64 R09: 0000000000000004
      [Thu Aug 20 23:18:14 2020] R10: ffffb29b52f6fb78 R11: 0000000000000001 R12: ffff8a0d45ff3d28
      [Thu Aug 20 23:18:14 2020] R13: ffff8a2d83e41028 R14: 0000000000000000 R15: 0000000000000000
      [Thu Aug 20 23:18:14 2020] FS:  0000000000000000(0000) GS:ffff8a0e2e400000(0000) knlGS:0000000000000000
      [Thu Aug 20 23:18:14 2020] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [Thu Aug 20 23:18:14 2020] CR2: 000055c783c0e6a8 CR3: 00000034a1284000 CR4: 0000000000340ee0
      [Thu Aug 20 23:18:14 2020] Call Trace:
      [Thu Aug 20 23:18:14 2020]  kfd_process_evict_queues+0x43/0xd0 [amdgpu]
      [Thu Aug 20 23:18:14 2020]  kfd_suspend_all_processes+0x60/0xf0 [amdgpu]
      [Thu Aug 20 23:18:14 2020]  kgd2kfd_suspend.part.7+0x43/0x50 [amdgpu]
      [Thu Aug 20 23:18:14 2020]  kgd2kfd_pre_reset+0x46/0x60 [amdgpu]
      [Thu Aug 20 23:18:14 2020]  amdgpu_amdkfd_pre_reset+0x1a/0x20 [amdgpu]
      [Thu Aug 20 23:18:14 2020]  amdgpu_device_gpu_recover+0x377/0xf90 [amdgpu]
      [Thu Aug 20 23:18:14 2020]  ? amdgpu_ras_error_query+0x1b8/0x2a0 [amdgpu]
      [Thu Aug 20 23:18:14 2020]  amdgpu_ras_do_recovery+0x159/0x190 [amdgpu]
      [Thu Aug 20 23:18:14 2020]  process_one_work+0x20f/0x400
      [Thu Aug 20 23:18:14 2020]  worker_thread+0x34/0x410
      
      When GPU hang, user process will fail to create a compute queue whose
      struct object will be freed later, but driver wrongly add this queue to
      queue list of the proccess. And then kfd_process_evict_queues will
      access a freed memory, which cause a system crash.
      
      v2:
      The failure to execute_queues should probably not be reported to
      the caller of create_queue, because the queue was already created.
      Therefore change to ignore the return value from execute_queues.
      Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
      Signed-off-by: NDennis Li <Dennis.Li@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      a9a83a92
  4. 01 9月, 2020 1 次提交
  5. 27 8月, 2020 4 次提交
  6. 25 8月, 2020 1 次提交
  7. 19 8月, 2020 1 次提交
  8. 15 8月, 2020 1 次提交
  9. 11 8月, 2020 3 次提交
  10. 05 8月, 2020 1 次提交
  11. 28 7月, 2020 4 次提交
  12. 16 7月, 2020 5 次提交
  13. 08 7月, 2020 1 次提交
  14. 03 7月, 2020 2 次提交
  15. 01 7月, 2020 9 次提交
    • M
      drm/amdkfd: Fix circular locking dependency warning · d69fd951
      Mukul Joshi 提交于
      [  150.887733] ======================================================
      [  150.893903] WARNING: possible circular locking dependency detected
      [  150.905917] ------------------------------------------------------
      [  150.912129] kfdtest/4081 is trying to acquire lock:
      [  150.917002] ffff8f7f3762e118 (&mm->mmap_sem#2){++++}, at:
                                       __might_fault+0x3e/0x90
      [  150.924490]
                     but task is already holding lock:
      [  150.930320] ffff8f7f49d229e8 (&dqm->lock_hidden){+.+.}, at:
                                      destroy_queue_cpsch+0x29/0x210 [amdgpu]
      [  150.939432]
                     which lock already depends on the new lock.
      
      [  150.947603]
                     the existing dependency chain (in reverse order) is:
      [  150.955074]
                     -> #3 (&dqm->lock_hidden){+.+.}:
      [  150.960822]        __mutex_lock+0xa1/0x9f0
      [  150.964996]        evict_process_queues_cpsch+0x22/0x120 [amdgpu]
      [  150.971155]        kfd_process_evict_queues+0x3b/0xc0 [amdgpu]
      [  150.977054]        kgd2kfd_quiesce_mm+0x25/0x60 [amdgpu]
      [  150.982442]        amdgpu_amdkfd_evict_userptr+0x35/0x70 [amdgpu]
      [  150.988615]        amdgpu_mn_invalidate_hsa+0x41/0x60 [amdgpu]
      [  150.994448]        __mmu_notifier_invalidate_range_start+0xa4/0x240
      [  151.000714]        copy_page_range+0xd70/0xd80
      [  151.005159]        dup_mm+0x3ca/0x550
      [  151.008816]        copy_process+0x1bdc/0x1c70
      [  151.013183]        _do_fork+0x76/0x6c0
      [  151.016929]        __x64_sys_clone+0x8c/0xb0
      [  151.021201]        do_syscall_64+0x4a/0x1d0
      [  151.025404]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [  151.030977]
                     -> #2 (&adev->notifier_lock){+.+.}:
      [  151.036993]        __mutex_lock+0xa1/0x9f0
      [  151.041168]        amdgpu_mn_invalidate_hsa+0x30/0x60 [amdgpu]
      [  151.047019]        __mmu_notifier_invalidate_range_start+0xa4/0x240
      [  151.053277]        copy_page_range+0xd70/0xd80
      [  151.057722]        dup_mm+0x3ca/0x550
      [  151.061388]        copy_process+0x1bdc/0x1c70
      [  151.065748]        _do_fork+0x76/0x6c0
      [  151.069499]        __x64_sys_clone+0x8c/0xb0
      [  151.073765]        do_syscall_64+0x4a/0x1d0
      [  151.077952]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [  151.083523]
                     -> #1 (mmu_notifier_invalidate_range_start){+.+.}:
      [  151.090833]        change_protection+0x802/0xab0
      [  151.095448]        mprotect_fixup+0x187/0x2d0
      [  151.099801]        setup_arg_pages+0x124/0x250
      [  151.104251]        load_elf_binary+0x3a4/0x1464
      [  151.108781]        search_binary_handler+0x6c/0x210
      [  151.113656]        __do_execve_file.isra.40+0x7f7/0xa50
      [  151.118875]        do_execve+0x21/0x30
      [  151.122632]        call_usermodehelper_exec_async+0x17e/0x190
      [  151.128393]        ret_from_fork+0x24/0x30
      [  151.132489]
                     -> #0 (&mm->mmap_sem#2){++++}:
      [  151.138064]        __lock_acquire+0x11a1/0x1490
      [  151.142597]        lock_acquire+0x90/0x180
      [  151.146694]        __might_fault+0x68/0x90
      [  151.150879]        read_sdma_queue_counter+0x5f/0xb0 [amdgpu]
      [  151.156693]        update_sdma_queue_past_activity_stats+0x3b/0x90 [amdgpu]
      [  151.163725]        destroy_queue_cpsch+0x1ae/0x210 [amdgpu]
      [  151.169373]        pqm_destroy_queue+0xf0/0x250 [amdgpu]
      [  151.174762]        kfd_ioctl_destroy_queue+0x32/0x70 [amdgpu]
      [  151.180577]        kfd_ioctl+0x223/0x400 [amdgpu]
      [  151.185284]        ksys_ioctl+0x8f/0xb0
      [  151.189118]        __x64_sys_ioctl+0x16/0x20
      [  151.193389]        do_syscall_64+0x4a/0x1d0
      [  151.197569]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [  151.203141]
                     other info that might help us debug this:
      
      [  151.211140] Chain exists of:
                       &mm->mmap_sem#2 --> &adev->notifier_lock --> &dqm->lock_hidden
      
      [  151.222535]  Possible unsafe locking scenario:
      
      [  151.228447]        CPU0                    CPU1
      [  151.232971]        ----                    ----
      [  151.237502]   lock(&dqm->lock_hidden);
      [  151.241254]                                lock(&adev->notifier_lock);
      [  151.247774]                                lock(&dqm->lock_hidden);
      [  151.254038]   lock(&mm->mmap_sem#2);
      
      This commit fixes the warning by ensuring get_user() is not called
      while reading SDMA stats with dqm_lock held as get_user() could cause a
      page fault which leads to the circular locking scenario.
      Signed-off-by: NMukul Joshi <mukul.joshi@amd.com>
      Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      d69fd951
    • B
      drm/amd: fix potential memleak in err branch · dc2f832e
      Bernard Zhao 提交于
      The function kobject_init_and_add alloc memory like:
      kobject_init_and_add->kobject_add_varg->kobject_set_name_vargs
      ->kvasprintf_const->kstrdup_const->kstrdup->kmalloc_track_caller
      ->kmalloc_slab, in err branch this memory not free. If use
      kmemleak, this path maybe catched.
      These changes are to add kobject_put in kobject_init_and_add
      failed branch, fix potential memleak.
      Signed-off-by: NBernard Zhao <bernard@vivo.com>
      Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
      Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      dc2f832e
    • N
      drm/amdkfd: label internally used symbols as static · 204d8998
      Nirmoy Das 提交于
      Used sparse(make C=1) to find these loose ends.
      
      v2:
      removed unwanted extra line
      Signed-off-by: NNirmoy Das <nirmoy.das@amd.com>
      Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      204d8998
    • A
      drm/amdkfd: fix ref count leak when pm_runtime_get_sync fails · 1c1ada37
      Alex Deucher 提交于
      The call to pm_runtime_get_sync increments the counter even in case of
      failure, leading to incorrect ref count.
      In case of failure, decrement the ref count before returning.
      Reviewed-by: NRajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      1c1ada37
    • Q
      drm/amdkfd: Fix reference count leaks. · 20eca012
      Qiushi Wu 提交于
      kobject_init_and_add() takes reference even when it fails.
      If this function returns an error, kobject_put() must be called to
      properly clean up the memory associated with the object.
      Signed-off-by: NQiushi Wu <wu000273@umn.edu>
      Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
      Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      20eca012
    • F
      drm/amdkfd: Add eviction debug messages · b2057956
      Felix Kuehling 提交于
      Use WARN to print messages with backtrace when evictions are triggered.
      This can help determine the root cause of evictions and help spot driver
      bugs triggering evictions unintentionally, or help with performance tuning
      by avoiding conditions that cause evictions in a specific workload.
      
      The messages are controlled by a new module parameter that can be changed
      at runtime:
      
        echo Y > /sys/module/amdgpu/parameters/debug_evictions
        echo N > /sys/module/amdgpu/parameters/debug_evictions
      Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
      Reviewed-by: NPhilip Yang <Philip.Yang@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      b2057956
    • L
      drm/amdkfd: Use correct major in devcgroup check · 7159562a
      Lorenz Brun 提交于
      The existing code used the major version number of the DRM driver
      instead of the device major number of the DRM subsystem for
      validating access for a devices cgroup.
      
      This meant that accesses allowed by the devices cgroup weren't
      permitted and certain accesses denied by the devices cgroup were
      permitted (if they matched the wrong major device number).
      Signed-off-by: NLorenz Brun <lorenz@brun.one>
      Fixes: 6b855f7b ("drm/amdkfd: Check against device cgroup")
      Reviewed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
      Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      7159562a
    • S
      drm/amdkfd: sienna_cichlid virtual function support · adab4dad
      shaoyunl 提交于
      amdkfd add support for sienna_cichlid virtual function
      Signed-off-by: Nshaoyunl <shaoyun.liu@amd.com>
      Reviewed-by: NYong Zhao <Yong.Zhao@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      adab4dad
    • J
      drm/amdkfd: Support debugger in Navi1x trap handler · 3cefc718
      Jay Cornwall 提交于
      - Preserve scalar GPRs ttmp[4:11] and ttmp13
      - Add single step exception during context save workaround
      - Remove incorrect PC adjustment during context save
      Signed-off-by: NJay Cornwall <jay.cornwall@amd.com>
      Reviewed-by: NYong Zhao <Yong.Zhao@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      3cefc718