1. 01 7月, 2020 3 次提交
    • M
      drm/amdkfd: Fix circular locking dependency warning · d69fd951
      Mukul Joshi 提交于
      [  150.887733] ======================================================
      [  150.893903] WARNING: possible circular locking dependency detected
      [  150.905917] ------------------------------------------------------
      [  150.912129] kfdtest/4081 is trying to acquire lock:
      [  150.917002] ffff8f7f3762e118 (&mm->mmap_sem#2){++++}, at:
                                       __might_fault+0x3e/0x90
      [  150.924490]
                     but task is already holding lock:
      [  150.930320] ffff8f7f49d229e8 (&dqm->lock_hidden){+.+.}, at:
                                      destroy_queue_cpsch+0x29/0x210 [amdgpu]
      [  150.939432]
                     which lock already depends on the new lock.
      
      [  150.947603]
                     the existing dependency chain (in reverse order) is:
      [  150.955074]
                     -> #3 (&dqm->lock_hidden){+.+.}:
      [  150.960822]        __mutex_lock+0xa1/0x9f0
      [  150.964996]        evict_process_queues_cpsch+0x22/0x120 [amdgpu]
      [  150.971155]        kfd_process_evict_queues+0x3b/0xc0 [amdgpu]
      [  150.977054]        kgd2kfd_quiesce_mm+0x25/0x60 [amdgpu]
      [  150.982442]        amdgpu_amdkfd_evict_userptr+0x35/0x70 [amdgpu]
      [  150.988615]        amdgpu_mn_invalidate_hsa+0x41/0x60 [amdgpu]
      [  150.994448]        __mmu_notifier_invalidate_range_start+0xa4/0x240
      [  151.000714]        copy_page_range+0xd70/0xd80
      [  151.005159]        dup_mm+0x3ca/0x550
      [  151.008816]        copy_process+0x1bdc/0x1c70
      [  151.013183]        _do_fork+0x76/0x6c0
      [  151.016929]        __x64_sys_clone+0x8c/0xb0
      [  151.021201]        do_syscall_64+0x4a/0x1d0
      [  151.025404]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [  151.030977]
                     -> #2 (&adev->notifier_lock){+.+.}:
      [  151.036993]        __mutex_lock+0xa1/0x9f0
      [  151.041168]        amdgpu_mn_invalidate_hsa+0x30/0x60 [amdgpu]
      [  151.047019]        __mmu_notifier_invalidate_range_start+0xa4/0x240
      [  151.053277]        copy_page_range+0xd70/0xd80
      [  151.057722]        dup_mm+0x3ca/0x550
      [  151.061388]        copy_process+0x1bdc/0x1c70
      [  151.065748]        _do_fork+0x76/0x6c0
      [  151.069499]        __x64_sys_clone+0x8c/0xb0
      [  151.073765]        do_syscall_64+0x4a/0x1d0
      [  151.077952]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [  151.083523]
                     -> #1 (mmu_notifier_invalidate_range_start){+.+.}:
      [  151.090833]        change_protection+0x802/0xab0
      [  151.095448]        mprotect_fixup+0x187/0x2d0
      [  151.099801]        setup_arg_pages+0x124/0x250
      [  151.104251]        load_elf_binary+0x3a4/0x1464
      [  151.108781]        search_binary_handler+0x6c/0x210
      [  151.113656]        __do_execve_file.isra.40+0x7f7/0xa50
      [  151.118875]        do_execve+0x21/0x30
      [  151.122632]        call_usermodehelper_exec_async+0x17e/0x190
      [  151.128393]        ret_from_fork+0x24/0x30
      [  151.132489]
                     -> #0 (&mm->mmap_sem#2){++++}:
      [  151.138064]        __lock_acquire+0x11a1/0x1490
      [  151.142597]        lock_acquire+0x90/0x180
      [  151.146694]        __might_fault+0x68/0x90
      [  151.150879]        read_sdma_queue_counter+0x5f/0xb0 [amdgpu]
      [  151.156693]        update_sdma_queue_past_activity_stats+0x3b/0x90 [amdgpu]
      [  151.163725]        destroy_queue_cpsch+0x1ae/0x210 [amdgpu]
      [  151.169373]        pqm_destroy_queue+0xf0/0x250 [amdgpu]
      [  151.174762]        kfd_ioctl_destroy_queue+0x32/0x70 [amdgpu]
      [  151.180577]        kfd_ioctl+0x223/0x400 [amdgpu]
      [  151.185284]        ksys_ioctl+0x8f/0xb0
      [  151.189118]        __x64_sys_ioctl+0x16/0x20
      [  151.193389]        do_syscall_64+0x4a/0x1d0
      [  151.197569]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [  151.203141]
                     other info that might help us debug this:
      
      [  151.211140] Chain exists of:
                       &mm->mmap_sem#2 --> &adev->notifier_lock --> &dqm->lock_hidden
      
      [  151.222535]  Possible unsafe locking scenario:
      
      [  151.228447]        CPU0                    CPU1
      [  151.232971]        ----                    ----
      [  151.237502]   lock(&dqm->lock_hidden);
      [  151.241254]                                lock(&adev->notifier_lock);
      [  151.247774]                                lock(&dqm->lock_hidden);
      [  151.254038]   lock(&mm->mmap_sem#2);
      
      This commit fixes the warning by ensuring get_user() is not called
      while reading SDMA stats with dqm_lock held as get_user() could cause a
      page fault which leads to the circular locking scenario.
      Signed-off-by: NMukul Joshi <mukul.joshi@amd.com>
      Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      d69fd951
    • N
      drm/amdkfd: label internally used symbols as static · 204d8998
      Nirmoy Das 提交于
      Used sparse(make C=1) to find these loose ends.
      
      v2:
      removed unwanted extra line
      Signed-off-by: NNirmoy Das <nirmoy.das@amd.com>
      Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      204d8998
    • Y
      drm/amdkfd: Support Sienna_Cichlid KFD v4 · 3a2f0c81
      Yong Zhao 提交于
      v4: drop get_tile_config, comment out other callbacks
      Signed-off-by: NYong Zhao <Yong.Zhao@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      3a2f0c81
  2. 29 5月, 2020 1 次提交
  3. 02 5月, 2020 1 次提交
  4. 29 4月, 2020 1 次提交
  5. 29 2月, 2020 1 次提交
  6. 27 2月, 2020 5 次提交
  7. 04 2月, 2020 1 次提交
  8. 17 1月, 2020 1 次提交
  9. 08 1月, 2020 3 次提交
  10. 23 11月, 2019 1 次提交
  11. 14 11月, 2019 1 次提交
  12. 26 10月, 2019 1 次提交
    • P
      drm/amdkfd: don't use dqm lock during device reset/suspend/resume · 2c99a547
      Philip Yang 提交于
      If device reset/suspend/resume failed for some reason, dqm lock is
      hold forever and this causes deadlock. Below is a kernel backtrace when
      application open kfd after suspend/resume failed.
      
      Instead of holding dqm lock in pre_reset and releasing dqm lock in
      post_reset, add dqm->sched_running flag which is modified in
      dqm->ops.start and dqm->ops.stop. The flag doesn't need lock protection
      because write/read are all inside dqm lock.
      
      For HWS case, map_queues_cpsch and unmap_queues_cpsch checks
      sched_running flag before sending the updated runlist.
      
      v2: For no-HWS case, when device is stopped, don't call
      load/destroy_mqd for eviction, restore and create queue, and avoid
      debugfs dump hdqs.
      
      Backtrace of dqm lock deadlock:
      
      [Thu Oct 17 16:43:37 2019] INFO: task rocminfo:3024 blocked for more
      than 120 seconds.
      [Thu Oct 17 16:43:37 2019]       Not tainted
      5.0.0-rc1-kfd-compute-rocm-dkms-no-npi-1131 #1
      [Thu Oct 17 16:43:37 2019] "echo 0 >
      /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [Thu Oct 17 16:43:37 2019] rocminfo        D    0  3024   2947
      0x80000000
      [Thu Oct 17 16:43:37 2019] Call Trace:
      [Thu Oct 17 16:43:37 2019]  ? __schedule+0x3d9/0x8a0
      [Thu Oct 17 16:43:37 2019]  schedule+0x32/0x70
      [Thu Oct 17 16:43:37 2019]  schedule_preempt_disabled+0xa/0x10
      [Thu Oct 17 16:43:37 2019]  __mutex_lock.isra.9+0x1e3/0x4e0
      [Thu Oct 17 16:43:37 2019]  ? __call_srcu+0x264/0x3b0
      [Thu Oct 17 16:43:37 2019]  ? process_termination_cpsch+0x24/0x2f0
      [amdgpu]
      [Thu Oct 17 16:43:37 2019]  process_termination_cpsch+0x24/0x2f0
      [amdgpu]
      [Thu Oct 17 16:43:37 2019]
      kfd_process_dequeue_from_all_devices+0x42/0x60 [amdgpu]
      [Thu Oct 17 16:43:37 2019]  kfd_process_notifier_release+0x1be/0x220
      [amdgpu]
      [Thu Oct 17 16:43:37 2019]  __mmu_notifier_release+0x3e/0xc0
      [Thu Oct 17 16:43:37 2019]  exit_mmap+0x160/0x1a0
      [Thu Oct 17 16:43:37 2019]  ? __handle_mm_fault+0xba3/0x1200
      [Thu Oct 17 16:43:37 2019]  ? exit_robust_list+0x5a/0x110
      [Thu Oct 17 16:43:37 2019]  mmput+0x4a/0x120
      [Thu Oct 17 16:43:37 2019]  do_exit+0x284/0xb20
      [Thu Oct 17 16:43:37 2019]  ? handle_mm_fault+0xfa/0x200
      [Thu Oct 17 16:43:37 2019]  do_group_exit+0x3a/0xa0
      [Thu Oct 17 16:43:37 2019]  __x64_sys_exit_group+0x14/0x20
      [Thu Oct 17 16:43:37 2019]  do_syscall_64+0x4f/0x100
      [Thu Oct 17 16:43:37 2019]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      Suggested-by: NFelix Kuehling <Felix.Kuehling@amd.com>
      Signed-off-by: NPhilip Yang <Philip.Yang@amd.com>
      Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      2c99a547
  13. 08 10月, 2019 2 次提交
  14. 03 10月, 2019 5 次提交
  15. 16 9月, 2019 2 次提交
  16. 23 8月, 2019 1 次提交
  17. 19 7月, 2019 2 次提交
  18. 12 7月, 2019 1 次提交
  19. 22 6月, 2019 2 次提交
  20. 18 6月, 2019 4 次提交
    • O
      drm/amdkfd: Fix sdma queue allocate race condition · 38bb4226
      Oak Zeng 提交于
      SDMA queue allocation requires the dqm lock as it modify
      the global dqm members. Enclose it in the dqm_lock.
      Signed-off-by: NOak Zeng <Oak.Zeng@amd.com>
      Reviewed-by: NPhilip Yang <philip.yang@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      38bb4226
    • O
      drm/amdkfd: Fix a circular lock dependency · 6a6ef5ee
      Oak Zeng 提交于
      The idea to break the circular lock dependency is to temporarily drop
      dqm lock before calling allocate_mqd. See callstack #1 below.
      
      [   59.510149] [drm] Initialized amdgpu 3.30.0 20150101 for 0000:04:00.0 on minor 0
      
      [  513.604034] ======================================================
      [  513.604205] WARNING: possible circular locking dependency detected
      [  513.604375] 4.18.0-kfd-root #2 Tainted: G        W
      [  513.604530] ------------------------------------------------------
      [  513.604699] kswapd0/611 is trying to acquire lock:
      [  513.604840] 00000000d254022e (&dqm->lock_hidden){+.+.}, at: evict_process_queues_nocpsch+0x26/0x140 [amdgpu]
      [  513.605150]
                     but task is already holding lock:
      [  513.605307] 00000000961547fc (&anon_vma->rwsem){++++}, at: page_lock_anon_vma_read+0xe4/0x250
      [  513.605540]
                     which lock already depends on the new lock.
      
      [  513.605747]
                     the existing dependency chain (in reverse order) is:
      [  513.605944]
                     -> #4 (&anon_vma->rwsem){++++}:
      [  513.606106]        __vma_adjust+0x147/0x7f0
      [  513.606231]        __split_vma+0x179/0x190
      [  513.606353]        mprotect_fixup+0x217/0x260
      [  513.606553]        do_mprotect_pkey+0x211/0x380
      [  513.606752]        __x64_sys_mprotect+0x1b/0x20
      [  513.606954]        do_syscall_64+0x50/0x1a0
      [  513.607149]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [  513.607380]
                     -> #3 (&mapping->i_mmap_rwsem){++++}:
      [  513.607678]        rmap_walk_file+0x1f0/0x280
      [  513.607887]        page_referenced+0xdd/0x180
      [  513.608081]        shrink_page_list+0x853/0xcb0
      [  513.608279]        shrink_inactive_list+0x33b/0x700
      [  513.608483]        shrink_node_memcg+0x37a/0x7f0
      [  513.608682]        shrink_node+0xd8/0x490
      [  513.608869]        balance_pgdat+0x18b/0x3b0
      [  513.609062]        kswapd+0x203/0x5c0
      [  513.609241]        kthread+0x100/0x140
      [  513.609420]        ret_from_fork+0x24/0x30
      [  513.609607]
                     -> #2 (fs_reclaim){+.+.}:
      [  513.609883]        kmem_cache_alloc_trace+0x34/0x2e0
      [  513.610093]        reservation_object_reserve_shared+0x139/0x300
      [  513.610326]        ttm_bo_init_reserved+0x291/0x480 [ttm]
      [  513.610567]        amdgpu_bo_do_create+0x1d2/0x650 [amdgpu]
      [  513.610811]        amdgpu_bo_create+0x40/0x1f0 [amdgpu]
      [  513.611041]        amdgpu_bo_create_reserved+0x249/0x2d0 [amdgpu]
      [  513.611290]        amdgpu_bo_create_kernel+0x12/0x70 [amdgpu]
      [  513.611584]        amdgpu_ttm_init+0x2cb/0x560 [amdgpu]
      [  513.611823]        gmc_v9_0_sw_init+0x400/0x750 [amdgpu]
      [  513.612491]        amdgpu_device_init+0x14eb/0x1990 [amdgpu]
      [  513.612730]        amdgpu_driver_load_kms+0x78/0x290 [amdgpu]
      [  513.612958]        drm_dev_register+0x111/0x1a0
      [  513.613171]        amdgpu_pci_probe+0x11c/0x1e0 [amdgpu]
      [  513.613389]        local_pci_probe+0x3f/0x90
      [  513.613581]        pci_device_probe+0x102/0x1c0
      [  513.613779]        driver_probe_device+0x2a7/0x480
      [  513.613984]        __driver_attach+0x10a/0x110
      [  513.614179]        bus_for_each_dev+0x67/0xc0
      [  513.614372]        bus_add_driver+0x1eb/0x260
      [  513.614565]        driver_register+0x5b/0xe0
      [  513.614756]        do_one_initcall+0xac/0x357
      [  513.614952]        do_init_module+0x5b/0x213
      [  513.615145]        load_module+0x2542/0x2d30
      [  513.615337]        __do_sys_finit_module+0xd2/0x100
      [  513.615541]        do_syscall_64+0x50/0x1a0
      [  513.615731]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [  513.615963]
                     -> #1 (reservation_ww_class_mutex){+.+.}:
      [  513.616293]        amdgpu_amdkfd_alloc_gtt_mem+0xcf/0x2c0 [amdgpu]
      [  513.616554]        init_mqd+0x223/0x260 [amdgpu]
      [  513.616779]        create_queue_nocpsch+0x4d9/0x600 [amdgpu]
      [  513.617031]        pqm_create_queue+0x37c/0x520 [amdgpu]
      [  513.617270]        kfd_ioctl_create_queue+0x2f9/0x650 [amdgpu]
      [  513.617522]        kfd_ioctl+0x202/0x350 [amdgpu]
      [  513.617724]        do_vfs_ioctl+0x9f/0x6c0
      [  513.617914]        ksys_ioctl+0x66/0x70
      [  513.618095]        __x64_sys_ioctl+0x16/0x20
      [  513.618286]        do_syscall_64+0x50/0x1a0
      [  513.618476]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [  513.618695]
                     -> #0 (&dqm->lock_hidden){+.+.}:
      [  513.618984]        __mutex_lock+0x98/0x970
      [  513.619197]        evict_process_queues_nocpsch+0x26/0x140 [amdgpu]
      [  513.619459]        kfd_process_evict_queues+0x3b/0xb0 [amdgpu]
      [  513.619710]        kgd2kfd_quiesce_mm+0x1c/0x40 [amdgpu]
      [  513.620103]        amdgpu_amdkfd_evict_userptr+0x38/0x70 [amdgpu]
      [  513.620363]        amdgpu_mn_invalidate_range_start_hsa+0xa6/0xc0 [amdgpu]
      [  513.620614]        __mmu_notifier_invalidate_range_start+0x70/0xb0
      [  513.620851]        try_to_unmap_one+0x7fc/0x8f0
      [  513.621049]        rmap_walk_anon+0x121/0x290
      [  513.621242]        try_to_unmap+0x93/0xf0
      [  513.621428]        shrink_page_list+0x606/0xcb0
      [  513.621625]        shrink_inactive_list+0x33b/0x700
      [  513.621835]        shrink_node_memcg+0x37a/0x7f0
      [  513.622034]        shrink_node+0xd8/0x490
      [  513.622219]        balance_pgdat+0x18b/0x3b0
      [  513.622410]        kswapd+0x203/0x5c0
      [  513.622589]        kthread+0x100/0x140
      [  513.622769]        ret_from_fork+0x24/0x30
      [  513.622957]
                     other info that might help us debug this:
      
      [  513.623354] Chain exists of:
                       &dqm->lock_hidden --> &mapping->i_mmap_rwsem --> &anon_vma->rwsem
      
      [  513.623900]  Possible unsafe locking scenario:
      
      [  513.624189]        CPU0                    CPU1
      [  513.624397]        ----                    ----
      [  513.624594]   lock(&anon_vma->rwsem);
      [  513.624771]                                lock(&mapping->i_mmap_rwsem);
      [  513.625020]                                lock(&anon_vma->rwsem);
      [  513.625253]   lock(&dqm->lock_hidden);
      [  513.625433]
                      *** DEADLOCK ***
      
      [  513.625783] 3 locks held by kswapd0/611:
      [  513.625967]  #0: 00000000f14edf84 (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x5/0x30
      [  513.626309]  #1: 00000000961547fc (&anon_vma->rwsem){++++}, at: page_lock_anon_vma_read+0xe4/0x250
      [  513.626671]  #2: 0000000067b5cd12 (srcu){....}, at: __mmu_notifier_invalidate_range_start+0x5/0xb0
      [  513.627037]
                     stack backtrace:
      [  513.627292] CPU: 0 PID: 611 Comm: kswapd0 Tainted: G        W         4.18.0-kfd-root #2
      [  513.627632] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
      [  513.627990] Call Trace:
      [  513.628143]  dump_stack+0x7c/0xbb
      [  513.628315]  print_circular_bug.isra.37+0x21b/0x228
      [  513.628581]  __lock_acquire+0xf7d/0x1470
      [  513.628782]  ? unwind_next_frame+0x6c/0x4f0
      [  513.628974]  ? lock_acquire+0xec/0x1e0
      [  513.629154]  lock_acquire+0xec/0x1e0
      [  513.629357]  ? evict_process_queues_nocpsch+0x26/0x140 [amdgpu]
      [  513.629587]  __mutex_lock+0x98/0x970
      [  513.629790]  ? evict_process_queues_nocpsch+0x26/0x140 [amdgpu]
      [  513.630047]  ? evict_process_queues_nocpsch+0x26/0x140 [amdgpu]
      [  513.630309]  ? evict_process_queues_nocpsch+0x26/0x140 [amdgpu]
      [  513.630562]  evict_process_queues_nocpsch+0x26/0x140 [amdgpu]
      [  513.630816]  kfd_process_evict_queues+0x3b/0xb0 [amdgpu]
      [  513.631057]  kgd2kfd_quiesce_mm+0x1c/0x40 [amdgpu]
      [  513.631288]  amdgpu_amdkfd_evict_userptr+0x38/0x70 [amdgpu]
      [  513.631536]  amdgpu_mn_invalidate_range_start_hsa+0xa6/0xc0 [amdgpu]
      [  513.632076]  __mmu_notifier_invalidate_range_start+0x70/0xb0
      [  513.632299]  try_to_unmap_one+0x7fc/0x8f0
      [  513.632487]  ? page_lock_anon_vma_read+0x68/0x250
      [  513.632690]  rmap_walk_anon+0x121/0x290
      [  513.632875]  try_to_unmap+0x93/0xf0
      [  513.633050]  ? page_remove_rmap+0x330/0x330
      [  513.633239]  ? rcu_read_unlock+0x60/0x60
      [  513.633422]  ? page_get_anon_vma+0x160/0x160
      [  513.633613]  shrink_page_list+0x606/0xcb0
      [  513.633800]  shrink_inactive_list+0x33b/0x700
      [  513.633997]  shrink_node_memcg+0x37a/0x7f0
      [  513.634186]  ? shrink_node+0xd8/0x490
      [  513.634363]  shrink_node+0xd8/0x490
      [  513.634537]  balance_pgdat+0x18b/0x3b0
      [  513.634718]  kswapd+0x203/0x5c0
      [  513.634887]  ? wait_woken+0xb0/0xb0
      [  513.635062]  kthread+0x100/0x140
      [  513.635231]  ? balance_pgdat+0x3b0/0x3b0
      [  513.635414]  ? kthread_delayed_work_timer_fn+0x80/0x80
      [  513.635626]  ret_from_fork+0x24/0x30
      [  513.636042] Evicting PASID 32768 queues
      [  513.936236] Restoring PASID 32768 queues
      [  524.708912] Evicting PASID 32768 queues
      [  524.999875] Restoring PASID 32768 queues
      Signed-off-by: NOak Zeng <Oak.Zeng@amd.com>
      Reviewed-by: NPhilip Yang <philip.yang@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      6a6ef5ee
    • O
      Revert "drm/amdkfd: Fix a circular lock dependency" · d091bc0a
      Oak Zeng 提交于
      This reverts commit 06b89b38.
      This fix is not proper. allocate_mqd can't be moved before
      allocate_sdma_queue as it depends on q->properties->sdma_id
      set in later.
      Signed-off-by: NOak Zeng <Oak.Zeng@amd.com>
      Reviewed-by: NPhilip Yang <philip.yang@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      d091bc0a
    • O
      Revert "drm/amdkfd: Fix sdma queue allocate race condition" · 70d488fb
      Oak Zeng 提交于
      This reverts commit f77dac6c.
      This fix is not proper. allocate_mqd can't be moved before
      allocate_sdma_queue as it depends on q->properties->sdma_id
      set in later.
      Signed-off-by: NOak Zeng <Oak.Zeng@amd.com>
      Reviewed-by: NPhilip Yang <philip.yang@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      70d488fb
  21. 12 6月, 2019 1 次提交