- 01 7月, 2020 3 次提交
-
-
由 Mukul Joshi 提交于
[ 150.887733] ====================================================== [ 150.893903] WARNING: possible circular locking dependency detected [ 150.905917] ------------------------------------------------------ [ 150.912129] kfdtest/4081 is trying to acquire lock: [ 150.917002] ffff8f7f3762e118 (&mm->mmap_sem#2){++++}, at: __might_fault+0x3e/0x90 [ 150.924490] but task is already holding lock: [ 150.930320] ffff8f7f49d229e8 (&dqm->lock_hidden){+.+.}, at: destroy_queue_cpsch+0x29/0x210 [amdgpu] [ 150.939432] which lock already depends on the new lock. [ 150.947603] the existing dependency chain (in reverse order) is: [ 150.955074] -> #3 (&dqm->lock_hidden){+.+.}: [ 150.960822] __mutex_lock+0xa1/0x9f0 [ 150.964996] evict_process_queues_cpsch+0x22/0x120 [amdgpu] [ 150.971155] kfd_process_evict_queues+0x3b/0xc0 [amdgpu] [ 150.977054] kgd2kfd_quiesce_mm+0x25/0x60 [amdgpu] [ 150.982442] amdgpu_amdkfd_evict_userptr+0x35/0x70 [amdgpu] [ 150.988615] amdgpu_mn_invalidate_hsa+0x41/0x60 [amdgpu] [ 150.994448] __mmu_notifier_invalidate_range_start+0xa4/0x240 [ 151.000714] copy_page_range+0xd70/0xd80 [ 151.005159] dup_mm+0x3ca/0x550 [ 151.008816] copy_process+0x1bdc/0x1c70 [ 151.013183] _do_fork+0x76/0x6c0 [ 151.016929] __x64_sys_clone+0x8c/0xb0 [ 151.021201] do_syscall_64+0x4a/0x1d0 [ 151.025404] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 151.030977] -> #2 (&adev->notifier_lock){+.+.}: [ 151.036993] __mutex_lock+0xa1/0x9f0 [ 151.041168] amdgpu_mn_invalidate_hsa+0x30/0x60 [amdgpu] [ 151.047019] __mmu_notifier_invalidate_range_start+0xa4/0x240 [ 151.053277] copy_page_range+0xd70/0xd80 [ 151.057722] dup_mm+0x3ca/0x550 [ 151.061388] copy_process+0x1bdc/0x1c70 [ 151.065748] _do_fork+0x76/0x6c0 [ 151.069499] __x64_sys_clone+0x8c/0xb0 [ 151.073765] do_syscall_64+0x4a/0x1d0 [ 151.077952] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 151.083523] -> #1 (mmu_notifier_invalidate_range_start){+.+.}: [ 151.090833] change_protection+0x802/0xab0 [ 151.095448] mprotect_fixup+0x187/0x2d0 [ 151.099801] setup_arg_pages+0x124/0x250 [ 151.104251] load_elf_binary+0x3a4/0x1464 [ 151.108781] search_binary_handler+0x6c/0x210 [ 151.113656] __do_execve_file.isra.40+0x7f7/0xa50 [ 151.118875] do_execve+0x21/0x30 [ 151.122632] call_usermodehelper_exec_async+0x17e/0x190 [ 151.128393] ret_from_fork+0x24/0x30 [ 151.132489] -> #0 (&mm->mmap_sem#2){++++}: [ 151.138064] __lock_acquire+0x11a1/0x1490 [ 151.142597] lock_acquire+0x90/0x180 [ 151.146694] __might_fault+0x68/0x90 [ 151.150879] read_sdma_queue_counter+0x5f/0xb0 [amdgpu] [ 151.156693] update_sdma_queue_past_activity_stats+0x3b/0x90 [amdgpu] [ 151.163725] destroy_queue_cpsch+0x1ae/0x210 [amdgpu] [ 151.169373] pqm_destroy_queue+0xf0/0x250 [amdgpu] [ 151.174762] kfd_ioctl_destroy_queue+0x32/0x70 [amdgpu] [ 151.180577] kfd_ioctl+0x223/0x400 [amdgpu] [ 151.185284] ksys_ioctl+0x8f/0xb0 [ 151.189118] __x64_sys_ioctl+0x16/0x20 [ 151.193389] do_syscall_64+0x4a/0x1d0 [ 151.197569] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 151.203141] other info that might help us debug this: [ 151.211140] Chain exists of: &mm->mmap_sem#2 --> &adev->notifier_lock --> &dqm->lock_hidden [ 151.222535] Possible unsafe locking scenario: [ 151.228447] CPU0 CPU1 [ 151.232971] ---- ---- [ 151.237502] lock(&dqm->lock_hidden); [ 151.241254] lock(&adev->notifier_lock); [ 151.247774] lock(&dqm->lock_hidden); [ 151.254038] lock(&mm->mmap_sem#2); This commit fixes the warning by ensuring get_user() is not called while reading SDMA stats with dqm_lock held as get_user() could cause a page fault which leads to the circular locking scenario. Signed-off-by: NMukul Joshi <mukul.joshi@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Nirmoy Das 提交于
Used sparse(make C=1) to find these loose ends. v2: removed unwanted extra line Signed-off-by: NNirmoy Das <nirmoy.das@amd.com> Reviewed-by: NAlex Deucher <alexander.deucher@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Yong Zhao 提交于
v4: drop get_tile_config, comment out other callbacks Signed-off-by: NYong Zhao <Yong.Zhao@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 29 5月, 2020 1 次提交
-
-
由 Mukul Joshi 提交于
Track SDMA usage on a per process basis and report it through sysfs. The value in the sysfs file indicates the amount of time SDMA has been in-use by this process since the creation of the process. This value is in microsecond granularity. Signed-off-by: NMukul Joshi <mukul.joshi@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 02 5月, 2020 1 次提交
-
-
由 Yong Zhao 提交于
The queue mask used for set_resources always assumes the queue number per pipe is 8, so KFD needs to align with that by using function amdgpu_queue_mask_bit_to_set_resource_bit(). Signed-off-by: NYong Zhao <Yong.Zhao@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 29 4月, 2020 1 次提交
-
-
由 Joseph Greathouse 提交于
The current GWS usage model will only allows a single GWS-enabled process to be active on the GPU at once. This ensures that a barrier-using kernel gets a known amount of GPU hardware, to prevent deadlock due to inability to go beyond the GWS barrier. The HWS watches how many GWS entries are assigned to each process, and goes into over-subscription mode when two processes need more than the 64 that are available. The current KFD method for working with this is to allocate all 64 GWS entries to each GWS-capable process. When more than one GWS-enabled process is in the runlist, we must make sure the runlist is in over-subscription mode, so that the HWS gets a chained RUN_LIST packet and continues scheduling kernels. Signed-off-by: NJoseph Greathouse <Joseph.Greathouse@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 29 2月, 2020 1 次提交
-
-
由 Eric Huang 提交于
SDMA MQD memory type is NC that causes MQD data overwritten accidentally by an old stable cache line. Changing it to UC default for GART will fix the issue. The mqd_gfx9 parameter is meant for control stacks that are allocated together with user mode queue MQDs. Setting mqd_gfx9 to true maps the control stack pages as NC. Here it was accidentally applied to SDMA MQDs, which are allocated together with the HIQ MQD. Setting the mqd_gfx9 to false avoids that. Signed-off-by: NEric Huang <jinhuieric.huang@amd.com> Acked-by: NYong Zhao <Yong.Zhao@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 27 2月, 2020 5 次提交
-
-
由 Yong Zhao 提交于
The previous way of using SDMA queue count to infer whether we should unmap SDMA engines has bugs. The reason it did not cause issues is because MEC firmware unmaps all queues (CP + SDMA) when a unmap package for compute engine is received. Becasue of that, only one unmap queue package is needed, instead of one unmap queue package for CP and each SDMA engine, which results in much simpler driver code. Signed-off-by: NYong Zhao <Yong.Zhao@amd.com> Acked-by: NAlex Deucher <alexander.deucher@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Yong Zhao 提交于
Those printings are duplicated or useless. Signed-off-by: NYong Zhao <Yong.Zhao@amd.com> Acked-by: NAlex Deucher <alexander.deucher@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Yong Zhao 提交于
The previous code of calculating active CP queues is problematic if some SDMA queues are inactive. Fix that by counting CP queues directly. Signed-off-by: NYong Zhao <Yong.Zhao@amd.com> Acked-by: NAlex Deucher <alexander.deucher@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Yong Zhao 提交于
The queues represented in queue_bitmap are only CP queues. Signed-off-by: NYong Zhao <Yong.Zhao@amd.com> Acked-by: NAlex Deucher <alexander.deucher@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Yong Zhao 提交于
The name is easier to understand the code. Signed-off-by: NYong Zhao <Yong.Zhao@amd.com> Acked-by: NAlex Deucher <alexander.deucher@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 04 2月, 2020 1 次提交
-
-
由 Yong Zhao 提交于
The sdma_queue_count increment should be done before execute_queues_cpsch(), which calls pm_calc_rlib_size() where sdma_queue_count is used to calculate whether over_subscription is triggered. With the previous code, when a SDMA queue is created, compute_queue_count in pm_calc_rlib_size() is one more than the actual compute queue number, because the queue_count has been incremented while sdma_queue_count has not. This patch fixes that. Signed-off-by: NYong Zhao <Yong.Zhao@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 17 1月, 2020 1 次提交
-
-
由 Yong Zhao 提交于
SW scheduler is previously called non HW scheduler, or non HWS. This message is useful when triaging issues from dmesg. Signed-off-by: NYong Zhao <Yong.Zhao@amd.com> Acked-by: NHuang Rui <ray.huang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 08 1月, 2020 3 次提交
-
-
由 Felix Kuehling 提交于
Don't use the HWS if it's known to be hanging. In a reset also don't try to destroy the HIQ because that may hang on SRIOV if the KIQ is unresponsive. Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com> Tested-by: NEmily Deng <Emily.Deng@amd.com> Reviewed-by: Nshaoyunl <shaoyun.liu@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Felix Kuehling 提交于
Move HWS hang detection into unmap_queues_cpsch to catch hangs in all cases. If this happens during a reset, don't schedule another reset because the reset already in progress is expected to take care of it. Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com> Tested-by: NEmily Deng <Emily.Deng@amd.com> Reviewed-by: Nshaoyunl <shaoyun.liu@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Felix Kuehling 提交于
dqm->pipeline_mem wasn't used anywhere. Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Nshaoyunl <shaoyun.liu@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 23 11月, 2019 1 次提交
-
-
由 Yong Zhao 提交于
It is the same as KFD_MQD_TYPE_CP, so delete it. As a result, we will have one less mqd mananger per device. Signed-off-by: NYong Zhao <Yong.Zhao@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 14 11月, 2019 1 次提交
-
-
由 Yong Zhao 提交于
The doorbell offset could mean the byte offset or the dword offset, and the 0 offset place is also different, sometimes the start of PCI doorbell bar or the start of process doorbell pages. Use better name to avoid confusion. Signed-off-by: NYong Zhao <Yong.Zhao@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 26 10月, 2019 1 次提交
-
-
由 Philip Yang 提交于
If device reset/suspend/resume failed for some reason, dqm lock is hold forever and this causes deadlock. Below is a kernel backtrace when application open kfd after suspend/resume failed. Instead of holding dqm lock in pre_reset and releasing dqm lock in post_reset, add dqm->sched_running flag which is modified in dqm->ops.start and dqm->ops.stop. The flag doesn't need lock protection because write/read are all inside dqm lock. For HWS case, map_queues_cpsch and unmap_queues_cpsch checks sched_running flag before sending the updated runlist. v2: For no-HWS case, when device is stopped, don't call load/destroy_mqd for eviction, restore and create queue, and avoid debugfs dump hdqs. Backtrace of dqm lock deadlock: [Thu Oct 17 16:43:37 2019] INFO: task rocminfo:3024 blocked for more than 120 seconds. [Thu Oct 17 16:43:37 2019] Not tainted 5.0.0-rc1-kfd-compute-rocm-dkms-no-npi-1131 #1 [Thu Oct 17 16:43:37 2019] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Thu Oct 17 16:43:37 2019] rocminfo D 0 3024 2947 0x80000000 [Thu Oct 17 16:43:37 2019] Call Trace: [Thu Oct 17 16:43:37 2019] ? __schedule+0x3d9/0x8a0 [Thu Oct 17 16:43:37 2019] schedule+0x32/0x70 [Thu Oct 17 16:43:37 2019] schedule_preempt_disabled+0xa/0x10 [Thu Oct 17 16:43:37 2019] __mutex_lock.isra.9+0x1e3/0x4e0 [Thu Oct 17 16:43:37 2019] ? __call_srcu+0x264/0x3b0 [Thu Oct 17 16:43:37 2019] ? process_termination_cpsch+0x24/0x2f0 [amdgpu] [Thu Oct 17 16:43:37 2019] process_termination_cpsch+0x24/0x2f0 [amdgpu] [Thu Oct 17 16:43:37 2019] kfd_process_dequeue_from_all_devices+0x42/0x60 [amdgpu] [Thu Oct 17 16:43:37 2019] kfd_process_notifier_release+0x1be/0x220 [amdgpu] [Thu Oct 17 16:43:37 2019] __mmu_notifier_release+0x3e/0xc0 [Thu Oct 17 16:43:37 2019] exit_mmap+0x160/0x1a0 [Thu Oct 17 16:43:37 2019] ? __handle_mm_fault+0xba3/0x1200 [Thu Oct 17 16:43:37 2019] ? exit_robust_list+0x5a/0x110 [Thu Oct 17 16:43:37 2019] mmput+0x4a/0x120 [Thu Oct 17 16:43:37 2019] do_exit+0x284/0xb20 [Thu Oct 17 16:43:37 2019] ? handle_mm_fault+0xfa/0x200 [Thu Oct 17 16:43:37 2019] do_group_exit+0x3a/0xa0 [Thu Oct 17 16:43:37 2019] __x64_sys_exit_group+0x14/0x20 [Thu Oct 17 16:43:37 2019] do_syscall_64+0x4f/0x100 [Thu Oct 17 16:43:37 2019] entry_SYSCALL_64_after_hwframe+0x44/0xa9 Suggested-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NPhilip Yang <Philip.Yang@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 08 10月, 2019 2 次提交
-
-
由 Oak Zeng 提交于
Previously only PCIe-optimized SDMA engine hqds were exposed in debug fs. Print all SDMA engine hqds. Reported-by: NJonathan Kim <Jonathan.Kim@amd.com> Signed-off-by: NJonathan Kim <Jonathan.Kim@amd.com> Signed-off-by: NOak Zeng <Oak.Zeng@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Oak Zeng 提交于
On device initialization, a chunk of GTT memory is pre-allocated for HIQ and all SDMA queues mqd. The size of this allocation was wrong. The correct sdma engine number should be PCIe-optimized SDMA engine number plus xgmi SDMA engine number. Reported-by: NJonathan Kim <Jonathan.Kim@amd.com> Signed-off-by: NJonathan Kim <Jonathan.Kim@amd.com> Signed-off-by: NOak Zeng <Oak.Zeng@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 03 10月, 2019 5 次提交
-
-
由 Yong Zhao 提交于
This makes possible the vmid pasid mapping query through software. Signed-off-by: NYong Zhao <Yong.Zhao@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Yong Zhao 提交于
Since KFD pasid starts from 0x8000 (32768 in decimal), it is better perceived as a hex number. Meanwhile, change the pasid type from unsigned int to uint16_t to be consistent throughout the code. Signed-off-by: NYong Zhao <Yong.Zhao@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 shaoyunl 提交于
Keep the same use of CHIP_IDs for navi12 in kfd Signed-off-by: Nshaoyunl <shaoyun.liu@amd.com> Reviewed-by: NAlex Deucher <alexander.deucher@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Yong Zhao 提交于
Currently this function pointer is missing for GFX10. Considering it is a void function since GFX9, fix it by checking the function pointer before dereferencing it. Signed-off-by: NYong Zhao <Yong.Zhao@amd.com> Reviewed-by: NAlex Deucher <alexander.deucher@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Yong Zhao 提交于
The packet manager is not needed for non HWS mode except Hawaii, so only initialize it for Hawaii under non HWS mode. This will simplify debugging under non HWS mode for all new asics, because it eliminates one variable out of the equation in non HWS mode Signed-off-by: NYong Zhao <Yong.Zhao@amd.com> Reviewed-by: NAlex Deucher <alexander.deucher@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 16 9月, 2019 2 次提交
-
-
由 Huang Rui 提交于
Renoir is GFX9, so enable v9 devcie queue manager. Signed-off-by: NHuang Rui <ray.huang@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Yong Zhao 提交于
Initial support of Navi14 in KFD. The device IDs will be added later. Signed-off-by: NYong Zhao <Yong.Zhao@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 23 8月, 2019 1 次提交
-
-
由 YueHaibing 提交于
Fix sparse warning: drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_device_queue_manager.c:1846:6: warning: symbol 'deallocate_hiq_sdma_mqd' was not declared. Should it be static? Reported-by: NHulk Robot <hulkci@huawei.com> Signed-off-by: NYueHaibing <yuehaibing@huawei.com> Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 19 7月, 2019 2 次提交
-
-
由 Oak Zeng 提交于
In the original formula, when sdma queue number is 64, the left shift overflows. Use an equivalence that won't overflow. Signed-off-by: NOak Zeng <Oak.Zeng@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Yong Zhao 提交于
Add initial support for ARCTURUS to kfd. Signed-off-by: NYong Zhao <Yong.Zhao@amd.com> Signed-off-by: NOak Zeng <Oak.Zeng@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 12 7月, 2019 1 次提交
-
-
由 Eric Huang 提交于
The cp hang occurs in OCL conformance test only on supermicro platform which has 40 cores and the test generates 40 threads. The root cause is race condition in non-protected flags. The fix is to add flags of is_evicted and is_active(init_mqd()) into protected area. Signed-off-by: NEric Huang <JinhuiEric.Huang@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 22 6月, 2019 2 次提交
-
-
由 Oak Zeng 提交于
Previous kfd doesn't use gws so this mask was set to 0. Set it to 64 bit 1s because now kfd can use all 64 gws resources. Signed-off-by: NOak Zeng <Oak.Zeng@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Philip Cox 提交于
KFD (kernel fusion driver) is the kernel driver for the compute backend for usermode compute stack. v2: squash in updates (Alex) v3: squash in rebase fixes (Alex) Signed-off-by: NOak Zeng <Oak.Zeng@amd.com> Signed-off-by: NPhilip Cox <Philip.Cox@amd.com> Acked-by: NAlex Deucher <alexander.deucher@amd.com> Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 18 6月, 2019 4 次提交
-
-
由 Oak Zeng 提交于
SDMA queue allocation requires the dqm lock as it modify the global dqm members. Enclose it in the dqm_lock. Signed-off-by: NOak Zeng <Oak.Zeng@amd.com> Reviewed-by: NPhilip Yang <philip.yang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Oak Zeng 提交于
The idea to break the circular lock dependency is to temporarily drop dqm lock before calling allocate_mqd. See callstack #1 below. [ 59.510149] [drm] Initialized amdgpu 3.30.0 20150101 for 0000:04:00.0 on minor 0 [ 513.604034] ====================================================== [ 513.604205] WARNING: possible circular locking dependency detected [ 513.604375] 4.18.0-kfd-root #2 Tainted: G W [ 513.604530] ------------------------------------------------------ [ 513.604699] kswapd0/611 is trying to acquire lock: [ 513.604840] 00000000d254022e (&dqm->lock_hidden){+.+.}, at: evict_process_queues_nocpsch+0x26/0x140 [amdgpu] [ 513.605150] but task is already holding lock: [ 513.605307] 00000000961547fc (&anon_vma->rwsem){++++}, at: page_lock_anon_vma_read+0xe4/0x250 [ 513.605540] which lock already depends on the new lock. [ 513.605747] the existing dependency chain (in reverse order) is: [ 513.605944] -> #4 (&anon_vma->rwsem){++++}: [ 513.606106] __vma_adjust+0x147/0x7f0 [ 513.606231] __split_vma+0x179/0x190 [ 513.606353] mprotect_fixup+0x217/0x260 [ 513.606553] do_mprotect_pkey+0x211/0x380 [ 513.606752] __x64_sys_mprotect+0x1b/0x20 [ 513.606954] do_syscall_64+0x50/0x1a0 [ 513.607149] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 513.607380] -> #3 (&mapping->i_mmap_rwsem){++++}: [ 513.607678] rmap_walk_file+0x1f0/0x280 [ 513.607887] page_referenced+0xdd/0x180 [ 513.608081] shrink_page_list+0x853/0xcb0 [ 513.608279] shrink_inactive_list+0x33b/0x700 [ 513.608483] shrink_node_memcg+0x37a/0x7f0 [ 513.608682] shrink_node+0xd8/0x490 [ 513.608869] balance_pgdat+0x18b/0x3b0 [ 513.609062] kswapd+0x203/0x5c0 [ 513.609241] kthread+0x100/0x140 [ 513.609420] ret_from_fork+0x24/0x30 [ 513.609607] -> #2 (fs_reclaim){+.+.}: [ 513.609883] kmem_cache_alloc_trace+0x34/0x2e0 [ 513.610093] reservation_object_reserve_shared+0x139/0x300 [ 513.610326] ttm_bo_init_reserved+0x291/0x480 [ttm] [ 513.610567] amdgpu_bo_do_create+0x1d2/0x650 [amdgpu] [ 513.610811] amdgpu_bo_create+0x40/0x1f0 [amdgpu] [ 513.611041] amdgpu_bo_create_reserved+0x249/0x2d0 [amdgpu] [ 513.611290] amdgpu_bo_create_kernel+0x12/0x70 [amdgpu] [ 513.611584] amdgpu_ttm_init+0x2cb/0x560 [amdgpu] [ 513.611823] gmc_v9_0_sw_init+0x400/0x750 [amdgpu] [ 513.612491] amdgpu_device_init+0x14eb/0x1990 [amdgpu] [ 513.612730] amdgpu_driver_load_kms+0x78/0x290 [amdgpu] [ 513.612958] drm_dev_register+0x111/0x1a0 [ 513.613171] amdgpu_pci_probe+0x11c/0x1e0 [amdgpu] [ 513.613389] local_pci_probe+0x3f/0x90 [ 513.613581] pci_device_probe+0x102/0x1c0 [ 513.613779] driver_probe_device+0x2a7/0x480 [ 513.613984] __driver_attach+0x10a/0x110 [ 513.614179] bus_for_each_dev+0x67/0xc0 [ 513.614372] bus_add_driver+0x1eb/0x260 [ 513.614565] driver_register+0x5b/0xe0 [ 513.614756] do_one_initcall+0xac/0x357 [ 513.614952] do_init_module+0x5b/0x213 [ 513.615145] load_module+0x2542/0x2d30 [ 513.615337] __do_sys_finit_module+0xd2/0x100 [ 513.615541] do_syscall_64+0x50/0x1a0 [ 513.615731] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 513.615963] -> #1 (reservation_ww_class_mutex){+.+.}: [ 513.616293] amdgpu_amdkfd_alloc_gtt_mem+0xcf/0x2c0 [amdgpu] [ 513.616554] init_mqd+0x223/0x260 [amdgpu] [ 513.616779] create_queue_nocpsch+0x4d9/0x600 [amdgpu] [ 513.617031] pqm_create_queue+0x37c/0x520 [amdgpu] [ 513.617270] kfd_ioctl_create_queue+0x2f9/0x650 [amdgpu] [ 513.617522] kfd_ioctl+0x202/0x350 [amdgpu] [ 513.617724] do_vfs_ioctl+0x9f/0x6c0 [ 513.617914] ksys_ioctl+0x66/0x70 [ 513.618095] __x64_sys_ioctl+0x16/0x20 [ 513.618286] do_syscall_64+0x50/0x1a0 [ 513.618476] entry_SYSCALL_64_after_hwframe+0x49/0xbe [ 513.618695] -> #0 (&dqm->lock_hidden){+.+.}: [ 513.618984] __mutex_lock+0x98/0x970 [ 513.619197] evict_process_queues_nocpsch+0x26/0x140 [amdgpu] [ 513.619459] kfd_process_evict_queues+0x3b/0xb0 [amdgpu] [ 513.619710] kgd2kfd_quiesce_mm+0x1c/0x40 [amdgpu] [ 513.620103] amdgpu_amdkfd_evict_userptr+0x38/0x70 [amdgpu] [ 513.620363] amdgpu_mn_invalidate_range_start_hsa+0xa6/0xc0 [amdgpu] [ 513.620614] __mmu_notifier_invalidate_range_start+0x70/0xb0 [ 513.620851] try_to_unmap_one+0x7fc/0x8f0 [ 513.621049] rmap_walk_anon+0x121/0x290 [ 513.621242] try_to_unmap+0x93/0xf0 [ 513.621428] shrink_page_list+0x606/0xcb0 [ 513.621625] shrink_inactive_list+0x33b/0x700 [ 513.621835] shrink_node_memcg+0x37a/0x7f0 [ 513.622034] shrink_node+0xd8/0x490 [ 513.622219] balance_pgdat+0x18b/0x3b0 [ 513.622410] kswapd+0x203/0x5c0 [ 513.622589] kthread+0x100/0x140 [ 513.622769] ret_from_fork+0x24/0x30 [ 513.622957] other info that might help us debug this: [ 513.623354] Chain exists of: &dqm->lock_hidden --> &mapping->i_mmap_rwsem --> &anon_vma->rwsem [ 513.623900] Possible unsafe locking scenario: [ 513.624189] CPU0 CPU1 [ 513.624397] ---- ---- [ 513.624594] lock(&anon_vma->rwsem); [ 513.624771] lock(&mapping->i_mmap_rwsem); [ 513.625020] lock(&anon_vma->rwsem); [ 513.625253] lock(&dqm->lock_hidden); [ 513.625433] *** DEADLOCK *** [ 513.625783] 3 locks held by kswapd0/611: [ 513.625967] #0: 00000000f14edf84 (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x5/0x30 [ 513.626309] #1: 00000000961547fc (&anon_vma->rwsem){++++}, at: page_lock_anon_vma_read+0xe4/0x250 [ 513.626671] #2: 0000000067b5cd12 (srcu){....}, at: __mmu_notifier_invalidate_range_start+0x5/0xb0 [ 513.627037] stack backtrace: [ 513.627292] CPU: 0 PID: 611 Comm: kswapd0 Tainted: G W 4.18.0-kfd-root #2 [ 513.627632] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 [ 513.627990] Call Trace: [ 513.628143] dump_stack+0x7c/0xbb [ 513.628315] print_circular_bug.isra.37+0x21b/0x228 [ 513.628581] __lock_acquire+0xf7d/0x1470 [ 513.628782] ? unwind_next_frame+0x6c/0x4f0 [ 513.628974] ? lock_acquire+0xec/0x1e0 [ 513.629154] lock_acquire+0xec/0x1e0 [ 513.629357] ? evict_process_queues_nocpsch+0x26/0x140 [amdgpu] [ 513.629587] __mutex_lock+0x98/0x970 [ 513.629790] ? evict_process_queues_nocpsch+0x26/0x140 [amdgpu] [ 513.630047] ? evict_process_queues_nocpsch+0x26/0x140 [amdgpu] [ 513.630309] ? evict_process_queues_nocpsch+0x26/0x140 [amdgpu] [ 513.630562] evict_process_queues_nocpsch+0x26/0x140 [amdgpu] [ 513.630816] kfd_process_evict_queues+0x3b/0xb0 [amdgpu] [ 513.631057] kgd2kfd_quiesce_mm+0x1c/0x40 [amdgpu] [ 513.631288] amdgpu_amdkfd_evict_userptr+0x38/0x70 [amdgpu] [ 513.631536] amdgpu_mn_invalidate_range_start_hsa+0xa6/0xc0 [amdgpu] [ 513.632076] __mmu_notifier_invalidate_range_start+0x70/0xb0 [ 513.632299] try_to_unmap_one+0x7fc/0x8f0 [ 513.632487] ? page_lock_anon_vma_read+0x68/0x250 [ 513.632690] rmap_walk_anon+0x121/0x290 [ 513.632875] try_to_unmap+0x93/0xf0 [ 513.633050] ? page_remove_rmap+0x330/0x330 [ 513.633239] ? rcu_read_unlock+0x60/0x60 [ 513.633422] ? page_get_anon_vma+0x160/0x160 [ 513.633613] shrink_page_list+0x606/0xcb0 [ 513.633800] shrink_inactive_list+0x33b/0x700 [ 513.633997] shrink_node_memcg+0x37a/0x7f0 [ 513.634186] ? shrink_node+0xd8/0x490 [ 513.634363] shrink_node+0xd8/0x490 [ 513.634537] balance_pgdat+0x18b/0x3b0 [ 513.634718] kswapd+0x203/0x5c0 [ 513.634887] ? wait_woken+0xb0/0xb0 [ 513.635062] kthread+0x100/0x140 [ 513.635231] ? balance_pgdat+0x3b0/0x3b0 [ 513.635414] ? kthread_delayed_work_timer_fn+0x80/0x80 [ 513.635626] ret_from_fork+0x24/0x30 [ 513.636042] Evicting PASID 32768 queues [ 513.936236] Restoring PASID 32768 queues [ 524.708912] Evicting PASID 32768 queues [ 524.999875] Restoring PASID 32768 queues Signed-off-by: NOak Zeng <Oak.Zeng@amd.com> Reviewed-by: NPhilip Yang <philip.yang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Oak Zeng 提交于
This reverts commit 06b89b38. This fix is not proper. allocate_mqd can't be moved before allocate_sdma_queue as it depends on q->properties->sdma_id set in later. Signed-off-by: NOak Zeng <Oak.Zeng@amd.com> Reviewed-by: NPhilip Yang <philip.yang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Oak Zeng 提交于
This reverts commit f77dac6c. This fix is not proper. allocate_mqd can't be moved before allocate_sdma_queue as it depends on q->properties->sdma_id set in later. Signed-off-by: NOak Zeng <Oak.Zeng@amd.com> Reviewed-by: NPhilip Yang <philip.yang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 12 6月, 2019 1 次提交
-
-
由 Oak Zeng 提交于
SDMA queue allocation requires the dqm lock at it modify the global dqm members. Move up the dqm_lock so sdma queue allocation is enclosed in the critical section. Move mqd allocation out of critical section to avoid circular lock dependency. Signed-off-by: NOak Zeng <Oak.Zeng@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-