- 15 4月, 2022 2 次提交
-
-
由 Felix Kuehling 提交于
Add the waiters to the wait queue during initialization, while holding the event spinlock. Otherwise the waiter will not get activated if the event signals before being added to the wait queue. Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Philip Yang<Philip.Yang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Dan Carpenter 提交于
If lookup_event_by_id() returns a NULL "ev" pointer then the spin_lock(&ev->lock) will crash. This was detected by Smatch: drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_events.c:644 kfd_set_event() error: we previously assumed 'ev' could be null (see line 639) Fixes: 5273e82c ("drm/amdkfd: Improve concurrency of event handling") Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 13 4月, 2022 3 次提交
-
-
由 Mukul Joshi 提交于
Currently, the IO-links to the device being removed from topology, are not cleared. As a result, there would be dangling links left in the KFD topology. This patch aims to fix the following: 1. Cleanup all IO links to the device being removed. 2. Ensure that node numbering in sysfs and nodes proximity domain values are consistent after the device is removed: a. Adding a device and removing a GPU device are made mutually exclusive. b. The global proximity domain counter is no longer required to be an atomic counter. A normal 32-bit counter can be used instead. 3. Update generation_count to let user-mode know that topology has changed due to device removal. CC: Shuotao Xu <shuotaoxu@microsoft.com> Reviewed-by: NShuotao Xu <shuotaoxu@microsoft.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NMukul Joshi <mukul.joshi@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Lang Yu 提交于
A MAX_GPU_INSTANCE bits bitmap will suffice. Signed-off-by: NLang Yu <Lang.Yu@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Felix Kuehling 提交于
The synchronize_rcu call in destroy_events can take several ms, which noticeably slows down applications destroying many events. Use kfree_rcu to free the event structure asynchronously and eliminate the synchronize_rcu call in the user thread. Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: NPhilip Yang <Philip.Yang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 12 4月, 2022 1 次提交
-
-
由 Philip Yang 提交于
Application could change XNACK enabled to disabled while KFD is draining stale retry fault, therefore the check for whether to drain retry faults must be before the check for whether xnack_enabled, to avoid report incorrect vm fault after application changes XNACK mode. Signed-off-by: NPhilip Yang <Philip.Yang@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 08 4月, 2022 1 次提交
-
-
由 Felix Kuehling 提交于
Use rcu_read_lock to read p->event_idr concurrently with other readers and writers. Use p->event_mutex only for creating and destroying events and in kfd_wait_on_events. Protect the contents of the kfd_event structure with a per-event spinlock that can be taken inside the rcu_read_lock critical section. This eliminates contention of p->event_mutex in set_event, which tends to be on the critical path for dispatch latency even when busy waiting is used. It also eliminates lock contention in event interrupt handlers. Since the p->event_mutex is now used much less, the impact of requiring it in kfd_wait_on_events should also be much smaller. This should improve event handling latency for processes using multiple GPUs concurrently. v2: Reschedule the worker periodically to avoid soft lockup warnings Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Sean Keely <Sean.Keely@amd.com> # v1 Tested-by: NSanjay Tripathi <sanjay.tripathi@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 06 4月, 2022 1 次提交
-
-
由 Philip Yang 提交于
bo_adev is NULL for system memory mapping to GPU. Fixes: 30671b44 ("drm/amdgpu: fix TLB flushing during eviction") Signed-off-by: NPhilip Yang <Philip.Yang@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: NChristian König <christian.koenig@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 05 4月, 2022 1 次提交
-
-
由 Christian König 提交于
Testing the valid bit is not enough to figure out if we need to invalidate the TLB or not. During eviction it is quite likely that we move a BO from VRAM to GTT and update the page tables immediately to the new GTT address. Rework the whole function to get all the necessary parameters directly as value. Signed-off-by: NChristian König <christian.koenig@amd.com> Reviewed-by: NPhilip Yang <Philip.Yang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 01 4月, 2022 2 次提交
-
-
由 Lee Jones 提交于
This ensures userspace cannot prematurely clean-up the client before it is fully initialised which has been proven to cause issues in the past. Cc: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: "Christian König" <christian.koenig@amd.com> Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com> Cc: David Airlie <airlied@linux.ie> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: amd-gfx@lists.freedesktop.org Cc: dri-devel@lists.freedesktop.org Signed-off-by: NLee Jones <lee.jones@linaro.org> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Philip Yang 提交于
To support multi-thread update page table. Signed-off-by: NPhilip Yang <Philip.Yang@amd.com> Reviewed-by: NChristian König <christian.koenig@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 26 3月, 2022 10 次提交
-
-
由 Christian König 提交于
Better to leave the decision when to flush the VM changes in the TLB to the VM code. Signed-off-by: NChristian König <christian.koenig@amd.com> Reviewed-by: Philip Yang<Philip.Yang@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Christian König 提交于
Instead of hand rolling the table_freed parameter. v2: add some changes suggested by Philip Signed-off-by: NChristian König <christian.koenig@amd.com> Reviewed-by: Philip Yang<Philip.Yang@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Christian König 提交于
Instead of trying to figure out if a TLB flush is necessary or not use the information provided by the VM subsystem now. Signed-off-by: NChristian König <christian.koenig@amd.com> Reviewed-by: Philip Yang<Philip.Yang@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Tao Zhou 提交于
Print the status out when it passes, and also tell user gpu reset is triggered when we fall back to legacy way. v2: make the message more explicit. v3: change succeeds to succeeded. replace pr_warn with dev_warn. Signed-off-by: NTao Zhou <tao.zhou1@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Tao Zhou 提交于
Do RAS page retirement and use gpu reset as fallback in UTCL2 fault handler. v2: replace vm fault event with posion consumed event in UTCL2 poison consumption. Signed-off-by: NTao Zhou <tao.zhou1@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Tao Zhou 提交于
Client ID is more accruate here and we can deal with more different cases with client ID. Signed-off-by: NTao Zhou <tao.zhou1@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Tao Zhou 提交于
Combine reading and setting poison flag as one atomic operation and add print message for the function. Signed-off-by: NTao Zhou <tao.zhou1@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 QintaoShen 提交于
As the kmalloc_array() may return null, the 'event_waiters[i].wait' would lead to null-pointer dereference. Therefore, it is better to check the return value of kmalloc_array() to avoid this confusion. Signed-off-by: NQintaoShen <unSimple1993@163.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Divya Shikre 提交于
Recently introduced commit 158a05a0 ("drm/amdgpu: Add use_xgmi_p2p module parameter") did not update XGMI iolinks when use_xgmi_p2p is disabled. Add fix to not create XGMI iolinks in KFD topology when this parameter is disabled. Fixes: 158a05a0 ("drm/amdgpu: Add use_xgmi_p2p module parameter") Signed-off-by: NDivya Shikre <DivyaUday.Shikre@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Tushar Patel 提交于
Compute-only GPUs have more than 8 VMIDs allocated to KFD. Fix this by passing correct number of VMIDs to HWS v2: squash in warning fix (Alex) Signed-off-by: NTushar Patel <tushar.patel@amd.com> Reviewed-by: NFelix Kuehling <felix.kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 16 3月, 2022 4 次提交
-
-
由 Philip Yang 提交于
Migrate vram to ram may fail to find the vma if process is exiting and vma is removed, evict svm bo worker sets prange->svm_bo to NULL and warn svm_bo ref count != 1 only if migrating vram to ram successfully. Signed-off-by: NPhilip Yang <Philip.Yang@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 David Yat Sin 提交于
Export dmabuf handles for GTT BOs so that their contents can be accessed using SDMA during checkpoint/restore. v2: Squash in fix from David to set dmabuf handle to invalid for BOs that cannot be accessed using SDMA during checkpoint/restore. Signed-off-by: NDavid Yat Sin <david.yatsin@amd.com> Reviewed-by : Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 David Yat Sin 提交于
Refactor CRIU restore BO to reduce identation. Signed-off-by: NDavid Yat Sin <david.yatsin@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 David Yat Sin 提交于
When the process is getting restored, the queues are not mapped yet, so there is no VMID assigned for this process and no TLBs to flush. Signed-off-by: NDavid Yat Sin <david.yatsin@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 10 3月, 2022 1 次提交
-
-
由 Yifan Zhang 提交于
it makes no sense to continue with an undefined vmid. Fixes: c8b0507f ("drm/amdkfd: judge get_atc_vmid_pasid_mapping_info before call") Signed-off-by: NYifan Zhang <yifan1.zhang@amd.com> Reported-by: NNathan Chancellor <nathan@kernel.org> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 08 3月, 2022 1 次提交
-
-
由 Philip Yang 提交于
To enable compiler type-checked against the format string in callers. All warnings (new ones prefixed by >>): >> warning: function 'kfd_smi_event_add' might be a candidate for 'gnu_printf' format attribute [-Wsuggest-attribute=format] Fixes: d58b8a99 ("drm/amdkfd: Add SMI add event helper") Reported-by: Nkernel test robot <lkp@intel.com> Signed-off-by: NPhilip Yang <Philip.Yang@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 05 3月, 2022 1 次提交
-
-
由 Yifan Zhang 提交于
Fix the NULL point issue: [ 3076.255609] BUG: kernel NULL pointer dereference, address: 0000000000000000 [ 3076.255624] #PF: supervisor instruction fetch in kernel mode [ 3076.255637] #PF: error_code(0x0010) - not-present page [ 3076.255649] PGD 0 P4D 0 [ 3076.255660] Oops: 0010 [#1] SMP NOPTI [ 3076.255669] CPU: 20 PID: 2415 Comm: kfdtest Tainted: G W OE 5.11.0-41-generic #45~20.04.1-Ubuntu [ 3076.255691] Hardware name: AMD Splinter/Splinter-RPL, BIOS VS2326337N.FD 02/07/2022 [ 3076.255706] RIP: 0010:0x0 [ 3076.255718] Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6. [ 3076.255732] RSP: 0018:ffffb64283e3fc10 EFLAGS: 00010297 [ 3076.255744] RAX: 0000000000000000 RBX: 0000000000000008 RCX: 0000000000000027 [ 3076.255759] RDX: ffffb64283e3fc1e RSI: 0000000000000008 RDI: ffff8c7a87f60000 [ 3076.255776] RBP: ffffb64283e3fc78 R08: ffff8c7d88518ac0 R09: ffffb64283e3fa60 [ 3076.255791] R10: 0000000000000001 R11: 0000000000000001 R12: 000000000000000f [ 3076.255805] R13: ffff8c7bdcea5800 R14: ffff8c7a9f3f3000 R15: ffff8c7a8696bc00 [ 3076.255820] FS: 0000000000000000(0000) GS:ffff8c7d88500000(0000) knlGS:0000000000000000 [ 3076.255839] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 3076.255851] CR2: ffffffffffffffd6 CR3: 0000000109e3c000 CR4: 0000000000750ee0 [ 3076.255866] PKRU: 55555554 [ 3076.255873] Call Trace: [ 3076.255884] dbgdev_wave_reset_wavefronts+0x72/0x160 [amdgpu] [ 3076.256025] process_termination_cpsch.cold+0x26/0x2f [amdgpu] [ 3076.256182] ? ktime_get_mono_fast_ns+0x4e/0xa0 [ 3076.256196] kfd_process_dequeue_from_all_devices+0x49/0x70 [amdgpu] [ 3076.256328] kfd_process_notifier_release+0x187/0x2b0 [amdgpu] [ 3076.256451] ? mn_itree_inv_end+0xdc/0x110 [ 3076.256463] __mmu_notifier_release+0x74/0x1f0 [ 3076.256474] exit_mmap+0x170/0x200 [ 3076.256484] ? __handle_mm_fault+0x677/0x920 [ 3076.256496] ? _cond_resched+0x19/0x30 [ 3076.256507] mmput+0x5d/0x130 [ 3076.256518] do_exit+0x332/0xaf0 [ 3076.256526] ? handle_mm_fault+0xd7/0x2b0 [ 3076.256537] do_group_exit+0x43/0xa0 [ 3076.256548] __x64_sys_exit_group+0x18/0x20 [ 3076.256559] do_syscall_64+0x38/0x90 [ 3076.256569] entry_SYSCALL_64_after_hwframe+0x44/0xa9 Signed-off-by: NYifan Zhang <yifan1.zhang@amd.com> Reviewed-by: NAlex Deucher <alexander.deucher@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 03 3月, 2022 3 次提交
-
-
由 Philip Yang 提交于
To remove duplicate code, unify event message format and simplify new event add in the following patches. Use KFD_SMI_EVENT_MSG_SIZE to define msg size, the same size will be used in user space to alloc the msg receive buffer. Signed-off-by: NPhilip Yang <Philip.Yang@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Philip Yang 提交于
sizeof(buf) is 8 bytes because it is defined as unsigned char *buf, each SMI event read only copy max 8 bytes to user buffer. Correct this by using the buf allocate size. Signed-off-by: NPhilip Yang <Philip.Yang@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Philip Yang 提交于
This reverts commit 3abfe30d. To fix deadlock in kFDSVMEvictTest when xnack off. Signed-off-by: NPhilip Yang <Philip.Yang@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 24 2月, 2022 3 次提交
-
-
由 Harish Kasiviswanathan 提交于
Print alloc node, peer node and memory domain when peer map fails. This is more useful v2: use dev_err instead of pr_err use bdf for identify peer gpu Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: NAlex Deucher <Alexander.Deucher@amd.com> Signed-off-by: NHarish Kasiviswanathan <Harish.Kasiviswanathan@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Felix Kuehling 提交于
kfd_chardev() doesn't provide much useful information in dev_... messages on multi-GPU systems because there is only one KFD device, which doesn't correspond to any particular GPU. Use the actual GPU device to indicate the GPU that caused a message. Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: NChristian König <christian.koenig@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 David Yat Sin 提交于
Fix for possible integer overflow when doing addition. Reported-by: NDan Carpenter <dan.carpenter@oracle.com> Signed-off-by: NDavid Yat Sin <david.yatsin@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 23 2月, 2022 3 次提交
-
-
由 Alex Deucher 提交于
The driver has a fallback so make the message informational rather than a warning. The driver has a fallback if the Component Resource Association Table (CRAT) is missing, so make this informational now. Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/1906Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Felix Kuehling 提交于
Clang static analysis reports this problem kfd_chardev.c:2327:2: warning: 1st function call argument is an uninitialized value kvfree(bo_privs); ^~~~~~~~~~~~~~~~ Make sure bo_buckets and bo_privs are initialized so freeing them in the error handling code path will never result in undefined behaviour. Fixes: 73fa13b6 ("drm/amdkfd: CRIU Implement KFD restore ioctl") Reported-by: NTom Rix <trix@redhat.com> Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: NRajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Kent Russell 提交于
When this was first implemented, overflows weren't expected in regular operations, and tests weren't in place to cause said overflow. Now there are cases where overflows occur with real workloads, but we know that the kernel can handle this robustly, so move the message to a debug message. Signed-off-by: NKent Russell <kent.russell@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 18 2月, 2022 1 次提交
-
-
由 Nathan Chancellor 提交于
Clang warns: drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_packet_manager_v9.c:267:3: error: implicit conversion from enumeration type 'enum mes_map_queues_extended_engine_sel_enum' to different enumeration type 'enum mes_unmap_queues_extended_engine_sel_enum' [-Werror,-Wenum-conversion] extended_engine_sel__mes_map_queues__sdma0_to_7_sel : ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1 error generated. Use 'extended_engine_sel__mes_unmap_queues__sdma0_to_7_sel' to eliminate the warning, which is the same numeric value of the proper type. Fixes: 009e9a15 ("drm/amdkfd: navi2x requires extended engines to map and unmap sdma queues") Link: https://github.com/ClangBuiltLinux/linux/issues/1596Signed-off-by: NNathan Chancellor <nathan@kernel.org> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 17 2月, 2022 2 次提交
-
-
由 Tao Zhou 提交于
Otherwise gpu reset will be triggered unconditionally in poison consumption. Signed-off-by: NTao Zhou <tao.zhou1@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Changcheng Deng 提交于
There is a regular need in the kernel to provide a way to declare having a dynamically sized set of trailing elements in a structure. Kernel code should always use "flexible array members" for these cases. The older style of one-element or zero-length arrays should no longer be used. Reference: https://www.kernel.org/doc/html/latest/process/deprecated.html#zero-length-and-one-element-arraysReported-by: NZeal Robot <zealci@zte.com.cn> Signed-off-by: NChangcheng Deng <deng.changcheng@zte.com.cn> Acked-by: NChristian König <christian.koenig@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-