- 09 10月, 2021 1 次提交
-
-
由 Alex Sierra 提交于
This fix the deadlock with the BO reservations during SVM_BO evictions while allocations in VRAM are concurrently performed. More specific, while the ttm waits for the fence to be signaled (ttm_bo_wait), it already has the BO reserved. In parallel, the restore worker might be running, prefetching memory to VRAM. This also requires to reserve the BO, but blocks the mmap semaphore first. The deadlock happens when the SVM_BO eviction worker kicks in and waits for the mmap semaphore held in restore worker. Preventing signal the fence back, causing the deadlock until the ttm times out. We don't need to hold the BO reservation anymore during validation and mapping. Now the physical addresses are taken from hmm_range_fault. We also take migrate_mutex to prevent range migration while validate_and_map update GPU page table. Signed-off-by: NAlex Sierra <alex.sierra@amd.com> Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: NPhilip Yang <philip.yang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 06 10月, 2021 1 次提交
-
-
由 Yifan Zhang 提交于
kfd_resume doesn't involve iommu operation, remove redundant iommu cleanup code. Signed-off-by: NYifan Zhang <yifan1.zhang@amd.com> Reviewed-by: NJames Zhu <James.Zhu@amd.com> Tested-by: NJames Zhu <James.Zhu@amd.com> Acked-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 05 10月, 2021 3 次提交
-
-
由 Alex Deucher 提交于
rather than asic type. v2: fix up CZ case Acked-by: NChristian König <christian.koenig@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Alex Deucher 提交于
We can get the pdev and asic type from the adev. No need to pass them explicitly. v2: squash in build fix for !CONFIG_HSA_AMD from Anson Reviewed-by: NChristian König <christian.koenig@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Tao Zhou 提交于
In ras poison mode, page retirement will be handled by the irq handler of the module which consumes corrupted data. v2: rename ras_process_cb to ras_poison_consumption_handler. move the handler's implementation from ASIC specific file to common file. v3: call gpu reset for xGMI connected mode. Signed-off-by: NTao Zhou <tao.zhou1@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 30 9月, 2021 1 次提交
-
-
由 Yang Li 提交于
Use resource_size function on resource object instead of explicit computation. Clean up coccicheck warning: ./drivers/gpu/drm/amd/amdkfd/kfd_migrate.c:905:10-13: ERROR: Missing resource_size with res Reported-by: NAbaci Robot <abaci@linux.alibaba.com> Reviewed-by:
Amos Kong <kongjianjun@gmail.com> Signed-off-by: NYang Li <yang.lee@linux.alibaba.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 24 9月, 2021 4 次提交
-
-
由 Philip Yang 提交于
Device manager releases device-specific resources when a driver disconnects from a device, devm_memunmap_pages and devm_release_mem_region calls in svm_migrate_fini are redundant. It causes below warning trace after patch "drm/amdgpu: Split amdgpu_device_fini into early and late", so remove function svm_migrate_fini. BUG: https://gitlab.freedesktop.org/drm/amd/-/issues/1718 WARNING: CPU: 1 PID: 3646 at drivers/base/devres.c:795 devm_release_action+0x51/0x60 Call Trace: ? memunmap_pages+0x360/0x360 svm_migrate_fini+0x2d/0x60 [amdgpu] kgd2kfd_device_exit+0x23/0xa0 [amdgpu] amdgpu_amdkfd_device_fini_sw+0x1d/0x30 [amdgpu] amdgpu_device_fini_sw+0x45/0x290 [amdgpu] amdgpu_driver_release_kms+0x12/0x30 [amdgpu] drm_dev_release+0x20/0x40 [drm] release_nodes+0x196/0x1e0 device_release_driver_internal+0x104/0x1d0 driver_detach+0x47/0x90 bus_remove_driver+0x7a/0xd0 pci_unregister_driver+0x3d/0x90 amdgpu_exit+0x11/0x20 [amdgpu] Signed-off-by: NPhilip Yang <Philip.Yang@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Philip Yang 提交于
If svm migration init failed to create pgmap for device memory, set pgmap type to 0 to disable device SVM support capability. Signed-off-by: NPhilip Yang <Philip.Yang@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Philip Yang 提交于
For xnack off, restore work dma unmap previous system memory page, and dma map the updated system memory page to update GPU mapping, this is not dma mapping leaking, remove the WARN_ONCE for dma mapping leaking. prange->dma_addr store the VRAM page pfn after the range migrated to VRAM, should not dma unmap VRAM page when updating GPU mapping or remove prange. Add helper svm_is_valid_dma_mapping_addr to check VRAM page and error cases. Mask out SVM_RANGE_VRAM_DOMAIN flag in dma_addr before calling amdgpu vm update to avoid BUG_ON(*addr & 0xFFFF00000000003FULL), and set it again immediately after. This flag is used to know the type of page later to dma unmapping system memory page. Fixes: 1d5dbfe6 ("drm/amdkfd: classify and map mixed svm range pages in GPU") Signed-off-by: NPhilip Yang <Philip.Yang@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Philip Yang 提交于
SVM range may includes multiple VMAs with different vm_flags, if prange page index is the last page of the VMA offset + npages, update GPU mapping to create GPU page table with same VMA access permission. Signed-off-by: NPhilip Yang <Philip.Yang@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 16 9月, 2021 2 次提交
-
-
由 James Zhu 提交于
Separate kfd_iommu_resume from kfd_resume for fine-tuning of amdgpu device init/resume/reset/recovery sequence. v2: squash in fix for !CONFIG_HSA_AMD Bug: https://bugzilla.kernel.org/show_bug.cgi?id=211277Signed-off-by: NJames Zhu <James.Zhu@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
-
由 Felix Kuehling 提交于
On some GPUs the PCIe atomic requirement for KFD depends on the MEC firmware version. Add a firmware version check for this. The minimum firmware version that works without atomics can be updated in the device_info structure for each GPU type. Move PCIe atomic detection from kgd2kfd_probe into kgd2kfd_device_init because the MEC firmware is not loaded yet at the probe stage. Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: NGuchun Chen <guchun.chen@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 15 9月, 2021 2 次提交
-
-
由 James Zhu 提交于
Separate kfd_iommu_resume from kfd_resume for fine-tuning of amdgpu device init/resume/reset/recovery sequence. v2: squash in fix for !CONFIG_HSA_AMD Bug: https://bugzilla.kernel.org/show_bug.cgi?id=211277Signed-off-by: NJames Zhu <James.Zhu@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Felix Kuehling 提交于
On some GPUs the PCIe atomic requirement for KFD depends on the MEC firmware version. Add a firmware version check for this. The minimum firmware version that works without atomics can be updated in the device_info structure for each GPU type. Move PCIe atomic detection from kgd2kfd_probe into kgd2kfd_device_init because the MEC firmware is not loaded yet at the probe stage. Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: NGuchun Chen <guchun.chen@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 02 9月, 2021 1 次提交
-
-
由 Alex Sierra 提交于
During svm restore pages interrupt handler, kfd_process ref count was never dropped when xnack was disabled. Therefore, the object was never released. Fixes: 2383f56b ("drm/amdkfd: page table restore through svm API") Signed-off-by: NAlex Sierra <alex.sierra@amd.com> Reviewed-by: NPhilip Yang <philip.yang@amd.com> Reviewed-by: NJonathan Kim <jonathan.kim@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
-
- 27 8月, 2021 1 次提交
-
-
由 Sean Keely 提交于
On systems with multiple SH per SE compute_static_thread_mgmt_se# is split into independent masks, one for each SH, in the upper and lower 16 bits. We need to detect this and apply cu masking to each SH. The cu mask bits are assigned first to each SE, then to alternate SHs, then finally to higher CU id. This ensures that the maximum number of SPIs are engaged as early as possible while balancing CU assignment to each SH. v2: Use max SH/SE rather than max SH in cu_per_sh. v3: Fix comment blocks, ensure se_mask is initially zero filled, and correctly assign se.sh.cu positions to unset bits in cu_mask. Signed-off-by: NSean Keely <Sean.Keely@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 25 8月, 2021 2 次提交
-
-
由 Philip Yang 提交于
Restore retry fault or prefetch range, or restore svm range after eviction to map range to GPU with correct read or write access permission. Range may includes multiple VMAs, update GPU page table with offset of prange, number of pages for each VMA according VMA access permission. Signed-off-by: NPhilip Yang <Philip.Yang@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Philip Yang 提交于
Check range access permission to restore GPU retry fault, if GPU retry fault on address which belongs to VMA, and VMA has no read or write permission requested by GPU, failed to restore the address. The vm fault event will pass back to user space. Signed-off-by: NPhilip Yang <Philip.Yang@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 17 8月, 2021 2 次提交
-
-
由 Yifan Zhang 提交于
KFDSVMRangeTest.SetGetAttributesTest randomly fails in stress test. Note: Google Test filter = KFDSVMRangeTest.* [==========] Running 18 tests from 1 test case. [----------] Global test environment set-up. [----------] 18 tests from KFDSVMRangeTest [ RUN ] KFDSVMRangeTest.BasicSystemMemTest [ OK ] KFDSVMRangeTest.BasicSystemMemTest (30 ms) [ RUN ] KFDSVMRangeTest.SetGetAttributesTest [ ] Get default atrributes /home/yifan/brahma/libhsakmt/tests/kfdtest/src/KFDSVMRangeTest.cpp:154: Failure Value of: expectedDefaultResults[i] Actual: 4294967295 Expected: outputAttributes[i].value Which is: 0 /home/yifan/brahma/libhsakmt/tests/kfdtest/src/KFDSVMRangeTest.cpp:154: Failure Value of: expectedDefaultResults[i] Actual: 4294967295 Expected: outputAttributes[i].value Which is: 0 /home/yifan/brahma/libhsakmt/tests/kfdtest/src/KFDSVMRangeTest.cpp:152: Failure Value of: expectedDefaultResults[i] Actual: 4 Expected: outputAttributes[i].type Which is: 2 [ ] Setting/Getting atrributes [ FAILED ] the root cause is that svm work queue has not finished when svm_range_get_attr is called, thus some garbage svm interval tree data make svm_range_get_attr get wrong result. Flush work queue before iterate svm interval tree. Signed-off-by: NYifan Zhang <yifan1.zhang@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Yifan Zhang 提交于
KFDSVMRangeTest.SetGetAttributesTest randomly fails in stress test. Note: Google Test filter = KFDSVMRangeTest.* [==========] Running 18 tests from 1 test case. [----------] Global test environment set-up. [----------] 18 tests from KFDSVMRangeTest [ RUN ] KFDSVMRangeTest.BasicSystemMemTest [ OK ] KFDSVMRangeTest.BasicSystemMemTest (30 ms) [ RUN ] KFDSVMRangeTest.SetGetAttributesTest [ ] Get default atrributes /home/yifan/brahma/libhsakmt/tests/kfdtest/src/KFDSVMRangeTest.cpp:154: Failure Value of: expectedDefaultResults[i] Actual: 4294967295 Expected: outputAttributes[i].value Which is: 0 /home/yifan/brahma/libhsakmt/tests/kfdtest/src/KFDSVMRangeTest.cpp:154: Failure Value of: expectedDefaultResults[i] Actual: 4294967295 Expected: outputAttributes[i].value Which is: 0 /home/yifan/brahma/libhsakmt/tests/kfdtest/src/KFDSVMRangeTest.cpp:152: Failure Value of: expectedDefaultResults[i] Actual: 4 Expected: outputAttributes[i].type Which is: 2 [ ] Setting/Getting atrributes [ FAILED ] the root cause is that svm work queue has not finished when svm_range_get_attr is called, thus some garbage svm interval tree data make svm_range_get_attr get wrong result. Flush work queue before iterate svm interval tree. Signed-off-by: NYifan Zhang <yifan1.zhang@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 12 8月, 2021 2 次提交
-
-
由 Mukul Joshi 提交于
This patch adds support to program trap handler settings when loading driver with software scheduler (sched_policy=2). Signed-off-by: NMukul Joshi <mukul.joshi@amd.com> Suggested-by: NJay Cornwall <Jay.Cornwall@amd.com> Reviewed-by: NHarish Kasiviswanathan <Harish.Kasiviswanathan@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Philip Yang 提交于
For xnack on, if range ACCESS or ACCESS_IN_PLACE (AIP) by single GPU, or range is ACCESS_IN_PLACE by mGPUs and all mGPUs connection on XGMI same hive, the best prefetch location is prefetch_loc GPU. Otherwise, the best prefetch location is always CPU because GPU does not have coherent mapping VRAM of other GPUs even with large-BAR PCIe connection. Signed-off-by: NPhilip Yang <Philip.Yang@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 07 8月, 2021 1 次提交
-
-
由 Felix Kuehling 提交于
Currently the SVM get_attr call allows querying, which flags are set in the entire address range. Add the opposite query, which flags are clear in the entire address range. Both queries can be combined in a single get_attr call, which allows answering questions such as, "is this address range coherent, non-coherent, or a mix of both"? Proposed userspace for UAPI: https://github.com/RadeonOpenCompute/ROCR-Runtime/tree/memory_model_queriesSigned-off-by: NFelix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: NPhilip Yand <philip.yang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 06 8月, 2021 1 次提交
-
-
由 Graham Sider 提交于
Add u32 gfx_target_version field to kfd_node_properties and kfd_device_info. Populate <asic>_device_info structs accordingly and expose to sysfs. This allows eliminating device-ID-based lookup tables in user mode for future ASICs. Signed-off-by: NGraham Sider <Graham.Sider@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 03 8月, 2021 4 次提交
-
-
由 Eric Huang 提交于
It is to workaround HW bug on other Asics and based on reverting two commits back: drm/amdkfd: Add heavy-weight TLB flush after unmapping drm/amdkfd: Add memory sync before TLB flush on unmap Signed-off-by: NEric Huang <jinhuieric.huang@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Eric Huang 提交于
This reverts commit 4bba567c. Revert reason: The issue has been resolved. Signed-off-by: NEric Huang <jinhuieric.huang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Eric Huang 提交于
This reverts commit 7ed9876c. Revert reason: The issue has been resolved. Signed-off-by: NEric Huang <jinhuieric.huang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Eric Huang 提交于
This reverts commit 430f8e6e. Revert reason: Issue has been resolved. Signed-off-by: NEric Huang <jinhuieric.huang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 29 7月, 2021 3 次提交
-
-
由 Eric Huang 提交于
This reverts commit 4bba567c. Revert reason: The issue has been resolved. Signed-off-by: NEric Huang <jinhuieric.huang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Eric Huang 提交于
This reverts commit 7ed9876c. Revert reason: The issue has been resolved. Signed-off-by: NEric Huang <jinhuieric.huang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Eric Huang 提交于
This reverts commit 430f8e6e. Revert reason: Issue has been resolved. Signed-off-by: NEric Huang <jinhuieric.huang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 23 7月, 2021 7 次提交
-
-
由 Tao Zhou 提交于
Add KFD support for cyan_skillfish. v2: whitespace fixes (Alex) Signed-off-by: NTao Zhou <tao.zhou1@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Graham Sider 提交于
Update Arcturus/Aldebaran thermal throttle SMI event path to use ASIC-independent throttler bits when logging. Signed-off-by: NGraham Sider <Graham.Sider@amd.com> Reviewed-by: NHarish Kasiviswanathan <Harish.Kasiviswanathan@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Oak Zeng 提交于
start_cpsch and stop_cpsch can be called during kfd device initialization or during gpu reset/recovery. So they can run concurrently. Currently in start_cpsch and stop_cpsch, pm_init and pm_uninit is not protected by the dpm lock. Imagine such a case that user use packet manager's function to submit a pm4 packet to hang hws (ie through command cat /sys/class/kfd/kfd/topology/nodes/1/gpu_id | sudo tee /sys/kernel/debug/kfd/hang_hws), while kfd device is under device reset/recovery so packet manager can be not initialized. There will be unpredictable protection fault in such case. This patch moves pm_init/uninit inside the dpm lock and check packet manager is initialized before using packet manager function. Signed-off-by: NOak Zeng <Oak.Zeng@amd.com> Acked-by: NChristian Konig <christian.koenig@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Oak Zeng 提交于
This variable will be used to determine whether packet manager is initialized or not, in a future patch. Signed-off-by: NOak Zeng <Oak.Zeng@amd.com> Acked-by: NChristian Konig <christian.koenig@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Oak Zeng 提交于
Renaming packets to packet_mgr to reflect the real meaning of this variable. Signed-off-by: NOak Zeng <Oak.Zeng@amd.com> Acked-by: NChristian Konig <christian.koenig@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Jonathan Kim 提交于
Similar to xGMI reporting the min/max bandwidth between direct peers, PCIe will report the min/max bandwidth to the KFD. Signed-off-by: NJonathan Kim <jonathan.kim@amd.com> Reviewed-by: NFelix Kuehling <felix.kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Jonathan Kim 提交于
Report the min/max bandwidth in megabytes to the kfd for direct xgmi connections only. Indirect peers will report 0 since indirect route is unknown. Signed-off-by: NJonathan Kim <jonathan.kim@amd.com> Reviewed-by: NFelix Kuehling <felix.kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 13 7月, 2021 2 次提交
-
-
由 Eric Huang 提交于
This reverts commit 1098d658. Reason for revert: it causes regressions on several Asics. Signed-off-by: NEric Huang <jinhuieric.huang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Eric Huang 提交于
This reverts commit 31f33243. Reason for revert: it causes regressions on several Asics. Signed-off-by: NEric Huang <jinhuieric.huang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-