- 18 6月, 2020 1 次提交
-
-
由 Lorenz Brun 提交于
The existing code used the major version number of the DRM driver instead of the device major number of the DRM subsystem for validating access for a devices cgroup. This meant that accesses allowed by the devices cgroup weren't permitted and certain accesses denied by the devices cgroup were permitted (if they matched the wrong major device number). Signed-off-by: NLorenz Brun <lorenz@brun.one> Fixes: 6b855f7b ("drm/amdkfd: Check against device cgroup") Reviewed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
-
- 10 6月, 2020 1 次提交
-
-
由 Michel Lespinasse 提交于
This change converts the existing mmap_sem rwsem calls to use the new mmap locking API instead. The change is generated using coccinelle with the following rule: // spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir . @@ expression mm; @@ ( -init_rwsem +mmap_init_lock | -down_write +mmap_write_lock | -down_write_killable +mmap_write_lock_killable | -down_write_trylock +mmap_write_trylock | -up_write +mmap_write_unlock | -downgrade_write +mmap_write_downgrade | -down_read +mmap_read_lock | -down_read_killable +mmap_read_lock_killable | -down_read_trylock +mmap_read_trylock | -up_read +mmap_read_unlock ) -(&mm->mmap_sem) +(mm) Signed-off-by: NMichel Lespinasse <walken@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NDaniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: NLaurent Dufour <ldufour@linux.ibm.com> Reviewed-by: NVlastimil Babka <vbabka@suse.cz> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Liam Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ying Han <yinghan@google.com> Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 22 5月, 2020 2 次提交
-
-
由 Evan Quan 提交于
Since the PCI bus number retrieved by PCI_BUS_NUM(pdev->devfn) is wrong. Signed-off-by: NEvan Quan <evan.quan@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Aishwarya Ramakrishnan 提交于
Return statements in functions returning bool should use true/false instead of 1/0. drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c:40:9-10: WARNING: return of 0/1 in function 'event_interrupt_isr_v9' with return type bool Generated by: scripts/coccinelle/misc/boolreturn.cocci Signed-off-by: NAishwarya Ramakrishnan <aishwaryarj100@gmail.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 02 5月, 2020 1 次提交
-
-
由 Yong Zhao 提交于
The queue mask used for set_resources always assumes the queue number per pipe is 8, so KFD needs to align with that by using function amdgpu_queue_mask_bit_to_set_resource_bit(). Signed-off-by: NYong Zhao <Yong.Zhao@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 01 5月, 2020 3 次提交
-
-
由 Felix Kuehling 提交于
Corrected two function names. Added a missing space. Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: NKent Russell <kent.russell@amd.com> Reviewed-by: NAlex Deucher <alexander.deucher@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Ori Messinger 提交于
PCI domain has moved to 32-bits to accommodate virtualization, so a 32-bit integer is exposed for domain to reflect this change. Domain can be found in here: /sys/class/kfd/kfd/topology/nodes/X/properties Where X is the card number Signed-off-by: NOri Messinger <ori.messinger@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com> Acked-by: NAlex Deucher <alexander.deucher@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Mukul Joshi 提交于
Track GPU VRAM usage on a per process basis and report it through sysfs. Signed-off-by: NMukul Joshi <mukul.joshi@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 29 4月, 2020 3 次提交
-
-
由 Joseph Greathouse 提交于
The current GWS usage model will only allows a single GWS-enabled process to be active on the GPU at once. This ensures that a barrier-using kernel gets a known amount of GPU hardware, to prevent deadlock due to inability to go beyond the GWS barrier. The HWS watches how many GWS entries are assigned to each process, and goes into over-subscription mode when two processes need more than the 64 that are available. The current KFD method for working with this is to allocate all 64 GWS entries to each GWS-capable process. When more than one GWS-enabled process is in the runlist, we must make sure the runlist is in over-subscription mode, so that the HWS gets a chained RUN_LIST packet and continues scheduling kernels. Signed-off-by: NJoseph Greathouse <Joseph.Greathouse@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Joseph Greathouse 提交于
Rather than only enabling GWS support based on the hws_gws_support modparm, also check whether the GPU's HWS firmware supports GWS. Leave the old modparm in place in case users want to test GWS on GPUs not yet in the support list. v2: fix broken syntax from the first patch. Signed-off-by: NJoseph Greathouse <Joseph.Greathouse@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Oak Zeng 提交于
Add a new kfd ioctl to allocate queue GWS. Queue GWS is released on queue destroy. v2: re-introduce this API with the following fixes squashed in: - drm/amdkfd: fix null pointer dereference on dev - drm/amdkfd: Return proper error code for gws alloc API - drm/amdkfd: Remove GPU ID in GWS queue creation Signed-off-by: NOak Zeng <Oak.Zeng@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 28 4月, 2020 1 次提交
-
-
由 Joseph Greathouse 提交于
In order to surface the ASIC revision to user level, we want to put it into the HSA topology. This can be because different ASIC revisions may require user-level software to do different things (e.g. patch code for things that are changed in later hardware revisions). The ASIC revision from the hardware is maximum of 4 bits at this time, so put it into 4 of the open bits in the HSA capability. Then user-level software can use this capability information to know -- for each ASIC -- what revision-based things must be done. Signed-off-by: NJoseph Greathouse <Joseph.Greathouse@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 23 4月, 2020 1 次提交
-
-
由 Yong Zhao 提交于
Delete two printings which are not very useful, and change one from pr_info() to pr_debug(). Signed-off-by: NYong Zhao <Yong.Zhao@amd.com> Reviewed-by: NAlex Deucher <alexander.deucher@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 14 4月, 2020 1 次提交
-
-
由 Odin Ugedal 提交于
Original cgroup v2 eBPF code for filtering device access made it possible to compile with CONFIG_CGROUP_DEVICE=n and still use the eBPF filtering. Change commit 4b7d4d45 ("device_cgroup: Export devcgroup_check_permission") reverted this, making it required to set it to y. Since the device filtering (and all the docs) for cgroup v2 is no longer a "device controller" like it was in v1, someone might compile their kernel with CONFIG_CGROUP_DEVICE=n. Then (for linux 5.5+) the eBPF filter will not be invoked, and all processes will be allowed access to all devices, no matter what the eBPF filter says. Signed-off-by: NOdin Ugedal <odin@ugedal.com> Acked-by: NRoman Gushchin <guro@fb.com> Signed-off-by: NTejun Heo <tj@kernel.org>
-
- 02 4月, 2020 1 次提交
-
-
由 Jack Zhang 提交于
Originally, it kfrees the wrong pointer for mem_obj. It would cause memory leak under stress test. Signed-off-by: NJack Zhang <Jack.Zhang1@amd.com> Acked-by: NNirmoy Das <nirmoy.das@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 19 3月, 2020 1 次提交
-
-
由 Colin Ian King 提交于
There are spelling mistakes in pr_err messages and a comment. Fix these. Signed-off-by: NColin Ian King <colin.king@canonical.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 11 3月, 2020 2 次提交
-
-
由 Yong Zhao 提交于
ALLOC_MEM_FLAGS_* used are the same as the KFD_IOC_ALLOC_MEM_FLAGS_*, but they are interweavedly used in kernel driver, resulting in bad readability. For example, KFD_IOC_ALLOC_MEM_FLAGS_COHERENT is not referenced in kernel, and it functions implicitly in kernel through ALLOC_MEM_FLAGS_COHERENT, causing unnecessary confusion. Replace all occurrences of ALLOC_MEM_FLAGS_* with KFD_IOC_ALLOC_MEM_FLAGS_* to solve the problem. Signed-off-by: NYong Zhao <Yong.Zhao@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Yong Zhao 提交于
People are inclined to think of the previous pr_warn message as an error, so use pre_debug instead. Signed-off-by: NYong Zhao <Yong.Zhao@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 07 3月, 2020 2 次提交
-
-
由 Felix Kuehling 提交于
Otherwise BOs may wait for the fence indefinitely and never be destroyed. v2: Signal the fence right after destroying queues to avoid unnecessary delaye-delete in kfd_process_wq_release Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: Nxinhui pan <xinhui.pan@amd.com> Acked-by: NChristian König <christian.koenig@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Yong Zhao 提交于
Because too many things are involved in this workaround, we need more comments to avoid pitfalls. Signed-off-by: NYong Zhao <Yong.Zhao@amd.com> Acked-by: NChristian König <christian.koenig@amd.com> Reviewed-by: NPhilip Yang <philip.yang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 05 3月, 2020 1 次提交
-
-
由 Colin Ian King 提交于
There is a statement that is indented with spaces instead of a tab. Replace spaces with a tab. Signed-off-by: NColin Ian King <colin.king@canonical.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 29 2月, 2020 2 次提交
-
-
由 Eric Huang 提交于
SDMA MQD memory type is NC that causes MQD data overwritten accidentally by an old stable cache line. Changing it to UC default for GART will fix the issue. The mqd_gfx9 parameter is meant for control stacks that are allocated together with user mode queue MQDs. Setting mqd_gfx9 to true maps the control stack pages as NC. Here it was accidentally applied to SDMA MQDs, which are allocated together with the HIQ MQD. Setting the mqd_gfx9 to false avoids that. Signed-off-by: NEric Huang <jinhuieric.huang@amd.com> Acked-by: NYong Zhao <Yong.Zhao@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Yong Zhao 提交于
Given we can query all the asic specific information from amdgpu_gfx_config, we can make get_tile_config() generic. Signed-off-by: NYong Zhao <Yong.Zhao@amd.com> Reviewed-by: NAlex Deucher <alexander.deucher@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 27 2月, 2020 7 次提交
-
-
由 Yong Zhao 提交于
The previous way of using SDMA queue count to infer whether we should unmap SDMA engines has bugs. The reason it did not cause issues is because MEC firmware unmaps all queues (CP + SDMA) when a unmap package for compute engine is received. Becasue of that, only one unmap queue package is needed, instead of one unmap queue package for CP and each SDMA engine, which results in much simpler driver code. Signed-off-by: NYong Zhao <Yong.Zhao@amd.com> Acked-by: NAlex Deucher <alexander.deucher@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Yong Zhao 提交于
Those printings are duplicated or useless. Signed-off-by: NYong Zhao <Yong.Zhao@amd.com> Acked-by: NAlex Deucher <alexander.deucher@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Yong Zhao 提交于
When the queue creation failed, some resources were not freed. Fix it. Signed-off-by: NYong Zhao <Yong.Zhao@amd.com> Acked-by: NAlex Deucher <alexander.deucher@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Yong Zhao 提交于
The previous code of calculating active CP queues is problematic if some SDMA queues are inactive. Fix that by counting CP queues directly. Signed-off-by: NYong Zhao <Yong.Zhao@amd.com> Acked-by: NAlex Deucher <alexander.deucher@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Yong Zhao 提交于
The queues represented in queue_bitmap are only CP queues. Signed-off-by: NYong Zhao <Yong.Zhao@amd.com> Acked-by: NAlex Deucher <alexander.deucher@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Yong Zhao 提交于
The name is easier to understand the code. Signed-off-by: NYong Zhao <Yong.Zhao@amd.com> Acked-by: NAlex Deucher <alexander.deucher@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Divya Shikre 提交于
Devices from Arcturus onwards will have their UUID exposed to Thunk. Adding neccessary functions to the kernel to propagate the uuid. Signed-off-by: NDivya Shikre <DivyaUday.Shikre@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 13 2月, 2020 2 次提交
-
-
由 Rajneesh Bhardwaj 提交于
So far the kfd driver implemented same routines for runtime and system wide suspend and resume (s2idle or mem). During system wide suspend the kfd aquires an atomic lock that prevents any more user processes to create queues and interact with kfd driver and amd gpu. This mechanism created problem when amdgpu device is runtime suspended with BACO enabled. Any application that relies on kfd driver fails to load because the driver reports a locked kfd device since gpu is runtime suspended. However, in an ideal case, when gpu is runtime suspended the kfd driver should be able to: - auto resume amdgpu driver whenever a client requests compute service - prevent runtime suspend for amdgpu while kfd is in use This change refactors the amdgpu and amdkfd drivers to support BACO and runtime power management. Reviewed-by: NOak Zeng <oak.zeng@amd.com> Reviewed-by: NFelix Kuehling <felix.kuehling@amd.com> Signed-off-by: NRajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Rajneesh Bhardwaj 提交于
During system suspend the kfd driver aquires a lock that prohibits further kfd actions unless the gpu is resumed. This adds some info which can be useful while debugging. Reviewed-by: NOak Zeng <oak.zeng@amd.com> Reviewed-by: NFelix Kuehling <felix.kuehling@amd.com> Signed-off-by: NRajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 07 2月, 2020 1 次提交
-
-
由 Amber Lin 提交于
Provide compute queues information in sysfs under /sys/class/kfd/kfd/proc. The format is /sys/class/kfd/kfd/proc/<pid>/queues/<queue id>/XX where XX are size, type, and gpuid three files to represent queue size, queue type, and the GPU this queue uses. <queue id> folder and files underneath are generated when a queue is created. They are removed when the queue is destroyed. Signed-off-by: NAmber Lin <Amber.Lin@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 04 2月, 2020 1 次提交
-
-
由 Yong Zhao 提交于
The sdma_queue_count increment should be done before execute_queues_cpsch(), which calls pm_calc_rlib_size() where sdma_queue_count is used to calculate whether over_subscription is triggered. With the previous code, when a SDMA queue is created, compute_queue_count in pm_calc_rlib_size() is one more than the actual compute queue number, because the queue_count has been incremented while sdma_queue_count has not. This patch fixes that. Signed-off-by: NYong Zhao <Yong.Zhao@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 17 1月, 2020 4 次提交
-
-
由 Yong Zhao 提交于
SW scheduler is previously called non HW scheduler, or non HWS. This message is useful when triaging issues from dmesg. Signed-off-by: NYong Zhao <Yong.Zhao@amd.com> Acked-by: NHuang Rui <ray.huang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Huang Rui 提交于
To align with gfx v9, we use the map_queues packet to load hiq MQD. Signed-off-by: NHuang Rui <ray.huang@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Aaron Liu 提交于
There is an issue that CP will check the HIQ queue to be configured and mapped with KIQ ring, otherwise, it will be unable to read back the secure buffer while the gfxoff is enabled even with trusted IP blocks. v1 -> v2: - Fix to remove surplus set_resources packets. - Fill the whole configuration in MQD. - Change the author as Aaron because he addressed the key point of this issue. - Add kiq ring lock. v2 -> v3: - Free the lock while in error return case. - Remove the programming only needed by the queue is unmapped. v3 -> v4: - Remove doorbell programming because it's used for restarting queue. - Remove CP scheduler programming because map_queue packet will handle this. v4 -> v5: - Remove cp_hqd_active because mec ucode will enable it while use map_queues. - Revise goto out_unlock. - Correct the right doorbell offset for HIQ that kfd driver assigned in the packet. v5 -> v6: - Merge Arcturus fix into this patch because it will get oops in Arcturus platform. Reported-by: NLisa Saturday <Lisa.Saturday@amd.com> Signed-off-by: NAaron Liu <aaron.liu@amd.com> Signed-off-by: NHuang Rui <ray.huang@amd.com> Reviewed-and-Tested-by: NAaron Liu <aaron.liu@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Alex Sierra 提交于
[Why] TLB flush method has been deprecated using kfd2kgd interface. This implementation is now on the amdgpu_amdkfd API. [How] TLB flush functions now implemented in amdgpu_amdkfd. Signed-off-by: NAlex Sierra <alex.sierra@amd.com> Acked-by: NChristian König <christian.koenig@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 10 1月, 2020 1 次提交
-
-
由 Felix Kuehling 提交于
Use filep->private_data to store a pointer to the kfd_process data structure. Take an extra reference for that, which gets released in the kfd_release callback. Check that the process calling kfd_ioctl is the same that opened the file descriptor. Return -EBADF if it's not, so that this error can be distinguished in user mode. Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: NPhilip Yang <Philip.Yang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 08 1月, 2020 1 次提交
-
-
由 Felix Kuehling 提交于
Don't use the HWS if it's known to be hanging. In a reset also don't try to destroy the HIQ because that may hang on SRIOV if the KIQ is unresponsive. Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com> Tested-by: NEmily Deng <Emily.Deng@amd.com> Reviewed-by: Nshaoyunl <shaoyun.liu@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-