提交 · d68cf992ded575928cf4ddf7c64faff0d8dcce14 · openeuler / Kernel

13 4月, 2022 1 次提交

drm/amdkfd: Cleanup IO links during KFD device removal · 46d18d51

由 Mukul Joshi 提交于 4月 06, 2022

Currently, the IO-links to the device being removed from topology,
are not cleared. As a result, there would be dangling links left in
the KFD topology. This patch aims to fix the following:
1. Cleanup all IO links to the device being removed.
2. Ensure that node numbering in sysfs and nodes proximity domain
   values are consistent after the device is removed:
   a. Adding a device and removing a GPU device are made mutually
      exclusive.
   b. The global proximity domain counter is no longer required to be
      an atomic counter. A normal 32-bit counter can be used instead.
3. Update generation_count to let user-mode know that topology has
   changed due to device removal.

CC: Shuotao Xu <shuotaoxu@microsoft.com>
Reviewed-by: NShuotao Xu <shuotaoxu@microsoft.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NMukul Joshi <mukul.joshi@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

46d18d51

01 4月, 2022 1 次提交

drm/amdkfd: Use atomic64_t type for pdd->tlb_seq · 8fde0248

由 Philip Yang 提交于 3月 25, 2022

To support multi-thread update page table.
Signed-off-by: NPhilip Yang <Philip.Yang@amd.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

8fde0248

26 3月, 2022 1 次提交

drm/amdkfd: start using tlb_seq from the VM subsystem · bffa91da

由 Christian König 提交于 3月 17, 2022

Instead of trying to figure out if a TLB flush is necessary or not use
the information provided by the VM subsystem now.
Signed-off-by: NChristian König <christian.koenig@amd.com>
Reviewed-by: Philip Yang<Philip.Yang@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

bffa91da

24 2月, 2022 1 次提交

drm/amdkfd: Use real device for messages · a0c5fd46

由 Felix Kuehling 提交于 2月 18, 2022

kfd_chardev() doesn't provide much useful information in dev_... messages
on multi-GPU systems because there is only one KFD device, which doesn't
correspond to any particular GPU. Use the actual GPU device to indicate
the GPU that caused a message.
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

a0c5fd46

17 2月, 2022 1 次提交

drm/amdkfd: Replace zero-length array with flexible-array member · d5c83156

由 Changcheng Deng 提交于 2月 15, 2022

There is a regular need in the kernel to provide a way to declare having
a dynamically sized set of trailing elements in a structure. Kernel code
should always use "flexible array members" for these cases. The older
style of one-element or zero-length arrays should no longer be used.
Reference:
https://www.kernel.org/doc/html/latest/process/deprecated.html#zero-length-and-one-element-arraysReported-by: NZeal Robot <zealci@zte.com.cn>
Signed-off-by: NChangcheng Deng <deng.changcheng@zte.com.cn>
Acked-by: NChristian König <christian.koenig@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

d5c83156

15 2月, 2022 3 次提交

drm/amdkfd: remove unneeded unmap single queue option · d2cb0b21

由 Jonathan Kim 提交于 2月 10, 2022

The KFD only unmaps all queues, all dynamics queues or all process queues
since RUN_LIST is mapped with all KFD queues.

There's no need to provide a single type unmap so remove this option.
Signed-off-by: NJonathan Kim <jonathan.kim@amd.com>
Reviewed-by: NFelix Kuehling <felix.kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

d2cb0b21

drm/amdkfd: Fix leftover errors and warnings · 2243f493

由 Rajneesh Bhardwaj 提交于 2月 10, 2022

A bunch of errors and warnings are leftover KFD over the years, attempt
to fix the errors and most warnings reported by checkpatch tool. Still a
few warnings remain which may be false positives so ignore them for now.
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NRajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

2243f493

drm/amdkfd: update SPDX license header · d87f36a0

由 Rajneesh Bhardwaj 提交于 2月 10, 2022

Update the SPDX License header for all the KFD files.
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NRajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

d87f36a0

10 2月, 2022 2 次提交

drm/amdkfd: Remove unused old debugger implementation · 5bdd3eb2

由 Mukul Joshi 提交于 2月 04, 2022

Cleanup the kfd code by removing the unused old debugger
implementation.
The address watch was only ever implemented in the upstream
driver for GFXv7 (Kaveri). The user mode tools runtime using
this API was never open-sourced. Work on the old debugger
prototype that used this API has been discontinued years ago.
Only a small piece of resetting wavefronts is kept and
is moved to kfd_device_queue_manager.c.
Signed-off-by: NMukul Joshi <mukul.joshi@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

5bdd3eb2

drm/amdkfd: rename kfd_process_vm_fault to kfd_dqm_evict_pasid · 03e5b167

由 Tao Zhou 提交于 2月 07, 2022

As the function is used in more different cases, use a more general
name.
Signed-off-by: NTao Zhou <tao.zhou1@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

03e5b167

08 2月, 2022 13 次提交

drm/amdkfd: CRIU prepare for svm resume · c2db32ce

由 Rajneesh Bhardwaj 提交于 11月 08, 2021

During CRIU restore phase, the VMAs for the virtual address ranges are
not at their final location yet so in this stage, only cache the data
required to successfully resume the svm ranges during an imminent CRIU
resume phase.
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NRajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

c2db32ce

drm/amdkfd: CRIU Discover svm ranges · 08a987a8

由 Rajneesh Bhardwaj 提交于 11月 02, 2021

A KFD process may contain a number of virtual address ranges for shared
virtual memory management and each such range can have many SVM
attributes spanning across various nodes within the process boundary.
This change reports the total number of such SVM ranges and
their total private data size by extending the PROCESS_INFO op of the the
CRIU IOCTL to discover the svm ranges in the target process and a future
patches brings in the required support for checkpoint and restore for
SVM ranges.
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NRajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

08a987a8

drm/amdkfd: CRIU checkpoint and restore xnack mode · 4717fe3d

由 Rajneesh Bhardwaj 提交于 11月 19, 2021

Recoverable page faults are represented by the xnack mode setting inside
a kfd process and are used to represent the device page faults. For CR,
we don't consider negative values which are typically used for querying
the current xnack mode without modifying it.
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NRajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

4717fe3d

drm/amdkfd: CRIU implement gpu_id remapping · bef153b7

由 David Yat Sin 提交于 4月 09, 2021

When doing a restore on a different node, the gpu_id's on the restore
node may be different. But the user space application will still refer
use the original gpu_id's in the ioctl calls. Adding code to create a
gpu id mapping so that kfd can determine actual gpu_id during the user
ioctl's.
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NDavid Yat Sin <david.yatsin@amd.com>
Signed-off-by: NRajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

bef153b7

drm/amdkfd: CRIU checkpoint and restore events · 40e8a766

由 David Yat Sin 提交于 3月 05, 2021

Add support to existing CRIU ioctl's to save and restore events during
criu checkpoint and restore.
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NDavid Yat Sin <david.yatsin@amd.com>
Signed-off-by: NRajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

40e8a766

drm/amdkfd: CRIU checkpoint and restore queue control stack · 3a9822d7

由 David Yat Sin 提交于 1月 25, 2021

Checkpoint contents of queue control stacks on CRIU dump and restore them
during CRIU restore.
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NDavid Yat Sin <david.yatsin@amd.com>
Signed-off-by: NRajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

3a9822d7

drm/amdkfd: CRIU checkpoint and restore queue mqds · 42c6c482

由 David Yat Sin 提交于 1月 25, 2021

Checkpoint contents of queue MQD's on CRIU dump and restore them during
CRIU restore.
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NDavid Yat Sin <david.yatsin@amd.com>
Signed-off-by: NRajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

42c6c482

drm/amdkfd: CRIU restore queue ids · 8668dfc3

由 David Yat Sin 提交于 1月 25, 2021

When re-creating queues during CRIU restore, restore the queue with the
same queue id value used during CRIU dump.
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NRajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: NDavid Yat Sin <david.yatsin@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

8668dfc3

drm/amdkfd: CRIU add queues support · 626f7b31

由 David Yat Sin 提交于 1月 25, 2021

Add support to existing CRIU ioctl's to save number of queues and queue
properties for each queue during checkpoint and re-create queues on
restore.
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NDavid Yat Sin <david.yatsin@amd.com>
Signed-off-by: NRajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

626f7b31

drm/amdkfd: CRIU Implement KFD unpause operation · cd9f7910

由 David Yat Sin 提交于 8月 16, 2021

Introducing UNPAUSE op. After CRIU amdgpu plugin performs a PROCESS_INFO
op the queues will be stay in an evicted state. Once the plugin is done
draining BO contents, it is safe to perform an UNPAUSE op for the queues
to resume.
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NDavid Yat Sin <david.yatsin@amd.com>
Signed-off-by: NRajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

cd9f7910

drm/amdkfd: CRIU Implement KFD resume ioctl · 011bbb03

由 Rajneesh Bhardwaj 提交于 1月 11, 2021

This adds support to create userptr BOs on restore and introduces a new
ioctl op to restart memory notifiers for the restored userptr BOs.
When doing CRIU restore MMU notifications can happen anytime after we call
amdgpu_mn_register. Prevent MMU notifications until we reach stage-4 of the
restore process i.e. criu_resume ioctl op is received, and the process is
ready to be resumed. This ioctl is different from other KFD CRIU ioctls
since its called by CRIU master restore process for all the target
processes being resumed by CRIU.
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NDavid Yat Sin <david.yatsin@amd.com>
Signed-off-by: NRajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

011bbb03

drm/amdkfd: CRIU Implement KFD checkpoint ioctl · 5ccbb057

由 Rajneesh Bhardwaj 提交于 11月 30, 2020

This adds support to discover the  buffer objects that belong to a
process being checkpointed. The data corresponding to these buffer
objects is returned to user space plugin running under criu master
context which then stores this info to recreate these buffer objects
during a restore operation.
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NDavid Yat Sin <david.yatsin@amd.com>
Signed-off-by: NRajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

5ccbb057

drm/amdkfd: CRIU Introduce Checkpoint-Restore APIs · 36988070

由 Rajneesh Bhardwaj 提交于 8月 24, 2021

Checkpoint-Restore in userspace (CRIU) is a powerful tool that can
snapshot a running process and later restore it on same or a remote
machine but expects the processes that have a device file (e.g. GPU)
associated with them, provide necessary driver support to assist CRIU
and its extensible plugin interface. Thus, In order to support the
Checkpoint-Restore of any ROCm process, the AMD Radeon Open Compute
Kernel driver, needs to provide a set of new APIs that provide
necessary VRAM metadata and its contents to a userspace component
(CRIU plugin) that can store it in form of image files.

This introduces some new ioctls which will be used to checkpoint-Restore
any KFD bound user process. KFD only allows ioctl calls from the same
process that opened the KFD file descriptor. Since these ioctls are
expected to be called from a KFD criu plugin which has elevated ptrace
attached privileges and CAP_CHECKPOINT_RESTORE capabilities attached with
the file descriptors so modify KFD to allow such calls.

(API redesigned by David Yat Sin)
Suggested-by: NFelix Kuehling <felix.kuehling@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NDavid Yat Sin <david.yatsin@amd.com>
Signed-off-by: NRajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

36988070

29 12月, 2021 1 次提交

drm/amdkfd: reset queue which consumes RAS poison (v2) · b6485bed

由 Tao Zhou 提交于 12月 06, 2021

CP supports unmap queue with reset mode which only destroys specific queue without affecting others.
Replacing whole gpu reset with reset queue mode for RAS poison consumption
saves much time, and we can also fallback to gpu reset solution if reset
queue fails.

v2: Return directly if process is NULL;
    Reset queue solution is not applicable to SDMA, fallback to legacy
way;
    Call kfd_unref_process after lookup process.
Signed-off-by: NTao Zhou <tao.zhou1@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Acked-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

b6485bed

02 12月, 2021 2 次提交

drm/amdkfd: add kfd_device_info_init function · f0dc99a6

由 Graham Sider 提交于 11月 17, 2021

Initializes kfd->device_info given either asic_type (enum) if GFX
version is less than GFX9, or GC IP version if greater. Also takes in vf
and the target compiler gfx version. Uses SDMA version to determine
num_sdma_queues_per_engine.

Convert device_info to a non-pointer member of kfd, change references
accordingly.

Change unsupported asic condition to only probe f2g, move device_info
initialization post-switch.
Signed-off-by: NGraham Sider <Graham.Sider@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

f0dc99a6

drm/amdkfd: replace asic_name with amdgpu_asic_name · b7675b7b

由 Graham Sider 提交于 11月 11, 2021

device_info->asic_name and amdgpu_asic_name[adev->asic_type] both
provide asic name strings, with the only difference being casing.
Remove asic_name from device_info and replace sysfs entry with lowercase
amdgpu_asic_name[]. Ensures string is null-terminated so that this
doesn't break if dev->node_props.name ever gets set anywhere else.
Signed-off-by: NGraham Sider <Graham.Sider@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

b7675b7b

25 11月, 2021 2 次提交

drm/amdkfd: simplify drain retry fault · 6946be24

由 Philip Yang 提交于 11月 19, 2021

unmap range always increase atomic svms->drain_pagefaults to simplify
both parent range and child range unmap, page fault handle ignores the
retry fault if svms->drain_pagefaults is set to speed up interrupt
handling. svm_range_drain_retry_fault restart draining if another
range unmap from cpu.
Signed-off-by: NPhilip Yang <Philip.Yang@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

6946be24

drm/amdkfd: simplify drain retry fault · 2e447728

由 Philip Yang 提交于 11月 19, 2021

unmap range always increase atomic svms->drain_pagefaults to simplify
both parent range and child range unmap, page fault handle ignores the
retry fault if svms->drain_pagefaults is set to speed up interrupt
handling. svm_range_drain_retry_fault restart draining if another
range unmap from cpu.
Signed-off-by: NPhilip Yang <Philip.Yang@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

2e447728

23 11月, 2021 2 次提交

drm/amdkfd: Remove unused entries in table · a0e7e140

由 Amber Lin 提交于 11月 18, 2021

Remove unused entries in kfd_device_info table: num_xgmi_sdma_engines
and num_sdma_queues_per_engine. They are calculated in
kfd_get_num_sdma_engines and kfd_get_num_xgmi_sdma_engines instead.
Signed-off-by: NAmber Lin <Amber.Lin@amd.com>
Reviewed-by: NGraham Sider <Graham.Sider@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

a0e7e140

drm/amdkfd: Retrieve SDMA numbers from amdgpu · ee2f17f4

由 Amber Lin 提交于 11月 18, 2021

Instead of hard coding the number of sdma engines and the number of
sdma_xgmi engines in the device_info table, get the number of toal SDMA
instances from amdgpu. The first two engines are sdma engines and the
rest are sdma-xgmi engines unless the ASIC doesn't support XGMI.

v2: add kfd_ prefix to non static function names
Signed-off-by: NAmber Lin <Amber.Lin@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

ee2f17f4

18 11月, 2021 6 次提交

drm/amdkfd: replace asic_family with asic_type · 7eb0502a

由 Graham Sider 提交于 11月 10, 2021

asic_family was a duplicate of asic_type, both of type amd_asic_type.
Replace all instances of device_info->asic_family with adev->asic_type
and remove asic_family from device_info.
Signed-off-by: NGraham Sider <Graham.Sider@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

7eb0502a

drm/amdkfd: convert KFD_IS_SOC to IP version checking · dd0ae064

由 Graham Sider 提交于 11月 09, 2021

Defined as GC HWIP >= IP_VERSION(9, 0, 1).

Also defines KFD_GC_VERSION to return GC HWIP version.
Signed-off-by: NGraham Sider <Graham.Sider@amd.com>
Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

dd0ae064

drm/amdkfd: remove kgd_dev declaration and initialization · b5d1d755

由 Graham Sider 提交于 10月 19, 2021

Completes removal of kgd_dev. Direct references to amdgpu_device objects
should now be used instead.
Signed-off-by: NGraham Sider <Graham.Sider@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

b5d1d755

drm/amdkfd: replace/remove remaining kgd_dev references · 56c5977e

由 Graham Sider 提交于 10月 19, 2021

Remove get_amdgpu_device and other remaining kgd_dev references aside
from declaration/kfd struct entry and initialization.
Signed-off-by: NGraham Sider <Graham.Sider@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

56c5977e

drm/amdkfd: replace kgd_dev in get amdgpu_amdkfd funcs · 574c4183

由 Graham Sider 提交于 10月 19, 2021

Modified definitions:

- amdgpu_amdkfd_get_fw_version
- amdgpu_amdkfd_get_local_mem_info
- amdgpu_amdkfd_get_gpu_clock_counter
- amdgpu_amdkfd_get_max_engine_clock_in_mhz
- amdgpu_amdkfd_get_cu_info
- amdgpu_amdkfd_get_dmabuf_info
- amdgpu_amdkfd_get_vram_usage
- amdgpu_amdkfd_get_hive_id
- amdgpu_amdkfd_get_unique_id
- amdgpu_amdkfd_get_mmio_remap_phys_addr
- amdgpu_amdkfd_get_num_gws
- amdgpu_amdkfd_get_asic_rev_id
- amdgpu_amdkfd_get_noretry
- amdgpu_amdkfd_get_xgmi_hops_count
- amdgpu_amdkfd_get_xgmi_bandwidth_mbytes
- amdgpu_amdkfd_get_pcie_bandwidth_mbytes

Also replaces kfd_device_by_kgd with kfd_device_by_adev, now
searching via adev rather than kgd.
Signed-off-by: NGraham Sider <Graham.Sider@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

574c4183

drm/amdkfd: add amdgpu_device entry to kfd_dev · c6c57446

由 Graham Sider 提交于 10月 12, 2021

Patch series to remove kgd_dev struct and replace all instances with
amdgpu_device objects.

amdgpu_device needs to be declared in kgd_kfd_interface.h to be visible
to kfd2kgd_calls.
Signed-off-by: NGraham Sider <Graham.Sider@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

c6c57446

10 11月, 2021 1 次提交

drm/amdkfd: Fix retry fault drain race conditions · a44fe9ee

由 Felix Kuehling 提交于 11月 05, 2021

The check for whether to drain retry faults must be under the mmap write
lock to serialize with munmap notifier callbacks.

We were also missing checks on child ranges. To fix that, simplify the
logic by using a flag rather than checking on each prange. That also
allows draining less freqeuntly when many ranges are unmapped at once.
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Tested-by: NPhilip Yang <Philip.Yang@amd.com>
Tested-by: NAlex Sierra <Alex.Sierra@amd.com>
Reviewed-by: NPhilip Yang <Philip.Yang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

a44fe9ee

06 11月, 2021 1 次提交

drm/amdkfd: avoid recursive lock in migrations back to RAM · a6283010

由 Alex Sierra 提交于 10月 29, 2021

[Why]:
When we call hmm_range_fault to map memory after a migration, we don't
expect memory to be migrated again as a result of hmm_range_fault. The
driver ensures that all memory is in GPU-accessible locations so that
no migration should be needed. However, there is one corner case where
hmm_range_fault can unexpectedly cause a migration from DEVICE_PRIVATE
back to system memory due to a write-fault when a system memory page in
the same range was mapped read-only (e.g. COW). Ranges with individual
pages in different locations are usually the result of failed page
migrations (e.g. page lock contention). The unexpected migration back
to system memory causes a deadlock from recursive locking in our
driver.

[How]:
Creating a task reference new member under svm_range_list struct.
Setting this with "current" reference, right before the hmm_range_fault
is called. This member is checked against "current" reference at
svm_migrate_to_ram callback function. If equal, the migration will be
ignored.
Signed-off-by: NAlex Sierra <alex.sierra@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

a6283010

29 10月, 2021 2 次提交

drm/amdkfd: Remove cu mask from struct queue_properties(v2) · 7c695a2c

由 Lang Yu 提交于 10月 08, 2021

Actually, cu_mask has been copied to mqd memory and
does't have to persist in queue_properties. Remove it
from queue_properties.

And use struct mqd_update_info to store such properties,
then pass it to update queue operation.

v2:
* Rename pqm_update_queue to pqm_update_queue_properties.
* Rename struct queue_update_info to struct mqd_update_info.
* Rename pqm_set_cu_mask to pqm_update_mqd.
Suggested-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NLang Yu <lang.yu@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

7c695a2c

drm/amdkfd: Add an optional argument into update queue operation(v2) · c6e559eb

由 Lang Yu 提交于 10月 08, 2021

Currently, queue is updated with data in queue_properties.
And all allocated resource in queue_properties will not
be freed until the queue is destroyed.

But some properties(e.g., cu mask) bring some memory
management headaches(e.g., memory leak) and make code
complex. Actually they have been copied to mqd and
don't have to persist in queue_properties.

Add an argument into update queue to pass such properties,
then we can remove them from queue_properties.

v2: Don't use void *.
Suggested-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NLang Yu <lang.yu@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

c6e559eb

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功