提交 · c7f21978fa6aafaf7ad37155c7d3a217dc3d16b0 · openeuler / Kernel

01 7月, 2022 1 次提交

drm/amdkfd: Add user queue eviction restore SMI event · c7f21978

由 Philip Yang 提交于 1月 13, 2022

Output user queue eviction and restore event. User queue eviction may be
triggered by svm or userptr MMU notifier, TTM eviction, device suspend
and CRIU checkpoint and restore.

User queue restore may be rescheduled if eviction happens again while
restore.
Signed-off-by: NPhilip Yang <Philip.Yang@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

c7f21978

24 6月, 2022 1 次提交

drm/amdkfd: Enable GFX11 usermode queue oversubscription · e77a541f

由 Graham Sider 提交于 5月 11, 2022

Starting with GFX11, MES requires wptr BOs to be GTT allocated/mapped to
GART for usermode queues in order to support oversubscription. In the
case that work is submitted to an unmapped queue, MES must have a GART
wptr address to determine whether the queue should be mapped.

This change is accompanied with changes in MES and is applicable for
MES_API_VERSION >= 2.

v3:
- Use amdgpu_vm_bo_lookup_mapping for wptr_bo mapping lookup
- Move wptr_bo refcount increment to amdgpu_amdkfd_map_gtt_bo_to_gart
- Remove list_del_init from amdgpu_amdkfd_map_gtt_bo_to_gart
- Cleanup/fix create_queue wptr_bo error handling
v4:
- Add MES version shift/mask defines to amdgpu_mes.h
- Change version check from MES_VERSION to MES_API_VERSION
- Add check in kfd_ioctl_create_queue before wptr bo pin/GART map to
ensure bo is a single page.
Signed-off-by: NGraham Sider <Graham.Sider@amd.com>
Acked-by: NAlex Deucher <alexander.deucher@amd.com>
Acked-by: NChristian König <christian.koenig@amd.com>
Reviewed-by: NPhilip Yang <Philip.Yang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

e77a541f

04 5月, 2022 1 次提交

drm/amdkfd: Add KFD support for soc21 v3 · cc009e61

由 Mukul Joshi 提交于 4月 26, 2022

Add initial support for soc21 in KFD compute
driver (Mukul)
- Add new definition for soc21 device.
- Add new file for amdgpu-kfd interface for GFX11 family.
- Add new file for queue management, interrupt handling,
  mqd management for GFX11 family in KFD driver.
- Related changes/updates for soc21 device in
  KFD driver.
- Repurpose last 2 entries of SDMA MQD for driver use.

v2: Add an optional argument into update queue operation (Mukul)

v3: Switch to ip version check, replace kgd_dev with
    amdgpu_device (Hawking)
Signed-off-by: NMukul Joshi <mukul.joshi@amd.com>
Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: NOak Zeng <Oak.Zeng@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

cc009e61

29 4月, 2022 1 次提交

drm/amdkfd: Fix circular lock dependency warning · b179fc28

由 Mukul Joshi 提交于 4月 22, 2022

[  168.544078] ======================================================
[  168.550309] WARNING: possible circular locking dependency detected
[  168.556523] 5.16.0-kfd-fkuehlin #148 Tainted: G            E
[  168.562558] ------------------------------------------------------
[  168.568764] kfdtest/3479 is trying to acquire lock:
[  168.573672] ffffffffc0927a70 (&topology_lock){++++}-{3:3}, at:
		kfd_topology_device_by_id+0x16/0x60 [amdgpu] [  168.583663]
                but task is already holding lock:
[  168.589529] ffff97d303dee668 (&mm->mmap_lock#2){++++}-{3:3}, at:
		vm_mmap_pgoff+0xa9/0x180 [  168.597755]
                which lock already depends on the new lock.

[  168.605970]
                the existing dependency chain (in reverse order) is:
[  168.613487]
                -> #3 (&mm->mmap_lock#2){++++}-{3:3}:
[  168.619700]        lock_acquire+0xca/0x2e0
[  168.623814]        down_read+0x3e/0x140
[  168.627676]        do_user_addr_fault+0x40d/0x690
[  168.632399]        exc_page_fault+0x6f/0x270
[  168.636692]        asm_exc_page_fault+0x1e/0x30
[  168.641249]        filldir64+0xc8/0x1e0
[  168.645115]        call_filldir+0x7c/0x110
[  168.649238]        ext4_readdir+0x58e/0x940
[  168.653442]        iterate_dir+0x16a/0x1b0
[  168.657558]        __x64_sys_getdents64+0x83/0x140
[  168.662375]        do_syscall_64+0x35/0x80
[  168.666492]        entry_SYSCALL_64_after_hwframe+0x44/0xae
[  168.672095]
                -> #2 (&type->i_mutex_dir_key#6){++++}-{3:3}:
[  168.679008]        lock_acquire+0xca/0x2e0
[  168.683122]        down_read+0x3e/0x140
[  168.686982]        path_openat+0x5b2/0xa50
[  168.691095]        do_file_open_root+0xfc/0x190
[  168.695652]        file_open_root+0xd8/0x1b0
[  168.702010]        kernel_read_file_from_path_initns+0xc4/0x140
[  168.709542]        _request_firmware+0x2e9/0x5e0
[  168.715741]        request_firmware+0x32/0x50
[  168.721667]        amdgpu_cgs_get_firmware_info+0x370/0xdd0 [amdgpu]
[  168.730060]        smu7_upload_smu_firmware_image+0x53/0x190 [amdgpu]
[  168.738414]        fiji_start_smu+0xcf/0x4e0 [amdgpu]
[  168.745539]        pp_dpm_load_fw+0x21/0x30 [amdgpu]
[  168.752503]        amdgpu_pm_load_smu_firmware+0x4b/0x80 [amdgpu]
[  168.760698]        amdgpu_device_fw_loading+0xb8/0x140 [amdgpu]
[  168.768412]        amdgpu_device_init.cold+0xdf6/0x1716 [amdgpu]
[  168.776285]        amdgpu_driver_load_kms+0x15/0x120 [amdgpu]
[  168.784034]        amdgpu_pci_probe+0x19b/0x3a0 [amdgpu]
[  168.791161]        local_pci_probe+0x40/0x80
[  168.797027]        work_for_cpu_fn+0x10/0x20
[  168.802839]        process_one_work+0x273/0x5b0
[  168.808903]        worker_thread+0x20f/0x3d0
[  168.814700]        kthread+0x176/0x1a0
[  168.819968]        ret_from_fork+0x1f/0x30
[  168.825563]
                -> #1 (&adev->pm.mutex){+.+.}-{3:3}:
[  168.834721]        lock_acquire+0xca/0x2e0
[  168.840364]        __mutex_lock+0xa2/0x930
[  168.846020]        amdgpu_dpm_get_mclk+0x37/0x60 [amdgpu]
[  168.853257]        amdgpu_amdkfd_get_local_mem_info+0xba/0xe0 [amdgpu]
[  168.861547]        kfd_create_vcrat_image_gpu+0x1b1/0xbb0 [amdgpu]
[  168.869478]        kfd_create_crat_image_virtual+0x447/0x510 [amdgpu]
[  168.877884]        kfd_topology_add_device+0x5c8/0x6f0 [amdgpu]
[  168.885556]        kgd2kfd_device_init.cold+0x385/0x4c5 [amdgpu]
[  168.893347]        amdgpu_amdkfd_device_init+0x138/0x180 [amdgpu]
[  168.901177]        amdgpu_device_init.cold+0x141b/0x1716 [amdgpu]
[  168.909025]        amdgpu_driver_load_kms+0x15/0x120 [amdgpu]
[  168.916458]        amdgpu_pci_probe+0x19b/0x3a0 [amdgpu]
[  168.923442]        local_pci_probe+0x40/0x80
[  168.929249]        work_for_cpu_fn+0x10/0x20
[  168.935008]        process_one_work+0x273/0x5b0
[  168.940944]        worker_thread+0x20f/0x3d0
[  168.946623]        kthread+0x176/0x1a0
[  168.951765]        ret_from_fork+0x1f/0x30
[  168.957277]
                -> #0 (&topology_lock){++++}-{3:3}:
[  168.965993]        check_prev_add+0x8f/0xbf0
[  168.971613]        __lock_acquire+0x1299/0x1ca0
[  168.977485]        lock_acquire+0xca/0x2e0
[  168.982877]        down_read+0x3e/0x140
[  168.987975]        kfd_topology_device_by_id+0x16/0x60 [amdgpu]
[  168.995583]        kfd_device_by_id+0xa/0x20 [amdgpu]
[  169.002180]        kfd_mmap+0x95/0x200 [amdgpu]
[  169.008293]        mmap_region+0x337/0x5a0
[  169.013679]        do_mmap+0x3aa/0x540
[  169.018678]        vm_mmap_pgoff+0xdc/0x180
[  169.024095]        ksys_mmap_pgoff+0x186/0x1f0
[  169.029734]        do_syscall_64+0x35/0x80
[  169.035005]        entry_SYSCALL_64_after_hwframe+0x44/0xae
[  169.041754]
                other info that might help us debug this:

[  169.053276] Chain exists of:
                  &topology_lock --> &type->i_mutex_dir_key#6 --> &mm->mmap_lock#2

[  169.068389]  Possible unsafe locking scenario:

[  169.076661]        CPU0                    CPU1
[  169.082383]        ----                    ----
[  169.088087]   lock(&mm->mmap_lock#2);
[  169.092922]                                lock(&type->i_mutex_dir_key#6);
[  169.100975]                                lock(&mm->mmap_lock#2);
[  169.108320]   lock(&topology_lock);
[  169.112957]
                 *** DEADLOCK ***

This commit fixes the deadlock warning by ensuring pm.mutex is not
held while holding the topology lock. For this, kfd_local_mem_info
is moved into the KFD dev struct and filled during device init.
This cached value can then be used instead of querying the value
again and again.
Signed-off-by: NMukul Joshi <mukul.joshi@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

b179fc28

28 4月, 2022 1 次提交

drm/amdkfd: CRIU add support for GWS queues · f567656f

由 David Yat Sin 提交于 4月 13, 2022

Add support to checkpoint/restore GWS (Global Wave Sync) queues.
Signed-off-by: NDavid Yat Sin <david.yatsin@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

f567656f

26 4月, 2022 1 次提交

drm/amdkfd: Ignore bogus signals from MEC efficiently · c3eb12df

由 Felix Kuehling 提交于 4月 07, 2022

MEC firmware sometimes sends signal interrupts without a valid context ID
on end of pipe events that don't intend to signal any HSA signals.
This triggers the slow path in kfd_signal_event_interrupt that scans the
entire event page for signaled events. Detect these signals in the top
half interrupt handler to stop processing them as early as possible.

Because we now always treat event ID 0 as invalid, reserve that ID during
process initialization.

v2: Update firmware version checks to support more GPUs
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: NPhilip Yang <Philip.Yang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

c3eb12df

21 4月, 2022 1 次提交

drm/amdkfd: CRIU add support for GWS queues · 747eea07

由 David Yat Sin 提交于 4月 13, 2022

Add support to checkpoint/restore GWS (Global Wave Sync) queues.
Signed-off-by: NDavid Yat Sin <david.yatsin@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

747eea07

20 4月, 2022 1 次提交

drm/amdkfd: move kfd_flush_tlb_after_unmap into kfd_priv.h · 459ccca5

由 Lang Yu 提交于 4月 14, 2022

To make kfd_flush_tlb_after_unmap visible in kfd_svm.c,
move it into kfd_priv.h. And change it to an inline function.
Signed-off-by: NLang Yu <Lang.Yu@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

459ccca5

13 4月, 2022 1 次提交

drm/amdkfd: Cleanup IO links during KFD device removal · 46d18d51

由 Mukul Joshi 提交于 4月 06, 2022

Currently, the IO-links to the device being removed from topology,
are not cleared. As a result, there would be dangling links left in
the KFD topology. This patch aims to fix the following:
1. Cleanup all IO links to the device being removed.
2. Ensure that node numbering in sysfs and nodes proximity domain
   values are consistent after the device is removed:
   a. Adding a device and removing a GPU device are made mutually
      exclusive.
   b. The global proximity domain counter is no longer required to be
      an atomic counter. A normal 32-bit counter can be used instead.
3. Update generation_count to let user-mode know that topology has
   changed due to device removal.

CC: Shuotao Xu <shuotaoxu@microsoft.com>
Reviewed-by: NShuotao Xu <shuotaoxu@microsoft.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NMukul Joshi <mukul.joshi@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

46d18d51

01 4月, 2022 1 次提交

drm/amdkfd: Use atomic64_t type for pdd->tlb_seq · 8fde0248

由 Philip Yang 提交于 3月 25, 2022

To support multi-thread update page table.
Signed-off-by: NPhilip Yang <Philip.Yang@amd.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

8fde0248

26 3月, 2022 1 次提交

drm/amdkfd: start using tlb_seq from the VM subsystem · bffa91da

由 Christian König 提交于 3月 17, 2022

Instead of trying to figure out if a TLB flush is necessary or not use
the information provided by the VM subsystem now.
Signed-off-by: NChristian König <christian.koenig@amd.com>
Reviewed-by: Philip Yang<Philip.Yang@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

bffa91da

04 3月, 2022 1 次提交

mm: don't include <linux/memremap.h> in <linux/mm.h> · dc90f084

由 Christoph Hellwig 提交于 2月 16, 2022

Move the check for the actual pgmap types that need the free at refcount
one behavior into the out of line helper, and thus avoid the need to
pull memremap.h into mm.h.

Link: https://lkml.kernel.org/r/20220210072828.2930359-7-hch@lst.deSigned-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NLogan Gunthorpe <logang@deltatee.com>
Reviewed-by: NJason Gunthorpe <jgg@nvidia.com>
Reviewed-by: NDan Williams <dan.j.williams@intel.com>
Acked-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Tested-by: N"Sierra Guiza, Alejandro (Alex)" <alex.sierra@amd.com>

Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Chaitanya Kulkarni <kch@nvidia.com>
Cc: Karol Herbst <kherbst@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>

dc90f084

24 2月, 2022 1 次提交

drm/amdkfd: Use real device for messages · a0c5fd46

由 Felix Kuehling 提交于 2月 18, 2022

kfd_chardev() doesn't provide much useful information in dev_... messages
on multi-GPU systems because there is only one KFD device, which doesn't
correspond to any particular GPU. Use the actual GPU device to indicate
the GPU that caused a message.
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

a0c5fd46

17 2月, 2022 1 次提交

drm/amdkfd: Replace zero-length array with flexible-array member · d5c83156

由 Changcheng Deng 提交于 2月 15, 2022

There is a regular need in the kernel to provide a way to declare having
a dynamically sized set of trailing elements in a structure. Kernel code
should always use "flexible array members" for these cases. The older
style of one-element or zero-length arrays should no longer be used.
Reference:
https://www.kernel.org/doc/html/latest/process/deprecated.html#zero-length-and-one-element-arraysReported-by: NZeal Robot <zealci@zte.com.cn>
Signed-off-by: NChangcheng Deng <deng.changcheng@zte.com.cn>
Acked-by: NChristian König <christian.koenig@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

d5c83156

15 2月, 2022 3 次提交

drm/amdkfd: remove unneeded unmap single queue option · d2cb0b21

由 Jonathan Kim 提交于 2月 10, 2022

The KFD only unmaps all queues, all dynamics queues or all process queues
since RUN_LIST is mapped with all KFD queues.

There's no need to provide a single type unmap so remove this option.
Signed-off-by: NJonathan Kim <jonathan.kim@amd.com>
Reviewed-by: NFelix Kuehling <felix.kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

d2cb0b21

drm/amdkfd: Fix leftover errors and warnings · 2243f493

由 Rajneesh Bhardwaj 提交于 2月 10, 2022

A bunch of errors and warnings are leftover KFD over the years, attempt
to fix the errors and most warnings reported by checkpatch tool. Still a
few warnings remain which may be false positives so ignore them for now.
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NRajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

2243f493

drm/amdkfd: update SPDX license header · d87f36a0

由 Rajneesh Bhardwaj 提交于 2月 10, 2022

Update the SPDX License header for all the KFD files.
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NRajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

d87f36a0

10 2月, 2022 2 次提交

drm/amdkfd: Remove unused old debugger implementation · 5bdd3eb2

由 Mukul Joshi 提交于 2月 04, 2022

Cleanup the kfd code by removing the unused old debugger
implementation.
The address watch was only ever implemented in the upstream
driver for GFXv7 (Kaveri). The user mode tools runtime using
this API was never open-sourced. Work on the old debugger
prototype that used this API has been discontinued years ago.
Only a small piece of resetting wavefronts is kept and
is moved to kfd_device_queue_manager.c.
Signed-off-by: NMukul Joshi <mukul.joshi@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

5bdd3eb2

drm/amdkfd: rename kfd_process_vm_fault to kfd_dqm_evict_pasid · 03e5b167

由 Tao Zhou 提交于 2月 07, 2022

As the function is used in more different cases, use a more general
name.
Signed-off-by: NTao Zhou <tao.zhou1@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

03e5b167

08 2月, 2022 13 次提交

drm/amdkfd: CRIU prepare for svm resume · c2db32ce

由 Rajneesh Bhardwaj 提交于 11月 08, 2021

During CRIU restore phase, the VMAs for the virtual address ranges are
not at their final location yet so in this stage, only cache the data
required to successfully resume the svm ranges during an imminent CRIU
resume phase.
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NRajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

c2db32ce

drm/amdkfd: CRIU Discover svm ranges · 08a987a8

由 Rajneesh Bhardwaj 提交于 11月 02, 2021

A KFD process may contain a number of virtual address ranges for shared
virtual memory management and each such range can have many SVM
attributes spanning across various nodes within the process boundary.
This change reports the total number of such SVM ranges and
their total private data size by extending the PROCESS_INFO op of the the
CRIU IOCTL to discover the svm ranges in the target process and a future
patches brings in the required support for checkpoint and restore for
SVM ranges.
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NRajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

08a987a8

drm/amdkfd: CRIU checkpoint and restore xnack mode · 4717fe3d

由 Rajneesh Bhardwaj 提交于 11月 19, 2021

Recoverable page faults are represented by the xnack mode setting inside
a kfd process and are used to represent the device page faults. For CR,
we don't consider negative values which are typically used for querying
the current xnack mode without modifying it.
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NRajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

4717fe3d

drm/amdkfd: CRIU implement gpu_id remapping · bef153b7

由 David Yat Sin 提交于 4月 09, 2021

When doing a restore on a different node, the gpu_id's on the restore
node may be different. But the user space application will still refer
use the original gpu_id's in the ioctl calls. Adding code to create a
gpu id mapping so that kfd can determine actual gpu_id during the user
ioctl's.
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NDavid Yat Sin <david.yatsin@amd.com>
Signed-off-by: NRajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

bef153b7

drm/amdkfd: CRIU checkpoint and restore events · 40e8a766

由 David Yat Sin 提交于 3月 05, 2021

Add support to existing CRIU ioctl's to save and restore events during
criu checkpoint and restore.
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NDavid Yat Sin <david.yatsin@amd.com>
Signed-off-by: NRajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

40e8a766

drm/amdkfd: CRIU checkpoint and restore queue control stack · 3a9822d7

由 David Yat Sin 提交于 1月 25, 2021

Checkpoint contents of queue control stacks on CRIU dump and restore them
during CRIU restore.
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NDavid Yat Sin <david.yatsin@amd.com>
Signed-off-by: NRajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

3a9822d7

drm/amdkfd: CRIU checkpoint and restore queue mqds · 42c6c482

由 David Yat Sin 提交于 1月 25, 2021

Checkpoint contents of queue MQD's on CRIU dump and restore them during
CRIU restore.
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NDavid Yat Sin <david.yatsin@amd.com>
Signed-off-by: NRajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

42c6c482

drm/amdkfd: CRIU restore queue ids · 8668dfc3

由 David Yat Sin 提交于 1月 25, 2021

When re-creating queues during CRIU restore, restore the queue with the
same queue id value used during CRIU dump.
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NRajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: NDavid Yat Sin <david.yatsin@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

8668dfc3

drm/amdkfd: CRIU add queues support · 626f7b31

由 David Yat Sin 提交于 1月 25, 2021

Add support to existing CRIU ioctl's to save number of queues and queue
properties for each queue during checkpoint and re-create queues on
restore.
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NDavid Yat Sin <david.yatsin@amd.com>
Signed-off-by: NRajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

626f7b31

drm/amdkfd: CRIU Implement KFD unpause operation · cd9f7910

由 David Yat Sin 提交于 8月 16, 2021

Introducing UNPAUSE op. After CRIU amdgpu plugin performs a PROCESS_INFO
op the queues will be stay in an evicted state. Once the plugin is done
draining BO contents, it is safe to perform an UNPAUSE op for the queues
to resume.
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NDavid Yat Sin <david.yatsin@amd.com>
Signed-off-by: NRajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

cd9f7910

drm/amdkfd: CRIU Implement KFD resume ioctl · 011bbb03

由 Rajneesh Bhardwaj 提交于 1月 11, 2021

This adds support to create userptr BOs on restore and introduces a new
ioctl op to restart memory notifiers for the restored userptr BOs.
When doing CRIU restore MMU notifications can happen anytime after we call
amdgpu_mn_register. Prevent MMU notifications until we reach stage-4 of the
restore process i.e. criu_resume ioctl op is received, and the process is
ready to be resumed. This ioctl is different from other KFD CRIU ioctls
since its called by CRIU master restore process for all the target
processes being resumed by CRIU.
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NDavid Yat Sin <david.yatsin@amd.com>
Signed-off-by: NRajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

011bbb03

drm/amdkfd: CRIU Implement KFD checkpoint ioctl · 5ccbb057

由 Rajneesh Bhardwaj 提交于 11月 30, 2020

This adds support to discover the  buffer objects that belong to a
process being checkpointed. The data corresponding to these buffer
objects is returned to user space plugin running under criu master
context which then stores this info to recreate these buffer objects
during a restore operation.
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NDavid Yat Sin <david.yatsin@amd.com>
Signed-off-by: NRajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

5ccbb057

drm/amdkfd: CRIU Introduce Checkpoint-Restore APIs · 36988070

由 Rajneesh Bhardwaj 提交于 8月 24, 2021

Checkpoint-Restore in userspace (CRIU) is a powerful tool that can
snapshot a running process and later restore it on same or a remote
machine but expects the processes that have a device file (e.g. GPU)
associated with them, provide necessary driver support to assist CRIU
and its extensible plugin interface. Thus, In order to support the
Checkpoint-Restore of any ROCm process, the AMD Radeon Open Compute
Kernel driver, needs to provide a set of new APIs that provide
necessary VRAM metadata and its contents to a userspace component
(CRIU plugin) that can store it in form of image files.

This introduces some new ioctls which will be used to checkpoint-Restore
any KFD bound user process. KFD only allows ioctl calls from the same
process that opened the KFD file descriptor. Since these ioctls are
expected to be called from a KFD criu plugin which has elevated ptrace
attached privileges and CAP_CHECKPOINT_RESTORE capabilities attached with
the file descriptors so modify KFD to allow such calls.

(API redesigned by David Yat Sin)
Suggested-by: NFelix Kuehling <felix.kuehling@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NDavid Yat Sin <david.yatsin@amd.com>
Signed-off-by: NRajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

36988070

29 12月, 2021 1 次提交

drm/amdkfd: reset queue which consumes RAS poison (v2) · b6485bed

由 Tao Zhou 提交于 12月 06, 2021

CP supports unmap queue with reset mode which only destroys specific queue without affecting others.
Replacing whole gpu reset with reset queue mode for RAS poison consumption
saves much time, and we can also fallback to gpu reset solution if reset
queue fails.

v2: Return directly if process is NULL;
    Reset queue solution is not applicable to SDMA, fallback to legacy
way;
    Call kfd_unref_process after lookup process.
Signed-off-by: NTao Zhou <tao.zhou1@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Acked-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

b6485bed

02 12月, 2021 2 次提交

drm/amdkfd: add kfd_device_info_init function · f0dc99a6

由 Graham Sider 提交于 11月 17, 2021

Initializes kfd->device_info given either asic_type (enum) if GFX
version is less than GFX9, or GC IP version if greater. Also takes in vf
and the target compiler gfx version. Uses SDMA version to determine
num_sdma_queues_per_engine.

Convert device_info to a non-pointer member of kfd, change references
accordingly.

Change unsupported asic condition to only probe f2g, move device_info
initialization post-switch.
Signed-off-by: NGraham Sider <Graham.Sider@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

f0dc99a6

drm/amdkfd: replace asic_name with amdgpu_asic_name · b7675b7b

由 Graham Sider 提交于 11月 11, 2021

device_info->asic_name and amdgpu_asic_name[adev->asic_type] both
provide asic name strings, with the only difference being casing.
Remove asic_name from device_info and replace sysfs entry with lowercase
amdgpu_asic_name[]. Ensures string is null-terminated so that this
doesn't break if dev->node_props.name ever gets set anywhere else.
Signed-off-by: NGraham Sider <Graham.Sider@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

b7675b7b

25 11月, 2021 2 次提交

drm/amdkfd: simplify drain retry fault · 6946be24

由 Philip Yang 提交于 11月 19, 2021

unmap range always increase atomic svms->drain_pagefaults to simplify
both parent range and child range unmap, page fault handle ignores the
retry fault if svms->drain_pagefaults is set to speed up interrupt
handling. svm_range_drain_retry_fault restart draining if another
range unmap from cpu.
Signed-off-by: NPhilip Yang <Philip.Yang@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

6946be24

drm/amdkfd: simplify drain retry fault · 2e447728

由 Philip Yang 提交于 11月 19, 2021

unmap range always increase atomic svms->drain_pagefaults to simplify
both parent range and child range unmap, page fault handle ignores the
retry fault if svms->drain_pagefaults is set to speed up interrupt
handling. svm_range_drain_retry_fault restart draining if another
range unmap from cpu.
Signed-off-by: NPhilip Yang <Philip.Yang@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

2e447728

23 11月, 2021 2 次提交

drm/amdkfd: Remove unused entries in table · a0e7e140

由 Amber Lin 提交于 11月 18, 2021

Remove unused entries in kfd_device_info table: num_xgmi_sdma_engines
and num_sdma_queues_per_engine. They are calculated in
kfd_get_num_sdma_engines and kfd_get_num_xgmi_sdma_engines instead.
Signed-off-by: NAmber Lin <Amber.Lin@amd.com>
Reviewed-by: NGraham Sider <Graham.Sider@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

a0e7e140

drm/amdkfd: Retrieve SDMA numbers from amdgpu · ee2f17f4

由 Amber Lin 提交于 11月 18, 2021

Instead of hard coding the number of sdma engines and the number of
sdma_xgmi engines in the device_info table, get the number of toal SDMA
instances from amdgpu. The first two engines are sdma engines and the
rest are sdma-xgmi engines unless the ASIC doesn't support XGMI.

v2: add kfd_ prefix to non static function names
Signed-off-by: NAmber Lin <Amber.Lin@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

ee2f17f4

18 11月, 2021 1 次提交

drm/amdkfd: replace asic_family with asic_type · 7eb0502a

由 Graham Sider 提交于 11月 10, 2021

asic_family was a duplicate of asic_type, both of type amd_asic_type.
Replace all instances of device_info->asic_family with adev->asic_type
and remove asic_family from device_info.
Signed-off-by: NGraham Sider <Graham.Sider@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

7eb0502a

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功