提交 · b91c23e099f0b65d62159da13458c5eefa76083f · openeuler / Kernel

10 11月, 2022 1 次提交

drm/amdkfd: Fix error handling in criu_checkpoint · b91c23e0

由 Felix Kuehling 提交于 11月 01, 2022

Checkpoint BOs last. That way we don't need to close dmabuf FDs if
something else fails later. This avoids problematic access to user mode
memory in the error handling code path.

criu_checkpoint_bos has its own error handling and cleanup that does not
depend on access to user memory.

In the private data, keep BOs before the remaining objects. This is
necessary to restore things in the correct order as restoring events
depends on the events-page BO being restored first.

Fixes: be072b06 ("drm/amdkfd: CRIU export BOs as prime dmabuf objects")
Reported-by: NJann Horn <jannh@google.com>
CC: Rajneesh Bhardwaj <Rajneesh.Bhardwaj@amd.com>
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Reviewed-and-tested-by: NRajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org

b91c23e0

30 9月, 2022 1 次提交

drm/amdkfd: Track unified memory when switching xnack mode · 8a7c3ce1

由 Philip Yang 提交于 9月 07, 2022

Unified memory usage with xnack off is tracked to avoid oversubscribe
system memory, with xnack on, we don't track unified memory usage to
allow memory oversubscribe. When switching xnack mode from off to on,
subsequent free ranges allocated with xnack off will not unreserve
memory. When switching xnack mode from on to off, subsequent free ranges
allocated with xnack on will unreserve memory. Both cases cause memory
accounting unbalanced.

When switching xnack mode from on to off, need reserve already allocated
svm range memory. When switching xnack mode from off to on, need
unreserve already allocated svm range memory.

v6: Take prange lock to access range child list
v5: Handle prange child ranges
v4: Handle reservation memory failure
v3: Handle switching xnack mode race with svm_range_deferred_list_work
v2: Handle both switching xnack from on to off and from off to on cases
Signed-off-by: NPhilip Yang <Philip.Yang@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

8a7c3ce1

14 9月, 2022 1 次提交

drm/amdkfd: Fix CRIU restore op due to doorbell offset · a4a3798f

由 Rajneesh Bhardwaj 提交于 9月 07, 2022

Recently introduced change to allocate doorbells only when the first
queue is created or mapped for CPU / GPU access, did not consider
Checkpoint Restore scenario completely. This fix allows the CRIU restore
operation by extending the doorbell optimization to CRIU restore
scenario.

Fixes: 16f00131 ("drm/amdkfd: Allocate doorbells only when needed")
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NRajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

a4a3798f

30 8月, 2022 1 次提交

drm/amdkfd: remove redundant variables err and ret · f4f5e507

由 Jinpeng Cui 提交于 8月 29, 2022

Return value from kfd_wait_on_events() and io_remap_pfn_range() directly
instead of taking this in another redundant variable.
Reported-by: NZeal Robot <zealci@zte.com.cn>
Signed-off-by: NJinpeng Cui <cui.jinpeng2@zte.com.cn>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

f4f5e507

26 8月, 2022 1 次提交

drm/amdkfd: Allocate doorbells only when needed · 16f00131

由 Felix Kuehling 提交于 8月 03, 2022

Only allocate doorbells when the first queue is created on a GPU or the
doorbells need to be mapped into CPU or GPU virtual address space. This
avoids allocating doorbells unnecessarily and can allow more processes
to use KFD on multi-GPU systems.
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: NKent Russell <kent.Russell@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

16f00131

11 8月, 2022 1 次提交

drm/amdkfd: Handle restart of kfd_ioctl_wait_events · bea9a56a

由 Felix Kuehling 提交于 8月 04, 2022

When kfd_ioctl_wait_events needs to restart due to a signal, we need to
update the timeout to account for the time already elapsed. We also need
to undo auto_reset of events that have signaled already, so that the
restarted ioctl will be able to count those signals again.

This fixes infinite hangs when kfd_ioctl_wait_events is interrupted by a
signal.
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Reviewed-and-tested-by: NXiaogang Chen <Xiaogang.Chen@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

bea9a56a

01 7月, 2022 1 次提交

drm/amdkfd: Add user queue eviction restore SMI event · c7f21978

由 Philip Yang 提交于 1月 13, 2022

Output user queue eviction and restore event. User queue eviction may be
triggered by svm or userptr MMU notifier, TTM eviction, device suspend
and CRIU checkpoint and restore.

User queue restore may be rescheduled if eviction happens again while
restore.
Signed-off-by: NPhilip Yang <Philip.Yang@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

c7f21978

24 6月, 2022 1 次提交

drm/amdkfd: Enable GFX11 usermode queue oversubscription · e77a541f

由 Graham Sider 提交于 5月 11, 2022

Starting with GFX11, MES requires wptr BOs to be GTT allocated/mapped to
GART for usermode queues in order to support oversubscription. In the
case that work is submitted to an unmapped queue, MES must have a GART
wptr address to determine whether the queue should be mapped.

This change is accompanied with changes in MES and is applicable for
MES_API_VERSION >= 2.

v3:
- Use amdgpu_vm_bo_lookup_mapping for wptr_bo mapping lookup
- Move wptr_bo refcount increment to amdgpu_amdkfd_map_gtt_bo_to_gart
- Remove list_del_init from amdgpu_amdkfd_map_gtt_bo_to_gart
- Cleanup/fix create_queue wptr_bo error handling
v4:
- Add MES version shift/mask defines to amdgpu_mes.h
- Change version check from MES_VERSION to MES_API_VERSION
- Add check in kfd_ioctl_create_queue before wptr bo pin/GART map to
ensure bo is a single page.
Signed-off-by: NGraham Sider <Graham.Sider@amd.com>
Acked-by: NAlex Deucher <alexander.deucher@amd.com>
Acked-by: NChristian König <christian.koenig@amd.com>
Reviewed-by: NPhilip Yang <Philip.Yang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

e77a541f

15 6月, 2022 1 次提交

drm/amdkfd: Add available memory ioctl · 9731dd4c

由 Daniel Phillips 提交于 5月 30, 2022

Add a new KFD ioctl to return the largest possible memory size that
can be allocated as a buffer object using
kfd_ioctl_alloc_memory_of_gpu. It attempts to use exactly the same
accept/reject criteria as that function so that allocating a new
buffer object of the size returned by this new ioctl is guaranteed to
succeed, barring races with other allocating tasks.

This IOCTL will be used by libhsakmt:
https://www.mail-archive.com/amd-gfx@lists.freedesktop.org/msg75743.htmlSigned-off-by: NDaniel Phillips <Daniel.Phillips@amd.com>
Signed-off-by: NDavid Yat Sin <David.YatSin@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

9731dd4c

29 4月, 2022 1 次提交

drm/amdkfd: Fix circular lock dependency warning · b179fc28

由 Mukul Joshi 提交于 4月 22, 2022

[  168.544078] ======================================================
[  168.550309] WARNING: possible circular locking dependency detected
[  168.556523] 5.16.0-kfd-fkuehlin #148 Tainted: G            E
[  168.562558] ------------------------------------------------------
[  168.568764] kfdtest/3479 is trying to acquire lock:
[  168.573672] ffffffffc0927a70 (&topology_lock){++++}-{3:3}, at:
		kfd_topology_device_by_id+0x16/0x60 [amdgpu] [  168.583663]
                but task is already holding lock:
[  168.589529] ffff97d303dee668 (&mm->mmap_lock#2){++++}-{3:3}, at:
		vm_mmap_pgoff+0xa9/0x180 [  168.597755]
                which lock already depends on the new lock.

[  168.605970]
                the existing dependency chain (in reverse order) is:
[  168.613487]
                -> #3 (&mm->mmap_lock#2){++++}-{3:3}:
[  168.619700]        lock_acquire+0xca/0x2e0
[  168.623814]        down_read+0x3e/0x140
[  168.627676]        do_user_addr_fault+0x40d/0x690
[  168.632399]        exc_page_fault+0x6f/0x270
[  168.636692]        asm_exc_page_fault+0x1e/0x30
[  168.641249]        filldir64+0xc8/0x1e0
[  168.645115]        call_filldir+0x7c/0x110
[  168.649238]        ext4_readdir+0x58e/0x940
[  168.653442]        iterate_dir+0x16a/0x1b0
[  168.657558]        __x64_sys_getdents64+0x83/0x140
[  168.662375]        do_syscall_64+0x35/0x80
[  168.666492]        entry_SYSCALL_64_after_hwframe+0x44/0xae
[  168.672095]
                -> #2 (&type->i_mutex_dir_key#6){++++}-{3:3}:
[  168.679008]        lock_acquire+0xca/0x2e0
[  168.683122]        down_read+0x3e/0x140
[  168.686982]        path_openat+0x5b2/0xa50
[  168.691095]        do_file_open_root+0xfc/0x190
[  168.695652]        file_open_root+0xd8/0x1b0
[  168.702010]        kernel_read_file_from_path_initns+0xc4/0x140
[  168.709542]        _request_firmware+0x2e9/0x5e0
[  168.715741]        request_firmware+0x32/0x50
[  168.721667]        amdgpu_cgs_get_firmware_info+0x370/0xdd0 [amdgpu]
[  168.730060]        smu7_upload_smu_firmware_image+0x53/0x190 [amdgpu]
[  168.738414]        fiji_start_smu+0xcf/0x4e0 [amdgpu]
[  168.745539]        pp_dpm_load_fw+0x21/0x30 [amdgpu]
[  168.752503]        amdgpu_pm_load_smu_firmware+0x4b/0x80 [amdgpu]
[  168.760698]        amdgpu_device_fw_loading+0xb8/0x140 [amdgpu]
[  168.768412]        amdgpu_device_init.cold+0xdf6/0x1716 [amdgpu]
[  168.776285]        amdgpu_driver_load_kms+0x15/0x120 [amdgpu]
[  168.784034]        amdgpu_pci_probe+0x19b/0x3a0 [amdgpu]
[  168.791161]        local_pci_probe+0x40/0x80
[  168.797027]        work_for_cpu_fn+0x10/0x20
[  168.802839]        process_one_work+0x273/0x5b0
[  168.808903]        worker_thread+0x20f/0x3d0
[  168.814700]        kthread+0x176/0x1a0
[  168.819968]        ret_from_fork+0x1f/0x30
[  168.825563]
                -> #1 (&adev->pm.mutex){+.+.}-{3:3}:
[  168.834721]        lock_acquire+0xca/0x2e0
[  168.840364]        __mutex_lock+0xa2/0x930
[  168.846020]        amdgpu_dpm_get_mclk+0x37/0x60 [amdgpu]
[  168.853257]        amdgpu_amdkfd_get_local_mem_info+0xba/0xe0 [amdgpu]
[  168.861547]        kfd_create_vcrat_image_gpu+0x1b1/0xbb0 [amdgpu]
[  168.869478]        kfd_create_crat_image_virtual+0x447/0x510 [amdgpu]
[  168.877884]        kfd_topology_add_device+0x5c8/0x6f0 [amdgpu]
[  168.885556]        kgd2kfd_device_init.cold+0x385/0x4c5 [amdgpu]
[  168.893347]        amdgpu_amdkfd_device_init+0x138/0x180 [amdgpu]
[  168.901177]        amdgpu_device_init.cold+0x141b/0x1716 [amdgpu]
[  168.909025]        amdgpu_driver_load_kms+0x15/0x120 [amdgpu]
[  168.916458]        amdgpu_pci_probe+0x19b/0x3a0 [amdgpu]
[  168.923442]        local_pci_probe+0x40/0x80
[  168.929249]        work_for_cpu_fn+0x10/0x20
[  168.935008]        process_one_work+0x273/0x5b0
[  168.940944]        worker_thread+0x20f/0x3d0
[  168.946623]        kthread+0x176/0x1a0
[  168.951765]        ret_from_fork+0x1f/0x30
[  168.957277]
                -> #0 (&topology_lock){++++}-{3:3}:
[  168.965993]        check_prev_add+0x8f/0xbf0
[  168.971613]        __lock_acquire+0x1299/0x1ca0
[  168.977485]        lock_acquire+0xca/0x2e0
[  168.982877]        down_read+0x3e/0x140
[  168.987975]        kfd_topology_device_by_id+0x16/0x60 [amdgpu]
[  168.995583]        kfd_device_by_id+0xa/0x20 [amdgpu]
[  169.002180]        kfd_mmap+0x95/0x200 [amdgpu]
[  169.008293]        mmap_region+0x337/0x5a0
[  169.013679]        do_mmap+0x3aa/0x540
[  169.018678]        vm_mmap_pgoff+0xdc/0x180
[  169.024095]        ksys_mmap_pgoff+0x186/0x1f0
[  169.029734]        do_syscall_64+0x35/0x80
[  169.035005]        entry_SYSCALL_64_after_hwframe+0x44/0xae
[  169.041754]
                other info that might help us debug this:

[  169.053276] Chain exists of:
                  &topology_lock --> &type->i_mutex_dir_key#6 --> &mm->mmap_lock#2

[  169.068389]  Possible unsafe locking scenario:

[  169.076661]        CPU0                    CPU1
[  169.082383]        ----                    ----
[  169.088087]   lock(&mm->mmap_lock#2);
[  169.092922]                                lock(&type->i_mutex_dir_key#6);
[  169.100975]                                lock(&mm->mmap_lock#2);
[  169.108320]   lock(&topology_lock);
[  169.112957]
                 *** DEADLOCK ***

This commit fixes the deadlock warning by ensuring pm.mutex is not
held while holding the topology lock. For this, kfd_local_mem_info
is moved into the KFD dev struct and filled during device init.
This cached value can then be used instead of querying the value
again and again.
Signed-off-by: NMukul Joshi <mukul.joshi@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

b179fc28

20 4月, 2022 1 次提交

drm/amdkfd: move kfd_flush_tlb_after_unmap into kfd_priv.h · 459ccca5

由 Lang Yu 提交于 4月 14, 2022

To make kfd_flush_tlb_after_unmap visible in kfd_svm.c,
move it into kfd_priv.h. And change it to an inline function.
Signed-off-by: NLang Yu <Lang.Yu@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

459ccca5

26 3月, 2022 1 次提交

drm/amdkfd: use tlb_seq from the VM subsystem for SVM as well v2 · 4d30a83c

由 Christian König 提交于 3月 17, 2022

Instead of hand rolling the table_freed parameter.

v2: add some changes suggested by Philip
Signed-off-by: NChristian König <christian.koenig@amd.com>
Reviewed-by: Philip Yang<Philip.Yang@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

4d30a83c

16 3月, 2022 3 次提交

drm/amdkfd: CRIU export dmabuf handles for GTT BOs · 65722ff6

由 David Yat Sin 提交于 3月 08, 2022

Export dmabuf handles for GTT BOs so that their contents can be accessed
using SDMA during checkpoint/restore.

v2: Squash in fix from David to set dmabuf handle to invalid for BOs
that cannot be accessed using SDMA during checkpoint/restore.
Signed-off-by: NDavid Yat Sin <david.yatsin@amd.com>
Reviewed-by : Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

65722ff6

drm/amdkfd: CRIU Refactor restore BO function · b38c074b

由 David Yat Sin 提交于 3月 08, 2022

Refactor CRIU restore BO to reduce identation.
Signed-off-by: NDavid Yat Sin <david.yatsin@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

b38c074b

drm/amdkfd: CRIU remove sync and TLB flush on restore · 67a359d8

由 David Yat Sin 提交于 3月 08, 2022

When the process is getting restored, the queues are not mapped yet, so
there is no VMID assigned for this process and no TLBs to flush.
Signed-off-by: NDavid Yat Sin <david.yatsin@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

67a359d8

24 2月, 2022 2 次提交

drm/amdkfd: Print bdf in peer map failure message · 0c41b9b5

由 Harish Kasiviswanathan 提交于 2月 15, 2022

Print alloc node, peer node and memory domain when peer map fails. This
is more useful

v2: use dev_err instead of pr_err
    use bdf for identify peer gpu
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: NAlex Deucher <Alexander.Deucher@amd.com>
Signed-off-by: NHarish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

0c41b9b5

drm/amdkfd: Use real device for messages · a0c5fd46

由 Felix Kuehling 提交于 2月 18, 2022

kfd_chardev() doesn't provide much useful information in dev_... messages
on multi-GPU systems because there is only one KFD device, which doesn't
correspond to any particular GPU. Use the actual GPU device to indicate
the GPU that caused a message.
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

a0c5fd46

23 2月, 2022 1 次提交

drm/amdkfd: Fix criu_restore_bo error handling · 22804e03

由 Felix Kuehling 提交于 2月 18, 2022

Clang static analysis reports this problem
kfd_chardev.c:2327:2: warning: 1st function call argument
  is an uninitialized value
  kvfree(bo_privs);
  ^~~~~~~~~~~~~~~~

Make sure bo_buckets and bo_privs are initialized so freeing them in the
error handling code path will never result in undefined behaviour.

Fixes: 73fa13b6 ("drm/amdkfd: CRIU Implement KFD restore ioctl")
Reported-by: NTom Rix <trix@redhat.com>
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: NRajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

22804e03

15 2月, 2022 2 次提交

drm/amdkfd: Fix leftover errors and warnings · 2243f493

由 Rajneesh Bhardwaj 提交于 2月 10, 2022

A bunch of errors and warnings are leftover KFD over the years, attempt
to fix the errors and most warnings reported by checkpatch tool. Still a
few warnings remain which may be false positives so ignore them for now.
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NRajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

2243f493

drm/amdkfd: update SPDX license header · d87f36a0

由 Rajneesh Bhardwaj 提交于 2月 10, 2022

Update the SPDX License header for all the KFD files.
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NRajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

d87f36a0

12 2月, 2022 4 次提交

drm/amdkfd: Fix prototype warning for get_process_num_bos · 24992ab0

由 Rajneesh Bhardwaj 提交于 2月 10, 2022

Fix the warning: no previous prototype for 'get_process_num_bos'
[-Wmissing-prototypes]
Reported-by: Nkernel test robot <lkp@intel.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NRajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

24992ab0

drm/amdkfd: fix loop error handling · d8a25e48

由 Tom Rix 提交于 2月 10, 2022

Clang static analysis reports this problem
kfd_chardev.c:2594:16: warning: The expression is an uninitialized value.
  The computed value will also be garbage
        while (ret && i--) {
                      ^~~

i is a loop variable and this block unwinds a problem in the loop.
When the error happens before the loop, this value is garbage.
Move the initialization of i to its decalaration.

Fixes: be072b06 ("drm/amdkfd: CRIU export BOs as prime dmabuf objects")
Signed-off-by: NTom Rix <trix@redhat.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

d8a25e48

drm/amdkfd: fix freeing an unset pointer · 574ff46f

由 Tom Rix 提交于 2月 09, 2022

clang static analysis reports this problem
kfd_chardev.c:2092:2: warning: 1st function call argument
  is an uninitialized value
        kvfree(bo_privs);
        ^~~~~~~~~~~~~~~~

When bo_buckets alloc fails, it jumps to an error handler
that frees the yet to be allocated bo_privs.  Because
bo_buckets is the first error, return directly.

Fixes: 5ccbb057 ("drm/amdkfd: CRIU Implement KFD checkpoint ioctl")
Signed-off-by: NTom Rix <trix@redhat.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

574ff46f

drm/amdkfd: CRIU fix a NULL vs IS_ERR() check · e5af61ff

由 Dan Carpenter 提交于 2月 09, 2022

The kfd_process_device_data_by_id() does not return error pointers,
it returns NULL.

Fixes: bef153b7 ("drm/amdkfd: CRIU implement gpu_id remapping")
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: NDavid Yat Sin <david.yatsin@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

e5af61ff

10 2月, 2022 1 次提交

drm/amdkfd: Remove unused old debugger implementation · 5bdd3eb2

由 Mukul Joshi 提交于 2月 04, 2022

Cleanup the kfd code by removing the unused old debugger
implementation.
The address watch was only ever implemented in the upstream
driver for GFXv7 (Kaveri). The user mode tools runtime using
this API was never open-sourced. Work on the old debugger
prototype that used this API has been discontinued years ago.
Only a small piece of resetting wavefronts is kept and
is moved to kfd_device_queue_manager.c.
Signed-off-by: NMukul Joshi <mukul.joshi@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

5bdd3eb2

08 2月, 2022 15 次提交

drm/amdkfd: CRIU resume shared virtual memory ranges · 2a909ae7

由 Rajneesh Bhardwaj 提交于 11月 08, 2021

In CRIU resume stage, resume all the shared virtual memory ranges from
the data stored inside the resuming kfd process during CRIU restore
phase. Also setup xnack mode and free up the resources.

KFD_IOCTL_SVM_ATTR_CLR_FLAGS is not available for querying via get_attr
interface but we must clear the flags during restore as there might be
some default flags set when the prange is created. Also handle the
invalid PREFETCH atribute values saved during checkpoint by replacing
them with another dummy KFD_IOCTL_SVM_ATTR_SET_FLAGS attribute.

(rajneesh: Fixed the checkpatch reported problems)
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NRajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

2a909ae7

drm/amdkfd: CRIU prepare for svm resume · c2db32ce

由 Rajneesh Bhardwaj 提交于 11月 08, 2021

During CRIU restore phase, the VMAs for the virtual address ranges are
not at their final location yet so in this stage, only cache the data
required to successfully resume the svm ranges during an imminent CRIU
resume phase.
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NRajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

c2db32ce

drm/amdkfd: CRIU Save Shared Virtual Memory ranges · 9d5dabfe

由 Rajneesh Bhardwaj 提交于 11月 03, 2021

During checkpoint stage, save the shared virtual memory ranges and
attributes for the target process. A process may contain a number of svm
ranges and each range might contain a number of attributes. While not
all attributes may be applicable for a given prange but during
checkpoint we store all possible values for the max possible attribute
types.
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NRajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

9d5dabfe

drm/amdkfd: CRIU Discover svm ranges · 08a987a8

由 Rajneesh Bhardwaj 提交于 11月 02, 2021

A KFD process may contain a number of virtual address ranges for shared
virtual memory management and each such range can have many SVM
attributes spanning across various nodes within the process boundary.
This change reports the total number of such SVM ranges and
their total private data size by extending the PROCESS_INFO op of the the
CRIU IOCTL to discover the svm ranges in the target process and a future
patches brings in the required support for checkpoint and restore for
SVM ranges.
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NRajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

08a987a8

drm/amdkfd: CRIU checkpoint and restore xnack mode · 4717fe3d

由 Rajneesh Bhardwaj 提交于 11月 19, 2021

Recoverable page faults are represented by the xnack mode setting inside
a kfd process and are used to represent the device page faults. For CR,
we don't consider negative values which are typically used for querying
the current xnack mode without modifying it.
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NRajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

4717fe3d

drm/amdkfd: CRIU export BOs as prime dmabuf objects · be072b06

由 Rajneesh Bhardwaj 提交于 6月 29, 2021

KFD buffer objects do not associate a GEM handle with them so cannot
directly be used with libdrm to initiate a system dma (sDMA) operation
to speedup the checkpoint and restore operation so export them as dmabuf
objects and use with libdrm helper (amdgpu_bo_import) to further process
the sdma command submissions.

With sDMA, we see huge improvement in checkpoint and restore operations
compared to the generic pci based access via host data path.
Suggested-by: NFelix Kuehling <felix.kuehling@amd.com>
Signed-off-by: NRajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: NDavid Yat Sin <david.yatsin@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

be072b06

drm/amdkfd: CRIU implement gpu_id remapping · bef153b7

由 David Yat Sin 提交于 4月 09, 2021

When doing a restore on a different node, the gpu_id's on the restore
node may be different. But the user space application will still refer
use the original gpu_id's in the ioctl calls. Adding code to create a
gpu id mapping so that kfd can determine actual gpu_id during the user
ioctl's.
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NDavid Yat Sin <david.yatsin@amd.com>
Signed-off-by: NRajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

bef153b7

drm/amdkfd: CRIU checkpoint and restore events · 40e8a766

由 David Yat Sin 提交于 3月 05, 2021

Add support to existing CRIU ioctl's to save and restore events during
criu checkpoint and restore.
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NDavid Yat Sin <david.yatsin@amd.com>
Signed-off-by: NRajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

40e8a766

drm/amdkfd: CRIU checkpoint and restore queue control stack · 3a9822d7

由 David Yat Sin 提交于 1月 25, 2021

Checkpoint contents of queue control stacks on CRIU dump and restore them
during CRIU restore.
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NDavid Yat Sin <david.yatsin@amd.com>
Signed-off-by: NRajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

3a9822d7

drm/amdkfd: CRIU checkpoint and restore queue mqds · 42c6c482

由 David Yat Sin 提交于 1月 25, 2021

Checkpoint contents of queue MQD's on CRIU dump and restore them during
CRIU restore.
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NDavid Yat Sin <david.yatsin@amd.com>
Signed-off-by: NRajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

42c6c482

drm/amdkfd: CRIU restore queue ids · 8668dfc3

由 David Yat Sin 提交于 1月 25, 2021

When re-creating queues during CRIU restore, restore the queue with the
same queue id value used during CRIU dump.
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NRajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: NDavid Yat Sin <david.yatsin@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

8668dfc3

drm/amdkfd: CRIU add queues support · 626f7b31

由 David Yat Sin 提交于 1月 25, 2021

Add support to existing CRIU ioctl's to save number of queues and queue
properties for each queue during checkpoint and re-create queues on
restore.
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NDavid Yat Sin <david.yatsin@amd.com>
Signed-off-by: NRajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

626f7b31

drm/amdkfd: CRIU Implement KFD unpause operation · cd9f7910

由 David Yat Sin 提交于 8月 16, 2021

Introducing UNPAUSE op. After CRIU amdgpu plugin performs a PROCESS_INFO
op the queues will be stay in an evicted state. Once the plugin is done
draining BO contents, it is safe to perform an UNPAUSE op for the queues
to resume.
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NDavid Yat Sin <david.yatsin@amd.com>
Signed-off-by: NRajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

cd9f7910

drm/amdkfd: CRIU Implement KFD resume ioctl · 011bbb03

由 Rajneesh Bhardwaj 提交于 1月 11, 2021

This adds support to create userptr BOs on restore and introduces a new
ioctl op to restart memory notifiers for the restored userptr BOs.
When doing CRIU restore MMU notifications can happen anytime after we call
amdgpu_mn_register. Prevent MMU notifications until we reach stage-4 of the
restore process i.e. criu_resume ioctl op is received, and the process is
ready to be resumed. This ioctl is different from other KFD CRIU ioctls
since its called by CRIU master restore process for all the target
processes being resumed by CRIU.
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NDavid Yat Sin <david.yatsin@amd.com>
Signed-off-by: NRajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

011bbb03

drm/amdkfd: CRIU Implement KFD restore ioctl · 73fa13b6

由 Rajneesh Bhardwaj 提交于 12月 01, 2020

This implements the KFD CRIU Restore ioctl that lays the basic
foundation for the CRIU restore operation. It provides support to
create the buffer objects corresponding to the checkpointed image.
This ioctl creates various types of buffer objects such as VRAM,
MMIO, Doorbell, GTT based on the date sent from the userspace plugin.
The data mostly contains the previously checkpointed KFD images from
some KFD processs.

While restoring a criu process, attach old IDR values to newly
created BOs. This also adds the minimal gpu mapping support for a single
gpu checkpoint restore use case.
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NDavid Yat Sin <david.yatsin@amd.com>
Signed-off-by: NRajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

73fa13b6

openeuler / Kernel 接近 2 年 前同步成功

openeuler / Kernel
接近 2 年前同步成功