提交 · 8a491bb31ba40246ae15cbb3c044d72fb9e0d13e · openeuler / Kernel

18 9月, 2020 1 次提交

drm/amdkfd: Add some eveiction debugging code · 8a491bb3

由 Philip Cox 提交于 6月 29, 2020

Extending the module parameter debug_evictions to also print a stack
trace when the eviction code path is called.
Signed-off-by: NPhilip Cox <Philip.Cox@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

8a491bb3

25 8月, 2020 1 次提交

drm/amdkfd: sparse: Fix warning in reading SDMA counters · 818b0324

由 Mukul Joshi 提交于 8月 18, 2020

Add __user annotation to fix related sparse warning while reading
SDMA counters from userland.
Also, rework the read SDMA counters function by removing redundant
checks.
Reported-by: Nkernel test robot <lkp@intel.com>
Signed-off-by: NMukul Joshi <mukul.joshi@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

818b0324

19 8月, 2020 1 次提交

drm/amdkfd: Initialize SDMA activity counter to 0 · 5960e022

由 Mukul Joshi 提交于 8月 17, 2020

To prevent reporting erroneous SDMA usage, initialize SDMA
activity counter to 0 before using.
Signed-off-by: NMukul Joshi <mukul.joshi@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

5960e022

15 8月, 2020 1 次提交

drm/amdgpu: revert "fix system hang issue during GPU reset" · f1403342

由 Christian König 提交于 8月 12, 2020

The whole approach wasn't thought through till the end.

We already had a reset lock like this in the past and it caused the same problems like this one.

Completely revert the patch for now and add individual trylock protection to the hardware access functions as necessary.

This reverts commit df9c8d1a.
Signed-off-by: NChristian König <christian.koenig@amd.com>
Acked-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

f1403342

28 7月, 2020 1 次提交

drm/amdgpu: fix system hang issue during GPU reset · df9c8d1a

由 Dennis Li 提交于 7月 08, 2020

when GPU hang, driver has multi-paths to enter amdgpu_device_gpu_recover,
the atomic adev->in_gpu_reset and hive->in_reset are used to avoid
re-entering GPU recovery.

During GPU reset and resume, it is unsafe that other threads access GPU,
which maybe cause GPU reset failed. Therefore the new rw_semaphore
adev->reset_sem is introduced, which protect GPU from being accessed by
external threads during recovery.

v2:
1. add rwlock for some ioctls, debugfs and file-close function.
2. change to use dqm->is_resetting and dqm_lock for protection in kfd
driver.
3. remove try_lock and change adev->in_gpu_reset as atomic, to avoid
re-enter GPU recovery for the same GPU hang.

v3:
1. change back to use adev->reset_sem to protect kfd callback
functions, because dqm_lock couldn't protect all codes, for example:
free_mqd must be called outside of dqm_lock;

[ 1230.176199] Hardware name: Supermicro SYS-7049GP-TRT/X11DPG-QT, BIOS 3.1 05/23/2019
[ 1230.177221] Call Trace:
[ 1230.178249]  dump_stack+0x98/0xd5
[ 1230.179443]  amdgpu_virt_kiq_reg_write_reg_wait+0x181/0x190 [amdgpu]
[ 1230.180673]  gmc_v9_0_flush_gpu_tlb+0xcc/0x310 [amdgpu]
[ 1230.181882]  amdgpu_gart_unbind+0xa9/0xe0 [amdgpu]
[ 1230.183098]  amdgpu_ttm_backend_unbind+0x46/0x180 [amdgpu]
[ 1230.184239]  ? ttm_bo_put+0x171/0x5f0 [ttm]
[ 1230.185394]  ttm_tt_unbind+0x21/0x40 [ttm]
[ 1230.186558]  ttm_tt_destroy.part.12+0x12/0x60 [ttm]
[ 1230.187707]  ttm_tt_destroy+0x13/0x20 [ttm]
[ 1230.188832]  ttm_bo_cleanup_memtype_use+0x36/0x80 [ttm]
[ 1230.189979]  ttm_bo_put+0x1be/0x5f0 [ttm]
[ 1230.191230]  amdgpu_bo_unref+0x1e/0x30 [amdgpu]
[ 1230.192522]  amdgpu_amdkfd_free_gtt_mem+0xaf/0x140 [amdgpu]
[ 1230.193833]  free_mqd+0x25/0x40 [amdgpu]
[ 1230.195143]  destroy_queue_cpsch+0x1a7/0x270 [amdgpu]
[ 1230.196475]  pqm_destroy_queue+0x105/0x260 [amdgpu]
[ 1230.197819]  kfd_ioctl_destroy_queue+0x37/0x70 [amdgpu]
[ 1230.199154]  kfd_ioctl+0x277/0x500 [amdgpu]
[ 1230.200458]  ? kfd_ioctl_get_clock_counters+0x60/0x60 [amdgpu]
[ 1230.201656]  ? tomoyo_file_ioctl+0x19/0x20
[ 1230.202831]  ksys_ioctl+0x98/0xb0
[ 1230.204004]  __x64_sys_ioctl+0x1a/0x20
[ 1230.205174]  do_syscall_64+0x5f/0x250
[ 1230.206339]  entry_SYSCALL_64_after_hwframe+0x49/0xbe

2. remove try_lock and introduce atomic hive->in_reset, to avoid
re-enter GPU recovery.

v4:
1. remove an unnecessary whitespace change in kfd_chardev.c
2. remove comment codes in amdgpu_device.c
3. add more detailed comment in commit message
4. define a wrap function amdgpu_in_reset

v5:
1. Fix some style issues.
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Suggested-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com>
Suggested-by: NChristian König <christian.koenig@amd.com>
Suggested-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Suggested-by: NLijo Lazar <Lijo.Lazar@amd.com>
Suggested-by: NLuben Tukov <luben.tuikov@amd.com>
Signed-off-by: NDennis Li <Dennis.Li@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

df9c8d1a

01 7月, 2020 4 次提交

drm/amdkfd: Fix circular locking dependency warning · d69fd951

由 Mukul Joshi 提交于 6月 24, 2020

[  150.887733] ======================================================
[  150.893903] WARNING: possible circular locking dependency detected
[  150.905917] ------------------------------------------------------
[  150.912129] kfdtest/4081 is trying to acquire lock:
[  150.917002] ffff8f7f3762e118 (&mm->mmap_sem#2){++++}, at:
                                 __might_fault+0x3e/0x90
[  150.924490]
               but task is already holding lock:
[  150.930320] ffff8f7f49d229e8 (&dqm->lock_hidden){+.+.}, at:
                                destroy_queue_cpsch+0x29/0x210 [amdgpu]
[  150.939432]
               which lock already depends on the new lock.

[  150.947603]
               the existing dependency chain (in reverse order) is:
[  150.955074]
               -> #3 (&dqm->lock_hidden){+.+.}:
[  150.960822]        __mutex_lock+0xa1/0x9f0
[  150.964996]        evict_process_queues_cpsch+0x22/0x120 [amdgpu]
[  150.971155]        kfd_process_evict_queues+0x3b/0xc0 [amdgpu]
[  150.977054]        kgd2kfd_quiesce_mm+0x25/0x60 [amdgpu]
[  150.982442]        amdgpu_amdkfd_evict_userptr+0x35/0x70 [amdgpu]
[  150.988615]        amdgpu_mn_invalidate_hsa+0x41/0x60 [amdgpu]
[  150.994448]        __mmu_notifier_invalidate_range_start+0xa4/0x240
[  151.000714]        copy_page_range+0xd70/0xd80
[  151.005159]        dup_mm+0x3ca/0x550
[  151.008816]        copy_process+0x1bdc/0x1c70
[  151.013183]        _do_fork+0x76/0x6c0
[  151.016929]        __x64_sys_clone+0x8c/0xb0
[  151.021201]        do_syscall_64+0x4a/0x1d0
[  151.025404]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  151.030977]
               -> #2 (&adev->notifier_lock){+.+.}:
[  151.036993]        __mutex_lock+0xa1/0x9f0
[  151.041168]        amdgpu_mn_invalidate_hsa+0x30/0x60 [amdgpu]
[  151.047019]        __mmu_notifier_invalidate_range_start+0xa4/0x240
[  151.053277]        copy_page_range+0xd70/0xd80
[  151.057722]        dup_mm+0x3ca/0x550
[  151.061388]        copy_process+0x1bdc/0x1c70
[  151.065748]        _do_fork+0x76/0x6c0
[  151.069499]        __x64_sys_clone+0x8c/0xb0
[  151.073765]        do_syscall_64+0x4a/0x1d0
[  151.077952]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  151.083523]
               -> #1 (mmu_notifier_invalidate_range_start){+.+.}:
[  151.090833]        change_protection+0x802/0xab0
[  151.095448]        mprotect_fixup+0x187/0x2d0
[  151.099801]        setup_arg_pages+0x124/0x250
[  151.104251]        load_elf_binary+0x3a4/0x1464
[  151.108781]        search_binary_handler+0x6c/0x210
[  151.113656]        __do_execve_file.isra.40+0x7f7/0xa50
[  151.118875]        do_execve+0x21/0x30
[  151.122632]        call_usermodehelper_exec_async+0x17e/0x190
[  151.128393]        ret_from_fork+0x24/0x30
[  151.132489]
               -> #0 (&mm->mmap_sem#2){++++}:
[  151.138064]        __lock_acquire+0x11a1/0x1490
[  151.142597]        lock_acquire+0x90/0x180
[  151.146694]        __might_fault+0x68/0x90
[  151.150879]        read_sdma_queue_counter+0x5f/0xb0 [amdgpu]
[  151.156693]        update_sdma_queue_past_activity_stats+0x3b/0x90 [amdgpu]
[  151.163725]        destroy_queue_cpsch+0x1ae/0x210 [amdgpu]
[  151.169373]        pqm_destroy_queue+0xf0/0x250 [amdgpu]
[  151.174762]        kfd_ioctl_destroy_queue+0x32/0x70 [amdgpu]
[  151.180577]        kfd_ioctl+0x223/0x400 [amdgpu]
[  151.185284]        ksys_ioctl+0x8f/0xb0
[  151.189118]        __x64_sys_ioctl+0x16/0x20
[  151.193389]        do_syscall_64+0x4a/0x1d0
[  151.197569]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  151.203141]
               other info that might help us debug this:

[  151.211140] Chain exists of:
                 &mm->mmap_sem#2 --> &adev->notifier_lock --> &dqm->lock_hidden

[  151.222535]  Possible unsafe locking scenario:

[  151.228447]        CPU0                    CPU1
[  151.232971]        ----                    ----
[  151.237502]   lock(&dqm->lock_hidden);
[  151.241254]                                lock(&adev->notifier_lock);
[  151.247774]                                lock(&dqm->lock_hidden);
[  151.254038]   lock(&mm->mmap_sem#2);

This commit fixes the warning by ensuring get_user() is not called
while reading SDMA stats with dqm_lock held as get_user() could cause a
page fault which leads to the circular locking scenario.
Signed-off-by: NMukul Joshi <mukul.joshi@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

d69fd951

drm/amd: fix potential memleak in err branch · dc2f832e

由 Bernard Zhao 提交于 6月 20, 2020

The function kobject_init_and_add alloc memory like:
kobject_init_and_add->kobject_add_varg->kobject_set_name_vargs
->kvasprintf_const->kstrdup_const->kstrdup->kmalloc_track_caller
->kmalloc_slab, in err branch this memory not free. If use
kmemleak, this path maybe catched.
These changes are to add kobject_put in kobject_init_and_add
failed branch, fix potential memleak.
Signed-off-by: NBernard Zhao <bernard@vivo.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

dc2f832e

drm/amdkfd: label internally used symbols as static · 204d8998

由 Nirmoy Das 提交于 6月 18, 2020

Used sparse(make C=1) to find these loose ends.

v2:
removed unwanted extra line
Signed-off-by: NNirmoy Das <nirmoy.das@amd.com>
Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

204d8998

drm/amdkfd: fix ref count leak when pm_runtime_get_sync fails · 1c1ada37

由 Alex Deucher 提交于 6月 17, 2020

The call to pm_runtime_get_sync increments the counter even in case of
failure, leading to incorrect ref count.
In case of failure, decrement the ref count before returning.
Reviewed-by: NRajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

1c1ada37

25 6月, 2020 1 次提交

drm/amd: fix potential memleak in err branch · b5b78a6c

由 Bernard Zhao 提交于 6月 20, 2020

The function kobject_init_and_add alloc memory like:
kobject_init_and_add->kobject_add_varg->kobject_set_name_vargs
->kvasprintf_const->kstrdup_const->kstrdup->kmalloc_track_caller
->kmalloc_slab, in err branch this memory not free. If use
kmemleak, this path maybe catched.
These changes are to add kobject_put in kobject_init_and_add
failed branch, fix potential memleak.
Signed-off-by: NBernard Zhao <bernard@vivo.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org

b5b78a6c

30 5月, 2020 1 次提交

drm/amdkfd: fix a dereference of pdd before it is null checked · 2652bda7

由 Colin Ian King 提交于 5月 28, 2020

Currently pointer pdd is being dereferenced when assigning pointer
dpm and then pdd is being null checked.  Fix this by checking if
pdd is null before the dereference of pdd occurs.

Addresses-Coverity: ("Dereference before null check")
Fixes: 32cb59f3 ("drm/amdkfd: Track SDMA utilization per process")
Signed-off-by: NColin Ian King <colin.king@canonical.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

2652bda7

29 5月, 2020 1 次提交

drm/amdkfd: Track SDMA utilization per process · 32cb59f3

由 Mukul Joshi 提交于 5月 26, 2020

Track SDMA usage on a per process basis and report it through sysfs.
The value in the sysfs file indicates the amount of time SDMA has
been in-use by this process since the creation of the process.
This value is in microsecond granularity.
Signed-off-by: NMukul Joshi <mukul.joshi@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

32cb59f3

01 5月, 2020 2 次提交

drm/amdkfd: Fix comment formatting · 0aeaaf64

由 Felix Kuehling 提交于 4月 29, 2020

Corrected two function names. Added a missing space.
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: NKent Russell <kent.russell@amd.com>
Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

0aeaaf64

drm/amdkfd: Track GPU memory utilization per process · d4566dee

由 Mukul Joshi 提交于 4月 28, 2020

Track GPU VRAM usage on a per process basis and report it through
sysfs.
Signed-off-by: NMukul Joshi <mukul.joshi@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

d4566dee

29 4月, 2020 1 次提交

drm/amdkfd: Enable over-subscription with >1 GWS queue · b8020b03

由 Joseph Greathouse 提交于 9月 18, 2019

The current GWS usage model will only allows a single GWS-enabled
process to be active on the GPU at once. This ensures that a
barrier-using kernel gets a known amount of GPU hardware, to
prevent deadlock due to inability to go beyond the GWS barrier.

The HWS watches how many GWS entries are assigned to each process,
and goes into over-subscription mode when two processes need more
than the 64 that are available. The current KFD method for working
with this is to allocate all 64 GWS entries to each GWS-capable
process.

When more than one GWS-enabled process is in the runlist, we must
make sure the runlist is in over-subscription mode, so that the
HWS gets a chained RUN_LIST packet and continues scheduling
kernels.
Signed-off-by: NJoseph Greathouse <Joseph.Greathouse@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

b8020b03

11 3月, 2020 1 次提交

drm/amdkfd: Consolidate duplicated bo alloc flags · 1d251d90

由 Yong Zhao 提交于 3月 04, 2020

ALLOC_MEM_FLAGS_* used are the same as the KFD_IOC_ALLOC_MEM_FLAGS_*,
but they are interweavedly used in kernel driver, resulting in bad
readability. For example, KFD_IOC_ALLOC_MEM_FLAGS_COHERENT is not
referenced in kernel, and it functions implicitly in kernel through
ALLOC_MEM_FLAGS_COHERENT, causing unnecessary confusion.

Replace all occurrences of ALLOC_MEM_FLAGS_* with
KFD_IOC_ALLOC_MEM_FLAGS_* to solve the problem.
Signed-off-by: NYong Zhao <Yong.Zhao@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

1d251d90

07 3月, 2020 1 次提交

drm/amdkfd: Signal eviction fence on process destruction (v2) · 129657c8

由 Felix Kuehling 提交于 3月 04, 2020

Otherwise BOs may wait for the fence indefinitely and never be destroyed.

v2: Signal the fence right after destroying queues to avoid unnecessary
    delaye-delete in kfd_process_wq_release
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: Nxinhui pan <xinhui.pan@amd.com>
Acked-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

129657c8

13 2月, 2020 1 次提交

drm/amdkfd: refactor runtime pm for baco · 9593f4d6

由 Rajneesh Bhardwaj 提交于 1月 21, 2020

So far the kfd driver implemented same routines for runtime and system
wide suspend and resume (s2idle or mem). During system wide suspend the
kfd aquires an atomic lock that prevents any more user processes to
create queues and interact with kfd driver and amd gpu. This mechanism
created problem when amdgpu device is runtime suspended with BACO
enabled. Any application that relies on kfd driver fails to load because
the driver reports a locked kfd device since gpu is runtime suspended.

However, in an ideal case, when gpu is runtime  suspended the kfd driver
should be able to:

 - auto resume amdgpu driver whenever a client requests compute service
 - prevent runtime suspend for amdgpu  while kfd is in use

This change refactors the amdgpu and amdkfd drivers to support BACO and
runtime power management.
Reviewed-by: NOak Zeng <oak.zeng@amd.com>
Reviewed-by: NFelix Kuehling <felix.kuehling@amd.com>
Signed-off-by: NRajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

9593f4d6

07 2月, 2020 1 次提交

drm/amdkfd: Add queue information to sysfs · 6d220a7e

由 Amber Lin 提交于 1月 30, 2020

Provide compute queues information in sysfs under /sys/class/kfd/kfd/proc.
The format is /sys/class/kfd/kfd/proc/<pid>/queues/<queue id>/XX where
XX are size, type, and gpuid three files to represent queue size, queue
type, and the GPU this queue uses. <queue id> folder and files underneath
are generated when a queue is created. They are removed when the queue is
destroyed.
Signed-off-by: NAmber Lin <Amber.Lin@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

6d220a7e

17 1月, 2020 1 次提交

drm/amdgpu: GPU TLB flush API moved to amdgpu_amdkfd · ffa02269

由 Alex Sierra 提交于 12月 19, 2019

[Why]
TLB flush method has been deprecated using kfd2kgd interface.
This implementation is now on the amdgpu_amdkfd API.

[How]
TLB flush functions now implemented in amdgpu_amdkfd.
Signed-off-by: NAlex Sierra <alex.sierra@amd.com>
Acked-by: NChristian König <christian.koenig@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

ffa02269

10 1月, 2020 1 次提交

drm/amdkfd: Improve kfd_process lookup in kfd_ioctl · 0f899fd4

由 Felix Kuehling 提交于 12月 04, 2019

Use filep->private_data to store a pointer to the kfd_process data
structure. Take an extra reference for that, which gets released in
the kfd_release callback. Check that the process calling kfd_ioctl
is the same that opened the file descriptor. Return -EBADF if it's
not, so that this error can be distinguished in user mode.
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: NPhilip Yang <Philip.Yang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

0f899fd4

14 11月, 2019 1 次提交

drm/amdkfd: Simplify the mmap offset related bit operations · 29453755

由 Yong Zhao 提交于 1月 15, 2019

The new code uses straightforward bit shifts and thus has better readability.
Signed-off-by: NYong Zhao <Yong.Zhao@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

29453755

03 10月, 2019 2 次提交

drm/amdkfd: Use hex print format for pasid · 6027b1bf

由 Yong Zhao 提交于 9月 25, 2019

Since KFD pasid starts from 0x8000 (32768 in decimal), it is better
perceived as a hex number. Meanwhile, change the pasid type from
unsigned int to uint16_t to be consistent throughout the code.
Signed-off-by: NYong Zhao <Yong.Zhao@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

6027b1bf

drm/amdkfd: Remove excessive print when reserving doorbells · 89b0679b

由 Yong Zhao 提交于 9月 21, 2019

The dozens of printing messages are compressed into 2 lines.
Signed-off-by: NYong Zhao <Yong.Zhao@amd.com>
Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

89b0679b

22 8月, 2019 1 次提交

drm/amdkfd: remove set but not used variable 'pdd' · a52c26f1

由 YueHaibing 提交于 7月 26, 2019

Fixes gcc '-Wunused-but-set-variable' warning:

drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_process.c: In function restore_process_worker:
drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_process.c:949:29: warning:
 variable pdd set but not used [-Wunused-but-set-variable]

It is not used since
commit 5b87245f ("drm/amdkfd: Simplify kfd2kgd interface")
Reported-by: NHulk Robot <hulkci@huawei.com>
Signed-off-by: NYueHaibing <yuehaibing@huawei.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

a52c26f1

20 8月, 2019 2 次提交

drm/amdkfd: use mmu_notifier_put · 471f3902

由 Jason Gunthorpe 提交于 8月 06, 2019

The sequence of mmu_notifier_unregister_no_release(),
mmu_notifier_call_srcu() is identical to mmu_notifier_put() with the
free_notifier callback.

As this is the last user of those APIs, converting it means we can drop
them.

Link: https://lore.kernel.org/r/20190806231548.25242-11-jgg@ziepe.caReviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

471f3902

drm/amdkfd: fix a use after free race with mmu_notifer unregister · 0029cab3

由 Jason Gunthorpe 提交于 8月 06, 2019

When using mmu_notifer_unregister_no_release() the caller must ensure
there is a SRCU synchronize before the mn memory is freed, otherwise use
after free races are possible, for instance:

CPU0 CPU1
invalidate_range_start
hlist_for_each_entry_rcu(..)
mmu_notifier_unregister_no_release(&p->mn)
kfree(mn)
if (mn->ops->invalidate_range_end)

The error unwind in amdkfd misses the SRCU synchronization.

amdkfd keeps the kfd_process around until the mm is released, so split the
flow to fully initialize the kfd_process and register it for find_process,
and with the notifier. Past this point the kfd_process does not need to be
cleaned up as it is fully ready.

The final failable step does a vm_mmap() and does not seem to impact the
kfd_process global state. Since it also cannot be undone (and already has
problems with undo if it internally fails), it has to be last.

This way we don't have to try to unwind the mmu_notifier_register() and
avoid the problem with the SRCU.

Along the way this also fixes various other error unwind bugs in the flow.

Fixes: 45102048 ("amdkfd: Add process queue manager module")
Link: https://lore.kernel.org/r/20190806231548.25242-10-jgg@ziepe.caReviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>

0029cab3

15 8月, 2019 1 次提交

drm/amdkfd: Fill amdgpu_task_info for KFD VMs · f40c6912

由 Yong Zhao 提交于 8月 07, 2019

The amdgpu_task_info will be used when printing VM page fault for KFD
processes.
Signed-off-by: NYong Zhao <Yong.Zhao@amd.com>
Acked-by: NChristian König <christian.koenig@amd.com>
Reviewed-by: NHarish Kasiviswanathan <harish.kasiviswanatha@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

f40c6912

22 6月, 2019 1 次提交

drm/amdkfd: Add navi10 support to amdkfd. (v3) · 14328aa5

由 Philip Cox 提交于 5月 29, 2019

KFD (kernel fusion driver) is the kernel driver
for the compute backend for usermode compute
stack.

v2: squash in updates (Alex)
v3: squash in rebase fixes (Alex)
Signed-off-by: NOak Zeng <Oak.Zeng@amd.com>
Signed-off-by: NPhilip Cox <Philip.Cox@amd.com>
Acked-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

14328aa5

21 6月, 2019 1 次提交

drm/amdkfd: Add procfs-style information for KFD processes · de9f26bb

由 Kent Russell 提交于 6月 13, 2019

Add a folder structure to /sys/class/kfd/kfd/ called proc which contains
subfolders, each representing an active KFD process' PID, containing 1
file: pasid.
Signed-off-by: NKent Russell <kent.russell@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Acked-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

de9f26bb

19 2月, 2019 1 次提交

drm/amdkfd: Fix bugs regarding CP queue doorbell mask on SOC15 · 1f86805a

由 Yong Zhao 提交于 2月 13, 2019

Reserved doorbells for SDMA IH and VCN were not properly masked out
when allocating doorbells for CP user queues. This patch fixed that.
Signed-off-by: NYong Zhao <Yong.Zhao@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

1f86805a

06 11月, 2018 1 次提交

drm/amdkfd: Simplify kfd2kgd interface · 5b87245f

由 Amber Lin 提交于 10月 16, 2018

After amdkfd module is merged into amdgpu, KFD can call amdgpu directly
and no longer needs to use the function pointer. Replace those function
pointers with functions if they are not ASIC dependent.
Signed-off-by: NAmber Lin <Amber.Lin@amd.com>
Acked-by: NChristian König <christian.koenig@amd.com>
Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

5b87245f

30 8月, 2018 2 次提交

drm/amdkfd: Release an acquired process vm · bf47afba

由 Oak Zeng 提交于 8月 27, 2018

For compute vm acquired from amdgpu, vm.pasid is managed
by kfd. Decouple pasid from such vm on process destroy
to avoid duplicate pasid release.
Signed-off-by: NOak Zeng <Oak.Zeng@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

bf47afba

drm/amdgpu: Set pasid for compute vm (v2) · 1685b01a

由 Oak Zeng 提交于 8月 29, 2018

To make a amdgpu vm to a compute vm, the old pasid will be freed and
replaced with a pasid managed by kfd. Kfd can't reuse original pasid
allocated by amdgpu because kfd uses different pasid policy with amdgpu.
For example, all graphic devices share one same pasid in a process.

v2: rebase (Alex)
Signed-off-by: NOak Zeng <Oak.Zeng@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

1685b01a

12 7月, 2018 1 次提交

drm/amdkfd: Fix error codes in kfd_get_process · e47cb828

由 Wei Lu 提交于 7月 11, 2018

Return ERR_PTR(-EINVAL) if kfd_get_process fails to find the process.
This fixes kernel oopses when a child process calls KFD ioctls with
a file descriptor inherited from the parent process.
Signed-off-by: NWei Lu <wei.lu2@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Acked-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NOded Gabbay <oded.gabbay@gmail.com>

e47cb828

11 4月, 2018 2 次提交

drm/amdkfd: Implement doorbell allocation for SOC15 · ef568db7

由 Felix Kuehling 提交于 4月 10, 2018

Allocate doorbells according to the doorbell routing information on
SOC15 ASICs (Vega10 and later). On older ASICs we continue to use the
queue_id as the doorbell ID to maintain compatibility with the Thunk.
Signed-off-by: NShaoyun Liu <Shaoyun.Liu@amd.com>
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: NOded Gabbay <oded.gabbay@gmail.com>
Signed-off-by: NOded Gabbay <oded.gabbay@gmail.com>

ef568db7

drm/amdkfd: Clean up KFD_MMAP_ offset handling · df03ef93

由 Harish Kasiviswanathan 提交于 4月 10, 2018

Use bit-rotate for better clarity and remove _MASK from the #defines as
these represent mmap types.

Centralize all the parsing of the mmap offset in kfd_mmap and add device
parameter to doorbell and reserved_mem map functions.

Encode gpu_id into upper bits of vm_pgoff. This frees up the lower bits
for encoding the the doorbell ID on Vega10.
Signed-off-by: NHarish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: NOded Gabbay <oded.gabbay@gmail.com>
Signed-off-by: NOded Gabbay <oded.gabbay@gmail.com>

df03ef93

24 3月, 2018 2 次提交

drm/amdkfd: Add quiesce_mm and resume_mm to kgd2kfd_calls · 6b95e797

由 Felix Kuehling 提交于 3月 23, 2018

These interfaces allow KGD to stop and resume all GPU user mode queue
access to a process address space. This is needed for handling MMU
notifiers of userptrs mapped for GPU access in KFD VMs.
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: NOded Gabbay <oded.gabbay@gmail.com>
Signed-off-by: NOded Gabbay <oded.gabbay@gmail.com>

6b95e797

drm/amdkfd: Use ordered workqueue to restore processes · 1679ae8f

由 Felix Kuehling 提交于 3月 23, 2018

Restoring multiple processes concurrently can lead to live-locks
where each process prevents the other from validating all its BOs.

v2: fix duplicate check of same variable
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: NOded Gabbay <oded.gabbay@gmail.com>
Signed-off-by: NOded Gabbay <oded.gabbay@gmail.com>

1679ae8f

16 3月, 2018 1 次提交

drm/amdkfd: Add TC flush on VMID deallocation for Hawaii · 552764b6

由 Felix Kuehling 提交于 3月 15, 2018

On GFX7 the CP does not perform a TC flush when queues are unmapped.
To avoid TC eviction from accessing an invalid VMID, flush it
explicitly before releasing a VMID.

v2: Fix unnecessary list_for_each_entry_safe
v3: Moved allocation to kfd_process_device_init_vm
Signed-off-by: NAmber Lin <Amber.Lin@amd.com>
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Acked-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NOded Gabbay <oded.gabbay@gmail.com>

552764b6

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功