提交 · 817154c1a2035fdecead78873245f5b528da451e · openeuler / Kernel

27 8月, 2020 2 次提交

drm/amdkfd: call amdgpu_amdkfd_get_unique_id directly · 817154c1

由 Felix Kuehling 提交于 8月 24, 2020

No need to use a function pointer because the implementation is not
ASIC-specific. This fixes missing support due to a missing function
pointer on Arcturus.
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

817154c1

drm/amdkfd: implement the dGPU fallback path for apu (v6) · 6127896f

由 Huang Rui 提交于 8月 18, 2020

We still have a few iommu issues which need to address, so force raven
as "dgpu" path for the moment.

This is to add the fallback path to bypass IOMMU if IOMMU v2 is disabled
or ACPI CRAT table not correct.

v2: Use ignore_crat parameter to decide whether it will go with IOMMUv2.
v3: Align with existed thunk, don't change the way of raven, only renoir
    will use "dgpu" path by default.
v4: don't update global ignore_crat in the driver, and revise fallback
    function if CRAT is broken.
v5: refine acpi crat good but no iommu support case, and rename the
    title.
v6: fix the issue of dGPU initialized firstly, just modify the report
    value in the node_show().
Signed-off-by: NHuang Rui <ray.huang@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

6127896f

25 8月, 2020 1 次提交

drm/amdkfd: sparse: Fix warning in reading SDMA counters · 818b0324

由 Mukul Joshi 提交于 8月 18, 2020

Add __user annotation to fix related sparse warning while reading
SDMA counters from userland.
Also, rework the read SDMA counters function by removing redundant
checks.
Reported-by: Nkernel test robot <lkp@intel.com>
Signed-off-by: NMukul Joshi <mukul.joshi@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

818b0324

19 8月, 2020 1 次提交

drm/amdkfd: Initialize SDMA activity counter to 0 · 5960e022

由 Mukul Joshi 提交于 8月 17, 2020

To prevent reporting erroneous SDMA usage, initialize SDMA
activity counter to 0 before using.
Signed-off-by: NMukul Joshi <mukul.joshi@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

5960e022

15 8月, 2020 1 次提交

drm/amdgpu: revert "fix system hang issue during GPU reset" · f1403342

由 Christian König 提交于 8月 12, 2020

The whole approach wasn't thought through till the end.

We already had a reset lock like this in the past and it caused the same problems like this one.

Completely revert the patch for now and add individual trylock protection to the hardware access functions as necessary.

This reverts commit df9c8d1a.
Signed-off-by: NChristian König <christian.koenig@amd.com>
Acked-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

f1403342

11 8月, 2020 3 次提交

drm/amdkfd: Fix spurious debug exception on gfx10 · b60646a2

由 Jay Cornwall 提交于 7月 24, 2020

s_barrier triggers a debug exception when issued with PRIV=1,
DEBUG_EN=1. This causes spurious notifications to rocm-gdb.

Clear MODE before issuing s_barrier and restore MODE afterwards
in the context restore handler.
Signed-off-by: NJay Cornwall <jay.cornwall@amd.com>
Tested-by: NLaurent Morichetti <laurent.morichetti@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

b60646a2

Revert "drm/amdkfd: Unify gfx9/gfx10 context save area layouts" · c342d7c5

由 Felix Kuehling 提交于 8月 07, 2020

This reverts commit 0a5baee4.

The change introduced a regression on some chips. Reverting until
a proper solution can be found.
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

c342d7c5

Revert "drm/amdkfd: Fix spurious debug exception on gfx10" · 52189922

由 Felix Kuehling 提交于 8月 07, 2020

This reverts commit ea368183.

Needed due to conflicts when reverting "drm/amdkfd: Unify gfx9/gfx10
context save area layouts".
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

52189922

05 8月, 2020 1 次提交

drm/amdkfd: Replace bitmask with event idx in SMI event msg · 522ec6e0

由 Mukul Joshi 提交于 7月 30, 2020

Event bitmask is a 64-bit mask with only 1 bit set. Sending this
event bitmask in KFD SMI event message is both wasteful of memory
and potentially limiting to only 64 events. Instead send event
index in SMI event message.
Please note this change does not break the ABI for the two event
types defined so far. The new index is identical to the mask used
before.
Signed-off-by: NMukul Joshi <mukul.joshi@amd.com>
Suggested-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

522ec6e0

28 7月, 2020 4 次提交

drm/amdkfd: Fix spurious debug exception on gfx10 · ea368183

由 Jay Cornwall 提交于 7月 24, 2020

s_barrier triggers a debug exception when issued with PRIV=1,
DEBUG_EN=1. This causes spurious notifications to rocm-gdb.

Clear MODE before issuing s_barrier and restore MODE afterwards
in the context restore handler.
Signed-off-by: NJay Cornwall <jay.cornwall@amd.com>
Tested-by: NLaurent Morichetti <laurent.morichetti@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

ea368183

drm/amdkfd: Add thermal throttling SMI event · 2c2b0d88

由 Mukul Joshi 提交于 7月 23, 2020

Add support for reporting thermal throttling events through SMI.
Also, add a counter to count the number of throttling interrupts
observed and report the count in the SMI event message.
Signed-off-by: NMukul Joshi <mukul.joshi@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

2c2b0d88

drm/amdgpu: fix system hang issue during GPU reset · df9c8d1a

由 Dennis Li 提交于 7月 08, 2020

when GPU hang, driver has multi-paths to enter amdgpu_device_gpu_recover,
the atomic adev->in_gpu_reset and hive->in_reset are used to avoid
re-entering GPU recovery.

During GPU reset and resume, it is unsafe that other threads access GPU,
which maybe cause GPU reset failed. Therefore the new rw_semaphore
adev->reset_sem is introduced, which protect GPU from being accessed by
external threads during recovery.

v2:
1. add rwlock for some ioctls, debugfs and file-close function.
2. change to use dqm->is_resetting and dqm_lock for protection in kfd
driver.
3. remove try_lock and change adev->in_gpu_reset as atomic, to avoid
re-enter GPU recovery for the same GPU hang.

v3:
1. change back to use adev->reset_sem to protect kfd callback
functions, because dqm_lock couldn't protect all codes, for example:
free_mqd must be called outside of dqm_lock;

[ 1230.176199] Hardware name: Supermicro SYS-7049GP-TRT/X11DPG-QT, BIOS 3.1 05/23/2019
[ 1230.177221] Call Trace:
[ 1230.178249]  dump_stack+0x98/0xd5
[ 1230.179443]  amdgpu_virt_kiq_reg_write_reg_wait+0x181/0x190 [amdgpu]
[ 1230.180673]  gmc_v9_0_flush_gpu_tlb+0xcc/0x310 [amdgpu]
[ 1230.181882]  amdgpu_gart_unbind+0xa9/0xe0 [amdgpu]
[ 1230.183098]  amdgpu_ttm_backend_unbind+0x46/0x180 [amdgpu]
[ 1230.184239]  ? ttm_bo_put+0x171/0x5f0 [ttm]
[ 1230.185394]  ttm_tt_unbind+0x21/0x40 [ttm]
[ 1230.186558]  ttm_tt_destroy.part.12+0x12/0x60 [ttm]
[ 1230.187707]  ttm_tt_destroy+0x13/0x20 [ttm]
[ 1230.188832]  ttm_bo_cleanup_memtype_use+0x36/0x80 [ttm]
[ 1230.189979]  ttm_bo_put+0x1be/0x5f0 [ttm]
[ 1230.191230]  amdgpu_bo_unref+0x1e/0x30 [amdgpu]
[ 1230.192522]  amdgpu_amdkfd_free_gtt_mem+0xaf/0x140 [amdgpu]
[ 1230.193833]  free_mqd+0x25/0x40 [amdgpu]
[ 1230.195143]  destroy_queue_cpsch+0x1a7/0x270 [amdgpu]
[ 1230.196475]  pqm_destroy_queue+0x105/0x260 [amdgpu]
[ 1230.197819]  kfd_ioctl_destroy_queue+0x37/0x70 [amdgpu]
[ 1230.199154]  kfd_ioctl+0x277/0x500 [amdgpu]
[ 1230.200458]  ? kfd_ioctl_get_clock_counters+0x60/0x60 [amdgpu]
[ 1230.201656]  ? tomoyo_file_ioctl+0x19/0x20
[ 1230.202831]  ksys_ioctl+0x98/0xb0
[ 1230.204004]  __x64_sys_ioctl+0x1a/0x20
[ 1230.205174]  do_syscall_64+0x5f/0x250
[ 1230.206339]  entry_SYSCALL_64_after_hwframe+0x49/0xbe

2. remove try_lock and introduce atomic hive->in_reset, to avoid
re-enter GPU recovery.

v4:
1. remove an unnecessary whitespace change in kfd_chardev.c
2. remove comment codes in amdgpu_device.c
3. add more detailed comment in commit message
4. define a wrap function amdgpu_in_reset

v5:
1. Fix some style issues.
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Suggested-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com>
Suggested-by: NChristian König <christian.koenig@amd.com>
Suggested-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Suggested-by: NLijo Lazar <Lijo.Lazar@amd.com>
Suggested-by: NLuben Tukov <luben.tuikov@amd.com>
Signed-off-by: NDennis Li <Dennis.Li@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

df9c8d1a

drm/amdkfd: Unify gfx9/gfx10 context save area layouts · 0a5baee4

由 Laurent Morichetti 提交于 7月 06, 2020

Add some padding before the MODE register in the HWREGs block to
preserve the same layout as gfx9. This simplifies implementation of a
user-mode debugger.
Signed-off-by: NLaurent Morichetti <laurent.morichetti@amd.com>
Reviewed-by: NJay Cornwall <jay.cornwall@amd.com>
Acked-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

0a5baee4

16 7月, 2020 5 次提交

drm/amd/amdkfd: Fix large framesize for kfd_smi_ev_read() · 6e14adea

由 Aurabindo Pillai 提交于 5月 19, 2020

The buffer allocated is of 1024 bytes. Allocate this from
heap instead of stack.

Also remove check for stack size since we're allocating from heap
Signed-off-by: NAurabindo Pillai <aurabindo.pillai@amd.com>
Tested-by: NAmber Lin <Amber.Lin@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

6e14adea

drm/amdkfd: Provide SMI events watch · 938a0650

由 Amber Lin 提交于 5月 13, 2020

When the compute is malfunctioning or performance drops, the system admin
will use SMI (System Management Interface) tool to monitor/diagnostic what
went wrong. This patch provides an event watch interface for the user
space to register devices and subscribe events they are interested. After
registered, the user can use annoymous file descriptor's poll function
with wait-time specified and wait for events to happen. Once an event
happens, the user can use read() to retrieve information related to the
event.

VM fault event is done in this patch.

v2: - remove UNREGISTER and add event ENABLE/DISABLE
    - correct kfifo usage
    - move event message API to kfd_ioctl.h
v3: send the event msg in text than in binary
v4: support multiple clients
v5: move events enablement from ioctl to fd write
v6: sparse fix
Signed-off-by: NAmber Lin <Amber.Lin@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

938a0650

drm/amdkfd: Add kfd2kgd_funcs for navy_flounder kfd support · 09759e13

由 Chengming Gui 提交于 6月 05, 2020

Add callbacks to KGD for navy flounder.
Signed-off-by: NChengming Gui <Jack.Gui@amd.com>
Reviewed-by: NTao Zhou <tao.zhou1@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

09759e13

drm/amdkfd: Support navy_flounder KFD · de89b2e4

由 Chengming Gui 提交于 6月 02, 2020

Add KFD support for Navy Flounder.
Signed-off-by: NChengming Gui <Jack.Gui@amd.com>
Reviewed-by: NTao Zhou <tao.zhou1@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

de89b2e4

drm/amdkfd: fix kernel-doc and cleanup · a4497974

由 Rajneesh Bhardwaj 提交于 7月 13, 2020

 - fix some styling issues
 - fixes for kernel-doc type
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NRajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

a4497974

08 7月, 2020 1 次提交

drm/amdkfd: Remove redundant kfd2kgd interface lookup · c1213911

由 Felix Kuehling 提交于 6月 30, 2020

kfd_pasid.c isn't using the kfd2kgd interface any more. Remove redundant
code trying to look up a device for finding that interface.
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: NKent Russell <kent.russell@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

c1213911

03 7月, 2020 2 次提交

drm/amdkfd: Add Arcturus GWS support and fix VG10 · fea7d919

由 Joseph Greathouse 提交于 6月 29, 2020

Add support for GWS in Arcturus, which needs MEC2 firmware #48
or above. Fix the MEC2 version check for Vega 10 GWS support,
since Vega 10 firmware adds 0x8000 to the actual firmware
revision. We were previously declaring support where it did not
exist.
Signed-off-by: NJoseph Greathouse <Joseph.Greathouse@amd.com>
Reviewed-by: NKent Russell <kent.russell@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

fea7d919

drm/amdkfd: Update hardware scheduling time quanta · 5d7c6f18

由 Joseph Greathouse 提交于 6月 29, 2020

Update PROCESS_QUANTUM, the time the hardware scheduler allows
processes to run before switching to other processes when it becomes
over-subscribed. Increase this to 10ms, to allow processes to better
amortize their task switch times.

Update HQD Quantum, the amount of time that an active queue stays
attached to the CP before we forcibly switch it for another active
queue for fairness.

Setting these so that HQD < PROCESS makes it easier to ensure that
we get fairness when we have multiple active queues on the device.
Otherwise we may start process-swapping before we get to all the
queues in a CP.
Signed-off-by: NJoseph Greathouse <Joseph.Greathouse@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

5d7c6f18

01 7月, 2020 12 次提交

drm/amdkfd: Fix circular locking dependency warning · d69fd951

由 Mukul Joshi 提交于 6月 24, 2020

[  150.887733] ======================================================
[  150.893903] WARNING: possible circular locking dependency detected
[  150.905917] ------------------------------------------------------
[  150.912129] kfdtest/4081 is trying to acquire lock:
[  150.917002] ffff8f7f3762e118 (&mm->mmap_sem#2){++++}, at:
                                 __might_fault+0x3e/0x90
[  150.924490]
               but task is already holding lock:
[  150.930320] ffff8f7f49d229e8 (&dqm->lock_hidden){+.+.}, at:
                                destroy_queue_cpsch+0x29/0x210 [amdgpu]
[  150.939432]
               which lock already depends on the new lock.

[  150.947603]
               the existing dependency chain (in reverse order) is:
[  150.955074]
               -> #3 (&dqm->lock_hidden){+.+.}:
[  150.960822]        __mutex_lock+0xa1/0x9f0
[  150.964996]        evict_process_queues_cpsch+0x22/0x120 [amdgpu]
[  150.971155]        kfd_process_evict_queues+0x3b/0xc0 [amdgpu]
[  150.977054]        kgd2kfd_quiesce_mm+0x25/0x60 [amdgpu]
[  150.982442]        amdgpu_amdkfd_evict_userptr+0x35/0x70 [amdgpu]
[  150.988615]        amdgpu_mn_invalidate_hsa+0x41/0x60 [amdgpu]
[  150.994448]        __mmu_notifier_invalidate_range_start+0xa4/0x240
[  151.000714]        copy_page_range+0xd70/0xd80
[  151.005159]        dup_mm+0x3ca/0x550
[  151.008816]        copy_process+0x1bdc/0x1c70
[  151.013183]        _do_fork+0x76/0x6c0
[  151.016929]        __x64_sys_clone+0x8c/0xb0
[  151.021201]        do_syscall_64+0x4a/0x1d0
[  151.025404]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  151.030977]
               -> #2 (&adev->notifier_lock){+.+.}:
[  151.036993]        __mutex_lock+0xa1/0x9f0
[  151.041168]        amdgpu_mn_invalidate_hsa+0x30/0x60 [amdgpu]
[  151.047019]        __mmu_notifier_invalidate_range_start+0xa4/0x240
[  151.053277]        copy_page_range+0xd70/0xd80
[  151.057722]        dup_mm+0x3ca/0x550
[  151.061388]        copy_process+0x1bdc/0x1c70
[  151.065748]        _do_fork+0x76/0x6c0
[  151.069499]        __x64_sys_clone+0x8c/0xb0
[  151.073765]        do_syscall_64+0x4a/0x1d0
[  151.077952]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  151.083523]
               -> #1 (mmu_notifier_invalidate_range_start){+.+.}:
[  151.090833]        change_protection+0x802/0xab0
[  151.095448]        mprotect_fixup+0x187/0x2d0
[  151.099801]        setup_arg_pages+0x124/0x250
[  151.104251]        load_elf_binary+0x3a4/0x1464
[  151.108781]        search_binary_handler+0x6c/0x210
[  151.113656]        __do_execve_file.isra.40+0x7f7/0xa50
[  151.118875]        do_execve+0x21/0x30
[  151.122632]        call_usermodehelper_exec_async+0x17e/0x190
[  151.128393]        ret_from_fork+0x24/0x30
[  151.132489]
               -> #0 (&mm->mmap_sem#2){++++}:
[  151.138064]        __lock_acquire+0x11a1/0x1490
[  151.142597]        lock_acquire+0x90/0x180
[  151.146694]        __might_fault+0x68/0x90
[  151.150879]        read_sdma_queue_counter+0x5f/0xb0 [amdgpu]
[  151.156693]        update_sdma_queue_past_activity_stats+0x3b/0x90 [amdgpu]
[  151.163725]        destroy_queue_cpsch+0x1ae/0x210 [amdgpu]
[  151.169373]        pqm_destroy_queue+0xf0/0x250 [amdgpu]
[  151.174762]        kfd_ioctl_destroy_queue+0x32/0x70 [amdgpu]
[  151.180577]        kfd_ioctl+0x223/0x400 [amdgpu]
[  151.185284]        ksys_ioctl+0x8f/0xb0
[  151.189118]        __x64_sys_ioctl+0x16/0x20
[  151.193389]        do_syscall_64+0x4a/0x1d0
[  151.197569]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  151.203141]
               other info that might help us debug this:

[  151.211140] Chain exists of:
                 &mm->mmap_sem#2 --> &adev->notifier_lock --> &dqm->lock_hidden

[  151.222535]  Possible unsafe locking scenario:

[  151.228447]        CPU0                    CPU1
[  151.232971]        ----                    ----
[  151.237502]   lock(&dqm->lock_hidden);
[  151.241254]                                lock(&adev->notifier_lock);
[  151.247774]                                lock(&dqm->lock_hidden);
[  151.254038]   lock(&mm->mmap_sem#2);

This commit fixes the warning by ensuring get_user() is not called
while reading SDMA stats with dqm_lock held as get_user() could cause a
page fault which leads to the circular locking scenario.
Signed-off-by: NMukul Joshi <mukul.joshi@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

d69fd951

drm/amd: fix potential memleak in err branch · dc2f832e

由 Bernard Zhao 提交于 6月 20, 2020

The function kobject_init_and_add alloc memory like:
kobject_init_and_add->kobject_add_varg->kobject_set_name_vargs
->kvasprintf_const->kstrdup_const->kstrdup->kmalloc_track_caller
->kmalloc_slab, in err branch this memory not free. If use
kmemleak, this path maybe catched.
These changes are to add kobject_put in kobject_init_and_add
failed branch, fix potential memleak.
Signed-off-by: NBernard Zhao <bernard@vivo.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

dc2f832e

drm/amdkfd: label internally used symbols as static · 204d8998

由 Nirmoy Das 提交于 6月 18, 2020

Used sparse(make C=1) to find these loose ends.

v2:
removed unwanted extra line
Signed-off-by: NNirmoy Das <nirmoy.das@amd.com>
Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

204d8998

drm/amdkfd: fix ref count leak when pm_runtime_get_sync fails · 1c1ada37

由 Alex Deucher 提交于 6月 17, 2020

The call to pm_runtime_get_sync increments the counter even in case of
failure, leading to incorrect ref count.
In case of failure, decrement the ref count before returning.
Reviewed-by: NRajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

1c1ada37

drm/amdkfd: Fix reference count leaks. · 20eca012

由 Qiushi Wu 提交于 6月 13, 2020

kobject_init_and_add() takes reference even when it fails.
If this function returns an error, kobject_put() must be called to
properly clean up the memory associated with the object.
Signed-off-by: NQiushi Wu <wu000273@umn.edu>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

20eca012

drm/amdkfd: Add eviction debug messages · b2057956

由 Felix Kuehling 提交于 6月 11, 2020

Use WARN to print messages with backtrace when evictions are triggered.
This can help determine the root cause of evictions and help spot driver
bugs triggering evictions unintentionally, or help with performance tuning
by avoiding conditions that cause evictions in a specific workload.

The messages are controlled by a new module parameter that can be changed
at runtime:

  echo Y > /sys/module/amdgpu/parameters/debug_evictions
  echo N > /sys/module/amdgpu/parameters/debug_evictions
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: NPhilip Yang <Philip.Yang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

b2057956

drm/amdkfd: Use correct major in devcgroup check · 7159562a

由 Lorenz Brun 提交于 6月 11, 2020

The existing code used the major version number of the DRM driver
instead of the device major number of the DRM subsystem for
validating access for a devices cgroup.

This meant that accesses allowed by the devices cgroup weren't
permitted and certain accesses denied by the devices cgroup were
permitted (if they matched the wrong major device number).
Signed-off-by: NLorenz Brun <lorenz@brun.one>
Fixes: 6b855f7b ("drm/amdkfd: Check against device cgroup")
Reviewed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

7159562a

drm/amdkfd: sienna_cichlid virtual function support · adab4dad

由 shaoyunl 提交于 2月 07, 2020

amdkfd add support for sienna_cichlid virtual function
Signed-off-by: Nshaoyunl <shaoyun.liu@amd.com>
Reviewed-by: NYong Zhao <Yong.Zhao@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

adab4dad

drm/amdkfd: Support debugger in Navi1x trap handler · 3cefc718

由 Jay Cornwall 提交于 11月 21, 2019

- Preserve scalar GPRs ttmp[4:11] and ttmp13
- Add single step exception during context save workaround
- Remove incorrect PC adjustment during context save
Signed-off-by: NJay Cornwall <jay.cornwall@amd.com>
Reviewed-by: NYong Zhao <Yong.Zhao@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

3cefc718

drm/amdkfd: Support newer assemblers in gfx10 trap handler · d0f1a853

由 Jay Cornwall 提交于 11月 20, 2019

The contents of macros are parsed by the assembler before conditions
have been tested. This causes assembly errors when using IP-specific
instructions in the IP-unified trap handler.

Add a preprocessing step to filter IP-specific code.

Also guard a Navi1x-specific instruction (no effect on Sienna_Cichlid).
Signed-off-by: NJay Cornwall <jay.cornwall@amd.com>
Reviewed-by: NYong Zhao <Yong.Zhao@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

d0f1a853

drm/amdkfd: Add Sienna_Cichlid trap handler support · 80b6cfed

由 Jay Cornwall 提交于 11月 01, 2019

- Replace SQC stores with TCP stores
- Synchronize with MSG_SAVEWAVE via lgkmcnt
- HW_REG_IB_STS is now read-only
Signed-off-by: NJay Cornwall <jay.cornwall@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

80b6cfed

drm/amdkfd: Support Sienna_Cichlid KFD v4 · 3a2f0c81

由 Yong Zhao 提交于 10月 01, 2019

v4: drop get_tile_config, comment out other callbacks
Signed-off-by: NYong Zhao <Yong.Zhao@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

3a2f0c81

18 6月, 2020 1 次提交

drm/amdkfd: Use correct major in devcgroup check · 99c7b309

由 Lorenz Brun 提交于 6月 11, 2020

The existing code used the major version number of the DRM driver
instead of the device major number of the DRM subsystem for
validating access for a devices cgroup.

This meant that accesses allowed by the devices cgroup weren't
permitted and certain accesses denied by the devices cgroup were
permitted (if they matched the wrong major device number).
Signed-off-by: NLorenz Brun <lorenz@brun.one>
Fixes: 6b855f7b ("drm/amdkfd: Check against device cgroup")
Reviewed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org

99c7b309

10 6月, 2020 1 次提交

mmap locking API: use coccinelle to convert mmap_sem rwsem call sites · d8ed45c5

由 Michel Lespinasse 提交于 6月 08, 2020

This change converts the existing mmap_sem rwsem calls to use the new mmap
locking API instead.

The change is generated using coccinelle with the following rule:

// spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir .

@@
expression mm;
@@
(
-init_rwsem
+mmap_init_lock
|
-down_write
+mmap_write_lock
|
-down_write_killable
+mmap_write_lock_killable
|
-down_write_trylock
+mmap_write_trylock
|
-up_write
+mmap_write_unlock
|
-downgrade_write
+mmap_write_downgrade
|
-down_read
+mmap_read_lock
|
-down_read_killable
+mmap_read_lock_killable
|
-down_read_trylock
+mmap_read_trylock
|
-up_read
+mmap_read_unlock
)
-(&mm->mmap_sem)
+(mm)
Signed-off-by: NMichel Lespinasse <walken@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Reviewed-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
Reviewed-by: NLaurent Dufour <ldufour@linux.ibm.com>
Reviewed-by: NVlastimil Babka <vbabka@suse.cz>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Liam Howlett <Liam.Howlett@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ying Han <yinghan@google.com>
Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

d8ed45c5

30 5月, 2020 1 次提交

drm/amdkfd: fix a dereference of pdd before it is null checked · 2652bda7

由 Colin Ian King 提交于 5月 28, 2020

Currently pointer pdd is being dereferenced when assigning pointer
dpm and then pdd is being null checked.  Fix this by checking if
pdd is null before the dereference of pdd occurs.

Addresses-Coverity: ("Dereference before null check")
Fixes: 32cb59f3 ("drm/amdkfd: Track SDMA utilization per process")
Signed-off-by: NColin Ian King <colin.king@canonical.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

2652bda7

29 5月, 2020 2 次提交

drm/amdkfd: Fix GCC 10 compiler warning · 83a13ef5

由 Felix Kuehling 提交于 5月 22, 2020

GCC 10 was complaining about how we append data to a buffer using snprintf:

drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_topology.c: In function ‘perf_show’:
drivers/gpu/drm/amd/amdgpu/../amdkfd/kfd_topology.c:214:3: warning: ‘snprintf’ argument 4 overlaps destination object ‘buf’ [-Wrestrict]
  214 |   snprintf(buffer, PAGE_SIZE, "%s"fmt, buffer, __VA_ARGS__)
      |   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This patch fixes the warnings and makes the sysfs code more efficient
by remembering the offset in the buffer between append operations.
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Acked-by: NAaron Liu <aaron.liu@amd.com>
Tested-by: NAaron Liu <aaron.liu@amd.com>
Reviewed-by: NAmber Lin <Amber.Lin@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

83a13ef5

drm/amdkfd: Track SDMA utilization per process · 32cb59f3

由 Mukul Joshi 提交于 5月 26, 2020

Track SDMA usage on a per process basis and report it through sysfs.
The value in the sysfs file indicates the amount of time SDMA has
been in-use by this process since the creation of the process.
This value is in microsecond granularity.
Signed-off-by: NMukul Joshi <mukul.joshi@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

32cb59f3

22 5月, 2020 2 次提交

drm/amdkfd: report the real PCI bus number · 997769fa

由 Evan Quan 提交于 5月 21, 2020

Since the PCI bus number retrieved by PCI_BUS_NUM(pdev->devfn)
is wrong.
Signed-off-by: NEvan Quan <evan.quan@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

997769fa

drm/amdkfd: Fix boolreturn.cocci warnings · 8c8e1f69

由 Aishwarya Ramakrishnan 提交于 5月 18, 2020

Return statements in functions returning bool should use
true/false instead of 1/0.

drivers/gpu/drm/amd/amdkfd/kfd_int_process_v9.c:40:9-10:
WARNING: return of 0/1 in function 'event_interrupt_isr_v9' with return type bool

Generated by: scripts/coccinelle/misc/boolreturn.cocci
Signed-off-by: NAishwarya Ramakrishnan <aishwaryarj100@gmail.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

8c8e1f69

openeuler / Kernel 接近 2 年 前同步成功

openeuler / Kernel
接近 2 年前同步成功