提交 · 5ad7bbf3dba5c4a684338df1f285080f2588b535 · openeuler / Kernel

09 2月, 2023 1 次提交

drm/amdgpu/fence: Fix oops due to non-matching drm_sched init/fini · 5ad7bbf3

由 Guilherme G. Piccoli 提交于 2月 02, 2023

Currently amdgpu calls drm_sched_fini() from the fence driver sw fini
routine - such function is expected to be called only after the
respective init function - drm_sched_init() - was executed successfully.

Happens that we faced a driver probe failure in the Steam Deck
recently, and the function drm_sched_fini() was called even without
its counter-part had been previously called, causing the following oops:

amdgpu: probe of 0000:04:00.0 failed with error -110
BUG: kernel NULL pointer dereference, address: 0000000000000090
PGD 0 P4D 0
Oops: 0002 [#1] PREEMPT SMP NOPTI
CPU: 0 PID: 609 Comm: systemd-udevd Not tainted 6.2.0-rc3-gpiccoli #338
Hardware name: Valve Jupiter/Jupiter, BIOS F7A0113 11/04/2022
RIP: 0010:drm_sched_fini+0x84/0xa0 [gpu_sched]
[...]
Call Trace:
 <TASK>
 amdgpu_fence_driver_sw_fini+0xc8/0xd0 [amdgpu]
 amdgpu_device_fini_sw+0x2b/0x3b0 [amdgpu]
 amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
 devm_drm_dev_init_release+0x49/0x70
 [...]

To prevent that, check if the drm_sched was properly initialized for a
given ring before calling its fini counter-part.

Notice ideally we'd use sched.ready for that; such field is set as the latest
thing on drm_sched_init(). But amdgpu seems to "override" the meaning of such
field - in the above oops for example, it was a GFX ring causing the crash, and
the sched.ready field was set to true in the ring init routine, regardless of
the state of the DRM scheduler. Hence, we ended-up using sched.ops as per
Christian's suggestion [0], and also removed the no_scheduler check [1].

[0] https://lore.kernel.org/amd-gfx/984ee981-2906-0eaf-ccec-9f80975cb136@amd.com/
[1] https://lore.kernel.org/amd-gfx/cd0e2994-f85f-d837-609f-7056d5fb7231@amd.com/

Fixes: 067f44c8 ("drm/amdgpu: avoid over-handle of fence driver fini in s3 test (v2)")
Suggested-by: NChristian König <christian.koenig@amd.com>
Cc: Guchun Chen <guchun.chen@amd.com>
Cc: Luben Tuikov <luben.tuikov@amd.com>
Cc: Mario Limonciello <mario.limonciello@amd.com>
Reviewed-by: NLuben Tuikov <luben.tuikov@amd.com>
Signed-off-by: NGuilherme G. Piccoli <gpiccoli@igalia.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org

5ad7bbf3

02 12月, 2022 1 次提交

drm/amdgpu: MCBP based on DRM scheduler (v9) · 3f4c175d

由 Jiadong.Zhu 提交于 9月 07, 2022

Trigger Mid-Command Buffer Preemption according to the priority of the software
rings and the hw fence signalling condition.

The muxer saves the locations of the indirect buffer frames from the software
ring together with the fence sequence number in its fifo queue, and pops out
those records when the fences are signalled. The locations are used to resubmit
packages in preemption scenarios by coping the chunks from the software ring.

v2: Update comment style.
v3: Fix conflict caused by previous modifications.
v4: Remove unnecessary prints.
v5: Fix corner cases for resubmission cases.
v6: Refactor functions for resubmission, calling fence_process in irq handler.
v7: Solve conflict for removing amdgpu_sw_ring.c.
v8: Add time threshold to judge if preemption request is needed.
v9: Correct comment spelling. Set fence emit timestamp before rsu assignment.

Cc: Christian Koenig <Christian.Koenig@amd.com>
Cc: Luben Tuikov <Luben.Tuikov@amd.com>
Cc: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com>
Cc: Michel Dänzer <michel@daenzer.net>
Signed-off-by: NJiadong.Zhu <Jiadong.Zhu@amd.com>
Acked-by: NLuben Tuikov <luben.tuikov@amd.com>
Acked-by: NHuang Rui <ray.huang@amd.com>
Acked-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

3f4c175d

29 9月, 2022 1 次提交

drm/amdgpu: Remove fence_process in count_emitted · e7b8e90a

由 Jiadong.Zhu 提交于 9月 23, 2022

The function amdgpu_fence_count_emitted used in work_hander should not call
amdgpu_fence_process which must be used in irq handler.
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NJiadong.Zhu <Jiadong.Zhu@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

e7b8e90a

28 9月, 2022 1 次提交

drm/amdgpu: Remove fence_process in count_emitted · 11e38360

由 Jiadong.Zhu 提交于 9月 23, 2022

The function amdgpu_fence_count_emitted used in work_hander should not call
amdgpu_fence_process which must be used in irq handler.
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NJiadong.Zhu <Jiadong.Zhu@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

11e38360

25 7月, 2022 1 次提交

drm/amd: Fix typo 'the the' in comment · 9dd4545f

由 Slark Xiao 提交于 7月 21, 2022

Replace 'the the' with 'the' in the comment.
Signed-off-by: NSlark Xiao <slark_xiao@163.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

9dd4545f

13 7月, 2022 1 次提交

drm/amdgpu: support reset flag set for gpu reset · f1549c09

由 Likun Gao 提交于 7月 08, 2022

Move reset_context out of gpu recover function to make it configurable
for different reset purpose.
For the reset way of call gpu_recovery sysfs, force to use full reset
method. Otherwise, try soft reset by default if the related ASIC
supportted, if soft reset failed, will use full reset.
Signed-off-by: NLikun Gao <Likun.Gao@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

f1549c09

28 6月, 2022 3 次提交

drm/amdgpu: Follow up change to previous drm scheduler change. · 9ae55f03

由 Andrey Grodzovsky 提交于 6月 20, 2022

Align refcount behaviour for amdgpu_job embedded HW fence with
classic pointer style HW fences by increasing refcount each
time emit is called so amdgpu code doesn't need to make workarounds
using amdgpu_job.job_run_counter to keep the HW fence refcount balanced.

Also since in the previous patch we resumed setting s_fence->parent to NULL
in drm_sched_stop switch to directly checking if job->hw_fence is
signaled to short circuit reset if already signed.
Signed-off-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com>
Tested-by: NYiqing Yao <yiqing.yao@amd.com>
Acked-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

9ae55f03

drm/amdgpu: Prevent race between late signaled fences and GPU reset. · 9e225fb9

由 Andrey Grodzovsky 提交于 6月 18, 2022

Problem:
After we start handling timed out jobs we assume there fences won't be
signaled but we cannot be sure and sometimes they fire late. We need
to prevent concurrent accesses to fence array from
amdgpu_fence_driver_clear_job_fences during GPU reset and amdgpu_fence_process
from a late EOP interrupt.

Fix:
Before accessing fence array in GPU disable EOP interrupt and flush
all pending interrupt handlers for amdgpu device's interrupt line.

v2: Switch from irq_get/put to full enable/disable_irq for amdgpu
Signed-off-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com>
Acked-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

9e225fb9

drm/amdgpu: Add put fence in amdgpu_fence_driver_clear_job_fences · dd70748e

由 Andrey Grodzovsky 提交于 6月 20, 2022

This function should drop the fence refcount when it extracts the
fence from the fence array, just as it's done in amdgpu_fence_process.
Signed-off-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

dd70748e

11 6月, 2022 2 次提交

drm/amdgpu: Rename amdgpu_device_gpu_recover_imp back to amdgpu_device_gpu_recover · cf727044

由 Andrey Grodzovsky 提交于 5月 17, 2022

We removed the wrapper that was queueing the recover function
into reset domain queue who was using this name.
Signed-off-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

cf727044

drm/amdgpu: Add work_struct for GPU reset from debugfs · 2f83658f

由 Andrey Grodzovsky 提交于 5月 17, 2022

We need to have a work_struct to cancel this reset if another
already in progress.
Signed-off-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

2f83658f

04 5月, 2022 1 次提交

drm/amdgpu: assign the cpu/gpu address of fence from ring · ae9fd76f

由 Jack Xiao 提交于 3月 20, 2020

assign the cpu/gpu address of fence for the normal or mes ring
from ring structure.
Signed-off-by: NJack Xiao <Jack.Xiao@amd.com>
Acked-by: NChristian König <christian.koenig@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

ae9fd76f

10 2月, 2022 1 次提交

drm/amdgpu: Move scheduler init to after XGMI is ready · 5fd8518d

由 Andrey Grodzovsky 提交于 12月 06, 2021

Before we initialize schedulers we must know which reset
domain are we in - for single device there iis a single
domain per device and so single wq per device. For XGMI
the reset domain spans the entire XGMI hive and so the
reset wq is per hive.
Signed-off-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Link: https://www.spinics.net/lists/amd-gfx/msg74112.html

5fd8518d

10 1月, 2022 1 次提交

Revert "drm/amdgpu: stop scheduler when calling hw_fini (v2)" · df5bc0aa

由 Len Brown 提交于 1月 09, 2022

This reverts commit f7d6779d.

This bisected regression has impacted suspend-resume stability
since 5.15-rc1. It regressed -stable via 5.14.10.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=215315
Fixes: f7d6779d ("drm/amdgpu: stop scheduler when calling hw_fini (v2)")
Cc: Guchun Chen <guchun.chen@amd.com>
Cc: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Cc: Christian Koenig <christian.koenig@amd.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: <stable@vger.kernel.org> # 5.14+
Signed-off-by: NLen Brown <len.brown@intel.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

df5bc0aa

18 12月, 2021 1 次提交

drm/amdgpu: introduce new amdgpu_fence object to indicate the job embedded fence · bf67014d

由 Huang Rui 提交于 12月 16, 2021

The job embedded fence donesn't initialize the flags at
dma_fence_init(). Then we will go a wrong way in
amdgpu_fence_get_timeline_name callback and trigger a null pointer panic
once we enabled the trace event here. So introduce new amdgpu_fence
object to indicate the job embedded fence.

[  156.131790] BUG: kernel NULL pointer dereference, address: 00000000000002a0
[  156.131804] #PF: supervisor read access in kernel mode
[  156.131811] #PF: error_code(0x0000) - not-present page
[  156.131817] PGD 0 P4D 0
[  156.131824] Oops: 0000 [#1] PREEMPT SMP PTI
[  156.131832] CPU: 6 PID: 1404 Comm: sdma0 Tainted: G           OE     5.16.0-rc1-custom #1
[  156.131842] Hardware name: Gigabyte Technology Co., Ltd. Z170XP-SLI/Z170XP-SLI-CF, BIOS F20 11/04/2016
[  156.131848] RIP: 0010:strlen+0x0/0x20
[  156.131859] Code: 89 c0 c3 0f 1f 80 00 00 00 00 48 01 fe eb 0f 0f b6 07 38 d0 74 10 48 83 c7 01 84 c0 74 05 48 39 f7 75 ec 31 c0 c3 48 89 f8 c3 <80> 3f 00 74 10 48 89 f8 48 83 c0 01 80 38 00 75 f7 48 29 f8 c3 31
[  156.131872] RSP: 0018:ffff9bd0018dbcf8 EFLAGS: 00010206
[  156.131880] RAX: 00000000000002a0 RBX: ffff8d0305ef01b0 RCX: 000000000000000b
[  156.131888] RDX: ffff8d03772ab924 RSI: ffff8d0305ef01b0 RDI: 00000000000002a0
[  156.131895] RBP: ffff9bd0018dbd60 R08: ffff8d03002094d0 R09: 0000000000000000
[  156.131901] R10: 000000000000005e R11: 0000000000000065 R12: ffff8d03002094d0
[  156.131907] R13: 000000000000001f R14: 0000000000070018 R15: 0000000000000007
[  156.131914] FS:  0000000000000000(0000) GS:ffff8d062ed80000(0000) knlGS:0000000000000000
[  156.131923] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  156.131929] CR2: 00000000000002a0 CR3: 000000001120a005 CR4: 00000000003706e0
[  156.131937] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  156.131942] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  156.131949] Call Trace:
[  156.131953]  <TASK>
[  156.131957]  ? trace_event_raw_event_dma_fence+0xcc/0x200
[  156.131973]  ? ring_buffer_unlock_commit+0x23/0x130
[  156.131982]  dma_fence_init+0x92/0xb0
[  156.131993]  amdgpu_fence_emit+0x10d/0x2b0 [amdgpu]
[  156.132302]  amdgpu_ib_schedule+0x2f9/0x580 [amdgpu]
[  156.132586]  amdgpu_job_run+0xed/0x220 [amdgpu]

v2: fix mismatch warning between the prototype and function name (Ray, kernel test robot)
Signed-off-by: NHuang Rui <ray.huang@amd.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

bf67014d

17 12月, 2021 1 次提交

drm/amdgpu: introduce new amdgpu_fence object to indicate the job embedded fence · 5c1e6fa4

由 Huang Rui 提交于 12月 16, 2021

The job embedded fence donesn't initialize the flags at
dma_fence_init(). Then we will go a wrong way in
amdgpu_fence_get_timeline_name callback and trigger a null pointer panic
once we enabled the trace event here. So introduce new amdgpu_fence
object to indicate the job embedded fence.

[  156.131790] BUG: kernel NULL pointer dereference, address: 00000000000002a0
[  156.131804] #PF: supervisor read access in kernel mode
[  156.131811] #PF: error_code(0x0000) - not-present page
[  156.131817] PGD 0 P4D 0
[  156.131824] Oops: 0000 [#1] PREEMPT SMP PTI
[  156.131832] CPU: 6 PID: 1404 Comm: sdma0 Tainted: G           OE     5.16.0-rc1-custom #1
[  156.131842] Hardware name: Gigabyte Technology Co., Ltd. Z170XP-SLI/Z170XP-SLI-CF, BIOS F20 11/04/2016
[  156.131848] RIP: 0010:strlen+0x0/0x20
[  156.131859] Code: 89 c0 c3 0f 1f 80 00 00 00 00 48 01 fe eb 0f 0f b6 07 38 d0 74 10 48 83 c7 01 84 c0 74 05 48 39 f7 75 ec 31 c0 c3 48 89 f8 c3 <80> 3f 00 74 10 48 89 f8 48 83 c0 01 80 38 00 75 f7 48 29 f8 c3 31
[  156.131872] RSP: 0018:ffff9bd0018dbcf8 EFLAGS: 00010206
[  156.131880] RAX: 00000000000002a0 RBX: ffff8d0305ef01b0 RCX: 000000000000000b
[  156.131888] RDX: ffff8d03772ab924 RSI: ffff8d0305ef01b0 RDI: 00000000000002a0
[  156.131895] RBP: ffff9bd0018dbd60 R08: ffff8d03002094d0 R09: 0000000000000000
[  156.131901] R10: 000000000000005e R11: 0000000000000065 R12: ffff8d03002094d0
[  156.131907] R13: 000000000000001f R14: 0000000000070018 R15: 0000000000000007
[  156.131914] FS:  0000000000000000(0000) GS:ffff8d062ed80000(0000) knlGS:0000000000000000
[  156.131923] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  156.131929] CR2: 00000000000002a0 CR3: 000000001120a005 CR4: 00000000003706e0
[  156.131937] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  156.131942] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  156.131949] Call Trace:
[  156.131953]  <TASK>
[  156.131957]  ? trace_event_raw_event_dma_fence+0xcc/0x200
[  156.131973]  ? ring_buffer_unlock_commit+0x23/0x130
[  156.131982]  dma_fence_init+0x92/0xb0
[  156.131993]  amdgpu_fence_emit+0x10d/0x2b0 [amdgpu]
[  156.132302]  amdgpu_ib_schedule+0x2f9/0x580 [amdgpu]
[  156.132586]  amdgpu_job_run+0xed/0x220 [amdgpu]

v2: fix mismatch warning between the prototype and function name (Ray, kernel test robot)
Signed-off-by: NHuang Rui <ray.huang@amd.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

5c1e6fa4

09 10月, 2021 1 次提交

drm/amdgpu: use adev_to_drm for consistency when accessing drm_device · c58a863b

由 Guchun Chen 提交于 10月 08, 2021

adev_to_drm is used everywhere, so improve recent changes
when accessing drm_device pointer from amdgpu_device.
Signed-off-by: NGuchun Chen <guchun.chen@amd.com>
Acked-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

c58a863b

02 9月, 2021 1 次提交

dma-buf: nuke DMA_FENCE_TRACE macros v2 · d72277b6

由 Christian König 提交于 7月 29, 2021

Only the DRM GPU scheduler, radeon and amdgpu where using them and they depend
on a non existing config option to actually emit some code.

v2: keep the signal path as is for now
Signed-off-by: NChristian König <christian.koenig@amd.com>
Acked-by: NDaniel Vetter <daniel.vetter@ffwll.ch>
Link: https://patchwork.freedesktop.org/patch/msgid/20210818105443.1578-1-christian.koenig@amd.com

d72277b6

01 9月, 2021 1 次提交

drm/amdgpu: stop scheduler when calling hw_fini (v2) · f7d6779d

由 Guchun Chen 提交于 8月 27, 2021

This gurantees no more work on the ring can be submitted
to hardware in suspend/resume case, otherwise a potential
race will occur and the ring will get no chance to stay
empty before suspend.

v2: Call drm_sched_resubmit_job before drm_sched_start to
restart jobs from the pending list.
Suggested-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com>
Suggested-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NGuchun Chen <guchun.chen@amd.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org

f7d6779d

17 8月, 2021 1 次提交

drm/amd/amdgpu embed hw_fence into amdgpu_job · c530b02f

由 Jack Zhang 提交于 5月 12, 2021

Why: Previously hw fence is alloced separately with job.
It caused historical lifetime issues and corner cases.
The ideal situation is to take fence to manage both job
and fence's lifetime, and simplify the design of gpu-scheduler.

How:
We propose to embed hw_fence into amdgpu_job.
1. We cover the normal job submission by this method.
2. For ib_test, and submit without a parent job keep the
legacy way to create a hw fence separately.
v2:
use AMDGPU_FENCE_FLAG_EMBED_IN_JOB_BIT to show that the fence is
embedded in a job.
v3:
remove redundant variable ring in amdgpu_job
v4:
add tdr sequence support for this feature. Add a job_run_counter to
indicate whether this job is a resubmit job.
v5
add missing handling in amdgpu_fence_enable_signaling
Signed-off-by: NJingwen Chen <Jingwen.Chen2@amd.com>
Signed-off-by: NJack Zhang <Jack.Zhang7@hotmail.com>
Reviewed-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed by: Monk Liu <monk.liu@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

c530b02f

06 8月, 2021 1 次提交

drm/amdgpu: avoid over-handle of fence driver fini in s3 test (v2) · 067f44c8

由 Guchun Chen 提交于 7月 29, 2021

In amdgpu_fence_driver_hw_fini, no need to call drm_sched_fini to stop
scheduler in s3 test, otherwise, fence related failure will arrive
after resume. To fix this and for a better clean up, move drm_sched_fini
from fence_hw_fini to fence_sw_fini, as it's part of driver shutdown, and
should never be called in hw_fini.

v2: rename amdgpu_fence_driver_init to amdgpu_fence_driver_sw_init,
to keep sw_init and sw_fini paired.

Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/1668
Fixes: 8d35a259 ("drm/amdgpu: adjust fence driver enable sequence")
Suggested-by: NChristian König <christian.koenig@amd.com>
Tested-by: NMike Lothian <mike@fireburn.co.uk>
Signed-off-by: NGuchun Chen <guchun.chen@amd.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

067f44c8

29 7月, 2021 1 次提交

drm/amdgpu: adjust fence driver enable sequence · 8d35a259

由 Likun Gao 提交于 7月 26, 2021

Fence driver was enabled per ring when sw init on per IP block before.
Change to enable all the fence driver at the same time after
amdgpu_device_ip_init finished.
Rename some function related to fence to make it reasonable for read.
Signed-off-by: NLikun Gao <Likun.Gao@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

8d35a259

01 7月, 2021 1 次提交

drm/sched: Allow using a dedicated workqueue for the timeout/fault tdr · 78efe21b

由 Boris Brezillon 提交于 6月 30, 2021

Mali Midgard/Bifrost GPUs have 3 hardware queues but only a global GPU
reset. This leads to extra complexity when we need to synchronize timeout
works with the reset work. One solution to address that is to have an
ordered workqueue at the driver level that will be used by the different
schedulers to queue their timeout work. Thanks to the serialization
provided by the ordered workqueue we are guaranteed that timeout
handlers are executed sequentially, and can thus easily reset the GPU
from the timeout handler without extra synchronization.

v5:
* Add a new paragraph to the timedout_job() method

v3:
* New patch

v4:
* Actually use the timeout_wq to queue the timeout work
Suggested-by: NDaniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: NBoris Brezillon <boris.brezillon@collabora.com>
Reviewed-by: NSteven Price <steven.price@arm.com>
Reviewed-by: NLucas Stach <l.stach@pengutronix.de>
Acked-by: NDaniel Vetter <daniel.vetter@ffwll.ch>
Acked-by: NChristian König <christian.koenig@amd.com>
Cc: Qiang Yu <yuq825@gmail.com>
Cc: Emma Anholt <emma@anholt.net>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: "Christian König" <christian.koenig@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210630062751.2832545-3-boris.brezillon@collabora.com

78efe21b

20 5月, 2021 2 次提交

drm/amdgpu: Fix hang on device removal. · 54a85db8

由 Andrey Grodzovsky 提交于 5月 12, 2021

If removing while commands in flight you cannot wait to flush the
HW fences on a ring since the device is gone.
Signed-off-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210512142648.666476-13-andrey.grodzovsky@amd.com

54a85db8

drm/amdgpu: Split amdgpu_device_fini into early and late · 72c8c97b

由 Andrey Grodzovsky 提交于 5月 12, 2021

Some of the stuff in amdgpu_device_fini such as HW interrupts
disable and pending fences finilization must be done right away on
pci_remove while most of the stuff which relates to finilizing and
releasing driver data structures can be kept until
drm_driver.release hook is called, i.e. when the last device
reference is dropped.

v4: Change functions prefix early->hw and late->sw
Signed-off-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com>
Acked-by: NChristian König <christian.koenig@amd.com>
Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210512142648.666476-3-andrey.grodzovsky@amd.com

72c8c97b

21 4月, 2021 1 次提交

drm/amd/amdgpu/amdgpu_fence: Provide description for 'sched_score' · b16cc4bb

由 Lee Jones 提交于 4月 16, 2021

Fixes the following W=1 kernel build warning(s):

 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c:444: warning: Function parameter or member 'sched_score' not described in 'amdgpu_fence_driver_init_ring'

Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: "Christian König" <christian.koenig@amd.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Jerome Glisse <glisse@freedesktop.org>
Cc: amd-gfx@lists.freedesktop.org
Cc: dri-devel@lists.freedesktop.org
Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NLee Jones <lee.jones@linaro.org>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

b16cc4bb

10 4月, 2021 1 次提交

drm/amdgpu: add the sched_score to amdgpu_ring_init · c107171b

由 Christian König 提交于 2月 02, 2021

Allow separate ring to share the same scheduler score.

No functional change.
Signed-off-by: NChristian König <christian.koenig@amd.com>
Reviewed-and-Tested-by: NLeo Liu <leo.liu@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

c107171b

06 3月, 2021 1 次提交

drm/amdgpu: Fix some unload driver issues · bb0cd09b

由 Emily Deng 提交于 3月 04, 2021

When unloading driver after killing some applications, it will hit sdma
flush tlb job timeout which is called by ttm_bo_delay_delete. So
to avoid the job submit after fence driver fini, call ttm_bo_lock_delayed_workqueue
before fence driver fini. And also put drm_sched_fini before waiting fence.
Signed-off-by: NEmily Deng <Emily.Deng@amd.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

bb0cd09b

19 2月, 2021 1 次提交

drm/amdgpu: do not use drm middle layer for debugfs · 98d28ac2

由 Nirmoy Das 提交于 2月 15, 2021

Use debugfs API directly instead of drm middle layer.

This also includes following debugfs file output changes:
1 amdgpu_evict_vram/amdgpu_evict_gtt output will not contain any braces.
  e.g. (0) --> 0
2 amdgpu_gpu_recover output will print return value of
  amdgpu_device_gpu_recover() instead of not so important "gpu recover"
  message.

v2: * checkpatch.pl: use '0444' instead of S_IRUGO.
    * remove S_IFREG from mode.
    * remove mode variable.
Signed-off-by: NNirmoy Das <nirmoy.das@amd.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

98d28ac2

05 2月, 2021 1 次提交

drm/scheduler: provide scheduler score externally · f2f12eb9

由 Christian König 提交于 2月 02, 2021

Allow multiple schedulers to share the load balancing score.

This is useful when one engine has different hw rings.
Signed-off-by: NChristian König <christian.koenig@amd.com>
Reviewed-and-Tested-by: NLeo Liu <leo.liu@amd.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210204144405.2737-1-christian.koenig@amd.com

f2f12eb9

13 11月, 2020 1 次提交

drm/amd/amdgpu/amdgpu_fence: Fix some issues pertaining to function documentation · f02f8c32

由 Lee Jones 提交于 11月 12, 2020

Fixes the following W=1 kernel build warning(s):

drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c:139: warning: Function parameter or member 'flags' not described in 'amdgpu_fence_emit'
drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c:197: warning: Function parameter or member 'timeout' not described in 'amdgpu_fence_emit_polling'
drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c:302: warning: Function parameter or member 't' not described in 'amdgpu_fence_fallback'
drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c:302: warning: Excess function parameter 'work' description in 'amdgpu_fence_fallback'
drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c:320: warning: Excess function parameter 'adev' description in 'amdgpu_fence_wait_empty'
drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c:649: warning: Function parameter or member 'f' not described in 'amdgpu_fence_enable_signaling'
drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c:649: warning: Excess function parameter 'fence' description in 'amdgpu_fence_enable_signaling'
drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c:684: warning: Function parameter or member 'f' not described in 'amdgpu_fence_release'
drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c:684: warning: Excess function parameter 'fence' description in 'amdgpu_fence_release'
drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c:749: warning: Function parameter or member 'm' not described in 'amdgpu_debugfs_gpu_recover'
drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c:749: warning: Function parameter or member 'data' not described in 'amdgpu_debugfs_gpu_recover'

Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: "Christian König" <christian.koenig@amd.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Jerome Glisse <glisse@freedesktop.org>
Cc: amd-gfx@lists.freedesktop.org
Cc: dri-devel@lists.freedesktop.org
Cc: linux-media@vger.kernel.org
Cc: linaro-mm-sig@lists.linaro.org
Signed-off-by: NLee Jones <lee.jones@linaro.org>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

f02f8c32

25 8月, 2020 2 次提交

drm/amdgpu: Get DRM dev from adev by inline-f · 4a580877

由 Luben Tuikov 提交于 8月 24, 2020

Add a static inline adev_to_drm() to obtain
the DRM device pointer from an amdgpu_device pointer.
Signed-off-by: NLuben Tuikov <luben.tuikov@amd.com>
Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

4a580877

drm/amdgpu: drm_device to amdgpu_device by inline-f (v2) · 1348969a

由 Luben Tuikov 提交于 8月 24, 2020

Get the amdgpu_device from the DRM device by use
of an inline function, drm_to_adev(). The inline
function resolves a pointer to struct drm_device
to a pointer to struct amdgpu_device.

v2: Use a typed visible static inline function
    instead of an invisible macro.
Signed-off-by: NLuben Tuikov <luben.tuikov@amd.com>
Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

1348969a

15 7月, 2020 1 次提交

drm/amdgpu: use ARRAY_SIZE() to add amdgpu debugfs files · 9987d70d

由 Xiaojie Yuan 提交于 7月 13, 2020

to easily add new debugfs file w/o changing the hardcoded list count.
Signed-off-by: NXiaojie Yuan <xiaojie.yuan@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

9987d70d

08 7月, 2020 1 次提交

gpu/drm: Remove debug info about CPU address · e241df69

由 Tiezhu Yang 提交于 7月 02, 2020

When I update the latest kernel, I see the following "____ptrval____" boot
messages.

[    1.872600] radeon 0000:01:05.0: fence driver on ring 0 use gpu addr 0x0000000048000c00 and cpu addr 0x(____ptrval____)
[    1.879095] radeon 0000:01:05.0: fence driver on ring 5 use gpu addr 0x0000000040056038 and cpu addr 0x(____ptrval____)

Both radeon_fence_driver_start_ring() and amdgpu_fence_driver_start_ring()
have the similar issue, there exists the following two methods to solve it:
(1) Use "%pK" instead of "%p" so that the CPU address can be printed when
the kptr_restrict sysctl is set to 1.
(2) Just completely drop the CPU address suggested by Christian, because
the CPU address was useful in the past, but isn't any more. We now have a
debugfs file to read the current fence values.

Since the CPU address is not much useful, just remove the debug info about
CPU address.
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NTiezhu Yang <yangtiezhu@loongson.cn>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

e241df69

01 7月, 2020 5 次提交

drm/amdgpu: restrict the hw sched jobs number to power of two · 5d5bd5e3

由 Kevin Wang 提交于 1月 19, 2020

the module parameter sched_hw_submission is probably from user mode,
and the kernel need to check whether it is legal.

1. align hw sched jobs to power of 2 and set minimum number is 2.
2. use kernel api is_power_of_2() to simplify driver code.
Signed-off-by: NKevin Wang <kevin1.wang@amd.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

5d5bd5e3

drm/amdgpu/fence: fix ref count leak when pm_runtime_get_sync fails · e520d3e0

由 Alex Deucher 提交于 6月 17, 2020

The call to pm_runtime_get_sync increments the counter even in case of
failure, leading to incorrect ref count.
In case of failure, decrement the ref count before returning.
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Acked-by: NRajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

e520d3e0

drm/amdgpu/fence: use the no_scheduler flag · 730c2eb9

由 Alex Deucher 提交于 6月 02, 2020

Rather than checking the ring type manually.  We already set
this for MES and KIQ (and a few other special cases).
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

730c2eb9

drm/amdgpu: skip GPU scheduler setup for KIQ and MES ring · 51450501

由 Likun Gao 提交于 4月 15, 2020

Fix the coding error to skip GPU scheduler setup for KIQ and MES ring.
Signed-off-by: NLikun Gao <Likun.Gao@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

51450501

drm/amdgpu: no need to set up GPU scheduler for mes ring · 03195e80

由 Jack Xiao 提交于 10月 21, 2019

As mes ring directly submits to hardwared,
it's no need to set up GPU scheduler for mes ring.
Signed-off-by: NJack Xiao <Jack.Xiao@amd.com>
Acked-by: NAlex Deucher <alexander.deucher@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

03195e80

openeuler / Kernel 2 年多 前同步成功

openeuler / Kernel
2 年多前同步成功