提交 · dac6b80818ac2353631c5a33d140d8d5508e2957 · openeuler / Kernel

17 8月, 2022 1 次提交

drm/amdgpu: let mode2 reset fallback to default when failure · dac6b808

由 Victor Zhao 提交于 7月 28, 2022

- introduce AMDGPU_SKIP_MODE2_RESET flag
- let mode2 reset fallback to default reset method if failed

v2: move this part out from the asic specific part
Signed-off-by: NVictor Zhao <Victor.Zhao@amd.com>
Acked-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

dac6b808

13 7月, 2022 1 次提交

drm/amdgpu: support reset flag set for gpu reset · f1549c09

由 Likun Gao 提交于 7月 08, 2022

Move reset_context out of gpu recover function to make it configurable
for different reset purpose.
For the reset way of call gpu_recovery sysfs, force to use full reset
method. Otherwise, try soft reset by default if the related ASIC
supportted, if soft reset failed, will use full reset.
Signed-off-by: NLikun Gao <Likun.Gao@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

f1549c09

11 6月, 2022 1 次提交

drm/amdgpu: Rename amdgpu_device_gpu_recover_imp back to amdgpu_device_gpu_recover · cf727044

由 Andrey Grodzovsky 提交于 5月 17, 2022

We removed the wrapper that was queueing the recover function
into reset domain queue who was using this name.
Signed-off-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

cf727044

10 2月, 2022 4 次提交

drm/amdgpu: Move in_gpu_reset into reset_domain · 89a7a870

由 Andrey Grodzovsky 提交于 1月 19, 2022

We should have a single instance per entrire reset domain.
Signed-off-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com>
Suggested-by: NLijo Lazar <lijo.lazar@amd.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Link: https://www.spinics.net/lists/amd-gfx/msg74116.html

89a7a870

drm/amdgpu: Move reset sem into reset_domain · d0fb18b5

由 Andrey Grodzovsky 提交于 1月 19, 2022

We want single instance of reset sem across all
reset clients because in case of XGMI we should stop
access cross device MMIO because any of them could be
in a reset in the moment.
Signed-off-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Link: https://www.spinics.net/lists/amd-gfx/msg74117.html

d0fb18b5

drm/amdgpu: Rework reset domain to be refcounted. · cfbb6b00

由 Andrey Grodzovsky 提交于 1月 21, 2022

The reset domain contains register access semaphor
now and so needs to be present as long as each device
in a hive needs it and so it cannot be binded to XGMI
hive life cycle.
Adress this by making reset domain refcounted and pointed
by each member of the hive and the hive itself.

v4:

Fix crash on boot witrh XGMI hive by adding type to reset_domain.
XGMI will only create a new reset_domain if prevoius was of single
device type meaning it's first boot. Otherwsie it will take a
refocunt to exsiting reset_domain from the amdgou device.

Add a wrapper around reset_domain->refcount get/put
and a wrapper around send to reset wq (Lijo)
Signed-off-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com>
Acked-by: NChristian König <christian.koenig@amd.com>
Link: https://www.spinics.net/lists/amd-gfx/msg74121.html

cfbb6b00

drm/amd/virt: For SRIOV send GPU reset directly to TDR queue. · 02599bc7

由 Andrey Grodzovsky 提交于 12月 20, 2021

No need to to trigger another work queue inside the work queue.

v3:

Problem:
Extra reset caused by host side FLR notification
following guest side triggered reset.
Fix: Preven qeuing flr_work from mailbox irq if guest
already executing a reset.
Suggested-by: NLiu Shaoyun <Shaoyun.Liu@amd.com>
Signed-off-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: NLiu Shaoyun <Shaoyun.Liu@amd.com>
Link: https://www.spinics.net/lists/amd-gfx/msg74114.html

02599bc7

08 1月, 2022 1 次提交

drm/amdgpu: add dummy event6 for vega10 · 216a9873

由 James Yao 提交于 12月 29, 2021

[why]
Malicious mailbox event1 fails driver loading on vega10.
A dummy event6 prevent driver from taking response from malicious event1 as its own.

[how]
On vega10, send a mailbox event6 before sending event1.
Signed-off-by: NJames Yao <yiqing.yao@amd.com>
Reviewed-by: NJingwen Chen <Jingwen.Chen2@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

216a9873

15 12月, 2021 1 次提交

drm/amdgpu: SRIOV flr_work should use down_write · fa4a427d

由 Victor Skvortsov 提交于 12月 13, 2021

Host initiated VF FLR may fail if someone else is
already holding a read_lock. Change from down_write_trylock
to down_write to guarantee the reset goes through.
Signed-off-by: NVictor Skvortsov <victor.skvortsov@amd.com>
Reviewed by: Shaoyun.liu <Shaoyun.liu@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

fa4a427d

31 8月, 2021 1 次提交

drm/amd/amdgpu: Add ready_to_reset resp for vega10 · 64261a0d

由 YuBiao Wang 提交于 8月 27, 2021

Send response to host after received the flr notification from host.
Port NV change to vega10.
Signed-off-by: NYuBiao Wang <YuBiao.Wang@amd.com>
Reviewed-by: NJingwen Chen <Jingwen.Chen2@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

64261a0d

13 7月, 2021 1 次提交

drm/amdgpu: SRIOV flr_work should take write_lock · 798c5115

由 Jingwen Chen 提交于 7月 01, 2021

[Why]
If flr_work takes read_lock, then other threads who takes
read_lock can access hardware when host is doing vf flr.

[How]
flr_work should take write_lock to avoid this case.
Signed-off-by: NJingwen Chen <Jingwen.Chen2@amd.com>
Reviewed-by: NMonk Liu <monk.liu@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

798c5115

09 7月, 2021 1 次提交

drm/amdgpu: SRIOV flr_work should take write_lock · b5840166

由 Jingwen Chen 提交于 7月 01, 2021

[Why]
If flr_work takes read_lock, then other threads who takes
read_lock can access hardware when host is doing vf flr.

[How]
flr_work should take write_lock to avoid this case.
Signed-off-by: NJingwen Chen <Jingwen.Chen2@amd.com>
Reviewed-by: NMonk Liu <monk.liu@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

b5840166

14 1月, 2021 1 次提交

drm/amdgpu/sriov Stop data exchange for wholegpu reset · 3c2a01cb

由 Jack Zhang 提交于 1月 07, 2021

[Why]
When host trigger a whole gpu reset, guest will keep
waiting till host finish reset. But there's a work
queue in guest exchanging data between vf&pf which need
to access frame buffer. During whole gpu reset, frame
buffer is not accessable, and this causes the call trace.

[How]
After vf get reset notification from pf, stop data exchange.
Signed-off-by: NJingwen Chen <Jingwen.Chen2@amd.com>
Signed-off-by: NJack Zhang <Jack.Zhang1@amd.com>
Reviewed-by: NMonk Liu <monk.liu@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

3c2a01cb

16 12月, 2020 1 次提交

drm/amdgpu/SRIOV: Extend VF reset request wait period · 3aa883ac

由 Jiange Zhao 提交于 11月 25, 2020

In Virtualization case, when one VF is sending too many
FLR requests, hypervisor would stop responding to this
VF's request for a long period of time. This is called
event guard. During this period of cooling time, guest
driver should wait instead of doing other things. After
this period of time, guest driver would resume reset
process and return to normal.

Currently, guest driver would wait 12 seconds and return fail
if it doesn't get response from host.

Solution: extend this waiting time in guest driver and poll
response periodically. Poll happens every 6 seconds and it will
last for 60 seconds.

v2: change the max repetition times from number to macro.
Signed-off-by: NJiange Zhao <Jiange.Zhao@amd.com>
Acked-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

3aa883ac

16 9月, 2020 1 次提交

drm/amdgpu: Do gpu recovery when no job is running · 2a9787dc

由 Liu ChengZhe 提交于 9月 09, 2020

In function flr_work, we should do gpu recovery when no job
is running. Fix the logic by inverting it.

v2: modify the description
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NLiu ChengZhe <ChengZhe.Liu@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

2a9787dc

25 8月, 2020 2 次提交

drm/amdgpu: change reset lock from mutex to rw_semaphore · 6049db43

由 Dennis Li 提交于 8月 20, 2020

clients don't need reset-lock for synchronization when no
GPU recovery.

v2:
change to return the return value of down_read_killable.

v3:
if GPU recovery begin, VF ignore FLR notification.
Reviewed-by: NMonk Liu <monk.liu@amd.com>
Acked-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NDennis Li <Dennis.Li@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

6049db43

drm/amdgpu: refine codes to avoid reentering GPU recovery · 53b3f8f4

由 Dennis Li 提交于 8月 19, 2020

if other threads have holden the reset lock, recovery will
fail to try_lock. Therefore we introduce atomic hive->in_reset
and adev->in_gpu_reset, to avoid reentering GPU recovery.

v2:
drop "? true : false" in the definition of amdgpu_in_reset
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NDennis Li <Dennis.Li@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

53b3f8f4

19 8月, 2020 1 次提交

drm/amdgpu: Fix repeatly flr issue · 9a1cddd6

由 jqdeng 提交于 8月 07, 2020

Only for no job running test case need to do recover in
flr notification.
For having job in mirror list, then let guest driver to
hit job timeout, and then do recover.
Signed-off-by: Njqdeng <Emily.Deng@amd.com>
Acked-by: NNirmoy Das <nirmoy.das@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

9a1cddd6

15 8月, 2020 1 次提交

drm/amdgpu: revert "fix system hang issue during GPU reset" · f1403342

由 Christian König 提交于 8月 12, 2020

The whole approach wasn't thought through till the end.

We already had a reset lock like this in the past and it caused the same problems like this one.

Completely revert the patch for now and add individual trylock protection to the hardware access functions as necessary.

This reverts commit df9c8d1a.
Signed-off-by: NChristian König <christian.koenig@amd.com>
Acked-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

f1403342

28 7月, 2020 1 次提交

drm/amdgpu: fix system hang issue during GPU reset · df9c8d1a

由 Dennis Li 提交于 7月 08, 2020

when GPU hang, driver has multi-paths to enter amdgpu_device_gpu_recover,
the atomic adev->in_gpu_reset and hive->in_reset are used to avoid
re-entering GPU recovery.

During GPU reset and resume, it is unsafe that other threads access GPU,
which maybe cause GPU reset failed. Therefore the new rw_semaphore
adev->reset_sem is introduced, which protect GPU from being accessed by
external threads during recovery.

v2:
1. add rwlock for some ioctls, debugfs and file-close function.
2. change to use dqm->is_resetting and dqm_lock for protection in kfd
driver.
3. remove try_lock and change adev->in_gpu_reset as atomic, to avoid
re-enter GPU recovery for the same GPU hang.

v3:
1. change back to use adev->reset_sem to protect kfd callback
functions, because dqm_lock couldn't protect all codes, for example:
free_mqd must be called outside of dqm_lock;

[ 1230.176199] Hardware name: Supermicro SYS-7049GP-TRT/X11DPG-QT, BIOS 3.1 05/23/2019
[ 1230.177221] Call Trace:
[ 1230.178249]  dump_stack+0x98/0xd5
[ 1230.179443]  amdgpu_virt_kiq_reg_write_reg_wait+0x181/0x190 [amdgpu]
[ 1230.180673]  gmc_v9_0_flush_gpu_tlb+0xcc/0x310 [amdgpu]
[ 1230.181882]  amdgpu_gart_unbind+0xa9/0xe0 [amdgpu]
[ 1230.183098]  amdgpu_ttm_backend_unbind+0x46/0x180 [amdgpu]
[ 1230.184239]  ? ttm_bo_put+0x171/0x5f0 [ttm]
[ 1230.185394]  ttm_tt_unbind+0x21/0x40 [ttm]
[ 1230.186558]  ttm_tt_destroy.part.12+0x12/0x60 [ttm]
[ 1230.187707]  ttm_tt_destroy+0x13/0x20 [ttm]
[ 1230.188832]  ttm_bo_cleanup_memtype_use+0x36/0x80 [ttm]
[ 1230.189979]  ttm_bo_put+0x1be/0x5f0 [ttm]
[ 1230.191230]  amdgpu_bo_unref+0x1e/0x30 [amdgpu]
[ 1230.192522]  amdgpu_amdkfd_free_gtt_mem+0xaf/0x140 [amdgpu]
[ 1230.193833]  free_mqd+0x25/0x40 [amdgpu]
[ 1230.195143]  destroy_queue_cpsch+0x1a7/0x270 [amdgpu]
[ 1230.196475]  pqm_destroy_queue+0x105/0x260 [amdgpu]
[ 1230.197819]  kfd_ioctl_destroy_queue+0x37/0x70 [amdgpu]
[ 1230.199154]  kfd_ioctl+0x277/0x500 [amdgpu]
[ 1230.200458]  ? kfd_ioctl_get_clock_counters+0x60/0x60 [amdgpu]
[ 1230.201656]  ? tomoyo_file_ioctl+0x19/0x20
[ 1230.202831]  ksys_ioctl+0x98/0xb0
[ 1230.204004]  __x64_sys_ioctl+0x1a/0x20
[ 1230.205174]  do_syscall_64+0x5f/0x250
[ 1230.206339]  entry_SYSCALL_64_after_hwframe+0x49/0xbe

2. remove try_lock and introduce atomic hive->in_reset, to avoid
re-enter GPU recovery.

v4:
1. remove an unnecessary whitespace change in kfd_chardev.c
2. remove comment codes in amdgpu_device.c
3. add more detailed comment in commit message
4. define a wrap function amdgpu_in_reset

v5:
1. Fix some style issues.
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Suggested-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com>
Suggested-by: NChristian König <christian.koenig@amd.com>
Suggested-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Suggested-by: NLijo Lazar <Lijo.Lazar@amd.com>
Suggested-by: NLuben Tukov <luben.tuikov@amd.com>
Signed-off-by: NDennis Li <Dennis.Li@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

df9c8d1a

24 12月, 2019 1 次提交

drm/amdgpu: use true, false for bool variable in mxgpu_ai.c · eb28038c

由 zhengbin 提交于 12月 23, 2019

Fixes coccicheck warning:

drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c:253:2-20: WARNING: Assignment of 0/1 to bool variable
drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c:265:2-20: WARNING: Assignment of 0/1 to bool variable
Reported-by: NHulk Robot <hulkci@huawei.com>
Signed-off-by: Nzhengbin <zhengbin13@huawei.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

eb28038c

12 12月, 2019 1 次提交

drm/amd/powerplay: enable pp one vf mode for vega10 · c9ffa427

由 Yintian Tao 提交于 10月 30, 2019

Originally, due to the restriction from PSP and SMU, VF has
to send message to hypervisor driver to handle powerplay
change which is complicated and redundant. Currently, SMU
and PSP can support VF to directly handle powerplay
change by itself. Therefore, the old code about the handshake
between VF and PF to handle powerplay will be removed and VF
will use new the registers below to handshake with SMU.
mmMP1_SMN_C2PMSG_101: register to handle SMU message
mmMP1_SMN_C2PMSG_102: register to handle SMU parameter
mmMP1_SMN_C2PMSG_103: register to handle SMU response

v2: remove module parameter pp_one_vf
v3: fix the parens
v4: forbid vf to change smu feature
v5: use hwmon_attributes_visible to skip sepicified hwmon atrribute
v6: change skip condition at vega10_copy_table_to_smc
Signed-off-by: NYintian Tao <yttao@amd.com>
Acked-by: NEvan Quan <evan.quan@amd.com>
Reviewed-by: NKenneth Feng <kenneth.feng@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

c9ffa427

02 8月, 2019 1 次提交

drm/amdgpu: cleanup vega10 SRIOV code path · 4cd4c5c0

由 Monk Liu 提交于 7月 30, 2019

we can simplify all those unnecessary function under
SRIOV for vega10 since:
1) PSP L1 policy is by force enabled in SRIOV
2) original logic always set all flags which make itself
   a dummy step

besides,
1) the ih_doorbell_range set should also be skipped
for VEGA10 SRIOV.
2) the gfx_common registers should also be skipped
for VEGA10 SRIOV.
Signed-off-by: NMonk Liu <Monk.Liu@amd.com>
Reviewed-by: NEmily Deng <Emily.Deng@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

4cd4c5c0

12 6月, 2019 1 次提交

drm/amdgpu: Hardcode reg access using L1 security · e0301317

由 Trigger Huang 提交于 6月 03, 2019

Under Vega10 SR-IOV VF, L1 register access mode should be enabled by
default as the non-security VF will no longer be supported.
Signed-off-by: NTrigger Huang <Trigger.Huang@amd.com>
Reviewed-by: NEmily Deng <Emily.Deng@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

e0301317

25 5月, 2019 2 次提交

drm/amdgpu: init vega10 SR-IOV reg access mode · 78d48112

由 Trigger Huang 提交于 5月 09, 2019

Set different register access mode according to the features
provided by firmware
Signed-off-by: NTrigger Huang <Trigger.Huang@amd.com>
Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

78d48112

drm/amdgpu: enable separate timeout setting for every ring type V4 · 912dfc84

由 Evan Quan 提交于 4月 29, 2019

Every ring type can have its own timeout setting.

 - V2: update lockup_timeout parameter format and cosmetic fixes
 - V3: invalidate 0 and negative values
 - V4: update lockup_timeout parameter format
Signed-off-by: NEvan Quan <evan.quan@amd.com>
Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

912dfc84

06 5月, 2019 1 次提交

drm/amdgpu: Add IDH_QUERY_ALIVE event for SR-IOV · b6818520

由 Trigger Huang 提交于 4月 30, 2019

SR-IOV host side will send IDH_QUERY_ALIVE to guest VM to check
if this guest VM is still alive (not destroyed). The only thing
guest KMD need to do is to send ACK back to host.
Signed-off-by: NTrigger Huang <Trigger.Huang@amd.com>
Acked-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

b6818520

11 4月, 2019 1 次提交

drm/amdgpu: support dpm level modification under virtualization v3 · bb5a2bdf

由 Yintian Tao 提交于 4月 09, 2019

Under vega10 virtualuzation, smu ip block will not be added.
Therefore, we need add pp clk query and force dpm level function
at amdgpu_virt_ops to support the feature.

v2: add get_pp_clk existence check and use kzalloc to allocate buf

v3: return -ENOMEM for allocation failure and correct the coding style
Signed-off-by: NYintian Tao <yttao@amd.com>
Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

bb5a2bdf

14 2月, 2019 1 次提交

drm/amdgpu: tighten gpu_recover in mailbox_flr to avoid duplicate recover in sriov · 2c11ee6a

由 wentalou 提交于 1月 30, 2019

sriov's gpu_recover inside xgpu_ai_mailbox_flr_work would cause duplicate recover in TDR.
TDR's gpu_recover would be triggered by amdgpu_job_timedout,
that could avoid vk-cts failure by unexpected recover.
Signed-off-by: NWentao Lou <Wentao.Lou@amd.com>
Acked-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

2c11ee6a

15 1月, 2019 1 次提交

drm/amdgpu/sriov:Correct pfvf exchange logic · d3c117e5

由 Emily Deng 提交于 12月 29, 2018

The pfvf exchange need be in exclusive mode. And add pfvf exchange in gpu
reset.
Signed-off-by: NEmily Deng <Emily.Deng@amd.com>
Reviewed-By: NXiangliang Yu <Xiangliang.Yu@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

d3c117e5

03 1月, 2019 1 次提交

drm/amdgpu/sriov:Correct pfvf exchange logic · b8cf6618

由 Emily Deng 提交于 12月 29, 2018

The pfvf exchange need be in exclusive mode. And add pfvf exchange in gpu
reset.
Signed-off-by: NEmily Deng <Emily.Deng@amd.com>
Reviewed-By: NXiangliang Yu <Xiangliang.Yu@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

b8cf6618

28 8月, 2018 1 次提交

drm/amdgpu: cleanup GPU recovery check a bit (v2) · 12938fad

由 Christian König 提交于 8月 21, 2018

Check if we should call the function instead of providing the forced
flag.

v2: rebase on KFD changes (Alex)
Signed-off-by: NChristian König <christian.koenig@amd.com>
Acked-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: NHuang Rui <ray.huang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

12938fad

16 5月, 2018 1 次提交

drm/amdgpu/sriov: Need to set in_gpu_reset flag to back after gpu reset · 6e9c2b88

由 Emily Deng 提交于 4月 26, 2018

After host os reset gpu reset, need to set flag in_gpu_reset to
zero.
Signed-off-by: NEmily Deng <Emily.Deng@amd.com>
Reviewed-by: NMonk Liu <monk.liu@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

6e9c2b88

23 3月, 2018 1 次提交

drm/amdgpu: fix spelling mistake: "asssert" -> "assert" · 36b3f84a

由 Colin Ian King 提交于 3月 22, 2018

Trivial fix to spelling mistake in pr_err error message text
Acked-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NColin Ian King <colin.king@canonical.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

36b3f84a

15 3月, 2018 2 次提交

drm/amdgpu: Move IH clientid defs to separate file · 3760f76c

由 Oak Zeng 提交于 3月 08, 2018

This is preparation for sharing client ID definitions
between amdgpu and amdkfd
Signed-off-by: NOak Zeng <Oak.Zeng@amd.com>
Reviewed-by: NChunming Zhou <david1.zhou@amd.com>
Acked-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

3760f76c

drm/amdgpu: refactoring mailbox to fix TDR handshake bugs(v2) · 48527e52

由 Monk Liu 提交于 1月 15, 2018

this patch actually refactor mailbox implmentations, and
all below changes are needed together to fix all those mailbox
handshake issues exposured by heavey TDR test.

1)refactor all mailbox functions based on byte accessing for mb_control
reason is to avoid touching non-related bits when writing trn/rcv part of
mailbox_control, this way some incorrect INTR sent to hypervisor
side could be avoided, and it fixes couple handshake bug.

2)trans_msg function re-impled: put a invalid
logic before transmitting message to make sure the ACK bit is in
a clear status, otherwise there is chance that ACK asserted already
before transmitting message and lead to fake ACK polling.
(hypervisor side have some tricks to workaround ACK bit being corrupted
by VF FLR which hase an side effects that may make guest side ACK bit
asserted wrongly), and clear TRANS_MSG words after message transferred.

3)for mailbox_flr_work, it is also re-worked: it takes the mutex lock
first if invoked, to block gpu recover's participate too early while
hypervisor side is doing VF FLR. (hypervisor sends FLR_NOTIFY to guest
before doing VF FLR and sentds FLR_COMPLETE after VF FLR done, and
the FLR_NOTIFY will trigger interrupt to guest which lead to
mailbox_flr_work being invoked)

This can avoid the issue that mailbox trans msg being cleared by its VF FLR.

4)for mailbox_rcv_irq IRQ routine, it should only peek msg and schedule
mailbox_flr_work, instead of ACK to hypervisor itself, because FLR_NOTIFY
msg sent from hypervisor side doesn't need VF's ACK (this is because
VF's ACK would lead to hypervisor clear its trans_valid/msg, and this
would cause handshake bug if trans_valid/msg is cleared not due to
correct VF ACK but from a wrong VF ACK like this "FLR_NOTIFY" one)

This fixed handshake bug that sometimes GUEST always couldn't receive
"READY_TO_ACCESS_GPU" msg from hypervisor.

5)seperate polling time limite accordingly:
POLL ACK cost no more than 500ms
POLL MSG cost no more than 12000ms
POLL FLR finish cost no more than 500ms

6) we still need to set adev into in_gpu_reset mode after we received
FLR_NOTIFY from host side, this can prevent innocent app wrongly succesed
to open amdgpu dri device.

FLR_NOFITY is received due to an IDLE hang detected from hypervisor side
which indicating GPU is already die in this VF.

v2:
use MACRO as the offset of mailbox_control register
don't test if NOTIFY_CMPL event in rcv_msg since it won't
recieve that message anymore
Signed-off-by: NMonk Liu <Monk.Liu@amd.com>
Reviewed-by: NPixel Ding <Pixel.Ding@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

48527e52

18 12月, 2017 1 次提交

drm/amdgpu: rename amdgpu_gpu_recover · 5f152b5e

由 Alex Deucher 提交于 12月 15, 2017

add device to the name for consistency.
Acked-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

5f152b5e

16 12月, 2017 2 次提交

drm/amdgpu: Simplify amdgpu_lockup_timeout usage. · 8854695a

由 Andrey Grodzovsky 提交于 12月 13, 2017

With introduction of amdgpu_gpu_recovery we don't need any more
to rely on amdgpu_lockup_timeout == 0 for disabling GPU reset.
Signed-off-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

8854695a

drm/amdgpu: Add gpu_recovery parameter · dcebf026

由 Andrey Grodzovsky 提交于 12月 12, 2017

Add new parameter to control GPU recovery procedure.

v2:
Add auto logic where reset is disabled for bare metal and enabled
for SR-IOV.
Allow forced reset from debugfs.
Signed-off-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

dcebf026

09 12月, 2017 1 次提交

drm/admgpu: Reduce the usage of soc15ip.h · 4fd09a19

由 Shaoyun Liu 提交于 11月 29, 2017

Remove the header where it's not used.
Acked-by: NChristian Konig <christian.koenig@amd.com>
Signed-off-by: NShaoyun Liu <Shaoyun.Liu@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

4fd09a19

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功