提交 · 54f329cc7a7a7ea265c45b206d45e3d09192aba7 · openeuler / Kernel

10 2月, 2022 3 次提交

drm/amdgpu: Serialize non TDR gpu recovery with TDRs · 54f329cc

由 Andrey Grodzovsky 提交于 12月 17, 2021

Use reset domain wq also for non TDR gpu recovery trigers
such as sysfs and RAS. We must serialize all possible
GPU recoveries to gurantee no concurrency there.
For TDR call the original recovery function directly since
it's already executed from within the wq. For others just
use a wrapper to qeueue work and wait on it to finish.

v2: Rename to amdgpu_recover_work_struct
Signed-off-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Link: https://www.spinics.net/lists/amd-gfx/msg74113.html

54f329cc

drm/amdgpu: Move scheduler init to after XGMI is ready · 5fd8518d

由 Andrey Grodzovsky 提交于 12月 06, 2021

Before we initialize schedulers we must know which reset
domain are we in - for single device there iis a single
domain per device and so single wq per device. For XGMI
the reset domain spans the entire XGMI hive and so the
reset wq is per hive.
Signed-off-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Link: https://www.spinics.net/lists/amd-gfx/msg74112.html

5fd8518d

drm/amdgpu: Introduce reset domain · a4c63caf

由 Andrey Grodzovsky 提交于 11月 30, 2021

Defined a reset_domain struct such that
all the entities that go through reset
together will be serialized one against
another. Do it for both single device and
XGMI hive cases.
Signed-off-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com>
Suggested-by: NDaniel Vetter <daniel.vetter@ffwll.ch>
Suggested-by: NChristian König <ckoenig.leichtzumerken@gmail.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Link: https://www.spinics.net/lists/amd-gfx/msg74111.html

a4c63caf

19 1月, 2022 1 次提交

drm/amd/amdgpu: fixing read wrong pf2vf data in SRIOV · 9a458402

由 Jingwen Chen 提交于 1月 13, 2022

[Why]
This fixes 892deb48 ("drm/amdgpu: Separate vf2pf work item init from virt data exchange").
we should read pf2vf data based at mman.fw_vram_usage_va after gmc
sw_init. commit 892deb48 breaks this logic.

[How]
calling amdgpu_virt_exchange_data in amdgpu_virt_init_data_exchange to
set the right base in the right sequence.

v2:
call amdgpu_virt_init_data_exchange after gmc sw_init to make data
exchange workqueue run

v3:
clean up the code logic

v4:
add some comment and make the code more readable

Fixes: 892deb48 ("drm/amdgpu: Separate vf2pf work item init from virt data exchange")
Signed-off-by: NJingwen Chen <Jingwen.Chen2@amd.com>
Reviewed-by: NHorace Chen <horace.chen@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

9a458402

15 1月, 2022 2 次提交

drm/amdgpu: invert the logic in amdgpu_device_should_recover_gpu() · 0ffb1fd1

由 Alex Deucher 提交于 1月 11, 2022

Rather than opting into GPU recovery support, default to on, and
opt out if it's not working on a particular GPU.  This avoids the
need to add new asics to this list since this is a core feature.
Reviewed-by: NEvan Quan <evan.quan@amd.com>
Reviewed-by: NGuchun Chen <guchun.chen@amd.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

0ffb1fd1

drm/amdgpu: Enable recovery on yellow carp · 4175c32b

由 CHANDAN VURDIGERE NATARAJ 提交于 1月 11, 2022

Add yellow carp to devices which support recovery
Signed-off-by: NCHANDAN VURDIGERE NATARAJ <chandan.vurdigerenataraj@amd.com>
Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

4175c32b

12 1月, 2022 5 次提交

drm/amdgpu: not return error on the init_apu_flags · 4eaf21b7

由 Prike Liang 提交于 11月 26, 2021

In some APU project we needn't always assign flags to identify each other,
so we may not need return an error.
Signed-off-by: NPrike Liang <Prike.Liang@amd.com>
Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
Reviewed-by: NHuang Rui <ray.huang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

4eaf21b7

drm/amd/amdgpu: Add pcie indirect support to amdgpu_mm_wreg_mmio_rlc() · 4cc9f86f

由 Tom St Denis 提交于 1月 07, 2022

The function amdgpu_mm_wreg_mmio_rlc() is used by debugfs to write to
MMIO registers.  It didn't support registers beyond the BAR mapped MMIO
space.  This adds pcie indirect write support.
Signed-off-by: NTom St Denis <tom.stdenis@amd.com>
Acked-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

4cc9f86f

drm/amdgpu: recover gart table at resume · 575e55ee

由 Nirmoy Das 提交于 1月 07, 2022

Get rid off pin/unpin of gart BO at resume/suspend and
instead pin only once and try to recover gart content
at resume time. This is much more stable in case there
is OOM situation at 2nd call to amdgpu_device_evict_resources()
while evicting GART table.

v3: remove gart recovery from other places
v2: pin gart at amdgpu_gart_table_vram_alloc()
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NNirmoy Das <nirmoy.das@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

575e55ee

drm/amdgpu: do not pass ttm_resource_manager to gtt_mgr · 1dd8b1b9

由 Nirmoy Das 提交于 1月 07, 2022

Do not allow exported amdgpu_gtt_mgr_*() to accept
any ttm_resource_manager pointer. Also there is no need
to force other module to call a ttm function just to
eventually call gtt_mgr functions.

v4: remove unused adev.
v3: upcast mgr from ttm resopurce manager instead of
getting it from adev.
v2: pass adev's gtt_mgr instead of adev.
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NNirmoy Das <nirmoy.das@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

1dd8b1b9

drm/amdgpu: Unmap MMIO mappings when device is not unplugged · 62d5f9f7

由 Leslie Shi 提交于 1月 05, 2022

Patch: 3efb17ae7e92 ("drm/amdgpu: Call amdgpu_device_unmap_mmio() if device
is unplugged to prevent crash in GPU initialization failure") makes call to
amdgpu_device_unmap_mmio() conditioned on device unplugged. This patch unmaps
MMIO mappings even when device is not unplugged.

v2: Add condition of drm_dev_enter() to deleted unmaps in patch
"drm/amdgpu: Unmap all MMIO mappings"
Signed-off-by: NLeslie Shi <Yuliang.Shi@amd.com>
Reviewed-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

62d5f9f7

08 1月, 2022 1 次提交

drm/amdgpu: explicitly check for s0ix when evicting resources · e53d9665

由 Mario Limonciello 提交于 12月 29, 2021

This codepath should be running in both s0ix and s3, but only does
currently because s3 and s0ix are both set in the s0ix case.
Signed-off-by: NMario Limonciello <mario.limonciello@amd.com>
Acked-by: NEvan Quan <evan.quan@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

e53d9665

30 12月, 2021 3 次提交

drm/amdgpu: no DC support for headless chips · 0637d417

由 Alex Deucher 提交于 12月 23, 2021

Chips with no display hardware should return false for
DC support.

v2: drop Arcturus and Aldebaran

Fixes: f7f12b25 ("drm/amdgpu: default to true in amdgpu_device_asic_has_dc_support")
Reviewed-by: NEvan Quan <evan.quan@amd.com>
Reviewed-by: NGuchun Chen <guchun.chen@amd.com>
Reported-by: NTareque Md.Hanif <tarequemd.hanif@yahoo.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

0637d417

drm/amdgpu: Check the memory can be accesssed by ttm_device_clear_dma_mappings. · b6fd6e0f

由 Surbhi Kakarya 提交于 12月 17, 2021

If the event guard is enabled and VF doesn't receive an ack from PF for full access,
the guest driver load crashes.
This is caused due to the call to ttm_device_clear_dma_mappings with non-initialized
mman during driver tear down.

This patch adds the necessary condition to check if the mman initialization passed or not
and takes the path based on the condition output.
Signed-off-by: NSurbhi Kakarya <Surbhi.Kakarya@amd.com>
Acked-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

b6fd6e0f

drm/amdgpu: Call amdgpu_device_unmap_mmio() if device is unplugged to prevent... · 87172e89

由 Leslie Shi 提交于 12月 16, 2021

drm/amdgpu: Call amdgpu_device_unmap_mmio() if device is unplugged to prevent crash in GPU initialization failure

[Why]
In amdgpu_driver_load_kms, when amdgpu_device_init returns error during driver modprobe, it
will start the error handle path immediately and call into amdgpu_device_unmap_mmio as well
to release mapped VRAM. However, in the following release callback, driver stills visits the
unmapped memory like vcn.inst[i].fw_shared_cpu_addr in vcn_v3_0_sw_fini. So a kernel crash occurs.

[How]
call amdgpu_device_unmap_mmio() if device is unplugged to prevent invalid memory address in
vcn_v3_0_sw_fini() when GPU initialization failure.
Signed-off-by: NLeslie Shi <Yuliang.Shi@amd.com>
Reviewed-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: NGuchun Chen <guchun.chen@amd.com>
Acked-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

87172e89

29 12月, 2021 2 次提交

drm/amdgpu: Send Message to SMU on aldebaran passthrough for sbr handling · 4da8b639

由 sashank saye 提交于 12月 17, 2021

For Aldebaran chip passthrough case we need to intimate SMU
about special handling for SBR.On older chips we send
LightSBR to SMU, enabling the same for Aldebaran. Slight
difference, compared to previous chips, is on Aldebaran, SMU
would do a heavy reset on SBR. Hence, the word Heavy
instead of Light SBR is used for SMU to differentiate.

Reviewed by: Shaoyun.liu <Shaoyun.liu@amd.com>
Signed-off-by: Nsashank saye <sashank.saye@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

4da8b639

drm/amdgpu: get xgmi info before ip_init · 4a0165f0

由 Victor Skvortsov 提交于 12月 15, 2021

Driver needs to call get_xgmi_info() before ip_init
to determine whether it needs to handle a pending hive reset.
Signed-off-by: NVictor Skvortsov <victor.skvortsov@amd.com>
Reviewed-by: NDavid Nieto <david.nieto@amd.com>
Reviewed by: shaoyun.liu <Shaoyun.lui@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

4a0165f0

28 12月, 2021 1 次提交

drm/amdgpu: no DC support for headless chips · ebae8973

由 Alex Deucher 提交于 12月 23, 2021

Chips with no display hardware should return false for
DC support.

v2: drop Arcturus and Aldebaran

Fixes: f7f12b25 ("drm/amdgpu: default to true in amdgpu_device_asic_has_dc_support")
Reviewed-by: NEvan Quan <evan.quan@amd.com>
Reviewed-by: NGuchun Chen <guchun.chen@amd.com>
Reported-by: NTareque Md.Hanif <tarequemd.hanif@yahoo.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

ebae8973

18 12月, 2021 1 次提交

drm/amdgpu: introduce new amdgpu_fence object to indicate the job embedded fence · bf67014d

由 Huang Rui 提交于 12月 16, 2021

The job embedded fence donesn't initialize the flags at
dma_fence_init(). Then we will go a wrong way in
amdgpu_fence_get_timeline_name callback and trigger a null pointer panic
once we enabled the trace event here. So introduce new amdgpu_fence
object to indicate the job embedded fence.

[  156.131790] BUG: kernel NULL pointer dereference, address: 00000000000002a0
[  156.131804] #PF: supervisor read access in kernel mode
[  156.131811] #PF: error_code(0x0000) - not-present page
[  156.131817] PGD 0 P4D 0
[  156.131824] Oops: 0000 [#1] PREEMPT SMP PTI
[  156.131832] CPU: 6 PID: 1404 Comm: sdma0 Tainted: G           OE     5.16.0-rc1-custom #1
[  156.131842] Hardware name: Gigabyte Technology Co., Ltd. Z170XP-SLI/Z170XP-SLI-CF, BIOS F20 11/04/2016
[  156.131848] RIP: 0010:strlen+0x0/0x20
[  156.131859] Code: 89 c0 c3 0f 1f 80 00 00 00 00 48 01 fe eb 0f 0f b6 07 38 d0 74 10 48 83 c7 01 84 c0 74 05 48 39 f7 75 ec 31 c0 c3 48 89 f8 c3 <80> 3f 00 74 10 48 89 f8 48 83 c0 01 80 38 00 75 f7 48 29 f8 c3 31
[  156.131872] RSP: 0018:ffff9bd0018dbcf8 EFLAGS: 00010206
[  156.131880] RAX: 00000000000002a0 RBX: ffff8d0305ef01b0 RCX: 000000000000000b
[  156.131888] RDX: ffff8d03772ab924 RSI: ffff8d0305ef01b0 RDI: 00000000000002a0
[  156.131895] RBP: ffff9bd0018dbd60 R08: ffff8d03002094d0 R09: 0000000000000000
[  156.131901] R10: 000000000000005e R11: 0000000000000065 R12: ffff8d03002094d0
[  156.131907] R13: 000000000000001f R14: 0000000000070018 R15: 0000000000000007
[  156.131914] FS:  0000000000000000(0000) GS:ffff8d062ed80000(0000) knlGS:0000000000000000
[  156.131923] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  156.131929] CR2: 00000000000002a0 CR3: 000000001120a005 CR4: 00000000003706e0
[  156.131937] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  156.131942] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  156.131949] Call Trace:
[  156.131953]  <TASK>
[  156.131957]  ? trace_event_raw_event_dma_fence+0xcc/0x200
[  156.131973]  ? ring_buffer_unlock_commit+0x23/0x130
[  156.131982]  dma_fence_init+0x92/0xb0
[  156.131993]  amdgpu_fence_emit+0x10d/0x2b0 [amdgpu]
[  156.132302]  amdgpu_ib_schedule+0x2f9/0x580 [amdgpu]
[  156.132586]  amdgpu_job_run+0xed/0x220 [amdgpu]

v2: fix mismatch warning between the prototype and function name (Ray, kernel test robot)
Signed-off-by: NHuang Rui <ray.huang@amd.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

bf67014d

17 12月, 2021 2 次提交

drm/amdgpu: Separate vf2pf work item init from virt data exchange · 892deb48

由 Victor Skvortsov 提交于 12月 16, 2021

We want to be able to call virt data exchange conditionally
after gmc sw init to reserve bad pages as early as possible.
Since this is a conditional call, we will need
to call it again unconditionally later in the init sequence.

Refactor the data exchange function so it can be
called multiple times without re-initializing the work item.

v2: Cleaned up the code. Kept the original call to init_exchange_data()
inside early init to initialize the work item, afterwards call
exchange_data() when needed.
Signed-off-by: NVictor Skvortsov <victor.skvortsov@amd.com>
Reviewed By: Shaoyun.liu <Shaoyun.liu@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

892deb48

drm/amdgpu: introduce new amdgpu_fence object to indicate the job embedded fence · 5c1e6fa4

由 Huang Rui 提交于 12月 16, 2021

The job embedded fence donesn't initialize the flags at
dma_fence_init(). Then we will go a wrong way in
amdgpu_fence_get_timeline_name callback and trigger a null pointer panic
once we enabled the trace event here. So introduce new amdgpu_fence
object to indicate the job embedded fence.

[  156.131790] BUG: kernel NULL pointer dereference, address: 00000000000002a0
[  156.131804] #PF: supervisor read access in kernel mode
[  156.131811] #PF: error_code(0x0000) - not-present page
[  156.131817] PGD 0 P4D 0
[  156.131824] Oops: 0000 [#1] PREEMPT SMP PTI
[  156.131832] CPU: 6 PID: 1404 Comm: sdma0 Tainted: G           OE     5.16.0-rc1-custom #1
[  156.131842] Hardware name: Gigabyte Technology Co., Ltd. Z170XP-SLI/Z170XP-SLI-CF, BIOS F20 11/04/2016
[  156.131848] RIP: 0010:strlen+0x0/0x20
[  156.131859] Code: 89 c0 c3 0f 1f 80 00 00 00 00 48 01 fe eb 0f 0f b6 07 38 d0 74 10 48 83 c7 01 84 c0 74 05 48 39 f7 75 ec 31 c0 c3 48 89 f8 c3 <80> 3f 00 74 10 48 89 f8 48 83 c0 01 80 38 00 75 f7 48 29 f8 c3 31
[  156.131872] RSP: 0018:ffff9bd0018dbcf8 EFLAGS: 00010206
[  156.131880] RAX: 00000000000002a0 RBX: ffff8d0305ef01b0 RCX: 000000000000000b
[  156.131888] RDX: ffff8d03772ab924 RSI: ffff8d0305ef01b0 RDI: 00000000000002a0
[  156.131895] RBP: ffff9bd0018dbd60 R08: ffff8d03002094d0 R09: 0000000000000000
[  156.131901] R10: 000000000000005e R11: 0000000000000065 R12: ffff8d03002094d0
[  156.131907] R13: 000000000000001f R14: 0000000000070018 R15: 0000000000000007
[  156.131914] FS:  0000000000000000(0000) GS:ffff8d062ed80000(0000) knlGS:0000000000000000
[  156.131923] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  156.131929] CR2: 00000000000002a0 CR3: 000000001120a005 CR4: 00000000003706e0
[  156.131937] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  156.131942] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  156.131949] Call Trace:
[  156.131953]  <TASK>
[  156.131957]  ? trace_event_raw_event_dma_fence+0xcc/0x200
[  156.131973]  ? ring_buffer_unlock_commit+0x23/0x130
[  156.131982]  dma_fence_init+0x92/0xb0
[  156.131993]  amdgpu_fence_emit+0x10d/0x2b0 [amdgpu]
[  156.132302]  amdgpu_ib_schedule+0x2f9/0x580 [amdgpu]
[  156.132586]  amdgpu_job_run+0xed/0x220 [amdgpu]

v2: fix mismatch warning between the prototype and function name (Ray, kernel test robot)
Signed-off-by: NHuang Rui <ray.huang@amd.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

5c1e6fa4

15 12月, 2021 2 次提交

amdgpu: fix some kernel-doc markup · 03f2abb0

由 Yann Dirson 提交于 12月 14, 2021

Those are not today pulled by the sphinx doc, but better be ready.
Signed-off-by: NYann Dirson <ydirson@free.fr>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

03f2abb0

drm/amdgpu: use adev_to_drm to get drm_device pointer · e0f943b4

由 Guchun Chen 提交于 12月 13, 2021

Updated for consistency when accessing drm_device from amdgpu driver.
Signed-off-by: NGuchun Chen <guchun.chen@amd.com>
Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

e0f943b4

14 12月, 2021 6 次提交

drm/amdgpu: Detect if amdgpu in IOMMU direct map mode · 4a74c38c

由 Philip Yang 提交于 12月 06, 2021

If host and amdgpu IOMMU is not enabled or IOMMU is pass through mode,
set adev->ram_is_direct_mapped flag which will be used to optimize
memory usage for multi GPU mappings.
Signed-off-by: NPhilip Yang <Philip.Yang@amd.com>
Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

4a74c38c

drm/amdgpu: introduce a kind of halt state for amdgpu device · 34f3a4a9

由 Lang Yu 提交于 12月 09, 2021

It is useful to maintain error context when debugging
SW/FW issues. Introduce amdgpu_device_halt() for this
purpose. It will bring hardware to a kind of halt state,
so that no one can touch it any more.

Compare to a simple hang, the system will keep stable
at least for SSH access. Then it should be trivial to
inspect the hardware state and see what's going on.

v2:
 - Set adev->no_hw_access earlier to avoid potential crashes.(Christian)
Suggested-by: NChristian Koenig <christian.koenig@amd.com>
Suggested-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com>
Signed-off-by: NLang Yu <lang.yu@amd.com>
Reviewed-by: NChristian Koenig <christian.koenig@amd.co>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

34f3a4a9

drm/amdgpu: only hw fini SMU fisrt for ASICs need that · 613aa3ea

由 Lang Yu 提交于 12月 03, 2021

We found some headaches on ASICs don't need that,
so remove that for them.
Suggested-by: NLijo Lazar <lijo.lazar@amd.com>
Signed-off-by: NLang Yu <lang.yu@amd.com>
Reviewed-by: NKevin Wang <kevinyang.wang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

613aa3ea

drm/amd: fix improper docstring syntax · bbe04dec

由 Isabella Basso 提交于 12月 07, 2021

This fixes various warnings relating to erroneous docstring syntax, of
which some are listed below:

 warning: Function parameter or member 'adev' not described in
 'amdgpu_atomfirmware_ras_rom_addr'
 ...
 warning: expecting prototype for amdgpu_atpx_validate_functions().
 Prototype was for amdgpu_atpx_validate() instead
 ...
 warning: Excess function parameter 'mem' description in 'amdgpu_preempt_mgr_new'
 ...
 warning: Cannot understand  * @kfd_get_cu_occupancy - Collect number of
 waves in-flight on this device
 ...
 warning: This comment starts with '/**', but isn't a kernel-doc
 comment. Refer Documentation/doc-guide/kernel-doc.rst
Signed-off-by: NIsabella Basso <isabbasso@riseup.net>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

bbe04dec

drm/amdgpu: recover XGMI topology for SRIOV VF after reset · a5f67c93

由 Zhigang Luo 提交于 12月 06, 2021

For SRIOV VF, the XGMI topology was not recovered after reset. This
change added code to SRIOV VF reset function to update XGMI topology
for SRIOV VF after reset.
Signed-off-by: NZhigang Luo <zhigang.luo@amd.com>
Reviewed-by: NShaoyun Liu <shaoyun.liu@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

a5f67c93

drm/amdgpu: skip reset other device in the same hive if it's SRIOV VF · 175ac6ec

由 Zhigang Luo 提交于 11月 26, 2021

On SRIOV, host driver can support FLR(function level reset) on individual VF
within the hive which might bring the individual device back to normal without
the necessary to execute the hive reset. If the FLR failed , host driver will
trigger the hive reset, each guest VF will get reset notification before the
real hive reset been executed. The VF device can handle the reset request
individually in it's reset work handler.

This change updated gpu recover sequence to skip reset other device in
the same hive for SRIOV VF.
Signed-off-by: NZhigang Luo <zhigang.luo@amd.com>
Reviewed-by: NShaoyun Liu <shaoyun.liu@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

175ac6ec

02 12月, 2021 5 次提交

drm/amdgpu: adjust the kfd reset sequence in reset sriov function · 428890a3

由 shaoyunl 提交于 11月 29, 2021

This change revert previous commits:
9f4f2c1a ("drm/amd/amdgpu: fix the kfd pre_reset sequence in sriov")
271fd38c ("drm/amdgpu: move kfd post_reset out of reset_sriov function")

This change moves the amdgpu_amdkfd_pre_reset to an earlier place
in amdgpu_device_reset_sriov, presumably to address the sequence issue
that the first patch was originally meant to fix.

Some register access(GRBM_GFX_CNTL) only be allowed on full access
mode. Move kfd_pre_reset and  kfd_post_reset back inside reset_sriov
function.

Fixes: 9f4f2c1a ("drm/amd/amdgpu: fix the kfd pre_reset sequence in sriov")
Fixes: 271fd38c ("drm/amdgpu: move kfd post_reset out of reset_sriov function")
Signed-off-by: Nshaoyunl <shaoyun.liu@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

428890a3

drm/amdgpu: check atomic flag to differeniate with legacy path · 1053b9c9

由 Flora Cui 提交于 11月 18, 2021

since vkms support atomic KMS interface
Signed-off-by: NFlora Cui <flora.cui@amd.com>
Reviewed-by: NGuchun Chen <guchun.chen@amd.com>
Acked-by: NAlex Deucher <aleander.deucher@amd.com>
Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

1053b9c9

drm/amdgpu: adjust the kfd reset sequence in reset sriov function · 992110d7

由 shaoyunl 提交于 11月 29, 2021

This change revert previous commits:
9f4f2c1a ("drm/amd/amdgpu: fix the kfd pre_reset sequence in sriov")
271fd38c ("drm/amdgpu: move kfd post_reset out of reset_sriov function")

This change moves the amdgpu_amdkfd_pre_reset to an earlier place
in amdgpu_device_reset_sriov, presumably to address the sequence issue
that the first patch was originally meant to fix.

Some register access(GRBM_GFX_CNTL) only be allowed on full access
mode. Move kfd_pre_reset and  kfd_post_reset back inside reset_sriov
function.

Fixes: 9f4f2c1a ("drm/amd/amdgpu: fix the kfd pre_reset sequence in sriov")
Fixes: 271fd38c ("drm/amdgpu: move kfd post_reset out of reset_sriov function")
Signed-off-by: Nshaoyunl <shaoyun.liu@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

992110d7

drm/amdgpu: fix disable ras feature failed when unload drvier v2 · 232d1d43

由 Stanley.Yang 提交于 11月 26, 2021

v2:
    still need call ras_disable_all_featrures to handle
    ras initilization failure case.

Function amdgpu_device_fini_hw is called before amdgpu_device_fini_sw,
so ras ta will unload before send ras disable command, ras dsiable operation
must before hw fini.
Signed-off-by: NStanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

232d1d43

drm/amdgpu: check atomic flag to differeniate with legacy path · 700de2c8

由 Flora Cui 提交于 11月 18, 2021

since vkms support atomic KMS interface
Signed-off-by: NFlora Cui <flora.cui@amd.com>
Reviewed-by: NGuchun Chen <guchun.chen@amd.com>
Acked-by: NAlex Deucher <aleander.deucher@amd.com>
Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

700de2c8

25 11月, 2021 2 次提交

drm/amdgpu: move kfd post_reset out of reset_sriov function · 271fd38c

由 shaoyunl 提交于 11月 18, 2021

Fixes: 9f4f2c1a ("drm/amd/amdgpu: fix the kfd pre_reset sequence in sriov")

For sriov XGMI configuration, the host driver will handle the hive reset,
so in guest side, the reset_sriov only be called once on one device. This will
make kfd post_reset unblanced with kfd pre_reset since kfd pre_reset already
been moved out of reset_sriov function. Move kfd post_reset out of reset_sriov
function to make them balance .
Signed-off-by: Nshaoyunl <shaoyun.liu@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

271fd38c

drm/amdgpu: move kfd post_reset out of reset_sriov function · 4f30d920

由 shaoyunl 提交于 11月 18, 2021

Fixes: 9f4f2c1a ("drm/amd/amdgpu: fix the kfd pre_reset sequence in sriov")

4f30d920

23 11月, 2021 1 次提交

drm/amd/pm: avoid duplicate powergate/ungate setting · 6c08e0ef

由 Evan Quan 提交于 11月 05, 2021

Just bail out if the target IP block is already in the desired
powergate/ungate state. This can avoid some duplicate settings
which sometimes may cause unexpected issues.

Link: https://lore.kernel.org/all/YV81vidWQLWvATMM@zn.tnic/
Bug: https://bugzilla.kernel.org/show_bug.cgi?id=214921
Bug: https://bugzilla.kernel.org/show_bug.cgi?id=215025
Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/1789Signed-off-by: NEvan Quan <evan.quan@amd.com>
Tested-by: NBorislav Petkov <bp@suse.de>
Reviewed-by: NLijo Lazar <lijo.lazar@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

6c08e0ef

18 11月, 2021 2 次提交

drm/amd/pm: avoid duplicate powergate/ungate setting · 6ee27ee2

由 Evan Quan 提交于 11月 05, 2021

Just bail out if the target IP block is already in the desired
powergate/ungate state. This can avoid some duplicate settings
which sometimes may cause unexpected issues.

Link: https://lore.kernel.org/all/YV81vidWQLWvATMM@zn.tnic/
Bug: https://bugzilla.kernel.org/show_bug.cgi?id=214921
Bug: https://bugzilla.kernel.org/show_bug.cgi?id=215025
Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/1789
Fixes: bf756fb8 ("drm/amdgpu: add missing cleanups for Polaris12 UVD/VCE on suspend")
Signed-off-by: NEvan Quan <evan.quan@amd.com>
Tested-by: NBorislav Petkov <bp@suse.de>
Reviewed-by: NLijo Lazar <lijo.lazar@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org

6ee27ee2

drm/amdgpu: use generic fb helpers instead of setting up AMD own's. · 087451f3

由 Evan Quan 提交于 10月 19, 2021

With the shadow buffer support from generic framebuffer emulation, it's
possible now to have runpm kicked when no update for console.
Signed-off-by: NEvan Quan <evan.quan@amd.com>
Acked-by: NChristian König <christian.koenig@amd.com>
Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

087451f3

10 11月, 2021 1 次提交

drm/amd/amdgpu: fix the kfd pre_reset sequence in sriov · 9f4f2c1a

由 shaoyunl 提交于 11月 05, 2021

The KFD pre_reset should be called before reset been executed, it will
hold the lock to prevent other rocm process to sent the packlage to hiq
during host execute the real reset on the HW
Signed-off-by: Nshaoyunl <shaoyun.liu@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

9f4f2c1a

openeuler / Kernel 大约 2 年 前同步成功

openeuler / Kernel
大约 2 年前同步成功