提交 · b85e285e3d6352b02947fc1b72303673dfacb0aa · openeuler / Kernel

23 11月, 2022 1 次提交

drm/amdgpu: fix pci device refcount leak · b85e285e

由 Yang Yingliang 提交于 11月 17, 2022

As comment of pci_get_domain_bus_and_slot() says, it returns
a pci device with refcount increment, when finish using it,
the caller must decrement the reference count by calling
pci_dev_put().

So before returning from amdgpu_device_resume|suspend_display_audio(),
pci_dev_put() is called to avoid refcount leak.

Fixes: 3f12acc8 ("drm/amdgpu: put the audio codec into suspend state before gpu reset V3")
Reviewed-by: NEvan Quan <evan.quan@amd.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

b85e285e

18 11月, 2022 1 次提交

drm/amdgpu: Enable mode-1 reset for RAS recovery in fatal error mode · 1a11a65d

由 YiPeng Chai 提交于 11月 08, 2022

The patch is enabling mode-1 reset for RAS recovery in fatal error mode.
Signed-off-by: NYiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: NTao Zhou <tao.zhou1@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

1a11a65d

16 11月, 2022 7 次提交

drm/amdgpu: stop resubmittting jobs in amdgpu_pci_resume · 0788a47e

由 Christian König 提交于 10月 26, 2022

The state of VRAM is unreliable due to a PCI event like AER, link reset
or DPC.
Signed-off-by: NChristian König <christian.koenig@amd.com>
Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

0788a47e

drm/amdgpu: stop resubmitting jobs for GPU reset v2 · 6868a2c4

由 Christian König 提交于 10月 26, 2022

Re-submitting IBs by the kernel has many problems because pre-
requisite state is not automatically re-created as well. In
other words neither binary semaphores nor things like ring
buffer pointers are in the state they should be when the
hardware starts to work on the IBs again.

Additional to that even after more than 5 years of
developing this feature it is still not stable and we have
massively problems getting the reference counts right.

As discussed with user space developers this behavior is not
helpful in the first place. For graphics and multimedia
workloads it makes much more sense to either completely
re-create the context or at least re-submitting the IBs
from userspace.

For compute use cases re-submitting is also not very
helpful since userspace must rely on the accuracy of
the result.

Because of this we stop this practice and instead just
properly note that the fence submission was canceled. The
only use case we keep the re-submission for now is SRIOV
and function level resets.

v2: as suggested by Sshaoyun stop resubmitting jobs even for SRIOV
Signed-off-by: NChristian König <christian.koenig@amd.com>
Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

6868a2c4

drm/amdgpu: revert "implement tdr advanced mode" · 06a2d7cc

由 Christian König 提交于 10月 26, 2022

This reverts commit e6c6338f.

This feature basically re-submits one job after another to
figure out which one was the one causing a hang.

This is obviously incompatible with gang-submit which requires
that multiple jobs run at the same time. It's also absolutely
not helpful to crash the hardware multiple times if a clean
recovery is desired.

For testing and debugging environments we should rather disable
recovery alltogether to be able to inspect the state with a hw
debugger.

Additional to that the sw implementation is clearly buggy and causes
reference count issues for the hardware fence.
Signed-off-by: NChristian König <christian.koenig@amd.com>
Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

06a2d7cc

drm/amdgpu: Fixed the problem that ras error can't be queried after gpu recovery is completed · d293470e

由 YiPeng Chai 提交于 11月 04, 2022

Amdgpu_ras_set_error_query_ready is called at the start of
amdgpu_device_gpu_recover to disable query ras error, but the
code behind only enables query ras error in full reset path,
but not in soft reset path, emergency restart path and skip
the hardware reset path.
Signed-off-by: NYiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

d293470e

drm/amdgpu: there is no vbios fb on devices with no display hw (v2) · 220c8cc8

由 Alex Deucher 提交于 11月 11, 2022

If we enable virtual display functionality on parts with
no display hardware we can end up trying to check for and
reserve the vbios FB area on devices where it doesn't exist.
Check if display hardware is actually present on the hardware
before trying to reserve the memory.

v2: move the check into common code
Acked-by: NEvan Quan <evan.quan@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

220c8cc8

drm/amdgpu: clarify DC checks · d09ef243

由 Alex Deucher 提交于 7月 19, 2022

There are several places where we don't want to check
if a particular asic could support DC, but rather, if
DC is enabled.  Set a flag if DC is enabled and check
for that rather than if a device supports DC or not.
Acked-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

d09ef243

drm/amdgpu: rework SR-IOV virtual display handling · 25263da3

由 Alex Deucher 提交于 7月 19, 2022

virtual display is enabled unconditionally in SR-IOV, but
without specifying the virtual_display module, the number
of crtcs defaults to 0.  Set a single display by default
for SR-IOV if the virtual_display parameter is not set.
Only enable virtual display by default on SR-IOV on asics
which actually have display hardware.
Acked-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

25263da3

06 11月, 2022 1 次提交

drm/fb-helper: Remove unnecessary include statements · 45b64fd9

由 Thomas Zimmermann 提交于 11月 03, 2022

Remove include statements for <drm/drm_fb_helper.h> where it is not
required (i.e., most of them). In a few places include other header
files that are required by the source code.

v3:
	* fix amdgpu include statements
	* fix rockchip include statements
Signed-off-by: NThomas Zimmermann <tzimmermann@suse.de>
Reviewed-by: NJavier Martinez Canillas <javierm@redhat.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20221103151446.2638-23-tzimmermann@suse.de

45b64fd9

05 11月, 2022 2 次提交

drm/amdgpu: fix for suspend/resume sequence under sriov · ec4927d4

由 Victor Zhao 提交于 10月 26, 2022

- clear kiq ring after suspend/resume under sriov to aviod kiq ring
test failure
- update irq after resume to fix kiq interrput loss
Signed-off-by: NVictor Zhao <Victor.Zhao@amd.com>
Acked-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

ec4927d4

drm/amdgpu: Disable MCBP from soc21 for SRIOV · 8a1fbb4a

由 Yiqing Yao 提交于 10月 28, 2022

[why]
Start from soc21, CP does not support MCBP, so disable it.

[how]
Used amgpu_mcbp flag alone instead of checking if is in SRIOV to
enable/disable MCBP.
Only set flag to enable on asic_type prior to soc21 in SRIOV.
Signed-off-by: NYiqing Yao <yiqing.yao@amd.com>
Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

8a1fbb4a

28 10月, 2022 1 次提交

drm/amd: Fail the suspend if resources can't be evicted · 7863c155

由 Mario Limonciello 提交于 10月 26, 2022

If a system does not have swap and memory is under 100% usage,
amdgpu will fail to evict resources.  Currently the suspend
carries on proceeding to reset the GPU:

```
[drm] evicting device resources failed
[drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <vcn_v3_0> failed -12
[drm] free PSP TMR buffer
[TTM] Failed allocating page table
[drm] evicting device resources failed
amdgpu 0000:03:00.0: amdgpu: MODE1 reset
amdgpu 0000:03:00.0: amdgpu: GPU mode1 reset
amdgpu 0000:03:00.0: amdgpu: GPU smu mode1 reset
```

At this point if the suspend actually succeeded I think that amdgpu
would have recovered because the GPU would have power cut off and
restored.  However the kernel fails to continue the suspend from the
memory pressure and amdgpu fails to run the "resume" from the aborted
suspend.

```
ACPI: PM: Preparing to enter system sleep state S3
SLUB: Unable to allocate memory on node -1, gfp=0xdc0(GFP_KERNEL|__GFP_ZERO)
  cache: Acpi-State, object size: 80, buffer size: 80, default order: 0, min order: 0
  node 0: slabs: 22, objs: 1122, free: 0
ACPI Error: AE_NO_MEMORY, Could not update object reference count (20210730/utdelete-651)

[drm:psp_hw_start [amdgpu]] *ERROR* PSP load kdb failed!
[drm:psp_resume [amdgpu]] *ERROR* PSP resume failed
[drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* resume of IP block <psp> failed -62
amdgpu 0000:03:00.0: amdgpu: amdgpu_device_ip_resume failed (-62).
PM: dpm_run_callback(): pci_pm_resume+0x0/0x100 returns -62
amdgpu 0000:03:00.0: PM: failed to resume async: error -62
```

To avoid this series of unfortunate events, fail amdgpu's suspend
when the memory eviction fails.  This will let the system gracefully
recover and the user can try suspend again when the memory pressure
is relieved.

Reported-by: post@davidak.de
Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2223Signed-off-by: NMario Limonciello <mario.limonciello@amd.com>
Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
Acked-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

7863c155

25 10月, 2022 3 次提交

amd/amdgpu: fix repeated words in comments · 12024b17

由 wangjianli 提交于 10月 22, 2022

Delete the redundant word 'the'.
Signed-off-by: Nwangjianli <wangjianli@cdjrlc.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

12024b17

drm/amdgpu: disallow gfxoff until GC IP blocks complete s2idle resume · f543d286

由 Prike Liang 提交于 10月 21, 2022

In the S2idle suspend/resume phase the gfxoff is keeping functional so
some IP blocks will be likely to reinitialize at gfxoff entry and that
will result in failing to program GC registers.Therefore, let disallow
gfxoff until AMDGPU IPs reinitialized completely.
Signed-off-by: NPrike Liang <Prike.Liang@amd.com>
Acked-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

f543d286

drm/amdgpu: skip mes self test for gc 11.0.3 in recover · 693073a0

由 YuBiao Wang 提交于 10月 19, 2022

Temporary disable mes self teset for gc 11.0.3 during gpu_recovery.
Signed-off-by: NYuBiao Wang <YuBiao.Wang@amd.com>
Acked-by: NLuben Tuikov <luben.tuikov@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

693073a0

19 10月, 2022 2 次提交

drm/amd/pm: disable cstate feature for gpu reset scenario · 3059cd8c

由 Evan Quan 提交于 9月 29, 2022

Suggested by PMFW team and same as what did for gfxoff feature.
This can address some Mode1Reset failures observed on SMU13.0.0.
Signed-off-by: NEvan Quan <evan.quan@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: NLijo Lazar <lijo.lazar@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org # 6.0.x
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

3059cd8c

Revert "drm/amdgpu: let mode2 reset fallback to default when failure" · a340847b

由 Victor Zhao 提交于 10月 13, 2022

This reverts commit dac6b808.

This commit reverted the AMDGPU_SKIP_MODE2_RESET as it conflicts with
the original design of reset handler. Will redesign it.

Fixes: dac6b808 ("drm/amdgpu: let mode2 reset fallback to default when failure")
Signed-off-by: NVictor Zhao <Victor.Zhao@amd.com>
Reviewed-by: NLijo Lazar <lijo.lazar@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

a340847b

18 10月, 2022 2 次提交

drm/amd/pm: disable cstate feature for gpu reset scenario · b31d6ada

由 Evan Quan 提交于 9月 29, 2022

Suggested by PMFW team and same as what did for gfxoff feature.
This can address some Mode1Reset failures observed on SMU13.0.0.
Signed-off-by: NEvan Quan <evan.quan@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: NLijo Lazar <lijo.lazar@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org # 6.0.x

b31d6ada

Revert "drm/amdgpu: let mode2 reset fallback to default when failure" · b98a1648

由 Victor Zhao 提交于 10月 13, 2022

This reverts commit dac6b808.

This commit reverted the AMDGPU_SKIP_MODE2_RESET as it conflicts with
the original design of reset handler. Will redesign it.

Fixes: dac6b808 ("drm/amdgpu: let mode2 reset fallback to default when failure")
Signed-off-by: NVictor Zhao <Victor.Zhao@amd.com>
Reviewed-by: NLijo Lazar <lijo.lazar@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

b98a1648

29 9月, 2022 3 次提交

drm/amdgpu: Add amdgpu suspend-resume code path under SRIOV · d7274ec7

由 Bokun Zhang 提交于 9月 28, 2022

- Under SRIOV, we need to send REQ_GPU_FINI to the hypervisor
  during the suspend time. Furthermore, we cannot request a
  mode 1 reset under SRIOV as VF. Therefore, we will skip it
  as it is called in suspend_noirq() function.

- In the resume code path, we need to send REQ_GPU_INIT to the
  hypervisor and also resume PSP IP block under SRIOV.
Signed-off-by: NBokun Zhang <Bokun.Zhang@amd.com>
Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

d7274ec7

drm/amdgpu: Use simplified API for p2p dist calc · bb66ecbf

由 Lijo Lazar 提交于 9月 21, 2022

Use the simpified API that calculates distance between two devices.
Signed-off-by: NLijo Lazar <lijo.lazar@amd.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Reviewed-by: NGuchun Chen <guchun.chen@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

bb66ecbf

drm/amdgpu: Disable verbose for p2p dist calc · d0fa84f1

由 Lijo Lazar 提交于 9月 21, 2022

Disable verbose while getting p2p distance. With verbose, it shows
warning if ACS redirect is set between the devices. Adds noise
to dmesg logs when a few GPU devices are on the same platform.

Example log:

amdgpu 0000:34:00.0: ACS redirect is set between the client and provider (0000:31:00.0)
amdgpu 0000:34:00.0: to disable ACS redirect for this path, add the kernel parameter:
	pci=disable_acs_redir=0000:30:00.0;0000:2e:00.0;0000:33:00.0;0000:2e:10.0
Signed-off-by: NLijo Lazar <lijo.lazar@amd.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Reviewed-by: NGuchun Chen <guchun.chen@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

d0fa84f1

28 9月, 2022 1 次提交

drm/amdgpu: Add amdgpu suspend-resume code path under SRIOV · 3b7329cf

由 Bokun Zhang 提交于 9月 28, 2022

- Under SRIOV, we need to send REQ_GPU_FINI to the hypervisor
  during the suspend time. Furthermore, we cannot request a
  mode 1 reset under SRIOV as VF. Therefore, we will skip it
  as it is called in suspend_noirq() function.

- In the resume code path, we need to send REQ_GPU_INIT to the
  hypervisor and also resume PSP IP block under SRIOV.
Signed-off-by: NBokun Zhang <Bokun.Zhang@amd.com>
Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org

3b7329cf

21 9月, 2022 1 次提交

drm/amdgpu: add gang submit backend v2 · 68ce8b24

由 Christian König 提交于 3月 02, 2022

Allows submitting jobs as gang which needs to run on multiple
engines at the same time.

Basic idea is that we have a global gang submit fence representing when the
gang leader is finally pushed to run on the hardware last.

Jobs submitted as gang are never re-submitted in case of a GPU reset since this
won't work and will just deadlock the hardware immediately again.

v2: fix logic inversion, improve documentation, fix rcu
Signed-off-by: NChristian König <christian.koenig@amd.com>
Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

68ce8b24

20 9月, 2022 2 次提交

drm/amdgpu: Fixed psp fence and memory issues when removing amdgpu device · 83d29a5f

由 YiPeng Chai 提交于 9月 08, 2022

V3:
Fixed psp fence and memory issues for the asic
using smu v13_0_2 when removing amdgpu device.

[Why]:
1. psp_suspend->psp_free_shared_bufs->
       psp_ta_free_shared_buf->
           amdgpu_bo_free_kernel->
             ...->amdgpu_bo_release_notify->
                    amdgpu_fill_buffer
   psp will free vram memory used by psp when psp_suspend
   is called. But for the asic using smu v13_0_2, because
   psp_suspend is called before adev->shutdown is set to
   true when removing the first hive device, amdgpu fill_buffer
   will be called, which will cause fence issues when evicting
   all vram resources in amdgpu vram mgr_fini.
2. Since psp_hw_fini is not called after calling psp_suspend
   and psp_suspend only calls psp_ring_stop, the psp ring memory
   will not be released when amdgpu device is removed.

[How]:
1. Set shutdown to true before calling amdgpu_device_gpu_recover,
   then amdgpu_fill_buffer will not be called when psp_suspend is
   called.
2. Free psp ring memory in psp_sw_fini.
Signed-off-by: NYiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

83d29a5f

drm/amdgpu: Adjust removal control flow for smu v13_0_2 · f5c7e779

由 YiPeng Chai 提交于 9月 07, 2022

Adjust removal control flow for smu v13_0_2:
   During amdgpu uninstallation, when removing the first
device, the kernel needs to first send a mode1reset message
to all gpu devices. Otherwise, smu initialization will fail
the next time amdgpu is installed.

V2:
1. Update commit comments.
2. Remove the global variable amdgpu_device_remove_cnt
   and add a variable to the structure amdgpu_hive_info.
3. Use hive to detect the first removed device instead of
   a global variable.

V3:
 1. Update commit comments.
 2. Split a patch into multiple patches.
 3. The current patch does:
    a. Add a work mode of AMDGPU_RESET_FOR_DEVICE_REMOVE into
       the existing gpu recover path, which make all devices
       in hive list only have HW reset but no resume (except
       the base IP).
    b. Call AMDGPU_RESET_FOR_DEVICE_REMOVE and
       AMDGPU_NEED_FULL_RESET mode of amdgpu_device_gpu_recover
       in amdgpu_pci_remove when removing the first device in
       hive list.
    c. When removing the first device, the IP blocks keyword
       function call sequence is as follows:
.suspend->mode1reset->.resume(basic ip)->.hw_fini->.early_fini->.sw_fini.
   ^                           |
   |-<----------<---------<----|
	The first three sequences are because of a call to
        amdgpu_device_gpu_recover. The three sequences will be
        executed in a loop until all devices in the hive list
        are iterated.
        The sequences starting from .hw_fini only apply to the
        first device. Since .suspend has been called before,
        except the resumed phase1 basic ip blocks, all other ip
        blocks .hw_fini of current device will do nothing.
     d. When removing other devices, the calling sequences is the
        same as legacy:
	   .hw_fini -> .early_fini -> .sw_fini.
	Since .suspend has been called when removing the first device,
        except the resumed phase1 basic ip blocks, all of other ip
        blocks .hw_fini of current device will do nothing.
Signed-off-by: NYiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

f5c7e779

15 9月, 2022 2 次提交

drm/amdgpu: make sure to init common IP before gmc · a8671493

由 Alex Deucher 提交于 8月 30, 2022

Move common IP init before GMC init so that HDP gets
remapped before GMC init which uses it.

This fixes the Unsupported Request error reported through
AER during driver load. The error happens as a write happens
to the remap offset before real remapping is done.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=216373

The error was unnoticed before and got visible because of the commit
referenced below. This doesn't fix anything in the commit below, rather
fixes the issue in amdgpu exposed by the commit. The reference is only
to associate this commit with below one so that both go together.

Fixes: 8795e182 ("PCI/portdrv: Don't disable AER reporting in get_port_device_capability()")
Acked-by: NChristian König <christian.koenig@amd.com>
Reviewed-by: NLijo Lazar <lijo.lazar@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org

a8671493

drm/amdgpu: make sure to init common IP before gmc · c1c39032

由 Alex Deucher 提交于 8月 30, 2022

Move common IP init before GMC init so that HDP gets
remapped before GMC init which uses it.

This fixes the Unsupported Request error reported through
AER during driver load. The error happens as a write happens
to the remap offset before real remapping is done.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=216373

The error was unnoticed before and got visible because of the commit
referenced below. This doesn't fix anything in the commit below, rather
fixes the issue in amdgpu exposed by the commit. The reference is only
to associate this commit with below one so that both go together.

Fixes: 8795e182 ("PCI/portdrv: Don't disable AER reporting in get_port_device_capability()")
Acked-by: NChristian König <christian.koenig@amd.com>
Reviewed-by: NLijo Lazar <lijo.lazar@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

c1c39032

14 9月, 2022 2 次提交

drm/amdgpu: Fix hive reference count leak · 2efc30f0

由 Vignesh Chander 提交于 9月 09, 2022

both get_xgmi_hive and put_xgmi_hive can be skipped since the
reset domain is not necessary for VF
Signed-off-by: NVignesh Chander <Vignesh.Chander@amd.com>
Reviewed-by: NShaoyun Liu <Shaoyun.Liu@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

2efc30f0

drm/amdgpu: Use per device reset_domain for XGMI on sriov configuration · 46c67660

由 shaoyunl 提交于 9月 06, 2022

For SRIOV configuration, host driver control the reset method(either FLR or
heavier chain reset). The host will notify the guest individually with FLR
message if individual GPU within the hive need to be reset. So for guest
side, no need to use hive->reset_domain to replace the original per
device reset_domain
Signed-off-by: Nshaoyunl <shaoyun.liu@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

46c67660

08 9月, 2022 1 次提交

drm/amdgpu: TA unload messages are not actually sent to psp when amdgpu is uninstalled · fac53471

由 YiPeng Chai 提交于 8月 18, 2022

V1:
  The psp_cmd_submit_buf function is called by psp_hw_fini to send
TA unload messages to psp to terminate ras, asd and tmr. But when
amdgpu is uninstalled, drm_dev_unplug is called earlier than
psp_hw_fini in amdgpu_pci_remove, the calling order as follows:
static void amdgpu_pci_remove(struct pci_dev *pdev) {
	drm_dev_unplug
	......
	amdgpu_driver_unload_kms->amdgpu_device_fini_hw->...
		->.hw_fini->psp_hw_fini->...
		->psp_ta_unload->psp_cmd_submit_buf
	......
}
The program will return when calling drm_dev_enter in psp_cmd_submit_buf.

So the call to drm_dev_enter in psp_cmd_submit_buf should be
removed, so that the TA unload messages can be sent to the psp
when amdgpu is uninstalled.

V2:
1. Restore psp_cmd_submit_buf to its original code.
2. Move drm_dev_unplug call after amdgpu_driver_unload_kms in
   amdgpu_pci_remove.
3. Since amdgpu_device_fini_hw is called by amdgpu_driver_unload_kms,
   remove the unplug check to release device mmio resource in
   amdgpu_device_fini_hw before calling drm_dev_unplug.
Signed-off-by: NYiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

fac53471

31 8月, 2022 1 次提交

drm/amdgpu: ensure no PCIe peer access for CPU XGMI iolinks · b97e9145

由 Alex Sierra 提交于 8月 25, 2022

[Why] Devices with CPU XGMI iolink do not support PCIe peer access.
Signed-off-by: NAlex Sierra <alex.sierra@amd.com>
Acked-by: NAlex Deucher <alexander.deucher@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

b97e9145

30 8月, 2022 1 次提交

drm/amdgpu: ensure no PCIe peer access for CPU XGMI iolinks · ab23c5b9

由 Alex Sierra 提交于 8月 25, 2022

[Why] Devices with CPU XGMI iolink do not support PCIe peer access.
Signed-off-by: NAlex Sierra <alex.sierra@amd.com>
Acked-by: NAlex Deucher <alexander.deucher@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

ab23c5b9

26 8月, 2022 1 次提交

drm/amd/amdgpu: avoid soft reset check when gpu recovery disabled · d3ef9d57

由 Chengming Gui 提交于 8月 05, 2022

Avoid soft reset, even ip hang check (ring/ib test) when gpu recovery
disabled.

v2: add missing "}"
Signed-off-by: NChengming Gui <Jack.Gui@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

d3ef9d57

23 8月, 2022 2 次提交

drm/amdgpu: Remove the additional kfd pre reset call for sriov · 947f63f1

由 shaoyunl 提交于 8月 18, 2022

The additional call is caused by merge conflict
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Nshaoyunl <shaoyun.liu@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

947f63f1

drm/amdgpu: fix hive reference leak when adding xgmi device · 9dfa4860

由 YiPeng Chai 提交于 8月 12, 2022

Only amdgpu_get_xgmi_hive but no amdgpu_put_xgmi_hive
which will leak the hive reference.
Signed-off-by: NYiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

9dfa4860

20 8月, 2022 2 次提交

drm/amdgpu: Remove the additional kfd pre reset call for sriov · 06671734

由 shaoyunl 提交于 8月 18, 2022

The additional call is caused by merge conflict
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Nshaoyunl <shaoyun.liu@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

06671734

drm/amdgpu: fix hive reference leak when adding xgmi device · f5994da7

由 YiPeng Chai 提交于 8月 12, 2022

Only amdgpu_get_xgmi_hive but no amdgpu_put_xgmi_hive
which will leak the hive reference.
Signed-off-by: NYiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

f5994da7

17 8月, 2022 1 次提交

drm/amd: Add detailed GFXOFF stats to debugfs · 0ad7347a

由 André Almeida 提交于 8月 10, 2022

Add debugfs interface to log GFXOFF statistics:

- Read amdgpu_gfxoff_count to get the total GFXOFF entry count at the
  time of query since system power-up

- Write 1 to amdgpu_gfxoff_residency to start logging, and 0 to stop.
  Read it to get average GFXOFF residency % multiplied by 100
  during the last logging interval.

Both features are designed to be keep the values persistent between
suspends.
Signed-off-by: NAndré Almeida <andrealmeid@igalia.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

0ad7347a

openeuler / Kernel 大约 2 年 前同步成功

openeuler / Kernel
大约 2 年前同步成功