提交 · d09ef243035b75a6d403ebfeb7e87fa20d7e25c6 · openeuler / Kernel

16 11月, 2022 2 次提交

drm/amdgpu: clarify DC checks · d09ef243

由 Alex Deucher 提交于 7月 19, 2022

There are several places where we don't want to check
if a particular asic could support DC, but rather, if
DC is enabled.  Set a flag if DC is enabled and check
for that rather than if a device supports DC or not.
Acked-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

d09ef243

drm/amdgpu: rework SR-IOV virtual display handling · 25263da3

由 Alex Deucher 提交于 7月 19, 2022

virtual display is enabled unconditionally in SR-IOV, but
without specifying the virtual_display module, the number
of crtcs defaults to 0.  Set a single display by default
for SR-IOV if the virtual_display parameter is not set.
Only enable virtual display by default on SR-IOV on asics
which actually have display hardware.
Acked-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

25263da3

05 11月, 2022 2 次提交

drm/amdgpu: fix for suspend/resume sequence under sriov · ec4927d4

由 Victor Zhao 提交于 10月 26, 2022

- clear kiq ring after suspend/resume under sriov to aviod kiq ring
test failure
- update irq after resume to fix kiq interrput loss
Signed-off-by: NVictor Zhao <Victor.Zhao@amd.com>
Acked-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

ec4927d4

drm/amdgpu: Disable MCBP from soc21 for SRIOV · 8a1fbb4a

由 Yiqing Yao 提交于 10月 28, 2022

[why]
Start from soc21, CP does not support MCBP, so disable it.

[how]
Used amgpu_mcbp flag alone instead of checking if is in SRIOV to
enable/disable MCBP.
Only set flag to enable on asic_type prior to soc21 in SRIOV.
Signed-off-by: NYiqing Yao <yiqing.yao@amd.com>
Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

8a1fbb4a

28 10月, 2022 1 次提交

drm/amd: Fail the suspend if resources can't be evicted · 7863c155

由 Mario Limonciello 提交于 10月 26, 2022

If a system does not have swap and memory is under 100% usage,
amdgpu will fail to evict resources.  Currently the suspend
carries on proceeding to reset the GPU:

```
[drm] evicting device resources failed
[drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <vcn_v3_0> failed -12
[drm] free PSP TMR buffer
[TTM] Failed allocating page table
[drm] evicting device resources failed
amdgpu 0000:03:00.0: amdgpu: MODE1 reset
amdgpu 0000:03:00.0: amdgpu: GPU mode1 reset
amdgpu 0000:03:00.0: amdgpu: GPU smu mode1 reset
```

At this point if the suspend actually succeeded I think that amdgpu
would have recovered because the GPU would have power cut off and
restored.  However the kernel fails to continue the suspend from the
memory pressure and amdgpu fails to run the "resume" from the aborted
suspend.

```
ACPI: PM: Preparing to enter system sleep state S3
SLUB: Unable to allocate memory on node -1, gfp=0xdc0(GFP_KERNEL|__GFP_ZERO)
  cache: Acpi-State, object size: 80, buffer size: 80, default order: 0, min order: 0
  node 0: slabs: 22, objs: 1122, free: 0
ACPI Error: AE_NO_MEMORY, Could not update object reference count (20210730/utdelete-651)

[drm:psp_hw_start [amdgpu]] *ERROR* PSP load kdb failed!
[drm:psp_resume [amdgpu]] *ERROR* PSP resume failed
[drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* resume of IP block <psp> failed -62
amdgpu 0000:03:00.0: amdgpu: amdgpu_device_ip_resume failed (-62).
PM: dpm_run_callback(): pci_pm_resume+0x0/0x100 returns -62
amdgpu 0000:03:00.0: PM: failed to resume async: error -62
```

To avoid this series of unfortunate events, fail amdgpu's suspend
when the memory eviction fails.  This will let the system gracefully
recover and the user can try suspend again when the memory pressure
is relieved.

Reported-by: post@davidak.de
Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2223Signed-off-by: NMario Limonciello <mario.limonciello@amd.com>
Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
Acked-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

7863c155

25 10月, 2022 3 次提交

amd/amdgpu: fix repeated words in comments · 12024b17

由 wangjianli 提交于 10月 22, 2022

Delete the redundant word 'the'.
Signed-off-by: Nwangjianli <wangjianli@cdjrlc.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

12024b17

drm/amdgpu: disallow gfxoff until GC IP blocks complete s2idle resume · f543d286

由 Prike Liang 提交于 10月 21, 2022

In the S2idle suspend/resume phase the gfxoff is keeping functional so
some IP blocks will be likely to reinitialize at gfxoff entry and that
will result in failing to program GC registers.Therefore, let disallow
gfxoff until AMDGPU IPs reinitialized completely.
Signed-off-by: NPrike Liang <Prike.Liang@amd.com>
Acked-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

f543d286

drm/amdgpu: skip mes self test for gc 11.0.3 in recover · 693073a0

由 YuBiao Wang 提交于 10月 19, 2022

Temporary disable mes self teset for gc 11.0.3 during gpu_recovery.
Signed-off-by: NYuBiao Wang <YuBiao.Wang@amd.com>
Acked-by: NLuben Tuikov <luben.tuikov@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

693073a0

19 10月, 2022 2 次提交

drm/amd/pm: disable cstate feature for gpu reset scenario · 3059cd8c

由 Evan Quan 提交于 9月 29, 2022

Suggested by PMFW team and same as what did for gfxoff feature.
This can address some Mode1Reset failures observed on SMU13.0.0.
Signed-off-by: NEvan Quan <evan.quan@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: NLijo Lazar <lijo.lazar@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org # 6.0.x
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

3059cd8c

Revert "drm/amdgpu: let mode2 reset fallback to default when failure" · a340847b

由 Victor Zhao 提交于 10月 13, 2022

This reverts commit dac6b808.

This commit reverted the AMDGPU_SKIP_MODE2_RESET as it conflicts with
the original design of reset handler. Will redesign it.

Fixes: dac6b808 ("drm/amdgpu: let mode2 reset fallback to default when failure")
Signed-off-by: NVictor Zhao <Victor.Zhao@amd.com>
Reviewed-by: NLijo Lazar <lijo.lazar@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

a340847b

18 10月, 2022 2 次提交

drm/amd/pm: disable cstate feature for gpu reset scenario · b31d6ada

由 Evan Quan 提交于 9月 29, 2022

Suggested by PMFW team and same as what did for gfxoff feature.
This can address some Mode1Reset failures observed on SMU13.0.0.
Signed-off-by: NEvan Quan <evan.quan@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: NLijo Lazar <lijo.lazar@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org # 6.0.x

b31d6ada

Revert "drm/amdgpu: let mode2 reset fallback to default when failure" · b98a1648

由 Victor Zhao 提交于 10月 13, 2022

This reverts commit dac6b808.

This commit reverted the AMDGPU_SKIP_MODE2_RESET as it conflicts with
the original design of reset handler. Will redesign it.

Fixes: dac6b808 ("drm/amdgpu: let mode2 reset fallback to default when failure")
Signed-off-by: NVictor Zhao <Victor.Zhao@amd.com>
Reviewed-by: NLijo Lazar <lijo.lazar@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

b98a1648

29 9月, 2022 3 次提交

drm/amdgpu: Add amdgpu suspend-resume code path under SRIOV · d7274ec7

由 Bokun Zhang 提交于 9月 28, 2022

- Under SRIOV, we need to send REQ_GPU_FINI to the hypervisor
  during the suspend time. Furthermore, we cannot request a
  mode 1 reset under SRIOV as VF. Therefore, we will skip it
  as it is called in suspend_noirq() function.

- In the resume code path, we need to send REQ_GPU_INIT to the
  hypervisor and also resume PSP IP block under SRIOV.
Signed-off-by: NBokun Zhang <Bokun.Zhang@amd.com>
Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

d7274ec7

drm/amdgpu: Use simplified API for p2p dist calc · bb66ecbf

由 Lijo Lazar 提交于 9月 21, 2022

Use the simpified API that calculates distance between two devices.
Signed-off-by: NLijo Lazar <lijo.lazar@amd.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Reviewed-by: NGuchun Chen <guchun.chen@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

bb66ecbf

drm/amdgpu: Disable verbose for p2p dist calc · d0fa84f1

由 Lijo Lazar 提交于 9月 21, 2022

Disable verbose while getting p2p distance. With verbose, it shows
warning if ACS redirect is set between the devices. Adds noise
to dmesg logs when a few GPU devices are on the same platform.

Example log:

amdgpu 0000:34:00.0: ACS redirect is set between the client and provider (0000:31:00.0)
amdgpu 0000:34:00.0: to disable ACS redirect for this path, add the kernel parameter:
	pci=disable_acs_redir=0000:30:00.0;0000:2e:00.0;0000:33:00.0;0000:2e:10.0
Signed-off-by: NLijo Lazar <lijo.lazar@amd.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Reviewed-by: NGuchun Chen <guchun.chen@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

d0fa84f1

28 9月, 2022 1 次提交

drm/amdgpu: Add amdgpu suspend-resume code path under SRIOV · 3b7329cf

由 Bokun Zhang 提交于 9月 28, 2022

- Under SRIOV, we need to send REQ_GPU_FINI to the hypervisor
  during the suspend time. Furthermore, we cannot request a
  mode 1 reset under SRIOV as VF. Therefore, we will skip it
  as it is called in suspend_noirq() function.

- In the resume code path, we need to send REQ_GPU_INIT to the
  hypervisor and also resume PSP IP block under SRIOV.
Signed-off-by: NBokun Zhang <Bokun.Zhang@amd.com>
Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org

3b7329cf

21 9月, 2022 1 次提交

drm/amdgpu: add gang submit backend v2 · 68ce8b24

由 Christian König 提交于 3月 02, 2022

Allows submitting jobs as gang which needs to run on multiple
engines at the same time.

Basic idea is that we have a global gang submit fence representing when the
gang leader is finally pushed to run on the hardware last.

Jobs submitted as gang are never re-submitted in case of a GPU reset since this
won't work and will just deadlock the hardware immediately again.

v2: fix logic inversion, improve documentation, fix rcu
Signed-off-by: NChristian König <christian.koenig@amd.com>
Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

68ce8b24

20 9月, 2022 2 次提交

drm/amdgpu: Fixed psp fence and memory issues when removing amdgpu device · 83d29a5f

由 YiPeng Chai 提交于 9月 08, 2022

V3:
Fixed psp fence and memory issues for the asic
using smu v13_0_2 when removing amdgpu device.

[Why]:
1. psp_suspend->psp_free_shared_bufs->
       psp_ta_free_shared_buf->
           amdgpu_bo_free_kernel->
             ...->amdgpu_bo_release_notify->
                    amdgpu_fill_buffer
   psp will free vram memory used by psp when psp_suspend
   is called. But for the asic using smu v13_0_2, because
   psp_suspend is called before adev->shutdown is set to
   true when removing the first hive device, amdgpu fill_buffer
   will be called, which will cause fence issues when evicting
   all vram resources in amdgpu vram mgr_fini.
2. Since psp_hw_fini is not called after calling psp_suspend
   and psp_suspend only calls psp_ring_stop, the psp ring memory
   will not be released when amdgpu device is removed.

[How]:
1. Set shutdown to true before calling amdgpu_device_gpu_recover,
   then amdgpu_fill_buffer will not be called when psp_suspend is
   called.
2. Free psp ring memory in psp_sw_fini.
Signed-off-by: NYiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

83d29a5f

drm/amdgpu: Adjust removal control flow for smu v13_0_2 · f5c7e779

由 YiPeng Chai 提交于 9月 07, 2022

Adjust removal control flow for smu v13_0_2:
   During amdgpu uninstallation, when removing the first
device, the kernel needs to first send a mode1reset message
to all gpu devices. Otherwise, smu initialization will fail
the next time amdgpu is installed.

V2:
1. Update commit comments.
2. Remove the global variable amdgpu_device_remove_cnt
   and add a variable to the structure amdgpu_hive_info.
3. Use hive to detect the first removed device instead of
   a global variable.

V3:
 1. Update commit comments.
 2. Split a patch into multiple patches.
 3. The current patch does:
    a. Add a work mode of AMDGPU_RESET_FOR_DEVICE_REMOVE into
       the existing gpu recover path, which make all devices
       in hive list only have HW reset but no resume (except
       the base IP).
    b. Call AMDGPU_RESET_FOR_DEVICE_REMOVE and
       AMDGPU_NEED_FULL_RESET mode of amdgpu_device_gpu_recover
       in amdgpu_pci_remove when removing the first device in
       hive list.
    c. When removing the first device, the IP blocks keyword
       function call sequence is as follows:
.suspend->mode1reset->.resume(basic ip)->.hw_fini->.early_fini->.sw_fini.
   ^                           |
   |-<----------<---------<----|
	The first three sequences are because of a call to
        amdgpu_device_gpu_recover. The three sequences will be
        executed in a loop until all devices in the hive list
        are iterated.
        The sequences starting from .hw_fini only apply to the
        first device. Since .suspend has been called before,
        except the resumed phase1 basic ip blocks, all other ip
        blocks .hw_fini of current device will do nothing.
     d. When removing other devices, the calling sequences is the
        same as legacy:
	   .hw_fini -> .early_fini -> .sw_fini.
	Since .suspend has been called when removing the first device,
        except the resumed phase1 basic ip blocks, all of other ip
        blocks .hw_fini of current device will do nothing.
Signed-off-by: NYiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

f5c7e779

15 9月, 2022 2 次提交

drm/amdgpu: make sure to init common IP before gmc · a8671493

由 Alex Deucher 提交于 8月 30, 2022

Move common IP init before GMC init so that HDP gets
remapped before GMC init which uses it.

This fixes the Unsupported Request error reported through
AER during driver load. The error happens as a write happens
to the remap offset before real remapping is done.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=216373

The error was unnoticed before and got visible because of the commit
referenced below. This doesn't fix anything in the commit below, rather
fixes the issue in amdgpu exposed by the commit. The reference is only
to associate this commit with below one so that both go together.

Fixes: 8795e182 ("PCI/portdrv: Don't disable AER reporting in get_port_device_capability()")
Acked-by: NChristian König <christian.koenig@amd.com>
Reviewed-by: NLijo Lazar <lijo.lazar@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org

a8671493

drm/amdgpu: make sure to init common IP before gmc · c1c39032

由 Alex Deucher 提交于 8月 30, 2022

Move common IP init before GMC init so that HDP gets
remapped before GMC init which uses it.

This fixes the Unsupported Request error reported through
AER during driver load. The error happens as a write happens
to the remap offset before real remapping is done.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=216373

The error was unnoticed before and got visible because of the commit
referenced below. This doesn't fix anything in the commit below, rather
fixes the issue in amdgpu exposed by the commit. The reference is only
to associate this commit with below one so that both go together.

Fixes: 8795e182 ("PCI/portdrv: Don't disable AER reporting in get_port_device_capability()")
Acked-by: NChristian König <christian.koenig@amd.com>
Reviewed-by: NLijo Lazar <lijo.lazar@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

c1c39032

14 9月, 2022 2 次提交

drm/amdgpu: Fix hive reference count leak · 2efc30f0

由 Vignesh Chander 提交于 9月 09, 2022

both get_xgmi_hive and put_xgmi_hive can be skipped since the
reset domain is not necessary for VF
Signed-off-by: NVignesh Chander <Vignesh.Chander@amd.com>
Reviewed-by: NShaoyun Liu <Shaoyun.Liu@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

2efc30f0

drm/amdgpu: Use per device reset_domain for XGMI on sriov configuration · 46c67660

由 shaoyunl 提交于 9月 06, 2022

For SRIOV configuration, host driver control the reset method(either FLR or
heavier chain reset). The host will notify the guest individually with FLR
message if individual GPU within the hive need to be reset. So for guest
side, no need to use hive->reset_domain to replace the original per
device reset_domain
Signed-off-by: Nshaoyunl <shaoyun.liu@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

46c67660

08 9月, 2022 1 次提交

drm/amdgpu: TA unload messages are not actually sent to psp when amdgpu is uninstalled · fac53471

由 YiPeng Chai 提交于 8月 18, 2022

V1:
  The psp_cmd_submit_buf function is called by psp_hw_fini to send
TA unload messages to psp to terminate ras, asd and tmr. But when
amdgpu is uninstalled, drm_dev_unplug is called earlier than
psp_hw_fini in amdgpu_pci_remove, the calling order as follows:
static void amdgpu_pci_remove(struct pci_dev *pdev) {
	drm_dev_unplug
	......
	amdgpu_driver_unload_kms->amdgpu_device_fini_hw->...
		->.hw_fini->psp_hw_fini->...
		->psp_ta_unload->psp_cmd_submit_buf
	......
}
The program will return when calling drm_dev_enter in psp_cmd_submit_buf.

So the call to drm_dev_enter in psp_cmd_submit_buf should be
removed, so that the TA unload messages can be sent to the psp
when amdgpu is uninstalled.

V2:
1. Restore psp_cmd_submit_buf to its original code.
2. Move drm_dev_unplug call after amdgpu_driver_unload_kms in
   amdgpu_pci_remove.
3. Since amdgpu_device_fini_hw is called by amdgpu_driver_unload_kms,
   remove the unplug check to release device mmio resource in
   amdgpu_device_fini_hw before calling drm_dev_unplug.
Signed-off-by: NYiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

fac53471

31 8月, 2022 1 次提交

drm/amdgpu: ensure no PCIe peer access for CPU XGMI iolinks · b97e9145

由 Alex Sierra 提交于 8月 25, 2022

[Why] Devices with CPU XGMI iolink do not support PCIe peer access.
Signed-off-by: NAlex Sierra <alex.sierra@amd.com>
Acked-by: NAlex Deucher <alexander.deucher@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

b97e9145

30 8月, 2022 1 次提交

drm/amdgpu: ensure no PCIe peer access for CPU XGMI iolinks · ab23c5b9

由 Alex Sierra 提交于 8月 25, 2022

[Why] Devices with CPU XGMI iolink do not support PCIe peer access.
Signed-off-by: NAlex Sierra <alex.sierra@amd.com>
Acked-by: NAlex Deucher <alexander.deucher@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

ab23c5b9

26 8月, 2022 1 次提交

drm/amd/amdgpu: avoid soft reset check when gpu recovery disabled · d3ef9d57

由 Chengming Gui 提交于 8月 05, 2022

Avoid soft reset, even ip hang check (ring/ib test) when gpu recovery
disabled.

v2: add missing "}"
Signed-off-by: NChengming Gui <Jack.Gui@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

d3ef9d57

23 8月, 2022 2 次提交

drm/amdgpu: Remove the additional kfd pre reset call for sriov · 947f63f1

由 shaoyunl 提交于 8月 18, 2022

The additional call is caused by merge conflict
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Nshaoyunl <shaoyun.liu@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

947f63f1

drm/amdgpu: fix hive reference leak when adding xgmi device · 9dfa4860

由 YiPeng Chai 提交于 8月 12, 2022

Only amdgpu_get_xgmi_hive but no amdgpu_put_xgmi_hive
which will leak the hive reference.
Signed-off-by: NYiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

9dfa4860

20 8月, 2022 2 次提交

drm/amdgpu: Remove the additional kfd pre reset call for sriov · 06671734

由 shaoyunl 提交于 8月 18, 2022

The additional call is caused by merge conflict
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Nshaoyunl <shaoyun.liu@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

06671734

drm/amdgpu: fix hive reference leak when adding xgmi device · f5994da7

由 YiPeng Chai 提交于 8月 12, 2022

Only amdgpu_get_xgmi_hive but no amdgpu_put_xgmi_hive
which will leak the hive reference.
Signed-off-by: NYiPeng Chai <YiPeng.Chai@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

f5994da7

17 8月, 2022 3 次提交

drm/amd: Add detailed GFXOFF stats to debugfs · 0ad7347a

由 André Almeida 提交于 8月 10, 2022

Add debugfs interface to log GFXOFF statistics:

- Read amdgpu_gfxoff_count to get the total GFXOFF entry count at the
  time of query since system power-up

- Write 1 to amdgpu_gfxoff_residency to start logging, and 0 to stop.
  Read it to get average GFXOFF residency % multiplied by 100
  during the last logging interval.

Both features are designed to be keep the values persistent between
suspends.
Signed-off-by: NAndré Almeida <andrealmeid@igalia.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

0ad7347a

drm/amdgpu: revert context to stop engine before mode2 reset · 72fadb13

由 Victor Zhao 提交于 6月 24, 2022

For some hang caused by slow tests, engine cannot be stopped which
may cause resume failure after reset. In this case, force halt
engine by reverting context addresses
Signed-off-by: NVictor Zhao <Victor.Zhao@amd.com>
Acked-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

72fadb13

drm/amdgpu: let mode2 reset fallback to default when failure · dac6b808

由 Victor Zhao 提交于 7月 28, 2022

- introduce AMDGPU_SKIP_MODE2_RESET flag
- let mode2 reset fallback to default reset method if failed

v2: move this part out from the asic specific part
Signed-off-by: NVictor Zhao <Victor.Zhao@amd.com>
Acked-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

dac6b808

11 8月, 2022 1 次提交

drm/amdgpu: Avoid another list of reset devices · 0a83bb35

由 Lijo Lazar 提交于 8月 03, 2022

A list of devices to be reset is already created in
amdgpu_device_gpu_recover function. Creating another list with the
same nodes is incorrect and not supported in list_head. Instead, pass
the device list as part of reset context.

Fixes: 9e085647 (drm/amdgpu: Refactor mode2 reset logic for v13.0.2)
Signed-off-by: NLijo Lazar <lijo.lazar@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

0a83bb35

29 7月, 2022 2 次提交

drm/amdgpu: move mes self test after drm sched re-started · ed67f729

由 Jack Xiao 提交于 7月 20, 2022

mes self test rely on vm mapping, move it after
drm sched re-started so that vm mapping can work
during gpu reset.
Signed-off-by: NJack Xiao <Jack.Xiao@amd.com>
Acked-and-tested-by: NEvan Quan <evan.quan@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

ed67f729

drm/amdgpu: drop non-necessary call trace dump · 0da0def7

由 Evan Quan 提交于 7月 20, 2022

This extra call trace dump comes out in every gpu reset.
And it gives people a wrong impression that something
went wrong. Although actually there was not.
Signed-off-by: NEvan Quan <evan.quan@amd.com>
Acked-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

0da0def7

19 7月, 2022 1 次提交

drm/amdgpu: Get rid of amdgpu_job->external_hw_fence · f6a3f660

由 Andrey Grodzovsky 提交于 7月 13, 2022

This is a follow-up cleanup to [1]. See bellow refcount balancing
for calling amdgpu_job_submit_direct after this cleanup as far
as I calculated.

amdgpu_fence_emit
	dma_fence_init 1
	dma_fence_get(fence) 2
	rcu_assign_pointer(*ptr, dma_fence_get(fence) 3

---> amdgpu_job_submit_direct completes before fence signaled
			amdgpu_sa_bo_free
				(*sa_bo)->fence = dma_fence_get(fence) 4

			amdgpu_job_free
				dma_fence_put 3

			amdgpu_vcn_enc_get_destroy_msg
				*fence = dma_fence_get(f) 4
				dma_fence_put(f); 3

			amdgpu_vcn_enc_ring_test_ib
				dma_fence_put(fence) 2

			amdgpu_fence_process
				dma_fence_put 1

			amdgpu_sa_bo_remove_locked
				dma_fence_put 0

---> amdgpu_job_submit_direct completes after fence signaled
			amdgpu_fence_process
				dma_fence_put 2

			amdgpu_job_free
				dma_fence_put 1

			amdgpu_vcn_enc_get_destroy_msg
				*fence = dma_fence_get(f) 2
				dma_fence_put(f); 1

			amdgpu_vcn_enc_ring_test_ib
				dma_fence_put(fence) 0

[1] - https://patchwork.kernel.org/project/dri-devel/cover/20220624180955.485440-1-andrey.grodzovsky@amd.com/Signed-off-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com>
Suggested-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

f6a3f660

13 7月, 2022 1 次提交

drm/amdgpu: support reset flag set for gpu reset · f1549c09

由 Likun Gao 提交于 7月 08, 2022

Move reset_context out of gpu recover function to make it configurable
for different reset purpose.
For the reset way of call gpu_recovery sysfs, force to use full reset
method. Otherwise, try soft reset by default if the related ASIC
supportted, if soft reset failed, will use full reset.
Signed-off-by: NLikun Gao <Likun.Gao@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

f1549c09

30 6月, 2022 1 次提交

drm/amdgpu: fix documentation warning · 6e9c65f7

由 Alex Deucher 提交于 6月 23, 2022

Fixes this issue:
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:5094: warning: expecting prototype for amdgpu_device_gpu_recover_imp(). Prototype was for amdgpu_device_gpu_recover() instead

Fixes: cf727044 ("drm/amdgpu: Rename amdgpu_device_gpu_recover_imp back to amdgpu_device_gpu_recover")
Reviewed-by: NKent Russell <kent.russell@amd.com>
Reported-by: NStephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

6e9c65f7

openeuler / Kernel 大约 2 年 前同步成功

openeuler / Kernel
大约 2 年前同步成功