提交 · bf67014d6bda16a72deea11dbbff2a97c705ca92 · openeuler / Kernel

18 12月, 2021 1 次提交

drm/amdgpu: introduce new amdgpu_fence object to indicate the job embedded fence · bf67014d

由 Huang Rui 提交于 12月 16, 2021

The job embedded fence donesn't initialize the flags at
dma_fence_init(). Then we will go a wrong way in
amdgpu_fence_get_timeline_name callback and trigger a null pointer panic
once we enabled the trace event here. So introduce new amdgpu_fence
object to indicate the job embedded fence.

[  156.131790] BUG: kernel NULL pointer dereference, address: 00000000000002a0
[  156.131804] #PF: supervisor read access in kernel mode
[  156.131811] #PF: error_code(0x0000) - not-present page
[  156.131817] PGD 0 P4D 0
[  156.131824] Oops: 0000 [#1] PREEMPT SMP PTI
[  156.131832] CPU: 6 PID: 1404 Comm: sdma0 Tainted: G           OE     5.16.0-rc1-custom #1
[  156.131842] Hardware name: Gigabyte Technology Co., Ltd. Z170XP-SLI/Z170XP-SLI-CF, BIOS F20 11/04/2016
[  156.131848] RIP: 0010:strlen+0x0/0x20
[  156.131859] Code: 89 c0 c3 0f 1f 80 00 00 00 00 48 01 fe eb 0f 0f b6 07 38 d0 74 10 48 83 c7 01 84 c0 74 05 48 39 f7 75 ec 31 c0 c3 48 89 f8 c3 <80> 3f 00 74 10 48 89 f8 48 83 c0 01 80 38 00 75 f7 48 29 f8 c3 31
[  156.131872] RSP: 0018:ffff9bd0018dbcf8 EFLAGS: 00010206
[  156.131880] RAX: 00000000000002a0 RBX: ffff8d0305ef01b0 RCX: 000000000000000b
[  156.131888] RDX: ffff8d03772ab924 RSI: ffff8d0305ef01b0 RDI: 00000000000002a0
[  156.131895] RBP: ffff9bd0018dbd60 R08: ffff8d03002094d0 R09: 0000000000000000
[  156.131901] R10: 000000000000005e R11: 0000000000000065 R12: ffff8d03002094d0
[  156.131907] R13: 000000000000001f R14: 0000000000070018 R15: 0000000000000007
[  156.131914] FS:  0000000000000000(0000) GS:ffff8d062ed80000(0000) knlGS:0000000000000000
[  156.131923] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  156.131929] CR2: 00000000000002a0 CR3: 000000001120a005 CR4: 00000000003706e0
[  156.131937] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  156.131942] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  156.131949] Call Trace:
[  156.131953]  <TASK>
[  156.131957]  ? trace_event_raw_event_dma_fence+0xcc/0x200
[  156.131973]  ? ring_buffer_unlock_commit+0x23/0x130
[  156.131982]  dma_fence_init+0x92/0xb0
[  156.131993]  amdgpu_fence_emit+0x10d/0x2b0 [amdgpu]
[  156.132302]  amdgpu_ib_schedule+0x2f9/0x580 [amdgpu]
[  156.132586]  amdgpu_job_run+0xed/0x220 [amdgpu]

v2: fix mismatch warning between the prototype and function name (Ray, kernel test robot)
Signed-off-by: NHuang Rui <ray.huang@amd.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

bf67014d

02 12月, 2021 2 次提交

drm/amdgpu: adjust the kfd reset sequence in reset sriov function · 428890a3

由 shaoyunl 提交于 11月 29, 2021

This change revert previous commits:
9f4f2c1a ("drm/amd/amdgpu: fix the kfd pre_reset sequence in sriov")
271fd38c ("drm/amdgpu: move kfd post_reset out of reset_sriov function")

This change moves the amdgpu_amdkfd_pre_reset to an earlier place
in amdgpu_device_reset_sriov, presumably to address the sequence issue
that the first patch was originally meant to fix.

Some register access(GRBM_GFX_CNTL) only be allowed on full access
mode. Move kfd_pre_reset and  kfd_post_reset back inside reset_sriov
function.

Fixes: 9f4f2c1a ("drm/amd/amdgpu: fix the kfd pre_reset sequence in sriov")
Fixes: 271fd38c ("drm/amdgpu: move kfd post_reset out of reset_sriov function")
Signed-off-by: Nshaoyunl <shaoyun.liu@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

428890a3

drm/amdgpu: check atomic flag to differeniate with legacy path · 1053b9c9

由 Flora Cui 提交于 11月 18, 2021

since vkms support atomic KMS interface
Signed-off-by: NFlora Cui <flora.cui@amd.com>
Reviewed-by: NGuchun Chen <guchun.chen@amd.com>
Acked-by: NAlex Deucher <aleander.deucher@amd.com>
Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

1053b9c9

25 11月, 2021 1 次提交

drm/amdgpu: move kfd post_reset out of reset_sriov function · 271fd38c

由 shaoyunl 提交于 11月 18, 2021

Fixes: 9f4f2c1a ("drm/amd/amdgpu: fix the kfd pre_reset sequence in sriov")

For sriov XGMI configuration, the host driver will handle the hive reset,
so in guest side, the reset_sriov only be called once on one device. This will
make kfd post_reset unblanced with kfd pre_reset since kfd pre_reset already
been moved out of reset_sriov function. Move kfd post_reset out of reset_sriov
function to make them balance .
Signed-off-by: Nshaoyunl <shaoyun.liu@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

271fd38c

18 11月, 2021 1 次提交

drm/amd/pm: avoid duplicate powergate/ungate setting · 6ee27ee2

由 Evan Quan 提交于 11月 05, 2021

Just bail out if the target IP block is already in the desired
powergate/ungate state. This can avoid some duplicate settings
which sometimes may cause unexpected issues.

Link: https://lore.kernel.org/all/YV81vidWQLWvATMM@zn.tnic/
Bug: https://bugzilla.kernel.org/show_bug.cgi?id=214921
Bug: https://bugzilla.kernel.org/show_bug.cgi?id=215025
Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/1789
Fixes: bf756fb8 ("drm/amdgpu: add missing cleanups for Polaris12 UVD/VCE on suspend")
Signed-off-by: NEvan Quan <evan.quan@amd.com>
Tested-by: NBorislav Petkov <bp@suse.de>
Reviewed-by: NLijo Lazar <lijo.lazar@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org

6ee27ee2

10 11月, 2021 1 次提交

drm/amd/amdgpu: fix the kfd pre_reset sequence in sriov · 9f4f2c1a

由 shaoyunl 提交于 11月 05, 2021

The KFD pre_reset should be called before reset been executed, it will
hold the lock to prevent other rocm process to sent the packlage to hiq
during host execute the real reset on the HW
Signed-off-by: Nshaoyunl <shaoyun.liu@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

9f4f2c1a

06 11月, 2021 1 次提交

drm/amdgpu: fix SI handling in amdgpu_device_asic_has_dc_support() · 2d32ffd6

由 Alex Deucher 提交于 11月 03, 2021

Properly handle SI DC support when CONFIG_DRM_AMD_DC_SI is not
set.

Fixes: f7f12b25 ("drm/amdgpu: default to true in amdgpu_device_asic_has_dc_support")
Reviewed-by: NEvan Quan <evan.quan@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

2d32ffd6

04 11月, 2021 2 次提交

drm/amdgpu: remove duplicated kfd_resume_iommu · 93cec184

由 James Zhu 提交于 11月 02, 2021

Remove duplicated kfd_resume_iommu which already runs
in mdgpu_amdkfd_device_init.
Tested-By: NKen Moffat <zarniwhoop@ntlworld.com>
Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NJames Zhu <James.Zhu@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

93cec184

drm/amd/amdgpu: fix bad job hw_fence use after free in advance tdr · 38d4e463

由 Jingwen Chen 提交于 10月 22, 2021

[Why]
In advance tdr mode, the real bad job will be resubmitted twice, while
in drm_sched_resubmit_jobs_ext, there's a dma_fence_put, so the bad job
is put one more time than other jobs.

[How]
Adding dma_fence_get before resbumit job in
amdgpu_device_recheck_guilty_jobs and put the fence for normal jobs
Signed-off-by: NJingwen Chen <Jingwen.Chen2@amd.com>
Reviewed-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

38d4e463

29 10月, 2021 1 次提交

drm/amdgpu: fix a potential memory leak in amdgpu_device_fini_sw() · a5c5d8d5

由 Lang Yu 提交于 10月 21, 2021

amdgpu_fence_driver_sw_fini() should be executed before
amdgpu_device_ip_fini(), otherwise fence driver resource
won't be properly freed as adev->rings have been tore down.

Fixes: 72c8c97b ("drm/amdgpu: Split amdgpu_device_fini into early and late")
Signed-off-by: NLang Yu <lang.yu@amd.com>
Reviewed-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

a5c5d8d5

20 10月, 2021 1 次提交

drm/amd/amdgpu: Do irq_fini_hw after ip_fini_early · a3848df6

由 YuBiao Wang 提交于 10月 19, 2021

[Why]
drm_irq_uninstall is called in irq_fini_hw so that irq is disabled in sw
stage. SMU (and maybe other IP blocks) fini_hw will call irq_put for
cleanup and the whole cleanup process will be skipped because of
drm->irq_enable = false.

[How]
Move ip_fini_early before irq_fini_hw.
Signed-off-by: NYuBiao Wang <YuBiao.Wang@amd.com>
Reviewed-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

a3848df6

14 10月, 2021 1 次提交

drm/amdkfd: fix boot failure when iommu is disabled in Picasso. · afd18180

由 Yifan Zhang 提交于 10月 11, 2021

When IOMMU disabled in sbios and kfd in iommuv2 path, iommuv2
init will fail. But this failure should not block amdgpu driver init.
Reported-by: Nyouling <youling257@gmail.com>
Tested-by: Nyouling <youling257@gmail.com>
Signed-off-by: NYifan Zhang <yifan1.zhang@amd.com>
Reviewed-by: NJames Zhu <James.Zhu@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

afd18180

09 10月, 2021 1 次提交

drm/amdgpu: use adev_to_drm for consistency when accessing drm_device · c58a863b

由 Guchun Chen 提交于 10月 08, 2021

adev_to_drm is used everywhere, so improve recent changes
when accessing drm_device pointer from amdgpu_device.
Signed-off-by: NGuchun Chen <guchun.chen@amd.com>
Acked-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

c58a863b

07 10月, 2021 1 次提交

drm/amdgpu: unify BO evicting method in amdgpu_ttm · 58144d28

由 Nirmoy Das 提交于 10月 06, 2021

Unify BO evicting functionality for possible memory
types in amdgpu_ttm.c.
Signed-off-by: NNirmoy Das <nirmoy.das@amd.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

58144d28

06 10月, 2021 6 次提交

drm/amdgpu: handle the case of pci_channel_io_frozen only in amdgpu_pci_resume · 248b0616

由 Guchun Chen 提交于 10月 01, 2021

In current code, when a PCI error state pci_channel_io_normal is detectd,
it will report PCI_ERS_RESULT_CAN_RECOVER status to PCI driver, and PCI
driver will continue the execution of PCI resume callback report_resume by
pci_walk_bridge, and the callback will go into amdgpu_pci_resume
finally, where write lock is releasd unconditionally without acquiring
such lock first. In this case, a deadlock will happen when other threads
start to acquire the read lock.

To fix this, add a member in amdgpu_device strucutre to cache
pci_channel_state, and only continue the execution in amdgpu_pci_resume
when it's pci_channel_io_frozen.

Fixes: c9a6b82f ("drm/amdgpu: Implement DPC recovery")
Suggested-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com>
Signed-off-by: NGuchun Chen <guchun.chen@amd.com>
Reviewed-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

248b0616

drm/amdgpu: init iommu after amdkfd device init · 714d9e45

由 Yifan Zhang 提交于 9月 28, 2021

This patch is to fix clinfo failure in Raven/Picasso:

Number of platforms: 1
  Platform Profile: FULL_PROFILE
  Platform Version: OpenCL 2.2 AMD-APP (3364.0)
  Platform Name: AMD Accelerated Parallel Processing
  Platform Vendor: Advanced Micro Devices, Inc.
  Platform Extensions: cl_khr_icd cl_amd_event_callback

  Platform Name: AMD Accelerated Parallel Processing Number of devices: 0
Signed-off-by: NYifan Zhang <yifan1.zhang@amd.com>
Reviewed-by: NJames Zhu <James.Zhu@amd.com>
Tested-by: NJames Zhu <James.Zhu@amd.com>
Acked-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

714d9e45

drm/amdgpu: handle the case of pci_channel_io_frozen only in amdgpu_pci_resume · e17e27f9

由 Guchun Chen 提交于 10月 01, 2021

In current code, when a PCI error state pci_channel_io_normal is detectd,
it will report PCI_ERS_RESULT_CAN_RECOVER status to PCI driver, and PCI
driver will continue the execution of PCI resume callback report_resume by
pci_walk_bridge, and the callback will go into amdgpu_pci_resume
finally, where write lock is releasd unconditionally without acquiring
such lock first. In this case, a deadlock will happen when other threads
start to acquire the read lock.

To fix this, add a member in amdgpu_device strucutre to cache
pci_channel_state, and only continue the execution in amdgpu_pci_resume
when it's pci_channel_io_frozen.

Fixes: c9a6b82f ("drm/amdgpu: Implement DPC recovery")
Suggested-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com>
Signed-off-by: NGuchun Chen <guchun.chen@amd.com>
Reviewed-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

e17e27f9

drm/amdgpu: print warning and taint kernel if lockup timeout is disabled · 127aedf9

由 Christian König 提交于 9月 30, 2021

Make sure that we notice this in error reports.
Signed-off-by: NChristian König <christian.koenig@amd.com>
Acked-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

127aedf9

drm/amdgpu: revert "Add autodump debugfs node for gpu reset v8" · c8365dbd

由 Christian König 提交于 9月 30, 2021

This reverts commit 728e7e0c.

Further discussion reveals that this feature is severely broken
and needs to be reverted ASAP.

GPU reset can never be delayed by userspace even for debugging or
otherwise we can run into in kernel deadlocks.
Signed-off-by: NChristian König <christian.koenig@amd.com>
Acked-by: NAlex Deucher <alexander.deucher@amd.com>
Acked-by: NNirmoy Das <nirmoy.das@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

c8365dbd

drm/amdgpu: init iommu after amdkfd device init · 286826d7

由 Yifan Zhang 提交于 9月 28, 2021

This patch is to fix clinfo failure in Raven/Picasso:

Number of platforms: 1
  Platform Profile: FULL_PROFILE
  Platform Version: OpenCL 2.2 AMD-APP (3364.0)
  Platform Name: AMD Accelerated Parallel Processing
  Platform Vendor: Advanced Micro Devices, Inc.
  Platform Extensions: cl_khr_icd cl_amd_event_callback

  Platform Name: AMD Accelerated Parallel Processing Number of devices: 0
Signed-off-by: NYifan Zhang <yifan1.zhang@amd.com>
Reviewed-by: NJames Zhu <James.Zhu@amd.com>
Tested-by: NJames Zhu <James.Zhu@amd.com>
Acked-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

286826d7

05 10月, 2021 7 次提交

drm/amdgpu: add new asic_type for IP discovery · 3ae695d6

由 Alex Deucher 提交于 8月 09, 2021

Add a new asic type for asics where we don't have an
explicit entry in the PCI ID list.  We don't need
an asic type for these asics, other than something higher
than the existing ones, so just use this for all new
asics.
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

3ae695d6

drm/amdgpu: default to true in amdgpu_device_asic_has_dc_support · f7f12b25

由 Alex Deucher 提交于 8月 03, 2021

We are not going to support any new chips with the old
non-DC code so make it the default.
Reviewed-by: NHarry Wentland <harry.wentland@amd.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

f7f12b25

drm/amdgpu: drive all vega asics from the IP discovery table · 98788440

由 Alex Deucher 提交于 7月 30, 2021

Rather than hardcoding based on asic_type, use the IP
discovery table to configure the driver.
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

98788440

drm/amdgpu: drive all navi asics from the IP discovery table · 75aa1841

由 Alex Deucher 提交于 7月 28, 2021

Rather than hardcoding based on asic_type, use the IP
discovery table to configure the driver.

v2: rebase
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

75aa1841

drm/amdgpu: drive nav10 from the IP discovery table · 524cf3ab

由 Alex Deucher 提交于 7月 26, 2021

Rather than hardcoding based on asic_type, use the IP
discovery table to configure the driver.

Only tested on Navi10 so far.
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

524cf3ab

drm/amdgpu: Use IP discovery to drive setting IP blocks by default · 63352b7f

由 Alex Deucher 提交于 7月 26, 2021

Drive the asic setup from the IP discovery table rather than
hardcoded settings based on asic type.
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

63352b7f

drm/amd/display: add cyan_skillfish display support · 3f68c01b

由 Zhan Liu 提交于 9月 25, 2021

[Why]
add display related cyan_skillfish files in.

makefile controlled by CONFIG_DRM_AMD_DC_DCN201 flag.

v2: squash in clang fixes from Harry, Nathan
v3: squash in missing CONFIG_DRM_AMD_DC check (Alex)
Signed-off-by: NCharlene Liu <charlene.liu@amd.com>
Signed-off-by: NZhan Liu <zhan.liu@amd.com>
Reviewed-by: NCharlene Liu <charlene.liu@amd.com>
Acked-by: NJun Lei <jun.lei@amd.com>
Acked-by: NHarry Wentland <harry.wentland@amd.com>
Acked-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

3f68c01b

30 9月, 2021 1 次提交

drm/amdgpu: drm/amdgpu: Handle IOMMU enabled case · 894c6890

由 Andrey Grodzovsky 提交于 8月 24, 2021

Handle all DMA IOMMU group related dependencies before the
group is removed and we try to access it after free.

v2:
Move the actul handling function to TTM
Signed-off-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

894c6890

24 9月, 2021 1 次提交

drm/amdgpu: move amdgpu_virt_release_full_gpu to fini_early stage · 6effad8a

由 Guchun Chen 提交于 9月 18, 2021

adev->rmmio is set to be NULL in amdgpu_device_unmap_mmio to prevent
access after pci_remove, however, in SRIOV case, amdgpu_virt_release_full_gpu
will still use adev->rmmio for access after amdgpu_device_unmap_mmio.
The patch is to move such SRIOV calling earlier to fini_early stage.

Fixes: 07775fc1 ("drm/amdgpu: Unmap all MMIO mappings")
Cc: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Signed-off-by: NLeslie Shi <Yuliang.Shi@amd.com>
Signed-off-by: NGuchun Chen <guchun.chen@amd.com>
Reviewed-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

6effad8a

16 9月, 2021 1 次提交

drm/amdgpu: move iommu_resume before ip init/resume · f02abeb0

由 James Zhu 提交于 9月 07, 2021

Separate iommu_resume from kfd_resume, and move it before
other amdgpu ip init/resume.

Bug: https://bugzilla.kernel.org/show_bug.cgi?id=211277Signed-off-by: NJames Zhu <James.Zhu@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org

f02abeb0

15 9月, 2021 2 次提交

drm/amdgpu: move iommu_resume before ip init/resume · 9cec53c1

由 James Zhu 提交于 9月 07, 2021

Separate iommu_resume from kfd_resume, and move it before
other amdgpu ip init/resume.

Bug: https://bugzilla.kernel.org/show_bug.cgi?id=211277Signed-off-by: NJames Zhu <James.Zhu@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

9cec53c1

drm/amdgpu: Get atomicOps info from Host for sriov setup · 8e6d0b69

由 shaoyunl 提交于 9月 08, 2021

The AtomicOp Requester Enable bit is reserved in VFs and the PF value applies to all
associated VFs. so guest driver can not directly enable the atomicOps for VF, it
depends on PF to enable it. In current design, amdgpu driver will get the enabled
atomicOps bits through private pf2vf data
Signed-off-by: Nshaoyunl <shaoyun.liu@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

8e6d0b69

21 8月, 2021 2 次提交

drm/amdgpu: Cancel delayed work when GFXOFF is disabled · 32bc8f83

由 Michel Dänzer 提交于 8月 17, 2021

schedule_delayed_work does not push back the work if it was already
scheduled before, so amdgpu_device_delay_enable_gfx_off ran ~100 ms
after the first time GFXOFF was disabled and re-enabled, even if GFXOFF
was disabled and re-enabled again during those 100 ms.

This resulted in frame drops / stutter with the upcoming mutter 41
release on Navi 14, due to constantly enabling GFXOFF in the HW and
disabling it again (for getting the GPU clock counter).

To fix this, call cancel_delayed_work_sync when the disable count
transitions from 0 to 1, and only schedule the delayed work on the
reverse transition, not if the disable count was already 0. This makes
sure the delayed work doesn't run at unexpected times, and allows it to
be lock-free.

v2:
* Use cancel_delayed_work_sync & mutex_trylock instead of
  mod_delayed_work.
v3:
* Make amdgpu_device_delay_enable_gfx_off lock-free (Christian König)
v4:
* Fix race condition between amdgpu_gfx_off_ctrl incrementing
  adev->gfx.gfx_off_req_count and amdgpu_device_delay_enable_gfx_off
  checking for it to be 0 (Evan Quan)

Cc: stable@vger.kernel.org
Reviewed-by: NEvan Quan <evan.quan@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> # v3
Acked-by: Christian König <christian.koenig@amd.com> # v3
Signed-off-by: NMichel Dänzer <mdaenzer@redhat.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

32bc8f83

drm/amdgpu: Cancel delayed work when GFXOFF is disabled · 90a92662

由 Michel Dänzer 提交于 8月 17, 2021

schedule_delayed_work does not push back the work if it was already
scheduled before, so amdgpu_device_delay_enable_gfx_off ran ~100 ms
after the first time GFXOFF was disabled and re-enabled, even if GFXOFF
was disabled and re-enabled again during those 100 ms.

This resulted in frame drops / stutter with the upcoming mutter 41
release on Navi 14, due to constantly enabling GFXOFF in the HW and
disabling it again (for getting the GPU clock counter).

To fix this, call cancel_delayed_work_sync when the disable count
transitions from 0 to 1, and only schedule the delayed work on the
reverse transition, not if the disable count was already 0. This makes
sure the delayed work doesn't run at unexpected times, and allows it to
be lock-free.

v2:
* Use cancel_delayed_work_sync & mutex_trylock instead of
  mod_delayed_work.
v3:
* Make amdgpu_device_delay_enable_gfx_off lock-free (Christian König)
v4:
* Fix race condition between amdgpu_gfx_off_ctrl incrementing
  adev->gfx.gfx_off_req_count and amdgpu_device_delay_enable_gfx_off
  checking for it to be 0 (Evan Quan)

Cc: stable@vger.kernel.org
Reviewed-by: NEvan Quan <evan.quan@amd.com>
Reviewed-by: Lijo Lazar <lijo.lazar@amd.com> # v3
Acked-by: Christian König <christian.koenig@amd.com> # v3
Signed-off-by: NMichel Dänzer <mdaenzer@redhat.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

90a92662

19 8月, 2021 1 次提交

drm/amd/amdgpu:flush ttm delayed work before cancel_sync · 691191a2

由 YuBiao Wang 提交于 8月 17, 2021

[Why]
In some cases when we unload driver, warning call trace
will show up in vram_mgr_fini which claims that LRU is not empty, caused
by the ttm bo inside delay deleted queue.

[How]
We should flush delayed work to make sure the delay deleting is done.
Signed-off-by: NYuBiao Wang <YuBiao.Wang@amd.com>
Reviewed-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

691191a2

17 8月, 2021 1 次提交

drm/amd/amdgpu embed hw_fence into amdgpu_job · c530b02f

由 Jack Zhang 提交于 5月 12, 2021

Why: Previously hw fence is alloced separately with job.
It caused historical lifetime issues and corner cases.
The ideal situation is to take fence to manage both job
and fence's lifetime, and simplify the design of gpu-scheduler.

How:
We propose to embed hw_fence into amdgpu_job.
1. We cover the normal job submission by this method.
2. For ib_test, and submit without a parent job keep the
legacy way to create a hw fence separately.
v2:
use AMDGPU_FENCE_FLAG_EMBED_IN_JOB_BIT to show that the fence is
embedded in a job.
v3:
remove redundant variable ring in amdgpu_job
v4:
add tdr sequence support for this feature. Add a job_run_counter to
indicate whether this job is a resubmit job.
v5
add missing handling in amdgpu_fence_enable_signaling
Signed-off-by: NJingwen Chen <Jingwen.Chen2@amd.com>
Signed-off-by: NJack Zhang <Jack.Zhang7@hotmail.com>
Reviewed-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed by: Monk Liu <monk.liu@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

c530b02f

10 8月, 2021 1 次提交

drm/amd/amdgpu: skip locking delayed work if not initialized. · e78b3197

由 YuBiao Wang 提交于 8月 05, 2021

When init failed in early init stage, amdgpu_object has
not been initialized, so hasn't the ttm delayed queue functions.
Signed-off-by: NYuBiao Wang <YuBiao.Wang@amd.com>
Reviewed-by: NEmily.Deng <Emily.Deng@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

e78b3197

06 8月, 2021 1 次提交

drm/amdgpu: avoid over-handle of fence driver fini in s3 test (v2) · 067f44c8

由 Guchun Chen 提交于 7月 29, 2021

In amdgpu_fence_driver_hw_fini, no need to call drm_sched_fini to stop
scheduler in s3 test, otherwise, fence related failure will arrive
after resume. To fix this and for a better clean up, move drm_sched_fini
from fence_hw_fini to fence_sw_fini, as it's part of driver shutdown, and
should never be called in hw_fini.

v2: rename amdgpu_fence_driver_init to amdgpu_fence_driver_sw_init,
to keep sw_init and sw_fini paired.

Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/1668
Fixes: 8d35a259 ("drm/amdgpu: adjust fence driver enable sequence")
Suggested-by: NChristian König <christian.koenig@amd.com>
Tested-by: NMike Lothian <mike@fireburn.co.uk>
Signed-off-by: NGuchun Chen <guchun.chen@amd.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

067f44c8

29 7月, 2021 1 次提交

drm/amdgpu: adjust fence driver enable sequence · 8d35a259

由 Likun Gao 提交于 7月 26, 2021

Fence driver was enabled per ring when sw init on per IP block before.
Change to enable all the fence driver at the same time after
amdgpu_device_ip_init finished.
Rename some function related to fence to make it reasonable for read.
Signed-off-by: NLikun Gao <Likun.Gao@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

8d35a259

27 7月, 2021 1 次提交

drm/amdgpu: Fix resource leak on probe error path · d47255d3

由 Jiri Kosina 提交于 6月 24, 2021

This reverts commit 4192f7b5.

It is not true (as stated in the reverted commit changelog) that we never
unmap the BAR on failure; it actually does happen properly on
amdgpu_driver_load_kms() -> amdgpu_driver_unload_kms() ->
amdgpu_device_fini() error path.

What's worse, this commit actually completely breaks resource freeing on
probe failure (like e.g. failure to load microcode), as
amdgpu_driver_unload_kms() notices adev->rmmio being NULL and bails too
early, leaving all the resources that'd normally be freed in
amdgpu_acpi_fini() and amdgpu_device_fini() still hanging around, leading
to all sorts of oopses when someone tries to, for example, access the
sysfs and procfs resources which are still around while the driver is
gone.

Fixes: 4192f7b5 ("drm/amdgpu: unmap register bar on device init failure")
Reported-by: NVojtech Pavlik <vojtech@ucw.cz>
Signed-off-by: NJiri Kosina <jkosina@suse.cz>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org

d47255d3

openeuler / Kernel 大约 2 年 前同步成功

openeuler / Kernel
大约 2 年前同步成功