提交 · a2b4785f01280a4291edb9fda69032fc2e4bfd3f · openeuler / Kernel

20 5月, 2021 9 次提交

drm/amdgpu: stop touching sched.ready in the backend · a2b4785f

由 Christian König 提交于 5月 18, 2021

This unfortunately comes up in regular intervals and breaks
GPU reset for the engine in question.

The sched.ready flag controls if an engine can't get working
during hw_init, but should never be set to false during hw_fini.

v2: squash in unused variable fix (Alex)
Signed-off-by: NChristian König <christian.koenig@amd.com>
Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

a2b4785f

drm/amd/amdgpu: fix a potential deadlock in gpu reset · 9c2876d5

由 Lang Yu 提交于 5月 17, 2021

When amdgpu_ib_ring_tests failed, the reset logic called
amdgpu_device_ip_suspend twice, then deadlock occurred.
Deadlock log:

[  805.655192] amdgpu 0000:04:00.0: amdgpu: ib ring test failed (-110).
[  806.290952] [drm] free PSP TMR buffer

[  806.319406] ============================================
[  806.320315] WARNING: possible recursive locking detected
[  806.321225] 5.11.0-custom #1 Tainted: G        W  OEL
[  806.322135] --------------------------------------------
[  806.323043] cat/2593 is trying to acquire lock:
[  806.323825] ffff888136b1cdc8 (&adev->dm.dc_lock){+.+.}-{3:3}, at: dm_suspend+0xb8/0x1d0 [amdgpu]
[  806.325668]
               but task is already holding lock:
[  806.326664] ffff888136b1cdc8 (&adev->dm.dc_lock){+.+.}-{3:3}, at: dm_suspend+0xb8/0x1d0 [amdgpu]
[  806.328430]
               other info that might help us debug this:
[  806.329539]  Possible unsafe locking scenario:

[  806.330549]        CPU0
[  806.330983]        ----
[  806.331416]   lock(&adev->dm.dc_lock);
[  806.332086]   lock(&adev->dm.dc_lock);
[  806.332738]
                *** DEADLOCK ***

[  806.333747]  May be due to missing lock nesting notation

[  806.334899] 3 locks held by cat/2593:
[  806.335537]  #0: ffff888100d3f1b8 (&attr->mutex){+.+.}-{3:3}, at: simple_attr_read+0x4e/0x110
[  806.337009]  #1: ffff888136b1fd78 (&adev->reset_sem){++++}-{3:3}, at: amdgpu_device_lock_adev+0x42/0x94 [amdgpu]
[  806.339018]  #2: ffff888136b1cdc8 (&adev->dm.dc_lock){+.+.}-{3:3}, at: dm_suspend+0xb8/0x1d0 [amdgpu]
[  806.340869]
               stack backtrace:
[  806.341621] CPU: 6 PID: 2593 Comm: cat Tainted: G        W  OEL    5.11.0-custom #1
[  806.342921] Hardware name: AMD Celadon-CZN/Celadon-CZN, BIOS WLD0C23N_Weekly_20_12_2 12/23/2020
[  806.344413] Call Trace:
[  806.344849]  dump_stack+0x93/0xbd
[  806.345435]  __lock_acquire.cold+0x18a/0x2cf
[  806.346179]  lock_acquire+0xca/0x390
[  806.346807]  ? dm_suspend+0xb8/0x1d0 [amdgpu]
[  806.347813]  __mutex_lock+0x9b/0x930
[  806.348454]  ? dm_suspend+0xb8/0x1d0 [amdgpu]
[  806.349434]  ? amdgpu_device_indirect_rreg+0x58/0x70 [amdgpu]
[  806.350581]  ? _raw_spin_unlock_irqrestore+0x47/0x50
[  806.351437]  ? dm_suspend+0xb8/0x1d0 [amdgpu]
[  806.352437]  ? rcu_read_lock_sched_held+0x4f/0x80
[  806.353252]  ? rcu_read_lock_sched_held+0x4f/0x80
[  806.354064]  mutex_lock_nested+0x1b/0x20
[  806.354747]  ? mutex_lock_nested+0x1b/0x20
[  806.355457]  dm_suspend+0xb8/0x1d0 [amdgpu]
[  806.356427]  ? soc15_common_set_clockgating_state+0x17d/0x19 [amdgpu]
[  806.357736]  amdgpu_device_ip_suspend_phase1+0x78/0xd0 [amdgpu]
[  806.360394]  amdgpu_device_ip_suspend+0x21/0x70 [amdgpu]
[  806.362926]  amdgpu_device_pre_asic_reset+0xb3/0x270 [amdgpu]
[  806.365560]  amdgpu_device_gpu_recover.cold+0x679/0x8eb [amdgpu]
Signed-off-by: NLang Yu <Lang.Yu@amd.com>
Acked-by: NChristian KÃnig <christian.koenig@amd.com>
Reviewed-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

9c2876d5

drm/amdgpu: update sdma golden setting for Navi12 · 77194d86

由 Guchun Chen 提交于 5月 17, 2021

Current golden setting is out of date.
Signed-off-by: NGuchun Chen <guchun.chen@amd.com>
Reviewed-by: NKenneth Feng <kenneth.feng@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org

77194d86

drm/amdgpu: update gc golden setting for Navi12 · 99c45ba5

由 Guchun Chen 提交于 5月 17, 2021

Current golden setting is out of date.
Signed-off-by: NGuchun Chen <guchun.chen@amd.com>
Reviewed-by: NKenneth Feng <kenneth.feng@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org

99c45ba5

drm/amdgpu: Fix a use-after-free · 1e5c3738

由 xinhui pan 提交于 5月 18, 2021

looks like we forget to set ttm->sg to NULL.
Hit panic below

[ 1235.844104] general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b7b4b: 0000 [#1] SMP DEBUG_PAGEALLOC NOPTI
[ 1235.989074] Call Trace:
[ 1235.991751]  sg_free_table+0x17/0x20
[ 1235.995667]  amdgpu_ttm_backend_unbind.cold+0x4d/0xf7 [amdgpu]
[ 1236.002288]  amdgpu_ttm_backend_destroy+0x29/0x130 [amdgpu]
[ 1236.008464]  ttm_tt_destroy+0x1e/0x30 [ttm]
[ 1236.013066]  ttm_bo_cleanup_memtype_use+0x51/0xa0 [ttm]
[ 1236.018783]  ttm_bo_release+0x262/0xa50 [ttm]
[ 1236.023547]  ttm_bo_put+0x82/0xd0 [ttm]
[ 1236.027766]  amdgpu_bo_unref+0x26/0x50 [amdgpu]
[ 1236.032809]  amdgpu_amdkfd_gpuvm_alloc_memory_of_gpu+0x7aa/0xd90 [amdgpu]
[ 1236.040400]  kfd_ioctl_alloc_memory_of_gpu+0xe2/0x330 [amdgpu]
[ 1236.046912]  kfd_ioctl+0x463/0x690 [amdgpu]
Signed-off-by: Nxinhui pan <xinhui.pan@amd.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

1e5c3738

drm/amdgpu: add video_codecs query support for aldebaran · ab95cb3e

由 James Zhu 提交于 5月 18, 2021

Add video_codecs query support for aldebaran.
Signed-off-by: NJames Zhu <James.Zhu@amd.com>
Reviewed-by: NLeo Liu <leo.liu@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

ab95cb3e

drm/amd/amdgpu: fix refcount leak · fa7e6abc

由 Jingwen Chen 提交于 5月 17, 2021

[Why]
the gem object rfb->base.obj[0] is get according to num_planes
in amdgpufb_create, but is not put according to num_planes

[How]
put rfb->base.obj[0] in amdgpu_fbdev_destroy according to num_planes
Signed-off-by: NJingwen Chen <Jingwen.Chen2@amd.com>
Acked-by: NChristian König <christian.koenig@amd.com>
Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

fa7e6abc

drm/amdgpu: disable 3DCGCG on picasso/raven1 to avoid compute hang · dbd1003d

由 Changfeng 提交于 5月 14, 2021

There is problem with 3DCGCG firmware and it will cause compute test
hang on picasso/raven1. It needs to disable 3DCGCG in driver to avoid
compute hang.
Signed-off-by: NChangfeng <Changfeng.Zhu@amd.com>
Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
Reviewed-by: NHuang Rui <ray.huang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org

dbd1003d

drm/amdgpu: Fix GPU TLB update error when PAGE_SIZE > AMDGPU_PAGE_SIZE · d5375156

由 Yi Li 提交于 5月 14, 2021

When PAGE_SIZE is larger than AMDGPU_PAGE_SIZE, the number of GPU TLB
entries which need to update in amdgpu_map_buffer() should be multiplied
by AMDGPU_GPU_PAGES_IN_CPU_PAGE (PAGE_SIZE / AMDGPU_PAGE_SIZE).
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NYi Li <liyi@loongson.cn>
Signed-off-by: NHuacai Chen <chenhuacai@loongson.cn>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org

d5375156

13 5月, 2021 4 次提交

drm/amdgpu: update vcn1.0 Non-DPG suspend sequence · 5c1efb5f

由 Sathishkumar S 提交于 5月 03, 2021

update suspend register settings in Non-DPG mode.
Signed-off-by: NSathishkumar S <sathishkumar.sundararaju@amd.com>
Reviewed-by: NLeo Liu <leo.liu@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

5c1efb5f

drm/amdgpu: set vcn mgcg flag for picasso · 3666f83a

由 Sathishkumar S 提交于 5月 03, 2021

enable vcn mgcg flag for picasso.
Signed-off-by: NSathishkumar S <sathishkumar.sundararaju@amd.com>
Reviewed-by: NLeo Liu <leo.liu@amd.com>
Acked-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

3666f83a

drm/amdgpu: update the method for harvest IP for specific SKU · 5c1a3768

由 Likun Gao 提交于 5月 07, 2021

Update the method of disabling VCN IP for specific SKU for navi1x ASIC,
it will judge whether should add the related IP at the function of
amdgpu_device_ip_block_add().
Signed-off-by: NLikun Gao <Likun.Gao@amd.com>
Reviewed-by: NGuchun Chen <guchun.chen@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

5c1a3768

drm/amdgpu: add judgement when add ip blocks (v2) · 83a0b863

由 Likun GAO 提交于 4月 29, 2021

Judgement whether to add an sw ip according to the harvest info.

v2: fix indentation (Alex)
Signed-off-by: NLikun Gao <Likun.Gao@amd.com>
Reviewed-by: NGuchun Chen <guchun.chen@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

83a0b863

06 5月, 2021 2 次提交

drm/amdgpu: Use device specific BO size & stride check. · 234055fd

由 Bas Nieuwenhuizen 提交于 5月 04, 2021

The builtin size check isn't really the right thing for AMD
modifiers due to a couple of reasons:

1) In the format structs we don't do set any of the tilesize / blocks
etc. to avoid having format arrays per modifier/GPU
2) The pitch on the main plane is pixel_pitch * bytes_per_pixel even
for tiled ...
3) The pitch for the DCC planes is really the pixel pitch of the main
surface that would be covered by it ...

Note that we only handle GFX9+ case but we do this after converting
the implicit modifier to an explicit modifier, so on GFX9+ all
framebuffers should be checked here.

There is a TODO about DCC alignment, but it isn't worse than before
and I'd need to dig a bunch into the specifics. Getting this out in
a reasonable timeframe to make sure it gets the appropriate testing
seemed more important.

Finally as I've found that debugging addfb2 failures is a pita I was
generous adding explicit error messages to every failure case.

Fixes: f258907f ("drm/amdgpu: Verify bo size can fit framebuffer size on init.")
Tested-by: NSimon Ser <contact@emersion.fr>
Signed-off-by: NBas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

234055fd

drm/amdgpu: Init GFX10_ADDR_CONFIG for VCN v3 in DPG mode. · 8bf073ca

由 Bas Nieuwenhuizen 提交于 5月 05, 2021

Otherwise tiling modes that require the values form this field
(In particular _*_X) would be corrupted upon video decode.

Copied from the VCN v2 code.

Fixes: 99541f39 ("drm/amdgpu: add mc resume DPG mode for VCN3.0")
Reviewed-and-Tested by: Leo Liu <leo.liu@amd.com>
Signed-off-by: NBas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org

8bf073ca

05 5月, 2021 1 次提交

drm/amdgpu: add new MC firmware for Polaris12 32bit ASIC · c83c4e19

由 Evan Quan 提交于 4月 28, 2021

Polaris12 32bit ASIC needs a special MC firmware.
Signed-off-by: NEvan Quan <evan.quan@amd.com>
Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org

c83c4e19

29 4月, 2021 5 次提交

amdgpu: fix GEM obj leak in amdgpu_display_user_framebuffer_create · e0c16eb4

由 Simon Ser 提交于 4月 21, 2021

This error code-path is missing a drm_gem_object_put call. Other
error code-paths are fine.
Signed-off-by: NSimon Ser <contact@emersion.fr>
Fixes: 1769152a ("drm/amdgpu: Fail fb creation from imported dma-bufs. (v2)")
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Harry Wentland <hwentlan@amd.com>
Cc: Nicholas Kazlauskas <nicholas.kazlauskas@amd.com>
Cc: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org

e0c16eb4

drm/amdgpu: Register VGA clients after init can no longer fail · 8c3dd61c

由 Kai-Heng Feng 提交于 4月 26, 2021

When an amdgpu device fails to init, it makes another VGA device cause
kernel splat:
kernel: amdgpu 0000:08:00.0: amdgpu: amdgpu_device_ip_init failed
kernel: amdgpu 0000:08:00.0: amdgpu: Fatal error during GPU init
kernel: amdgpu: probe of 0000:08:00.0 failed with error -110
...
kernel: amdgpu 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
kernel: BUG: kernel NULL pointer dereference, address: 0000000000000018
kernel: #PF: supervisor read access in kernel mode
kernel: #PF: error_code(0x0000) - not-present page
kernel: PGD 0 P4D 0
kernel: Oops: 0000 [#1] SMP NOPTI
kernel: CPU: 6 PID: 1080 Comm: Xorg Tainted: G W 5.12.0-rc8+ #12
kernel: Hardware name: HP HP EliteDesk 805 G6/872B, BIOS S09 Ver. 02.02.00 12/30/2020
kernel: RIP: 0010:amdgpu_device_vga_set_decode+0x13/0x30 [amdgpu]
kernel: Code: 06 31 c0 c3 b8 ea ff ff ff 5d c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 0f 1f 44 00 00 55 48 8b 87 90 06 00 00 48 89 e5 53 89 f3 <48> 8b 40 18 40 0f b6 f6 e8 40 58 39 fd 80 fb 01 5b 5d 19 c0 83 e0
kernel: RSP: 0018:ffffae3c0246bd68 EFLAGS: 00010002
kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
kernel: RDX: ffff8dd1af5a8560 RSI: 0000000000000000 RDI: ffff8dce8c160000
kernel: RBP: ffffae3c0246bd70 R08: ffff8dd1af5985c0 R09: ffffae3c0246ba38
kernel: R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000246
kernel: R13: 0000000000000000 R14: 0000000000000003 R15: ffff8dce81490000
kernel: FS: 00007f9303d8fa40(0000) GS:ffff8dd1af580000(0000) knlGS:0000000000000000
kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 0000000000000018 CR3: 0000000103cfa000 CR4: 0000000000350ee0
kernel: Call Trace:
kernel: vga_arbiter_notify_clients.part.0+0x4a/0x80
kernel: vga_get+0x17f/0x1c0
kernel: vga_arb_write+0x121/0x6a0
kernel: ? apparmor_file_permission+0x1c/0x20
kernel: ? security_file_permission+0x30/0x180
kernel: vfs_write+0xca/0x280
kernel: ksys_write+0x67/0xe0
kernel: __x64_sys_write+0x1a/0x20
kernel: do_syscall_64+0x38/0x90
kernel: entry_SYSCALL_64_after_hwframe+0x44/0xae
kernel: RIP: 0033:0x7f93041e02f7
kernel: Code: 75 05 48 83 c4 58 c3 e8 f7 33 ff ff 0f 1f 80 00 00 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
kernel: RSP: 002b:00007fff60e49b28 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
kernel: RAX: ffffffffffffffda RBX: 000000000000000b RCX: 00007f93041e02f7
kernel: RDX: 000000000000000b RSI: 00007fff60e49b40 RDI: 000000000000000f
kernel: RBP: 00007fff60e49b40 R08: 00000000ffffffff R09: 00007fff60e499d0
kernel: R10: 00007f93049350b5 R11: 0000000000000246 R12: 000056111d45e808
kernel: R13: 0000000000000000 R14: 000056111d45e7f8 R15: 000056111d46c980
kernel: Modules linked in: nls_iso8859_1 snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hwdep snd_hda_core snd_pcm snd_seq input_leds snd_seq_device snd_timer snd soundcore joydev kvm_amd serio_raw k10temp mac_hid hp_wmi ccp kvm sparse_keymap wmi_bmof ucsi_acpi efi_pstore typec_ucsi rapl typec video wmi sch_fq_codel parport_pc ppdev lp parport ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx libcrc32c xor raid6_pq raid1 raid0 multipath linear dm_mirror dm_region_hash dm_log hid_generic usbhid hid amdgpu drm_ttm_helper ttm iommu_v2 gpu_sched i2c_algo_bit drm_kms_helper syscopyarea sysfillrect crct10dif_pclmul sysimgblt crc32_pclmul fb_sys_fops ghash_clmulni_intel cec rc_core aesni_intel crypto_simd psmouse cryptd r8169 i2c_piix4 drm ahci xhci_pci realtek libahci xhci_pci_renesas gpio_amdpt gpio_generic
kernel: CR2: 0000000000000018
kernel: ---[ end trace 76d04313d4214c51 ]---

Commit 4192f7b5 ("drm/amdgpu: unmap register bar on device init
failure") makes amdgpu_driver_unload_kms() skips amdgpu_device_fini(),
so the VGA clients remain registered. So when
vga_arbiter_notify_clients() iterates over registered clients, it causes
NULL pointer dereference.

Since there's no reason to register VGA clients that early, so solve
the issue by putting them after all the goto cleanups.

v2:
- Remove redundant vga_switcheroo cleanup in failed: label.

Fixes: 4192f7b5 ("drm/amdgpu: unmap register bar on device init failure")
Signed-off-by: NKai-Heng Feng <kai.heng.feng@canonical.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

8c3dd61c

drm/amdgpu: Handling of amdgpu_device_resume return value for graceful teardown · b45aeb2d

由 Pavan Kumar Ramayanam 提交于 4月 27, 2021

The runtime resume PM op disregards the return value from
amdgpu_device_resume(), masking errors for failed resumes at the PM
layer.
Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NPavan Kumar Ramayanam <pavan.ramayanam@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

b45aeb2d

drm/amdgpu: fix r initial values · 4b12ee6f

由 Victor Zhao 提交于 4月 27, 2021

Sriov gets suspend of IP block <dce_virtual> failed as return
value was not initialized.

v2: return 0 directly to align original code semantic before this
was broken out into a separate helper function instead of setting
initial values
Signed-off-by: NVictor Zhao <Victor.Zhao@amd.com>
Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org

4b12ee6f

drm/amdgpu: fix concurrent VM flushes on Vega/Navi v2 · 20a5f5a9

由 Christian König 提交于 4月 22, 2021

Starting with Vega the hardware supports concurrent flushes
of VMID which can be used to implement per process VMID
allocation.

But concurrent flushes are mutual exclusive with back to
back VMID allocations, fix this to avoid a VMID used in
two ways at the same time.

v2: don't set ring to NULL
Signed-off-by: NChristian König <christian.koenig@amd.com>
Reviewed-by: NJames Zhu <James.Zhu@amd.com>
Tested-by: NJames Zhu <James.Zhu@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org

20a5f5a9

21 4月, 2021 14 次提交

drm/amdgpu: fix GCR_GENERAL_CNTL offset for dimgrey_cavefish · 24d03452

由 Jiansong Chen 提交于 4月 19, 2021

dimgrey_cavefish has similar gc_10_3 ip with sienna_cichlid,
so follow its registers offset setting.
Signed-off-by: NJiansong Chen <Jiansong.Chen@amd.com>
Reviewed-by: NTao Zhou <tao.zhou1@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org

24d03452

drm/amdgpu: reserve fence slot to update page table · d42a5b63

由 Philip Yang 提交于 4月 01, 2021

Forgot to reserve a fence slot to use sdma to update page table, cause
below kernel BUG backtrace to handle vm retry fault while application is
exiting.

[  133.048143] kernel BUG at /home/yangp/git/compute_staging/kernel/drivers/dma-buf/dma-resv.c:281!
[  133.048487] Workqueue: events amdgpu_irq_handle_ih1 [amdgpu]
[  133.048506] RIP: 0010:dma_resv_add_shared_fence+0x204/0x280
[  133.048672]  amdgpu_vm_sdma_commit+0x134/0x220 [amdgpu]
[  133.048788]  amdgpu_vm_bo_update_range+0x220/0x250 [amdgpu]
[  133.048905]  amdgpu_vm_handle_fault+0x202/0x370 [amdgpu]
[  133.049031]  gmc_v9_0_process_interrupt+0x1ab/0x310 [amdgpu]
[  133.049165]  ? kgd2kfd_interrupt+0x9a/0x180 [amdgpu]
[  133.049289]  ? amdgpu_irq_dispatch+0xb6/0x240 [amdgpu]
[  133.049408]  amdgpu_irq_dispatch+0xb6/0x240 [amdgpu]
[  133.049534]  amdgpu_ih_process+0x9b/0x1c0 [amdgpu]
[  133.049657]  amdgpu_irq_handle_ih1+0x21/0x60 [amdgpu]
[  133.049669]  process_one_work+0x29f/0x640
[  133.049678]  worker_thread+0x39/0x3f0
[  133.049685]  ? process_one_work+0x640/0x640
Signed-off-by: NPhilip Yang <Philip.Yang@amd.com>
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org # 5.11.x

d42a5b63

drm/amdgpu/gmc9: remove dummy read workaround for newer chips · 7845d80d

由 Alex Deucher 提交于 4月 16, 2021

Aldebaran has a hw fix so no longer requires the workaround.
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

7845d80d

drm/amdgpu: Add mem sync flag for IB allocated by SA · 5c88e3b8

由 Jinzhou Su 提交于 4月 20, 2021

The buffer of SA bo will be used by many cases. So it's better
to invalidate the cache of indirect buffer allocated by SA before
commit the IB.
Signed-off-by: NJinzhou Su <Jinzhou.Su@amd.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

5c88e3b8

drm/amdgpu: Fix SDMA RAS error reporting on Aldebaran · ceb47e0d

由 Mukul Joshi 提交于 3月 24, 2021

Fix the following issues with SDMA RAS error reporting:
1. Read the EDC_COUNTER2 register also to fetch error counts
   for all sub-blocks in SDMA.
2. SDMA RAS on Aldebaran suports single-bit uncorrectable errors
   only. So, report error count in UE count instead of CE count.
Signed-off-by: NMukul Joshi <mukul.joshi@amd.com>
Reviewed-By: NJohn Clements <John.Clements@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

ceb47e0d

drm/amdgpu: Reset RAS error count and status regs · 1f0d8e37

由 Mukul Joshi 提交于 3月 24, 2021

Reset the RAS error count and error status registers after
reading to prevent over reporting error counts on Aldebaran.
Signed-off-by: NMukul Joshi <mukul.joshi@amd.com>
Reviewed-By: NJohn Clements <John.Clements@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

1f0d8e37

Revert "drm/amdgpu: workaround the TMR MC address issue (v2)" · 5f41741a

由 Oak Zeng 提交于 3月 11, 2021

This reverts commit 2f055097.
2f055097 was a driver workaround
when PSP firmware was not ready. Now the PSP fw is ready so we
revert this driver workaround.
Signed-off-by: NOak Zeng <Oak.Zeng@amd.com>
Reviewed-by: NHarish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

5f41741a

drm/amdgpu: fix GCR_GENERAL_CNTL offset for dimgrey_cavefish · 7c49ee9e

由 Jiansong Chen 提交于 4月 19, 2021

dimgrey_cavefish has similar gc_10_3 ip with sienna_cichlid,
so follow its registers offset setting.
Signed-off-by: NJiansong Chen <Jiansong.Chen@amd.com>
Reviewed-by: NTao Zhou <tao.zhou1@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

7c49ee9e

drm/amdgpu: resolve erroneous gfx_v9_4_2 prints · f9727922

由 John Clements 提交于 4月 19, 2021

resolve bug on aldebaran where gfx error counts will
print on driver load when there are no errors present
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NJohn Clements <john.clements@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

f9727922

drm/amdgpu: fix a error injection failed issue · 6df23f4c

由 Dennis Li 提交于 4月 16, 2021

because "sscanf(str, "retire_page")" always return 0, if application use
the raw data for error injection, it always wrongly falls into "op ==
3". Change to use strstr instead.
Signed-off-by: NDennis Li <Dennis.Li@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

6df23f4c

drm/amdgpu: only harvest gcea/mmea error status in aldebaran · 1f8d3ad2

由 Hawking Zhang 提交于 4月 16, 2021

In aldebaran, driver only needs to harvest SDP
RdRspStatus, WrRspStatus and first parity error
on RdRsp data. Check error type before harvest
error information.
Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: NStanley Yang <Stanley.Yang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

1f8d3ad2

drm/amdgpu: only harvest gcea/mmea error status in arcturus · 53ee6609

由 Hawking Zhang 提交于 4月 16, 2021

SDP RdRspStatus/WrRspStatus or first parity error on
RdRsp data can cause system fatal error in arcturus.
GPU will be freezed in such case.

Driver needs to harvest these error information before
reset the GPU. Check error type to avoid harvest normal
gcea/mmea information.
Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: NStanley Yang <Stanley.Yang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

53ee6609

drm/amdgpu: enable tmz on renoir asics · 9406d39b

由 Huang Rui 提交于 4月 14, 2021

The tmz functions are verified on renoir chips as well. So enable it by
default.
Signed-off-by: NHuang Rui <ray.huang@amd.com>
Tested-by: NLang Yu <Lang.Yu@amd.com>
Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

9406d39b

drm/amdgpu: correct default gfx wdt timeout setting · 28a5d7a5

由 Hawking Zhang 提交于 4月 16, 2021

When gfx wdt was configured to fatal_disable, the
timeout period should be configured to 0x0 (timeout
disabled)
Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: NDennis Li <Dennis.Li@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

28a5d7a5

16 4月, 2021 5 次提交

drm/amdgpu: fix an error code in init_pmu_entry_by_type_and_add() · 90cb3d8a

由 Dan Carpenter 提交于 4月 14, 2021

If the kmemdup() fails then this should return a negative error code
but it currently returns success

Fixes: b4a7db71 ("drm/amdgpu: add per device user friendly xgmi events for vega20")
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

90cb3d8a

Revert "Revert "drm/amdgpu: Ensure that the modifier requested is supported by plane."" · fe180178

由 Qingqing Zhuo 提交于 4月 14, 2021

This reverts commit 55fa622f.

The regression caused by the original patch has been
cleared, thus introduce back the change.
Signed-off-by: NQingqing Zhuo <qingqing.zhuo@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

fe180178

drm/amdgpu: Copy MEC FW version to MEC2 if we skipped loading MEC2 · 47e5d79a

由 Joseph Greathouse 提交于 4月 15, 2021

If we skipped loading MEC2 firmware separately
from MEC, then MEC2 will be running the same
firmware image. Copy the MEC version and feature
numbers into MEC2 version and feature numbers.
This is needed for things like GWS support, where
we rely on knowing what version of firmware is
running on MEC2. Leaving these MEC2 entries blank
breaks our ability to version-check enables and
workarounds.
Signed-off-by: NJoseph Greathouse <Joseph.Greathouse@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

47e5d79a

drm/amdkfd: Remove legacy code not acquiring VMs · f45e6b9d

由 Felix Kuehling 提交于 4月 07, 2021

ROCm user mode has acquired VMs from DRM file descriptors for as long
as it supported the upstream KFD. Legacy code to support older versions
of ROCm is not needed any more.
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: NPhilip Yang <Philip.Yang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

f45e6b9d

drm/amdgpu: Use iterator methods exposed by amdgpu_res_cursor.h in building... · ba5b662c

由 Ramesh Errabolu 提交于 4月 12, 2021

drm/amdgpu: Use iterator methods exposed by amdgpu_res_cursor.h in building SG_TABLE's for a VRAM BO

Extend current implementation of SG_TABLE construction method to
allow exportation of sub-buffers of a VRAM BO. This capability will
enable logical partitioning of a VRAM BO into multiple non-overlapping
sub-buffers. One example of this use case is to partition a VRAM BO
into two sub-buffers, one for SRC and another for DST.
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NRamesh Errabolu <Ramesh.Errabolu@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

ba5b662c

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功