- 25 8月, 2020 1 次提交
-
-
由 Dennis Li 提交于
if other threads have holden the reset lock, recovery will fail to try_lock. Therefore we introduce atomic hive->in_reset and adev->in_gpu_reset, to avoid reentering GPU recovery. v2: drop "? true : false" in the definition of amdgpu_in_reset Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NDennis Li <Dennis.Li@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 19 8月, 2020 2 次提交
-
-
由 Alex Deucher 提交于
We can get this on RENOIR and newer via the SMU metrics table. Reviewed-by: NEvan Quan <evan.quan@amd.com> Acked-by: NNirmoy Das <nirmoy.das@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Alex Deucher 提交于
FAMILY_KV is APUs and we already check for APUs. Reviewed-by: NEvan Quan <evan.quan@amd.com> Acked-by: NNirmoy Das <nirmoy.das@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 15 8月, 2020 3 次提交
-
-
由 Evan Quan 提交于
The target is to provide a clear entry point(for power routines). Also this can help to maintain a clear view about the frameworks used on different ASICs. Hopefully all these can make power part more friendly to play with. Signed-off-by: NEvan Quan <evan.quan@amd.com> Reviewed-by: NAlex Deucher <alexander.deucher@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Evan Quan 提交于
As other power interfaces. Signed-off-by: NEvan Quan <evan.quan@amd.com> Reviewed-by: NAlex Deucher <alexander.deucher@amd.com> Acked-by: NNirmoy Das <nirmoy.das@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Christian König 提交于
The whole approach wasn't thought through till the end. We already had a reset lock like this in the past and it caused the same problems like this one. Completely revert the patch for now and add individual trylock protection to the hardware access functions as necessary. This reverts commit df9c8d1a. Signed-off-by: NChristian König <christian.koenig@amd.com> Acked-by: NAlex Deucher <alexander.deucher@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 07 8月, 2020 1 次提交
-
-
由 Evan Quan 提交于
A new interface for UMD to retrieve gpu metrics data. V2: rich the documentation Signed-off-by: NEvan Quan <evan.quan@amd.com> Reviewed-by: NAlex Deucher <alexander.deucher@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 31 7月, 2020 2 次提交
-
-
由 Alex Deucher 提交于
This regressed some working configurations so revert it. Will fix this properly for 5.9 and backport then. This reverts commit 38e0c89a. Signed-off-by: NAlex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
-
由 Huang Rui 提交于
It doesn't expose PPTable descriptor on APU platform. So max/min temperature values cannot be got from APU platform. v2: Stoney needs to skip crit temperature as well. Signed-off-by: NHuang Rui <ray.huang@amd.com> Reviewed-by: NKevin Wang <kevin1.wang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 28 7月, 2020 2 次提交
-
-
由 Evan Quan 提交于
The current outputs of amdgpu_pm_info debugfs come with clock gating status and followed by current clock/power information. However the clock gating status retrieving may pull GFX out of CG status. That will make the succeeding clock/power information retrieving inaccurate. To overcome this and be with minimum impact, the outputs are updated to show current clock/power information first. Signed-off-by: NEvan Quan <evan.quan@amd.com> Reviewed-by: NAlex Deucher <alexander.deucher@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Dennis Li 提交于
when GPU hang, driver has multi-paths to enter amdgpu_device_gpu_recover, the atomic adev->in_gpu_reset and hive->in_reset are used to avoid re-entering GPU recovery. During GPU reset and resume, it is unsafe that other threads access GPU, which maybe cause GPU reset failed. Therefore the new rw_semaphore adev->reset_sem is introduced, which protect GPU from being accessed by external threads during recovery. v2: 1. add rwlock for some ioctls, debugfs and file-close function. 2. change to use dqm->is_resetting and dqm_lock for protection in kfd driver. 3. remove try_lock and change adev->in_gpu_reset as atomic, to avoid re-enter GPU recovery for the same GPU hang. v3: 1. change back to use adev->reset_sem to protect kfd callback functions, because dqm_lock couldn't protect all codes, for example: free_mqd must be called outside of dqm_lock; [ 1230.176199] Hardware name: Supermicro SYS-7049GP-TRT/X11DPG-QT, BIOS 3.1 05/23/2019 [ 1230.177221] Call Trace: [ 1230.178249] dump_stack+0x98/0xd5 [ 1230.179443] amdgpu_virt_kiq_reg_write_reg_wait+0x181/0x190 [amdgpu] [ 1230.180673] gmc_v9_0_flush_gpu_tlb+0xcc/0x310 [amdgpu] [ 1230.181882] amdgpu_gart_unbind+0xa9/0xe0 [amdgpu] [ 1230.183098] amdgpu_ttm_backend_unbind+0x46/0x180 [amdgpu] [ 1230.184239] ? ttm_bo_put+0x171/0x5f0 [ttm] [ 1230.185394] ttm_tt_unbind+0x21/0x40 [ttm] [ 1230.186558] ttm_tt_destroy.part.12+0x12/0x60 [ttm] [ 1230.187707] ttm_tt_destroy+0x13/0x20 [ttm] [ 1230.188832] ttm_bo_cleanup_memtype_use+0x36/0x80 [ttm] [ 1230.189979] ttm_bo_put+0x1be/0x5f0 [ttm] [ 1230.191230] amdgpu_bo_unref+0x1e/0x30 [amdgpu] [ 1230.192522] amdgpu_amdkfd_free_gtt_mem+0xaf/0x140 [amdgpu] [ 1230.193833] free_mqd+0x25/0x40 [amdgpu] [ 1230.195143] destroy_queue_cpsch+0x1a7/0x270 [amdgpu] [ 1230.196475] pqm_destroy_queue+0x105/0x260 [amdgpu] [ 1230.197819] kfd_ioctl_destroy_queue+0x37/0x70 [amdgpu] [ 1230.199154] kfd_ioctl+0x277/0x500 [amdgpu] [ 1230.200458] ? kfd_ioctl_get_clock_counters+0x60/0x60 [amdgpu] [ 1230.201656] ? tomoyo_file_ioctl+0x19/0x20 [ 1230.202831] ksys_ioctl+0x98/0xb0 [ 1230.204004] __x64_sys_ioctl+0x1a/0x20 [ 1230.205174] do_syscall_64+0x5f/0x250 [ 1230.206339] entry_SYSCALL_64_after_hwframe+0x49/0xbe 2. remove try_lock and introduce atomic hive->in_reset, to avoid re-enter GPU recovery. v4: 1. remove an unnecessary whitespace change in kfd_chardev.c 2. remove comment codes in amdgpu_device.c 3. add more detailed comment in commit message 4. define a wrap function amdgpu_in_reset v5: 1. Fix some style issues. Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Suggested-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com> Suggested-by: NChristian König <christian.koenig@amd.com> Suggested-by: NFelix Kuehling <Felix.Kuehling@amd.com> Suggested-by: NLijo Lazar <Lijo.Lazar@amd.com> Suggested-by: NLuben Tukov <luben.tuikov@amd.com> Signed-off-by: NDennis Li <Dennis.Li@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 23 7月, 2020 1 次提交
-
-
由 Alex Deucher 提交于
We expose the actual memory controller clock rate in Linux, not the effective memory clock of the DRAMs. To translate it, it follows the following formula: Clock conversion (Mhz): HBM: effective_memory_clock = memory_controller_clock * 1 G5: effective_memory_clock = memory_controller_clock * 1 G6: effective_memory_clock = memory_controller_clock * 2 DRAM data rate (MT/s): HBM: effective_memory_clock * 2 = data_rate G5: effective_memory_clock * 4 = data_rate G6: effective_memory_clock * 8 = data_rate Bandwidth (MB/s): data_rate * vram_bit_width / 8 = memory_bandwidth Some examples: G5 on RX460: memory_controller_clock = 1750 Mhz effective_memory_clock = 1750 Mhz * 1 = 1750 Mhz data rate = 1750 * 4 = 7000 MT/s memory_bandwidth = 7000 * 128 bits / 8 = 112000 MB/s G6 on RX5600: memory_controller_clock = 900 Mhz effective_memory_clock = 900 Mhz * 2 = 1800 Mhz data rate = 1800 * 8 = 14400 MT/s memory_bandwidth = 14400 * 192 bits / 8 = 345600 MB/s Acked-by: NEvan Quan <evan.quan@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 22 7月, 2020 1 次提交
-
-
由 Paweł Gronowski 提交于
NULL dereference occurs when string that is not ended with space or newline is written to some dpm sysfs interface (for example pp_dpm_sclk). This happens because strsep replaces the tmp with NULL if the delimiter is not present in string, which is then dereferenced by tmp[0]. Reproduction example: sudo sh -c 'echo -n 1 > /sys/class/drm/card0/device/pp_dpm_sclk' Signed-off-by: NPaweł Gronowski <me@woland.xyz> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 16 7月, 2020 1 次提交
-
-
由 Evan Quan 提交于
Leftover of previous performance level setting cleanups. Signed-off-by: NEvan Quan <evan.quan@amd.com> Reviewed-by: NAlex Deucher <alexander.deucher@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 08 7月, 2020 1 次提交
-
-
由 Alex Jivin 提交于
Move the mutext lock/unlock outside of the if(), as the mutex is always taken: either in the if() branch or in the else branch. Signed-off-by: NAlex Jivin <alex.jivin@amd.com> Suggested-By: NLuben Tukov <luben.tuikov@amd.com> Reviewed-by: NLuben Tuikov <luben.tuikov@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 03 7月, 2020 2 次提交
-
-
由 Alex Deucher 提交于
Large clock values may overflow and show up as negative. Reported by prOMiNd on IRC. Acked-by: NNirmoy Das <nirmoy.das@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Alex Jivin 提交于
Port functionality from the Radeon driver to support UVD and VCE power management. Signed-off-by: NAlex Jivin <alex.jivin@amd.com> Reviewed-by: NAlex Deucher <alexander.deucher@amd.com> Acked-by: NChristian König <christian.koenig@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 01 7月, 2020 6 次提交
-
-
由 Wenhui Sheng 提交于
On Navi12 platform, node power_dpm_force_performance_level doesn't work correctly in one-VF mode with at least three smu messages not supported: SMU_MSG_SetSoftMaxByFreq SMU_MSG_SetSoftMinByFreq SMU_MSG_TransferTableDram2Smu Reviewed-by: NKevin Wang <kevin1.wang@amd.com> Signed-off-by: NWenhui Sheng <Wenhui.Sheng@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Colin Ian King 提交于
The variable ret is being initialized with a value that is never read and it is being updated later with a new value. The initialization is redundant and can be removed. Addresses-Coverity: ("Unused value") Signed-off-by: NColin Ian King <colin.king@canonical.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Alex Deucher 提交于
The call to pm_runtime_get_sync increments the counter even in case of failure, leading to incorrect ref count. In case of failure, decrement the ref count before returning. Reviewed-by: NEvan Quan <evan.quan@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Alex Deucher 提交于
Add rename the gpu busy percentage for consistency and add the mem busy percentage documentation. Reviewed-by: NEvan Quan <evan.quan@amd.com> Reviewed-by: NNirmoy Das <nirmoy.das@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Alex Deucher 提交于
Vega10 and previous asics use one interface, vega20 and newer use another. Reviewed-by: NEvan Quan <evan.quan@amd.com> Acked-by: NNirmoy Das <nirmoy.das@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Evan Quan 提交于
Drop unused APIs, variables and argument. Signed-off-by: NEvan Quan <evan.quan@amd.com> Reviewed-by: NAlex Deucher <alexander.deucher@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 18 6月, 2020 2 次提交
-
-
由 Alex Deucher 提交于
Add rename the gpu busy percentage for consistency and add the mem busy percentage documentation. Reviewed-by: NEvan Quan <evan.quan@amd.com> Reviewed-by: NNirmoy Das <nirmoy.das@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Alex Deucher 提交于
Vega10 and previous asics use one interface, vega20 and newer use another. Reviewed-by: NEvan Quan <evan.quan@amd.com> Acked-by: NNirmoy Das <nirmoy.das@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 03 6月, 2020 1 次提交
-
-
由 Kent Russell 提交于
Add support for unique_id and serial_number, as these are now the same value, and will be for future ASICs as well. v2: Explicitly create unique_id only for VG10/20/ARC v3: Change set_unique_id to get_unique_id for clarity Signed-off-by: NKent Russell <kent.russell@amd.com> Reviewed-by: NAlex Deucher <alexander.deucher@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 30 5月, 2020 3 次提交
-
-
由 Evan Quan 提交于
User can check and set the enablement of throttling logging and the interval between each logging. V2: simplify the sysfs interface(no string parsing) V3: add proper lock protection on updating throttling_logging_rs.interval V4: documentation cosmetic per Luben's suggestion Signed-off-by: NEvan Quan <evan.quan@amd.com> Reviewed-by: NChristian König <christian.koenig@amd.com> Reviewed-by: NLuben Tuikov <luben.tuikov@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Alex Deucher 提交于
Return an error for sysfs and debugfs power interfaces during gpu reset and suspend. Prevents access to the hw while it may be in an unusable state. v2: squash in fix to drop suspend check Acked-by: NNirmoy Das <nirmoy.das@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Alex Deucher 提交于
Return an error for sysfs and debugfs power interfaces during gpu reset and suspend. Prevents access to the hw while it may be in an unusable state. v2: squash in fix to drop suspend check Acked-by: NNirmoy Das <nirmoy.das@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 27 5月, 2020 1 次提交
-
-
由 Kevin Wang 提交于
the origin design will use varible of "attr->states" to save node supported states on current gpu device, but for multi gpu device, when probe second gpu device, the driver will check attribute node states from previous gpu device wthether to create attribute node. it will cause other gpu device create attribute node faild. 1. add member attr_list into amdgpu_device to link supported device attribute node. 2. add new structure "struct amdgpu_device_attr_entry{}" to track device attribute state. 3. drop member "states" from amdgpu_device_attr. v2: 1. move "attr_list" into amdgpu_pm and rename to "pm_attr_list". 2. refine create & remove device node functions parameter. fix: drm/amdgpu: optimize amdgpu device attribute code Signed-off-by: NKevin Wang <kevin1.wang@amd.com> Reviewed-by: NAlex Deucher <alexander.deucher@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 23 5月, 2020 3 次提交
-
-
由 Alex Deucher 提交于
Add some APU flags to simplify handling of different APU variants. It's easier to understand the special cases if we use names flags rather than checking device ids and silicon revisions. v2: rebase on latest code Acked-by: NEvan Quan <evan.quan@amd.com> Acked-by: NChristian König <christian.koenig@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 chen gong 提交于
[Problem description] 1. Boot up picasso platform, launches desktop, Don't do anything (APU enter into "gfxoff" state) 2. Remote login to platform using SSH, then type the command line: sudo su -c "echo manual > /sys/class/drm/card0/device/power_dpm_force_performance_level" sudo su -c "echo 2 > /sys/class/drm/card0/device/pp_dpm_sclk" (fix SCLK to 1400MHz) 3. Move the mouse around in Window 4. Phenomenon : The screen frozen Tester will switch sclk level during glmark2 run time. APU will enter "gfxoff" state intermittently during glmark2 run time. The system got hanged if fix GFXCLK to 1400MHz when APU is in "gfxoff" state. [Debug] 1. Fix SCLK to X MHz 1400: screen frozen, screen black, then OS will reboot. 1300: screen frozen. 1200: screen frozen, screen black. 1100: screen frozen, screen black, then OS will reboot. 1000: screen frozen, screen black. 900: screen frozen, screen black, then OS will reboot. 800: Situation Nomal, issue disappear. 700: Situation Nomal, issue disappear. 2. SBIOS setting: AMD CBS --> SMU Debug Options -->SMU Debug --> "GFX DLDO Psm Margin Control": 50 : Situation Nomal, issue disappear. 45 : Situation Nomal, issue disappear. 40 : Situation Nomal, issue disappear. 35 : Situation Nomal, issue disappear. 30 : screen black. 25 : screen frozen, then blurred screen. 20 : screen frozen. 15 : screen black. 10 : screen frozen. 5 : screen frozen, then blurred screen. 3. Disable GFXOFF feature Situation Nomal, issue disappear. [Why] Through a period of time debugging with Sys Eng team and SMU team, Sys Eng team said this is voltage/frequency marginal issue not a F/W or H/W bug. This experiment proves that default targetPsm [for f=1400MHz] is not sufficient when GFXOFF is enabled on Picasso. SMU team think it is an odd test conditions to force sclk="1400MHz" when GPU is in "gfxoff" state,then wake up the GFX. SCLK should be in the "lowest frequency" when gfxoff. [How] Disable gfxoff when setting manual mode. Enable gfxoff when setting other mode(exiting manual mode) again. By the way, from the user point of view, now that user switch to manual mode and force SCLK Frequency, he don't want SCLK be controlled by workload.It becomes meaningless to "switch to manual mode" if APU enter "gfxoff" due to lack of workload at this point. Tips: Same issue observed on Raven. Signed-off-by: Nchen gong <curry.gong@amd.com> Reviewed-by: NAlex Deucher <alexander.deucher@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Alex Deucher 提交于
Fix typos that prevented them from showing up. v2: switch other files in addition to pp_clk_voltage Fixes: 4e01847c ("drm/amdgpu: optimize amdgpu device attribute code") Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/1150Signed-off-by: NAlex Deucher <alexander.deucher@amd.com> Acked-by: NEvan Quan <evan.quan@amd.com>
-
- 22 5月, 2020 3 次提交
-
-
由 Alex Deucher 提交于
1. Initialize the counters to 0 in case the callback fails to initialize them. 2. The counters don't exist on APUs so return an error for them. 3. Return an error if the callback doesn't exist. Reviewed-by: NYong Zhao <Yong.Zhao@amd.com> Reviewed-By: NKent Russell <kent.russell@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Dan Carpenter 提交于
This loop in the error handling code should start a "i - 1" and end at "i == 0". Currently it starts a "i" and ends at "i == 1". The result is that it removes one attribute that wasn't created yet, and leaks the zeroeth attribute. Fixes: 4e01847c ("drm/amdgpu: optimize amdgpu device attribute code") Acked-by: NMichael J. Ruhl <michael.j.ruhl@intel.com> Reviewed-by: NChristian König <christian.koenig@amd.com> Reviewed-by: NKevin Wang <kevin1.wang@amd.com> Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Kevin Wang 提交于
the amdgpu device attribute node will be created accordding to sriov vf mode at runtime. cleanup unnecessary sriov check in attribute operation function. Signed-off-by: NKevin Wang <kevin1.wang@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 18 5月, 2020 1 次提交
-
-
由 Kevin Wang 提交于
unified amdgpu device attribute node functions: 1. add some helper functions to create amdgpu device attribute node. 2. create device node according to device attr flags on different VF mode. 3. rename some functions name to adapt a new interface. v2: 1. remove ATTR_STATE_DEAD, ATTR_STATE_ALIVE enum. 2. rename callback function perform to attr_update. 3. modify some variable names Signed-off-by: NKevin Wang <kevin1.wang@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 24 4月, 2020 2 次提交
-
-
由 Monk Liu 提交于
Signed-off-by: NMonk Liu <Monk.Liu@amd.com> Acked-by: NYintian Tao <yttao@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 limingyu 提交于
For chip like CHIP_OLAND with si enabled(amdgpu.si_support=1), the amdgpu will expose pp_num_states to the /sys directory. In this moment, read the pp_num_states file will excute the amdgpu_get_pp_num_states func. In our case, the data hasn't been initialized, so the kernel will access some ilegal address, trigger the segmentfault and system will reboot soon: uos@uos-PC:~$ cat /sys/devices/pci0000\:00/0000\:00\:00.0/0000\:01\:00 .0/pp_num_states Message from syslogd@uos-PC at Apr 22 09:26:20 ... kernel:[ 82.154129] Internal error: Oops: 96000004 [#1] SMP This patch aims to fix this problem, avoid that reading file triggers the kernel sementfault. Signed-off-by: Nlimingyu <limingyu@uniontech.com> Signed-off-by: Nzhoubinbin <zhoubinbin@uniontech.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 09 4月, 2020 1 次提交
-
-
由 Aaron Ma 提交于
On ARCTURUS and RENOIR, powerplay is not supported yet. When plug in or unplug power jack, ACPI event will issue. Then kernel NULL pointer BUG will be triggered. Check for NULL pointers before calling. Signed-off-by: NAaron Ma <aaron.ma@canonical.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-