- 15 12月, 2021 1 次提交
-
-
由 Victor Skvortsov 提交于
Host initiated VF FLR may fail if someone else is already holding a read_lock. Change from down_write_trylock to down_write to guarantee the reset goes through. Signed-off-by: NVictor Skvortsov <victor.skvortsov@amd.com> Reviewed by: Shaoyun.liu <Shaoyun.liu@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 14 12月, 2021 27 次提交
-
-
由 chiminghao 提交于
return value form directly instead of taking this in another redundant variable. Reported-by: NZeal Robot <zealci@zte.com.cm> Signed-off-by: Nchiminghao <chi.minghao@zte.com.cn> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Isabella Basso 提交于
Fix the warning below: warning: Cannot understand * \file amdgpu_ioc32.c on line 2 - I thought it was a doc line Changes since v1: - As suggested by Alexander Deucher: 1. Reduce diff to minimum as this DOC section doesn't provide much value. Signed-off-by: NIsabella Basso <isabbasso@riseup.net> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Isabella Basso 提交于
This commit fixes the compile-time warning below: warning: no previous prototype for ‘amdgpu_ras_mca_query_error_status’ [-Wmissing-prototypes] Changes since v1: - As suggested by Alexander Deucher: 1. Make function static instead of adding prototype. Signed-off-by: NIsabella Basso <isabbasso@riseup.net> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Philip Yang 提交于
For userptr bo, if adev is not in IOMMU isolation mode, RAM direct map to GPU, multiple GPUs use same system memory dma mapping address, they can share the original mem->bo in attachment to reduce dma address array memory usage. Signed-off-by: NPhilip Yang <Philip.Yang@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Philip Yang 提交于
If host and amdgpu IOMMU is not enabled or IOMMU is pass through mode, set adev->ram_is_direct_mapped flag which will be used to optimize memory usage for multi GPU mappings. Signed-off-by: NPhilip Yang <Philip.Yang@amd.com> Reviewed-by: NAlex Deucher <alexander.deucher@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Lang Yu 提交于
SMU firmware expects the driver maintains error context and doesn't interact with SMU any more when SMU errors occurred. That will aid in debugging SMU firmware issues. Add SMU debug option support for this request, it can be enabled or disabled via amdgpu_smu_debug debugfs file. Use a 32-bit mask to indicate corresponding debug modes. Currently, only one mode(HALT_ON_ERROR) is supported. When enabled, it brings hardware to a kind of halt state so that no one can touch it any more in the envent of SMU errors. The dirver interacts with SMU via sending messages. And threre are three ways to sending messages to SMU in current implementation. Handle them respectively as following: 1, smu_cmn_send_smc_msg_with_param() for normal timeout cases Halt on any error. 2, smu_cmn_send_msg_without_waiting()/smu_cmn_wait_for_response() for longer timeout cases Halt on errors apart from ETIME. Otherwise this way won't work. Let the user handle ETIME error in such a case. 3, smu_cmn_send_msg_without_waiting() for no waiting cases Halt on errors apart from ETIME. Otherwise second way won't work. == Command Guide == 1, enable HALT_ON_ERROR mode # echo 0x1 > /sys/kernel/debug/dri/0/amdgpu_smu_debug 2, disable HALT_ON_ERROR mode # echo 0x0 > /sys/kernel/debug/dri/0/amdgpu_smu_debug v5: - Use bit mask to allow more debug features.(Evan) - Use WRAN() instead of BUG().(Evan) v4: - Set to halt state instead of a simple hang.(Christian) v3: - Use debugfs_create_bool().(Christian) - Put variable into smu_context struct. - Don't resend command when timeout. v2: - Resend command when timeout.(Lijo) - Use debugfs file instead of module parameter. Signed-off-by: NLang Yu <lang.yu@amd.com> Reviewed-by: NLijo Lazar <lijo.lazar@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Lang Yu 提交于
It is useful to maintain error context when debugging SW/FW issues. Introduce amdgpu_device_halt() for this purpose. It will bring hardware to a kind of halt state, so that no one can touch it any more. Compare to a simple hang, the system will keep stable at least for SSH access. Then it should be trivial to inspect the hardware state and see what's going on. v2: - Set adev->no_hw_access earlier to avoid potential crashes.(Christian) Suggested-by: NChristian Koenig <christian.koenig@amd.com> Suggested-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com> Signed-off-by: NLang Yu <lang.yu@amd.com> Reviewed-by: NChristian Koenig <christian.koenig@amd.co> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Hawking Zhang 提交于
in case they are not avaiable in early phase Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: NLe Ma <Le.Ma@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Hawking Zhang 提交于
Leave this bit as hardware default setting Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: NAlex Deucher <alexander.deucher@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Le Ma 提交于
should count on GC IP base address Signed-off-by: NLe Ma <le.ma@amd.com> Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Hawking Zhang 提交于
read and authenticate ip discovery binary getting from vram first, if it is not valid, read and authenticate the one getting from file Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: NAlex Deucher <alexander.deucher@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Hawking Zhang 提交于
To be used to check ip discovery binary signature Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: NAlex Deucher <alexander.deucher@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Hawking Zhang 提交于
add _from_vram in the funciton name to diffrentiate the one used to read from file Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: NAlex Deucher <alexander.deucher@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Hawking Zhang 提交于
To be used when ip_discovery binary is not carried by vbios Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: NAlex Deucher <alexander.deucher@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Leslie Shi 提交于
Guest OS will setup VCN instance 1 which is disabled as an enabled instance and execute initialization work on it, but this causes VCN ib ring test failure on the disabled VCN instance during modprobe: amdgpu 0000:00:08.0: amdgpu: ring vcn_enc_1.0 uses VM inv eng 5 on hub 1 amdgpu 0000:00:08.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on vcn_dec_0 (-110). amdgpu 0000:00:08.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on vcn_enc_0.0 (-110). [drm:amdgpu_device_delayed_init_work_handler [amdgpu]] *ERROR* ib ring test failed (-110). v2: drop amdgpu_discovery_get_vcn_version and rename sriov_config to vcn_config v3: modify VCN's revision in SR-IOV and bare-metal Fixes: baf3f8f3 ("drm/amdgpu: handle SRIOV VCN revision parsing") Signed-off-by: NLeslie Shi <Yuliang.Shi@amd.com> Reviewed-by: NLijo Lazar <lijo.lazar@amd.com> Reviewed-by: NGuchun Chen <guchun.chen@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Leslie Shi 提交于
Fix following warning in SRIOV during modprobe: amdgpu 0000:00:08.0: GFX9+ requires FB check based on format modifier WARNING: CPU: 0 PID: 1023 at drivers/gpu/drm/amd/amdgpu/amdgpu_display.c:1150 amdgpu_display_framebuffer_init+0x8e7/0xb40 [amdgpu] Signed-off-by: NLeslie Shi <Yuliang.Shi@amd.com> Reviewed-by: NGuchun Chen <guchun.chen@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Lang Yu 提交于
We found some headaches on ASICs don't need that, so remove that for them. Suggested-by: NLijo Lazar <lijo.lazar@amd.com> Signed-off-by: NLang Yu <lang.yu@amd.com> Reviewed-by: NKevin Wang <kevinyang.wang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Philip Yang 提交于
Remove not unique timestamp WARNING as same timestamp interrupt happens on some chips, Drain fault need to wait for the processed_timestamp to be truly greater than the checkpoint or the ring to be empty to be sure no stale faults are handled. Bug: https://gitlab.freedesktop.org/drm/amd/-/issues/1818Signed-off-by: NPhilip Yang <Philip.Yang@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: NChristian König <christian.koenig@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Isabella Basso 提交于
This fixes the warning below by changing the prototype to a location that's actually included by the .c files that call amdgpu_kms_compat_ioctl: warning: no previous prototype for ‘amdgpu_kms_compat_ioctl’ [-Wmissing-prototypes] 37 | long amdgpu_kms_compat_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) | ^~~~~~~~~~~~~~~~~~~~~~~ Signed-off-by: NIsabella Basso <isabbasso@riseup.net> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Isabella Basso 提交于
This fixes warnings caused by global functions lacking prototypes:, such as: warning: no previous prototype for 'dcn303_hw_sequencer_construct' [-Wmissing-prototypes] 12 | void dcn303_hw_sequencer_construct(struct dc *dc) | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ... warning: no previous prototype for ‘amdgpu_has_atpx’ [-Wmissing-prototypes] 76 | bool amdgpu_has_atpx(void) { | ^~~~~~~~~~~~~~~ Reviewed-by: NRodrigo Siqueira <Rodrigo.Siqueira@amd.com> Signed-off-by: NIsabella Basso <isabbasso@riseup.net> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Isabella Basso 提交于
This turns previously global functions into static, thus removing compile-time warnings such as: warning: no previous prototype for 'amdgpu_vkms_output_init' [-Wmissing-prototypes] 399 | int amdgpu_vkms_output_init(struct drm_device *dev, | ^~~~~~~~~~~~~~~~~~~~~~~ Signed-off-by: NIsabella Basso <isabbasso@riseup.net> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Isabella Basso 提交于
This fixes various warnings relating to erroneous docstring syntax, of which some are listed below: warning: Function parameter or member 'adev' not described in 'amdgpu_atomfirmware_ras_rom_addr' ... warning: expecting prototype for amdgpu_atpx_validate_functions(). Prototype was for amdgpu_atpx_validate() instead ... warning: Excess function parameter 'mem' description in 'amdgpu_preempt_mgr_new' ... warning: Cannot understand * @kfd_get_cu_occupancy - Collect number of waves in-flight on this device ... warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst Signed-off-by: NIsabella Basso <isabbasso@riseup.net> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Zhigang Luo 提交于
For the ASIC has big FB, it need more time to clear FB during reset. This change extended SRIOV VF waiting reset completion timeout from 5s to 10s. Signed-off-by: NZhigang Luo <zhigang.luo@amd.com> Acked-by: NShaoyun Liu <shaoyun.liu@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Zhigang Luo 提交于
For SRIOV VF, the XGMI topology was not recovered after reset. This change added code to SRIOV VF reset function to update XGMI topology for SRIOV VF after reset. Signed-off-by: NZhigang Luo <zhigang.luo@amd.com> Reviewed-by: NShaoyun Liu <shaoyun.liu@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Zhigang Luo 提交于
For SRIOV VF, XGMI was not initialized in PSP during recover. This change added PSP XGMI initialization for SRIOV VF during recover. Signed-off-by: NZhigang Luo <zhigang.luo@amd.com> Reviewed-by: NShaoyun Liu <shaoyun.liu@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Zhigang Luo 提交于
On SRIOV, host driver can support FLR(function level reset) on individual VF within the hive which might bring the individual device back to normal without the necessary to execute the hive reset. If the FLR failed , host driver will trigger the hive reset, each guest VF will get reset notification before the real hive reset been executed. The VF device can handle the reset request individually in it's reset work handler. This change updated gpu recover sequence to skip reset other device in the same hive for SRIOV VF. Signed-off-by: NZhigang Luo <zhigang.luo@amd.com> Reviewed-by: NShaoyun Liu <shaoyun.liu@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Tao Zhou 提交于
The RAS poison mode is enabled by default on the platform. Signed-off-by: NTao Zhou <tao.zhou1@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 08 12月, 2021 6 次提交
-
-
由 Claudio Suarez 提交于
Once EDID is parsed, the monitor HDMI support information is available through drm_display_info.is_hdmi. The amdgpu driver still calls drm_detect_hdmi_monitor() to retrieve the same information, which is less efficient. Change to drm_display_info.is_hdmi This is a TODO task in Documentation/gpu/todo.rst Reviewed-by: NHarry Wentland <harry.wentland@amd.com> Signed-off-by: NClaudio Suarez <cssk@net-c.es> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Claudio Suarez 提交于
drm_display_info is updated by drm_get_edid() or drm_connector_update_edid_property(). In the amdgpu driver it is almost always updated when the edid is read in amdgpu_connector_get_edid(), but not always. Change amdgpu_connector_get_edid() and amdgpu_connector_free_edid() to keep drm_display_info updated. Reviewed-by: NHarry Wentland <harry.wentland@amd.com> Signed-off-by: NClaudio Suarez <cssk@net-c.es> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Stanley.Yang 提交于
remove in recovery stat check, skip umc ras err cnt harvest in amdgpu_ras_log_on_err_counter Signed-off-by: NStanley.Yang <Stanley.Yang@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Flora Cui 提交于
Signed-off-by: NFlora Cui <flora.cui@amd.com> Reviewed-by: NLeslie Shi <Yuliang.Shi@amd.com> Reviewed-by: NAlex Deucher <alexander.deucher@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Flora Cui 提交于
Signed-off-by: NFlora Cui <flora.cui@amd.com> Reviewed-by: NLeslie Shi <Yuliang.Shi@amd.com> Reviewed-by: NAlex Deucher <alexander.deucher@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Stanley.Yang 提交于
skip get ecc info for aldebarn through check ip version do not affect other asic type Signed-off-by: NStanley.Yang <Stanley.Yang@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 03 12月, 2021 3 次提交
-
-
由 Zhou Qingyang 提交于
In amdgpu_connector_lcd_native_mode(), the return value of drm_mode_duplicate() is assigned to mode, and there is a dereference of it in amdgpu_connector_lcd_native_mode(), which will lead to a NULL pointer dereference on failure of drm_mode_duplicate(). Fix this bug add a check of mode. This bug was found by a static analyzer. The analysis employs differential checking to identify inconsistent security operations (e.g., checks or kfrees) between two code paths and confirms that the inconsistent operations are not recovered in the current function or the callers, so they constitute bugs. Note that, as a bug found by static analysis, it can be a false positive or hard to trigger. Multiple researchers have cross-reviewed the bug. Builds with CONFIG_DRM_AMDGPU=m show no new warnings, and our static analyzer no longer warns about this code. Fixes: d38ceaf9 ("drm/amdgpu: add core driver (v4)") Signed-off-by: NZhou Qingyang <zhou1615@umn.edu> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Alex Deucher 提交于
For SR-IOV, the IP discovery revision number encodes additional information. Handle that case here. v2: drop additional IP versions Reviewed-by: NGuchun Chen <guchun.chen@amd.com> Acked-by: NChristian König <christian.koenig@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Stanley.Yang 提交于
this is a workaround due to get ecc info failed during gpu recovery [ 700.236122] amdgpu 0000:09:00.0: amdgpu: Failed to export SMU ecc table! [ 700.236128] amdgpu 0000:09:00.0: amdgpu: GPU reset begin! [ 704.331171] amdgpu: qcm fence wait loop timeout expired [ 704.331194] amdgpu: The cp might be in an unrecoverable state due to an unsuccessful queues preemption [ 704.332445] amdgpu 0000:09:00.0: amdgpu: GPU reset begin! [ 704.332448] amdgpu 0000:09:00.0: amdgpu: Bailing on TDR for s_job:ffffffffffffffff, as another already in progress [ 704.332456] amdgpu: Pasid 0x8000 destroy queue 0 failed, ret -62 [ 710.360924] amdgpu 0000:09:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x00000013 SMN_C2PMSG_82:0x00000007 [ 710.360964] amdgpu 0000:09:00.0: amdgpu: Failed to disable smu features. [ 710.361002] amdgpu 0000:09:00.0: amdgpu: Fail to disable dpm features! [ 710.361014] [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <smu> failed -62 Signed-off-by: NStanley.Yang <Stanley.Yang@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 02 12月, 2021 3 次提交
-
-
由 Yann Dirson 提交于
amdgpu_ucode_get_load_type() does not interpret this parameter as documented. It is ignored for many ASIC types (which presumably only support one load_type), and when not ignored it is only used to force direct loading instead of PSP loading. SMU loading is only available for ASICs for which the parameter is ignored. Signed-off-by: NYann Dirson <ydirson@free.fr> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Philip Yang 提交于
Refactor userptr and pin_bo path to make it less confusing, move err_pin_bo label up to remove mem from process_info kfd_bo_list. Signed-off-by: NPhilip Yang <Philip.Yang@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 shaoyunl 提交于
This change revert previous commits: 9f4f2c1a ("drm/amd/amdgpu: fix the kfd pre_reset sequence in sriov") 271fd38c ("drm/amdgpu: move kfd post_reset out of reset_sriov function") This change moves the amdgpu_amdkfd_pre_reset to an earlier place in amdgpu_device_reset_sriov, presumably to address the sequence issue that the first patch was originally meant to fix. Some register access(GRBM_GFX_CNTL) only be allowed on full access mode. Move kfd_pre_reset and kfd_post_reset back inside reset_sriov function. Fixes: 9f4f2c1a ("drm/amd/amdgpu: fix the kfd pre_reset sequence in sriov") Fixes: 271fd38c ("drm/amdgpu: move kfd post_reset out of reset_sriov function") Signed-off-by: Nshaoyunl <shaoyun.liu@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-