- 14 9月, 2019 11 次提交
-
-
由 Tao Zhou 提交于
save umc error records to ras bad page array v2: add bad pages before gpu reset v3: add NULL check for adev->umc.funcs Signed-off-by: NTao Zhou <tao.zhou1@amd.com> Signed-off-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com> Reviewed-by: NGuchun Chen <guchun.chen@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Tao Zhou 提交于
change r type from bool to int, suitable for both bool and int return value Signed-off-by: NTao Zhou <tao.zhou1@amd.com> Reviewed-by: NChristian König <christian.koenig@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Hawking Zhang 提交于
amdgpu_gmc_ras_late_init is used to init gmc specfic ras debugfs/sysfs node and gmc specific interrupt handler. It can be shared among gmc generations. Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: NTao Zhou <tao.zhou1@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Hawking Zhang 提交于
to prevent access to dangling pointers Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: NTao Zhou <tao.zhou1@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Andrey Grodzovsky 提交于
Problem: Under certain conditions, when some IP bocks take a RAS error, we can get into a situation where a GPU reset is not possible due to issues in RAS in SMU/PSP. Temporary fix until proper solution in PSP/SMU is ready: When uncorrectable error happens the DF will unconditionally broadcast error event packets to all its clients/slave upon receiving fatal error event and freeze all its outbound queues, err_event_athub interrupt will be triggered. In such case and we use this interrupt to issue GPU reset. THe GPU reset code is modified for such case to avoid HW reset, only stops schedulers, deatches all in progress and not yet scheduled job's fences, set error code on them and signals. Also reject any new incoming job submissions from user space. All this is done to notify the applications of the problem. v2: Extract amdgpu_amdkfd_pre/post_reset from amdgpu_device_lock/unlock_adev Move amdgpu_job_stop_all_jobs_on_sched to amdgpu_job.c Remove print param from amdgpu_ras_query_error_count v3: Update based on prevoius bug fixing patch to properly call amdgpu_amdkfd_pre_reset for other XGMI hive memebers. Signed-off-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com> Acked-by: NFelix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Hawking Zhang 提交于
free ras_if if ras is not supported Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: NTao Zhou <tao.zhou1@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Hawking Zhang 提交于
mmhub callback functions are not initialized for all the ASICs Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: NTao Zhou <tao.zhou1@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Oak Zeng 提交于
Arcturus repurposed mtype WC to RW. Modify gmc functions to support the new mtype Signed-off-by: NOak Zeng <Oak.Zeng@amd.com> Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: NChristian König <christian.koenig@amd.com> Reviewed-by: NShaoyun Liu <Shaoyun.Liu@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Hawking Zhang 提交于
The function will be called in late init phase to do mmhub ras init v2: check ras_late_init function pointer before invoking the function Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: NAlex Deucher <alexander.deucher@amd.com> Reviewed-by: NTao Zhou <tao.zhou1@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Hawking Zhang 提交于
call helper function in late init phase to handle ras init for gmc ip block v2: call ras_late_fini to do clean up when fail to enable interrupt Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: NAlex Deucher <alexander.deucher@amd.com> Reviewed-by: NTao Zhou <tao.zhou1@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Hawking Zhang 提交于
no functional change, just switch to new structures Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: NAlex Deucher <alexander.deucher@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 30 8月, 2019 1 次提交
-
-
由 Tianci.Yin 提交于
stolen memory should be fixed in visible region. Reviewed-by: NChristian König <christian.koenig@amd.com> Signed-off-by: NTianci.Yin <tianci.yin@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 27 8月, 2019 1 次提交
-
-
由 Xiaojie Yuan 提交于
The GRBM register interface is now capable of bursting 1 cycle per register wr->wr, wr->rd much faster than previous muticycle per transaction done interface. This has caused a problem where status registers requiring HW to update have a 1 cycle delay, due to the register update having to go through GRBM. SW may operate on an incorrect value if they write a register and immediately check the corresponding status register. Registers requiring HW to clear or set fields may be delayed by 1 cycle. For example, 1. write VM_INVALIDATE_ENG0_REQ mask = 5a 2. read VM_INVALIDATE_ENG0_ACK till the ack is same as the request mask = 5a a. HW will reset VM_INVALIDATE_ENG0_ACK = 0 until invalidation is complete 3. write VM_INVALIDATE_ENG0_REQ mask = 5a 4. read VM_INVALIDATE_ENG0_ACK till the ack is same as the request mask = 5a a. First read of VM_INVALIDATE_ENG0_ACK = 5a instead of 0 b. Second read of VM_INVALIDATE_ENG0_ACK = 0 because the remote GRBM h/w register takes one extra cycle to be cleared c. In this case, SW will see a false ACK if they exit on first read Affected registers (only GC variant) | Recommended Dummy Read --------------------------------------+---------------------------- VM_INVALIDATE_ENG*_ACK | VM_INVALIDATE_ENG*_REQ VM_L2_STATUS | VM_L2_STATUS VM_L2_PROTECTION_FAULT_STATUS | VM_L2_PROTECTION_FAULT_STATUS VM_L2_PROTECTION_FAULT_ADDR_HI/LO32 | VM_L2_PROTECTION_FAULT_ADDR_HI/LO32 VM_L2_IH_LOG_BUSY | VM_L2_IH_LOG_BUSY MC_VM_L2_PERFCOUNTER_HI/LO | MC_VM_L2_PERFCOUNTER_HI/LO ATC_L2_PERFCOUNTER_HI/LO | ATC_L2_PERFCOUNTER_HI/LO ATC_L2_PERFCOUNTER2_HI/LO | ATC_L2_PERFCOUNTER2_HI/LO Signed-off-by: NXiaojie Yuan <xiaojie.yuan@amd.com> Reviewed-by: NJack Xiao <Jack.Xiao@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 23 8月, 2019 2 次提交
-
-
由 Frank.Min 提交于
arcturus for sriov would use the unified mc base address Reviewed-by: NChristian König <christian.koenig@amd.com> Signed-off-by: NFrank.Min <Frank.Min@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Frank.Min 提交于
Since agp is not used for sriov, just disable it Reviewed-by: NChristian König <christian.koenig@amd.com> Signed-off-by: NFrank.Min <Frank.Min@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 15 8月, 2019 3 次提交
-
-
由 Yong Zhao 提交于
RW is also useful in most cases. Signed-off-by: NYong Zhao <Yong.Zhao@amd.com> Reviewed-by: NChristian König <christian.koenig@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Oak Zeng 提交于
This is for kfd to reuse amdgpu TLB invalidation function. On gfx10, kfd only needs to flush TLB on gfx hub but not on mm hub. So export a function for KFD flush TLB only on specific hub. Signed-off-by: NOak Zeng <Oak.Zeng@amd.com> Reviewed-by: NChristian Konig <christian.koenig@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Christoph Hellwig 提交于
Use dma_set_mask_and_coherent to set both masks in one go, and remove the no longer required fallback, as the kernel now always accepts larger than required DMA masks. Fail the driver probe if we can't set the DMA mask, as that means the system can only support a larger mask. Reviewed-by: NChristian König <christian.koenig@amd.com> Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 13 8月, 2019 5 次提交
-
-
由 Huang Rui 提交于
Add gfx memory controller support for renoir. Acked-by: NHuang Rui <ray.huang@amd.com> Signed-off-by: NHuang Rui <ray.huang@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Le Ma 提交于
Add 2 mmhub instances CG Signed-off-by: NLe Ma <le.ma@amd.com> Reviewed-by: NKevin Wang <kevin1.wang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Le Ma 提交于
Untie the bind of get/set athub CG state from mmhub, for cosmetic fix and Asic not using mmhub 1.0. Besides, also fix wrong athub CG state in amdgpu_pm_info. Signed-off-by: NLe Ma <le.ma@amd.com> Reviewed-by: NFeifei Xu <Feifei.Xu@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Tao Zhou 提交于
enable mmhub ras feature and create sysfs/debugfs node for mmhub Signed-off-by: NTao Zhou <tao.zhou1@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Tao Zhou 提交于
add amdgpu_mmhub_funcs definition and initialize it, prepare for mmhub ras enablement Signed-off-by: NTao Zhou <tao.zhou1@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 02 8月, 2019 5 次提交
-
-
由 Tao Zhou 提交于
ce can also trigger interrupt, and even both ce and ue error can be found in one ras query, distinguishing between ce and ue in interrupt handler is uncessary. Signed-off-by: NTao Zhou <tao.zhou1@amd.com> Suggested-by: NGuchun Chen <guchun.chen@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Tao Zhou 提交于
we only read error information for correctable error in interrupt handler, gpu reset is unnecessary since there is no data lost in correctable error Signed-off-by: NTao Zhou <tao.zhou1@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: NAlex Deucher <alexander.deucher@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Tao Zhou 提交于
umc error address query can get ce/ue error address and clear error status Signed-off-by: NTao Zhou <tao.zhou1@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Tao Zhou 提交于
add initialization for new members of amdgpu_umc structure Signed-off-by: NTao Zhou <tao.zhou1@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Monk Liu 提交于
we can simplify all those unnecessary function under SRIOV for vega10 since: 1) PSP L1 policy is by force enabled in SRIOV 2) original logic always set all flags which make itself a dummy step besides, 1) the ih_doorbell_range set should also be skipped for VEGA10 SRIOV. 2) the gfx_common registers should also be skipped for VEGA10 SRIOV. Signed-off-by: NMonk Liu <Monk.Liu@amd.com> Reviewed-by: NEmily Deng <Emily.Deng@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 01 8月, 2019 4 次提交
-
-
由 Tao Zhou 提交于
add err_data parameter in interrupt cb for ras clients Signed-off-by: NTao Zhou <tao.zhou1@amd.com> Reviewed-by: NDennis Li <dennis.li@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Tao Zhou 提交于
create new amdgpu_umc structure to for more umc settings in future and switch to the new structure Signed-off-by: NTao Zhou <tao.zhou1@amd.com> Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: NDennis Li <dennis.li@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Hawking Zhang 提交于
check umc error count in both ras querry function and ras interrupt handler Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: NDennis Li <dennis.li@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Hawking Zhang 提交于
init umc callback function for vega20 in sw early init phase Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: NDennis Li <dennis.li@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 19 7月, 2019 8 次提交
-
-
由 Yong Zhao 提交于
With the printing, we don't need to parse the value on our own any more. Signed-off-by: NYong Zhao <Yong.Zhao@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Hawking Zhang 提交于
Any dce register read back from arct is invalid. use hard code stolen memory for arct until we validate the s3. Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: NLe Ma <Le.Ma@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Le Ma 提交于
There's no LUT register for second mmhub to convert pasid since it has no ATC. Signed-off-by: NLe Ma <le.ma@amd.com> Reviewed-by: NFeifei Xu <Feifei.Xu@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Le Ma 提交于
There is one more vmc interrupt and mmhub on Arcturus. Signed-off-by: NLe Ma <le.ma@amd.com> Acked-by: Snow Zhang < Snow.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Le Ma 提交于
New IH client id for VMC1. Signed-off-by: NLe Ma <le.ma@amd.com> Acked-by: Snow Zhang < Snow.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Le Ma 提交于
Arcturus has two MMHUBs. Signed-off-by: NLe Ma <le.ma@amd.com> Acked-by: Snow Zhang < Snow.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Le Ma 提交于
v2: set mmhub num under CHIP_ARCTURUS switch case and add one more mmhub id_mgr Signed-off-by: NLe Ma <le.ma@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Le Ma 提交于
It aims to replace AMDGPU_MAX_VMHUBS in for loop to initialize registers. Signed-off-by: NLe Ma <le.ma@amd.com> Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-