提交 · 87d2b92f1e9df64a74f7fda0691d4041ba2727f9 · openeuler / Kernel

14 9月, 2019 11 次提交

drm/amdgpu: save umc error records · 87d2b92f

由 Tao Zhou 提交于 8月 15, 2019

save umc error records to ras bad page array

v2: add bad pages before gpu reset
v3: add NULL check for adev->umc.funcs
Signed-off-by: NTao Zhou <tao.zhou1@amd.com>
Signed-off-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com>
Reviewed-by: NGuchun Chen <guchun.chen@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

87d2b92f

drm/amdgpu: change r type to int in gmc_v9_0_late_init · c5b6e585

由 Tao Zhou 提交于 9月 02, 2019

change r type from bool to int, suitable for both bool and int return
value
Signed-off-by: NTao Zhou <tao.zhou1@amd.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

c5b6e585

drm/amdgpu/gmc: switch to amdgpu_gmc_ras_late_init helper function · a85eff14

由 Hawking Zhang 提交于 9月 03, 2019

amdgpu_gmc_ras_late_init is used to init gmc specfic
ras debugfs/sysfs node and gmc specific interrupt handler.
It can be shared among gmc generations.
Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: NTao Zhou <tao.zhou1@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

a85eff14

drm/amdgpu: set ip specific ras interface pointer to NULL after free it · d094aea3

由 Hawking Zhang 提交于 9月 03, 2019

to prevent access to dangling pointers
Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: NTao Zhou <tao.zhou1@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

d094aea3

drm/amdgpu: Avoid HW GPU reset for RAS. · 7c6e68c7

由 Andrey Grodzovsky 提交于 9月 13, 2019

Problem:
Under certain conditions, when some IP bocks take a RAS error,
we can get into a situation where a GPU reset is not possible
due to issues in RAS in SMU/PSP.

Temporary fix until proper solution in PSP/SMU is ready:
When uncorrectable error happens the DF will unconditionally
broadcast error event packets to all its clients/slave upon
receiving fatal error event and freeze all its outbound queues,
err_event_athub interrupt  will be triggered.
In such case and we use this interrupt
to issue GPU reset. THe GPU reset code is modified for such case to avoid HW
reset, only stops schedulers, deatches all in progress and not yet scheduled
job's fences, set error code on them and signals.
Also reject any new incoming job submissions from user space.
All this is done to notify the applications of the problem.

v2:
Extract amdgpu_amdkfd_pre/post_reset from amdgpu_device_lock/unlock_adev
Move amdgpu_job_stop_all_jobs_on_sched to amdgpu_job.c
Remove print param from amdgpu_ras_query_error_count

v3:
Update based on prevoius bug fixing patch to properly call amdgpu_amdkfd_pre_reset
for other XGMI hive memebers.
Signed-off-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com>
Acked-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

7c6e68c7

drm/amdgpu: fix memory leak when ras is not supported on specific ip block · 8bf2485a

由 Hawking Zhang 提交于 8月 31, 2019

free ras_if if ras is not supported
Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: NTao Zhou <tao.zhou1@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

8bf2485a

drm/amdgpu: check mmhub_funcs pointer before refering to it · 4ce71be6

由 Hawking Zhang 提交于 8月 31, 2019

mmhub callback functions are not initialized for all the ASICs
Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: NTao Zhou <tao.zhou1@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

4ce71be6

drm/amdgpu: Support new arcturus mtype · 093e48c0

由 Oak Zeng 提交于 7月 26, 2019

Arcturus repurposed mtype WC to RW. Modify gmc functions
to support the new mtype
Signed-off-by: NOak Zeng <Oak.Zeng@amd.com>
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Reviewed-by: NShaoyun Liu <Shaoyun.Liu@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

093e48c0

drm/amdgpu: add mmhub ras_late_init callback function (v2) · dda79907

由 Hawking Zhang 提交于 8月 30, 2019

The function will be called in late init phase to do mmhub
ras init

v2: check ras_late_init function pointer before invoking the
function
Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
Reviewed-by: NTao Zhou <tao.zhou1@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

dda79907

drm/amdgpu: switch to amdgpu_ras_late_init for gmc v9 block (v2) · 2452e778

由 Hawking Zhang 提交于 8月 29, 2019

call helper function in late init phase to handle ras init
for gmc ip block

v2: call ras_late_fini to do clean up when fail to enable interrupt
Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
Reviewed-by: NTao Zhou <tao.zhou1@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

2452e778

drm/amdgpu: switch to new amdgpu_nbio structure · bebc0762

由 Hawking Zhang 提交于 8月 23, 2019

no functional change, just switch to new structures
Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

bebc0762

30 8月, 2019 1 次提交

drm/amdgpu: keep the stolen memory in visible vram region · 994dcfaa

由 Tianci.Yin 提交于 8月 28, 2019

stolen memory should be fixed in visible region.
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NTianci.Yin <tianci.yin@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

994dcfaa

27 8月, 2019 1 次提交

drm/amdgpu: add dummy read for some GCVM status registers · 53499173

由 Xiaojie Yuan 提交于 8月 16, 2019

The GRBM register interface is now capable of bursting 1 cycle per
register wr->wr, wr->rd much faster than previous muticycle per
transaction done interface.  This has caused a problem where status
registers requiring HW to update have a 1 cycle delay, due to the
register update having to go through GRBM.

SW may operate on an incorrect value if they write a register and
immediately check the corresponding status register.

Registers requiring HW to clear or set fields may be delayed by 1 cycle.
For example,

1. write VM_INVALIDATE_ENG0_REQ mask = 5a
2. read VM_INVALIDATE_ENG0_ACK till the ack is same as the request mask = 5a
    a. HW will reset VM_INVALIDATE_ENG0_ACK = 0 until invalidation is complete
3. write VM_INVALIDATE_ENG0_REQ mask = 5a
4. read VM_INVALIDATE_ENG0_ACK till the ack is same as the request mask = 5a
    a. First read of VM_INVALIDATE_ENG0_ACK = 5a instead of 0
    b. Second read of VM_INVALIDATE_ENG0_ACK = 0 because
       the remote GRBM h/w register takes one extra cycle to be cleared
    c. In this case, SW will see a false ACK if they exit on first read

Affected registers (only GC variant)  |  Recommended Dummy Read
--------------------------------------+----------------------------
VM_INVALIDATE_ENG*_ACK                |  VM_INVALIDATE_ENG*_REQ
VM_L2_STATUS                          |  VM_L2_STATUS
VM_L2_PROTECTION_FAULT_STATUS         |  VM_L2_PROTECTION_FAULT_STATUS
VM_L2_PROTECTION_FAULT_ADDR_HI/LO32   |  VM_L2_PROTECTION_FAULT_ADDR_HI/LO32
VM_L2_IH_LOG_BUSY                     |  VM_L2_IH_LOG_BUSY
MC_VM_L2_PERFCOUNTER_HI/LO            |  MC_VM_L2_PERFCOUNTER_HI/LO
ATC_L2_PERFCOUNTER_HI/LO              |  ATC_L2_PERFCOUNTER_HI/LO
ATC_L2_PERFCOUNTER2_HI/LO             |  ATC_L2_PERFCOUNTER2_HI/LO
Signed-off-by: NXiaojie Yuan <xiaojie.yuan@amd.com>
Reviewed-by: NJack Xiao <Jack.Xiao@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

53499173

23 8月, 2019 2 次提交

drm/amdgpu: unity mc base address for arcturus · 9d4f837a

由 Frank.Min 提交于 8月 21, 2019

arcturus for sriov would use the unified mc base address
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NFrank.Min <Frank.Min@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

9d4f837a

drm/amdgpu: disable agp for sriov · 81c274c4

由 Frank.Min 提交于 8月 21, 2019

Since agp is not used for sriov, just disable it
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NFrank.Min <Frank.Min@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

81c274c4

15 8月, 2019 3 次提交

drm/amdgpu: Add printing for RW extracted from VM_L2_PROTECTION_FAULT_STATUS · 4e0ae5e2

由 Yong Zhao 提交于 8月 13, 2019

RW is also useful in most cases.
Signed-off-by: NYong Zhao <Yong.Zhao@amd.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

4e0ae5e2

drm/amdgpu: Export function to flush TLB of specific vm hub · 3ff98548

由 Oak Zeng 提交于 8月 01, 2019

This is for kfd to reuse amdgpu TLB invalidation function.
On gfx10, kfd only needs to flush TLB on gfx hub but not
on mm hub. So export a function for KFD flush TLB only on
specific hub.
Signed-off-by: NOak Zeng <Oak.Zeng@amd.com>
Reviewed-by: NChristian Konig <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

3ff98548

drm/amdgpu: simplify and cleanup setting the dma mask · 244511f3

由 Christoph Hellwig 提交于 8月 15, 2019

Use dma_set_mask_and_coherent to set both masks in one go, and remove
the no longer required fallback, as the kernel now always accepts
larger than required DMA masks.  Fail the driver probe if we can't
set the DMA mask, as that means the system can only support a larger
mask.
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

244511f3

13 8月, 2019 5 次提交

drm/amdgpu: add gmc v9 supports for renoir · 8787ee01

由 Huang Rui 提交于 7月 24, 2019

Add gfx memory controller support for renoir.
Acked-by: NHuang Rui <ray.huang@amd.com>
Signed-off-by: NHuang Rui <ray.huang@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

8787ee01

drm/amdgpu: add mmhub clock gating for Arcturus · cb15e804

由 Le Ma 提交于 8月 09, 2019

Add 2 mmhub instances CG
Signed-off-by: NLe Ma <le.ma@amd.com>
Reviewed-by: NKevin Wang <kevin1.wang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

cb15e804

drm/amdgpu: split athub clock gating from mmhub · bee7b51a

由 Le Ma 提交于 8月 08, 2019

Untie the bind of get/set athub CG state from mmhub, for cosmetic fix and Asic
not using mmhub 1.0. Besides, also fix wrong athub CG state in amdgpu_pm_info.
Signed-off-by: NLe Ma <le.ma@amd.com>
Reviewed-by: NFeifei Xu <Feifei.Xu@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

bee7b51a

drm/amdgpu: create mmhub ras framework · 145b03eb

由 Tao Zhou 提交于 8月 07, 2019

enable mmhub ras feature and create sysfs/debugfs node for mmhub
Signed-off-by: NTao Zhou <tao.zhou1@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

145b03eb

drm/amdgpu: add amdgpu_mmhub_funcs definition · 3d093da0

由 Tao Zhou 提交于 8月 06, 2019

add amdgpu_mmhub_funcs definition and initialize it,
prepare for mmhub ras enablement
Signed-off-by: NTao Zhou <tao.zhou1@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

3d093da0

02 8月, 2019 5 次提交

drm/amdgpu: replace AMDGPU_RAS_UE with AMDGPU_RAS_SUCCESS · bd2280da

由 Tao Zhou 提交于 8月 01, 2019

ce can also trigger interrupt, and even both ce and ue error can be
found in one ras query, distinguishing between ce and ue in interrupt
handler is uncessary.
Signed-off-by: NTao Zhou <tao.zhou1@amd.com>
Suggested-by: NGuchun Chen <guchun.chen@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

bd2280da

drm/amdgpu: only uncorrectable error needs gpu reset · 91ba68f8

由 Tao Zhou 提交于 8月 01, 2019

we only read error information for correctable error in interrupt
handler, gpu reset is unnecessary since there is no data lost
in correctable error
Signed-off-by: NTao Zhou <tao.zhou1@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

91ba68f8

drm/amdgpu: add error address query for umc ras · 13b7c46c

由 Tao Zhou 提交于 8月 01, 2019

umc error address query can get ce/ue error address and clear error
status
Signed-off-by: NTao Zhou <tao.zhou1@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

13b7c46c

drm/amdgpu: initialize new parameters and functions for amdgpu_umc structure · 3aacf4ea

由 Tao Zhou 提交于 7月 29, 2019

add initialization for new members of amdgpu_umc structure
Signed-off-by: NTao Zhou <tao.zhou1@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

3aacf4ea

drm/amdgpu: cleanup vega10 SRIOV code path · 4cd4c5c0

由 Monk Liu 提交于 7月 30, 2019

we can simplify all those unnecessary function under
SRIOV for vega10 since:
1) PSP L1 policy is by force enabled in SRIOV
2) original logic always set all flags which make itself
   a dummy step

besides,
1) the ih_doorbell_range set should also be skipped
for VEGA10 SRIOV.
2) the gfx_common registers should also be skipped
for VEGA10 SRIOV.
Signed-off-by: NMonk Liu <Monk.Liu@amd.com>
Reviewed-by: NEmily Deng <Emily.Deng@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

4cd4c5c0

01 8月, 2019 4 次提交

drm/amdgpu: update interrupt callback for all ras clients · 81e02619

由 Tao Zhou 提交于 7月 22, 2019

add err_data parameter in interrupt cb for ras clients
Signed-off-by: NTao Zhou <tao.zhou1@amd.com>
Reviewed-by: NDennis Li <dennis.li@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

81e02619

drm/amdgpu: switch to amdgpu_umc structure · 045c0216

由 Tao Zhou 提交于 7月 23, 2019

create new amdgpu_umc structure to for more umc
settings in future and switch to the new structure
Signed-off-by: NTao Zhou <tao.zhou1@amd.com>
Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: NDennis Li <dennis.li@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

045c0216

drm/amdgpu: querry umc error count · 939e2258

由 Hawking Zhang 提交于 7月 17, 2019

check umc error count in both ras querry function and
ras interrupt handler
Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: NDennis Li <dennis.li@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

939e2258

drm/amdgpu: init umc v6_1 functions for vega20 · 5b6b35aa

由 Hawking Zhang 提交于 7月 17, 2019

init umc callback function for vega20 in sw early init phase
Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: NDennis Li <dennis.li@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

5b6b35aa

19 7月, 2019 8 次提交

drm/amdgpu: Add more detail to the VM fault printing · 5ddd4a9a

由 Yong Zhao 提交于 7月 01, 2019

With the printing, we don't need to parse the value on our own any more.
Signed-off-by: NYong Zhao <Yong.Zhao@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

5ddd4a9a

drm/amdgpu: keep stolen memory for arct · bfa3a9bb

由 Hawking Zhang 提交于 6月 28, 2019

Any dce register read back from arct is invalid. use hard code
stolen memory for arct until we validate the s3.
Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: NLe Ma <Le.Ma@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

bfa3a9bb

drm/amdgpu: skip pasid mapping for second mmhub on Arcturus · f2d66571

由 Le Ma 提交于 9月 11, 2018

There's no LUT register for second mmhub to convert pasid since it has no ATC.
Signed-off-by: NLe Ma <le.ma@amd.com>
Reviewed-by: NFeifei Xu <Feifei.Xu@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

f2d66571

drm/amdgpu: update vmc interrupt routine to support 3 vmhubs · 51c60898

由 Le Ma 提交于 9月 06, 2018

There is one more vmc interrupt and mmhub on Arcturus.
Signed-off-by: NLe Ma <le.ma@amd.com>
Acked-by: Snow Zhang < Snow.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

51c60898

drm/amdgpu: add VMC1 interrupt client id for Arcturus · 7d19b15f

由 Le Ma 提交于 9月 06, 2018

New IH client id for VMC1.
Signed-off-by: NLe Ma <le.ma@amd.com>
Acked-by: Snow Zhang < Snow.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

7d19b15f

drm/amdgpu: use new mmhub interfaces for Arcturus · 51cce480

由 Le Ma 提交于 9月 04, 2018

Arcturus has two MMHUBs.
Signed-off-by: NLe Ma <le.ma@amd.com>
Acked-by: Snow Zhang < Snow.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

51cce480

drm/amdgpu: add one more mmhub instance for Arcturus (v2) · c8a6e2a3

由 Le Ma 提交于 8月 31, 2018

v2: set mmhub num under CHIP_ARCTURUS switch case and add one more mmhub id_mgr
Signed-off-by: NLe Ma <le.ma@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

c8a6e2a3

drm/amdgpu: add new member in amdgpu_device for vmhub counts per asic chip · 1daa2bfa

由 Le Ma 提交于 8月 31, 2018

It aims to replace AMDGPU_MAX_VMHUBS in for loop to initialize registers.
Signed-off-by: NLe Ma <le.ma@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

1daa2bfa

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功