提交 · 892a57a975c3bd51834ddb0afa5f27baa19a785b · openeuler / Kernel

15 2月, 2022 1 次提交

drm/amdgpu: Optimize xxx_ras_late_init/xxx_ras_late_fini for each ras block · bdb3489c

由 yipechai 提交于 1月 30, 2022

1. Define amdgpu_ras_block_late_init to create sysfs nodes
   and interrupt handles.
2. Define amdgpu_ras_block_late_fini to remove sysfs nodes
   and interrupt handles.
3. Replace ras block variable members in struct
   amdgpu_ras_block_object with struct ras_common_if, which
   can make it easy to associate each ras block instance
   with each ras block functional interface.
4. Add .ras_cb to struct amdgpu_ras_block_object.
5. Change each ras block to fit for the changement of struct
   amdgpu_ras_block_object.
Signed-off-by: Nyipechai <YiPeng.Chai@amd.com>
Reviewed-by: NTao Zhou <tao.zhou1@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

bdb3489c

08 2月, 2022 1 次提交

drm/amdgpu: Fixed the defect of soft lock caused by infinite loop · d5e8ff5f

由 yipechai 提交于 1月 29, 2022

1. The infinite loop case only occurs on multiple cards support
   ras functions.
2. The explanation of root cause refer to commit 76641cbbf196
   ("drm/amdgpu: Add judgement to avoid infinite loop").
3. Create new node to manage each unique ras instance to guarantee
   each device .ras_list is completely independent.
4. Fixes: commit 7a6b8ab3231b51 ("drm/amdgpu: Unify ras block
   interface for each ras block").
5. The soft locked logs are as follows:
[  262.165690] CPU: 93 PID: 758 Comm: kworker/93:1 Tainted: G           OE     5.13.0-27-generic #29~20.04.1-Ubuntu
[  262.165695] Hardware name: Supermicro AS -4124GS-TNR/H12DSG-O-CPU, BIOS T20200717143848 07/17/2020
[  262.165698] Workqueue: events amdgpu_ras_do_recovery [amdgpu]
[  262.165980] RIP: 0010:amdgpu_ras_get_ras_block+0x86/0xd0 [amdgpu]
[  262.166239] Code: 68 d8 4c 8d 71 d8 48 39 c3 74 54 49 8b 45 38 48 85 c0 74 32 44 89 fa 44 89 e6 4c 89 ef e8 82 e4 9b dc 85 c0 74 3c 49 8b 46 28 <49> 8d 56 28 4d 89 f5 48 83 e8 28 48 39 d3 74 25 49 89 c6 49 8b 45
[  262.166243] RSP: 0018:ffffac908fa87d80 EFLAGS: 00000202
[  262.166247] RAX: ffffffffc1394248 RBX: ffff91e4ab8d6e20 RCX: ffffffffc1394248
[  262.166249] RDX: ffff91e4aa356e20 RSI: 000000000000000e RDI: ffff91e4ab8c0000
[  262.166252] RBP: ffffac908fa87da8 R08: 0000000000000007 R09: 0000000000000001
[  262.166254] R10: ffff91e4930b64ec R11: 0000000000000000 R12: 000000000000000e
[  262.166256] R13: ffff91e4aa356df8 R14: ffffffffc1394320 R15: 0000000000000003
[  262.166258] FS:  0000000000000000(0000) GS:ffff92238fb40000(0000) knlGS:0000000000000000
[  262.166261] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  262.166264] CR2: 00000001004865d0 CR3: 000000406d796000 CR4: 0000000000350ee0
[  262.166267] Call Trace:
[  262.166272]  amdgpu_ras_do_recovery+0x130/0x290 [amdgpu]
[  262.166529]  ? psi_task_switch+0xd2/0x250
[  262.166537]  ? __switch_to+0x11d/0x460
[  262.166542]  ? __switch_to_asm+0x36/0x70
[  262.166549]  process_one_work+0x220/0x3c0
[  262.166556]  worker_thread+0x4d/0x3f0
[  262.166560]  ? process_one_work+0x3c0/0x3c0
[  262.166563]  kthread+0x12b/0x150
[  262.166568]  ? set_kthread_struct+0x40/0x40
[  262.166571]  ret_from_fork+0x22/0x30
Signed-off-by: Nyipechai <YiPeng.Chai@amd.com>
Reviewed-by: NTao Zhou <tao.zhou1@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

d5e8ff5f

19 1月, 2022 1 次提交

drm/amdgpu: Fix the code style warnings in amdgpu_ras · b6efdb02

由 yipechai 提交于 1月 14, 2022

Fix the code style warnings in amdgpu_ras:
1. ERROR: space required before the open parenthesis '('.
2. WARNING: line length of xxx exceeds 100 columns.
3. ERROR: "foo* bar" should be "foo *bar".
4. WARNING: unnecessary whitespace before a quoted newline.
5. WARNING: space prohibited before semicolon.
6. WARNING: suspect code indent for conditional statements.
7. WARNING: braces {} are not necessary for single statement blocks.
Signed-off-by: Nyipechai <YiPeng.Chai@amd.com>
Reviewed-by: NTao Zhou <tao.zhou1@amd.com>
Acked-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

b6efdb02

15 1月, 2022 2 次提交

drm/amdgpu: Modify the compilation failed problem when other ras blocks' .h include amdgpu_ras.h · 7cab2124

由 yipechai 提交于 1月 04, 2022

Modify the compilation failed problem when other ras blocks' .h include amdgpu_ras.h.

v2: squash in forward declaration warning fix (Alex)
Signed-off-by: Nyipechai <YiPeng.Chai@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: NJohn Clements <john.clements@amd.com>
Reviewed-by: NTao Zhou <tao.zhou1@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

7cab2124

drm/amdgpu: Unify ras block interface for each ras block · 6492e1b0

由 yipechai 提交于 1月 04, 2022

1. Define unified ops interface for each block.
2. Add ras_block_match function pointer in ops interface, each ras block can customize specail match function to identify itself.
3. Add amdgpu_ras_block_match_default new function. If a ras block doesn't define .ras_block_match, default execute amdgpu_ras_block_match_default to identify this ras block.
4. Define unified basic ras block data for each ras block.
5. Create dedicated amdgpu device ras block link list to manage all of the ras blocks.
6. Add amdgpu_ras_register_ras_block new function interface for each ras block to register itself to ras controlling block.
Signed-off-by: Nyipechai <YiPeng.Chai@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: NJohn Clements <john.clements@amd.com>
Reviewed-by: NTao Zhou <tao.zhou1@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

6492e1b0

23 11月, 2021 1 次提交

drm/amdgpu: add new query interface for umc block v2 · 8882f90a

由 Stanley.Yang 提交于 11月 16, 2021

add message smu to query error information

v2:
    rename message_smu to ecc_info
Signed-off-by: NStanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

8882f90a

28 9月, 2021 1 次提交

drm/amdgpu: set poison supported flag for RAS (v2) · e4348849

由 Tao Zhou 提交于 9月 17, 2021

Add RAS poison supported flag and tell PSP RAS TA about the info.

v2: rename poison mode to poison supported, we can also disable poison
mode even we support it.
    print value of poison supported if ras feature enablement fails.
Signed-off-by: NTao Zhou <tao.zhou1@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

e4348849

24 9月, 2021 2 次提交

drm/amdgpu: Remove all code paths under the EAGAIN path in RAS late init · 9080a18f

由 Candice Li 提交于 9月 15, 2021

All code paths under the EAGAIN path in RAS late init are unused.
Signed-off-by: NCandice Li <candice.li@amd.com>
Reviewed-by: NJohn Clements <john.clements@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

9080a18f

drm/amdgpu: Updated RAS infrastructure · 640ae42e

由 John Clements 提交于 9月 22, 2021

Update RAS infrastructure to support RAS query for MCA subblocks
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NJohn Clements <john.clements@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

640ae42e

02 9月, 2021 1 次提交

drm/amd/amdgpu: add mpio to ras block · a0a2f7bb

由 Candice Li 提交于 8月 27, 2021

Add MPIO to RAS block
Signed-off-by: NCandice Li <candice.li@amd.com>
Reviewed-by: NJohn Clements <john.clements@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

a0a2f7bb

25 8月, 2021 2 次提交

drm/amdgpu: Add driver infrastructure for MCA RAS · 3907c492

由 John Clements 提交于 8月 24, 2021

Add MCA specific IP blocks targetting RAS features
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NJohn Clements <john.clements@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

3907c492

drm/amd/amdgpu: add name field back to ras_common_if · 355e3e4c

由 Candice Li 提交于 8月 23, 2021

Adding name field back to ras_common_if to work around error
injection failure with amdgpuras tool.
Signed-off-by: NCandice Li <candice.li@amd.com>
Reviewed-by: NJohn Clements <john.clements@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

355e3e4c

17 8月, 2021 2 次提交

drm/amd/amdgpu: remove unnecessary RAS context field · 893cf382

由 Candice Li 提交于 8月 13, 2021

Delete ras_if->name in the RAS ctx structure and remove related lines.
Signed-off-by: NCandice Li <candice.li@amd.com>
Reviewed-by: NJohn Clements <john.clements@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

893cf382

drm/amd/amdgpu: consolidate PSP TA context · 6457205c

由 Candice Li 提交于 8月 13, 2021

Signed-off-by: NCandice Li <candice.li@amd.com>
Reviewed-by: NJohn Clements <john.clements@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

6457205c

13 7月, 2021 1 次提交

drm/amdgpu: Return error if no RAS · 43a44c53

由 Luben Tuikov 提交于 7月 02, 2021

In amdgpu_ras_query_error_count() return an error
if the device doesn't support RAS. This prevents
that function from having to always set the values
of the integer pointers (if set), and thus
prevents function side effects--always to have to
set values of integers if integer pointers set,
regardless of whether RAS is supported or
not--with this change this side effect is
mitigated.

Also, if no pointers are set, don't count, since
we've no way of reporting the counts.

Also, give this function a kernel-doc.

Cc: Alexander Deucher <Alexander.Deucher@amd.com>
Cc: John Clements <john.clements@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Reported-by: NTom Rix <trix@redhat.com>
Fixes: a46751fb ("drm/amdgpu: Fix RAS function interface")
Signed-off-by: NLuben Tuikov <luben.tuikov@amd.com>
Reviewed-by: NAlexander Deucher <Alexander.Deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

43a44c53

09 7月, 2021 1 次提交

drm/amdgpu: Return error if no RAS · 4d9f771e

由 Luben Tuikov 提交于 7月 02, 2021

In amdgpu_ras_query_error_count() return an error
if the device doesn't support RAS. This prevents
that function from having to always set the values
of the integer pointers (if set), and thus
prevents function side effects--always to have to
set values of integers if integer pointers set,
regardless of whether RAS is supported or
not--with this change this side effect is
mitigated.

Also, if no pointers are set, don't count, since
we've no way of reporting the counts.

Also, give this function a kernel-doc.

Cc: Alexander Deucher <Alexander.Deucher@amd.com>
Cc: John Clements <john.clements@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Reported-by: NTom Rix <trix@redhat.com>
Fixes: a46751fb ("drm/amdgpu: Fix RAS function interface")
Signed-off-by: NLuben Tuikov <luben.tuikov@amd.com>
Reviewed-by: NAlexander Deucher <Alexander.Deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

4d9f771e

01 7月, 2021 1 次提交

drm/amdgpu: RAS EEPROM table is now in debugfs · c65b0805

由 Luben Tuikov 提交于 4月 08, 2021

Add "ras_eeprom_size" file in debugfs, which
reports the maximum size allocated to the RAS
table in EEROM, as the number of bytes and the
number of records it could store. For instance,

$cat /sys/kernel/debug/dri/0/ras/ras_eeprom_size
262144 bytes or 10921 records
$_

Add "ras_eeprom_table" file in debugfs, which
dumps the RAS table stored EEPROM, in a formatted
way. For instance,

$cat ras_eeprom_table
 Signature    Version  FirstOffs       Size   Checksum
0x414D4452 0x00010000 0x00000014 0x000000EC 0x000000DA
Index  Offset ErrType Bank/CU          TimeStamp      Offs/Addr MemChl MCUMCID    RetiredPage
    0 0x00014      ue    0x00 0x00000000607608DC 0x000000000000   0x00    0x00 0x000000000000
    1 0x0002C      ue    0x00 0x00000000607608DC 0x000000001000   0x00    0x00 0x000000000001
    2 0x00044      ue    0x00 0x00000000607608DC 0x000000002000   0x00    0x00 0x000000000002
    3 0x0005C      ue    0x00 0x00000000607608DC 0x000000003000   0x00    0x00 0x000000000003
    4 0x00074      ue    0x00 0x00000000607608DC 0x000000004000   0x00    0x00 0x000000000004
    5 0x0008C      ue    0x00 0x00000000607608DC 0x000000005000   0x00    0x00 0x000000000005
    6 0x000A4      ue    0x00 0x00000000607608DC 0x000000006000   0x00    0x00 0x000000000006
    7 0x000BC      ue    0x00 0x00000000607608DC 0x000000007000   0x00    0x00 0x000000000007
    8 0x000D4      ue    0x00 0x00000000607608DD 0x000000008000   0x00    0x00 0x000000000008
$_

Cc: Alexander Deucher <Alexander.Deucher@amd.com>
Cc: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com>
Cc: John Clements <john.clements@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Cc: Xinhui Pan <xinhui.pan@amd.com>
Signed-off-by: NLuben Tuikov <luben.tuikov@amd.com>
Acked-by: NAlexander Deucher <Alexander.Deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

c65b0805

28 5月, 2021 2 次提交

drm/amdgpu: Use delayed work to collect RAS error counters · 05adfd80

由 Luben Tuikov 提交于 5月 21, 2021

On Context Query2 IOCTL return the correctable and
uncorrectable errors in O(1) fashion, from cached
values, and schedule a delayed work function to
calculate and cache them for the next such IOCTL.

v2: Cancel pending delayed work at ras_fini().
v3: Remove conditionals when dealing with delayed
    work manipulation as they're inherently racy.

Cc: Alexander Deucher <Alexander.Deucher@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: John Clements <john.clements@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NLuben Tuikov <luben.tuikov@amd.com>
Reviewed-by: NAlexander Deucher <Alexander.Deucher@amd.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

05adfd80

drm/amdgpu: Fix RAS function interface · a46751fb

由 Luben Tuikov 提交于 5月 18, 2021

The correctable and uncorrectable errors
are calculated at each invocation of this
function. Therefore, it is highly inefficient to
return just one of them based on a Boolean
input. If the caller wants both, twice the work
would be done. (And this work is O(n^3) on
Vega20.)

Fix this "interface" to simply return what it had
calculated--both values. Let the caller choose
what it wants to record, inspect, use.

Cc: Alexander Deucher <Alexander.Deucher@amd.com>
Cc: John Clements <john.clements@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NLuben Tuikov <luben.tuikov@amd.com>
Reviewed-by: NAlexander Deucher <Alexander.Deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

a46751fb

20 5月, 2021 1 次提交

drm/amdgpu: Conditionally reset RAS counters on boot · 8f6368a9

由 John Clements 提交于 5月 17, 2021

Only clear RAS error counters if perestent EDC harvesting is not supported
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NJohn Clements <john.clements@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

8f6368a9

11 5月, 2021 3 次提交

drm/amdgpu: Rename to ras_*_enabled · 8ab0d6f0

由 Luben Tuikov 提交于 5月 04, 2021

Rename,
  ras_hw_supported --> ras_hw_enabled, and
  ras_features     --> ras_enabled,
to show that ras_enabled is a subset of
ras_hw_enabled, which itself is a subset
of the ASIC capability.

Cc: Alexander Deucher <Alexander.Deucher@amd.com>
Cc: John Clements <john.clements@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NLuben Tuikov <luben.tuikov@amd.com>
Acked-by: NChristian König <christian.koenig@amd.com>
Reviewed-by: NJohn Clements <John.Clements@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

8ab0d6f0

drm/amdgpu: Move up ras_hw_supported · e509965e

由 Luben Tuikov 提交于 5月 03, 2021

Move ras_hw_supported into struct amdgpu_dev.
The dependency is:
struct amdgpu_ras <== struct amdgpu_dev <== ASIC,
read as "struct amdgpu_ras depends on struct
amdgpu_dev, which depends on the hardware."

This can be loosely understood as, "if RAS is
supported, which is property of the ASIC (struct
amdgpu_dev), then we can access struct
amdgpu_ras."

v2: Fix a typo: must binary AND in ternary cond
    in amdgpu_ras.c

Cc: Alexander Deucher <Alexander.Deucher@amd.com>
Cc: John Clements <john.clements@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NLuben Tuikov <luben.tuikov@amd.com>
Acked-by: NChristian König <christian.koenig@amd.com>
Reviewed-by: NJohn Clements <John.Clements@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

e509965e

drm/amdgpu: Remove redundant ras->supported · acdae216

由 Luben Tuikov 提交于 5月 03, 2021

Remove redundant ras->supported, as this value
is also stored in adev->ras_features.

Use adev->ras_features, as that supercedes "ras",
since the latter is its member.

The dependency goes like this:
ras <== adev->ras_features <== hw_supported,
and is read as "ras depends on ras_features, which
depends on hw_supported." The arrows show the flow
of information, i.e. the dependency update.

"hw_supported" should also live in "adev".

Cc: Alexander Deucher <Alexander.Deucher@amd.com>
Cc: John Clements <john.clements@amd.com>
Cc: Hawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NLuben Tuikov <luben.tuikov@amd.com>
Acked-by: NChristian König <christian.koenig@amd.com>
Reviewed-by: NJohn Clements <John.Clements@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

acdae216

24 3月, 2021 2 次提交

drm/amdgpu: fix send ras disable cmd when asic not support ras · 970fd197

由 Stanley.Yang 提交于 3月 10, 2021

    cause:
	It is necessary to send ras disable command to ras-ta during gfx
	block ras later init, because the ras capability is disable read
	from vbios for vega20 gaming, but the ras context is released
	during ras init process, this will cause send ras disable command
	to ras-to failed.
    how:
	Delay releasing ras context, the ras context
	will be released after gfx block later init done.

Changed from V1:
    move release_ras_context into ras_resume

Changed from V2:
    check BIT(UMC) is more reasonable before access eeprom table
Signed-off-by: NStanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

970fd197

drm/amdgpu: harvest edc status when connected to host via xGMI · 761d86d3

由 Dennis Li 提交于 2月 04, 2021

When connected to a host via xGMI, system fatal errors may trigger
warm reset, driver has no change to query edc status before reset.
Therefore in this case, driver should harvest previous error loging
registers during boot, instead of only resetting them.

v2:
1. IP's ras_manager object is created when its ras feature is enabled,
so change to query edc status after amdgpu_ras_late_init called

2. change to enable watchdog timer after finishing gfx edc init
Signed-off-by: NDennis Li <Dennis.Li@amd.com>
Reivewed-by: NHawking Zhang <hawking.zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

761d86d3

27 2月, 2021 1 次提交

drm/amdgpu: remove unnecessary reading for epprom header · 11003c68

由 Dennis Li 提交于 2月 26, 2021

If the number of badpage records exceed the threshold, driver has
updated both epprom header and control->tbl_hdr.header before gpu reset,
therefore GPU recovery thread no need to read epprom header directly.

v2: merge amdgpu_ras_check_err_threshold into amdgpu_ras_eeprom_check_err_threshold
Signed-off-by: NDennis Li <Dennis.Li@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

11003c68

19 2月, 2021 1 次提交

drm/amdgpu: do not keep debugfs dentry · 88293c03

由 Nirmoy Das 提交于 2月 10, 2021

Cleanup unnecessary debugfs dentries and surrounding functions.

v3: remove return value check for debugfs_create_file()
v2: remove ttm_debugfs_entries array.
    do not init variables.
Signed-off-by: NNirmoy Das <nirmoy.das@amd.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

88293c03

09 12月, 2020 2 次提交

drm/amdgpu: fix debugfs creation/removal, again · 2343e9d2

由 Arnd Bergmann 提交于 12月 04, 2020

There is still a warning when CONFIG_DEBUG_FS is disabled:

drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:1145:13: error: 'amdgpu_ras_debugfs_create_ctrl_node' defined but not used [-Werror=unused-function]
1145 | static void amdgpu_ras_debugfs_create_ctrl_node(struct amdgpu_device *adev)

Change the code again to make the compiler actually drop
this code but not warn about it.

Fixes: ae2bf61f ("drm/amdgpu: guard ras debugfs creation/removal based on CONFIG_DEBUG_FS")
Reviewed-by: NTao Zhou <tao.zhou1@amd.com>
Signed-off-by: NArnd Bergmann <arnd@arndb.de>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

2343e9d2

drm/amdgpu: fix debugfs creation/removal, again · cedf7884

由 Arnd Bergmann 提交于 12月 04, 2020

There is still a warning when CONFIG_DEBUG_FS is disabled:

Change the code again to make the compiler actually drop
this code but not warn about it.

cedf7884

30 10月, 2020 3 次提交

drm/amdgpu: fix the issue of reserving bad pages failed · 676deb38

由 Dennis Li 提交于 10月 22, 2020

In amdgpu_ras_reset_gpu, because bad pages may not be freed,
it has high probability to reserve bad pages failed.

Change to reserve bad pages when freeing VRAM.

v2:
1. avoid allocating the drm_mm node outside of amdgpu_vram_mgr.c
2. move bad page reserving into amdgpu_ras_add_bad_pages, if vram mgr
   reserve bad page failed, it will put it into pending list, otherwise
   put it into processed list;
3. remove amdgpu_ras_release_bad_pages, because retired page's info has
   been moved into amdgpu_vram_mgr

v3:
1. formate code style;
2. rename amdgpu_vram_reserve_scope as amdgpu_vram_reservation;
3. rename scope_pending as reservations_pending;
4. rename scope_processed as reserved_pages;
5. change to iterate over all the pending ones and try to insert them
   with drm_mm_reserve_node();

v4:
1. rename amdgpu_vram_mgr_reserve_scope as
amdgpu_vram_mgr_reserve_range;
2. remove unused include "amdgpu_ras.h";
3. rename amdgpu_vram_mgr_check_and_reserve as
amdgpu_vram_mgr_do_reserve;
4. refine amdgpu_vram_mgr_reserve_range to call
amdgpu_vram_mgr_do_reserve.
Reviewed-by: NChristian König <christian.koenig@amd.com>
Reviewed-by: NHawking Zhang <hawking.zhang@amd.com>
Signed-off-by: NDennis Li <Dennis.Li@amd.com>
Signed-off-by: NWenhui Sheng <Wenhui.Sheng@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

676deb38

drm/amdgpu: remove redundant GPU reset · 5eeb4593

由 Dennis Li 提交于 10月 19, 2020

Because bad pages saving has been moved to UMC error interrupt callback,
which will trigger a new GPU reset after saving.
Signed-off-by: NDennis Li <Dennis.Li@amd.com>
Reviewed-by: NHawking Zhang <hawking.zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

5eeb4593

drm/amdgpu: change to save bad pages in UMC error interrupt callback · 22503d80

由 Dennis Li 提交于 10月 19, 2020

Instead of saving bad pages in amdgpu_ras_reset_gpu, it will reduce
the unnecessary calling of amdgpu_ras_save_bad_pages.
Signed-off-by: NDennis Li <Dennis.Li@amd.com>
Reviewed-by: NHawking Zhang <hawking.zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

22503d80

15 8月, 2020 1 次提交

drm/amdgpu: bypass querying ras error count registers · f75e94d8

由 Guchun Chen 提交于 8月 04, 2020

Once ras recovery is issued by ras sync flood interrupt or
ras controller interrupt, add this guard to bypass or execute
ras error count register harvest of all IPs.
Signed-off-by: NGuchun Chen <guchun.chen@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: NDennis Li <Dennis.Li@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

f75e94d8

05 8月, 2020 3 次提交

drm/amdgpu: break GPU recovery once it's in bad state(v4) · e8fbaf03

由 Guchun Chen 提交于 7月 23, 2020

When GPU executes recovery and retriving bad GPU tag
from external eerpom device, the recovery will be broken
and error message is printed as well for user's awareness.

v2: Refine warning message in threshold reaching case, and
    fix spelling typo.

v3: Fix explicit calling of bad gpu.

v4: Rename function names.
Signed-off-by: NGuchun Chen <guchun.chen@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

e8fbaf03

drm/amdgpu: skip bad page reservation once issuing from eeprom write · 35cd2cda

由 Guchun Chen 提交于 7月 23, 2020

Once the ras recovery is issued from eeprom write itself,
bad page reservation should be ignored, otherwise, recursive
calling of writting to eeprom would happen.
Signed-off-by: NGuchun Chen <guchun.chen@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

35cd2cda

drm/amdgpu: validate bad page threshold in ras(v3) · c84d4670

由 Guchun Chen 提交于 7月 22, 2020

Bad page threshold value should be valid in the range between
-1 and max records length of eeprom. It could determine when
saved bad pages exceed threshold value, and proceed corresponding
actions.

v2: When using the default typical value, it should be min
value between typical value and eeprom max records length.

v3: drop the case of setting bad_page_cnt_threshold to be
    0xFFFFFFFF, as it confuses user.
Signed-off-by: NGuchun Chen <guchun.chen@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

c84d4670

16 7月, 2020 1 次提交

drm/amdgpu: RAS emergency restart logic refine · bb5c7235

由 Wenhui Sheng 提交于 7月 13, 2020

If we are in RAS triggered situation and
BACO isn't support, emergency restart is needed,
and this code is only needed for some specific
cases(vega20 with given smu fw version).

After we add smu mode1 reset for sienna cichlid, we
need to share AMD_RESET_METHOD_MODE1 with psp mode1 reset,
so in amdgpu_device_gpu_recover, we need differentiate
which mode1 reset we are using, then decide if it's
a full reset and then decide if emergency restart is needed,
the logic will become much more complex.

After discussion with Hawking, move emergency restart logic
to an independent function.
Signed-off-by: NLikun Gao <Likun.Gao@amd.com>
Signed-off-by: NWenhui Sheng <Wenhui.Sheng@amd.com>
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

bb5c7235

02 4月, 2020 1 次提交

drm/amdgpu: disable ras query and iject during gpu reset · 61380faa

由 John Clements 提交于 3月 25, 2020

added flag to ras context to indicate if ras query functionality is ready
Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
Signed-off-by: NJohn Clements <john.clements@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

61380faa

11 3月, 2020 1 次提交

drm/amdgpu: add function to creat all ras debugfs node · f9317014

由 Tao Zhou 提交于 3月 06, 2020

centralize all debugfs creation in one place for ras

this is required to fix ras when the driver does not use the drm load
and unload callbacks due to ordering issues with the drm device node.
Signed-off-by: NTao Zhou <tao.zhou1@amd.com>
Signed-off-by: NStanley.Yang <Stanley.Yang@amd.com>
Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

f9317014

19 12月, 2019 1 次提交

drm/amdgpu: drop useless BACO arg in amdgpu_ras_reset_gpu · 61934624

由 Guchun Chen 提交于 12月 13, 2019

BACO reset mode strategy is determined by latter func when
calling amdgpu_ras_reset_gpu. So not to confuse audience, drop
it.
Signed-off-by: NGuchun Chen <guchun.chen@amd.com>
Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

61934624

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功