1. 15 2月, 2022 1 次提交
  2. 08 2月, 2022 1 次提交
    • Y
      drm/amdgpu: Fixed the defect of soft lock caused by infinite loop · d5e8ff5f
      yipechai 提交于
      1. The infinite loop case only occurs on multiple cards support
         ras functions.
      2. The explanation of root cause refer to commit 76641cbbf196
         ("drm/amdgpu: Add judgement to avoid infinite loop").
      3. Create new node to manage each unique ras instance to guarantee
         each device .ras_list is completely independent.
      4. Fixes: commit 7a6b8ab3231b51 ("drm/amdgpu: Unify ras block
         interface for each ras block").
      5. The soft locked logs are as follows:
      [  262.165690] CPU: 93 PID: 758 Comm: kworker/93:1 Tainted: G           OE     5.13.0-27-generic #29~20.04.1-Ubuntu
      [  262.165695] Hardware name: Supermicro AS -4124GS-TNR/H12DSG-O-CPU, BIOS T20200717143848 07/17/2020
      [  262.165698] Workqueue: events amdgpu_ras_do_recovery [amdgpu]
      [  262.165980] RIP: 0010:amdgpu_ras_get_ras_block+0x86/0xd0 [amdgpu]
      [  262.166239] Code: 68 d8 4c 8d 71 d8 48 39 c3 74 54 49 8b 45 38 48 85 c0 74 32 44 89 fa 44 89 e6 4c 89 ef e8 82 e4 9b dc 85 c0 74 3c 49 8b 46 28 <49> 8d 56 28 4d 89 f5 48 83 e8 28 48 39 d3 74 25 49 89 c6 49 8b 45
      [  262.166243] RSP: 0018:ffffac908fa87d80 EFLAGS: 00000202
      [  262.166247] RAX: ffffffffc1394248 RBX: ffff91e4ab8d6e20 RCX: ffffffffc1394248
      [  262.166249] RDX: ffff91e4aa356e20 RSI: 000000000000000e RDI: ffff91e4ab8c0000
      [  262.166252] RBP: ffffac908fa87da8 R08: 0000000000000007 R09: 0000000000000001
      [  262.166254] R10: ffff91e4930b64ec R11: 0000000000000000 R12: 000000000000000e
      [  262.166256] R13: ffff91e4aa356df8 R14: ffffffffc1394320 R15: 0000000000000003
      [  262.166258] FS:  0000000000000000(0000) GS:ffff92238fb40000(0000) knlGS:0000000000000000
      [  262.166261] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  262.166264] CR2: 00000001004865d0 CR3: 000000406d796000 CR4: 0000000000350ee0
      [  262.166267] Call Trace:
      [  262.166272]  amdgpu_ras_do_recovery+0x130/0x290 [amdgpu]
      [  262.166529]  ? psi_task_switch+0xd2/0x250
      [  262.166537]  ? __switch_to+0x11d/0x460
      [  262.166542]  ? __switch_to_asm+0x36/0x70
      [  262.166549]  process_one_work+0x220/0x3c0
      [  262.166556]  worker_thread+0x4d/0x3f0
      [  262.166560]  ? process_one_work+0x3c0/0x3c0
      [  262.166563]  kthread+0x12b/0x150
      [  262.166568]  ? set_kthread_struct+0x40/0x40
      [  262.166571]  ret_from_fork+0x22/0x30
      Signed-off-by: Nyipechai <YiPeng.Chai@amd.com>
      Reviewed-by: NTao Zhou <tao.zhou1@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      d5e8ff5f
  3. 19 1月, 2022 1 次提交
  4. 15 1月, 2022 2 次提交
  5. 23 11月, 2021 1 次提交
  6. 28 9月, 2021 1 次提交
  7. 24 9月, 2021 2 次提交
  8. 02 9月, 2021 1 次提交
  9. 25 8月, 2021 2 次提交
  10. 17 8月, 2021 2 次提交
  11. 13 7月, 2021 1 次提交
    • L
      drm/amdgpu: Return error if no RAS · 43a44c53
      Luben Tuikov 提交于
      In amdgpu_ras_query_error_count() return an error
      if the device doesn't support RAS. This prevents
      that function from having to always set the values
      of the integer pointers (if set), and thus
      prevents function side effects--always to have to
      set values of integers if integer pointers set,
      regardless of whether RAS is supported or
      not--with this change this side effect is
      mitigated.
      
      Also, if no pointers are set, don't count, since
      we've no way of reporting the counts.
      
      Also, give this function a kernel-doc.
      
      Cc: Alexander Deucher <Alexander.Deucher@amd.com>
      Cc: John Clements <john.clements@amd.com>
      Cc: Hawking Zhang <Hawking.Zhang@amd.com>
      Reported-by: NTom Rix <trix@redhat.com>
      Fixes: a46751fb ("drm/amdgpu: Fix RAS function interface")
      Signed-off-by: NLuben Tuikov <luben.tuikov@amd.com>
      Reviewed-by: NAlexander Deucher <Alexander.Deucher@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      43a44c53
  12. 09 7月, 2021 1 次提交
    • L
      drm/amdgpu: Return error if no RAS · 4d9f771e
      Luben Tuikov 提交于
      In amdgpu_ras_query_error_count() return an error
      if the device doesn't support RAS. This prevents
      that function from having to always set the values
      of the integer pointers (if set), and thus
      prevents function side effects--always to have to
      set values of integers if integer pointers set,
      regardless of whether RAS is supported or
      not--with this change this side effect is
      mitigated.
      
      Also, if no pointers are set, don't count, since
      we've no way of reporting the counts.
      
      Also, give this function a kernel-doc.
      
      Cc: Alexander Deucher <Alexander.Deucher@amd.com>
      Cc: John Clements <john.clements@amd.com>
      Cc: Hawking Zhang <Hawking.Zhang@amd.com>
      Reported-by: NTom Rix <trix@redhat.com>
      Fixes: a46751fb ("drm/amdgpu: Fix RAS function interface")
      Signed-off-by: NLuben Tuikov <luben.tuikov@amd.com>
      Reviewed-by: NAlexander Deucher <Alexander.Deucher@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      4d9f771e
  13. 01 7月, 2021 1 次提交
    • L
      drm/amdgpu: RAS EEPROM table is now in debugfs · c65b0805
      Luben Tuikov 提交于
      Add "ras_eeprom_size" file in debugfs, which
      reports the maximum size allocated to the RAS
      table in EEROM, as the number of bytes and the
      number of records it could store. For instance,
      
      $cat /sys/kernel/debug/dri/0/ras/ras_eeprom_size
      262144 bytes or 10921 records
      $_
      
      Add "ras_eeprom_table" file in debugfs, which
      dumps the RAS table stored EEPROM, in a formatted
      way. For instance,
      
      $cat ras_eeprom_table
       Signature    Version  FirstOffs       Size   Checksum
      0x414D4452 0x00010000 0x00000014 0x000000EC 0x000000DA
      Index  Offset ErrType Bank/CU          TimeStamp      Offs/Addr MemChl MCUMCID    RetiredPage
          0 0x00014      ue    0x00 0x00000000607608DC 0x000000000000   0x00    0x00 0x000000000000
          1 0x0002C      ue    0x00 0x00000000607608DC 0x000000001000   0x00    0x00 0x000000000001
          2 0x00044      ue    0x00 0x00000000607608DC 0x000000002000   0x00    0x00 0x000000000002
          3 0x0005C      ue    0x00 0x00000000607608DC 0x000000003000   0x00    0x00 0x000000000003
          4 0x00074      ue    0x00 0x00000000607608DC 0x000000004000   0x00    0x00 0x000000000004
          5 0x0008C      ue    0x00 0x00000000607608DC 0x000000005000   0x00    0x00 0x000000000005
          6 0x000A4      ue    0x00 0x00000000607608DC 0x000000006000   0x00    0x00 0x000000000006
          7 0x000BC      ue    0x00 0x00000000607608DC 0x000000007000   0x00    0x00 0x000000000007
          8 0x000D4      ue    0x00 0x00000000607608DD 0x000000008000   0x00    0x00 0x000000000008
      $_
      
      Cc: Alexander Deucher <Alexander.Deucher@amd.com>
      Cc: Andrey Grodzovsky <Andrey.Grodzovsky@amd.com>
      Cc: John Clements <john.clements@amd.com>
      Cc: Hawking Zhang <Hawking.Zhang@amd.com>
      Cc: Xinhui Pan <xinhui.pan@amd.com>
      Signed-off-by: NLuben Tuikov <luben.tuikov@amd.com>
      Acked-by: NAlexander Deucher <Alexander.Deucher@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      c65b0805
  14. 28 5月, 2021 2 次提交
  15. 20 5月, 2021 1 次提交
  16. 11 5月, 2021 3 次提交
  17. 24 3月, 2021 2 次提交
  18. 27 2月, 2021 1 次提交
  19. 19 2月, 2021 1 次提交
  20. 09 12月, 2020 2 次提交
  21. 30 10月, 2020 3 次提交
  22. 15 8月, 2020 1 次提交
  23. 05 8月, 2020 3 次提交
  24. 16 7月, 2020 1 次提交
  25. 02 4月, 2020 1 次提交
  26. 11 3月, 2020 1 次提交
  27. 19 12月, 2019 1 次提交