1. 09 12月, 2020 1 次提交
  2. 22 10月, 2020 1 次提交
  3. 26 9月, 2020 3 次提交
  4. 23 9月, 2020 1 次提交
    • S
      drm/amdgpu: update athub interrupt harvesting handle · 3f975d0f
      Stanley.Yang 提交于
      GCEA/MMHUB EA error should not result to DF freeze, this is
      fixed in next generation, but for some reasons the GCEA/MMHUB
      EA error will result to DF freeze in previous generation,
      diver should avoid to indicate GCEA/MMHUB EA error as hw fatal
      error in kernel message by read GCEA/MMHUB err status registers.
      
      Changed from V1:
          make query_ras_error_status function more general
          make read mmhub er status register more friendly
      
      Changed from V2:
          move ras error status query function into do_recovery workqueue
      
      Changed from V3:
          remove useless code from V2, print GCEA error status
          instance number
      Signed-off-by: NStanley.Yang <Stanley.Yang@amd.com>
      Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      3f975d0f
  5. 27 8月, 2020 1 次提交
  6. 25 8月, 2020 3 次提交
  7. 19 8月, 2020 1 次提交
    • G
      drm/amdgpu: fix NULL pointer access issue when unloading driver · 1a68d96f
      Guchun Chen 提交于
      When unloading driver by "modprobe -r amdgpu", one NULL pointer
      dereference bug occurs in ras debugfs releasing. The cause is the
      duplicated debugfs_remove, as drm debugfs_root dir has been cleaned
      up already by drm_minor_unregister.
      
      BUG: kernel NULL pointer dereference, address: 00000000000000a0
      PGD 0 P4D 0
      Oops: 0002 [#1] SMP PTI
      CPU: 11 PID: 1526 Comm: modprobe Tainted: G           OE     5.6.0-guchchen #1
      Hardware name: System manufacturer System Product Name/TUF Z370-PLUS GAMING II, BIOS 0411 09/21/2018
      RIP: 0010:down_write+0x15/0x40
      Code: eb de e8 7e 17 72 ff cc cc cc cc cc cc cc cc cc cc cc cc cc cc 0f 1f 44 00 00 53 48 89 fb e8 92
      d8 ff ff 31 c0 ba 01 00 00 00 <f0> 48 0f b1 13 75 0f 65 48 8b 04 25 c0 8b 01 00 48 89 43 08 5b c3
      RSP: 0018:ffffb1590386fcd0 EFLAGS: 00010246
      RAX: 0000000000000000 RBX: 00000000000000a0 RCX: 0000000000000000
      RDX: 0000000000000001 RSI: ffffffff85b2fcc2 RDI: 00000000000000a0
      RBP: ffffb1590386fd30 R08: ffffffff85b2fcc2 R09: 000000000002b3c0
      R10: ffff97a330618c40 R11: 00000000000005f6 R12: ffff97a3481beb40
      R13: 00000000000000a0 R14: ffff97a3481beb40 R15: 0000000000000000
      FS:  00007fb11a717540(0000) GS:ffff97a376cc0000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00000000000000a0 CR3: 00000004066d6006 CR4: 00000000003606e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       simple_recursive_removal+0x63/0x370
       ? debugfs_remove+0x60/0x60
       debugfs_remove+0x40/0x60
       amdgpu_ras_fini+0x82/0x230 [amdgpu]
       ? __kernfs_remove.part.17+0x101/0x1f0
       ? kernfs_name_hash+0x12/0x80
       amdgpu_device_fini+0x1c0/0x580 [amdgpu]
       amdgpu_driver_unload_kms+0x3e/0x70 [amdgpu]
       amdgpu_pci_remove+0x36/0x60 [amdgpu]
       pci_device_remove+0x3b/0xb0
       device_release_driver_internal+0xe5/0x1c0
       driver_detach+0x46/0x90
       bus_remove_driver+0x58/0xd0
       pci_unregister_driver+0x29/0x90
       amdgpu_exit+0x11/0x25 [amdgpu]
       __x64_sys_delete_module+0x13d/0x210
       do_syscall_64+0x5f/0x250
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      Signed-off-by: NGuchun Chen <guchun.chen@amd.com>
      Reviewed-by: NTao Zhou <tao.zhou1@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      1a68d96f
  8. 15 8月, 2020 6 次提交
  9. 07 8月, 2020 1 次提交
  10. 05 8月, 2020 9 次提交
  11. 28 7月, 2020 1 次提交
    • D
      drm/amdgpu: fix system hang issue during GPU reset · df9c8d1a
      Dennis Li 提交于
      when GPU hang, driver has multi-paths to enter amdgpu_device_gpu_recover,
      the atomic adev->in_gpu_reset and hive->in_reset are used to avoid
      re-entering GPU recovery.
      
      During GPU reset and resume, it is unsafe that other threads access GPU,
      which maybe cause GPU reset failed. Therefore the new rw_semaphore
      adev->reset_sem is introduced, which protect GPU from being accessed by
      external threads during recovery.
      
      v2:
      1. add rwlock for some ioctls, debugfs and file-close function.
      2. change to use dqm->is_resetting and dqm_lock for protection in kfd
      driver.
      3. remove try_lock and change adev->in_gpu_reset as atomic, to avoid
      re-enter GPU recovery for the same GPU hang.
      
      v3:
      1. change back to use adev->reset_sem to protect kfd callback
      functions, because dqm_lock couldn't protect all codes, for example:
      free_mqd must be called outside of dqm_lock;
      
      [ 1230.176199] Hardware name: Supermicro SYS-7049GP-TRT/X11DPG-QT, BIOS 3.1 05/23/2019
      [ 1230.177221] Call Trace:
      [ 1230.178249]  dump_stack+0x98/0xd5
      [ 1230.179443]  amdgpu_virt_kiq_reg_write_reg_wait+0x181/0x190 [amdgpu]
      [ 1230.180673]  gmc_v9_0_flush_gpu_tlb+0xcc/0x310 [amdgpu]
      [ 1230.181882]  amdgpu_gart_unbind+0xa9/0xe0 [amdgpu]
      [ 1230.183098]  amdgpu_ttm_backend_unbind+0x46/0x180 [amdgpu]
      [ 1230.184239]  ? ttm_bo_put+0x171/0x5f0 [ttm]
      [ 1230.185394]  ttm_tt_unbind+0x21/0x40 [ttm]
      [ 1230.186558]  ttm_tt_destroy.part.12+0x12/0x60 [ttm]
      [ 1230.187707]  ttm_tt_destroy+0x13/0x20 [ttm]
      [ 1230.188832]  ttm_bo_cleanup_memtype_use+0x36/0x80 [ttm]
      [ 1230.189979]  ttm_bo_put+0x1be/0x5f0 [ttm]
      [ 1230.191230]  amdgpu_bo_unref+0x1e/0x30 [amdgpu]
      [ 1230.192522]  amdgpu_amdkfd_free_gtt_mem+0xaf/0x140 [amdgpu]
      [ 1230.193833]  free_mqd+0x25/0x40 [amdgpu]
      [ 1230.195143]  destroy_queue_cpsch+0x1a7/0x270 [amdgpu]
      [ 1230.196475]  pqm_destroy_queue+0x105/0x260 [amdgpu]
      [ 1230.197819]  kfd_ioctl_destroy_queue+0x37/0x70 [amdgpu]
      [ 1230.199154]  kfd_ioctl+0x277/0x500 [amdgpu]
      [ 1230.200458]  ? kfd_ioctl_get_clock_counters+0x60/0x60 [amdgpu]
      [ 1230.201656]  ? tomoyo_file_ioctl+0x19/0x20
      [ 1230.202831]  ksys_ioctl+0x98/0xb0
      [ 1230.204004]  __x64_sys_ioctl+0x1a/0x20
      [ 1230.205174]  do_syscall_64+0x5f/0x250
      [ 1230.206339]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      2. remove try_lock and introduce atomic hive->in_reset, to avoid
      re-enter GPU recovery.
      
      v4:
      1. remove an unnecessary whitespace change in kfd_chardev.c
      2. remove comment codes in amdgpu_device.c
      3. add more detailed comment in commit message
      4. define a wrap function amdgpu_in_reset
      
      v5:
      1. Fix some style issues.
      Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
      Suggested-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com>
      Suggested-by: NChristian König <christian.koenig@amd.com>
      Suggested-by: NFelix Kuehling <Felix.Kuehling@amd.com>
      Suggested-by: NLijo Lazar <Lijo.Lazar@amd.com>
      Suggested-by: NLuben Tukov <luben.tuikov@amd.com>
      Signed-off-by: NDennis Li <Dennis.Li@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      df9c8d1a
  12. 23 7月, 2020 1 次提交
  13. 16 7月, 2020 1 次提交
  14. 01 7月, 2020 1 次提交
  15. 03 6月, 2020 2 次提交
  16. 29 5月, 2020 1 次提交
  17. 15 5月, 2020 1 次提交
  18. 06 5月, 2020 1 次提交
  19. 01 5月, 2020 1 次提交
  20. 23 4月, 2020 2 次提交
  21. 14 4月, 2020 1 次提交