1. 25 8月, 2020 6 次提交
    • D
      drm/amdgpu: annotate a false positive recursive locking · 08ebb485
      Dennis Li 提交于
      Re-apply commit 72e14ebf
      
      [  584.110304] ============================================
      [  584.110590] WARNING: possible recursive locking detected
      [  584.110876] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G           OE
      [  584.111164] --------------------------------------------
      [  584.111456] kworker/38:1/553 is trying to acquire lock:
      [  584.111721] ffff9b15ff0a47a0 (&adev->reset_sem){++++}, at: amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu]
      [  584.112112]
                     but task is already holding lock:
      [  584.112673] ffff9b1603d247a0 (&adev->reset_sem){++++}, at: amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu]
      [  584.113068]
                     other info that might help us debug this:
      [  584.113689]  Possible unsafe locking scenario:
      
      [  584.114350]        CPU0
      [  584.114685]        ----
      [  584.115014]   lock(&adev->reset_sem);
      [  584.115349]   lock(&adev->reset_sem);
      [  584.115678]
                      *** DEADLOCK ***
      
      [  584.116624]  May be due to missing lock nesting notation
      
      [  584.117284] 4 locks held by kworker/38:1/553:
      [  584.117616]  #0: ffff9ad635c1d348 ((wq_completion)events){+.+.}, at: process_one_work+0x21f/0x630
      [  584.117967]  #1: ffffac708e1c3e58 ((work_completion)(&con->recovery_work)){+.+.}, at: process_one_work+0x21f/0x630
      [  584.118358]  #2: ffffffffc1c2a5d0 (&tmp->hive_lock){+.+.}, at: amdgpu_device_gpu_recover+0xae/0x1030 [amdgpu]
      [  584.118786]  #3: ffff9b1603d247a0 (&adev->reset_sem){++++}, at: amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu]
      [  584.119222]
                     stack backtrace:
      [  584.119990] CPU: 38 PID: 553 Comm: kworker/38:1 Kdump: loaded Tainted: G           OE     5.6.0-deli-v5.6-2848-g3f3109b0e75f #1
      [  584.120782] Hardware name: Supermicro SYS-7049GP-TRT/X11DPG-QT, BIOS 3.1 05/23/2019
      [  584.121223] Workqueue: events amdgpu_ras_do_recovery [amdgpu]
      [  584.121638] Call Trace:
      [  584.122050]  dump_stack+0x98/0xd5
      [  584.122499]  __lock_acquire+0x1139/0x16e0
      [  584.122931]  ? trace_hardirqs_on+0x3b/0xf0
      [  584.123358]  ? cancel_delayed_work+0xa6/0xc0
      [  584.123771]  lock_acquire+0xb8/0x1c0
      [  584.124197]  ? amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu]
      [  584.124599]  down_write+0x49/0x120
      [  584.125032]  ? amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu]
      [  584.125472]  amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu]
      [  584.125910]  ? amdgpu_ras_error_query+0x1b8/0x2a0 [amdgpu]
      [  584.126367]  amdgpu_ras_do_recovery+0x159/0x190 [amdgpu]
      [  584.126789]  process_one_work+0x29e/0x630
      [  584.127208]  worker_thread+0x3c/0x3f0
      [  584.127621]  ? __kthread_parkme+0x61/0x90
      [  584.128014]  kthread+0x12f/0x150
      [  584.128402]  ? process_one_work+0x630/0x630
      [  584.128790]  ? kthread_park+0x90/0x90
      [  584.129174]  ret_from_fork+0x3a/0x50
      
      Each adev has owned lock_class_key to avoid false positive
      recursive locking.
      
      v2:
      1. register adev->lock_key into lockdep, otherwise lockdep will
      report the below warning
      
      [ 1216.705820] BUG: key ffff890183b647d0 has not been registered!
      [ 1216.705924] ------------[ cut here ]------------
      [ 1216.705972] DEBUG_LOCKS_WARN_ON(1)
      [ 1216.705997] WARNING: CPU: 20 PID: 541 at kernel/locking/lockdep.c:3743 lockdep_init_map+0x150/0x210
      
      v3:
      change to use down_write_nest_lock to annotate the false dead-lock
      warning.
      Reviewed-by: NDaniel Vetter <daniel.vetter@ffwll.ch>
      Signed-off-by: NDennis Li <Dennis.Li@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      08ebb485
    • D
      drm/amdgpu: refine create and release logic of hive info · d95e8e97
      Dennis Li 提交于
      Change to dynamically create and release hive info object,
      which help driver support more hives in the future.
      
      v2:
      Change to save hive object pointer in adev, to avoid locking
      xgmi_mutex every time when calling amdgpu_get_xgmi_hive.
      
      v3:
      1. Change type of hive object pointer in adev from void* to
      amdgpu_hive_info*.
      2. remove unnecessary variable initialization.
      Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
      Signed-off-by: NDennis Li <Dennis.Li@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      d95e8e97
    • D
      drm/amdgpu: refine message print for devices of hive · aac89168
      Dennis Li 提交于
      Using dev_xxx instead of DRM_xxx/pr_xxx to indicate which device
      of a hive is the message for.
      Reviewed-by: NChristian König <christian.koenig@amd.com>
      Signed-off-by: NDennis Li <Dennis.Li@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      aac89168
    • D
      drm/amdgpu: fix the nullptr issue when reenter GPU recovery · cbfd17f7
      Dennis Li 提交于
      in single gpu system, if driver reenter gpu recovery,
      amdgpu_device_lock_adev will return false, but hive is
      nullptr now.
      Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
      Signed-off-by: NDennis Li <Dennis.Li@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      cbfd17f7
    • D
      drm/amdgpu: change reset lock from mutex to rw_semaphore · 6049db43
      Dennis Li 提交于
      clients don't need reset-lock for synchronization when no
      GPU recovery.
      
      v2:
      change to return the return value of down_read_killable.
      
      v3:
      if GPU recovery begin, VF ignore FLR notification.
      Reviewed-by: NMonk Liu <monk.liu@amd.com>
      Acked-by: NChristian König <christian.koenig@amd.com>
      Signed-off-by: NDennis Li <Dennis.Li@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      6049db43
    • D
      drm/amdgpu: refine codes to avoid reentering GPU recovery · 53b3f8f4
      Dennis Li 提交于
      if other threads have holden the reset lock, recovery will
      fail to try_lock. Therefore we introduce atomic hive->in_reset
      and adev->in_gpu_reset, to avoid reentering GPU recovery.
      
      v2:
      drop "? true : false" in the definition of amdgpu_in_reset
      Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
      Signed-off-by: NDennis Li <Dennis.Li@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      53b3f8f4
  2. 19 8月, 2020 1 次提交
  3. 15 8月, 2020 3 次提交
    • C
      drm/amdgpu: revert "fix system hang issue during GPU reset" · f1403342
      Christian König 提交于
      The whole approach wasn't thought through till the end.
      
      We already had a reset lock like this in the past and it caused the same problems like this one.
      
      Completely revert the patch for now and add individual trylock protection to the hardware access functions as necessary.
      
      This reverts commit df9c8d1a.
      Signed-off-by: NChristian König <christian.koenig@amd.com>
      Acked-by: NAlex Deucher <alexander.deucher@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      f1403342
    • E
      drm/amd/powerplay: optimize the interface for mgpu fan boost enablement · f10bb940
      Evan Quan 提交于
      Cover the implementation details from outside(of power). Also preparing
      for expanding this to swSMU.
      Signed-off-by: NEvan Quan <evan.quan@amd.com>
      Acked-by: NNirmoy Das <nirmoy.das@amd.com>
      Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      f10bb940
    • D
      drm/amdgpu: annotate a false positive recursive locking · 72e14ebf
      Dennis Li 提交于
      [  584.110304] ============================================
      [  584.110590] WARNING: possible recursive locking detected
      [  584.110876] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G           OE
      [  584.111164] --------------------------------------------
      [  584.111456] kworker/38:1/553 is trying to acquire lock:
      [  584.111721] ffff9b15ff0a47a0 (&adev->reset_sem){++++}, at: amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu]
      [  584.112112]
                     but task is already holding lock:
      [  584.112673] ffff9b1603d247a0 (&adev->reset_sem){++++}, at: amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu]
      [  584.113068]
                     other info that might help us debug this:
      [  584.113689]  Possible unsafe locking scenario:
      
      [  584.114350]        CPU0
      [  584.114685]        ----
      [  584.115014]   lock(&adev->reset_sem);
      [  584.115349]   lock(&adev->reset_sem);
      [  584.115678]
                      *** DEADLOCK ***
      
      [  584.116624]  May be due to missing lock nesting notation
      
      [  584.117284] 4 locks held by kworker/38:1/553:
      [  584.117616]  #0: ffff9ad635c1d348 ((wq_completion)events){+.+.}, at: process_one_work+0x21f/0x630
      [  584.117967]  #1: ffffac708e1c3e58 ((work_completion)(&con->recovery_work)){+.+.}, at: process_one_work+0x21f/0x630
      [  584.118358]  #2: ffffffffc1c2a5d0 (&tmp->hive_lock){+.+.}, at: amdgpu_device_gpu_recover+0xae/0x1030 [amdgpu]
      [  584.118786]  #3: ffff9b1603d247a0 (&adev->reset_sem){++++}, at: amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu]
      [  584.119222]
                     stack backtrace:
      [  584.119990] CPU: 38 PID: 553 Comm: kworker/38:1 Kdump: loaded Tainted: G           OE     5.6.0-deli-v5.6-2848-g3f3109b0e75f #1
      [  584.120782] Hardware name: Supermicro SYS-7049GP-TRT/X11DPG-QT, BIOS 3.1 05/23/2019
      [  584.121223] Workqueue: events amdgpu_ras_do_recovery [amdgpu]
      [  584.121638] Call Trace:
      [  584.122050]  dump_stack+0x98/0xd5
      [  584.122499]  __lock_acquire+0x1139/0x16e0
      [  584.122931]  ? trace_hardirqs_on+0x3b/0xf0
      [  584.123358]  ? cancel_delayed_work+0xa6/0xc0
      [  584.123771]  lock_acquire+0xb8/0x1c0
      [  584.124197]  ? amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu]
      [  584.124599]  down_write+0x49/0x120
      [  584.125032]  ? amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu]
      [  584.125472]  amdgpu_device_gpu_recover+0x262/0x1030 [amdgpu]
      [  584.125910]  ? amdgpu_ras_error_query+0x1b8/0x2a0 [amdgpu]
      [  584.126367]  amdgpu_ras_do_recovery+0x159/0x190 [amdgpu]
      [  584.126789]  process_one_work+0x29e/0x630
      [  584.127208]  worker_thread+0x3c/0x3f0
      [  584.127621]  ? __kthread_parkme+0x61/0x90
      [  584.128014]  kthread+0x12f/0x150
      [  584.128402]  ? process_one_work+0x630/0x630
      [  584.128790]  ? kthread_park+0x90/0x90
      [  584.129174]  ret_from_fork+0x3a/0x50
      
      Each adev has owned lock_class_key to avoid false positive
      recursive locking.
      
      v2:
      1. register adev->lock_key into lockdep, otherwise lockdep will
      report the below warning
      
      [ 1216.705820] BUG: key ffff890183b647d0 has not been registered!
      [ 1216.705924] ------------[ cut here ]------------
      [ 1216.705972] DEBUG_LOCKS_WARN_ON(1)
      [ 1216.705997] WARNING: CPU: 20 PID: 541 at kernel/locking/lockdep.c:3743 lockdep_init_map+0x150/0x210
      
      v3:
      change to use down_write_nest_lock to annotate the false dead-lock
      warning.
      Reviewed-by: NDaniel Vetter <daniel.vetter@ffwll.ch>
      Signed-off-by: NDennis Li <Dennis.Li@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      72e14ebf
  4. 07 8月, 2020 1 次提交
  5. 05 8月, 2020 4 次提交
  6. 31 7月, 2020 1 次提交
  7. 28 7月, 2020 2 次提交
    • M
      drm/amdgpu: enable DC support for SI parts (v2) · 64200c46
      Mauro Rossi 提交于
      [Why]
      amdgpu_device.c requires changes for SI chipsets support
      si.c require changes for Display Manager IP block enabling
      
      [How]
      amdgpu_device.c: add SI families in amdgpu_device_asic_has_dc_support()
      si.c: changes in si_set_ip_blocks() for Display Manager IP blocks enablement
      
      (v1) NOTE: As per Kaveri and older amdgpu.dc=1 kernel cmdline is required
      
      (v2) fix for bc011f9350 ("drm/amdgpu: Change SI/CI gfx/sdma/smu init sequence")
           remove CHIP_HAINAN support since it does not have physical DCE6 module
      Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
      Signed-off-by: NMauro Rossi <issor.oruam@gmail.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      64200c46
    • D
      drm/amdgpu: fix system hang issue during GPU reset · df9c8d1a
      Dennis Li 提交于
      when GPU hang, driver has multi-paths to enter amdgpu_device_gpu_recover,
      the atomic adev->in_gpu_reset and hive->in_reset are used to avoid
      re-entering GPU recovery.
      
      During GPU reset and resume, it is unsafe that other threads access GPU,
      which maybe cause GPU reset failed. Therefore the new rw_semaphore
      adev->reset_sem is introduced, which protect GPU from being accessed by
      external threads during recovery.
      
      v2:
      1. add rwlock for some ioctls, debugfs and file-close function.
      2. change to use dqm->is_resetting and dqm_lock for protection in kfd
      driver.
      3. remove try_lock and change adev->in_gpu_reset as atomic, to avoid
      re-enter GPU recovery for the same GPU hang.
      
      v3:
      1. change back to use adev->reset_sem to protect kfd callback
      functions, because dqm_lock couldn't protect all codes, for example:
      free_mqd must be called outside of dqm_lock;
      
      [ 1230.176199] Hardware name: Supermicro SYS-7049GP-TRT/X11DPG-QT, BIOS 3.1 05/23/2019
      [ 1230.177221] Call Trace:
      [ 1230.178249]  dump_stack+0x98/0xd5
      [ 1230.179443]  amdgpu_virt_kiq_reg_write_reg_wait+0x181/0x190 [amdgpu]
      [ 1230.180673]  gmc_v9_0_flush_gpu_tlb+0xcc/0x310 [amdgpu]
      [ 1230.181882]  amdgpu_gart_unbind+0xa9/0xe0 [amdgpu]
      [ 1230.183098]  amdgpu_ttm_backend_unbind+0x46/0x180 [amdgpu]
      [ 1230.184239]  ? ttm_bo_put+0x171/0x5f0 [ttm]
      [ 1230.185394]  ttm_tt_unbind+0x21/0x40 [ttm]
      [ 1230.186558]  ttm_tt_destroy.part.12+0x12/0x60 [ttm]
      [ 1230.187707]  ttm_tt_destroy+0x13/0x20 [ttm]
      [ 1230.188832]  ttm_bo_cleanup_memtype_use+0x36/0x80 [ttm]
      [ 1230.189979]  ttm_bo_put+0x1be/0x5f0 [ttm]
      [ 1230.191230]  amdgpu_bo_unref+0x1e/0x30 [amdgpu]
      [ 1230.192522]  amdgpu_amdkfd_free_gtt_mem+0xaf/0x140 [amdgpu]
      [ 1230.193833]  free_mqd+0x25/0x40 [amdgpu]
      [ 1230.195143]  destroy_queue_cpsch+0x1a7/0x270 [amdgpu]
      [ 1230.196475]  pqm_destroy_queue+0x105/0x260 [amdgpu]
      [ 1230.197819]  kfd_ioctl_destroy_queue+0x37/0x70 [amdgpu]
      [ 1230.199154]  kfd_ioctl+0x277/0x500 [amdgpu]
      [ 1230.200458]  ? kfd_ioctl_get_clock_counters+0x60/0x60 [amdgpu]
      [ 1230.201656]  ? tomoyo_file_ioctl+0x19/0x20
      [ 1230.202831]  ksys_ioctl+0x98/0xb0
      [ 1230.204004]  __x64_sys_ioctl+0x1a/0x20
      [ 1230.205174]  do_syscall_64+0x5f/0x250
      [ 1230.206339]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      2. remove try_lock and introduce atomic hive->in_reset, to avoid
      re-enter GPU recovery.
      
      v4:
      1. remove an unnecessary whitespace change in kfd_chardev.c
      2. remove comment codes in amdgpu_device.c
      3. add more detailed comment in commit message
      4. define a wrap function amdgpu_in_reset
      
      v5:
      1. Fix some style issues.
      Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
      Suggested-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com>
      Suggested-by: NChristian König <christian.koenig@amd.com>
      Suggested-by: NFelix Kuehling <Felix.Kuehling@amd.com>
      Suggested-by: NLijo Lazar <Lijo.Lazar@amd.com>
      Suggested-by: NLuben Tukov <luben.tuikov@amd.com>
      Signed-off-by: NDennis Li <Dennis.Li@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      df9c8d1a
  8. 16 7月, 2020 5 次提交
  9. 11 7月, 2020 2 次提交
  10. 03 7月, 2020 4 次提交
  11. 01 7月, 2020 7 次提交
  12. 04 6月, 2020 3 次提交
  13. 30 5月, 2020 1 次提交