1. 05 8月, 2020 5 次提交
  2. 28 7月, 2020 1 次提交
    • D
      drm/amdgpu: fix system hang issue during GPU reset · df9c8d1a
      Dennis Li 提交于
      when GPU hang, driver has multi-paths to enter amdgpu_device_gpu_recover,
      the atomic adev->in_gpu_reset and hive->in_reset are used to avoid
      re-entering GPU recovery.
      
      During GPU reset and resume, it is unsafe that other threads access GPU,
      which maybe cause GPU reset failed. Therefore the new rw_semaphore
      adev->reset_sem is introduced, which protect GPU from being accessed by
      external threads during recovery.
      
      v2:
      1. add rwlock for some ioctls, debugfs and file-close function.
      2. change to use dqm->is_resetting and dqm_lock for protection in kfd
      driver.
      3. remove try_lock and change adev->in_gpu_reset as atomic, to avoid
      re-enter GPU recovery for the same GPU hang.
      
      v3:
      1. change back to use adev->reset_sem to protect kfd callback
      functions, because dqm_lock couldn't protect all codes, for example:
      free_mqd must be called outside of dqm_lock;
      
      [ 1230.176199] Hardware name: Supermicro SYS-7049GP-TRT/X11DPG-QT, BIOS 3.1 05/23/2019
      [ 1230.177221] Call Trace:
      [ 1230.178249]  dump_stack+0x98/0xd5
      [ 1230.179443]  amdgpu_virt_kiq_reg_write_reg_wait+0x181/0x190 [amdgpu]
      [ 1230.180673]  gmc_v9_0_flush_gpu_tlb+0xcc/0x310 [amdgpu]
      [ 1230.181882]  amdgpu_gart_unbind+0xa9/0xe0 [amdgpu]
      [ 1230.183098]  amdgpu_ttm_backend_unbind+0x46/0x180 [amdgpu]
      [ 1230.184239]  ? ttm_bo_put+0x171/0x5f0 [ttm]
      [ 1230.185394]  ttm_tt_unbind+0x21/0x40 [ttm]
      [ 1230.186558]  ttm_tt_destroy.part.12+0x12/0x60 [ttm]
      [ 1230.187707]  ttm_tt_destroy+0x13/0x20 [ttm]
      [ 1230.188832]  ttm_bo_cleanup_memtype_use+0x36/0x80 [ttm]
      [ 1230.189979]  ttm_bo_put+0x1be/0x5f0 [ttm]
      [ 1230.191230]  amdgpu_bo_unref+0x1e/0x30 [amdgpu]
      [ 1230.192522]  amdgpu_amdkfd_free_gtt_mem+0xaf/0x140 [amdgpu]
      [ 1230.193833]  free_mqd+0x25/0x40 [amdgpu]
      [ 1230.195143]  destroy_queue_cpsch+0x1a7/0x270 [amdgpu]
      [ 1230.196475]  pqm_destroy_queue+0x105/0x260 [amdgpu]
      [ 1230.197819]  kfd_ioctl_destroy_queue+0x37/0x70 [amdgpu]
      [ 1230.199154]  kfd_ioctl+0x277/0x500 [amdgpu]
      [ 1230.200458]  ? kfd_ioctl_get_clock_counters+0x60/0x60 [amdgpu]
      [ 1230.201656]  ? tomoyo_file_ioctl+0x19/0x20
      [ 1230.202831]  ksys_ioctl+0x98/0xb0
      [ 1230.204004]  __x64_sys_ioctl+0x1a/0x20
      [ 1230.205174]  do_syscall_64+0x5f/0x250
      [ 1230.206339]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      2. remove try_lock and introduce atomic hive->in_reset, to avoid
      re-enter GPU recovery.
      
      v4:
      1. remove an unnecessary whitespace change in kfd_chardev.c
      2. remove comment codes in amdgpu_device.c
      3. add more detailed comment in commit message
      4. define a wrap function amdgpu_in_reset
      
      v5:
      1. Fix some style issues.
      Reviewed-by: NHawking Zhang <Hawking.Zhang@amd.com>
      Suggested-by: NAndrey Grodzovsky <andrey.grodzovsky@amd.com>
      Suggested-by: NChristian König <christian.koenig@amd.com>
      Suggested-by: NFelix Kuehling <Felix.Kuehling@amd.com>
      Suggested-by: NLijo Lazar <Lijo.Lazar@amd.com>
      Suggested-by: NLuben Tukov <luben.tuikov@amd.com>
      Signed-off-by: NDennis Li <Dennis.Li@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      df9c8d1a
  3. 03 7月, 2020 1 次提交
  4. 01 7月, 2020 6 次提交
  5. 25 6月, 2020 1 次提交
  6. 10 6月, 2020 1 次提交
  7. 11 5月, 2020 4 次提交
  8. 02 5月, 2020 2 次提交
  9. 29 4月, 2020 9 次提交
  10. 02 4月, 2020 1 次提交
  11. 28 3月, 2020 1 次提交
    • J
      mm/hmm: remove HMM_FAULT_SNAPSHOT · 6bfef2f9
      Jason Gunthorpe 提交于
      Now that flags are handled on a fine-grained per-page basis this global
      flag is redundant and has a confusing overlap with the pfn_flags_mask and
      default_flags.
      
      Normalize the HMM_FAULT_SNAPSHOT behavior into one place. Callers needing
      the SNAPSHOT behavior should set a pfn_flags_mask and default_flags that
      always results in a cleared HMM_PFN_VALID. Then no pages will be faulted,
      and HMM_FAULT_SNAPSHOT is not a special flow that overrides the masking
      mechanism.
      
      As this is the last flag, also remove the flags argument. If future flags
      are needed they can be part of the struct hmm_range function arguments.
      
      Link: https://lore.kernel.org/r/20200327200021.29372-5-jgg@ziepe.caReviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      6bfef2f9
  12. 27 3月, 2020 1 次提交
  13. 26 3月, 2020 2 次提交
  14. 07 3月, 2020 1 次提交
  15. 29 2月, 2020 1 次提交
    • Y
      drm/amdgpu: no need to clean debugfs at amdgpu · d2790e10
      Yintian Tao 提交于
      drm_minor_unregister will invoke drm_debugfs_cleanup
      to clean all the child node under primary minor node.
      We don't need to invoke amdgpu_debugfs_fini and
      amdgpu_debugfs_regs_cleanup to clean agian.
      Otherwise, it will raise the NULL pointer like below.
      [   45.046029] BUG: unable to handle kernel NULL pointer dereference at 00000000000000a8
      [   45.047256] PGD 0 P4D 0
      [   45.047713] Oops: 0002 [#1] SMP PTI
      [   45.048198] CPU: 0 PID: 2796 Comm: modprobe Tainted: G        W  OE     4.18.0-15-generic #16~18.04.1-Ubuntu
      [   45.049538] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
      [   45.050651] RIP: 0010:down_write+0x1f/0x40
      [   45.051194] Code: 90 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 53 48 89 fb e8 ce d9 ff ff 48 ba 01 00 00 00 ff ff ff ff 48 89 d8 <f0> 48 0f c1 10 85 d2 74 05 e8 53 1c ff ff 65 48 8b 04 25 00 5c 01
      [   45.053702] RSP: 0018:ffffad8f4133fd40 EFLAGS: 00010246
      [   45.054384] RAX: 00000000000000a8 RBX: 00000000000000a8 RCX: ffffa011327dd814
      [   45.055349] RDX: ffffffff00000001 RSI: 0000000000000001 RDI: 00000000000000a8
      [   45.056346] RBP: ffffad8f4133fd48 R08: 0000000000000000 R09: ffffffffc0690a00
      [   45.057326] R10: ffffad8f4133fd58 R11: 0000000000000001 R12: ffffa0113cff0300
      [   45.058266] R13: ffffa0113c0a0000 R14: ffffffffc0c02a10 R15: ffffa0113e5c7860
      [   45.059221] FS:  00007f60d46f9540(0000) GS:ffffa0113fc00000(0000) knlGS:0000000000000000
      [   45.060809] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   45.061826] CR2: 00000000000000a8 CR3: 0000000136250004 CR4: 00000000003606f0
      [   45.062913] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [   45.064404] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [   45.065897] Call Trace:
      [   45.066426]  debugfs_remove+0x36/0xa0
      [   45.067131]  amdgpu_debugfs_ring_fini+0x15/0x20 [amdgpu]
      [   45.068019]  amdgpu_debugfs_fini+0x2c/0x50 [amdgpu]
      [   45.068756]  amdgpu_pci_remove+0x49/0x70 [amdgpu]
      [   45.069439]  pci_device_remove+0x3e/0xc0
      [   45.070037]  device_release_driver_internal+0x18a/0x260
      [   45.070842]  driver_detach+0x3f/0x80
      [   45.071325]  bus_remove_driver+0x59/0xd0
      [   45.071850]  driver_unregister+0x2c/0x40
      [   45.072377]  pci_unregister_driver+0x22/0xa0
      [   45.073043]  amdgpu_exit+0x15/0x57c [amdgpu]
      [   45.073683]  __x64_sys_delete_module+0x146/0x280
      [   45.074369]  do_syscall_64+0x5a/0x120
      [   45.074916]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      v2: remove all debugfs cleanup/fini code at amdgpu
      v3: squash in unused variable removal
      Signed-off-by: NYintian Tao <yttao@amd.com>
      Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      d2790e10
  16. 27 2月, 2020 1 次提交
  17. 13 2月, 2020 1 次提交
  18. 08 2月, 2020 1 次提交