1. 12 1月, 2023 1 次提交
    • M
      drm/amdkfd: Fix kernel warning during topology setup · 306888b1
      Mukul Joshi 提交于
      commit cf97eb7e upstream.
      
      This patch fixes the following kernel warning seen during
      driver load by correctly initializing the p2plink attr before
      creating the sysfs file:
      
      [  +0.002865] ------------[ cut here ]------------
      [  +0.002327] kobject: '(null)' (0000000056260cfb): is not initialized, yet kobject_put() is being called.
      [  +0.004780] WARNING: CPU: 32 PID: 1006 at lib/kobject.c:718 kobject_put+0xaa/0x1c0
      [  +0.001361] Call Trace:
      [  +0.001234]  <TASK>
      [  +0.001067]  kfd_remove_sysfs_node_entry+0x24a/0x2d0 [amdgpu]
      [  +0.003147]  kfd_topology_update_sysfs+0x3d/0x750 [amdgpu]
      [  +0.002890]  kfd_topology_add_device+0xbd7/0xc70 [amdgpu]
      [  +0.002844]  ? lock_release+0x13c/0x2e0
      [  +0.001936]  ? smu_cmn_send_smc_msg_with_param+0x1e8/0x2d0 [amdgpu]
      [  +0.003313]  ? amdgpu_dpm_get_mclk+0x54/0x60 [amdgpu]
      [  +0.002703]  kgd2kfd_device_init.cold+0x39f/0x4ed [amdgpu]
      [  +0.002930]  amdgpu_amdkfd_device_init+0x13d/0x1f0 [amdgpu]
      [  +0.002944]  amdgpu_device_init.cold+0x1464/0x17b4 [amdgpu]
      [  +0.002970]  ? pci_bus_read_config_word+0x43/0x80
      [  +0.002380]  amdgpu_driver_load_kms+0x15/0x100 [amdgpu]
      [  +0.002744]  amdgpu_pci_probe+0x147/0x370 [amdgpu]
      [  +0.002522]  local_pci_probe+0x40/0x80
      [  +0.001896]  work_for_cpu_fn+0x10/0x20
      [  +0.001892]  process_one_work+0x26e/0x5a0
      [  +0.002029]  worker_thread+0x1fd/0x3e0
      [  +0.001890]  ? process_one_work+0x5a0/0x5a0
      [  +0.002115]  kthread+0xea/0x110
      [  +0.001618]  ? kthread_complete_and_exit+0x20/0x20
      [  +0.002422]  ret_from_fork+0x1f/0x30
      [  +0.001808]  </TASK>
      [  +0.001103] irq event stamp: 59837
      [  +0.001718] hardirqs last  enabled at (59849): [<ffffffffb30fab12>] __up_console_sem+0x52/0x60
      [  +0.004414] hardirqs last disabled at (59860): [<ffffffffb30faaf7>] __up_console_sem+0x37/0x60
      [  +0.004414] softirqs last  enabled at (59654): [<ffffffffb307d9c7>] irq_exit_rcu+0xd7/0x130
      [  +0.004205] softirqs last disabled at (59649): [<ffffffffb307d9c7>] irq_exit_rcu+0xd7/0x130
      [  +0.004203] ---[ end trace 0000000000000000 ]---
      
      Fixes: 0f28cca8 ("drm/amdkfd: Extend KFD device topology to surface peer-to-peer links")
      Signed-off-by: NMukul Joshi <mukul.joshi@amd.com>
      Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      306888b1
  2. 17 8月, 2022 1 次提交
  3. 15 6月, 2022 1 次提交
  4. 11 6月, 2022 1 次提交
  5. 08 6月, 2022 1 次提交
    • R
      drm/amdkfd: Extend KFD device topology to surface peer-to-peer links · 0f28cca8
      Ramesh Errabolu 提交于
      Extend KFD device topology to surface peer-to-peer links among
      GPU devices connected over PCIe or xGMI. Enabling HSA_AMD_P2P is
      REQUIRED to surface peer-to-peer links.
      
      Prior to this KFD did not expose to user mode any P2P links or
      indirect links that go over two or more direct hops. Old versions
      of the Thunk used to make up their own P2P and indirect links without
      the information about peer-accessibility and chipset support available
      to the kernel mode driver. In this patch we expose P2P links in a new
      sysfs directory to provide more reliable P2P link information to user
      mode.
      
      Old versions of the Thunk will continue to work as before and ignore
      the new directory. This avoids conflicts between P2P links exposed by
      KFD and P2P links created by the Thunk itself. New versions of the Thunk
      will use only the P2P links provided in the new p2p_links directory, if
      it exists, or fall back to the old code path on older KFDs that don't
      expose p2p_links.
      Signed-off-by: NRamesh Errabolu <Ramesh.Errabolu@amd.com>
      Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      0f28cca8
  6. 27 5月, 2022 1 次提交
  7. 04 5月, 2022 1 次提交
  8. 29 4月, 2022 2 次提交
    • M
      drm/amdkfd: Fix circular lock dependency warning · b179fc28
      Mukul Joshi 提交于
      [  168.544078] ======================================================
      [  168.550309] WARNING: possible circular locking dependency detected
      [  168.556523] 5.16.0-kfd-fkuehlin #148 Tainted: G            E
      [  168.562558] ------------------------------------------------------
      [  168.568764] kfdtest/3479 is trying to acquire lock:
      [  168.573672] ffffffffc0927a70 (&topology_lock){++++}-{3:3}, at:
      		kfd_topology_device_by_id+0x16/0x60 [amdgpu] [  168.583663]
                      but task is already holding lock:
      [  168.589529] ffff97d303dee668 (&mm->mmap_lock#2){++++}-{3:3}, at:
      		vm_mmap_pgoff+0xa9/0x180 [  168.597755]
                      which lock already depends on the new lock.
      
      [  168.605970]
                      the existing dependency chain (in reverse order) is:
      [  168.613487]
                      -> #3 (&mm->mmap_lock#2){++++}-{3:3}:
      [  168.619700]        lock_acquire+0xca/0x2e0
      [  168.623814]        down_read+0x3e/0x140
      [  168.627676]        do_user_addr_fault+0x40d/0x690
      [  168.632399]        exc_page_fault+0x6f/0x270
      [  168.636692]        asm_exc_page_fault+0x1e/0x30
      [  168.641249]        filldir64+0xc8/0x1e0
      [  168.645115]        call_filldir+0x7c/0x110
      [  168.649238]        ext4_readdir+0x58e/0x940
      [  168.653442]        iterate_dir+0x16a/0x1b0
      [  168.657558]        __x64_sys_getdents64+0x83/0x140
      [  168.662375]        do_syscall_64+0x35/0x80
      [  168.666492]        entry_SYSCALL_64_after_hwframe+0x44/0xae
      [  168.672095]
                      -> #2 (&type->i_mutex_dir_key#6){++++}-{3:3}:
      [  168.679008]        lock_acquire+0xca/0x2e0
      [  168.683122]        down_read+0x3e/0x140
      [  168.686982]        path_openat+0x5b2/0xa50
      [  168.691095]        do_file_open_root+0xfc/0x190
      [  168.695652]        file_open_root+0xd8/0x1b0
      [  168.702010]        kernel_read_file_from_path_initns+0xc4/0x140
      [  168.709542]        _request_firmware+0x2e9/0x5e0
      [  168.715741]        request_firmware+0x32/0x50
      [  168.721667]        amdgpu_cgs_get_firmware_info+0x370/0xdd0 [amdgpu]
      [  168.730060]        smu7_upload_smu_firmware_image+0x53/0x190 [amdgpu]
      [  168.738414]        fiji_start_smu+0xcf/0x4e0 [amdgpu]
      [  168.745539]        pp_dpm_load_fw+0x21/0x30 [amdgpu]
      [  168.752503]        amdgpu_pm_load_smu_firmware+0x4b/0x80 [amdgpu]
      [  168.760698]        amdgpu_device_fw_loading+0xb8/0x140 [amdgpu]
      [  168.768412]        amdgpu_device_init.cold+0xdf6/0x1716 [amdgpu]
      [  168.776285]        amdgpu_driver_load_kms+0x15/0x120 [amdgpu]
      [  168.784034]        amdgpu_pci_probe+0x19b/0x3a0 [amdgpu]
      [  168.791161]        local_pci_probe+0x40/0x80
      [  168.797027]        work_for_cpu_fn+0x10/0x20
      [  168.802839]        process_one_work+0x273/0x5b0
      [  168.808903]        worker_thread+0x20f/0x3d0
      [  168.814700]        kthread+0x176/0x1a0
      [  168.819968]        ret_from_fork+0x1f/0x30
      [  168.825563]
                      -> #1 (&adev->pm.mutex){+.+.}-{3:3}:
      [  168.834721]        lock_acquire+0xca/0x2e0
      [  168.840364]        __mutex_lock+0xa2/0x930
      [  168.846020]        amdgpu_dpm_get_mclk+0x37/0x60 [amdgpu]
      [  168.853257]        amdgpu_amdkfd_get_local_mem_info+0xba/0xe0 [amdgpu]
      [  168.861547]        kfd_create_vcrat_image_gpu+0x1b1/0xbb0 [amdgpu]
      [  168.869478]        kfd_create_crat_image_virtual+0x447/0x510 [amdgpu]
      [  168.877884]        kfd_topology_add_device+0x5c8/0x6f0 [amdgpu]
      [  168.885556]        kgd2kfd_device_init.cold+0x385/0x4c5 [amdgpu]
      [  168.893347]        amdgpu_amdkfd_device_init+0x138/0x180 [amdgpu]
      [  168.901177]        amdgpu_device_init.cold+0x141b/0x1716 [amdgpu]
      [  168.909025]        amdgpu_driver_load_kms+0x15/0x120 [amdgpu]
      [  168.916458]        amdgpu_pci_probe+0x19b/0x3a0 [amdgpu]
      [  168.923442]        local_pci_probe+0x40/0x80
      [  168.929249]        work_for_cpu_fn+0x10/0x20
      [  168.935008]        process_one_work+0x273/0x5b0
      [  168.940944]        worker_thread+0x20f/0x3d0
      [  168.946623]        kthread+0x176/0x1a0
      [  168.951765]        ret_from_fork+0x1f/0x30
      [  168.957277]
                      -> #0 (&topology_lock){++++}-{3:3}:
      [  168.965993]        check_prev_add+0x8f/0xbf0
      [  168.971613]        __lock_acquire+0x1299/0x1ca0
      [  168.977485]        lock_acquire+0xca/0x2e0
      [  168.982877]        down_read+0x3e/0x140
      [  168.987975]        kfd_topology_device_by_id+0x16/0x60 [amdgpu]
      [  168.995583]        kfd_device_by_id+0xa/0x20 [amdgpu]
      [  169.002180]        kfd_mmap+0x95/0x200 [amdgpu]
      [  169.008293]        mmap_region+0x337/0x5a0
      [  169.013679]        do_mmap+0x3aa/0x540
      [  169.018678]        vm_mmap_pgoff+0xdc/0x180
      [  169.024095]        ksys_mmap_pgoff+0x186/0x1f0
      [  169.029734]        do_syscall_64+0x35/0x80
      [  169.035005]        entry_SYSCALL_64_after_hwframe+0x44/0xae
      [  169.041754]
                      other info that might help us debug this:
      
      [  169.053276] Chain exists of:
                        &topology_lock --> &type->i_mutex_dir_key#6 --> &mm->mmap_lock#2
      
      [  169.068389]  Possible unsafe locking scenario:
      
      [  169.076661]        CPU0                    CPU1
      [  169.082383]        ----                    ----
      [  169.088087]   lock(&mm->mmap_lock#2);
      [  169.092922]                                lock(&type->i_mutex_dir_key#6);
      [  169.100975]                                lock(&mm->mmap_lock#2);
      [  169.108320]   lock(&topology_lock);
      [  169.112957]
                       *** DEADLOCK ***
      
      This commit fixes the deadlock warning by ensuring pm.mutex is not
      held while holding the topology lock. For this, kfd_local_mem_info
      is moved into the KFD dev struct and filled during device init.
      This cached value can then be used instead of querying the value
      again and again.
      Signed-off-by: NMukul Joshi <mukul.joshi@amd.com>
      Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      b179fc28
    • M
      drm/amdkfd: Fix updating IO links during device removal · 98447635
      Mukul Joshi 提交于
      The logic to update the IO links when a KFD device
      is removed was not correct as it would miss updating
      the proximity domain values for some nodes where the
      node_from and node_to both were greater values than the
      proximity domain value of the KFD device being removed
      from topology.
      
      Fixes: 46d18d51 ("drm/amdkfd: Cleanup IO links during KFD device removal")
      Signed-off-by: NMukul Joshi <mukul.joshi@amd.com>
      Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      98447635
  9. 13 4月, 2022 1 次提交
    • M
      drm/amdkfd: Cleanup IO links during KFD device removal · 46d18d51
      Mukul Joshi 提交于
      Currently, the IO-links to the device being removed from topology,
      are not cleared. As a result, there would be dangling links left in
      the KFD topology. This patch aims to fix the following:
      1. Cleanup all IO links to the device being removed.
      2. Ensure that node numbering in sysfs and nodes proximity domain
         values are consistent after the device is removed:
         a. Adding a device and removing a GPU device are made mutually
            exclusive.
         b. The global proximity domain counter is no longer required to be
            an atomic counter. A normal 32-bit counter can be used instead.
      3. Update generation_count to let user-mode know that topology has
         changed due to device removal.
      
      CC: Shuotao Xu <shuotaoxu@microsoft.com>
      Reviewed-by: NShuotao Xu <shuotaoxu@microsoft.com>
      Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
      Signed-off-by: NMukul Joshi <mukul.joshi@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      46d18d51
  10. 15 2月, 2022 2 次提交
  11. 02 12月, 2021 2 次提交
  12. 23 11月, 2021 1 次提交
  13. 18 11月, 2021 6 次提交
  14. 20 10月, 2021 1 次提交
  15. 06 8月, 2021 1 次提交
  16. 23 7月, 2021 2 次提交
  17. 19 6月, 2021 1 次提交
  18. 16 6月, 2021 1 次提交
    • F
      drm/amdkfd: Disable SVM per GPU, not per process · 5a75ea56
      Felix Kuehling 提交于
      When some GPUs don't support SVM, don't disabe it for the entire process.
      That would be inconsistent with the information the process got from the
      topology, which indicates SVM support per GPU.
      
      Instead disable SVM support only for the unsupported GPUs. This is done
      by checking any per-device attributes against the bitmap of supported
      GPUs. Also use the supported GPU bitmap to initialize access bitmaps for
      new SVM address ranges.
      
      Don't handle recoverable page faults from unsupported GPUs. (I don't
      think there will be unsupported GPUs that can generate recoverable page
      faults. But better safe than sorry.)
      Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
      Reviewed-by: NPhilip Yang <philip.yang@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      5a75ea56
  19. 05 6月, 2021 1 次提交
  20. 20 5月, 2021 1 次提交
  21. 11 5月, 2021 3 次提交
  22. 21 4月, 2021 1 次提交
  23. 10 3月, 2021 1 次提交
  24. 10 2月, 2021 1 次提交
    • K
      drm/amdkfd: Get unique_id dynamically v2 · 11964258
      Kent Russell 提交于
      Instead of caching the value during amdgpu_device_init, just call the
      function directly. This avoids issues where the unique_id hasn't been
      saved by the time that KFD's topology snapshot is done (e.g. Arcturus).
      
      KFD's topology information from the amdgpu_device was initially cached
      at KFD initialization due to amdkfd and amdgpu being separate modules.
      Now that they are combined together, we can directly call the functions
      that we need and avoid this unnecessary duplication and complexity.
      
      As a side-effect of this change, we also remove unique_id=0 for CPUs,
      which is obviously not unique.
      
      v2: Drop previous patch printing unique_id in hex
      Signed-off-by: NKent Russell <kent.russell@amd.com>
      Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      11964258
  25. 30 10月, 2020 1 次提交
  26. 13 10月, 2020 1 次提交
  27. 06 10月, 2020 1 次提交
  28. 27 8月, 2020 2 次提交