1. 12 1月, 2023 3 次提交
    • M
      drm/amdkfd: Fix kernel warning during topology setup · 306888b1
      Mukul Joshi 提交于
      commit cf97eb7e upstream.
      
      This patch fixes the following kernel warning seen during
      driver load by correctly initializing the p2plink attr before
      creating the sysfs file:
      
      [  +0.002865] ------------[ cut here ]------------
      [  +0.002327] kobject: '(null)' (0000000056260cfb): is not initialized, yet kobject_put() is being called.
      [  +0.004780] WARNING: CPU: 32 PID: 1006 at lib/kobject.c:718 kobject_put+0xaa/0x1c0
      [  +0.001361] Call Trace:
      [  +0.001234]  <TASK>
      [  +0.001067]  kfd_remove_sysfs_node_entry+0x24a/0x2d0 [amdgpu]
      [  +0.003147]  kfd_topology_update_sysfs+0x3d/0x750 [amdgpu]
      [  +0.002890]  kfd_topology_add_device+0xbd7/0xc70 [amdgpu]
      [  +0.002844]  ? lock_release+0x13c/0x2e0
      [  +0.001936]  ? smu_cmn_send_smc_msg_with_param+0x1e8/0x2d0 [amdgpu]
      [  +0.003313]  ? amdgpu_dpm_get_mclk+0x54/0x60 [amdgpu]
      [  +0.002703]  kgd2kfd_device_init.cold+0x39f/0x4ed [amdgpu]
      [  +0.002930]  amdgpu_amdkfd_device_init+0x13d/0x1f0 [amdgpu]
      [  +0.002944]  amdgpu_device_init.cold+0x1464/0x17b4 [amdgpu]
      [  +0.002970]  ? pci_bus_read_config_word+0x43/0x80
      [  +0.002380]  amdgpu_driver_load_kms+0x15/0x100 [amdgpu]
      [  +0.002744]  amdgpu_pci_probe+0x147/0x370 [amdgpu]
      [  +0.002522]  local_pci_probe+0x40/0x80
      [  +0.001896]  work_for_cpu_fn+0x10/0x20
      [  +0.001892]  process_one_work+0x26e/0x5a0
      [  +0.002029]  worker_thread+0x1fd/0x3e0
      [  +0.001890]  ? process_one_work+0x5a0/0x5a0
      [  +0.002115]  kthread+0xea/0x110
      [  +0.001618]  ? kthread_complete_and_exit+0x20/0x20
      [  +0.002422]  ret_from_fork+0x1f/0x30
      [  +0.001808]  </TASK>
      [  +0.001103] irq event stamp: 59837
      [  +0.001718] hardirqs last  enabled at (59849): [<ffffffffb30fab12>] __up_console_sem+0x52/0x60
      [  +0.004414] hardirqs last disabled at (59860): [<ffffffffb30faaf7>] __up_console_sem+0x37/0x60
      [  +0.004414] softirqs last  enabled at (59654): [<ffffffffb307d9c7>] irq_exit_rcu+0xd7/0x130
      [  +0.004205] softirqs last disabled at (59649): [<ffffffffb307d9c7>] irq_exit_rcu+0xd7/0x130
      [  +0.004203] ---[ end trace 0000000000000000 ]---
      
      Fixes: 0f28cca8 ("drm/amdkfd: Extend KFD device topology to surface peer-to-peer links")
      Signed-off-by: NMukul Joshi <mukul.joshi@amd.com>
      Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      306888b1
    • P
      drm/amdkfd: Fix double release compute pasid · a02c07b6
      Philip Yang 提交于
      [ Upstream commit 1a799c4c ]
      
      If kfd_process_device_init_vm returns failure after vm is converted to
      compute vm and vm->pasid set to compute pasid, KFD will not take
      pdd->drm_file reference. As a result, drm close file handler maybe
      called to release the compute pasid before KFD process destroy worker to
      release the same pasid and set vm->pasid to zero, this generates below
      WARNING backtrace and NULL pointer access.
      
      Add helper amdgpu_amdkfd_gpuvm_set_vm_pasid and call it at the last step
      of kfd_process_device_init_vm, to ensure vm pasid is the original pasid
      if acquiring vm failed or is the compute pasid with pdd->drm_file
      reference taken to avoid double release same pasid.
      
       amdgpu: Failed to create process VM object
       ida_free called for id=32770 which is not allocated.
       WARNING: CPU: 57 PID: 72542 at ../lib/idr.c:522 ida_free+0x96/0x140
       RIP: 0010:ida_free+0x96/0x140
       Call Trace:
        amdgpu_pasid_free_delayed+0xe1/0x2a0 [amdgpu]
        amdgpu_driver_postclose_kms+0x2d8/0x340 [amdgpu]
        drm_file_free.part.13+0x216/0x270 [drm]
        drm_close_helper.isra.14+0x60/0x70 [drm]
        drm_release+0x6e/0xf0 [drm]
        __fput+0xcc/0x280
        ____fput+0xe/0x20
        task_work_run+0x96/0xc0
        do_exit+0x3d0/0xc10
      
       BUG: kernel NULL pointer dereference, address: 0000000000000000
       RIP: 0010:ida_free+0x76/0x140
       Call Trace:
        amdgpu_pasid_free_delayed+0xe1/0x2a0 [amdgpu]
        amdgpu_driver_postclose_kms+0x2d8/0x340 [amdgpu]
        drm_file_free.part.13+0x216/0x270 [drm]
        drm_close_helper.isra.14+0x60/0x70 [drm]
        drm_release+0x6e/0xf0 [drm]
        __fput+0xcc/0x280
        ____fput+0xe/0x20
        task_work_run+0x96/0xc0
        do_exit+0x3d0/0xc10
      Signed-off-by: NPhilip Yang <Philip.Yang@amd.com>
      Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      a02c07b6
    • P
      drm/amdkfd: Fix kfd_process_device_init_vm error handling · 9d74d1f5
      Philip Yang 提交于
      [ Upstream commit 29d48b87 ]
      
      Should only destroy the ib_mem and let process cleanup worker to free
      the outstanding BOs. Reset the pointer in pdd->qpd structure, to avoid
      NULL pointer access in process destroy worker.
      
       BUG: kernel NULL pointer dereference, address: 0000000000000010
       Call Trace:
        amdgpu_amdkfd_gpuvm_unmap_gtt_bo_from_kernel+0x46/0xb0 [amdgpu]
        kfd_process_device_destroy_cwsr_dgpu+0x40/0x70 [amdgpu]
        kfd_process_destroy_pdds+0x71/0x190 [amdgpu]
        kfd_process_wq_release+0x2a2/0x3b0 [amdgpu]
        process_one_work+0x2a1/0x600
        worker_thread+0x39/0x3d0
      Signed-off-by: NPhilip Yang <Philip.Yang@amd.com>
      Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      9d74d1f5
  2. 10 11月, 2022 2 次提交
  3. 03 11月, 2022 2 次提交
  4. 25 10月, 2022 2 次提交
  5. 13 10月, 2022 2 次提交
    • A
      mm: free device private pages have zero refcount · ef233450
      Alistair Popple 提交于
      Since 27674ef6 ("mm: remove the extra ZONE_DEVICE struct page
      refcount") device private pages have no longer had an extra reference
      count when the page is in use.  However before handing them back to the
      owning device driver we add an extra reference count such that free pages
      have a reference count of one.
      
      This makes it difficult to tell if a page is free or not because both free
      and in use pages will have a non-zero refcount.  Instead we should return
      pages to the drivers page allocator with a zero reference count.  Kernel
      code can then safely use kernel functions such as get_page_unless_zero().
      
      Link: https://lkml.kernel.org/r/cf70cf6f8c0bdb8aaebdbfb0d790aea4c683c3c6.1664366292.git-series.apopple@nvidia.comSigned-off-by: NAlistair Popple <apopple@nvidia.com>
      Acked-by: NFelix Kuehling <Felix.Kuehling@amd.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Alex Deucher <alexander.deucher@amd.com>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Lyude Paul <lyude@redhat.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Alex Sierra <alex.sierra@amd.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      ef233450
    • A
      mm/memory.c: fix race when faulting a device private page · 16ce101d
      Alistair Popple 提交于
      Patch series "Fix several device private page reference counting issues",
      v2
      
      This series aims to fix a number of page reference counting issues in
      drivers dealing with device private ZONE_DEVICE pages.  These result in
      use-after-free type bugs, either from accessing a struct page which no
      longer exists because it has been removed or accessing fields within the
      struct page which are no longer valid because the page has been freed.
      
      During normal usage it is unlikely these will cause any problems.  However
      without these fixes it is possible to crash the kernel from userspace. 
      These crashes can be triggered either by unloading the kernel module or
      unbinding the device from the driver prior to a userspace task exiting. 
      In modules such as Nouveau it is also possible to trigger some of these
      issues by explicitly closing the device file-descriptor prior to the task
      exiting and then accessing device private memory.
      
      This involves some minor changes to both PowerPC and AMD GPU code. 
      Unfortunately I lack hardware to test either of those so any help there
      would be appreciated.  The changes mimic what is done in for both Nouveau
      and hmm-tests though so I doubt they will cause problems.
      
      
      This patch (of 8):
      
      When the CPU tries to access a device private page the migrate_to_ram()
      callback associated with the pgmap for the page is called.  However no
      reference is taken on the faulting page.  Therefore a concurrent migration
      of the device private page can free the page and possibly the underlying
      pgmap.  This results in a race which can crash the kernel due to the
      migrate_to_ram() function pointer becoming invalid.  It also means drivers
      can't reliably read the zone_device_data field because the page may have
      been freed with memunmap_pages().
      
      Close the race by getting a reference on the page while holding the ptl to
      ensure it has not been freed.  Unfortunately the elevated reference count
      will cause the migration required to handle the fault to fail.  To avoid
      this failure pass the faulting page into the migrate_vma functions so that
      if an elevated reference count is found it can be checked to see if it's
      expected or not.
      
      [mpe@ellerman.id.au: fix build]
        Link: https://lkml.kernel.org/r/87fsgbf3gh.fsf@mpe.ellerman.id.au
      Link: https://lkml.kernel.org/r/cover.60659b549d8509ddecafad4f498ee7f03bb23c69.1664366292.git-series.apopple@nvidia.com
      Link: https://lkml.kernel.org/r/d3e813178a59e565e8d78d9b9a4e2562f6494f90.1664366292.git-series.apopple@nvidia.comSigned-off-by: NAlistair Popple <apopple@nvidia.com>
      Acked-by: NFelix Kuehling <Felix.Kuehling@amd.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Lyude Paul <lyude@redhat.com>
      Cc: Alex Deucher <alexander.deucher@amd.com>
      Cc: Alex Sierra <alex.sierra@amd.com>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      16ce101d
  6. 07 10月, 2022 1 次提交
  7. 30 9月, 2022 2 次提交
  8. 29 9月, 2022 4 次提交
  9. 28 9月, 2022 3 次提交
  10. 20 9月, 2022 2 次提交
  11. 14 9月, 2022 5 次提交
  12. 31 8月, 2022 1 次提交
  13. 30 8月, 2022 1 次提交
  14. 26 8月, 2022 3 次提交
  15. 17 8月, 2022 3 次提交
  16. 11 8月, 2022 1 次提交
  17. 30 7月, 2022 1 次提交
  18. 29 7月, 2022 2 次提交