提交 · 306888b1246bf44e703b6f1ccc746c2746c1a981 · openeuler / Kernel

12 1月, 2023 3 次提交

drm/amdkfd: Fix kernel warning during topology setup · 306888b1

由 Mukul Joshi 提交于 12月 20, 2022

commit cf97eb7e upstream.

This patch fixes the following kernel warning seen during
driver load by correctly initializing the p2plink attr before
creating the sysfs file:

[  +0.002865] ------------[ cut here ]------------
[  +0.002327] kobject: '(null)' (0000000056260cfb): is not initialized, yet kobject_put() is being called.
[  +0.004780] WARNING: CPU: 32 PID: 1006 at lib/kobject.c:718 kobject_put+0xaa/0x1c0
[  +0.001361] Call Trace:
[  +0.001234]  <TASK>
[  +0.001067]  kfd_remove_sysfs_node_entry+0x24a/0x2d0 [amdgpu]
[  +0.003147]  kfd_topology_update_sysfs+0x3d/0x750 [amdgpu]
[  +0.002890]  kfd_topology_add_device+0xbd7/0xc70 [amdgpu]
[  +0.002844]  ? lock_release+0x13c/0x2e0
[  +0.001936]  ? smu_cmn_send_smc_msg_with_param+0x1e8/0x2d0 [amdgpu]
[  +0.003313]  ? amdgpu_dpm_get_mclk+0x54/0x60 [amdgpu]
[  +0.002703]  kgd2kfd_device_init.cold+0x39f/0x4ed [amdgpu]
[  +0.002930]  amdgpu_amdkfd_device_init+0x13d/0x1f0 [amdgpu]
[  +0.002944]  amdgpu_device_init.cold+0x1464/0x17b4 [amdgpu]
[  +0.002970]  ? pci_bus_read_config_word+0x43/0x80
[  +0.002380]  amdgpu_driver_load_kms+0x15/0x100 [amdgpu]
[  +0.002744]  amdgpu_pci_probe+0x147/0x370 [amdgpu]
[  +0.002522]  local_pci_probe+0x40/0x80
[  +0.001896]  work_for_cpu_fn+0x10/0x20
[  +0.001892]  process_one_work+0x26e/0x5a0
[  +0.002029]  worker_thread+0x1fd/0x3e0
[  +0.001890]  ? process_one_work+0x5a0/0x5a0
[  +0.002115]  kthread+0xea/0x110
[  +0.001618]  ? kthread_complete_and_exit+0x20/0x20
[  +0.002422]  ret_from_fork+0x1f/0x30
[  +0.001808]  </TASK>
[  +0.001103] irq event stamp: 59837
[  +0.001718] hardirqs last  enabled at (59849): [<ffffffffb30fab12>] __up_console_sem+0x52/0x60
[  +0.004414] hardirqs last disabled at (59860): [<ffffffffb30faaf7>] __up_console_sem+0x37/0x60
[  +0.004414] softirqs last  enabled at (59654): [<ffffffffb307d9c7>] irq_exit_rcu+0xd7/0x130
[  +0.004205] softirqs last disabled at (59649): [<ffffffffb307d9c7>] irq_exit_rcu+0xd7/0x130
[  +0.004203] ---[ end trace 0000000000000000 ]---

Fixes: 0f28cca8 ("drm/amdkfd: Extend KFD device topology to surface peer-to-peer links")
Signed-off-by: NMukul Joshi <mukul.joshi@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

306888b1

drm/amdkfd: Fix double release compute pasid · a02c07b6

由 Philip Yang 提交于 12月 13, 2022

[ Upstream commit 1a799c4c ]

If kfd_process_device_init_vm returns failure after vm is converted to
compute vm and vm->pasid set to compute pasid, KFD will not take
pdd->drm_file reference. As a result, drm close file handler maybe
called to release the compute pasid before KFD process destroy worker to
release the same pasid and set vm->pasid to zero, this generates below
WARNING backtrace and NULL pointer access.

Add helper amdgpu_amdkfd_gpuvm_set_vm_pasid and call it at the last step
of kfd_process_device_init_vm, to ensure vm pasid is the original pasid
if acquiring vm failed or is the compute pasid with pdd->drm_file
reference taken to avoid double release same pasid.

 amdgpu: Failed to create process VM object
 ida_free called for id=32770 which is not allocated.
 WARNING: CPU: 57 PID: 72542 at ../lib/idr.c:522 ida_free+0x96/0x140
 RIP: 0010:ida_free+0x96/0x140
 Call Trace:
  amdgpu_pasid_free_delayed+0xe1/0x2a0 [amdgpu]
  amdgpu_driver_postclose_kms+0x2d8/0x340 [amdgpu]
  drm_file_free.part.13+0x216/0x270 [drm]
  drm_close_helper.isra.14+0x60/0x70 [drm]
  drm_release+0x6e/0xf0 [drm]
  __fput+0xcc/0x280
  ____fput+0xe/0x20
  task_work_run+0x96/0xc0
  do_exit+0x3d0/0xc10

 BUG: kernel NULL pointer dereference, address: 0000000000000000
 RIP: 0010:ida_free+0x76/0x140
 Call Trace:
  amdgpu_pasid_free_delayed+0xe1/0x2a0 [amdgpu]
  amdgpu_driver_postclose_kms+0x2d8/0x340 [amdgpu]
  drm_file_free.part.13+0x216/0x270 [drm]
  drm_close_helper.isra.14+0x60/0x70 [drm]
  drm_release+0x6e/0xf0 [drm]
  __fput+0xcc/0x280
  ____fput+0xe/0x20
  task_work_run+0x96/0xc0
  do_exit+0x3d0/0xc10
Signed-off-by: NPhilip Yang <Philip.Yang@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NSasha Levin <sashal@kernel.org>

a02c07b6

drm/amdkfd: Fix kfd_process_device_init_vm error handling · 9d74d1f5

由 Philip Yang 提交于 12月 14, 2022

[ Upstream commit 29d48b87 ]

Should only destroy the ib_mem and let process cleanup worker to free
the outstanding BOs. Reset the pointer in pdd->qpd structure, to avoid
NULL pointer access in process destroy worker.

 BUG: kernel NULL pointer dereference, address: 0000000000000010
 Call Trace:
  amdgpu_amdkfd_gpuvm_unmap_gtt_bo_from_kernel+0x46/0xb0 [amdgpu]
  kfd_process_device_destroy_cwsr_dgpu+0x40/0x70 [amdgpu]
  kfd_process_destroy_pdds+0x71/0x190 [amdgpu]
  kfd_process_wq_release+0x2a2/0x3b0 [amdgpu]
  process_one_work+0x2a1/0x600
  worker_thread+0x39/0x3d0
Signed-off-by: NPhilip Yang <Philip.Yang@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NSasha Levin <sashal@kernel.org>

9d74d1f5

10 11月, 2022 2 次提交

drm/amdkfd: Fix error handling in criu_checkpoint · b91c23e0

由 Felix Kuehling 提交于 11月 01, 2022

Checkpoint BOs last. That way we don't need to close dmabuf FDs if
something else fails later. This avoids problematic access to user mode
memory in the error handling code path.

criu_checkpoint_bos has its own error handling and cleanup that does not
depend on access to user memory.

In the private data, keep BOs before the remaining objects. This is
necessary to restore things in the correct order as restoring events
depends on the events-page BO being restored first.

Fixes: be072b06 ("drm/amdkfd: CRIU export BOs as prime dmabuf objects")
Reported-by: NJann Horn <jannh@google.com>
CC: Rajneesh Bhardwaj <Rajneesh.Bhardwaj@amd.com>
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Reviewed-and-tested-by: NRajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org

b91c23e0

drm/amdkfd: Fix error handling in kfd_criu_restore_events · 66f79037

由 Felix Kuehling 提交于 11月 03, 2022

mutex_unlock before the exit label because all the error code paths that
jump there didn't take that lock. This fixes unbalanced locking errors
in case of restore errors.

Fixes: 40e8a766 ("drm/amdkfd: CRIU checkpoint and restore events")
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: NRajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org

66f79037

03 11月, 2022 2 次提交

drm/amdkfd: update GFX11 CWSR trap handler · 6640f8e5

由 Jay Cornwall 提交于 10月 13, 2022

With corresponding FW change fixes issue where triggering CWSR on a
workgroup with waves in s_barrier wouldn't lead to a back-off and
therefore cause a hang.
Signed-off-by: NJay Cornwall <jay.cornwall@amd.com>
Tested-by: NGraham Sider <Graham.Sider@amd.com>
Acked-by: NHarish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Acked-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: NGraham Sider <Graham.Sider@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org # 6.0.x

6640f8e5

drm/amdkfd: Fix NULL pointer dereference in svm_migrate_to_ram() · 5b994354

由 Yang Li 提交于 10月 26, 2022

./drivers/gpu/drm/amd/amdkfd/kfd_migrate.c:985:58-62: ERROR: p is NULL but dereferenced.

Link: https://bugzilla.openanolis.cn/show_bug.cgi?id=2549Reported-by: NAbaci Robot <abaci@linux.alibaba.com>
Signed-off-by: NYang Li <yang.lee@linux.alibaba.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org

5b994354

25 10月, 2022 2 次提交

drm/amdkfd: correct the cache info for gfx1036 · 969758bb

由 Jesse Zhang 提交于 10月 11, 2022

correct the cache information for gfx1036
Acked-by: NAlex Deucher <alexander.deucher@amd.com>
Reviewed-by: NYifan Zhang <yifan1.zhang@amd.com>
Signed-off-by: NYifan Zhang <yifan1.zhang@amd.com>
Signed-off-by: NJesse Zhang <jesse.zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org

969758bb

drm/amdkfd: update gfx1037 Lx cache setting · 9656db1b

由 Prike Liang 提交于 10月 20, 2022

Update the gfx1037 L1/L2 cache setting.
Signed-off-by: NPrike Liang <Prike.Liang@amd.com>
Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org

9656db1b

13 10月, 2022 2 次提交

mm: free device private pages have zero refcount · ef233450

由 Alistair Popple 提交于 9月 28, 2022

Since 27674ef6 ("mm: remove the extra ZONE_DEVICE struct page
refcount") device private pages have no longer had an extra reference
count when the page is in use.  However before handing them back to the
owning device driver we add an extra reference count such that free pages
have a reference count of one.

This makes it difficult to tell if a page is free or not because both free
and in use pages will have a non-zero refcount.  Instead we should return
pages to the drivers page allocator with a zero reference count.  Kernel
code can then safely use kernel functions such as get_page_unless_zero().

Link: https://lkml.kernel.org/r/cf70cf6f8c0bdb8aaebdbfb0d790aea4c683c3c6.1664366292.git-series.apopple@nvidia.comSigned-off-by: NAlistair Popple <apopple@nvidia.com>
Acked-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Alex Sierra <alex.sierra@amd.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>

ef233450

mm/memory.c: fix race when faulting a device private page · 16ce101d

由 Alistair Popple 提交于 9月 28, 2022

Patch series "Fix several device private page reference counting issues",
v2

This series aims to fix a number of page reference counting issues in
drivers dealing with device private ZONE_DEVICE pages.  These result in
use-after-free type bugs, either from accessing a struct page which no
longer exists because it has been removed or accessing fields within the
struct page which are no longer valid because the page has been freed.

During normal usage it is unlikely these will cause any problems.  However
without these fixes it is possible to crash the kernel from userspace. 
These crashes can be triggered either by unloading the kernel module or
unbinding the device from the driver prior to a userspace task exiting. 
In modules such as Nouveau it is also possible to trigger some of these
issues by explicitly closing the device file-descriptor prior to the task
exiting and then accessing device private memory.

This involves some minor changes to both PowerPC and AMD GPU code. 
Unfortunately I lack hardware to test either of those so any help there
would be appreciated.  The changes mimic what is done in for both Nouveau
and hmm-tests though so I doubt they will cause problems.


This patch (of 8):

When the CPU tries to access a device private page the migrate_to_ram()
callback associated with the pgmap for the page is called.  However no
reference is taken on the faulting page.  Therefore a concurrent migration
of the device private page can free the page and possibly the underlying
pgmap.  This results in a race which can crash the kernel due to the
migrate_to_ram() function pointer becoming invalid.  It also means drivers
can't reliably read the zone_device_data field because the page may have
been freed with memunmap_pages().

Close the race by getting a reference on the page while holding the ptl to
ensure it has not been freed.  Unfortunately the elevated reference count
will cause the migration required to handle the fault to fail.  To avoid
this failure pass the faulting page into the migrate_vma functions so that
if an elevated reference count is found it can be checked to see if it's
expected or not.

[mpe@ellerman.id.au: fix build]
  Link: https://lkml.kernel.org/r/87fsgbf3gh.fsf@mpe.ellerman.id.au
Link: https://lkml.kernel.org/r/cover.60659b549d8509ddecafad4f498ee7f03bb23c69.1664366292.git-series.apopple@nvidia.com
Link: https://lkml.kernel.org/r/d3e813178a59e565e8d78d9b9a4e2562f6494f90.1664366292.git-series.apopple@nvidia.comSigned-off-by: NAlistair Popple <apopple@nvidia.com>
Acked-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Alex Deucher <alexander.deucher@amd.com>
Cc: Alex Sierra <alex.sierra@amd.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>

16ce101d

07 10月, 2022 1 次提交

drm/amdgpu: Enable F32_WPTR_POLL_ENABLE in mqd · 21a550de

由 Ruili Ji 提交于 10月 03, 2022

This patch is to fix the SDMA user queue doorbell missing issue on
SDMA 6.0. F32_WPTR_POLL_ENABLE has to be set if doorbell mode is
used. Otherwise ringing SDMA user queue doorbell can't wake up
system from gfxoff.
Signed-off-by: NRuili Ji <ruiliji2@amd.com>
Reviewed-by: NYifan Zhang <yifan1.zhang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
Cc: stable@vger.kernel.org # 6.0.x

21a550de

30 9月, 2022 2 次提交

drm/amdkfd: Fix UBSAN shift-out-of-bounds warning · b292cafe

由 Felix Kuehling 提交于 9月 21, 2022

This was fixed in initialize_cpsch before, but not in initialize_nocpsch.
Factor sdma bitmap initialization into a helper function to apply the
correct implementation in both cases without duplicating it.

v2: Added a range check
Reported-by: NEllis Michael <ellis@ellismichael.com>
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: NGraham Sider <Graham.Sider@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

b292cafe

drm/amdkfd: Track unified memory when switching xnack mode · 8a7c3ce1

由 Philip Yang 提交于 9月 07, 2022

Unified memory usage with xnack off is tracked to avoid oversubscribe
system memory, with xnack on, we don't track unified memory usage to
allow memory oversubscribe. When switching xnack mode from off to on,
subsequent free ranges allocated with xnack off will not unreserve
memory. When switching xnack mode from on to off, subsequent free ranges
allocated with xnack on will unreserve memory. Both cases cause memory
accounting unbalanced.

When switching xnack mode from on to off, need reserve already allocated
svm range memory. When switching xnack mode from off to on, need
unreserve already allocated svm range memory.

v6: Take prange lock to access range child list
v5: Handle prange child ranges
v4: Handle reservation memory failure
v3: Handle switching xnack mode race with svm_range_deferred_list_work
v2: Handle both switching xnack from on to off and from off to on cases
Signed-off-by: NPhilip Yang <Philip.Yang@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

8a7c3ce1

29 9月, 2022 4 次提交

drm/amdkfd: fix dropped interrupt in kfd_int_process_v11 · 15afe323

由 Graham Sider 提交于 9月 23, 2022

Shader wave interrupts were getting dropped in event_interrupt_wq_v11
if the PRIV bit was set to 1. This would often lead to a hang. Until
debugger logic is upstreamed, expand comment to stop early return.
Signed-off-by: NGraham Sider <Graham.Sider@amd.com>
Reviewed-by: NHarish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

15afe323

drm/amdgpu: pass queue size and is_aql_queue to MES · 3e9cf234

由 Graham Sider 提交于 9月 19, 2022

Update mes_v11_api_def.h add_queue API with is_aql_queue parameter. Also
re-use gds_size for the queue size (unused for KFD). MES requires the
queue size in order to compute the actual wptr offset within the queue
RB since it increases monotonically for AQL queues.

v2: Make is_aql_queue assign clearer
Signed-off-by: NGraham Sider <Graham.Sider@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

3e9cf234

drm/amdkfd: fix MQD init for GFX11 in init_mqd · 7971b5c2

由 Graham Sider 提交于 9月 20, 2022

Set remaining compute_static_thread_mgmt_se* accordingly.
Signed-off-by: NGraham Sider <Graham.Sider@amd.com>
Acked-by: NAlex Deucher <alexander.deucher@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

7971b5c2

drm/amdgpu: Enable SA software trap. · 585a8261

由 David Belanger 提交于 8月 25, 2022

Enables support for software trap for MES >= 4.
Adapted from implementation from Jay Cornwall.

v2: Add IP version check in conditions.
v3: Remove debugger code changes.
Signed-off-by: NJay Cornwall <Jay.Cornwall@amd.com>
Signed-off-by: NDavid Belanger <david.belanger@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Acked-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

585a8261

28 9月, 2022 3 次提交

drm/amdkfd: fix dropped interrupt in kfd_int_process_v11 · 664883dd

由 Graham Sider 提交于 9月 23, 2022

664883dd

drm/amdgpu: pass queue size and is_aql_queue to MES · 91ef6cfd

由 Graham Sider 提交于 9月 19, 2022

Update mes_v11_api_def.h add_queue API with is_aql_queue parameter. Also
re-use gds_size for the queue size (unused for KFD). MES requires the
queue size in order to compute the actual wptr offset within the queue
RB since it increases monotonically for AQL queues.

v2: Make is_aql_queue assign clearer
Signed-off-by: NGraham Sider <Graham.Sider@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

91ef6cfd

drm/amdkfd: fix MQD init for GFX11 in init_mqd · a9b47002

由 Graham Sider 提交于 9月 20, 2022

Set remaining compute_static_thread_mgmt_se* accordingly.
Signed-off-by: NGraham Sider <Graham.Sider@amd.com>
Acked-by: NAlex Deucher <alexander.deucher@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

a9b47002

20 9月, 2022 2 次提交

drm/amdkfd: Fix spelling mistake "detroyed" -> "destroyed" · e6a7746e

由 Colin Ian King 提交于 9月 14, 2022

There is a spelling mistake in a pr_debug message. Fix it.
Signed-off-by: NColin Ian King <colin.i.king@gmail.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

e6a7746e

drm/amdkfd: Use the consolidated MQD manager functions for GFX11 · b98451dc

由 Shiwu Zhang 提交于 9月 07, 2022

To remove duplication for GFX11 as well, use the common MQD manager
functions defined in kfd_mqd_manager.c for all version of managers
Signed-off-by: NShiwu Zhang <shiwu.zhang@amd.com>
Acked-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: NMukul Joshi <mukul.joshi@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

b98451dc

14 9月, 2022 5 次提交

drm/amdkfd: Migrate in CPU page fault use current mm · 3a876060

由 Philip Yang 提交于 9月 08, 2022

migrate_vma_setup shows below warning because we don't hold another
process mm mmap_lock. We should use current vmf->vma->vm_mm instead, the
caller already hold current mmap lock inside CPU page fault handler.

 WARNING: CPU: 10 PID: 3054 at include/linux/mmap_lock.h:155 find_vma
 Call Trace:
  walk_page_range+0x76/0x150
  migrate_vma_setup+0x18a/0x640
  svm_migrate_vram_to_ram+0x245/0xa10 [amdgpu]
  svm_migrate_to_ram+0x36f/0x470 [amdgpu]
  do_swap_page+0xcfe/0xec0
  __handle_mm_fault+0x96b/0x15e0
  handle_mm_fault+0x13f/0x3e0
  do_user_addr_fault+0x1e7/0x690

Fixes: e1f84eef ("drm/amdkfd: handle CPU fault on COW mapping")
Signed-off-by: NPhilip Yang <Philip.Yang@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

3a876060

drm/amdkfd: Remove prefault before migrating to VRAM · c969c5fd

由 Philip Yang 提交于 7月 26, 2022

Prefaulting potentially allocates system memory pages before a
migration. This adds unnecessary overhead. Instead we can skip
unallocated pages in the migration and just point migrate->dst to a
0-initialized VRAM page directly. Then the VRAM page will be inserted
to the PTE. A subsequent CPU page fault will migrate the page back to
system memory.
Signed-off-by: NPhilip Yang <Philip.Yang@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

c969c5fd

drm/amdkfd: handle CPU fault on COW mapping · e1f84eef

由 Philip Yang 提交于 9月 07, 2022

If CPU page fault in a page with zone_device_data svm_bo from another
process, that means it is COW mapping in the child process and the
range is migrated to VRAM by parent process. Migrate the parent
process range back to system memory to recover the CPU page fault.
Signed-off-by: NPhilip Yang <Philip.Yang@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

e1f84eef

amd/amdkfd: fix repeated words in comments · 7a3f8b7c

由 wangjianli 提交于 9月 08, 2022

Delete the redundant word 'to'.
Signed-off-by: Nwangjianli <wangjianli@cdjrlc.com>
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

7a3f8b7c

drm/amdkfd: Fix CRIU restore op due to doorbell offset · a4a3798f

由 Rajneesh Bhardwaj 提交于 9月 07, 2022

Recently introduced change to allocate doorbells only when the first
queue is created or mapped for CPU / GPU access, did not consider
Checkpoint Restore scenario completely. This fix allows the CRIU restore
operation by extending the doorbell optimization to CRIU restore
scenario.

Fixes: 16f00131 ("drm/amdkfd: Allocate doorbells only when needed")
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NRajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

a4a3798f

31 8月, 2022 1 次提交

drm/amdkfd: Added GFX 11.0.3 Support · 5ddb5fe9

由 David Belanger 提交于 7月 26, 2022

Added missing cases for GFX 11.0.3 code in a few switch statements.
Signed-off-by: NDavid Belanger <david.belanger@amd.com>
Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

5ddb5fe9

30 8月, 2022 1 次提交

drm/amdkfd: remove redundant variables err and ret · f4f5e507

由 Jinpeng Cui 提交于 8月 29, 2022

Return value from kfd_wait_on_events() and io_remap_pfn_range() directly
instead of taking this in another redundant variable.
Reported-by: NZeal Robot <zealci@zte.com.cn>
Signed-off-by: NJinpeng Cui <cui.jinpeng2@zte.com.cn>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

f4f5e507

26 8月, 2022 3 次提交

drm/amdkfd: Fix isa version for the GC 10.3.7 · ee8086db

由 Prike Liang 提交于 8月 24, 2022

Correct the isa version for handling KFD test.

Fixes: 7c4f4f19 ("drm/amdkfd: Add GC 10.3.6 and 10.3.7 KFD definitions")
Signed-off-by: NPrike Liang <Prike.Liang@amd.com>
Reviewed-by: NAaron Liu <aaron.liu@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

ee8086db

drm/amdkfd: Fix isa version for the GC 10.3.7 · 2724efa3

由 Prike Liang 提交于 8月 24, 2022

Correct the isa version for handling KFD test.

Fixes: 7c4f4f19 ("drm/amdkfd: Add GC 10.3.6 and 10.3.7 KFD definitions")
Signed-off-by: NPrike Liang <Prike.Liang@amd.com>
Reviewed-by: NAaron Liu <aaron.liu@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

2724efa3

drm/amdkfd: Allocate doorbells only when needed · 16f00131

由 Felix Kuehling 提交于 8月 03, 2022

Only allocate doorbells when the first queue is created on a GPU or the
doorbells need to be mapped into CPU or GPU virtual address space. This
avoids allocating doorbells unnecessarily and can allow more processes
to use KFD on multi-GPU systems.
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: NKent Russell <kent.Russell@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

16f00131

17 8月, 2022 3 次提交

drm/amdkfd: potential crash in kfd_create_indirect_link_prop() · 7d50b92d

由 Dan Carpenter 提交于 8月 12, 2022

This code has two bugs.  If kfd_topology_device_by_proximity_domain()
failed on the first iteration through the loop then "cpu_link" is
uninitialized and should not be dereferenced.

The second bug is that we cannot dereference a list iterator when it
points to the list head.  In other words, if we exit the
list_for_each_entry() loop exits without hitting a break then "cpu_link"
is not a valid pointer and should not be dereferenced.

Fix both of these problems by setting "cpu_link" to NULL when it is invalid
and non-NULL when it is valid.  That makes it easier to test for
valid vs invalid.

Fixes: 0f28cca8 ("drm/amdkfd: Extend KFD device topology to surface peer-to-peer links")
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

7d50b92d

drm/amdkfd: reserve 2 queues for sdma 6.0.1 in bitmap · e48e6a13

由 Yifan Zhang 提交于 8月 10, 2022

There is only one engine in sdma 6.0.1, the total number of
reserved queues should be 2, reflect this number in bitmap as well.
Signed-off-by: NYifan Zhang <yifan1.zhang@amd.com>
Reviewed-by: NTim Huang <Tim.Huang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

e48e6a13

drm/amdkfd: Fix mm reference in SVM eviction worker · c0289557

由 Felix Kuehling 提交于 8月 08, 2022

Use the mm reference from the fence. This allows removing the
svm_bo->svms pointer, which was problematic because we cannot assume
that the struct kfd_process containing the svms is still allocated
without holding a refcount on the process.

Use mmget_not_zero to ensure the mm is still valid, and drop the svm_bo
reference if it isn't.
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: NPhilip Yang <Philip.Yang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

c0289557

11 8月, 2022 1 次提交

drm/amdkfd: Handle restart of kfd_ioctl_wait_events · bea9a56a

由 Felix Kuehling 提交于 8月 04, 2022

When kfd_ioctl_wait_events needs to restart due to a signal, we need to
update the timeout to account for the time already elapsed. We also need
to undo auto_reset of events that have signaled already, so that the
restarted ioctl will be able to count those signals again.

This fixes infinite hangs when kfd_ioctl_wait_events is interrupted by a
signal.
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Reviewed-and-tested-by: NXiaogang Chen <Xiaogang.Chen@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

bea9a56a

30 7月, 2022 1 次提交

drm/amdkfd: use time_is_before_jiffies(a + b) to replace "jiffies - a > b" · dcfe584b

由 Yu Zhe 提交于 7月 28, 2022

time_is_before_jiffies deals with timer wrapping correctly.
Signed-off-by: NYu Zhe <yuzhe@nfschina.com>
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

dcfe584b

29 7月, 2022 2 次提交

drm/amdgpu: add debugfs for kfd system and ttm mem used · 3d2af401

由 Alex Sierra 提交于 6月 13, 2022

This keeps track of kfd system mem used and kfd ttm mem used.
Signed-off-by: NAlex Sierra <alex.sierra@amd.com>
Reviewed-by: NPhilip Yang <Philip.Yang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

3d2af401

drm/amdkfd: track unified memory reservation with xnack off · f9af3c16

由 Alex Sierra 提交于 5月 17, 2022

[WHY]
Unified memory with xnack off should be tracked, as userptr mappings
and legacy allocations do. To avoid oversuscribe system memory when
xnack off.
[How]
Exposing functions reserve_mem_limit and unreserve_mem_limit to SVM
API and call them on every prange creation and free.
Signed-off-by: NAlex Sierra <alex.sierra@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

f9af3c16

openeuler / Kernel 接近 2 年 前同步成功

openeuler / Kernel
接近 2 年前同步成功