- 13 10月, 2022 2 次提交
-
-
由 Alistair Popple 提交于
Since 27674ef6 ("mm: remove the extra ZONE_DEVICE struct page refcount") device private pages have no longer had an extra reference count when the page is in use. However before handing them back to the owning device driver we add an extra reference count such that free pages have a reference count of one. This makes it difficult to tell if a page is free or not because both free and in use pages will have a non-zero refcount. Instead we should return pages to the drivers page allocator with a zero reference count. Kernel code can then safely use kernel functions such as get_page_unless_zero(). Link: https://lkml.kernel.org/r/cf70cf6f8c0bdb8aaebdbfb0d790aea4c683c3c6.1664366292.git-series.apopple@nvidia.comSigned-off-by: NAlistair Popple <apopple@nvidia.com> Acked-by: NFelix Kuehling <Felix.Kuehling@amd.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Christian König <christian.koenig@amd.com> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Lyude Paul <lyude@redhat.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Alex Sierra <alex.sierra@amd.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
-
由 Alistair Popple 提交于
Patch series "Fix several device private page reference counting issues", v2 This series aims to fix a number of page reference counting issues in drivers dealing with device private ZONE_DEVICE pages. These result in use-after-free type bugs, either from accessing a struct page which no longer exists because it has been removed or accessing fields within the struct page which are no longer valid because the page has been freed. During normal usage it is unlikely these will cause any problems. However without these fixes it is possible to crash the kernel from userspace. These crashes can be triggered either by unloading the kernel module or unbinding the device from the driver prior to a userspace task exiting. In modules such as Nouveau it is also possible to trigger some of these issues by explicitly closing the device file-descriptor prior to the task exiting and then accessing device private memory. This involves some minor changes to both PowerPC and AMD GPU code. Unfortunately I lack hardware to test either of those so any help there would be appreciated. The changes mimic what is done in for both Nouveau and hmm-tests though so I doubt they will cause problems. This patch (of 8): When the CPU tries to access a device private page the migrate_to_ram() callback associated with the pgmap for the page is called. However no reference is taken on the faulting page. Therefore a concurrent migration of the device private page can free the page and possibly the underlying pgmap. This results in a race which can crash the kernel due to the migrate_to_ram() function pointer becoming invalid. It also means drivers can't reliably read the zone_device_data field because the page may have been freed with memunmap_pages(). Close the race by getting a reference on the page while holding the ptl to ensure it has not been freed. Unfortunately the elevated reference count will cause the migration required to handle the fault to fail. To avoid this failure pass the faulting page into the migrate_vma functions so that if an elevated reference count is found it can be checked to see if it's expected or not. [mpe@ellerman.id.au: fix build] Link: https://lkml.kernel.org/r/87fsgbf3gh.fsf@mpe.ellerman.id.au Link: https://lkml.kernel.org/r/cover.60659b549d8509ddecafad4f498ee7f03bb23c69.1664366292.git-series.apopple@nvidia.com Link: https://lkml.kernel.org/r/d3e813178a59e565e8d78d9b9a4e2562f6494f90.1664366292.git-series.apopple@nvidia.comSigned-off-by: NAlistair Popple <apopple@nvidia.com> Acked-by: NFelix Kuehling <Felix.Kuehling@amd.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Lyude Paul <lyude@redhat.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Alex Sierra <alex.sierra@amd.com> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Christian König <christian.koenig@amd.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
-
- 29 7月, 2022 1 次提交
-
-
由 Philip Yang 提交于
This will be used to split giant svm range into smaller ranges, to support VRAM overcommitment by giant range and improve GPU retry fault recover on giant range. Signed-off-by: NPhilip Yang <Philip.Yang@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 18 7月, 2022 1 次提交
-
-
由 Alex Sierra 提交于
When CPU is connected throug XGMI, it has coherent access to VRAM resource. In this case that resource is taken from a table in the device gmc aperture base. This resource is used along with the device type, which could be DEVICE_PRIVATE or DEVICE_COHERENT to create the device page map region. Also, MIGRATE_VMA_SELECT_DEVICE_COHERENT flag is selected for coherent type case during migration to device. Link: https://lkml.kernel.org/r/20220715150521.18165-8-alex.sierra@amd.comSigned-off-by: NAlex Sierra <alex.sierra@amd.com> Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ralph Campbell <rcampbell@nvidia.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
-
- 01 7月, 2022 1 次提交
-
-
由 Philip Yang 提交于
For migration start and end event, output timestamp when migration starts, ends, svm range address and size, GPU id of migration source and destination and svm range attributes, Migration trigger could be prefetch, CPU or GPU page fault and TTM eviction. Signed-off-by: NPhilip Yang <Philip.Yang@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 04 6月, 2022 1 次提交
-
-
由 Philip Yang 提交于
Migration range from system memory to VRAM, if system page can not be locked or unmapped, we do partial migration and leave some pages in system memory. Several bugs found to copy pages and update GPU mapping for this situation: 1. copy to vram should use migrate->npage which is total pages of range as npages, not migrate->cpages which is number of pages can be migrated. 2. After partial copy, set VRAM res cursor as j + 1, j is number of system pages copied plus 1 page to skip copy. 3. copy to ram, should collect all continuous VRAM pages and copy together. 4. Call amdgpu_vm_update_range, should pass in offset as bytes, not as number of pages. Signed-off-by: NPhilip Yang <Philip.Yang@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com> Cc: stable@vger.kernel.org
-
- 23 4月, 2022 1 次提交
-
-
由 Yang Wang 提交于
simplify programming with existing functions. Signed-off-by: NYang Wang <KevinYang.Wang@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: NAlex Deucher <alexander.deucher@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 16 3月, 2022 1 次提交
-
-
由 Philip Yang 提交于
Migrate vram to ram may fail to find the vma if process is exiting and vma is removed, evict svm bo worker sets prange->svm_bo to NULL and warn svm_bo ref count != 1 only if migrating vram to ram successfully. Signed-off-by: NPhilip Yang <Philip.Yang@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 04 3月, 2022 2 次提交
-
-
由 Christoph Hellwig 提交于
ZONE_DEVICE struct pages have an extra reference count that complicates the code for put_page() and several places in the kernel that need to check the reference count to see that a page is not being used (gup, compaction, migration, etc.). Clean up the code so the reference count doesn't need to be treated specially for ZONE_DEVICE pages. Note that this excludes the special idle page wakeup for fsdax pages, which still happens at refcount 1. This is a separate issue and will be sorted out later. Given that only fsdax pages require the notifiacation when the refcount hits 1 now, the PAGEMAP_OPS Kconfig symbol can go away and be replaced with a FS_DAX check for this hook in the put_page fastpath. Based on an earlier patch from Ralph Campbell <rcampbell@nvidia.com>. Link: https://lkml.kernel.org/r/20220210072828.2930359-8-hch@lst.deSigned-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NLogan Gunthorpe <logang@deltatee.com> Reviewed-by: NRalph Campbell <rcampbell@nvidia.com> Reviewed-by: NJason Gunthorpe <jgg@nvidia.com> Reviewed-by: NDan Williams <dan.j.williams@intel.com> Acked-by: NFelix Kuehling <Felix.Kuehling@amd.com> Tested-by: N"Sierra Guiza, Alejandro (Alex)" <alex.sierra@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Chaitanya Kulkarni <kch@nvidia.com> Cc: Christian Knig <christian.koenig@amd.com> Cc: Karol Herbst <kherbst@redhat.com> Cc: Lyude Paul <lyude@redhat.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
-
由 Christoph Hellwig 提交于
hmm.h pulls in the world for no good reason at all. Remove the includes and push a few ones into the users instead. Link: https://lkml.kernel.org/r/20220210072828.2930359-4-hch@lst.deSigned-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NLogan Gunthorpe <logang@deltatee.com> Reviewed-by: NJason Gunthorpe <jgg@nvidia.com> Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com> Reviewed-by: NMuchun Song <songmuchun@bytedance.com> Tested-by: N"Sierra Guiza, Alejandro (Alex)" <alex.sierra@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Christian Knig <christian.koenig@amd.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Felix Kuehling <Felix.Kuehling@amd.com> Cc: Karol Herbst <kherbst@redhat.com> Cc: Lyude Paul <lyude@redhat.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: "Pan, Xinhui" <Xinhui.Pan@amd.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
-
- 15 2月, 2022 1 次提交
-
-
由 Rajneesh Bhardwaj 提交于
A bunch of errors and warnings are leftover KFD over the years, attempt to fix the errors and most warnings reported by checkpatch tool. Still a few warnings remain which may be false positives so ignore them for now. Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NRajneesh Bhardwaj <rajneesh.bhardwaj@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 12 2月, 2022 1 次提交
-
-
由 Alex Sierra 提交于
Avoid spam the kernel log on application memory allocation failures. __func__ argument was also removed from dev_fmt macro due to parameter conflicts with dynamic_dev_dbg. Signed-off-by: NAlex Sierra <alex.sierra@amd.com> Reviewed-by: NPhilip Yang <Philip.Yang@amd.comi> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 20 1月, 2022 1 次提交
-
-
由 Christian König 提交于
That's just a leftover from old radeon days and was preventing CS and GART bindings before the hardware was initialized. But nowdays that is perfectly valid. The only thing we need to warn about are GART binding before the table is even allocated. Signed-off-by: NChristian König <christian.koenig@amd.com> Reviewed-by: NGuchun Chen <guchun.chen@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 17 12月, 2021 1 次提交
-
-
由 Philip Yang 提交于
Add svm_range_bo_unref_async to schedule work to wait for svm_bo eviction work done and then free svm_bo. __do_munmap put_page is atomic context, call svm_range_bo_unref_async to avoid warning invalid wait context. Other non atomic context call svm_range_bo_unref. Signed-off-by: NPhilip Yang <Philip.Yang@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 14 12月, 2021 1 次提交
-
-
由 Isabella Basso 提交于
This fixes various warnings relating to erroneous docstring syntax, of which some are listed below: warning: Function parameter or member 'adev' not described in 'amdgpu_atomfirmware_ras_rom_addr' ... warning: expecting prototype for amdgpu_atpx_validate_functions(). Prototype was for amdgpu_atpx_validate() instead ... warning: Excess function parameter 'mem' description in 'amdgpu_preempt_mgr_new' ... warning: Cannot understand * @kfd_get_cu_occupancy - Collect number of waves in-flight on this device ... warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst Signed-off-by: NIsabella Basso <isabbasso@riseup.net> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 18 11月, 2021 1 次提交
-
-
由 Graham Sider 提交于
Switch to IP version checking instead of asic_type on various KFD version checks. Signed-off-by: NGraham Sider <Graham.Sider@amd.com> Reviewed-by: NAlex Deucher <alexander.deucher@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 12 11月, 2021 1 次提交
-
-
由 Alistair Popple 提交于
MIGRATE_PFN_LOCKED is used to indicate to migrate_vma_prepare() that a source page was already locked during migrate_vma_collect(). If it wasn't then the a second attempt is made to lock the page. However if the first attempt failed it's unlikely a second attempt will succeed, and the retry adds complexity. So clean this up by removing the retry and MIGRATE_PFN_LOCKED flag. Destination pages are also meant to have the MIGRATE_PFN_LOCKED flag set, but nothing actually checks that. Link: https://lkml.kernel.org/r/20211025041608.289017-1-apopple@nvidia.comSigned-off-by: NAlistair Popple <apopple@nvidia.com> Reviewed-by: NRalph Campbell <rcampbell@nvidia.com> Acked-by: NFelix Kuehling <Felix.Kuehling@amd.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Ben Skeggs <bskeggs@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 06 11月, 2021 1 次提交
-
-
由 Alex Sierra 提交于
[Why]: When we call hmm_range_fault to map memory after a migration, we don't expect memory to be migrated again as a result of hmm_range_fault. The driver ensures that all memory is in GPU-accessible locations so that no migration should be needed. However, there is one corner case where hmm_range_fault can unexpectedly cause a migration from DEVICE_PRIVATE back to system memory due to a write-fault when a system memory page in the same range was mapped read-only (e.g. COW). Ranges with individual pages in different locations are usually the result of failed page migrations (e.g. page lock contention). The unexpected migration back to system memory causes a deadlock from recursive locking in our driver. [How]: Creating a task reference new member under svm_range_list struct. Setting this with "current" reference, right before the hmm_range_fault is called. This member is checked against "current" reference at svm_migrate_to_ram callback function. If equal, the migration will be ignored. Signed-off-by: NAlex Sierra <alex.sierra@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 04 11月, 2021 1 次提交
-
-
由 Felix Kuehling 提交于
If some pages fail to migrate to system memory, don't update prange->actual_loc = 0. This prevents endless CPU page faults after partial migration failures due to contested page locks. Migration to RAM must be complete during migrations from VRAM to VRAM and during evictions. Implement retry and fail if the migration to RAM fails. Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: NPhilip Yang <Philip.Yang@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 22 10月, 2021 2 次提交
-
-
由 Philip Yang 提交于
Not all migrate.cpages returned from migrate_vma_setup can be migrated, for example non anonymous page, or out of device memory. So after migrate_vma_pages returns, add debug message to count pages are successfully migrated which has MIGRATE_PFN_VALID and MIGRATE_PFN_MIGRATE flag set. Signed-off-by: NPhilip Yang <Philip.Yang@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Philip Yang 提交于
cpages is only updated by migrate_vma_setup. So capture its value at that point to clarify the significance of the number. The next patch will add counting of actually migrated pages after migrate_vma_pages for debug purposes. Signed-off-by: NPhilip Yang <Philip.Yang@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 14 10月, 2021 2 次提交
-
-
由 Philip Yang 提交于
migrate_vma_setup may return cpages 0, means 0 page can be migrated, treat this as error case to skip the rest of vma migration steps. Change svm_migrate_vma_to_vram and svm_migrate_vma_to_ram to return the number of pages migrated successfully or error code. The caller add up all the successful migration pages and update prange->actual_loc only if the total migrated pages is not 0. This also removes the warning message "VRAM BO missing during validation" if migration cpages is 0. Signed-off-by: NPhilip Yang <Philip.Yang@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Philip Yang 提交于
No function change, use pr_debug_ratelimited to avoid per page debug message overflowing dmesg buf and console log. use dev_err to show error message from unexpected situation, to provide clue to help debug without enabling dynamic debug log. Define dev_fmt to output function name in error message. Signed-off-by: NPhilip Yang <Philip.Yang@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 30 9月, 2021 1 次提交
-
-
由 Yang Li 提交于
Use resource_size function on resource object instead of explicit computation. Clean up coccicheck warning: ./drivers/gpu/drm/amd/amdkfd/kfd_migrate.c:905:10-13: ERROR: Missing resource_size with res Reported-by: NAbaci Robot <abaci@linux.alibaba.com> Reviewed-by: Amos Kong <kongjianjun@gmail.com> Signed-off-by: NYang Li <yang.lee@linux.alibaba.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 24 9月, 2021 4 次提交
-
-
由 Philip Yang 提交于
Device manager releases device-specific resources when a driver disconnects from a device, devm_memunmap_pages and devm_release_mem_region calls in svm_migrate_fini are redundant. It causes below warning trace after patch "drm/amdgpu: Split amdgpu_device_fini into early and late", so remove function svm_migrate_fini. BUG: https://gitlab.freedesktop.org/drm/amd/-/issues/1718 WARNING: CPU: 1 PID: 3646 at drivers/base/devres.c:795 devm_release_action+0x51/0x60 Call Trace: ? memunmap_pages+0x360/0x360 svm_migrate_fini+0x2d/0x60 [amdgpu] kgd2kfd_device_exit+0x23/0xa0 [amdgpu] amdgpu_amdkfd_device_fini_sw+0x1d/0x30 [amdgpu] amdgpu_device_fini_sw+0x45/0x290 [amdgpu] amdgpu_driver_release_kms+0x12/0x30 [amdgpu] drm_dev_release+0x20/0x40 [drm] release_nodes+0x196/0x1e0 device_release_driver_internal+0x104/0x1d0 driver_detach+0x47/0x90 bus_remove_driver+0x7a/0xd0 pci_unregister_driver+0x3d/0x90 amdgpu_exit+0x11/0x20 [amdgpu] Signed-off-by: NPhilip Yang <Philip.Yang@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Philip Yang 提交于
If svm migration init failed to create pgmap for device memory, set pgmap type to 0 to disable device SVM support capability. Signed-off-by: NPhilip Yang <Philip.Yang@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Philip Yang 提交于
Device manager releases device-specific resources when a driver disconnects from a device, devm_memunmap_pages and devm_release_mem_region calls in svm_migrate_fini are redundant. It causes below warning trace after patch "drm/amdgpu: Split amdgpu_device_fini into early and late", so remove function svm_migrate_fini. BUG: https://gitlab.freedesktop.org/drm/amd/-/issues/1718 WARNING: CPU: 1 PID: 3646 at drivers/base/devres.c:795 devm_release_action+0x51/0x60 Call Trace: ? memunmap_pages+0x360/0x360 svm_migrate_fini+0x2d/0x60 [amdgpu] kgd2kfd_device_exit+0x23/0xa0 [amdgpu] amdgpu_amdkfd_device_fini_sw+0x1d/0x30 [amdgpu] amdgpu_device_fini_sw+0x45/0x290 [amdgpu] amdgpu_driver_release_kms+0x12/0x30 [amdgpu] drm_dev_release+0x20/0x40 [drm] release_nodes+0x196/0x1e0 device_release_driver_internal+0x104/0x1d0 driver_detach+0x47/0x90 bus_remove_driver+0x7a/0xd0 pci_unregister_driver+0x3d/0x90 amdgpu_exit+0x11/0x20 [amdgpu] Signed-off-by: NPhilip Yang <Philip.Yang@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Philip Yang 提交于
If svm migration init failed to create pgmap for device memory, set pgmap type to 0 to disable device SVM support capability. Signed-off-by: NPhilip Yang <Philip.Yang@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 01 7月, 2021 6 次提交
-
-
由 Alex Sierra 提交于
Each zone-device page holds a reference to the SVM BO that manages its backing storage. This is necessary to correctly hold on to the BO in case zone_device pages are shared with a child-process. Signed-off-by: NAlex Sierra <alex.sierra@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Alex Sierra 提交于
This is for debug purposes only. It conditionally generates partial migrations to test mixed CPU/GPU memory domain pages in a prange easily. Signed-off-by: NAlex Sierra <alex.sierra@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Alex Sierra 提交于
Migration skipped for pages that are already in VRAM domain. These could be the result of previous partial migrations to SYS RAM, and prefetch back to VRAM. Ex. Coherent pages in VRAM that were not written/invalidated after a copy-on-write. Signed-off-by: NAlex Sierra <alex.sierra@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Alex Sierra 提交于
Invalid pages can be the result of pages that have been migrated already due to copy-on-write procedure or pages that were never migrated to VRAM in first place. This is not an issue anymore, as pranges now support mixed memory domains (CPU/GPU). Signed-off-by: NAlex Sierra <alex.sierra@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Alex Sierra 提交于
svm_range_prefault is called right before migrations to VRAM, to make sure pages are resident in system memory before the migration. With partial migrations, this reference is used by hmm range get pages to avoid migrating pages that are already in the same VRAM domain. Signed-off-by: NAlex Sierra <alex.sierra@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Alex Sierra 提交于
GPUs in the same XGMI hive have direct access to all members'VRAM. When mapping memory to a GPU, we don't need hmm_range_fault to fault device-private pages in the same hive back to the host. Identifying the page owner as the hive, rather than the individual GPU, accomplishes this. Signed-off-by: NAlex Sierra <alex.sierra@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 30 6月, 2021 1 次提交
-
-
由 Philip Yang 提交于
Add helper function to get process device data structure from adev to update counters. Update vm faults, page_in, page_out counters will no be executed in parallel, use WRITE_ONCE to avoid any form of compiler optimizations. Signed-off-by: NPhilip Yang <Philip.Yang@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 04 6月, 2021 1 次提交
-
-
由 Christian König 提交于
Access to the mm_node is now forbidden. So instead of hand wiring that use the cursor functionality. v2: fix handling as pointed out by Philip. Signed-off-by: NChristian König <christian.koenig@amd.com> Reviewed-by: NMatthew Auld <matthew.auld@intel.com> Reviewed-by and Tested-by: Philip Yang <philip.yang@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20210602100914.46246-5-christian.koenig@amd.com
-
- 11 5月, 2021 2 次提交
-
-
由 Philip Yang 提交于
If migration copy failed because process is killed, or out of VRAM or system memory, pass error code back to caller to handle error gracefully. Signed-off-by: NPhilip Yang <Philip.Yang@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
由 Felix Kuehling 提交于
This function is only used in this source file. Reported-by: Nkernel test robot <lkp@intel.com> Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com> Reviewed-by: NRodrigo Siqueira <Rodrigo.Siqueira@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 29 4月, 2021 1 次提交
-
-
由 Philip Yang 提交于
Use devm_memunmap_pages instead of memunmap_pages to release pgmap and remove pgmap from device action, to avoid double free pgmap when unloading driver module. Release device memory region if failed to create device memory pages structure. Signed-off-by: NPhilip Yang <Philip.Yang@amd.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-
- 24 4月, 2021 1 次提交
-
-
由 Colin Ian King 提交于
The variable r is being initialized with a value that is never read and it is being updated later with a new value. The initialization is redundant and can be removed. Addresses-Coverity: ("Unused value") Signed-off-by: NColin Ian King <colin.king@canonical.com> Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
-