1. 13 10月, 2022 2 次提交
    • A
      mm: free device private pages have zero refcount · ef233450
      Alistair Popple 提交于
      Since 27674ef6 ("mm: remove the extra ZONE_DEVICE struct page
      refcount") device private pages have no longer had an extra reference
      count when the page is in use.  However before handing them back to the
      owning device driver we add an extra reference count such that free pages
      have a reference count of one.
      
      This makes it difficult to tell if a page is free or not because both free
      and in use pages will have a non-zero refcount.  Instead we should return
      pages to the drivers page allocator with a zero reference count.  Kernel
      code can then safely use kernel functions such as get_page_unless_zero().
      
      Link: https://lkml.kernel.org/r/cf70cf6f8c0bdb8aaebdbfb0d790aea4c683c3c6.1664366292.git-series.apopple@nvidia.comSigned-off-by: NAlistair Popple <apopple@nvidia.com>
      Acked-by: NFelix Kuehling <Felix.Kuehling@amd.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Alex Deucher <alexander.deucher@amd.com>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Lyude Paul <lyude@redhat.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Alex Sierra <alex.sierra@amd.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      ef233450
    • A
      mm/memory.c: fix race when faulting a device private page · 16ce101d
      Alistair Popple 提交于
      Patch series "Fix several device private page reference counting issues",
      v2
      
      This series aims to fix a number of page reference counting issues in
      drivers dealing with device private ZONE_DEVICE pages.  These result in
      use-after-free type bugs, either from accessing a struct page which no
      longer exists because it has been removed or accessing fields within the
      struct page which are no longer valid because the page has been freed.
      
      During normal usage it is unlikely these will cause any problems.  However
      without these fixes it is possible to crash the kernel from userspace. 
      These crashes can be triggered either by unloading the kernel module or
      unbinding the device from the driver prior to a userspace task exiting. 
      In modules such as Nouveau it is also possible to trigger some of these
      issues by explicitly closing the device file-descriptor prior to the task
      exiting and then accessing device private memory.
      
      This involves some minor changes to both PowerPC and AMD GPU code. 
      Unfortunately I lack hardware to test either of those so any help there
      would be appreciated.  The changes mimic what is done in for both Nouveau
      and hmm-tests though so I doubt they will cause problems.
      
      
      This patch (of 8):
      
      When the CPU tries to access a device private page the migrate_to_ram()
      callback associated with the pgmap for the page is called.  However no
      reference is taken on the faulting page.  Therefore a concurrent migration
      of the device private page can free the page and possibly the underlying
      pgmap.  This results in a race which can crash the kernel due to the
      migrate_to_ram() function pointer becoming invalid.  It also means drivers
      can't reliably read the zone_device_data field because the page may have
      been freed with memunmap_pages().
      
      Close the race by getting a reference on the page while holding the ptl to
      ensure it has not been freed.  Unfortunately the elevated reference count
      will cause the migration required to handle the fault to fail.  To avoid
      this failure pass the faulting page into the migrate_vma functions so that
      if an elevated reference count is found it can be checked to see if it's
      expected or not.
      
      [mpe@ellerman.id.au: fix build]
        Link: https://lkml.kernel.org/r/87fsgbf3gh.fsf@mpe.ellerman.id.au
      Link: https://lkml.kernel.org/r/cover.60659b549d8509ddecafad4f498ee7f03bb23c69.1664366292.git-series.apopple@nvidia.com
      Link: https://lkml.kernel.org/r/d3e813178a59e565e8d78d9b9a4e2562f6494f90.1664366292.git-series.apopple@nvidia.comSigned-off-by: NAlistair Popple <apopple@nvidia.com>
      Acked-by: NFelix Kuehling <Felix.Kuehling@amd.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Lyude Paul <lyude@redhat.com>
      Cc: Alex Deucher <alexander.deucher@amd.com>
      Cc: Alex Sierra <alex.sierra@amd.com>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      16ce101d
  2. 29 7月, 2022 1 次提交
  3. 18 7月, 2022 1 次提交
  4. 01 7月, 2022 1 次提交
  5. 04 6月, 2022 1 次提交
    • P
      drm/amdkfd: Fix partial migration bugs · 88467db6
      Philip Yang 提交于
      Migration range from system memory to VRAM, if system page can not be
      locked or unmapped, we do partial migration and leave some pages in
      system memory. Several bugs found to copy pages and update GPU mapping
      for this situation:
      
      1. copy to vram should use migrate->npage which is total pages of range
      as npages, not migrate->cpages which is number of pages can be migrated.
      
      2. After partial copy, set VRAM res cursor as j + 1, j is number of
      system pages copied plus 1 page to skip copy.
      
      3. copy to ram, should collect all continuous VRAM pages and copy
      together.
      
      4. Call amdgpu_vm_update_range, should pass in offset as bytes, not
      as number of pages.
      Signed-off-by: NPhilip Yang <Philip.Yang@amd.com>
      Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      Cc: stable@vger.kernel.org
      88467db6
  6. 23 4月, 2022 1 次提交
  7. 16 3月, 2022 1 次提交
  8. 04 3月, 2022 2 次提交
  9. 15 2月, 2022 1 次提交
  10. 12 2月, 2022 1 次提交
  11. 20 1月, 2022 1 次提交
  12. 17 12月, 2021 1 次提交
  13. 14 12月, 2021 1 次提交
    • I
      drm/amd: fix improper docstring syntax · bbe04dec
      Isabella Basso 提交于
      This fixes various warnings relating to erroneous docstring syntax, of
      which some are listed below:
      
       warning: Function parameter or member 'adev' not described in
       'amdgpu_atomfirmware_ras_rom_addr'
       ...
       warning: expecting prototype for amdgpu_atpx_validate_functions().
       Prototype was for amdgpu_atpx_validate() instead
       ...
       warning: Excess function parameter 'mem' description in 'amdgpu_preempt_mgr_new'
       ...
       warning: Cannot understand  * @kfd_get_cu_occupancy - Collect number of
       waves in-flight on this device
       ...
       warning: This comment starts with '/**', but isn't a kernel-doc
       comment. Refer Documentation/doc-guide/kernel-doc.rst
      Signed-off-by: NIsabella Basso <isabbasso@riseup.net>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      bbe04dec
  14. 18 11月, 2021 1 次提交
  15. 12 11月, 2021 1 次提交
  16. 06 11月, 2021 1 次提交
    • A
      drm/amdkfd: avoid recursive lock in migrations back to RAM · a6283010
      Alex Sierra 提交于
      [Why]:
      When we call hmm_range_fault to map memory after a migration, we don't
      expect memory to be migrated again as a result of hmm_range_fault. The
      driver ensures that all memory is in GPU-accessible locations so that
      no migration should be needed. However, there is one corner case where
      hmm_range_fault can unexpectedly cause a migration from DEVICE_PRIVATE
      back to system memory due to a write-fault when a system memory page in
      the same range was mapped read-only (e.g. COW). Ranges with individual
      pages in different locations are usually the result of failed page
      migrations (e.g. page lock contention). The unexpected migration back
      to system memory causes a deadlock from recursive locking in our
      driver.
      
      [How]:
      Creating a task reference new member under svm_range_list struct.
      Setting this with "current" reference, right before the hmm_range_fault
      is called. This member is checked against "current" reference at
      svm_migrate_to_ram callback function. If equal, the migration will be
      ignored.
      Signed-off-by: NAlex Sierra <alex.sierra@amd.com>
      Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      a6283010
  17. 04 11月, 2021 1 次提交
  18. 22 10月, 2021 2 次提交
  19. 14 10月, 2021 2 次提交
  20. 30 9月, 2021 1 次提交
  21. 24 9月, 2021 4 次提交
  22. 01 7月, 2021 6 次提交
  23. 30 6月, 2021 1 次提交
  24. 04 6月, 2021 1 次提交
  25. 11 5月, 2021 2 次提交
  26. 29 4月, 2021 1 次提交
  27. 24 4月, 2021 1 次提交