提交 · 2f617f4df8dfef68f175160d533f5820a368023e · openeuler / Kernel

25 8月, 2021 2 次提交

drm/amdkfd: map SVM range with correct access permission · 2f617f4d

由 Philip Yang 提交于 8月 18, 2021

Restore retry fault or prefetch range, or restore svm range after
eviction to map range to GPU with correct read or write access
permission.

Range may includes multiple VMAs, update GPU page table with offset of
prange, number of pages for each VMA according VMA access permission.
Signed-off-by: NPhilip Yang <Philip.Yang@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

2f617f4d

drm/amdkfd: check access permisson to restore retry fault · ff891a2e

由 Philip Yang 提交于 8月 15, 2021

Check range access permission to restore GPU retry fault, if GPU retry
fault on address which belongs to VMA, and VMA has no read or write
permission requested by GPU, failed to restore the address. The vm fault
event will pass back to user space.
Signed-off-by: NPhilip Yang <Philip.Yang@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

ff891a2e

17 8月, 2021 1 次提交

drm/amdkfd: fix random KFDSVMRangeTest.SetGetAttributesTest test failure · 2bbab7ce

由 Yifan Zhang 提交于 8月 10, 2021

KFDSVMRangeTest.SetGetAttributesTest randomly fails in stress test.

Note: Google Test filter = KFDSVMRangeTest.*
[==========] Running 18 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 18 tests from KFDSVMRangeTest
[ RUN      ] KFDSVMRangeTest.BasicSystemMemTest
[       OK ] KFDSVMRangeTest.BasicSystemMemTest (30 ms)
[ RUN      ] KFDSVMRangeTest.SetGetAttributesTest
[          ] Get default atrributes
/home/yifan/brahma/libhsakmt/tests/kfdtest/src/KFDSVMRangeTest.cpp:154: Failure
Value of: expectedDefaultResults[i]
  Actual: 4294967295
Expected: outputAttributes[i].value
Which is: 0
/home/yifan/brahma/libhsakmt/tests/kfdtest/src/KFDSVMRangeTest.cpp:154: Failure
Value of: expectedDefaultResults[i]
  Actual: 4294967295
Expected: outputAttributes[i].value
Which is: 0
/home/yifan/brahma/libhsakmt/tests/kfdtest/src/KFDSVMRangeTest.cpp:152: Failure
Value of: expectedDefaultResults[i]
  Actual: 4
Expected: outputAttributes[i].type
Which is: 2
[          ] Setting/Getting atrributes
[  FAILED  ]

the root cause is that svm work queue has not finished when svm_range_get_attr is called, thus
some garbage svm interval tree data make svm_range_get_attr get wrong result. Flush work queue before
iterate svm interval tree.
Signed-off-by: NYifan Zhang <yifan1.zhang@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

2bbab7ce

12 8月, 2021 1 次提交

drm/amdkfd: AIP mGPUs best prefetch location for xnack on · eff8cbf0

由 Philip Yang 提交于 7月 31, 2021

For xnack on, if range ACCESS or ACCESS_IN_PLACE (AIP) by single GPU, or
range is ACCESS_IN_PLACE by mGPUs and all mGPUs connection on XGMI same
hive, the best prefetch location is prefetch_loc GPU. Otherwise, the best
prefetch location is always CPU because GPU does not have coherent
mapping VRAM of other GPUs even with large-BAR PCIe connection.
Signed-off-by: NPhilip Yang <Philip.Yang@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

eff8cbf0

07 8月, 2021 1 次提交

drm/amdkfd: Allow querying SVM attributes that are clear · a43e2a0e

由 Felix Kuehling 提交于 7月 16, 2021

Currently the SVM get_attr call allows querying, which flags are set
in the entire address range. Add the opposite query, which flags are
clear in the entire address range. Both queries can be combined in a
single get_attr call, which allows answering questions such as, "is
this address range coherent, non-coherent, or a mix of both"?

Proposed userspace for UAPI:
https://github.com/RadeonOpenCompute/ROCR-Runtime/tree/memory_model_queriesSigned-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: NPhilip Yand <philip.yang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

a43e2a0e

13 7月, 2021 1 次提交

drm/amdkfd: handle fault counters on invalid address · 5017bf82

由 Philip Yang 提交于 7月 07, 2021

prange is NULL if vm fault retry on invalid address, for this case, can
not use prange to get pdd, use adev to get gpuidx and then get pdd
instead, then increase pdd vm fault counter.
Signed-off-by: NPhilip Yang <Philip.Yang@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

5017bf82

09 7月, 2021 1 次提交

drm/amdkfd: handle fault counters on invalid address · 4818545a

由 Philip Yang 提交于 7月 07, 2021

prange is NULL if vm fault retry on invalid address, for this case, can
not use prange to get pdd, use adev to get gpuidx and then get pdd
instead, then increase pdd vm fault counter.
Signed-off-by: NPhilip Yang <Philip.Yang@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

4818545a

01 7月, 2021 7 次提交

drm/amdkfd: Maintain svm_bo reference in page->zone_device_data · 7981ec65

由 Alex Sierra 提交于 3月 09, 2021

Each zone-device page holds a reference to the SVM BO that manages its
backing storage. This is necessary to correctly hold on to the BO in
case zone_device pages are shared with a child-process.
Signed-off-by: NAlex Sierra <alex.sierra@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

7981ec65

drm/amdkfd: classify and map mixed svm range pages in GPU · 1d5dbfe6

由 Alex Sierra 提交于 5月 05, 2021

[Why]
svm ranges can have mixed pages from device or system memory.
A good example is, after a prange has been allocated in VRAM and a
copy-on-write is triggered by a fork. This invalidates some pages
inside the prange. Endding up in mixed pages.

[How]
By classifying each page inside a prange, based on its type. Device or
system memory, during dma mapping call. If page corresponds
to VRAM domain, a flag is set to its dma_addr entry for each GPU.
Then, at the GPU page table mapping. All group of contiguous pages within
the same type are mapped with their proper pte flags.

v2:
Instead of using ttm_res to calculate vram pfns in the svm_range. It is now
done by setting the vram real physical address into drm_addr array.
This makes more flexible VRAM management, plus removes the need to have
a BO reference in the svm_range.

v3:
Remove mapping member from svm_range
Signed-off-by: NAlex Sierra <alex.sierra@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

1d5dbfe6

drm/amdkfd: use hmm range fault to get both domain pfns · 278a7087

由 Alex Sierra 提交于 6月 23, 2021

Now that prange could have mixed domains (VRAM or SYSRAM),
actual_loc nor svm_bo can not be used to check its current
domain and eventually get its pfns to map them in GPU.
Instead, pfns from both domains, are now obtained from
hmm_range_fault through amdgpu_hmm_range_get_pages
call. This is done everytime a GPU map occur.
Signed-off-by: NAlex Sierra <alex.sierra@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

278a7087

drm/amdgpu: get owner ref in validate and map · 1fc160cf

由 Alex Sierra 提交于 5月 06, 2021

Get the proper owner reference for amdgpu_hmm_range_get_pages function.
This is useful for partial migrations. To avoid migrating back to
system memory, VRAM pages, that are accessible by all devices in the
same memory domain.
Ex. multiple devices in the same hive.
Signed-off-by: NAlex Sierra <alex.sierra@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

1fc160cf

drm/amdkfd: set owner ref to svm range prefault · a010d98a

由 Alex Sierra 提交于 5月 06, 2021

svm_range_prefault is called right before migrations to VRAM,
to make sure pages are resident in system memory before the migration.
With partial migrations, this reference is used by hmm range get pages
to avoid migrating pages that are already in the same VRAM domain.
Signed-off-by: NAlex Sierra <alex.sierra@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

a010d98a

drm/amdkfd: add owner ref param to get hmm pages · 8c21fc49

由 Alex Sierra 提交于 5月 06, 2021

The parameter is used in the dev_private_owner to decide if device
pages in the range require to be migrated back to system memory, based
if they are or not in the same memory domain.
In this case, this reference could come from the same memory domain
with devices connected to the same hive.
Signed-off-by: NAlex Sierra <alex.sierra@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

8c21fc49

drm/amdkfd: inc counter on child ranges with xnack off · 9e4a91cd

由 Alex Sierra 提交于 6月 29, 2021

During GPU page table invalidation with xnack off, new ranges
split may occur concurrently in the same prange. Creating a new
child per split. Each child should also increment its
invalid counter, to assure GPU page table updates in these
ranges.
Signed-off-by: NAlex Sierra <alex.sierra@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

9e4a91cd

30 6月, 2021 1 次提交

drm/amdkfd: implement counters for vm fault and migration · d4ebc200

由 Philip Yang 提交于 6月 22, 2021

Add helper function to get process device data structure from adev to
update counters.

Update vm faults, page_in, page_out counters will no be executed in
parallel, use WRITE_ONCE to avoid any form of compiler optimizations.
Signed-off-by: NPhilip Yang <Philip.Yang@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

d4ebc200

16 6月, 2021 2 次提交

drm/amdgpu: remove amdgpu_vm_pt · 391629bd

由 Nirmoy Das 提交于 6月 15, 2021

Page table entries are now in embedded in VM BO, so
we do not need struct amdgpu_vm_pt. This patch replaces
struct amdgpu_vm_pt with struct amdgpu_vm_bo_base.
Signed-off-by: NNirmoy Das <nirmoy.das@amd.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

391629bd

drm/amdkfd: Disable SVM per GPU, not per process · 5a75ea56

由 Felix Kuehling 提交于 6月 10, 2021

When some GPUs don't support SVM, don't disabe it for the entire process.
That would be inconsistent with the information the process got from the
topology, which indicates SVM support per GPU.

Instead disable SVM support only for the unsupported GPUs. This is done
by checking any per-device attributes against the bitmap of supported
GPUs. Also use the supported GPU bitmap to initialize access bitmaps for
new SVM address ranges.

Don't handle recoverable page faults from unsupported GPUs. (I don't
think there will be unsupported GPUs that can generate recoverable page
faults. But better safe than sorry.)
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: NPhilip Yang <philip.yang@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

5a75ea56

08 6月, 2021 1 次提交

drm/amdkfd: pages_addr offset must be 0 for system range · 0dc2bafb

由 Philip Yang 提交于 6月 04, 2021

prange->offset is for VRAM range mm_nodes, if multiple ranges share same
mm_nodes, migrate range back to VRAM will reuse the VRAM at offset of
the same mm_nodes. For system memory pages_addr array, the offset is
always 0, otherwise, update GPU mapping will use incorrect system memory
page, and cause system memory corruption.
Signed-off-by: NPhilip Yang <Philip.Yang@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

0dc2bafb

02 6月, 2021 1 次提交

drm/ttm: rename bo->mem and make it a pointer · d3116756

由 Christian König 提交于 4月 12, 2021

When we want to decouble resource management from buffer management we need to
be able to handle resources separately.

Add a resource pointer and rename bo->mem so that all code needs to
change to access the pointer instead.

No functional change.
Signed-off-by: NChristian König <christian.koenig@amd.com>
Reviewed-by: NMatthew Auld <matthew.auld@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20210430092508.60710-4-christian.koenig@amd.com

d3116756

20 5月, 2021 7 次提交

drm/amdkfd: heavy-weight flush TLB after unmap · 765385ec

由 Philip Yang 提交于 5月 13, 2021

Need do a heavy-weight TLB flush to make sure we have no more dirty data
in the cache for the unmapped pages.

Define enum TLB_FLUSH_TYPE, add flush_type parameter to
amdgpu_amdkfd_flush_gpu_tlb_pasid.
Signed-off-by: NPhilip Yang <Philip.Yang@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

765385ec

Revert "drm/amdkfd: flush TLB after updating GPU page table" · 7a3ae1e2

由 Philip Yang 提交于 5月 13, 2021

This reverts commit 1704ac8e.

After "drm/amdgpu: flush TLB if valid PDE turns into PTE" is checked
in, this workaround is not needed.
Signed-off-by: NPhilip Yang <Philip.Yang@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

7a3ae1e2

drm/amdgpu: flush TLB if valid PDE turns into PTE · bf546940

由 Philip Yang 提交于 5月 12, 2021

Mapping huge page, 2MB aligned address with 2MB size, uses PDE0 as PTE.
If previously valid PDE0, PDE0.V=1 and PDE0.P=0 turns into PTE, this
requires TLB flush, otherwise page table walker will not read updated
PDE0.

Change page table update mapping to return table_freed flag to indicate
the previously valid PDE may have turned into a PTE if page table is
freed.
Signed-off-by: NPhilip Yang <Philip.Yang@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: NChristian König <christian.koenig@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

bf546940

drm/amdgpu: re-apply "use the new cursor in the VM code" v2 · 0ccc3ccf

由 Christian König 提交于 3月 22, 2021

Now that we found the underlying problem we can re-apply this patch.

This reverts commit 6b44b667.

v2: rebase on KFD changes
Signed-off-by: NChristian König <christian.koenig@amd.com>
Acked-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Tested-by: NNirmoy Das <nirmoy.das@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

0ccc3ccf

drm/amdgpu: Albebaran: MTYPE_NC for coarse-grain remote memory · 2b2339ee

由 Felix Kuehling 提交于 5月 10, 2021

MTYPE UC was used for a specific use case that ended up not being
implemented. Use NC for better performance for coarse-grained memory where
cache coherence during shader execution is not required.
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: NOak Zeng <Oak.Zeng@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

2b2339ee

drm/amdgpu: Arcturus: MTYPE_NC for coarse-grain remote memory · 0c6f7777

由 Felix Kuehling 提交于 5月 10, 2021

MTYPE UC was used for a specific use case that ended up not being
implemented. Use NC for better performance for coarse-grained memory where
cache coherence during shader execution is not required.
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: NOak Zeng <Oak.Zeng@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

0c6f7777

drm/amdkfd: new range accessible by all GPUs · a9a76bee

由 Philip Yang 提交于 5月 05, 2021

If xnack is on, new range is created to recover retry vm fault or
created by SVM API calls, set all GPUs have access to the range.
Signed-off-by: NPhilip Yang <Philip.Yang@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

a9a76bee

11 5月, 2021 1 次提交

drm/amdkfd: flush TLB after updating GPU page table · 1704ac8e

由 Philip Yang 提交于 4月 15, 2021

To workaround the situation that vm retry fault keep coming after page
table update. We are investigating the root cause, but once this issue
happens, application will stuck and sometimes have to reboot to recover.
Signed-off-by: NPhilip Yang <Philip.Yang@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

1704ac8e

29 4月, 2021 4 次提交

drm/amdkfd: enable subsequent retry fault · b3dc91f9

由 Philip Yang 提交于 4月 20, 2021

After draining the stale retry fault, or failed to validate the range
to recover, have to remove the fault address from fault filter ring, to
be able to handle subsequent retry interrupt on same address. Otherwise
the retry fault will not be processed to recover until timeout passed.
Signed-off-by: NPhilip Yang <Philip.Yang@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

b3dc91f9

drm/amdkfd: handle stale retry fault · 373e3ccd

由 Philip Yang 提交于 4月 19, 2021

Retry fault interrupt maybe pending in IH ring after GPU page table
is updated to recover the vm fault, because each page of the range
generate retry fault interrupt. There is race if application unmap
range to remove and free the range first and then retry fault work
restore_pages handle the retry fault interrupt, because range can not be
found, this vm fault can not be recovered and report incorrect GPU vm
fault to application.

Before unmap to remove and free range, drain retry fault interrupt
from IH ring1 to ensure no retry fault comes after the range is removed.

Drain retry fault interrupt skip the range which is on deferred list
to remove, or the range is child range, which is split by unmap, does
not add to svms and have interval notifier.
Signed-off-by: NPhilip Yang <Philip.Yang@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

373e3ccd

drm/amdkfd: retry validation to recover range · 4999e398

由 Philip Yang 提交于 4月 19, 2021

GPU vm retry fault recover range need retry validation if

1. range is split in parallel by unmap while recover
2. range migrate to system memory and range is updated in system
memory while recover
Signed-off-by: NPhilip Yang <Philip.Yang@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

4999e398

drm/amdkfd: Fix spelling mistake "unregisterd" -> "unregistered" · dd57e65f

由 Colin Ian King 提交于 4月 26, 2021

There is a spelling mistake in a pr_debug message. Fix it.
Signed-off-by: NColin Ian King <colin.king@canonical.com>
Reviewed-by: NNirmoy Das <nirmoy.das@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

dd57e65f

24 4月, 2021 3 次提交

drm/amdkfd: fix uint32 variable compared to less than zero · 65f8db81

由 Colin Ian King 提交于 4月 22, 2021

Currently the call to kfd_process_gpuidx_from_gpuid is returning an
int value and this is being assigned to a uint32_t variable gpuidx
and this is being checked for a negative error return which is always
going to be false. Fix this by making gpuidx an int32_t. This makes
gpuidx also type consistent with the use of gpuidx from the callers.

Addresses-Coverity: ("Unsigned compared against 0")
Fixes: cda0f85b ("drm/amdkfd: refine migration policy with xnack on")
Signed-off-by: NColin Ian King <colin.king@canonical.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

65f8db81

drm/amdkfd: set attribute access for default ranges · 63f1af83

由 Alex Sierra 提交于 4月 21, 2021

Attribute access value for default ranges is set, based on
process xnack on/off.
XNACK ON has GPU access attribute for unregistered ranges through page
fault. While XNACK OFF has no access attribute for unregistered ranges.
Signed-off-by: NAlex Sierra <alex.sierra@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

63f1af83

drm/amdkfd: svm ranges creation for unregistered memory · b19dbb7a

由 Alex Sierra 提交于 4月 12, 2021

SVM ranges are created for unregistered memory, triggered
by page faults. These ranges are migrated/mapped to
GPU VRAM memory.
Signed-off-by: NAlex Sierra <alex.sierra@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

b19dbb7a

21 4月, 2021 6 次提交

drm/amdkfd: multiple gpu migrate vram to vram · 1a3b2b5d

由 Felix Kuehling 提交于 2月 24, 2021

If prefetch range to gpu with acutal location is another gpu, or GPU
retry fault restore pages to migrate the range with acutal location is
gpu, then migrate from one gpu to another gpu.

Use system memory as bridge because sdma engine may not able to access
another gpu vram, use sdma of source gpu to migrate to system memory,
then use sdma of destination gpu to migrate from system memory to gpu.

Print out gpuid or gpuidx in debug messages.
Signed-off-by: NPhilip Yang <Philip.Yang@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

1a3b2b5d

drm/amdkfd: add svm range validate timestamp · 564d2b92

由 Felix Kuehling 提交于 2月 24, 2021

With xnack on, add validate timestamp in order to handle GPU vm fault
from multiple GPUs.

If GPU retry fault need migrate the range to the best restore location,
use range validate timestamp to record system timestamp after range is
restored to update GPU page table.

Because multiple pages of same range have multiple retry fault, define
AMDGPU_SVM_RANGE_RETRY_FAULT_PENDING to the long time period that
pending retry fault may still comes after page table update, to skip
duplicate retry fault of same range.

If difference between system timestamp and range last validate timestamp
is bigger than AMDGPU_SVM_RANGE_RETRY_FAULT_PENDING, that means the
retry fault is from another GPU, then continue to handle retry fault
recover.
Signed-off-by: NPhilip Yang <Philip.Yang@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

564d2b92

drm/amdkfd: refine migration policy with xnack on · cda0f85b

由 Felix Kuehling 提交于 2月 24, 2021

With xnack on, GPU vm fault handler decide the best restore location,
then migrate range to the best restore location and update GPU mapping
to recover the GPU vm fault.
Signed-off-by: NPhilip Yang <Philip.Yang@amd.com>
Signed-off-by: NAlex Sierra <alex.sierra@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

cda0f85b

drm/amdkfd: add svm_bo eviction mechanism support · b41896e3

由 Felix Kuehling 提交于 2月 24, 2021

svm_bo eviction mechanism is different from regular BOs.
Every SVM_BO created contains one eviction fence and one
worker item for eviction process.
SVM_BOs can be attached to one or more pranges.
For SVM_BO eviction mechanism, TTM will start to call
enable_signal callback for every SVM_BO until VRAM space
is available.
Here, all the ttm_evict calls are synchronous, this guarantees
that each eviction has completed and the fence has signaled before
it returns.
Signed-off-by: NAlex Sierra <alex.sierra@amd.com>
Signed-off-by: NPhilip Yang <Philip.Yang@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

b41896e3

drm/amdkfd: page table restore through svm API · 2383f56b

由 Felix Kuehling 提交于 2月 24, 2021

Page table restore implementation in SVM API. This is called from
the fault handler at amdgpu_vm. To update page tables through
the page fault retry IH.
Signed-off-by: NAlex Sierra <alex.sierra@amd.com>
Signed-off-by: NPhilip Yang <Philip.Yang@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

2383f56b

drm/amdkfd: invalidate tables on page retry fault · 90d7d3ed

由 Felix Kuehling 提交于 2月 24, 2021

GPU page tables are invalidated by unmapping prange directly at
the mmu notifier, when page fault retry is enabled through
amdgpu_noretry global parameter. The restore page table is
performed at the page fault handler.

If xnack is on, we update GPU mappings after migration to avoid
unnecessary GPUVM faults.
Signed-off-by: NAlex Sierra <alex.sierra@amd.com>
Signed-off-by: NPhilip Yang <Philip.Yang@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

90d7d3ed

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功