提交 · bdd2465730a3ad62aec8ad47a59a9d36ffdebf37 · openeuler / Kernel

11 5月, 2021 6 次提交

drm/amdkfd: fix no atomics settings in the kfd topology · bdd24657

由 Jonathan Kim 提交于 4月 30, 2021

To account for various PCIe and xGMI setups, check the no atomics settings
for a device in relation to every direct peer.
Signed-off-by: NJonathan Kim <jonathan.kim@amd.com>
Reviewed-by: NFelix Kuehling <felix.kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

bdd24657

drm/amdkfd: Make svm_migrate_put_sys_page static · 2e4ec251

由 Felix Kuehling 提交于 4月 30, 2021

This function is only used in this source file.
Reported-by: Nkernel test robot <lkp@intel.com>
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Reviewed-by: NRodrigo Siqueira <Rodrigo.Siqueira@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

2e4ec251

drm/amdkfd: flush TLB after updating GPU page table · 1704ac8e

由 Philip Yang 提交于 4月 15, 2021

To workaround the situation that vm retry fault keep coming after page
table update. We are investigating the root cause, but once this issue
happens, application will stuck and sometimes have to reboot to recover.
Signed-off-by: NPhilip Yang <Philip.Yang@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

1704ac8e

drm/amdkfd: Add Aldebaran virtualization support · cecd91b4

由 Zhigang Luo 提交于 4月 29, 2021

update kfd_supported_devices to enable Aldebaran virtualization support
Signed-off-by: NZhigang Luo <zhigang.luo@amd.com>
Reviewed-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

cecd91b4

drm/amdkfd: report the numa weight between host and device over xgmi · 559f418e

由 Jonathan Kim 提交于 4月 21, 2021

GPUs connected to CPUs over xGMI are bidirectional so set weight by a
single hop both ways.
Signed-off-by: NJonathan Kim <jonathan.kim@amd.com>
Tested-by: NRamesh Errabolu <ramesh.errabolu@amd.com>
Reviewed-by: NFelix Kuehling <felix.kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

559f418e

drm/amdkfd: report atomics support in io_links over xgmi · deb68983

由 Jonathan Kim 提交于 4月 21, 2021

Link atomics support over xGMI should be reported independently of PCIe.
Do not set NO_ATOMICS flags on devices that support xGMI but that do not
have atomics support over PCIe.
Signed-off-by: NJonathan Kim <jonathan.kim@amd.com>
Tested-by: NRamesh Errabolu <ramesh.errabolu@amd.com>
Reviewed-by: NFelix Kuehling <felix.kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

deb68983

29 4月, 2021 9 次提交

drm/amdkfd: Add Aldebaran gws support · 8baa6018

由 Harish Kasiviswanathan 提交于 4月 20, 2021

v2: updated MEC FW version after validating gws with debugger
Signed-off-by: NHarish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Reviewed-by: NJoseph Greathouse <Joseph.Greathouse@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

8baa6018

drm/amdkfd: enable subsequent retry fault · b3dc91f9

由 Philip Yang 提交于 4月 20, 2021

After draining the stale retry fault, or failed to validate the range
to recover, have to remove the fault address from fault filter ring, to
be able to handle subsequent retry interrupt on same address. Otherwise
the retry fault will not be processed to recover until timeout passed.
Signed-off-by: NPhilip Yang <Philip.Yang@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

b3dc91f9

drm/amdkfd: handle stale retry fault · 373e3ccd

由 Philip Yang 提交于 4月 19, 2021

Retry fault interrupt maybe pending in IH ring after GPU page table
is updated to recover the vm fault, because each page of the range
generate retry fault interrupt. There is race if application unmap
range to remove and free the range first and then retry fault work
restore_pages handle the retry fault interrupt, because range can not be
found, this vm fault can not be recovered and report incorrect GPU vm
fault to application.

Before unmap to remove and free range, drain retry fault interrupt
from IH ring1 to ensure no retry fault comes after the range is removed.

Drain retry fault interrupt skip the range which is on deferred list
to remove, or the range is child range, which is split by unmap, does
not add to svms and have interval notifier.
Signed-off-by: NPhilip Yang <Philip.Yang@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

373e3ccd

drm/amdkfd: retry validation to recover range · 4999e398

由 Philip Yang 提交于 4月 19, 2021

GPU vm retry fault recover range need retry validation if

1. range is split in parallel by unmap while recover
2. range migrate to system memory and range is updated in system
memory while recover
Signed-off-by: NPhilip Yang <Philip.Yang@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

4999e398

drm/amdkfd: fix spelling mistake in packet manager · c3c5cc9a

由 Jonathan Kim 提交于 4月 26, 2021

The plural of 'process' should be 'processes'.
Signed-off-by: NJonathan Kim <jonathan.kim@amd.com>
Reviewed-by: NFelix Kuehling <felix.kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

c3c5cc9a

drm/amdkfd: fix double free device pgmap resource · c0f76fc8

由 Philip Yang 提交于 4月 26, 2021

Use devm_memunmap_pages instead of memunmap_pages to release pgmap
and remove pgmap from device action, to avoid double free pgmap when
unloading driver module.

Release device memory region if failed to create device memory pages
structure.
Signed-off-by: NPhilip Yang <Philip.Yang@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

c0f76fc8

drm/amdkfd: Fix spelling mistake "unregisterd" -> "unregistered" · dd57e65f

由 Colin Ian King 提交于 4月 26, 2021

There is a spelling mistake in a pr_debug message. Fix it.
Signed-off-by: NColin Ian King <colin.king@canonical.com>
Reviewed-by: NNirmoy Das <nirmoy.das@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

dd57e65f

drm/amdgpu: remove unnecessary header include · be9064b7

由 Hawking Zhang 提交于 4月 25, 2021

amdgpu.h is included in kfd_priv.h
Signed-off-by: NHawking Zhang <Hawking.Zhang@amd.com>
Reviewed-by: NJohn Clements <John.Clements@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

be9064b7

drm/amdkfd: Fix kernel-doc syntax error · 71ff0b4d

由 Fabio M. De Francesco 提交于 4月 24, 2021

Fixed a kernel-doc error in the documentation of a function.
Signed-off-by: NFabio M. De Francesco <fmdefrancesco@gmail.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

71ff0b4d

24 4月, 2021 5 次提交

drm/amdkfd: remove redundant initialization to variable r · a40eb089

由 Colin Ian King 提交于 4月 22, 2021

The variable r is being initialized with a value that is never read
and it is being updated later with a new value. The initialization is
redundant and can be removed.

Addresses-Coverity: ("Unused value")
Signed-off-by: NColin Ian King <colin.king@canonical.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

a40eb089

drm/amdkfd: fix uint32 variable compared to less than zero · 65f8db81

由 Colin Ian King 提交于 4月 22, 2021

Currently the call to kfd_process_gpuidx_from_gpuid is returning an
int value and this is being assigned to a uint32_t variable gpuidx
and this is being checked for a negative error return which is always
going to be false. Fix this by making gpuidx an int32_t. This makes
gpuidx also type consistent with the use of gpuidx from the callers.

Addresses-Coverity: ("Unsigned compared against 0")
Fixes: cda0f85b ("drm/amdkfd: refine migration policy with xnack on")
Signed-off-by: NColin Ian King <colin.king@canonical.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

65f8db81

drm/amdkfd: set attribute access for default ranges · 63f1af83

由 Alex Sierra 提交于 4月 21, 2021

Attribute access value for default ranges is set, based on
process xnack on/off.
XNACK ON has GPU access attribute for unregistered ranges through page
fault. While XNACK OFF has no access attribute for unregistered ranges.
Signed-off-by: NAlex Sierra <alex.sierra@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

63f1af83

drm/amdkfd: svm ranges creation for unregistered memory · b19dbb7a

由 Alex Sierra 提交于 4月 12, 2021

SVM ranges are created for unregistered memory, triggered
by page faults. These ranges are migrated/mapped to
GPU VRAM memory.
Signed-off-by: NAlex Sierra <alex.sierra@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

b19dbb7a

drm/amdkfd: add per-vmid-debug map_process_support · fd6a440e

由 Jonathan Kim 提交于 3月 02, 2021

In order to support multi-process debugging, HWS PM4 packet
MAP_PROCESS requires an extension of 5 DWORDS to support targeting of
per-vmid SPI debug control registers as well as watch points per process.
Signed-off-by: NJonathan Kim <jonathan.kim@amd.com>
Reviewed-by: NFelix Kuehling <felix.kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

fd6a440e

21 4月, 2021 20 次提交

drm/amdkfd: Add CONFIG_HSA_AMD_SVM · 4ab159d2

由 Felix Kuehling 提交于 3月 29, 2021

Control whether to build SVM support into amdgpu with a Kconfig option.
This makes it easier to disable it in production kernels if this new
feature causes problems in production environments.

Use "depends on" instead of "select" for DEVICE_PRIVATE, as is
recommended for visible options.
Reviewed-by: NPhilip Yang <Philip.Yang@amd.com>
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

4ab159d2

drm/amdkfd: Add SVM API support capability bits · 4c166eb9

由 Philip Yang 提交于 12月 09, 2020

SVMAPISupported property added to HSA_CAPABILITY, the value match
HSA_CAPABILITY defined in Thunk spec:

SVMAPISupported: it will not be supported on older kernels that don't
have HMM or on systems with GFXv8 or older GPUs without support for
48-bit virtual addresses.

CoherentHostAccess property added to HSA_MEMORYPROPERTY, the value match
HSA_MEMORYPROPERTY defined in Thunk spec:

CoherentHostAccess: whether or not device memory can be coherently
accessed by the host CPU.
Signed-off-by: NPhilip Yang <Philip.Yang@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

4c166eb9

drm/amdkfd: multiple gpu migrate vram to vram · 1a3b2b5d

由 Felix Kuehling 提交于 2月 24, 2021

If prefetch range to gpu with acutal location is another gpu, or GPU
retry fault restore pages to migrate the range with acutal location is
gpu, then migrate from one gpu to another gpu.

Use system memory as bridge because sdma engine may not able to access
another gpu vram, use sdma of source gpu to migrate to system memory,
then use sdma of destination gpu to migrate from system memory to gpu.

Print out gpuid or gpuidx in debug messages.
Signed-off-by: NPhilip Yang <Philip.Yang@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

1a3b2b5d

drm/amdkfd: add svm range validate timestamp · 564d2b92

由 Felix Kuehling 提交于 2月 24, 2021

With xnack on, add validate timestamp in order to handle GPU vm fault
from multiple GPUs.

If GPU retry fault need migrate the range to the best restore location,
use range validate timestamp to record system timestamp after range is
restored to update GPU page table.

Because multiple pages of same range have multiple retry fault, define
AMDGPU_SVM_RANGE_RETRY_FAULT_PENDING to the long time period that
pending retry fault may still comes after page table update, to skip
duplicate retry fault of same range.

If difference between system timestamp and range last validate timestamp
is bigger than AMDGPU_SVM_RANGE_RETRY_FAULT_PENDING, that means the
retry fault is from another GPU, then continue to handle retry fault
recover.
Signed-off-by: NPhilip Yang <Philip.Yang@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

564d2b92

drm/amdkfd: refine migration policy with xnack on · cda0f85b

由 Felix Kuehling 提交于 2月 24, 2021

With xnack on, GPU vm fault handler decide the best restore location,
then migrate range to the best restore location and update GPU mapping
to recover the GPU vm fault.
Signed-off-by: NPhilip Yang <Philip.Yang@amd.com>
Signed-off-by: NAlex Sierra <alex.sierra@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

cda0f85b

drm/amdkfd: add svm_bo eviction mechanism support · b41896e3

由 Felix Kuehling 提交于 2月 24, 2021

svm_bo eviction mechanism is different from regular BOs.
Every SVM_BO created contains one eviction fence and one
worker item for eviction process.
SVM_BOs can be attached to one or more pranges.
For SVM_BO eviction mechanism, TTM will start to call
enable_signal callback for every SVM_BO until VRAM space
is available.
Here, all the ttm_evict calls are synchronous, this guarantees
that each eviction has completed and the fence has signaled before
it returns.
Signed-off-by: NAlex Sierra <alex.sierra@amd.com>
Signed-off-by: NPhilip Yang <Philip.Yang@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

b41896e3

drm/amdkfd: page table restore through svm API · 2383f56b

由 Felix Kuehling 提交于 2月 24, 2021

Page table restore implementation in SVM API. This is called from
the fault handler at amdgpu_vm. To update page tables through
the page fault retry IH.
Signed-off-by: NAlex Sierra <alex.sierra@amd.com>
Signed-off-by: NPhilip Yang <Philip.Yang@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

2383f56b

drm/amdkfd: invalidate tables on page retry fault · 90d7d3ed

由 Felix Kuehling 提交于 2月 24, 2021

GPU page tables are invalidated by unmapping prange directly at
the mmu notifier, when page fault retry is enabled through
amdgpu_noretry global parameter. The restore page table is
performed at the page fault handler.

If xnack is on, we update GPU mappings after migration to avoid
unnecessary GPUVM faults.
Signed-off-by: NAlex Sierra <alex.sierra@amd.com>
Signed-off-by: NPhilip Yang <Philip.Yang@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

90d7d3ed

drm/amdkfd: HMM migrate vram to ram · 48ff079b

由 Felix Kuehling 提交于 3月 17, 2021

If CPU page fault happens, HMM pgmap_ops callback migrate_to_ram start
migrate memory from vram to ram in steps:

1. migrate_vma_pages get vram pages, and notify HMM to invalidate the
pages, HMM interval notifier callback evict process queues
2. Allocate system memory pages
3. Use svm copy memory to migrate data from vram to ram
4. migrate_vma_pages copy pages structure from vram pages to ram pages
5. Return VM_FAULT_SIGBUS if migration failed, to notify application
6. migrate_vma_finalize put vram pages, page_free callback free vram
pages and vram nodes
7. Restore work wait for migration is finished, then update GPU page
table mapping to system memory, and resume process queues
Signed-off-by: NPhilip Yang <Philip.Yang@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

48ff079b

drm/amdkfd: HMM migrate ram to vram · 0b0e518d

由 Felix Kuehling 提交于 2月 24, 2021

Register svm range with same address and size but perferred_location
is changed from CPU to GPU or from GPU to CPU, trigger migration the svm
range from ram to vram or from vram to ram.

If svm range prefetch location is GPU with flags
KFD_IOCTL_SVM_FLAG_HOST_ACCESS, validate the svm range on ram first,
then migrate it from ram to vram.

After migrating to vram is done, CPU access will have cpu page fault,
page fault handler migrate it back to ram and resume cpu access.

Migration steps:

1. migrate_vma_pages get svm range ram pages, notify the
interval is invalidated and unmap from CPU page table, HMM interval
notifier callback evict process queues
2. Allocate new pages in vram using TTM
3. Use svm copy memory to sdma copy data from ram to vram
4. migrate_vma_pages copy ram pages structure to vram pages structure
5. migrate_vma_finalize put ram pages to free ram pages and memory
6. Restore work wait for migration is finished, then update GPUs page
table mapping to new vram pages, resume process queues

If migrate_vma_setup failed to collect all ram pages of range, retry 3
times until success to start migration.
Signed-off-by: NPhilip Yang <Philip.Yang@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

0b0e518d

drm/amdkfd: copy memory through gart table · 50ea50cf

由 Philip Yang 提交于 2月 07, 2020

Use sdma linear copy to migrate data between ram and vram. The sdma
linear copy command uses kernel buffer function queue to access system
memory through gart table.

Use reserved gart table window 0 to map system page address, and vram
page address is direct mapping. Use the same kernel buffer function to
fill in gart table mapping, so this is serialized with memory copy by
sdma job submit. We only need wait for the last memory copy sdma fence
for larger buffer migration.
Signed-off-by: NPhilip Yang <Philip.Yang@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

50ea50cf

drm/amdkfd: support xgmi same hive mapping · b53fa124

由 Philip Yang 提交于 6月 23, 2020

amdgpu_gmc_get_vm_pte use bo_va->is_xgmi same hive information to set
pte flags to update GPU mapping. Add local structure variable bo_va, and
update bo_va.is_xgmi, pass it to mapping->bo_va while mapping to GPU.

Assuming xgmi pstate is hi after boot.
Signed-off-by: NPhilip Yang <Philip.Yang@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

b53fa124

drm/amdkfd: validate vram svm range from TTM · e49fe404

由 Felix Kuehling 提交于 2月 24, 2021

If svm range perfetch location is not zero, use TTM to alloc
amdgpu_bo vram nodes to validate svm range, then map vram nodes to GPUs.

Use offset to sub allocate from the same amdgpu_bo to handle overlap
vram range while adding new range or unmapping range.

svm_bo has ref count to trace the shared ranges. If all ranges of shared
amdgpu_bo are migrated to ram, ref count becomes 0, then amdgpu_bo is
released, all ranges svm_bo is set to NULL.

To migrate range from ram back to vram, allocate the same amdgpu_bo
with previous offset if the range has svm_bo.

If prange migrate to VRAM, no CPU mapping exist, then process exit will
not have unmap callback for this prange to free prange and svm bo. Free
outstanding pranges from svms list before process is freed in
svm_range_list_fini.
Signed-off-by: NPhilip Yang <Philip.Yang@amd.com>
Signed-off-by: NAlex Sierra <alex.sierra@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

e49fe404

drm/amdkfd: set memory limit to avoid OOM with HMM enabled · c46ebb6a

由 Philip Yang 提交于 4月 15, 2021

HMM migration alloc sizeof(struct page) on system memory for each VRAM
page, it is 1GB system memory reserved for 64GB VRAM. To avoid
application OOM, increase system memory used size based on VRAM size of
all GPUs, then application alloc memory will fail if system memory usage
reach the limit.
Signed-off-by: NPhilip Yang <Philip.Yang@amd.com>
Reviewed-by: NOak Zeng <Oak.Zeng@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

c46ebb6a

drm/amdkfd: register HMM device private zone · 814ab993

由 Philip Yang 提交于 2月 07, 2020

Register vram memory as MEMORY_DEVICE_PRIVATE type resource, to
allocate vram backing pages for page migration.
Signed-off-by: NPhilip Yang <Philip.Yang@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

814ab993

drm/amdkfd: add ioctl to configure and query xnack retries · 0f7b5c44

由 Alex Sierra 提交于 5月 28, 2020

Xnack retries are used for page fault recovery. Some AMD chip
families support continuously retry while page table entries are invalid.
The driver must handle the page fault interrupt and fill in a valid entry
for the GPU to continue.

This ioctl allows to enable/disable XNACK retries per KFD process.
Signed-off-by: NAlex Sierra <alex.sierra@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

0f7b5c44

drm/amdkfd: add xnack enabled flag to kfd_process · 063e33c5

由 Alex Sierra 提交于 5月 28, 2020

XNACK mode controls the SQ RETRY_DISABLE setting that determines,
whether recoverable page faults can be supported on GFXv9 hardware.
Only on Aldebaran we can support different processes running with
different XNACK modes. On older chips all processes must use the same
RETRY_DISABLE setting. However, processes not relying on recoverable
page faults can work with RETRY enabled. This means XNACK off is always
available as a fallback so we can use the same mode on all GPUs in a
process.
Signed-off-by: NAlex Sierra <alex.sierra@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

063e33c5

drm/amdkfd: svm range eviction and restore · 8a7c184a

由 Felix Kuehling 提交于 2月 24, 2021

HMM interval notifier callback notify CPU page table will be updated,
stop process queues if the updated address belongs to svm range
registered in process svms objects tree. Scheduled restore work to
update GPU page table using new pages address in the updated svm range.

The restore worker flushes any deferred work to make sure it restores
an up-to-date svm_range_list.
Signed-off-by: NPhilip Yang <Philip.Yang@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

8a7c184a

drm/amdkfd: map svm range to GPUs · f80fe9d3

由 Felix Kuehling 提交于 2月 24, 2021

Use amdgpu_vm_bo_update_mapping to update GPU page table to map or unmap
svm range system memory pages address to GPUs.
Signed-off-by: NPhilip Yang <Philip.Yang@amd.com>
Signed-off-by: NAlex Sierra <alex.sierra@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

f80fe9d3

drm/amdkfd: deregister svm range · 4683cfec

由 Philip Yang 提交于 4月 08, 2020

When application explicitly call unmap or unmap from mmput when
application exit, driver will receive MMU_NOTIFY_UNMAP event to remove
svm range from process svms object tree and list first, unmap from GPUs
(in the following patch).

Split the svm ranges to handle partial unmapping of svm ranges. To
avoid deadlocks, updating MMU notifiers, range lists and interval trees
is done in a deferred worker. New child ranges are attached to their
parent range's child_list until the worker can update the
svm_range_list. svm_range_set_attr flushes deferred work and takes the
mmap_write_lock to guarantee that it has an up-to-date svm_range_list.
Signed-off-by: NPhilip Yang <Philip.Yang@amd.com>
Signed-off-by: NAlex Sierra <alex.sierra@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>

4683cfec

openeuler / Kernel 接近 2 年 前同步成功

openeuler / Kernel
接近 2 年前同步成功