• A
    powerpc/powernv/ioda: Allocate indirect TCE levels on demand · a68bd126
    Alexey Kardashevskiy 提交于
    At the moment we allocate the entire TCE table, twice (hardware part and
    userspace translation cache). This normally works as we normally have
    contigous memory and the guest will map entire RAM for 64bit DMA.
    
    However if we have sparse RAM (one example is a memory device), then
    we will allocate TCEs which will never be used as the guest only maps
    actual memory for DMA. If it is a single level TCE table, there is nothing
    we can really do but if it a multilevel table, we can skip allocating
    TCEs we know we won't need.
    
    This adds ability to allocate only first level, saving memory.
    
    This changes iommu_table::free() to avoid allocating of an extra level;
    iommu_table::set() will do this when needed.
    
    This adds @alloc parameter to iommu_table::exchange() to tell the callback
    if it can allocate an extra level; the flag is set to "false" for
    the realmode KVM handlers of H_PUT_TCE hcalls and the callback returns
    H_TOO_HARD.
    
    This still requires the entire table to be counted in mm::locked_vm.
    
    To be conservative, this only does on-demand allocation when
    the usespace cache table is requested which is the case of VFIO.
    
    The example math for a system replicating a powernv setup with NVLink2
    in a guest:
    16GB RAM mapped at 0x0
    128GB GPU RAM window (16GB of actual RAM) mapped at 0x244000000000
    
    the table to cover that all with 64K pages takes:
    (((0x244000000000 + 0x2000000000) >> 16)*8)>>20 = 4556MB
    
    If we allocate only necessary TCE levels, we will only need:
    (((0x400000000 + 0x400000000) >> 16)*8)>>20 = 4MB (plus some for indirect
    levels).
    Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
    Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au>
    Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
    a68bd126
pci-ioda-tce.c 9.6 KB