- 09 10月, 2018 19 次提交
-
-
由 Paul Mackerras 提交于
This starts the process of adding the code to support nested HV-style virtualization. It defines a new H_SET_PARTITION_TABLE hypercall which a nested hypervisor can use to set the base address and size of a partition table in its memory (analogous to the PTCR register). On the host (level 0 hypervisor) side, the H_SET_PARTITION_TABLE hypercall from the guest is handled by code that saves the virtual PTCR value for the guest. This also adds code for creating and destroying nested guests and for reading the partition table entry for a nested guest from L1 memory. Each nested guest has its own shadow LPID value, different in general from the LPID value used by the nested hypervisor to refer to it. The shadow LPID value is allocated at nested guest creation time. Nested hypervisor functionality is only available for a radix guest, which therefore means a radix host on a POWER9 (or later) processor. Signed-off-by: NPaul Mackerras <paulus@ozlabs.org> Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Paul Mackerras 提交于
kvmppc_unmap_pte() does a sequence of operations that are open-coded in kvm_unmap_radix(). This extends kvmppc_unmap_pte() a little so that it can be used by kvm_unmap_radix(), and makes kvm_unmap_radix() call it. Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au> Signed-off-by: NPaul Mackerras <paulus@ozlabs.org> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Suraj Jitindar Singh 提交于
The radix page fault handler accounts for all cases, including just needing to insert a pte. This breaks it up into separate functions for the two main cases; setting rc and inserting a pte. This allows us to make the setting of rc and inserting of a pte generic for any pgtable, not specific to the one for this guest. [paulus@ozlabs.org - reduced diffs from previous code] Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au> Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: NPaul Mackerras <paulus@ozlabs.org> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Suraj Jitindar Singh 提交于
kvmppc_mmu_radix_xlate() is used to translate an effective address through the process tables. The process table and partition tables have identical layout. Exploit this fact to make the kvmppc_mmu_radix_xlate() function able to translate either an effective address through the process tables or a guest real address through the partition tables. [paulus@ozlabs.org - reduced diffs from previous code] Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au> Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: NPaul Mackerras <paulus@ozlabs.org> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Suraj Jitindar Singh 提交于
When destroying a VM we return the LPID to the pool, however we never zero the partition table entry. This is instead done when we reallocate the LPID. Zero the partition table entry on VM teardown before returning the LPID to the pool. This means if we were running as a nested hypervisor the real hypervisor could use this to determine when it can free resources. Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au> Signed-off-by: NSuraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: NPaul Mackerras <paulus@ozlabs.org> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Paul Mackerras 提交于
When the 'regs' field was added to struct kvm_vcpu_arch, the code was changed to use several of the fields inside regs (e.g., gpr, lr, etc.) but not the ccr field, because the ccr field in struct pt_regs is 64 bits on 64-bit platforms, but the cr field in kvm_vcpu_arch is only 32 bits. This changes the code to use the regs.ccr field instead of cr, and changes the assembly code on 64-bit platforms to use 64-bit loads and stores instead of 32-bit ones. Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au> Signed-off-by: NPaul Mackerras <paulus@ozlabs.org> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Paul Mackerras 提交于
This adds a file called 'radix' in the debugfs directory for the guest, which when read gives all of the valid leaf PTEs in the partition-scoped radix tree for a radix guest, in human-readable format. It is analogous to the existing 'htab' file which dumps the HPT entries for a HPT guest. Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au> Signed-off-by: NPaul Mackerras <paulus@ozlabs.org> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Paul Mackerras 提交于
Currently the code for handling hypervisor instruction page faults passes 0 for the flags indicating the type of fault, which is OK in the usual case that the page is not mapped in the partition-scoped page tables. However, there are other causes for hypervisor instruction page faults, such as not being to update a reference (R) or change (C) bit. The cause is indicated in bits in HSRR1, including a bit which indicates that the fault is due to not being able to write to a page (for example to update an R or C bit). Not handling these other kinds of faults correctly can lead to a loop of continual faults without forward progress in the guest. In order to handle these faults better, this patch constructs a "DSISR-like" value from the bits which DSISR and SRR1 (for a HISI) have in common, and passes it to kvmppc_book3s_hv_page_fault() so that it knows what caused the fault. Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au> Signed-off-by: NPaul Mackerras <paulus@ozlabs.org> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Paul Mackerras 提交于
This creates an alternative guest entry/exit path which is used for radix guests on POWER9 systems when we have indep_threads_mode=Y. In these circumstances there is exactly one vcpu per vcore and there is no coordination required between vcpus or vcores; the vcpu can enter the guest without needing to synchronize with anything else. The new fast path is implemented almost entirely in C in book3s_hv.c and runs with the MMU on until the guest is entered. On guest exit we use the existing path until the point where we are committed to exiting the guest (as distinct from handling an interrupt in the low-level code and returning to the guest) and we have pulled the guest context from the XIVE. At that point we check a flag in the stack frame to see whether we came in via the old path and the new path; if we came in via the new path then we go back to C code to do the rest of the process of saving the guest context and restoring the host context. The C code is split into separate functions for handling the OS-accessible state and the hypervisor state, with the idea that the latter can be replaced by a hypercall when we implement nested virtualization. Signed-off-by: NPaul Mackerras <paulus@ozlabs.org> Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au> [mpe: Fix CONFIG_ALTIVEC=n build] Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Paul Mackerras 提交于
Currently kvmppc_handle_exit_hv() is called with the vcore lock held because it is called within a for_each_runnable_thread loop. However, we already unlock the vcore within kvmppc_handle_exit_hv() under certain circumstances, and this is safe because (a) any vcpus that become runnable and are added to the runnable set by kvmppc_run_vcpu() have their vcpu->arch.trap == 0 and can't actually run in the guest (because the vcore state is VCORE_EXITING), and (b) for_each_runnable_thread is safe against addition or removal of vcpus from the runnable set. Therefore, in order to simplify things for following patches, let's drop the vcore lock in the for_each_runnable_thread loop, so kvmppc_handle_exit_hv() gets called without the vcore lock held. Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au> Signed-off-by: NPaul Mackerras <paulus@ozlabs.org> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Paul Mackerras 提交于
This adds a parameter to __kvmppc_save_tm and __kvmppc_restore_tm which allows the caller to indicate whether it wants the nonvolatile register state to be preserved across the call, as required by the C calling conventions. This parameter being non-zero also causes the MSR bits that enable TM, FP, VMX and VSX to be preserved. The condition register and DSCR are now always preserved. With this, kvmppc_save_tm_hv and kvmppc_restore_tm_hv can be called from C code provided the 3rd parameter is non-zero. So that these functions can be called from modules, they now include code to set the TOC pointer (r2) on entry, as they can call other built-in C functions which will assume the TOC to have been set. Also, the fake suspend code in kvmppc_save_tm_hv is modified here to assume that treclaim in fake-suspend state does not modify any registers, which is the case on POWER9. This enables the code to be simplified quite a bit. _kvmppc_save_tm_pr and _kvmppc_restore_tm_pr become much simpler with this change, since they now only need to save and restore TAR and pass 1 for the 3rd argument to __kvmppc_{save,restore}_tm. Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au> Signed-off-by: NPaul Mackerras <paulus@ozlabs.org> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Paul Mackerras 提交于
This streamlines the first part of the code that handles a hypervisor interrupt that occurred in the guest. With this, all of the real-mode handling that occurs is done before the "guest_exit_cont" label; once we get to that label we are committed to exiting to host virtual mode. Thus the machine check and HMI real-mode handling is moved before that label. Also, the code to handle external interrupts is moved out of line, as is the code that calls kvmppc_realmode_hmi_handler(). Signed-off-by: NPaul Mackerras <paulus@ozlabs.org> Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Paul Mackerras 提交于
This pulls out the assembler code that is responsible for saving and restoring the PMU state for the host and guest into separate functions so they can be used from an alternate entry path. The calling convention is made compatible with C. Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au> Signed-off-by: NPaul Mackerras <paulus@ozlabs.org> Reviewed-by: NMadhavan Srinivasan <maddy@linux.vnet.ibm.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Paul Mackerras 提交于
This is based on a patch by Suraj Jitindar Singh. This moves the code in book3s_hv_rmhandlers.S that generates an external, decrementer or privileged doorbell interrupt just before entering the guest to C code in book3s_hv_builtin.c. This is to make future maintenance and modification easier. The algorithm expressed in the C code is almost identical to the previous algorithm. Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au> Signed-off-by: NPaul Mackerras <paulus@ozlabs.org> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Paul Mackerras 提交于
This removes code that clears the external interrupt pending bit in the pending_exceptions bitmap. This is left over from an earlier iteration of the code where this bit was set when an escalation interrupt arrived in order to wake the vcpu from cede. Currently we set the vcpu->arch.irq_pending flag instead for this purpose. Therefore there is no need to do anything with the pending_exceptions bitmap. Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au> Signed-off-by: NPaul Mackerras <paulus@ozlabs.org> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Paul Mackerras 提交于
Currently we use two bits in the vcpu pending_exceptions bitmap to indicate that an external interrupt is pending for the guest, one for "one-shot" interrupts that are cleared when delivered, and one for interrupts that persist until cleared by an explicit action of the OS (e.g. an acknowledge to an interrupt controller). The BOOK3S_IRQPRIO_EXTERNAL bit is used for one-shot interrupt requests and BOOK3S_IRQPRIO_EXTERNAL_LEVEL is used for persisting interrupts. In practice BOOK3S_IRQPRIO_EXTERNAL never gets used, because our Book3S platforms generally, and pseries in particular, expect external interrupt requests to persist until they are acknowledged at the interrupt controller. That combined with the confusion introduced by having two bits for what is essentially the same thing makes it attractive to simplify things by only using one bit. This patch does that. With this patch there is only BOOK3S_IRQPRIO_EXTERNAL, and by default it has the semantics of a persisting interrupt. In order to avoid breaking the ABI, we introduce a new "external_oneshot" flag which preserves the behaviour of the KVM_INTERRUPT ioctl with the KVM_INTERRUPT_SET argument. Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au> Signed-off-by: NPaul Mackerras <paulus@ozlabs.org> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Alexey Kardashevskiy 提交于
The kvmppc_gpa_to_ua() helper itself takes care of the permission bits in the TCE and yet every single caller removes them. This changes semantics of kvmppc_gpa_to_ua() so it takes TCEs (which are GPAs + TCE permission bits) to make the callers simpler. This should cause no behavioural change. Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Alexey Kardashevskiy 提交于
At the moment if the PUT_TCE{_INDIRECT} handlers fail to update the hardware tables, we print a warning once, clear the entry and continue. This is so as at the time the assumption was that if a VFIO device is hotplugged into the guest, and the userspace replays virtual DMA mappings (i.e. TCEs) to the hardware tables and if this fails, then there is nothing useful we can do about it. However the assumption is not valid as these handlers are not called for TCE replay (VFIO ioctl interface is used for that) and these handlers are for new TCEs. This returns an error to the guest if there is a request which cannot be processed. By now the only possible failure must be H_TOO_HARD. Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Alexey Kardashevskiy 提交于
The userspace can request an arbitrary supported page size for a DMA window and this works fine as long as the mapped memory is backed with the pages of the same or bigger size; if this is not the case, mm_iommu_ua_to_hpa{_rm}() fail and tables do not populated with dangerously incorrect TCEs. However since it is quite easy to misconfigure the KVM and we do not do reverts to all changes made to TCE tables if an error happens in a middle, we better do the acceptable page size validation before we even touch the tables. This enhances kvmppc_tce_validate() to check the hardware IOMMU page sizes against the preregistered memory page sizes. Since the new check uses real/virtual mode helpers, this renames kvmppc_tce_validate() to kvmppc_rm_tce_validate() to handle the real mode case and mirrors it for the virtual mode under the old name. The real mode handler is not used for the virtual mode as: 1. it uses _lockless() list traversing primitives instead of RCU; 2. realmode's mm_iommu_ua_to_hpa_rm() uses vmalloc_to_phys() which virtual mode does not have to use and since on POWER9+radix only virtual mode handlers actually work, we do not want to slow down that path even a bit. This removes EXPORT_SYMBOL_GPL(kvmppc_tce_validate) as the validators are static now. From now on the attempts on mapping IOMMU pages bigger than allowed will result in KVM exit. Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au> [mpe: Fix KVM_HV=n build] Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 02 10月, 2018 2 次提交
-
-
由 Alexey Kardashevskiy 提交于
We return H_TOO_HARD from TCE update handlers when we think that the next handler (realmode -> virtual mode -> user mode) has a chance to handle the request; H_HARDWARE/H_CLOSED otherwise. This changes the handlers to return H_TOO_HARD on every error giving the userspace an opportunity to handle any request or at least log them all. Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Alexey Kardashevskiy 提交于
The KVM TCE handlers are written in a way so they fail when either something went horribly wrong or the userspace did some obvious mistake such as passing a misaligned address. We are going to enhance the TCE checker to fail on attempts to map bigger IOMMU page than the underlying pinned memory so let's valitate TCE beforehand. This should cause no behavioral change. Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 12 9月, 2018 2 次提交
-
-
由 Nicholas Piggin 提交于
THP paths can defer splitting compound pages until after the actual remap and TLB flushes to split a huge PMD/PUD. This causes radix partition scope page table mappings to get out of synch with the host qemu page table mappings. This results in random memory corruption in the guest when running with THP. The easiest way to reproduce is use KVM balloon to free up a lot of memory in the guest and then shrink the balloon to give the memory back, while some work is being done in the guest. Cc: David Gibson <david@gibson.dropbear.id.au> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: kvm-ppc@vger.kernel.org Cc: linuxppc-dev@lists.ozlabs.org Signed-off-by: NNicholas Piggin <npiggin@gmail.com> Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
-
由 Alexey Kardashevskiy 提交于
At the moment the real mode handler of H_PUT_TCE calls iommu_tce_xchg_rm() which in turn reads the old TCE and if it was a valid entry, marks the physical page dirty if it was mapped for writing. Since it is in real mode, realmode_pfn_to_page() is used instead of pfn_to_page() to get the page struct. However SetPageDirty() itself reads the compound page head and returns a virtual address for the head page struct and setting dirty bit for that kills the system. This adds additional dirty bit tracking into the MM/IOMMU API for use in the real mode. Note that this does not change how VFIO and KVM (in virtual mode) set this bit. The KVM (real mode) changes include: - use the lowest bit of the cached host phys address to carry the dirty bit; - mark pages dirty when they are unpinned which happens when the preregistered memory is released which always happens in virtual mode; - add mm_iommu_ua_mark_dirty_rm() helper to set delayed dirty bit; - change iommu_tce_xchg_rm() to take the kvm struct for the mm to use in the new mm_iommu_ua_mark_dirty_rm() helper; - move iommu_tce_xchg_rm() to book3s_64_vio_hv.c (which is the only caller anyway) to reduce the real mode KVM and IOMMU knowledge across different subsystems. This removes realmode_pfn_to_page() as it is not used anymore. While we at it, remove some EXPORT_SYMBOL_GPL() as that code is for the real mode only and modules cannot call it anyway. Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au> Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
-
- 24 8月, 2018 1 次提交
-
-
由 Finn Thain 提交于
Also add these typos to spelling.txt so checkpatch.pl will look for them. Link: http://lkml.kernel.org/r/88af06b9de34d870cb0afc46cfd24e0458be2575.1529471371.git.fthain@telegraphics.com.auSigned-off-by: NFinn Thain <fthain@telegraphics.com.au> Cc: Masahiro Yamada <yamada.masahiro@socionext.com> Cc: Joe Perches <joe@perches.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 21 8月, 2018 1 次提交
-
-
由 Luke Dashjr 提交于
this_cpu_disable_ftrace and this_cpu_enable_ftrace are inlines in ftrace.h Without it included, the build fails. Fixes: a4bc64d3 ("powerpc64/ftrace: Disable ftrace during kvm entry/exit") Cc: stable@vger.kernel.org # v4.18+ Signed-off-by: NLuke Dashjr <luke-jr+git@utopios.org> Acked-by: Naveen N. Rao <naveen.n.rao at linux.vnet.ibm.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 20 8月, 2018 1 次提交
-
-
由 Paul Mackerras 提交于
This fixes a bug which causes guest virtual addresses to get translated to guest real addresses incorrectly when the guest is using the HPT MMU and has more than 256GB of RAM, or more specifically has a HPT larger than 2GB. This has showed up in testing as a failure of the host to emulate doorbell instructions correctly on POWER9 for HPT guests with more than 256GB of RAM. The bug is that the HPTE index in kvmppc_mmu_book3s_64_hv_xlate() is stored as an int, and in forming the HPTE address, the index gets shifted left 4 bits as an int before being signed-extended to 64 bits. The simple fix is to make the variable a long int, matching the return type of kvmppc_hv_find_lock_hpte(), which is what calculates the index. Fixes: 697d3899 ("KVM: PPC: Implement MMIO emulation support for Book3S HV guests") Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
-
- 18 8月, 2018 1 次提交
-
-
由 Marek Szyprowski 提交于
cma_alloc() doesn't really support gfp flags other than __GFP_NOWARN, so convert gfp_mask parameter to boolean no_warn parameter. This will help to avoid giving false feeling that this function supports standard gfp flags and callers can pass __GFP_ZERO to get zeroed buffer, what has already been an issue: see commit dd65a941 ("arm64: dma-mapping: clear buffers allocated with FORCE_CONTIGUOUS flag"). Link: http://lkml.kernel.org/r/20180709122019eucas1p2340da484acfcc932537e6014f4fd2c29~-sqTPJKij2939229392eucas1p2j@eucas1p2.samsung.comSigned-off-by: NMarek Szyprowski <m.szyprowski@samsung.com> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NMichał Nazarewicz <mina86@mina86.com> Acked-by: NLaura Abbott <labbott@redhat.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Reviewed-by: NChristoph Hellwig <hch@lst.de> Cc: Joonsoo Kim <js1304@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 15 8月, 2018 1 次提交
-
-
由 Paul Mackerras 提交于
Since commit e641a317 ("KVM: PPC: Book3S HV: Unify dirty page map between HPT and radix", 2017-10-26), kvm_unmap_radix() computes the number of PAGE_SIZEd pages being unmapped and passes it to kvmppc_update_dirty_map(), which expects to be passed the page size instead. Consequently it will only mark one system page dirty even when a large page (for example a THP page) is being unmapped. The consequence of this is that part of the THP page might not get copied during live migration, resulting in memory corruption for the guest. This fixes it by computing and passing the page size in kvm_unmap_radix(). Cc: stable@vger.kernel.org # v4.15+ Fixes: e641a317 (KVM: PPC: Book3S HV: Unify dirty page map between HPT and radix) Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
-
- 30 7月, 2018 4 次提交
-
-
由 Christophe Leroy 提交于
asm/tlbflush.h is only needed for: - using functions xxx_flush_tlb_xxx() - using MMU_NO_CONTEXT - including asm-generic/pgtable.h Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Christophe Leroy 提交于
files not using feature fixup don't need asm/feature-fixups.h files using feature fixup need asm/feature-fixups.h Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Christophe Leroy 提交于
Only include linux/stringify.h is files using __stringify() Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Christophe Leroy 提交于
This patch moves ASM_CONST() and stringify_in_c() into dedicated asm-const.h, then cleans all related inclusions. Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr> [mpe: asm-compat.h should include asm-const.h] Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 26 7月, 2018 2 次提交
-
-
由 Paul Mackerras 提交于
Commit 1e175d2e ("KVM: PPC: Book3S HV: Pack VCORE IDs to access full VCPU ID space", 2018-07-25) added code that uses kvm->arch.emul_smt_mode before any VCPUs are created. However, userspace can change kvm->arch.emul_smt_mode at any time up until the first VCPU is created. Hence it is (theoretically) possible for the check in kvmppc_core_vcpu_create_hv() to race with another userspace thread changing kvm->arch.emul_smt_mode. This fixes it by moving the test that uses kvm->arch.emul_smt_mode into the block where kvm->lock is held. Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
-
由 Sam Bobroff 提交于
It is not currently possible to create the full number of possible VCPUs (KVM_MAX_VCPUS) on Power9 with KVM-HV when the guest uses fewer threads per core than its core stride (or "VSMT mode"). This is because the VCORE ID and XIVE offsets grow beyond KVM_MAX_VCPUS even though the VCPU ID is less than KVM_MAX_VCPU_ID. To address this, "pack" the VCORE ID and XIVE offsets by using knowledge of the way the VCPU IDs will be used when there are fewer guest threads per core than the core stride. The primary thread of each core will always be used first. Then, if the guest uses more than one thread per core, these secondary threads will sequentially follow the primary in each core. So, the only way an ID above KVM_MAX_VCPUS can be seen, is if the VCPUs are being spaced apart, so at least half of each core is empty, and IDs between KVM_MAX_VCPUS and (KVM_MAX_VCPUS * 2) can be mapped into the second half of each core (4..7, in an 8-thread core). Similarly, if IDs above KVM_MAX_VCPUS * 2 are seen, at least 3/4 of each core is being left empty, and we can map down into the second and third quarters of each core (2, 3 and 5, 6 in an 8-thread core). Lastly, if IDs above KVM_MAX_VCPUS * 4 are seen, only the primary threads are being used and 7/8 of the core is empty, allowing use of the 1, 5, 3 and 7 thread slots. (Strides less than 8 are handled similarly.) This allows the VCORE ID or offset to be calculated quickly from the VCPU ID or XIVE server numbers, without access to the VCPU structure. [paulus@ozlabs.org - tidied up comment a little, changed some WARN_ONCE to pr_devel, wrapped line, fixed id check.] Signed-off-by: NSam Bobroff <sam.bobroff@au1.ibm.com> Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
-
- 18 7月, 2018 5 次提交
-
-
由 Alexey Kardashevskiy 提交于
A VM which has: - a DMA capable device passed through to it (eg. network card); - running a malicious kernel that ignores H_PUT_TCE failure; - capability of using IOMMU pages bigger that physical pages can create an IOMMU mapping that exposes (for example) 16MB of the host physical memory to the device when only 64K was allocated to the VM. The remaining 16MB - 64K will be some other content of host memory, possibly including pages of the VM, but also pages of host kernel memory, host programs or other VMs. The attacking VM does not control the location of the page it can map, and is only allowed to map as many pages as it has pages of RAM. We already have a check in drivers/vfio/vfio_iommu_spapr_tce.c that an IOMMU page is contained in the physical page so the PCI hardware won't get access to unassigned host memory; however this check is missing in the KVM fastpath (H_PUT_TCE accelerated code). We were lucky so far and did not hit this yet as the very first time when the mapping happens we do not have tbl::it_userspace allocated yet and fall back to the userspace which in turn calls VFIO IOMMU driver, this fails and the guest does not retry, This stores the smallest preregistered page size in the preregistered region descriptor and changes the mm_iommu_xxx API to check this against the IOMMU page size. This calculates maximum page size as a minimum of the natural region alignment and compound page size. For the page shift this uses the shift returned by find_linux_pte() which indicates how the page is mapped to the current userspace - if the page is huge and this is not a zero, then it is a leaf pte and the page is mapped within the range. Fixes: 121f80ba ("KVM: PPC: VFIO: Add in-kernel acceleration for VFIO") Cc: stable@vger.kernel.org # v4.12+ Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Nicholas Mc Guire 提交于
The constants are 64bit but not explicitly declared UL resulting in sparse warnings. Fix this by declaring the constants UL. Signed-off-by: NNicholas Mc Guire <hofrat@osadl.org> Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
-
由 Nicholas Mc Guire 提交于
The call to of_find_compatible_node() is returning a pointer with incremented refcount so it must be explicitly decremented after the last use. As here it is only being used for checking of node presence but the result is not actually used in the success path it can be dropped immediately. Signed-off-by: NNicholas Mc Guire <hofrat@osadl.org> Fixes: commit f725758b ("KVM: PPC: Book3S HV: Use OPAL XICS emulation on POWER9") Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
-
由 Alexey Kardashevskiy 提交于
When attaching a hardware table to LIOBN in KVM, we match table parameters such as page size, table offset and table size. However the tables are created via very different paths - VFIO and KVM - and the VFIO path goes through the platform code which has minimum TCE page size requirement (which is 4K but since we allocate memory by pages and cannot avoid alignment anyway, we align to 64k pages for powernv_defconfig). So when we match the tables, one might be bigger that the other which means the hardware table cannot get attached to LIOBN and DMA mapping fails. This removes the table size alignment from the guest visible table. This does not affect the memory allocation which is still aligned - kvmppc_tce_pages() takes care of this. This relaxes the check we do when attaching tables to allow the hardware table be bigger than the guest visible table. Ideally we want the KVM table to cover the same space as the hardware table does but since the hardware table may use multiple levels, and all levels must use the same table size (IODA2 design), the area it can actually cover might get very different from the window size which the guest requested, even though the guest won't map it all. Fixes: ca1fc489 "KVM: PPC: Book3S: Allow backing bigger guest IOMMU pages with smaller physical pages" Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au> Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
-
由 Simon Guo 提交于
Originally PPC KVM MMIO emulation uses only 0~31#(5 bits) for VSR reg number, and use mmio_vsx_tx_sx_enabled field together for 0~63# VSR regs. Currently PPC KVM MMIO emulation is reimplemented with analyse_instr() assistance. analyse_instr() returns 0~63 for VSR register number, so it is not necessary to use additional mmio_vsx_tx_sx_enabled field any more. This patch extends related reg bits (expand io_gpr to u16 from u8 and use 6 bits for VSR reg#), so that mmio_vsx_tx_sx_enabled can be removed. Signed-off-by: NSimon Guo <wei.guo.simon@gmail.com> Signed-off-by: NPaul Mackerras <paulus@ozlabs.org>
-
- 16 7月, 2018 1 次提交
-
-
由 Alexey Kardashevskiy 提交于
At the moment we allocate the entire TCE table, twice (hardware part and userspace translation cache). This normally works as we normally have contigous memory and the guest will map entire RAM for 64bit DMA. However if we have sparse RAM (one example is a memory device), then we will allocate TCEs which will never be used as the guest only maps actual memory for DMA. If it is a single level TCE table, there is nothing we can really do but if it a multilevel table, we can skip allocating TCEs we know we won't need. This adds ability to allocate only first level, saving memory. This changes iommu_table::free() to avoid allocating of an extra level; iommu_table::set() will do this when needed. This adds @alloc parameter to iommu_table::exchange() to tell the callback if it can allocate an extra level; the flag is set to "false" for the realmode KVM handlers of H_PUT_TCE hcalls and the callback returns H_TOO_HARD. This still requires the entire table to be counted in mm::locked_vm. To be conservative, this only does on-demand allocation when the usespace cache table is requested which is the case of VFIO. The example math for a system replicating a powernv setup with NVLink2 in a guest: 16GB RAM mapped at 0x0 128GB GPU RAM window (16GB of actual RAM) mapped at 0x244000000000 the table to cover that all with 64K pages takes: (((0x244000000000 + 0x2000000000) >> 16)*8)>>20 = 4556MB If we allocate only necessary TCE levels, we will only need: (((0x400000000 + 0x400000000) >> 16)*8)>>20 = 4MB (plus some for indirect levels). Signed-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-