- 02 4月, 2022 13 次提交
-
-
由 David Woodhouse 提交于
In order to intercept hypercalls such as VCPUOP_set_singleshot_timer, we need to be aware of the Xen CPU numbering. This looks a lot like the Hyper-V handling of vpidx, for obvious reasons. Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com> Message-Id: <20220303154127.202856-12-dwmw2@infradead.org> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Joao Martins 提交于
Userspace registers a sending @port to either deliver to an @eventfd or directly back to a local event channel port. After binding events the guest or host may wish to bind those events to a particular vcpu. This is usually done for unbound and and interdomain events. Update requests are handled via the KVM_XEN_EVTCHN_UPDATE flag. Unregistered ports are handled by the emulator. Co-developed-by: NAnkur Arora <ankur.a.arora@oracle.com> Co-developed-By: NDavid Woodhouse <dwmw@amazon.co.uk> Signed-off-by: NJoao Martins <joao.m.martins@oracle.com> Signed-off-by: NAnkur Arora <ankur.a.arora@oracle.com> Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com> Message-Id: <20220303154127.202856-10-dwmw2@infradead.org> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 David Woodhouse 提交于
This switches the final pvclock to kvm_setup_pvclock_pfncache() and now the old kvm_setup_pvclock_page() can be removed. Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com> Message-Id: <20220303154127.202856-7-dwmw2@infradead.org> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 David Woodhouse 提交于
Currently, the fast path of kvm_xen_set_evtchn_fast() doesn't set the index bits in the target vCPU's evtchn_pending_sel, because it only has a userspace virtual address with which to do so. It just sets them in the kernel, and kvm_xen_has_interrupt() then completes the delivery to the actual vcpu_info structure when the vCPU runs. Using a gfn_to_pfn_cache allows kvm_xen_set_evtchn_fast() to do the full delivery in the common case. Clean up the fallback case too, by moving the deferred delivery out into a separate kvm_xen_inject_pending_events() function which isn't ever called in atomic contexts as __kvm_xen_has_interrupt() is. Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com> Message-Id: <20220303154127.202856-6-dwmw2@infradead.org> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 David Woodhouse 提交于
Add a new kvm_setup_guest_pvclock() which parallels the existing kvm_setup_pvclock_page(). The latter will be removed once we convert all users to the gfn_to_pfn_cache version. Using the new cache, we can potentially let kvm_set_guest_paused() set the PVCLOCK_GUEST_STOPPED bit directly rather than having to delegate to the vCPU via KVM_REQ_CLOCK_UPDATE. But not yet. Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com> Message-Id: <20220303154127.202856-5-dwmw2@infradead.org> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 David Woodhouse 提交于
Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com> Message-Id: <20220303154127.202856-4-dwmw2@infradead.org> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Oliver Upton 提交于
KVM handles the VMCALL/VMMCALL instructions very strangely. Even though both of these instructions really should #UD when executed on the wrong vendor's hardware (i.e. VMCALL on SVM, VMMCALL on VMX), KVM replaces the guest's instruction with the appropriate instruction for the vendor. Nonetheless, older guest kernels without commit c1118b36 ("x86: kvm: use alternatives for VMCALL vs. VMMCALL if kernel text is read-only") do not patch in the appropriate instruction using alternatives, likely motivating KVM's intervention. Add a quirk allowing userspace to opt out of hypercall patching. If the quirk is disabled, KVM synthesizes a #UD in the guest. Signed-off-by: NOliver Upton <oupton@google.com> Message-Id: <20220316005538.2282772-2-oupton@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Add set/clear wrappers for toggling APICv inhibits to make the call sites more readable, and opportunistically rename the inner helpers to align with the new wrappers and to make them more readable as well. Invert the flag from "activate" to "set"; activate is painfully ambiguous as it's not obvious if the inhibit is being activated, or if APICv is being activated, in which case the inhibit is being deactivated. For the functions that take @set, swap the order of the inhibit reason and @set so that the call sites are visually similar to those that bounce through the wrapper. Signed-off-by: NSean Christopherson <seanjc@google.com> Message-Id: <20220311043517.17027-3-seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Use an enum for the APICv inhibit reasons, there is no meaning behind their values and they most definitely are not "unsigned longs". Rename the various params to "reason" for consistency and clarity (inhibit may be confused as a command, i.e. inhibit APICv, instead of the reason that is getting toggled/checked). No functional change intended. Signed-off-by: NSean Christopherson <seanjc@google.com> Message-Id: <20220311043517.17027-2-seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Lai Jiangshan 提交于
There are two kinds of implicit supervisor access implicit supervisor access when CPL = 3 implicit supervisor access when CPL < 3 Current permission_fault() handles only the first kind for SMAP. But if the access is implicit when SMAP is on, data may not be read nor write from any user-mode address regardless the current CPL. So the second kind should be also supported. The first kind can be detect via CPL and access mode: if it is supervisor access and CPL = 3, it must be implicit supervisor access. But it is not possible to detect the second kind without extra information, so this patch adds an artificial PFERR_EXPLICIT_ACCESS into @access. This extra information also works for the first kind, so the logic is changed to use this information for both cases. The value of PFERR_EXPLICIT_ACCESS is deliberately chosen to be bit 48 which is in the most significant 16 bits of u64 and less likely to be forced to change due to future hardware uses it. This patch removes the call to ->get_cpl() for access mode is determined by @access. Not only does it reduce a function call, but also remove confusions when the permission is checked for nested TDP. The nested TDP shouldn't have SMAP checking nor even the L2's CPL have any bearing on it. The original code works just because it is always user walk for NPT and SMAP fault is not set for EPT in update_permission_bitmask. Signed-off-by: NLai Jiangshan <jiangshan.ljs@antgroup.com> Message-Id: <20220311070346.45023-5-jiangshanlai@gmail.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Lai Jiangshan 提交于
Change the type of access u32 to u64 for FNAME(walk_addr) and ->gva_to_gpa(). The kinds of accesses are usually combinations of UWX, and VMX/SVM's nested paging adds a new factor of access: is it an access for a guest page table or for a final guest physical address. And SMAP relies a factor for supervisor access: explicit or implicit. So @access in FNAME(walk_addr) and ->gva_to_gpa() is better to include all these information to do the walk. Although @access(u32) has enough bits to encode all the kinds, this patch extends it to u64: o Extra bits will be in the higher 32 bits, so that we can easily obtain the traditional access mode (UWX) by converting it to u32. o Reuse the value for the access kind defined by SVM's nested paging (PFERR_GUEST_FINAL_MASK and PFERR_GUEST_PAGE_MASK) as @error_code in kvm_handle_page_fault(). Signed-off-by: NLai Jiangshan <jiangshan.ljs@antgroup.com> Message-Id: <20220311070346.45023-2-jiangshanlai@gmail.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Jim Mattson 提交于
The third nybble of AMD's event select overlaps with Intel's IN_TX and IN_TXCP bits. Therefore, we can't use AMD64_RAW_EVENT_MASK on Intel platforms that support TSX. Declare a raw_event_mask in the kvm_pmu structure, initialize it in the vendor-specific pmu_refresh() functions, and use that mask for PERF_TYPE_RAW configurations in reprogram_gp_counter(). Fixes: 710c4765 ("KVM: x86/pmu: Use AMD64_RAW_EVENT_MASK for PERF_TYPE_RAW") Signed-off-by: NJim Mattson <jmattson@google.com> Message-Id: <20220308012452.3468611-1-jmattson@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Paolo Bonzini 提交于
If kvm->arch.tdp_mmu_zap_wq cannot be created, the failure has to be propagated up to kvm_mmu_init_vm and kvm_arch_init_vm. kvm_arch_init_vm also has to undo all the initialization, so group all the MMU initialization code at the beginning and handle cleaning up of kvm_page_track_init. Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 21 3月, 2022 1 次提交
-
-
由 Oliver Upton 提交于
KVM_CAP_DISABLE_QUIRKS is irrevocably broken. The capability does not advertise the set of quirks which may be disabled to userspace, so it is impossible to predict the behavior of KVM. Worse yet, KVM_CAP_DISABLE_QUIRKS will tolerate any value for cap->args[0], meaning it fails to reject attempts to set invalid quirk bits. The only valid workaround for the quirky quirks API is to add a new CAP. Actually advertise the set of quirks that can be disabled to userspace so it can predict KVM's behavior. Reject values for cap->args[0] that contain invalid bits. Finally, add documentation for the new capability and describe the existing quirks. Signed-off-by: NOliver Upton <oupton@google.com> Message-Id: <20220301060351.442881-5-oupton@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 08 3月, 2022 1 次提交
-
-
由 Paolo Bonzini 提交于
Use the system worker threads to zap the roots invalidated by the TDP MMU's "fast zap" mechanism, implemented by kvm_tdp_mmu_invalidate_all_roots(). At this point, apart from allowing some parallelism in the zapping of roots, the workqueue is a glorified linked list: work items are added and flushed entirely within a single kvm->slots_lock critical section. However, the workqueue fixes a latent issue where kvm_mmu_zap_all_invalidated_roots() assumes that it owns a reference to all invalid roots; therefore, no one can set the invalid bit outside kvm_mmu_zap_all_fast(). Putting the invalidated roots on a linked list... erm, on a workqueue ensures that tdp_mmu_zap_root_work() only puts back those extra references that kvm_mmu_zap_all_invalidated_roots() had gifted to it. Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 01 3月, 2022 1 次提交
-
-
由 Sean Christopherson 提交于
Zap only obsolete roots when responding to zapping a single root shadow page. Because KVM keeps root_count elevated when stuffing a previous root into its PGD cache, shadowing a 64-bit guest means that zapping any root causes all vCPUs to reload all roots, even if their current root is not affected by the zap. For many kernels, zapping a single root is a frequent operation, e.g. in Linux it happens whenever an mm is dropped, e.g. process exits, etc... Signed-off-by: NSean Christopherson <seanjc@google.com> Reviewed-by: NBen Gardon <bgardon@google.com> Message-Id: <20220225182248.3812651-5-seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 25 2月, 2022 4 次提交
-
-
由 Paolo Bonzini 提交于
These functions only operate on a given MMU, of which there is more than one in a vCPU (we care about two, because the third does not have any roots and is only used to walk guest page tables). They do need a struct kvm in order to lock the mmu_lock, but they do not needed anything else in the struct kvm_vcpu. So, pass the vcpu->kvm directly to them. Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Paolo Bonzini 提交于
The root_hpa and root_pgd fields form essentially a struct kvm_mmu_root_info. Use the struct to have more consistency between mmu->root and mmu->prev_roots. The patch is entirely search and replace except for cached_root_available, which does not need a temporary struct kvm_mmu_root_info anymore. Reviewed-by: NSean Christopherson <seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 David Dunn 提交于
Add a new capability, KVM_CAP_PMU_CAPABILITY, that takes a bitmask of settings/features to allow userspace to configure PMU virtualization on a per-VM basis. For now, support a single flag, KVM_PMU_CAP_DISABLE, to allow disabling PMU virtualization for a VM even when KVM is configured with enable_pmu=true a module level. To keep KVM simple, disallow changing VM's PMU configuration after vCPUs have been created. Signed-off-by: NDavid Dunn <daviddunn@google.com> Message-Id: <20220223225743.2703915-2-daviddunn@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Cast kvm_x86_ops.func to 'void *' when updating KVM static calls that are conditionally patched to __static_call_return0(). clang complains about using mismatching pointers in the ternary operator, which breaks the build when compiling with CONFIG_KVM_WERROR=y. >> arch/x86/include/asm/kvm-x86-ops.h:82:1: warning: pointer type mismatch ('bool (*)(struct kvm_vcpu *)' and 'void *') [-Wpointer-type-mismatch] Fixes: 5be2226f ("KVM: x86: allow defining return-0 static calls") Reported-by: NLike Xu <like.xu.linux@gmail.com> Reported-by: Nkernel test robot <lkp@intel.com> Signed-off-by: NSean Christopherson <seanjc@google.com> Reviewed-by: NDavid Dunn <daviddunn@google.com> Reviewed-by: NNathan Chancellor <nathan@kernel.org> Tested-by: NNathan Chancellor <nathan@kernel.org> Message-Id: <20220223162355.3174907-1-seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 19 2月, 2022 5 次提交
-
-
由 Sean Christopherson 提交于
Remove mmu_audit.c and all its collateral, the auditing code has suffered severe bitrot, ironically partly due to shadow paging being more stable and thus not benefiting as much from auditing, but mostly due to TDP supplanting shadow paging for non-nested guests and shadowing of nested TDP not heavily stressing the logic that is being audited. Signed-off-by: NSean Christopherson <seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Paolo Bonzini 提交于
A few vendor callbacks are only used by VMX, but they return an integer or bool value. Introduce KVM_X86_OP_OPTIONAL_RET0 for them: if a func is NULL in struct kvm_x86_ops, it will be changed to __static_call_return0 when updating static calls. Reviewed-by: NSean Christopherson <seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Paolo Bonzini 提交于
Use the newly corrected KVM_X86_OP annotations to warn about possible NULL pointer dereferences as soon as the vendor module is loaded. Reviewed-by: NSean Christopherson <seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Paolo Bonzini 提交于
The original use of KVM_X86_OP_NULL, which was to mark calls that do not follow a specific naming convention, is not in use anymore. Instead, let's mark calls that are optional because they are always invoked within conditionals or with static_call_cond. Those that are _not_, i.e. those that are defined with KVM_X86_OP, must be defined by both vendor modules or some kind of NULL pointer dereference is bound to happen at runtime. Reviewed-by: NSean Christopherson <seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Paolo Bonzini 提交于
The two ioctls used to implement userspace-accelerated TPR, KVM_TPR_ACCESS_REPORTING and KVM_SET_VAPIC_ADDR, are available even if hardware-accelerated TPR can be used. So there is no reason not to report KVM_CAP_VAPIC. Reviewed-by: NSean Christopherson <seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 17 2月, 2022 1 次提交
-
-
由 Leonardo Bras 提交于
kvm_vcpu_arch currently contains the guest supported features in both guest_supported_xcr0 and guest_fpu.fpstate->user_xfeatures field. Currently both fields are set to the same value in kvm_vcpu_after_set_cpuid() and are not changed anywhere else after that. Since it's not good to keep duplicated data, remove guest_supported_xcr0. To keep the code more readable, introduce kvm_guest_supported_xcr() and kvm_guest_supported_xfd() to replace the previous usages of guest_supported_xcr0. Signed-off-by: NLeonardo Bras <leobras@redhat.com> Message-Id: <20220217053028.96432-3-leobras@redhat.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 11 2月, 2022 6 次提交
-
-
由 David Matlack 提交于
When using KVM_DIRTY_LOG_INITIALLY_SET, huge pages are not write-protected when dirty logging is enabled on the memslot. Instead they are write-protected once userspace invokes KVM_CLEAR_DIRTY_LOG for the first time and only for the specific sub-region being cleared. Enhance KVM_CLEAR_DIRTY_LOG to also try to split huge pages prior to write-protecting to avoid causing write-protection faults on vCPU threads. This also allows userspace to smear the cost of huge page splitting across multiple ioctls, rather than splitting the entire memslot as is the case when initially-all-set is not used. Signed-off-by: NDavid Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-17-dmatlack@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 David Matlack 提交于
When dirty logging is enabled without initially-all-set, try to split all huge pages in the memslot down to 4KB pages so that vCPUs do not have to take expensive write-protection faults to split huge pages. Eager page splitting is best-effort only. This commit only adds the support for the TDP MMU, and even there splitting may fail due to out of memory conditions. Failures to split a huge page is fine from a correctness standpoint because KVM will always follow up splitting by write-protecting any remaining huge pages. Eager page splitting moves the cost of splitting huge pages off of the vCPU threads and onto the thread enabling dirty logging on the memslot. This is useful because: 1. Splitting on the vCPU thread interrupts vCPUs execution and is disruptive to customers whereas splitting on VM ioctl threads can run in parallel with vCPU execution. 2. Splitting all huge pages at once is more efficient because it does not require performing VM-exit handling or walking the page table for every 4KiB page in the memslot, and greatly reduces the amount of contention on the mmu_lock. For example, when running dirty_log_perf_test with 96 virtual CPUs, 1GiB per vCPU, and 1GiB HugeTLB memory, the time it takes vCPUs to write to all of their memory after dirty logging is enabled decreased by 95% from 2.94s to 0.14s. Eager Page Splitting is over 100x more efficient than the current implementation of splitting on fault under the read lock. For example, taking the same workload as above, Eager Page Splitting reduced the CPU required to split all huge pages from ~270 CPU-seconds ((2.94s - 0.14s) * 96 vCPU threads) to only 1.55 CPU-seconds. Eager page splitting does increase the amount of time it takes to enable dirty logging since it has split all huge pages. For example, the time it took to enable dirty logging in the 96GiB region of the aforementioned test increased from 0.001s to 1.55s. Reviewed-by: NPeter Xu <peterx@redhat.com> Signed-off-by: NDavid Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-16-dmatlack@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Use slightly more verbose names for the so called "memory encrypt", a.k.a. "mem enc", kvm_x86_ops hooks to bridge the gap between the current super short kvm_x86_ops names and SVM's more verbose, but non-conforming names. This is a step toward using kvm-x86-ops.h with KVM_X86_CVM_OP() to fill svm_x86_ops. Opportunistically rename mem_enc_op() to mem_enc_ioctl() to better reflect its true nature, as it really is a full fledged ioctl() of its own. Ideally, the hook would be named confidential_vm_ioctl() or so, as the ioctl() is a gateway to more than just memory encryption, and because its underlying purpose to support Confidential VMs, which can be provided without memory encryption, e.g. if the TCB of the guest includes the host kernel but not host userspace, or by isolation in hardware without encrypting memory. But, diverging from KVM_MEMORY_ENCRYPT_OP even further is undeseriable, and short of creating alises for all related ioctl()s, which introduces a different flavor of divergence, KVM is stuck with the nomenclature. Defer renaming SVM's functions to a future commit as there are additional changes needed to make SVM fully conforming and to match reality (looking at you, svm_vm_copy_asid_from()). No functional change intended. Signed-off-by: NSean Christopherson <seanjc@google.com> Message-Id: <20220128005208.4008533-20-seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Move kvm_get_cs_db_l_bits() to SVM and rename it appropriately so that its svm_x86_ops entry can be filled via kvm-x86-ops, and to eliminate a superfluous export from KVM x86. No functional change intended. Signed-off-by: NSean Christopherson <seanjc@google.com> Message-Id: <20220128005208.4008533-16-seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Rename a variety of kvm_x86_op function pointers so that preferred name for vendor implementations follows the pattern <vendor>_<function>, e.g. rename .run() to .vcpu_run() to match {svm,vmx}_vcpu_run(). This will allow vendor implementations to be wired up via the KVM_X86_OP macro. In many cases, VMX and SVM "disagree" on the preferred name, though in reality it's VMX and x86 that disagree as SVM blindly prepended _svm to the kvm_x86_ops name. Justification for using the VMX nomenclature: - set_{irq,nmi} => inject_{irq,nmi} because the helper is injecting an event that has already been "set" in e.g. the vIRR. SVM's relevant VMCB field is even named event_inj, and KVM's stat is irq_injections. - prepare_guest_switch => prepare_switch_to_guest because the former is ambiguous, e.g. it could mean switching between multiple guests, switching from the guest to host, etc... - update_pi_irte => pi_update_irte to allow for matching match the rest of VMX's posted interrupt naming scheme, which is vmx_pi_<blah>(). - start_assignment => pi_start_assignment to again follow VMX's posted interrupt naming scheme, and to provide context for what bit of code might care about an otherwise undescribed "assignment". The "tlb_flush" => "flush_tlb" creates an inconsistency with respect to Hyper-V's "tlb_remote_flush" hooks, but Hyper-V really is the one that's wrong. x86, VMX, and SVM all use flush_tlb, and even common KVM is on a variant of the bandwagon with "kvm_flush_remote_tlbs", e.g. a more appropriate name for the Hyper-V hooks would be flush_remote_tlbs. Leave that change for another time as the Hyper-V hooks always start as NULL, i.e. the name doesn't matter for using kvm-x86-ops.h, and changing all names requires an astounding amount of churn. VMX and SVM function names are intentionally left as is to minimize the diff. Both VMX and SVM will need to rename even more functions in order to fully utilize KVM_X86_OPS, i.e. an additional patch for each is inevitable. No functional change intended. Signed-off-by: NSean Christopherson <seanjc@google.com> Message-Id: <20220128005208.4008533-5-seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Jinrong Liang 提交于
The "struct kvm_vcpu *vcpu" parameter of kvm_scale_tsc() is not used, so remove it. No functional change intended. Signed-off-by: NJinrong Liang <cloudliang@tencent.com> Message-Id: <20220125095909.38122-18-cloudliang@tencent.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 01 2月, 2022 1 次提交
-
-
由 Sean Christopherson 提交于
Handle non-APICv interrupt delivery in vendor code, even though it means VMX and SVM will temporarily have duplicate code. SVM's AVIC has a race condition that requires KVM to fall back to legacy interrupt injection _after_ the interrupt has been logged in the vIRR, i.e. to fix the race, SVM will need to open code the full flow anyways[*]. Refactor the code so that the SVM bug without introducing other issues, e.g. SVM would return "success" and thus invoke trace_kvm_apicv_accept_irq() even when delivery through the AVIC failed, and to opportunistically prepare for using KVM_X86_OP to fill each vendor's kvm_x86_ops struct, which will rely on the vendor function matching the kvm_x86_op pointer name. No functional change intended. [*] https://lore.kernel.org/all/20211213104634.199141-4-mlevitsk@redhat.comSigned-off-by: NSean Christopherson <seanjc@google.com> Message-Id: <20220128005208.4008533-3-seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 27 1月, 2022 2 次提交
-
-
由 Sean Christopherson 提交于
Forcibly leave nested virtualization operation if userspace toggles SMM state via KVM_SET_VCPU_EVENTS or KVM_SYNC_X86_EVENTS. If userspace forces the vCPU out of SMM while it's post-VMXON and then injects an SMI, vmx_enter_smm() will overwrite vmx->nested.smm.vmxon and end up with both vmxon=false and smm.vmxon=false, but all other nVMX state allocated. Don't attempt to gracefully handle the transition as (a) most transitions are nonsencial, e.g. forcing SMM while L2 is running, (b) there isn't sufficient information to handle all transitions, e.g. SVM wants access to the SMRAM save state, and (c) KVM_SET_VCPU_EVENTS must precede KVM_SET_NESTED_STATE during state restore as the latter disallows putting the vCPU into L2 if SMM is active, and disallows tagging the vCPU as being post-VMXON in SMM if SMM is not active. Abuse of KVM_SET_VCPU_EVENTS manifests as a WARN and memory leak in nVMX due to failure to free vmcs01's shadow VMCS, but the bug goes far beyond just a memory leak, e.g. toggling SMM on while L2 is active puts the vCPU in an architecturally impossible state. WARNING: CPU: 0 PID: 3606 at free_loaded_vmcs arch/x86/kvm/vmx/vmx.c:2665 [inline] WARNING: CPU: 0 PID: 3606 at free_loaded_vmcs+0x158/0x1a0 arch/x86/kvm/vmx/vmx.c:2656 Modules linked in: CPU: 1 PID: 3606 Comm: syz-executor725 Not tainted 5.17.0-rc1-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:free_loaded_vmcs arch/x86/kvm/vmx/vmx.c:2665 [inline] RIP: 0010:free_loaded_vmcs+0x158/0x1a0 arch/x86/kvm/vmx/vmx.c:2656 Code: <0f> 0b eb b3 e8 8f 4d 9f 00 e9 f7 fe ff ff 48 89 df e8 92 4d 9f 00 Call Trace: <TASK> kvm_arch_vcpu_destroy+0x72/0x2f0 arch/x86/kvm/x86.c:11123 kvm_vcpu_destroy arch/x86/kvm/../../../virt/kvm/kvm_main.c:441 [inline] kvm_destroy_vcpus+0x11f/0x290 arch/x86/kvm/../../../virt/kvm/kvm_main.c:460 kvm_free_vcpus arch/x86/kvm/x86.c:11564 [inline] kvm_arch_destroy_vm+0x2e8/0x470 arch/x86/kvm/x86.c:11676 kvm_destroy_vm arch/x86/kvm/../../../virt/kvm/kvm_main.c:1217 [inline] kvm_put_kvm+0x4fa/0xb00 arch/x86/kvm/../../../virt/kvm/kvm_main.c:1250 kvm_vm_release+0x3f/0x50 arch/x86/kvm/../../../virt/kvm/kvm_main.c:1273 __fput+0x286/0x9f0 fs/file_table.c:311 task_work_run+0xdd/0x1a0 kernel/task_work.c:164 exit_task_work include/linux/task_work.h:32 [inline] do_exit+0xb29/0x2a30 kernel/exit.c:806 do_group_exit+0xd2/0x2f0 kernel/exit.c:935 get_signal+0x4b0/0x28c0 kernel/signal.c:2862 arch_do_signal_or_restart+0x2a9/0x1c40 arch/x86/kernel/signal.c:868 handle_signal_work kernel/entry/common.c:148 [inline] exit_to_user_mode_loop kernel/entry/common.c:172 [inline] exit_to_user_mode_prepare+0x17d/0x290 kernel/entry/common.c:207 __syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline] syscall_exit_to_user_mode+0x19/0x60 kernel/entry/common.c:300 do_syscall_64+0x42/0xb0 arch/x86/entry/common.c:86 entry_SYSCALL_64_after_hwframe+0x44/0xae </TASK> Cc: stable@vger.kernel.org Reported-by: syzbot+8112db3ab20e70d50c31@syzkaller.appspotmail.com Signed-off-by: NSean Christopherson <seanjc@google.com> Message-Id: <20220125220358.2091737-1-seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Pass the emulation type to kvm_x86_ops.can_emulate_insutrction() so that a future commit can harden KVM's SEV support to WARN on emulation scenarios that should never happen. No functional change intended. Signed-off-by: NSean Christopherson <seanjc@google.com> Reviewed-by: NLiam Merwick <liam.merwick@oracle.com> Message-Id: <20220120010719.711476-6-seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 25 1月, 2022 1 次提交
-
-
由 Quanfa Fu 提交于
Make kvm_vcpu_reload_apic_access_page() static as it is no longer invoked directly by vmx and it is also no longer exported. No functional change intended. Signed-off-by: NQuanfa Fu <quanfafu@gmail.com> Message-Id: <20211219091446.174584-1-quanfafu@gmail.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 20 1月, 2022 2 次提交
-
-
由 Sean Christopherson 提交于
Drop kvm_x86_ops' pre/post_block() now that all implementations are nops. No functional change intended. Signed-off-by: NSean Christopherson <seanjc@google.com> Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com> Message-Id: <20211208015236.1616697-10-seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Reject KVM_RUN if emulation is required (because VMX is running without unrestricted guest) and an exception is pending, as KVM doesn't support emulating exceptions except when emulating real mode via vm86. The vCPU is hosed either way, but letting KVM_RUN proceed triggers a WARN due to the impossible condition. Alternatively, the WARN could be removed, but then userspace and/or KVM bugs would result in the vCPU silently running in a bad state, which isn't very friendly to users. Originally, the bug was hit by syzkaller with a nested guest as that doesn't require kvm_intel.unrestricted_guest=0. That particular flavor is likely fixed by commit cd0e615c ("KVM: nVMX: Synthesize TRIPLE_FAULT for L2 if emulation is required"), but it's trivial to trigger the WARN with a non-nested guest, and userspace can likely force bad state via ioctls() for a nested guest as well. Checking for the impossible condition needs to be deferred until KVM_RUN because KVM can't force specific ordering between ioctls. E.g. clearing exception.pending in KVM_SET_SREGS doesn't prevent userspace from setting it in KVM_SET_VCPU_EVENTS, and disallowing KVM_SET_VCPU_EVENTS with emulation_required would prevent userspace from queuing an exception and then stuffing sregs. Note, if KVM were to try and detect/prevent the condition prior to KVM_RUN, handle_invalid_guest_state() and/or handle_emulation_failure() would need to be modified to clear the pending exception prior to exiting to userspace. ------------[ cut here ]------------ WARNING: CPU: 6 PID: 137812 at arch/x86/kvm/vmx/vmx.c:1623 vmx_queue_exception+0x14f/0x160 [kvm_intel] CPU: 6 PID: 137812 Comm: vmx_invalid_nes Not tainted 5.15.2-7cc36c3e14ae-pop #279 Hardware name: ASUS Q87M-E/Q87M-E, BIOS 1102 03/03/2014 RIP: 0010:vmx_queue_exception+0x14f/0x160 [kvm_intel] Code: <0f> 0b e9 fd fe ff ff 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 RSP: 0018:ffffa45c83577d38 EFLAGS: 00010202 RAX: 0000000000000003 RBX: 0000000080000006 RCX: 0000000000000006 RDX: 0000000000000000 RSI: 0000000000010002 RDI: ffff9916af734000 RBP: ffff9916af734000 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000006 R13: 0000000000000000 R14: ffff9916af734038 R15: 0000000000000000 FS: 00007f1e1a47c740(0000) GS:ffff99188fb80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f1e1a6a8008 CR3: 000000026f83b005 CR4: 00000000001726e0 Call Trace: kvm_arch_vcpu_ioctl_run+0x13a2/0x1f20 [kvm] kvm_vcpu_ioctl+0x279/0x690 [kvm] __x64_sys_ioctl+0x83/0xb0 do_syscall_64+0x3b/0xc0 entry_SYSCALL_64_after_hwframe+0x44/0xae Reported-by: syzbot+82112403ace4cbd780d8@syzkaller.appspotmail.com Signed-off-by: NSean Christopherson <seanjc@google.com> Message-Id: <20211228232437.1875318-2-seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 15 1月, 2022 1 次提交
-
-
由 Kevin Tian 提交于
Always intercepting IA32_XFD causes non-negligible overhead when this register is updated frequently in the guest. Disable r/w emulation after intercepting the first WRMSR(IA32_XFD) with a non-zero value. Disable WRMSR emulation implies that IA32_XFD becomes out-of-sync with the software states in fpstate and the per-cpu xfd cache. This leads to two additional changes accordingly: - Call fpu_sync_guest_vmexit_xfd_state() after vm-exit to bring software states back in-sync with the MSR, before handle_exit_irqoff() is called. - Always trap #NM once write interception is disabled for IA32_XFD. The #NM exception is rare if the guest doesn't use dynamic features. Otherwise, there is at most one exception per guest task given a dynamic feature. p.s. We have confirmed that SDM is being revised to say that when setting IA32_XFD[18] the AMX register state is not guaranteed to be preserved. This clarification avoids adding mess for a creative guest which sets IA32_XFD[18]=1 before saving active AMX state to its own storage. Signed-off-by: NKevin Tian <kevin.tian@intel.com> Signed-off-by: NJing Liu <jing2.liu@intel.com> Signed-off-by: NYang Zhong <yang.zhong@intel.com> Message-Id: <20220105123532.12586-22-yang.zhong@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 07 1月, 2022 1 次提交
-
-
由 Michael Roth 提交于
Normally guests will set up CR3 themselves, but some guests, such as kselftests, and potentially CONFIG_PVH guests, rely on being booted with paging enabled and CR3 initialized to a pre-allocated page table. Currently CR3 updates via KVM_SET_SREGS* are not loaded into the guest VMCB until just prior to entering the guest. For SEV-ES/SEV-SNP, this is too late, since it will have switched over to using the VMSA page prior to that point, with the VMSA CR3 copied from the VMCB initial CR3 value: 0. Address this by sync'ing the CR3 value into the VMCB save area immediately when KVM_SET_SREGS* is issued so it will find it's way into the initial VMSA. Suggested-by: NTom Lendacky <thomas.lendacky@amd.com> Signed-off-by: NMichael Roth <michael.roth@amd.com> Message-Id: <20211216171358.61140-10-michael.roth@amd.com> [Remove vmx_post_set_cr3; add a remark about kvm_set_cr3 not calling the new hook. - Paolo] Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-