- 12 9月, 2020 2 次提交
-
-
由 Wanpeng Li 提交于
According to SDM 27.2.4, Event delivery causes an APIC-access VM exit. Don't report internal error and freeze guest when event delivery causes an APIC-access exit, it is handleable and the event will be re-injected during the next vmentry. Signed-off-by: NWanpeng Li <wanpengli@tencent.com> Message-Id: <1597827327-25055-2-git-send-email-wanpengli@tencent.com> Cc: stable@vger.kernel.org Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Peter Shier 提交于
When L2 uses PAE, L0 intercepts of L2 writes to CR0/CR3/CR4 call load_pdptrs to read the possibly updated PDPTEs from the guest physical address referenced by CR3. It loads them into vcpu->arch.walk_mmu->pdptrs and sets VCPU_EXREG_PDPTR in vcpu->arch.regs_dirty. At the subsequent assumed reentry into L2, the mmu will call vmx_load_mmu_pgd which calls ept_load_pdptrs. ept_load_pdptrs sees VCPU_EXREG_PDPTR set in vcpu->arch.regs_dirty and loads VMCS02.GUEST_PDPTRn from vcpu->arch.walk_mmu->pdptrs[]. This all works if the L2 CRn write intercept always resumes L2. The resume path calls vmx_check_nested_events which checks for exceptions, MTF, and expired VMX preemption timers. If vmx_check_nested_events finds any of these conditions pending it will reflect the corresponding exit into L1. Live migration at this point would also cause a missed immediate reentry into L2. After L1 exits, vmx_vcpu_run calls vmx_register_cache_reset which clears VCPU_EXREG_PDPTR in vcpu->arch.regs_dirty. When L2 next resumes, ept_load_pdptrs finds VCPU_EXREG_PDPTR clear in vcpu->arch.regs_dirty and does not load VMCS02.GUEST_PDPTRn from vcpu->arch.walk_mmu->pdptrs[]. prepare_vmcs02 will then load VMCS02.GUEST_PDPTRn from vmcs12->pdptr0/1/2/3 which contain the stale values stored at last L2 exit. A repro of this bug showed L2 entering triple fault immediately due to the bad VMCS02.GUEST_PDPTRn values. When L2 is in PAE paging mode add a call to ept_load_pdptrs before leaving L2. This will update VMCS02.GUEST_PDPTRn if they are dirty in vcpu->arch.walk_mmu->pdptrs[]. Tested: kvm-unit-tests with new directed test: vmx_mtf_pdpte_test. Verified that test fails without the fix. Also ran Google internal VMM with an Ubuntu 16.04 4.4.0-83 guest running a custom hypervisor with a 32-bit Windows XP L2 guest using PAE. Prior to fix would repro readily. Ran 14 simultaneous L2s for 140 iterations with no failures. Signed-off-by: NPeter Shier <pshier@google.com> Reviewed-by: NJim Mattson <jmattson@google.com> Message-Id: <20200820230545.2411347-1-pshier@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 31 7月, 2020 5 次提交
-
-
由 Sean Christopherson 提交于
Capture the max TDP level during kvm_configure_mmu() instead of using a kvm_x86_ops hook to do it at every vCPU creation. Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200716034122.5998-10-sean.j.christopherson@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Calculate the desired TDP level on the fly using the max TDP level and MAXPHYADDR instead of doing the same when CPUID is updated. This avoids the hidden dependency on cpuid_maxphyaddr() in vmx_get_tdp_level() and also standardizes the "use 5-level paging iff MAXPHYADDR > 48" behavior across x86. Suggested-by: NPaolo Bonzini <pbonzini@redhat.com> Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200716034122.5998-8-sean.j.christopherson@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Remove the WARN in vmx_load_mmu_pgd() that was temporarily added to aid bisection/debug in the event the current MMU's shadow root level didn't match VMX's computed EPTP level. Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200716034122.5998-7-sean.j.christopherson@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Use the shadow_root_level from the current MMU as the root level for the PGD, i.e. for VMX's EPTP. This eliminates the weird dependency between VMX and the MMU where both must independently calculate the same root level for things to work correctly. Temporarily keep VMX's calculation of the level and use it to WARN if the incoming level diverges. Opportunistically refactor kvm_mmu_load_pgd() to avoid indentation hell, and rename a 'cr3' param in the load_mmu_pgd prototype that managed to survive the cr3 purge. No functional change intended. Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200716034122.5998-6-sean.j.christopherson@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Make vmx_load_mmu_pgd() static as it is no longer invoked directly by nested VMX (or any code for that matter). No functional change intended. Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200716034122.5998-5-sean.j.christopherson@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 24 7月, 2020 1 次提交
-
-
由 Thomas Gleixner 提交于
Use the generic infrastructure to check for and handle pending work before transitioning into guest mode. This now handles TIF_NOTIFY_RESUME as well which was ignored so far. Handling it is important as this covers task work and task work will be used to offload the heavy lifting of POSIX CPU timers to thread context. Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20200722220520.979724969@linutronix.de
-
- 11 7月, 2020 5 次提交
-
-
由 Mohammed Gamal 提交于
This patch adds a new capability KVM_CAP_SMALLER_MAXPHYADDR which allows userspace to query if the underlying architecture would support GUEST_MAXPHYADDR < HOST_MAXPHYADDR and hence act accordingly (e.g. qemu can decide if it should warn for -cpu ..,phys-bits=X) The complications in this patch are due to unexpected (but documented) behaviour we see with NPF vmexit handling in AMD processor. If SVM is modified to add guest physical address checks in the NPF and guest #PF paths, we see the followning error multiple times in the 'access' test in kvm-unit-tests: test pte.p pte.36 pde.p: FAIL: pte 2000021 expected 2000001 Dump mapping: address: 0x123400000000 ------L4: 24c3027 ------L3: 24c4027 ------L2: 24c5021 ------L1: 1002000021 This is because the PTE's accessed bit is set by the CPU hardware before the NPF vmexit. This is handled completely by hardware and cannot be fixed in software. Therefore, availability of the new capability depends on a boolean variable allow_smaller_maxphyaddr which is set individually by VMX and SVM init routines. On VMX it's always set to true, on SVM it's only set to true when NPT is not enabled. CC: Tom Lendacky <thomas.lendacky@amd.com> CC: Babu Moger <babu.moger@amd.com> Signed-off-by: NMohammed Gamal <mgamal@redhat.com> Message-Id: <20200710154811.418214-10-mgamal@redhat.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Paolo Bonzini 提交于
Ignore non-present page faults, since those cannot have reserved bits set. When running access.flat with "-cpu Haswell,phys-bits=36", the number of trapped page faults goes down from 8872644 to 3978948. Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com> Message-Id: <20200710154811.418214-9-mgamal@redhat.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Mohammed Gamal 提交于
Check guest physical address against its maximum, which depends on the guest MAXPHYADDR. If the guest's physical address exceeds the maximum (i.e. has reserved bits set), inject a guest page fault with PFERR_RSVD_MASK set. This has to be done both in the EPT violation and page fault paths, as there are complications in both cases with respect to the computation of the correct error code. For EPT violations, unfortunately the only possibility is to emulate, because the access type in the exit qualification might refer to an access to a paging structure, rather than to the access performed by the program. Trapping page faults instead is needed in order to correct the error code, but the access type can be obtained from the original error code and passed to gva_to_gpa. The corrections required in the error code are subtle. For example, imagine that a PTE for a supervisor page has a reserved bit set. On a supervisor-mode access, the EPT violation path would trigger. However, on a user-mode access, the processor will not notice the reserved bit and not include PFERR_RSVD_MASK in the error code. Co-developed-by: NMohammed Gamal <mgamal@redhat.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com> Message-Id: <20200710154811.418214-8-mgamal@redhat.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Paolo Bonzini 提交于
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com> Message-Id: <20200710154811.418214-7-mgamal@redhat.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Paolo Bonzini 提交于
We would like to introduce a callback to update the #PF intercept when CPUID changes. Just reuse update_bp_intercept since VMX is already using update_exception_bitmap instead of a bespoke function. While at it, remove an unnecessary assignment in the SVM version, which is already done in the caller (kvm_arch_vcpu_ioctl_set_guest_debug) and has nothing to do with the exception bitmap. Reviewed-by: NJim Mattson <jmattson@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 09 7月, 2020 13 次提交
-
-
由 Thomas Gleixner 提交于
read/write_cr2() go throuh the paravirt XXL indirection, but nested VMX in a XEN_PV guest is not supported. Use the native variants. Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Message-Id: <20200708195322.344731916@linutronix.de> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Thomas Gleixner 提交于
Move the functions which are inside the RCU off region into the non-instrumentable text section. Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Reviewed-by: NAlexandre Chartre <alexandre.chartre@oracle.com> Acked-by: NPeter Zijlstra <peterz@infradead.org> Acked-by: NPaolo Bonzini <pbonzini@redhat.com> Message-Id: <20200708195322.037311579@linutronix.de> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Thomas Gleixner 提交于
Entering guest mode is more or less the same as returning to user space. From an instrumentation point of view both leave kernel mode and the transition to guest or user mode reenables interrupts on the host. In user mode an interrupt is served directly and in guest mode it causes a VM exit which then handles or reinjects the interrupt. The transition from guest mode or user mode to kernel mode disables interrupts, which needs to be recorded in instrumentation to set the correct state again. This is important for e.g. latency analysis because otherwise the execution time in guest or user mode would be wrongly accounted as interrupt disabled and could trigger false positives. Add hardirq tracing to guest enter/exit functions in the same way as it is done in the user mode enter/exit code, respecting the RCU requirements. Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Reviewed-by: NAlexandre Chartre <alexandre.chartre@oracle.com> Acked-by: NPeter Zijlstra <peterz@infradead.org> Acked-by: NPaolo Bonzini <pbonzini@redhat.com> Message-Id: <20200708195321.822002354@linutronix.de> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Thomas Gleixner 提交于
Context tracking for KVM happens way too early in the vcpu_run() code. Anything after guest_enter_irqoff() and before guest_exit_irqoff() cannot use RCU and should also be not instrumented. The current way of doing this covers way too much code. Move it closer to the actual vmenter/exit code. Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Reviewed-by: NAlexandre Chartre <alexandre.chartre@oracle.com> Acked-by: NPeter Zijlstra <peterz@infradead.org> Acked-by: NPaolo Bonzini <pbonzini@redhat.com> Message-Id: <20200708195321.724574345@linutronix.de> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Maxim Levitsky 提交于
To avoid complex and in some cases incorrect logic in kvm_spec_ctrl_test_value, just try the guest's given value on the host processor instead, and if it doesn't #GP, allow the guest to set it. One such case is when host CPU supports STIBP mitigation but doesn't support IBRS (as is the case with some Zen2 AMD cpus), and in this case we were giving guest #GP when it tried to use STIBP The reason why can can do the host test is that IA32_SPEC_CTRL msr is passed to the guest, after the guest sets it to a non zero value for the first time (due to performance reasons), and as as result of this, it is pointless to emulate #GP condition on this first access, in a different way than what the host CPU does. This is based on a patch from Sean Christopherson, who suggested this idea. Fixes: 6441fa61 ("KVM: x86: avoid incorrect writes to host MSR_IA32_SPEC_CTRL") Cc: stable@vger.kernel.org Suggested-by: NSean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com> Message-Id: <20200708115731.180097-1-mlevitsk@redhat.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Xiaoyao Li 提交于
The name of callback cpuid_update() is misleading that it's not about updating CPUID settings of vcpu but updating the configurations of vcpu based on the CPUIDs. So rename it to vcpu_after_set_cpuid(). Signed-off-by: NXiaoyao Li <xiaoyao.li@intel.com> Message-Id: <20200709043426.92712-5-xiaoyao.li@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Jim Mattson 提交于
Since this field is now in kvm_vcpu_arch, clean things up a little by setting it in vendor-agnostic code: vcpu_enter_guest. Note that it must be set after the call to kvm_x86_ops.run(), since it can't be updated before pre_sev_run(). Suggested-by: NSean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: NJim Mattson <jmattson@google.com> Reviewed-by: NOliver Upton <oupton@google.com> Reviewed-by: NPeter Shier <pshier@google.com> Message-Id: <20200603235623.245638-7-jmattson@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Jim Mattson 提交于
Both the vcpu_vmx structure and the vcpu_svm structure have a 'last_cpu' field. Move the common field into the kvm_vcpu_arch structure. For clarity, rename it to 'last_vmentry_cpu.' Suggested-by: NSean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: NJim Mattson <jmattson@google.com> Reviewed-by: NOliver Upton <oupton@google.com> Reviewed-by: NPeter Shier <pshier@google.com> Message-Id: <20200603235623.245638-6-jmattson@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Jim Mattson 提交于
More often than not, a failed VM-entry in an x86 production environment is induced by a defective CPU. To help identify the bad hardware, include the id of the last logical CPU to run a vCPU in the information provided to userspace on a KVM exit for failed VM-entry or for KVM internal errors not associated with emulation. The presence of this additional information is indicated by a new capability, KVM_CAP_LAST_CPU. Signed-off-by: NJim Mattson <jmattson@google.com> Reviewed-by: NOliver Upton <oupton@google.com> Reviewed-by: NPeter Shier <pshier@google.com> Message-Id: <20200603235623.245638-5-jmattson@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Jim Mattson 提交于
As we already do in svm, record the last logical processor on which a vCPU has run, so that it can be communicated to userspace for potential hardware errors. Signed-off-by: NJim Mattson <jmattson@google.com> Reviewed-by: NOliver Upton <oupton@google.com> Reviewed-by: NPeter Shier <pshier@google.com> Message-Id: <20200603235623.245638-4-jmattson@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Peter Xu 提交于
Logically the ignore_msrs and report_ignored_msrs should also apply to feature MSRs. Add them in. Signed-off-by: NPeter Xu <peterx@redhat.com> Message-Id: <20200622220442.21998-3-peterx@redhat.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Move .write_log_dirty() into kvm_x86_nested_ops to help differentiate it from the non-nested dirty log hooks. And because it's a nested-only operation. Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200622215832.22090-5-sean.j.christopherson@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
WARN if vmx_write_pml_buffer() is called outside of guest mode instead of silently ignoring the condition. The only caller is nested EPT's ept_update_accessed_dirty_bits(), which should only be reachable when L2 is active. Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200622215832.22090-4-sean.j.christopherson@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 04 7月, 2020 2 次提交
-
-
由 Sean Christopherson 提交于
Use the "common" KVM_POSSIBLE_CR*_GUEST_BITS defines to initialize the CR0/CR4 guest host masks instead of duplicating most of the CR4 mask and open coding the CR0 mask. SVM doesn't utilize the masks, i.e. the masks are effectively VMX specific even if they're not named as such. This avoids duplicate code, better documents the guest owned CR0 bit, and eliminates the need for a build-time assertion to keep VMX and x86 synchronized. Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200703040422.31536-3-sean.j.christopherson@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Mark CR4.TSD as being possibly owned by the guest as that is indeed the case on VMX. Without TSD being tagged as possibly owned by the guest, a targeted read of CR4 to get TSD could observe a stale value. This bug is benign in the current code base as the sole consumer of TSD is the emulator (for RDTSC) and the emulator always "reads" the entirety of CR4 when grabbing bits. Add a build-time assertion in to ensure VMX doesn't hand over more CR4 bits without also updating x86. Fixes: 52ce3c21 ("x86,kvm,vmx: Don't trap writes to CR4.TSD") Cc: stable@vger.kernel.org Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200703040422.31536-2-sean.j.christopherson@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 23 6月, 2020 2 次提交
-
-
由 Sean Christopherson 提交于
Remove support for context switching between the guest's and host's desired UMWAIT_CONTROL. Propagating the guest's value to hardware isn't required for correct functionality, e.g. KVM intercepts reads and writes to the MSR, and the latency effects of the settings controlled by the MSR are not architecturally visible. As a general rule, KVM should not allow the guest to control power management settings unless explicitly enabled by userspace, e.g. see KVM_CAP_X86_DISABLE_EXITS. E.g. Intel's SDM explicitly states that C0.2 can improve the performance of SMT siblings. A devious guest could disable C0.2 so as to improve the performance of their workloads at the detriment to workloads running in the host or on other VMs. Wholesale removal of UMWAIT_CONTROL context switching also fixes a race condition where updates from the host may cause KVM to enter the guest with the incorrect value. Because updates are are propagated to all CPUs via IPI (SMP function callback), the value in hardware may be stale with respect to the cached value and KVM could enter the guest with the wrong value in hardware. As above, the guest can't observe the bad value, but it's a weird and confusing wart in the implementation. Removal also fixes the unnecessary usage of VMX's atomic load/store MSR lists. Using the lists is only necessary for MSRs that are required for correct functionality immediately upon VM-Enter/VM-Exit, e.g. EFER on old hardware, or for MSRs that need to-the-uop precision, e.g. perf related MSRs. For UMWAIT_CONTROL, the effects are only visible in the kernel via TPAUSE/delay(), and KVM doesn't do any form of delay in vcpu_vmx_run(). Using the atomic lists is undesirable as they are more expensive than direct RDMSR/WRMSR. Furthermore, even if giving the guest control of the MSR is legitimate, e.g. in pass-through scenarios, it's not clear that the benefits would outweigh the overhead. E.g. saving and restoring an MSR across a VMX roundtrip costs ~250 cycles, and if the guest diverged from the host that cost would be paid on every run of the guest. In other words, if there is a legitimate use case then it should be enabled by a new per-VM capability. Note, KVM still needs to emulate MSR_IA32_UMWAIT_CONTROL so that it can correctly expose other WAITPKG features to the guest, e.g. TPAUSE, UMWAIT and UMONITOR. Fixes: 6e3ba4ab ("KVM: vmx: Emulate MSR IA32_UMWAIT_CONTROL") Cc: stable@vger.kernel.org Cc: Jingqi Liu <jingqi.liu@intel.com> Cc: Tao Xu <tao3.xu@intel.com> Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200623005135.10414-1-sean.j.christopherson@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Explicitly pass the L2 GPA to kvm_arch_write_log_dirty(), which for all intents and purposes is vmx_write_pml_buffer(), instead of having the latter pull the GPA from vmcs.GUEST_PHYSICAL_ADDRESS. If the dirty bit update is the result of KVM emulation (rare for L2), then the GPA in the VMCS may be stale and/or hold a completely unrelated GPA. Fixes: c5f983f6 ("nVMX: Implement emulated Page Modification Logging") Cc: stable@vger.kernel.org Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200622215832.22090-2-sean.j.christopherson@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 19 6月, 2020 1 次提交
-
-
由 Vitaly Kuznetsov 提交于
Guest crashes are observed on a Cascade Lake system when 'perf top' is launched on the host, e.g. BUG: unable to handle kernel paging request at fffffe0000073038 PGD 7ffa7067 P4D 7ffa7067 PUD 7ffa6067 PMD 7ffa5067 PTE ffffffffff120 Oops: 0000 [#1] SMP PTI CPU: 1 PID: 1 Comm: systemd Not tainted 4.18.0+ #380 ... Call Trace: serial8250_console_write+0xfe/0x1f0 call_console_drivers.constprop.0+0x9d/0x120 console_unlock+0x1ea/0x460 Call traces are different but the crash is imminent. The problem was blindly bisected to the commit 041bc42c ("KVM: VMX: Micro-optimize vmexit time when not exposing PMU"). It was also confirmed that the issue goes away if PMU is exposed to the guest. With some instrumentation of the guest we can see what is being switched (when we do atomic_switch_perf_msrs()): vmx_vcpu_run: switching 2 msrs vmx_vcpu_run: switching MSR38f guest: 70000000d host: 70000000f vmx_vcpu_run: switching MSR3f1 guest: 0 host: 2 The current guess is that PEBS (MSR_IA32_PEBS_ENABLE, 0x3f1) is to blame. Regardless of whether PMU is exposed to the guest or not, PEBS needs to be disabled upon switch. This reverts commit 041bc42c. Reported-by: NMaxime Coquelin <maxime.coquelin@redhat.com> Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20200619094046.654019-1-vkuznets@redhat.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 18 6月, 2020 1 次提交
-
-
由 Thomas Gleixner 提交于
save_fsgs_for_kvm() is invoked via vcpu_enter_guest() kvm_x86_ops.prepare_guest_switch(vcpu) vmx_prepare_switch_to_guest() save_fsgs_for_kvm() with preemption disabled, but interrupts enabled. The upcoming FSGSBASE based GS safe needs interrupts to be disabled. This could be done in the helper function, but that function is also called from switch_to() which has interrupts disabled already. Disable interrupts inside save_fsgs_for_kvm() and rename the function to current_save_fsgs() so it can be invoked from other places. Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20200528201402.1708239-7-sashal@kernel.org
-
- 11 6月, 2020 1 次提交
-
-
由 Thomas Gleixner 提交于
Convert #MC to IDTENTRY_MCE: - Implement the C entry points with DEFINE_IDTENTRY_MCE - Emit the ASM stub with DECLARE_IDTENTRY_MCE - Remove the ASM idtentry in 64bit - Remove the open coded ASM entry code in 32bit - Fixup the XEN/PV code - Remove the old prototypes - Remove the error code from *machine_check_vector() as it is always 0 and not used by any of the functions it can point to. Fixup all the functions as well. No functional change. Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Reviewed-by: NAlexandre Chartre <alexandre.chartre@oracle.com> Acked-by: NPeter Zijlstra <peterz@infradead.org> Acked-by: NAndy Lutomirski <luto@kernel.org> Link: https://lkml.kernel.org/r/20200505135314.334980426@linutronix.de
-
- 08 6月, 2020 1 次提交
-
-
由 Vitaly Kuznetsov 提交于
Syzbot reports the following issue: WARNING: CPU: 0 PID: 6819 at arch/x86/kvm/x86.c:618 kvm_inject_emulated_page_fault+0x210/0x290 arch/x86/kvm/x86.c:618 ... Call Trace: ... RIP: 0010:kvm_inject_emulated_page_fault+0x210/0x290 arch/x86/kvm/x86.c:618 ... nested_vmx_get_vmptr+0x1f9/0x2a0 arch/x86/kvm/vmx/nested.c:4638 handle_vmon arch/x86/kvm/vmx/nested.c:4767 [inline] handle_vmon+0x168/0x3a0 arch/x86/kvm/vmx/nested.c:4728 vmx_handle_exit+0x29c/0x1260 arch/x86/kvm/vmx/vmx.c:6067 'exception' we're trying to inject with kvm_inject_emulated_page_fault() comes from: nested_vmx_get_vmptr() kvm_read_guest_virt() kvm_read_guest_virt_helper() vcpu->arch.walk_mmu->gva_to_gpa() but it is only set when GVA to GPA conversion fails. In case it doesn't but we still fail kvm_vcpu_read_guest_page(), X86EMUL_IO_NEEDED is returned and nested_vmx_get_vmptr() calls kvm_inject_emulated_page_fault() with zeroed 'exception'. This happen when the argument is MMIO. Paolo also noticed that nested_vmx_get_vmptr() is not the only place in KVM code where kvm_read/write_guest_virt*() return result is mishandled. VMX instructions along with INVPCID have the same issue. This was already noticed before, e.g. see commit 541ab2ae ("KVM: x86: work around leak of uninitialized stack contents") but was never fully fixed. KVM could've handled the request correctly by going to userspace and performing I/O but there doesn't seem to be a good need for such requests in the first place. Introduce vmx_handle_memory_failure() as an interim solution. Note, nested_vmx_get_vmptr() now has three possible outcomes: OK, PF, KVM_EXIT_INTERNAL_ERROR and callers need to know if userspace exit is needed (for KVM_EXIT_INTERNAL_ERROR) in case of failure. We don't seem to have a good enum describing this tristate, just add "int *ret" to nested_vmx_get_vmptr() interface to pass the information. Reported-by: syzbot+2a7156e11dc199bdbd8a@syzkaller.appspotmail.com Suggested-by: NSean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20200605115906.532682-1-vkuznets@redhat.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 05 6月, 2020 1 次提交
-
-
由 Babu Moger 提交于
Both Intel and AMD support (MPK) Memory Protection Key feature. Move the feature detection from VMX to the common code. It should work for both the platforms now. Signed-off-by: NBabu Moger <babu.moger@amd.com> Message-Id: <158932795627.44260.15144185478040178638.stgit@naples-babu.amd.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 01 6月, 2020 2 次提交
-
-
由 Like Xu 提交于
Intel CPUs have a new alternative MSR range (starting from MSR_IA32_PMC0) for GP counters that allows writing the full counter width. Enable this range from a new capability bit (IA32_PERF_CAPABILITIES.FW_WRITE[bit 13]). The guest would query CPUID to get the counter width, and sign extends the counter values as needed. The traditional MSRs always limit to 32bit, even though the counter internally is larger (48 or 57 bits). When the new capability is set, use the alternative range which do not have these restrictions. This lowers the overhead of perf stat slightly because it has to do less interrupts to accumulate the counter value. Signed-off-by: NLike Xu <like.xu@linux.intel.com> Message-Id: <20200529074347.124619-3-like.xu@linux.intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Vitaly Kuznetsov 提交于
Currently, APF mechanism relies on the #PF abuse where the token is being passed through CR2. If we switch to using interrupts to deliver page-ready notifications we need a different way to pass the data. Extent the existing 'struct kvm_vcpu_pv_apf_data' with token information for page-ready notifications. While on it, rename 'reason' to 'flags'. This doesn't change the semantics as we only have reasons '1' and '2' and these can be treated as bit flags but KVM_PV_REASON_PAGE_READY is going away with interrupt based delivery making 'reason' name misleading. The newly introduced apf_put_user_ready() temporary puts both flags and token information, this will be changed to put token only when we switch to interrupt based notifications. Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20200525144125.143875-3-vkuznets@redhat.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 28 5月, 2020 3 次提交
-
-
由 Paolo Bonzini 提交于
vmx_load_mmu_pgd is delaying the write of GUEST_CR3 to prepare_vmcs02 as an optimization, but this is only correct before the nested vmentry. If userspace is modifying CR3 with KVM_SET_SREGS after the VM has already been put in guest mode, the value of CR3 will not be updated. Remove the optimization, which almost never triggers anyway. Fixes: 04f11ef4 ("KVM: nVMX: Always write vmcs02.GUEST_CR3 during nested VM-Enter") Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Paolo Bonzini 提交于
In case an interrupt arrives after nested.check_events but before the call to kvm_cpu_has_injectable_intr, we could end up enabling the interrupt window even if the interrupt is actually going to be a vmexit. This is useless rather than harmful, but it really complicates reasoning about SVM's handling of the VINTR intercept. We'd like to never bother with the VINTR intercept if V_INTR_MASKING=1 && INTERCEPT_INTR=1, because in that case there is no interrupt window and we can just exit the nested guest whenever we want. This patch moves the opening of the interrupt window inside inject_pending_event. This consolidates the check for pending interrupt/NMI/SMI in one place, and makes KVM's usage of immediate exits more consistent, extending it beyond just nested virtualization. There are two functional changes here. They only affect corner cases, but overall they simplify the inject_pending_event. - re-injection of still-pending events will also use req_immediate_exit instead of using interrupt-window intercepts. This should have no impact on performance on Intel since it simply replaces an interrupt-window or NMI-window exit for a preemption-timer exit. On AMD, which has no equivalent of the preemption time, it may incur some overhead but an actual effect on performance should only be visible in pathological cases. - kvm_arch_interrupt_allowed and kvm_vcpu_has_events will return true if an interrupt, NMI or SMI is blocked by nested_run_pending. This makes sense because entering the VM will allow it to make progress and deliver the event. Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Miaohe Lin 提交于
The second "/* fall through */" in rmode_exception() makes code harder to read. Replace it with "return" to indicate they are different cases, only the #DB and #BP check vcpu->guest_debug, while others don't care. And this also improves the readability. Suggested-by: NVitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: NVitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: NMiaohe Lin <linmiaohe@huawei.com> Message-Id: <1582080348-20827-1-git-send-email-linmiaohe@huawei.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-