- 14 9月, 2019 1 次提交
-
-
由 Paolo Bonzini 提交于
The implementation of vmread to memory is still incomplete, as it lacks the ability to do vmread to I/O memory just like vmptrst. Cc: stable@vger.kernel.org Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 12 9月, 2019 1 次提交
-
-
由 Liran Alon 提交于
Commit cd7764fe ("KVM: x86: latch INITs while in system management mode") changed code to latch INIT while vCPU is in SMM and process latched INIT when leaving SMM. It left a subtle remark in commit message that similar treatment should also be done while vCPU is in VMX non-root-mode. However, INIT signals should actually be latched in various vCPU states: (*) For both Intel and AMD, INIT signals should be latched while vCPU is in SMM. (*) For Intel, INIT should also be latched while vCPU is in VMX operation and later processed when vCPU leaves VMX operation by executing VMXOFF. (*) For AMD, INIT should also be latched while vCPU runs with GIF=0 or in guest-mode with intercept defined on INIT signal. To fix this: 1) Add kvm_x86_ops->apic_init_signal_blocked() such that each CPU vendor can define the various CPU states in which INIT signals should be blocked and modify kvm_apic_accept_events() to use it. 2) Modify vmx_check_nested_events() to check for pending INIT signal while vCPU in guest-mode. If so, emualte vmexit on EXIT_REASON_INIT_SIGNAL. Note that nSVM should have similar behaviour but is currently left as a TODO comment to implement in the future because nSVM don't yet implement svm_check_nested_events(). Note: Currently KVM nVMX implementation don't support VMX wait-for-SIPI activity state as specified in MSR_IA32_VMX_MISC bits 6:8 exposed to guest (See nested_vmx_setup_ctls_msrs()). If and when support for this activity state will be implemented, kvm_check_nested_events() would need to avoid emulating vmexit on INIT signal in case activity-state is wait-for-SIPI. In addition, kvm_apic_accept_events() would need to be modified to avoid discarding SIPI in case VMX activity-state is wait-for-SIPI but instead delay SIPI processing to vmx_check_nested_events() that would clear pending APIC events and emulate vmexit on SIPI. Reviewed-by: NJoao Martins <joao.m.martins@oracle.com> Co-developed-by: NNikita Leshenko <nikita.leshchenko@oracle.com> Signed-off-by: NNikita Leshenko <nikita.leshchenko@oracle.com> Signed-off-by: NLiran Alon <liran.alon@oracle.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 11 9月, 2019 3 次提交
-
-
由 Sean Christopherson 提交于
Use the recently added tracepoint for logging nested VM-Enter failures instead of spamming the kernel log when hardware detects a consistency check failure. Take the opportunity to print the name of the error code instead of dumping the raw hex number, but limit the symbol table to error codes that can reasonably be encountered by KVM. Add an equivalent tracepoint in nested_vmx_check_vmentry_hw(), e.g. so that tracing of "invalid control field" errors isn't suppressed when nested early checks are enabled. Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Debugging a failed VM-Enter is often like searching for a needle in a haystack, e.g. there are over 80 consistency checks that funnel into the "invalid control field" error code. One way to expedite debug is to run the buggy code as an L1 guest under KVM (and pray that the failing check is detected by KVM). However, extracting useful debug information out of L0 KVM requires attaching a debugger to KVM and/or modifying the source, e.g. to log which check is failing. Make life a little less painful for VMM developers and add a tracepoint for failed VM-Enter consistency checks. Ideally the tracepoint would capture both what check failed and precisely why it failed, but logging why a checked failed is difficult to do in a generic tracepoint without resorting to invasive techniques, e.g. generating a custom string on failure. That being said, for the vast majority of VM-Enter failures the most difficult step is figuring out exactly what to look at, e.g. figuring out which bit was incorrectly set in a control field is usually not too painful once the guilty field as been identified. To reach a happy medium between precision and ease of use, simply log the code that detected a failed check, using a macro to execute the check and log the trace event on failure. This approach enables tracing arbitrary code, e.g. it's not limited to function calls or specific formats of checks, and the changes to the existing code are minimally invasive. A macro with a two-character name is desirable as usage of the macro doesn't result in overly long lines or confusing alignment, while still retaining some amount of readability. I.e. a one-character name is a little too terse, and a three-character name results in the contents being passed to the macro aligning with an indented line when the macro is used an in if-statement, e.g.: if (VCC(nested_vmx_check_long_line_one(...) && nested_vmx_check_long_line_two(...))) return -EINVAL; And that is the story of how the CC(), a.k.a. Consistency Check, macro got its name. Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Refactor the top-level MSR accessors to take/return the index and value directly instead of requiring the caller to dump them into a msr_data struct. No functional change intended. Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 22 7月, 2019 2 次提交
-
-
由 Jan Kiszka 提交于
Shall help finding use-after-free bugs earlier. Suggested-by: NLiran Alon <liran.alon@oracle.com> Signed-off-by: NJan Kiszka <jan.kiszka@siemens.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Jan Kiszka 提交于
Letting this pend may cause nested_get_vmcs12_pages to run against an invalid state, corrupting the effective vmcs of L1. This was triggerable in QEMU after a guest corruption in L2, followed by a L1 reset. Signed-off-by: NJan Kiszka <jan.kiszka@siemens.com> Reviewed-by: NLiran Alon <liran.alon@oracle.com> Cc: stable@vger.kernel.org Fixes: 7f7f1ba3 ("KVM: x86: do not load vmcs12 pages while still in SMM") Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 20 7月, 2019 1 次提交
-
-
由 Paolo Bonzini 提交于
If a KVM guest is reset while running a nested guest, free_nested will disable the shadow VMCS execution control in the vmcs01. However, on the next KVM_RUN vmx_vcpu_run would nevertheless try to sync the VMCS12 to the shadow VMCS which has since been freed. This causes a vmptrld of a NULL pointer on my machime, but Jan reports the host to hang altogether. Let's see how much this trivial patch fixes. Reported-by: NJan Kiszka <jan.kiszka@siemens.com> Cc: Liran Alon <liran.alon@oracle.com> Cc: stable@vger.kernel.org Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 16 7月, 2019 1 次提交
-
-
由 Liran Alon 提交于
As reported by Maxime at https://bugzilla.kernel.org/show_bug.cgi?id=204175: In vmx/nested.c::get_vmx_mem_address(), when the guest runs in long mode, the base address of the memory operand is computed with a simple: *ret = s.base + off; This is incorrect, the base applies only to FS and GS, not to the others. Because of that, if the guest uses a VMX instruction based on DS and has a DS.base that is non-zero, KVM wrongfully adds the base to the resulting address. Reported-by: NMaxime Villard <max@m00nbsd.net> Reviewed-by: NJoao Martins <joao.m.martins@oracle.com> Signed-off-by: NLiran Alon <liran.alon@oracle.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 05 7月, 2019 2 次提交
-
-
由 Krish Sadhukhan 提交于
According to section "Checks on Host Segment and Descriptor-Table Registers" in Intel SDM vol 3C, the following checks are performed on vmentry of nested guests: - In the selector field for each of CS, SS, DS, ES, FS, GS and TR, the RPL (bits 1:0) and the TI flag (bit 2) must be 0. - The selector fields for CS and TR cannot be 0000H. - The selector field for SS cannot be 0000H if the "host address-space size" VM-exit control is 0. - On processors that support Intel 64 architecture, the base-address fields for FS, GS and TR must contain canonical addresses. Signed-off-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com> Reviewed-by: NKarl Heubaum <karl.heubaum@oracle.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
KVM does not have 100% coverage of VMX consistency checks, i.e. some checks that cause VM-Fail may only be detected by hardware during a nested VM-Entry. In such a case, KVM must restore L1's state to the pre-VM-Enter state as L2's state has already been loaded into KVM's software model. L1's CR3 and PDPTRs in particular are loaded from vmcs01.GUEST_*. But when EPT is disabled, the associated fields hold KVM's shadow values, not L1's "real" values. Fortunately, when EPT is disabled the PDPTRs come from memory, i.e. are not cached in the VMCS. Which leaves CR3 as the sole anomaly. A previously applied workaround to handle CR3 was to force nested early checks if EPT is disabled: commit 2b27924b ("KVM: nVMX: always use early vmcs check when EPT is disabled") Forcing nested early checks is undesirable as doing so adds hundreds of cycles to every nested VM-Entry. Rather than take this performance hit, handle CR3 by overwriting vmcs01.GUEST_CR3 with L1's CR3 during nested VM-Entry when EPT is disabled *and* nested early checks are disabled. By stuffing vmcs01.GUEST_CR3, nested_vmx_restore_host_state() will naturally restore the correct vcpu->arch.cr3 from vmcs01.GUEST_CR3. These shenanigans work because nested_vmx_restore_host_state() does a full kvm_mmu_reset_context(), i.e. unloads the current MMU, which guarantees vmcs01.GUEST_CR3 will be rewritten with a new shadow CR3 prior to re-entering L1. vcpu->arch.root_mmu.root_hpa is set to INVALID_PAGE via: nested_vmx_restore_host_state() -> kvm_mmu_reset_context() -> kvm_mmu_unload() -> kvm_mmu_free_roots() kvm_mmu_unload() has WARN_ON(root_hpa != INVALID_PAGE), i.e. we can bank on 'root_hpa == INVALID_PAGE' unless the implementation of kvm_mmu_reset_context() is changed. On the way into L1, VMCS.GUEST_CR3 is guaranteed to be written (on a successful entry) via: vcpu_enter_guest() -> kvm_mmu_reload() -> kvm_mmu_load() -> kvm_mmu_load_cr3() -> vmx_set_cr3() Stuff vmcs01.GUEST_CR3 if and only if nested early checks are disabled as a "late" VM-Fail should never happen win that case (KVM WARNs), and the conditional write avoids the need to restore the correct GUEST_CR3 when nested_vmx_check_vmentry_hw() fails. Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20190607185534.24368-1-sean.j.christopherson@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 03 7月, 2019 5 次提交
-
-
由 Liran Alon 提交于
Currently KVM_STATE_NESTED_EVMCS is used to signal that eVMCS capability is enabled on vCPU. As indicated by vmx->nested.enlightened_vmcs_enabled. This is quite bizarre as userspace VMM should make sure to expose same vCPU with same CPUID values in both source and destination. In case vCPU is exposed with eVMCS support on CPUID, it is also expected to enable KVM_CAP_HYPERV_ENLIGHTENED_VMCS capability. Therefore, KVM_STATE_NESTED_EVMCS is redundant. KVM_STATE_NESTED_EVMCS is currently used on restore path (vmx_set_nested_state()) only to enable eVMCS capability in KVM and to signal need_vmcs12_sync such that on next VMEntry to guest nested_sync_from_vmcs12() will be called to sync vmcs12 content into eVMCS in guest memory. However, because restore nested-state is rare enough, we could have just modified vmx_set_nested_state() to always signal need_vmcs12_sync. From all the above, it seems that we could have just removed the usage of KVM_STATE_NESTED_EVMCS. However, in order to preserve backwards migration compatibility, we cannot do that. (vmx_get_nested_state() needs to signal flag when migrating from new kernel to old kernel). Returning KVM_STATE_NESTED_EVMCS when just vCPU have eVMCS enabled have a bad side-effect of userspace VMM having to send nested-state from source to destination as part of migration stream. Even if guest have never used eVMCS as it doesn't even run a nested hypervisor workload. This requires destination userspace VMM and KVM to support setting nested-state. Which make it more difficult to migrate from new host to older host. To avoid this, change KVM_STATE_NESTED_EVMCS to signal eVMCS is not only enabled but also active. i.e. Guest have made some eVMCS active via an enlightened VMEntry. i.e. vmcs12 is copied from eVMCS and therefore should be restored into eVMCS resident in memory (by copy_vmcs12_to_enlightened()). Reviewed-by: NVitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: NMaran Wilson <maran.wilson@oracle.com> Reviewed-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com> Signed-off-by: NLiran Alon <liran.alon@oracle.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Liran Alon 提交于
As comment in code specifies, SMM temporarily disables VMX so we cannot be in guest mode, nor can VMLAUNCH/VMRESUME be pending. However, code currently assumes that these are the only flags that can be set on kvm_state->flags. This is not true as KVM_STATE_NESTED_EVMCS can also be set on this field to signal that eVMCS should be enabled. Therefore, fix code to check for guest-mode and pending VMLAUNCH/VMRESUME explicitly. Reviewed-by: NJoao Martins <joao.m.martins@oracle.com> Signed-off-by: NLiran Alon <liran.alon@oracle.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Jim Mattson 提交于
When L0 is executing handle_invept(), the TDP MMU is active. Emulating an L1 INVEPT does require synchronizing the appropriate shadow EPT root(s), but a call to kvm_mmu_sync_roots in this context won't do that. Similarly, the hardware TLB and paging-structure-cache entries associated with the appropriate shadow EPT root(s) must be flushed, but requesting a TLB_FLUSH from this context won't do that either. How did this ever work? KVM always does a sync_roots and TLB flush (in the correct context) when transitioning from L1 to L2. That isn't the best choice for nested VM performance, but it effectively papers over the mistakes here. Remove the unnecessary operations and leave a comment to try to do better in the future. Reported-by: NJunaid Shahid <junaids@google.com> Fixes: bfd0a56b ("nEPT: Nested INVEPT") Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Nadav Har'El <nyh@il.ibm.com> Cc: Jun Nakajima <jun.nakajima@intel.com> Cc: Xinhao Xu <xinhao.xu@intel.com> Cc: Yang Zhang <yang.z.zhang@Intel.com> Cc: Gleb Natapov <gleb@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by Peter Shier <pshier@google.com> Reviewed-by: NJunaid Shahid <junaids@google.com> Signed-off-by: NJim Mattson <jmattson@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Vitaly Kuznetsov 提交于
When Enlightened VMCS is in use, it is valid to do VMCLEAR and, according to TLFS, this should "transition an enlightened VMCS from the active to the non-active state". It is, however, wrong to assume that it is only valid to do VMCLEAR for the eVMCS which is currently active on the vCPU performing VMCLEAR. Currently, the logic in handle_vmclear() is broken: in case, there is no active eVMCS on the vCPU doing VMCLEAR we treat the argument as a 'normal' VMCS and kvm_vcpu_write_guest() to the 'launch_state' field irreversibly corrupts the memory area. So, in case the VMCLEAR argument is not the current active eVMCS on the vCPU, how can we know if the area it is pointing to is a normal or an enlightened VMCS? Thanks to the bug in Hyper-V (see commit 72aeb60c ("KVM: nVMX: Verify eVMCS revision id match supported eVMCS version on eVMCS VMPTRLD")) we can not, the revision can't be used to distinguish between them. So let's assume it is always enlightened in case enlightened vmentry is enabled in the assist page. Also, check if vmx->nested.enlightened_vmcs_enabled to minimize the impact for 'unenlightened' workloads. Fixes: b8bbab92 ("KVM: nVMX: implement enlightened VMPTRLD and VMCLEAR") Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Vitaly Kuznetsov 提交于
Apparently, Windows doesn't maintain clean fields data after it does VMCLEAR for an enlightened VMCS so we can only use it on VMRESUME. The issue went unnoticed because currently we do nested_release_evmcs() in handle_vmclear() and the consecutive enlightened VMPTRLD invalidates clean fields when a new eVMCS is mapped but we're going to change the logic. Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 02 7月, 2019 2 次提交
-
-
由 Paolo Bonzini 提交于
Allow userspace to set a custom value for the VMFUNC controls MSR, as long as the capabilities it advertises do not exceed those of the host. Fixes: 27c42a1b ("KVM: nVMX: Enable VMFUNC for the L1 hypervisor", 2017-08-03) Reviewed-by: NLiran Alon <liran.alon@oracle.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Paolo Bonzini 提交于
Some secondary controls are automatically enabled/disabled based on the CPUID values that are set for the guest. However, they are still available at a global level and therefore should be present when KVM_GET_MSRS is sent to /dev/kvm. Fixes: 1389309c ("KVM: nVMX: expose VMX capabilities for nested hypervisors to userspace", 2018-02-26) Reviewed-by: NLiran Alon <liran.alon@oracle.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 21 6月, 2019 1 次提交
-
-
由 Paolo Bonzini 提交于
Commit 332d0797 ("KVM: nVMX: KVM_SET_NESTED_STATE - Tear down old EVMCS state before setting new state", 2019-05-02) broke evmcs_test because the eVMCS setup must be performed even if there is no VMXON region defined, as long as the eVMCS bit is set in the assist page. While the simplest possible fix would be to add a check on kvm_state->flags & KVM_STATE_NESTED_EVMCS in the initial "if" that covers kvm_state->hdr.vmx.vmxon_pa == -1ull, that is quite ugly. Instead, this patch moves checks earlier in the function and conditionalizes them on kvm_state->hdr.vmx.vmxon_pa, so that vmx_set_nested_state always goes through vmx_leave_nested and nested_enable_evmcs. Fixes: 332d0797 ("KVM: nVMX: KVM_SET_NESTED_STATE - Tear down old EVMCS state before setting new state") Cc: Aaron Lewis <aaronlewis@google.com> Reviewed-by: NVitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 19 6月, 2019 1 次提交
-
-
由 Liran Alon 提交于
Improve the KVM_{GET,SET}_NESTED_STATE structs by detailing the format of VMX nested state data in a struct. In order to avoid changing the ioctl values of KVM_{GET,SET}_NESTED_STATE, there is a need to preserve sizeof(struct kvm_nested_state). This is done by defining the data struct as "data.vmx[0]". It was the most elegant way I found to preserve struct size while still keeping struct readable and easy to maintain. It does have a misfortunate side-effect that now it has to be accessed as "data.vmx[0]" rather than just "data.vmx". Because we are already modifying these structs, I also modified the following: * Define the "format" field values as macros. * Rename vmcs_pa to vmcs12_pa for better readability. Signed-off-by: NLiran Alon <liran.alon@oracle.com> [Remove SVM stubs, add KVM_STATE_NESTED_VMX_VMCS12_SIZE. - Paolo] Reviewed-by: NLiran Alon <liran.alon@oracle.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 18 6月, 2019 20 次提交
-
-
由 Sean Christopherson 提交于
VMWRITEs to the major VMCS controls, pin controls included, are deceptively expensive. CPUs with VMCS caching (Westmere and later) also optimize away consistency checks on VM-Entry, i.e. skip consistency checks if the relevant fields have not changed since the last successful VM-Entry (of the cached VMCS). Because uops are a precious commodity, uCode's dirty VMCS field tracking isn't as precise as software would prefer. Notably, writing any of the major VMCS fields effectively marks the entire VMCS dirty, i.e. causes the next VM-Entry to perform all consistency checks, which consumes several hundred cycles. As it pertains to KVM, toggling PIN_BASED_VMX_PREEMPTION_TIMER more than doubles the latency of the next VM-Entry (and again when/if the flag is toggled back). In a non-nested scenario, running a "standard" guest with the preemption timer enabled, toggling the timer flag is uncommon but not rare, e.g. roughly 1 in 10 entries. Disabling the preemption timer can change these numbers due to its use for "immediate exits", even when explicitly disabled by userspace. Nested virtualization in particular is painful, as the timer flag is set for the majority of VM-Enters, but prepare_vmcs02() initializes vmcs02's pin controls to *clear* the flag since its the timer's final state isn't known until vmx_vcpu_run(). I.e. the majority of nested VM-Enters end up unnecessarily writing pin controls *twice*. Rather than toggle the timer flag in pin controls, set the timer value itself to the largest allowed value to put it into a "soft disabled" state, and ignore any spurious preemption timer exits. Sadly, the timer is a 32-bit value and so theoretically it can fire before the head death of the universe, i.e. spurious exits are possible. But because KVM does *not* save the timer value on VM-Exit and because the timer runs at a slower rate than the TSC, the maximuma timer value is still sufficiently large for KVM's purposes. E.g. on a modern CPU with a timer that runs at 1/32 the frequency of a 2.4ghz constant-rate TSC, the timer will fire after ~55 seconds of *uninterrupted* guest execution. In other words, spurious VM-Exits are effectively only possible if the host is completely tickless on the logical CPU, the guest is not using the preemption timer, and the guest is not generating VM-Exits for any other reason. To be safe from bad/weird hardware, disable the preemption timer if its maximum delay is less than ten seconds. Ten seconds is mostly arbitrary and was selected in no small part because it's a nice round number. For simplicity and paranoia, fall back to __kvm_request_immediate_exit() if the preemption timer is disabled by KVM or userspace. Previously KVM continued to use the preemption timer to force immediate exits even when the timer was disabled by userspace. Now that KVM leaves the timer running instead of truly disabling it, allow userspace to kill it entirely in the unlikely event the timer (or KVM) malfunctions. Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
... now that it is fully redundant with the pin controls shadow. Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
KVM dynamically toggles SECONDARY_EXEC_DESC to intercept (a subset of) instructions that are subject to User-Mode Instruction Prevention, i.e. VMCS.SECONDARY_EXEC_DESC == CR4.UMIP when emulating UMIP. Preset the VMCS control when preparing vmcs02 to avoid unnecessarily VMWRITEs, e.g. KVM will clear VMCS.SECONDARY_EXEC_DESC in prepare_vmcs02_early() and then set it in vmx_set_cr4(). Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
KVM dynamically toggles the CPU_BASED_USE_MSR_BITMAPS execution control for nested guests based on whether or not both L0 and L1 want to pass through the same MSRs to L2. Preserve the last used value from vmcs02 so as to avoid multiple VMWRITEs to (re)set/(re)clear the bit on nested VM-Entry. Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Or: Don't re-initialize vmcs02's controls on every nested VM-Entry. VMWRITEs to the major VMCS controls are deceptively expensive. Intel CPUs with VMCS caching (Westmere and later) also optimize away consistency checks on VM-Entry, i.e. skip consistency checks if the relevant fields have not changed since the last successful VM-Entry (of the cached VMCS). Because uops are a precious commodity, uCode's dirty VMCS field tracking isn't as precise as software would prefer. Notably, writing any of the major VMCS fields effectively marks the entire VMCS dirty, i.e. causes the next VM-Entry to perform all consistency checks, which consumes several hundred cycles. Zero out the controls' shadow copies during VMCS allocation and use the optimized setter when "initializing" controls. While this technically affects both non-nested and nested virtualization, nested virtualization is the primary beneficiary as avoid VMWRITEs when prepare vmcs02 allows hardware to optimizie away consistency checks. Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
... now that the shadow copies are per-VMCS. Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Prepare to shadow all major control fields on a per-VMCS basis, which allows KVM to avoid costly VMWRITEs when switching between vmcs01 and vmcs02. Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Prepare to shadow all major control fields on a per-VMCS basis, which allows KVM to avoid VMREADs when switching between vmcs01 and vmcs02, and more importantly can eliminate costly VMWRITEs to controls when preparing vmcs02. Shadowing exec controls also saves a VMREAD when opening virtual INTR/NMI windows, yay... Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Prepare to shadow all major control fields on a per-VMCS basis, which allows KVM to avoid costly VMWRITEs when switching between vmcs01 and vmcs02. Shadowing pin controls also allows a future patch to remove the per-VMCS 'hv_timer_armed' flag, as the shadow copy is a superset of said flag. Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
KVM provides a module parameter to allow disabling virtual NMI support to simplify testing (hardware *without* virtual NMI support is hard to come by but it does have users). When preparing vmcs02, use the accessor for pin controls to ensure that the module param is respected for nested guests. Opportunistically swap the order of applying L0's and L1's pin controls to better align with other controls and to prepare for a future patche that will ignore L1's, but not L0's, preemption timer flag. Fixes: d02fcf50 ("kvm: vmx: Allow disabling virtual NMI support") Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Per Intel's SDM: ... the logical processor uses PAE paging if CR0.PG=1, CR4.PAE=1 and IA32_EFER.LME=0. A VM entry to a guest that uses PAE paging loads the PDPTEs into internal, non-architectural registers based on the setting of the "enable EPT" VM-execution control. and: [GUEST_PDPTR] values are saved into the four PDPTE fields as follows: - If the "enable EPT" VM-execution control is 0 or the logical processor was not using PAE paging at the time of the VM exit, the values saved are undefined. In other words, if EPT is disabled or the guest isn't using PAE paging, then the PDPTRS aren't consumed by hardware on VM-Entry and are loaded with junk on VM-Exit. From a nesting perspective, all of the above hold true, i.e. KVM can effectively ignore the VMCS PDPTRs. E.g. KVM already loads the PDPTRs from memory when nested EPT is disabled (see nested_vmx_load_cr3()). Because KVM intercepts setting CR4.PAE, there is no danger of consuming a stale value or crushing L1's VMWRITEs regardless of whether L1 intercepts CR4.PAE. The vmcs12's values are unchanged up until the VM-Exit where L2 sets CR4.PAE, i.e. L0 will see the new PAE state on the subsequent VM-Entry and propagate the PDPTRs from vmcs12 to vmcs02. Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Paolo Bonzini 提交于
Checking for 32-bit PAE is quite common around code that fiddles with the PDPTRs. Add a function to compress all checks into a single invocation. Reviewed-by: NSean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
L1 is responsible for dirtying GUEST_GRP1 if it writes GUEST_BNDCFGS. Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
KVM unconditionally intercepts WRMSR to MSR_IA32_DEBUGCTLMSR. In the unlikely event that L1 allows L2 to write L1's MSR_IA32_DEBUGCTLMSR, but but saves L2's value on VM-Exit, update vmcs12 during L2's WRMSR so as to eliminate the need to VMREAD the value from vmcs02 on nested VM-Exit. Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
For L2, KVM always intercepts WRMSR to SYSENTER MSRs. Update vmcs12 in the WRMSR handler so that they don't need to be (re)read from vmcs02 on every nested VM-Exit. Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
As alluded to by the TODO comment, KVM unconditionally intercepts writes to the PAT MSR. In the unlikely event that L1 allows L2 to write L1's PAT directly but saves L2's PAT on VM-Exit, update vmcs12 when L2 writes the PAT. This eliminates the need to VMREAD the value from vmcs02 on VM-Exit as vmcs12 is already up to date in all situations. Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
If nested_get_vmcs12_pages() fails to map L1's APIC_ACCESS_ADDR into L2, then it disables SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES in vmcs02. In other words, the APIC_ACCESS_ADDR in vmcs02 is guaranteed to be written with the correct value before being consumed by hardware, drop the unneessary VMWRITE. Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
The VIRTUAL_APIC_PAGE_ADDR in vmcs02 is guaranteed to be updated before it is consumed by hardware, either in nested_vmx_enter_non_root_mode() or via the KVM_REQ_GET_VMCS12_PAGES callback. Avoid an extra VMWRITE and only stuff a bad value into vmcs02 when mapping vmcs12's address fails. This also eliminates the need for extra comments to connect the dots between prepare_vmcs02_early() and nested_get_vmcs12_pages(). Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
... as a malicious userspace can run a toy guest to generate invalid virtual-APIC page addresses in L1, i.e. flood the kernel log with error messages. Fixes: 69090810 ("KVM: nVMX: allow tests to use bad virtual-APIC page address") Cc: stable@vger.kernel.org Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
When switching between vmcs01 and vmcs02, there is no need to update state tracking for values that aren't tied to any particular VMCS as the per-vCPU values are already up-to-date (vmx_switch_vmcs() can only be called when the vCPU is loaded). Avoiding the update eliminates a RDMSR, and potentially a RDPKRU and posted-interrupt update (cmpxchg64() and more). Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-