提交 · 7e7126846c95a34f98a1524d5c473af1f0783735 · openanolis / cloud-kernel

04 10月, 2018 3 次提交

kvm: nVMX: fix entry with pending interrupt if APICv is enabled · 7e712684

由 Paolo Bonzini 提交于 10月 03, 2018

Commit b5861e5c introduced a check on
the interrupt-window and NMI-window CPU execution controls in order to
inject an external interrupt vmexit before the first guest instruction
executes.  However, when APIC virtualization is enabled the host does not
need a vmexit in order to inject an interrupt at the next interrupt window;
instead, it just places the interrupt vector in RVI and the processor will
inject it as soon as possible.  Therefore, on machines with APICv it is
not enough to check the CPU execution controls: the same scenario can also
happen if RVI>vPPR.

Fixes: b5861e5cReviewed-by: NNikita Leshchenko <nikita.leshchenko@oracle.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Liran Alon <liran.alon@oracle.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

7e712684

KVM: VMX: hide flexpriority from guest when disabled at the module level · 2cf7ea9f

由 Paolo Bonzini 提交于 10月 03, 2018

As of commit 8d860bbe ("kvm: vmx: Basic APIC virtualization controls
have three settings"), KVM will disable VIRTUALIZE_APIC_ACCESSES when
a nested guest writes APIC_BASE MSR and kvm-intel.flexpriority=0,
whereas previously KVM would allow a nested guest to enable
VIRTUALIZE_APIC_ACCESSES so long as it's supported in hardware.  That is,
KVM now advertises VIRTUALIZE_APIC_ACCESSES to a guest but doesn't
(always) allow setting it when kvm-intel.flexpriority=0, and may even
initially allow the control and then clear it when the nested guest
writes APIC_BASE MSR, which is decidedly odd even if it doesn't cause
functional issues.

Hide the control completely when the module parameter is cleared.
reported-by: NSean Christopherson <sean.j.christopherson@intel.com>
Fixes: 8d860bbe ("kvm: vmx: Basic APIC virtualization controls have three settings")
Cc: Jim Mattson <jmattson@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

2cf7ea9f

KVM: VMX: check for existence of secondary exec controls before accessing · fd6b6d9b

由 Sean Christopherson 提交于 10月 01, 2018

Return early from vmx_set_virtual_apic_mode() if the processor doesn't
support VIRTUALIZE_APIC_ACCESSES or VIRTUALIZE_X2APIC_MODE, both of
which reside in SECONDARY_VM_EXEC_CONTROL.  This eliminates warnings
due to VMWRITEs to SECONDARY_VM_EXEC_CONTROL (VMCS field 401e) failing
on processors without secondary exec controls.

Remove the similar check for TPR shadowing as it is incorporated in the
flexpriority_enabled check and the APIC-related code in
vmx_update_msr_bitmap() is further gated by VIRTUALIZE_X2APIC_MODE.
Reported-by: NGerhard Wiesinger <redhat@wiesinger.com>
Fixes: 8d860bbe ("kvm: vmx: Basic APIC virtualization controls have three settings")
Cc: Jim Mattson <jmattson@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

fd6b6d9b

01 10月, 2018 3 次提交

KVM: nVMX: Fix emulation of VM_ENTRY_LOAD_BNDCFGS · 62cf9bd8

由 Liran Alon 提交于 9月 14, 2018

L2 IA32_BNDCFGS should be updated with vmcs12->guest_bndcfgs only
when VM_ENTRY_LOAD_BNDCFGS is specified in vmcs12->vm_entry_controls.

Otherwise, L2 IA32_BNDCFGS should be set to vmcs01->guest_bndcfgs which
is L1 IA32_BNDCFGS.
Reviewed-by: NNikita Leshchenko <nikita.leshchenko@oracle.com>
Reviewed-by: NDarren Kenny <darren.kenny@oracle.com>
Signed-off-by: NLiran Alon <liran.alon@oracle.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

62cf9bd8

KVM: x86: Do not use kvm_x86_ops->mpx_supported() directly · 503234b3

由 Liran Alon 提交于 9月 14, 2018

Commit a87036ad ("KVM: x86: disable MPX if host did not enable
MPX XSAVE features") introduced kvm_mpx_supported() to return true
iff MPX is enabled in the host.

However, that commit seems to have missed replacing some calls to
kvm_x86_ops->mpx_supported() to kvm_mpx_supported().

Complete original commit by replacing remaining calls to
kvm_mpx_supported().

Fixes: a87036ad ("KVM: x86: disable MPX if host did not enable
MPX XSAVE features")
Suggested-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NLiran Alon <liran.alon@oracle.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

503234b3

KVM: nVMX: Do not expose MPX VMX controls when guest MPX disabled · 5f76f6f5

由 Liran Alon 提交于 9月 14, 2018

Before this commit, KVM exposes MPX VMX controls to L1 guest only based
on if KVM and host processor supports MPX virtualization.
However, these controls should be exposed to guest only in case guest
vCPU supports MPX.

Without this change, a L1 guest running with kernel which don't have
commit 691bd434 ("kvm: vmx: allow host to access guest
MSR_IA32_BNDCFGS") asserts in QEMU on the following:
	qemu-kvm: error: failed to set MSR 0xd90 to 0x0
	qemu-kvm: .../qemu-2.10.0/target/i386/kvm.c:1801 kvm_put_msrs:
	Assertion 'ret == cpu->kvm_msr_buf->nmsrs failed'
This is because L1 KVM kvm_init_msr_list() will see that
vmx_mpx_supported() (As it only checks MPX VMX controls support) and
therefore KVM_GET_MSR_INDEX_LIST IOCTL will include MSR_IA32_BNDCFGS.
However, later when L1 will attempt to set this MSR via KVM_SET_MSRS
IOCTL, it will fail because !guest_cpuid_has_mpx(vcpu).

Therefore, fix the issue by exposing MPX VMX controls to L1 guest only
when vCPU supports MPX.

Fixes: 36be0b9d ("KVM: x86: Add nested virtualization support for MPX")
Reported-by: NEyal Moscovici <eyal.moscovici@oracle.com>
Reviewed-by: NNikita Leshchenko <nikita.leshchenko@oracle.com>
Reviewed-by: NDarren Kenny <darren.kenny@oracle.com>
Signed-off-by: NLiran Alon <liran.alon@oracle.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

5f76f6f5

25 9月, 2018 1 次提交

KVM: x86: never trap MSR_KERNEL_GS_BASE · 4679b61f

由 Paolo Bonzini 提交于 9月 24, 2018

KVM has an old optimization whereby accesses to the kernel GS base MSR
are trapped when the guest is in 32-bit and not when it is in 64-bit mode.
The idea is that swapgs is not available in 32-bit mode, thus the
guest has no reason to access the MSR unless in 64-bit mode and
32-bit applications need not pay the price of switching the kernel GS
base between the host and the guest values.

However, this optimization adds complexity to the code for little
benefit (these days most guests are going to be 64-bit anyway) and in fact
broke after commit 678e315e ("KVM: vmx: add dedicated utility to
access guest's kernel_gs_base", 2018-08-06); the guest kernel GS base
can be corrupted across SMIs and UEFI Secure Boot is therefore broken
(a secure boot Linux guest, for example, fails to reach the login prompt
about half the time).  This patch just removes the optimization; the
kernel GS base MSR is now never trapped by KVM, similarly to the FS and
GS base MSRs.

Fixes: 678e315eReviewed-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

4679b61f

20 9月, 2018 7 次提交

nVMX x86: Check VPID value on vmentry of L2 guests · ba8e23db

由 Krish Sadhukhan 提交于 9月 04, 2018

According to section "Checks on VMX Controls" in Intel SDM vol 3C, the
following check needs to be enforced on vmentry of L2 guests:

    If the 'enable VPID' VM-execution control is 1, the value of the
    of the VPID VM-execution control field must not be 0000H.
Signed-off-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com>
Reviewed-by: NMark Kanda <mark.kanda@oracle.com>
Reviewed-by: NLiran Alon <liran.alon@oracle.com>
Reviewed-by: NJim Mattson <jmattson@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

ba8e23db

nVMX x86: check posted-interrupt descriptor addresss on vmentry of L2 · 6de84e58

由 Krish Sadhukhan 提交于 8月 23, 2018

According to section "Checks on VMX Controls" in Intel SDM vol 3C,
the following check needs to be enforced on vmentry of L2 guests:

   - Bits 5:0 of the posted-interrupt descriptor address are all 0.
   - The posted-interrupt descriptor address does not set any bits
     beyond the processor's physical-address width.
Signed-off-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com>
Reviewed-by: NMark Kanda <mark.kanda@oracle.com>
Reviewed-by: NLiran Alon <liran.alon@oracle.com>
Reviewed-by: NDarren Kenny <darren.kenny@oracle.com>
Reviewed-by: NKarl Heubaum <karl.heubaum@oracle.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

6de84e58

KVM: nVMX: Wake blocked vCPU in guest-mode if pending interrupt in virtual APICv · e6c67d8c

由 Liran Alon 提交于 9月 04, 2018

In case L1 do not intercept L2 HLT or enter L2 in HLT activity-state,
it is possible for a vCPU to be blocked while it is in guest-mode.

According to Intel SDM 26.6.5 Interrupt-Window Exiting and
Virtual-Interrupt Delivery: "These events wake the logical processor
if it just entered the HLT state because of a VM entry".
Therefore, if L1 enters L2 in HLT activity-state and L2 has a pending
deliverable interrupt in vmcs12->guest_intr_status.RVI, then the vCPU
should be waken from the HLT state and injected with the interrupt.

In addition, if while the vCPU is blocked (while it is in guest-mode),
it receives a nested posted-interrupt, then the vCPU should also be
waken and injected with the posted interrupt.

To handle these cases, this patch enhances kvm_vcpu_has_events() to also
check if there is a pending interrupt in L2 virtual APICv provided by
L1. That is, it evaluates if there is a pending virtual interrupt for L2
by checking RVI[7:4] > VPPR[7:4] as specified in Intel SDM 29.2.1
Evaluation of Pending Interrupts.

Note that this also handles the case of nested posted-interrupt by the
fact RVI is updated in vmx_complete_nested_posted_interrupt() which is
called from kvm_vcpu_check_block() -> kvm_arch_vcpu_runnable() ->
kvm_vcpu_running() -> vmx_check_nested_events() ->
vmx_complete_nested_posted_interrupt().
Reviewed-by: NNikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: NDarren Kenny <darren.kenny@oracle.com>
Signed-off-by: NLiran Alon <liran.alon@oracle.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

e6c67d8c

KVM: VMX: check nested state and CR4.VMXE against SMM · 5bea5123

由 Paolo Bonzini 提交于 9月 18, 2018

VMX cannot be enabled under SMM, check it when CR4 is set and when nested
virtualization state is restored.

This should fix some WARNs reported by syzkaller, mostly around
alloc_shadow_vmcs.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

5bea5123

KVM: VMX: use preemption timer to force immediate VMExit · d264ee0c

由 Sean Christopherson 提交于 8月 27, 2018

A VMX preemption timer value of '0' is guaranteed to cause a VMExit
prior to the CPU executing any instructions in the guest.  Use the
preemption timer (if it's supported) to trigger immediate VMExit
in place of the current method of sending a self-IPI.  This ensures
that pending VMExit injection to L1 occurs prior to executing any
instructions in the guest (regardless of nesting level).

When deferring VMExit injection, KVM generates an immediate VMExit
from the (possibly nested) guest by sending itself an IPI.  Because
hardware interrupts are blocked prior to VMEnter and are unblocked
(in hardware) after VMEnter, this results in taking a VMExit(INTR)
before any guest instruction is executed.  But, as this approach
relies on the IPI being received before VMEnter executes, it only
works as intended when KVM is running as L0.  Because there are no
architectural guarantees regarding when IPIs are delivered, when
running nested the INTR may "arrive" long after L2 is running e.g.
L0 KVM doesn't force an immediate switch to L1 to deliver an INTR.

For the most part, this unintended delay is not an issue since the
events being injected to L1 also do not have architectural guarantees
regarding their timing.  The notable exception is the VMX preemption
timer[1], which is architecturally guaranteed to cause a VMExit prior
to executing any instructions in the guest if the timer value is '0'
at VMEnter.  Specifically, the delay in injecting the VMExit causes
the preemption timer KVM unit test to fail when run in a nested guest.

Note: this approach is viable even on CPUs with a broken preemption
timer, as broken in this context only means the timer counts at the
wrong rate.  There are no known errata affecting timer value of '0'.

[1] I/O SMIs also have guarantees on when they arrive, but I have
    no idea if/how those are emulated in KVM.
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
[Use a hook for SVM instead of leaving the default in x86.c - Paolo]
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

d264ee0c

KVM: VMX: modify preemption timer bit only when arming timer · f459a707

由 Sean Christopherson 提交于 8月 27, 2018

Provide a singular location where the VMX preemption timer bit is
set/cleared so that future usages of the preemption timer can ensure
the VMCS bit is up-to-date without having to modify unrelated code
paths.  For example, the preemption timer can be used to force an
immediate VMExit.  Cache the status of the timer to avoid redundant
VMREAD and VMWRITE, e.g. if the timer stays armed across multiple
VMEnters/VMExits.
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

f459a707

KVM: VMX: immediately mark preemption timer expired only for zero value · 4c008127

由 Sean Christopherson 提交于 8月 27, 2018

A VMX preemption timer value of '0' at the time of VMEnter is
architecturally guaranteed to cause a VMExit prior to the CPU
executing any instructions in the guest.  This architectural
definition is in place to ensure that a previously expired timer
is correctly recognized by the CPU as it is possible for the timer
to reach zero and not trigger a VMexit due to a higher priority
VMExit being signalled instead, e.g. a pending #DB that morphs into
a VMExit.

Whether by design or coincidence, commit f4124500 ("KVM: nVMX:
Fully emulate preemption timer") special cased timer values of '0'
and '1' to ensure prompt delivery of the VMExit.  Unlike '0', a
timer value of '1' has no has no architectural guarantees regarding
when it is delivered.

Modify the timer emulation to trigger immediate VMExit if and only
if the timer value is '0', and document precisely why '0' is special.
Do this even if calibration of the virtual TSC failed, i.e. VMExit
will occur immediately regardless of the frequency of the timer.
Making only '0' a special case gives KVM leeway to be more aggressive
in ensuring the VMExit is injected prior to executing instructions in
the nested guest, and also eliminates any ambiguity as to why '1' is
a special case, e.g. why wasn't the threshold for a "short timeout"
set to 10, 100, 1000, etc...
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

4c008127

08 9月, 2018 1 次提交

KVM: nVMX: Fix loss of pending IRQ/NMI before entering L2 · b5861e5c

由 Liran Alon 提交于 9月 03, 2018

Consider the case L1 had a IRQ/NMI event until it executed
VMLAUNCH/VMRESUME which wasn't delivered because it was disallowed
(e.g. interrupts disabled). When L1 executes VMLAUNCH/VMRESUME,
L0 needs to evaluate if this pending event should cause an exit from
L2 to L1 or delivered directly to L2 (e.g. In case L1 don't intercept
EXTERNAL_INTERRUPT).

Usually this would be handled by L0 requesting a IRQ/NMI window
by setting VMCS accordingly. However, this setting was done on
VMCS01 and now VMCS02 is active instead. Thus, when L1 executes
VMLAUNCH/VMRESUME we force L0 to perform pending event evaluation by
requesting a KVM_REQ_EVENT.

Note that above scenario exists when L1 KVM is about to enter L2 but
requests an "immediate-exit". As in this case, L1 will
disable-interrupts and then send a self-IPI before entering L2.
Reviewed-by: NNikita Leshchenko <nikita.leshchenko@oracle.com>
Co-developed-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NLiran Alon <liran.alon@oracle.com>
Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>

b5861e5c

30 8月, 2018 3 次提交

KVM: x86: Rename emulate_instruction() to kvm_emulate_instruction() · 0ce97a2b

由 Sean Christopherson 提交于 8月 23, 2018

Lack of the kvm_ prefix gives the impression that it's a VMX or SVM
specific function, and there's no conflict that prevents adding the
kvm_ prefix.
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>

0ce97a2b

KVM: VMX: Do not allow reexecute_instruction() when skipping MMIO instr · c4409905

由 Sean Christopherson 提交于 8月 23, 2018

Re-execution after an emulation decode failure is only intended to
handle a case where two or vCPUs race to write a shadowed page, i.e.
we should never re-execute an instruction as part of MMIO emulation.
As handle_ept_misconfig() is only used for MMIO emulation, it should
pass EMULTYPE_NO_REEXECUTE when using the emulator to skip an instr
in the fast-MMIO case where VM_EXIT_INSTRUCTION_LEN is invalid.

And because the cr2 value passed to x86_emulate_instruction() is only
destined for use when retrying or reexecuting, we can simply call
emulate_instruction().

Fixes: d391f120 ("x86/kvm/vmx: do not use vm-exit instruction length
                      for fast MMIO when running nested")
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>

c4409905

KVM: nVMX: avoid redundant double assignment of nested_run_pending · b871da4a

由 Vitaly Kuznetsov 提交于 8月 23, 2018

nested_run_pending is set 20 lines above and check_vmentry_prereqs()/
check_vmentry_postreqs() don't seem to be resetting it (the later, however,
checks it).
Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com>
Reviewed-by: NJim Mattson <jmattson@google.com>
Reviewed-by: NEduardo Valentin <eduval@amazon.com>
Reviewed-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>

b871da4a

22 8月, 2018 3 次提交

KVM: VMX: fixes for vmentry_l1d_flush module parameter · 0027ff2a

由 Paolo Bonzini 提交于 8月 22, 2018

Two bug fixes:

1) missing entries in the l1d_param array; this can cause a host crash
if an access attempts to reach the missing entry. Future-proof the get
function against any overflows as well.  However, the two entries
VMENTER_L1D_FLUSH_EPT_DISABLED and VMENTER_L1D_FLUSH_NOT_REQUIRED must
not be accepted by the parse function, so disable them there.

2) invalid values must be rejected even if the CPU does not have the
bug, so test for them before checking boot_cpu_has(X86_BUG_L1TF)

... and a small refactoring, since the .cmd field is redundant with
the index in the array.
Reported-by: NBandan Das <bsd@redhat.com>
Cc: stable@vger.kernel.org
Fixes: a7b9020bSigned-off-by: NPaolo Bonzini <pbonzini@redhat.com>

0027ff2a

KVM: vmx: Inject #UD for SGX ENCLS instruction in guest · 0b665d30

由 Sean Christopherson 提交于 8月 14, 2018

Virtualization of Intel SGX depends on Enclave Page Cache (EPC)
management that is not yet available in the kernel, i.e. KVM support
for exposing SGX to a guest cannot be added until basic support
for SGX is upstreamed, which is a WIP[1].

Until SGX is properly supported in KVM, ensure a guest sees expected
behavior for ENCLS, i.e. all ENCLS #UD.  Because SGX does not have a
true software enable bit, e.g. there is no CR4.SGXE bit, the ENCLS
instruction can be executed[1] by the guest if SGX is supported by the
system.  Intercept all ENCLS leafs (via the ENCLS- exiting control and
field) and unconditionally inject #UD.

[1] https://www.spinics.net/lists/kvm/msg171333.html or
    https://lkml.org/lkml/2018/7/3/879

[2] A guest can execute ENCLS in the sense that ENCLS will not take
    an immediate #UD, but no ENCLS will ever succeed in a guest
    without explicit support from KVM (map EPC memory into the guest),
    unless KVM has a *very* egregious bug, e.g. accidentally mapped
    EPC memory into the guest SPTEs.  In other words this patch is
    needed only to prevent the guest from seeing inconsistent behavior,
    e.g. #GP (SGX not enabled in Feature Control MSR) or #PF (leaf
    operand(s) does not point at EPC memory) instead of #UD on ENCLS.
    Intercepting ENCLS is not required to prevent the guest from truly
    utilizing SGX.
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20180814163334.25724-3-sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

0b665d30

x86/kvm/vmx: Fix coding style in vmx_setup_l1d_flush() · d806afa4

由 Yi Wang 提交于 8月 16, 2018

Substitute spaces with tab. No functional changes.
Signed-off-by: NYi Wang <wang.yi59@zte.com.cn>
Reviewed-by: NJiang Biao <jiang.biao2@zte.com.cn>
Message-Id: <1534398159-48509-1-git-send-email-wang.yi59@zte.com.cn>
Cc: stable@vger.kernel.org # L1TF
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

d806afa4

21 8月, 2018 1 次提交

x86/kvm/vmx: Remove duplicate l1d flush definitions · 94d7a86c

由 Josh Poimboeuf 提交于 8月 14, 2018

These are already defined higher up in the file.

Fixes: 7db92e16 ("x86/kvm: Move l1tf setup function")
Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/d7ca03ae210d07173452aeed85ffe344301219a5.1534253536.git.jpoimboe@redhat.com

94d7a86c

07 8月, 2018 1 次提交

KVM/x86: Use CC_SET()/CC_OUT in arch/x86/kvm/vmx.c · fd8ca6da

由 Uros Bizjak 提交于 8月 06, 2018

Remove open-coded uses of set instructions to use CC_SET()/CC_OUT() in
arch/x86/kvm/vmx.c.
Signed-off-by: NUros Bizjak <ubizjak@gmail.com>
[Mark error paths as unlikely while touching this. - Paolo]
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

fd8ca6da

06 8月, 2018 17 次提交

KVM: vmx: skip VMWRITE of HOST_{FS,GS}_BASE when possible · 5e079c7e

由 Sean Christopherson 提交于 7月 23, 2018

The host's FS.base and GS.base rarely change, e.g. ~0.1% of host/guest
swaps on my system. Cache the last value written to the VMCS and skip
the VMWRITE to the associated VMCS fields when loading host state if
the value hasn't changed since the last VMWRITE.
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

5e079c7e

KVM: vmx: skip VMWRITE of HOST_{FS,GS}_SEL when possible · 8f21a0bb

由 Sean Christopherson 提交于 7月 23, 2018

On a 64-bit host, FS.sel and GS.sel are all but guaranteed to be 0,
which in turn means they'll rarely change.  Skip the VMWRITE for the
associated VMCS fields when loading host state if the selector hasn't
changed since the last VMWRITE.
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

8f21a0bb

KVM: vmx: always initialize HOST_{FS,GS}_BASE to zero during setup · f3bbc0dc

由 Sean Christopherson 提交于 7月 23, 2018

The HOST_{FS,GS}_BASE fields are guaranteed to be written prior to
VMENTER, by way of vmx_prepare_switch_to_guest().  Initialize the
fields to zero for 64-bit kernels instead of pulling the base values
from their respective MSRs.  In addition to eliminating two RDMSRs,
vmx_prepare_switch_to_guest() can safely assume the initial value of
the fields is zero in all cases.
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

f3bbc0dc

KVM: vmx: move struct host_state usage to struct loaded_vmcs · d7ee039e

由 Sean Christopherson 提交于 7月 23, 2018

Make host_state a property of a loaded_vmcs so that it can be
used as a cache of the VMCS fields, e.g. to lazily VMWRITE the
corresponding VMCS field.  Treating host_state as a cache does
not work if it's not VMCS specific as the cache would become
incoherent when switching between vmcs01 and vmcs02.

Move vmcs_host_cr3 and vmcs_host_cr4 into host_state.

Explicitly zero out host_state when allocating a new VMCS for a
loaded_vmcs.  Unlike the pre-existing vmcs_host_cr{3,4} usage,
the segment information is not guaranteed to be (re)initialized
when running a new nested VMCS, e.g. HOST_FS_BASE is not written
in vmx_set_constant_host_state().
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

d7ee039e

KVM: vmx: compute need to reload FS/GS/LDT on demand · e920de85

由 Sean Christopherson 提交于 7月 23, 2018

Remove fs_reload_needed and gs_ldt_reload_needed from host_state
and instead compute whether we need to reload various state at
the time we actually do the reload.  The state that is tracked
by the *_reload_needed variables is not any more volatile than
the trackers themselves.
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

e920de85

KVM: nVMX: remove a misleading comment regarding vmcs02 fields · fd1ec772

由 Sean Christopherson 提交于 7月 23, 2018

prepare_vmcs02() has an odd comment that says certain fields are
"not in vmcs02".  AFAICT the intent of the comment is to document
that various VMCS fields are not handled by prepare_vmcs02(),
e.g. HOST_{FS,GS}_{BASE,SELECTOR}.  While technically true, the
comment is misleading, e.g. it can lead the reader to think that
KVM never writes those fields to vmcs02.

Remove the comment altogether as the handling of FS and GS is
not specific to nested VMX, and GUEST_PML_INDEX has been written
by prepare_vmcs02() since commit "4e59516a (kvm: vmx: ensure
VMCS is current while enabling PML)"
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

fd1ec772

KVM: vmx: rename __vmx_load_host_state() and vmx_save_host_state() · 6d6095bd

由 Sean Christopherson 提交于 7月 23, 2018

Now that the vmx_load_host_state() wrapper is gone, i.e. the only
time we call the core functions is when we're actually about to
switch between guest/host, rename the functions that handle lazy
state switching to vmx_prepare_switch_to_{guest,host}_state() to
better document the full extent of their functionality.
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

6d6095bd

KVM: vmx: add dedicated utility to access guest's kernel_gs_base · 678e315e

由 Sean Christopherson 提交于 7月 23, 2018

When lazy save/restore of MSR_KERNEL_GS_BASE was introduced[1], the
MSR was intercepted in all modes and was only restored for the host
when the guest is in 64-bit mode.  So at the time, going through the
full host restore prior to accessing MSR_KERNEL_GS_BASE was necessary
to load host state and was not a significant waste of cycles.

Later, MSR_KERNEL_GS_BASE interception was disabled for a 64-bit
guest[2], and then unconditionally saved/restored for the host[3].
As a result, loading full host state is overkill for accesses to
MSR_KERNEL_GS_BASE, and completely unnecessary when the guest is
not in 64-bit mode.

Add a dedicated utility to read/write the guest's MSR_KERNEL_GS_BASE
(outside of the save/restore flow) to minimize the overhead incurred
when accessing the MSR.  When setting EFER, only decache the MSR if
the new EFER will disable long mode.

Removing out-of-band usage of vmx_load_host_state() also eliminates,
or at least reduces, potential corner cases in its usage, which in
turn will (hopefuly) make it easier to reason about future changes
to the save/restore flow, e.g. optimization of saving host state.

[1] commit 44ea2b17 ("KVM: VMX: Move MSR_KERNEL_GS_BASE out of the vmx
                                    autoload msr area")
[2] commit 5897297b ("KVM: VMX: Don't intercept MSR_KERNEL_GS_BASE")
[3] commit c8770e7b ("KVM: VMX: Fix host userspace gsbase corruption")
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

678e315e

KVM: vmx: track host_state.loaded using a loaded_vmcs pointer · bd9966de

由 Sean Christopherson 提交于 7月 23, 2018

Using 'struct loaded_vmcs*' to track whether the CPU registers
contain host or guest state kills two birds with one stone.

  1. The (effective) boolean host_state.loaded is poorly named.
     It does not track whether or not host state is loaded into
     the CPU registers (which most readers would expect), but
     rather tracks if host state has been saved AND guest state
     is loaded.

  2. Using a loaded_vmcs pointer provides a more robust framework
     for the optimized guest/host state switching, especially when
     consideration per-VMCS enhancements.  To that end, WARN_ONCE
     if we try to switch to host state with a different VMCS than
     was last used to save host state.

Resolve an occurrence of the new WARN by setting loaded_vmcs after
the call to vmx_vcpu_put() in vmx_switch_vmcs().
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

bd9966de

KVM: vmx: refactor segmentation code in vmx_save_host_state() · e368b875

由 Sean Christopherson 提交于 7月 23, 2018

Use local variables in vmx_save_host_state() to temporarily track
the selector and base values for FS and GS, and reorganize the
code so that the 64-bit vs 32-bit portions are contained within
a single #ifdef.  This refactoring paves the way for future patches
to modify the updating of VMCS state with minimal changes to the
code, and (hopefully) simplifies resolving a likely conflict with
another in-flight patch[1] by being the whipping boy for future
patches.

[1] https://www.spinics.net/lists/kvm/msg171647.htmlSigned-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

e368b875

kvm: nVMX: Fix fault priority for VMX operations · e49fcb8b

由 Jim Mattson 提交于 7月 27, 2018

When checking emulated VMX instructions for faults, the #UD for "IF
(not in VMX operation)" should take precedence over the #GP for "ELSIF
CPL > 0."
Suggested-by: NEric Northup <digitaleric@google.com>
Signed-off-by: NJim Mattson <jmattson@google.com>
Reviewed-by: NDavid Hildenbrand <david@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

e49fcb8b

kvm: nVMX: Fix fault vector for VMX operation at CPL > 0 · 36090bf4

由 Jim Mattson 提交于 7月 27, 2018

The fault that should be raised for a privilege level violation is #GP
rather than #UD.

Fixes: 727ba748 ("kvm: nVMX: Enforce cpl=0 for VMX instructions")
Signed-off-by: NJim Mattson <jmattson@google.com>
Reviewed-by: NDavid Hildenbrand <david@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

36090bf4

KVM: vmx: Add tlb_remote_flush callback support · 877ad952

由 Tianyu Lan 提交于 7月 19, 2018

Register tlb_remote_flush callback for vmx when hyperv capability of
nested guest mapping flush is detected. The interface can help to
reduce overhead when flush ept table among vcpus for nested VM. The
tradition way is to send IPIs to all affected vcpus and executes
INVEPT on each vcpus. It will trigger several vmexits for IPI
and INVEPT emulation. Hyper-V provides such hypercall to do
flush for all vcpus and call the hypercall when all ept table
pointers of single VM are same.
Signed-off-by: NLan Tianyu <Tianyu.Lan@microsoft.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

877ad952

kvm: x86: Add multi-entry LRU cache for previous CR3s · b94742c9

由 Junaid Shahid 提交于 6月 27, 2018

Adds support for storing multiple previous CR3/root_hpa pairs maintained
as an LRU cache, so that the lockless CR3 switch path can be used when
switching back to any of them.
Signed-off-by: NJunaid Shahid <junaids@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

b94742c9

kvm: x86: Flush only affected TLB entries in kvm_mmu_invlpg* · faff8758

由 Junaid Shahid 提交于 6月 29, 2018

This needs a minor bug fix. The updated patch is as follows.

Thanks,
Junaid

------------------------------------------------------------------------------

kvm_mmu_invlpg() and kvm_mmu_invpcid_gva() only need to flush the TLB
entries for the specific guest virtual address, instead of flushing all
TLB entries associated with the VM.
Signed-off-by: NJunaid Shahid <junaids@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

faff8758

kvm: x86: Skip shadow page resync on CR3 switch when indicated by guest · 956bf353

由 Junaid Shahid 提交于 6月 27, 2018

When the guest indicates that the TLB doesn't need to be flushed in a
CR3 switch, we can also skip resyncing the shadow page tables since an
out-of-sync shadow page table is equivalent to an out-of-sync TLB.
Signed-off-by: NJunaid Shahid <junaids@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

956bf353

kvm: x86: Skip TLB flush on fast CR3 switch when indicated by guest · ade61e28

由 Junaid Shahid 提交于 6月 27, 2018

When PCIDs are enabled, the MSb of the source operand for a MOV-to-CR3
instruction indicates that the TLB doesn't need to be flushed.

This change enables this optimization for MOV-to-CR3s in the guest
that have been intercepted by KVM for shadow paging and are handled
within the fast CR3 switch path.
Signed-off-by: NJunaid Shahid <junaids@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

ade61e28

openanolis / cloud-kernel 11 个月 前同步成功

openanolis / cloud-kernel
11 个月前同步成功