提交 · da39556f66f5cfe8f9c989206974f1cb16ca5d7c · openanolis / cloud-kernel

03 5月, 2018 2 次提交

x86/KVM/VMX: Expose SPEC_CTRL Bit(2) to the guest · da39556f

由 Konrad Rzeszutek Wilk 提交于 4月 25, 2018

Expose the CPUID.7.EDX[31] bit to the guest, and also guard against various
combinations of SPEC_CTRL MSR values.

The handling of the MSR (to take into account the host value of SPEC_CTRL
Bit(2)) is taken care of in patch:

  KVM/SVM/VMX/x86/spectre_v2: Support the combination of guest and host IBRS
Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NIngo Molnar <mingo@kernel.org>

da39556f

x86/bugs, KVM: Support the combination of guest and host IBRS · 5cf68754

由 Konrad Rzeszutek Wilk 提交于 4月 25, 2018

A guest may modify the SPEC_CTRL MSR from the value used by the
kernel. Since the kernel doesn't use IBRS, this means a value of zero is
what is needed in the host.

But the 336996-Speculative-Execution-Side-Channel-Mitigations.pdf refers to
the other bits as reserved so the kernel should respect the boot time
SPEC_CTRL value and use that.

This allows to deal with future extensions to the SPEC_CTRL interface if
any at all.

Note: This uses wrmsrl() instead of native_wrmsl(). I does not make any
difference as paravirt will over-write the callq *0xfff.. with the wrmsrl
assembler code.
Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NBorislav Petkov <bp@suse.de>
Reviewed-by: NIngo Molnar <mingo@kernel.org>

5cf68754

28 4月, 2018 1 次提交

x86/headers/UAPI: Move DISABLE_EXITS KVM capability bits to the UAPI · 5e62493f

由 KarimAllah Ahmed 提交于 4月 17, 2018

Move DISABLE_EXITS KVM capability bits to the UAPI just like the rest of
capabilities.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: NKarimAllah Ahmed <karahmed@amazon.de>
Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>

5e62493f

27 4月, 2018 1 次提交

kvm: apic: Flush TLB after APIC mode/address change if VPIDs are in use · a468f2db

由 Junaid Shahid 提交于 4月 26, 2018

Currently, KVM flushes the TLB after a change to the APIC access page
address or the APIC mode when EPT mode is enabled. However, even in
shadow paging mode, a TLB flush is needed if VPIDs are being used, as
specified in the Intel SDM Section 29.4.5.

So replace vmx_flush_tlb_ept_only() with vmx_flush_tlb(), which will
flush if either EPT or VPIDs are in use.
Signed-off-by: NJunaid Shahid <junaids@google.com>
Reviewed-by: NJim Mattson <jmattson@google.com>
Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>

a468f2db

16 4月, 2018 2 次提交

kvm: x86: move MSR_IA32_TSC handling to x86.c · dd259935

由 Paolo Bonzini 提交于 4月 13, 2018

This is not specific to Intel/AMD anymore.  The TSC offset is available
in vcpu->arch.tsc_offset.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

dd259935

X86/KVM: Properly update 'tsc_offset' to represent the running guest · e79f245d

由 KarimAllah Ahmed 提交于 4月 14, 2018

Update 'tsc_offset' on vmentry/vmexit of L2 guests to ensure that it always
captures the TSC_OFFSET of the running guest whether it is the L1 or L2
guest.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Reviewed-by: NJim Mattson <jmattson@google.com>
Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NKarimAllah Ahmed <karahmed@amazon.de>
[AMD changes, fix update_ia32_tsc_adjust_msr. - Paolo]
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

e79f245d

13 4月, 2018 1 次提交

x86: Add check for APIC access address for vmentry of L2 guests · f0f4cf5b

由 Krish Sadhukhan 提交于 4月 11, 2018

According to the sub-section titled 'VM-Execution Control Fields' in the
section titled 'Basic VM-Entry Checks' in Intel SDM vol. 3C, the following
vmentry check must be enforced:

    If the 'virtualize APIC-accesses' VM-execution control is 1, the
    APIC-access address must satisfy the following checks:

	- Bits 11:0 of the address must be 0.
	- The address should not set any bits beyond the processor's
	  physical-address width.

This patch adds the necessary check to conform to this rule. If the check
fails, we cause the L2 VMENTRY to fail which is what the associated unit
test (following patch) expects.
Reviewed-by: NMihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: NJim Mattson <jmattson@google.com>
Reviewed-by: NWanpeng Li <wanpengli@tencent.com>
Signed-off-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

f0f4cf5b

11 4月, 2018 2 次提交

KVM: X86: fix incorrect reference of trace_kvm_pi_irte_update · 2698d82e

由 hu huajun 提交于 4月 11, 2018

In arch/x86/kvm/trace.h, this function is declared as host_irq the
first input, and vcpu_id the second, instead of otherwise.
Signed-off-by: Nhu huajun <huhuajun@linux.alibaba.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

2698d82e

X86/KVM: Do not allow DISABLE_EXITS_MWAIT when LAPIC ARAT is not available · 8e9b29b6

由 KarimAllah Ahmed 提交于 4月 11, 2018

If the processor does not have an "Always Running APIC Timer" (aka ARAT),
we should not give guests direct access to MWAIT. The LAPIC timer would
stop ticking in deep C-states, so any host deadlines would not wakeup the
host kernel.

The host kernel intel_idle driver handles this by switching to broadcast
mode when ARAT is not available and MWAIT is issued with a deep C-state
that would stop the LAPIC timer. When MWAIT is passed through, we can not
tell when MWAIT is issued.

So just disable this capability when LAPIC ARAT is not available. I am not
even sure if there are any CPUs with VMX support but no LAPIC ARAT or not.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Reported-by: NWanpeng Li <kernellwp@gmail.com>
Signed-off-by: NKarimAllah Ahmed <karahmed@amazon.de>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

8e9b29b6

10 4月, 2018 1 次提交

X86/VMX: Disable VMX preemption timer if MWAIT is not intercepted · 386c6ddb

由 KarimAllah Ahmed 提交于 4月 10, 2018

The VMX-preemption timer is used by KVM as a way to set deadlines for the
guest (i.e. timer emulation). That was safe till very recently when
capability KVM_X86_DISABLE_EXITS_MWAIT to disable intercepting MWAIT was
introduced. According to Intel SDM 25.5.1:

"""
The VMX-preemption timer operates in the C-states C0, C1, and C2; it also
operates in the shutdown and wait-for-SIPI states. If the timer counts down
to zero in any state other than the wait-for SIPI state, the logical
processor transitions to the C0 C-state and causes a VM exit; the timer
does not cause a VM exit if it counts down to zero in the wait-for-SIPI
state. The timer is not decremented in C-states deeper than C2.
"""

Now once the guest issues the MWAIT with a c-state deeper than
C2 the preemption timer will never wake it up again since it stopped
ticking! Usually this is compensated by other activities in the system that
would wake the core from the deep C-state (and cause a VMExit). For
example, if the host itself is ticking or it received interrupts, etc!

So disable the VMX-preemption timer if MWAIT is exposed to the guest!

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: kvm@vger.kernel.org
Signed-off-by: NKarimAllah Ahmed <karahmed@amazon.de>
Fixes: 4d5422ceSigned-off-by: NPaolo Bonzini <pbonzini@redhat.com>

386c6ddb

07 4月, 2018 1 次提交

kvm: x86: fix a prototype warning · e01bca2f

由 Peng Hao 提交于 4月 07, 2018

Make the function static to avoid a

    warning: no previous prototype for ‘vmx_enable_tdp’
Signed-off-by: NPeng Hao <peng.hao2@zte.com.cn>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

e01bca2f

05 4月, 2018 7 次提交

kvm: x86: fix a compile warning · 3140c156

由 Peng Hao 提交于 4月 02, 2018

fix a "warning: no previous prototype".

Cc: stable@vger.kernel.org
Signed-off-by: NPeng Hao <peng.hao2@zte.com.cn>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

3140c156

KVM: X86: Add Force Emulation Prefix for "emulate the next instruction" · 6c86eedc

由 Wanpeng Li 提交于 4月 03, 2018

There is no easy way to force KVM to run an instruction through the emulator
(by design as that will expose the x86 emulator as a significant attack-surface).
However, we do wish to expose the x86 emulator in case we are testing it
(e.g. via kvm-unit-tests). Therefore, this patch adds a "force emulation prefix"
that is designed to raise #UD which KVM will trap and it's #UD exit-handler will
match "force emulation prefix" to run instruction after prefix by the x86 emulator.
To not expose the x86 emulator by default, we add a module parameter that should
be off by default.

A simple testcase here:

    #include <stdio.h>
    #include <string.h>

    #define HYPERVISOR_INFO 0x40000000

    #define CPUID(idx, eax, ebx, ecx, edx) \
        asm volatile (\
        "ud2a; .ascii \"kvm\"; cpuid" \
        :"=b" (*ebx), "=a" (*eax), "=c" (*ecx), "=d" (*edx) \
            :"0"(idx) );

    void main()
    {
        unsigned int eax, ebx, ecx, edx;
        char string[13];

        CPUID(HYPERVISOR_INFO, &eax, &ebx, &ecx, &edx);
        *(unsigned int *)(string + 0) = ebx;
        *(unsigned int *)(string + 4) = ecx;
        *(unsigned int *)(string + 8) = edx;

        string[12] = 0;
        if (strncmp(string, "KVMKVMKVM\0\0\0", 12) == 0)
            printf("kvm guest\n");
        else
            printf("bare hardware\n");
    }
Suggested-by: NAndrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: NRadim Krčmář <rkrcmar@redhat.com>
Reviewed-by: NLiran Alon <liran.alon@oracle.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Liran Alon <liran.alon@oracle.com>
Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
[Correctly handle usermode exits. - Paolo]
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

6c86eedc

KVM: X86: Introduce handle_ud() · 082d06ed

由 Wanpeng Li 提交于 4月 03, 2018

Introduce handle_ud() to handle invalid opcode, this function will be
used by later patches.
Reviewed-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: NLiran Alon <liran.alon@oracle.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim KrÄmÃ¡Å™ <rkrcmar@redhat.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Liran Alon <liran.alon@oracle.com>
Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

082d06ed

KVM: vmx: unify adjacent #ifdefs · 4fde8d57

由 Paolo Bonzini 提交于 4月 04, 2018

vmx_save_host_state has multiple ifdefs for CONFIG_X86_64 that have
no other code between them.  Simplify by reducing them to a single
conditional.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

4fde8d57

x86: kvm: hide the unused 'cpu' variable · 51e8a8cc

由 Arnd Bergmann 提交于 4月 04, 2018

The local variable was newly introduced but is only accessed in one
place on x86_64, but not on 32-bit:

arch/x86/kvm/vmx.c: In function 'vmx_save_host_state':
arch/x86/kvm/vmx.c:2175:6: error: unused variable 'cpu' [-Werror=unused-variable]

This puts it into another #ifdef.

Fixes: 35060ed6 ("x86/kvm/vmx: avoid expensive rdmsr for MSR_GS_BASE")
Signed-off-by: NArnd Bergmann <arnd@arndb.de>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

51e8a8cc

KVM: VMX: remove bogus WARN_ON in handle_ept_misconfig · c75d0edc

由 Sean Christopherson 提交于 3月 29, 2018

Remove the WARN_ON in handle_ept_misconfig() as it is unnecessary
and causes false positives.  Return the unmodified result of
kvm_mmu_page_fault() instead of converting a system error code to
KVM_EXIT_UNKNOWN so that userspace sees the error code of the
actual failure, not a generic "we don't know what went wrong".

  * kvm_mmu_page_fault() will WARN if reserved bits are set in the
    SPTEs, i.e. it covers the case where an EPT misconfig occurred
    because of a KVM bug.

  * The WARN_ON will fire on any system error code that is hit while
    handling the fault, e.g. -ENOMEM from mmu_topup_memory_caches()
    while handling a legitmate MMIO EPT misconfig or -EFAULT from
    kvm_handle_bad_page() if the corresponding HVA is invalid.  In
    either case, userspace should receive the original error code
    and firing a warning is incorrect behavior as KVM is operating
    as designed.
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

c75d0edc

Revert "KVM: X86: Fix SMRAM accessing even if VM is shutdown" · 2c151b25

由 Sean Christopherson 提交于 3月 29, 2018

The bug that led to commit 95e057e2
was a benign warning (no adverse affects other than the warning
itself) that was detected by syzkaller.  Further inspection shows
that the WARN_ON in question, in handle_ept_misconfig(), is
unnecessary and flawed (this was also briefly discussed in the
original patch: https://patchwork.kernel.org/patch/10204649).

  * The WARN_ON is unnecessary as kvm_mmu_page_fault() will WARN
    if reserved bits are set in the SPTEs, i.e. it covers the case
    where an EPT misconfig occurred because of a KVM bug.

  * The WARN_ON is flawed because it will fire on any system error
    code that is hit while handling the fault, e.g. -ENOMEM can be
    returned by mmu_topup_memory_caches() while handling a legitmate
    MMIO EPT misconfig.

The original behavior of returning -EFAULT when userspace munmaps
an HVA without first removing the memslot is correct and desirable,
i.e. KVM is letting userspace know it has generated a bad address.
Returning RET_PF_EMULATE masks the WARN_ON in the EPT misconfig path,
but does not fix the underlying bug, i.e. the WARN_ON is bogus.

Furthermore, returning RET_PF_EMULATE has the unwanted side effect of
causing KVM to attempt to emulate an instruction on any page fault
with an invalid HVA translation, e.g. a not-present EPT violation
on a VM_PFNMAP VMA whose fault handler failed to insert a PFN.

  * There is no guarantee that the fault is directly related to the
    instruction, i.e. the fault could have been triggered by a side
    effect memory access in the guest, e.g. while vectoring a #DB or
    writing a tracing record.  This could cause KVM to effectively
    mask the fault if KVM doesn't model the behavior leading to the
    fault, i.e. emulation could succeed and resume the guest.

  * If emulation does fail, KVM will return EMULATION_FAILED instead
    of -EFAULT, which is a red herring as the user will either debug
    a bogus emulation attempt or scratch their head wondering why we
    were attempting emulation in the first place.

TL;DR: revert to returning -EFAULT and remove the bogus WARN_ON in
handle_ept_misconfig in a future patch.

This reverts commit 95e057e2.
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

2c151b25

04 4月, 2018 2 次提交

kvm: Add emulation for movups/movupd · 29916968

由 Stefan Fritsch 提交于 4月 01, 2018

This is very similar to the aligned versions movaps/movapd.

We have seen the corresponding emulation failures with openbsd as guest
and with Windows 10 with intel HD graphics pass through.
Signed-off-by: NChristian Ehrhardt <christian_ehrhardt@genua.de>
Signed-off-by: NStefan Fritsch <sf@sfritsch.de>
Reviewed-by: NRadim Krčmář <rkrcmar@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

29916968

KVM: VMX: raise internal error for exception during invalid protected mode state · add5ff7a

由 Sean Christopherson 提交于 3月 23, 2018

Exit to userspace with KVM_INTERNAL_ERROR_EMULATION if we encounter
an exception in Protected Mode while emulating guest due to invalid
guest state. Unlike Big RM, KVM doesn't support emulating exceptions
in PM, i.e. PM exceptions are always injected via the VMCS. Because
we will never do VMRESUME due to emulation_required, the exception is
never realized and we'll keep emulating the faulting instruction over
and over until we receive a signal.

Exit to userspace iff there is a pending exception, i.e. don't exit
simply on a requested event. The purpose of this check and exit is to
aid in debugging a guest that is in all likelihood already doomed.
Invalid guest state in PM is extremely limited in normal operation,
e.g. it generally only occurs for a few instructions early in BIOS,
and any exception at this time is all but guaranteed to be fatal.
Non-vectored interrupts, e.g. INIT, SIPI and SMI, can be cleanly
handled/emulated, while checking for vectored interrupts, e.g. INTR
and NMI, without hitting false positives would add a fair amount of
complexity for almost no benefit (getting hit by lightning seems
more likely than encountering this specific scenario).

Add a WARN_ON_ONCE to vmx_queue_exception() if we try to inject an
exception via the VMCS and emulation_required is true.
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>

add5ff7a

29 3月, 2018 12 次提交

KVM: nVMX: Optimization: Dont set KVM_REQ_EVENT when VMExit with nested_run_pending · f497b6c2

由 Liran Alon 提交于 3月 23, 2018

When vCPU runs L2 and there is a pending event that requires to exit
from L2 to L1 and nested_run_pending=1, vcpu_enter_guest() will request
an immediate-exit from L2 (See req_immediate_exit).

Since now handling of req_immediate_exit also makes sure to set
KVM_REQ_EVENT, there is no need to also set it on vmx_vcpu_run() when
nested_run_pending=1.

This optimizes cases where VMRESUME was executed by L1 to enter L2 and
there is no pending events that require exit from L2 to L1. Previously,
this would have set KVM_REQ_EVENT unnecessarly.
Signed-off-by: NLiran Alon <liran.alon@oracle.com>
Reviewed-by: NNikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>

f497b6c2

KVM: nVMX: Require immediate-exit when event reinjected to L2 and L1 event pending · 1a680e35

由 Liran Alon 提交于 3月 23, 2018

In case L2 VMExit to L0 during event-delivery, VMCS02 is filled with
IDT-vectoring-info which vmx_complete_interrupts() makes sure to
reinject before next resume of L2.

While handling the VMExit in L0, an IPI could be sent by another L1 vCPU
to the L1 vCPU which currently runs L2 and exited to L0.

When L0 will reach vcpu_enter_guest() and call inject_pending_event(),
it will note that a previous event was re-injected to L2 (by
IDT-vectoring-info) and therefore won't check if there are pending L1
events which require exit from L2 to L1. Thus, L0 enters L2 without
immediate VMExit even though there are pending L1 events!

This commit fixes the issue by making sure to check for L1 pending
events even if a previous event was reinjected to L2 and bailing out
from inject_pending_event() before evaluating a new pending event in
case an event was already reinjected.

The bug was observed by the following setup:
* L0 is a 64CPU machine which runs KVM.
* L1 is a 16CPU machine which runs KVM.
* L0 & L1 runs with APICv disabled.
(Also reproduced with APICv enabled but easier to analyze below info
with APICv disabled)
* L1 runs a 16CPU L2 Windows Server 2012 R2 guest.
During L2 boot, L1 hangs completely and analyzing the hang reveals that
one L1 vCPU is holding KVM's mmu_lock and is waiting forever on an IPI
that he has sent for another L1 vCPU. And all other L1 vCPUs are
currently attempting to grab mmu_lock. Therefore, all L1 vCPUs are stuck
forever (as L1 runs with kernel-preemption disabled).

Observing /sys/kernel/debug/tracing/trace_pipe reveals the following
series of events:
(1) qemu-system-x86-19066 [030] kvm_nested_vmexit: rip:
0xfffff802c5dca82f reason: EPT_VIOLATION ext_inf1: 0x0000000000000182
ext_inf2: 0x00000000800000d2 ext_int: 0x00000000 ext_int_err: 0x00000000
(2) qemu-system-x86-19054 [028] kvm_apic_accept_irq: apicid f
vec 252 (Fixed|edge)
(3) qemu-system-x86-19066 [030] kvm_inj_virq: irq 210
(4) qemu-system-x86-19066 [030] kvm_entry: vcpu 15
(5) qemu-system-x86-19066 [030] kvm_exit: reason EPT_VIOLATION
rip 0xffffe00069202690 info 83 0
(6) qemu-system-x86-19066 [030] kvm_nested_vmexit: rip:
0xffffe00069202690 reason: EPT_VIOLATION ext_inf1: 0x0000000000000083
ext_inf2: 0x0000000000000000 ext_int: 0x00000000 ext_int_err: 0x00000000
(7) qemu-system-x86-19066 [030] kvm_nested_vmexit_inject: reason:
EPT_VIOLATION ext_inf1: 0x0000000000000083 ext_inf2: 0x0000000000000000
ext_int: 0x00000000 ext_int_err: 0x00000000
(8) qemu-system-x86-19066 [030] kvm_entry: vcpu 15

Which can be analyzed as follows:
(1) L2 VMExit to L0 on EPT_VIOLATION during delivery of vector 0xd2.
Therefore, vmx_complete_interrupts() will set KVM_REQ_EVENT and reinject
a pending-interrupt of 0xd2.
(2) L1 sends an IPI of vector 0xfc (CALL_FUNCTION_VECTOR) to destination
vCPU 15. This will set relevant bit in LAPIC's IRR and set KVM_REQ_EVENT.
(3) L0 reach vcpu_enter_guest() which calls inject_pending_event() which
notes that interrupt 0xd2 was reinjected and therefore calls
vmx_inject_irq() and returns. Without checking for pending L1 events!
Note that at this point, KVM_REQ_EVENT was cleared by vcpu_enter_guest()
before calling inject_pending_event().
(4) L0 resumes L2 without immediate-exit even though there is a pending
L1 event (The IPI pending in LAPIC's IRR).

We have already reached the buggy scenario but events could be
furthered analyzed:
(5+6) L2 VMExit to L0 on EPT_VIOLATION.  This time not during
event-delivery.
(7) L0 decides to forward the VMExit to L1 for further handling.
(8) L0 resumes into L1. Note that because KVM_REQ_EVENT is cleared, the
LAPIC's IRR is not examined and therefore the IPI is still not delivered
into L1!
Signed-off-by: NLiran Alon <liran.alon@oracle.com>
Reviewed-by: NNikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: NJim Mattson <jmattson@google.com>
Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>

1a680e35

KVM: x86: Fix misleading comments on handling pending exceptions · a042c26f

由 Liran Alon 提交于 3月 23, 2018

The reason that exception.pending should block re-injection of
NMI/interrupt is not described correctly in comment in code.
Instead, it describes why a pending exception should be injected
before a pending NMI/interrupt.

Therefore, move currently present comment to code-block evaluating
a new pending event which explains why exception.pending is evaluated
first.
In addition, create a new comment describing that exception.pending
blocks re-injection of NMI/interrupt because the exception was
queued by handling vmexit which was due to NMI/interrupt delivery.
Signed-off-by: NLiran Alon <liran.alon@oracle.com>
Reviewed-by: NNikita Leshenko <nikita.leshchenko@oracle.com>
Signed-off-by: NKrish Sadhukhan <krish.sadhukhan@orcle.com>
[Used a comment from Sean J <sean.j.christopherson@intel.com>. - Radim]
Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>

a042c26f

KVM: x86: Rename interrupt.pending to interrupt.injected · 04140b41

由 Liran Alon 提交于 3月 23, 2018

For exceptions & NMIs events, KVM code use the following
coding convention:
*) "pending" represents an event that should be injected to guest at
some point but it's side-effects have not yet occurred.
*) "injected" represents an event that it's side-effects have already
occurred.

However, interrupts don't conform to this coding convention.
All current code flows mark interrupt.pending when it's side-effects
have already taken place (For example, bit moved from LAPIC IRR to
ISR). Therefore, it makes sense to just rename
interrupt.pending to interrupt.injected.

This change follows logic of previous commit 664f8e26 ("KVM: X86:
Fix loss of exception which has not yet been injected") which changed
exception to follow this coding convention as well.

It is important to note that in case !lapic_in_kernel(vcpu),
interrupt.pending usage was and still incorrect.
In this case, interrrupt.pending can only be set using one of the
following ioctls: KVM_INTERRUPT, KVM_SET_VCPU_EVENTS and
KVM_SET_SREGS. Looking at how QEMU uses these ioctls, one can see that
QEMU uses them either to re-set an "interrupt.pending" state it has
received from KVM (via KVM_GET_VCPU_EVENTS interrupt.pending or
via KVM_GET_SREGS interrupt_bitmap) or by dispatching a new interrupt
from QEMU's emulated LAPIC which reset bit in IRR and set bit in ISR
before sending ioctl to KVM. So it seems that indeed "interrupt.pending"
in this case is also suppose to represent "interrupt.injected".
However, kvm_cpu_has_interrupt() & kvm_cpu_has_injectable_intr()
is misusing (now named) interrupt.injected in order to return if
there is a pending interrupt.
This leads to nVMX/nSVM not be able to distinguish if it should exit
from L2 to L1 on EXTERNAL_INTERRUPT on pending interrupt or should
re-inject an injected interrupt.
Therefore, add a FIXME at these functions for handling this issue.

This patch introduce no semantics change.
Signed-off-by: NLiran Alon <liran.alon@oracle.com>
Reviewed-by: NNikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: NJim Mattson <jmattson@google.com>
Signed-off-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>

04140b41

KVM: VMX: No need to clear pending NMI/interrupt on inject realmode interrupt · 7c5a6a59

由 Liran Alon 提交于 3月 23, 2018

kvm_inject_realmode_interrupt() is called from one of the injection
functions which writes event-injection to VMCS: vmx_queue_exception(),
vmx_inject_irq() and vmx_inject_nmi().

All these functions are called just to cause an event-injection to
guest. They are not responsible of manipulating the event-pending
flag. The only purpose of kvm_inject_realmode_interrupt() should be
to emulate real-mode interrupt-injection.

This was also incorrect when called from vmx_queue_exception().
Signed-off-by: NLiran Alon <liran.alon@oracle.com>
Reviewed-by: NNikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: NJim Mattson <jmattson@google.com>
Signed-off-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>

7c5a6a59

x86/kvm: use Enlightened VMCS when running on Hyper-V · 773e8a04

由 Vitaly Kuznetsov 提交于 3月 20, 2018

Enlightened VMCS is just a structure in memory, the main benefit
besides avoiding somewhat slower VMREAD/VMWRITE is using clean field
mask: we tell the underlying hypervisor which fields were modified
since VMEXIT so there's no need to inspect them all.

Tight CPUID loop test shows significant speedup:
Before: 18890 cycles
After: 8304 cycles

Static key is being used to avoid performance penalty for non-Hyper-V
deployments.
Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: NRadim Krčmář <rkrcmar@redhat.com>
Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>

773e8a04

x86/kvm: rename HV_X64_MSR_APIC_ASSIST_PAGE to HV_X64_MSR_VP_ASSIST_PAGE · d4abc577

由 Ladi Prosek 提交于 3月 20, 2018

The assist page has been used only for the paravirtual EOI so far, hence
the "APIC" in the MSR name. Renaming to match the Hyper-V TLFS where it's
called "Virtual VP Assist MSR".
Signed-off-by: NLadi Prosek <lprosek@redhat.com>
Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Acked-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>

d4abc577

KVM: SVM: Implement pause loop exit logic in SVM · 8566ac8b

由 Babu Moger 提交于 3月 16, 2018

Bring the PLE(pause loop exit) logic to AMD svm driver.

While testing, we found this helping in situations where numerous
pauses are generated. Without these patches we could see continuos
VMEXITS due to pause interceptions. Tested it on AMD EPYC server with
boot parameter idle=poll on a VM with 32 vcpus to simulate extensive
pause behaviour. Here are VMEXITS in 10 seconds interval.

Pauses                  810199                  504
Total                   882184                  325415
Signed-off-by: NBabu Moger <babu.moger@amd.com>
[Prevented the window from dropping below the initial value. - Radim]
Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>

8566ac8b

KVM: SVM: Add pause filter threshold · 1d8fb44a

由 Babu Moger 提交于 3月 16, 2018

This patch adds the support for pause filtering threshold. This feature
support is indicated by CPUID Fn8000_000A_EDX. See AMD APM Vol 2 Section
15.14.4 Pause Intercept Filtering for more details.

In this mode, a 16-bit pause filter threshold field is added in VMCB.
The threshold value is a cycle count that is used to reset the pause
counter. As with simple pause filtering, VMRUN loads the pause count
value from VMCB into an internal counter. Then, on each pause instruction
the hardware checks the elapsed number of cycles since the most recent
pause instruction against the pause Filter Threshold. If the elapsed cycle
count is greater than the pause filter threshold, then the internal pause
count is reloaded from VMCB and execution continues. If the elapsed cycle
count is less than the pause filter threshold, then the internal pause
count is decremented. If the count value is less than zero and pause
intercept is enabled, a #VMEXIT is triggered. If advanced pause filtering
is supported and pause filter threshold field is set to zero, the filter
will operate in the simpler, count only mode.
Signed-off-by: NBabu Moger <babu.moger@amd.com>
Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>

1d8fb44a

KVM: VMX: Bring the common code to header file · c8e88717

由 Babu Moger 提交于 3月 16, 2018

This patch brings some of the code from vmx to x86.h header file. Now, we
can share this code between vmx and svm. Modified couple functions to make
it common.
Signed-off-by: NBabu Moger <babu.moger@amd.com>
Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>

c8e88717

KVM: VMX: Remove ple_window_actual_max · 18abdc34

由 Babu Moger 提交于 3月 16, 2018

Get rid of ple_window_actual_max, because its benefits are really
minuscule and the logic is complicated.

The overflows(and underflow) are controlled in __ple_window_grow
and _ple_window_shrink respectively.
Suggested-by: NRadim Krčmář <rkrcmar@redhat.com>
Signed-off-by: NBabu Moger <babu.moger@amd.com>
[Fixed potential wraparound and change the max to UINT_MAX. - Radim]
Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>

18abdc34

KVM: VMX: Fix the module parameters for vmx · 7fbc85a5

由 Babu Moger 提交于 3月 16, 2018

The vmx module parameters are supposed to be unsigned variants.

Also fixed the checkpatch errors like the one below.

WARNING: Symbolic permissions 'S_IRUGO' are not preferred. Consider using octal permissions '0444'.
+module_param(ple_gap, uint, S_IRUGO);
Signed-off-by: NBabu Moger <babu.moger@amd.com>
[Expanded uint to unsigned int in code. - Radim]
Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>

7fbc85a5

28 3月, 2018 1 次提交

KVM: x86: Fix perf timer mode IP reporting · dd60d217

由 Andi Kleen 提交于 7月 25, 2017

KVM and perf have a special backdoor mechanism to report the IP for interrupts
re-executed after vm exit. This works for the NMIs that perf normally uses.

However when perf is in timer mode it doesn't work because the timer interrupt
doesn't get this special treatment. This is common when KVM is running
nested in another hypervisor which may not implement the PMU, so only
timer mode is available.

Call the functions to set up the backdoor IP also for non NMI interrupts.

I renamed the functions to set up the backdoor IP reporting to be more
appropiate for their new use.  The SVM change is only compile tested.

v2: Moved the functions inline.
For the normal interrupt case the before/after functions are now
called from x86.c, not arch specific code.
For the NMI case we still need to call it in the architecture
specific code, because it's already needed in the low level *_run
functions.
Signed-off-by: NAndi Kleen <ak@linux.intel.com>
[Removed unnecessary calls from arch handle_external_intr. - Radim]
Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>

dd60d217

24 3月, 2018 6 次提交

kvm: x86: hyperv: delete dead code in kvm_hv_hypercall() · d32ef547

由 Dan Carpenter 提交于 3月 17, 2018

"rep_done" is always zero so the "(((u64)rep_done & 0xfff) << 32)"
expression is just zero.  We can remove the "res" temporary variable as
well and just use "ret" directly.
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

d32ef547

KVM: SVM: add struct kvm_svm to hold SVM specific KVM vars · 81811c16

由 Sean Christopherson 提交于 3月 20, 2018

Add struct kvm_svm, which is analagous to struct vcpu_svm, along with
a helper to_kvm_svm() to retrieve kvm_svm from a struct kvm *.  Move
the SVM specific variables and struct definitions out of kvm_arch
and into kvm_svm.

Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

81811c16

KVM: VMX: add struct kvm_vmx to hold VMX specific KVM vars · 40bbb9d0

由 Sean Christopherson 提交于 3月 20, 2018

Add struct kvm_vmx, which wraps struct kvm, and a helper to_kvm_vmx()
that retrieves 'struct kvm_vmx *' from 'struct kvm *'. Move the VMX
specific variables out of kvm_arch and into kvm_vmx.
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

40bbb9d0

KVM: x86: move setting of ept_identity_map_addr to vmx.c · 2ac52ab8

由 Sean Christopherson 提交于 3月 20, 2018

Add kvm_x86_ops->set_identity_map_addr and set ept_identity_map_addr
in VMX specific code so that ept_identity_map_addr can be moved out
of 'struct kvm_arch' in a future patch.
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

2ac52ab8

KVM: x86: define SVM/VMX specific kvm_arch_[alloc|free]_vm · 434a1e94

由 Sean Christopherson 提交于 3月 20, 2018

Define kvm_arch_[alloc|free]_vm in x86 as pass through functions
to new kvm_x86_ops vm_alloc and vm_free, and move the current
allocation logic as-is to SVM and VMX.  Vendor specific alloc/free
functions set the stage for SVM/VMX wrappers of 'struct kvm',
which will allow us to move the growing number of SVM/VMX specific
member variables out of 'struct kvm_arch'.
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

434a1e94

KVM: nVMX: sync vmcs02 segment regs prior to vmx_set_cr0 · 9d1887ef

由 Sean Christopherson 提交于 3月 05, 2018

Segment registers must be synchronized prior to any code that may
trigger a call to emulation_required()/guest_state_valid(), e.g.
vmx_set_cr0().  Because preparing vmcs02 writes segmentation fields
directly, i.e. doesn't use vmx_set_segment(), emulation_required
will not be re-evaluated when synchronizing the segment registers,
which can result in L0 incorrectly starting emulation of L2.

Fixes: 8665c3f9 ("KVM: nVMX: initialize descriptor cache fields in prepare_vmcs02_full")
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
[Move all of prepare_vmcs02_full earlier, not just segment registers. - Paolo]
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

9d1887ef

21 3月, 2018 1 次提交

KVM: nVMX: fix vmentry failure code when L2 state would require emulation · 3184a995

由 Paolo Bonzini 提交于 3月 21, 2018

Commit 2bb8cafe ("KVM: vVMX: signal failure for nested VMEntry if
emulation_required", 2018-03-12) introduces a new error path which does
not set *entry_failure_code.  Fix that to avoid a leak of L0 stack to L1.
Reported-by: NRadim Krčmář <rkrcmar@redhat.com>
Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

3184a995

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功