提交 · 04140b4144cd888c080cddbb2be2ec603f00d081 · OpenHarmony / kernel_linux

29 3月, 2018 3 次提交

KVM: x86: Rename interrupt.pending to interrupt.injected · 04140b41

由 Liran Alon 提交于 3月 23, 2018

For exceptions & NMIs events, KVM code use the following
coding convention:
*) "pending" represents an event that should be injected to guest at
some point but it's side-effects have not yet occurred.
*) "injected" represents an event that it's side-effects have already
occurred.

However, interrupts don't conform to this coding convention.
All current code flows mark interrupt.pending when it's side-effects
have already taken place (For example, bit moved from LAPIC IRR to
ISR). Therefore, it makes sense to just rename
interrupt.pending to interrupt.injected.

This change follows logic of previous commit 664f8e26 ("KVM: X86:
Fix loss of exception which has not yet been injected") which changed
exception to follow this coding convention as well.

It is important to note that in case !lapic_in_kernel(vcpu),
interrupt.pending usage was and still incorrect.
In this case, interrrupt.pending can only be set using one of the
following ioctls: KVM_INTERRUPT, KVM_SET_VCPU_EVENTS and
KVM_SET_SREGS. Looking at how QEMU uses these ioctls, one can see that
QEMU uses them either to re-set an "interrupt.pending" state it has
received from KVM (via KVM_GET_VCPU_EVENTS interrupt.pending or
via KVM_GET_SREGS interrupt_bitmap) or by dispatching a new interrupt
from QEMU's emulated LAPIC which reset bit in IRR and set bit in ISR
before sending ioctl to KVM. So it seems that indeed "interrupt.pending"
in this case is also suppose to represent "interrupt.injected".
However, kvm_cpu_has_interrupt() & kvm_cpu_has_injectable_intr()
is misusing (now named) interrupt.injected in order to return if
there is a pending interrupt.
This leads to nVMX/nSVM not be able to distinguish if it should exit
from L2 to L1 on EXTERNAL_INTERRUPT on pending interrupt or should
re-inject an injected interrupt.
Therefore, add a FIXME at these functions for handling this issue.

This patch introduce no semantics change.
Signed-off-by: NLiran Alon <liran.alon@oracle.com>
Reviewed-by: NNikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: NJim Mattson <jmattson@google.com>
Signed-off-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>

04140b41

KVM: VMX: No need to clear pending NMI/interrupt on inject realmode interrupt · 7c5a6a59

由 Liran Alon 提交于 3月 23, 2018

kvm_inject_realmode_interrupt() is called from one of the injection
functions which writes event-injection to VMCS: vmx_queue_exception(),
vmx_inject_irq() and vmx_inject_nmi().

All these functions are called just to cause an event-injection to
guest. They are not responsible of manipulating the event-pending
flag. The only purpose of kvm_inject_realmode_interrupt() should be
to emulate real-mode interrupt-injection.

This was also incorrect when called from vmx_queue_exception().
Signed-off-by: NLiran Alon <liran.alon@oracle.com>
Reviewed-by: NNikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: NJim Mattson <jmattson@google.com>
Signed-off-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>

7c5a6a59

x86/kvm: rename HV_X64_MSR_APIC_ASSIST_PAGE to HV_X64_MSR_VP_ASSIST_PAGE · d4abc577

由 Ladi Prosek 提交于 3月 20, 2018

The assist page has been used only for the paravirtual EOI so far, hence
the "APIC" in the MSR name. Renaming to match the Hyper-V TLFS where it's
called "Virtual VP Assist MSR".
Signed-off-by: NLadi Prosek <lprosek@redhat.com>
Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Acked-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>

d4abc577

28 3月, 2018 1 次提交

KVM: x86: Fix perf timer mode IP reporting · dd60d217

由 Andi Kleen 提交于 7月 25, 2017

KVM and perf have a special backdoor mechanism to report the IP for interrupts
re-executed after vm exit. This works for the NMIs that perf normally uses.

However when perf is in timer mode it doesn't work because the timer interrupt
doesn't get this special treatment. This is common when KVM is running
nested in another hypervisor which may not implement the PMU, so only
timer mode is available.

Call the functions to set up the backdoor IP also for non NMI interrupts.

I renamed the functions to set up the backdoor IP reporting to be more
appropiate for their new use.  The SVM change is only compile tested.

v2: Moved the functions inline.
For the normal interrupt case the before/after functions are now
called from x86.c, not arch specific code.
For the NMI case we still need to call it in the architecture
specific code, because it's already needed in the low level *_run
functions.
Signed-off-by: NAndi Kleen <ak@linux.intel.com>
[Removed unnecessary calls from arch handle_external_intr. - Radim]
Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>

dd60d217

24 3月, 2018 1 次提交

KVM: x86: move setting of ept_identity_map_addr to vmx.c · 2ac52ab8

由 Sean Christopherson 提交于 3月 20, 2018

Add kvm_x86_ops->set_identity_map_addr and set ept_identity_map_addr
in VMX specific code so that ept_identity_map_addr can be moved out
of 'struct kvm_arch' in a future patch.
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

2ac52ab8

21 3月, 2018 1 次提交

KVM: nVMX: Do not load EOI-exitmap while running L2 · e40ff1d6

由 Liran Alon 提交于 3月 21, 2018

When L1 IOAPIC redirection-table is written, a request of
KVM_REQ_SCAN_IOAPIC is set on all vCPUs. This is done such that
all vCPUs will now recalc their IOAPIC handled vectors and load
it to their EOI-exitmap.

However, it could be that one of the vCPUs is currently running
L2. In this case, load_eoi_exitmap() will be called which would
write to vmcs02->eoi_exit_bitmap, which is wrong because
vmcs02->eoi_exit_bitmap should always be equal to
vmcs12->eoi_exit_bitmap. Furthermore, at this point
KVM_REQ_SCAN_IOAPIC was already consumed and therefore we will
never update vmcs01->eoi_exit_bitmap. This could lead to remote_irr
of some IOAPIC level-triggered entry to remain set forever.

Fix this issue by delaying the load of EOI-exitmap to when vCPU
is running L1.

One may wonder why not just delay entire KVM_REQ_SCAN_IOAPIC
processing to when vCPU is running L1. This is done in order to handle
correctly the case where LAPIC & IO-APIC of L1 is pass-throughed into
L2. In this case, vmcs12->virtual_interrupt_delivery should be 0. In
current nVMX implementation, that results in
vmcs02->virtual_interrupt_delivery to also be 0. Thus,
vmcs02->eoi_exit_bitmap is not used. Therefore, every L2 EOI cause
a #VMExit into L0 (either on MSR_WRITE to x2APIC MSR or
APIC_ACCESS/APIC_WRITE/EPT_MISCONFIG to APIC MMIO page).
In order for such L2 EOI to be broadcasted, if needed, from LAPIC
to IO-APIC, vcpu->arch.ioapic_handled_vectors must be updated
while L2 is running. Therefore, patch makes sure to delay only the
loading of EOI-exitmap but not the update of
vcpu->arch.ioapic_handled_vectors.
Reviewed-by: NArbel Moshe <arbel.moshe@oracle.com>
Reviewed-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: NLiran Alon <liran.alon@oracle.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

e40ff1d6

17 3月, 2018 10 次提交

KVM: X86: Provide a capability to disable PAUSE intercepts · b31c114b

由 Wanpeng Li 提交于 3月 12, 2018

Allow to disable pause loop exit/pause filtering on a per VM basis.

If some VMs have dedicated host CPUs, they won't be negatively affected
due to needlessly intercepted PAUSE instructions.

Thanks to Jan H. Schönherr's initial patch.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Jan H. Schönherr <jschoenh@amazon.de>
Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

b31c114b

KVM: X86: Provide a capability to disable HLT intercepts · caa057a2

由 Wanpeng Li 提交于 3月 12, 2018

If host CPUs are dedicated to a VM, we can avoid VM exits on HLT.
This patch adds the per-VM capability to disable them.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Jan H. Schönherr <jschoenh@amazon.de>
Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

caa057a2

KVM: X86: Provide a capability to disable MWAIT intercepts · 4d5422ce

由 Wanpeng Li 提交于 3月 12, 2018

Allowing a guest to execute MWAIT without interception enables a guest
to put a (physical) CPU into a power saving state, where it takes
longer to return from than what may be desired by the host.

Don't give a guest that power over a host by default. (Especially,
since nothing prevents a guest from using MWAIT even when it is not
advertised via CPUID.)

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Jan H. Schönherr <jschoenh@amazon.de>
Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

4d5422ce

KVM: x86: Add support for VMware backdoor Pseudo-PMCs · 2d7921c4

由 Arbel Moshe 提交于 3月 12, 2018

VMware exposes the following Pseudo PMCs:
0x10000: Physical host TSC
0x10001: Elapsed real time in ns
0x10002: Elapsed apparent time in ns

For more info refer to:
https://www.vmware.com/files/pdf/techpaper/Timekeeping-In-VirtualMachines.pdf

VMware allows access to these Pseduo-PMCs even when read via RDPMC
in Ring3 and CR4.PCE=0. Therefore, commit modifies x86 emulator
to allow access to these PMCs in this situation. In addition,
emulation of these PMCs were added to kvm_pmu_rdpmc().
Signed-off-by: NArbel Moshe <arbel.moshe@oracle.com>
Signed-off-by: NLiran Alon <liran.alon@oracle.com>
Reviewed-by: NRadim Krčmář <rkrcmar@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

2d7921c4

KVM: x86: Emulate only IN/OUT instructions when accessing VMware backdoor · 04789b66

由 Liran Alon 提交于 3月 12, 2018

Access to VMware backdoor ports is done by one of the IN/OUT/INS/OUTS
instructions. These ports must be allowed access even if TSS I/O
permission bitmap don't allow it.

To handle this, VMX/SVM will be changed in future commits
to intercept #GP which was raised by such access and
handle it by calling x86 emulator to emulate instruction.
If it was one of these instructions, the x86 emulator already handles
it correctly (Since commit "KVM: x86: Always allow access to VMware
backdoor I/O ports") by not checking these ports against TSS I/O
permission bitmap.

One may wonder why checking for specific instructions is necessary
as we can just forward all #GPs to the x86 emulator.
There are multiple reasons for doing so:

1. We don't want the x86 emulator to be reached easily
by guest by just executing an instruction that raises #GP as that
exposes the x86 emulator as a bigger attack surface.

2. The x86 emulator is incomplete and therefore certain instructions
that can cause #GP cannot be emulated. Such an example is "INT x"
(opcode 0xcd) which reaches emulate_int() which can only emulate
the instruction if vCPU is in real-mode.
Signed-off-by: NLiran Alon <liran.alon@oracle.com>
Reviewed-by: NNikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: NRadim Krčmář <rkrcmar@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

04789b66

KVM: x86: Add emulation_type to not raise #UD on emulation failure · e2366171

由 Liran Alon 提交于 3月 12, 2018

Next commits are going introduce support for accessing VMware backdoor
ports even though guest's TSS I/O permissions bitmap doesn't allow
access. This mimic VMware hypervisor behavior.

In order to support this, next commits will change VMX/SVM to
intercept #GP which was raised by such access and handle it by calling
the x86 emulator to emulate instruction. Since commit "KVM: x86:
Always allow access to VMware backdoor I/O ports", the x86 emulator
handles access to these I/O ports by not checking these ports against
the TSS I/O permission bitmap.

However, there could be cases that CPU rasies a #GP on instruction
that fails to be disassembled by the x86 emulator (Because of
incomplete implementation for example).

In those cases, we would like the #GP intercept to just forward #GP
as-is to guest as if there was no intercept to begin with.
However, current emulator code always queues #UD exception in case
emulator fails (including disassembly failures) which is not what is
wanted in this flow.

This commit addresses this issue by adding a new emulation_type flag
that will allow the #GP intercept handler to specify that it wishes
to be aware when instruction emulation fails and doesn't want #UD
exception to be queued.
Signed-off-by: NLiran Alon <liran.alon@oracle.com>
Reviewed-by: NNikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: NRadim Krčmář <rkrcmar@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

e2366171

KVM: x86: Add module parameter for supporting VMware backdoor · c4ae60e4

由 Liran Alon 提交于 3月 12, 2018

Support access to VMware backdoor requires KVM to intercept #GP
exceptions from guest which introduce slight performance hit.
Therefore, control this support by module parameter.

Note that module parameter is exported as it should be consumed by
kvm_intel & kvm_amd to determine if they should intercept #GP or not.

This commit doesn't change semantics.
It is done as a preparation for future commits.
Signed-off-by: NLiran Alon <liran.alon@oracle.com>
Reviewed-by: NNikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: NRadim Krčmář <rkrcmar@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

c4ae60e4

KVM: x86: add kvm_fast_pio() to consolidate fast PIO code · dca7f128

由 Sean Christopherson 提交于 3月 08, 2018

Add kvm_fast_pio() to consolidate duplicate code in VMX and SVM.
Unexport kvm_fast_pio_in() and kvm_fast_pio_out().
Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

dca7f128

x86/kvm/hyper-v: add reenlightenment MSRs support · a2e164e7

由 Vitaly Kuznetsov 提交于 3月 01, 2018

Nested Hyper-V/Windows guest running on top of KVM will use TSC page
clocksource in two cases:
- L0 exposes invariant TSC (CPUID.80000007H:EDX[8]).
- L0 provides Hyper-V Reenlightenment support (CPUID.40000003H:EAX[13]).

Exposing invariant TSC effectively blocks migration to hosts with different
TSC frequencies, providing reenlightenment support will be needed when we
start migrating nested workloads.

Implement rudimentary support for reenlightenment MSRs. For now, these are
just read/write MSRs with no effect.
Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: NRoman Kagan <rkagan@virtuozzo.com>
Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>

a2e164e7

KVM: x86: Add support for AMD Core Perf Extension in guest · c51eb52b

由 Janakarajan Natarajan 提交于 2月 05, 2018

Add support for AMD Core Performance counters in the guest. The base
event select and counter MSRs are changed. In addition, with the core
extension, there are 2 extra counters available for performance
measurements for a total of 6.

With the new MSRs, the logic to map them to the gp_counters[] is changed.
New functions are added to check the validity of the get/set MSRs.

If the guest has the X86_FEATURE_PERFCTR_CORE cpuid flag set, the number
of counters available to the vcpu is set to 6. It the flag is not set
then it is 4.
Signed-off-by: NJanakarajan Natarajan <Janakarajan.Natarajan@amd.com>
[Squashed "Expose AMD Core Perf Extension flag to guests" - Radim.]
Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>

c51eb52b

07 3月, 2018 4 次提交

KVM: nVMX: expose VMX capabilities for nested hypervisors to userspace · 1389309c

由 Paolo Bonzini 提交于 2月 26, 2018

Use the new MSR feature framework to tell userspace which VMX capabilities
are available for nested hypervisors. Before, these were only accessible
with the KVM_GET_MSR VCPU ioctl, after VCPUs had been created.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>

1389309c

KVM: x86: KVM_CAP_SYNC_REGS · 01643c51

由 Ken Hofsass 提交于 1月 31, 2018

This commit implements an enhanced x86 version of S390
KVM_CAP_SYNC_REGS functionality. KVM_CAP_SYNC_REGS "allow[s]
userspace to access certain guest registers without having
to call SET/GET_*REGS”. This reduces ioctl overhead which
is particularly important when userspace is making synchronous
guest state modifications (e.g. when emulating and/or intercepting
instructions).

Originally implemented upstream for the S390, the x86 differences
follow:
- userspace can select the register sets to be synchronized with kvm_run
using bit-flags in the kvm_valid_registers and kvm_dirty_registers
fields.
- vcpu_events is available in addition to the regs and sregs register
sets.
Signed-off-by: NKen Hofsass <hofsass@google.com>
Reviewed-by: NDavid Hildenbrand <david@redhat.com>
[Removed wrapper around check for reserved kvm_valid_regs. - Radim]
Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>

01643c51

kvm: x86: hyperv: guest->host event signaling via eventfd · faeb7833

由 Roman Kagan 提交于 2月 01, 2018

In Hyper-V, the fast guest->host notification mechanism is the
SIGNAL_EVENT hypercall, with a single parameter of the connection ID to
signal.

Currently this hypercall incurs a user exit and requires the userspace
to decode the parameters and trigger the notification of the potentially
different I/O context.

To avoid the costly user exit, process this hypercall and signal the
corresponding eventfd in KVM, similar to ioeventfd.  The association
between the connection id and the eventfd is established via the newly
introduced KVM_HYPERV_EVENTFD ioctl, and maintained in an
(srcu-protected) IDR.
Signed-off-by: NRoman Kagan <rkagan@virtuozzo.com>
Reviewed-by: NDavid Hildenbrand <david@redhat.com>
[asm/hyperv.h changes approved by KY Srinivasan. - Radim]
Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>

faeb7833

kvm: x86: factor out kvm.arch.hyperv (de)init · cbc0236a

由 Roman Kagan 提交于 2月 01, 2018

Move kvm.arch.hyperv initialization and cleanup to separate functions.

For now only a mutex is inited in the former, and the latter is empty;
more stuff will go in there in a followup patch.
Signed-off-by: NRoman Kagan <rkagan@virtuozzo.com>
Reviewed-by: NDavid Hildenbrand <david@redhat.com>
Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>

cbc0236a

02 3月, 2018 5 次提交

KVM: x86: fix vcpu initialization with userspace lapic · b7e31be3

由 Radim Krčmář 提交于 3月 01, 2018

Moving the code around broke this rare configuration.
Use this opportunity to finally call lapic reset from vcpu reset.

Reported-by: syzbot+fb7a33a4b6c35007a72b@syzkaller.appspotmail.com
Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
Fixes: 0b2e9904 ("KVM: x86: move LAPIC initialization after VMCS creation")
Cc: stable@vger.kernel.org
Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>

b7e31be3

KVM: X86: Allow userspace to define the microcode version · 518e7b94

由 Wanpeng Li 提交于 2月 28, 2018

Linux (among the others) has checks to make sure that certain features
aren't enabled on a certain family/model/stepping if the microcode version
isn't greater than or equal to a known good version.

By exposing the real microcode version, we're preventing buggy guests that
don't check that they are running virtualized (i.e., they should trust the
hypervisor) from disabling features that are effectively not buggy.
Suggested-by: NFilippo Sironi <sironi@amazon.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Liran Alon <liran.alon@oracle.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>

518e7b94

KVM: X86: Introduce kvm_get_msr_feature() · 66421c1e

由 Wanpeng Li 提交于 2月 28, 2018

Introduce kvm_get_msr_feature() to handle the msrs which are supported
by different vendors and sharing the same emulation logic.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Liran Alon <liran.alon@oracle.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>

66421c1e

KVM: SVM: Add MSR-based feature support for serializing LFENCE · d1d93fa9

由 Tom Lendacky 提交于 2月 24, 2018

In order to determine if LFENCE is a serializing instruction on AMD
processors, MSR 0xc0011029 (MSR_F10H_DECFG) must be read and the state
of bit 1 checked.  This patch will add support to allow a guest to
properly make this determination.

Add the MSR feature callback operation to svm.c and add MSR 0xc0011029
to the list of MSR-based features.  If LFENCE is serializing, then the
feature is supported, allowing the hypervisor to set the value of the
MSR that guest will see.  Support is also added to write (hypervisor only)
and read the MSR value for the guest.  A write by the guest will result in
a #GP.  A read by the guest will return the value as set by the host.  In
this way, the support to expose the feature to the guest is controlled by
the hypervisor.
Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NTom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>

d1d93fa9

KVM: x86: Add a framework for supporting MSR-based features · 801e459a

由 Tom Lendacky 提交于 2月 21, 2018

Provide a new KVM capability that allows bits within MSRs to be recognized
as features.  Two new ioctls are added to the /dev/kvm ioctl routine to
retrieve the list of these MSRs and then retrieve their values. A kvm_x86_ops
callback is used to determine support for the listed MSR-based features.
Signed-off-by: NTom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
[Tweaked documentation. - Radim]
Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>

801e459a

24 2月, 2018 2 次提交

KVM/x86: remove WARN_ON() for when vm_munmap() fails · 103c763c

由 Eric Biggers 提交于 1月 31, 2018

On x86, special KVM memslots such as the TSS region have anonymous
memory mappings created on behalf of userspace, and these mappings are
removed when the VM is destroyed.

It is however possible for removing these mappings via vm_munmap() to
fail.  This can most easily happen if the thread receives SIGKILL while
it's waiting to acquire ->mmap_sem.   This triggers the 'WARN_ON(r < 0)'
in __x86_set_memory_region().  syzkaller was able to hit this, using
'exit()' to send the SIGKILL.  Note that while the vm_munmap() failure
results in the mapping not being removed immediately, it is not leaked
forever but rather will be freed when the process exits.

It's not really possible to handle this failure properly, so almost
every other caller of vm_munmap() doesn't check the return value.  It's
a limitation of having the kernel manage these mappings rather than
userspace.

So just remove the WARN_ON() so that users can't spam the kernel log
with this warning.

Fixes: f0d648bd ("KVM: x86: map/unmap private slots in __x86_set_memory_region")
Reported-by: Nsyzbot <syzkaller@googlegroups.com>
Signed-off-by: NEric Biggers <ebiggers@google.com>
Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>

103c763c

KVM: x86: move LAPIC initialization after VMCS creation · 0b2e9904

由 Paolo Bonzini 提交于 2月 23, 2018

The initial reset of the local APIC is performed before the VMCS has been
created, but it tries to do a vmwrite:

 vmwrite error: reg 810 value 4a00 (err 18944)
 CPU: 54 PID: 38652 Comm: qemu-kvm Tainted: G        W I      4.16.0-0.rc2.git0.1.fc28.x86_64 #1
 Hardware name: Intel Corporation S2600CW/S2600CW, BIOS SE5C610.86B.01.01.0003.090520141303 09/05/2014
 Call Trace:
  vmx_set_rvi [kvm_intel]
  vmx_hwapic_irr_update [kvm_intel]
  kvm_lapic_reset [kvm]
  kvm_create_lapic [kvm]
  kvm_arch_vcpu_init [kvm]
  kvm_vcpu_init [kvm]
  vmx_create_vcpu [kvm_intel]
  kvm_vm_ioctl [kvm]

Move it later, after the VMCS has been created.

Fixes: 4191db26 ("KVM: x86: Update APICv on APIC reset")
Cc: stable@vger.kernel.org
Cc: Liran Alon <liran.alon@oracle.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

0b2e9904

04 2月, 2018 2 次提交

KVM/VMX: Allow direct access to MSR_IA32_SPEC_CTRL · d28b387f

由 KarimAllah Ahmed 提交于 2月 01, 2018

[ Based on a patch from Ashok Raj <ashok.raj@intel.com> ]

Add direct access to MSR_IA32_SPEC_CTRL for guests. This is needed for
guests that will only mitigate Spectre V2 through IBRS+IBPB and will not
be using a retpoline+IBPB based approach.

To avoid the overhead of saving and restoring the MSR_IA32_SPEC_CTRL for
guests that do not actually use the MSR, only start saving and restoring
when a non-zero is written to it.

No attempt is made to handle STIBP here, intentionally. Filtering STIBP
may be added in a future patch, which may require trapping all writes
if we don't want to pass it through directly to the guest.

[dwmw2: Clean up CPUID bits, save/restore manually, handle reset]
Signed-off-by: NKarimAllah Ahmed <karahmed@amazon.de>
Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NDarren Kenny <darren.kenny@oracle.com>
Reviewed-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reviewed-by: NJim Mattson <jmattson@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Jun Nakajima <jun.nakajima@intel.com>
Cc: kvm@vger.kernel.org
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Asit Mallick <asit.k.mallick@intel.com>
Cc: Arjan Van De Ven <arjan.van.de.ven@intel.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Ashok Raj <ashok.raj@intel.com>
Link: https://lkml.kernel.org/r/1517522386-18410-5-git-send-email-karahmed@amazon.de

d28b387f

KVM/VMX: Emulate MSR_IA32_ARCH_CAPABILITIES · 28c1c9fa

由 KarimAllah Ahmed 提交于 2月 01, 2018

Intel processors use MSR_IA32_ARCH_CAPABILITIES MSR to indicate RDCL_NO
(bit 0) and IBRS_ALL (bit 1). This is a read-only MSR. By default the
contents will come directly from the hardware, but user-space can still
override it.

[dwmw2: The bit in kvm_cpuid_7_0_edx_x86_features can be unconditional]
Signed-off-by: NKarimAllah Ahmed <karahmed@amazon.de>
Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com>
Reviewed-by: NDarren Kenny <darren.kenny@oracle.com>
Reviewed-by: NJim Mattson <jmattson@google.com>
Reviewed-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Jun Nakajima <jun.nakajima@intel.com>
Cc: kvm@vger.kernel.org
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Asit Mallick <asit.k.mallick@intel.com>
Cc: Arjan Van De Ven <arjan.van.de.ven@intel.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Ashok Raj <ashok.raj@intel.com>
Link: https://lkml.kernel.org/r/1517522386-18410-4-git-send-email-karahmed@amazon.de

28c1c9fa

03 2月, 2018 1 次提交

KVM: x86: don't forget vcpu_put() in kvm_arch_vcpu_ioctl_set_sregs() · 8dbfb2bf

由 Eric Biggers 提交于 12月 20, 2017

Due to a bad merge resolution between commit f2981033 ("KVM/x86:
Check input paging mode when cs.l is set") and commit b4ef9d4e
("KVM: Move vcpu_load to arch-specific kvm_arch_vcpu_ioctl_set_sregs"),
there is a case in kvm_arch_vcpu_ioctl_set_sregs() where vcpu_put() is
not called after vcpu_get().  Fix it.
Reported-by: Nsyzbot <syzkaller@googlegroups.com>
Signed-off-by: NEric Biggers <ebiggers@google.com>
Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>

8dbfb2bf

01 2月, 2018 2 次提交

kvm: x86: remove efer_reload entry in kvm_vcpu_stat · 87cedc6b

由 Longpeng(Mike) 提交于 1月 26, 2018

The efer_reload is never used since
commit 26bb0981 ("KVM: VMX: Use shared msr infrastructure"),
so remove it.
Signed-off-by: NLongpeng(Mike) <longpeng2@huawei.com>
Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>

87cedc6b

x86/kvm/vmx: do not use vm-exit instruction length for fast MMIO when running nested · d391f120

由 Vitaly Kuznetsov 提交于 1月 25, 2018

I was investigating an issue with seabios >= 1.10 which stopped working
for nested KVM on Hyper-V. The problem appears to be in
handle_ept_violation() function: when we do fast mmio we need to skip
the instruction so we do kvm_skip_emulated_instruction(). This, however,
depends on VM_EXIT_INSTRUCTION_LEN field being set correctly in VMCS.
However, this is not the case.

Intel's manual doesn't mandate VM_EXIT_INSTRUCTION_LEN to be set when
EPT MISCONFIG occurs. While on real hardware it was observed to be set,
some hypervisors follow the spec and don't set it; we end up advancing
IP with some random value.

I checked with Microsoft and they confirmed they don't fill
VM_EXIT_INSTRUCTION_LEN on EPT MISCONFIG.

Fix the issue by doing instruction skip through emulator when running
nested.

Fixes: 68c3b4d1Suggested-by: NRadim Krčmář <rkrcmar@redhat.com>
Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Acked-by: NMichael S. Tsirkin <mst@redhat.com>
Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>

d391f120

31 1月, 2018 3 次提交

x86/kvm: Make it compile on 32bit and with HYPYERVISOR_GUEST=n · 5fa4ec9c

由 Thomas Gleixner 提交于 1月 31, 2018

The reenlightment support for hyperv slapped a direct reference to
x86_hyper_type into the kvm code which results in the following build
failure when CONFIG_HYPERVISOR_GUEST=n:

arch/x86/kvm/x86.c:6259:6: error: ‘x86_hyper_type’ undeclared (first use in this function)
arch/x86/kvm/x86.c:6259:6: note: each undeclared identifier is reported only once for each function it appears in

Use the proper helper function to cure that.

The 32bit compile fails because of:

arch/x86/kvm/x86.c:5936:13: warning: ‘kvm_hyperv_tsc_notifier’ defined but not used [-Wunused-function]

which is a real trainwreck engineering artwork. The callsite is wrapped
into #ifdef CONFIG_X86_64, but the function itself has the #ifdef inside
the function body. Make the function itself wrapped into the ifdef to cure
that.

Qualiteee....

Fixes: 0092e434 ("x86/kvm: Support Hyper-V reenlightenment")
Reported-by: NIngo Molnar <mingo@kernel.org>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: kvm@vger.kernel.org
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: "Michael Kelley (EOSG)" <Michael.H.Kelley@microsoft.com>
Cc: Roman Kagan <rkagan@virtuozzo.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: devel@linuxdriverproject.org
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Cathy Avery <cavery@redhat.com>
Cc: Mohammed Gamal <mmorsy@redhat.com>

5fa4ec9c

x86/kvm: Support Hyper-V reenlightenment · 0092e434

由 Vitaly Kuznetsov 提交于 1月 24, 2018

When running nested KVM on Hyper-V guests its required to update
masterclocks for all guests when L1 migrates to a host with different TSC
frequency.

Implement the procedure in the following way:
  - Pause all guests.
  - Tell the host (Hyper-V) to stop emulating TSC accesses.
  - Update the gtod copy, recompute clocks.
  - Unpause all guests.

This is somewhat similar to cpufreq but there are two important differences:
 - TSC emulation can only be disabled globally (on all CPUs)
 - The new TSC frequency is not known until emulation is turned off so
   there is no way to 'prepare' for the event upfront.
Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Acked-by: NPaolo Bonzini <pbonzini@redhat.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: kvm@vger.kernel.org
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: "Michael Kelley (EOSG)" <Michael.H.Kelley@microsoft.com>
Cc: Roman Kagan <rkagan@virtuozzo.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: devel@linuxdriverproject.org
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Cathy Avery <cavery@redhat.com>
Cc: Mohammed Gamal <mmorsy@redhat.com>
Link: https://lkml.kernel.org/r/20180124132337.30138-8-vkuznets@redhat.com

0092e434

x86/kvm: Pass stable clocksource to guests when running nested on Hyper-V · b0c39dc6

由 Vitaly Kuznetsov 提交于 1月 24, 2018

Currently, KVM is able to work in 'masterclock' mode passing
PVCLOCK_TSC_STABLE_BIT to guests when the clocksource which is used on the
host is TSC.

When running nested on Hyper-V the guest normally uses a different one: TSC
page which is resistant to TSC frequency changes on events like L1
migration. Add support for it in KVM.

The only non-trivial change is in vgettsc(): when updating the gtod copy
both the clock readout and tsc value have to be updated now.
Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Acked-by: NPaolo Bonzini <pbonzini@redhat.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: kvm@vger.kernel.org
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: "Michael Kelley (EOSG)" <Michael.H.Kelley@microsoft.com>
Cc: Roman Kagan <rkagan@virtuozzo.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: devel@linuxdriverproject.org
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Cathy Avery <cavery@redhat.com>
Cc: Mohammed Gamal <mmorsy@redhat.com>
Link: https://lkml.kernel.org/r/20180124132337.30138-7-vkuznets@redhat.com

b0c39dc6

17 1月, 2018 1 次提交

KVM/x86: Fix wrong macro references of X86_CR0_PG_BIT and X86_CR4_PAE_BIT in kvm_valid_sregs() · 37b95951

由 Tianyu Lan 提交于 1月 16, 2018

kvm_valid_sregs() should use X86_CR0_PG and X86_CR4_PAE to check bit
status rather than X86_CR0_PG_BIT and X86_CR4_PAE_BIT. This patch is
to fix it.

Fixes: f2981033(KVM/x86: Check input paging mode when cs.l is set)
Reported-by: NJeremi Piotrowski <jeremi.piotrowski@gmail.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: NTianyu Lan <Tianyu.Lan@microsoft.com>
Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>

37b95951

16 1月, 2018 4 次提交

KVM: x86: Optimization: Create SVM stubs for sync_pir_to_irr() · fa59cc00

由 Liran Alon 提交于 12月 24, 2017

sync_pir_to_irr() is only called if vcpu->arch.apicv_active()==true.
In case it is false, VMX code make sure to set sync_pir_to_irr
to NULL.

Therefore, having SVM stubs allows to remove check for if
sync_pir_to_irr != NULL from all calling sites.
Signed-off-by: NLiran Alon <liran.alon@oracle.com>
Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
Reviewed-by: NNikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: NLiam Merwick <liam.merwick@oracle.com>
[Return highest IRR in the SVM case. - Paolo]
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>

fa59cc00

KVM: x86: avoid unnecessary XSETBV on guest entry · 476b7ada

由 Paolo Bonzini 提交于 12月 13, 2017

xsetbv can be expensive when running on nested virtualization, try to
avoid it.
Reviewed-by: NJim Mattson <jmattson@google.com>
Reviewed-by: NWanpeng Li <wanpeng.li@hotmail.com>
Reviewed-by: NQuan Xu <quan.xu0@gmail.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>

476b7ada

KVM: x86: fix escape of guest dr6 to the host · efdab992

由 Wanpeng Li 提交于 12月 13, 2017

syzkaller reported:

   WARNING: CPU: 0 PID: 12927 at arch/x86/kernel/traps.c:780 do_debug+0x222/0x250
   CPU: 0 PID: 12927 Comm: syz-executor Tainted: G           OE    4.15.0-rc2+ #16
   RIP: 0010:do_debug+0x222/0x250
   Call Trace:
    <#DB>
    debug+0x3e/0x70
   RIP: 0010:copy_user_enhanced_fast_string+0x10/0x20
    </#DB>
    _copy_from_user+0x5b/0x90
    SyS_timer_create+0x33/0x80
    entry_SYSCALL_64_fastpath+0x23/0x9a

The testcase sets a watchpoint (with perf_event_open) on a buffer that is
passed to timer_create() as the struct sigevent argument.  In timer_create(),
copy_from_user()'s rep movsb triggers the BP.  The testcase also sets
the debug registers for the guest.

However, KVM only restores host debug registers when the host has active
watchpoints, which triggers a race condition when running the testcase with
multiple threads.  The guest's DR6.BS bit can escape to the host before
another thread invokes timer_create(), and do_debug() complains.

The fix is to respect do_debug()'s dr6 invariant when leaving KVM.
Reported-by: NDmitry Vyukov <dvyukov@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: NDavid Hildenbrand <david@redhat.com>
Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>

efdab992

KVM: X86: support paravirtualized help for TLB shootdowns · f38a7b75

由 Wanpeng Li 提交于 12月 12, 2017

When running on a virtual machine, IPIs are expensive when the target
CPU is sleeping.  Thus, it is nice to be able to avoid them for TLB
shootdowns.  KVM can just do the flush via INVVPID on the guest's behalf
the next time the CPU is scheduled.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
[Use "&" to test the bit instead of "==". - Paolo]
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>

f38a7b75

OpenHarmony / kernel_linux 上一次同步 4 年多

OpenHarmony / kernel_linux
上一次同步 4 年多