提交 · e751732486eb3f159089a64d1901992b1357e7cc · openeuler / Kernel

22 7月, 2019 2 次提交

KVM: X86: Fix fpu state crash in kvm guest · e7517324

由 Wanpeng Li 提交于 7月 22, 2019

The idea before commit 240c35a3 (which has just been reverted)
was that we have the following FPU states:

               userspace (QEMU)             guest
---------------------------------------------------------------------------
               processor                    vcpu->arch.guest_fpu
>>> KVM_RUN: kvm_load_guest_fpu
               vcpu->arch.user_fpu          processor
>>> preempt out
               vcpu->arch.user_fpu          current->thread.fpu
>>> preempt in
               vcpu->arch.user_fpu          processor
>>> back to userspace
>>> kvm_put_guest_fpu
               processor                    vcpu->arch.guest_fpu
---------------------------------------------------------------------------

With the new lazy model we want to get the state back to the processor
when schedule in from current->thread.fpu.
Reported-by: NThomas Lambertz <mail@thomaslambertz.de>
Reported-by: Nanthony <antdev66@gmail.com>
Tested-by: Nanthony <antdev66@gmail.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Thomas Lambertz <mail@thomaslambertz.de>
Cc: anthony <antdev66@gmail.com>
Cc: stable@vger.kernel.org
Fixes: 5f409e20 (x86/fpu: Defer FPU state load until return to userspace)
Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
[Add a comment in front of the warning. - Paolo]
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

e7517324

Revert "kvm: x86: Use task structs fpu field for user" · ec269475

由 Paolo Bonzini 提交于 7月 22, 2019

This reverts commit 240c35a3
("kvm: x86: Use task structs fpu field for user", 2018-11-06).
The commit is broken and causes QEMU's FPU state to be destroyed
when KVM_RUN is preempted.

Fixes: 240c35a3 ("kvm: x86: Use task structs fpu field for user")
Cc: stable@vger.kernel.org
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

ec269475

20 7月, 2019 1 次提交

KVM: LAPIC: Inject timer interrupt via posted interrupt · 0c5f81da

由 Wanpeng Li 提交于 7月 06, 2019

Dedicated instances are currently disturbed by unnecessary jitter due
to the emulated lapic timers firing on the same pCPUs where the
vCPUs reside. There is no hardware virtual timer on Intel for guest
like ARM, so both programming timer in guest and the emulated timer fires
incur vmexits. This patch tries to avoid vmexit when the emulated timer
fires, at least in dedicated instance scenario when nohz_full is enabled.

In that case, the emulated timers can be offload to the nearest busy
housekeeping cpus since APICv has been found for several years in server
processors. The guest timer interrupt can then be injected via posted interrupts,
which are delivered by the housekeeping cpu once the emulated timer fires.

The host should tuned so that vCPUs are placed on isolated physical
processors, and with several pCPUs surplus for busy housekeeping.
If disabled mwait/hlt/pause vmexits keep the vCPUs in non-root mode,
~3% redis performance benefit can be observed on Skylake server, and the
number of external interrupt vmexits drops substantially. Without patch

VM-EXIT Samples Samples% Time% Min Time Max Time Avg time
EXTERNAL_INTERRUPT 42916 49.43% 39.30% 0.47us 106.09us 0.71us ( +- 1.09% )

While with patch:

VM-EXIT Samples Samples% Time% Min Time Max Time Avg time
EXTERNAL_INTERRUPT 6871 9.29% 2.96% 0.44us 57.88us 0.72us ( +- 4.02% )

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

0c5f81da

18 7月, 2019 1 次提交

KVM: LAPIC: Make lapic timer unpinned · 4d151bf3

由 Wanpeng Li 提交于 7月 06, 2019

Commit 61abdbe0 ("kvm: x86: make lapic hrtimer pinned") pinned the
lapic timer to avoid to wait until the next kvm exit for the guest to
see KVM_REQ_PENDING_TIMER set. There is another solution to give a kick
after setting the KVM_REQ_PENDING_TIMER bit, make lapic timer unpinned
will be used in follow up patches.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

4d151bf3

15 7月, 2019 1 次提交

kvm: x86: some tsc debug cleanup · 9a5611af

由 Yi Wang 提交于 7月 06, 2019

There are some pr_debug in TSC code, which may have
been no use, so remove them as Paolo suggested.
Signed-off-by: NYi Wang <wang.yi59@zte.com.cn>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

9a5611af

11 7月, 2019 2 次提交

KVM: x86: Unconditionally enable irqs in guest context · d7a08882

由 Sean Christopherson 提交于 7月 10, 2019

On VMX, KVM currently does not re-enable irqs until after it has exited
the guest context. As a result, a tick that fires in the window between
VM-Exit and guest_exit_irqoff() will be accounted as system time. While
said window is relatively small, it's large enough to be problematic in
some configurations, e.g. if VM-Exits are consistently occurring a hair
earlier than the tick irq.

Intentionally toggle irqs back off so that guest_exit_irqoff() can be
used in lieu of guest_exit() in order to avoid the save/restore of flags
in guest_exit(). On my Haswell system, "nop; cli; sti" is ~6 cycles,
versus ~28 cycles for "pushf; pop <reg>; cli; push <reg>; popf".

Fixes: f2485b3e ("KVM: x86: use guest_exit_irqoff")
Reported-by: NWei Yang <w90p710@gmail.com>
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

d7a08882

KVM: x86: PMU Event Filter · 66bb8a06

由 Eric Hankland 提交于 7月 10, 2019

Some events can provide a guest with information about other guests or the
host (e.g. L3 cache stats); providing the capability to restrict access
to a "safe" set of events would limit the potential for the PMU to be used
in any side channel attacks. This change introduces a new VM ioctl that
sets an event filter. If the guest attempts to program a counter for
any blacklisted or non-whitelisted event, the kernel counter won't be
created, so any RDPMC/RDMSR will show 0 instances of that event.
Signed-off-by: NEric Hankland <ehankland@google.com>
[Lots of changes. All remaining bugs are probably mine. - Paolo]
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

66bb8a06

03 7月, 2019 3 次提交

clocksource/drivers: Continue making Hyper-V clocksource ISA agnostic · dd2cb348

由 Michael Kelley 提交于 7月 01, 2019

Continue consolidating Hyper-V clock and timer code into an ISA
independent Hyper-V clocksource driver.

Move the existing clocksource code under drivers/hv and arch/x86 to the new
clocksource driver while separating out the ISA dependencies. Update
Hyper-V initialization to call initialization and cleanup routines since
the Hyper-V synthetic clock is not independently enumerated in ACPI.

Update Hyper-V clocksource users in KVM and VDSO to get definitions from
the new include file.

No behavior is changed and no new functionality is added.
Suggested-by: NMarc Zyngier <marc.zyngier@arm.com>
Signed-off-by: NMichael Kelley <mikelley@microsoft.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Cc: "bp@alien8.de" <bp@alien8.de>
Cc: "will.deacon@arm.com" <will.deacon@arm.com>
Cc: "catalin.marinas@arm.com" <catalin.marinas@arm.com>
Cc: "mark.rutland@arm.com" <mark.rutland@arm.com>
Cc: "linux-arm-kernel@lists.infradead.org" <linux-arm-kernel@lists.infradead.org>
Cc: "gregkh@linuxfoundation.org" <gregkh@linuxfoundation.org>
Cc: "linux-hyperv@vger.kernel.org" <linux-hyperv@vger.kernel.org>
Cc: "olaf@aepfle.de" <olaf@aepfle.de>
Cc: "apw@canonical.com" <apw@canonical.com>
Cc: "jasowang@redhat.com" <jasowang@redhat.com>
Cc: "marcelo.cerri@canonical.com" <marcelo.cerri@canonical.com>
Cc: Sunil Muthuswamy <sunilmut@microsoft.com>
Cc: KY Srinivasan <kys@microsoft.com>
Cc: "sashal@kernel.org" <sashal@kernel.org>
Cc: "vincenzo.frascino@arm.com" <vincenzo.frascino@arm.com>
Cc: "linux-arch@vger.kernel.org" <linux-arch@vger.kernel.org>
Cc: "linux-mips@vger.kernel.org" <linux-mips@vger.kernel.org>
Cc: "linux-kselftest@vger.kernel.org" <linux-kselftest@vger.kernel.org>
Cc: "arnd@arndb.de" <arnd@arndb.de>
Cc: "linux@armlinux.org.uk" <linux@armlinux.org.uk>
Cc: "ralf@linux-mips.org" <ralf@linux-mips.org>
Cc: "paul.burton@mips.com" <paul.burton@mips.com>
Cc: "daniel.lezcano@linaro.org" <daniel.lezcano@linaro.org>
Cc: "salyzyn@android.com" <salyzyn@android.com>
Cc: "pcc@google.com" <pcc@google.com>
Cc: "shuah@kernel.org" <shuah@kernel.org>
Cc: "0x7f454c46@gmail.com" <0x7f454c46@gmail.com>
Cc: "linux@rasmusvillemoes.dk" <linux@rasmusvillemoes.dk>
Cc: "huw@codeweavers.com" <huw@codeweavers.com>
Cc: "sfr@canb.auug.org.au" <sfr@canb.auug.org.au>
Cc: "pbonzini@redhat.com" <pbonzini@redhat.com>
Cc: "rkrcmar@redhat.com" <rkrcmar@redhat.com>
Cc: "kvm@vger.kernel.org" <kvm@vger.kernel.org>
Link: https://lkml.kernel.org/r/1561955054-1838-3-git-send-email-mikelley@microsoft.com

dd2cb348

KVM: x86: degrade WARN to pr_warn_ratelimited · 3f16a5c3

由 Paolo Bonzini 提交于 6月 26, 2019

This warning can be triggered easily by userspace, so it should certainly not
cause a panic if panic_on_warn is set.

Reported-by: syzbot+c03f30b4f4c46bdf8575@syzkaller.appspotmail.com
Suggested-by: NAlexander Potapenko <glider@google.com>
Acked-by: NAlexander Potapenko <glider@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

3f16a5c3

KVM: X86: Implement PV sched yield hypercall · 71506297

由 Wanpeng Li 提交于 6月 11, 2019

The target vCPUs are in runnable state after vcpu_kick and suitable
as a yield target. This patch implements the sched yield hypercall.

17% performance increasement of ebizzy benchmark can be observed in an
over-subscribe environment. (w/ kvm-pv-tlb disabled, testing TLB flush
call-function IPI-many since call-function is not easy to be trigged
by userspace workload).

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Liran Alon <liran.alon@oracle.com>
Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

71506297

02 7月, 2019 1 次提交

KVM: nVMX: list VMX MSRs in KVM_GET_MSR_INDEX_LIST · 95c5c7c7

由 Paolo Bonzini 提交于 7月 02, 2019

This allows userspace to know which MSRs are supported by the hypervisor.
Unfortunately userspace must resort to tricks for everything except
MSR_IA32_VMX_VMFUNC (which was just added in the previous patch).
One possibility is to use the feature control MSR, which is tied to nested
VMX as well and is present on all KVM versions that support feature MSRs.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

95c5c7c7

22 6月, 2019 1 次提交

timekeeping: Use proper clock specifier names in functions · 9285ec4c

由 Jason A. Donenfeld 提交于 6月 21, 2019

This makes boot uniformly boottime and tai uniformly clocktai, to
address the remaining oversights.
Signed-off-by: NJason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NArnd Bergmann <arnd@arndb.de>
Link: https://lkml.kernel.org/r/20190621203249.3909-2-Jason@zx2c4.com

9285ec4c

19 6月, 2019 1 次提交

treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 499 · 20c8ccb1

由 Thomas Gleixner 提交于 6月 04, 2019

Based on 1 normalized pattern(s):

  this work is licensed under the terms of the gnu gpl version 2 see
  the copying file in the top level directory

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 35 file(s).
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NKate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: NEnrico Weigelt <info@metux.net>
Reviewed-by: NAllison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190604081206.797835076@linutronix.deSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

20c8ccb1

18 6月, 2019 6 次提交

KVM: x86: introduce is_pae_paging · bf03d4f9

由 Paolo Bonzini 提交于 6月 06, 2019

Checking for 32-bit PAE is quite common around code that fiddles with
the PDPTRs.  Add a function to compress all checks into a single
invocation.
Reviewed-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

bf03d4f9

KVM: VMX: Handle NMIs, #MCs and async #PFs in common irqs-disabled fn · 95b5a48c

由 Sean Christopherson 提交于 4月 19, 2019

Per commit 1b6269db ("KVM: VMX: Handle NMIs before enabling
interrupts and preemption"), NMIs are handled directly in vmx_vcpu_run()
to "make sure we handle NMI on the current cpu, and that we don't
service maskable interrupts before non-maskable ones".  The other
exceptions handled by complete_atomic_exit(), e.g. async #PF and #MC,
have similar requirements, and are located there to avoid extra VMREADs
since VMX bins hardware exceptions and NMIs into a single exit reason.

Clean up the code and eliminate the vaguely named complete_atomic_exit()
by moving the interrupts-disabled exception and NMI handling into the
existing handle_external_intrs() callback, and rename the callback to
a more appropriate name.  Rename VMexit handlers throughout so that the
atomic and non-atomic counterparts have similar names.

In addition to improving code readability, this also ensures the NMI
handler is run with the host's debug registers loaded in the unlikely
event that the user is debugging NMIs.  Accuracy of the last_guest_tsc
field is also improved when handling NMIs (and #MCs) as the handler
will run after updating said field.
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
[Naming cleanups. - Paolo]
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

95b5a48c

KVM: x86: Move kvm_{before,after}_interrupt() calls to vendor code · 165072b0

由 Sean Christopherson 提交于 4月 19, 2019

VMX can conditionally call kvm_{before,after}_interrupt() since KVM
always uses "ack interrupt on exit" and therefore explicitly handles
interrupts as opposed to blindly enabling irqs.
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

165072b0

KVM: x86: move MSR_IA32_POWER_CTL handling to common code · 73f624f4

由 Paolo Bonzini 提交于 6月 06, 2019

Make it available to AMD hosts as well, just in case someone is trying
to use an Intel processor's CPUID setup.
Suggested-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

73f624f4

kvm: x86: add host poll control msrs · 2d5ba19b

由 Marcelo Tosatti 提交于 6月 03, 2019

Add an MSRs which allows the guest to disable
host polling (specifically the cpuidle-haltpoll,
when performing polling in the guest, disables
host side polling).
Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

2d5ba19b

KVM: x86: Use DR_TRAP_BITS instead of hard-coded 15 · 1fc5d194

由 Liran Alon 提交于 6月 06, 2019

Make all code consistent with kvm_deliver_exception_payload() by using
appropriate symbolic constant instead of hard-coded number.
Reviewed-by: NNikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: NLiran Alon <liran.alon@oracle.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

1fc5d194

14 6月, 2019 1 次提交

KVM: x86: clean up conditions for asynchronous page fault handling · 1dfdb45e

由 Paolo Bonzini 提交于 6月 05, 2019

Even when asynchronous page fault is disabled, KVM does not want to pause
the host if a guest triggers a page fault; instead it will put it into
an artificial HLT state that allows running other host processes while
allowing interrupt delivery into the guest.

However, the way this feature is triggered is a bit confusing.
First, it is not used for page faults while a nested guest is
running: but this is not an issue since the artificial halt
is completely invisible to the guest, either L1 or L2.  Second,
it is used even if kvm_halt_in_guest() returns true; in this case,
the guest probably should not pay the additional latency cost of the
artificial halt, and thus we should handle the page fault in a
completely synchronous way.

By introducing a new function kvm_can_deliver_async_pf, this patch
commonizes the code that chooses whether to deliver an async page fault
(kvm_arch_async_page_not_present) and the code that chooses whether a
page fault should be handled synchronously (kvm_can_do_async_pf).
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

1dfdb45e

05 6月, 2019 8 次提交

kvm: Convert kvm_lock to a mutex · 0d9ce162

由 Junaid Shahid 提交于 1月 03, 2019

It doesn't seem as if there is any particular need for kvm_lock to be a
spinlock, so convert the lock to a mutex so that sleepable functions (in
particular cond_resched()) can be called while holding it.
Signed-off-by: NJunaid Shahid <junaids@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

0d9ce162

KVM: X86: Emulate MSR_IA32_MISC_ENABLE MWAIT bit · 511a8556

由 Wanpeng Li 提交于 5月 21, 2019

MSR IA32_MISC_ENABLE bit 18, according to SDM:

| When this bit is set to 0, the MONITOR feature flag is not set (CPUID.01H:ECX[bit 3] = 0).
| This indicates that MONITOR/MWAIT are not supported.
|
| Software attempts to execute MONITOR/MWAIT will cause #UD when this bit is 0.
|
| When this bit is set to 1 (default), MONITOR/MWAIT are supported (CPUID.01H:ECX[bit 3] = 1).

The CPUID.01H:ECX[bit 3] ought to mirror the value of the MSR bit,
CPUID.01H:ECX[bit 3] is a better guard than kvm_mwait_in_guest().
kvm_mwait_in_guest() affects the behavior of MONITOR/MWAIT, not its
guest visibility.

This patch implements toggling of the CPUID bit based on guest writes
to the MSR.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Liran Alon <liran.alon@oracle.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
[Fixes for backwards compatibility - Paolo]
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

511a8556

KVM: X86: Provide a capability to disable cstate msr read intercepts · b5170063

由 Wanpeng Li 提交于 5月 21, 2019

Allow guest reads CORE cstate when exposing host CPU power management capabilities
to the guest. PKG cstate is restricted to avoid a guest to get the whole package
information in multi-tenant scenario.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Liran Alon <liran.alon@oracle.com>
Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

b5170063

kvm: x86: refine kvm_get_arch_capabilities() · 4d22c17c

由 Xiaoyao Li 提交于 4月 19, 2019

1. Using X86_FEATURE_ARCH_CAPABILITIES to enumerate the existence of
MSR_IA32_ARCH_CAPABILITIES to avoid using rdmsrl_safe().

2. Since kvm_get_arch_capabilities() is only used in this file, making
it static.
Signed-off-by: NXiaoyao Li <xiaoyao.li@linux.intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

4d22c17c

KVM: Directly return result from kvm_arch_check_processor_compat() · f257d6dc

由 Sean Christopherson 提交于 4月 19, 2019

Add a wrapper to invoke kvm_arch_check_processor_compat() so that the
boilerplate ugliness of checking virtualization support on all CPUs is
hidden from the arch specific code.  x86's implementation in particular
is quite heinous, as it unnecessarily propagates the out-param pattern
into kvm_x86_ops.

While the x86 specific issue could be resolved solely by changing
kvm_x86_ops, make the change for all architectures as returning a value
directly is prettier and technically more robust, e.g. s390 doesn't set
the out param, which could lead to subtle breakage in the (highly
unlikely) scenario where the out-param was not pre-initialized by the
caller.

Opportunistically annotate svm_check_processor_compat() with __init.
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: NCornelia Huck <cohuck@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

f257d6dc

KVM: LAPIC: Optimize timer latency further · b6c4bc65

由 Wanpeng Li 提交于 5月 20, 2019

Advance lapic timer tries to hidden the hypervisor overhead between the
host emulated timer fires and the guest awares the timer is fired. However,
it just hidden the time between apic_timer_fn/handle_preemption_timer ->
wait_lapic_expire, instead of the real position of vmentry which is
mentioned in the orignial commit d0659d94 ("KVM: x86: add option to
advance tscdeadline hrtimer expiration"). There is 700+ cpu cycles between
the end of wait_lapic_expire and before world switch on my haswell desktop.

This patch tries to narrow the last gap(wait_lapic_expire -> world switch),
it takes the real overhead time between apic_timer_fn/handle_preemption_timer
and before world switch into consideration when adaptively tuning timer
advancement. The patch can reduce 40% latency (~1600+ cycles to ~1000+ cycles
on a haswell desktop) for kvm-unit-tests/tscdeadline_latency when testing
busy waits.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Liran Alon <liran.alon@oracle.com>
Reviewed-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

b6c4bc65

KVM: LAPIC: Delay trace_kvm_wait_lapic_expire tracepoint to after vmexit · ec0671d5

由 Wanpeng Li 提交于 5月 20, 2019

wait_lapic_expire() call was moved above guest_enter_irqoff() because of
its tracepoint, which violated the RCU extended quiescent state invoked
by guest_enter_irqoff()[1][2]. This patch simply moves the tracepoint
below guest_exit_irqoff() in vcpu_enter_guest(). Snapshot the delta before
VM-Enter, but trace it after VM-Exit. This can help us to move
wait_lapic_expire() just before vmentry in the later patch.

[1] Commit 8b89fe1f ("kvm: x86: move tracepoints outside extended quiescent state")
[2] https://patchwork.kernel.org/patch/7821111/

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Liran Alon <liran.alon@oracle.com>
Suggested-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
[Track whether wait_lapic_expire was called, and do not invoke the tracepoint
 if not. - Paolo]
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

ec0671d5

kvm: x86: Move kvm_set_mmio_spte_mask() from x86.c to mmu.c · 7b6f8a06

由 Kai Huang 提交于 5月 03, 2019

As a prerequisite to fix several SPTE reserved bits related calculation
errors caused by MKTME, which requires kvm_set_mmio_spte_mask() to use
local static variable defined in mmu.c.

Also move call site of kvm_set_mmio_spte_mask() from kvm_arch_init() to
kvm_mmu_module_init() so that kvm_set_mmio_spte_mask() can be static.
Reviewed-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NKai Huang <kai.huang@linux.intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

7b6f8a06

28 5月, 2019 1 次提交

KVM: s390: Do not report unusabled IDs via KVM_CAP_MAX_VCPU_ID · a86cb413

由 Thomas Huth 提交于 5月 23, 2019

KVM_CAP_MAX_VCPU_ID is currently always reporting KVM_MAX_VCPU_ID on all
architectures. However, on s390x, the amount of usable CPUs is determined
during runtime - it is depending on the features of the machine the code
is running on. Since we are using the vcpu_id as an index into the SCA
structures that are defined by the hardware (see e.g. the sca_add_vcpu()
function), it is not only the amount of CPUs that is limited by the hard-
ware, but also the range of IDs that we can use.
Thus KVM_CAP_MAX_VCPU_ID must be determined during runtime on s390x, too.
So the handling of KVM_CAP_MAX_VCPU_ID has to be moved from the common
code into the architecture specific code, and on s390x we have to return
the same value here as for KVM_CAP_MAX_VCPUS.
This problem has been discovered with the kvm_create_max_vcpus selftest.
With this change applied, the selftest now passes on s390x, too.
Reviewed-by: NAndrew Jones <drjones@redhat.com>
Reviewed-by: NCornelia Huck <cohuck@redhat.com>
Reviewed-by: NDavid Hildenbrand <david@redhat.com>
Signed-off-by: NThomas Huth <thuth@redhat.com>
Message-Id: <20190523164309.13345-9-thuth@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>

a86cb413

25 5月, 2019 2 次提交

KVM: x86: fix return value for reserved EFER · 66f61c92

由 Paolo Bonzini 提交于 5月 24, 2019

Commit 11988499 ("KVM: x86: Skip EFER vs. guest CPUID checks for
host-initiated writes", 2019-04-02) introduced a "return false" in a
function returning int, and anyway set_efer has a "nonzero on error"
conventon so it should be returning 1.
Reported-by: NPavel Machek <pavel@denx.de>
Fixes: 11988499 ("KVM: x86: Skip EFER vs. guest CPUID checks for host-initiated writes")
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

66f61c92

KVM: LAPIC: Fix lapic_timer_advance_ns parameter overflow · 0e6edceb

由 Wanpeng Li 提交于 5月 20, 2019

After commit c3941d9e (KVM: lapic: Allow user to disable adaptive tuning of
timer advancement), '-1' enables adaptive tuning starting from default
advancment of 1000ns. However, we should expose an int instead of an overflow
uint module parameter.

Before patch:

/sys/module/kvm/parameters/lapic_timer_advance_ns:4294967295

After patch:

/sys/module/kvm/parameters/lapic_timer_advance_ns:-1

Fixes: c3941d9e (KVM: lapic: Allow user to disable adaptive tuning of timer advancement)
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Liran Alon <liran.alon@oracle.com>
Reviewed-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

0e6edceb

10 5月, 2019 1 次提交

cpufreq: Call transition notifier only once for each policy · df24014a

由 Viresh Kumar 提交于 4月 29, 2019

Currently, the notifiers are called once for each CPU of the policy->cpus
cpumask. It would be more optimal if the notifier can be called only
once and all the relevant information be provided to it. Out of the 23
drivers that register for the transition notifiers today, only 4 of them
do per-cpu updates and the callback for the rest can be called only once
for the policy without any impact.

This would also avoid multiple function calls to the notifier callbacks
and reduce multiple iterations of notifier core's code (which does
locking as well).

This patch adds pointer to the cpufreq policy to the struct
cpufreq_freqs, so the notifier callback has all the information
available to it with a single call. The five drivers which perform
per-cpu updates are updated to use the cpufreq policy. The freqs->cpu
field is redundant now and is removed.

Acked-by: David S. Miller <davem@davemloft.net> (sparc)
Signed-off-by: NViresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

df24014a

01 5月, 2019 8 次提交

KVM: x86: use direct accessors for RIP and RSP · e9c16c78

由 Paolo Bonzini 提交于 4月 30, 2019

Use specific inline functions for RIP and RSP instead of
going through kvm_register_read and kvm_register_write,
which are quite a mouthful.  kvm_rsp_read and kvm_rsp_write
did not exist, so add them.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

e9c16c78

KVM: x86: Omit caching logic for always-available GPRs · de3cd117

由 Sean Christopherson 提交于 4月 30, 2019

Except for RSP and RIP, which are held in VMX's VMCS, GPRs are always
treated "available and dirtly" on both VMX and SVM, i.e. are
unconditionally loaded/saved immediately before/after VM-Enter/VM-Exit.

Eliminating the unnecessary caching code reduces the size of KVM by a
non-trivial amount, much of which comes from the most common code paths.
E.g. on x86_64, kvm_emulate_cpuid() is reduced from 342 to 182 bytes and
kvm_emulate_hypercall() from 1362 to 1143, with the total size of KVM
dropping by ~1000 bytes. With CONFIG_RETPOLINE=y, the numbers are even
more pronounced, e.g.: 353->182, 1418->1172 and well over 2000 bytes.
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

de3cd117

KVM/X86: Use kvm_vcpu_map in emulator_cmpxchg_emulated · 42e35f80

由 KarimAllah Ahmed 提交于 1月 31, 2019

Use kvm_vcpu_map in emulator_cmpxchg_emulated since using
kvm_vcpu_gpa_to_page() and kmap() will only work for guest memory that has
a "struct page".
Signed-off-by: NKarimAllah Ahmed <karahmed@amazon.de>
Reviewed-by: NKonrad Rzeszutek Wilk <kjonrad.wilk@oracle.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

42e35f80

x86/kvm: Implement HWCR support · 191c8137

由 Borislav Petkov 提交于 4月 18, 2019

The hardware configuration register has some useful bits which can be
used by guests. Implement McStatusWrEn which can be used by guests when
injecting MCEs with the in-kernel mce-inject module.

For that, we need to set bit 18 - McStatusWrEn - first, before writing
the MCi_STATUS registers (otherwise we #GP).

Add the required machinery to do so.
Signed-off-by: NBorislav Petkov <bp@suse.de>
Cc: Jim Mattson <jmattson@google.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: KVM <kvm@vger.kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Yazen Ghannam <Yazen.Ghannam@amd.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

191c8137

KVM: VMX: Nop emulation of MSR_IA32_POWER_CTL · 6c6a2ab9

由 Liran Alon 提交于 4月 15, 2019

Since commits 668fffa3 ("kvm: better MWAIT emulation for guestsâ€)
and 4d5422ce ("KVM: X86: Provide a capability to disable MWAIT interceptsâ€),
KVM was modified to allow an admin to configure certain guests to execute
MONITOR/MWAIT inside guest without being intercepted by host.

This is useful in case admin wishes to allocate a dedicated logical
processor for each vCPU thread. Thus, making it safe for guest to
completely control the power-state of the logical processor.

The ability to use this new KVM capability was introduced to QEMU by
commits 6f131f13e68d ("kvm: support -overcommit cpu-pm=on|offâ€) and
2266d4431132 ("i386/cpu: make -cpu host support monitor/mwaitâ€).

However, exposing MONITOR/MWAIT to a Linux guest may cause it's intel_idle
kernel module to execute c1e_promotion_disable() which will attempt to
RDMSR/WRMSR from/to MSR_IA32_POWER_CTL to manipulate the "C1E Enable"
bit. This behaviour was introduced by commit
32e95180 ("intel_idle: export both C1 and C1Eâ€).

Becuase KVM doesn't emulate this MSR, running KVM with ignore_msrs=0
will cause the above guest behaviour to raise a #GP which will cause
guest to kernel panic.

Therefore, add support for nop emulation of MSR_IA32_POWER_CTL to
avoid #GP in guest in this scenario.

Future commits can optimise emulation further by reflecting guest
MSR changes to host MSR to provide guest with the ability to
fine-tune the dedicated logical processor power-state.
Reviewed-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: NLiran Alon <liran.alon@oracle.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

6c6a2ab9

KVM: x86: Inject PMI for KVM guest · 8479e04e

由 Luwei Kang 提交于 2月 18, 2019

Inject a PMI for KVM guest when Intel PT working
in Host-Guest mode and Guest ToPA entry memory buffer
was completely filled.
Signed-off-by: NLuwei Kang <luwei.kang@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

8479e04e

KVM: lapic: Check for in-kernel LAPIC before deferencing apic pointer · b904cb8d

由 Sean Christopherson 提交于 4月 25, 2019

...to avoid dereferencing a null pointer when querying the per-vCPU
timer advance.

Fixes: 39497d76 ("KVM: lapic: Track lapic timer advance per vCPU")
Reported-by: syzbot+f7e65445a40d3e0e4ebf@syzkaller.appspotmail.com
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

b904cb8d

KVM: x86: Whitelist port 0x7e for pre-incrementing %rip · 8764ed55

由 Sean Christopherson 提交于 4月 29, 2019

KVM's recent bug fix to update %rip after emulating I/O broke userspace
that relied on the previous behavior of incrementing %rip prior to
exiting to userspace.  When running a Windows XP guest on AMD hardware,
Qemu may patch "OUT 0x7E" instructions in reaction to the OUT itself.
Because KVM's old behavior was to increment %rip before exiting to
userspace to handle the I/O, Qemu manually adjusted %rip to account for
the OUT instruction.

Arguably this is a userspace bug as KVM requires userspace to re-enter
the kernel to complete instruction emulation before taking any other
actions.  That being said, this is a bit of a grey area and breaking
userspace that has worked for many years is bad.

Pre-increment %rip on OUT to port 0x7e before exiting to userspace to
hack around the issue.

Fixes: 45def77e ("KVM: x86: update %rip after emulating IO")
Reported-by: NSimon Becherer <simon@becherer.de>
Reported-and-tested-by: NIakov Karpov <srid@rkmail.ru>
Reported-by: NGabriele Balducci <balducci@units.it>
Reported-by: NAntti Antinoja <reader@fennosys.fi>
Cc: stable@vger.kernel.org
Cc: Takashi Iwai <tiwai@suse.com>
Cc: Jiri Slaby <jslaby@suse.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

8764ed55

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功