提交 · b86bb11e3a79ac0db9a6786b1fe80f74321cb076 · openeuler / Kernel

26 4月, 2021 2 次提交

KVM: X86: Fix failure to boost kernel lock holder candidate in SEV-ES guests · b86bb11e

由 Wanpeng Li 提交于 4月 22, 2021

Commit f1c6366e ("KVM: SVM: Add required changes to support intercepts under
SEV-ES") prevents hypervisor accesses guest register state when the guest is
running under SEV-ES. The initial value of vcpu->arch.guest_state_protected
is false, it will not be updated in preemption notifiers after this commit which
means that the kernel spinlock lock holder will always be skipped to boost. Let's
fix it by always treating preempted is in the guest kernel mode, false positive
is better than skip completely.

Fixes: f1c6366e (KVM: SVM: Add required changes to support intercepts under SEV-ES)
Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
Message-Id: <1619080459-30032-1-git-send-email-wanpengli@tencent.com>
Cc: stable@vger.kernel.org
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

b86bb11e

KVM: x86: Properly handle APF vs disabled LAPIC situation · 2f15d027

由 Vitaly Kuznetsov 提交于 4月 22, 2021

Async PF 'page ready' event may happen when LAPIC is (temporary) disabled.
In particular, Sebastien reports that when Linux kernel is directly booted
by Cloud Hypervisor, LAPIC is 'software disabled' when APF mechanism is
initialized. On initialization KVM tries to inject 'wakeup all' event and
puts the corresponding token to the slot. It is, however, failing to inject
an interrupt (kvm_apic_set_irq() -> __apic_accept_irq() -> !apic_enabled())
so the guest never gets notified and the whole APF mechanism gets stuck.
The same issue is likely to happen if the guest temporary disables LAPIC
and a previously unavailable page becomes available.

Do two things to resolve the issue:
- Avoid dequeuing 'page ready' events from APF queue when LAPIC is
  disabled.
- Trigger an attempt to deliver pending 'page ready' events when LAPIC
  becomes enabled (SPIV or MSR_IA32_APICBASE).
Reported-by: NSebastien Boeuf <sebastien.boeuf@intel.com>
Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210422092948.568327-1-vkuznets@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

2f15d027

22 4月, 2021 2 次提交

KVM: Boost vCPU candidate in user mode which is delivering interrupt · 52acd22f

由 Wanpeng Li 提交于 4月 16, 2021

Both lock holder vCPU and IPI receiver that has halted are condidate for
boost. However, the PLE handler was originally designed to deal with the
lock holder preemption problem. The Intel PLE occurs when the spinlock
waiter is in kernel mode. This assumption doesn't hold for IPI receiver,
they can be in either kernel or user mode. the vCPU candidate in user mode
will not be boosted even if they should respond to IPIs. Some benchmarks
like pbzip2, swaptions etc do the TLB shootdown in kernel mode and most
of the time they are running in user mode. It can lead to a large number
of continuous PLE events because the IPI sender causes PLE events
repeatedly until the receiver is scheduled while the receiver is not
candidate for a boost.

This patch boosts the vCPU candidiate in user mode which is delivery
interrupt. We can observe the speed of pbzip2 improves 10% in 96 vCPUs
VM in over-subscribe scenario (The host machine is 2 socket, 48 cores,
96 HTs Intel CLX box). There is no performance regression for other
benchmarks like Unixbench spawn (most of the time contend read/write
lock in kernel mode), ebizzy (most of the time contend read/write sem
and TLB shoodtdown in kernel mode).
Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
Message-Id: <1618542490-14756-1-git-send-email-wanpengli@tencent.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

52acd22f

KVM: x86: Support KVM VMs sharing SEV context · 54526d1f

由 Nathan Tempelman 提交于 4月 08, 2021

Add a capability for userspace to mirror SEV encryption context from
one vm to another. On our side, this is intended to support a
Migration Helper vCPU, but it can also be used generically to support
other in-guest workloads scheduled by the host. The intention is for
the primary guest and the mirror to have nearly identical memslots.

The primary benefits of this are that:
1) The VMs do not share KVM contexts (think APIC/MSRs/etc), so they
can't accidentally clobber each other.
2) The VMs can have different memory-views, which is necessary for post-copy
migration (the migration vCPUs on the target need to read and write to
pages, when the primary guest would VMEXIT).

This does not change the threat model for AMD SEV. Any memory involved
is still owned by the primary guest and its initial state is still
attested to through the normal SEV_LAUNCH_* flows. If userspace wanted
to circumvent SEV, they could achieve the same effect by simply attaching
a vCPU to the primary VM.
This patch deliberately leaves userspace in charge of the memslots for the
mirror, as it already has the power to mess with them in the primary guest.

This patch does not support SEV-ES (much less SNP), as it does not
handle handing off attested VMSAs to the mirror.

For additional context, we need a Migration Helper because SEV PSP
migration is far too slow for our live migration on its own. Using
an in-guest migrator lets us speed this up significantly.
Signed-off-by: NNathan Tempelman <natet@google.com>
Message-Id: <20210408223214.2582277-1-natet@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

54526d1f

20 4月, 2021 4 次提交

KVM: x86: Add capability to grant VM access to privileged SGX attribute · fe7e9488

由 Sean Christopherson 提交于 4月 12, 2021

Add a capability, KVM_CAP_SGX_ATTRIBUTE, that can be used by userspace
to grant a VM access to a priveleged attribute, with args[0] holding a
file handle to a valid SGX attribute file.

The SGX subsystem restricts access to a subset of enclave attributes to
provide additional security for an uncompromised kernel, e.g. to prevent
malware from using the PROVISIONKEY to ensure its nodes are running
inside a geniune SGX enclave and/or to obtain a stable fingerprint.

To prevent userspace from circumventing such restrictions by running an
enclave in a VM, KVM restricts guest access to privileged attributes by
default.

Cc: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NKai Huang <kai.huang@intel.com>
Message-Id: <0b099d65e933e068e3ea934b0523bab070cb8cea.1618196135.git.kai.huang@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

fe7e9488

KVM: x86: Export kvm_mmu_gva_to_gpa_{read,write}() for SGX (VMX) · 54f958cd

由 Sean Christopherson 提交于 4月 12, 2021

Export the gva_to_gpa() helpers for use by SGX virtualization when
executing ENCLS[ECREATE] and ENCLS[EINIT] on behalf of the guest.
To execute ECREATE and EINIT, KVM must obtain the GPA of the target
Secure Enclave Control Structure (SECS) in order to get its
corresponding HVA.

Because the SECS must reside in the Enclave Page Cache (EPC), copying
the SECS's data to a host-controlled buffer via existing exported
helpers is not a viable option as the EPC is not readable or writable
by the kernel.

SGX virtualization will also use gva_to_gpa() to obtain HVAs for
non-EPC pages in order to pass user pointers directly to ECREATE and
EINIT, which avoids having to copy pages worth of data into the kernel.
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Acked-by: NJarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: NKai Huang <kai.huang@intel.com>
Message-Id: <02f37708321bcdfaa2f9d41c8478affa6e84b04d.1618196135.git.kai.huang@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

54f958cd

KVM: X86: Do not yield to self · a1fa4cbd

由 Wanpeng Li 提交于 4月 09, 2021

If the target is self we do not need to yield, we can avoid malicious
guest to play this.
Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
Message-Id: <1617941911-5338-3-git-send-email-wanpengli@tencent.com>
Cc: stable@vger.kernel.org
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

a1fa4cbd

KVM: X86: Count attempted/successful directed yield · 4a7132ef

由 Wanpeng Li 提交于 4月 09, 2021

To analyze some performance issues with lock contention and scheduling,
it is nice to know when directed yield are successful or failing.
Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
Message-Id: <1617941911-5338-2-git-send-email-wanpengli@tencent.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

4a7132ef

17 4月, 2021 2 次提交

KVM: x86: implement KVM_CAP_SET_GUEST_DEBUG2 · 7e582ccb

由 Maxim Levitsky 提交于 4月 01, 2021

Store the supported bits into KVM_GUESTDBG_VALID_MASK
macro, similar to how arm does this.
Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210401135451.1004564-4-mlevitsk@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

7e582ccb

KVM: x86/vPMU: Forbid reading from MSR_F15H_PERF MSRs when guest doesn't have... · c28fa560

由 Vitaly Kuznetsov 提交于 3月 29, 2021

KVM: x86/vPMU: Forbid reading from MSR_F15H_PERF MSRs when guest doesn't have X86_FEATURE_PERFCTR_CORE

MSR_F15H_PERF_CTL0-5, MSR_F15H_PERF_CTR0-5 MSRs have a CPUID bit assigned
to them (X86_FEATURE_PERFCTR_CORE) and when it wasn't exposed to the guest
the correct behavior is to inject #GP an not just return zero.
Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210329124804.170173-1-vkuznets@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

c28fa560

01 4月, 2021 3 次提交

KVM: x86: Prevent 'hv_clock->system_time' from going negative in kvm_guest_time_update() · 77fcbe82

由 Vitaly Kuznetsov 提交于 3月 31, 2021

When guest time is reset with KVM_SET_CLOCK(0), it is possible for
'hv_clock->system_time' to become a small negative number. This happens
because in KVM_SET_CLOCK handling we set 'kvm->arch.kvmclock_offset' based
on get_kvmclock_ns(kvm) but when KVM_REQ_CLOCK_UPDATE is handled,
kvm_guest_time_update() does (masterclock in use case):

hv_clock.system_time = ka->master_kernel_ns + v->kvm->arch.kvmclock_offset;

And 'master_kernel_ns' represents the last time when masterclock
got updated, it can precede KVM_SET_CLOCK() call. Normally, this is not a
problem, the difference is very small, e.g. I'm observing
hv_clock.system_time = -70 ns. The issue comes from the fact that
'hv_clock.system_time' is stored as unsigned and 'system_time / 100' in
compute_tsc_page_parameters() becomes a very big number.

Use 'master_kernel_ns' instead of get_kvmclock_ns() when masterclock is in
use and get_kvmclock_base_ns() when it's not to prevent 'system_time' from
going negative.
Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210331124130.337992-2-vkuznets@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

77fcbe82

KVM: x86: disable interrupts while pvclock_gtod_sync_lock is taken · a83829f5

由 Paolo Bonzini 提交于 3月 25, 2021

pvclock_gtod_sync_lock can be taken with interrupts disabled if the
preempt notifier calls get_kvmclock_ns to update the Xen
runstate information:

   spin_lock include/linux/spinlock.h:354 [inline]
   get_kvmclock_ns+0x25/0x390 arch/x86/kvm/x86.c:2587
   kvm_xen_update_runstate+0x3d/0x2c0 arch/x86/kvm/xen.c:69
   kvm_xen_update_runstate_guest+0x74/0x320 arch/x86/kvm/xen.c:100
   kvm_xen_runstate_set_preempted arch/x86/kvm/xen.h:96 [inline]
   kvm_arch_vcpu_put+0x2d8/0x5a0 arch/x86/kvm/x86.c:4062

So change the users of the spinlock to spin_lock_irqsave and
spin_unlock_irqrestore.

Reported-by: syzbot+b282b65c2c68492df769@syzkaller.appspotmail.com
Fixes: 30b5c851 ("KVM: x86/xen: Add support for vCPU runstate information")
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

a83829f5

KVM: x86: reduce pvclock_gtod_sync_lock critical sections · c2c647f9

由 Paolo Bonzini 提交于 3月 25, 2021

There is no need to include changes to vcpu->requests into
the pvclock_gtod_sync_lock critical section.  The changes to
the shared data structures (in pvclock_update_vm_gtod_copy)
already occur under the lock.

Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

c2c647f9

31 3月, 2021 1 次提交

KVM: clean up the unused argument · d632826f

由 Haiwei Li 提交于 3月 13, 2021

kvm_msr_ignored_check function never uses vcpu argument. Clean up the
function and invokers.
Signed-off-by: NHaiwei Li <lihaiwei@tencent.com>
Message-Id: <20210313051032.4171-1-lihaiwei.kernel@gmail.com>
Reviewed-by: NKeqian Zhu <zhukeqian1@huawei.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

d632826f

19 3月, 2021 2 次提交

KVM: X86: Fix missing local pCPU when executing wbinvd on all dirty pCPUs · c2162e13

由 Wanpeng Li 提交于 3月 12, 2021

In order to deal with noncoherent DMA, we should execute wbinvd on
all dirty pCPUs when guest wbinvd exits to maintain data consistency.
smp_call_function_many() does not execute the provided function on the
local core, therefore replace it by on_each_cpu_mask().
Reported-by: NNadav Amit <namit@vmware.com>
Cc: Nadav Amit <namit@vmware.com>
Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
Message-Id: <1615517151-7465-1-git-send-email-wanpengli@tencent.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

c2162e13

KVM: x86: Protect userspace MSR filter with SRCU, and set atomically-ish · b318e8de

由 Sean Christopherson 提交于 3月 16, 2021

Fix a plethora of issues with MSR filtering by installing the resulting
filter as an atomic bundle instead of updating the live filter one range
at a time.  The KVM_X86_SET_MSR_FILTER ioctl() isn't truly atomic, as
the hardware MSR bitmaps won't be updated until the next VM-Enter, but
the relevant software struct is atomically updated, which is what KVM
really needs.

Similar to the approach used for modifying memslots, make arch.msr_filter
a SRCU-protected pointer, do all the work configuring the new filter
outside of kvm->lock, and then acquire kvm->lock only when the new filter
has been vetted and created.  That way vCPU readers either see the old
filter or the new filter in their entirety, not some half-baked state.

Yuan Yao pointed out a use-after-free in ksm_msr_allowed() due to a
TOCTOU bug, but that's just the tip of the iceberg...

  - Nothing is __rcu annotated, making it nigh impossible to audit the
    code for correctness.
  - kvm_add_msr_filter() has an unpaired smp_wmb().  Violation of kernel
    coding style aside, the lack of a smb_rmb() anywhere casts all code
    into doubt.
  - kvm_clear_msr_filter() has a double free TOCTOU bug, as it grabs
    count before taking the lock.
  - kvm_clear_msr_filter() also has memory leak due to the same TOCTOU bug.

The entire approach of updating the live filter is also flawed.  While
installing a new filter is inherently racy if vCPUs are running, fixing
the above issues also makes it trivial to ensure certain behavior is
deterministic, e.g. KVM can provide deterministic behavior for MSRs with
identical settings in the old and new filters.  An atomic update of the
filter also prevents KVM from getting into a half-baked state, e.g. if
installing a filter fails, the existing approach would leave the filter
in a half-baked state, having already committed whatever bits of the
filter were already processed.

[*] https://lkml.kernel.org/r/20210312083157.25403-1-yaoyuan0329os@gmail.com

Fixes: 1a155254 ("KVM: x86: Introduce MSR filtering")
Cc: stable@vger.kernel.org
Cc: Alexander Graf <graf@amazon.com>
Reported-by: NYuan Yao <yaoyuan0329os@gmail.com>
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20210316184436.2544875-2-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

b318e8de

17 3月, 2021 1 次提交

KVM: x86: hyper-v: Prevent using not-yet-updated TSC page by secondary CPUs · e880c6ea

由 Vitaly Kuznetsov 提交于 3月 16, 2021

When KVM_REQ_MASTERCLOCK_UPDATE request is issued (e.g. after migration)
we need to make sure no vCPU sees stale values in PV clock structures and
thus all vCPUs are kicked with KVM_REQ_CLOCK_UPDATE. Hyper-V TSC page
clocksource is global and kvm_guest_time_update() only updates in on vCPU0
but this is not entirely correct: nothing blocks some other vCPU from
entering the guest before we finish the update on CPU0 and it can read
stale values from the page.

Invalidate TSC page in kvm_gen_update_masterclock() to switch all vCPUs
to using MSR based clocksource (HV_X64_MSR_TIME_REF_COUNT).
Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210316143736.964151-3-vkuznets@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

e880c6ea

15 3月, 2021 8 次提交

KVM: x86/mmu: Move initial kvm_mmu_set_mask_ptes() call into MMU proper · ec761cfd

由 Sean Christopherson 提交于 2月 25, 2021

Move kvm_mmu_set_mask_ptes() into mmu.c as prep for future cleanup of the
mask initialization code.

No functional change intended.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20210225204749.1512652-16-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

ec761cfd

KVM: x86: determine if an exception has an error code only when injecting it. · b97f0745

由 Maxim Levitsky 提交于 2月 25, 2021

A page fault can be queued while vCPU is in real paged mode on AMD, and
AMD manual asks the user to always intercept it
(otherwise result is undefined).
The resulting VM exit, does have an error code.
Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210225154135.405125-2-mlevitsk@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

b97f0745

KVM: x86: Move RDPMC emulation to common code · c483c454

由 Sean Christopherson 提交于 2月 04, 2021

Move the entirety of the accelerated RDPMC emulation to x86.c, and assign
the common handler directly to the exit handler array for VMX. SVM has
bizarre nrips behavior that prevents it from directly invoking the common
handler. The nrips goofiness will be addressed in a future patch.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20210205005750.3841462-8-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

c483c454

KVM: x86: Move trivial instruction-based exit handlers to common code · 5ff3a351

由 Sean Christopherson 提交于 2月 04, 2021

Move the trivial exit handlers, e.g. for instructions that KVM
"emulates" as nops, to common x86 code.  Assign the common handlers
directly to the exit handler arrays and drop the vendor trampolines.

Opportunistically use pr_warn_once() where appropriate.

No functional change intended.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20210205005750.3841462-7-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

5ff3a351

KVM: x86: Move XSETBV emulation to common code · 92f9895c

由 Sean Christopherson 提交于 2月 04, 2021

Move the entirety of XSETBV emulation to x86.c, and assign the
function directly to both VMX's and SVM's exit handlers, i.e. drop the
unnecessary trampolines.

No functional change intended.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20210205005750.3841462-6-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

92f9895c

KVM: x86: Handle triple fault in L2 without killing L1 · cb6a32c2

由 Sean Christopherson 提交于 3月 02, 2021

Synthesize a nested VM-Exit if L2 triggers an emulated triple fault
instead of exiting to userspace, which likely will kill L1.  Any flow
that does KVM_REQ_TRIPLE_FAULT is suspect, but the most common scenario
for L2 killing L1 is if L0 (KVM) intercepts a contributory exception that
is _not_intercepted by L1.  E.g. if KVM is intercepting #GPs for the
VMware backdoor, a #GP that occurs in L2 while vectoring an injected #DF
will cause KVM to emulate triple fault.

Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Jim Mattson <jmattson@google.com>
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20210302174515.2812275-2-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

cb6a32c2

KVM: x86: Defer the MMU unload to the normal path on an global INVPCID · f66c53b3

由 Sean Christopherson 提交于 3月 04, 2021

Defer unloading the MMU after a INVPCID until the instruction emulation
has completed, i.e. until after RIP has been updated.

On VMX, this is a benign bug as VMX doesn't touch the MMU when skipping
an emulated instruction.  However, on SVM, if nrip is disabled, the
emulator is used to skip an instruction, which would lead to fireworks
if the emulator were invoked without a valid MMU.

Fixes: eb4b248e ("kvm: vmx: Support INVPCID in shadow paging mode")
Cc: stable@vger.kernel.org
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20210305011101.3597423-15-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

f66c53b3

KVM: x86: to track if L1 is running L2 VM · 43c11d91

由 Dongli Zhang 提交于 3月 05, 2021

The new per-cpu stat 'nested_run' is introduced in order to track if L1 VM
is running or used to run L2 VM.

An example of the usage of 'nested_run' is to help the host administrator
to easily track if any L1 VM is used to run L2 VM. Suppose there is issue
that may happen with nested virtualization, the administrator will be able
to easily narrow down and confirm if the issue is due to nested
virtualization via 'nested_run'. For example, whether the fix like
commit 88dddc11 ("KVM: nVMX: do not use dangling shadow VMCS after
guest reset") is required.

Cc: Joe Jin <joe.jin@oracle.com>
Signed-off-by: NDongli Zhang <dongli.zhang@oracle.com>
Message-Id: <20210305225747.7682-1-dongli.zhang@oracle.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

43c11d91

06 3月, 2021 1 次提交

kvm: x86: use NULL instead of using plain integer as pointer · 46914534

由 Muhammad Usama Anjum 提交于 3月 05, 2021

Sparse warnings removed:
warning: Using plain integer as NULL pointer
Signed-off-by: NMuhammad Usama Anjum <musamaanjum@gmail.com>
Message-Id: <20210305180816.GA488770@LEGION>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

46914534

03 3月, 2021 2 次提交

KVM: x86/xen: Add support for vCPU runstate information · 30b5c851

由 David Woodhouse 提交于 3月 01, 2021

This is how Xen guests do steal time accounting. The hypervisor records
the amount of time spent in each of running/runnable/blocked/offline
states.

In the Xen accounting, a vCPU is still in state RUNSTATE_running while
in Xen for a hypercall or I/O trap, etc. Only if Xen explicitly schedules
does the state become RUNSTATE_blocked. In KVM this means that even when
the vCPU exits the kvm_run loop, the state remains RUNSTATE_running.

The VMM can explicitly set the vCPU to RUNSTATE_blocked by using the
KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_CURRENT attribute, and can also use
KVM_XEN_VCPU_ATTR_TYPE_RUNSTATE_ADJUST to retrospectively add a given
amount of time to the blocked state and subtract it from the running
state.

The state_entry_time corresponds to get_kvmclock_ns() at the time the
vCPU entered the current state, and the total times of all four states
should always add up to state_entry_time.
Co-developed-by: NJoao Martins <joao.m.martins@oracle.com>
Signed-off-by: NJoao Martins <joao.m.martins@oracle.com>
Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20210301125309.874953-2-dwmw2@infradead.org>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

30b5c851

KVM: x86: allow compiling out the Xen hypercall interface · b59b153d

由 Paolo Bonzini 提交于 2月 26, 2021

The Xen hypercall interface adds to the attack surface of the hypervisor
and will be used quite rarely.  Allow compiling it out.
Suggested-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDavid Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

b59b153d

26 2月, 2021 1 次提交

KVM: xen: flush deferred static key before checking it · c462f859

由 Paolo Bonzini 提交于 2月 26, 2021

A missing flush would cause the static branch to trigger incorrectly.

Cc: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

c462f859

19 2月, 2021 5 次提交

KVM: x86: Fold "write-protect large" use case into generic write-protect · a1419f8b

由 Sean Christopherson 提交于 2月 12, 2021

Drop kvm_mmu_slot_largepage_remove_write_access() and refactor its sole
caller to use kvm_mmu_slot_remove_write_access().  Remove the now-unused
slot_handle_large_level() and slot_handle_all_level() helpers.

No functional change intended.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20210213005015.1651772-14-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

a1419f8b

KVM: x86/mmu: Don't set dirty bits when disabling dirty logging w/ PML · b6e16ae5

由 Sean Christopherson 提交于 2月 12, 2021

Stop setting dirty bits for MMU pages when dirty logging is disabled for
a memslot, as PML is now completely disabled when there are no memslots
with dirty logging enabled.

This means that spurious PML entries will be created for memslots with
dirty logging disabled if at least one other memslot has dirty logging
enabled. However, spurious PML entries are already possible since
dirty bits are set only when a dirty logging is turned off, i.e. memslots
that are never dirty logged will have dirty bits cleared.

In the end, it's faster overall to eat a few spurious PML entries in the
window where dirty logging is being disabled across all memslots.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20210213005015.1651772-13-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

b6e16ae5

KVM: VMX: Dynamically enable/disable PML based on memslot dirty logging · a85863c2

由 Makarand Sonare 提交于 2月 12, 2021

Currently, if enable_pml=1 PML remains enabled for the entire lifetime
of the VM irrespective of whether dirty logging is enable or disabled.
When dirty logging is disabled, all the pages of the VM are manually
marked dirty, so that PML is effectively non-operational. Setting
the dirty bits is an expensive operation which can cause severe MMU
lock contention in a performance sensitive path when dirty logging is
disabled after a failed or canceled live migration.

Manually setting dirty bits also fails to prevent PML activity if some
code path clears dirty bits, which can incur unnecessary VM-Exits.

In order to avoid this extra overhead, dynamically enable/disable PML
when dirty logging gets turned on/off for the first/last memslot.
Signed-off-by: NMakarand Sonare <makarandsonare@google.com>
Co-developed-by: NSean Christopherson <seanjc@google.com>
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20210213005015.1651772-12-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

a85863c2

KVM: x86: Further clarify the logic and comments for toggling log dirty · 52f46079

由 Sean Christopherson 提交于 2月 12, 2021

Add a sanity check in kvm_mmu_slot_apply_flags to assert that the
LOG_DIRTY_PAGES flag is indeed being toggled, and explicitly rely on
that holding true when zapping collapsible SPTEs.  Manipulating the
CPU dirty log (PML) and write-protection also relies on this assertion,
but that's not obvious in the current code.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20210213005015.1651772-11-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

52f46079

KVM: x86: Move MMU's PML logic to common code · a018eba5

由 Sean Christopherson 提交于 2月 12, 2021

Drop the facade of KVM's PML logic being vendor specific and move the
bits that aren't truly VMX specific into common x86 code.  The MMU logic
for dealing with PML is tightly coupled to the feature and to VMX's
implementation, bouncing through kvm_x86_ops obfuscates the code without
providing any meaningful separation of concerns or encapsulation.

No functional change intended.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20210213005015.1651772-10-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

a018eba5

17 2月, 2021 1 次提交

entry/kvm: Explicitly flush pending rcuog wakeup before last rescheduling point · 4ae7dc97

由 Frederic Weisbecker 提交于 2月 01, 2021

Following the idle loop model, cleanly check for pending rcuog wakeup
before the last rescheduling point upon resuming to guest mode. This
way we can avoid to do it from rcu_user_enter() with the last resort
self-IPI hack that enforces rescheduling.
Suggested-by: NPeter Zijlstra <peterz@infradead.org>
Signed-off-by: NFrederic Weisbecker <frederic@kernel.org>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: NIngo Molnar <mingo@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20210131230548.32970-6-frederic@kernel.org

4ae7dc97

09 2月, 2021 5 次提交

KVM: x86: hyper-v: Allocate Hyper-V context lazily · fc08b628

由 Vitaly Kuznetsov 提交于 1月 26, 2021

Hyper-V context is only needed for guests which use Hyper-V emulation in
KVM (e.g. Windows/Hyper-V guests) so we don't actually need to allocate
it in kvm_arch_vcpu_create(), we can postpone the action until Hyper-V
specific MSRs are accessed or SynIC is enabled.

Once allocated, let's keep the context alive for the lifetime of the vCPU
as an attempt to free it would require additional synchronization with
other vCPUs and normally it is not supposed to happen.

Note, Hyper-V style hypercall enablement is done by writing to
HV_X64_MSR_GUEST_OS_ID so we don't need to worry about allocating Hyper-V
context from kvm_hv_hypercall().
Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210126134816.1880136-15-vkuznets@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

fc08b628

KVM: x86: hyper-v: Make Hyper-V emulation enablement conditional · 8f014550

由 Vitaly Kuznetsov 提交于 1月 26, 2021

Hyper-V emulation is enabled in KVM unconditionally. This is bad at least
from security standpoint as it is an extra attack surface. Ideally, there
should be a per-VM capability explicitly enabled by VMM but currently it
is not the case and we can't mandate one without breaking backwards
compatibility. We can, however, check guest visible CPUIDs and only enable
Hyper-V emulation when "Hv#1" interface was exposed in
HYPERV_CPUID_INTERFACE.

Note, VMMs are free to act in any sequence they like, e.g. they can try
to set MSRs first and CPUIDs later so we still need to allow the host
to read/write Hyper-V specific MSRs unconditionally.
Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210126134816.1880136-14-vkuznets@redhat.com>
[Add selftest vcpu_set_hv_cpuid API to avoid breaking xen_vmcall_test. - Paolo]
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

8f014550

KVM: x86: hyper-v: Allocate 'struct kvm_vcpu_hv' dynamically · 4592b7ea

由 Vitaly Kuznetsov 提交于 1月 26, 2021

Hyper-V context is only needed for guests which use Hyper-V emulation in
KVM (e.g. Windows/Hyper-V guests). 'struct kvm_vcpu_hv' is, however, quite
big, it accounts for more than 1/4 of the total 'struct kvm_vcpu_arch'
which is also quite big already. This all looks like a waste.

Allocate 'struct kvm_vcpu_hv' dynamically. This patch does not bring any
(intentional) functional change as we still allocate the context
unconditionally but it paves the way to doing that only when needed.
Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210126134816.1880136-13-vkuznets@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

4592b7ea

KVM: x86: hyper-v: Prepare to meet unallocated Hyper-V context · f2bc14b6

由 Vitaly Kuznetsov 提交于 1月 26, 2021

Currently, Hyper-V context is part of 'struct kvm_vcpu_arch' and is always
available. As a preparation to allocating it dynamically, check that it is
not NULL at call sites which can normally proceed without it i.e. the
behavior is identical to the situation when Hyper-V emulation is not being
used by the guest.

When Hyper-V context for a particular vCPU is not allocated, we may still
need to get 'vp_index' from there. E.g. in a hypothetical situation when
Hyper-V emulation was enabled on one CPU and wasn't on another, Hyper-V
style send-IPI hypercall may still be used. Luckily, vp_index is always
initialized to kvm_vcpu_get_idx() and can only be changed when Hyper-V
context is present. Introduce kvm_hv_get_vpindex() helper for
simplification.

No functional change intended.
Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210126134816.1880136-12-vkuznets@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

f2bc14b6

KVM: x86: hyper-v: Always use to_hv_vcpu() accessor to get to 'struct kvm_vcpu_hv' · 9ff5e030

由 Vitaly Kuznetsov 提交于 1月 26, 2021

As a preparation to allocating Hyper-V context dynamically, make it clear
who's the user of the said context.

No functional change intended.
Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210126134816.1880136-11-vkuznets@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

9ff5e030

openeuler / Kernel 接近 2 年 前同步成功

openeuler / Kernel
接近 2 年前同步成功