提交 · 1edce0a9eb239b7a2768d10cf7d21dd83e269d5b · openeuler / Kernel

11 9月, 2019 6 次提交

KVM: x86: Add kvm_emulate_{rd,wr}msr() to consolidate VXM/SVM code · 1edce0a9

由 Sean Christopherson 提交于 9月 05, 2019

Move RDMSR and WRMSR emulation into common x86 code to consolidate
nearly identical SVM and VMX code.

Note, consolidating RDMSR introduces an extra indirect call, i.e.
retpoline, due to reaching {svm,vmx}_get_msr() via kvm_x86_ops, but a
guest kernel likely has bigger problems if increasing the latency of
RDMSR VM-Exits by ~70 cycles has a measurable impact on overall VM
performance.  E.g. the only recurring RDMSR VM-Exits (after booting) on
my system running Linux 5.2 in the guest are for MSR_IA32_TSC_ADJUST via
arch_cpu_idle_enter().

No functional change intended.
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

1edce0a9

KVM: x86: Refactor up kvm_{g,s}et_msr() to simplify callers · f20935d8

由 Sean Christopherson 提交于 9月 05, 2019

Refactor the top-level MSR accessors to take/return the index and value
directly instead of requiring the caller to dump them into a msr_data
struct.

No functional change intended.
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

f20935d8

KVM: X86: Tune PLE Window tracepoint · 4f75bcc3

由 Peter Xu 提交于 9月 06, 2019

The PLE window tracepoint triggers even if the window is not changed,
and the wording can be a bit confusing too.  One example line:

  kvm_ple_window: vcpu 0: ple_window 4096 (shrink 4096)

It easily let people think of "the window now is 4096 which is
shrinked", but the truth is the value actually didn't change (4096).

Let's only dump this message if the value really changed, and we make
the message even simpler like:

  kvm_ple_window: vcpu 4 old 4096 new 8192 (growed)
Signed-off-by: NPeter Xu <peterx@redhat.com>
Reviewed-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

4f75bcc3

KVM: VMX: Change ple_window type to unsigned int · c5c5d6fa

由 Peter Xu 提交于 9月 06, 2019

The VMX ple_window is 32 bits wide, so logically it can overflow with
an int.  The module parameter is declared as unsigned int which is
good, however the dynamic variable is not.  Switching all the
ple_window references to use unsigned int.

The tracepoint changes will also affect SVM, but SVM is using an even
smaller width (16 bits) so it's always fine.
Suggested-by: NSean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPeter Xu <peterx@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

c5c5d6fa

KVM: X86: Remove tailing newline for tracepoints · 13a7e370

由 Peter Xu 提交于 9月 06, 2019

It's done by TP_printk() already.
Reviewed-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com>
Reviewed-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPeter Xu <peterx@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

13a7e370

KVM: X86: Trace vcpu_id for vmexit · d94fdcd7

由 Peter Xu 提交于 9月 06, 2019

Tracing the ID helps to pair vmenters and vmexits for guests with
multiple vCPUs.
Reviewed-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com>
Reviewed-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPeter Xu <peterx@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

d94fdcd7

10 9月, 2019 2 次提交

KVM: x86: Manually calculate reserved bits when loading PDPTRS · 16cfacc8

由 Sean Christopherson 提交于 9月 03, 2019

Manually generate the PDPTR reserved bit mask when explicitly loading
PDPTRs.  The reserved bits that are being tracked by the MMU reflect the
current paging mode, which is unlikely to be PAE paging in the vast
majority of flows that use load_pdptrs(), e.g. CR0 and CR4 emulation,
__set_sregs(), etc...  This can cause KVM to incorrectly signal a bad
PDPTR, or more likely, miss a reserved bit check and subsequently fail
a VM-Enter due to a bad VMCS.GUEST_PDPTR.

Add a one off helper to generate the reserved bits instead of sharing
code across the MMU's calculations and the PDPTR emulation.  The PDPTR
reserved bits are basically set in stone, and pushing a helper into
the MMU's calculation adds unnecessary complexity without improving
readability.

Oppurtunistically fix/update the comment for load_pdptrs().

Note, the buggy commit also introduced a deliberate functional change,
"Also remove bit 5-6 from rsvd_bits_mask per latest SDM.", which was
effectively (and correctly) reverted by commit cd9ae5fe ("KVM: x86:
Fix page-tables reserved bits").  A bit of SDM archaeology shows that
the SDM from late 2008 had a bug (likely a copy+paste error) where it
listed bits 6:5 as AVL and A for PDPTEs used for 4k entries but reserved
for 2mb entries.  I.e. the SDM contradicted itself, and bits 6:5 are and
always have been reserved.

Fixes: 20c466b5 ("KVM: Use rsvd_bits_mask in load_pdptrs()")
Cc: stable@vger.kernel.org
Cc: Nadav Amit <nadav.amit@gmail.com>
Reported-by: NDoug Reiland <doug.reiland@intel.com>
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: NPeter Xu <peterx@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

16cfacc8

KVM: x86: Disable posted interrupts for non-standard IRQs delivery modes · fdcf7562

由 Alexander Graf 提交于 9月 05, 2019

We can easily route hardware interrupts directly into VM context when
they target the "Fixed" or "LowPriority" delivery modes.

However, on modes such as "SMI" or "Init", we need to go via KVM code
to actually put the vCPU into a different mode of operation, so we can
not post the interrupt

Add code in the VMX and SVM PI logic to explicitly refuse to establish
posted mappings for advanced IRQ deliver modes. This reflects the logic
in __apic_accept_irq() which also only ever passes Fixed and LowPriority
interrupts as posted interrupts into the guest.

This fixes a bug I have with code which configures real hardware to
inject virtual SMIs into my guest.
Signed-off-by: NAlexander Graf <graf@amazon.com>
Reviewed-by: NLiran Alon <liran.alon@oracle.com>
Reviewed-by: NSean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: NWanpeng Li <wanpengli@tencent.com>
Cc: stable@vger.kernel.org
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

fdcf7562

22 8月, 2019 18 次提交

KVM: VMX: Fix and tweak the comments for VM-Enter · b6852ae7

由 Sean Christopherson 提交于 8月 15, 2019

Fix an incorrect/stale comment regarding the vmx_vcpu pointer, as guest
registers are now loaded using a direct pointer to the start of the
register array.

Opportunistically add a comment to document why the vmx_vcpu pointer is
needed, its consumption via 'call vmx_update_host_rsp' is rather subtle.
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: NJim Mattson <jmattson@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

b6852ae7

KVM: Assert that struct kvm_vcpu is always as offset zero · 12b58f4e

由 Sean Christopherson 提交于 8月 15, 2019

KVM implementations that wrap struct kvm_vcpu with a vendor specific
struct, e.g. struct vcpu_vmx, must place the vcpu member at offset 0,
otherwise the usercopy region intended to encompass struct kvm_vcpu_arch
will instead overlap random chunks of the vendor specific struct.
E.g. padding a large number of bytes before struct kvm_vcpu triggers
a usercopy warn when running with CONFIG_HARDENED_USERCOPY=y.
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: NJim Mattson <jmattson@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

12b58f4e

KVM: X86: Add pv tlb shootdown tracepoint · b382f44e

由 Wanpeng Li 提交于 8月 05, 2019

Add pv tlb shootdown tracepoint.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

b382f44e

KVM: x86: Unconditionally call x86 ops that are always implemented · 92735b1b

由 Sean Christopherson 提交于 8月 02, 2019

Remove a few stale checks for non-NULL ops now that the ops in question
are implemented by both VMX and SVM.

Note, this is **not** stable material, the Fixes tags are there purely
to show when a particular op was first supported by both VMX and SVM.

Fixes: 74f16909 ("kvm/svm: Setup MCG_CAP on AMD properly")
Fixes: b31c114b ("KVM: X86: Provide a capability to disable PAUSE intercepts")
Fixes: 411b44ba ("svm: Implements update_pi_irte hook to setup posted interrupt")
Cc: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

92735b1b

KVM: x86/mmu: Consolidate "is MMIO SPTE" code · 26c44a63

由 Sean Christopherson 提交于 8月 01, 2019

Replace the open-coded "is MMIO SPTE" checks in the MMU warnings
related to software-based access/dirty tracking to make the code
slightly more self-documenting.

No functional change intended.
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

26c44a63

KVM: x86/mmu: Add explicit access mask for MMIO SPTEs · 4af77151

由 Sean Christopherson 提交于 8月 01, 2019

When shadow paging is enabled, KVM tracks the allowed access type for
MMIO SPTEs so that it can do a permission check on a MMIO GVA cache hit
without having to walk the guest's page tables.  The tracking is done
by retaining the WRITE and USER bits of the access when inserting the
MMIO SPTE (read access is implicitly allowed), which allows the MMIO
page fault handler to retrieve and cache the WRITE/USER bits from the
SPTE.

Unfortunately for EPT, the mask used to retain the WRITE/USER bits is
hardcoded using the x86 paging versions of the bits.  This funkiness
happens to work because KVM uses a completely different mask/value for
MMIO SPTEs when EPT is enabled, and the EPT mask/value just happens to
overlap exactly with the x86 WRITE/USER bits[*].

Explicitly define the access mask for MMIO SPTEs to accurately reflect
that EPT does not want to incorporate any access bits into the SPTE, and
so that KVM isn't subtly relying on EPT's WX bits always being set in
MMIO SPTEs, e.g. attempting to use other bits for experimentation breaks
horribly.

Note, vcpu_match_mmio_gva() explicits prevents matching GVA==0, and all
TDP flows explicit set mmio_gva to 0, i.e. zeroing vcpu->arch.access for
EPT has no (known) functional impact.

[*] Using WX to generate EPT misconfigurations (equivalent to reserved
    bit page fault) ensures KVM can employ its MMIO page fault tricks
    even platforms without reserved address bits.

Fixes: ce88decf ("KVM: MMU: mmio page fault support")
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

4af77151

KVM: x86: Rename access permissions cache member in struct kvm_vcpu_arch · 871bd034

由 Sean Christopherson 提交于 8月 01, 2019

Rename "access" to "mmio_access" to match the other MMIO cache members
and to make it more obvious that it's tracking the access permissions
for the MMIO cache.
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

871bd034

x86: KVM: svm: eliminate hardcoded RIP advancement from vmrun_interception() · c8e16b78

由 Vitaly Kuznetsov 提交于 8月 13, 2019

Just like we do with other intercepts, in vmrun_interception() we should be
doing kvm_skip_emulated_instruction() and not just RIP += 3. Also, it is
wrong to increment RIP before nested_svm_vmrun() as it can result in
kvm_inject_gp().

We can't call kvm_skip_emulated_instruction() after nested_svm_vmrun() so
move it inside.
Suggested-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

c8e16b78

x86: KVM: svm: eliminate weird goto from vmrun_interception() · e7134c1b

由 Vitaly Kuznetsov 提交于 8月 13, 2019

Regardless of whether or not nested_svm_vmrun_msrpm() fails, we return 1
from vmrun_interception() so there's no point in doing goto. Also,
nested_svm_vmrun_msrpm() call can be made from nested_svm_vmrun() where
other nested launch issues are handled.

nested_svm_vmrun() returns a bool, however, its result is ignored in
vmrun_interception() as we always return '1'. As a preparatory change
to putting kvm_skip_emulated_instruction() inside nested_svm_vmrun()
make nested_svm_vmrun() return an int (always '1' for now).
Suggested-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

e7134c1b

x86: KVM: svm: remove hardcoded instruction length from intercepts · c4762fda

由 Vitaly Kuznetsov 提交于 8月 13, 2019

Various intercepts hard-code the respective instruction lengths to optimize
skip_emulated_instruction(): when next_rip is pre-set we skip
kvm_emulate_instruction(vcpu, EMULTYPE_SKIP). The optimization is, however,
incorrect: different (redundant) prefixes could be used to enlarge the
instruction. We can't really avoid decoding.

svm->next_rip is not used when CPU supports 'nrips' (X86_FEATURE_NRIPS)
feature: next RIP is provided in VMCB. The feature is not really new
(Opteron G3s had it already) and the change should have zero affect.

Remove manual svm->next_rip setting with hard-coded instruction lengths.
The only case where we now use svm->next_rip is EXIT_IOIO: the instruction
length is provided to us by hardware.

Hardcoded RIP advancement remains in vmrun_interception(), this is going to
be taken care of separately.
Reported-by: NJim Mattson <jmattson@google.com>
Reviewed-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

c4762fda

x86: KVM: add xsetbv to the emulator · 02d4160f

由 Vitaly Kuznetsov 提交于 8月 13, 2019

To avoid hardcoding xsetbv length to '3' we need to support decoding it in
the emulator.
Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

02d4160f

x86: KVM: clear interrupt shadow on EMULTYPE_SKIP · 97413d29

由 Vitaly Kuznetsov 提交于 8月 13, 2019

When doing x86_emulate_instruction(EMULTYPE_SKIP) interrupt shadow has to
be cleared if and only if the skipping is successful.

There are two immediate issues:
- In SVM skip_emulated_instruction() we are not zapping interrupt shadow
  in case kvm_emulate_instruction(EMULTYPE_SKIP) is used to advance RIP
  (!nrpip_save).
- In VMX handle_ept_misconfig() when running as a nested hypervisor we
  (static_cpu_has(X86_FEATURE_HYPERVISOR) case) forget to clear interrupt
  shadow.

Note that we intentionally don't handle the case when the skipped
instruction is supposed to prolong the interrupt shadow ("MOV/POP SS") as
skip-emulation of those instructions should not happen under normal
circumstances.
Suggested-by: NSean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

97413d29

x86: kvm: svm: propagate errors from skip_emulated_instruction() · f8ea7c60

由 Vitaly Kuznetsov 提交于 8月 13, 2019

On AMD, kvm_x86_ops->skip_emulated_instruction(vcpu) can, in theory,
fail: in !nrips case we call kvm_emulate_instruction(EMULTYPE_SKIP).
Currently, we only do printk(KERN_DEBUG) when this happens and this
is not ideal. Propagate the error up the stack.

On VMX, skip_emulated_instruction() doesn't fail, we have two call
sites calling it explicitly: handle_exception_nmi() and
handle_task_switch(), we can just ignore the result.

On SVM, we also have two explicit call sites:
svm_queue_exception() and it seems we don't need to do anything there as
we check if RIP was advanced or not. In task_switch_interception(),
however, we are better off not proceeding to kvm_task_switch() in case
skip_emulated_instruction() failed.
Suggested-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

f8ea7c60

x86: KVM: svm: don't pretend to advance RIP in case wrmsr_interception() results in #GP · 05402f64

由 Vitaly Kuznetsov 提交于 8月 13, 2019

svm->next_rip is only used by skip_emulated_instruction() and in case
kvm_set_msr() fails we rightfully don't do that. Move svm->next_rip
advancement to 'else' branch to avoid creating false impression that
it's always advanced (and make it look like rdmsr_interception()).

This is a preparatory change to removing hardcoded RIP advancement
from instruction intercepts, no functional change.
Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

05402f64

KVM: x86: Fix x86_decode_insn() return when fetching insn bytes fails · 016cd759

由 Sean Christopherson 提交于 8月 15, 2019

Jump to the common error handling in x86_decode_insn() if
__do_insn_fetch_bytes() fails so that its error code is converted to the
appropriate return type.  Although the various helpers used by
x86_decode_insn() return X86EMUL_* values, x86_decode_insn() itself
returns EMULATION_FAILED or EMULATION_OK.

This doesn't cause a functional issue as the sole caller,
x86_emulate_instruction(), currently only cares about success vs.
failure, and success is indicated by '0' for both types
(X86EMUL_CONTINUE and EMULATION_OK).

Fixes: 285ca9e9 ("KVM: emulate: speed up do_insn_fetch")
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

016cd759

KVM: x86: use Intel speculation bugs and features as derived in generic x86 code · 0c54914d

由 Paolo Bonzini 提交于 8月 19, 2019

Similar to AMD bits, set the Intel bits from the vendor-independent
feature and bug flags, because KVM_GET_SUPPORTED_CPUID does not care
about the vendor and they should be set on AMD processors as well.
Suggested-by: NJim Mattson <jmattson@google.com>
Reviewed-by: NJim Mattson <jmattson@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

0c54914d

KVM: x86: always expose VIRT_SSBD to guests · 50896de4

由 Paolo Bonzini 提交于 8月 15, 2019

Even though it is preferrable to use SPEC_CTRL (represented by
X86_FEATURE_AMD_SSBD) instead of VIRT_SPEC, VIRT_SPEC is always
supported anyway because otherwise it would be impossible to
migrate from old to new CPUs.  Make this apparent in the
result of KVM_GET_SUPPORTED_CPUID as well.

However, we need to hide the bit on Intel processors, so move
the setting to svm_set_supported_cpuid.

Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Reported-by: NEduardo Habkost <ehabkost@redhat.com>
Reviewed-by: NJim Mattson <jmattson@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

50896de4

KVM: x86: fix reporting of AMD speculation bug CPUID leaf · 4c6903a0

由 Paolo Bonzini 提交于 8月 14, 2019

The AMD_* bits have to be set from the vendor-independent
feature and bug flags, because KVM_GET_SUPPORTED_CPUID does not care
about the vendor and they should be set on Intel processors as well.
On top of this, SSBD, STIBP and AMD_SSB_NO bit were not set, and
VIRT_SSBD does not have to be added manually because it is a
cpufeature that comes directly from the host's CPUID bit.
Reviewed-by: NJim Mattson <jmattson@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

4c6903a0

21 8月, 2019 1 次提交

Revert "KVM: x86/mmu: Zap only the relevant pages when removing a memslot" · d012a06a

由 Paolo Bonzini 提交于 8月 15, 2019

This reverts commit 4e103134.
Alex Williamson reported regressions with device assignment with
this patch.  Even though the bug is probably elsewhere and still
latent, this is needed to fix the regression.

Fixes: 4e103134 ("KVM: x86/mmu: Zap only the relevant pages when removing a memslot", 2019-02-05)
Reported-by: NAlex Willamson <alex.williamson@redhat.com>
Cc: stable@vger.kernel.org
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

d012a06a

14 8月, 2019 2 次提交

KVM: x86: svm: remove redundant assignment of var new_entry · c8e174b3

由 Miaohe Lin 提交于 8月 12, 2019

new_entry is reassigned a new value next line. So
it's redundant and remove it.
Signed-off-by: NMiaohe Lin <linmiaohe@huawei.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

c8e174b3

kvm: x86: skip populating logical dest map if apic is not sw enabled · b14c876b

由 Radim Krcmar 提交于 8月 13, 2019

recalculate_apic_map does not santize ldr and it's possible that
multiple bits are set. In that case, a previous valid entry
can potentially be overwritten by an invalid one.

This condition is hit when booting a 32 bit, >8 CPU, RHEL6 guest and then
triggering a crash to boot a kdump kernel. This is the sequence of
events:
1. Linux boots in bigsmp mode and enables PhysFlat, however, it still
writes to the LDR which probably will never be used.
2. However, when booting into kdump, the stale LDR values remain as
they are not cleared by the guest and there isn't a apic reset.
3. kdump boots with 1 cpu, and uses Logical Destination Mode but the
logical map has been overwritten and points to an inactive vcpu.
Signed-off-by: NRadim Krcmar <rkrcmar@redhat.com>
Signed-off-by: NBandan Das <bsd@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

b14c876b

05 8月, 2019 4 次提交

KVM: no need to check return value of debugfs_create functions · 3e7093d0

由 Greg KH 提交于 7月 31, 2019

When calling debugfs functions, there is no need to ever check the
return value.  The function can work or not, but the code logic should
never do something different based on this.

Also, when doing this, change kvm_arch_create_vcpu_debugfs() to return
void instead of an integer, as we should not care at all about if this
function actually does anything or not.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Radim Krčmář" <rkrcmar@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: <x86@kernel.org>
Cc: <kvm@vger.kernel.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

3e7093d0

KVM: remove kvm_arch_has_vcpu_debugfs() · 741cbbae

由 Paolo Bonzini 提交于 8月 03, 2019

There is no need for this function as all arches have to implement
kvm_arch_create_vcpu_debugfs() no matter what.  A #define symbol
let us actually simplify the code.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

741cbbae

KVM: Fix leak vCPU's VMCS value into other pCPU · 17e433b5

由 Wanpeng Li 提交于 8月 05, 2019

After commit d73eb57b (KVM: Boost vCPUs that are delivering interrupts), a
five years old bug is exposed. Running ebizzy benchmark in three 80 vCPUs VMs
on one 80 pCPUs Skylake server, a lot of rcu_sched stall warning splatting
in the VMs after stress testing:

 INFO: rcu_sched detected stalls on CPUs/tasks: { 4 41 57 62 77} (detected by 15, t=60004 jiffies, g=899, c=898, q=15073)
 Call Trace:
   flush_tlb_mm_range+0x68/0x140
   tlb_flush_mmu.part.75+0x37/0xe0
   tlb_finish_mmu+0x55/0x60
   zap_page_range+0x142/0x190
   SyS_madvise+0x3cd/0x9c0
   system_call_fastpath+0x1c/0x21

swait_active() sustains to be true before finish_swait() is called in
kvm_vcpu_block(), voluntarily preempted vCPUs are taken into account
by kvm_vcpu_on_spin() loop greatly increases the probability condition
kvm_arch_vcpu_runnable(vcpu) is checked and can be true, when APICv
is enabled the yield-candidate vCPU's VMCS RVI field leaks(by
vmx_sync_pir_to_irr()) into spinning-on-a-taken-lock vCPU's current
VMCS.

This patch fixes it by checking conservatively a subset of events.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Marc Zyngier <Marc.Zyngier@arm.com>
Cc: stable@vger.kernel.org
Fixes: 98f4a146 (KVM: add kvm_arch_vcpu_runnable() test to kvm_vcpu_on_spin() loop)
Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

17e433b5

KVM: LAPIC: Don't need to wakeup vCPU twice afer timer fire · a48d06f9

由 Wanpeng Li 提交于 8月 01, 2019

kvm_set_pending_timer() will take care to wake up the sleeping vCPU which
has pending timer, don't need to check this in apic_timer_expired() again.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

a48d06f9

24 7月, 2019 2 次提交

KVM: X86: Boost queue head vCPU to mitigate lock waiter preemption · 266e85a5

由 Wanpeng Li 提交于 7月 24, 2019

Commit 11752adb (locking/pvqspinlock: Implement hybrid PV queued/unfair locks)
introduces hybrid PV queued/unfair locks
 - queued mode (no starvation)
 - unfair mode (good performance on not heavily contended lock)
The lock waiter goes into the unfair mode especially in VMs with over-commit
vCPUs since increaing over-commitment increase the likehood that the queue
head vCPU may have been preempted and not actively spinning.

However, reschedule queue head vCPU timely to acquire the lock still can get
better performance than just depending on lock stealing in over-subscribe
scenario.

Testing on 80 HT 2 socket Xeon Skylake server, with 80 vCPUs VM 80GB RAM:
ebizzy -M
             vanilla     boosting    improved
 1VM          23520        25040         6%
 2VM           8000        13600        70%
 3VM           3100         5400        74%

The lock holder vCPU yields to the queue head vCPU when unlock, to boost queue
head vCPU which is involuntary preemption or the one which is voluntary halt
due to fail to acquire the lock after a short spin in the guest.

Cc: Waiman Long <longman@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

266e85a5

Documentation: move Documentation/virtual to Documentation/virt · 2f5947df

由 Christoph Hellwig 提交于 7月 24, 2019

Renaming docs seems to be en vogue at the moment, so fix on of the
grossly misnamed directories.  We usually never use "virtual" as
a shortcut for virtualization in the kernel, but always virt,
as seen in the virt/ top-level directory.  Fix up the documentation
to match that.

Fixes: ed16648e ("Move kvm, uml, and lguest subdirectories under a common "virtual" directory, I.E:")
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

2f5947df

22 7月, 2019 5 次提交

KVM: nVMX: Set cached_vmcs12 and cached_shadow_vmcs12 NULL after free · c6bf2ae9

由 Jan Kiszka 提交于 7月 21, 2019

Shall help finding use-after-free bugs earlier.
Suggested-by: NLiran Alon <liran.alon@oracle.com>
Signed-off-by: NJan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

c6bf2ae9

KVM: X86: Dynamically allocate user_fpu · d9a710e5

由 Wanpeng Li 提交于 7月 22, 2019

After reverting commit 240c35a3 (kvm: x86: Use task structs fpu field
for user), struct kvm_vcpu is 19456 bytes on my server, PAGE_ALLOC_COSTLY_ORDER(3)
is the order at which allocations are deemed costly to service. In serveless
scenario, one host can service hundreds/thoudands firecracker/kata-container
instances, howerver, new instance will fail to launch after memory is too
fragmented to allocate kvm_vcpu struct on host, this was observed in some
cloud provider product environments.

This patch dynamically allocates user_fpu, kvm_vcpu is 15168 bytes now on my
Skylake server.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

d9a710e5

KVM: X86: Fix fpu state crash in kvm guest · e7517324

由 Wanpeng Li 提交于 7月 22, 2019

The idea before commit 240c35a3 (which has just been reverted)
was that we have the following FPU states:

               userspace (QEMU)             guest
---------------------------------------------------------------------------
               processor                    vcpu->arch.guest_fpu
>>> KVM_RUN: kvm_load_guest_fpu
               vcpu->arch.user_fpu          processor
>>> preempt out
               vcpu->arch.user_fpu          current->thread.fpu
>>> preempt in
               vcpu->arch.user_fpu          processor
>>> back to userspace
>>> kvm_put_guest_fpu
               processor                    vcpu->arch.guest_fpu
---------------------------------------------------------------------------

With the new lazy model we want to get the state back to the processor
when schedule in from current->thread.fpu.
Reported-by: NThomas Lambertz <mail@thomaslambertz.de>
Reported-by: Nanthony <antdev66@gmail.com>
Tested-by: Nanthony <antdev66@gmail.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Thomas Lambertz <mail@thomaslambertz.de>
Cc: anthony <antdev66@gmail.com>
Cc: stable@vger.kernel.org
Fixes: 5f409e20 (x86/fpu: Defer FPU state load until return to userspace)
Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
[Add a comment in front of the warning. - Paolo]
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

e7517324

Revert "kvm: x86: Use task structs fpu field for user" · ec269475

由 Paolo Bonzini 提交于 7月 22, 2019

This reverts commit 240c35a3
("kvm: x86: Use task structs fpu field for user", 2018-11-06).
The commit is broken and causes QEMU's FPU state to be destroyed
when KVM_RUN is preempted.

Fixes: 240c35a3 ("kvm: x86: Use task structs fpu field for user")
Cc: stable@vger.kernel.org
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

ec269475

KVM: nVMX: Clear pending KVM_REQ_GET_VMCS12_PAGES when leaving nested · cf64527b

由 Jan Kiszka 提交于 7月 21, 2019

Letting this pend may cause nested_get_vmcs12_pages to run against an
invalid state, corrupting the effective vmcs of L1.

This was triggerable in QEMU after a guest corruption in L2, followed by
a L1 reset.
Signed-off-by: NJan Kiszka <jan.kiszka@siemens.com>
Reviewed-by: NLiran Alon <liran.alon@oracle.com>
Cc: stable@vger.kernel.org
Fixes: 7f7f1ba3 ("KVM: x86: do not load vmcs12 pages while still in SMM")
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

cf64527b

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功