提交 · aebc3ca19063d68b76bcaaca81558d4f180c61b0 · openeuler / Kernel

24 6月, 2022 6 次提交

KVM: x86: Enable CMCI capability by default and handle injected UCNA errors · aebc3ca1

由 Jue Wang 提交于 6月 10, 2022

This patch enables MCG_CMCI_P by default in kvm_mce_cap_supported. It
reuses ioctl KVM_X86_SET_MCE to implement injection of UnCorrectable
No Action required (UCNA) errors, signaled via Corrected Machine
Check Interrupt (CMCI).

Neither of the CMCI and UCNA emulations depends on hardware.
Signed-off-by: NJue Wang <juew@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220610171134.772566-8-juew@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

aebc3ca1

KVM: x86: Add emulation for MSR_IA32_MCx_CTL2 MSRs. · 281b5278

由 Jue Wang 提交于 6月 10, 2022

This patch adds the emulation of IA32_MCi_CTL2 registers to KVM. A
separate mci_ctl2_banks array is used to keep the existing mce_banks
register layout intact.

In Machine Check Architecture, in addition to MCG_CMCI_P, bit 30 of
the per-bank register IA32_MCi_CTL2 controls whether Corrected Machine
Check error reporting is enabled.
Signed-off-by: NJue Wang <juew@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220610171134.772566-7-juew@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

281b5278

KVM: x86: Use kcalloc to allocate the mce_banks array. · 087acc4e

由 Jue Wang 提交于 6月 10, 2022

This patch updates the allocation of mce_banks with the array allocation
API (kcalloc) as a precedent for the later mci_ctl2_banks to implement
per-bank control of Corrected Machine Check Interrupt (CMCI).
Suggested-by: NSean Christopherson <seanjc@google.com>
Signed-off-by: NJue Wang <juew@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220610171134.772566-6-juew@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

087acc4e

KVM: x86: Add Corrected Machine Check Interrupt (CMCI) emulation to lapic. · 4b903561

由 Jue Wang 提交于 6月 10, 2022

This patch calculates the number of lvt entries as part of
KVM_X86_MCE_SETUP conditioned on the presence of MCG_CMCI_P bit in
MCG_CAP and stores result in kvm_lapic. It translats from APIC_LVTx
register to index in lapic_lvt_entry enum. It extends the APIC_LVTx
macro as well as other lapic write/reset handling etc to support
Corrected Machine Check Interrupt.
Signed-off-by: NJue Wang <juew@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220610171134.772566-5-juew@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

4b903561

KVM: x86/MMU: Allow NX huge pages to be disabled on a per-vm basis · 084cc29f

由 Ben Gardon 提交于 6月 13, 2022

In some cases, the NX hugepage mitigation for iTLB multihit is not
needed for all guests on a host. Allow disabling the mitigation on a
per-VM basis to avoid the performance hit of NX hugepages on trusted
workloads.

In order to disable NX hugepages on a VM, ensure that the userspace
actor has permission to reboot the system. Since disabling NX hugepages
would allow a guest to crash the system, it is similar to reboot
permissions.

Ideally, KVM would require userspace to prove it has access to KVM's
nx_huge_pages module param, e.g. so that userspace can opt out without
needing full reboot permissions.  But getting access to the module param
file info is difficult because it is buried in layers of sysfs and module
glue. Requiring CAP_SYS_BOOT is sufficient for all known use cases.
Suggested-by: NJim Mattson <jmattson@google.com>
Reviewed-by: NDavid Matlack <dmatlack@google.com>
Reviewed-by: NPeter Xu <peterx@redhat.com>
Signed-off-by: NBen Gardon <bgardon@google.com>
Message-Id: <20220613212523.3436117-9-bgardon@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

084cc29f

KVM: x86: Fix errant brace in KVM capability handling · 1c4dc573

由 Ben Gardon 提交于 6月 13, 2022

The braces around the KVM_CAP_XSAVE2 block also surround the
KVM_CAP_PMU_CAPABILITY block, likely the result of a merge issue. Simply
move the curly brace back to where it belongs.

Fixes: ba7bb663 ("KVM: x86: Provide per VM capability for disabling PMU virtualization")
Reviewed-by: NDavid Matlack <dmatlack@google.com>
Reviewed-by: NPeter Xu <peterx@redhat.com>
Signed-off-by: NBen Gardon <bgardon@google.com>
Message-Id: <20220613212523.3436117-8-bgardon@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

1c4dc573

20 6月, 2022 7 次提交

KVM: x86: Add a quirk for KVM's "MONITOR/MWAIT are NOPs!" behavior · bfbcc81b

由 Sean Christopherson 提交于 6月 08, 2022

Add a quirk for KVM's behavior of emulating intercepted MONITOR/MWAIT
instructions a NOPs regardless of whether or not they are supported in
guest CPUID.  KVM's current behavior was likely motiviated by a certain
fruity operating system that expects MONITOR/MWAIT to be supported
unconditionally and blindly executes MONITOR/MWAIT without first checking
CPUID.  And because KVM does NOT advertise MONITOR/MWAIT to userspace,
that's effectively the default setup for any VMM that regurgitates
KVM_GET_SUPPORTED_CPUID to KVM_SET_CPUID2.

Note, this quirk interacts with KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT.  The
behavior is actually desirable, as userspace VMMs that want to
unconditionally hide MONITOR/MWAIT from the guest can leave the
MISC_ENABLE quirk enabled.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220608224516.3788274-2-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

bfbcc81b

KVM: x86: Ignore benign host writes to "unsupported" F15H_PERF_CTL MSRs · ff81a90f

由 Sean Christopherson 提交于 6月 11, 2022

Ignore host userspace writes of '0' to F15H_PERF_CTL MSRs KVM reports
in the MSR-to-save list, but the MSRs are ultimately unsupported. All
MSRs in said list must be writable by userspace, e.g. if userspace sends
the list back at KVM without filtering out the MSRs it doesn't need.

Note, reads of said MSRs already have the desired behavior.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220611005755.753273-8-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

ff81a90f

KVM: x86: Ignore benign host accesses to "unsupported" PEBS and BTS MSRs · 157fc497

由 Sean Christopherson 提交于 6月 11, 2022

Ignore host userspace reads and writes of '0' to PEBS and BTS MSRs that
KVM reports in the MSR-to-save list, but the MSRs are ultimately
unsupported. All MSRs in said list must be writable by userspace, e.g.
if userspace sends the list back at KVM without filtering out the MSRs it
doesn't need.

Fixes: 8183a538 ("KVM: x86/pmu: Add IA32_DS_AREA MSR emulation to support guest DS")
Fixes: 902caeb6 ("KVM: x86/pmu: Add PEBS_DATA_CFG MSR emulation to support adaptive PEBS")
Fixes: c59a1f10 ("KVM: x86/pmu: Add IA32_PEBS_ENABLE MSR emulation for extended PEBS")
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220611005755.753273-7-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

157fc497

Revert "KVM: x86: always allow host-initiated writes to PMU MSRs" · 545feb96

由 Sean Christopherson 提交于 6月 11, 2022

Revert the hack to allow host-initiated accesses to all "PMU" MSRs,
as intel_is_valid_msr() returns true for _all_ MSRs, regardless of whether
or not it has a snowball's chance in hell of actually being a PMU MSR.

That mostly gets papered over by the actual get/set helpers only handling
MSRs that they knows about, except there's the minor detail that
kvm_pmu_{g,s}et_msr() eat reads and writes when the PMU is disabled.
I.e. KVM will happy allow reads and writes to _any_ MSR if the PMU is
disabled, either via module param or capability.

This reverts commit d1c88a40.

Fixes: d1c88a40 ("KVM: x86: always allow host-initiated writes to PMU MSRs")
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220611005755.753273-5-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

545feb96

KVM: x86: Give host userspace full control of MSR_IA32_MISC_ENABLES · 9fc22296

由 Sean Christopherson 提交于 6月 11, 2022

Give userspace full control of the read-only bits in MISC_ENABLES, i.e.
do not modify bits on PMU refresh and do not preserve existing bits when
userspace writes MISC_ENABLES. With a few exceptions where KVM doesn't
expose the necessary controls to userspace _and_ there is a clear cut
association with CPUID, e.g. reserved CR4 bits, KVM does not own the vCPU
and should not manipulate the vCPU model on behalf of "dummy user space".

The argument that KVM is doing userspace a favor because "the order of
setting vPMU capabilities and MSR_IA32_MISC_ENABLE is not strictly
guaranteed" is specious, as attempting to configure MSRs on behalf of
userspace inevitably leads to edge cases precisely because KVM does not
prescribe a specific order of initialization.

Example #1: intel_pmu_refresh() consumes and modifies the vCPU's
MSR_IA32_PERF_CAPABILITIES, and so assumes userspace initializes config
MSRs before setting the guest CPUID model. If userspace sets CPUID
first, then KVM will mark PEBS as available when arch.perf_capabilities
is initialized with a non-zero PEBS format, thus creating a bad vCPU
model if userspace later disables PEBS by writing PERF_CAPABILITIES.

Example #2: intel_pmu_refresh() does not clear PERF_CAP_PEBS_MASK in
MSR_IA32_PERF_CAPABILITIES if there is no vPMU, making KVM inconsistent
in its desire to be consistent.

Example #3: intel_pmu_refresh() does not clear MSR_IA32_MISC_ENABLE_EMON
if KVM_SET_CPUID2 is called multiple times, first with a vPMU, then
without a vPMU. While slightly contrived, it's plausible a VMM could
reflect KVM's default vCPU and then operate on KVM's copy of CPUID to
later clear the vPMU settings, e.g. see KVM's selftests.

Example #4: Enumerating an Intel vCPU on an AMD host will not call into
intel_pmu_refresh() at any point, and so the BTS and PEBS "unavailable"
bits will be left clear, without any way for userspace to set them.

Keep the "R" behavior of the bit 7, "EMON available", for the guest.
Unlike the BTS and PEBS bits, which are fully "RO", the EMON bit can be
written with a different value, but that new value is ignored.

Cc: Like Xu <likexu@tencent.com>
Signed-off-by: NSean Christopherson <seanjc@google.com>
Reported-by: Nkernel test robot <oliver.sang@intel.com>
Message-Id: <20220611005755.753273-2-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

9fc22296

KVM: x86: Move "apicv_active" into "struct kvm_lapic" · ce0a58f4

由 Sean Christopherson 提交于 6月 14, 2022

Move the per-vCPU apicv_active flag into KVM's local APIC instance.
APICv is fully dependent on an in-kernel local APIC, but that's not at
all clear when reading the current code due to the flag being stored in
the generic kvm_vcpu_arch struct.

No functional change intended.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220614230548.3852141-5-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

ce0a58f4

KVM: x86: Check for in-kernel xAPIC when querying APICv for directed yield · ae801e13

由 Sean Christopherson 提交于 6月 14, 2022

Use kvm_vcpu_apicv_active() to check if APICv is active when seeing if a
vCPU is a candidate for directed yield due to a pending ACPIv interrupt.
This will allow moving apicv_active into kvm_lapic without introducing a
potential NULL pointer deref (kvm_vcpu_apicv_active() effectively adds a
pre-check on the vCPU having an in-kernel APIC).

No functional change intended.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220614230548.3852141-4-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

ae801e13

10 6月, 2022 1 次提交

KVM: x86: Bug the VM if the emulator accesses a non-existent GPR · 1cca2f8c

由 Sean Christopherson 提交于 5月 26, 2022

Bug the VM, i.e. kill it, if the emulator accesses a non-existent GPR,
i.e. generates an out-of-bounds GPR index.  Continuing on all but
gaurantees some form of data corruption in the guest, e.g. even if KVM
were to redirect to a dummy register, KVM would be incorrectly read zeros
and drop writes.

Note, bugging the VM doesn't completely prevent data corruption, e.g. the
current round of emulation will complete before the vCPU bails out to
userspace.  But, the very act of killing the guest can also cause data
corruption, e.g. due to lack of file writeback before termination, so
taking on additional complexity to cleanly bail out of the emulator isn't
justified, the goal is purely to stem the bleeding and alert userspace
that something has gone horribly wrong, i.e. to avoid _silent_ data
corruption.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Reviewed-by: NKees Cook <keescook@chromium.org>
Reviewed-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20220526210817.3428868-7-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

1cca2f8c

09 6月, 2022 2 次提交

KVM: x86: disable preemption while updating apicv inhibition · 66c768d3

由 Maxim Levitsky 提交于 6月 06, 2022

Currently nothing prevents preemption in kvm_vcpu_update_apicv.

On SVM, If the preemption happens after we update the
vcpu->arch.apicv_active, the preemption itself will
'update' the inhibition since the AVIC will be first disabled
on vCPU unload and then enabled, when the current task
is loaded again.

Then we will try to update it again, which will lead to a warning
in __avic_vcpu_load, that the AVIC is already enabled.

Fix this by disabling preemption in this code.
Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220606180829.102503-6-mlevitsk@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

66c768d3

KVM: x86/pmu: Avoid exposing Intel BTS feature · b9181c8e

由 Like Xu 提交于 6月 01, 2022

The BTS feature (including the ability to set the BTS and BTINT
bits in the DEBUGCTL MSR) is currently unsupported on KVM.

But we may try using the BTS facility on a PEBS enabled guest like this:
    perf record -e branches:u -c 1 -d ls
and then we would encounter the following call trace:

 [] unchecked MSR access error: WRMSR to 0x1d9 (tried to write 0x00000000000003c0)
        at rIP: 0xffffffff810745e4 (native_write_msr+0x4/0x20)
 [] Call Trace:
 []  intel_pmu_enable_bts+0x5d/0x70
 []  bts_event_add+0x54/0x70
 []  event_sched_in+0xee/0x290

As it lacks any CPUID indicator or perf_capabilities valid bit
fields to prompt for this information, the platform would hint
the Intel BTS feature unavailable to guest by setting the
BTS_UNAVAIL bit in the IA32_MISC_ENABLE.
Signed-off-by: NLike Xu <likexu@tencent.com>
Message-Id: <20220601031925.59693-3-likexu@tencent.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

b9181c8e

08 6月, 2022 17 次提交

KVM: VMX: Enable Notify VM exit · 2f4073e0

由 Tao Xu 提交于 5月 24, 2022

There are cases that malicious virtual machines can cause CPU stuck (due
to event windows don't open up), e.g., infinite loop in microcode when
nested #AC (CVE-2015-5307). No event window means no event (NMI, SMI and
IRQ) can be delivered. It leads the CPU to be unavailable to host or
other VMs.

VMM can enable notify VM exit that a VM exit generated if no event
window occurs in VM non-root mode for a specified amount of time (notify
window).

Feature enabling:
- The new vmcs field SECONDARY_EXEC_NOTIFY_VM_EXITING is introduced to
  enable this feature. VMM can set NOTIFY_WINDOW vmcs field to adjust
  the expected notify window.
- Add a new KVM capability KVM_CAP_X86_NOTIFY_VMEXIT so that user space
  can query and enable this feature in per-VM scope. The argument is a
  64bit value: bits 63:32 are used for notify window, and bits 31:0 are
  for flags. Current supported flags:
  - KVM_X86_NOTIFY_VMEXIT_ENABLED: enable the feature with the notify
    window provided.
  - KVM_X86_NOTIFY_VMEXIT_USER: exit to userspace once the exits happen.
- It's safe to even set notify window to zero since an internal hardware
  threshold is added to vmcs.notify_window.

VM exit handling:
- Introduce a vcpu state notify_window_exits to records the count of
  notify VM exits and expose it through the debugfs.
- Notify VM exit can happen incident to delivery of a vector event.
  Allow it in KVM.
- Exit to userspace unconditionally for handling when VM_CONTEXT_INVALID
  bit is set.

Nested handling
- Nested notify VM exits are not supported yet. Keep the same notify
  window control in vmcs02 as vmcs01, so that L1 can't escape the
  restriction of notify VM exits through launching L2 VM.

Notify VM exit is defined in latest Intel Architecture Instruction Set
Extensions Programming Reference, chapter 9.2.
Co-developed-by: NXiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: NXiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: NTao Xu <tao3.xu@intel.com>
Co-developed-by: NChenyi Qiang <chenyi.qiang@intel.com>
Signed-off-by: NChenyi Qiang <chenyi.qiang@intel.com>
Message-Id: <20220524135624.22988-5-chenyi.qiang@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

2f4073e0

KVM: x86: Introduce "struct kvm_caps" to track misc caps/settings · 938c8745

由 Sean Christopherson 提交于 5月 24, 2022

Add kvm_caps to hold a variety of capabilites and defaults that aren't
handled by kvm_cpu_caps because they aren't CPUID bits in order to reduce
the amount of boilerplate code required to add a new feature.  The vast
majority (all?) of the caps interact with vendor code and are written
only during initialization, i.e. should be tagged __read_mostly, declared
extern in x86.h, and exported.

No functional change intended.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220524135624.22988-4-chenyi.qiang@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

938c8745

KVM: x86: Extend KVM_{G,S}ET_VCPU_EVENTS to support pending triple fault · ed235117

由 Chenyi Qiang 提交于 5月 24, 2022

For the triple fault sythesized by KVM, e.g. the RSM path or
nested_vmx_abort(), if KVM exits to userspace before the request is
serviced, userspace could migrate the VM and lose the triple fault.

Extend KVM_{G,S}ET_VCPU_EVENTS to support pending triple fault with a
new event KVM_VCPUEVENT_VALID_FAULT_FAULT so that userspace can save and
restore the triple fault event. This extension is guarded by a new KVM
capability KVM_CAP_TRIPLE_FAULT_EVENT.

Note that in the set_vcpu_events path, userspace is able to set/clear
the triple fault request through triple_fault.pending field.
Signed-off-by: NChenyi Qiang <chenyi.qiang@intel.com>
Message-Id: <20220524135624.22988-2-chenyi.qiang@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

ed235117

KVM: x86: always allow host-initiated writes to PMU MSRs · d1c88a40

由 Paolo Bonzini 提交于 5月 25, 2022

Whenever an MSR is part of KVM_GET_MSR_INDEX_LIST, it has to be always
retrievable and settable with KVM_GET_MSR and KVM_SET_MSR.  Accept
the PMU MSRs unconditionally in intel_is_valid_msr, if the access was
host-initiated.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

d1c88a40

KVM: x86/pmu: Add kvm_pmu_cap to optimize perf_get_x86_pmu_capability · 968635ab

由 Like Xu 提交于 4月 11, 2022

The information obtained from the interface perf_get_x86_pmu_capability()
doesn't change, so an exported "struct x86_pmu_capability" is introduced
for all guests in the KVM, and it's initialized before hardware_setup().
Signed-off-by: NLike Xu <likexu@tencent.com>
Message-Id: <20220411101946.20262-16-likexu@tencent.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

968635ab

KVM: x86: Set PEBS_UNAVAIL in IA32_MISC_ENABLE when PEBS is enabled · d1055173

由 Like Xu 提交于 4月 11, 2022

The bit 12 represents "Processor Event Based Sampling Unavailable (RO)" :
	1 = PEBS is not supported.
	0 = PEBS is supported.

A write to this PEBS_UNAVL available bit will bring #GP(0) when guest PEBS
is enabled. Some PEBS drivers in guest may care about this bit.
Signed-off-by: NLike Xu <like.xu@linux.intel.com>
Message-Id: <20220411101946.20262-13-likexu@tencent.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

d1055173

KVM: x86/pmu: Add PEBS_DATA_CFG MSR emulation to support adaptive PEBS · 902caeb6

由 Like Xu 提交于 4月 11, 2022

If IA32_PERF_CAPABILITIES.PEBS_BASELINE [bit 14] is set, the adaptive
PEBS is supported. The PEBS_DATA_CFG MSR and adaptive record enable
bits (IA32_PERFEVTSELx.Adaptive_Record and IA32_FIXED_CTR_CTRL.
FCx_Adaptive_Record) are also supported.

Adaptive PEBS provides software the capability to configure the PEBS
records to capture only the data of interest, keeping the record size
compact. An overflow of PMCx results in generation of an adaptive PEBS
record with state information based on the selections specified in
MSR_PEBS_DATA_CFG.By default, the record only contain the Basic group.

When guest adaptive PEBS is enabled, the IA32_PEBS_ENABLE MSR will
be added to the perf_guest_switch_msr() and switched during the VMX
transitions just like CORE_PERF_GLOBAL_CTRL MSR.

According to Intel SDM, software is recommended to  PEBS Baseline
when the following is true. IA32_PERF_CAPABILITIES.PEBS_BASELINE[14]
&& IA32_PERF_CAPABILITIES.PEBS_FMT[11:8] ≥ 4.
Co-developed-by: NLuwei Kang <luwei.kang@intel.com>
Signed-off-by: NLuwei Kang <luwei.kang@intel.com>
Signed-off-by: NLike Xu <likexu@tencent.com>
Message-Id: <20220411101946.20262-12-likexu@tencent.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

902caeb6

KVM: x86/pmu: Add IA32_DS_AREA MSR emulation to support guest DS · 8183a538

由 Like Xu 提交于 4月 11, 2022

When CPUID.01H:EDX.DS[21] is set, the IA32_DS_AREA MSR exists and points
to the linear address of the first byte of the DS buffer management area,
which is used to manage the PEBS records.

When guest PEBS is enabled, the MSR_IA32_DS_AREA MSR will be added to the
perf_guest_switch_msr() and switched during the VMX transitions just like
CORE_PERF_GLOBAL_CTRL MSR. The WRMSR to IA32_DS_AREA MSR brings a #GP(0)
if the source register contains a non-canonical address.
Originally-by: NAndi Kleen <ak@linux.intel.com>
Co-developed-by: NKan Liang <kan.liang@linux.intel.com>
Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
Signed-off-by: NLike Xu <like.xu@linux.intel.com>
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Message-Id: <20220411101946.20262-11-likexu@tencent.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

8183a538

KVM: x86/pmu: Add IA32_PEBS_ENABLE MSR emulation for extended PEBS · c59a1f10

由 Like Xu 提交于 4月 11, 2022

If IA32_PERF_CAPABILITIES.PEBS_BASELINE [bit 14] is set, the
IA32_PEBS_ENABLE MSR exists and all architecturally enumerated fixed
and general-purpose counters have corresponding bits in IA32_PEBS_ENABLE
that enable generation of PEBS records. The general-purpose counter bits
start at bit IA32_PEBS_ENABLE[0], and the fixed counter bits start at
bit IA32_PEBS_ENABLE[32].

When guest PEBS is enabled, the IA32_PEBS_ENABLE MSR will be
added to the perf_guest_switch_msr() and atomically switched during
the VMX transitions just like CORE_PERF_GLOBAL_CTRL MSR.

Based on whether the platform supports x86_pmu.pebs_ept, it has also
refactored the way to add more msrs to arr[] in intel_guest_get_msrs()
for extensibility.
Originally-by: NAndi Kleen <ak@linux.intel.com>
Co-developed-by: NKan Liang <kan.liang@linux.intel.com>
Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
Co-developed-by: NLuwei Kang <luwei.kang@intel.com>
Signed-off-by: NLuwei Kang <luwei.kang@intel.com>
Signed-off-by: NLike Xu <like.xu@linux.intel.com>
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Message-Id: <20220411101946.20262-8-likexu@tencent.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

c59a1f10

KVM: x86/pmu: Set MSR_IA32_MISC_ENABLE_EMON bit when vPMU is enabled · bef6ecca

由 Like Xu 提交于 4月 11, 2022

On Intel platforms, the software can use the IA32_MISC_ENABLE[7] bit to
detect whether the processor supports performance monitoring facility.

It depends on the PMU is enabled for the guest, and a software write
operation to this available bit will be ignored. The proposal to ignore
the toggle in KVM is the way to go and that behavior matches bare metal.
Signed-off-by: NLike Xu <likexu@tencent.com>
Message-Id: <20220411101946.20262-5-likexu@tencent.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

bef6ecca

KVM: VMX: enable IPI virtualization · d588bb9b

由 Chao Gao 提交于 4月 19, 2022

With IPI virtualization enabled, the processor emulates writes to
APIC registers that would send IPIs. The processor sets the bit
corresponding to the vector in target vCPU's PIR and may send a
notification (IPI) specified by NDST and NV fields in target vCPU's
Posted-Interrupt Descriptor (PID). It is similar to what IOMMU
engine does when dealing with posted interrupt from devices.

A PID-pointer table is used by the processor to locate the PID of a
vCPU with the vCPU's APIC ID. The table size depends on maximum APIC
ID assigned for current VM session from userspace. Allocating memory
for PID-pointer table is deferred to vCPU creation, because irqchip
mode and VM-scope maximum APIC ID is settled at that point. KVM can
skip PID-pointer table allocation if !irqchip_in_kernel().

Like VT-d PI, if a vCPU goes to blocked state, VMM needs to switch its
notification vector to wakeup vector. This can ensure that when an IPI
for blocked vCPUs arrives, VMM can get control and wake up blocked
vCPUs. And if a VCPU is preempted, its posted interrupt notification
is suppressed.

Note that IPI virtualization can only virualize physical-addressing,
flat mode, unicast IPIs. Sending other IPIs would still cause a
trap-like APIC-write VM-exit and need to be handled by VMM.
Signed-off-by: NChao Gao <chao.gao@intel.com>
Signed-off-by: NZeng Guang <guang.zeng@intel.com>
Message-Id: <20220419154510.11938-1-guang.zeng@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

d588bb9b

KVM: x86: Allow userspace to set maximum VCPU id for VM · 35875316

由 Zeng Guang 提交于 4月 19, 2022

Introduce new max_vcpu_ids in KVM for x86 architecture. Userspace
can assign maximum possible vcpu id for current VM session using
KVM_CAP_MAX_VCPU_ID of KVM_ENABLE_CAP ioctl().

This is done for x86 only because the sole use case is to guide
memory allocation for PID-pointer table, a structure needed to
enable VMX IPI.

By default, max_vcpu_ids set as KVM_MAX_VCPU_IDS.
Suggested-by: NSean Christopherson <seanjc@google.com>
Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: NZeng Guang <guang.zeng@intel.com>
Message-Id: <20220419154444.11888-1-guang.zeng@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

35875316

KVM: Move kvm_arch_vcpu_precreate() under kvm->lock · 1d5e740d

由 Zeng Guang 提交于 4月 19, 2022

kvm_arch_vcpu_precreate() targets to handle arch specific VM resource
to be prepared prior to the actual creation of vCPU. For example, x86
platform may need do per-VM allocation based on max_vcpu_ids at the
first vCPU creation. It probably leads to concurrency control on this
allocation as multiple vCPU creation could happen simultaneously. From
the architectual point of view, it's necessary to execute
kvm_arch_vcpu_precreate() under protect of kvm->lock.

Currently only arm64, x86 and s390 have non-nop implementations at the
stage of vCPU pre-creation. Remove the lock acquiring in s390's design
and make sure all architecture can run kvm_arch_vcpu_precreate() safely
under kvm->lock without recrusive lock issue.
Suggested-by: NSean Christopherson <seanjc@google.com>
Signed-off-by: NZeng Guang <guang.zeng@intel.com>
Message-Id: <20220419154409.11842-1-guang.zeng@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

1d5e740d

KVM: x86: Differentiate Soft vs. Hard IRQs vs. reinjected in tracepoint · 2d613912

由 Sean Christopherson 提交于 5月 02, 2022

In the IRQ injection tracepoint, differentiate between Hard IRQs and Soft
"IRQs", i.e. interrupts that are reinjected after incomplete delivery of
a software interrupt from an INTn instruction.  Tag reinjected interrupts
as such, even though the information is usually redundant since soft
interrupts are only ever reinjected by KVM.  Though rare in practice, a
hard IRQ can be reinjected.
Signed-off-by: NSean Christopherson <seanjc@google.com>
[MSS: change "kvm_inj_virq" event "reinjected" field type to bool]
Signed-off-by: NMaciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <9664d49b3bd21e227caa501cff77b0569bebffe2.1651440202.git.maciej.szmigiero@oracle.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

2d613912

KVM: x86: Trace re-injected exceptions · a61d7c54

由 Sean Christopherson 提交于 5月 02, 2022

Trace exceptions that are re-injected, not just those that KVM is
injecting for the first time.  Debugging re-injection bugs is painful
enough as is, not having visibility into what KVM is doing only makes
things worse.

Delay propagating pending=>injected in the non-reinjection path so that
the tracing can properly identify reinjected exceptions.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: NMaciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <25470690a38b4d2b32b6204875dd35676c65c9f2.1651440202.git.maciej.szmigiero@oracle.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

a61d7c54

KVM: x86: do not report a vCPU as preempted outside instruction boundaries · 6cd88243

由 Paolo Bonzini 提交于 6月 07, 2022

If a vCPU is outside guest mode and is scheduled out, it might be in the
process of making a memory access. A problem occurs if another vCPU uses
the PV TLB flush feature during the period when the vCPU is scheduled
out, and a virtual address has already been translated but has not yet
been accessed, because this is equivalent to using a stale TLB entry.

To avoid this, only report a vCPU as preempted if sure that the guest
is at an instruction boundary. A rescheduling request will be delivered
to the host physical CPU as an external interrupt, so for simplicity
consider any vmexit *not* instruction boundary except for external
interrupts.

It would in principle be okay to report the vCPU as preempted also
if it is sleeping in kvm_vcpu_block(): a TLB flush IPI will incur the
vmentry/vmexit overhead unnecessarily, and optimistic spinning is
also unlikely to succeed. However, leave it for later because right
now kvm_vcpu_check_block() is doing memory accesses. Even
though the TLB flush issue only applies to virtual memory address,
it's very much preferrable to be conservative.
Reported-by: NJann Horn <jannh@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

6cd88243

KVM: x86: do not set st->preempted when going back to user space · 54aa83c9

由 Paolo Bonzini 提交于 6月 07, 2022

Similar to the Xen path, only change the vCPU's reported state if the vCPU
was actually preempted. The reason for KVM's behavior is that for example
optimistic spinning might not be a good idea if the guest is doing repeated
exits to userspace; however, it is confusing and unlikely to make a difference,
because well-tuned guests will hardly ever exit KVM_RUN in the first place.
Suggested-by: NSean Christopherson <seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

54aa83c9

27 5月, 2022 1 次提交

x86: Fix all occurences of the "the the" typo · f7081834

由 Bo Liu 提交于 5月 27, 2022

Rather than waiting for the bots to fix these one-by-one,
fix all occurences of "the the" throughout arch/x86.
Signed-off-by: NBo Liu <liubo03@inspur.com>
Signed-off-by: NIngo Molnar <mingo@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Link: https://lore.kernel.org/r/20220527061400.5694-1-liubo03@inspur.com

f7081834

25 5月, 2022 3 次提交

KVM: set_msr_mce: Permit guests to ignore single-bit ECC errors · 0471a7bd

由 Lev Kujawski 提交于 5月 21, 2022

Certain guest operating systems (e.g., UNIXWARE) clear bit 0 of
MC1_CTL to ignore single-bit ECC data errors.  Single-bit ECC data
errors are always correctable and thus are safe to ignore because they
are informational in nature rather than signaling a loss of data
integrity.

Prior to this patch, these guests would crash upon writing MC1_CTL,
with resultant error messages like the following:

error: kvm run failed Operation not permitted
EAX=fffffffe EBX=fffffffe ECX=00000404 EDX=ffffffff
ESI=ffffffff EDI=00000001 EBP=fffdaba4 ESP=fffdab20
EIP=c01333a5 EFL=00000246 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0108 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
CS =0100 00000000 ffffffff 00c09b00 DPL=0 CS32 [-RA]
SS =0108 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
DS =0108 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
FS =0000 00000000 ffffffff 00c00000
GS =0000 00000000 ffffffff 00c00000
LDT=0118 c1026390 00000047 00008200 DPL=0 LDT
TR =0110 ffff5af0 00000067 00008b00 DPL=0 TSS32-busy
GDT=     ffff5020 000002cf
IDT=     ffff52f0 000007ff
CR0=8001003b CR2=00000000 CR3=0100a000 CR4=00000230
DR0=00000000 DR1=00000000 DR2=00000000 DR3=00000000
DR6=ffff0ff0 DR7=00000400
EFER=0000000000000000
Code=08 89 01 89 51 04 c3 8b 4c 24 08 8b 01 8b 51 04 8b 4c 24 04 <0f>
30 c3 f7 05 a4 6d ff ff 10 00 00 00 74 03 0f 31 c3 33 c0 33 d2 c3 8d
74 26 00 0f 31 c3
Signed-off-by: NLev Kujawski <lkujaw@member.fsf.org>
Message-Id: <20220521081511.187388-1-lkujaw@member.fsf.org>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

0471a7bd

KVM: x86: avoid calling x86 emulator without a decoded instruction · fee060cd

由 Sean Christopherson 提交于 3月 11, 2022

Whenever x86_decode_emulated_instruction() detects a breakpoint, it
returns the value that kvm_vcpu_check_breakpoint() writes into its
pass-by-reference second argument.  Unfortunately this is completely
bogus because the expected outcome of x86_decode_emulated_instruction
is an EMULATION_* value.

Then, if kvm_vcpu_check_breakpoint() does "*r = 0" (corresponding to
a KVM_EXIT_DEBUG userspace exit), it is misunderstood as EMULATION_OK
and x86_emulate_instruction() is called without having decoded the
instruction.  This causes various havoc from running with a stale
emulation context.

The fix is to move the call to kvm_vcpu_check_breakpoint() where it was
before commit 4aa2691d ("KVM: x86: Factor out x86 instruction
emulation with decoding") introduced x86_decode_emulated_instruction().
The other caller of the function does not need breakpoint checks,
because it is invoked as part of a vmexit and the processor has already
checked those before executing the instruction that #GP'd.

This fixes CVE-2022-1852.
Reported-by: NQiuhao Li <qiuhao@sysec.org>
Reported-by: NGaoning Pan <pgn@zju.edu.cn>
Reported-by: NYongkang Jia <kangel@zju.edu.cn>
Fixes: 4aa2691d ("KVM: x86: Factor out x86 instruction emulation with decoding")
Cc: stable@vger.kernel.org
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220311032801.3467418-2-seanjc@google.com>
[Rewrote commit message according to Qiuhao's report, since a patch
 already existed to fix the bug. - Paolo]
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

fee060cd

KVM: LAPIC: Trace LAPIC timer expiration on every vmentry · e0ac5351

由 Wanpeng Li 提交于 4月 26, 2022

In commit ec0671d5 ("KVM: LAPIC: Delay trace_kvm_wait_lapic_expire
tracepoint to after vmexit", 2019-06-04), trace_kvm_wait_lapic_expire
was moved after guest_exit_irqoff() because invoking tracepoints within
kvm_guest_enter/kvm_guest_exit caused a lockdep splat.

These days this is not necessary, because commit 87fa7f3e ("x86/kvm:
Move context tracking where it belongs", 2020-07-09) restricted
the RCU extended quiescent state to be closer to vmentry/vmexit.
Moving the tracepoint back to __kvm_wait_lapic_expire is more accurate,
because it will be reported even if vcpu_enter_guest causes multiple
vmentries via the IPI/Timer fast paths, and it allows the removal of
advance_expire_delta.
Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
Message-Id: <1650961551-38390-1-git-send-email-wanpengli@tencent.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

e0ac5351

12 5月, 2022 3 次提交

P
KVM: x86: a vCPU with a pending triple fault is runnable · c9f3d9fb
由 Paolo Bonzini 提交于 4月 27, 2022
```
Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
```
c9f3d9fb

KVM: x86/mmu: Expand and clean up page fault stats · 1075d41e

由 Sean Christopherson 提交于 4月 23, 2022

Expand and clean up the page fault stats. The current stats are at best
incomplete, and at worst misleading. Differentiate between faults that
are actually fixed vs those that result in an MMIO SPTE being created,
track faults that are spurious, faults that trigger emulation, faults
that that are fixed in the fast path, and last but not least, track the
number of faults that are taken.

Note, the number of faults that require emulation for write-protected
shadow pages can roughly be calculated by subtracting the number of MMIO
SPTEs created from the overall number of faults that trigger emulation.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220423034752.1161007-10-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

1075d41e

KVM: x86/mmu: Make all page fault handlers internal to the MMU · 8a009d5b

由 Sean Christopherson 提交于 4月 23, 2022

Move kvm_arch_async_page_ready() to mmu.c where it belongs, and move all
of the page fault handling collateral that was in mmu.h purely for the
async #PF handler into mmu_internal.h, where it belongs.  This will allow
kvm_mmu_do_page_fault() to act on the RET_PF_* return without having to
expose those enums outside of the MMU.

No functional change intended.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220423034752.1161007-8-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

8a009d5b

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功