1. 20 6月, 2022 7 次提交
    • S
      KVM: x86: Add a quirk for KVM's "MONITOR/MWAIT are NOPs!" behavior · bfbcc81b
      Sean Christopherson 提交于
      Add a quirk for KVM's behavior of emulating intercepted MONITOR/MWAIT
      instructions a NOPs regardless of whether or not they are supported in
      guest CPUID.  KVM's current behavior was likely motiviated by a certain
      fruity operating system that expects MONITOR/MWAIT to be supported
      unconditionally and blindly executes MONITOR/MWAIT without first checking
      CPUID.  And because KVM does NOT advertise MONITOR/MWAIT to userspace,
      that's effectively the default setup for any VMM that regurgitates
      KVM_GET_SUPPORTED_CPUID to KVM_SET_CPUID2.
      
      Note, this quirk interacts with KVM_X86_QUIRK_MISC_ENABLE_NO_MWAIT.  The
      behavior is actually desirable, as userspace VMMs that want to
      unconditionally hide MONITOR/MWAIT from the guest can leave the
      MISC_ENABLE quirk enabled.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220608224516.3788274-2-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      bfbcc81b
    • S
      KVM: x86: Ignore benign host writes to "unsupported" F15H_PERF_CTL MSRs · ff81a90f
      Sean Christopherson 提交于
      Ignore host userspace writes of '0' to F15H_PERF_CTL MSRs KVM reports
      in the MSR-to-save list, but the MSRs are ultimately unsupported.  All
      MSRs in said list must be writable by userspace, e.g. if userspace sends
      the list back at KVM without filtering out the MSRs it doesn't need.
      
      Note, reads of said MSRs already have the desired behavior.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220611005755.753273-8-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ff81a90f
    • S
      KVM: x86: Ignore benign host accesses to "unsupported" PEBS and BTS MSRs · 157fc497
      Sean Christopherson 提交于
      Ignore host userspace reads and writes of '0' to PEBS and BTS MSRs that
      KVM reports in the MSR-to-save list, but the MSRs are ultimately
      unsupported.  All MSRs in said list must be writable by userspace, e.g.
      if userspace sends the list back at KVM without filtering out the MSRs it
      doesn't need.
      
      Fixes: 8183a538 ("KVM: x86/pmu: Add IA32_DS_AREA MSR emulation to support guest DS")
      Fixes: 902caeb6 ("KVM: x86/pmu: Add PEBS_DATA_CFG MSR emulation to support adaptive PEBS")
      Fixes: c59a1f10 ("KVM: x86/pmu: Add IA32_PEBS_ENABLE MSR emulation for extended PEBS")
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220611005755.753273-7-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      157fc497
    • S
      Revert "KVM: x86: always allow host-initiated writes to PMU MSRs" · 545feb96
      Sean Christopherson 提交于
      Revert the hack to allow host-initiated accesses to all "PMU" MSRs,
      as intel_is_valid_msr() returns true for _all_ MSRs, regardless of whether
      or not it has a snowball's chance in hell of actually being a PMU MSR.
      
      That mostly gets papered over by the actual get/set helpers only handling
      MSRs that they knows about, except there's the minor detail that
      kvm_pmu_{g,s}et_msr() eat reads and writes when the PMU is disabled.
      I.e. KVM will happy allow reads and writes to _any_ MSR if the PMU is
      disabled, either via module param or capability.
      
      This reverts commit d1c88a40.
      
      Fixes: d1c88a40 ("KVM: x86: always allow host-initiated writes to PMU MSRs")
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220611005755.753273-5-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      545feb96
    • S
      KVM: x86: Give host userspace full control of MSR_IA32_MISC_ENABLES · 9fc22296
      Sean Christopherson 提交于
      Give userspace full control of the read-only bits in MISC_ENABLES, i.e.
      do not modify bits on PMU refresh and do not preserve existing bits when
      userspace writes MISC_ENABLES.  With a few exceptions where KVM doesn't
      expose the necessary controls to userspace _and_ there is a clear cut
      association with CPUID, e.g. reserved CR4 bits, KVM does not own the vCPU
      and should not manipulate the vCPU model on behalf of "dummy user space".
      
      The argument that KVM is doing userspace a favor because "the order of
      setting vPMU capabilities and MSR_IA32_MISC_ENABLE is not strictly
      guaranteed" is specious, as attempting to configure MSRs on behalf of
      userspace inevitably leads to edge cases precisely because KVM does not
      prescribe a specific order of initialization.
      
      Example #1: intel_pmu_refresh() consumes and modifies the vCPU's
      MSR_IA32_PERF_CAPABILITIES, and so assumes userspace initializes config
      MSRs before setting the guest CPUID model.  If userspace sets CPUID
      first, then KVM will mark PEBS as available when arch.perf_capabilities
      is initialized with a non-zero PEBS format, thus creating a bad vCPU
      model if userspace later disables PEBS by writing PERF_CAPABILITIES.
      
      Example #2: intel_pmu_refresh() does not clear PERF_CAP_PEBS_MASK in
      MSR_IA32_PERF_CAPABILITIES if there is no vPMU, making KVM inconsistent
      in its desire to be consistent.
      
      Example #3: intel_pmu_refresh() does not clear MSR_IA32_MISC_ENABLE_EMON
      if KVM_SET_CPUID2 is called multiple times, first with a vPMU, then
      without a vPMU.  While slightly contrived, it's plausible a VMM could
      reflect KVM's default vCPU and then operate on KVM's copy of CPUID to
      later clear the vPMU settings, e.g. see KVM's selftests.
      
      Example #4: Enumerating an Intel vCPU on an AMD host will not call into
      intel_pmu_refresh() at any point, and so the BTS and PEBS "unavailable"
      bits will be left clear, without any way for userspace to set them.
      
      Keep the "R" behavior of the bit 7, "EMON available", for the guest.
      Unlike the BTS and PEBS bits, which are fully "RO", the EMON bit can be
      written with a different value, but that new value is ignored.
      
      Cc: Like Xu <likexu@tencent.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reported-by: Nkernel test robot <oliver.sang@intel.com>
      Message-Id: <20220611005755.753273-2-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      9fc22296
    • S
      KVM: x86: Move "apicv_active" into "struct kvm_lapic" · ce0a58f4
      Sean Christopherson 提交于
      Move the per-vCPU apicv_active flag into KVM's local APIC instance.
      APICv is fully dependent on an in-kernel local APIC, but that's not at
      all clear when reading the current code due to the flag being stored in
      the generic kvm_vcpu_arch struct.
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220614230548.3852141-5-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ce0a58f4
    • S
      KVM: x86: Check for in-kernel xAPIC when querying APICv for directed yield · ae801e13
      Sean Christopherson 提交于
      Use kvm_vcpu_apicv_active() to check if APICv is active when seeing if a
      vCPU is a candidate for directed yield due to a pending ACPIv interrupt.
      This will allow moving apicv_active into kvm_lapic without introducing a
      potential NULL pointer deref (kvm_vcpu_apicv_active() effectively adds a
      pre-check on the vCPU having an in-kernel APIC).
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220614230548.3852141-4-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ae801e13
  2. 10 6月, 2022 1 次提交
    • S
      KVM: x86: Bug the VM if the emulator accesses a non-existent GPR · 1cca2f8c
      Sean Christopherson 提交于
      Bug the VM, i.e. kill it, if the emulator accesses a non-existent GPR,
      i.e. generates an out-of-bounds GPR index.  Continuing on all but
      gaurantees some form of data corruption in the guest, e.g. even if KVM
      were to redirect to a dummy register, KVM would be incorrectly read zeros
      and drop writes.
      
      Note, bugging the VM doesn't completely prevent data corruption, e.g. the
      current round of emulation will complete before the vCPU bails out to
      userspace.  But, the very act of killing the guest can also cause data
      corruption, e.g. due to lack of file writeback before termination, so
      taking on additional complexity to cleanly bail out of the emulator isn't
      justified, the goal is purely to stem the bleeding and alert userspace
      that something has gone horribly wrong, i.e. to avoid _silent_ data
      corruption.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Reviewed-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20220526210817.3428868-7-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      1cca2f8c
  3. 09 6月, 2022 2 次提交
    • M
      KVM: x86: disable preemption while updating apicv inhibition · 66c768d3
      Maxim Levitsky 提交于
      Currently nothing prevents preemption in kvm_vcpu_update_apicv.
      
      On SVM, If the preemption happens after we update the
      vcpu->arch.apicv_active, the preemption itself will
      'update' the inhibition since the AVIC will be first disabled
      on vCPU unload and then enabled, when the current task
      is loaded again.
      
      Then we will try to update it again, which will lead to a warning
      in __avic_vcpu_load, that the AVIC is already enabled.
      
      Fix this by disabling preemption in this code.
      Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20220606180829.102503-6-mlevitsk@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      66c768d3
    • L
      KVM: x86/pmu: Avoid exposing Intel BTS feature · b9181c8e
      Like Xu 提交于
      The BTS feature (including the ability to set the BTS and BTINT
      bits in the DEBUGCTL MSR) is currently unsupported on KVM.
      
      But we may try using the BTS facility on a PEBS enabled guest like this:
          perf record -e branches:u -c 1 -d ls
      and then we would encounter the following call trace:
      
       [] unchecked MSR access error: WRMSR to 0x1d9 (tried to write 0x00000000000003c0)
              at rIP: 0xffffffff810745e4 (native_write_msr+0x4/0x20)
       [] Call Trace:
       []  intel_pmu_enable_bts+0x5d/0x70
       []  bts_event_add+0x54/0x70
       []  event_sched_in+0xee/0x290
      
      As it lacks any CPUID indicator or perf_capabilities valid bit
      fields to prompt for this information, the platform would hint
      the Intel BTS feature unavailable to guest by setting the
      BTS_UNAVAIL bit in the IA32_MISC_ENABLE.
      Signed-off-by: NLike Xu <likexu@tencent.com>
      Message-Id: <20220601031925.59693-3-likexu@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b9181c8e
  4. 08 6月, 2022 17 次提交
    • T
      KVM: VMX: Enable Notify VM exit · 2f4073e0
      Tao Xu 提交于
      There are cases that malicious virtual machines can cause CPU stuck (due
      to event windows don't open up), e.g., infinite loop in microcode when
      nested #AC (CVE-2015-5307). No event window means no event (NMI, SMI and
      IRQ) can be delivered. It leads the CPU to be unavailable to host or
      other VMs.
      
      VMM can enable notify VM exit that a VM exit generated if no event
      window occurs in VM non-root mode for a specified amount of time (notify
      window).
      
      Feature enabling:
      - The new vmcs field SECONDARY_EXEC_NOTIFY_VM_EXITING is introduced to
        enable this feature. VMM can set NOTIFY_WINDOW vmcs field to adjust
        the expected notify window.
      - Add a new KVM capability KVM_CAP_X86_NOTIFY_VMEXIT so that user space
        can query and enable this feature in per-VM scope. The argument is a
        64bit value: bits 63:32 are used for notify window, and bits 31:0 are
        for flags. Current supported flags:
        - KVM_X86_NOTIFY_VMEXIT_ENABLED: enable the feature with the notify
          window provided.
        - KVM_X86_NOTIFY_VMEXIT_USER: exit to userspace once the exits happen.
      - It's safe to even set notify window to zero since an internal hardware
        threshold is added to vmcs.notify_window.
      
      VM exit handling:
      - Introduce a vcpu state notify_window_exits to records the count of
        notify VM exits and expose it through the debugfs.
      - Notify VM exit can happen incident to delivery of a vector event.
        Allow it in KVM.
      - Exit to userspace unconditionally for handling when VM_CONTEXT_INVALID
        bit is set.
      
      Nested handling
      - Nested notify VM exits are not supported yet. Keep the same notify
        window control in vmcs02 as vmcs01, so that L1 can't escape the
        restriction of notify VM exits through launching L2 VM.
      
      Notify VM exit is defined in latest Intel Architecture Instruction Set
      Extensions Programming Reference, chapter 9.2.
      Co-developed-by: NXiaoyao Li <xiaoyao.li@intel.com>
      Signed-off-by: NXiaoyao Li <xiaoyao.li@intel.com>
      Signed-off-by: NTao Xu <tao3.xu@intel.com>
      Co-developed-by: NChenyi Qiang <chenyi.qiang@intel.com>
      Signed-off-by: NChenyi Qiang <chenyi.qiang@intel.com>
      Message-Id: <20220524135624.22988-5-chenyi.qiang@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      2f4073e0
    • S
      KVM: x86: Introduce "struct kvm_caps" to track misc caps/settings · 938c8745
      Sean Christopherson 提交于
      Add kvm_caps to hold a variety of capabilites and defaults that aren't
      handled by kvm_cpu_caps because they aren't CPUID bits in order to reduce
      the amount of boilerplate code required to add a new feature.  The vast
      majority (all?) of the caps interact with vendor code and are written
      only during initialization, i.e. should be tagged __read_mostly, declared
      extern in x86.h, and exported.
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220524135624.22988-4-chenyi.qiang@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      938c8745
    • C
      KVM: x86: Extend KVM_{G,S}ET_VCPU_EVENTS to support pending triple fault · ed235117
      Chenyi Qiang 提交于
      For the triple fault sythesized by KVM, e.g. the RSM path or
      nested_vmx_abort(), if KVM exits to userspace before the request is
      serviced, userspace could migrate the VM and lose the triple fault.
      
      Extend KVM_{G,S}ET_VCPU_EVENTS to support pending triple fault with a
      new event KVM_VCPUEVENT_VALID_FAULT_FAULT so that userspace can save and
      restore the triple fault event. This extension is guarded by a new KVM
      capability KVM_CAP_TRIPLE_FAULT_EVENT.
      
      Note that in the set_vcpu_events path, userspace is able to set/clear
      the triple fault request through triple_fault.pending field.
      Signed-off-by: NChenyi Qiang <chenyi.qiang@intel.com>
      Message-Id: <20220524135624.22988-2-chenyi.qiang@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ed235117
    • P
      KVM: x86: always allow host-initiated writes to PMU MSRs · d1c88a40
      Paolo Bonzini 提交于
      Whenever an MSR is part of KVM_GET_MSR_INDEX_LIST, it has to be always
      retrievable and settable with KVM_GET_MSR and KVM_SET_MSR.  Accept
      the PMU MSRs unconditionally in intel_is_valid_msr, if the access was
      host-initiated.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d1c88a40
    • L
      KVM: x86/pmu: Add kvm_pmu_cap to optimize perf_get_x86_pmu_capability · 968635ab
      Like Xu 提交于
      The information obtained from the interface perf_get_x86_pmu_capability()
      doesn't change, so an exported "struct x86_pmu_capability" is introduced
      for all guests in the KVM, and it's initialized before hardware_setup().
      Signed-off-by: NLike Xu <likexu@tencent.com>
      Message-Id: <20220411101946.20262-16-likexu@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      968635ab
    • L
      KVM: x86: Set PEBS_UNAVAIL in IA32_MISC_ENABLE when PEBS is enabled · d1055173
      Like Xu 提交于
      The bit 12 represents "Processor Event Based Sampling Unavailable (RO)" :
      	1 = PEBS is not supported.
      	0 = PEBS is supported.
      
      A write to this PEBS_UNAVL available bit will bring #GP(0) when guest PEBS
      is enabled. Some PEBS drivers in guest may care about this bit.
      Signed-off-by: NLike Xu <like.xu@linux.intel.com>
      Message-Id: <20220411101946.20262-13-likexu@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d1055173
    • L
      KVM: x86/pmu: Add PEBS_DATA_CFG MSR emulation to support adaptive PEBS · 902caeb6
      Like Xu 提交于
      If IA32_PERF_CAPABILITIES.PEBS_BASELINE [bit 14] is set, the adaptive
      PEBS is supported. The PEBS_DATA_CFG MSR and adaptive record enable
      bits (IA32_PERFEVTSELx.Adaptive_Record and IA32_FIXED_CTR_CTRL.
      FCx_Adaptive_Record) are also supported.
      
      Adaptive PEBS provides software the capability to configure the PEBS
      records to capture only the data of interest, keeping the record size
      compact. An overflow of PMCx results in generation of an adaptive PEBS
      record with state information based on the selections specified in
      MSR_PEBS_DATA_CFG.By default, the record only contain the Basic group.
      
      When guest adaptive PEBS is enabled, the IA32_PEBS_ENABLE MSR will
      be added to the perf_guest_switch_msr() and switched during the VMX
      transitions just like CORE_PERF_GLOBAL_CTRL MSR.
      
      According to Intel SDM, software is recommended to  PEBS Baseline
      when the following is true. IA32_PERF_CAPABILITIES.PEBS_BASELINE[14]
      && IA32_PERF_CAPABILITIES.PEBS_FMT[11:8] ≥ 4.
      Co-developed-by: NLuwei Kang <luwei.kang@intel.com>
      Signed-off-by: NLuwei Kang <luwei.kang@intel.com>
      Signed-off-by: NLike Xu <likexu@tencent.com>
      Message-Id: <20220411101946.20262-12-likexu@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      902caeb6
    • L
      KVM: x86/pmu: Add IA32_DS_AREA MSR emulation to support guest DS · 8183a538
      Like Xu 提交于
      When CPUID.01H:EDX.DS[21] is set, the IA32_DS_AREA MSR exists and points
      to the linear address of the first byte of the DS buffer management area,
      which is used to manage the PEBS records.
      
      When guest PEBS is enabled, the MSR_IA32_DS_AREA MSR will be added to the
      perf_guest_switch_msr() and switched during the VMX transitions just like
      CORE_PERF_GLOBAL_CTRL MSR. The WRMSR to IA32_DS_AREA MSR brings a #GP(0)
      if the source register contains a non-canonical address.
      Originally-by: NAndi Kleen <ak@linux.intel.com>
      Co-developed-by: NKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: NLike Xu <like.xu@linux.intel.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Message-Id: <20220411101946.20262-11-likexu@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      8183a538
    • L
      KVM: x86/pmu: Add IA32_PEBS_ENABLE MSR emulation for extended PEBS · c59a1f10
      Like Xu 提交于
      If IA32_PERF_CAPABILITIES.PEBS_BASELINE [bit 14] is set, the
      IA32_PEBS_ENABLE MSR exists and all architecturally enumerated fixed
      and general-purpose counters have corresponding bits in IA32_PEBS_ENABLE
      that enable generation of PEBS records. The general-purpose counter bits
      start at bit IA32_PEBS_ENABLE[0], and the fixed counter bits start at
      bit IA32_PEBS_ENABLE[32].
      
      When guest PEBS is enabled, the IA32_PEBS_ENABLE MSR will be
      added to the perf_guest_switch_msr() and atomically switched during
      the VMX transitions just like CORE_PERF_GLOBAL_CTRL MSR.
      
      Based on whether the platform supports x86_pmu.pebs_ept, it has also
      refactored the way to add more msrs to arr[] in intel_guest_get_msrs()
      for extensibility.
      Originally-by: NAndi Kleen <ak@linux.intel.com>
      Co-developed-by: NKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Co-developed-by: NLuwei Kang <luwei.kang@intel.com>
      Signed-off-by: NLuwei Kang <luwei.kang@intel.com>
      Signed-off-by: NLike Xu <like.xu@linux.intel.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Message-Id: <20220411101946.20262-8-likexu@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c59a1f10
    • L
      KVM: x86/pmu: Set MSR_IA32_MISC_ENABLE_EMON bit when vPMU is enabled · bef6ecca
      Like Xu 提交于
      On Intel platforms, the software can use the IA32_MISC_ENABLE[7] bit to
      detect whether the processor supports performance monitoring facility.
      
      It depends on the PMU is enabled for the guest, and a software write
      operation to this available bit will be ignored. The proposal to ignore
      the toggle in KVM is the way to go and that behavior matches bare metal.
      Signed-off-by: NLike Xu <likexu@tencent.com>
      Message-Id: <20220411101946.20262-5-likexu@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      bef6ecca
    • C
      KVM: VMX: enable IPI virtualization · d588bb9b
      Chao Gao 提交于
      With IPI virtualization enabled, the processor emulates writes to
      APIC registers that would send IPIs. The processor sets the bit
      corresponding to the vector in target vCPU's PIR and may send a
      notification (IPI) specified by NDST and NV fields in target vCPU's
      Posted-Interrupt Descriptor (PID). It is similar to what IOMMU
      engine does when dealing with posted interrupt from devices.
      
      A PID-pointer table is used by the processor to locate the PID of a
      vCPU with the vCPU's APIC ID. The table size depends on maximum APIC
      ID assigned for current VM session from userspace. Allocating memory
      for PID-pointer table is deferred to vCPU creation, because irqchip
      mode and VM-scope maximum APIC ID is settled at that point. KVM can
      skip PID-pointer table allocation if !irqchip_in_kernel().
      
      Like VT-d PI, if a vCPU goes to blocked state, VMM needs to switch its
      notification vector to wakeup vector. This can ensure that when an IPI
      for blocked vCPUs arrives, VMM can get control and wake up blocked
      vCPUs. And if a VCPU is preempted, its posted interrupt notification
      is suppressed.
      
      Note that IPI virtualization can only virualize physical-addressing,
      flat mode, unicast IPIs. Sending other IPIs would still cause a
      trap-like APIC-write VM-exit and need to be handled by VMM.
      Signed-off-by: NChao Gao <chao.gao@intel.com>
      Signed-off-by: NZeng Guang <guang.zeng@intel.com>
      Message-Id: <20220419154510.11938-1-guang.zeng@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d588bb9b
    • Z
      KVM: x86: Allow userspace to set maximum VCPU id for VM · 35875316
      Zeng Guang 提交于
      Introduce new max_vcpu_ids in KVM for x86 architecture. Userspace
      can assign maximum possible vcpu id for current VM session using
      KVM_CAP_MAX_VCPU_ID of KVM_ENABLE_CAP ioctl().
      
      This is done for x86 only because the sole use case is to guide
      memory allocation for PID-pointer table, a structure needed to
      enable VMX IPI.
      
      By default, max_vcpu_ids set as KVM_MAX_VCPU_IDS.
      Suggested-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Signed-off-by: NZeng Guang <guang.zeng@intel.com>
      Message-Id: <20220419154444.11888-1-guang.zeng@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      35875316
    • Z
      KVM: Move kvm_arch_vcpu_precreate() under kvm->lock · 1d5e740d
      Zeng Guang 提交于
      kvm_arch_vcpu_precreate() targets to handle arch specific VM resource
      to be prepared prior to the actual creation of vCPU. For example, x86
      platform may need do per-VM allocation based on max_vcpu_ids at the
      first vCPU creation. It probably leads to concurrency control on this
      allocation as multiple vCPU creation could happen simultaneously. From
      the architectual point of view, it's necessary to execute
      kvm_arch_vcpu_precreate() under protect of kvm->lock.
      
      Currently only arm64, x86 and s390 have non-nop implementations at the
      stage of vCPU pre-creation. Remove the lock acquiring in s390's design
      and make sure all architecture can run kvm_arch_vcpu_precreate() safely
      under kvm->lock without recrusive lock issue.
      Suggested-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NZeng Guang <guang.zeng@intel.com>
      Message-Id: <20220419154409.11842-1-guang.zeng@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      1d5e740d
    • S
      KVM: x86: Differentiate Soft vs. Hard IRQs vs. reinjected in tracepoint · 2d613912
      Sean Christopherson 提交于
      In the IRQ injection tracepoint, differentiate between Hard IRQs and Soft
      "IRQs", i.e. interrupts that are reinjected after incomplete delivery of
      a software interrupt from an INTn instruction.  Tag reinjected interrupts
      as such, even though the information is usually redundant since soft
      interrupts are only ever reinjected by KVM.  Though rare in practice, a
      hard IRQ can be reinjected.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      [MSS: change "kvm_inj_virq" event "reinjected" field type to bool]
      Signed-off-by: NMaciej S. Szmigiero <maciej.szmigiero@oracle.com>
      Message-Id: <9664d49b3bd21e227caa501cff77b0569bebffe2.1651440202.git.maciej.szmigiero@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      2d613912
    • S
      KVM: x86: Trace re-injected exceptions · a61d7c54
      Sean Christopherson 提交于
      Trace exceptions that are re-injected, not just those that KVM is
      injecting for the first time.  Debugging re-injection bugs is painful
      enough as is, not having visibility into what KVM is doing only makes
      things worse.
      
      Delay propagating pending=>injected in the non-reinjection path so that
      the tracing can properly identify reinjected exceptions.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Signed-off-by: NMaciej S. Szmigiero <maciej.szmigiero@oracle.com>
      Message-Id: <25470690a38b4d2b32b6204875dd35676c65c9f2.1651440202.git.maciej.szmigiero@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a61d7c54
    • P
      KVM: x86: do not report a vCPU as preempted outside instruction boundaries · 6cd88243
      Paolo Bonzini 提交于
      If a vCPU is outside guest mode and is scheduled out, it might be in the
      process of making a memory access.  A problem occurs if another vCPU uses
      the PV TLB flush feature during the period when the vCPU is scheduled
      out, and a virtual address has already been translated but has not yet
      been accessed, because this is equivalent to using a stale TLB entry.
      
      To avoid this, only report a vCPU as preempted if sure that the guest
      is at an instruction boundary.  A rescheduling request will be delivered
      to the host physical CPU as an external interrupt, so for simplicity
      consider any vmexit *not* instruction boundary except for external
      interrupts.
      
      It would in principle be okay to report the vCPU as preempted also
      if it is sleeping in kvm_vcpu_block(): a TLB flush IPI will incur the
      vmentry/vmexit overhead unnecessarily, and optimistic spinning is
      also unlikely to succeed.  However, leave it for later because right
      now kvm_vcpu_check_block() is doing memory accesses.  Even
      though the TLB flush issue only applies to virtual memory address,
      it's very much preferrable to be conservative.
      Reported-by: NJann Horn <jannh@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      6cd88243
    • P
      KVM: x86: do not set st->preempted when going back to user space · 54aa83c9
      Paolo Bonzini 提交于
      Similar to the Xen path, only change the vCPU's reported state if the vCPU
      was actually preempted.  The reason for KVM's behavior is that for example
      optimistic spinning might not be a good idea if the guest is doing repeated
      exits to userspace; however, it is confusing and unlikely to make a difference,
      because well-tuned guests will hardly ever exit KVM_RUN in the first place.
      Suggested-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      54aa83c9
  5. 27 5月, 2022 1 次提交
  6. 25 5月, 2022 3 次提交
    • L
      KVM: set_msr_mce: Permit guests to ignore single-bit ECC errors · 0471a7bd
      Lev Kujawski 提交于
      Certain guest operating systems (e.g., UNIXWARE) clear bit 0 of
      MC1_CTL to ignore single-bit ECC data errors.  Single-bit ECC data
      errors are always correctable and thus are safe to ignore because they
      are informational in nature rather than signaling a loss of data
      integrity.
      
      Prior to this patch, these guests would crash upon writing MC1_CTL,
      with resultant error messages like the following:
      
      error: kvm run failed Operation not permitted
      EAX=fffffffe EBX=fffffffe ECX=00000404 EDX=ffffffff
      ESI=ffffffff EDI=00000001 EBP=fffdaba4 ESP=fffdab20
      EIP=c01333a5 EFL=00000246 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
      ES =0108 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
      CS =0100 00000000 ffffffff 00c09b00 DPL=0 CS32 [-RA]
      SS =0108 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
      DS =0108 00000000 ffffffff 00c09300 DPL=0 DS   [-WA]
      FS =0000 00000000 ffffffff 00c00000
      GS =0000 00000000 ffffffff 00c00000
      LDT=0118 c1026390 00000047 00008200 DPL=0 LDT
      TR =0110 ffff5af0 00000067 00008b00 DPL=0 TSS32-busy
      GDT=     ffff5020 000002cf
      IDT=     ffff52f0 000007ff
      CR0=8001003b CR2=00000000 CR3=0100a000 CR4=00000230
      DR0=00000000 DR1=00000000 DR2=00000000 DR3=00000000
      DR6=ffff0ff0 DR7=00000400
      EFER=0000000000000000
      Code=08 89 01 89 51 04 c3 8b 4c 24 08 8b 01 8b 51 04 8b 4c 24 04 <0f>
      30 c3 f7 05 a4 6d ff ff 10 00 00 00 74 03 0f 31 c3 33 c0 33 d2 c3 8d
      74 26 00 0f 31 c3
      Signed-off-by: NLev Kujawski <lkujaw@member.fsf.org>
      Message-Id: <20220521081511.187388-1-lkujaw@member.fsf.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0471a7bd
    • S
      KVM: x86: avoid calling x86 emulator without a decoded instruction · fee060cd
      Sean Christopherson 提交于
      Whenever x86_decode_emulated_instruction() detects a breakpoint, it
      returns the value that kvm_vcpu_check_breakpoint() writes into its
      pass-by-reference second argument.  Unfortunately this is completely
      bogus because the expected outcome of x86_decode_emulated_instruction
      is an EMULATION_* value.
      
      Then, if kvm_vcpu_check_breakpoint() does "*r = 0" (corresponding to
      a KVM_EXIT_DEBUG userspace exit), it is misunderstood as EMULATION_OK
      and x86_emulate_instruction() is called without having decoded the
      instruction.  This causes various havoc from running with a stale
      emulation context.
      
      The fix is to move the call to kvm_vcpu_check_breakpoint() where it was
      before commit 4aa2691d ("KVM: x86: Factor out x86 instruction
      emulation with decoding") introduced x86_decode_emulated_instruction().
      The other caller of the function does not need breakpoint checks,
      because it is invoked as part of a vmexit and the processor has already
      checked those before executing the instruction that #GP'd.
      
      This fixes CVE-2022-1852.
      Reported-by: NQiuhao Li <qiuhao@sysec.org>
      Reported-by: NGaoning Pan <pgn@zju.edu.cn>
      Reported-by: NYongkang Jia <kangel@zju.edu.cn>
      Fixes: 4aa2691d ("KVM: x86: Factor out x86 instruction emulation with decoding")
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220311032801.3467418-2-seanjc@google.com>
      [Rewrote commit message according to Qiuhao's report, since a patch
       already existed to fix the bug. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      fee060cd
    • W
      KVM: LAPIC: Trace LAPIC timer expiration on every vmentry · e0ac5351
      Wanpeng Li 提交于
      In commit ec0671d5 ("KVM: LAPIC: Delay trace_kvm_wait_lapic_expire
      tracepoint to after vmexit", 2019-06-04), trace_kvm_wait_lapic_expire
      was moved after guest_exit_irqoff() because invoking tracepoints within
      kvm_guest_enter/kvm_guest_exit caused a lockdep splat.
      
      These days this is not necessary, because commit 87fa7f3e ("x86/kvm:
      Move context tracking where it belongs", 2020-07-09) restricted
      the RCU extended quiescent state to be closer to vmentry/vmexit.
      Moving the tracepoint back to __kvm_wait_lapic_expire is more accurate,
      because it will be reported even if vcpu_enter_guest causes multiple
      vmentries via the IPI/Timer fast paths, and it allows the removal of
      advance_expire_delta.
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Message-Id: <1650961551-38390-1-git-send-email-wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e0ac5351
  7. 12 5月, 2022 4 次提交
  8. 02 5月, 2022 1 次提交
  9. 30 4月, 2022 4 次提交