1. 08 6月, 2022 13 次提交
    • S
      KVM: x86: Introduce "struct kvm_caps" to track misc caps/settings · 938c8745
      Sean Christopherson 提交于
      Add kvm_caps to hold a variety of capabilites and defaults that aren't
      handled by kvm_cpu_caps because they aren't CPUID bits in order to reduce
      the amount of boilerplate code required to add a new feature.  The vast
      majority (all?) of the caps interact with vendor code and are written
      only during initialization, i.e. should be tagged __read_mostly, declared
      extern in x86.h, and exported.
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220524135624.22988-4-chenyi.qiang@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      938c8745
    • L
      KVM: x86/pmu: Drop amd_event_mapping[] in the KVM context · 7aadaa98
      Like Xu 提交于
      All gp or fixed counters have been reprogrammed using PERF_TYPE_RAW,
      which means that the table that maps perf_hw_id to event select values is
      no longer useful, at least for AMD.
      
      For Intel, the logic to check if the pmu event reported by Intel cpuid is
      not available is still required, in which case pmc_perf_hw_id() could be
      renamed to hw_event_is_unavail() and a bool value is returned to replace
      the semantics of "PERF_COUNT_HW_MAX+1".
      Signed-off-by: NLike Xu <likexu@tencent.com>
      Message-Id: <20220518132512.37864-12-likexu@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7aadaa98
    • P
      KVM: x86/pmu: Use only the uniform interface reprogram_counter() · e99fae6e
      Paolo Bonzini 提交于
      Since reprogram_counter(), reprogram_{gp, fixed}_counter() currently have
      the same incoming parameter "struct kvm_pmc *pmc", the callers can simplify
      the conetxt by using uniformly exported interface, which makes reprogram_
      {gp, fixed}_counter() static and eliminates EXPORT_SYMBOL_GPL.
      Signed-off-by: NLike Xu <likexu@tencent.com>
      Message-Id: <20220518132512.37864-8-likexu@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e99fae6e
    • L
      KVM: x86/pmu: Drop "u64 eventsel" for reprogram_gp_counter() · fb121aaf
      Like Xu 提交于
      Because inside reprogram_gp_counter() it is bound to assign the requested
      eventel to pmc->eventsel, this assignment step can be moved forward, thus
      simplifying the passing of parameters to "struct kvm_pmc *pmc" only.
      
      No functional change intended.
      Signed-off-by: NLike Xu <likexu@tencent.com>
      Message-Id: <20220518132512.37864-6-likexu@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      fb121aaf
    • P
      KVM: x86: always allow host-initiated writes to PMU MSRs · d1c88a40
      Paolo Bonzini 提交于
      Whenever an MSR is part of KVM_GET_MSR_INDEX_LIST, it has to be always
      retrievable and settable with KVM_GET_MSR and KVM_SET_MSR.  Accept
      the PMU MSRs unconditionally in intel_is_valid_msr, if the access was
      host-initiated.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d1c88a40
    • M
      KVM: nSVM: Transparently handle L1 -> L2 NMI re-injection · 159fc6fa
      Maciej S. Szmigiero 提交于
      A NMI that L1 wants to inject into its L2 should be directly re-injected,
      without causing L0 side effects like engaging NMI blocking for L1.
      
      It's also worth noting that in this case it is L1 responsibility
      to track the NMI window status for its L2 guest.
      Signed-off-by: NMaciej S. Szmigiero <maciej.szmigiero@oracle.com>
      Message-Id: <f894d13501cd48157b3069a4b4c7369575ddb60e.1651440202.git.maciej.szmigiero@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      159fc6fa
    • S
      KVM: x86: Differentiate Soft vs. Hard IRQs vs. reinjected in tracepoint · 2d613912
      Sean Christopherson 提交于
      In the IRQ injection tracepoint, differentiate between Hard IRQs and Soft
      "IRQs", i.e. interrupts that are reinjected after incomplete delivery of
      a software interrupt from an INTn instruction.  Tag reinjected interrupts
      as such, even though the information is usually redundant since soft
      interrupts are only ever reinjected by KVM.  Though rare in practice, a
      hard IRQ can be reinjected.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      [MSS: change "kvm_inj_virq" event "reinjected" field type to bool]
      Signed-off-by: NMaciej S. Szmigiero <maciej.szmigiero@oracle.com>
      Message-Id: <9664d49b3bd21e227caa501cff77b0569bebffe2.1651440202.git.maciej.szmigiero@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      2d613912
    • S
      KVM: SVM: Re-inject INTn instead of retrying the insn on "failure" · 7e5b5ef8
      Sean Christopherson 提交于
      Re-inject INTn software interrupts instead of retrying the instruction if
      the CPU encountered an intercepted exception while vectoring the INTn,
      e.g. if KVM intercepted a #PF when utilizing shadow paging.  Retrying the
      instruction is architecturally wrong e.g. will result in a spurious #DB
      if there's a code breakpoint on the INT3/O, and lack of re-injection also
      breaks nested virtualization, e.g. if L1 injects a software interrupt and
      vectoring the injected interrupt encounters an exception that is
      intercepted by L0 but not L1.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NMaciej S. Szmigiero <maciej.szmigiero@oracle.com>
      Message-Id: <1654ad502f860948e4f2d57b8bd881d67301f785.1651440202.git.maciej.szmigiero@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7e5b5ef8
    • S
      KVM: SVM: Re-inject INT3/INTO instead of retrying the instruction · 6ef88d6e
      Sean Christopherson 提交于
      Re-inject INT3/INTO instead of retrying the instruction if the CPU
      encountered an intercepted exception while vectoring the software
      exception, e.g. if vectoring INT3 encounters a #PF and KVM is using
      shadow paging.  Retrying the instruction is architecturally wrong, e.g.
      will result in a spurious #DB if there's a code breakpoint on the INT3/O,
      and lack of re-injection also breaks nested virtualization, e.g. if L1
      injects a software exception and vectoring the injected exception
      encounters an exception that is intercepted by L0 but not L1.
      
      Due to, ahem, deficiencies in the SVM architecture, acquiring the next
      RIP may require flowing through the emulator even if NRIPS is supported,
      as the CPU clears next_rip if the VM-Exit is due to an exception other
      than "exceptions caused by the INT3, INTO, and BOUND instructions".  To
      deal with this, "skip" the instruction to calculate next_rip (if it's
      not already known), and then unwind the RIP write and any side effects
      (RFLAGS updates).
      
      Save the computed next_rip and use it to re-stuff next_rip if injection
      doesn't complete.  This allows KVM to do the right thing if next_rip was
      known prior to injection, e.g. if L1 injects a soft event into L2, and
      there is no backing INTn instruction, e.g. if L1 is injecting an
      arbitrary event.
      
      Note, it's impossible to guarantee architectural correctness given SVM's
      architectural flaws.  E.g. if the guest executes INTn (no KVM injection),
      an exit occurs while vectoring the INTn, and the guest modifies the code
      stream while the exit is being handled, KVM will compute the incorrect
      next_rip due to "skipping" the wrong instruction.  A future enhancement
      to make this less awful would be for KVM to detect that the decoded
      instruction is not the correct INTn and drop the to-be-injected soft
      event (retrying is a lesser evil compared to shoving the wrong RIP on the
      exception stack).
      Reported-by: NMaciej S. Szmigiero <maciej.szmigiero@oracle.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NMaciej S. Szmigiero <maciej.szmigiero@oracle.com>
      Message-Id: <65cb88deab40bc1649d509194864312a89bbe02e.1651440202.git.maciej.szmigiero@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      6ef88d6e
    • S
      KVM: SVM: Stuff next_rip on emulated INT3 injection if NRIPS is supported · 3741aec4
      Sean Christopherson 提交于
      If NRIPS is supported in hardware but disabled in KVM, set next_rip to
      the next RIP when advancing RIP as part of emulating INT3 injection.
      There is no flag to tell the CPU that KVM isn't using next_rip, and so
      leaving next_rip is left as is will result in the CPU pushing garbage
      onto the stack when vectoring the injected event.
      Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Fixes: 66b7138f ("KVM: SVM: Emulate nRIP feature when reinjecting INT3")
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NMaciej S. Szmigiero <maciej.szmigiero@oracle.com>
      Message-Id: <cd328309a3b88604daa2359ad56f36cb565ce2d4.1651440202.git.maciej.szmigiero@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3741aec4
    • S
      KVM: SVM: Unwind "speculative" RIP advancement if INTn injection "fails" · cd9e6da8
      Sean Christopherson 提交于
      Unwind the RIP advancement done by svm_queue_exception() when injecting
      an INT3 ultimately "fails" due to the CPU encountering a VM-Exit while
      vectoring the injected event, even if the exception reported by the CPU
      isn't the same event that was injected.  If vectoring INT3 encounters an
      exception, e.g. #NP, and vectoring the #NP encounters an intercepted
      exception, e.g. #PF when KVM is using shadow paging, then the #NP will
      be reported as the event that was in-progress.
      
      Note, this is still imperfect, as it will get a false positive if the
      INT3 is cleanly injected, no VM-Exit occurs before the IRET from the INT3
      handler in the guest, the instruction following the INT3 generates an
      exception (directly or indirectly), _and_ vectoring that exception
      encounters an exception that is intercepted by KVM.  The false positives
      could theoretically be solved by further analyzing the vectoring event,
      e.g. by comparing the error code against the expected error code were an
      exception to occur when vectoring the original injected exception, but
      SVM without NRIPS is a complete disaster, trying to make it 100% correct
      is a waste of time.
      Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Fixes: 66b7138f ("KVM: SVM: Emulate nRIP feature when reinjecting INT3")
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NMaciej S. Szmigiero <maciej.szmigiero@oracle.com>
      Message-Id: <450133cf0a026cb9825a2ff55d02cb136a1cb111.1651440202.git.maciej.szmigiero@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      cd9e6da8
    • M
      KVM: SVM: Don't BUG if userspace injects an interrupt with GIF=0 · f17c31c4
      Maciej S. Szmigiero 提交于
      Don't BUG/WARN on interrupt injection due to GIF being cleared,
      since it's trivial for userspace to force the situation via
      KVM_SET_VCPU_EVENTS (even if having at least a WARN there would be correct
      for KVM internally generated injections).
      
        kernel BUG at arch/x86/kvm/svm/svm.c:3386!
        invalid opcode: 0000 [#1] SMP
        CPU: 15 PID: 926 Comm: smm_test Not tainted 5.17.0-rc3+ #264
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        RIP: 0010:svm_inject_irq+0xab/0xb0 [kvm_amd]
        Code: <0f> 0b 0f 1f 00 0f 1f 44 00 00 80 3d ac b3 01 00 00 55 48 89 f5 53
        RSP: 0018:ffffc90000b37d88 EFLAGS: 00010246
        RAX: 0000000000000000 RBX: ffff88810a234ac0 RCX: 0000000000000006
        RDX: 0000000000000000 RSI: ffffc90000b37df7 RDI: ffff88810a234ac0
        RBP: ffffc90000b37df7 R08: ffff88810a1fa410 R09: 0000000000000000
        R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
        R13: ffff888109571000 R14: ffff88810a234ac0 R15: 0000000000000000
        FS:  0000000001821380(0000) GS:ffff88846fdc0000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f74fc550008 CR3: 000000010a6fe000 CR4: 0000000000350ea0
        Call Trace:
         <TASK>
         inject_pending_event+0x2f7/0x4c0 [kvm]
         kvm_arch_vcpu_ioctl_run+0x791/0x17a0 [kvm]
         kvm_vcpu_ioctl+0x26d/0x650 [kvm]
         __x64_sys_ioctl+0x82/0xb0
         do_syscall_64+0x3b/0xc0
         entry_SYSCALL_64_after_hwframe+0x44/0xae
         </TASK>
      
      Fixes: 219b65dc ("KVM: SVM: Improve nested interrupt injection")
      Cc: stable@vger.kernel.org
      Co-developed-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NMaciej S. Szmigiero <maciej.szmigiero@oracle.com>
      Message-Id: <35426af6e123cbe91ec7ce5132ce72521f02b1b5.1651440202.git.maciej.szmigiero@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f17c31c4
    • M
      KVM: nSVM: Sync next_rip field from vmcb12 to vmcb02 · 00f08d99
      Maciej S. Szmigiero 提交于
      The next_rip field of a VMCB is *not* an output-only field for a VMRUN.
      This field value (instead of the saved guest RIP) in used by the CPU for
      the return address pushed on stack when injecting a software interrupt or
      INT3 or INTO exception.
      
      Make sure this field gets synced from vmcb12 to vmcb02 when entering L2 or
      loading a nested state and NRIPS is exposed to L1.  If NRIPS is supported
      in hardware but not exposed to L1 (nrips=0 or hidden by userspace), stuff
      vmcb02's next_rip from the new L2 RIP to emulate a !NRIPS CPU (which
      saves RIP on the stack as-is).
      Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Co-developed-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NMaciej S. Szmigiero <maciej.szmigiero@oracle.com>
      Message-Id: <c2e0a3d78db3ae30530f11d4e9254b452a89f42b.1651440202.git.maciej.szmigiero@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      00f08d99
  2. 07 6月, 2022 1 次提交
    • M
      KVM: SVM: fix tsc scaling cache logic · 11d39e8c
      Maxim Levitsky 提交于
      SVM uses a per-cpu variable to cache the current value of the
      tsc scaling multiplier msr on each cpu.
      
      Commit 1ab9287a
      ("KVM: X86: Add vendor callbacks for writing the TSC multiplier")
      broke this caching logic.
      
      Refactor the code so that all TSC scaling multiplier writes go through
      a single function which checks and updates the cache.
      
      This fixes the following scenario:
      
      1. A CPU runs a guest with some tsc scaling ratio.
      
      2. New guest with different tsc scaling ratio starts on this CPU
         and terminates almost immediately.
      
         This ensures that the short running guest had set the tsc scaling ratio just
         once when it was set via KVM_SET_TSC_KHZ. Due to the bug,
         the per-cpu cache is not updated.
      
      3. The original guest continues to run, it doesn't restore the msr
         value back to its own value, because the cache matches,
         and thus continues to run with a wrong tsc scaling ratio.
      
      Fixes: 1ab9287a ("KVM: X86: Add vendor callbacks for writing the TSC multiplier")
      Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20220606181149.103072-1-mlevitsk@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      11d39e8c
  3. 25 5月, 2022 1 次提交
  4. 12 5月, 2022 1 次提交
    • K
      KVM: x86/mmu: Add shadow_me_value and repurpose shadow_me_mask · e54f1ff2
      Kai Huang 提交于
      Intel Multi-Key Total Memory Encryption (MKTME) repurposes couple of
      high bits of physical address bits as 'KeyID' bits.  Intel Trust Domain
      Extentions (TDX) further steals part of MKTME KeyID bits as TDX private
      KeyID bits.  TDX private KeyID bits cannot be set in any mapping in the
      host kernel since they can only be accessed by software running inside a
      new CPU isolated mode.  And unlike to AMD's SME, host kernel doesn't set
      any legacy MKTME KeyID bits to any mapping either.  Therefore, it's not
      legitimate for KVM to set any KeyID bits in SPTE which maps guest
      memory.
      
      KVM maintains shadow_zero_check bits to represent which bits must be
      zero for SPTE which maps guest memory.  MKTME KeyID bits should be set
      to shadow_zero_check.  Currently, shadow_me_mask is used by AMD to set
      the sme_me_mask to SPTE, and shadow_me_shadow is excluded from
      shadow_zero_check.  So initializing shadow_me_mask to represent all
      MKTME keyID bits doesn't work for VMX (as oppositely, they must be set
      to shadow_zero_check).
      
      Introduce a new 'shadow_me_value' to replace existing shadow_me_mask,
      and repurpose shadow_me_mask as 'all possible memory encryption bits'.
      The new schematic of them will be:
      
       - shadow_me_value: the memory encryption bit(s) that will be set to the
         SPTE (the original shadow_me_mask).
       - shadow_me_mask: all possible memory encryption bits (which is a super
         set of shadow_me_value).
       - For now, shadow_me_value is supposed to be set by SVM and VMX
         respectively, and it is a constant during KVM's life time.  This
         perhaps doesn't fit MKTME but for now host kernel doesn't support it
         (and perhaps will never do).
       - Bits in shadow_me_mask are set to shadow_zero_check, except the bits
         in shadow_me_value.
      
      Introduce a new helper kvm_mmu_set_me_spte_mask() to initialize them.
      Replace shadow_me_mask with shadow_me_value in almost all code paths,
      except the one in PT64_PERM_MASK, which is used by need_remote_flush()
      to determine whether remote TLB flush is needed.  This should still use
      shadow_me_mask as any encryption bit change should need a TLB flush.
      And for AMD, move initializing shadow_me_value/shadow_me_mask from
      kvm_mmu_reset_all_pte_masks() to svm_hardware_setup().
      Signed-off-by: NKai Huang <kai.huang@intel.com>
      Message-Id: <f90964b93a3398b1cf1c56f510f3281e0709e2ab.1650363789.git.kai.huang@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e54f1ff2
  5. 07 5月, 2022 1 次提交
  6. 03 5月, 2022 1 次提交
    • K
      KVM: x86/svm: Account for family 17h event renumberings in amd_pmc_perf_hw_id · 5eb84932
      Kyle Huey 提交于
      Zen renumbered some of the performance counters that correspond to the
      well known events in perf_hw_id. This code in KVM was never updated for
      that, so guest that attempt to use counters on Zen that correspond to the
      pre-Zen perf_hw_id values will silently receive the wrong values.
      
      This has been observed in the wild with rr[0] when running in Zen 3
      guests. rr uses the retired conditional branch counter 00d1 which is
      incorrectly recognized by KVM as PERF_COUNT_HW_STALLED_CYCLES_BACKEND.
      
      [0] https://rr-project.org/Signed-off-by: NKyle Huey <me@kylehuey.com>
      Message-Id: <20220503050136.86298-1-khuey@kylehuey.com>
      Cc: stable@vger.kernel.org
      [Check guest family, not host. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5eb84932
  7. 30 4月, 2022 5 次提交
  8. 22 4月, 2022 4 次提交
    • M
      KVM: SEV: add cache flush to solve SEV cache incoherency issues · 683412cc
      Mingwei Zhang 提交于
      Flush the CPU caches when memory is reclaimed from an SEV guest (where
      reclaim also includes it being unmapped from KVM's memslots).  Due to lack
      of coherency for SEV encrypted memory, failure to flush results in silent
      data corruption if userspace is malicious/broken and doesn't ensure SEV
      guest memory is properly pinned and unpinned.
      
      Cache coherency is not enforced across the VM boundary in SEV (AMD APM
      vol.2 Section 15.34.7). Confidential cachelines, generated by confidential
      VM guests have to be explicitly flushed on the host side. If a memory page
      containing dirty confidential cachelines was released by VM and reallocated
      to another user, the cachelines may corrupt the new user at a later time.
      
      KVM takes a shortcut by assuming all confidential memory remain pinned
      until the end of VM lifetime. Therefore, KVM does not flush cache at
      mmu_notifier invalidation events. Because of this incorrect assumption and
      the lack of cache flushing, malicous userspace can crash the host kernel:
      creating a malicious VM and continuously allocates/releases unpinned
      confidential memory pages when the VM is running.
      
      Add cache flush operations to mmu_notifier operations to ensure that any
      physical memory leaving the guest VM get flushed. In particular, hook
      mmu_notifier_invalidate_range_start and mmu_notifier_release events and
      flush cache accordingly. The hook after releasing the mmu lock to avoid
      contention with other vCPUs.
      
      Cc: stable@vger.kernel.org
      Suggested-by: NSean Christpherson <seanjc@google.com>
      Reported-by: NMingwei Zhang <mizhang@google.com>
      Signed-off-by: NMingwei Zhang <mizhang@google.com>
      Message-Id: <20220421031407.2516575-4-mizhang@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      683412cc
    • M
      KVM: SVM: Flush when freeing encrypted pages even on SME_COHERENT CPUs · d45829b3
      Mingwei Zhang 提交于
      Use clflush_cache_range() to flush the confidential memory when
      SME_COHERENT is supported in AMD CPU. Cache flush is still needed since
      SME_COHERENT only support cache invalidation at CPU side. All confidential
      cache lines are still incoherent with DMA devices.
      
      Cc: stable@vger.kerel.org
      
      Fixes: add5e2f0 ("KVM: SVM: Add support for the SEV-ES VMSA")
      Reviewed-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NMingwei Zhang <mizhang@google.com>
      Message-Id: <20220421031407.2516575-3-mizhang@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d45829b3
    • S
      KVM: SVM: Simplify and harden helper to flush SEV guest page(s) · 4bbef7e8
      Sean Christopherson 提交于
      Rework sev_flush_guest_memory() to explicitly handle only a single page,
      and harden it to fall back to WBINVD if VM_PAGE_FLUSH fails.  Per-page
      flushing is currently used only to flush the VMSA, and in its current
      form, the helper is completely broken with respect to flushing actual
      guest memory, i.e. won't work correctly for an arbitrary memory range.
      
      VM_PAGE_FLUSH takes a host virtual address, and is subject to normal page
      walks, i.e. will fault if the address is not present in the host page
      tables or does not have the correct permissions.  Current AMD CPUs also
      do not honor SMAP overrides (undocumented in kernel versions of the APM),
      so passing in a userspace address is completely out of the question.  In
      other words, KVM would need to manually walk the host page tables to get
      the pfn, ensure the pfn is stable, and then use the direct map to invoke
      VM_PAGE_FLUSH.  And the latter might not even work, e.g. if userspace is
      particularly evil/clever and backs the guest with Secret Memory (which
      unmaps memory from the direct map).
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      
      Fixes: add5e2f0 ("KVM: SVM: Add support for the SEV-ES VMSA")
      Reported-by: NMingwei Zhang <mizhang@google.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NMingwei Zhang <mizhang@google.com>
      Message-Id: <20220421031407.2516575-2-mizhang@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4bbef7e8
    • L
      KVM: x86/pmu: Update AMD PMC sample period to fix guest NMI-watchdog · 75189d1d
      Like Xu 提交于
      NMI-watchdog is one of the favorite features of kernel developers,
      but it does not work in AMD guest even with vPMU enabled and worse,
      the system misrepresents this capability via /proc.
      
      This is a PMC emulation error. KVM does not pass the latest valid
      value to perf_event in time when guest NMI-watchdog is running, thus
      the perf_event corresponding to the watchdog counter will enter the
      old state at some point after the first guest NMI injection, forcing
      the hardware register PMC0 to be constantly written to 0x800000000001.
      
      Meanwhile, the running counter should accurately reflect its new value
      based on the latest coordinated pmc->counter (from vPMC's point of view)
      rather than the value written directly by the guest.
      
      Fixes: 168d918f ("KVM: x86: Adjust counter sample period after a wrmsr")
      Reported-by: NDongli Cao <caodongli@kingsoft.com>
      Signed-off-by: NLike Xu <likexu@tencent.com>
      Reviewed-by: NYanan Wang <wangyanan55@huawei.com>
      Tested-by: NYanan Wang <wangyanan55@huawei.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Message-Id: <20220409015226.38619-1-likexu@tencent.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      75189d1d
  9. 14 4月, 2022 3 次提交
    • P
      KVM, SEV: Add KVM_EXIT_SHUTDOWN metadata for SEV-ES · c24a950e
      Peter Gonda 提交于
      If an SEV-ES guest requests termination, exit to userspace with
      KVM_EXIT_SYSTEM_EVENT and a dedicated SEV_TERM type instead of -EINVAL
      so that userspace can take appropriate action.
      
      See AMD's GHCB spec section '4.1.13 Termination Request' for more details.
      Suggested-by: NSean Christopherson <seanjc@google.com>
      Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
      Cc: kvm@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NPeter Gonda <pgonda@google.com>
      Reported-by: Nkernel test robot <lkp@intel.com>
      Message-Id: <20220407210233.782250-1-pgonda@google.com>
      [Add documentatino. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c24a950e
    • S
      KVM: x86: Drop WARNs that assert a triple fault never "escapes" from L2 · 45846661
      Sean Christopherson 提交于
      Remove WARNs that sanity check that KVM never lets a triple fault for L2
      escape and incorrectly end up in L1.  In normal operation, the sanity
      check is perfectly valid, but it incorrectly assumes that it's impossible
      for userspace to induce KVM_REQ_TRIPLE_FAULT without bouncing through
      KVM_RUN (which guarantees kvm_check_nested_state() will see and handle
      the triple fault).
      
      The WARN can currently be triggered if userspace injects a machine check
      while L2 is active and CR4.MCE=0.  And a future fix to allow save/restore
      of KVM_REQ_TRIPLE_FAULT, e.g. so that a synthesized triple fault isn't
      lost on migration, will make it trivially easy for userspace to trigger
      the WARN.
      
      Clearing KVM_REQ_TRIPLE_FAULT when forcibly leaving guest mode is
      tempting, but wrong, especially if/when the request is saved/restored,
      e.g. if userspace restores events (including a triple fault) and then
      restores nested state (which may forcibly leave guest mode).  Ignoring
      the fact that KVM doesn't currently provide the necessary APIs, it's
      userspace's responsibility to manage pending events during save/restore.
      
        ------------[ cut here ]------------
        WARNING: CPU: 7 PID: 1399 at arch/x86/kvm/vmx/nested.c:4522 nested_vmx_vmexit+0x7fe/0xd90 [kvm_intel]
        Modules linked in: kvm_intel kvm irqbypass
        CPU: 7 PID: 1399 Comm: state_test Not tainted 5.17.0-rc3+ #808
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        RIP: 0010:nested_vmx_vmexit+0x7fe/0xd90 [kvm_intel]
        Call Trace:
         <TASK>
         vmx_leave_nested+0x30/0x40 [kvm_intel]
         vmx_set_nested_state+0xca/0x3e0 [kvm_intel]
         kvm_arch_vcpu_ioctl+0xf49/0x13e0 [kvm]
         kvm_vcpu_ioctl+0x4b9/0x660 [kvm]
         __x64_sys_ioctl+0x83/0xb0
         do_syscall_64+0x3b/0xc0
         entry_SYSCALL_64_after_hwframe+0x44/0xae
         </TASK>
        ---[ end trace 0000000000000000 ]---
      
      Fixes: cb6a32c2 ("KVM: x86: Handle triple fault in L2 without killing L1")
      Cc: stable@vger.kernel.org
      Cc: Chenyi Qiang <chenyi.qiang@intel.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220407002315.78092-2-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      45846661
    • L
      KVM: x86: Move .pmu_ops to kvm_x86_init_ops and tag as __initdata · 34886e79
      Like Xu 提交于
      The pmu_ops should be moved to kvm_x86_init_ops and tagged as __initdata.
      That'll save those precious few bytes, and more importantly make
      the original ops unreachable, i.e. make it harder to sneak in post-init
      modification bugs.
      Suggested-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NLike Xu <likexu@tencent.com>
      Reviewed-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220329235054.3534728-4-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      34886e79
  10. 12 4月, 2022 1 次提交
  11. 06 4月, 2022 2 次提交
  12. 05 4月, 2022 1 次提交
  13. 02 4月, 2022 6 次提交