1. 15 7月, 2021 2 次提交
    • L
      KVM: X86: Disable hardware breakpoints unconditionally before kvm_x86->run() · f85d4016
      Lai Jiangshan 提交于
      When the host is using debug registers but the guest is not using them
      nor is the guest in guest-debug state, the kvm code does not reset
      the host debug registers before kvm_x86->run().  Rather, it relies on
      the hardware vmentry instruction to automatically reset the dr7 registers
      which ensures that the host breakpoints do not affect the guest.
      
      This however violates the non-instrumentable nature around VM entry
      and exit; for example, when a host breakpoint is set on vcpu->arch.cr2,
      
      Another issue is consistency.  When the guest debug registers are active,
      the host breakpoints are reset before kvm_x86->run(). But when the
      guest debug registers are inactive, the host breakpoints are delayed to
      be disabled.  The host tracing tools may see different results depending
      on what the guest is doing.
      
      To fix the problems, we clear %db7 unconditionally before kvm_x86->run()
      if the host has set any breakpoints, no matter if the guest is using
      them or not.
      Signed-off-by: NLai Jiangshan <laijs@linux.alibaba.com>
      Message-Id: <20210628172632.81029-1-jiangshanlai@gmail.com>
      Cc: stable@vger.kernel.org
      [Only clear %db7 instead of reloading all debug registers. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f85d4016
    • S
      Revert "KVM: x86: WARN and reject loading KVM if NX is supported but not enabled" · f0414b07
      Sean Christopherson 提交于
      Let KVM load if EFER.NX=0 even if NX is supported, the analysis and
      testing (or lack thereof) for the non-PAE host case was garbage.
      
      If the kernel won't be using PAE paging, .Ldefault_entry in head_32.S
      skips over the entire EFER sequence.  Hopefully that can be changed in
      the future to allow KVM to require EFER.NX, but the motivation behind
      KVM's requirement isn't yet merged.  Reverting and revisiting the mess
      at a later date is by far the safest approach.
      
      This reverts commit 8bbed95d.
      
      Fixes: 8bbed95d ("KVM: x86: WARN and reject loading KVM if NX is supported but not enabled")
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210625001853.318148-1-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f0414b07
  2. 25 6月, 2021 8 次提交
    • A
      kvm: x86: Allow userspace to handle emulation errors · 19238e75
      Aaron Lewis 提交于
      Add a fallback mechanism to the in-kernel instruction emulator that
      allows userspace the opportunity to process an instruction the emulator
      was unable to.  When the in-kernel instruction emulator fails to process
      an instruction it will either inject a #UD into the guest or exit to
      userspace with exit reason KVM_INTERNAL_ERROR.  This is because it does
      not know how to proceed in an appropriate manner.  This feature lets
      userspace get involved to see if it can figure out a better path
      forward.
      Signed-off-by: NAaron Lewis <aaronlewis@google.com>
      Reviewed-by: NDavid Edmondson <david.edmondson@oracle.com>
      Message-Id: <20210510144834.658457-2-aaronlewis@google.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      19238e75
    • S
      KVM: x86: Read and pass all CR0/CR4 role bits to shadow MMU helper · 20f632bd
      Sean Christopherson 提交于
      Grab all CR0/CR4 MMU role bits from current vCPU state when initializing
      a non-nested shadow MMU.  Extract the masks from kvm_post_set_cr{0,4}(),
      as the CR0/CR4 update masks must exactly match the mmu_role bits, with
      one exception (see below).  The "full" CR0/CR4 will be used by future
      commits to initialize the MMU and its role, as opposed to the current
      approach of pulling everything from vCPU, which is incorrect for certain
      flows, e.g. nested NPT.
      
      CR4.LA57 is an exception, as it can be toggled on VM-Exit (for L1's MMU)
      but can't be toggled via MOV CR4 while long mode is active.  I.e. LA57
      needs to be in the mmu_role, but technically doesn't need to be checked
      by kvm_post_set_cr4().  However, the extra check is completely benign as
      the hardware restrictions simply mean LA57 will never be _the_ cause of
      a MMU reset during MOV CR4.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210622175739.3610207-18-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      20f632bd
    • S
      KVM: x86: Fix sizes used to pass around CR0, CR4, and EFER · dbc4739b
      Sean Christopherson 提交于
      When configuring KVM's MMU, pass CR0 and CR4 as unsigned longs, and EFER
      as a u64 in various flows (mostly MMU).  Passing the params as u32s is
      functionally ok since all of the affected registers reserve bits 63:32 to
      zero (enforced by KVM), but it's technically wrong.
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210622175739.3610207-15-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      dbc4739b
    • S
      KVM: x86: Alert userspace that KVM_SET_CPUID{,2} after KVM_RUN is broken · 63f5a190
      Sean Christopherson 提交于
      Warn userspace that KVM_SET_CPUID{,2} after KVM_RUN "may" cause guest
      instability.  Initialize last_vmentry_cpu to -1 and use it to detect if
      the vCPU has been run at least once when its CPUID model is changed.
      
      KVM does not correctly handle changes to paging related settings in the
      guest's vCPU model after KVM_RUN, e.g. MAXPHYADDR, GBPAGES, etc...  KVM
      could theoretically zap all shadow pages, but actually making that happen
      is a mess due to lock inversion (vcpu->mutex is held).  And even then,
      updating paging settings on the fly would only work if all vCPUs are
      stopped, updated in concert with identical settings, then restarted.
      
      To support running vCPUs with different vCPU models (that affect paging),
      KVM would need to track all relevant information in kvm_mmu_page_role.
      Note, that's the _page_ role, not the full mmu_role.  Updating mmu_role
      isn't sufficient as a vCPU can reuse a shadow page translation that was
      created by a vCPU with different settings and thus completely skip the
      reserved bit checks (that are tied to CPUID).
      
      Tracking CPUID state in kvm_mmu_page_role is _extremely_ undesirable as
      it would require doubling gfn_track from a u16 to a u32, i.e. would
      increase KVM's memory footprint by 2 bytes for every 4kb of guest memory.
      E.g. MAXPHYADDR (6 bits), GBPAGES, AMD vs. INTEL = 1 bit, and SEV C-BIT
      would all need to be tracked.
      
      In practice, there is no remotely sane use case for changing any paging
      related CPUID entries on the fly, so just sweep it under the rug (after
      yelling at userspace).
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210622175739.3610207-8-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      63f5a190
    • S
      KVM: x86: Properly reset MMU context at vCPU RESET/INIT · 0aa18375
      Sean Christopherson 提交于
      Reset the MMU context at vCPU INIT (and RESET for good measure) if CR0.PG
      was set prior to INIT.  Simply re-initializing the current MMU is not
      sufficient as the current root HPA may not be usable in the new context.
      E.g. if TDP is disabled and INIT arrives while the vCPU is in long mode,
      KVM will fail to switch to the 32-bit pae_root and bomb on the next
      VM-Enter due to running with a 64-bit CR3 in 32-bit mode.
      
      This bug was papered over in both VMX and SVM, but still managed to rear
      its head in the MMU role on VMX.  Because EFER.LMA=1 requires CR0.PG=1,
      kvm_calc_shadow_mmu_root_page_role() checks for EFER.LMA without first
      checking CR0.PG.  VMX's RESET/INIT flow writes CR0 before EFER, and so
      an INIT with the vCPU in 64-bit mode will cause the hack-a-fix to
      generate the wrong MMU role.
      
      In VMX, the INIT issue is specific to running without unrestricted guest
      since unrestricted guest is available if and only if EPT is enabled.
      Commit 8668a3c4 ("KVM: VMX: Reset mmu context when entering real
      mode") resolved the issue by forcing a reset when entering emulated real
      mode.
      
      In SVM, commit ebae871a ("kvm: svm: reset mmu on VCPU reset") forced
      a MMU reset on every INIT to workaround the flaw in common x86.  Note, at
      the time the bug was fixed, the SVM problem was exacerbated by a complete
      lack of a CR4 update.
      
      The vendor resets will be reverted in future patches, primarily to aid
      bisection in case there are non-INIT flows that rely on the existing VMX
      logic.
      
      Because CR0.PG is unconditionally cleared on INIT, and because CR0.WP and
      all CR4/EFER paging bits are ignored if CR0.PG=0, simply checking that
      CR0.PG was '1' prior to INIT/RESET is sufficient to detect a required MMU
      context reset.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210622175739.3610207-4-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0aa18375
    • J
      KVM: debugfs: Reuse binary stats descriptors · bc9e9e67
      Jing Zhang 提交于
      To remove code duplication, use the binary stats descriptors in the
      implementation of the debugfs interface for statistics. This unifies
      the definition of statistics for the binary and debugfs interfaces.
      Signed-off-by: NJing Zhang <jingzhangos@google.com>
      Message-Id: <20210618222709.1858088-8-jingzhangos@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      bc9e9e67
    • J
      KVM: stats: Support binary stats retrieval for a VCPU · ce55c049
      Jing Zhang 提交于
      Add a VCPU ioctl to get a statistics file descriptor by which a read
      functionality is provided for userspace to read out VCPU stats header,
      descriptors and data.
      Define VCPU statistics descriptors and header for all architectures.
      Reviewed-by: NDavid Matlack <dmatlack@google.com>
      Reviewed-by: NRicardo Koller <ricarkol@google.com>
      Reviewed-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com>
      Reviewed-by: NFuad Tabba <tabba@google.com>
      Tested-by: Fuad Tabba <tabba@google.com> #arm64
      Signed-off-by: NJing Zhang <jingzhangos@google.com>
      Message-Id: <20210618222709.1858088-5-jingzhangos@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ce55c049
    • J
      KVM: stats: Support binary stats retrieval for a VM · fcfe1bae
      Jing Zhang 提交于
      Add a VM ioctl to get a statistics file descriptor by which a read
      functionality is provided for userspace to read out VM stats header,
      descriptors and data.
      Define VM statistics descriptors and header for all architectures.
      Reviewed-by: NDavid Matlack <dmatlack@google.com>
      Reviewed-by: NRicardo Koller <ricarkol@google.com>
      Reviewed-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com>
      Reviewed-by: NFuad Tabba <tabba@google.com>
      Tested-by: Fuad Tabba <tabba@google.com> #arm64
      Signed-off-by: NJing Zhang <jingzhangos@google.com>
      Message-Id: <20210618222709.1858088-4-jingzhangos@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      fcfe1bae
  3. 24 6月, 2021 5 次提交
  4. 23 6月, 2021 1 次提交
  5. 18 6月, 2021 24 次提交
    • S
      KVM: x86: WARN and reject loading KVM if NX is supported but not enabled · 8bbed95d
      Sean Christopherson 提交于
      WARN if NX is reported as supported but not enabled in EFER.  All flavors
      of the kernel, including non-PAE 32-bit kernels, set EFER.NX=1 if NX is
      supported, even if NX usage is disable via kernel command line.  KVM relies
      on NX being enabled if it's supported, e.g. KVM will generate illegal NPT
      entries if nx_huge_pages is enabled and NX is supported but not enabled.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Message-Id: <20210615164535.2146172-4-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      8bbed95d
    • A
      KVM: X86: Introduce KVM_HC_MAP_GPA_RANGE hypercall · 0dbb1123
      Ashish Kalra 提交于
      This hypercall is used by the SEV guest to notify a change in the page
      encryption status to the hypervisor. The hypercall should be invoked
      only when the encryption attribute is changed from encrypted -> decrypted
      and vice versa. By default all guest pages are considered encrypted.
      
      The hypercall exits to userspace to manage the guest shared regions and
      integrate with the userspace VMM's migration code.
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: x86@kernel.org
      Cc: kvm@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Reviewed-by: NSteve Rutherford <srutherford@google.com>
      Signed-off-by: NBrijesh Singh <brijesh.singh@amd.com>
      Signed-off-by: NAshish Kalra <ashish.kalra@amd.com>
      Co-developed-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Co-developed-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <90778988e1ee01926ff9cac447aacb745f954c8c.1623174621.git.ashish.kalra@amd.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0dbb1123
    • V
      KVM: x86: Check for pending interrupts when APICv is getting disabled · bca66dbc
      Vitaly Kuznetsov 提交于
      When APICv is active, interrupt injection doesn't raise KVM_REQ_EVENT
      request (see __apic_accept_irq()) as the required work is done by hardware.
      In case KVM_REQ_APICV_UPDATE collides with such injection, the interrupt
      may never get delivered.
      
      Currently, the described situation is hardly possible: all
      kvm_request_apicv_update() calls normally happen upon VM creation when
      no interrupts are pending. We are, however, going to move unconditional
      kvm_request_apicv_update() call from kvm_hv_activate_synic() to
      synic_update_vector() and without this fix 'hyperv_connections' test from
      kvm-unit-tests gets stuck on IPI delivery attempt right after configuring
      a SynIC route which triggers APICv disablement.
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20210609150911.1471882c-4-vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      bca66dbc
    • S
      KVM: x86: Drop pointless @reset_roots from kvm_init_mmu() · c9060662
      Sean Christopherson 提交于
      Remove the @reset_roots param from kvm_init_mmu(), the one user,
      kvm_mmu_reset_context() has already unloaded the MMU and thus freed and
      invalidated all roots.  This also happens to be why the reset_roots=true
      paths doesn't leak roots; they're already invalid.
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210609234235.1244004-14-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c9060662
    • S
      KVM: x86: Defer MMU sync on PCID invalidation · e62f1aa8
      Sean Christopherson 提交于
      Defer the MMU sync on PCID invalidation so that multiple sync requests in
      a single VM-Exit are batched.  This is a very minor optimization as
      checking for unsync'd children is quite cheap.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210609234235.1244004-13-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e62f1aa8
    • S
      KVM: x86: Use KVM_REQ_TLB_FLUSH_GUEST to handle INVPCID(ALL) emulation · 28f28d45
      Sean Christopherson 提交于
      Use KVM_REQ_TLB_FLUSH_GUEST instead of KVM_REQ_MMU_RELOAD when emulating
      INVPCID of all contexts.  In the current code, this is a glorified nop as
      TLB_FLUSH_GUEST becomes kvm_mmu_unload(), same as MMU_RELOAD, when TDP
      is disabled, which is the only time INVPCID is only intercepted+emulated.
      In the future, reusing TLB_FLUSH_GUEST will simplify optimizing paths
      that emulate a guest TLB flush, e.g. by synchronizing as needed instead
      of completely unloading all MMUs.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210609234235.1244004-11-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      28f28d45
    • S
      KVM: x86: Drop skip MMU sync and TLB flush params from "new PGD" helpers · b5129100
      Sean Christopherson 提交于
      Drop skip_mmu_sync and skip_tlb_flush from __kvm_mmu_new_pgd() now that
      all call sites unconditionally skip both the sync and flush.
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210609234235.1244004-8-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b5129100
    • S
      KVM: x86: Uncondtionally skip MMU sync/TLB flush in MOV CR3's PGD switch · 415b1a01
      Sean Christopherson 提交于
      Stop leveraging the MMU sync and TLB flush requested by the fast PGD
      switch helper now that kvm_set_cr3() manually handles the necessary sync,
      frees, and TLB flush.  This will allow dropping the params from the fast
      PGD helpers since nested SVM is now the odd blob out.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210609234235.1244004-6-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      415b1a01
    • S
      KVM: x86: Invalidate all PGDs for the current PCID on MOV CR3 w/ flush · 21823fbd
      Sean Christopherson 提交于
      Flush and sync all PGDs for the current/target PCID on MOV CR3 with a
      TLB flush, i.e. without PCID_NOFLUSH set.  Paraphrasing Intel's SDM
      regarding the behavior of MOV to CR3:
      
        - If CR4.PCIDE = 0, invalidates all TLB entries associated with PCID
          000H and all entries in all paging-structure caches associated with
          PCID 000H.
      
        - If CR4.PCIDE = 1 and NOFLUSH=0, invalidates all TLB entries
          associated with the PCID specified in bits 11:0, and all entries in
          all paging-structure caches associated with that PCID. It is not
          required to invalidate entries in the TLBs and paging-structure
          caches that are associated with other PCIDs.
      
        - If CR4.PCIDE=1 and NOFLUSH=1, is not required to invalidate any TLB
          entries or entries in paging-structure caches.
      
      Extract and reuse the logic for INVPCID(single) which is effectively the
      same flow and works even if CR4.PCIDE=0, as the current PCID will be '0'
      in that case, thus honoring the requirement of flushing PCID=0.
      
      Continue passing skip_tlb_flush to kvm_mmu_new_pgd() even though it
      _should_ be redundant; the clean up will be done in a future patch.  The
      overhead of an unnecessary nop sync is minimal (especially compared to
      the actual sync), and the TLB flush is handled via request.  Avoiding the
      the negligible overhead is not worth the risk of breaking kernels that
      backport the fix.
      
      Fixes: 956bf353 ("kvm: x86: Skip shadow page resync on CR3 switch when indicated by guest")
      Cc: Junaid Shahid <junaids@google.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210609234235.1244004-5-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      21823fbd
    • S
      KVM: nVMX: Sync all PGDs on nested transition with shadow paging · 07ffaf34
      Sean Christopherson 提交于
      Trigger a full TLB flush on behalf of the guest on nested VM-Enter and
      VM-Exit when VPID is disabled for L2.  kvm_mmu_new_pgd() syncs only the
      current PGD, which can theoretically leave stale, unsync'd entries in a
      previous guest PGD, which could be consumed if L2 is allowed to load CR3
      with PCID_NOFLUSH=1.
      
      Rename KVM_REQ_HV_TLB_FLUSH to KVM_REQ_TLB_FLUSH_GUEST so that it can
      be utilized for its obvious purpose of emulating a guest TLB flush.
      
      Note, there is no change the actual TLB flush executed by KVM, even
      though the fast PGD switch uses KVM_REQ_TLB_FLUSH_CURRENT.  When VPID is
      disabled for L2, vpid02 is guaranteed to be '0', and thus
      nested_get_vpid02() will return the VPID that is shared by L1 and L2.
      
      Generate the request outside of kvm_mmu_new_pgd(), as getting the common
      helper to correctly identify which requested is needed is quite painful.
      E.g. using KVM_REQ_TLB_FLUSH_GUEST when nested EPT is in play is wrong as
      a TLB flush from the L1 kernel's perspective does not invalidate EPT
      mappings.  And, by using KVM_REQ_TLB_FLUSH_GUEST, nVMX can do future
      simplification by moving the logic into nested_vmx_transition_tlb_flush().
      
      Fixes: 41fab65e ("KVM: nVMX: Skip MMU sync on nested VMX transition when possible")
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210609234235.1244004-2-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      07ffaf34
    • M
      KVM: x86: avoid loading PDPTRs after migration when possible · 158a48ec
      Maxim Levitsky 提交于
      if new KVM_*_SREGS2 ioctls are used, the PDPTRs are
      a part of the migration state and are correctly
      restored by those ioctls.
      Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20210607090203.133058-9-mlevitsk@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      158a48ec
    • M
      KVM: x86: Introduce KVM_GET_SREGS2 / KVM_SET_SREGS2 · 6dba9403
      Maxim Levitsky 提交于
      This is a new version of KVM_GET_SREGS / KVM_SET_SREGS.
      
      It has the following changes:
         * Has flags for future extensions
         * Has vcpu's PDPTRs, allowing to save/restore them on migration.
         * Lacks obsolete interrupt bitmap (done now via KVM_SET_VCPU_EVENTS)
      
      New capability, KVM_CAP_SREGS2 is added to signal
      the userspace of this ioctl.
      Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20210607090203.133058-8-mlevitsk@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      6dba9403
    • S
      KVM: x86: Always load PDPTRs on CR3 load for SVM w/o NPT and a PAE guest · c7313155
      Sean Christopherson 提交于
      Kill off pdptrs_changed() and instead go through the full kvm_set_cr3()
      for PAE guest, even if the new CR3 is the same as the current CR3.  For
      VMX, and SVM with NPT enabled, the PDPTRs are unconditionally marked as
      unavailable after VM-Exit, i.e. the optimization is dead code except for
      SVM without NPT.
      
      In the unlikely scenario that anyone cares about SVM without NPT _and_ a
      PAE guest, they've got bigger problems if their guest is loading the same
      CR3 so frequently that the performance of kvm_set_cr3() is notable,
      especially since KVM's fast PGD switching means reloading the same CR3
      does not require a full rebuild.  Given that PAE and PCID are mutually
      exclusive, i.e. a sync and flush are guaranteed in any case, the actual
      benefits of the pdptrs_changed() optimization are marginal at best.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210607090203.133058-4-mlevitsk@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c7313155
    • V
      KVM: x86: hyper-v: Introduce KVM_CAP_HYPERV_ENFORCE_CPUID · 644f7067
      Vitaly Kuznetsov 提交于
      Modeled after KVM_CAP_ENFORCE_PV_FEATURE_CPUID, the new capability allows
      for limiting Hyper-V features to those exposed to the guest in Hyper-V
      CPUIDs (0x40000003, 0x40000004, ...).
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <20210521095204.2161214-3-vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      644f7067
    • V
      KVM: x86: hyper-v: Move the remote TLB flush logic out of vmx · 3c86c0d3
      Vineeth Pillai 提交于
      Currently the remote TLB flush logic is specific to VMX.
      Move it to a common place so that SVM can use it as well.
      Signed-off-by: NVineeth Pillai <viremana@linux.microsoft.com>
      Message-Id: <4f4e4ca19778437dae502f44363a38e99e3ef5d1.1622730232.git.viremana@linux.microsoft.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3c86c0d3
    • K
      KVM: nVMX: nSVM: Add a new VCPU statistic to show if VCPU is in guest mode · d5a0483f
      Krish Sadhukhan 提交于
      Add the following per-VCPU statistic to KVM debugfs to show if a given
      VCPU is in guest mode:
      
      	guest_mode
      
      Also add this as a per-VM statistic to KVM debugfs to show the total number
      of VCPUs that are in guest mode in a given VM.
      Signed-off-by: NKrish Sadhukhan <Krish.Sadhukhan@oracle.com>
      Message-Id: <20210609180340.104248-3-krish.sadhukhan@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d5a0483f
    • S
      KVM: x86: Drop "pre_" from enter/leave_smm() helpers · ecc513e5
      Sean Christopherson 提交于
      Now that .post_leave_smm() is gone, drop "pre_" from the remaining
      helpers.  The helpers aren't invoked purely before SMI/RSM processing,
      e.g. both helpers are invoked after state is snapshotted (from regs or
      SMRAM), and the RSM helper is invoked after some amount of register state
      has been stuffed.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210609185619.992058-10-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ecc513e5
    • S
      KVM: x86: Drop .post_leave_smm(), i.e. the manual post-RSM MMU reset · 01281165
      Sean Christopherson 提交于
      Drop the .post_leave_smm() emulator callback, which at this point is just
      a wrapper to kvm_mmu_reset_context().  The manual context reset is
      unnecessary, because unlike enter_smm() which calls vendor MSR/CR helpers
      directly, em_rsm() bounces through the KVM helpers, e.g. kvm_set_cr4(),
      which are responsible for processing side effects.  em_rsm() is already
      subtly relying on this behavior as it doesn't manually do
      kvm_update_cpuid_runtime(), e.g. to recognize CR4.OSXSAVE changes.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210609185619.992058-9-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      01281165
    • S
      KVM: x86: Rename SMM tracepoint to make it reflect reality · 1270e647
      Sean Christopherson 提交于
      Rename the SMM tracepoint, which handles both entering and exiting SMM,
      from kvm_enter_smm to kvm_smm_transition.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210609185619.992058-8-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      1270e647
    • S
      KVM: x86: Move "entering SMM" tracepoint into kvm_smm_changed() · 0d7ee6f4
      Sean Christopherson 提交于
      Invoke the "entering SMM" tracepoint from kvm_smm_changed() instead of
      enter_smm(), effectively moving it from before reading vCPU state to
      after reading state (but still before writing it to SMRAM!).  The primary
      motivation is to consolidate code, but calling the tracepoint from
      kvm_smm_changed() also makes its invocation consistent with respect to
      SMI and RSM, and with respect to KVM_SET_VCPU_EVENTS (which previously
      only invoked the tracepoint when forcing the vCPU out of SMM).
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210609185619.992058-7-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0d7ee6f4
    • S
      KVM: x86: Move (most) SMM hflags modifications into kvm_smm_changed() · dc87275f
      Sean Christopherson 提交于
      Move the core of SMM hflags modifications into kvm_smm_changed() and use
      kvm_smm_changed() in enter_smm().  Clear HF_SMM_INSIDE_NMI_MASK for
      leaving SMM but do not set it for entering SMM.  If the vCPU is executing
      outside of SMM, the flag should unequivocally be cleared, e.g. this
      technically fixes a benign bug where the flag could be left set after
      KVM_SET_VCPU_EVENTS, but the reverse is not true as NMI blocking depends
      on pre-SMM state or userspace input.
      
      Note, this adds an extra kvm_mmu_reset_context() to enter_smm().  The
      extra/early reset isn't strictly necessary, and in a way can never be
      necessary since the vCPU/MMU context is in a half-baked state until the
      final context reset at the end of the function.  But, enter_smm() is not
      a hot path, and exploding on an invalid root_hpa is probably better than
      having a stale SMM flag in the MMU role; it's at least no worse.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210609185619.992058-6-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      dc87275f
    • S
      KVM: x86: Invoke kvm_smm_changed() immediately after clearing SMM flag · fa75e08b
      Sean Christopherson 提交于
      Move RSM emulation's call to kvm_smm_changed() from .post_leave_smm() to
      .exiting_smm(), leaving behind the MMU context reset.  The primary
      motivation is to allow for future cleanup, but this also fixes a bug of
      sorts by queueing KVM_REQ_EVENT even if RSM causes shutdown, e.g. to let
      an INIT wake the vCPU from shutdown.  Of course, KVM doesn't properly
      emulate a shutdown state, e.g. KVM doesn't block SMIs after shutdown, and
      immediately exits to userspace, so the event request is a moot point in
      practice.
      
      Moving kvm_smm_changed() also moves the RSM tracepoint.  This isn't
      strictly necessary, but will allow consolidating the SMI and RSM
      tracepoints in a future commit (by also moving the SMI tracepoint).
      Invoking the tracepoint before loading SMRAM state also means the SMBASE
      that reported in the tracepoint will point that the state that will be
      used for RSM, as opposed to the SMBASE _after_ RSM completes, which is
      arguably a good thing if the tracepoint is being used to debug a RSM/SMM
      issue.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210609185619.992058-5-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      fa75e08b
    • S
      KVM: x86: Replace .set_hflags() with dedicated .exiting_smm() helper · edce4654
      Sean Christopherson 提交于
      Replace the .set_hflags() emulator hook with a dedicated .exiting_smm(),
      moving the SMM and SMM_INSIDE_NMI flag handling out of the emulator in
      the process.  This is a step towards consolidating much of the logic in
      kvm_smm_changed(), including the SMM hflags updates.
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210609185619.992058-4-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      edce4654
    • S
      KVM: x86: Emulate triple fault shutdown if RSM emulation fails · 25b17226
      Sean Christopherson 提交于
      Use the recently introduced KVM_REQ_TRIPLE_FAULT to properly emulate
      shutdown if RSM from SMM fails.
      
      Note, entering shutdown after clearing the SMM flag and restoring NMI
      blocking is architecturally correct with respect to AMD's APM, which KVM
      also uses for SMRAM layout and RSM NMI blocking behavior.  The APM says:
      
        An RSM causes a processor shutdown if an invalid-state condition is
        found in the SMRAM state-save area. Only an external reset, external
        processor-initialization, or non-maskable external interrupt (NMI) can
        cause the processor to leave the shutdown state.
      
      Of note is processor-initialization (INIT) as a valid shutdown wake
      event, as INIT is blocked by SMM, implying that entering shutdown also
      forces the CPU out of SMM.
      
      For recent Intel CPUs, restoring NMI blocking is technically wrong, but
      so is restoring NMI blocking in the first place, and Intel's RSM
      "architecture" is such a mess that just about anything is allowed and can
      be justified as micro-architectural behavior.
      
      Per the SDM:
      
        On Pentium 4 and later processors, shutdown will inhibit INTR and A20M
        but will not change any of the other inhibits. On these processors,
        NMIs will be inhibited if no action is taken in the SMI handler to
        uninhibit them (see Section 34.8).
      
      where Section 34.8 says:
      
        When the processor enters SMM while executing an NMI handler, the
        processor saves the SMRAM state save map but does not save the
        attribute to keep NMI interrupts disabled. Potentially, an NMI could be
        latched (while in SMM or upon exit) and serviced upon exit of SMM even
        though the previous NMI handler has still not completed.
      
      I.e. RSM unconditionally unblocks NMI, but shutdown on RSM does not,
      which is in direct contradiction of KVM's behavior.  But, as mentioned
      above, KVM follows AMD architecture and restores NMI blocking on RSM, so
      that micro-architectural detail is already lost.
      
      And for Pentium era CPUs, SMI# can break shutdown, meaning that at least
      some Intel CPUs fully leave SMM when entering shutdown:
      
        In the shutdown state, Intel processors stop executing instructions
        until a RESET#, INIT# or NMI# is asserted.  While Pentium family
        processors recognize the SMI# signal in shutdown state, P6 family and
        Intel486 processors do not.
      
      In other words, the fact that Intel CPUs have implemented the two
      extremes gives KVM carte blanche when it comes to honoring Intel's
      architecture for handling shutdown during RSM.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210609185619.992058-3-seanjc@google.com>
      [Return X86EMUL_CONTINUE after triple fault. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      25b17226