1. 22 9月, 2021 6 次提交
    • F
      kvm: x86: Add AMD PMU MSRs to msrs_to_save_all[] · e1fc1553
      Fares Mehanna 提交于
      Intel PMU MSRs is in msrs_to_save_all[], so add AMD PMU MSRs to have a
      consistent behavior between Intel and AMD when using KVM_GET_MSRS,
      KVM_SET_MSRS or KVM_GET_MSR_INDEX_LIST.
      
      We have to add legacy and new MSRs to handle guests running without
      X86_FEATURE_PERFCTR_CORE.
      Signed-off-by: NFares Mehanna <faresx@amazon.de>
      Message-Id: <20210915133951.22389-1-faresx@amazon.de>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e1fc1553
    • M
      KVM: x86: reset pdptrs_from_userspace when exiting smm · 37687c40
      Maxim Levitsky 提交于
      When exiting SMM, pdpts are loaded again from the guest memory.
      
      This fixes a theoretical bug, when exit from SMM triggers entry to the
      nested guest which re-uses some of the migration
      code which uses this flag as a workaround for a legacy userspace.
      Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20210913140954.165665-4-mlevitsk@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      37687c40
    • S
      KVM: x86: Identify vCPU0 by its vcpu_idx instead of its vCPUs array entry · 94c245a2
      Sean Christopherson 提交于
      Use vcpu_idx to identify vCPU0 when updating HyperV's TSC page, which is
      shared by all vCPUs and "owned" by vCPU0 (because vCPU0 is the only vCPU
      that's guaranteed to exist).  Using kvm_get_vcpu() to find vCPU works,
      but it's a rather odd and suboptimal method to check the index of a given
      vCPU.
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Reviewed-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20210910183220.2397812-3-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      94c245a2
    • H
      KVM: x86: Handle SRCU initialization failure during page track init · eb7511bf
      Haimin Zhang 提交于
      Check the return of init_srcu_struct(), which can fail due to OOM, when
      initializing the page track mechanism.  Lack of checking leads to a NULL
      pointer deref found by a modified syzkaller.
      Reported-by: NTCS Robot <tcs_robot@tencent.com>
      Signed-off-by: NHaimin Zhang <tcs_kernel@tencent.com>
      Message-Id: <1630636626-12262-1-git-send-email-tcs_kernel@tencent.com>
      [Move the call towards the beginning of kvm_arch_init_vm. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      eb7511bf
    • S
      KVM: x86: Clear KVM's cached guest CR3 at RESET/INIT · 03a6e840
      Sean Christopherson 提交于
      Explicitly zero the guest's CR3 and mark it available+dirty at RESET/INIT.
      Per Intel's SDM and AMD's APM, CR3 is zeroed at both RESET and INIT.  For
      RESET, this is a nop as vcpu is zero-allocated.  For INIT, the bug has
      likely escaped notice because no firmware/kernel puts its page tables root
      at PA=0, let alone relies on INIT to get the desired CR3 for such page
      tables.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210921000303.400537-3-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      03a6e840
    • S
      KVM: x86: Mark all registers as avail/dirty at vCPU creation · 7117003f
      Sean Christopherson 提交于
      Mark all registers as available and dirty at vCPU creation, as the vCPU has
      obviously not been loaded into hardware, let alone been given the chance to
      be modified in hardware.  On SVM, reading from "uninitialized" hardware is
      a non-issue as VMCBs are zero allocated (thus not truly uninitialized) and
      hardware does not allow for arbitrary field encoding schemes.
      
      On VMX, backing memory for VMCSes is also zero allocated, but true
      initialization of the VMCS _technically_ requires VMWRITEs, as the VMX
      architectural specification technically allows CPU implementations to
      encode fields with arbitrary schemes.  E.g. a CPU could theoretically store
      the inverted value of every field, which would result in VMREAD to a
      zero-allocated field returns all ones.
      
      In practice, only the AR_BYTES fields are known to be manipulated by
      hardware during VMREAD/VMREAD; no known hardware or VMM (for nested VMX)
      does fancy encoding of cacheable field values (CR0, CR3, CR4, etc...).  In
      other words, this is technically a bug fix, but practically speakings it's
      a glorified nop.
      
      Failure to mark registers as available has been a lurking bug for quite
      some time.  The original register caching supported only GPRs (+RIP, which
      is kinda sorta a GPR), with the masks initialized at ->vcpu_reset().  That
      worked because the two cacheable registers, RIP and RSP, are generally
      speaking not read as side effects in other flows.
      
      Arguably, commit aff48baa ("KVM: Fetch guest cr3 from hardware on
      demand") was the first instance of failure to mark regs available.  While
      _just_ marking CR3 available during vCPU creation wouldn't have fixed the
      VMREAD from an uninitialized VMCS bug because ept_update_paging_mode_cr0()
      unconditionally read vmcs.GUEST_CR3, marking CR3 _and_ intentionally not
      reading GUEST_CR3 when it's available would have avoided VMREAD to a
      technically-uninitialized VMCS.
      
      Fixes: aff48baa ("KVM: Fetch guest cr3 from hardware on demand")
      Fixes: 6de4f3ad ("KVM: Cache pdptrs")
      Fixes: 6de12732 ("KVM: VMX: Optimize vmx_get_rflags()")
      Fixes: 2fb92db1 ("KVM: VMX: Cache vmcs segment fields")
      Fixes: bd31fe49 ("KVM: VMX: Add proper cache tracking for CR0")
      Fixes: f98c1e77 ("KVM: VMX: Add proper cache tracking for CR4")
      Fixes: 5addc235 ("KVM: VMX: Cache vmcs.EXIT_QUALIFICATION using arch avail_reg flags")
      Fixes: 87915858 ("KVM: VMX: Cache vmcs.EXIT_INTR_INFO using arch avail_reg flags")
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210921000303.400537-2-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7117003f
  2. 06 9月, 2021 1 次提交
  3. 21 8月, 2021 7 次提交
    • M
      KVM: x86: implement KVM_GUESTDBG_BLOCKIRQ · 61e5f69e
      Maxim Levitsky 提交于
      KVM_GUESTDBG_BLOCKIRQ will allow KVM to block all interrupts
      while running.
      
      This change is mostly intended for more robust single stepping
      of the guest and it has the following benefits when enabled:
      
      * Resuming from a breakpoint is much more reliable.
        When resuming execution from a breakpoint, with interrupts enabled,
        more often than not, KVM would inject an interrupt and make the CPU
        jump immediately to the interrupt handler and eventually return to
        the breakpoint, to trigger it again.
      
        From the user point of view it looks like the CPU never executed a
        single instruction and in some cases that can even prevent forward
        progress, for example, when the breakpoint is placed by an automated
        script (e.g lx-symbols), which does something in response to the
        breakpoint and then continues the guest automatically.
        If the script execution takes enough time for another interrupt to
        arrive, the guest will be stuck on the same breakpoint RIP forever.
      
      * Normal single stepping is much more predictable, since it won't
        land the debugger into an interrupt handler.
      
      * RFLAGS.TF has less chance to be leaked to the guest:
      
        We set that flag behind the guest's back to do single stepping
        but if single step lands us into an interrupt/exception handler
        it will be leaked to the guest in the form of being pushed
        to the stack.
        This doesn't completely eliminate this problem as exceptions
        can still happen, but at least this reduces the chances
        of this happening.
      Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20210811122927.900604-6-mlevitsk@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      61e5f69e
    • M
      KVM: x86/mmu: Add detailed page size stats · 71f51d2c
      Mingwei Zhang 提交于
      Existing KVM code tracks the number of large pages regardless of their
      sizes. Therefore, when large page of 1GB (or larger) is adopted, the
      information becomes less useful because lpages counts a mix of 1G and 2M
      pages.
      
      So remove the lpages since it is easy for user space to aggregate the info.
      Instead, provide a comprehensive page stats of all sizes from 4K to 512G.
      Suggested-by: NBen Gardon <bgardon@google.com>
      Reviewed-by: NDavid Matlack <dmatlack@google.com>
      Reviewed-by: NBen Gardon <bgardon@google.com>
      Signed-off-by: NMingwei Zhang <mizhang@google.com>
      Cc: Jing Zhang <jingzhangos@google.com>
      Cc: David Matlack <dmatlack@google.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Message-Id: <20210803044607.599629-4-mizhang@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      71f51d2c
    • J
      KVM: stats: Support linear and logarithmic histogram statistics · f95937cc
      Jing Zhang 提交于
      Add new types of KVM stats, linear and logarithmic histogram.
      Histogram are very useful for observing the value distribution
      of time or size related stats.
      Signed-off-by: NJing Zhang <jingzhangos@google.com>
      Message-Id: <20210802165633.1866976-2-jingzhangos@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f95937cc
    • M
      KVM: SVM: avoid refreshing avic if its state didn't change · 06ef8134
      Maxim Levitsky 提交于
      Since AVIC can be inhibited and uninhibited rapidly it is possible that
      we have nothing to do by the time the svm_refresh_apicv_exec_ctrl
      is called.
      
      Detect and avoid this, which will be useful when we will start calling
      avic_vcpu_load/avic_vcpu_put when the avic inhibition state changes.
      Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20210810205251.424103-14-mlevitsk@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      06ef8134
    • M
      KVM: x86: APICv: fix race in kvm_request_apicv_update on SVM · b0a1637f
      Maxim Levitsky 提交于
      Currently on SVM, the kvm_request_apicv_update toggles the APICv
      memslot without doing any synchronization.
      
      If there is a mismatch between that memslot state and the AVIC state,
      on one of the vCPUs, an APIC mmio access can be lost:
      
      For example:
      
      VCPU0: enable the APIC_ACCESS_PAGE_PRIVATE_MEMSLOT
      VCPU1: access an APIC mmio register.
      
      Since AVIC is still disabled on VCPU1, the access will not be intercepted
      by it, and neither will it cause MMIO fault, but rather it will just be
      read/written from/to the dummy page mapped into the
      APIC_ACCESS_PAGE_PRIVATE_MEMSLOT.
      
      Fix that by adding a lock guarding the AVIC state changes, and carefully
      order the operations of kvm_request_apicv_update to avoid this race:
      
      1. Take the lock
      2. Send KVM_REQ_APICV_UPDATE
      3. Update the apic inhibit reason
      4. Release the lock
      
      This ensures that at (2) all vCPUs are kicked out of the guest mode,
      but don't yet see the new avic state.
      Then only after (4) all other vCPUs can update their AVIC state and resume.
      Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20210810205251.424103-10-mlevitsk@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b0a1637f
    • M
      KVM: x86: don't disable APICv memslot when inhibited · 36222b11
      Maxim Levitsky 提交于
      Thanks to the former patches, it is now possible to keep the APICv
      memslot always enabled, and it will be invisible to the guest
      when it is inhibited
      
      This code is based on a suggestion from Sean Christopherson:
      https://lkml.org/lkml/2021/7/19/2970Suggested-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20210810205251.424103-9-mlevitsk@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      36222b11
    • P
      KVM: X86: Introduce kvm_mmu_slot_lpages() helpers · 4139b197
      Peter Xu 提交于
      Introduce kvm_mmu_slot_lpages() to calculcate lpage_info and rmap array size.
      The other __kvm_mmu_slot_lpages() can take an extra parameter of npages rather
      than fetching from the memslot pointer.  Start to use the latter one in
      kvm_alloc_memslot_metadata().
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Message-Id: <20210730220455.26054-4-peterx@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4139b197
  4. 13 8月, 2021 4 次提交
    • S
      KVM: x86: Kill off __ex() and __kvm_handle_fault_on_reboot() · ad0577c3
      Sean Christopherson 提交于
      Remove the __kvm_handle_fault_on_reboot() and __ex() macros now that all
      VMX and SVM instructions use asm goto to handle the fault (or in the
      case of VMREAD, completely custom logic).  Drop kvm_spurious_fault()'s
      asmlinkage annotation as __kvm_handle_fault_on_reboot() was the only
      flow that invoked it from assembly code.
      
      Cc: Uros Bizjak <ubizjak@gmail.com>
      Cc: Like Xu <like.xu.linux@gmail.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210809173955.1710866-2-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ad0577c3
    • P
      KVM: VMX: Reset DR6 only when KVM_DEBUGREG_WONT_EXIT · 1ccb6f98
      Paolo Bonzini 提交于
      The commit efdab992 ("KVM: x86: fix escape of guest dr6 to the host")
      fixed a bug by resetting DR6 unconditionally when the vcpu being scheduled out.
      
      But writing to debug registers is slow, and it can be visible in perf results
      sometimes, even if neither the host nor the guest activate breakpoints.
      
      Since KVM_DEBUGREG_WONT_EXIT on Intel processors is the only case
      where DR6 gets the guest value, and it never happens at all on SVM,
      the register can be cleared in vmx.c right after reading it.
      Reported-by: NLai Jiangshan <laijs@linux.alibaba.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      1ccb6f98
    • P
      KVM: X86: Set host DR6 only on VMX and for KVM_DEBUGREG_WONT_EXIT · 375e28ff
      Paolo Bonzini 提交于
      Commit c77fb5fe ("KVM: x86: Allow the guest to run with dirty debug
      registers") allows the guest accessing to DRs without exiting when
      KVM_DEBUGREG_WONT_EXIT and we need to ensure that they are synchronized
      on entry to the guest---including DR6 that was not synced before the commit.
      
      But the commit sets the hardware DR6 not only when KVM_DEBUGREG_WONT_EXIT,
      but also when KVM_DEBUGREG_BP_ENABLED.  The second case is unnecessary
      and just leads to a more case which leaks stale DR6 to the host which has
      to be resolved by unconditionally reseting DR6 in kvm_arch_vcpu_put().
      
      Even if KVM_DEBUGREG_WONT_EXIT, however, setting the host DR6 only matters
      on VMX because SVM always uses the DR6 value from the VMCB.  So move this
      line to vmx.c and make it conditional on KVM_DEBUGREG_WONT_EXIT.
      Reported-by: NLai Jiangshan <jiangshanlai@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      375e28ff
    • L
      KVM: X86: Remove unneeded KVM_DEBUGREG_RELOAD · 34e9f860
      Lai Jiangshan 提交于
      Commit ae561ede ("KVM: x86: DR0-DR3 are not clear on reset") added code to
      ensure eff_db are updated when they're modified through non-standard paths.
      
      But there is no reason to also update hardware DRs unless hardware breakpoints
      are active or DR exiting is disabled, and in those cases updating hardware is
      handled by KVM_DEBUGREG_WONT_EXIT and KVM_DEBUGREG_BP_ENABLED.
      
      KVM_DEBUGREG_RELOAD just causes unnecesarry load of hardware DRs and is better
      to be removed.
      Suggested-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NLai Jiangshan <laijs@linux.alibaba.com>
      Message-Id: <20210809174307.145263-1-jiangshanlai@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      34e9f860
  5. 05 8月, 2021 1 次提交
    • P
      KVM: xen: do not use struct gfn_to_hva_cache · 319afe68
      Paolo Bonzini 提交于
      gfn_to_hva_cache is not thread-safe, so it is usually used only within
      a vCPU (whose code is protected by vcpu->mutex).  The Xen interface
      implementation has such a cache in kvm->arch, but it is not really
      used except to store the location of the shared info page.  Replace
      shinfo_set and shinfo_cache with just the value that is passed via
      KVM_XEN_ATTR_TYPE_SHARED_INFO; the only complication is that the
      initialization value is not zero anymore and therefore kvm_xen_init_vm
      needs to be introduced.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      319afe68
  6. 03 8月, 2021 1 次提交
  7. 02 8月, 2021 10 次提交
    • S
      KVM: x86: Preserve guest's CR0.CD/NW on INIT · 4c72ab5a
      Sean Christopherson 提交于
      Preserve CR0.CD and CR0.NW on INIT instead of forcing them to '1', as
      defined by both Intel's SDM and AMD's APM.
      
      Note, current versions of Intel's SDM are very poorly written with
      respect to INIT behavior.  Table 9-1. "IA-32 and Intel 64 Processor
      States Following Power-up, Reset, or INIT" quite clearly lists power-up,
      RESET, _and_ INIT as setting CR0=60000010H, i.e. CD/NW=1.  But the SDM
      then attempts to qualify CD/NW behavior in a footnote:
      
        2. The CD and NW flags are unchanged, bit 4 is set to 1, all other bits
           are cleared.
      
      Presumably that footnote is only meant for INIT, as the RESET case and
      especially the power-up case are rather non-sensical.  Another footnote
      all but confirms that:
      
        6. Internal caches are invalid after power-up and RESET, but left
           unchanged with an INIT.
      
      Bare metal testing shows that CD/NW are indeed preserved on INIT (someone
      else can hack their BIOS to check RESET and power-up :-D).
      Reported-by: NReiji Watanabe <reijiw@google.com>
      Reviewed-by: NReiji Watanabe <reijiw@google.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210713163324.627647-47-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4c72ab5a
    • S
      KVM: SVM: Emulate #INIT in response to triple fault shutdown · 265e4353
      Sean Christopherson 提交于
      Emulate a full #INIT instead of simply initializing the VMCB if the
      guest hits a shutdown.  Initializing the VMCB but not other vCPU state,
      much of which is mirrored by the VMCB, results in incoherent and broken
      vCPU state.
      
      Ideally, KVM would not automatically init anything on shutdown, and
      instead put the vCPU into e.g. KVM_MP_STATE_UNINITIALIZED and force
      userspace to explicitly INIT or RESET the vCPU.  Even better would be to
      add KVM_MP_STATE_SHUTDOWN, since technically NMI can break shutdown
      (and SMI on Intel CPUs).
      
      But, that ship has sailed, and emulating #INIT is the next best thing as
      that has at least some connection with reality since there exist bare
      metal platforms that automatically INIT the CPU if it hits shutdown.
      
      Fixes: 46fe4ddd ("[PATCH] KVM: SVM: Propagate cpu shutdown events to userspace")
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210713163324.627647-45-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      265e4353
    • S
      KVM: x86: Move setting of sregs during vCPU RESET/INIT to common x86 · f39e805e
      Sean Christopherson 提交于
      Move the setting of CR0, CR4, EFER, RFLAGS, and RIP from vendor code to
      common x86.  VMX and SVM now have near-identical sequences, the only
      difference being that VMX updates the exception bitmap.  Updating the
      bitmap on SVM is unnecessary, but benign.  Unfortunately it can't be left
      behind in VMX due to the need to update exception intercepts after the
      control registers are set.
      Reviewed-by: NReiji Watanabe <reijiw@google.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210713163324.627647-37-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f39e805e
    • S
      KVM: x86/mmu: Skip the permission_fault() check on MMIO if CR0.PG=0 · 908b7d43
      Sean Christopherson 提交于
      Skip the MMU permission_fault() check if paging is disabled when
      verifying the cached MMIO GVA is usable.  The check is unnecessary and
      can theoretically get a false positive since the MMU doesn't zero out
      "permissions" or "pkru_mask" when guest paging is disabled.
      
      The obvious alternative is to zero out all the bitmasks when configuring
      nonpaging MMUs, but that's unnecessary work and doesn't align with the
      MMU's general approach of doing as little as possible for flows that are
      supposed to be unreachable.
      
      This is nearly a nop as the false positive is nothing more than an
      insignificant performance blip, and more or less limited to string MMIO
      when L1 is running with paging disabled.  KVM doesn't cache MMIO if L2 is
      active with nested TDP since the "GVA" is really an L2 GPA.  If L2 is
      active without nested TDP, then paging can't be disabled as neither VMX
      nor SVM allows entering the guest without paging of some form.
      
      Jumping back to L1 with paging disabled, in that case direct_map is true
      and so KVM will use CR2 as a GPA; the only time it doesn't is if the
      fault from the emulator doesn't match or emulator_can_use_gpa(), and that
      fails only on string MMIO and other instructions with multiple memory
      operands.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210713163324.627647-27-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      908b7d43
    • S
      KVM: x86: Move EDX initialization at vCPU RESET to common code · 49d8665c
      Sean Christopherson 提交于
      Move the EDX initialization at vCPU RESET, which is now identical between
      VMX and SVM, into common code.
      
      No functional change intended.
      Reviewed-by: NReiji Watanabe <reijiw@google.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210713163324.627647-20-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      49d8665c
    • S
      KVM: x86: Flush the guest's TLB on INIT · df37ed38
      Sean Christopherson 提交于
      Flush the guest's TLB on INIT, as required by Intel's SDM.  Although
      AMD's APM states that the TLBs are unchanged by INIT, it's not clear that
      that's correct as the APM also states that the TLB is flush on "External
      initialization of the processor."  Regardless, relying on the guest to be
      paranoid is unnecessarily risky, while an unnecessary flush is benign
      from a functional perspective and likely has no measurable impact on
      guest performance.
      
      Note, as of the April 2021 version of Intels' SDM, it also contradicts
      itself with respect to TLB flushing.  The overview of INIT explicitly
      calls out the TLBs as being invalidated, while a table later in the same
      section says they are unchanged.
      
        9.1 INITIALIZATION OVERVIEW:
          The major difference is that during an INIT, the internal caches, MSRs,
          MTRRs, and x87 FPU state are left unchanged (although, the TLBs and BTB
          are invalidated as with a hardware reset)
      
        Table 9-1:
      
        Register                    Power up    Reset      INIT
        Data and Code Cache, TLBs:  Invalid[6]  Invalid[6] Unchanged
      
      Given Core2's erratum[*] about global TLB entries not being flush on INIT,
      it's safe to assume that the table is simply wrong.
      
        AZ28. INIT Does Not Clear Global Entries in the TLB
        Problem: INIT may not flush a TLB entry when:
          • The processor is in protected mode with paging enabled and the page global enable
            flag is set (PGE bit of CR4 register)
          • G bit for the page table entry is set
          • TLB entry is present in TLB when INIT occurs
          • Software may encounter unexpected page fault or incorrect address translation due
            to a TLB entry erroneously left in TLB after INIT.
      
        Workaround: Write to CR3, CR4 (setting bits PSE, PGE or PAE) or CR0 (setting
                    bits PG or PE) registers before writing to memory early in BIOS
                    code to clear all the global entries from TLB.
      
        Status: For the steppings affected, see the Summary Tables of Changes.
      
      [*] https://www.intel.com/content/dam/support/us/en/documents/processors/mobile/celeron/sb/320121.pdf
      
      Fixes: 6aa8b732 ("[PATCH] kvm: userspace interface")
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210713163324.627647-2-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      df37ed38
    • M
      KVM: x86: APICv: drop immediate APICv disablement on current vCPU · df63202f
      Maxim Levitsky 提交于
      Special case of disabling the APICv on the current vCPU right away in
      kvm_request_apicv_update doesn't bring much benefit vs raising
      KVM_REQ_APICV_UPDATE on it instead, since this request will be processed
      on the next entry to the guest.
      (the comment about having another #VMEXIT is wrong).
      
      It also hides various assumptions that APIVc enable state matches
      the APICv inhibit state, as this special case only makes those states
      match on the current vCPU.
      
      Previous patches fixed few such assumptions so now it should be safe
      to drop this special case.
      Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20210713142023.106183-5-mlevitsk@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      df63202f
    • P
      KVM: X86: Add per-vm stat for max rmap list size · ec1cf69c
      Peter Xu 提交于
      Add a new statistic max_mmu_rmap_size, which stores the maximum size of rmap
      for the vm.
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Message-Id: <20210625153214.43106-2-peterx@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ec1cf69c
    • S
      KVM: x86: Hoist kvm_dirty_regs check out of sync_regs() · e489a4a6
      Sean Christopherson 提交于
      Move the kvm_dirty_regs vs. KVM_SYNC_X86_VALID_FIELDS check out of
      sync_regs() and into its sole caller, kvm_arch_vcpu_ioctl_run().  This
      allows a future patch to allow synchronizing select state for protected
      VMs.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NIsaku Yamahata <isaku.yamahata@intel.com>
      Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <889017a8d31cea46472e0c64b234ef5919278ed9.1625186503.git.isaku.yamahata@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e489a4a6
    • S
      KVM: x86: Use KVM_BUG/KVM_BUG_ON to handle bugs that are fatal to the VM · 67369273
      Sean Christopherson 提交于
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NIsaku Yamahata <isaku.yamahata@intel.com>
      Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com>
      Message-Id: <0e8760a26151f47dc47052b25ca8b84fffe0641e.1625186503.git.isaku.yamahata@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      67369273
  8. 30 7月, 2021 1 次提交
    • P
      KVM: x86: accept userspace interrupt only if no event is injected · fa7a549d
      Paolo Bonzini 提交于
      Once an exception has been injected, any side effects related to
      the exception (such as setting CR2 or DR6) have been taked place.
      Therefore, once KVM sets the VM-entry interruption information
      field or the AMD EVENTINJ field, the next VM-entry must deliver that
      exception.
      
      Pending interrupts are processed after injected exceptions, so
      in theory it would not be a problem to use KVM_INTERRUPT when
      an injected exception is present.  However, DOSEMU is using
      run->ready_for_interrupt_injection to detect interrupt windows
      and then using KVM_SET_SREGS/KVM_SET_REGS to inject the
      interrupt manually.  For this to work, the interrupt window
      must be delayed after the completion of the previous event
      injection.
      
      Cc: stable@vger.kernel.org
      Reported-by: NStas Sergeev <stsp2@yandex.ru>
      Tested-by: NStas Sergeev <stsp2@yandex.ru>
      Fixes: 71cc849b ("KVM: x86: Fix split-irqchip vs interrupt injection window request")
      Reviewed-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      fa7a549d
  9. 26 7月, 2021 1 次提交
  10. 15 7月, 2021 2 次提交
    • L
      KVM: X86: Disable hardware breakpoints unconditionally before kvm_x86->run() · f85d4016
      Lai Jiangshan 提交于
      When the host is using debug registers but the guest is not using them
      nor is the guest in guest-debug state, the kvm code does not reset
      the host debug registers before kvm_x86->run().  Rather, it relies on
      the hardware vmentry instruction to automatically reset the dr7 registers
      which ensures that the host breakpoints do not affect the guest.
      
      This however violates the non-instrumentable nature around VM entry
      and exit; for example, when a host breakpoint is set on vcpu->arch.cr2,
      
      Another issue is consistency.  When the guest debug registers are active,
      the host breakpoints are reset before kvm_x86->run(). But when the
      guest debug registers are inactive, the host breakpoints are delayed to
      be disabled.  The host tracing tools may see different results depending
      on what the guest is doing.
      
      To fix the problems, we clear %db7 unconditionally before kvm_x86->run()
      if the host has set any breakpoints, no matter if the guest is using
      them or not.
      Signed-off-by: NLai Jiangshan <laijs@linux.alibaba.com>
      Message-Id: <20210628172632.81029-1-jiangshanlai@gmail.com>
      Cc: stable@vger.kernel.org
      [Only clear %db7 instead of reloading all debug registers. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f85d4016
    • S
      Revert "KVM: x86: WARN and reject loading KVM if NX is supported but not enabled" · f0414b07
      Sean Christopherson 提交于
      Let KVM load if EFER.NX=0 even if NX is supported, the analysis and
      testing (or lack thereof) for the non-PAE host case was garbage.
      
      If the kernel won't be using PAE paging, .Ldefault_entry in head_32.S
      skips over the entire EFER sequence.  Hopefully that can be changed in
      the future to allow KVM to require EFER.NX, but the motivation behind
      KVM's requirement isn't yet merged.  Reverting and revisiting the mess
      at a later date is by far the safest approach.
      
      This reverts commit 8bbed95d.
      
      Fixes: 8bbed95d ("KVM: x86: WARN and reject loading KVM if NX is supported but not enabled")
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210625001853.318148-1-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f0414b07
  11. 25 6月, 2021 6 次提交
    • A
      kvm: x86: Allow userspace to handle emulation errors · 19238e75
      Aaron Lewis 提交于
      Add a fallback mechanism to the in-kernel instruction emulator that
      allows userspace the opportunity to process an instruction the emulator
      was unable to.  When the in-kernel instruction emulator fails to process
      an instruction it will either inject a #UD into the guest or exit to
      userspace with exit reason KVM_INTERNAL_ERROR.  This is because it does
      not know how to proceed in an appropriate manner.  This feature lets
      userspace get involved to see if it can figure out a better path
      forward.
      Signed-off-by: NAaron Lewis <aaronlewis@google.com>
      Reviewed-by: NDavid Edmondson <david.edmondson@oracle.com>
      Message-Id: <20210510144834.658457-2-aaronlewis@google.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      19238e75
    • S
      KVM: x86: Read and pass all CR0/CR4 role bits to shadow MMU helper · 20f632bd
      Sean Christopherson 提交于
      Grab all CR0/CR4 MMU role bits from current vCPU state when initializing
      a non-nested shadow MMU.  Extract the masks from kvm_post_set_cr{0,4}(),
      as the CR0/CR4 update masks must exactly match the mmu_role bits, with
      one exception (see below).  The "full" CR0/CR4 will be used by future
      commits to initialize the MMU and its role, as opposed to the current
      approach of pulling everything from vCPU, which is incorrect for certain
      flows, e.g. nested NPT.
      
      CR4.LA57 is an exception, as it can be toggled on VM-Exit (for L1's MMU)
      but can't be toggled via MOV CR4 while long mode is active.  I.e. LA57
      needs to be in the mmu_role, but technically doesn't need to be checked
      by kvm_post_set_cr4().  However, the extra check is completely benign as
      the hardware restrictions simply mean LA57 will never be _the_ cause of
      a MMU reset during MOV CR4.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210622175739.3610207-18-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      20f632bd
    • S
      KVM: x86: Fix sizes used to pass around CR0, CR4, and EFER · dbc4739b
      Sean Christopherson 提交于
      When configuring KVM's MMU, pass CR0 and CR4 as unsigned longs, and EFER
      as a u64 in various flows (mostly MMU).  Passing the params as u32s is
      functionally ok since all of the affected registers reserve bits 63:32 to
      zero (enforced by KVM), but it's technically wrong.
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210622175739.3610207-15-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      dbc4739b
    • S
      KVM: x86: Alert userspace that KVM_SET_CPUID{,2} after KVM_RUN is broken · 63f5a190
      Sean Christopherson 提交于
      Warn userspace that KVM_SET_CPUID{,2} after KVM_RUN "may" cause guest
      instability.  Initialize last_vmentry_cpu to -1 and use it to detect if
      the vCPU has been run at least once when its CPUID model is changed.
      
      KVM does not correctly handle changes to paging related settings in the
      guest's vCPU model after KVM_RUN, e.g. MAXPHYADDR, GBPAGES, etc...  KVM
      could theoretically zap all shadow pages, but actually making that happen
      is a mess due to lock inversion (vcpu->mutex is held).  And even then,
      updating paging settings on the fly would only work if all vCPUs are
      stopped, updated in concert with identical settings, then restarted.
      
      To support running vCPUs with different vCPU models (that affect paging),
      KVM would need to track all relevant information in kvm_mmu_page_role.
      Note, that's the _page_ role, not the full mmu_role.  Updating mmu_role
      isn't sufficient as a vCPU can reuse a shadow page translation that was
      created by a vCPU with different settings and thus completely skip the
      reserved bit checks (that are tied to CPUID).
      
      Tracking CPUID state in kvm_mmu_page_role is _extremely_ undesirable as
      it would require doubling gfn_track from a u16 to a u32, i.e. would
      increase KVM's memory footprint by 2 bytes for every 4kb of guest memory.
      E.g. MAXPHYADDR (6 bits), GBPAGES, AMD vs. INTEL = 1 bit, and SEV C-BIT
      would all need to be tracked.
      
      In practice, there is no remotely sane use case for changing any paging
      related CPUID entries on the fly, so just sweep it under the rug (after
      yelling at userspace).
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210622175739.3610207-8-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      63f5a190
    • S
      KVM: x86: Properly reset MMU context at vCPU RESET/INIT · 0aa18375
      Sean Christopherson 提交于
      Reset the MMU context at vCPU INIT (and RESET for good measure) if CR0.PG
      was set prior to INIT.  Simply re-initializing the current MMU is not
      sufficient as the current root HPA may not be usable in the new context.
      E.g. if TDP is disabled and INIT arrives while the vCPU is in long mode,
      KVM will fail to switch to the 32-bit pae_root and bomb on the next
      VM-Enter due to running with a 64-bit CR3 in 32-bit mode.
      
      This bug was papered over in both VMX and SVM, but still managed to rear
      its head in the MMU role on VMX.  Because EFER.LMA=1 requires CR0.PG=1,
      kvm_calc_shadow_mmu_root_page_role() checks for EFER.LMA without first
      checking CR0.PG.  VMX's RESET/INIT flow writes CR0 before EFER, and so
      an INIT with the vCPU in 64-bit mode will cause the hack-a-fix to
      generate the wrong MMU role.
      
      In VMX, the INIT issue is specific to running without unrestricted guest
      since unrestricted guest is available if and only if EPT is enabled.
      Commit 8668a3c4 ("KVM: VMX: Reset mmu context when entering real
      mode") resolved the issue by forcing a reset when entering emulated real
      mode.
      
      In SVM, commit ebae871a ("kvm: svm: reset mmu on VCPU reset") forced
      a MMU reset on every INIT to workaround the flaw in common x86.  Note, at
      the time the bug was fixed, the SVM problem was exacerbated by a complete
      lack of a CR4 update.
      
      The vendor resets will be reverted in future patches, primarily to aid
      bisection in case there are non-INIT flows that rely on the existing VMX
      logic.
      
      Because CR0.PG is unconditionally cleared on INIT, and because CR0.WP and
      all CR4/EFER paging bits are ignored if CR0.PG=0, simply checking that
      CR0.PG was '1' prior to INIT/RESET is sufficient to detect a required MMU
      context reset.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210622175739.3610207-4-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0aa18375
    • J
      KVM: debugfs: Reuse binary stats descriptors · bc9e9e67
      Jing Zhang 提交于
      To remove code duplication, use the binary stats descriptors in the
      implementation of the debugfs interface for statistics. This unifies
      the definition of statistics for the binary and debugfs interfaces.
      Signed-off-by: NJing Zhang <jingzhangos@google.com>
      Message-Id: <20210618222709.1858088-8-jingzhangos@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      bc9e9e67