1. 16 8月, 2021 1 次提交
  2. 13 8月, 2021 7 次提交
    • S
      KVM: x86/mmu: Protect marking SPs unsync when using TDP MMU with spinlock · ce25681d
      Sean Christopherson 提交于
      Add yet another spinlock for the TDP MMU and take it when marking indirect
      shadow pages unsync.  When using the TDP MMU and L1 is running L2(s) with
      nested TDP, KVM may encounter shadow pages for the TDP entries managed by
      L1 (controlling L2) when handling a TDP MMU page fault.  The unsync logic
      is not thread safe, e.g. the kvm_mmu_page fields are not atomic, and
      misbehaves when a shadow page is marked unsync via a TDP MMU page fault,
      which runs with mmu_lock held for read, not write.
      
      Lack of a critical section manifests most visibly as an underflow of
      unsync_children in clear_unsync_child_bit() due to unsync_children being
      corrupted when multiple CPUs write it without a critical section and
      without atomic operations.  But underflow is the best case scenario.  The
      worst case scenario is that unsync_children prematurely hits '0' and
      leads to guest memory corruption due to KVM neglecting to properly sync
      shadow pages.
      
      Use an entirely new spinlock even though piggybacking tdp_mmu_pages_lock
      would functionally be ok.  Usurping the lock could degrade performance when
      building upper level page tables on different vCPUs, especially since the
      unsync flow could hold the lock for a comparatively long time depending on
      the number of indirect shadow pages and the depth of the paging tree.
      
      For simplicity, take the lock for all MMUs, even though KVM could fairly
      easily know that mmu_lock is held for write.  If mmu_lock is held for
      write, there cannot be contention for the inner spinlock, and marking
      shadow pages unsync across multiple vCPUs will be slow enough that
      bouncing the kvm_arch cacheline should be in the noise.
      
      Note, even though L2 could theoretically be given access to its own EPT
      entries, a nested MMU must hold mmu_lock for write and thus cannot race
      against a TDP MMU page fault.  I.e. the additional spinlock only _needs_ to
      be taken by the TDP MMU, as opposed to being taken by any MMU for a VM
      that is running with the TDP MMU enabled.  Holding mmu_lock for read also
      prevents the indirect shadow page from being freed.  But as above, keep
      it simple and always take the lock.
      
      Alternative #1, the TDP MMU could simply pass "false" for can_unsync and
      effectively disable unsync behavior for nested TDP.  Write protecting leaf
      shadow pages is unlikely to noticeably impact traditional L1 VMMs, as such
      VMMs typically don't modify TDP entries, but the same may not hold true for
      non-standard use cases and/or VMMs that are migrating physical pages (from
      L1's perspective).
      
      Alternative #2, the unsync logic could be made thread safe.  In theory,
      simply converting all relevant kvm_mmu_page fields to atomics and using
      atomic bitops for the bitmap would suffice.  However, (a) an in-depth audit
      would be required, (b) the code churn would be substantial, and (c) legacy
      shadow paging would incur additional atomic operations in performance
      sensitive paths for no benefit (to legacy shadow paging).
      
      Fixes: a2855afc ("KVM: x86/mmu: Allow parallel page faults for the TDP MMU")
      Cc: stable@vger.kernel.org
      Cc: Ben Gardon <bgardon@google.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210812181815.3378104-1-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ce25681d
    • S
      KVM: x86/mmu: Don't step down in the TDP iterator when zapping all SPTEs · 0103098f
      Sean Christopherson 提交于
      Set the min_level for the TDP iterator at the root level when zapping all
      SPTEs to optimize the iterator's try_step_down().  Zapping a non-leaf
      SPTE will recursively zap all its children, thus there is no need for the
      iterator to attempt to step down.  This avoids rereading the top-level
      SPTEs after they are zapped by causing try_step_down() to short-circuit.
      
      In most cases, optimizing try_step_down() will be in the noise as the cost
      of zapping SPTEs completely dominates the overall time.  The optimization
      is however helpful if the zap occurs with relatively few SPTEs, e.g. if KVM
      is zapping in response to multiple memslot updates when userspace is adding
      and removing read-only memslots for option ROMs.  In that case, the task
      doing the zapping likely isn't a vCPU thread, but it still holds mmu_lock
      for read and thus can be a noisy neighbor of sorts.
      Reviewed-by: NBen Gardon <bgardon@google.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210812181414.3376143-3-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0103098f
    • S
      KVM: x86/mmu: Don't leak non-leaf SPTEs when zapping all SPTEs · 524a1e4e
      Sean Christopherson 提交于
      Pass "all ones" as the end GFN to signal "zap all" for the TDP MMU and
      really zap all SPTEs in this case.  As is, zap_gfn_range() skips non-leaf
      SPTEs whose range exceeds the range to be zapped.  If shadow_phys_bits is
      not aligned to the range size of top-level SPTEs, e.g. 512gb with 4-level
      paging, the "zap all" flows will skip top-level SPTEs whose range extends
      beyond shadow_phys_bits and leak their SPs when the VM is destroyed.
      
      Use the current upper bound (based on host.MAXPHYADDR) to detect that the
      caller wants to zap all SPTEs, e.g. instead of using the max theoretical
      gfn, 1 << (52 - 12).  The more precise upper bound allows the TDP iterator
      to terminate its walk earlier when running on hosts with MAXPHYADDR < 52.
      
      Add a WARN on kmv->arch.tdp_mmu_pages when the TDP MMU is destroyed to
      help future debuggers should KVM decide to leak SPTEs again.
      
      The bug is most easily reproduced by running (and unloading!) KVM in a
      VM whose host.MAXPHYADDR < 39, as the SPTE for gfn=0 will be skipped.
      
        =============================================================================
        BUG kvm_mmu_page_header (Not tainted): Objects remaining in kvm_mmu_page_header on __kmem_cache_shutdown()
        -----------------------------------------------------------------------------
        Slab 0x000000004d8f7af1 objects=22 used=2 fp=0x00000000624d29ac flags=0x4000000000000200(slab|zone=1)
        CPU: 0 PID: 1582 Comm: rmmod Not tainted 5.14.0-rc2+ #420
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        Call Trace:
         dump_stack_lvl+0x45/0x59
         slab_err+0x95/0xc9
         __kmem_cache_shutdown.cold+0x3c/0x158
         kmem_cache_destroy+0x3d/0xf0
         kvm_mmu_module_exit+0xa/0x30 [kvm]
         kvm_arch_exit+0x5d/0x90 [kvm]
         kvm_exit+0x78/0x90 [kvm]
         vmx_exit+0x1a/0x50 [kvm_intel]
         __x64_sys_delete_module+0x13f/0x220
         do_syscall_64+0x3b/0xc0
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Fixes: faaf05b0 ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU")
      Cc: stable@vger.kernel.org
      Cc: Ben Gardon <bgardon@google.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210812181414.3376143-2-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      524a1e4e
    • S
      KVM: nVMX: Use vmx_need_pf_intercept() when deciding if L0 wants a #PF · 18712c13
      Sean Christopherson 提交于
      Use vmx_need_pf_intercept() when determining if L0 wants to handle a #PF
      in L2 or if the VM-Exit should be forwarded to L1.  The current logic fails
      to account for the case where #PF is intercepted to handle
      guest.MAXPHYADDR < host.MAXPHYADDR and ends up reflecting all #PFs into
      L1.  At best, L1 will complain and inject the #PF back into L2.  At
      worst, L1 will eat the unexpected fault and cause L2 to hang on infinite
      page faults.
      
      Note, while the bug was technically introduced by the commit that added
      support for the MAXPHYADDR madness, the shame is all on commit
      a0c13434 ("KVM: VMX: introduce vmx_need_pf_intercept").
      
      Fixes: 1dbf5d68 ("KVM: VMX: Add guest physical address check in EPT violation and misconfig")
      Cc: stable@vger.kernel.org
      Cc: Peter Shier <pshier@google.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Jim Mattson <jmattson@google.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210812045615.3167686-1-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      18712c13
    • J
      kvm: vmx: Sync all matching EPTPs when injecting nested EPT fault · 85aa8889
      Junaid Shahid 提交于
      When a nested EPT violation/misconfig is injected into the guest,
      the shadow EPT PTEs associated with that address need to be synced.
      This is done by kvm_inject_emulated_page_fault() before it calls
      nested_ept_inject_page_fault(). However, that will only sync the
      shadow EPT PTE associated with the current L1 EPTP. Since the ASID
      is based on EP4TA rather than the full EPTP, so syncing the current
      EPTP is not enough. The SPTEs associated with any other L1 EPTPs
      in the prev_roots cache with the same EP4TA also need to be synced.
      Signed-off-by: NJunaid Shahid <junaids@google.com>
      Message-Id: <20210806222229.1645356-1-junaids@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      85aa8889
    • P
      KVM: x86: remove dead initialization · ffbe17ca
      Paolo Bonzini 提交于
      hv_vcpu is initialized again a dozen lines below, and at this
      point vcpu->arch.hyperv is not valid.  Remove the initializer.
      Reported-by: Nkernel test robot <lkp@intel.com>
      Reviewed-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ffbe17ca
    • S
      KVM: x86: Allow guest to set EFER.NX=1 on non-PAE 32-bit kernels · 1383279c
      Sean Christopherson 提交于
      Remove an ancient restriction that disallowed exposing EFER.NX to the
      guest if EFER.NX=0 on the host, even if NX is fully supported by the CPU.
      The motivation of the check, added by commit 2cc51560 ("KVM: VMX:
      Avoid saving and restoring msr_efer on lightweight vmexit"), was to rule
      out the case of host.EFER.NX=0 and guest.EFER.NX=1 so that KVM could run
      the guest with the host's EFER.NX and thus avoid context switching EFER
      if the only divergence was the NX bit.
      
      Fast forward to today, and KVM has long since stopped running the guest
      with the host's EFER.NX.  Not only does KVM context switch EFER if
      host.EFER.NX=1 && guest.EFER.NX=0, KVM also forces host.EFER.NX=0 &&
      guest.EFER.NX=1 when using shadow paging (to emulate SMEP).  Furthermore,
      the entire motivation for the restriction was made obsolete over a decade
      ago when Intel added dedicated host and guest EFER fields in the VMCS
      (Nehalem timeframe), which reduced the overhead of context switching EFER
      from 400+ cycles (2 * WRMSR + 1 * RDMSR) to a mere ~2 cycles.
      
      In practice, the removed restriction only affects non-PAE 32-bit kernels,
      as EFER.NX is set during boot if NX is supported and the kernel will use
      PAE paging (32-bit or 64-bit), regardless of whether or not the kernel
      will actually use NX itself (mark PTEs non-executable).
      
      Alternatively and/or complementarily, startup_32_smp() in head_32.S could
      be modified to set EFER.NX=1 regardless of paging mode, thus eliminating
      the scenario where NX is supported but not enabled.  However, that runs
      the risk of breaking non-KVM non-PAE kernels (though the risk is very,
      very low as there are no known EFER.NX errata), and also eliminates an
      easy-to-use mechanism for stressing KVM's handling of guest vs. host EFER
      across nested virtualization transitions.
      Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210805183804.1221554-1-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      1383279c
  3. 11 8月, 2021 1 次提交
    • S
      KVM: VMX: Use current VMCS to query WAITPKG support for MSR emulation · 7b9cae02
      Sean Christopherson 提交于
      Use the secondary_exec_controls_get() accessor in vmx_has_waitpkg() to
      effectively get the controls for the current VMCS, as opposed to using
      vmx->secondary_exec_controls, which is the cached value of KVM's desired
      controls for vmcs01 and truly not reflective of any particular VMCS.
      
      While the waitpkg control is not dynamic, i.e. vmcs01 will always hold
      the same waitpkg configuration as vmx->secondary_exec_controls, the same
      does not hold true for vmcs02 if the L1 VMM hides the feature from L2.
      If L1 hides the feature _and_ does not intercept MSR_IA32_UMWAIT_CONTROL,
      L2 could incorrectly read/write L1's virtual MSR instead of taking a #GP.
      
      Fixes: 6e3ba4ab ("KVM: vmx: Emulate MSR IA32_UMWAIT_CONTROL")
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210810171952.2758100-2-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7b9cae02
  4. 05 8月, 2021 1 次提交
    • S
      KVM: x86/mmu: Fix per-cpu counter corruption on 32-bit builds · d5aaad6f
      Sean Christopherson 提交于
      Take a signed 'long' instead of an 'unsigned long' for the number of
      pages to add/subtract to the total number of pages used by the MMU.  This
      fixes a zero-extension bug on 32-bit kernels that effectively corrupts
      the per-cpu counter used by the shrinker.
      
      Per-cpu counters take a signed 64-bit value on both 32-bit and 64-bit
      kernels, whereas kvm_mod_used_mmu_pages() takes an unsigned long and thus
      an unsigned 32-bit value on 32-bit kernels.  As a result, the value used
      to adjust the per-cpu counter is zero-extended (unsigned -> signed), not
      sign-extended (signed -> signed), and so KVM's intended -1 gets morphed to
      4294967295 and effectively corrupts the counter.
      
      This was found by a staggering amount of sheer dumb luck when running
      kvm-unit-tests on a 32-bit KVM build.  The shrinker just happened to kick
      in while running tests and do_shrink_slab() logged an error about trying
      to free a negative number of objects.  The truly lucky part is that the
      kernel just happened to be a slightly stale build, as the shrinker no
      longer yells about negative objects as of commit 18bb473e ("mm:
      vmscan: shrink deferred objects proportional to priority").
      
       vmscan: shrink_slab: mmu_shrink_scan+0x0/0x210 [kvm] negative objects to delete nr=-858993460
      
      Fixes: bc8a3d89 ("kvm: mmu: Fix overflow on kvm mmu page limit calculation")
      Cc: stable@vger.kernel.org
      Cc: Ben Gardon <bgardon@google.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210804214609.1096003-1-seanjc@google.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d5aaad6f
  5. 04 8月, 2021 2 次提交
    • M
      KVM: SVM: improve the code readability for ASID management · bb2baeb2
      Mingwei Zhang 提交于
      KVM SEV code uses bitmaps to manage ASID states. ASID 0 was always skipped
      because it is never used by VM. Thus, in existing code, ASID value and its
      bitmap postion always has an 'offset-by-1' relationship.
      
      Both SEV and SEV-ES shares the ASID space, thus KVM uses a dynamic range
      [min_asid, max_asid] to handle SEV and SEV-ES ASIDs separately.
      
      Existing code mixes the usage of ASID value and its bitmap position by
      using the same variable called 'min_asid'.
      
      Fix the min_asid usage: ensure that its usage is consistent with its name;
      allocate extra size for ASID 0 to ensure that each ASID has the same value
      with its bitmap position. Add comments on ASID bitmap allocation to clarify
      the size change.
      Signed-off-by: NMingwei Zhang <mizhang@google.com>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Marc Orr <marcorr@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Alper Gun <alpergun@google.com>
      Cc: Dionna Glaze <dionnaglaze@google.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: Vipin Sharma <vipinsh@google.com>
      Cc: Peter Gonda <pgonda@google.com>
      Cc: Joerg Roedel <joro@8bytes.org>
      Message-Id: <20210802180903.159381-1-mizhang@google.com>
      [Fix up sev_asid_free to also index by ASID, as suggested by Sean
       Christopherson, and use nr_asids in sev_cpu_init. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      bb2baeb2
    • S
      KVM: SVM: Fix off-by-one indexing when nullifying last used SEV VMCB · 179c6c27
      Sean Christopherson 提交于
      Use the raw ASID, not ASID-1, when nullifying the last used VMCB when
      freeing an SEV ASID.  The consumer, pre_sev_run(), indexes the array by
      the raw ASID, thus KVM could get a false negative when checking for a
      different VMCB if KVM manages to reallocate the same ASID+VMCB combo for
      a new VM.
      
      Note, this cannot cause a functional issue _in the current code_, as
      pre_sev_run() also checks which pCPU last did VMRUN for the vCPU, and
      last_vmentry_cpu is initialized to -1 during vCPU creation, i.e. is
      guaranteed to mismatch on the first VMRUN.  However, prior to commit
      8a14fe4f ("kvm: x86: Move last_cpu into kvm_vcpu_arch as
      last_vmentry_cpu"), SVM tracked pCPU on its own and zero-initialized the
      last_cpu variable.  Thus it's theoretically possible that older versions
      of KVM could miss a TLB flush if the first VMRUN is on pCPU0 and the ASID
      and VMCB exactly match those of a prior VM.
      
      Fixes: 70cd94e6 ("KVM: SVM: VMRUN should use associated ASID when SEV is enabled")
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Brijesh Singh <brijesh.singh@amd.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      179c6c27
  6. 03 8月, 2021 3 次提交
  7. 30 7月, 2021 1 次提交
    • P
      KVM: x86: accept userspace interrupt only if no event is injected · fa7a549d
      Paolo Bonzini 提交于
      Once an exception has been injected, any side effects related to
      the exception (such as setting CR2 or DR6) have been taked place.
      Therefore, once KVM sets the VM-entry interruption information
      field or the AMD EVENTINJ field, the next VM-entry must deliver that
      exception.
      
      Pending interrupts are processed after injected exceptions, so
      in theory it would not be a problem to use KVM_INTERRUPT when
      an injected exception is present.  However, DOSEMU is using
      run->ready_for_interrupt_injection to detect interrupt windows
      and then using KVM_SET_SREGS/KVM_SET_REGS to inject the
      interrupt manually.  For this to work, the interrupt window
      must be delayed after the completion of the previous event
      injection.
      
      Cc: stable@vger.kernel.org
      Reported-by: NStas Sergeev <stsp2@yandex.ru>
      Tested-by: NStas Sergeev <stsp2@yandex.ru>
      Fixes: 71cc849b ("KVM: x86: Fix split-irqchip vs interrupt injection window request")
      Reviewed-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      fa7a549d
  8. 28 7月, 2021 5 次提交
  9. 26 7月, 2021 3 次提交
  10. 15 7月, 2021 16 次提交
    • V
      KVM: nSVM: Restore nested control upon leaving SMM · bb00bd9c
      Vitaly Kuznetsov 提交于
      If the VM was migrated while in SMM, no nested state was saved/restored,
      and therefore svm_leave_smm has to load both save and control area
      of the vmcb12. Save area is already loaded from HSAVE area,
      so now load the control area as well from the vmcb12.
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20210628104425.391276-6-vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      bb00bd9c
    • V
      KVM: nSVM: Fix L1 state corruption upon return from SMM · 37be407b
      Vitaly Kuznetsov 提交于
      VMCB split commit 4995a368 ("KVM: SVM: Use a separate vmcb for the
      nested L2 guest") broke return from SMM when we entered there from guest
      (L2) mode. Gen2 WS2016/Hyper-V is known to do this on boot. The problem
      manifests itself like this:
      
        kvm_exit:             reason EXIT_RSM rip 0x7ffbb280 info 0 0
        kvm_emulate_insn:     0:7ffbb280: 0f aa
        kvm_smm_transition:   vcpu 0: leaving SMM, smbase 0x7ffb3000
        kvm_nested_vmrun:     rip: 0x000000007ffbb280 vmcb: 0x0000000008224000
          nrip: 0xffffffffffbbe119 int_ctl: 0x01020000 event_inj: 0x00000000
          npt: on
        kvm_nested_intercepts: cr_read: 0000 cr_write: 0010 excp: 40060002
          intercepts: fd44bfeb 0000217f 00000000
        kvm_entry:            vcpu 0, rip 0xffffffffffbbe119
        kvm_exit:             reason EXIT_NPF rip 0xffffffffffbbe119 info
          200000006 1ab000
        kvm_nested_vmexit:    vcpu 0 reason npf rip 0xffffffffffbbe119 info1
          0x0000000200000006 info2 0x00000000001ab000 intr_info 0x00000000
          error_code 0x00000000
        kvm_page_fault:       address 1ab000 error_code 6
        kvm_nested_vmexit_inject: reason EXIT_NPF info1 200000006 info2 1ab000
          int_info 0 int_info_err 0
        kvm_entry:            vcpu 0, rip 0x7ffbb280
        kvm_exit:             reason EXIT_EXCP_GP rip 0x7ffbb280 info 0 0
        kvm_emulate_insn:     0:7ffbb280: 0f aa
        kvm_inj_exception:    #GP (0x0)
      
      Note: return to L2 succeeded but upon first exit to L1 its RIP points to
      'RSM' instruction but we're not in SMM.
      
      The problem appears to be that VMCB01 gets irreversibly destroyed during
      SMM execution. Previously, we used to have 'hsave' VMCB where regular
      (pre-SMM) L1's state was saved upon nested_svm_vmexit() but now we just
      switch to VMCB01 from VMCB02.
      
      Pre-split (working) flow looked like:
      - SMM is triggered during L2's execution
      - L2's state is pushed to SMRAM
      - nested_svm_vmexit() restores L1's state from 'hsave'
      - SMM -> RSM
      - enter_svm_guest_mode() switches to L2 but keeps 'hsave' intact so we have
        pre-SMM (and pre L2 VMRUN) L1's state there
      - L2's state is restored from SMRAM
      - upon first exit L1's state is restored from L1.
      
      This was always broken with regards to svm_get_nested_state()/
      svm_set_nested_state(): 'hsave' was never a part of what's being
      save and restored so migration happening during SMM triggered from L2 would
      never restore L1's state correctly.
      
      Post-split flow (broken) looks like:
      - SMM is triggered during L2's execution
      - L2's state is pushed to SMRAM
      - nested_svm_vmexit() switches to VMCB01 from VMCB02
      - SMM -> RSM
      - enter_svm_guest_mode() switches from VMCB01 to VMCB02 but pre-SMM VMCB01
        is already lost.
      - L2's state is restored from SMRAM
      - upon first exit L1's state is restored from VMCB01 but it is corrupted
       (reflects the state during 'RSM' execution).
      
      VMX doesn't have this problem because unlike VMCB, VMCS keeps both guest
      and host state so when we switch back to VMCS02 L1's state is intact there.
      
      To resolve the issue we need to save L1's state somewhere. We could've
      created a third VMCB for SMM but that would require us to modify saved
      state format. L1's architectural HSAVE area (pointed by MSR_VM_HSAVE_PA)
      seems appropriate: L0 is free to save any (or none) of L1's state there.
      Currently, KVM does 'none'.
      
      Note, for nested state migration to succeed, both source and destination
      hypervisors must have the fix. We, however, don't need to create a new
      flag indicating the fact that HSAVE area is now populated as migration
      during SMM triggered from L2 was always broken.
      
      Fixes: 4995a368 ("KVM: SVM: Use a separate vmcb for the nested L2 guest")
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      37be407b
    • V
      KVM: nSVM: Introduce svm_copy_vmrun_state() · 0a758290
      Vitaly Kuznetsov 提交于
      Separate the code setting non-VMLOAD-VMSAVE state from
      svm_set_nested_state() into its own function. This is going to be
      re-used from svm_enter_smm()/svm_leave_smm().
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20210628104425.391276-4-vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0a758290
    • V
      KVM: nSVM: Check that VM_HSAVE_PA MSR was set before VMRUN · fb79f566
      Vitaly Kuznetsov 提交于
      APM states that "The address written to the VM_HSAVE_PA MSR, which holds
      the address of the page used to save the host state on a VMRUN, must point
      to a hypervisor-owned page. If this check fails, the WRMSR will fail with
      a #GP(0) exception. Note that a value of 0 is not considered valid for the
      VM_HSAVE_PA MSR and a VMRUN that is attempted while the HSAVE_PA is 0 will
      fail with a #GP(0) exception."
      
      svm_set_msr() already checks that the supplied address is valid, so only
      check for '0' is missing. Add it to nested_svm_vmrun().
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20210628104425.391276-3-vkuznets@redhat.com>
      Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      fb79f566
    • V
      KVM: nSVM: Check the value written to MSR_VM_HSAVE_PA · fce7e152
      Vitaly Kuznetsov 提交于
      APM states that #GP is raised upon write to MSR_VM_HSAVE_PA when
      the supplied address is not page-aligned or is outside of "maximum
      supported physical address for this implementation".
      page_address_valid() check seems suitable. Also, forcefully page-align
      the address when it's written from VMM.
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20210628104425.391276-2-vkuznets@redhat.com>
      Cc: stable@vger.kernel.org
      Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
      [Add comment about behavior for host-provided values. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      fce7e152
    • S
      KVM: SVM: Fix sev_pin_memory() error checks in SEV migration utilities · c7a1b2b6
      Sean Christopherson 提交于
      Use IS_ERR() instead of checking for a NULL pointer when querying for
      sev_pin_memory() failures.  sev_pin_memory() always returns an error code
      cast to a pointer, or a valid pointer; it never returns NULL.
      Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
      Cc: Steve Rutherford <srutherford@google.com>
      Cc: Brijesh Singh <brijesh.singh@amd.com>
      Cc: Ashish Kalra <ashish.kalra@amd.com>
      Fixes: d3d1af85 ("KVM: SVM: Add KVM_SEND_UPDATE_DATA command")
      Fixes: 15fb7de1 ("KVM: SVM: Add KVM_SEV_RECEIVE_UPDATE_DATA command")
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210506175826.2166383-3-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c7a1b2b6
    • S
      KVM: SVM: Return -EFAULT if copy_to_user() for SEV mig packet header fails · b4a69392
      Sean Christopherson 提交于
      Return -EFAULT if copy_to_user() fails; if accessing user memory faults,
      copy_to_user() returns the number of bytes remaining, not an error code.
      Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
      Cc: Steve Rutherford <srutherford@google.com>
      Cc: Brijesh Singh <brijesh.singh@amd.com>
      Cc: Ashish Kalra <ashish.kalra@amd.com>
      Fixes: d3d1af85 ("KVM: SVM: Add KVM_SEND_UPDATE_DATA command")
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210506175826.2166383-2-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b4a69392
    • M
      KVM: SVM: add module param to control the #SMI interception · 4b639a9f
      Maxim Levitsky 提交于
      In theory there are no side effects of not intercepting #SMI,
      because then #SMI becomes transparent to the OS and the KVM.
      
      Plus an observation on recent Zen2 CPUs reveals that these
      CPUs ignore #SMI interception and never deliver #SMI VMexits.
      
      This is also useful to test nested KVM to see that L1
      handles #SMIs correctly in case when L1 doesn't intercept #SMI.
      
      Finally the default remains the same, the SMI are intercepted
      by default thus this patch doesn't have any effect unless
      non default module param value is used.
      Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20210707125100.677203-4-mlevitsk@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4b639a9f
    • M
      KVM: SVM: remove INIT intercept handler · 896707c2
      Maxim Levitsky 提交于
      Kernel never sends real INIT even to CPUs, other than on boot.
      
      Thus INIT interception is an error which should be caught
      by a check for an unknown VMexit reason.
      
      On top of that, the current INIT VM exit handler skips
      the current instruction which is wrong.
      That was added in commit 5ff3a351 ("KVM: x86: Move trivial
      instruction-based exit handlers to common code").
      
      Fixes: 5ff3a351 ("KVM: x86: Move trivial instruction-based exit handlers to common code")
      Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20210707125100.677203-3-mlevitsk@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      896707c2
    • M
      KVM: SVM: #SMI interception must not skip the instruction · 991afbbe
      Maxim Levitsky 提交于
      Commit 5ff3a351 ("KVM: x86: Move trivial instruction-based
      exit handlers to common code"), unfortunately made a mistake of
      treating nop_on_interception and nop_interception in the same way.
      
      Former does truly nothing while the latter skips the instruction.
      
      SMI VM exit handler should do nothing.
      (SMI itself is handled by the host when we do STGI)
      
      Fixes: 5ff3a351 ("KVM: x86: Move trivial instruction-based exit handlers to common code")
      Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20210707125100.677203-2-mlevitsk@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      991afbbe
    • Y
      KVM: VMX: Remove vmx_msr_index from vmx.h · c0e1303e
      Yu Zhang 提交于
      vmx_msr_index was used to record the list of MSRs which can be lazily
      restored when kvm returns to userspace. It is now reimplemented as
      kvm_uret_msrs_list, a common x86 list which is only used inside x86.c.
      So just remove the obsolete declaration in vmx.h.
      Signed-off-by: NYu Zhang <yu.c.zhang@linux.intel.com>
      Message-Id: <20210707235702.31595-1-yu.c.zhang@linux.intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c0e1303e
    • L
      KVM: X86: Disable hardware breakpoints unconditionally before kvm_x86->run() · f85d4016
      Lai Jiangshan 提交于
      When the host is using debug registers but the guest is not using them
      nor is the guest in guest-debug state, the kvm code does not reset
      the host debug registers before kvm_x86->run().  Rather, it relies on
      the hardware vmentry instruction to automatically reset the dr7 registers
      which ensures that the host breakpoints do not affect the guest.
      
      This however violates the non-instrumentable nature around VM entry
      and exit; for example, when a host breakpoint is set on vcpu->arch.cr2,
      
      Another issue is consistency.  When the guest debug registers are active,
      the host breakpoints are reset before kvm_x86->run(). But when the
      guest debug registers are inactive, the host breakpoints are delayed to
      be disabled.  The host tracing tools may see different results depending
      on what the guest is doing.
      
      To fix the problems, we clear %db7 unconditionally before kvm_x86->run()
      if the host has set any breakpoints, no matter if the guest is using
      them or not.
      Signed-off-by: NLai Jiangshan <laijs@linux.alibaba.com>
      Message-Id: <20210628172632.81029-1-jiangshanlai@gmail.com>
      Cc: stable@vger.kernel.org
      [Only clear %db7 instead of reloading all debug registers. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f85d4016
    • L
      KVM: x86/pmu: Clear anythread deprecated bit when 0xa leaf is unsupported on the SVM · 7234c362
      Like Xu 提交于
      The AMD platform does not support the functions Ah CPUID leaf. The returned
      results for this entry should all remain zero just like the native does:
      
      AMD host:
         0x0000000a 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000
      (uncanny) AMD guest:
         0x0000000a 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00008000
      
      Fixes: cadbaa03 ("perf/x86/intel: Make anythread filter support conditional")
      Signed-off-by: NLike Xu <likexu@tencent.com>
      Message-Id: <20210628074354.33848-1-likexu@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7234c362
    • S
      KVM: SVM: Revert clearing of C-bit on GPA in #NPF handler · 76ff371b
      Sean Christopherson 提交于
      Don't clear the C-bit in the #NPF handler, as it is a legal GPA bit for
      non-SEV guests, and for SEV guests the C-bit is dropped before the GPA
      hits the NPT in hardware.  Clearing the bit for non-SEV guests causes KVM
      to mishandle #NPFs with that collide with the host's C-bit.
      
      Although the APM doesn't explicitly state that the C-bit is not reserved
      for non-SEV, Tom Lendacky confirmed that the following snippet about the
      effective reduction due to the C-bit does indeed apply only to SEV guests.
      
        Note that because guest physical addresses are always translated
        through the nested page tables, the size of the guest physical address
        space is not impacted by any physical address space reduction indicated
        in CPUID 8000_001F[EBX]. If the C-bit is a physical address bit however,
        the guest physical address space is effectively reduced by 1 bit.
      
      And for SEV guests, the APM clearly states that the bit is dropped before
      walking the nested page tables.
      
        If the C-bit is an address bit, this bit is masked from the guest
        physical address when it is translated through the nested page tables.
        Consequently, the hypervisor does not need to be aware of which pages
        the guest has chosen to mark private.
      
      Note, the bogus C-bit clearing was removed from legacy #PF handler in
      commit 6d1b867d ("KVM: SVM: Don't strip the C-bit from CR2 on #PF
      interception").
      
      Fixes: 0ede79e1 ("KVM: SVM: Clear C-bit from the page fault address")
      Cc: Peter Gonda <pgonda@google.com>
      Cc: Brijesh Singh <brijesh.singh@amd.com>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210625020354.431829-3-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      76ff371b
    • S
      KVM: x86/mmu: Do not apply HPA (memory encryption) mask to GPAs · fc9bf2e0
      Sean Christopherson 提交于
      Ignore "dynamic" host adjustments to the physical address mask when
      generating the masks for guest PTEs, i.e. the guest PA masks.  The host
      physical address space and guest physical address space are two different
      beasts, e.g. even though SEV's C-bit is the same bit location for both
      host and guest, disabling SME in the host (which clears shadow_me_mask)
      does not affect the guest PTE->GPA "translation".
      
      For non-SEV guests, not dropping bits is the correct behavior.  Assuming
      KVM and userspace correctly enumerate/configure guest MAXPHYADDR, bits
      that are lost as collateral damage from memory encryption are treated as
      reserved bits, i.e. KVM will never get to the point where it attempts to
      generate a gfn using the affected bits.  And if userspace wants to create
      a bogus vCPU, then userspace gets to deal with the fallout of hardware
      doing odd things with bad GPAs.
      
      For SEV guests, not dropping the C-bit is technically wrong, but it's a
      moot point because KVM can't read SEV guest's page tables in any case
      since they're always encrypted.  Not to mention that the current KVM code
      is also broken since sme_me_mask does not have to be non-zero for SEV to
      be supported by KVM.  The proper fix would be to teach all of KVM to
      correctly handle guest private memory, but that's a task for the future.
      
      Fixes: d0ec49d4 ("kvm/x86/svm: Support Secure Memory Encryption within KVM")
      Cc: stable@vger.kernel.org
      Cc: Brijesh Singh <brijesh.singh@amd.com>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210623230552.4027702-5-seanjc@google.com>
      [Use a new header instead of adding header guards to paging_tmpl.h. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      fc9bf2e0
    • S
      KVM: x86: Use kernel's x86_phys_bits to handle reduced MAXPHYADDR · e39f00f6
      Sean Christopherson 提交于
      Use boot_cpu_data.x86_phys_bits instead of the raw CPUID information to
      enumerate the MAXPHYADDR for KVM guests when TDP is disabled (the guest
      version is only relevant to NPT/TDP).
      
      When using shadow paging, any reductions to the host's MAXPHYADDR apply
      to KVM and its guests as well, i.e. using the raw CPUID info will cause
      KVM to misreport the number of PA bits available to the guest.
      
      Unconditionally zero out the "Physical Address bit reduction" entry.
      For !TDP, the adjustment is already done, and for TDP enumerating the
      host's reduction is wrong as the reduction does not apply to GPAs.
      
      Fixes: 9af9b940 ("x86/cpu/AMD: Handle SME reduction in physical address size")
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210623230552.4027702-3-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e39f00f6