1. 28 9月, 2020 10 次提交
    • S
      KVM: nVMX: Move free_nested() below vmx_switch_vmcs() · c61ca2fc
      Sean Christopherson 提交于
      Move free_nested() down below vmx_switch_vmcs() so that a future patch
      can do an "emergency" invocation of vmx_switch_vmcs() if vmcs01 is not
      the loaded VMCS when freeing nested resources.
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Message-Id: <20200923184452.980-5-sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c61ca2fc
    • S
      KVM: nVMX: Explicitly check for valid guest state for !unrestricted guest · 2ba4493a
      Sean Christopherson 提交于
      Call guest_state_valid() directly instead of querying emulation_required
      when checking if L1 is attempting VM-Enter with invalid guest state.
      If emulate_invalid_guest_state is false, KVM will fixup segment regs to
      avoid emulation and will never set emulation_required, i.e. KVM will
      incorrectly miss the associated consistency checks because the nested
      path stuffs segments directly into vmcs02.
      
      Opportunsitically add Consistency Check tracing to make future debug
      suck a little less.
      
      Fixes: 2bb8cafe ("KVM: vVMX: signal failure for nested VMEntry if emulation_required")
      Fixes: 3184a995 ("KVM: nVMX: fix vmentry failure code when L2 state would require emulation")
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Message-Id: <20200923184452.980-4-sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      2ba4493a
    • S
      KVM: nVMX: Reload vmcs01 if getting vmcs12's pages fails · b89d5ad0
      Sean Christopherson 提交于
      Reload vmcs01 when bailing from nested_vmx_enter_non_root_mode() as KVM
      expects vmcs01 to be loaded when is_guest_mode() is false.
      
      Fixes: 671ddc70 ("KVM: nVMX: Don't leak L1 MMIO regions to L2")
      Cc: stable@vger.kernel.org
      Cc: Dan Cross <dcross@google.com>
      Cc: Jim Mattson <jmattson@google.com>
      Cc: Peter Shier <pshier@google.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Message-Id: <20200923184452.980-3-sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b89d5ad0
    • S
      KVM: nVMX: Reset the segment cache when stuffing guest segs · fc387d8d
      Sean Christopherson 提交于
      Explicitly reset the segment cache after stuffing guest segment regs in
      prepare_vmcs02_rare().  Although the cache is reset when switching to
      vmcs02, there is nothing that prevents KVM from re-populating the cache
      prior to writing vmcs02 with vmcs12's values.  E.g. if the vCPU is
      preempted after switching to vmcs02 but before prepare_vmcs02_rare(),
      kvm_arch_vcpu_put() will dereference GUEST_SS_AR_BYTES via .get_cpl()
      and cache the stale vmcs02 value.  While the current code base only
      caches stale data in the preemption case, it's theoretically possible
      future code could read a segment register during the nested flow itself,
      i.e. this isn't technically illegal behavior in kvm_arch_vcpu_put(),
      although it did introduce the bug.
      
      This manifests as an unexpected nested VM-Enter failure when running
      with unrestricted guest disabled if the above preemption case coincides
      with L1 switching L2's CPL, e.g. when switching from a L2 vCPU at CPL3
      to to a L2 vCPU at CPL0.  stack_segment_valid() will see the new SS_SEL
      but the old SS_AR_BYTES and incorrectly mark the guest state as invalid
      due to SS.dpl != SS.rpl.
      
      Don't bother updating the cache even though prepare_vmcs02_rare() writes
      every segment.  With unrestricted guest, guest segments are almost never
      read, let alone L2 guest segments.  On the other hand, populating the
      cache requires a large number of memory writes, i.e. it's unlikely to be
      a net win.  Updating the cache would be a win when unrestricted guest is
      not supported, as guest_state_valid() will immediately cache all segment
      registers.  But, nested virtualization without unrestricted guest is
      dirt slow, saving some VMREADs won't change that, and every CPU
      manufactured in the last decade supports unrestricted guest.  In other
      words, the extra (minor) complexity isn't worth the trouble.
      
      Note, kvm_arch_vcpu_put() may see stale data when querying guest CPL
      depending on when preemption occurs.  This is "ok" in that the usage is
      imperfect by nature, i.e. it's used heuristically to improve performance
      but doesn't affect functionality.  kvm_arch_vcpu_put() could be "fixed"
      by also disabling preemption while loading segments, but that's
      pointless and misleading as reading state from kvm_sched_{in,out}() is
      guaranteed to see stale data in one form or another.  E.g. even if all
      the usage of regs_avail is fixed to call kvm_register_mark_available()
      after the associated state is set, the individual state might still be
      stale with respect to the overall vCPU state.  I.e. making functional
      decisions in an asynchronous hook is doomed from the get go.  Thankfully
      KVM doesn't do that.
      
      Fixes: de63ad4c ("KVM: X86: implement the logic for spinlock optimization")
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Message-Id: <20200923184452.980-2-sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      fc387d8d
    • S
      KVM: VMX: Rename RDTSCP secondary exec control name to insert "ENABLE" · 7f3603b6
      Sean Christopherson 提交于
      Rename SECONDARY_EXEC_RDTSCP to SECONDARY_EXEC_ENABLE_RDTSCP in
      preparation for consolidating the logic for adjusting secondary exec
      controls based on the guest CPUID model.
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Message-Id: <20200923165048.20486-4-sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7f3603b6
    • S
      KVM: nVMX: Add VM-Enter failed tracepoints for super early checks · fc595f35
      Sean Christopherson 提交于
      Add tracepoints for the early consistency checks in nested_vmx_run().
      The "VMLAUNCH vs. VMRESUME" check in particular is useful to trace, as
      there is no architectural way to check VMCS.LAUNCH_STATE, and subtle
      bugs such as VMCLEAR on the wrong HPA can lead to confusing errors in
      the L1 VMM.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Message-Id: <20200812180615.22372-1-sean.j.christopherson@intel.com>
      Reviewed-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      fc595f35
    • S
      KVM: nVMX: Morph notification vector IRQ on nested VM-Enter to pending PI · 25bb2cf9
      Sean Christopherson 提交于
      On successful nested VM-Enter, check for pending interrupts and convert
      the highest priority interrupt to a pending posted interrupt if it
      matches L2's notification vector.  If the vCPU receives a notification
      interrupt before nested VM-Enter (assuming L1 disables IRQs before doing
      VM-Enter), the pending interrupt (for L1) should be recognized and
      processed as a posted interrupt when interrupts become unblocked after
      VM-Enter to L2.
      
      This fixes a bug where L1/L2 will get stuck in an infinite loop if L1 is
      trying to inject an interrupt into L2 by setting the appropriate bit in
      L2's PIR and sending a self-IPI prior to VM-Enter (as opposed to KVM's
      method of manually moving the vector from PIR->vIRR/RVI).  KVM will
      observe the IPI while the vCPU is in L1 context and so won't immediately
      morph it to a posted interrupt for L2.  The pending interrupt will be
      seen by vmx_check_nested_events(), cause KVM to force an immediate exit
      after nested VM-Enter, and eventually be reflected to L1 as a VM-Exit.
      After handling the VM-Exit, L1 will see that L2 has a pending interrupt
      in PIR, send another IPI, and repeat until L2 is killed.
      
      Note, posted interrupts require virtual interrupt deliveriy, and virtual
      interrupt delivery requires exit-on-interrupt, ergo interrupts will be
      unconditionally unmasked on VM-Enter if posted interrupts are enabled.
      
      Fixes: 705699a1 ("KVM: nVMX: Enable nested posted interrupt processing")
      Cc: stable@vger.kernel.org
      Cc: Liran Alon <liran.alon@oracle.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Message-Id: <20200812175129.12172-1-sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      25bb2cf9
    • K
      KVM: nVMX: KVM needs to unset "unrestricted guest" VM-execution control in... · bddd82d1
      Krish Sadhukhan 提交于
      KVM: nVMX: KVM needs to unset "unrestricted guest" VM-execution control in vmcs02 if vmcs12 doesn't set it
      
      Currently, prepare_vmcs02_early() does not check if the "unrestricted guest"
      VM-execution control in vmcs12 is turned off and leaves the corresponding
      bit on in vmcs02. Due to this setting, vmentry checks which are supposed to
      render the nested guest state as invalid when this VM-execution control is
      not set, are passing in hardware.
      
      This patch turns off the "unrestricted guest" VM-execution control in vmcs02
      if vmcs12 has turned it off.
      Suggested-by: NJim Mattson <jmattson@google.com>
      Suggested-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com>
      Message-Id: <20200921081027.23047-2-krish.sadhukhan@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      bddd82d1
    • B
      KVM: X86: Rename and move the function vmx_handle_memory_failure to x86.c · 3f3393b3
      Babu Moger 提交于
      Handling of kvm_read/write_guest_virt*() errors can be moved to common
      code. The same code can be used by both VMX and SVM.
      Signed-off-by: NBabu Moger <babu.moger@amd.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Message-Id: <159985254493.11252.6603092560732507607.stgit@bmoger-ubuntu>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3f3393b3
    • C
      KVM: nVMX: Fix VMX controls MSRs setup when nested VMX enabled · efc83133
      Chenyi Qiang 提交于
      KVM supports the nested VM_{EXIT, ENTRY}_LOAD_IA32_PERF_GLOBAL_CTRL and
      VM_{ENTRY_LOAD, EXIT_CLEAR}_BNDCFGS, but they are not exposed by the
      system ioctl KVM_GET_MSR.  Add them to the setup of nested VMX controls MSR.
      Signed-off-by: NChenyi Qiang <chenyi.qiang@intel.com>
      Message-Id: <20200828085622.8365-2-chenyi.qiang@intel.com>
      Reviewed-by: NXiaoyao Li <xiaoyao.li@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      efc83133
  2. 12 9月, 2020 2 次提交
    • C
      KVM: nVMX: Fix the update value of nested load IA32_PERF_GLOBAL_CTRL control · c6b177a3
      Chenyi Qiang 提交于
      A minor fix for the update of VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL field
      in exit_ctls_high.
      
      Fixes: 03a8871a ("KVM: nVMX: Expose load IA32_PERF_GLOBAL_CTRL
      VM-{Entry,Exit} control")
      Signed-off-by: NChenyi Qiang <chenyi.qiang@intel.com>
      Reviewed-by: NXiaoyao Li <xiaoyao.li@intel.com>
      Message-Id: <20200828085622.8365-5-chenyi.qiang@intel.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c6b177a3
    • P
      KVM: nVMX: Update VMCS02 when L2 PAE PDPTE updates detected · 43fea4e4
      Peter Shier 提交于
      When L2 uses PAE, L0 intercepts of L2 writes to CR0/CR3/CR4 call
      load_pdptrs to read the possibly updated PDPTEs from the guest
      physical address referenced by CR3.  It loads them into
      vcpu->arch.walk_mmu->pdptrs and sets VCPU_EXREG_PDPTR in
      vcpu->arch.regs_dirty.
      
      At the subsequent assumed reentry into L2, the mmu will call
      vmx_load_mmu_pgd which calls ept_load_pdptrs. ept_load_pdptrs sees
      VCPU_EXREG_PDPTR set in vcpu->arch.regs_dirty and loads
      VMCS02.GUEST_PDPTRn from vcpu->arch.walk_mmu->pdptrs[]. This all works
      if the L2 CRn write intercept always resumes L2.
      
      The resume path calls vmx_check_nested_events which checks for
      exceptions, MTF, and expired VMX preemption timers. If
      vmx_check_nested_events finds any of these conditions pending it will
      reflect the corresponding exit into L1. Live migration at this point
      would also cause a missed immediate reentry into L2.
      
      After L1 exits, vmx_vcpu_run calls vmx_register_cache_reset which
      clears VCPU_EXREG_PDPTR in vcpu->arch.regs_dirty.  When L2 next
      resumes, ept_load_pdptrs finds VCPU_EXREG_PDPTR clear in
      vcpu->arch.regs_dirty and does not load VMCS02.GUEST_PDPTRn from
      vcpu->arch.walk_mmu->pdptrs[]. prepare_vmcs02 will then load
      VMCS02.GUEST_PDPTRn from vmcs12->pdptr0/1/2/3 which contain the stale
      values stored at last L2 exit. A repro of this bug showed L2 entering
      triple fault immediately due to the bad VMCS02.GUEST_PDPTRn values.
      
      When L2 is in PAE paging mode add a call to ept_load_pdptrs before
      leaving L2. This will update VMCS02.GUEST_PDPTRn if they are dirty in
      vcpu->arch.walk_mmu->pdptrs[].
      
      Tested:
      kvm-unit-tests with new directed test: vmx_mtf_pdpte_test.
      Verified that test fails without the fix.
      
      Also ran Google internal VMM with an Ubuntu 16.04 4.4.0-83 guest running a
      custom hypervisor with a 32-bit Windows XP L2 guest using PAE. Prior to fix
      would repro readily. Ran 14 simultaneous L2s for 140 iterations with no
      failures.
      Signed-off-by: NPeter Shier <pshier@google.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Message-Id: <20200820230545.2411347-1-pshier@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      43fea4e4
  3. 31 7月, 2020 1 次提交
    • S
      KVM: x86: Pull the PGD's level from the MMU instead of recalculating it · 2a40b900
      Sean Christopherson 提交于
      Use the shadow_root_level from the current MMU as the root level for the
      PGD, i.e. for VMX's EPTP.  This eliminates the weird dependency between
      VMX and the MMU where both must independently calculate the same root
      level for things to work correctly.  Temporarily keep VMX's calculation
      of the level and use it to WARN if the incoming level diverges.
      
      Opportunistically refactor kvm_mmu_load_pgd() to avoid indentation hell,
      and rename a 'cr3' param in the load_mmu_pgd prototype that managed to
      survive the cr3 purge.
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Message-Id: <20200716034122.5998-6-sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      2a40b900
  4. 27 7月, 2020 2 次提交
  5. 11 7月, 2020 1 次提交
  6. 10 7月, 2020 1 次提交
  7. 09 7月, 2020 3 次提交
  8. 04 7月, 2020 1 次提交
  9. 11 6月, 2020 2 次提交
  10. 08 6月, 2020 1 次提交
    • V
      KVM: VMX: Properly handle kvm_read/write_guest_virt*() result · 7a35e515
      Vitaly Kuznetsov 提交于
      Syzbot reports the following issue:
      
      WARNING: CPU: 0 PID: 6819 at arch/x86/kvm/x86.c:618
       kvm_inject_emulated_page_fault+0x210/0x290 arch/x86/kvm/x86.c:618
      ...
      Call Trace:
      ...
      RIP: 0010:kvm_inject_emulated_page_fault+0x210/0x290 arch/x86/kvm/x86.c:618
      ...
       nested_vmx_get_vmptr+0x1f9/0x2a0 arch/x86/kvm/vmx/nested.c:4638
       handle_vmon arch/x86/kvm/vmx/nested.c:4767 [inline]
       handle_vmon+0x168/0x3a0 arch/x86/kvm/vmx/nested.c:4728
       vmx_handle_exit+0x29c/0x1260 arch/x86/kvm/vmx/vmx.c:6067
      
      'exception' we're trying to inject with kvm_inject_emulated_page_fault()
      comes from:
      
        nested_vmx_get_vmptr()
         kvm_read_guest_virt()
           kvm_read_guest_virt_helper()
             vcpu->arch.walk_mmu->gva_to_gpa()
      
      but it is only set when GVA to GPA conversion fails. In case it doesn't but
      we still fail kvm_vcpu_read_guest_page(), X86EMUL_IO_NEEDED is returned and
      nested_vmx_get_vmptr() calls kvm_inject_emulated_page_fault() with zeroed
      'exception'. This happen when the argument is MMIO.
      
      Paolo also noticed that nested_vmx_get_vmptr() is not the only place in
      KVM code where kvm_read/write_guest_virt*() return result is mishandled.
      VMX instructions along with INVPCID have the same issue. This was already
      noticed before, e.g. see commit 541ab2ae ("KVM: x86: work around
      leak of uninitialized stack contents") but was never fully fixed.
      
      KVM could've handled the request correctly by going to userspace and
      performing I/O but there doesn't seem to be a good need for such requests
      in the first place.
      
      Introduce vmx_handle_memory_failure() as an interim solution.
      
      Note, nested_vmx_get_vmptr() now has three possible outcomes: OK, PF,
      KVM_EXIT_INTERNAL_ERROR and callers need to know if userspace exit is
      needed (for KVM_EXIT_INTERNAL_ERROR) in case of failure. We don't seem
      to have a good enum describing this tristate, just add "int *ret" to
      nested_vmx_get_vmptr() interface to pass the information.
      
      Reported-by: syzbot+2a7156e11dc199bdbd8a@syzkaller.appspotmail.com
      Suggested-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20200605115906.532682-1-vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7a35e515
  11. 01 6月, 2020 4 次提交
  12. 16 5月, 2020 5 次提交
  13. 14 5月, 2020 7 次提交