1. 28 1月, 2022 1 次提交
  2. 27 1月, 2022 2 次提交
    • S
      KVM: nVMX: WARN on any attempt to allocate shadow VMCS for vmcs02 · d6e656cd
      Sean Christopherson 提交于
      WARN if KVM attempts to allocate a shadow VMCS for vmcs02.  KVM emulates
      VMCS shadowing but doesn't virtualize it, i.e. KVM should never allocate
      a "real" shadow VMCS for L2.
      
      The previous code WARNed but continued anyway with the allocation,
      presumably in an attempt to avoid NULL pointer dereference.
      However, alloc_vmcs (and hence alloc_shadow_vmcs) can fail, and
      indeed the sole caller does:
      
      	if (enable_shadow_vmcs && !alloc_shadow_vmcs(vcpu))
      		goto out_shadow_vmcs;
      
      which makes it not a useful attempt.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220125220527.2093146-1-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d6e656cd
    • S
      KVM: x86: Forcibly leave nested virt when SMM state is toggled · f7e57078
      Sean Christopherson 提交于
      Forcibly leave nested virtualization operation if userspace toggles SMM
      state via KVM_SET_VCPU_EVENTS or KVM_SYNC_X86_EVENTS.  If userspace
      forces the vCPU out of SMM while it's post-VMXON and then injects an SMI,
      vmx_enter_smm() will overwrite vmx->nested.smm.vmxon and end up with both
      vmxon=false and smm.vmxon=false, but all other nVMX state allocated.
      
      Don't attempt to gracefully handle the transition as (a) most transitions
      are nonsencial, e.g. forcing SMM while L2 is running, (b) there isn't
      sufficient information to handle all transitions, e.g. SVM wants access
      to the SMRAM save state, and (c) KVM_SET_VCPU_EVENTS must precede
      KVM_SET_NESTED_STATE during state restore as the latter disallows putting
      the vCPU into L2 if SMM is active, and disallows tagging the vCPU as
      being post-VMXON in SMM if SMM is not active.
      
      Abuse of KVM_SET_VCPU_EVENTS manifests as a WARN and memory leak in nVMX
      due to failure to free vmcs01's shadow VMCS, but the bug goes far beyond
      just a memory leak, e.g. toggling SMM on while L2 is active puts the vCPU
      in an architecturally impossible state.
      
        WARNING: CPU: 0 PID: 3606 at free_loaded_vmcs arch/x86/kvm/vmx/vmx.c:2665 [inline]
        WARNING: CPU: 0 PID: 3606 at free_loaded_vmcs+0x158/0x1a0 arch/x86/kvm/vmx/vmx.c:2656
        Modules linked in:
        CPU: 1 PID: 3606 Comm: syz-executor725 Not tainted 5.17.0-rc1-syzkaller #0
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
        RIP: 0010:free_loaded_vmcs arch/x86/kvm/vmx/vmx.c:2665 [inline]
        RIP: 0010:free_loaded_vmcs+0x158/0x1a0 arch/x86/kvm/vmx/vmx.c:2656
        Code: <0f> 0b eb b3 e8 8f 4d 9f 00 e9 f7 fe ff ff 48 89 df e8 92 4d 9f 00
        Call Trace:
         <TASK>
         kvm_arch_vcpu_destroy+0x72/0x2f0 arch/x86/kvm/x86.c:11123
         kvm_vcpu_destroy arch/x86/kvm/../../../virt/kvm/kvm_main.c:441 [inline]
         kvm_destroy_vcpus+0x11f/0x290 arch/x86/kvm/../../../virt/kvm/kvm_main.c:460
         kvm_free_vcpus arch/x86/kvm/x86.c:11564 [inline]
         kvm_arch_destroy_vm+0x2e8/0x470 arch/x86/kvm/x86.c:11676
         kvm_destroy_vm arch/x86/kvm/../../../virt/kvm/kvm_main.c:1217 [inline]
         kvm_put_kvm+0x4fa/0xb00 arch/x86/kvm/../../../virt/kvm/kvm_main.c:1250
         kvm_vm_release+0x3f/0x50 arch/x86/kvm/../../../virt/kvm/kvm_main.c:1273
         __fput+0x286/0x9f0 fs/file_table.c:311
         task_work_run+0xdd/0x1a0 kernel/task_work.c:164
         exit_task_work include/linux/task_work.h:32 [inline]
         do_exit+0xb29/0x2a30 kernel/exit.c:806
         do_group_exit+0xd2/0x2f0 kernel/exit.c:935
         get_signal+0x4b0/0x28c0 kernel/signal.c:2862
         arch_do_signal_or_restart+0x2a9/0x1c40 arch/x86/kernel/signal.c:868
         handle_signal_work kernel/entry/common.c:148 [inline]
         exit_to_user_mode_loop kernel/entry/common.c:172 [inline]
         exit_to_user_mode_prepare+0x17d/0x290 kernel/entry/common.c:207
         __syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline]
         syscall_exit_to_user_mode+0x19/0x60 kernel/entry/common.c:300
         do_syscall_64+0x42/0xb0 arch/x86/entry/common.c:86
         entry_SYSCALL_64_after_hwframe+0x44/0xae
         </TASK>
      
      Cc: stable@vger.kernel.org
      Reported-by: syzbot+8112db3ab20e70d50c31@syzkaller.appspotmail.com
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220125220358.2091737-1-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f7e57078
  3. 07 1月, 2022 2 次提交
    • E
      KVM: x86: Update vPMCs when retiring branch instructions · 018d70ff
      Eric Hankland 提交于
      When KVM retires a guest branch instruction through emulation,
      increment any vPMCs that are configured to monitor "branch
      instructions retired," and update the sample period of those counters
      so that they will overflow at the right time.
      Signed-off-by: NEric Hankland <ehankland@google.com>
      [jmattson:
        - Split the code to increment "branch instructions retired" into a
          separate commit.
        - Moved/consolidated the calls to kvm_pmu_trigger_event() in the
          emulation of VMLAUNCH/VMRESUME to accommodate the evolution of
          that code.
      ]
      Fixes: f5132b01 ("KVM: Expose a version 2 architectural PMU to a guests")
      Signed-off-by: NJim Mattson <jmattson@google.com>
      Message-Id: <20211130074221.93635-7-likexu@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      018d70ff
    • L
      KVM: VMX: Save HOST_CR3 in vmx_set_host_fs_gs() · a9f2705e
      Lai Jiangshan 提交于
      The host CR3 in the vcpu thread can only be changed when scheduling,
      so commit 15ad9762 ("KVM: VMX: Save HOST_CR3 in vmx_prepare_switch_to_guest()")
      changed vmx.c to only save it in vmx_prepare_switch_to_guest().
      
      However, it also has to be synced in vmx_sync_vmcs_host_state() when switching VMCS.
      vmx_set_host_fs_gs() is called in both places, so rename it to
      vmx_set_vmcs_host_state() and make it update HOST_CR3.
      
      Fixes: 15ad9762 ("KVM: VMX: Save HOST_CR3 in vmx_prepare_switch_to_guest()")
      Signed-off-by: NLai Jiangshan <laijs@linux.alibaba.com>
      Message-Id: <20211216021938.11752-2-jiangshanlai@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a9f2705e
  4. 08 12月, 2021 8 次提交
  5. 02 12月, 2021 1 次提交
  6. 26 11月, 2021 3 次提交
    • S
      KVM: nVMX: Emulate guest TLB flush on nested VM-Enter with new vpid12 · 712494de
      Sean Christopherson 提交于
      Fully emulate a guest TLB flush on nested VM-Enter which changes vpid12,
      i.e. L2's VPID, instead of simply doing INVVPID to flush real hardware's
      TLB entries for vpid02.  From L1's perspective, changing L2's VPID is
      effectively a TLB flush unless "hardware" has previously cached entries
      for the new vpid12.  Because KVM tracks only a single vpid12, KVM doesn't
      know if the new vpid12 has been used in the past and so must treat it as
      a brand new, never been used VPID, i.e. must assume that the new vpid12
      represents a TLB flush from L1's perspective.
      
      For example, if L1 and L2 share a CR3, the first VM-Enter to L2 (with a
      VPID) is effectively a TLB flush as hardware/KVM has never seen vpid12
      and thus can't have cached entries in the TLB for vpid12.
      Reported-by: NLai Jiangshan <jiangshanlai+lkml@gmail.com>
      Fixes: 5c614b35 ("KVM: nVMX: nested VPID emulation")
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20211125014944.536398-3-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      712494de
    • S
      KVM: nVMX: Abide to KVM_REQ_TLB_FLUSH_GUEST request on nested vmentry/vmexit · 40e5f908
      Sean Christopherson 提交于
      Like KVM_REQ_TLB_FLUSH_CURRENT, the GUEST variant needs to be serviced at
      nested transitions, as KVM doesn't track requests for L1 vs L2.  E.g. if
      there's a pending flush when a nested VM-Exit occurs, then the flush was
      requested in the context of L2 and needs to be handled before switching
      to L1, otherwise the flush for L2 would effectiely be lost.
      
      Opportunistically add a helper to handle CURRENT and GUEST as a pair, the
      logic for when they need to be serviced is identical as both requests are
      tied to L1 vs. L2, the only difference is the scope of the flush.
      Reported-by: NLai Jiangshan <jiangshanlai+lkml@gmail.com>
      Fixes: 07ffaf34 ("KVM: nVMX: Sync all PGDs on nested transition with shadow paging")
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20211125014944.536398-2-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      40e5f908
    • P
      KVM: VMX: do not use uninitialized gfn_to_hva_cache · 8503fea6
      Paolo Bonzini 提交于
      An uninitialized gfn_to_hva_cache has ghc->len == 0, which causes
      the accessors to croak very loudly.  While a BUG_ON is definitely
      _too_ loud and a bug on its own, there is indeed an issue of using
      the caches in such a way that they could not have been initialized,
      because ghc->gpa == 0 might match and thus kvm_gfn_to_hva_cache_init
      would not be called.
      
      For the vmcs12_cache, the solution is simply to invoke
      kvm_gfn_to_hva_cache_init unconditionally: we already know
      that the cache does not match the current VMCS pointer.
      For the shadow_vmcs12_cache, there is no similar condition
      that checks the VMCS link pointer, so invalidate the cache
      on VMXON.
      
      Fixes: cee66664 ("KVM: nVMX: Use a gfn_to_hva_cache for vmptrld")
      Acked-by: NDavid Woodhouse <dwmw@amazon.co.uk>
      Reported-by: syzbot+7b7db8bb4db6fd5e157b@syzkaller.appspotmail.com
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      8503fea6
  7. 18 11月, 2021 4 次提交
  8. 11 11月, 2021 3 次提交
    • V
      KVM: VMX: Add a helper function to retrieve the GPR index for INVPCID, INVVPID, and INVEPT · 329bd56c
      Vipin Sharma 提交于
      handle_invept(), handle_invvpid(), handle_invpcid() read the same reg2
      field in vmcs.VMX_INSTRUCTION_INFO to get the index of the GPR that
      holds the invalidation type. Add a helper to retrieve reg2 from VMX
      instruction info to consolidate and document the shift+mask magic.
      Signed-off-by: NVipin Sharma <vipinsh@google.com>
      Reviewed-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20211109174426.2350547-2-vipinsh@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      329bd56c
    • S
      KVM: nVMX: Clean up x2APIC MSR handling for L2 · a5e0c252
      Sean Christopherson 提交于
      Clean up the x2APIC MSR bitmap intereption code for L2, which is the last
      holdout of open coded bitmap manipulations.  Freshen up the SDM/PRM
      comment, rename the function to make it abundantly clear the funky
      behavior is x2APIC specific, and explain _why_ vmcs01's bitmap is ignored
      (the previous comment was flat out wrong for x2APIC behavior).
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20211109013047.2041518-5-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a5e0c252
    • S
      KVM: nVMX: Handle dynamic MSR intercept toggling · 67f4b996
      Sean Christopherson 提交于
      Always check vmcs01's MSR bitmap when merging L0 and L1 bitmaps for L2,
      and always update the relevant bits in vmcs02.  This fixes two distinct,
      but intertwined bugs related to dynamic MSR bitmap modifications.
      
      The first issue is that KVM fails to enable MSR interception in vmcs02
      for the FS/GS base MSRs if L1 first runs L2 with interception disabled,
      and later enables interception.
      
      The second issue is that KVM fails to honor userspace MSR filtering when
      preparing vmcs02.
      
      Fix both issues simultaneous as fixing only one of the issues (doesn't
      matter which) would create a mess that no one should have to bisect.
      Fixing only the first bug would exacerbate the MSR filtering issue as
      userspace would see inconsistent behavior depending on the whims of L1.
      Fixing only the second bug (MSR filtering) effectively requires fixing
      the first, as the nVMX code only knows how to transition vmcs02's
      bitmap from 1->0.
      
      Move the various accessor/mutators that are currently buried in vmx.c
      into vmx.h so that they can be shared by the nested code.
      
      Fixes: 1a155254 ("KVM: x86: Introduce MSR filtering")
      Fixes: d69129b4 ("KVM: nVMX: Disable intercept for FS/GS base MSRs in vmcs02 when possible")
      Cc: stable@vger.kernel.org
      Cc: Alexander Graf <graf@amazon.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20211109013047.2041518-3-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      67f4b996
  9. 25 10月, 2021 1 次提交
  10. 30 9月, 2021 2 次提交
  11. 22 9月, 2021 4 次提交
  12. 13 8月, 2021 4 次提交
    • S
      KVM: nVMX: Unconditionally clear nested.pi_pending on nested VM-Enter · f7782bb8
      Sean Christopherson 提交于
      Clear nested.pi_pending on nested VM-Enter even if L2 will run without
      posted interrupts enabled.  If nested.pi_pending is left set from a
      previous L2, vmx_complete_nested_posted_interrupt() will pick up the
      stale flag and exit to userspace with an "internal emulation error" due
      the new L2 not having a valid nested.pi_desc.
      
      Arguably, vmx_complete_nested_posted_interrupt() should first check for
      posted interrupts being enabled, but it's also completely reasonable that
      KVM wouldn't screw up a fundamental flag.  Not to mention that the mere
      existence of nested.pi_pending is a long-standing bug as KVM shouldn't
      move the posted interrupt out of the IRR until it's actually processed,
      e.g. KVM effectively drops an interrupt when it performs a nested VM-Exit
      with a "pending" posted interrupt.  Fixing the mess is a future problem.
      
      Prior to vmx_complete_nested_posted_interrupt() interpreting a null PI
      descriptor as an error, this was a benign bug as the null PI descriptor
      effectively served as a check on PI not being enabled.  Even then, the
      new flow did not become problematic until KVM started checking the result
      of kvm_check_nested_events().
      
      Fixes: 705699a1 ("KVM: nVMX: Enable nested posted interrupt processing")
      Fixes: 966eefb8 ("KVM: nVMX: Disable vmcs02 posted interrupts if vmcs12 PID isn't mappable")
      Fixes: 47d3530f86c0 ("KVM: x86: Exit to userspace when kvm_check_nested_events fails")
      Cc: stable@vger.kernel.org
      Cc: Jim Mattson <jmattson@google.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210810144526.2662272-1-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f7782bb8
    • S
      KVM: nVMX: Pull KVM L0's desired controls directly from vmcs01 · 389ab252
      Sean Christopherson 提交于
      When preparing controls for vmcs02, grab KVM's desired controls from
      vmcs01's shadow state instead of recalculating the controls from scratch,
      or in the secondary execution controls, instead of using the dedicated
      cache.  Calculating secondary exec controls is eye-poppingly expensive
      due to the guest CPUID checks, hence the dedicated cache, but the other
      calculations aren't exactly free either.
      
      Explicitly clear several bits (x2APIC, DESC exiting, and load EFER on
      exit) as appropriate as they may be set in vmcs01, whereas the previous
      implementation relied on dynamic bits being cleared in the calculator.
      
      Intentionally propagate VM_{ENTRY,EXIT}_LOAD_IA32_PERF_GLOBAL_CTRL from
      vmcs01 to vmcs02.  Whether or not PERF_GLOBAL_CTRL is loaded depends on
      whether or not perf itself is active, so unless perf stops between the
      exit from L1 and entry to L2, vmcs01 will hold the desired value.  This
      is purely an optimization as atomic_switch_perf_msrs() will set/clear
      the control as needed at VM-Enter, i.e. it avoids two extra VMWRITEs in
      the case where perf is active (versus starting with the bits clear in
      vmcs02, which was the previous behavior).
      
      Cc: Zeng Guang <guang.zeng@intel.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210810171952.2758100-3-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      389ab252
    • S
      KVM: nVMX: Use vmx_need_pf_intercept() when deciding if L0 wants a #PF · 18712c13
      Sean Christopherson 提交于
      Use vmx_need_pf_intercept() when determining if L0 wants to handle a #PF
      in L2 or if the VM-Exit should be forwarded to L1.  The current logic fails
      to account for the case where #PF is intercepted to handle
      guest.MAXPHYADDR < host.MAXPHYADDR and ends up reflecting all #PFs into
      L1.  At best, L1 will complain and inject the #PF back into L2.  At
      worst, L1 will eat the unexpected fault and cause L2 to hang on infinite
      page faults.
      
      Note, while the bug was technically introduced by the commit that added
      support for the MAXPHYADDR madness, the shame is all on commit
      a0c13434 ("KVM: VMX: introduce vmx_need_pf_intercept").
      
      Fixes: 1dbf5d68 ("KVM: VMX: Add guest physical address check in EPT violation and misconfig")
      Cc: stable@vger.kernel.org
      Cc: Peter Shier <pshier@google.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Jim Mattson <jmattson@google.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20210812045615.3167686-1-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      18712c13
    • J
      kvm: vmx: Sync all matching EPTPs when injecting nested EPT fault · 85aa8889
      Junaid Shahid 提交于
      When a nested EPT violation/misconfig is injected into the guest,
      the shadow EPT PTEs associated with that address need to be synced.
      This is done by kvm_inject_emulated_page_fault() before it calls
      nested_ept_inject_page_fault(). However, that will only sync the
      shadow EPT PTE associated with the current L1 EPTP. Since the ASID
      is based on EP4TA rather than the full EPTP, so syncing the current
      EPTP is not enough. The SPTEs associated with any other L1 EPTPs
      in the prev_roots cache with the same EP4TA also need to be synced.
      Signed-off-by: NJunaid Shahid <junaids@google.com>
      Message-Id: <20210806222229.1645356-1-junaids@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      85aa8889
  13. 02 8月, 2021 3 次提交
  14. 25 6月, 2021 1 次提交
  15. 24 6月, 2021 1 次提交