1. 27 1月, 2022 5 次提交
    • L
      KVM: x86: Sync the states size with the XCR0/IA32_XSS at, any time · 05a9e065
      Like Xu 提交于
      XCR0 is reset to 1 by RESET but not INIT and IA32_XSS is zeroed by
      both RESET and INIT. The kvm_set_msr_common()'s handling of MSR_IA32_XSS
      also needs to update kvm_update_cpuid_runtime(). In the above cases, the
      size in bytes of the XSAVE area containing all states enabled by XCR0 or
      (XCRO | IA32_XSS) needs to be updated.
      
      For simplicity and consistency, existing helpers are used to write values
      and call kvm_update_cpuid_runtime(), and it's not exactly a fast path.
      
      Fixes: a554d207 ("KVM: X86: Processor States following Reset or INIT")
      Cc: stable@vger.kernel.org
      Signed-off-by: NLike Xu <likexu@tencent.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220126172226.2298529-4-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      05a9e065
    • L
      KVM: x86: Update vCPU's runtime CPUID on write to MSR_IA32_XSS · 4c282e51
      Like Xu 提交于
      Do a runtime CPUID update for a vCPU if MSR_IA32_XSS is written, as the
      size in bytes of the XSAVE area is affected by the states enabled in XSS.
      
      Fixes: 20300099 ("kvm: vmx: add MSR logic for XSAVES")
      Cc: stable@vger.kernel.org
      Signed-off-by: NLike Xu <likexu@tencent.com>
      [sean: split out as a separate patch, adjust Fixes tag]
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220126172226.2298529-3-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4c282e51
    • X
      KVM: x86: Keep MSR_IA32_XSS unchanged for INIT · be4f3b3f
      Xiaoyao Li 提交于
      It has been corrected from SDM version 075 that MSR_IA32_XSS is reset to
      zero on Power up and Reset but keeps unchanged on INIT.
      
      Fixes: a554d207 ("KVM: X86: Processor States following Reset or INIT")
      Cc: stable@vger.kernel.org
      Signed-off-by: NXiaoyao Li <xiaoyao.li@intel.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220126172226.2298529-2-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      be4f3b3f
    • S
      KVM: x86: Forcibly leave nested virt when SMM state is toggled · f7e57078
      Sean Christopherson 提交于
      Forcibly leave nested virtualization operation if userspace toggles SMM
      state via KVM_SET_VCPU_EVENTS or KVM_SYNC_X86_EVENTS.  If userspace
      forces the vCPU out of SMM while it's post-VMXON and then injects an SMI,
      vmx_enter_smm() will overwrite vmx->nested.smm.vmxon and end up with both
      vmxon=false and smm.vmxon=false, but all other nVMX state allocated.
      
      Don't attempt to gracefully handle the transition as (a) most transitions
      are nonsencial, e.g. forcing SMM while L2 is running, (b) there isn't
      sufficient information to handle all transitions, e.g. SVM wants access
      to the SMRAM save state, and (c) KVM_SET_VCPU_EVENTS must precede
      KVM_SET_NESTED_STATE during state restore as the latter disallows putting
      the vCPU into L2 if SMM is active, and disallows tagging the vCPU as
      being post-VMXON in SMM if SMM is not active.
      
      Abuse of KVM_SET_VCPU_EVENTS manifests as a WARN and memory leak in nVMX
      due to failure to free vmcs01's shadow VMCS, but the bug goes far beyond
      just a memory leak, e.g. toggling SMM on while L2 is active puts the vCPU
      in an architecturally impossible state.
      
        WARNING: CPU: 0 PID: 3606 at free_loaded_vmcs arch/x86/kvm/vmx/vmx.c:2665 [inline]
        WARNING: CPU: 0 PID: 3606 at free_loaded_vmcs+0x158/0x1a0 arch/x86/kvm/vmx/vmx.c:2656
        Modules linked in:
        CPU: 1 PID: 3606 Comm: syz-executor725 Not tainted 5.17.0-rc1-syzkaller #0
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
        RIP: 0010:free_loaded_vmcs arch/x86/kvm/vmx/vmx.c:2665 [inline]
        RIP: 0010:free_loaded_vmcs+0x158/0x1a0 arch/x86/kvm/vmx/vmx.c:2656
        Code: <0f> 0b eb b3 e8 8f 4d 9f 00 e9 f7 fe ff ff 48 89 df e8 92 4d 9f 00
        Call Trace:
         <TASK>
         kvm_arch_vcpu_destroy+0x72/0x2f0 arch/x86/kvm/x86.c:11123
         kvm_vcpu_destroy arch/x86/kvm/../../../virt/kvm/kvm_main.c:441 [inline]
         kvm_destroy_vcpus+0x11f/0x290 arch/x86/kvm/../../../virt/kvm/kvm_main.c:460
         kvm_free_vcpus arch/x86/kvm/x86.c:11564 [inline]
         kvm_arch_destroy_vm+0x2e8/0x470 arch/x86/kvm/x86.c:11676
         kvm_destroy_vm arch/x86/kvm/../../../virt/kvm/kvm_main.c:1217 [inline]
         kvm_put_kvm+0x4fa/0xb00 arch/x86/kvm/../../../virt/kvm/kvm_main.c:1250
         kvm_vm_release+0x3f/0x50 arch/x86/kvm/../../../virt/kvm/kvm_main.c:1273
         __fput+0x286/0x9f0 fs/file_table.c:311
         task_work_run+0xdd/0x1a0 kernel/task_work.c:164
         exit_task_work include/linux/task_work.h:32 [inline]
         do_exit+0xb29/0x2a30 kernel/exit.c:806
         do_group_exit+0xd2/0x2f0 kernel/exit.c:935
         get_signal+0x4b0/0x28c0 kernel/signal.c:2862
         arch_do_signal_or_restart+0x2a9/0x1c40 arch/x86/kernel/signal.c:868
         handle_signal_work kernel/entry/common.c:148 [inline]
         exit_to_user_mode_loop kernel/entry/common.c:172 [inline]
         exit_to_user_mode_prepare+0x17d/0x290 kernel/entry/common.c:207
         __syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline]
         syscall_exit_to_user_mode+0x19/0x60 kernel/entry/common.c:300
         do_syscall_64+0x42/0xb0 arch/x86/entry/common.c:86
         entry_SYSCALL_64_after_hwframe+0x44/0xae
         </TASK>
      
      Cc: stable@vger.kernel.org
      Reported-by: syzbot+8112db3ab20e70d50c31@syzkaller.appspotmail.com
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20220125220358.2091737-1-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f7e57078
    • S
      KVM: x86: Pass emulation type to can_emulate_instruction() · 4d31d9ef
      Sean Christopherson 提交于
      Pass the emulation type to kvm_x86_ops.can_emulate_insutrction() so that
      a future commit can harden KVM's SEV support to WARN on emulation
      scenarios that should never happen.
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NLiam Merwick <liam.merwick@oracle.com>
      Message-Id: <20220120010719.711476-6-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4d31d9ef
  2. 25 1月, 2022 1 次提交
  3. 20 1月, 2022 4 次提交
    • S
      KVM: VMX: Don't do full kick when triggering posted interrupt "fails" · 0f65a9d3
      Sean Christopherson 提交于
      Replace the full "kick" with just the "wake" in the fallback path when
      triggering a virtual interrupt via a posted interrupt fails because the
      guest is not IN_GUEST_MODE.  If the guest transitions into guest mode
      between the check and the kick, then it's guaranteed to see the pending
      interrupt as KVM syncs the PIR to IRR (and onto GUEST_RVI) after setting
      IN_GUEST_MODE.  Kicking the guest in this case is nothing more than an
      unnecessary VM-Exit (and host IRQ).
      
      Opportunistically update comments to explain the various ordering rules
      and barriers at play.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20211208015236.1616697-17-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0f65a9d3
    • S
      KVM: x86: Remove defunct pre_block/post_block kvm_x86_ops hooks · c3e8abf0
      Sean Christopherson 提交于
      Drop kvm_x86_ops' pre/post_block() now that all implementations are nops.
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20211208015236.1616697-10-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c3e8abf0
    • S
      KVM: VMX: Move preemption timer <=> hrtimer dance to common x86 · 98c25ead
      Sean Christopherson 提交于
      Handle the switch to/from the hypervisor/software timer when a vCPU is
      blocking in common x86 instead of in VMX.  Even though VMX is the only
      user of a hypervisor timer, the logic and all functions involved are
      generic x86 (unless future CPUs do something completely different and
      implement a hypervisor timer that runs regardless of mode).
      
      Handling the switch in common x86 will allow for the elimination of the
      pre/post_blocks hooks, and also lets KVM switch back to the hypervisor
      timer if and only if it was in use (without additional params).  Add a
      comment explaining why the switch cannot be deferred to kvm_sched_out()
      or kvm_vcpu_block().
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20211208015236.1616697-8-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      98c25ead
    • S
      KVM: VMX: Reject KVM_RUN if emulation is required with pending exception · fc4fad79
      Sean Christopherson 提交于
      Reject KVM_RUN if emulation is required (because VMX is running without
      unrestricted guest) and an exception is pending, as KVM doesn't support
      emulating exceptions except when emulating real mode via vm86.  The vCPU
      is hosed either way, but letting KVM_RUN proceed triggers a WARN due to
      the impossible condition.  Alternatively, the WARN could be removed, but
      then userspace and/or KVM bugs would result in the vCPU silently running
      in a bad state, which isn't very friendly to users.
      
      Originally, the bug was hit by syzkaller with a nested guest as that
      doesn't require kvm_intel.unrestricted_guest=0.  That particular flavor
      is likely fixed by commit cd0e615c ("KVM: nVMX: Synthesize
      TRIPLE_FAULT for L2 if emulation is required"), but it's trivial to
      trigger the WARN with a non-nested guest, and userspace can likely force
      bad state via ioctls() for a nested guest as well.
      
      Checking for the impossible condition needs to be deferred until KVM_RUN
      because KVM can't force specific ordering between ioctls.  E.g. clearing
      exception.pending in KVM_SET_SREGS doesn't prevent userspace from setting
      it in KVM_SET_VCPU_EVENTS, and disallowing KVM_SET_VCPU_EVENTS with
      emulation_required would prevent userspace from queuing an exception and
      then stuffing sregs.  Note, if KVM were to try and detect/prevent the
      condition prior to KVM_RUN, handle_invalid_guest_state() and/or
      handle_emulation_failure() would need to be modified to clear the pending
      exception prior to exiting to userspace.
      
       ------------[ cut here ]------------
       WARNING: CPU: 6 PID: 137812 at arch/x86/kvm/vmx/vmx.c:1623 vmx_queue_exception+0x14f/0x160 [kvm_intel]
       CPU: 6 PID: 137812 Comm: vmx_invalid_nes Not tainted 5.15.2-7cc36c3e14ae-pop #279
       Hardware name: ASUS Q87M-E/Q87M-E, BIOS 1102 03/03/2014
       RIP: 0010:vmx_queue_exception+0x14f/0x160 [kvm_intel]
       Code: <0f> 0b e9 fd fe ff ff 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00
       RSP: 0018:ffffa45c83577d38 EFLAGS: 00010202
       RAX: 0000000000000003 RBX: 0000000080000006 RCX: 0000000000000006
       RDX: 0000000000000000 RSI: 0000000000010002 RDI: ffff9916af734000
       RBP: ffff9916af734000 R08: 0000000000000000 R09: 0000000000000000
       R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000006
       R13: 0000000000000000 R14: ffff9916af734038 R15: 0000000000000000
       FS:  00007f1e1a47c740(0000) GS:ffff99188fb80000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 00007f1e1a6a8008 CR3: 000000026f83b005 CR4: 00000000001726e0
       Call Trace:
        kvm_arch_vcpu_ioctl_run+0x13a2/0x1f20 [kvm]
        kvm_vcpu_ioctl+0x279/0x690 [kvm]
        __x64_sys_ioctl+0x83/0xb0
        do_syscall_64+0x3b/0xc0
        entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Reported-by: syzbot+82112403ace4cbd780d8@syzkaller.appspotmail.com
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20211228232437.1875318-2-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      fc4fad79
  4. 18 1月, 2022 2 次提交
    • L
      KVM: x86: Making the module parameter of vPMU more common · 4732f244
      Like Xu 提交于
      The new module parameter to control PMU virtualization should apply
      to Intel as well as AMD, for situations where userspace is not trusted.
      If the module parameter allows PMU virtualization, there could be a
      new KVM_CAP or guest CPUID bits whereby userspace can enable/disable
      PMU virtualization on a per-VM basis.
      
      If the module parameter does not allow PMU virtualization, there
      should be no userspace override, since we have no precedent for
      authorizing that kind of override. If it's false, other counter-based
      profiling features (such as LBR including the associated CPUID bits
      if any) will not be exposed.
      
      Change its name from "pmu" to "enable_pmu" as we have temporary
      variables with the same name in our code like "struct kvm_pmu *pmu".
      
      Fixes: b1d66dad ("KVM: x86/svm: Add module param to control PMU virtualization")
      Suggested-by : Jim Mattson <jmattson@google.com>
      Signed-off-by: NLike Xu <likexu@tencent.com>
      Message-Id: <20220111073823.21885-1-likexu@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4732f244
    • V
      KVM: x86: Partially allow KVM_SET_CPUID{,2} after KVM_RUN · c6617c61
      Vitaly Kuznetsov 提交于
      Commit feb627e8 ("KVM: x86: Forbid KVM_SET_CPUID{,2} after KVM_RUN")
      forbade changing CPUID altogether but unfortunately this is not fully
      compatible with existing VMMs. In particular, QEMU reuses vCPU fds for
      CPU hotplug after unplug and it calls KVM_SET_CPUID2. Instead of full ban,
      check whether the supplied CPUID data is equal to what was previously set.
      Reported-by: NIgor Mammedov <imammedo@redhat.com>
      Fixes: feb627e8 ("KVM: x86: Forbid KVM_SET_CPUID{,2} after KVM_RUN")
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20220117150542.2176196-3-vkuznets@redhat.com>
      Cc: stable@vger.kernel.org
      [Do not call kvm_find_cpuid_entry repeatedly. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c6617c61
  5. 15 1月, 2022 6 次提交
    • K
      kvm: x86: Disable interception for IA32_XFD on demand · b5274b1b
      Kevin Tian 提交于
      Always intercepting IA32_XFD causes non-negligible overhead when this
      register is updated frequently in the guest.
      
      Disable r/w emulation after intercepting the first WRMSR(IA32_XFD)
      with a non-zero value.
      
      Disable WRMSR emulation implies that IA32_XFD becomes out-of-sync
      with the software states in fpstate and the per-cpu xfd cache. This
      leads to two additional changes accordingly:
      
        - Call fpu_sync_guest_vmexit_xfd_state() after vm-exit to bring
          software states back in-sync with the MSR, before handle_exit_irqoff()
          is called.
      
        - Always trap #NM once write interception is disabled for IA32_XFD.
          The #NM exception is rare if the guest doesn't use dynamic
          features. Otherwise, there is at most one exception per guest
          task given a dynamic feature.
      
      p.s. We have confirmed that SDM is being revised to say that
      when setting IA32_XFD[18] the AMX register state is not guaranteed
      to be preserved. This clarification avoids adding mess for a creative
      guest which sets IA32_XFD[18]=1 before saving active AMX state to
      its own storage.
      Signed-off-by: NKevin Tian <kevin.tian@intel.com>
      Signed-off-by: NJing Liu <jing2.liu@intel.com>
      Signed-off-by: NYang Zhong <yang.zhong@intel.com>
      Message-Id: <20220105123532.12586-22-yang.zhong@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b5274b1b
    • G
      kvm: x86: Add support for getting/setting expanded xstate buffer · be50b206
      Guang Zeng 提交于
      With KVM_CAP_XSAVE, userspace uses a hardcoded 4KB buffer to get/set
      xstate data from/to KVM. This doesn't work when dynamic xfeatures
      (e.g. AMX) are exposed to the guest as they require a larger buffer
      size.
      
      Introduce a new capability (KVM_CAP_XSAVE2). Userspace VMM gets the
      required xstate buffer size via KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2).
      KVM_SET_XSAVE is extended to work with both legacy and new capabilities
      by doing properly-sized memdup_user() based on the guest fpu container.
      KVM_GET_XSAVE is kept for backward-compatible reason. Instead,
      KVM_GET_XSAVE2 is introduced under KVM_CAP_XSAVE2 as the preferred
      interface for getting xstate buffer (4KB or larger size) from KVM
      (Link: https://lkml.org/lkml/2021/12/15/510)
      
      Also, update the api doc with the new KVM_GET_XSAVE2 ioctl.
      Signed-off-by: NGuang Zeng <guang.zeng@intel.com>
      Signed-off-by: NWei Wang <wei.w.wang@intel.com>
      Signed-off-by: NJing Liu <jing2.liu@intel.com>
      Signed-off-by: NKevin Tian <kevin.tian@intel.com>
      Signed-off-by: NYang Zhong <yang.zhong@intel.com>
      Message-Id: <20220105123532.12586-19-yang.zhong@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      be50b206
    • J
      kvm: x86: Add XCR0 support for Intel AMX · 86aff7a4
      Jing Liu 提交于
      Two XCR0 bits are defined for AMX to support XSAVE mechanism. Bit 17
      is for tilecfg and bit 18 is for tiledata.
      
      The value of XCR0[17:18] is always either 00b or 11b. Also, SDM
      recommends that only 64-bit operating systems enable Intel AMX by
      setting XCR0[18:17]. 32-bit host kernel never sets the tile bits in
      vcpu->arch.guest_supported_xcr0.
      Signed-off-by: NJing Liu <jing2.liu@intel.com>
      Signed-off-by: NKevin Tian <kevin.tian@intel.com>
      Signed-off-by: NYang Zhong <yang.zhong@intel.com>
      Message-Id: <20220105123532.12586-16-yang.zhong@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      86aff7a4
    • J
      kvm: x86: Emulate IA32_XFD_ERR for guest · 548e8365
      Jing Liu 提交于
      Emulate read/write to IA32_XFD_ERR MSR.
      
      Only the saved value in the guest_fpu container is touched in the
      emulation handler. Actual MSR update is handled right before entering
      the guest (with preemption disabled)
      Signed-off-by: NJing Liu <jing2.liu@intel.com>
      Signed-off-by: NZeng Guang <guang.zeng@intel.com>
      Signed-off-by: NWei Wang <wei.w.wang@intel.com>
      Signed-off-by: NJing Liu <jing2.liu@intel.com>
      Signed-off-by: NYang Zhong <yang.zhong@intel.com>
      Message-Id: <20220105123532.12586-14-yang.zhong@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      548e8365
    • J
      kvm: x86: Intercept #NM for saving IA32_XFD_ERR · ec5be88a
      Jing Liu 提交于
      Guest IA32_XFD_ERR is generally modified in two places:
      
        - Set by CPU when #NM is triggered;
        - Cleared by guest in its #NM handler;
      
      Intercept #NM for the first case when a nonzero value is written
      to IA32_XFD. Nonzero indicates that the guest is willing to do
      dynamic fpstate expansion for certain xfeatures, thus KVM needs to
      manage and virtualize guest XFD_ERR properly. The vcpu exception
      bitmap is updated in XFD write emulation according to guest_fpu::xfd.
      
      Save the current XFD_ERR value to the guest_fpu container in the #NM
      VM-exit handler. This must be done with interrupt disabled, otherwise
      the unsaved MSR value may be clobbered by host activity.
      
      The saving operation is conducted conditionally only when guest_fpu:xfd
      includes a non-zero value. Doing so also avoids misread on a platform
      which doesn't support XFD but #NM is triggered due to L1 interception.
      
      Queueing #NM to the guest is postponed to handle_exception_nmi(). This
      goes through the nested_vmx check so a virtual vmexit is queued instead
      when #NM is triggered in L2 but L1 wants to intercept it.
      
      Restore the host value (always ZERO outside of the host #NM
      handler) before enabling interrupt.
      
      Restore the guest value from the guest_fpu container right before
      entering the guest (with interrupt disabled).
      Suggested-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NJing Liu <jing2.liu@intel.com>
      Signed-off-by: NKevin Tian <kevin.tian@intel.com>
      Signed-off-by: NYang Zhong <yang.zhong@intel.com>
      Message-Id: <20220105123532.12586-13-yang.zhong@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ec5be88a
    • J
      kvm: x86: Add emulation for IA32_XFD · 820a6ee9
      Jing Liu 提交于
      Intel's eXtended Feature Disable (XFD) feature allows the software
      to dynamically adjust fpstate buffer size for XSAVE features which
      have large state.
      
      Because guest fpstate has been expanded for all possible dynamic
      xstates at KVM_SET_CPUID2, emulation of the IA32_XFD MSR is
      straightforward. For write just call fpu_update_guest_xfd() to
      update the guest fpu container once all the sanity checks are passed.
      For read simply return the cached value in the container.
      Signed-off-by: NJing Liu <jing2.liu@intel.com>
      Signed-off-by: NZeng Guang <guang.zeng@intel.com>
      Signed-off-by: NWei Wang <wei.w.wang@intel.com>
      Signed-off-by: NJing Liu <jing2.liu@intel.com>
      Signed-off-by: NYang Zhong <yang.zhong@intel.com>
      Message-Id: <20220105123532.12586-11-yang.zhong@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      820a6ee9
  6. 07 1月, 2022 7 次提交
    • M
      KVM: SVM: include CR3 in initial VMSA state for SEV-ES guests · 405329fc
      Michael Roth 提交于
      Normally guests will set up CR3 themselves, but some guests, such as
      kselftests, and potentially CONFIG_PVH guests, rely on being booted
      with paging enabled and CR3 initialized to a pre-allocated page table.
      
      Currently CR3 updates via KVM_SET_SREGS* are not loaded into the guest
      VMCB until just prior to entering the guest. For SEV-ES/SEV-SNP, this
      is too late, since it will have switched over to using the VMSA page
      prior to that point, with the VMSA CR3 copied from the VMCB initial
      CR3 value: 0.
      
      Address this by sync'ing the CR3 value into the VMCB save area
      immediately when KVM_SET_SREGS* is issued so it will find it's way into
      the initial VMSA.
      Suggested-by: NTom Lendacky <thomas.lendacky@amd.com>
      Signed-off-by: NMichael Roth <michael.roth@amd.com>
      Message-Id: <20211216171358.61140-10-michael.roth@amd.com>
      [Remove vmx_post_set_cr3; add a remark about kvm_set_cr3 not calling the
       new hook. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      405329fc
    • D
      KVM: x86: Fix wall clock writes in Xen shared_info not to mark page dirty · 55749769
      David Woodhouse 提交于
      When dirty ring logging is enabled, any dirty logging without an active
      vCPU context will cause a kernel oops. But we've already declared that
      the shared_info page doesn't get dirty tracking anyway, since it would
      be kind of insane to mark it dirty every time we deliver an event channel
      interrupt. Userspace is supposed to just assume it's always dirty any
      time a vCPU can run or event channels are routed.
      
      So stop using the generic kvm_write_wall_clock() and just write directly
      through the gfn_to_pfn_cache that we already have set up.
      
      We can make kvm_write_wall_clock() static in x86.c again now, but let's
      not remove the 'sec_hi_ofs' argument even though it's not used yet. At
      some point we *will* want to use that for KVM guests too.
      
      Fixes: 629b5348 ("KVM: x86/xen: update wallclock region")
      Reported-by: Nbutt3rflyh4ck <butterflyhuangxx@gmail.com>
      Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>
      Message-Id: <20211210163625.2886-6-dwmw2@infradead.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      55749769
    • D
      KVM: x86/xen: Add KVM_IRQ_ROUTING_XEN_EVTCHN and event channel delivery · 14243b38
      David Woodhouse 提交于
      This adds basic support for delivering 2 level event channels to a guest.
      
      Initially, it only supports delivery via the IRQ routing table, triggered
      by an eventfd. In order to do so, it has a kvm_xen_set_evtchn_fast()
      function which will use the pre-mapped shared_info page if it already
      exists and is still valid, while the slow path through the irqfd_inject
      workqueue will remap the shared_info page if necessary.
      
      It sets the bits in the shared_info page but not the vcpu_info; that is
      deferred to __kvm_xen_has_interrupt() which raises the vector to the
      appropriate vCPU.
      
      Add a 'verbose' mode to xen_shinfo_test while adding test cases for this.
      Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>
      Message-Id: <20211210163625.2886-5-dwmw2@infradead.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      14243b38
    • E
      KVM: x86: Update vPMCs when retiring branch instructions · 018d70ff
      Eric Hankland 提交于
      When KVM retires a guest branch instruction through emulation,
      increment any vPMCs that are configured to monitor "branch
      instructions retired," and update the sample period of those counters
      so that they will overflow at the right time.
      Signed-off-by: NEric Hankland <ehankland@google.com>
      [jmattson:
        - Split the code to increment "branch instructions retired" into a
          separate commit.
        - Moved/consolidated the calls to kvm_pmu_trigger_event() in the
          emulation of VMLAUNCH/VMRESUME to accommodate the evolution of
          that code.
      ]
      Fixes: f5132b01 ("KVM: Expose a version 2 architectural PMU to a guests")
      Signed-off-by: NJim Mattson <jmattson@google.com>
      Message-Id: <20211130074221.93635-7-likexu@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      018d70ff
    • E
      KVM: x86: Update vPMCs when retiring instructions · 9cd803d4
      Eric Hankland 提交于
      When KVM retires a guest instruction through emulation, increment any
      vPMCs that are configured to monitor "instructions retired," and
      update the sample period of those counters so that they will overflow
      at the right time.
      Signed-off-by: NEric Hankland <ehankland@google.com>
      [jmattson:
        - Split the code to increment "branch instructions retired" into a
          separate commit.
        - Added 'static' to kvm_pmu_incr_counter() definition.
        - Modified kvm_pmu_incr_counter() to check pmc->perf_event->state ==
          PERF_EVENT_STATE_ACTIVE.
      ]
      Fixes: f5132b01 ("KVM: Expose a version 2 architectural PMU to a guests")
      Signed-off-by: NJim Mattson <jmattson@google.com>
      [likexu:
        - Drop checks for pmc->perf_event or event state or event type
        - Increase a counter once its umask bits and the first 8 select bits are matched
        - Rewrite kvm_pmu_incr_counter() with a less invasive approach to the host perf;
        - Rename kvm_pmu_record_event to kvm_pmu_trigger_event;
        - Add counter enable and CPL check for kvm_pmu_trigger_event();
      ]
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NLike Xu <likexu@tencent.com>
      Message-Id: <20211130074221.93635-6-likexu@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      9cd803d4
    • L
      KVM: x86/mmu: Reconstruct shadow page root if the guest PDPTEs is changed · 6b123c3a
      Lai Jiangshan 提交于
      For shadow paging, the page table needs to be reconstructed before the
      coming VMENTER if the guest PDPTEs is changed.
      
      But not all paths that call load_pdptrs() will cause the page tables to be
      reconstructed. Normally, kvm_mmu_reset_context() and kvm_mmu_free_roots()
      are used to launch later reconstruction.
      
      The commit d81135a5("KVM: x86: do not reset mmu if CR0.CD and
      CR0.NW are changed") skips kvm_mmu_reset_context() after load_pdptrs()
      when changing CR0.CD and CR0.NW.
      
      The commit 21823fbd("KVM: x86: Invalidate all PGDs for the current
      PCID on MOV CR3 w/ flush") skips kvm_mmu_free_roots() after
      load_pdptrs() when rewriting the CR3 with the same value.
      
      The commit a91a7c70("KVM: X86: Don't reset mmu context when
      toggling X86_CR4_PGE") skips kvm_mmu_reset_context() after
      load_pdptrs() when changing CR4.PGE.
      
      Guests like linux would keep the PDPTEs unchanged for every instance of
      pagetable, so this missing reconstruction has no problem for linux
      guests.
      
      Fixes: d81135a5("KVM: x86: do not reset mmu if CR0.CD and CR0.NW are changed")
      Fixes: 21823fbd("KVM: x86: Invalidate all PGDs for the current PCID on MOV CR3 w/ flush")
      Fixes: a91a7c70("KVM: X86: Don't reset mmu context when toggling X86_CR4_PGE")
      Suggested-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NLai Jiangshan <laijs@linux.alibaba.com>
      Message-Id: <20211216021938.11752-3-jiangshanlai@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      6b123c3a
    • P
      Revert "KVM: X86: Update mmu->pdptrs only when it is changed" · 46cbc040
      Paolo Bonzini 提交于
      This reverts commit 24cd19a2.
      Sean Christopherson reports:
      
      "Commit 24cd19a2 ('KVM: X86: Update mmu->pdptrs only when it is
      changed') breaks nested VMs with EPT in L0 and PAE shadow paging in L2.
      Reproducing is trivial, just disable EPT in L1 and run a VM.  I haven't
      investigating how it breaks things."
      Reviewed-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      46cbc040
  7. 20 12月, 2021 3 次提交
  8. 10 12月, 2021 2 次提交
  9. 09 12月, 2021 1 次提交
  10. 08 12月, 2021 9 次提交