1. 20 1月, 2022 5 次提交
    • S
      KVM: x86: Remove defunct pre_block/post_block kvm_x86_ops hooks · c3e8abf0
      Sean Christopherson 提交于
      Drop kvm_x86_ops' pre/post_block() now that all implementations are nops.
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20211208015236.1616697-10-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c3e8abf0
    • S
      KVM: VMX: Move preemption timer <=> hrtimer dance to common x86 · 98c25ead
      Sean Christopherson 提交于
      Handle the switch to/from the hypervisor/software timer when a vCPU is
      blocking in common x86 instead of in VMX.  Even though VMX is the only
      user of a hypervisor timer, the logic and all functions involved are
      generic x86 (unless future CPUs do something completely different and
      implement a hypervisor timer that runs regardless of mode).
      
      Handling the switch in common x86 will allow for the elimination of the
      pre/post_blocks hooks, and also lets KVM switch back to the hypervisor
      timer if and only if it was in use (without additional params).  Add a
      comment explaining why the switch cannot be deferred to kvm_sched_out()
      or kvm_vcpu_block().
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20211208015236.1616697-8-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      98c25ead
    • S
      KVM: Move x86 VMX's posted interrupt list_head to vcpu_vmx · 12a8eee5
      Sean Christopherson 提交于
      Move the seemingly generic block_vcpu_list from kvm_vcpu to vcpu_vmx, and
      rename the list and all associated variables to clarify that it tracks
      the set of vCPU that need to be poked on a posted interrupt to the wakeup
      vector.  The list is not used to track _all_ vCPUs that are blocking, and
      the term "blocked" can be misleading as it may refer to a blocking
      condition in the host or the guest, where as the PI wakeup case is
      specifically for the vCPUs that are actively blocking from within the
      guest.
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20211208015236.1616697-7-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      12a8eee5
    • S
      KVM: VMX: Handle PI descriptor updates during vcpu_put/load · d76fb406
      Sean Christopherson 提交于
      Move the posted interrupt pre/post_block logic into vcpu_put/load
      respectively, using the kvm_vcpu_is_blocking() to determining whether or
      not the wakeup handler needs to be set (and unset).  This avoids updating
      the PI descriptor if halt-polling is successful, reduces the number of
      touchpoints for updating the descriptor, and eliminates the confusing
      behavior of intentionally leaving a "stale" PI.NDST when a blocking vCPU
      is scheduled back in after preemption.
      
      The downside is that KVM will do the PID update twice if the vCPU is
      preempted after prepare_to_rcuwait() but before schedule(), but that's a
      rare case (and non-existent on !PREEMPT kernels).
      
      The notable wart is the need to send a self-IPI on the wakeup vector if
      an outstanding notification is pending after configuring the wakeup
      vector.  Ideally, KVM would just do a kvm_vcpu_wake_up() in this case,
      but the scheduler doesn't support waking a task from its preemption
      notifier callback, i.e. while the task is right in the middle of
      being scheduled out.
      
      Note, setting the wakeup vector before halt-polling is not necessary:
      once the pending IRQ will be recorded in the PIR, kvm_vcpu_has_events()
      will detect this (via kvm_cpu_get_interrupt(), kvm_apic_get_interrupt(),
      apic_has_interrupt_for_ppr() and finally vmx_sync_pir_to_irr()) and
      terminate the polling.
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20211208015236.1616697-5-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d76fb406
    • S
      KVM: VMX: Reject KVM_RUN if emulation is required with pending exception · fc4fad79
      Sean Christopherson 提交于
      Reject KVM_RUN if emulation is required (because VMX is running without
      unrestricted guest) and an exception is pending, as KVM doesn't support
      emulating exceptions except when emulating real mode via vm86.  The vCPU
      is hosed either way, but letting KVM_RUN proceed triggers a WARN due to
      the impossible condition.  Alternatively, the WARN could be removed, but
      then userspace and/or KVM bugs would result in the vCPU silently running
      in a bad state, which isn't very friendly to users.
      
      Originally, the bug was hit by syzkaller with a nested guest as that
      doesn't require kvm_intel.unrestricted_guest=0.  That particular flavor
      is likely fixed by commit cd0e615c ("KVM: nVMX: Synthesize
      TRIPLE_FAULT for L2 if emulation is required"), but it's trivial to
      trigger the WARN with a non-nested guest, and userspace can likely force
      bad state via ioctls() for a nested guest as well.
      
      Checking for the impossible condition needs to be deferred until KVM_RUN
      because KVM can't force specific ordering between ioctls.  E.g. clearing
      exception.pending in KVM_SET_SREGS doesn't prevent userspace from setting
      it in KVM_SET_VCPU_EVENTS, and disallowing KVM_SET_VCPU_EVENTS with
      emulation_required would prevent userspace from queuing an exception and
      then stuffing sregs.  Note, if KVM were to try and detect/prevent the
      condition prior to KVM_RUN, handle_invalid_guest_state() and/or
      handle_emulation_failure() would need to be modified to clear the pending
      exception prior to exiting to userspace.
      
       ------------[ cut here ]------------
       WARNING: CPU: 6 PID: 137812 at arch/x86/kvm/vmx/vmx.c:1623 vmx_queue_exception+0x14f/0x160 [kvm_intel]
       CPU: 6 PID: 137812 Comm: vmx_invalid_nes Not tainted 5.15.2-7cc36c3e14ae-pop #279
       Hardware name: ASUS Q87M-E/Q87M-E, BIOS 1102 03/03/2014
       RIP: 0010:vmx_queue_exception+0x14f/0x160 [kvm_intel]
       Code: <0f> 0b e9 fd fe ff ff 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00
       RSP: 0018:ffffa45c83577d38 EFLAGS: 00010202
       RAX: 0000000000000003 RBX: 0000000080000006 RCX: 0000000000000006
       RDX: 0000000000000000 RSI: 0000000000010002 RDI: ffff9916af734000
       RBP: ffff9916af734000 R08: 0000000000000000 R09: 0000000000000000
       R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000006
       R13: 0000000000000000 R14: ffff9916af734038 R15: 0000000000000000
       FS:  00007f1e1a47c740(0000) GS:ffff99188fb80000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 00007f1e1a6a8008 CR3: 000000026f83b005 CR4: 00000000001726e0
       Call Trace:
        kvm_arch_vcpu_ioctl_run+0x13a2/0x1f20 [kvm]
        kvm_vcpu_ioctl+0x279/0x690 [kvm]
        __x64_sys_ioctl+0x83/0xb0
        do_syscall_64+0x3b/0xc0
        entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Reported-by: syzbot+82112403ace4cbd780d8@syzkaller.appspotmail.com
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20211228232437.1875318-2-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      fc4fad79
  2. 15 1月, 2022 3 次提交
    • K
      kvm: x86: Disable interception for IA32_XFD on demand · b5274b1b
      Kevin Tian 提交于
      Always intercepting IA32_XFD causes non-negligible overhead when this
      register is updated frequently in the guest.
      
      Disable r/w emulation after intercepting the first WRMSR(IA32_XFD)
      with a non-zero value.
      
      Disable WRMSR emulation implies that IA32_XFD becomes out-of-sync
      with the software states in fpstate and the per-cpu xfd cache. This
      leads to two additional changes accordingly:
      
        - Call fpu_sync_guest_vmexit_xfd_state() after vm-exit to bring
          software states back in-sync with the MSR, before handle_exit_irqoff()
          is called.
      
        - Always trap #NM once write interception is disabled for IA32_XFD.
          The #NM exception is rare if the guest doesn't use dynamic
          features. Otherwise, there is at most one exception per guest
          task given a dynamic feature.
      
      p.s. We have confirmed that SDM is being revised to say that
      when setting IA32_XFD[18] the AMX register state is not guaranteed
      to be preserved. This clarification avoids adding mess for a creative
      guest which sets IA32_XFD[18]=1 before saving active AMX state to
      its own storage.
      Signed-off-by: NKevin Tian <kevin.tian@intel.com>
      Signed-off-by: NJing Liu <jing2.liu@intel.com>
      Signed-off-by: NYang Zhong <yang.zhong@intel.com>
      Message-Id: <20220105123532.12586-22-yang.zhong@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b5274b1b
    • J
      kvm: x86: Disable RDMSR interception of IA32_XFD_ERR · 61f20813
      Jing Liu 提交于
      This saves one unnecessary VM-exit in guest #NM handler, given that the
      MSR is already restored with the guest value before the guest is resumed.
      Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NJing Liu <jing2.liu@intel.com>
      Signed-off-by: NYang Zhong <yang.zhong@intel.com>
      Message-Id: <20220105123532.12586-15-yang.zhong@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      61f20813
    • J
      kvm: x86: Intercept #NM for saving IA32_XFD_ERR · ec5be88a
      Jing Liu 提交于
      Guest IA32_XFD_ERR is generally modified in two places:
      
        - Set by CPU when #NM is triggered;
        - Cleared by guest in its #NM handler;
      
      Intercept #NM for the first case when a nonzero value is written
      to IA32_XFD. Nonzero indicates that the guest is willing to do
      dynamic fpstate expansion for certain xfeatures, thus KVM needs to
      manage and virtualize guest XFD_ERR properly. The vcpu exception
      bitmap is updated in XFD write emulation according to guest_fpu::xfd.
      
      Save the current XFD_ERR value to the guest_fpu container in the #NM
      VM-exit handler. This must be done with interrupt disabled, otherwise
      the unsaved MSR value may be clobbered by host activity.
      
      The saving operation is conducted conditionally only when guest_fpu:xfd
      includes a non-zero value. Doing so also avoids misread on a platform
      which doesn't support XFD but #NM is triggered due to L1 interception.
      
      Queueing #NM to the guest is postponed to handle_exception_nmi(). This
      goes through the nested_vmx check so a virtual vmexit is queued instead
      when #NM is triggered in L2 but L1 wants to intercept it.
      
      Restore the host value (always ZERO outside of the host #NM
      handler) before enabling interrupt.
      
      Restore the guest value from the guest_fpu container right before
      entering the guest (with interrupt disabled).
      Suggested-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NJing Liu <jing2.liu@intel.com>
      Signed-off-by: NKevin Tian <kevin.tian@intel.com>
      Signed-off-by: NYang Zhong <yang.zhong@intel.com>
      Message-Id: <20220105123532.12586-13-yang.zhong@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ec5be88a
  3. 07 1月, 2022 3 次提交
  4. 22 12月, 2021 1 次提交
    • S
      KVM: VMX: Wake vCPU when delivering posted IRQ even if vCPU == this vCPU · fdba608f
      Sean Christopherson 提交于
      Drop a check that guards triggering a posted interrupt on the currently
      running vCPU, and more importantly guards waking the target vCPU if
      triggering a posted interrupt fails because the vCPU isn't IN_GUEST_MODE.
      If a vIRQ is delivered from asynchronous context, the target vCPU can be
      the currently running vCPU and can also be blocking, in which case
      skipping kvm_vcpu_wake_up() is effectively dropping what is supposed to
      be a wake event for the vCPU.
      
      The "do nothing" logic when "vcpu == running_vcpu" mostly works only
      because the majority of calls to ->deliver_posted_interrupt(), especially
      when using posted interrupts, come from synchronous KVM context.  But if
      a device is exposed to the guest using vfio-pci passthrough, the VFIO IRQ
      and vCPU are bound to the same pCPU, and the IRQ is _not_ configured to
      use posted interrupts, wake events from the device will be delivered to
      KVM from IRQ context, e.g.
      
        vfio_msihandler()
        |
        |-> eventfd_signal()
            |
            |-> ...
                |
                |->  irqfd_wakeup()
                     |
                     |->kvm_arch_set_irq_inatomic()
                        |
                        |-> kvm_irq_delivery_to_apic_fast()
                            |
                            |-> kvm_apic_set_irq()
      
      This also aligns the non-nested and nested usage of triggering posted
      interrupts, and will allow for additional cleanups.
      
      Fixes: 379a3c8e ("KVM: VMX: Optimize posted-interrupt delivery for timer fastpath")
      Cc: stable@vger.kernel.org
      Reported-by: NLongpeng (Mike) <longpeng2@huawei.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20211208015236.1616697-18-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      fdba608f
  5. 20 12月, 2021 3 次提交
    • S
      KVM: nVMX: Synthesize TRIPLE_FAULT for L2 if emulation is required · cd0e615c
      Sean Christopherson 提交于
      Synthesize a triple fault if L2 guest state is invalid at the time of
      VM-Enter, which can happen if L1 modifies SMRAM or if userspace stuffs
      guest state via ioctls(), e.g. KVM_SET_SREGS.  KVM should never emulate
      invalid guest state, since from L1's perspective, it's architecturally
      impossible for L2 to have invalid state while L2 is running in hardware.
      E.g. attempts to set CR0 or CR4 to unsupported values will either VM-Exit
      or #GP.
      
      Modifying vCPU state via RSM+SMRAM and ioctl() are the only paths that
      can trigger this scenario, as nested VM-Enter correctly rejects any
      attempt to enter L2 with invalid state.
      
      RSM is a straightforward case as (a) KVM follows AMD's SMRAM layout and
      behavior, and (b) Intel's SDM states that loading reserved CR0/CR4 bits
      via RSM results in shutdown, i.e. there is precedent for KVM's behavior.
      Following AMD's SMRAM layout is important as AMD's layout saves/restores
      the descriptor cache information, including CS.RPL and SS.RPL, and also
      defines all the fields relevant to invalid guest state as read-only, i.e.
      so long as the vCPU had valid state before the SMI, which is guaranteed
      for L2, RSM will generate valid state unless SMRAM was modified.  Intel's
      layout saves/restores only the selector, which means that scenarios where
      the selector and cached RPL don't match, e.g. conforming code segments,
      would yield invalid guest state.  Intel CPUs fudge around this issued by
      stuffing SS.RPL and CS.RPL on RSM.  Per Intel's SDM on the "Default
      Treatment of RSM", paraphrasing for brevity:
      
        IF internal storage indicates that the [CPU was post-VMXON]
        THEN
           enter VMX operation (root or non-root);
           restore VMX-critical state as defined in Section 34.14.1;
           set to their fixed values any bits in CR0 and CR4 whose values must
           be fixed in VMX operation [unless coming from an unrestricted guest];
           IF RFLAGS.VM = 0 AND (in VMX root operation OR the
              “unrestricted guest” VM-execution control is 0)
           THEN
             CS.RPL := SS.DPL;
             SS.RPL := SS.DPL;
           FI;
           restore current VMCS pointer;
        FI;
      
      Note that Intel CPUs also overwrite the fixed CR0/CR4 bits, whereas KVM
      will sythesize TRIPLE_FAULT in this scenario.  KVM's behavior is allowed
      as both Intel and AMD define CR0/CR4 SMRAM fields as read-only, i.e. the
      only way for CR0 and/or CR4 to have illegal values is if they were
      modified by the L1 SMM handler, and Intel's SDM "SMRAM State Save Map"
      section states "modifying these registers will result in unpredictable
      behavior".
      
      KVM's ioctl() behavior is less straightforward.  Because KVM allows
      ioctls() to be executed in any order, rejecting an ioctl() if it would
      result in invalid L2 guest state is not an option as KVM cannot know if
      a future ioctl() would resolve the invalid state, e.g. KVM_SET_SREGS, or
      drop the vCPU out of L2, e.g. KVM_SET_NESTED_STATE.  Ideally, KVM would
      reject KVM_RUN if L2 contained invalid guest state, but that carries the
      risk of a false positive, e.g. if RSM loaded invalid guest state and KVM
      exited to userspace.  Setting a flag/request to detect such a scenario is
      undesirable because (a) it's extremely unlikely to add value to KVM as a
      whole, and (b) KVM would need to consider ioctl() interactions with such
      a flag, e.g. if userspace migrated the vCPU while the flag were set.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20211207193006.120997-3-seanjc@google.com>
      Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      cd0e615c
    • S
      KVM: VMX: Always clear vmx->fail on emulation_required · a80dfc02
      Sean Christopherson 提交于
      Revert a relatively recent change that set vmx->fail if the vCPU is in L2
      and emulation_required is true, as that behavior is completely bogus.
      Setting vmx->fail and synthesizing a VM-Exit is contradictory and wrong:
      
        (a) it's impossible to have both a VM-Fail and VM-Exit
        (b) vmcs.EXIT_REASON is not modified on VM-Fail
        (c) emulation_required refers to guest state and guest state checks are
            always VM-Exits, not VM-Fails.
      
      For KVM specifically, emulation_required is handled before nested exits
      in __vmx_handle_exit(), thus setting vmx->fail has no immediate effect,
      i.e. KVM calls into handle_invalid_guest_state() and vmx->fail is ignored.
      Setting vmx->fail can ultimately result in a WARN in nested_vmx_vmexit()
      firing when tearing down the VM as KVM never expects vmx->fail to be set
      when L2 is active, KVM always reflects those errors into L1.
      
        ------------[ cut here ]------------
        WARNING: CPU: 0 PID: 21158 at arch/x86/kvm/vmx/nested.c:4548
                                      nested_vmx_vmexit+0x16bd/0x17e0
                                      arch/x86/kvm/vmx/nested.c:4547
        Modules linked in:
        CPU: 0 PID: 21158 Comm: syz-executor.1 Not tainted 5.16.0-rc3-syzkaller #0
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
        RIP: 0010:nested_vmx_vmexit+0x16bd/0x17e0 arch/x86/kvm/vmx/nested.c:4547
        Code: <0f> 0b e9 2e f8 ff ff e8 57 b3 5d 00 0f 0b e9 00 f1 ff ff 89 e9 80
        Call Trace:
         vmx_leave_nested arch/x86/kvm/vmx/nested.c:6220 [inline]
         nested_vmx_free_vcpu+0x83/0xc0 arch/x86/kvm/vmx/nested.c:330
         vmx_free_vcpu+0x11f/0x2a0 arch/x86/kvm/vmx/vmx.c:6799
         kvm_arch_vcpu_destroy+0x6b/0x240 arch/x86/kvm/x86.c:10989
         kvm_vcpu_destroy+0x29/0x90 arch/x86/kvm/../../../virt/kvm/kvm_main.c:441
         kvm_free_vcpus arch/x86/kvm/x86.c:11426 [inline]
         kvm_arch_destroy_vm+0x3ef/0x6b0 arch/x86/kvm/x86.c:11545
         kvm_destroy_vm arch/x86/kvm/../../../virt/kvm/kvm_main.c:1189 [inline]
         kvm_put_kvm+0x751/0xe40 arch/x86/kvm/../../../virt/kvm/kvm_main.c:1220
         kvm_vcpu_release+0x53/0x60 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3489
         __fput+0x3fc/0x870 fs/file_table.c:280
         task_work_run+0x146/0x1c0 kernel/task_work.c:164
         exit_task_work include/linux/task_work.h:32 [inline]
         do_exit+0x705/0x24f0 kernel/exit.c:832
         do_group_exit+0x168/0x2d0 kernel/exit.c:929
         get_signal+0x1740/0x2120 kernel/signal.c:2852
         arch_do_signal_or_restart+0x9c/0x730 arch/x86/kernel/signal.c:868
         handle_signal_work kernel/entry/common.c:148 [inline]
         exit_to_user_mode_loop kernel/entry/common.c:172 [inline]
         exit_to_user_mode_prepare+0x191/0x220 kernel/entry/common.c:207
         __syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline]
         syscall_exit_to_user_mode+0x2e/0x70 kernel/entry/common.c:300
         do_syscall_64+0x53/0xd0 arch/x86/entry/common.c:86
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Fixes: c8607e4a ("KVM: x86: nVMX: don't fail nested VM entry on invalid guest state if !from_vmentry")
      Reported-by: syzbot+f1d2136db9c80d4733e8@syzkaller.appspotmail.com
      Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20211207193006.120997-2-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a80dfc02
    • M
      KVM: x86: Always set kvm_run->if_flag · c5063551
      Marc Orr 提交于
      The kvm_run struct's if_flag is a part of the userspace/kernel API. The
      SEV-ES patches failed to set this flag because it's no longer needed by
      QEMU (according to the comment in the source code). However, other
      hypervisors may make use of this flag. Therefore, set the flag for
      guests with encrypted registers (i.e., with guest_state_protected set).
      
      Fixes: f1c6366e ("KVM: SVM: Add required changes to support intercepts under SEV-ES")
      Signed-off-by: NMarc Orr <marcorr@google.com>
      Message-Id: <20211209155257.128747-1-marcorr@google.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
      c5063551
  6. 09 12月, 2021 1 次提交
  7. 08 12月, 2021 16 次提交
  8. 02 12月, 2021 1 次提交
  9. 30 11月, 2021 2 次提交
  10. 26 11月, 2021 1 次提交
  11. 11 11月, 2021 4 次提交
    • V
      KVM: Move INVPCID type check from vmx and svm to the common kvm_handle_invpcid() · 796c83c5
      Vipin Sharma 提交于
      Handle #GP on INVPCID due to an invalid type in the common switch
      statement instead of relying on the callers (VMX and SVM) to manually
      validate the type.
      
      Unlike INVVPID and INVEPT, INVPCID is not explicitly documented to check
      the type before reading the operand from memory, so deferring the
      type validity check until after that point is architecturally allowed.
      Signed-off-by: NVipin Sharma <vipinsh@google.com>
      Reviewed-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20211109174426.2350547-3-vipinsh@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      796c83c5
    • V
      KVM: VMX: Add a helper function to retrieve the GPR index for INVPCID, INVVPID, and INVEPT · 329bd56c
      Vipin Sharma 提交于
      handle_invept(), handle_invvpid(), handle_invpcid() read the same reg2
      field in vmcs.VMX_INSTRUCTION_INFO to get the index of the GPR that
      holds the invalidation type. Add a helper to retrieve reg2 from VMX
      instruction info to consolidate and document the shift+mask magic.
      Signed-off-by: NVipin Sharma <vipinsh@google.com>
      Reviewed-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20211109174426.2350547-2-vipinsh@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      329bd56c
    • S
      KVM: nVMX: Handle dynamic MSR intercept toggling · 67f4b996
      Sean Christopherson 提交于
      Always check vmcs01's MSR bitmap when merging L0 and L1 bitmaps for L2,
      and always update the relevant bits in vmcs02.  This fixes two distinct,
      but intertwined bugs related to dynamic MSR bitmap modifications.
      
      The first issue is that KVM fails to enable MSR interception in vmcs02
      for the FS/GS base MSRs if L1 first runs L2 with interception disabled,
      and later enables interception.
      
      The second issue is that KVM fails to honor userspace MSR filtering when
      preparing vmcs02.
      
      Fix both issues simultaneous as fixing only one of the issues (doesn't
      matter which) would create a mess that no one should have to bisect.
      Fixing only the first bug would exacerbate the MSR filtering issue as
      userspace would see inconsistent behavior depending on the whims of L1.
      Fixing only the second bug (MSR filtering) effectively requires fixing
      the first, as the nVMX code only knows how to transition vmcs02's
      bitmap from 1->0.
      
      Move the various accessor/mutators that are currently buried in vmx.c
      into vmx.h so that they can be shared by the nested code.
      
      Fixes: 1a155254 ("KVM: x86: Introduce MSR filtering")
      Fixes: d69129b4 ("KVM: nVMX: Disable intercept for FS/GS base MSRs in vmcs02 when possible")
      Cc: stable@vger.kernel.org
      Cc: Alexander Graf <graf@amazon.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20211109013047.2041518-3-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      67f4b996
    • S
      KVM: nVMX: Query current VMCS when determining if MSR bitmaps are in use · 7dfbc624
      Sean Christopherson 提交于
      Check the current VMCS controls to determine if an MSR write will be
      intercepted due to MSR bitmaps being disabled.  In the nested VMX case,
      KVM will disable MSR bitmaps in vmcs02 if they're disabled in vmcs12 or
      if KVM can't map L1's bitmaps for whatever reason.
      
      Note, the bad behavior is relatively benign in the current code base as
      KVM sets all bits in vmcs02's MSR bitmap by default, clears bits if and
      only if L0 KVM also disables interception of an MSR, and only uses the
      buggy helper for MSR_IA32_SPEC_CTRL.  Because KVM explicitly tests WRMSR
      before disabling interception of MSR_IA32_SPEC_CTRL, the flawed check
      will only result in KVM reading MSR_IA32_SPEC_CTRL from hardware when it
      isn't strictly necessary.
      
      Tag the fix for stable in case a future fix wants to use
      msr_write_intercepted(), in which case a buggy implementation in older
      kernels could prove subtly problematic.
      
      Fixes: d28b387f ("KVM/VMX: Allow direct access to MSR_IA32_SPEC_CTRL")
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20211109013047.2041518-2-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7dfbc624