1. 23 2月, 2020 1 次提交
    • O
      KVM: nVMX: Emulate MTF when performing instruction emulation · 5ef8acbd
      Oliver Upton 提交于
      Since commit 5f3d45e7 ("kvm/x86: add support for
      MONITOR_TRAP_FLAG"), KVM has allowed an L1 guest to use the monitor trap
      flag processor-based execution control for its L2 guest. KVM simply
      forwards any MTF VM-exits to the L1 guest, which works for normal
      instruction execution.
      
      However, when KVM needs to emulate an instruction on the behalf of an L2
      guest, the monitor trap flag is not emulated. Add the necessary logic to
      kvm_skip_emulated_instruction() to synthesize an MTF VM-exit to L1 upon
      instruction emulation for L2.
      
      Fixes: 5f3d45e7 ("kvm/x86: add support for MONITOR_TRAP_FLAG")
      Signed-off-by: NOliver Upton <oupton@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5ef8acbd
  2. 14 1月, 2020 1 次提交
    • S
      x86/msr-index: Clean up bit defines for IA32_FEATURE_CONTROL MSR · 32ad73db
      Sean Christopherson 提交于
      As pointed out by Boris, the defines for bits in IA32_FEATURE_CONTROL
      are quite a mouthful, especially the VMX bits which must differentiate
      between enabling VMX inside and outside SMX (TXT) operation.  Rename the
      MSR and its bit defines to abbreviate FEATURE_CONTROL as FEAT_CTL to
      make them a little friendlier on the eyes.
      
      Arguably, the MSR itself should keep the full IA32_FEATURE_CONTROL name
      to match Intel's SDM, but a future patch will add a dedicated Kconfig,
      file and functions for the MSR. Using the full name for those assets is
      rather unwieldy, so bite the bullet and use IA32_FEAT_CTL so that its
      nomenclature is consistent throughout the kernel.
      
      Opportunistically, fix a few other annoyances with the defines:
      
        - Relocate the bit defines so that they immediately follow the MSR
          define, e.g. aren't mistaken as belonging to MISC_FEATURE_CONTROL.
        - Add whitespace around the block of feature control defines to make
          it clear they're all related.
        - Use BIT() instead of manually encoding the bit shift.
        - Use "VMX" instead of "VMXON" to match the SDM.
        - Append "_ENABLED" to the LMCE (Local Machine Check Exception) bit to
          be consistent with the kernel's verbiage used for all other feature
          control bits.  Note, the SDM refers to the LMCE bit as LMCE_ON,
          likely to differentiate it from IA32_MCG_EXT_CTL.LMCE_EN.  Ignore
          the (literal) one-off usage of _ON, the SDM is simply "wrong".
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Link: https://lkml.kernel.org/r/20191221044513.21680-2-sean.j.christopherson@intel.com
      32ad73db
  3. 04 12月, 2019 1 次提交
  4. 15 11月, 2019 3 次提交
    • A
      KVM: nVMX: Add support for capturing highest observable L2 TSC · 662f1d1d
      Aaron Lewis 提交于
      The L1 hypervisor may include the IA32_TIME_STAMP_COUNTER MSR in the
      vmcs12 MSR VM-exit MSR-store area as a way of determining the highest
      TSC value that might have been observed by L2 prior to VM-exit. The
      current implementation does not capture a very tight bound on this
      value.  To tighten the bound, add the IA32_TIME_STAMP_COUNTER MSR to the
      vmcs02 VM-exit MSR-store area whenever it appears in the vmcs12 VM-exit
      MSR-store area.  When L0 processes the vmcs12 VM-exit MSR-store area
      during the emulation of an L2->L1 VM-exit, special-case the
      IA32_TIME_STAMP_COUNTER MSR, using the value stored in the vmcs02
      VM-exit MSR-store area to derive the value to be stored in the vmcs12
      VM-exit MSR-store area.
      Reviewed-by: NLiran Alon <liran.alon@oracle.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NAaron Lewis <aaronlewis@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      662f1d1d
    • A
      kvm: vmx: Rename NR_AUTOLOAD_MSRS to NR_LOADSTORE_MSRS · 7cfe0526
      Aaron Lewis 提交于
      Rename NR_AUTOLOAD_MSRS to NR_LOADSTORE_MSRS.  This needs to be done
      due to the addition of the MSR-autostore area that will be added in a
      future patch.  After that the name AUTOLOAD will no longer make sense.
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NAaron Lewis <aaronlewis@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7cfe0526
    • L
      KVM: nVMX: Update vmcs01 TPR_THRESHOLD if L2 changed L1 TPR · 02d496cf
      Liran Alon 提交于
      When L1 don't use TPR-Shadow to run L2, L0 configures vmcs02 without
      TPR-Shadow and install intercepts on CR8 access (load and store).
      
      If L1 do not intercept L2 CR8 access, L0 intercepts on those accesses
      will emulate load/store on L1's LAPIC TPR. If in this case L2 lowers
      TPR such that there is now an injectable interrupt to L1,
      apic_update_ppr() will request a KVM_REQ_EVENT which will trigger a call
      to update_cr8_intercept() to update TPR-Threshold to highest pending IRR
      priority.
      
      However, this update to TPR-Threshold is done while active vmcs is
      vmcs02 instead of vmcs01. Thus, when later at some point L0 will
      emulate an exit from L2 to L1, L1 will still run with high
      TPR-Threshold. This will result in every VMEntry to L1 to immediately
      exit on TPR_BELOW_THRESHOLD and continue to do so infinitely until
      some condition will cause KVM_REQ_EVENT to be set.
      (Note that TPR_BELOW_THRESHOLD exit handler do not set KVM_REQ_EVENT
      until apic_update_ppr() will notice a new injectable interrupt for PPR)
      
      To fix this issue, change update_cr8_intercept() such that if L2 lowers
      L1's TPR in a way that requires to lower L1's TPR-Threshold, save update
      to TPR-Threshold and apply it to vmcs01 when L0 emulates an exit from
      L2 to L1.
      Reviewed-by: NJoao Martins <joao.m.martins@oracle.com>
      Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      02d496cf
  5. 12 11月, 2019 2 次提交
    • J
      KVM: VMX: Introduce pi_is_pir_empty() helper · 29881b6e
      Joao Martins 提交于
      Streamline the PID.PIR check and change its call sites to use
      the newly added helper.
      Suggested-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NJoao Martins <joao.m.martins@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      29881b6e
    • J
      KVM: VMX: Do not change PID.NDST when loading a blocked vCPU · 132194ff
      Joao Martins 提交于
      When vCPU enters block phase, pi_pre_block() inserts vCPU to a per pCPU
      linked list of all vCPUs that are blocked on this pCPU. Afterwards, it
      changes PID.NV to POSTED_INTR_WAKEUP_VECTOR which its handler
      (wakeup_handler()) is responsible to kick (unblock) any vCPU on that
      linked list that now has pending posted interrupts.
      
      While vCPU is blocked (in kvm_vcpu_block()), it may be preempted which
      will cause vmx_vcpu_pi_put() to set PID.SN.  If later the vCPU will be
      scheduled to run on a different pCPU, vmx_vcpu_pi_load() will clear
      PID.SN but will also *overwrite PID.NDST to this different pCPU*.
      Instead of keeping it with original pCPU which vCPU had entered block
      phase on.
      
      This results in an issue because when a posted interrupt is delivered, as
      the wakeup_handler() will be executed and fail to find blocked vCPU on
      its per pCPU linked list of all vCPUs that are blocked on this pCPU.
      Which is due to the vCPU being placed on a *different* per pCPU
      linked list i.e. the original pCPU in which it entered block phase.
      
      The regression is introduced by commit c112b5f5 ("KVM: x86:
      Recompute PID.ON when clearing PID.SN"). Therefore, partially revert
      it and reintroduce the condition in vmx_vcpu_pi_load() responsible for
      avoiding changing PID.NDST when loading a blocked vCPU.
      
      Fixes: c112b5f5 ("KVM: x86: Recompute PID.ON when clearing PID.SN")
      Tested-by: NNathan Ni <nathan.ni@oracle.com>
      Co-developed-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NJoao Martins <joao.m.martins@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      132194ff
  6. 24 9月, 2019 1 次提交
  7. 11 9月, 2019 1 次提交
  8. 18 6月, 2019 15 次提交
  9. 25 5月, 2019 1 次提交
  10. 01 5月, 2019 5 次提交
  11. 16 4月, 2019 1 次提交
  12. 29 3月, 2019 1 次提交
  13. 21 2月, 2019 3 次提交
    • B
      kvm: vmx: Add memcg accounting to KVM allocations · 41836839
      Ben Gardon 提交于
      There are many KVM kernel memory allocations which are tied to the life of
      the VM process and should be charged to the VM process's cgroup. If the
      allocations aren't tied to the process, the OOM killer will not know
      that killing the process will free the associated kernel memory.
      Add __GFP_ACCOUNT flags to many of the allocations which are not yet being
      charged to the VM process's cgroup.
      
      Tested:
      	Ran all kvm-unit-tests on a 64 bit Haswell machine, the patch
      	introduced no new failures.
      	Ran a kernel memory accounting test which creates a VM to touch
      	memory and then checks that the kernel memory allocated for the
      	process is within certain bounds.
      	With this patch we account for much more of the vmalloc and slab memory
      	allocated for the VM.
      Signed-off-by: NBen Gardon <bgardon@google.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      41836839
    • Y
      kvm: vmx: Fix typos in vmentry/vmexit control setting · d9293597
      Yu Zhang 提交于
      Previously, 'commit f99e3daf ("KVM: x86: Add Intel PT
      virtualization work mode")' work mode' offered framework
      to support Intel PT virtualization. However, the patch has
      some typos in vmx_vmentry_ctrl() and vmx_vmexit_ctrl(), e.g.
      used wrong flags and wrong variable, which will cause the
      VM entry failure later.
      
      Fixes: 'commit f99e3daf ("KVM: x86: Add Intel PT virtualization work mode")'
      Signed-off-by: NYu Zhang <yu.c.zhang@linux.intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d9293597
    • L
      KVM: x86: Sync the pending Posted-Interrupts · 81b01667
      Luwei Kang 提交于
      Some Posted-Interrupts from passthrough devices may be lost or
      overwritten when the vCPU is in runnable state.
      
      The SN (Suppress Notification) of PID (Posted Interrupt Descriptor) will
      be set when the vCPU is preempted (vCPU in KVM_MP_STATE_RUNNABLE state
      but not running on physical CPU). If a posted interrupt coming at this
      time, the irq remmaping facility will set the bit of PIR (Posted
      Interrupt Requests) without ON (Outstanding Notification).
      So this interrupt can't be sync to APIC virtualization register and
      will not be handled by Guest because ON is zero.
      Signed-off-by: NLuwei Kang <luwei.kang@intel.com>
      [Eliminate the pi_clear_sn fast path. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      81b01667
  14. 14 2月, 2019 1 次提交
    • L
      KVM: x86: Recompute PID.ON when clearing PID.SN · c112b5f5
      Luwei Kang 提交于
      Some Posted-Interrupts from passthrough devices may be lost or
      overwritten when the vCPU is in runnable state.
      
      The SN (Suppress Notification) of PID (Posted Interrupt Descriptor) will
      be set when the vCPU is preempted (vCPU in KVM_MP_STATE_RUNNABLE state but
      not running on physical CPU). If a posted interrupt comes at this time,
      the irq remapping facility will set the bit of PIR (Posted Interrupt
      Requests) but not ON (Outstanding Notification).  Then, the interrupt
      will not be seen by KVM, which always expects PID.ON=1 if PID.PIR=1
      as documented in the Intel processor SDM but not in the VT-d specification.
      To fix this, restore the invariant after PID.SN is cleared.
      Signed-off-by: NLuwei Kang <luwei.kang@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c112b5f5
  15. 12 2月, 2019 2 次提交
    • S
      KVM: VMX: Pass "launched" directly to the vCPU-run asm blob · c9afc58c
      Sean Christopherson 提交于
      ...and remove struct vcpu_vmx's temporary __launched variable.
      
      Eliminating __launched is a bonus, the real motivation is to get to the
      point where the only reference to struct vcpu_vmx in the asm code is
      to vcpu.arch.regs, which will simplify moving the blob to a proper asm
      file.  Note that also means this approach is deliberately different than
      what is used in nested_vmx_check_vmentry_hw().
      
      Use BL as it is a callee-save register in both 32-bit and 64-bit ABIs,
      i.e. it can't be modified by vmx_update_host_rsp(), to avoid having to
      temporarily save/restore the launched flag.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c9afc58c
    • S
      KVM: nVMX: Cache host_rsp on a per-VMCS basis · 5a878160
      Sean Christopherson 提交于
      Currently, host_rsp is cached on a per-vCPU basis, i.e. it's stored in
      struct vcpu_vmx.  In non-nested usage the caching is for all intents
      and purposes 100% effective, e.g. only the first VMLAUNCH needs to
      synchronize VMCS.HOST_RSP since the call stack to vmx_vcpu_run() is
      identical each and every time.  But when running a nested guest, KVM
      must invalidate the cache when switching the current VMCS as it can't
      guarantee the new VMCS has the same HOST_RSP as the previous VMCS.  In
      other words, the cache loses almost all of its efficacy when running a
      nested VM.
      
      Move host_rsp to struct vmcs_host_state, which is per-VMCS, so that it
      is cached on a per-VMCS basis and restores its 100% hit rate when
      nested VMs are in play.
      
      Note that the host_rsp cache for vmcs02 essentially "breaks" when
      nested early checks are enabled as nested_vmx_check_vmentry_hw() will
      see a different RSP at the time of its VM-Enter.  While it's possible
      to avoid even that VMCS.HOST_RSP synchronization, e.g. by employing a
      dedicated VM-Exit stack, there is little motivation for doing so as
      the overhead of two VMWRITEs (~55 cycles) is dwarfed by the overhead
      of the extra VMX transition (600+ cycles) and is a proverbial drop in
      the ocean relative to the total cost of a nested transtion (10s of
      thousands of cycles).
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5a878160
  16. 21 12月, 2018 1 次提交
    • S
      KVM: VMX: Move VM-Enter + VM-Exit handling to non-inline sub-routines · 453eafbe
      Sean Christopherson 提交于
      Transitioning to/from a VMX guest requires KVM to manually save/load
      the bulk of CPU state that the guest is allowed to direclty access,
      e.g. XSAVE state, CR2, GPRs, etc...  For obvious reasons, loading the
      guest's GPR snapshot prior to VM-Enter and saving the snapshot after
      VM-Exit is done via handcoded assembly.  The assembly blob is written
      as inline asm so that it can easily access KVM-defined structs that
      are used to hold guest state, e.g. moving the blob to a standalone
      assembly file would require generating defines for struct offsets.
      
      The other relevant aspect of VMX transitions in KVM is the handling of
      VM-Exits.  KVM doesn't employ a separate VM-Exit handler per se, but
      rather treats the VMX transition as a mega instruction (with many side
      effects), i.e. sets the VMCS.HOST_RIP to a label immediately following
      VMLAUNCH/VMRESUME.  The label is then exposed to C code via a global
      variable definition in the inline assembly.
      
      Because of the global variable, KVM takes steps to (attempt to) ensure
      only a single instance of the owning C function, e.g. vmx_vcpu_run, is
      generated by the compiler.  The earliest approach placed the inline
      assembly in a separate noinline function[1].  Later, the assembly was
      folded back into vmx_vcpu_run() and tagged with __noclone[2][3], which
      is still used today.
      
      After moving to __noclone, an edge case was encountered where GCC's
      -ftracer optimization resulted in the inline assembly blob being
      duplicated.  This was "fixed" by explicitly disabling -ftracer in the
      __noclone definition[4].
      
      Recently, it was found that disabling -ftracer causes build warnings
      for unsuspecting users of __noclone[5], and more importantly for KVM,
      prevents the compiler for properly optimizing vmx_vcpu_run()[6].  And
      perhaps most importantly of all, it was pointed out that there is no
      way to prevent duplication of a function with 100% reliability[7],
      i.e. more edge cases may be encountered in the future.
      
      So to summarize, the only way to prevent the compiler from duplicating
      the global variable definition is to move the variable out of inline
      assembly, which has been suggested several times over[1][7][8].
      
      Resolve the aforementioned issues by moving the VMLAUNCH+VRESUME and
      VM-Exit "handler" to standalone assembly sub-routines.  Moving only
      the core VMX transition codes allows the struct indexing to remain as
      inline assembly and also allows the sub-routines to be used by
      nested_vmx_check_vmentry_hw().  Reusing the sub-routines has a happy
      side-effect of eliminating two VMWRITEs in the nested_early_check path
      as there is no longer a need to dynamically change VMCS.HOST_RIP.
      
      Note that callers to vmx_vmenter() must account for the CALL modifying
      RSP, e.g. must subtract op-size from RSP when synchronizing RSP with
      VMCS.HOST_RSP and "restore" RSP prior to the CALL.  There are no great
      alternatives to fudging RSP.  Saving RSP in vmx_enter() is difficult
      because doing so requires a second register (VMWRITE does not provide
      an immediate encoding for the VMCS field and KVM supports Hyper-V's
      memory-based eVMCS ABI).  The other more drastic alternative would be
      to use eschew VMCS.HOST_RSP and manually save/load RSP using a per-cpu
      variable (which can be encoded as e.g. gs:[imm]).  But because a valid
      stack is needed at the time of VM-Exit (NMIs aren't blocked and a user
      could theoretically insert INT3/INT1ICEBRK at the VM-Exit handler), a
      dedicated per-cpu VM-Exit stack would be required.  A dedicated stack
      isn't difficult to implement, but it would require at least one page
      per CPU and knowledge of the stack in the dumpstack routines.  And in
      most cases there is essentially zero overhead in dynamically updating
      VMCS.HOST_RSP, e.g. the VMWRITE can be avoided for all but the first
      VMLAUNCH unless nested_early_check=1, which is not a fast path.  In
      other words, avoiding the VMCS.HOST_RSP by using a dedicated stack
      would only make the code marginally less ugly while requiring at least
      one page per CPU and forcing the kernel to be aware (and approve) of
      the VM-Exit stack shenanigans.
      
      [1] cea15c24ca39 ("KVM: Move KVM context switch into own function")
      [2] a3b5ba49 ("KVM: VMX: add the __noclone attribute to vmx_vcpu_run")
      [3] 104f226b ("KVM: VMX: Fold __vmx_vcpu_run() into vmx_vcpu_run()")
      [4] 95272c29 ("compiler-gcc: disable -ftracer for __noclone functions")
      [5] https://lkml.kernel.org/r/20181218140105.ajuiglkpvstt3qxs@treble
      [6] https://patchwork.kernel.org/patch/8707981/#21817015
      [7] https://lkml.kernel.org/r/ri6y38lo23g.fsf@suse.cz
      [8] https://lkml.kernel.org/r/20181218212042.GE25620@tassilo.jf.intel.comSuggested-by: NAndi Kleen <ak@linux.intel.com>
      Suggested-by: NMartin Jambor <mjambor@suse.cz>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Martin Jambor <mjambor@suse.cz>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Miroslav Benes <mbenes@suse.cz>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Reviewed-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      453eafbe