1. 16 4月, 2019 4 次提交
  2. 06 4月, 2019 2 次提交
    • M
      KVM: x86: nVMX: fix x2APIC VTPR read intercept · c73f4c99
      Marc Orr 提交于
      Referring to the "VIRTUALIZING MSR-BASED APIC ACCESSES" chapter of the
      SDM, when "virtualize x2APIC mode" is 1 and "APIC-register
      virtualization" is 0, a RDMSR of 808H should return the VTPR from the
      virtual APIC page.
      
      However, for nested, KVM currently fails to disable the read intercept
      for this MSR. This means that a RDMSR exit takes precedence over
      "virtualize x2APIC mode", and KVM passes through L1's TPR to L2,
      instead of sourcing the value from L2's virtual APIC page.
      
      This patch fixes the issue by disabling the read intercept, in VMCS02,
      for the VTPR when "APIC-register virtualization" is 0.
      
      The issue described above and fix prescribed here, were verified with
      a related patch in kvm-unit-tests titled "Test VMX's virtualize x2APIC
      mode w/ nested".
      Signed-off-by: NMarc Orr <marcorr@google.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Fixes: c992384b ("KVM: vmx: speed up MSR bitmap merge")
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c73f4c99
    • M
      KVM: x86: nVMX: close leak of L0's x2APIC MSRs (CVE-2019-3887) · acff7847
      Marc Orr 提交于
      The nested_vmx_prepare_msr_bitmap() function doesn't directly guard the
      x2APIC MSR intercepts with the "virtualize x2APIC mode" MSR. As a
      result, we discovered the potential for a buggy or malicious L1 to get
      access to L0's x2APIC MSRs, via an L2, as follows.
      
      1. L1 executes WRMSR(IA32_SPEC_CTRL, 1). This causes the spec_ctrl
      variable, in nested_vmx_prepare_msr_bitmap() to become true.
      2. L1 disables "virtualize x2APIC mode" in VMCS12.
      3. L1 enables "APIC-register virtualization" in VMCS12.
      
      Now, KVM will set VMCS02's x2APIC MSR intercepts from VMCS12, and then
      set "virtualize x2APIC mode" to 0 in VMCS02. Oops.
      
      This patch closes the leak by explicitly guarding VMCS02's x2APIC MSR
      intercepts with VMCS12's "virtualize x2APIC mode" control.
      
      The scenario outlined above and fix prescribed here, were verified with
      a related patch in kvm-unit-tests titled "Add leak scenario to
      virt_x2apic_mode_test".
      
      Note, it looks like this issue may have been introduced inadvertently
      during a merge---see 15303ba5.
      Signed-off-by: NMarc Orr <marcorr@google.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      acff7847
  3. 29 3月, 2019 1 次提交
  4. 21 2月, 2019 7 次提交
    • B
      kvm: vmx: Add memcg accounting to KVM allocations · 41836839
      Ben Gardon 提交于
      There are many KVM kernel memory allocations which are tied to the life of
      the VM process and should be charged to the VM process's cgroup. If the
      allocations aren't tied to the process, the OOM killer will not know
      that killing the process will free the associated kernel memory.
      Add __GFP_ACCOUNT flags to many of the allocations which are not yet being
      charged to the VM process's cgroup.
      
      Tested:
      	Ran all kvm-unit-tests on a 64 bit Haswell machine, the patch
      	introduced no new failures.
      	Ran a kernel memory accounting test which creates a VM to touch
      	memory and then checks that the kernel memory allocated for the
      	process is within certain bounds.
      	With this patch we account for much more of the vmalloc and slab memory
      	allocated for the VM.
      Signed-off-by: NBen Gardon <bgardon@google.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      41836839
    • P
      KVM: nVMX: do not start the preemption timer hrtimer unnecessarily · 359a6c3d
      Paolo Bonzini 提交于
      The preemption timer can be started even if there is a vmentry
      failure during or after loading guest state.  That is pointless,
      move the call after all conditions have been checked.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      359a6c3d
    • P
      KVM: x86: cleanup freeing of nested state · b4b65b56
      Paolo Bonzini 提交于
      Ensure that the VCPU free path goes through vmx_leave_nested and
      thus nested_vmx_vmexit, so that the cancellation of the timer does
      not have to be in free_nested.  In addition, because some paths through
      nested_vmx_vmexit do not go through sync_vmcs12, the cancellation of
      the timer is moved there.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b4b65b56
    • P
      KVM: nVMX: remove useless is_protmode check · e0dfacbf
      Paolo Bonzini 提交于
      VMX is only accessible in protected mode, remove a confusing check
      that causes the conditional to lack a final "else" branch.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e0dfacbf
    • S
      KVM: nVMX: Ignore limit checks on VMX instructions using flat segments · 34333cc6
      Sean Christopherson 提交于
      Regarding segments with a limit==0xffffffff, the SDM officially states:
      
          When the effective limit is FFFFFFFFH (4 GBytes), these accesses may
          or may not cause the indicated exceptions.  Behavior is
          implementation-specific and may vary from one execution to another.
      
      In practice, all CPUs that support VMX ignore limit checks for "flat
      segments", i.e. an expand-up data or code segment with base=0 and
      limit=0xffffffff.  This is subtly different than wrapping the effective
      address calculation based on the address size, as the flat segment
      behavior also applies to accesses that would wrap the 4g boundary, e.g.
      a 4-byte access starting at 0xffffffff will access linear addresses
      0xffffffff, 0x0, 0x1 and 0x2.
      
      Fixes: f9eb4af6 ("KVM: nVMX: VMX instructions: add checks for #GP/#SS exceptions")
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      34333cc6
    • S
      KVM: nVMX: Apply addr size mask to effective address for VMX instructions · 8570f9e8
      Sean Christopherson 提交于
      The address size of an instruction affects the effective address, not
      the virtual/linear address.  The final address may still be truncated,
      e.g. to 32-bits outside of long mode, but that happens irrespective of
      the address size, e.g. a 32-bit address size can yield a 64-bit virtual
      address when using FS/GS with a non-zero base.
      
      Fixes: 064aea77 ("KVM: nVMX: Decoding memory operands of VMX instructions")
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      8570f9e8
    • S
      KVM: nVMX: Sign extend displacements of VMX instr's mem operands · 946c522b
      Sean Christopherson 提交于
      The VMCS.EXIT_QUALIFCATION field reports the displacements of memory
      operands for various instructions, including VMX instructions, as a
      naturally sized unsigned value, but masks the value by the addr size,
      e.g. given a ModRM encoded as -0x28(%ebp), the -0x28 displacement is
      reported as 0xffffffd8 for a 32-bit address size.  Despite some weird
      wording regarding sign extension, the SDM explicitly states that bits
      beyond the instructions address size are undefined:
      
          In all cases, bits of this field beyond the instruction’s address
          size are undefined.
      
      Failure to sign extend the displacement results in KVM incorrectly
      treating a negative displacement as a large positive displacement when
      the address size of the VMX instruction is smaller than KVM's native
      size, e.g. a 32-bit address size on a 64-bit KVM.
      
      The very original decoding, added by commit 064aea77 ("KVM: nVMX:
      Decoding memory operands of VMX instructions"), sort of modeled sign
      extension by truncating the final virtual/linear address for a 32-bit
      address size.  I.e. it messed up the effective address but made it work
      by adjusting the final address.
      
      When segmentation checks were added, the truncation logic was kept
      as-is and no sign extension logic was introduced.  In other words, it
      kept calculating the wrong effective address while mostly generating
      the correct virtual/linear address.  As the effective address is what's
      used in the segment limit checks, this results in KVM incorreclty
      injecting #GP/#SS faults due to non-existent segment violations when
      a nested VMM uses negative displacements with an address size smaller
      than KVM's native address size.
      
      Using the -0x28(%ebp) example, an EBP value of 0x1000 will result in
      KVM using 0x100000fd8 as the effective address when checking for a
      segment limit violation.  This causes a 100% failure rate when running
      a 32-bit KVM build as L1 on top of a 64-bit KVM L0.
      
      Fixes: f9eb4af6 ("KVM: nVMX: VMX instructions: add checks for #GP/#SS exceptions")
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      946c522b
  5. 14 2月, 2019 1 次提交
  6. 12 2月, 2019 10 次提交
  7. 08 2月, 2019 1 次提交
  8. 26 1月, 2019 2 次提交
  9. 12 1月, 2019 1 次提交
  10. 21 12月, 2018 4 次提交
    • S
      KVM: VMX: Move VM-Enter + VM-Exit handling to non-inline sub-routines · 453eafbe
      Sean Christopherson 提交于
      Transitioning to/from a VMX guest requires KVM to manually save/load
      the bulk of CPU state that the guest is allowed to direclty access,
      e.g. XSAVE state, CR2, GPRs, etc...  For obvious reasons, loading the
      guest's GPR snapshot prior to VM-Enter and saving the snapshot after
      VM-Exit is done via handcoded assembly.  The assembly blob is written
      as inline asm so that it can easily access KVM-defined structs that
      are used to hold guest state, e.g. moving the blob to a standalone
      assembly file would require generating defines for struct offsets.
      
      The other relevant aspect of VMX transitions in KVM is the handling of
      VM-Exits.  KVM doesn't employ a separate VM-Exit handler per se, but
      rather treats the VMX transition as a mega instruction (with many side
      effects), i.e. sets the VMCS.HOST_RIP to a label immediately following
      VMLAUNCH/VMRESUME.  The label is then exposed to C code via a global
      variable definition in the inline assembly.
      
      Because of the global variable, KVM takes steps to (attempt to) ensure
      only a single instance of the owning C function, e.g. vmx_vcpu_run, is
      generated by the compiler.  The earliest approach placed the inline
      assembly in a separate noinline function[1].  Later, the assembly was
      folded back into vmx_vcpu_run() and tagged with __noclone[2][3], which
      is still used today.
      
      After moving to __noclone, an edge case was encountered where GCC's
      -ftracer optimization resulted in the inline assembly blob being
      duplicated.  This was "fixed" by explicitly disabling -ftracer in the
      __noclone definition[4].
      
      Recently, it was found that disabling -ftracer causes build warnings
      for unsuspecting users of __noclone[5], and more importantly for KVM,
      prevents the compiler for properly optimizing vmx_vcpu_run()[6].  And
      perhaps most importantly of all, it was pointed out that there is no
      way to prevent duplication of a function with 100% reliability[7],
      i.e. more edge cases may be encountered in the future.
      
      So to summarize, the only way to prevent the compiler from duplicating
      the global variable definition is to move the variable out of inline
      assembly, which has been suggested several times over[1][7][8].
      
      Resolve the aforementioned issues by moving the VMLAUNCH+VRESUME and
      VM-Exit "handler" to standalone assembly sub-routines.  Moving only
      the core VMX transition codes allows the struct indexing to remain as
      inline assembly and also allows the sub-routines to be used by
      nested_vmx_check_vmentry_hw().  Reusing the sub-routines has a happy
      side-effect of eliminating two VMWRITEs in the nested_early_check path
      as there is no longer a need to dynamically change VMCS.HOST_RIP.
      
      Note that callers to vmx_vmenter() must account for the CALL modifying
      RSP, e.g. must subtract op-size from RSP when synchronizing RSP with
      VMCS.HOST_RSP and "restore" RSP prior to the CALL.  There are no great
      alternatives to fudging RSP.  Saving RSP in vmx_enter() is difficult
      because doing so requires a second register (VMWRITE does not provide
      an immediate encoding for the VMCS field and KVM supports Hyper-V's
      memory-based eVMCS ABI).  The other more drastic alternative would be
      to use eschew VMCS.HOST_RSP and manually save/load RSP using a per-cpu
      variable (which can be encoded as e.g. gs:[imm]).  But because a valid
      stack is needed at the time of VM-Exit (NMIs aren't blocked and a user
      could theoretically insert INT3/INT1ICEBRK at the VM-Exit handler), a
      dedicated per-cpu VM-Exit stack would be required.  A dedicated stack
      isn't difficult to implement, but it would require at least one page
      per CPU and knowledge of the stack in the dumpstack routines.  And in
      most cases there is essentially zero overhead in dynamically updating
      VMCS.HOST_RSP, e.g. the VMWRITE can be avoided for all but the first
      VMLAUNCH unless nested_early_check=1, which is not a fast path.  In
      other words, avoiding the VMCS.HOST_RSP by using a dedicated stack
      would only make the code marginally less ugly while requiring at least
      one page per CPU and forcing the kernel to be aware (and approve) of
      the VM-Exit stack shenanigans.
      
      [1] cea15c24ca39 ("KVM: Move KVM context switch into own function")
      [2] a3b5ba49 ("KVM: VMX: add the __noclone attribute to vmx_vcpu_run")
      [3] 104f226b ("KVM: VMX: Fold __vmx_vcpu_run() into vmx_vcpu_run()")
      [4] 95272c29 ("compiler-gcc: disable -ftracer for __noclone functions")
      [5] https://lkml.kernel.org/r/20181218140105.ajuiglkpvstt3qxs@treble
      [6] https://patchwork.kernel.org/patch/8707981/#21817015
      [7] https://lkml.kernel.org/r/ri6y38lo23g.fsf@suse.cz
      [8] https://lkml.kernel.org/r/20181218212042.GE25620@tassilo.jf.intel.comSuggested-by: NAndi Kleen <ak@linux.intel.com>
      Suggested-by: NMartin Jambor <mjambor@suse.cz>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Martin Jambor <mjambor@suse.cz>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Miroslav Benes <mbenes@suse.cz>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Reviewed-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      453eafbe
    • S
      KVM: VMX: Explicitly reference RCX as the vmx_vcpu pointer in asm blobs · 051a2d3e
      Sean Christopherson 提交于
      Use '%% " _ASM_CX"' instead of '%0' to dereference RCX, i.e. the
      'struct vcpu_vmx' pointer, in the VM-Enter asm blobs of vmx_vcpu_run()
      and nested_vmx_check_vmentry_hw().  Using the symbolic name means that
      adding/removing an output parameter(s) requires "rewriting" almost all
      of the asm blob, which makes it nearly impossible to understand what's
      being changed in even the most minor patches.
      
      Opportunistically improve the code comments.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Reviewed-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      051a2d3e
    • L
      KVM: x86: Disable Intel PT when VMXON in L1 guest · ee85dec2
      Luwei Kang 提交于
      Currently, Intel Processor Trace do not support tracing in L1 guest
      VMX operation(IA32_VMX_MISC[bit 14] is 0). As mentioned in SDM,
      on these type of processors, execution of the VMXON instruction will
      clears IA32_RTIT_CTL.TraceEn and any attempt to write IA32_RTIT_CTL
      causes a general-protection exception (#GP).
      Signed-off-by: NLuwei Kang <luwei.kang@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ee85dec2
    • J
      kvm: nVMX: NMI-window and interrupt-window exiting should wake L2 from HLT · 9ebdfe52
      Jim Mattson 提交于
      According to the SDM, "NMI-window exiting" VM-exits wake a logical
      processor from the same inactive states as would an NMI and
      "interrupt-window exiting" VM-exits wake a logical processor from the
      same inactive states as would an external interrupt. Specifically, they
      wake a logical processor from the shutdown state and from the states
      entered using the HLT and MWAIT instructions.
      
      Fixes: 6dfacadd ("KVM: nVMX: Add support for activity state HLT")
      Signed-off-by: NJim Mattson <jmattson@google.com>
      Reviewed-by: NPeter Shier <pshier@google.com>
      Suggested-by: NSean Christopherson <sean.j.christopherson@intel.com>
      [Squashed comments of two Jim's patches and used the simplified code
       hunk provided by Sean. - Radim]
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      9ebdfe52
  11. 15 12月, 2018 7 次提交