1. 01 8月, 2016 2 次提交
    • J
      KVM: VMX: Add VMCS to CPU's loaded VMCSs before VMPTRLD · b80c76ec
      Jim Mattson 提交于
      Kexec needs to know the addresses of all VMCSs that are active on
      each CPU, so that it can flush them from the VMCS caches. It is
      safe to record superfluous addresses that are not associated with
      an active VMCS, but it is not safe to omit an address associated
      with an active VMCS.
      
      After a call to vmcs_load, the VMCS that was loaded is active on
      the CPU. The VMCS should be added to the CPU's list of active
      VMCSs before it is loaded.
      Signed-off-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      b80c76ec
    • D
      kvm: x86: nVMX: maintain internal copy of current VMCS · 4f2777bc
      David Matlack 提交于
      KVM maintains L1's current VMCS in guest memory, at the guest physical
      page identified by the argument to VMPTRLD. This makes hairy
      time-of-check to time-of-use bugs possible,as VCPUs can be writing
      the the VMCS page in memory while KVM is emulating VMLAUNCH and
      VMRESUME.
      
      The spec documents that writing to the VMCS page while it is loaded is
      "undefined". Therefore it is reasonable to load the entire VMCS into
      an internal cache during VMPTRLD and ignore writes to the VMCS page
      -- the guest should be using VMREAD and VMWRITE to access the current
      VMCS.
      
      To adhere to the spec, KVM should flush the current VMCS during VMPTRLD,
      and the target VMCS during VMCLEAR (as given by the operand to VMCLEAR).
      Since this implementation of VMCS caching only maintains the the current
      VMCS, VMCLEAR will only do a flush if the operand to VMCLEAR is the
      current VMCS pointer.
      
      KVM will also flush during VMXOFF, which is not mandated by the spec,
      but also not in conflict with the spec.
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4f2777bc
  2. 14 7月, 2016 5 次提交
  3. 11 7月, 2016 4 次提交
    • P
      KVM: VMX: introduce vm_{entry,exit}_control_reset_shadow · 8391ce44
      Paolo Bonzini 提交于
      There is no reason to read the entry/exit control fields of the
      VMCS and immediately write back the same value.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      8391ce44
    • P
      KVM: nVMX: keep preemption timer enabled during L2 execution · 9314006d
      Paolo Bonzini 提交于
      Because the vmcs12 preemption timer is emulated through a separate hrtimer,
      we can keep on using the preemption timer in the vmcs02 to emulare L1's
      TSC deadline timer.
      
      However, the corresponding bit in the pin-based execution control field
      must be kept consistent between vmcs01 and vmcs02.  On vmentry we copy
      it into the vmcs02; on vmexit the preemption timer must be disabled in
      the vmcs01 if a preemption timer vmexit happened while in guest mode.
      
      The preemption timer value in the vmcs02 is set by vmx_vcpu_run, so it
      need not be considered in prepare_vmcs02.
      
      Cc: Yunhong Jiang <yunhong.jiang@intel.com>
      Cc: Haozhong Zhang <haozhong.zhang@intel.com>
      Tested-by: NWanpeng Li <kernellwp@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      9314006d
    • W
      KVM: nVMX: avoid incorrect preemption timer vmexit in nested guest · 55123e3c
      Wanpeng Li 提交于
      The preemption timer for nested VMX is emulated by hrtimer which is started on L2
      entry, stopped on L2 exit and evaluated via the check_nested_events hook. However,
      nested_vmx_exit_handled is always returning true for preemption timer vmexit.  Then,
      the L1 preemption timer vmexit is captured and be treated as a L2 preemption
      timer vmexit, causing NULL pointer dereferences or worse in the L1 guest's
      vmexit handler:
      
          BUG: unable to handle kernel NULL pointer dereference at           (null)
          IP: [<          (null)>]           (null)
          PGD 0
          Oops: 0010 [#1] SMP
          Call Trace:
           ? kvm_lapic_expired_hv_timer+0x47/0x90 [kvm]
           handle_preemption_timer+0xe/0x20 [kvm_intel]
           vmx_handle_exit+0x169/0x15a0 [kvm_intel]
           ? kvm_arch_vcpu_ioctl_run+0xd5d/0x19d0 [kvm]
           kvm_arch_vcpu_ioctl_run+0xdee/0x19d0 [kvm]
           ? kvm_arch_vcpu_ioctl_run+0xd5d/0x19d0 [kvm]
           ? vcpu_load+0x1c/0x60 [kvm]
           ? kvm_arch_vcpu_load+0x57/0x260 [kvm]
           kvm_vcpu_ioctl+0x2d3/0x7c0 [kvm]
           do_vfs_ioctl+0x96/0x6a0
           ? __fget_light+0x2a/0x90
           SyS_ioctl+0x79/0x90
           do_syscall_64+0x68/0x180
           entry_SYSCALL64_slow_path+0x25/0x25
          Code:  Bad RIP value.
          RIP  [<          (null)>]           (null)
           RSP <ffff8800b5263c48>
          CR2: 0000000000000000
          ---[ end trace 9c70c48b1a2bc66e ]---
      
      This can be reproduced readily by preemption timer enabled on L0 and disabled
      on L1.
      
      Return false since preemption timer vmexits must never be reflected to L2.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Yunhong Jiang <yunhong.jiang@intel.com>
      Cc: Jan Kiszka <jan.kiszka@siemens.com>
      Cc: Haozhong Zhang <haozhong.zhang@intel.com>
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      55123e3c
    • P
      KVM: VMX: reflect broken preemption timer in vmcs_config · 1c17c3e6
      Paolo Bonzini 提交于
      Simplify cpu_has_vmx_preemption_timer.  This is consistent with the
      rest of setup_vmcs_config and preparatory for the next patch.
      Tested-by: NWanpeng Li <kernellwp@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      1c17c3e6
  4. 05 7月, 2016 1 次提交
  5. 01 7月, 2016 3 次提交
  6. 24 6月, 2016 3 次提交
  7. 16 6月, 2016 2 次提交
  8. 14 6月, 2016 1 次提交
  9. 25 5月, 2016 1 次提交
    • R
      kvm:vmx: more complete state update on APICv on/off · 3ce424e4
      Roman Kagan 提交于
      The function to update APICv on/off state (in particular, to deactivate
      it when enabling Hyper-V SynIC) is incomplete: it doesn't adjust
      APICv-related fields among secondary processor-based VM-execution
      controls.  As a result, Windows 2012 guests get stuck when SynIC-based
      auto-EOI interrupt intersected with e.g. an IPI in the guest.
      
      In addition, the MSR intercept bitmap isn't updated every time "virtualize
      x2APIC mode" is toggled.  This path can only be triggered by a malicious
      guest, because Windows didn't use x2APIC but rather their own synthetic
      APIC access MSRs; however a guest running in a SynIC-enabled VM could
      switch to x2APIC and thus obtain direct access to host APIC MSRs
      (CVE-2016-4440).
      
      The patch fixes those omissions.
      Signed-off-by: NRoman Kagan <rkagan@virtuozzo.com>
      Reported-by: NSteve Rutherford <srutherford@google.com>
      Reported-by: NYang Zhang <yang.zhang.wz@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3ce424e4
  10. 19 5月, 2016 1 次提交
  11. 29 4月, 2016 1 次提交
    • B
      KVM: x86: fix ordering of cr0 initialization code in vmx_cpu_reset · f2463247
      Bruce Rogers 提交于
      Commit d28bc9dd reversed the order of two lines which initialize cr0,
      allowing the current (old) cr0 value to mess up vcpu initialization.
      This was observed in the checks for cr0 X86_CR0_WP bit in the context of
      kvm_mmu_reset_context(). Besides, setting vcpu->arch.cr0 after vmx_set_cr0()
      is completely redundant. Change the order back to ensure proper vcpu
      initialization.
      
      The combination of booting with ovmf firmware when guest vcpus > 1 and kvm's
      ept=N option being set results in a VM-entry failure. This patch fixes that.
      
      Fixes: d28bc9dd ("KVM: x86: INIT and reset sequences are different")
      Cc: stable@vger.kernel.org
      Signed-off-by: NBruce Rogers <brogers@suse.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      f2463247
  12. 28 4月, 2016 1 次提交
  13. 13 4月, 2016 1 次提交
  14. 22 3月, 2016 6 次提交
  15. 10 3月, 2016 1 次提交
    • P
      KVM: MMU: fix ept=0/pte.u=1/pte.w=0/CR0.WP=0/CR4.SMEP=1/EFER.NX=0 combo · 844a5fe2
      Paolo Bonzini 提交于
      Yes, all of these are needed. :) This is admittedly a bit odd, but
      kvm-unit-tests access.flat tests this if you run it with "-cpu host"
      and of course ept=0.
      
      KVM runs the guest with CR0.WP=1, so it must handle supervisor writes
      specially when pte.u=1/pte.w=0/CR0.WP=0.  Such writes cause a fault
      when U=1 and W=0 in the SPTE, but they must succeed because CR0.WP=0.
      When KVM gets the fault, it sets U=0 and W=1 in the shadow PTE and
      restarts execution.  This will still cause a user write to fault, while
      supervisor writes will succeed.  User reads will fault spuriously now,
      and KVM will then flip U and W again in the SPTE (U=1, W=0).  User reads
      will be enabled and supervisor writes disabled, going back to the
      originary situation where supervisor writes fault spuriously.
      
      When SMEP is in effect, however, U=0 will enable kernel execution of
      this page.  To avoid this, KVM also sets NX=1 in the shadow PTE together
      with U=0.  If the guest has not enabled NX, the result is a continuous
      stream of page faults due to the NX bit being reserved.
      
      The fix is to force EFER.NX=1 even if the CPU is taking care of the EFER
      switch.  (All machines with SMEP have the CPU_LOAD_IA32_EFER vm-entry
      control, so they do not use user-return notifiers for EFER---if they did,
      EFER.NX would be forced to the same value as the host).
      
      There is another bug in the reserved bit check, which I've split to a
      separate patch for easier application to stable kernels.
      
      Cc: stable@vger.kernel.org
      Cc: Andy Lutomirski <luto@amacapital.net>
      Reviewed-by: NXiao Guangrong <guangrong.xiao@linux.intel.com>
      Fixes: f6577a5fSigned-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      844a5fe2
  16. 09 3月, 2016 1 次提交
    • P
      KVM: x86: disable MPX if host did not enable MPX XSAVE features · a87036ad
      Paolo Bonzini 提交于
      When eager FPU is disabled, KVM will still see the MPX bit in CPUID and
      presumably the MPX vmentry and vmexit controls.  However, it will not
      be able to expose the MPX XSAVE features to the guest, because the guest's
      accessible XSAVE features are always a subset of host_xcr0.
      
      In this case, we should disable the MPX CPUID bit, the BNDCFGS MSR,
      and the MPX vmentry and vmexit controls for nested virtualization.
      It is then unnecessary to enable guest eager FPU if the guest has the
      MPX CPUID bit set.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a87036ad
  17. 08 3月, 2016 1 次提交
    • R
      KVM: VMX: disable PEBS before a guest entry · 7099e2e1
      Radim Krčmář 提交于
      Linux guests on Haswell (and also SandyBridge and Broadwell, at least)
      would crash if you decided to run a host command that uses PEBS, like
        perf record -e 'cpu/mem-stores/pp' -a
      
      This happens because KVM is using VMX MSR switching to disable PEBS, but
      SDM [2015-12] 18.4.4.4 Re-configuring PEBS Facilities explains why it
      isn't safe:
        When software needs to reconfigure PEBS facilities, it should allow a
        quiescent period between stopping the prior event counting and setting
        up a new PEBS event. The quiescent period is to allow any latent
        residual PEBS records to complete its capture at their previously
        specified buffer address (provided by IA32_DS_AREA).
      
      There might not be a quiescent period after the MSR switch, so a CPU
      ends up using host's MSR_IA32_DS_AREA to access an area in guest's
      memory.  (Or MSR switching is just buggy on some models.)
      
      The guest can learn something about the host this way:
      If the guest doesn't map address pointed by MSR_IA32_DS_AREA, it results
      in #PF where we leak host's MSR_IA32_DS_AREA through CR2.
      
      After that, a malicious guest can map and configure memory where
      MSR_IA32_DS_AREA is pointing and can therefore get an output from
      host's tracing.
      
      This is not a critical leak as the host must initiate with PEBS tracing
      and I have not been able to get a record from more than one instruction
      before vmentry in vmx_vcpu_run() (that place has most registers already
      overwritten with guest's).
      
      We could disable PEBS just few instructions before vmentry, but
      disabling it earlier shouldn't affect host tracing too much.
      We also don't need to switch MSR_IA32_PEBS_ENABLE on VMENTRY, but that
      optimization isn't worth its code, IMO.
      
      (If you are implementing PEBS for guests, be sure to handle the case
       where both host and guest enable PEBS, because this patch doesn't.)
      
      Fixes: 26a4f3c0 ("perf/x86: disable PEBS on a guest entry.")
      Cc: <stable@vger.kernel.org>
      Reported-by: NJiří Olša <jolsa@redhat.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7099e2e1
  18. 04 3月, 2016 1 次提交
  19. 02 3月, 2016 1 次提交
  20. 24 2月, 2016 2 次提交
  21. 23 2月, 2016 1 次提交