1. 13 11月, 2014 1 次提交
    • C
      kvm: svm: move WARN_ON in svm_adjust_tsc_offset · d913b904
      Chris J Arges 提交于
      When running the tsc_adjust kvm-unit-test on an AMD processor with the
      IA32_TSC_ADJUST feature enabled, the WARN_ON in svm_adjust_tsc_offset can be
      triggered. This WARN_ON checks for a negative adjustment in case __scale_tsc
      is called; however it may trigger unnecessary warnings.
      
      This patch moves the WARN_ON to trigger only if __scale_tsc will actually be
      called from svm_adjust_tsc_offset. In addition make adj in kvm_set_msr_common
      s64 since this can have signed values.
      Signed-off-by: NChris J Arges <chris.j.arges@canonical.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d913b904
  2. 03 11月, 2014 1 次提交
    • N
      KVM: vmx: Unavailable DR4/5 is checked before CPL · 16f8a6f9
      Nadav Amit 提交于
      If DR4/5 is accessed when it is unavailable (since CR4.DE is set), then #UD
      should be generated even if CPL>0. This is according to Intel SDM Table 6-2:
      "Priority Among Simultaneous Exceptions and Interrupts".
      
      Note, that this may happen on the first DR access, even if the host does not
      sets debug breakpoints. Obviously, it occurs when the host debugs the guest.
      
      This patch moves the DR4/5 checks from __kvm_set_dr/_kvm_get_dr to handle_dr.
      The emulator already checks DR4/5 availability in check_dr_read. Nested
      virutalization related calls to kvm_set_dr/kvm_get_dr would not like to inject
      exceptions to the guest.
      
      As for SVM, the patch follows the previous logic as much as possible. Anyhow,
      it appears the DR interception code might be buggy - even if the DR access
      may cause an exception, the instruction is skipped.
      Signed-off-by: NNadav Amit <namit@cs.technion.ac.il>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      16f8a6f9
  3. 24 10月, 2014 2 次提交
    • M
      kvm: x86: don't kill guest on unknown exit reason · 2bc19dc3
      Michael S. Tsirkin 提交于
      KVM_EXIT_UNKNOWN is a kvm bug, we don't really know whether it was
      triggered by a priveledged application.  Let's not kill the guest: WARN
      and inject #UD instead.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      2bc19dc3
    • N
      KVM: x86: Check non-canonical addresses upon WRMSR · 854e8bb1
      Nadav Amit 提交于
      Upon WRMSR, the CPU should inject #GP if a non-canonical value (address) is
      written to certain MSRs. The behavior is "almost" identical for AMD and Intel
      (ignoring MSRs that are not implemented in either architecture since they would
      anyhow #GP). However, IA32_SYSENTER_ESP and IA32_SYSENTER_EIP cause #GP if
      non-canonical address is written on Intel but not on AMD (which ignores the top
      32-bits).
      
      Accordingly, this patch injects a #GP on the MSRs which behave identically on
      Intel and AMD.  To eliminate the differences between the architecutres, the
      value which is written to IA32_SYSENTER_ESP and IA32_SYSENTER_EIP is turned to
      canonical value before writing instead of injecting a #GP.
      
      Some references from Intel and AMD manuals:
      
      According to Intel SDM description of WRMSR instruction #GP is expected on
      WRMSR "If the source register contains a non-canonical address and ECX
      specifies one of the following MSRs: IA32_DS_AREA, IA32_FS_BASE, IA32_GS_BASE,
      IA32_KERNEL_GS_BASE, IA32_LSTAR, IA32_SYSENTER_EIP, IA32_SYSENTER_ESP."
      
      According to AMD manual instruction manual:
      LSTAR/CSTAR (SYSCALL): "The WRMSR instruction loads the target RIP into the
      LSTAR and CSTAR registers.  If an RIP written by WRMSR is not in canonical
      form, a general-protection exception (#GP) occurs."
      IA32_GS_BASE and IA32_FS_BASE (WRFSBASE/WRGSBASE): "The address written to the
      base field must be in canonical form or a #GP fault will occur."
      IA32_KERNEL_GS_BASE (SWAPGS): "The address stored in the KernelGSbase MSR must
      be in canonical form."
      
      This patch fixes CVE-2014-3610.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NNadav Amit <namit@cs.technion.ac.il>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      854e8bb1
  4. 11 9月, 2014 1 次提交
  5. 03 9月, 2014 1 次提交
  6. 29 8月, 2014 2 次提交
  7. 27 8月, 2014 1 次提交
    • C
      x86: Replace __get_cpu_var uses · 89cbc767
      Christoph Lameter 提交于
      __get_cpu_var() is used for multiple purposes in the kernel source. One of
      them is address calculation via the form &__get_cpu_var(x).  This calculates
      the address for the instance of the percpu variable of the current processor
      based on an offset.
      
      Other use cases are for storing and retrieving data from the current
      processors percpu area.  __get_cpu_var() can be used as an lvalue when
      writing data or on the right side of an assignment.
      
      __get_cpu_var() is defined as :
      
      #define __get_cpu_var(var) (*this_cpu_ptr(&(var)))
      
      __get_cpu_var() always only does an address determination. However, store
      and retrieve operations could use a segment prefix (or global register on
      other platforms) to avoid the address calculation.
      
      this_cpu_write() and this_cpu_read() can directly take an offset into a
      percpu area and use optimized assembly code to read and write per cpu
      variables.
      
      This patch converts __get_cpu_var into either an explicit address
      calculation using this_cpu_ptr() or into a use of this_cpu operations that
      use the offset.  Thereby address calculations are avoided and less registers
      are used when code is generated.
      
      Transformations done to __get_cpu_var()
      
      1. Determine the address of the percpu instance of the current processor.
      
      	DEFINE_PER_CPU(int, y);
      	int *x = &__get_cpu_var(y);
      
          Converts to
      
      	int *x = this_cpu_ptr(&y);
      
      2. Same as #1 but this time an array structure is involved.
      
      	DEFINE_PER_CPU(int, y[20]);
      	int *x = __get_cpu_var(y);
      
          Converts to
      
      	int *x = this_cpu_ptr(y);
      
      3. Retrieve the content of the current processors instance of a per cpu
      variable.
      
      	DEFINE_PER_CPU(int, y);
      	int x = __get_cpu_var(y)
      
         Converts to
      
      	int x = __this_cpu_read(y);
      
      4. Retrieve the content of a percpu struct
      
      	DEFINE_PER_CPU(struct mystruct, y);
      	struct mystruct x = __get_cpu_var(y);
      
         Converts to
      
      	memcpy(&x, this_cpu_ptr(&y), sizeof(x));
      
      5. Assignment to a per cpu variable
      
      	DEFINE_PER_CPU(int, y)
      	__get_cpu_var(y) = x;
      
         Converts to
      
      	__this_cpu_write(y, x);
      
      6. Increment/Decrement etc of a per cpu variable
      
      	DEFINE_PER_CPU(int, y);
      	__get_cpu_var(y)++
      
         Converts to
      
      	__this_cpu_inc(y)
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: x86@kernel.org
      Acked-by: NH. Peter Anvin <hpa@linux.intel.com>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      89cbc767
  8. 22 8月, 2014 1 次提交
  9. 19 8月, 2014 1 次提交
  10. 11 7月, 2014 2 次提交
    • P
      KVM: x86: return all bits from get_interrupt_shadow · 37ccdcbe
      Paolo Bonzini 提交于
      For the next patch we will need to know the full state of the
      interrupt shadow; we will then set KVM_REQ_EVENT when one bit
      is cleared.
      
      However, right now get_interrupt_shadow only returns the one
      corresponding to the emulated instruction, or an unconditional
      0 if the emulated instruction does not have an interrupt shadow.
      This is confusing and does not allow us to check for cleared
      bits as mentioned above.
      
      Clean the callback up, and modify toggle_interruptibility to
      match the comment above the call.  As a small result, the
      call to set_interrupt_shadow will be skipped in the common
      case where int_shadow == 0 && mask == 0.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      37ccdcbe
    • J
      KVM: Synthesize G bit for all segments. · 80112c89
      Jim Mattson 提交于
      We have noticed that qemu-kvm hangs early in the BIOS when runnning nested
      under some versions of VMware ESXi.
      
      The problem we believe is because KVM assumes that the platform preserves
      the 'G' but for any segment register. The SVM specification itemizes the
      segment attribute bits that are observed by the CPU, but the (G)ranularity bit
      is not one of the bits itemized, for any segment. Though current AMD CPUs keep
      track of the (G)ranularity bit for all segment registers other than CS, the
      specification does not require it. VMware's virtual CPU may not track the
      (G)ranularity bit for any segment register.
      
      Since kvm already synthesizes the (G)ranularity bit for the CS segment. It
      should do so for all segments. The patch below does that, and helps get rid of
      the hangs. Patch applies on top of Linus' tree.
      Signed-off-by: NJim Mattson <jmattson@vmware.com>
      Signed-off-by: NAlok N Kataria <akataria@vmware.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      80112c89
  11. 10 7月, 2014 4 次提交
  12. 30 6月, 2014 1 次提交
  13. 22 5月, 2014 1 次提交
    • P
      KVM: x86: get CPL from SS.DPL · ae9fedc7
      Paolo Bonzini 提交于
      CS.RPL is not equal to the CPL in the few instructions between
      setting CR0.PE and reloading CS.  And CS.DPL is also not equal
      to the CPL for conforming code segments.
      
      However, SS.DPL *is* always equal to the CPL except for the weird
      case of SYSRET on AMD processors, which sets SS.DPL=SS.RPL from the
      value in the STAR MSR, but force CPL=3 (Intel instead forces
      SS.DPL=SS.RPL=CPL=3).
      
      So this patch:
      
      - modifies SVM to update the CPL from SS.DPL rather than CS.RPL;
      the above case with SYSRET is not broken further, and the way
      to fix it would be to pass the CPL to userspace and back
      
      - modifies VMX to always return the CPL from SS.DPL (except
      forcing it to 0 if we are emulating real mode via vm86 mode;
      in vm86 mode all DPLs have to be 3, but real mode does allow
      privileged instructions).  It also removes the CPL cache,
      which becomes a duplicate of the SS access rights cache.
      
      This fixes doing KVM_IOCTL_SET_SREGS exactly after setting
      CR0.PE=1 but before CS has been reloaded.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ae9fedc7
  14. 08 5月, 2014 1 次提交
    • G
      kvm: x86: emulate monitor and mwait instructions as nop · 87c00572
      Gabriel L. Somlo 提交于
      Treat monitor and mwait instructions as nop, which is architecturally
      correct (but inefficient) behavior. We do this to prevent misbehaving
      guests (e.g. OS X <= 10.7) from crashing after they fail to check for
      monitor/mwait availability via cpuid.
      
      Since mwait-based idle loops relying on these nop-emulated instructions
      would keep the host CPU pegged at 100%, do NOT advertise their presence
      via cpuid, to prevent compliant guests from using them inadvertently.
      Signed-off-by: NGabriel L. Somlo <somlo@cmu.edu>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      87c00572
  15. 17 3月, 2014 1 次提交
    • P
      KVM: x86: handle missing MPX in nested virtualization · 93c4adc7
      Paolo Bonzini 提交于
      When doing nested virtualization, we may be able to read BNDCFGS but
      still not be allowed to write to GUEST_BNDCFGS in the VMCS.  Guard
      writes to the field with vmx_mpx_supported(), and similarly hide the
      MSR from userspace if the processor does not support the field.
      
      We could work around this with the generic MSR save/load machinery,
      but there is only a limited number of MSR save/load slots and it is
      not really worthwhile to waste one for a scenario that should not
      happen except in the nested virtualization case.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      93c4adc7
  16. 13 3月, 2014 1 次提交
  17. 11 3月, 2014 3 次提交
    • P
      KVM: svm: Allow the guest to run with dirty debug registers · facb0139
      Paolo Bonzini 提交于
      When not running in guest-debug mode (i.e. the guest controls the debug
      registers, having to take an exit for each DR access is a waste of time.
      If the guest gets into a state where each context switch causes DR to be
      saved and restored, this can take away as much as 40% of the execution
      time from the guest.
      
      If the guest is running with vcpu->arch.db == vcpu->arch.eff_db, we
      can let it write freely to the debug registers and reload them on the
      next exit.  We still need to exit on the first access, so that the
      KVM_DEBUGREG_WONT_EXIT flag is set in switch_db_regs; after that, further
      accesses to the debug registers will not cause a vmexit.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      facb0139
    • P
      KVM: svm: set/clear all DR intercepts in one swoop · 5315c716
      Paolo Bonzini 提交于
      Unlike other intercepts, debug register intercepts will be modified
      in hot paths if the guest OS is bad or otherwise gets tricked into
      doing so.
      
      Avoid calling recalc_intercepts 16 times for debug registers.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5315c716
    • J
      KVM: x86: Remove return code from enable_irq/nmi_window · c9a7953f
      Jan Kiszka 提交于
      It's no longer possible to enter enable_irq_window in guest mode when
      L1 intercepts external interrupts and we are entering L2. This is now
      caught in vcpu_enter_guest. So we can remove the check from the VMX
      version of enable_irq_window, thus the need to return an error code from
      both enable_irq_window and enable_nmi_window.
      Signed-off-by: NJan Kiszka <jan.kiszka@siemens.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c9a7953f
  18. 18 2月, 2014 1 次提交
  19. 17 1月, 2014 1 次提交
  20. 03 10月, 2013 1 次提交
  21. 27 6月, 2013 1 次提交
    • Y
      kvm: Add a tracepoint write_tsc_offset · 489223ed
      Yoshihiro YUNOMAE 提交于
      Add a tracepoint write_tsc_offset for tracing TSC offset change.
      We want to merge ftrace's trace data of guest OSs and the host OS using
      TSC for timestamp in chronological order. We need "TSC offset" values for
      each guest when merge those because the TSC value on a guest is always the
      host TSC plus guest's TSC offset. If we get the TSC offset values, we can
      calculate the host TSC value for each guest events from the TSC offset and
      the event TSC value. The host TSC values of the guest events are used when we
      want to merge trace data of guests and the host in chronological order.
      (Note: the trace_clock of both the host and the guest must be set x86-tsc in
      this case)
      
      This tracepoint also records vcpu_id which can be used to merge trace data for
      SMP guests. A merge tool will read TSC offset for each vcpu, then the tool
      converts guest TSC values to host TSC values for each vcpu.
      
      TSC offset is stored in the VMCS by vmx_write_tsc_offset() or
      vmx_adjust_tsc_offset(). KVM executes the former function when a guest boots.
      The latter function is executed when kvm clock is updated. Only host can read
      TSC offset value from VMCS, so a host needs to output TSC offset value
      when TSC offset is changed.
      
      Since the TSC offset is not often changed, it could be overwritten by other
      frequent events while tracing. To avoid that, I recommend to use a special
      instance for getting this event:
      
      1. set a instance before booting a guest
       # cd /sys/kernel/debug/tracing/instances
       # mkdir tsc_offset
       # cd tsc_offset
       # echo x86-tsc > trace_clock
       # echo 1 > events/kvm/kvm_write_tsc_offset/enable
      
      2. boot a guest
      Signed-off-by: NYoshihiro YUNOMAE <yoshihiro.yunomae.ez@hitachi.com>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Acked-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NGleb Natapov <gleb@redhat.com>
      489223ed
  22. 03 5月, 2013 1 次提交
  23. 28 4月, 2013 2 次提交
  24. 17 4月, 2013 2 次提交
    • Y
      KVM: VMX: Add the deliver posted interrupt algorithm · a20ed54d
      Yang Zhang 提交于
      Only deliver the posted interrupt when target vcpu is running
      and there is no previous interrupt pending in pir.
      Signed-off-by: NYang Zhang <yang.z.zhang@Intel.com>
      Reviewed-by: NGleb Natapov <gleb@redhat.com>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      a20ed54d
    • Y
      KVM: VMX: Enable acknowledge interupt on vmexit · a547c6db
      Yang Zhang 提交于
      The "acknowledge interrupt on exit" feature controls processor behavior
      for external interrupt acknowledgement. When this control is set, the
      processor acknowledges the interrupt controller to acquire the
      interrupt vector on VM exit.
      
      After enabling this feature, an interrupt which arrived when target cpu is
      running in vmx non-root mode will be handled by vmx handler instead of handler
      in idt. Currently, vmx handler only fakes an interrupt stack and jump to idt
      table to let real handler to handle it. Further, we will recognize the interrupt
      and only delivery the interrupt which not belong to current vcpu through idt table.
      The interrupt which belonged to current vcpu will be handled inside vmx handler.
      This will reduce the interrupt handle cost of KVM.
      
      Also, interrupt enable logic is changed if this feature is turnning on:
      Before this patch, hypervior call local_irq_enable() to enable it directly.
      Now IF bit is set on interrupt stack frame, and will be enabled on a return from
      interrupt handler if exterrupt interrupt exists. If no external interrupt, still
      call local_irq_enable() to enable it.
      
      Refer to Intel SDM volum 3, chapter 33.2.
      Signed-off-by: NYang Zhang <yang.z.zhang@Intel.com>
      Reviewed-by: NGleb Natapov <gleb@redhat.com>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      a547c6db
  25. 03 4月, 2013 1 次提交
  26. 21 3月, 2013 1 次提交
  27. 13 3月, 2013 1 次提交
    • J
      KVM: x86: Rework INIT and SIPI handling · 66450a21
      Jan Kiszka 提交于
      A VCPU sending INIT or SIPI to some other VCPU races for setting the
      remote VCPU's mp_state. When we were unlucky, KVM_MP_STATE_INIT_RECEIVED
      was overwritten by kvm_emulate_halt and, thus, got lost.
      
      This introduces APIC events for those two signals, keeping them in
      kvm_apic until kvm_apic_accept_events is run over the target vcpu
      context. kvm_apic_has_events reports to kvm_arch_vcpu_runnable if there
      are pending events, thus if vcpu blocking should end.
      
      The patch comes with the side effect of effectively obsoleting
      KVM_MP_STATE_SIPI_RECEIVED. We still accept it from user space, but
      immediately translate it to KVM_MP_STATE_INIT_RECEIVED + KVM_APIC_SIPI.
      The vcpu itself will no longer enter the KVM_MP_STATE_SIPI_RECEIVED
      state. That also means we no longer exit to user space after receiving a
      SIPI event.
      
      Furthermore, we already reset the VCPU on INIT, only fixing up the code
      segment later on when SIPI arrives. Moreover, we fix INIT handling for
      the BSP: it never enter wait-for-SIPI but directly starts over on INIT.
      Tested-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NJan Kiszka <jan.kiszka@siemens.com>
      Signed-off-by: NGleb Natapov <gleb@redhat.com>
      66450a21
  28. 12 3月, 2013 1 次提交
  29. 29 1月, 2013 2 次提交