1. 24 10月, 2014 2 次提交
    • A
      KVM: x86: Prevent host from panicking on shared MSR writes. · 8b3c3104
      Andy Honig 提交于
      The previous patch blocked invalid writes directly when the MSR
      is written.  As a precaution, prevent future similar mistakes by
      gracefulling handle GPs caused by writes to shared MSRs.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NAndrew Honig <ahonig@google.com>
      [Remove parts obsoleted by Nadav's patch. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      8b3c3104
    • N
      KVM: x86: Check non-canonical addresses upon WRMSR · 854e8bb1
      Nadav Amit 提交于
      Upon WRMSR, the CPU should inject #GP if a non-canonical value (address) is
      written to certain MSRs. The behavior is "almost" identical for AMD and Intel
      (ignoring MSRs that are not implemented in either architecture since they would
      anyhow #GP). However, IA32_SYSENTER_ESP and IA32_SYSENTER_EIP cause #GP if
      non-canonical address is written on Intel but not on AMD (which ignores the top
      32-bits).
      
      Accordingly, this patch injects a #GP on the MSRs which behave identically on
      Intel and AMD.  To eliminate the differences between the architecutres, the
      value which is written to IA32_SYSENTER_ESP and IA32_SYSENTER_EIP is turned to
      canonical value before writing instead of injecting a #GP.
      
      Some references from Intel and AMD manuals:
      
      According to Intel SDM description of WRMSR instruction #GP is expected on
      WRMSR "If the source register contains a non-canonical address and ECX
      specifies one of the following MSRs: IA32_DS_AREA, IA32_FS_BASE, IA32_GS_BASE,
      IA32_KERNEL_GS_BASE, IA32_LSTAR, IA32_SYSENTER_EIP, IA32_SYSENTER_ESP."
      
      According to AMD manual instruction manual:
      LSTAR/CSTAR (SYSCALL): "The WRMSR instruction loads the target RIP into the
      LSTAR and CSTAR registers.  If an RIP written by WRMSR is not in canonical
      form, a general-protection exception (#GP) occurs."
      IA32_GS_BASE and IA32_FS_BASE (WRFSBASE/WRGSBASE): "The address written to the
      base field must be in canonical form or a #GP fault will occur."
      IA32_KERNEL_GS_BASE (SWAPGS): "The address stored in the KernelGSbase MSR must
      be in canonical form."
      
      This patch fixes CVE-2014-3610.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NNadav Amit <namit@cs.technion.ac.il>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      854e8bb1
  2. 24 9月, 2014 5 次提交
  3. 17 9月, 2014 1 次提交
    • T
      kvm: Remove ept_identity_pagetable from struct kvm_arch. · a255d479
      Tang Chen 提交于
      kvm_arch->ept_identity_pagetable holds the ept identity pagetable page. But
      it is never used to refer to the page at all.
      
      In vcpu initialization, it indicates two things:
      1. indicates if ept page is allocated
      2. indicates if a memory slot for identity page is initialized
      
      Actually, kvm_arch->ept_identity_pagetable_done is enough to tell if the ept
      identity pagetable is initialized. So we can remove ept_identity_pagetable.
      
      NOTE: In the original code, ept identity pagetable page is pinned in memroy.
            As a result, it cannot be migrated/hot-removed. After this patch, since
            kvm_arch->ept_identity_pagetable is removed, ept identity pagetable page
            is no longer pinned in memory. And it can be migrated/hot-removed.
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Reviewed-by: NGleb Natapov <gleb@kernel.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a255d479
  4. 05 9月, 2014 2 次提交
  5. 03 9月, 2014 1 次提交
    • D
      kvm: x86: fix stale mmio cache bug · 56f17dd3
      David Matlack 提交于
      The following events can lead to an incorrect KVM_EXIT_MMIO bubbling
      up to userspace:
      
      (1) Guest accesses gpa X without a memory slot. The gfn is cached in
      struct kvm_vcpu_arch (mmio_gfn). On Intel EPT-enabled hosts, KVM sets
      the SPTE write-execute-noread so that future accesses cause
      EPT_MISCONFIGs.
      
      (2) Host userspace creates a memory slot via KVM_SET_USER_MEMORY_REGION
      covering the page just accessed.
      
      (3) Guest attempts to read or write to gpa X again. On Intel, this
      generates an EPT_MISCONFIG. The memory slot generation number that
      was incremented in (2) would normally take care of this but we fast
      path mmio faults through quickly_check_mmio_pf(), which only checks
      the per-vcpu mmio cache. Since we hit the cache, KVM passes a
      KVM_EXIT_MMIO up to userspace.
      
      This patch fixes the issue by using the memslot generation number
      to validate the mmio cache.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      [xiaoguangrong: adjust the code to make it simpler for stable-tree fix.]
      Signed-off-by: NXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Reviewed-by: NDavid Matlack <dmatlack@google.com>
      Reviewed-by: NXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Tested-by: NDavid Matlack <dmatlack@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      56f17dd3
  6. 29 8月, 2014 2 次提交
  7. 22 8月, 2014 1 次提交
  8. 19 8月, 2014 2 次提交
  9. 21 7月, 2014 1 次提交
    • N
      KVM: x86: DR6/7.RTM cannot be written · 6f43ed01
      Nadav Amit 提交于
      Haswell and newer Intel CPUs have support for RTM, and in that case DR6.RTM is
      not fixed to 1 and DR7.RTM is not fixed to zero. That is not the case in the
      current KVM implementation. This bug is apparent only if the MOV-DR instruction
      is emulated or the host also debugs the guest.
      
      This patch is a partial fix which enables DR6.RTM and DR7.RTM to be cleared and
      set respectively. It also sets DR6.RTM upon every debug exception. Obviously,
      it is not a complete fix, as debugging of RTM is still unsupported.
      Signed-off-by: NNadav Amit <namit@cs.technion.ac.il>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      6f43ed01
  10. 11 7月, 2014 1 次提交
    • P
      KVM: x86: return all bits from get_interrupt_shadow · 37ccdcbe
      Paolo Bonzini 提交于
      For the next patch we will need to know the full state of the
      interrupt shadow; we will then set KVM_REQ_EVENT when one bit
      is cleared.
      
      However, right now get_interrupt_shadow only returns the one
      corresponding to the emulated instruction, or an unconditional
      0 if the emulated instruction does not have an interrupt shadow.
      This is confusing and does not allow us to check for cleared
      bits as mentioned above.
      
      Clean the callback up, and modify toggle_interruptibility to
      match the comment above the call.  As a small result, the
      call to set_interrupt_shadow will be skipped in the common
      case where int_shadow == 0 && mask == 0.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      37ccdcbe
  11. 10 7月, 2014 1 次提交
    • T
      KVM: x86: fix TSC matching · 0d3da0d2
      Tomasz Grabiec 提交于
      I've observed kvmclock being marked as unstable on a modern
      single-socket system with a stable TSC and qemu-1.6.2 or qemu-2.0.0.
      
      The culprit was failure in TSC matching because of overflow of
      kvm_arch::nr_vcpus_matched_tsc in case there were multiple TSC writes
      in a single synchronization cycle.
      
      Turns out that qemu does multiple TSC writes during init, below is the
      evidence of that (qemu-2.0.0):
      
      The first one:
      
       0xffffffffa08ff2b4 : vmx_write_tsc_offset+0xa4/0xb0 [kvm_intel]
       0xffffffffa04c9c05 : kvm_write_tsc+0x1a5/0x360 [kvm]
       0xffffffffa04cfd6b : kvm_arch_vcpu_postcreate+0x4b/0x80 [kvm]
       0xffffffffa04b8188 : kvm_vm_ioctl+0x418/0x750 [kvm]
      
      The second one:
      
       0xffffffffa08ff2b4 : vmx_write_tsc_offset+0xa4/0xb0 [kvm_intel]
       0xffffffffa04c9c05 : kvm_write_tsc+0x1a5/0x360 [kvm]
       0xffffffffa090610d : vmx_set_msr+0x29d/0x350 [kvm_intel]
       0xffffffffa04be83b : do_set_msr+0x3b/0x60 [kvm]
       0xffffffffa04c10a8 : msr_io+0xc8/0x160 [kvm]
       0xffffffffa04caeb6 : kvm_arch_vcpu_ioctl+0xc86/0x1060 [kvm]
       0xffffffffa04b6797 : kvm_vcpu_ioctl+0xc7/0x5a0 [kvm]
      
       #0  kvm_vcpu_ioctl at /build/buildd/qemu-2.0.0+dfsg/kvm-all.c:1780
       #1  kvm_put_msrs at /build/buildd/qemu-2.0.0+dfsg/target-i386/kvm.c:1270
       #2  kvm_arch_put_registers at /build/buildd/qemu-2.0.0+dfsg/target-i386/kvm.c:1909
       #3  kvm_cpu_synchronize_post_init at /build/buildd/qemu-2.0.0+dfsg/kvm-all.c:1641
       #4  cpu_synchronize_post_init at /build/buildd/qemu-2.0.0+dfsg/include/sysemu/kvm.h:330
       #5  cpu_synchronize_all_post_init () at /build/buildd/qemu-2.0.0+dfsg/cpus.c:521
       #6  main at /build/buildd/qemu-2.0.0+dfsg/vl.c:4390
      
      The third one:
      
       0xffffffffa08ff2b4 : vmx_write_tsc_offset+0xa4/0xb0 [kvm_intel]
       0xffffffffa04c9c05 : kvm_write_tsc+0x1a5/0x360 [kvm]
       0xffffffffa090610d : vmx_set_msr+0x29d/0x350 [kvm_intel]
       0xffffffffa04be83b : do_set_msr+0x3b/0x60 [kvm]
       0xffffffffa04c10a8 : msr_io+0xc8/0x160 [kvm]
       0xffffffffa04caeb6 : kvm_arch_vcpu_ioctl+0xc86/0x1060 [kvm]
       0xffffffffa04b6797 : kvm_vcpu_ioctl+0xc7/0x5a0 [kvm]
      
       #0  kvm_vcpu_ioctl at /build/buildd/qemu-2.0.0+dfsg/kvm-all.c:1780
       #1  kvm_put_msrs  at /build/buildd/qemu-2.0.0+dfsg/target-i386/kvm.c:1270
       #2  kvm_arch_put_registers  at /build/buildd/qemu-2.0.0+dfsg/target-i386/kvm.c:1909
       #3  kvm_cpu_synchronize_post_reset  at /build/buildd/qemu-2.0.0+dfsg/kvm-all.c:1635
       #4  cpu_synchronize_post_reset  at /build/buildd/qemu-2.0.0+dfsg/include/sysemu/kvm.h:323
       #5  cpu_synchronize_all_post_reset () at /build/buildd/qemu-2.0.0+dfsg/cpus.c:512
       #6  main  at /build/buildd/qemu-2.0.0+dfsg/vl.c:4482
      
      The fix is to count each vCPU only once when matched, so that
      nr_vcpus_matched_tsc holds the size of the matched set. This is
      achieved by reusing generation counters. Every vCPU with
      this_tsc_generation == cur_tsc_generation is in the matched set. The
      match set is cleared by setting cur_tsc_generation to a value which no
      other vCPU is set to (by incrementing it).
      
      I needed to bump up the counter size form u8 to u64 to ensure it never
      overflows. Otherwise in cases TSC is not written the same number of
      times on each vCPU the counter could overflow and incorrectly indicate
      some vCPUs as being in the matched set. This scenario seems unlikely
      but I'm not sure if it can be disregarded.
      Signed-off-by: NTomasz Grabiec <tgrabiec@cloudius-systems.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0d3da0d2
  12. 19 6月, 2014 2 次提交
  13. 18 6月, 2014 1 次提交
  14. 22 5月, 2014 1 次提交
    • P
      KVM: x86: get CPL from SS.DPL · ae9fedc7
      Paolo Bonzini 提交于
      CS.RPL is not equal to the CPL in the few instructions between
      setting CR0.PE and reloading CS.  And CS.DPL is also not equal
      to the CPL for conforming code segments.
      
      However, SS.DPL *is* always equal to the CPL except for the weird
      case of SYSRET on AMD processors, which sets SS.DPL=SS.RPL from the
      value in the STAR MSR, but force CPL=3 (Intel instead forces
      SS.DPL=SS.RPL=CPL=3).
      
      So this patch:
      
      - modifies SVM to update the CPL from SS.DPL rather than CS.RPL;
      the above case with SYSRET is not broken further, and the way
      to fix it would be to pass the CPL to userspace and back
      
      - modifies VMX to always return the CPL from SS.DPL (except
      forcing it to 0 if we are emulating real mode via vm86 mode;
      in vm86 mode all DPLs have to be 3, but real mode does allow
      privileged instructions).  It also removes the CPL cache,
      which becomes a duplicate of the SS access rights cache.
      
      This fixes doing KVM_IOCTL_SET_SREGS exactly after setting
      CR0.PE=1 but before CS has been reloaded.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ae9fedc7
  15. 24 4月, 2014 1 次提交
  16. 15 4月, 2014 1 次提交
  17. 11 3月, 2014 4 次提交
    • P
      KVM: x86: Allow the guest to run with dirty debug registers · c77fb5fe
      Paolo Bonzini 提交于
      When not running in guest-debug mode, the guest controls the debug
      registers and having to take an exit for each DR access is a waste
      of time.  If the guest gets into a state where each context switch
      causes DR to be saved and restored, this can take away as much as 40%
      of the execution time from the guest.
      
      After this patch, VMX- and SVM-specific code can set a flag in
      switch_db_regs, telling vcpu_enter_guest that on the next exit the debug
      registers might be dirty and need to be reloaded (syncing will be taken
      care of by a new callback in kvm_x86_ops).  This flag can be set on the
      first access to a debug registers, so that multiple accesses to the
      debug registers only cause one vmexit.
      
      Note that since the guest will be able to read debug registers and
      enable breakpoints in DR7, we need to ensure that they are synchronized
      on entry to the guest---including DR6 that was not synced before.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c77fb5fe
    • P
      KVM: x86: change vcpu->arch.switch_db_regs to a bit mask · 360b948d
      Paolo Bonzini 提交于
      The next patch will add another bit that we can test with the
      same "if".
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      360b948d
    • J
      KVM: x86: Remove return code from enable_irq/nmi_window · c9a7953f
      Jan Kiszka 提交于
      It's no longer possible to enter enable_irq_window in guest mode when
      L1 intercepts external interrupts and we are entering L2. This is now
      caught in vcpu_enter_guest. So we can remove the check from the VMX
      version of enable_irq_window, thus the need to return an error code from
      both enable_irq_window and enable_nmi_window.
      Signed-off-by: NJan Kiszka <jan.kiszka@siemens.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c9a7953f
    • J
      KVM: nVMX: Rework interception of IRQs and NMIs · b6b8a145
      Jan Kiszka 提交于
      Move the check for leaving L2 on pending and intercepted IRQs or NMIs
      from the *_allowed handler into a dedicated callback. Invoke this
      callback at the relevant points before KVM checks if IRQs/NMIs can be
      injected. The callback has the task to switch from L2 to L1 if needed
      and inject the proper vmexit events.
      
      The rework fixes L2 wakeups from HLT and provides the foundation for
      preemption timer emulation.
      Signed-off-by: NJan Kiszka <jan.kiszka@siemens.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b6b8a145
  18. 04 3月, 2014 2 次提交
    • A
      x86: kvm: introduce periodic global clock updates · 332967a3
      Andrew Jones 提交于
      commit 0061d53d introduced a mechanism to execute a global clock
      update for a vm. We can apply this periodically in order to propagate
      host NTP corrections. Also, if all vcpus of a vm are pinned, then
      without an additional trigger, no guest NTP corrections can propagate
      either, as the current trigger is only vcpu cpu migration.
      Signed-off-by: NAndrew Jones <drjones@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      332967a3
    • A
      x86: kvm: rate-limit global clock updates · 7e44e449
      Andrew Jones 提交于
      When we update a vcpu's local clock it may pick up an NTP correction.
      We can't wait an indeterminate amount of time for other vcpus to pick
      up that correction, so commit 0061d53d introduced a global clock
      update. However, we can't request a global clock update on every vcpu
      load either (which is what happens if the tsc is marked as unstable).
      The solution is to rate-limit the global clock updates. Marcelo
      calculated that we should delay the global clock updates no more
      than 0.1s as follows:
      
      Assume an NTP correction c is applied to one vcpu, but not the other,
      then in n seconds the delta of the vcpu system_timestamps will be
      c * n. If we assume a correction of 500ppm (worst-case), then the two
      vcpus will diverge 50us in 0.1s, which is a considerable amount.
      Signed-off-by: NAndrew Jones <drjones@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7e44e449
  19. 24 2月, 2014 1 次提交
  20. 04 2月, 2014 1 次提交
  21. 17 1月, 2014 2 次提交
    • J
      KVM: SVM: Fix reading of DR6 · 73aaf249
      Jan Kiszka 提交于
      In contrast to VMX, SVM dose not automatically transfer DR6 into the
      VCPU's arch.dr6. So if we face a DR6 read, we must consult a new vendor
      hook to obtain the current value. And as SVM now picks the DR6 state
      from its VMCB, we also need a set callback in order to write updates of
      DR6 back.
      
      Fixes a regression of 020df079.
      Signed-off-by: NJan Kiszka <jan.kiszka@siemens.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      73aaf249
    • V
      add support for Hyper-V reference time counter · e984097b
      Vadim Rozenfeld 提交于
      Signed-off: Peter Lieven <pl@kamp.de>
      Signed-off: Gleb Natapov
      Signed-off: Vadim Rozenfeld <vrozenfe@redhat.com>
      
      After some consideration I decided to submit only Hyper-V reference
      counters support this time. I will submit iTSC support as a separate
      patch as soon as it is ready.
      
      v1 -> v2
      1. mark TSC page dirty as suggested by
          Eric Northup <digitaleric@google.com> and Gleb
      2. disable local irq when calling get_kernel_ns,
          as it was done by Peter Lieven <pl@amp.de>
      3. move check for TSC page enable from second patch
          to this one.
      
      v3 -> v4
          Get rid of ref counter offset.
      
      v4 -> v5
          replace __copy_to_user with kvm_write_guest
          when updateing iTSC page.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e984097b
  22. 06 11月, 2013 2 次提交
  23. 31 10月, 2013 2 次提交
  24. 14 10月, 2013 1 次提交