1. 01 8月, 2016 2 次提交
    • J
      KVM: VMX: Add VMCS to CPU's loaded VMCSs before VMPTRLD · b80c76ec
      Jim Mattson 提交于
      Kexec needs to know the addresses of all VMCSs that are active on
      each CPU, so that it can flush them from the VMCS caches. It is
      safe to record superfluous addresses that are not associated with
      an active VMCS, but it is not safe to omit an address associated
      with an active VMCS.
      
      After a call to vmcs_load, the VMCS that was loaded is active on
      the CPU. The VMCS should be added to the CPU's list of active
      VMCSs before it is loaded.
      Signed-off-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      b80c76ec
    • D
      kvm: x86: nVMX: maintain internal copy of current VMCS · 4f2777bc
      David Matlack 提交于
      KVM maintains L1's current VMCS in guest memory, at the guest physical
      page identified by the argument to VMPTRLD. This makes hairy
      time-of-check to time-of-use bugs possible,as VCPUs can be writing
      the the VMCS page in memory while KVM is emulating VMLAUNCH and
      VMRESUME.
      
      The spec documents that writing to the VMCS page while it is loaded is
      "undefined". Therefore it is reasonable to load the entire VMCS into
      an internal cache during VMPTRLD and ignore writes to the VMCS page
      -- the guest should be using VMREAD and VMWRITE to access the current
      VMCS.
      
      To adhere to the spec, KVM should flush the current VMCS during VMPTRLD,
      and the target VMCS during VMCLEAR (as given by the operand to VMCLEAR).
      Since this implementation of VMCS caching only maintains the the current
      VMCS, VMCLEAR will only do a flush if the operand to VMCLEAR is the
      current VMCS pointer.
      
      KVM will also flush during VMXOFF, which is not mandated by the spec,
      but also not in conflict with the spec.
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4f2777bc
  2. 14 7月, 2016 19 次提交
  3. 11 7月, 2016 4 次提交
    • P
      KVM: VMX: introduce vm_{entry,exit}_control_reset_shadow · 8391ce44
      Paolo Bonzini 提交于
      There is no reason to read the entry/exit control fields of the
      VMCS and immediately write back the same value.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      8391ce44
    • P
      KVM: nVMX: keep preemption timer enabled during L2 execution · 9314006d
      Paolo Bonzini 提交于
      Because the vmcs12 preemption timer is emulated through a separate hrtimer,
      we can keep on using the preemption timer in the vmcs02 to emulare L1's
      TSC deadline timer.
      
      However, the corresponding bit in the pin-based execution control field
      must be kept consistent between vmcs01 and vmcs02.  On vmentry we copy
      it into the vmcs02; on vmexit the preemption timer must be disabled in
      the vmcs01 if a preemption timer vmexit happened while in guest mode.
      
      The preemption timer value in the vmcs02 is set by vmx_vcpu_run, so it
      need not be considered in prepare_vmcs02.
      
      Cc: Yunhong Jiang <yunhong.jiang@intel.com>
      Cc: Haozhong Zhang <haozhong.zhang@intel.com>
      Tested-by: NWanpeng Li <kernellwp@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      9314006d
    • W
      KVM: nVMX: avoid incorrect preemption timer vmexit in nested guest · 55123e3c
      Wanpeng Li 提交于
      The preemption timer for nested VMX is emulated by hrtimer which is started on L2
      entry, stopped on L2 exit and evaluated via the check_nested_events hook. However,
      nested_vmx_exit_handled is always returning true for preemption timer vmexit.  Then,
      the L1 preemption timer vmexit is captured and be treated as a L2 preemption
      timer vmexit, causing NULL pointer dereferences or worse in the L1 guest's
      vmexit handler:
      
          BUG: unable to handle kernel NULL pointer dereference at           (null)
          IP: [<          (null)>]           (null)
          PGD 0
          Oops: 0010 [#1] SMP
          Call Trace:
           ? kvm_lapic_expired_hv_timer+0x47/0x90 [kvm]
           handle_preemption_timer+0xe/0x20 [kvm_intel]
           vmx_handle_exit+0x169/0x15a0 [kvm_intel]
           ? kvm_arch_vcpu_ioctl_run+0xd5d/0x19d0 [kvm]
           kvm_arch_vcpu_ioctl_run+0xdee/0x19d0 [kvm]
           ? kvm_arch_vcpu_ioctl_run+0xd5d/0x19d0 [kvm]
           ? vcpu_load+0x1c/0x60 [kvm]
           ? kvm_arch_vcpu_load+0x57/0x260 [kvm]
           kvm_vcpu_ioctl+0x2d3/0x7c0 [kvm]
           do_vfs_ioctl+0x96/0x6a0
           ? __fget_light+0x2a/0x90
           SyS_ioctl+0x79/0x90
           do_syscall_64+0x68/0x180
           entry_SYSCALL64_slow_path+0x25/0x25
          Code:  Bad RIP value.
          RIP  [<          (null)>]           (null)
           RSP <ffff8800b5263c48>
          CR2: 0000000000000000
          ---[ end trace 9c70c48b1a2bc66e ]---
      
      This can be reproduced readily by preemption timer enabled on L0 and disabled
      on L1.
      
      Return false since preemption timer vmexits must never be reflected to L2.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Yunhong Jiang <yunhong.jiang@intel.com>
      Cc: Jan Kiszka <jan.kiszka@siemens.com>
      Cc: Haozhong Zhang <haozhong.zhang@intel.com>
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      55123e3c
    • P
      KVM: VMX: reflect broken preemption timer in vmcs_config · 1c17c3e6
      Paolo Bonzini 提交于
      Simplify cpu_has_vmx_preemption_timer.  This is consistent with the
      rest of setup_vmcs_config and preparatory for the next patch.
      Tested-by: NWanpeng Li <kernellwp@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      1c17c3e6
  4. 05 7月, 2016 1 次提交
  5. 01 7月, 2016 6 次提交
    • W
      KVM: vmx: fix missed cancellation of TSC deadline timer · 196f20ca
      Wanpeng Li 提交于
      INFO: rcu_sched detected stalls on CPUs/tasks:
       1-...: (11800 GPs behind) idle=45d/140000000000000/0 softirq=0/0 fqs=21663
       (detected by 0, t=65016 jiffies, g=11500, c=11499, q=719)
      Task dump for CPU 1:
      qemu-system-x86 R  running task        0  3529   3525 0x00080808
       ffff8802021791a0 ffff880212895040 0000000000000001 00007f1c2c00db40
       ffff8801dd20fcd3 ffffc90002b98000 ffff8801dd20fc88 ffff8801dd20fcf8
       0000000000000286 ffff8801dd2ac538 ffff8801dd20fcc0 ffffffffc06949c9
      Call Trace:
      ? kvm_write_guest_cached+0xb9/0x160 [kvm]
      ? __delay+0xf/0x20
      ? wait_lapic_expire+0x14a/0x200 [kvm]
      ? kvm_arch_vcpu_ioctl_run+0xcbe/0x1b00 [kvm]
      ? kvm_arch_vcpu_ioctl_run+0xe34/0x1b00 [kvm]
      ? kvm_vcpu_ioctl+0x2d3/0x7c0 [kvm]
      ? __fget+0x5/0x210
      ? do_vfs_ioctl+0x96/0x6a0
      ? __fget_light+0x2a/0x90
      ? SyS_ioctl+0x79/0x90
      ? do_syscall_64+0x7c/0x1e0
      ? entry_SYSCALL64_slow_path+0x25/0x25
      
      This can be reproduced readily by running a full dynticks guest(since hrtimer
      in guest is heavily used) w/ lapic_timer_advance disabled.
      
      If fail to program hardware preemption timer, we will fallback to hrtimer based
      method, however, a previous programmed preemption timer miss to cancel in this
      scenario which results in one hardware preemption timer and one hrtimer emulated
      tsc deadline timer run simultaneously. So sometimes the target guest deadline
      tsc is earlier than guest tsc, which leads to the computation in vmx_set_hv_timer
      can underflow and cause delta_tsc to be set a huge value, then host soft lockup
      as above.
      
      This patch fix it by cancelling the previous programmed preemption timer if there
      is once we failed to program the new preemption timer and fallback to hrtimer
      based method.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Yunhong Jiang <yunhong.jiang@intel.com>
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      196f20ca
    • W
      KVM: x86: introduce cancel_hv_tscdeadline · bd97ad0e
      Wanpeng Li 提交于
      Introduce cancel_hv_tscdeadline() to encapsulate preemption
      timer cancel stuff.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Yunhong Jiang <yunhong.jiang@intel.com>
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      bd97ad0e
    • P
      KVM: vmx: fix underflow in TSC deadline calculation · 9175d2e9
      Paolo Bonzini 提交于
      If the TSC deadline timer is programmed really close to the deadline or
      even in the past, the computation in vmx_set_hv_timer can underflow and
      cause delta_tsc to be set to a huge value.  This generally results
      in vmx_set_hv_timer returning -ERANGE, but we can fix it by limiting
      delta_tsc to be positive or zero.
      Reported-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      9175d2e9
    • P
      KVM: x86: use guest_exit_irqoff · f2485b3e
      Paolo Bonzini 提交于
      This gains a few clock cycles per vmexit.  On Intel there is no need
      anymore to enable the interrupts in vmx_handle_external_intr, since
      we are using the "acknowledge interrupt on exit" feature.  AMD
      needs to do that, and must be careful to avoid the interrupt shadow.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f2485b3e
    • P
      KVM: x86: always use "acknowledge interrupt on exit" · 91fa0f8e
      Paolo Bonzini 提交于
      This is necessary to simplify handle_external_intr in the next patch.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      91fa0f8e
    • P
      KVM: remove kvm_guest_enter/exit wrappers · 6edaa530
      Paolo Bonzini 提交于
      Use the functions from context_tracking.h directly.
      
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      6edaa530
  6. 24 6月, 2016 4 次提交
  7. 16 6月, 2016 4 次提交
    • Y
      kvm: vmx: hook preemption timer support · 64672c95
      Yunhong Jiang 提交于
      Hook the VMX preemption timer to the "hv timer" functionality added
      by the previous patch.  This includes: checking if the feature is
      supported, if the feature is broken on the CPU, the hooks to
      setup/clean the VMX preemption timer, arming the timer on vmentry
      and handling the vmexit.
      
      A module parameter states if the VMX preemption timer should be
      utilized.
      Signed-off-by: NYunhong Jiang <yunhong.jiang@intel.com>
      [Move hv_deadline_tsc to struct vcpu_vmx, use -1 as the "unset" value.
       Put all VMX bits here.  Enable it by default #yolo. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      64672c95
    • Y
      kvm: vmx: rename vmx_pre/post_block to pi_pre/post_block · bc22512b
      Yunhong Jiang 提交于
      Prepare to switch from preemption timer to hrtimer in the
      vmx_pre/post_block. Current functions are only for posted interrupt,
      rename them accordingly.
      Signed-off-by: NYunhong Jiang <yunhong.jiang@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      bc22512b
    • Y
      KVM: x86: support using the vmx preemption timer for tsc deadline timer · ce7a058a
      Yunhong Jiang 提交于
      The VMX preemption timer can be used to virtualize the TSC deadline timer.
      The VMX preemption timer is armed when the vCPU is running, and a VMExit
      will happen if the virtual TSC deadline timer expires.
      
      When the vCPU thread is blocked because of HLT, KVM will switch to use
      an hrtimer, and then go back to the VMX preemption timer when the vCPU
      thread is unblocked.
      
      This solution avoids the complex OS's hrtimer system, and the host
      timer interrupt handling cost, replacing them with a little math
      (for guest->host TSC and host TSC->preemption timer conversion)
      and a cheaper VMexit.  This benefits latency for isolated pCPUs.
      
      [A word about performance... Yunhong reported a 30% reduction in average
       latency from cyclictest.  I made a similar test with tscdeadline_latency
       from kvm-unit-tests, and measured
      
       - ~20 clock cycles loss (out of ~3200, so less than 1% but still
         statistically significant) in the worst case where the test halts
         just after programming the TSC deadline timer
      
       - ~800 clock cycles gain (25% reduction in latency) in the best case
         where the test busy waits.
      
       I removed the VMX bits from Yunhong's patch, to concentrate them in the
       next patch - Paolo]
      Signed-off-by: NYunhong Jiang <yunhong.jiang@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ce7a058a
    • Y
      kvm: lapic: separate start_sw_tscdeadline from start_apic_timer · 53f9eedf
      Yunhong Jiang 提交于
      The function to start the tsc deadline timer virtualization will be used
      also by the pre_block hook when we use the preemption timer; change it
      to a separate function. No logic changes.
      Signed-off-by: NYunhong Jiang <yunhong.jiang@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      53f9eedf