1. 08 3月, 2012 2 次提交
    • Z
      KVM: Improve TSC offset matching · 5d3cb0f6
      Zachary Amsden 提交于
      There are a few improvements that can be made to the TSC offset
      matching code.  First, we don't need to call the 128-bit multiply
      (especially on a constant number), the code works much nicer to
      do computation in nanosecond units.
      
      Second, the way everything is setup with software TSC rate scaling,
      we currently have per-cpu rates.  Obviously this isn't too desirable
      to use in practice, but if for some reason we do change the rate of
      all VCPUs at runtime, then reset the TSCs, we will only want to
      match offsets for VCPUs running at the same rate.
      
      Finally, for the case where we have an unstable host TSC, but
      rate scaling is being done in hardware, we should call the platform
      code to compute the TSC offset, so the math is reorganized to recompute
      the base instead, then transform the base into an offset using the
      existing API.
      
      [avi: fix 64-bit division on i386]
      Signed-off-by: NZachary Amsden <zamsden@gmail.com>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      
      KVM: Fix 64-bit division in kvm_write_tsc()
      
      Breaks i386 build.
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      5d3cb0f6
    • Z
      KVM: Infrastructure for software and hardware based TSC rate scaling · cc578287
      Zachary Amsden 提交于
      This requires some restructuring; rather than use 'virtual_tsc_khz'
      to indicate whether hardware rate scaling is in effect, we consider
      each VCPU to always have a virtual TSC rate.  Instead, there is new
      logic above the vendor-specific hardware scaling that decides whether
      it is even necessary to use and updates all rate variables used by
      common code.  This means we can simply query the virtual rate at
      any point, which is needed for software rate scaling.
      
      There is also now a threshold added to the TSC rate scaling; minor
      differences and variations of measured TSC rate can accidentally
      provoke rate scaling to be used when it is not needed.  Instead,
      we have a tolerance variable called tsc_tolerance_ppm, which is
      the maximum variation from user requested rate at which scaling
      will be used.  The default is 250ppm, which is the half the
      threshold for NTP adjustment, allowing for some hardware variation.
      
      In the event that hardware rate scaling is not available, we can
      kludge a bit by forcing TSC catchup to turn on when a faster than
      hardware speed has been requested, but there is nothing available
      yet for the reverse case; this requires a trap and emulate software
      implementation for RDTSC, which is still forthcoming.
      
      [avi: fix 64-bit division on i386]
      Signed-off-by: NZachary Amsden <zamsden@gmail.com>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      cc578287
  2. 05 3月, 2012 3 次提交
  3. 27 12月, 2011 10 次提交
  4. 05 10月, 2011 1 次提交
  5. 26 9月, 2011 5 次提交
    • A
      KVM: Fix simultaneous NMIs · 7460fb4a
      Avi Kivity 提交于
      If simultaneous NMIs happen, we're supposed to queue the second
      and next (collapsing them), but currently we sometimes collapse
      the second into the first.
      
      Fix by using a counter for pending NMIs instead of a bool; since
      the counter limit depends on whether the processor is currently
      in an NMI handler, which can only be checked in vcpu context
      (via the NMI mask), we add a new KVM_REQ_NMI to request recalculation
      of the counter.
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      7460fb4a
    • N
      KVM: L1 TSC handling · d5c1785d
      Nadav Har'El 提交于
      KVM assumed in several places that reading the TSC MSR returns the value for
      L1. This is incorrect, because when L2 is running, the correct TSC read exit
      emulation is to return L2's value.
      
      We therefore add a new x86_ops function, read_l1_tsc, to use in places that
      specifically need to read the L1 TSC, NOT the TSC of the current level of
      guest.
      
      Note that one change, of one line in kvm_arch_vcpu_load, is made redundant
      by a different patch sent by Zachary Amsden (and not yet applied):
      kvm_arch_vcpu_load() should not read the guest TSC, and if it didn't, of
      course we didn't have to change the call of kvm_get_msr() to read_l1_tsc().
      
      [avi: moved callback to kvm_x86_ops tsc block]
      Signed-off-by: NNadav Har'El <nyh@il.ibm.com>
      Acked-by: NZachary Amsdem <zamsden@gmail.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      d5c1785d
    • A
      KVM: MMU: Do not unconditionally read PDPTE from guest memory · e4e517b4
      Avi Kivity 提交于
      Architecturally, PDPTEs are cached in the PDPTRs when CR3 is reloaded.
      On SVM, it is not possible to implement this, but on VMX this is possible
      and was indeed implemented until nested SVM changed this to unconditionally
      read PDPTEs dynamically.  This has noticable impact when running PAE guests.
      
      Fix by changing the MMU to read PDPTRs from the cache, falling back to
      reading from memory for the nested MMU.
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      Tested-by: NJoerg Roedel <joerg.roedel@amd.com>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      e4e517b4
    • S
      KVM: Use __print_symbolic() for vmexit tracepoints · 0d460ffc
      Stefan Hajnoczi 提交于
      The vmexit tracepoints format the exit_reason to make it human-readable.
      Since the exit_reason depends on the instruction set (vmx or svm),
      formatting is handled with ftrace_print_symbols_seq() by referring to
      the appropriate exit reason table.
      
      However, the ftrace_print_symbols_seq() function is not meant to be used
      directly in tracepoints since it does not export the formatting table
      which userspace tools like trace-cmd and perf use to format traces.
      
      In practice perf dies when formatting vmexit-related events and
      trace-cmd falls back to printing the numeric value (with extra
      formatting code in the kvm plugin to paper over this limitation).  Other
      userspace consumers of vmexit-related tracepoints would be in similar
      trouble.
      
      To avoid significant changes to the kvm_exit tracepoint, this patch
      moves the vmx and svm exit reason tables into arch/x86/kvm/trace.h and
      selects the right table with __print_symbolic() depending on the
      instruction set.  Note that __print_symbolic() is designed for exporting
      the formatting table to userspace and allows trace-cmd and perf to work.
      Signed-off-by: NStefan Hajnoczi <stefanha@linux.vnet.ibm.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      0d460ffc
    • S
      KVM: x86: Raise the hard VCPU count limit · 8c3ba334
      Sasha Levin 提交于
      The patch raises the hard limit of VCPU count to 254.
      
      This will allow developers to easily work on scalability
      and will allow users to test high VCPU setups easily without
      patching the kernel.
      
      To prevent possible issues with current setups, KVM_CAP_NR_VCPUS
      now returns the recommended VCPU limit (which is still 64) - this
      should be a safe value for everybody, while a new KVM_CAP_MAX_VCPUS
      returns the hard limit which is now 254.
      
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Suggested-by: NPekka Enberg <penberg@cs.helsinki.fi>
      Signed-off-by: NSasha Levin <levinsasha928@gmail.com>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      8c3ba334
  6. 24 7月, 2011 3 次提交
  7. 14 7月, 2011 1 次提交
    • G
      KVM: Steal time implementation · c9aaa895
      Glauber Costa 提交于
      To implement steal time, we need the hypervisor to pass the guest
      information about how much time was spent running other processes
      outside the VM, while the vcpu had meaningful work to do - halt
      time does not count.
      
      This information is acquired through the run_delay field of
      delayacct/schedstats infrastructure, that counts time spent in a
      runqueue but not running.
      
      Steal time is a per-cpu information, so the traditional MSR-based
      infrastructure is used. A new msr, KVM_MSR_STEAL_TIME, holds the
      memory area address containing information about steal time
      
      This patch contains the hypervisor part of the steal time infrasructure,
      and can be backported independently of the guest portion.
      
      [avi, yongjie: export delayacct_on, to avoid build failures in some configs]
      Signed-off-by: NGlauber Costa <glommer@redhat.com>
      Tested-by: NEric B Munson <emunson@mgebm.net>
      CC: Rik van Riel <riel@redhat.com>
      CC: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
      CC: Peter Zijlstra <peterz@infradead.org>
      CC: Anthony Liguori <aliguori@us.ibm.com>
      Signed-off-by: NYongjie Ren <yongjie.ren@intel.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      c9aaa895
  8. 12 7月, 2011 8 次提交
  9. 22 5月, 2011 6 次提交
  10. 11 5月, 2011 1 次提交