1. 25 6月, 2012 1 次提交
  2. 19 6月, 2012 1 次提交
  3. 06 6月, 2012 1 次提交
  4. 05 6月, 2012 1 次提交
  5. 17 5月, 2012 1 次提交
  6. 06 5月, 2012 1 次提交
    • G
      KVM: ensure async PF event wakes up vcpu from halt · a4fa1635
      Gleb Natapov 提交于
      If vcpu executes hlt instruction while async PF is waiting to be delivered
      vcpu can block and deliver async PF only after another even wakes it
      up. This happens because kvm_check_async_pf_completion() will remove
      completion event from vcpu->async_pf.done before entering kvm_vcpu_block()
      and this will make kvm_arch_vcpu_runnable() return false. The solution
      is to make vcpu runnable when processing completion.
      Signed-off-by: NGleb Natapov <gleb@redhat.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      a4fa1635
  7. 21 4月, 2012 3 次提交
  8. 20 4月, 2012 1 次提交
    • A
      KVM: Fix page-crossing MMIO · f78146b0
      Avi Kivity 提交于
      MMIO that are split across a page boundary are currently broken - the
      code does not expect to be aborted by the exit to userspace for the
      first MMIO fragment.
      
      This patch fixes the problem by generalizing the current code for handling
      16-byte MMIOs to handle a number of "fragments", and changes the MMIO
      code to create those fragments.
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      f78146b0
  9. 08 4月, 2012 4 次提交
    • T
      KVM: Switch to srcu-less get_dirty_log() · 60c34612
      Takuya Yoshikawa 提交于
      We have seen some problems of the current implementation of
      get_dirty_log() which uses synchronize_srcu_expedited() for updating
      dirty bitmaps; e.g. it is noticeable that this sometimes gives us ms
      order of latency when we use VGA displays.
      
      Furthermore the recent discussion on the following thread
          "srcu: Implement call_srcu()"
          http://lkml.org/lkml/2012/1/31/211
      also motivated us to implement get_dirty_log() without SRCU.
      
      This patch achieves this goal without sacrificing the performance of
      both VGA and live migration: in practice the new code is much faster
      than the old one unless we have too many dirty pages.
      
      Implementation:
      
      The key part of the implementation is the use of xchg() operation for
      clearing dirty bits atomically.  Since this allows us to update only
      BITS_PER_LONG pages at once, we need to iterate over the dirty bitmap
      until every dirty bit is cleared again for the next call.
      
      Although some people may worry about the problem of using the atomic
      memory instruction many times to the concurrently accessible bitmap,
      it is usually accessed with mmu_lock held and we rarely see concurrent
      accesses: so what we need to care about is the pure xchg() overheads.
      
      Another point to note is that we do not use for_each_set_bit() to check
      which ones in each BITS_PER_LONG pages are actually dirty.  Instead we
      simply use __ffs() in a loop.  This is much faster than repeatedly call
      find_next_bit().
      
      Performance:
      
      The dirty-log-perf unit test showed nice improvements, some times faster
      than before, except for some extreme cases; for such cases the speed of
      getting dirty page information is much faster than we process it in the
      userspace.
      
      For real workloads, both VGA and live migration, we have observed pure
      improvements: when the guest was reading a file during live migration,
      we originally saw a few ms of latency, but with the new method the
      latency was less than 200us.
      Signed-off-by: NTakuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      60c34612
    • T
      KVM: Avoid checking huge page mappings in get_dirty_log() · 5dc99b23
      Takuya Yoshikawa 提交于
      Dropped such mappings when we enabled dirty logging and we will never
      create new ones until we stop the logging.
      
      For this we introduce a new function which can be used to write protect
      a range of PT level pages: although we do not need to care about a range
      of pages at this point, the following patch will need this feature to
      optimize the write protection of many pages.
      Signed-off-by: NTakuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      5dc99b23
    • E
      KVM: x86: Add ioctl for KVM_KVMCLOCK_CTRL · 1c0b28c2
      Eric B Munson 提交于
      Now that we have a flag that will tell the guest it was suspended, create an
      interface for that communication using a KVM ioctl.
      Signed-off-by: NEric B Munson <emunson@mgebm.net>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      1c0b28c2
    • C
      KVM: Factor out kvm_vcpu_kick to arch-generic code · b6d33834
      Christoffer Dall 提交于
      The kvm_vcpu_kick function performs roughly the same funcitonality on
      most all architectures, so we shouldn't have separate copies.
      
      PowerPC keeps a pointer to interchanging waitqueues on the vcpu_arch
      structure and to accomodate this special need a
      __KVM_HAVE_ARCH_VCPU_GET_WQ define and accompanying function
      kvm_arch_vcpu_wq have been defined. For all other architectures this
      is a generic inline that just returns &vcpu->wq;
      Acked-by: NScott Wood <scottwood@freescale.com>
      Signed-off-by: NChristoffer Dall <c.dall@virtualopensystems.com>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      b6d33834
  10. 20 3月, 2012 2 次提交
  11. 08 3月, 2012 15 次提交
    • N
      KVM: Ignore the writes to MSR_K7_HWCR(3) · a223c313
      Nicolae Mogoreanu 提交于
      When CPUID Fn8000_0001_EAX reports 0x00100f22 Windows 7 x64 guest
      tries to set bit 3 in MSRC001_0015 in nt!KiDisableCacheErrataSource
      and fails. This patch will ignore this step and allow things to move
      on without having to fake CPUID value.
      Signed-off-by: NNicolae Mogoreanu <mogoreanu@gmail.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      a223c313
    • J
      KVM: Allow host IRQ sharing for assigned PCI 2.3 devices · 07700a94
      Jan Kiszka 提交于
      PCI 2.3 allows to generically disable IRQ sources at device level. This
      enables us to share legacy IRQs of such devices with other host devices
      when passing them to a guest.
      
      The new IRQ sharing feature introduced here is optional, user space has
      to request it explicitly. Moreover, user space can inform us about its
      view of PCI_COMMAND_INTX_DISABLE so that we can avoid unmasking the
      interrupt and signaling it if the guest masked it via the virtualized
      PCI config space.
      Signed-off-by: NJan Kiszka <jan.kiszka@siemens.com>
      Acked-by: NAlex Williamson <alex.williamson@redhat.com>
      Acked-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      07700a94
    • A
      KVM: Ensure all vcpus are consistent with in-kernel irqchip settings · 3e515705
      Avi Kivity 提交于
      If some vcpus are created before KVM_CREATE_IRQCHIP, then
      irqchip_in_kernel() and vcpu->arch.apic will be inconsistent, leading
      to potential NULL pointer dereferences.
      
      Fix by:
      - ensuring that no vcpus are installed when KVM_CREATE_IRQCHIP is called
      - ensuring that a vcpu has an apic if it is installed after KVM_CREATE_IRQCHIP
      
      This is somewhat long winded because vcpu->arch.apic is created without
      kvm->lock held.
      
      Based on earlier patch by Michael Ellerman.
      Signed-off-by: NMichael Ellerman <michael@ellerman.id.au>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      3e515705
    • K
      KVM: x86 emulator: Allow PM/VM86 switch during task switch · 4cee4798
      Kevin Wolf 提交于
      Task switches can switch between Protected Mode and VM86. The current
      mode must be updated during the task switch emulation so that the new
      segment selectors are interpreted correctly.
      
      In order to let privilege checks succeed, rflags needs to be updated in
      the vcpu struct as this causes a CPL update.
      Signed-off-by: NKevin Wolf <kwolf@redhat.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      4cee4798
    • K
      KVM: x86 emulator: Fix task switch privilege checks · 7f3d35fd
      Kevin Wolf 提交于
      Currently, all task switches check privileges against the DPL of the
      TSS. This is only correct for jmp/call to a TSS. If a task gate is used,
      the DPL of this take gate is used for the check instead. Exceptions,
      external interrupts and iret shouldn't perform any check.
      
      [avi: kill kvm-kmod remnants]
      Signed-off-by: NKevin Wolf <kwolf@redhat.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      7f3d35fd
    • T
      KVM: Introduce kvm_memory_slot::arch and move lpage_info into it · db3fe4eb
      Takuya Yoshikawa 提交于
      Some members of kvm_memory_slot are not used by every architecture.
      
      This patch is the first step to make this difference clear by
      introducing kvm_memory_slot::arch;  lpage_info is moved into it.
      Signed-off-by: NTakuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      db3fe4eb
    • T
      KVM: Fix write protection race during dirty logging · 6dbf79e7
      Takuya Yoshikawa 提交于
      This patch fixes a race introduced by:
      
        commit 95d4c16c
        KVM: Optimize dirty logging by rmap_write_protect()
      
      During protecting pages for dirty logging, other threads may also try
      to protect a page in mmu_sync_children() or kvm_mmu_get_page().
      
      In such a case, because get_dirty_log releases mmu_lock before flushing
      TLB's, the following race condition can happen:
      
        A (get_dirty_log)     B (another thread)
      
        lock(mmu_lock)
        clear pte.w
        unlock(mmu_lock)
                              lock(mmu_lock)
                              pte.w is already cleared
                              unlock(mmu_lock)
                              skip TLB flush
                              return
        ...
        TLB flush
      
      Though thread B assumes the page has already been protected when it
      returns, the remaining TLB entry will break that assumption.
      
      This patch fixes this problem by making get_dirty_log hold the mmu_lock
      until it flushes the TLB's.
      Signed-off-by: NTakuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      6dbf79e7
    • Z
      KVM: Track TSC synchronization in generations · e26101b1
      Zachary Amsden 提交于
      This allows us to track the original nanosecond and counter values
      at each phase of TSC writing by the guest.  This gets us perfect
      offset matching for stable TSC systems, and perfect software
      computed TSC matching for machines with unstable TSC.
      Signed-off-by: NZachary Amsden <zamsden@gmail.com>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      e26101b1
    • Z
      KVM: Dont mark TSC unstable due to S4 suspend · 0dd6a6ed
      Zachary Amsden 提交于
      During a host suspend, TSC may go backwards, which KVM interprets
      as an unstable TSC.  Technically, KVM should not be marking the
      TSC unstable, which causes the TSC clocksource to go bad, but we
      need to be adjusting the TSC offsets in such a case.
      
      Dealing with this issue is a little tricky as the only place we
      can reliably do it is before much of the timekeeping infrastructure
      is up and running.  On top of this, we are not in a KVM thread
      context, so we may not be able to safely access VCPU fields.
      Instead, we compute our best known hardware offset at power-up and
      stash it to be applied to all VCPUs when they actually start running.
      Signed-off-by: NZachary Amsden <zamsden@gmail.com>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      0dd6a6ed
    • M
      KVM: Allow adjust_tsc_offset to be in host or guest cycles · f1e2b260
      Marcelo Tosatti 提交于
      Redefine the API to take a parameter indicating whether an
      adjustment is in host or guest cycles.
      Signed-off-by: NZachary Amsden <zamsden@gmail.com>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      f1e2b260
    • Z
      KVM: Add last_host_tsc tracking back to KVM · 6f526ec5
      Zachary Amsden 提交于
      The variable last_host_tsc was removed from upstream code.  I am adding
      it back for two reasons.  First, it is unnecessary to use guest TSC
      computation to conclude information about the host TSC.  The guest may
      set the TSC backwards (this case handled by the previous patch), but
      the computation of guest TSC (and fetching an MSR) is significanlty more
      work and complexity than simply reading the hardware counter.  In addition,
      we don't actually need the guest TSC for any part of the computation,
      by always recomputing the offset, we can eliminate the need to deal with
      the current offset and any scaling factors that may apply.
      
      The second reason is that later on, we are going to be using the host
      TSC value to restore TSC offsets after a host S4 suspend, so we need to
      be reading the host values, not the guest values here.
      Signed-off-by: NZachary Amsden <zamsden@gmail.com>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      6f526ec5
    • Z
      KVM: Fix last_guest_tsc / tsc_offset semantics · b183aa58
      Zachary Amsden 提交于
      The variable last_guest_tsc was being used as an ad-hoc indicator
      that guest TSC has been initialized and recorded correctly.  However,
      it may not have been, it could be that guest TSC has been set to some
      large value, the back to a small value (by, say, a software reboot).
      
      This defeats the logic and causes KVM to falsely assume that the
      guest TSC has gone backwards, marking the host TSC unstable, which
      is undesirable behavior.
      
      In addition, rather than try to compute an offset adjustment for the
      TSC on unstable platforms, just recompute the whole offset.  This
      allows us to get rid of one callsite for adjust_tsc_offset, which
      is problematic because the units it takes are in guest units, but
      here, the computation was originally being done in host units.
      
      Doing this, and also recording last_guest_tsc when the TSC is written
      allow us to remove the tricky logic which depended on last_guest_tsc
      being zero to indicate a reset of uninitialized value.
      
      Instead, we now have the guarantee that the guest TSC offset is
      always at least something which will get us last_guest_tsc.
      Signed-off-by: NZachary Amsden <zamsden@gmail.com>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      b183aa58
    • Z
      KVM: Leave TSC synchronization window open with each new sync · 4dd7980b
      Zachary Amsden 提交于
      Currently, when the TSC is written by the guest, the variable
      ns is updated to force the current write to appear to have taken
      place at the time of the first write in this sync phase.  This
      leaves a cliff at the end of the match window where updates will
      fall of the end.  There are two scenarios where this can be a
      problem in practe - first, on a system with a large number of
      VCPUs, the sync period may last for an extended period of time.
      
      The second way this can happen is if the VM reboots very rapidly
      and we catch a VCPU TSC synchronization just around the edge.
      We may be unaware of the reboot, and thus the first VCPU might
      synchronize with an old set of the timer (at, say 0.97 seconds
      ago, when first powered on).  The second VCPU can come in 0.04
      seconds later to try to synchronize, but it misses the window
      because it is just over the threshold.
      
      Instead, stop doing this artificial setback of the ns variable
      and just update it with every write of the TSC.
      
      It may be observed that doing so causes values computed by
      compute_guest_tsc to diverge slightly across CPUs - note that
      the last_tsc_ns and last_tsc_write variable are used here, and
      now they last_tsc_ns will be different for each VCPU, reflecting
      the actual time of the update.
      
      However, compute_guest_tsc is used only for guests which already
      have TSC stability issues, and further, note that the previous
      patch has caused last_tsc_write to be incremented by the difference
      in nanoseconds, converted back into guest cycles.  As such, only
      boundary rounding errors should be visible, which given the
      resolution in nanoseconds, is going to only be a few cycles and
      only visible in cross-CPU consistency tests.  The problem can be
      fixed by adding a new set of variables to track the start offset
      and start write value for the current sync cycle.
      Signed-off-by: NZachary Amsden <zamsden@gmail.com>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      4dd7980b
    • Z
      KVM: Improve TSC offset matching · 5d3cb0f6
      Zachary Amsden 提交于
      There are a few improvements that can be made to the TSC offset
      matching code.  First, we don't need to call the 128-bit multiply
      (especially on a constant number), the code works much nicer to
      do computation in nanosecond units.
      
      Second, the way everything is setup with software TSC rate scaling,
      we currently have per-cpu rates.  Obviously this isn't too desirable
      to use in practice, but if for some reason we do change the rate of
      all VCPUs at runtime, then reset the TSCs, we will only want to
      match offsets for VCPUs running at the same rate.
      
      Finally, for the case where we have an unstable host TSC, but
      rate scaling is being done in hardware, we should call the platform
      code to compute the TSC offset, so the math is reorganized to recompute
      the base instead, then transform the base into an offset using the
      existing API.
      
      [avi: fix 64-bit division on i386]
      Signed-off-by: NZachary Amsden <zamsden@gmail.com>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      
      KVM: Fix 64-bit division in kvm_write_tsc()
      
      Breaks i386 build.
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      5d3cb0f6
    • Z
      KVM: Infrastructure for software and hardware based TSC rate scaling · cc578287
      Zachary Amsden 提交于
      This requires some restructuring; rather than use 'virtual_tsc_khz'
      to indicate whether hardware rate scaling is in effect, we consider
      each VCPU to always have a virtual TSC rate.  Instead, there is new
      logic above the vendor-specific hardware scaling that decides whether
      it is even necessary to use and updates all rate variables used by
      common code.  This means we can simply query the virtual rate at
      any point, which is needed for software rate scaling.
      
      There is also now a threshold added to the TSC rate scaling; minor
      differences and variations of measured TSC rate can accidentally
      provoke rate scaling to be used when it is not needed.  Instead,
      we have a tolerance variable called tsc_tolerance_ppm, which is
      the maximum variation from user requested rate at which scaling
      will be used.  The default is 250ppm, which is the half the
      threshold for NTP adjustment, allowing for some hardware variation.
      
      In the event that hardware rate scaling is not available, we can
      kludge a bit by forcing TSC catchup to turn on when a faster than
      hardware speed has been requested, but there is nothing available
      yet for the reverse case; this requires a trap and emulate software
      implementation for RDTSC, which is still forthcoming.
      
      [avi: fix 64-bit division on i386]
      Signed-off-by: NZachary Amsden <zamsden@gmail.com>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      cc578287
  12. 05 3月, 2012 3 次提交
  13. 22 2月, 2012 1 次提交
    • L
      i387: Split up <asm/i387.h> into exported and internal interfaces · 1361b83a
      Linus Torvalds 提交于
      While various modules include <asm/i387.h> to get access to things we
      actually *intend* for them to use, most of that header file was really
      pretty low-level internal stuff that we really don't want to expose to
      others.
      
      So split the header file into two: the small exported interfaces remain
      in <asm/i387.h>, while the internal definitions that are only used by
      core architecture code are now in <asm/fpu-internal.h>.
      
      The guiding principle for this was to expose functions that we export to
      modules, and leave them in <asm/i387.h>, while stuff that is used by
      task switching or was marked GPL-only is in <asm/fpu-internal.h>.
      
      The fpu-internal.h file could be further split up too, especially since
      arch/x86/kvm/ uses some of the remaining stuff for its module.  But that
      kvm usage should probably be abstracted out a bit, and at least now the
      internal FPU accessor functions are much more contained.  Even if it
      isn't perhaps as contained as it _could_ be.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1202211340330.5354@i5.linux-foundation.orgSigned-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      1361b83a
  14. 01 2月, 2012 2 次提交
  15. 13 1月, 2012 1 次提交
  16. 27 12月, 2011 2 次提交