1. 08 3月, 2012 9 次提交
    • T
      KVM: Fix write protection race during dirty logging · 6dbf79e7
      Takuya Yoshikawa 提交于
      This patch fixes a race introduced by:
      
        commit 95d4c16c
        KVM: Optimize dirty logging by rmap_write_protect()
      
      During protecting pages for dirty logging, other threads may also try
      to protect a page in mmu_sync_children() or kvm_mmu_get_page().
      
      In such a case, because get_dirty_log releases mmu_lock before flushing
      TLB's, the following race condition can happen:
      
        A (get_dirty_log)     B (another thread)
      
        lock(mmu_lock)
        clear pte.w
        unlock(mmu_lock)
                              lock(mmu_lock)
                              pte.w is already cleared
                              unlock(mmu_lock)
                              skip TLB flush
                              return
        ...
        TLB flush
      
      Though thread B assumes the page has already been protected when it
      returns, the remaining TLB entry will break that assumption.
      
      This patch fixes this problem by making get_dirty_log hold the mmu_lock
      until it flushes the TLB's.
      Signed-off-by: NTakuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      6dbf79e7
    • Z
      KVM: Track TSC synchronization in generations · e26101b1
      Zachary Amsden 提交于
      This allows us to track the original nanosecond and counter values
      at each phase of TSC writing by the guest.  This gets us perfect
      offset matching for stable TSC systems, and perfect software
      computed TSC matching for machines with unstable TSC.
      Signed-off-by: NZachary Amsden <zamsden@gmail.com>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      e26101b1
    • Z
      KVM: Dont mark TSC unstable due to S4 suspend · 0dd6a6ed
      Zachary Amsden 提交于
      During a host suspend, TSC may go backwards, which KVM interprets
      as an unstable TSC.  Technically, KVM should not be marking the
      TSC unstable, which causes the TSC clocksource to go bad, but we
      need to be adjusting the TSC offsets in such a case.
      
      Dealing with this issue is a little tricky as the only place we
      can reliably do it is before much of the timekeeping infrastructure
      is up and running.  On top of this, we are not in a KVM thread
      context, so we may not be able to safely access VCPU fields.
      Instead, we compute our best known hardware offset at power-up and
      stash it to be applied to all VCPUs when they actually start running.
      Signed-off-by: NZachary Amsden <zamsden@gmail.com>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      0dd6a6ed
    • M
      KVM: Allow adjust_tsc_offset to be in host or guest cycles · f1e2b260
      Marcelo Tosatti 提交于
      Redefine the API to take a parameter indicating whether an
      adjustment is in host or guest cycles.
      Signed-off-by: NZachary Amsden <zamsden@gmail.com>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      f1e2b260
    • Z
      KVM: Add last_host_tsc tracking back to KVM · 6f526ec5
      Zachary Amsden 提交于
      The variable last_host_tsc was removed from upstream code.  I am adding
      it back for two reasons.  First, it is unnecessary to use guest TSC
      computation to conclude information about the host TSC.  The guest may
      set the TSC backwards (this case handled by the previous patch), but
      the computation of guest TSC (and fetching an MSR) is significanlty more
      work and complexity than simply reading the hardware counter.  In addition,
      we don't actually need the guest TSC for any part of the computation,
      by always recomputing the offset, we can eliminate the need to deal with
      the current offset and any scaling factors that may apply.
      
      The second reason is that later on, we are going to be using the host
      TSC value to restore TSC offsets after a host S4 suspend, so we need to
      be reading the host values, not the guest values here.
      Signed-off-by: NZachary Amsden <zamsden@gmail.com>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      6f526ec5
    • Z
      KVM: Fix last_guest_tsc / tsc_offset semantics · b183aa58
      Zachary Amsden 提交于
      The variable last_guest_tsc was being used as an ad-hoc indicator
      that guest TSC has been initialized and recorded correctly.  However,
      it may not have been, it could be that guest TSC has been set to some
      large value, the back to a small value (by, say, a software reboot).
      
      This defeats the logic and causes KVM to falsely assume that the
      guest TSC has gone backwards, marking the host TSC unstable, which
      is undesirable behavior.
      
      In addition, rather than try to compute an offset adjustment for the
      TSC on unstable platforms, just recompute the whole offset.  This
      allows us to get rid of one callsite for adjust_tsc_offset, which
      is problematic because the units it takes are in guest units, but
      here, the computation was originally being done in host units.
      
      Doing this, and also recording last_guest_tsc when the TSC is written
      allow us to remove the tricky logic which depended on last_guest_tsc
      being zero to indicate a reset of uninitialized value.
      
      Instead, we now have the guarantee that the guest TSC offset is
      always at least something which will get us last_guest_tsc.
      Signed-off-by: NZachary Amsden <zamsden@gmail.com>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      b183aa58
    • Z
      KVM: Leave TSC synchronization window open with each new sync · 4dd7980b
      Zachary Amsden 提交于
      Currently, when the TSC is written by the guest, the variable
      ns is updated to force the current write to appear to have taken
      place at the time of the first write in this sync phase.  This
      leaves a cliff at the end of the match window where updates will
      fall of the end.  There are two scenarios where this can be a
      problem in practe - first, on a system with a large number of
      VCPUs, the sync period may last for an extended period of time.
      
      The second way this can happen is if the VM reboots very rapidly
      and we catch a VCPU TSC synchronization just around the edge.
      We may be unaware of the reboot, and thus the first VCPU might
      synchronize with an old set of the timer (at, say 0.97 seconds
      ago, when first powered on).  The second VCPU can come in 0.04
      seconds later to try to synchronize, but it misses the window
      because it is just over the threshold.
      
      Instead, stop doing this artificial setback of the ns variable
      and just update it with every write of the TSC.
      
      It may be observed that doing so causes values computed by
      compute_guest_tsc to diverge slightly across CPUs - note that
      the last_tsc_ns and last_tsc_write variable are used here, and
      now they last_tsc_ns will be different for each VCPU, reflecting
      the actual time of the update.
      
      However, compute_guest_tsc is used only for guests which already
      have TSC stability issues, and further, note that the previous
      patch has caused last_tsc_write to be incremented by the difference
      in nanoseconds, converted back into guest cycles.  As such, only
      boundary rounding errors should be visible, which given the
      resolution in nanoseconds, is going to only be a few cycles and
      only visible in cross-CPU consistency tests.  The problem can be
      fixed by adding a new set of variables to track the start offset
      and start write value for the current sync cycle.
      Signed-off-by: NZachary Amsden <zamsden@gmail.com>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      4dd7980b
    • Z
      KVM: Improve TSC offset matching · 5d3cb0f6
      Zachary Amsden 提交于
      There are a few improvements that can be made to the TSC offset
      matching code.  First, we don't need to call the 128-bit multiply
      (especially on a constant number), the code works much nicer to
      do computation in nanosecond units.
      
      Second, the way everything is setup with software TSC rate scaling,
      we currently have per-cpu rates.  Obviously this isn't too desirable
      to use in practice, but if for some reason we do change the rate of
      all VCPUs at runtime, then reset the TSCs, we will only want to
      match offsets for VCPUs running at the same rate.
      
      Finally, for the case where we have an unstable host TSC, but
      rate scaling is being done in hardware, we should call the platform
      code to compute the TSC offset, so the math is reorganized to recompute
      the base instead, then transform the base into an offset using the
      existing API.
      
      [avi: fix 64-bit division on i386]
      Signed-off-by: NZachary Amsden <zamsden@gmail.com>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      
      KVM: Fix 64-bit division in kvm_write_tsc()
      
      Breaks i386 build.
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      5d3cb0f6
    • Z
      KVM: Infrastructure for software and hardware based TSC rate scaling · cc578287
      Zachary Amsden 提交于
      This requires some restructuring; rather than use 'virtual_tsc_khz'
      to indicate whether hardware rate scaling is in effect, we consider
      each VCPU to always have a virtual TSC rate.  Instead, there is new
      logic above the vendor-specific hardware scaling that decides whether
      it is even necessary to use and updates all rate variables used by
      common code.  This means we can simply query the virtual rate at
      any point, which is needed for software rate scaling.
      
      There is also now a threshold added to the TSC rate scaling; minor
      differences and variations of measured TSC rate can accidentally
      provoke rate scaling to be used when it is not needed.  Instead,
      we have a tolerance variable called tsc_tolerance_ppm, which is
      the maximum variation from user requested rate at which scaling
      will be used.  The default is 250ppm, which is the half the
      threshold for NTP adjustment, allowing for some hardware variation.
      
      In the event that hardware rate scaling is not available, we can
      kludge a bit by forcing TSC catchup to turn on when a faster than
      hardware speed has been requested, but there is nothing available
      yet for the reverse case; this requires a trap and emulate software
      implementation for RDTSC, which is still forthcoming.
      
      [avi: fix 64-bit division on i386]
      Signed-off-by: NZachary Amsden <zamsden@gmail.com>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      cc578287
  2. 05 3月, 2012 3 次提交
  3. 01 2月, 2012 2 次提交
  4. 13 1月, 2012 1 次提交
  5. 27 12月, 2011 18 次提交
  6. 26 12月, 2011 1 次提交
    • J
      KVM: Don't automatically expose the TSC deadline timer in cpuid · 4d25a066
      Jan Kiszka 提交于
      Unlike all of the other cpuid bits, the TSC deadline timer bit is set
      unconditionally, regardless of what userspace wants.
      
      This is broken in several ways:
       - if userspace doesn't use KVM_CREATE_IRQCHIP, and doesn't emulate the TSC
         deadline timer feature, a guest that uses the feature will break
       - live migration to older host kernels that don't support the TSC deadline
         timer will cause the feature to be pulled from under the guest's feet;
         breaking it
       - guests that are broken wrt the feature will fail.
      
      Fix by not enabling the feature automatically; instead report it to userspace.
      Because the feature depends on KVM_CREATE_IRQCHIP, which we cannot guarantee
      will be called, we expose it via a KVM_CAP_TSC_DEADLINE_TIMER and not
      KVM_GET_SUPPORTED_CPUID.
      
      Fixes the Illumos guest kernel, which uses the TSC deadline timer feature.
      
      [avi: add the KVM_CAP + documentation]
      Reported-by: NAlexey Zaytsev <alexey.zaytsev@gmail.com>
      Tested-by: NAlexey Zaytsev <alexey.zaytsev@gmail.com>
      Signed-off-by: NJan Kiszka <jan.kiszka@siemens.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      4d25a066
  7. 21 10月, 2011 1 次提交
    • J
      iommu/core: Convert iommu_found to iommu_present · a1b60c1c
      Joerg Roedel 提交于
      With per-bus iommu_ops the iommu_found function needs to
      work on a bus_type too. This patch adds a bus_type parameter
      to that function and converts all call-places.
      The function is also renamed to iommu_present because the
      function now checks if an iommu is present for a given bus
      and does not check for a global iommu anymore.
      Signed-off-by: NJoerg Roedel <joerg.roedel@amd.com>
      a1b60c1c
  8. 05 10月, 2011 1 次提交
  9. 26 9月, 2011 4 次提交
    • A
      KVM: Fix simultaneous NMIs · 7460fb4a
      Avi Kivity 提交于
      If simultaneous NMIs happen, we're supposed to queue the second
      and next (collapsing them), but currently we sometimes collapse
      the second into the first.
      
      Fix by using a counter for pending NMIs instead of a bool; since
      the counter limit depends on whether the processor is currently
      in an NMI handler, which can only be checked in vcpu context
      (via the NMI mask), we add a new KVM_REQ_NMI to request recalculation
      of the counter.
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      7460fb4a
    • N
      KVM: L1 TSC handling · d5c1785d
      Nadav Har'El 提交于
      KVM assumed in several places that reading the TSC MSR returns the value for
      L1. This is incorrect, because when L2 is running, the correct TSC read exit
      emulation is to return L2's value.
      
      We therefore add a new x86_ops function, read_l1_tsc, to use in places that
      specifically need to read the L1 TSC, NOT the TSC of the current level of
      guest.
      
      Note that one change, of one line in kvm_arch_vcpu_load, is made redundant
      by a different patch sent by Zachary Amsden (and not yet applied):
      kvm_arch_vcpu_load() should not read the guest TSC, and if it didn't, of
      course we didn't have to change the call of kvm_get_msr() to read_l1_tsc().
      
      [avi: moved callback to kvm_x86_ops tsc block]
      Signed-off-by: NNadav Har'El <nyh@il.ibm.com>
      Acked-by: NZachary Amsdem <zamsden@gmail.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      d5c1785d
    • M
      KVM: x86: report valid microcode update ID · 742bc670
      Marcelo Tosatti 提交于
      Windows Server 2008 SP2 checked build with smp > 1 BSOD's during
      boot due to lack of microcode update:
      
      *** Assertion failed: The system BIOS on this machine does not properly
      support the processor.  The system BIOS did not load any microcode update.
      A BIOS containing the latest microcode update is needed for system reliability.
      (CurrentUpdateRevision != 0)
      ***   Source File: d:\longhorn\base\hals\update\intelupd\update.c, line 440
      
      Report a non-zero microcode update signature to make it happy.
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      742bc670
    • T
      KVM: x86 emulator: Make x86_decode_insn() return proper macros · 1d2887e2
      Takuya Yoshikawa 提交于
      Return EMULATION_OK/FAILED consistently.  Also treat instruction fetch
      errors, not restricted to X86EMUL_UNHANDLEABLE, as EMULATION_FAILED;
      although this cannot happen in practice, the current logic will continue
      the emulation even if the decoder fails to fetch the instruction.
      Signed-off-by: NTakuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      1d2887e2