1. 08 3月, 2012 14 次提交
    • K
      KVM: x86 emulator: Fix task switch privilege checks · 7f3d35fd
      Kevin Wolf 提交于
      Currently, all task switches check privileges against the DPL of the
      TSS. This is only correct for jmp/call to a TSS. If a task gate is used,
      the DPL of this take gate is used for the check instead. Exceptions,
      external interrupts and iret shouldn't perform any check.
      
      [avi: kill kvm-kmod remnants]
      Signed-off-by: NKevin Wolf <kwolf@redhat.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      7f3d35fd
    • G
    • T
      KVM: Introduce kvm_memory_slot::arch and move lpage_info into it · db3fe4eb
      Takuya Yoshikawa 提交于
      Some members of kvm_memory_slot are not used by every architecture.
      
      This patch is the first step to make this difference clear by
      introducing kvm_memory_slot::arch;  lpage_info is moved into it.
      Signed-off-by: NTakuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      db3fe4eb
    • T
      KVM: Introduce gfn_to_index() which returns the index for a given level · fb03cb6f
      Takuya Yoshikawa 提交于
      This patch cleans up the code and removes the "(void)level;" warning
      suppressor.
      
      Note that we can also use this for PT_PAGE_TABLE_LEVEL to treat every
      level uniformly later.
      Signed-off-by: NTakuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      fb03cb6f
    • T
      KVM: Fix write protection race during dirty logging · 6dbf79e7
      Takuya Yoshikawa 提交于
      This patch fixes a race introduced by:
      
        commit 95d4c16c
        KVM: Optimize dirty logging by rmap_write_protect()
      
      During protecting pages for dirty logging, other threads may also try
      to protect a page in mmu_sync_children() or kvm_mmu_get_page().
      
      In such a case, because get_dirty_log releases mmu_lock before flushing
      TLB's, the following race condition can happen:
      
        A (get_dirty_log)     B (another thread)
      
        lock(mmu_lock)
        clear pte.w
        unlock(mmu_lock)
                              lock(mmu_lock)
                              pte.w is already cleared
                              unlock(mmu_lock)
                              skip TLB flush
                              return
        ...
        TLB flush
      
      Though thread B assumes the page has already been protected when it
      returns, the remaining TLB entry will break that assumption.
      
      This patch fixes this problem by making get_dirty_log hold the mmu_lock
      until it flushes the TLB's.
      Signed-off-by: NTakuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      6dbf79e7
    • R
      KVM: VMX: remove yield_on_hlt · 10166744
      Raghavendra K T 提交于
      yield_on_hlt was introduced for CPU bandwidth capping. Now it is
      redundant with CFS hardlimit.
      
      yield_on_hlt also complicates the scenario in paravirtual environment,
      that needs to trap halt. for e.g. paravirtualized ticket spinlocks.
      Acked-by: NAnthony Liguori <aliguori@us.ibm.com>
      Signed-off-by: NRaghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      10166744
    • Z
      KVM: Track TSC synchronization in generations · e26101b1
      Zachary Amsden 提交于
      This allows us to track the original nanosecond and counter values
      at each phase of TSC writing by the guest.  This gets us perfect
      offset matching for stable TSC systems, and perfect software
      computed TSC matching for machines with unstable TSC.
      Signed-off-by: NZachary Amsden <zamsden@gmail.com>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      e26101b1
    • Z
      KVM: Dont mark TSC unstable due to S4 suspend · 0dd6a6ed
      Zachary Amsden 提交于
      During a host suspend, TSC may go backwards, which KVM interprets
      as an unstable TSC.  Technically, KVM should not be marking the
      TSC unstable, which causes the TSC clocksource to go bad, but we
      need to be adjusting the TSC offsets in such a case.
      
      Dealing with this issue is a little tricky as the only place we
      can reliably do it is before much of the timekeeping infrastructure
      is up and running.  On top of this, we are not in a KVM thread
      context, so we may not be able to safely access VCPU fields.
      Instead, we compute our best known hardware offset at power-up and
      stash it to be applied to all VCPUs when they actually start running.
      Signed-off-by: NZachary Amsden <zamsden@gmail.com>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      0dd6a6ed
    • M
      KVM: Allow adjust_tsc_offset to be in host or guest cycles · f1e2b260
      Marcelo Tosatti 提交于
      Redefine the API to take a parameter indicating whether an
      adjustment is in host or guest cycles.
      Signed-off-by: NZachary Amsden <zamsden@gmail.com>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      f1e2b260
    • Z
      KVM: Add last_host_tsc tracking back to KVM · 6f526ec5
      Zachary Amsden 提交于
      The variable last_host_tsc was removed from upstream code.  I am adding
      it back for two reasons.  First, it is unnecessary to use guest TSC
      computation to conclude information about the host TSC.  The guest may
      set the TSC backwards (this case handled by the previous patch), but
      the computation of guest TSC (and fetching an MSR) is significanlty more
      work and complexity than simply reading the hardware counter.  In addition,
      we don't actually need the guest TSC for any part of the computation,
      by always recomputing the offset, we can eliminate the need to deal with
      the current offset and any scaling factors that may apply.
      
      The second reason is that later on, we are going to be using the host
      TSC value to restore TSC offsets after a host S4 suspend, so we need to
      be reading the host values, not the guest values here.
      Signed-off-by: NZachary Amsden <zamsden@gmail.com>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      6f526ec5
    • Z
      KVM: Fix last_guest_tsc / tsc_offset semantics · b183aa58
      Zachary Amsden 提交于
      The variable last_guest_tsc was being used as an ad-hoc indicator
      that guest TSC has been initialized and recorded correctly.  However,
      it may not have been, it could be that guest TSC has been set to some
      large value, the back to a small value (by, say, a software reboot).
      
      This defeats the logic and causes KVM to falsely assume that the
      guest TSC has gone backwards, marking the host TSC unstable, which
      is undesirable behavior.
      
      In addition, rather than try to compute an offset adjustment for the
      TSC on unstable platforms, just recompute the whole offset.  This
      allows us to get rid of one callsite for adjust_tsc_offset, which
      is problematic because the units it takes are in guest units, but
      here, the computation was originally being done in host units.
      
      Doing this, and also recording last_guest_tsc when the TSC is written
      allow us to remove the tricky logic which depended on last_guest_tsc
      being zero to indicate a reset of uninitialized value.
      
      Instead, we now have the guarantee that the guest TSC offset is
      always at least something which will get us last_guest_tsc.
      Signed-off-by: NZachary Amsden <zamsden@gmail.com>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      b183aa58
    • Z
      KVM: Leave TSC synchronization window open with each new sync · 4dd7980b
      Zachary Amsden 提交于
      Currently, when the TSC is written by the guest, the variable
      ns is updated to force the current write to appear to have taken
      place at the time of the first write in this sync phase.  This
      leaves a cliff at the end of the match window where updates will
      fall of the end.  There are two scenarios where this can be a
      problem in practe - first, on a system with a large number of
      VCPUs, the sync period may last for an extended period of time.
      
      The second way this can happen is if the VM reboots very rapidly
      and we catch a VCPU TSC synchronization just around the edge.
      We may be unaware of the reboot, and thus the first VCPU might
      synchronize with an old set of the timer (at, say 0.97 seconds
      ago, when first powered on).  The second VCPU can come in 0.04
      seconds later to try to synchronize, but it misses the window
      because it is just over the threshold.
      
      Instead, stop doing this artificial setback of the ns variable
      and just update it with every write of the TSC.
      
      It may be observed that doing so causes values computed by
      compute_guest_tsc to diverge slightly across CPUs - note that
      the last_tsc_ns and last_tsc_write variable are used here, and
      now they last_tsc_ns will be different for each VCPU, reflecting
      the actual time of the update.
      
      However, compute_guest_tsc is used only for guests which already
      have TSC stability issues, and further, note that the previous
      patch has caused last_tsc_write to be incremented by the difference
      in nanoseconds, converted back into guest cycles.  As such, only
      boundary rounding errors should be visible, which given the
      resolution in nanoseconds, is going to only be a few cycles and
      only visible in cross-CPU consistency tests.  The problem can be
      fixed by adding a new set of variables to track the start offset
      and start write value for the current sync cycle.
      Signed-off-by: NZachary Amsden <zamsden@gmail.com>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      4dd7980b
    • Z
      KVM: Improve TSC offset matching · 5d3cb0f6
      Zachary Amsden 提交于
      There are a few improvements that can be made to the TSC offset
      matching code.  First, we don't need to call the 128-bit multiply
      (especially on a constant number), the code works much nicer to
      do computation in nanosecond units.
      
      Second, the way everything is setup with software TSC rate scaling,
      we currently have per-cpu rates.  Obviously this isn't too desirable
      to use in practice, but if for some reason we do change the rate of
      all VCPUs at runtime, then reset the TSCs, we will only want to
      match offsets for VCPUs running at the same rate.
      
      Finally, for the case where we have an unstable host TSC, but
      rate scaling is being done in hardware, we should call the platform
      code to compute the TSC offset, so the math is reorganized to recompute
      the base instead, then transform the base into an offset using the
      existing API.
      
      [avi: fix 64-bit division on i386]
      Signed-off-by: NZachary Amsden <zamsden@gmail.com>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      
      KVM: Fix 64-bit division in kvm_write_tsc()
      
      Breaks i386 build.
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      5d3cb0f6
    • Z
      KVM: Infrastructure for software and hardware based TSC rate scaling · cc578287
      Zachary Amsden 提交于
      This requires some restructuring; rather than use 'virtual_tsc_khz'
      to indicate whether hardware rate scaling is in effect, we consider
      each VCPU to always have a virtual TSC rate.  Instead, there is new
      logic above the vendor-specific hardware scaling that decides whether
      it is even necessary to use and updates all rate variables used by
      common code.  This means we can simply query the virtual rate at
      any point, which is needed for software rate scaling.
      
      There is also now a threshold added to the TSC rate scaling; minor
      differences and variations of measured TSC rate can accidentally
      provoke rate scaling to be used when it is not needed.  Instead,
      we have a tolerance variable called tsc_tolerance_ppm, which is
      the maximum variation from user requested rate at which scaling
      will be used.  The default is 250ppm, which is the half the
      threshold for NTP adjustment, allowing for some hardware variation.
      
      In the event that hardware rate scaling is not available, we can
      kludge a bit by forcing TSC catchup to turn on when a faster than
      hardware speed has been requested, but there is nothing available
      yet for the reverse case; this requires a trap and emulate software
      implementation for RDTSC, which is still forthcoming.
      
      [avi: fix 64-bit division on i386]
      Signed-off-by: NZachary Amsden <zamsden@gmail.com>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      cc578287
  2. 05 3月, 2012 19 次提交
  3. 22 2月, 2012 1 次提交
  4. 21 2月, 2012 5 次提交
    • L
      i387: export 'fpu_owner_task' per-cpu variable · 27e74da9
      Linus Torvalds 提交于
      (And define it properly for x86-32, which had its 'current_task'
      declaration in separate from x86-64)
      
      Bitten by my dislike for modules on the machines I use, and the fact
      that apparently nobody else actually wanted to test the patches I sent
      out.
      
      Snif. Nobody else cares.
      
      Anyway, we probably should uninline the 'kernel_fpu_begin()' function
      that is what modules actually use and that references this, but this is
      the minimal fix for now.
      Reported-by: NJosh Boyer <jwboyer@gmail.com>
      Reported-and-tested-by: NJongman Heo <jongman.heo@samsung.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      27e74da9
    • S
      x86: Specify a size for the cmp in the NMI handler · a38449ef
      Steven Rostedt 提交于
      Linus noticed that the cmp used to check if the code segment is
      __KERNEL_CS or not did not specify a size. Perhaps it does not matter
      as H. Peter Anvin noted that user space can not set the bottom two
      bits of the %cs register. But it's best not to let the assembly choose
      and change things between different versions of gas, but instead just
      pick the size.
      
      Four bytes are used to compare the saved code segment against
      __KERNEL_CS. Perhaps this might mess up Xen, but we can fix that when
      the time comes.
      
      Also I noticed that there was another non-specified cmp that checks
      the special stack variable if it is 1 or 0. This too probably doesn't
      matter what cmp is used, but this patch uses cmpl just to make it non
      ambiguous.
      
      Link: http://lkml.kernel.org/r/CA+55aFxfAn9MWRgS3O5k2tqN5ys1XrhSFVO5_9ZAoZKDVgNfGA@mail.gmail.comSuggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      a38449ef
    • L
      i387: support lazy restore of FPU state · 7e16838d
      Linus Torvalds 提交于
      This makes us recognize when we try to restore FPU state that matches
      what we already have in the FPU on this CPU, and avoids the restore
      entirely if so.
      
      To do this, we add two new data fields:
      
       - a percpu 'fpu_owner_task' variable that gets written any time we
         update the "has_fpu" field, and thus acts as a kind of back-pointer
         to the task that owns the CPU.  The exception is when we save the FPU
         state as part of a context switch - if the save can keep the FPU
         state around, we leave the 'fpu_owner_task' variable pointing at the
         task whose FP state still remains on the CPU.
      
       - a per-thread 'last_cpu' field, that indicates which CPU that thread
         used its FPU on last.  We update this on every context switch
         (writing an invalid CPU number if the last context switch didn't
         leave the FPU in a lazily usable state), so we know that *that*
         thread has done nothing else with the FPU since.
      
      These two fields together can be used when next switching back to the
      task to see if the CPU still matches: if 'fpu_owner_task' matches the
      task we are switching to, we know that no other task (or kernel FPU
      usage) touched the FPU on this CPU in the meantime, and if the current
      CPU number matches the 'last_cpu' field, we know that this thread did no
      other FP work on any other CPU, so the FPU state on the CPU must match
      what was saved on last context switch.
      
      In that case, we can avoid the 'f[x]rstor' entirely, and just clear the
      CR0.TS bit.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7e16838d
    • L
      i387: use 'restore_fpu_checking()' directly in task switching code · 80ab6f1e
      Linus Torvalds 提交于
      This inlines what is usually just a couple of instructions, but more
      importantly it also fixes the theoretical error case (can that FPU
      restore really ever fail? Maybe we should remove the checking).
      
      We can't start sending signals from within the scheduler, we're much too
      deep in the kernel and are holding the runqueue lock etc.  So don't
      bother even trying.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      80ab6f1e
    • L
      i387: fix up some fpu_counter confusion · cea20ca3
      Linus Torvalds 提交于
      This makes sure we clear the FPU usage counter for newly created tasks,
      just so that we start off in a known state (for example, don't try to
      preload the FPU state on the first task switch etc).
      
      It also fixes a thinko in when we increment the fpu_counter at task
      switch time, introduced by commit 34ddc81a ("i387: re-introduce FPU
      state preloading at context switch time").  We should increment the
      *new* task fpu_counter, not the old task, and only if we decide to use
      that state (whether lazily or preloaded).
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cea20ca3
  5. 20 2月, 2012 1 次提交
    • K
      xen/pat: Disable PAT support for now. · 8eaffa67
      Konrad Rzeszutek Wilk 提交于
      [Pls also look at https://lkml.org/lkml/2012/2/10/228]
      
      Using of PAT to change pages from WB to WC works quite nicely.
      Changing it back to WB - not so much. The crux of the matter is
      that the code that does this (__page_change_att_set_clr) has only
      limited information so when it tries to the change it gets
      the "raw" unfiltered information instead of the properly filtered one -
      and the "raw" one tell it that PSE bit is on (while infact it
      is not).  As a result when the PTE is set to be WB from WC, we get
      tons of:
      
      :WARNING: at arch/x86/xen/mmu.c:475 xen_make_pte+0x67/0xa0()
      :Hardware name: HP xw4400 Workstation
      .. snip..
      :Pid: 27, comm: kswapd0 Tainted: G        W    3.2.2-1.fc16.x86_64 #1
      :Call Trace:
      : [<ffffffff8106dd1f>] warn_slowpath_common+0x7f/0xc0
      : [<ffffffff8106dd7a>] warn_slowpath_null+0x1a/0x20
      : [<ffffffff81005a17>] xen_make_pte+0x67/0xa0
      : [<ffffffff810051bd>] __raw_callee_save_xen_make_pte+0x11/0x1e
      : [<ffffffff81040e15>] ? __change_page_attr_set_clr+0x9d5/0xc00
      : [<ffffffff8114c2e8>] ? __purge_vmap_area_lazy+0x158/0x1d0
      : [<ffffffff8114cca5>] ? vm_unmap_aliases+0x175/0x190
      : [<ffffffff81041168>] change_page_attr_set_clr+0x128/0x4c0
      : [<ffffffff81041542>] set_pages_array_wb+0x42/0xa0
      : [<ffffffff8100a9b2>] ? check_events+0x12/0x20
      : [<ffffffffa0074d4c>] ttm_pages_put+0x1c/0x70 [ttm]
      : [<ffffffffa0074e98>] ttm_page_pool_free+0xf8/0x180 [ttm]
      : [<ffffffffa0074f78>] ttm_pool_mm_shrink+0x58/0x90 [ttm]
      : [<ffffffff8112ba04>] shrink_slab+0x154/0x310
      : [<ffffffff8112f17a>] balance_pgdat+0x4fa/0x6c0
      : [<ffffffff8112f4b8>] kswapd+0x178/0x3d0
      : [<ffffffff815df134>] ? __schedule+0x3d4/0x8c0
      : [<ffffffff81090410>] ? remove_wait_queue+0x50/0x50
      : [<ffffffff8112f340>] ? balance_pgdat+0x6c0/0x6c0
      : [<ffffffff8108fb6c>] kthread+0x8c/0xa0
      
      for every page. The proper fix for this is has been posted
      and is https://lkml.org/lkml/2012/2/10/228
      "x86/cpa: Use pte_attrs instead of pte_flags on CPA/set_p.._wb/wc operations."
      along with a detailed description of the problem and solution.
      
      But since that posting has gone nowhere I am proposing
      this band-aid solution so that at least users don't get
      the page corruption (the pages that are WC don't get changed to WB
      and end up being recycled for filesystem or other things causing
      mysterious crashes).
      
      The negative impact of this patch is that users of WC flag
      (which are InfiniBand, radeon, nouveau drivers) won't be able
      to set that flag - so they are going to see performance degradation.
      But stability is more important here.
      
      Fixes RH BZ# 742032, 787403, and 745574
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      8eaffa67