1. 09 7月, 2012 1 次提交
    • A
      KVM: MMU: Force cr3 reload with two dimensional paging on mov cr3 emulation · e676505a
      Avi Kivity 提交于
      Currently the MMU's ->new_cr3() callback does nothing when guest paging
      is disabled or when two-dimentional paging (e.g. EPT on Intel) is active.
      This means that an emulated write to cr3 can be lost; kvm_set_cr3() will
      write vcpu-arch.cr3, but the GUEST_CR3 field in the VMCS will retain its
      old value and this is what the guest sees.
      
      This bug did not have any effect until now because:
      - with unrestricted guest, or with svm, we never emulate a mov cr3 instruction
      - without unrestricted guest, and with paging enabled, we also never emulate a
        mov cr3 instruction
      - without unrestricted guest, but with paging disabled, the guest's cr3 is
        ignored until the guest enables paging; at this point the value from arch.cr3
        is loaded correctly my the mov cr0 instruction which turns on paging
      
      However, the patchset that enables big real mode causes us to emulate mov cr3
      instructions in protected mode sometimes (when guest state is not virtualizable
      by vmx); this mov cr3 is effectively ignored and will crash the guest.
      
      The fix is to make nonpaging_new_cr3() call mmu_free_roots() to force a cr3
      reload.  This is awkward because now all the new_cr3 callbacks to the same
      thing, and because mmu_free_roots() is somewhat of an overkill; but fixing
      that is more complicated and will be done after this minimal fix.
      
      Observed in the Window XP 32-bit installer while bringing up secondary vcpus.
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      e676505a
  2. 04 7月, 2012 2 次提交
  3. 25 6月, 2012 7 次提交
    • M
      KVM: host side for eoi optimization · ae7a2a3f
      Michael S. Tsirkin 提交于
      Implementation of PV EOI using shared memory.
      This reduces the number of exits an interrupt
      causes as much as by half.
      
      The idea is simple: there's a bit, per APIC, in guest memory,
      that tells the guest that it does not need EOI.
      We set it before injecting an interrupt and clear
      before injecting a nested one. Guest tests it using
      a test and clear operation - this is necessary
      so that host can detect interrupt nesting -
      and if set, it can skip the EOI MSR.
      
      There's a new MSR to set the address of said register
      in guest memory. Otherwise not much changed:
      - Guest EOI is not required
      - Register is tested & ISR is automatically cleared on exit
      
      For testing results see description of previous patch
      'kvm_para: guest side for eoi avoidance'.
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      ae7a2a3f
    • M
      KVM: rearrange injection cancelling code · d905c069
      Michael S. Tsirkin 提交于
      Each time we need to cancel injection we invoke same code
      (cancel_injection callback).  Move it towards the end of function using
      the familiar goto on error pattern.
      
      Will make it easier to do more cleanups for PV EOI.
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      d905c069
    • M
      KVM: only sync when attention bits set · 5cfb1d5a
      Michael S. Tsirkin 提交于
      Commit eb0dc6d0368072236dcd086d7fdc17fd3c4574d4 introduced apic
      attention bitmask but kvm still syncs lapic unconditionally.
      As that commit suggested and in anticipation of adding more attention
      bits, only sync lapic if(apic_attention).
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      5cfb1d5a
    • M
      x86, bitops: note on __test_and_clear_bit atomicity · d0a69d63
      Michael S. Tsirkin 提交于
      __test_and_clear_bit is actually atomic with respect
      to the local CPU. Add a note saying that KVM on x86
      relies on this behaviour so people don't accidentaly break it.
      Also warn not to rely on this in portable code.
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      d0a69d63
    • M
      KVM guest: guest side for eoi avoidance · ab9cf499
      Michael S. Tsirkin 提交于
      The idea is simple: there's a bit, per APIC, in guest memory,
      that tells the guest that it does not need EOI.
      Guest tests it using a single est and clear operation - this is
      necessary so that host can detect interrupt nesting - and if set, it can
      skip the EOI MSR.
      
      I run a simple microbenchmark to show exit reduction
      (note: for testing, need to apply follow-up patch
      'kvm: host side for eoi optimization' + a qemu patch
       I posted separately, on host):
      
      Before:
      
      Performance counter stats for 'sleep 1s':
      
                  47,357 kvm:kvm_entry                                                [99.98%]
                       0 kvm:kvm_hypercall                                            [99.98%]
                       0 kvm:kvm_hv_hypercall                                         [99.98%]
                   5,001 kvm:kvm_pio                                                  [99.98%]
                       0 kvm:kvm_cpuid                                                [99.98%]
                  22,124 kvm:kvm_apic                                                 [99.98%]
                  49,849 kvm:kvm_exit                                                 [99.98%]
                  21,115 kvm:kvm_inj_virq                                             [99.98%]
                       0 kvm:kvm_inj_exception                                        [99.98%]
                       0 kvm:kvm_page_fault                                           [99.98%]
                  22,937 kvm:kvm_msr                                                  [99.98%]
                       0 kvm:kvm_cr                                                   [99.98%]
                       0 kvm:kvm_pic_set_irq                                          [99.98%]
                       0 kvm:kvm_apic_ipi                                             [99.98%]
                  22,207 kvm:kvm_apic_accept_irq                                      [99.98%]
                  22,421 kvm:kvm_eoi                                                  [99.98%]
                       0 kvm:kvm_pv_eoi                                               [99.99%]
                       0 kvm:kvm_nested_vmrun                                         [99.99%]
                       0 kvm:kvm_nested_intercepts                                    [99.99%]
                       0 kvm:kvm_nested_vmexit                                        [99.99%]
                       0 kvm:kvm_nested_vmexit_inject                                    [99.99%]
                       0 kvm:kvm_nested_intr_vmexit                                    [99.99%]
                       0 kvm:kvm_invlpga                                              [99.99%]
                       0 kvm:kvm_skinit                                               [99.99%]
                      57 kvm:kvm_emulate_insn                                         [99.99%]
                       0 kvm:vcpu_match_mmio                                          [99.99%]
                       0 kvm:kvm_userspace_exit                                       [99.99%]
                       2 kvm:kvm_set_irq                                              [99.99%]
                       2 kvm:kvm_ioapic_set_irq                                       [99.99%]
                  23,609 kvm:kvm_msi_set_irq                                          [99.99%]
                       1 kvm:kvm_ack_irq                                              [99.99%]
                     131 kvm:kvm_mmio                                                 [99.99%]
                     226 kvm:kvm_fpu                                                  [100.00%]
                       0 kvm:kvm_age_page                                             [100.00%]
                       0 kvm:kvm_try_async_get_page                                    [100.00%]
                       0 kvm:kvm_async_pf_doublefault                                    [100.00%]
                       0 kvm:kvm_async_pf_not_present                                    [100.00%]
                       0 kvm:kvm_async_pf_ready                                       [100.00%]
                       0 kvm:kvm_async_pf_completed
      
             1.002100578 seconds time elapsed
      
      After:
      
       Performance counter stats for 'sleep 1s':
      
                  28,354 kvm:kvm_entry                                                [99.98%]
                       0 kvm:kvm_hypercall                                            [99.98%]
                       0 kvm:kvm_hv_hypercall                                         [99.98%]
                   1,347 kvm:kvm_pio                                                  [99.98%]
                       0 kvm:kvm_cpuid                                                [99.98%]
                   1,931 kvm:kvm_apic                                                 [99.98%]
                  29,595 kvm:kvm_exit                                                 [99.98%]
                  24,884 kvm:kvm_inj_virq                                             [99.98%]
                       0 kvm:kvm_inj_exception                                        [99.98%]
                       0 kvm:kvm_page_fault                                           [99.98%]
                   1,986 kvm:kvm_msr                                                  [99.98%]
                       0 kvm:kvm_cr                                                   [99.98%]
                       0 kvm:kvm_pic_set_irq                                          [99.98%]
                       0 kvm:kvm_apic_ipi                                             [99.99%]
                  25,953 kvm:kvm_apic_accept_irq                                      [99.99%]
                  26,132 kvm:kvm_eoi                                                  [99.99%]
                  26,593 kvm:kvm_pv_eoi                                               [99.99%]
                       0 kvm:kvm_nested_vmrun                                         [99.99%]
                       0 kvm:kvm_nested_intercepts                                    [99.99%]
                       0 kvm:kvm_nested_vmexit                                        [99.99%]
                       0 kvm:kvm_nested_vmexit_inject                                    [99.99%]
                       0 kvm:kvm_nested_intr_vmexit                                    [99.99%]
                       0 kvm:kvm_invlpga                                              [99.99%]
                       0 kvm:kvm_skinit                                               [99.99%]
                     284 kvm:kvm_emulate_insn                                         [99.99%]
                      68 kvm:vcpu_match_mmio                                          [99.99%]
                      68 kvm:kvm_userspace_exit                                       [99.99%]
                       2 kvm:kvm_set_irq                                              [99.99%]
                       2 kvm:kvm_ioapic_set_irq                                       [99.99%]
                  28,288 kvm:kvm_msi_set_irq                                          [99.99%]
                       1 kvm:kvm_ack_irq                                              [99.99%]
                     131 kvm:kvm_mmio                                                 [100.00%]
                     588 kvm:kvm_fpu                                                  [100.00%]
                       0 kvm:kvm_age_page                                             [100.00%]
                       0 kvm:kvm_try_async_get_page                                    [100.00%]
                       0 kvm:kvm_async_pf_doublefault                                    [100.00%]
                       0 kvm:kvm_async_pf_not_present                                    [100.00%]
                       0 kvm:kvm_async_pf_ready                                       [100.00%]
                       0 kvm:kvm_async_pf_completed
      
             1.002039622 seconds time elapsed
      
      We see that # of exits is almost halved.
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      ab9cf499
    • M
      KVM: optimize ISR lookups · 8680b94b
      Michael S. Tsirkin 提交于
      We perform ISR lookups twice: during interrupt
      injection and on EOI. Typical workloads only have
      a single bit set there. So we can avoid ISR scans by
      1. counting bits as we set/clear them in ISR
      2. on set, caching the injected vector number
      3. on clear, invalidating the cache
      
      The real purpose of this is enabling PV EOI
      which needs to quickly validate the vector.
      But non PV guests also benefit: with this patch,
      and without interrupt nesting, apic_find_highest_isr
      will always return immediately without scanning ISR.
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      8680b94b
    • M
      KVM: document lapic regs field · 5eadf916
      Michael S. Tsirkin 提交于
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      5eadf916
  4. 19 6月, 2012 1 次提交
  5. 18 6月, 2012 1 次提交
  6. 14 6月, 2012 1 次提交
  7. 12 6月, 2012 1 次提交
  8. 06 6月, 2012 2 次提交
    • M
      KVM: disable uninitialized var warning · 79f702a6
      Michael S. Tsirkin 提交于
      I see this in 3.5-rc1:
      
      arch/x86/kvm/mmu.c: In function ‘kvm_test_age_rmapp’:
      arch/x86/kvm/mmu.c:1271: warning: ‘iter.desc’ may be used uninitialized in this function
      
      The line in question was introduced by commit
      1e3f42f0
      
       static int kvm_test_age_rmapp(struct kvm *kvm, unsigned long *rmapp,
                                    unsigned long data)
       {
      -       u64 *spte;
      +       u64 *sptep;
      +       struct rmap_iterator iter;   <- line 1271
              int young = 0;
      
              /*
      
      The reason I think is that the compiler assumes that
      the rmap value could be 0, so
      
      static u64 *rmap_get_first(unsigned long rmap, struct rmap_iterator
      *iter)
      {
              if (!rmap)
                      return NULL;
      
              if (!(rmap & 1)) {
                      iter->desc = NULL;
                      return (u64 *)rmap;
              }
      
              iter->desc = (struct pte_list_desc *)(rmap & ~1ul);
              iter->pos = 0;
              return iter->desc->sptes[iter->pos];
      }
      
      will not initialize iter.desc, but the compiler isn't
      smart enough to see that
      
              for (sptep = rmap_get_first(*rmapp, &iter); sptep;
                   sptep = rmap_get_next(&iter)) {
      
      will immediately exit in this case.
      I checked by adding
              if (!*rmapp)
                      goto out;
      on top which is clearly equivalent but disables the warning.
      
      This patch uses uninitialized_var to disable the warning without
      increasing code size.
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      79f702a6
    • C
      KVM: Cleanup the kvm_print functions and introduce pr_XX wrappers · a737f256
      Christoffer Dall 提交于
      Introduces a couple of print functions, which are essentially wrappers
      around standard printk functions, with a KVM: prefix.
      
      Functions introduced or modified are:
       - kvm_err(fmt, ...)
       - kvm_info(fmt, ...)
       - kvm_debug(fmt, ...)
       - kvm_pr_unimpl(fmt, ...)
       - pr_unimpl(vcpu, fmt, ...) -> vcpu_unimpl(vcpu, fmt, ...)
      Signed-off-by: NChristoffer Dall <c.dall@virtualopensystems.com>
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      a737f256
  9. 05 6月, 2012 7 次提交
  10. 02 6月, 2012 12 次提交
  11. 01 6月, 2012 5 次提交
    • S
      ftrace/x86: Do not change stacks in DEBUG when calling lockdep · 5963e317
      Steven Rostedt 提交于
      When both DYNAMIC_FTRACE and LOCKDEP are set, the TRACE_IRQS_ON/OFF
      will call into the lockdep code. The lockdep code can call lots of
      functions that may be traced by ftrace. When ftrace is updating its
      code and hits a breakpoint, the breakpoint handler will call into
      lockdep. If lockdep happens to call a function that also has a breakpoint
      attached, it will jump back into the breakpoint handler resetting
      the stack to the debug stack and corrupt the contents currently on
      that stack.
      
      The 'do_sym' call that calls do_int3() is protected by modifying the
      IST table to point to a different location if another breakpoint is
      hit. But the TRACE_IRQS_OFF/ON are outside that protection, and if
      a breakpoint is hit from those, the stack will get corrupted, and
      the kernel will crash:
      
      [ 1013.243754] BUG: unable to handle kernel NULL pointer dereference at 0000000000000002
      [ 1013.272665] IP: [<ffff880145cc0000>] 0xffff880145cbffff
      [ 1013.285186] PGD 1401b2067 PUD 14324c067 PMD 0
      [ 1013.298832] Oops: 0010 [#1] PREEMPT SMP
      [ 1013.310600] CPU 2
      [ 1013.317904] Modules linked in: ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables crc32c_intel ghash_clmulni_intel microcode usb_debug serio_raw pcspkr iTCO_wdt i2c_i801 iTCO_vendor_support e1000e nfsd nfs_acl auth_rpcgss lockd sunrpc i915 video i2c_algo_bit drm_kms_helper drm i2c_core [last unloaded: scsi_wait_scan]
      [ 1013.401848]
      [ 1013.407399] Pid: 112, comm: kworker/2:1 Not tainted 3.4.0+ #30
      [ 1013.437943] RIP: 8eb8:[<ffff88014630a000>]  [<ffff88014630a000>] 0xffff880146309fff
      [ 1013.459871] RSP: ffffffff8165e919:ffff88014780f408  EFLAGS: 00010046
      [ 1013.477909] RAX: 0000000000000001 RBX: ffffffff81104020 RCX: 0000000000000000
      [ 1013.499458] RDX: ffff880148008ea8 RSI: ffffffff8131ef40 RDI: ffffffff82203b20
      [ 1013.521612] RBP: ffffffff81005751 R08: 0000000000000000 R09: 0000000000000000
      [ 1013.543121] R10: ffffffff82cdc318 R11: 0000000000000000 R12: ffff880145cc0000
      [ 1013.564614] R13: ffff880148008eb8 R14: 0000000000000002 R15: ffff88014780cb40
      [ 1013.586108] FS:  0000000000000000(0000) GS:ffff880148000000(0000) knlGS:0000000000000000
      [ 1013.609458] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      [ 1013.627420] CR2: 0000000000000002 CR3: 0000000141f10000 CR4: 00000000001407e0
      [ 1013.649051] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [ 1013.670724] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      [ 1013.692376] Process kworker/2:1 (pid: 112, threadinfo ffff88013fe0e000, task ffff88014020a6a0)
      [ 1013.717028] Stack:
      [ 1013.724131]  ffff88014780f570 ffff880145cc0000 0000400000004000 0000000000000000
      [ 1013.745918]  cccccccccccccccc ffff88014780cca8 ffffffff811072bb ffffffff81651627
      [ 1013.767870]  ffffffff8118f8a7 ffffffff811072bb ffffffff81f2b6c5 ffffffff81f11bdb
      [ 1013.790021] Call Trace:
      [ 1013.800701] Code: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a <e7> d7 64 81 ff ff ff ff 01 00 00 00 00 00 00 00 65 d9 64 81 ff
      [ 1013.861443] RIP  [<ffff88014630a000>] 0xffff880146309fff
      [ 1013.884466]  RSP <ffff88014780f408>
      [ 1013.901507] CR2: 0000000000000002
      
      The solution was to reuse the NMI functions that change the IDT table to make the debug
      stack keep its current stack (in kernel mode) when hitting a breakpoint:
      
        call debug_stack_set_zero
        TRACE_IRQS_ON
        call debug_stack_reset
      
      If the TRACE_IRQS_ON happens to hit a breakpoint then it will keep the current stack
      and not crash the box.
      Reported-by: NDave Jones <davej@redhat.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      5963e317
    • S
      x86: Allow nesting of the debug stack IDT setting · f8988175
      Steven Rostedt 提交于
      When the NMI handler runs, it checks if it preempted a debug handler
      and if that handler is using the debug stack. If it is, it changes the
      IDT table not to update the stack, otherwise it will reset the debug
      stack and corrupt the debug handler it preempted.
      
      Now that ftrace uses breakpoints to change functions from nops to
      callers, many more places may hit a breakpoint. Unfortunately this
      includes some of the calls that lockdep performs. Which causes issues
      with the debug stack. It too needs to change the debug stack before
      tracing (if called from the debug handler).
      
      Allow the debug_stack_set_zero() and debug_stack_reset() to be nested
      so that the debug handlers can take advantage of them too.
      
      [ Used this_cpu_*() over __get_cpu_var() as suggested by H. Peter Anvin ]
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      f8988175
    • S
      x86: Reset the debug_stack update counter · c0525a69
      Steven Rostedt 提交于
      When an NMI goes off and it sees that it preempted the debug stack,
      to keep the debug stack safe, it changes the IDT to point to one that
      does not modify the stack on breakpoint (to allow breakpoints in NMIs).
      
      But the variable that gets set to know to undo it on exit never gets
      cleared on exit. Thus every NMI will reset it on exit the first time
      it is done even if it does not need to be reset.
      
      [ Added H. Peter Anvin's suggestion to use this_cpu_read/write ]
      
      Cc: <stable@vger.kernel.org> # v3.3
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      c0525a69
    • S
      ftrace: Use breakpoint method to update ftrace caller · 8a4d0a68
      Steven Rostedt 提交于
      On boot up and module load, it is fine to modify the code directly,
      without the use of breakpoints. This is because boot up modification
      is done before SMP is initialized, thus the modification is serial,
      and module load is done before the module executes.
      
      But after that we must use a SMP safe method to modify running code.
      Otherwise, if we are running the function tracer and update its
      function (by starting off the stack tracer, or perf tracing)
      the change of the function called by the ftrace trampoline is done
      directly. If this is being executed on another CPU, that CPU may
      take a GPF and crash the kernel.
      
      The breakpoint method is used to change the nops at all the functions, but
      the change of the ftrace callback handler itself was still using a
      direct modification. If tracing was enabled and the function callback
      was changed then another CPU could fault if it was currently calling
      the original callback. This modification must use the breakpoint method
      too.
      
      Note, the direct method is still used for boot up and module load.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      8a4d0a68
    • S
      ftrace: Synchronize variable setting with breakpoints · a192cd04
      Steven Rostedt 提交于
      When the function tracer starts modifying the code via breakpoints
      it sets a variable (modifying_ftrace_code) to inform the breakpoint
      handler to call the ftrace int3 code.
      
      But there's no synchronization between setting this code and the
      handler, thus it is possible for the handler to be called on another
      CPU before it sees the variable. This will cause a kernel crash as
      the int3 handler will not know what to do with it.
      
      I originally added smp_mb()'s to force the visibility of the variable
      but H. Peter Anvin suggested that I just make it atomic.
      
      [ Added comments as suggested by Peter Zijlstra ]
      Suggested-by: NH. Peter Anvin <hpa@zytor.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      a192cd04