1. 22 11月, 2016 1 次提交
    • P
      x86/kvm: Support the vCPU preemption check · 0b9f6c46
      Pan Xinhui 提交于
      Support the vcpu_is_preempted() functionality under KVM. This will
      enhance lock performance on overcommitted hosts (more runnable vCPUs
      than physical CPUs in the system) as doing busy waits for preempted
      vCPUs will hurt system performance far worse than early yielding.
      
      Use struct kvm_steal_time::preempted to indicate that if a vCPU
      is running or not.
      Signed-off-by: NPan Xinhui <xinhui.pan@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NPaolo Bonzini <pbonzini@redhat.com>
      Cc: David.Laight@ACULAB.COM
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: benh@kernel.crashing.org
      Cc: boqun.feng@gmail.com
      Cc: borntraeger@de.ibm.com
      Cc: bsingharora@gmail.com
      Cc: dave@stgolabs.net
      Cc: jgross@suse.com
      Cc: kernellwp@gmail.com
      Cc: konrad.wilk@oracle.com
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: mpe@ellerman.id.au
      Cc: paulmck@linux.vnet.ibm.com
      Cc: paulus@samba.org
      Cc: rkrcmar@redhat.com
      Cc: virtualization@lists.linux-foundation.org
      Cc: will.deacon@arm.com
      Cc: xen-devel-request@lists.xenproject.org
      Cc: xen-devel@lists.xenproject.org
      Link: http://lkml.kernel.org/r/1478077718-37424-9-git-send-email-xinhui.pan@linux.vnet.ibm.com
      [ Typo fixes. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      0b9f6c46
  2. 20 11月, 2016 4 次提交
  3. 03 11月, 2016 1 次提交
    • P
      KVM: x86: drop TSC offsetting kvm_x86_ops to fix KVM_GET/SET_CLOCK · ea26e4ec
      Paolo Bonzini 提交于
      Since commit a545ab6a ("kvm: x86: add tsc_offset field to struct
      kvm_vcpu_arch", 2016-09-07) the offset between host and L1 TSC is
      cached and need not be fished out of the VMCS or VMCB.  This means
      that we can implement adjust_tsc_offset_guest and read_l1_tsc
      entirely in generic code.  The simplification is particularly
      significant for VMX code, where vmx->nested.vmcs01_tsc_offset
      was duplicating what is now in vcpu->arch.tsc_offset.  Therefore
      the vmcs01_tsc_offset can be dropped completely.
      
      More importantly, this fixes KVM_GET_CLOCK/KVM_SET_CLOCK
      which, after commit 108b249c ("KVM: x86: introduce get_kvmclock_ns",
      2016-09-01) called read_l1_tsc while the VMCS was not loaded.
      It thus returned bogus values on Intel CPUs.
      
      Fixes: 108b249cReported-by: NRoman Kagan <rkagan@virtuozzo.com>
      Reviewed-by: NRadim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ea26e4ec
  4. 28 10月, 2016 2 次提交
  5. 20 10月, 2016 1 次提交
  6. 20 9月, 2016 4 次提交
  7. 16 9月, 2016 2 次提交
  8. 08 9月, 2016 1 次提交
  9. 05 9月, 2016 1 次提交
    • W
      KVM: lapic: adjust preemption timer correctly when goes TSC backward · e12c8f36
      Wanpeng Li 提交于
      TSC_OFFSET will be adjusted if discovers TSC backward during vCPU load.
      The preemption timer, which relies on the guest tsc to reprogram its
      preemption timer value, is also reprogrammed if vCPU is scheded in to
      a different pCPU. However, the current implementation reprogram preemption
      timer before TSC_OFFSET is adjusted to the right value, resulting in the
      preemption timer firing prematurely.
      
      This patch fix it by adjusting TSC_OFFSET before reprogramming preemption
      timer if TSC backward.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krċmář <rkrcmar@redhat.com>
      Cc: Yunhong Jiang <yunhong.jiang@intel.com>
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e12c8f36
  10. 15 7月, 2016 2 次提交
  11. 14 7月, 2016 7 次提交
    • P
      x86/kvm: Audit and remove any unnecessary uses of module.h · 1767e931
      Paul Gortmaker 提交于
      Historically a lot of these existed because we did not have
      a distinction between what was modular code and what was providing
      support to modules via EXPORT_SYMBOL and friends.  That changed
      when we forked out support for the latter into the export.h file.
      
      This means we should be able to reduce the usage of module.h
      in code that is obj-y Makefile or bool Kconfig.  In the case of
      kvm where it is modular, we can extend that to also include files
      that are building basic support functionality but not related
      to loading or registering the final module; such files also have
      no need whatsoever for module.h
      
      The advantage in removing such instances is that module.h itself
      sources about 15 other headers; adding significantly to what we feed
      cpp, and it can obscure what headers we are effectively using.
      
      Since module.h was the source for init.h (for __init) and for
      export.h (for EXPORT_SYMBOL) we consider each instance for the
      presence of either and replace as needed.
      
      Several instances got replaced with moduleparam.h since that was
      really all that was required for those particular files.
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Acked-by: NPaolo Bonzini <pbonzini@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: kvm@vger.kernel.org
      Link: http://lkml.kernel.org/r/20160714001901.31603-8-paul.gortmaker@windriver.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1767e931
    • R
      KVM: x86: bump KVM_MAX_VCPU_ID to 1023 · af1bae54
      Radim Krčmář 提交于
      kzalloc was replaced with kvm_kvzalloc to allow non-contiguous areas and
      rcu had to be modified to cope with it.
      
      The practical limit for KVM_MAX_VCPU_ID right now is INT_MAX, but lower
      value was chosen in case there were bugs.  1023 is sufficient maximum
      APIC ID for 288 VCPUs.
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      af1bae54
    • R
      KVM: x86: add a flag to disable KVM x2apic broadcast quirk · c519265f
      Radim Krčmář 提交于
      Add KVM_X2APIC_API_DISABLE_BROADCAST_QUIRK as a feature flag to
      KVM_CAP_X2APIC_API.
      
      The quirk made KVM interpret 0xff as a broadcast even in x2APIC mode.
      The enableable capability is needed in order to support standard x2APIC and
      remain backward compatible.
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      [Expand kvm_apic_mda comment. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c519265f
    • R
      KVM: x86: add KVM_CAP_X2APIC_API · 37131313
      Radim Krčmář 提交于
      KVM_CAP_X2APIC_API is a capability for features related to x2APIC
      enablement.  KVM_X2APIC_API_32BIT_FORMAT feature can be enabled to
      extend APIC ID in get/set ioctl and MSI addresses to 32 bits.
      Both are needed to support x2APIC.
      
      The feature has to be enableable and disabled by default, because
      get/set ioctl shifted and truncated APIC ID to 8 bits by using a
      non-standard protocol inspired by xAPIC and the change is not
      backward-compatible.
      
      Changes to MSI addresses follow the format used by interrupt remapping
      unit.  The upper address word, that used to be 0, contains upper 24 bits
      of the LAPIC address in its upper 24 bits.  Lower 8 bits are reserved as
      0.  Using the upper address word is not backward-compatible either as we
      didn't check that userspace zeroed the word.  Reserved bits are still
      not explicitly checked, but non-zero data will affect LAPIC addresses,
      which will cause a bug.
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      37131313
    • R
      KVM: x86: use hardware-compatible format for APIC ID register · a92e2543
      Radim Krčmář 提交于
      We currently always shift APIC ID as if APIC was in xAPIC mode.
      x2APIC mode wants to use more bits and storing a hardware-compabible
      value is the the sanest option.
      
      KVM API to set the lapic expects that bottom 8 bits of APIC ID are in
      top 8 bits of APIC_ID register, so the register needs to be shifted in
      x2APIC mode.
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a92e2543
    • B
      kvm: mmu: don't set the present bit unconditionally · ffb128c8
      Bandan Das 提交于
      To support execute only mappings on behalf of L1
      hypervisors, we need to teach set_spte() to honor all three of
      L1's XWR bits.  As a start, add a new variable "shadow_present_mask"
      that will be set for non-EPT shadow paging and clear for EPT.
      Signed-off-by: NBandan Das <bsd@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ffb128c8
    • B
      kvm: mmu: remove is_present_gpte() · 812f30b2
      Bandan Das 提交于
      We have two versions of the above function.
      To prevent confusion and bugs in the future, remove
      the non-FNAME version entirely and replace all calls
      with the actual check.
      Signed-off-by: NBandan Das <bsd@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      812f30b2
  12. 01 7月, 2016 2 次提交
  13. 27 6月, 2016 1 次提交
  14. 24 6月, 2016 2 次提交
  15. 16 6月, 2016 3 次提交
    • Y
      kvm: vmx: hook preemption timer support · 64672c95
      Yunhong Jiang 提交于
      Hook the VMX preemption timer to the "hv timer" functionality added
      by the previous patch.  This includes: checking if the feature is
      supported, if the feature is broken on the CPU, the hooks to
      setup/clean the VMX preemption timer, arming the timer on vmentry
      and handling the vmexit.
      
      A module parameter states if the VMX preemption timer should be
      utilized.
      Signed-off-by: NYunhong Jiang <yunhong.jiang@intel.com>
      [Move hv_deadline_tsc to struct vcpu_vmx, use -1 as the "unset" value.
       Put all VMX bits here.  Enable it by default #yolo. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      64672c95
    • Y
      KVM: x86: support using the vmx preemption timer for tsc deadline timer · ce7a058a
      Yunhong Jiang 提交于
      The VMX preemption timer can be used to virtualize the TSC deadline timer.
      The VMX preemption timer is armed when the vCPU is running, and a VMExit
      will happen if the virtual TSC deadline timer expires.
      
      When the vCPU thread is blocked because of HLT, KVM will switch to use
      an hrtimer, and then go back to the VMX preemption timer when the vCPU
      thread is unblocked.
      
      This solution avoids the complex OS's hrtimer system, and the host
      timer interrupt handling cost, replacing them with a little math
      (for guest->host TSC and host TSC->preemption timer conversion)
      and a cheaper VMexit.  This benefits latency for isolated pCPUs.
      
      [A word about performance... Yunhong reported a 30% reduction in average
       latency from cyclictest.  I made a similar test with tscdeadline_latency
       from kvm-unit-tests, and measured
      
       - ~20 clock cycles loss (out of ~3200, so less than 1% but still
         statistically significant) in the worst case where the test halts
         just after programming the TSC deadline timer
      
       - ~800 clock cycles gain (25% reduction in latency) in the best case
         where the test busy waits.
      
       I removed the VMX bits from Yunhong's patch, to concentrate them in the
       next patch - Paolo]
      Signed-off-by: NYunhong Jiang <yunhong.jiang@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ce7a058a
    • P
      KVM: remove kvm_vcpu_compatible · 557abc40
      Paolo Bonzini 提交于
      The new created_vcpus field makes it possible to avoid the race between
      irqchip and VCPU creation in a much nicer way; just check under kvm->lock
      whether a VCPU has already been created.
      
      We can then remove KVM_APIC_ARCHITECTURE too, because at this point the
      symbol is only governing the default definition of kvm_vcpu_compatible.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      557abc40
  16. 14 6月, 2016 1 次提交
  17. 08 6月, 2016 1 次提交
    • D
      x86/fpu: Add tracepoints to dump FPU state at key points · d1898b73
      Dave Hansen 提交于
      I've been carrying this patch around for a bit and it's helped me
      solve at least a couple FPU-related bugs.  In addition to using
      it for debugging, I also drug it out because using AVX (and
      AVX2/AVX-512) can have serious power consequences for a modern
      core.  It's very important to be able to figure out who is using
      it.
      
      It's also insanely useful to go out and see who is using a given
      feature, like MPX or Memory Protection Keys.  If you, for
      instance, want to find all processes using protection keys, you
      can do:
      
      	echo 'xfeatures & 0x200' > filter
      
      Since 0x200 is the protection keys feature bit.
      
      Note that this touches the KVM code.  KVM did a CREATE_TRACE_POINTS
      and then included a bunch of random headers.  If anyone one of
      those included other tracepoints, it would have defined the *OTHER*
      tracepoints.  That's bogus, so move it to the right place.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20160601174220.3CDFB90E@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d1898b73
  18. 03 6月, 2016 3 次提交
    • P
      KVM: x86: protect KVM_CREATE_PIT/KVM_CREATE_PIT2 with kvm->lock · 250715a6
      Paolo Bonzini 提交于
      The syzkaller folks reported a NULL pointer dereference that seems
      to be cause by a race between KVM_CREATE_IRQCHIP and KVM_CREATE_PIT2.
      The former takes kvm->lock (except when registering the devices,
      which needs kvm->slots_lock); the latter takes kvm->slots_lock only.
      Change KVM_CREATE_PIT2 to follow the same model as KVM_CREATE_IRQCHIP.
      
      Testcase:
      
          #include <pthread.h>
          #include <linux/kvm.h>
          #include <fcntl.h>
          #include <sys/ioctl.h>
          #include <stdint.h>
          #include <string.h>
          #include <stdlib.h>
          #include <sys/syscall.h>
          #include <unistd.h>
      
          long r[23];
      
          void* thr1(void* arg)
          {
              struct kvm_pit_config pitcfg = { .flags = 4 };
              switch ((long)arg) {
              case 0: r[2]  = open("/dev/kvm", O_RDONLY|O_ASYNC);    break;
              case 1: r[3]  = ioctl(r[2], KVM_CREATE_VM, 0);         break;
              case 2: r[4]  = ioctl(r[3], KVM_CREATE_IRQCHIP, 0);    break;
              case 3: r[22] = ioctl(r[3], KVM_CREATE_PIT2, &pitcfg); break;
              }
              return 0;
          }
      
          int main(int argc, char **argv)
          {
              long i;
              pthread_t th[4];
      
              memset(r, -1, sizeof(r));
              for (i = 0; i < 4; i++) {
                  pthread_create(&th[i], 0, thr, (void*)i);
                  if (argc > 1 && rand()%2) usleep(rand()%1000);
              }
              usleep(20000);
              return 0;
          }
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      250715a6
    • P
      KVM: x86: rename process_smi to enter_smm, process_smi_request to process_smi · ee2cd4b7
      Paolo Bonzini 提交于
      Make the function names more similar between KVM_REQ_NMI and KVM_REQ_SMI.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      ee2cd4b7
    • P
      KVM: x86: avoid simultaneous queueing of both IRQ and SMI · c43203ca
      Paolo Bonzini 提交于
      If the processor exits to KVM while delivering an interrupt,
      the hypervisor then requeues the interrupt for the next vmentry.
      Trying to enter SMM in this same window causes to enter non-root
      mode in emulated SMM (i.e. with IF=0) and with a request to
      inject an IRQ (i.e. with a valid VM-entry interrupt info field).
      This is invalid guest state (SDM 26.3.1.4 "Check on Guest RIP
      and RFLAGS") and the processor fails vmentry.
      
      The fix is to defer the injection from KVM_REQ_SMI to KVM_REQ_EVENT,
      like we already do for e.g. NMIs.  This patch doesn't change the
      name of the process_smi function so that it can be applied to
      stable releases.  The next patch will modify the names so that
      process_nmi and process_smi handle respectively KVM_REQ_NMI and
      KVM_REQ_SMI.
      
      This is especially common with Windows, probably due to the
      self-IPI trick that it uses to deliver deferred procedure
      calls (DPCs).
      Reported-by: NLaszlo Ersek <lersek@redhat.com>
      Reported-by: NMichał Zegan <webczat_200@poczta.onet.pl>
      Fixes: 64d60670
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      c43203ca
  19. 02 6月, 2016 1 次提交
    • P
      KVM: x86: fix OOPS after invalid KVM_SET_DEBUGREGS · d14bdb55
      Paolo Bonzini 提交于
      MOV to DR6 or DR7 causes a #GP if an attempt is made to write a 1 to
      any of bits 63:32.  However, this is not detected at KVM_SET_DEBUGREGS
      time, and the next KVM_RUN oopses:
      
         general protection fault: 0000 [#1] SMP
         CPU: 2 PID: 14987 Comm: a.out Not tainted 4.4.9-300.fc23.x86_64 #1
         Hardware name: LENOVO 2325F51/2325F51, BIOS G2ET32WW (1.12 ) 05/30/2012
         [...]
         Call Trace:
          [<ffffffffa072c93d>] kvm_arch_vcpu_ioctl_run+0x141d/0x14e0 [kvm]
          [<ffffffffa071405d>] kvm_vcpu_ioctl+0x33d/0x620 [kvm]
          [<ffffffff81241648>] do_vfs_ioctl+0x298/0x480
          [<ffffffff812418a9>] SyS_ioctl+0x79/0x90
          [<ffffffff817a0f2e>] entry_SYSCALL_64_fastpath+0x12/0x71
         Code: 55 83 ff 07 48 89 e5 77 27 89 ff ff 24 fd 90 87 80 81 0f 23 fe 5d c3 0f 23 c6 5d c3 0f 23 ce 5d c3 0f 23 d6 5d c3 0f 23 de 5d c3 <0f> 23 f6 5d c3 0f 0b 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00
         RIP  [<ffffffff810639eb>] native_set_debugreg+0x2b/0x40
          RSP <ffff88005836bd50>
      
      Testcase (beautified/reduced from syzkaller output):
      
          #include <unistd.h>
          #include <sys/syscall.h>
          #include <string.h>
          #include <stdint.h>
          #include <linux/kvm.h>
          #include <fcntl.h>
          #include <sys/ioctl.h>
      
          long r[8];
      
          int main()
          {
              struct kvm_debugregs dr = { 0 };
      
              r[2] = open("/dev/kvm", O_RDONLY);
              r[3] = ioctl(r[2], KVM_CREATE_VM, 0);
              r[4] = ioctl(r[3], KVM_CREATE_VCPU, 7);
      
              memcpy(&dr,
                     "\x5d\x6a\x6b\xe8\x57\x3b\x4b\x7e\xcf\x0d\xa1\x72"
                     "\xa3\x4a\x29\x0c\xfc\x6d\x44\x00\xa7\x52\xc7\xd8"
                     "\x00\xdb\x89\x9d\x78\xb5\x54\x6b\x6b\x13\x1c\xe9"
                     "\x5e\xd3\x0e\x40\x6f\xb4\x66\xf7\x5b\xe3\x36\xcb",
                     48);
              r[7] = ioctl(r[4], KVM_SET_DEBUGREGS, &dr);
              r[6] = ioctl(r[4], KVM_RUN, 0);
          }
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      d14bdb55