1. 21 4月, 2017 3 次提交
    • M
      KVM: x86: remove irq disablement around KVM_SET_CLOCK/KVM_GET_CLOCK · e891a32e
      Marcelo Tosatti 提交于
      The disablement of interrupts at KVM_SET_CLOCK/KVM_GET_CLOCK
      attempts to disable software suspend from causing "non atomic behaviour" of
      the operation:
      
          Add a helper function to compute the kernel time and convert nanoseconds
          back to CPU specific cycles.  Note that these must not be called in preemptible
          context, as that would mean the kernel could enter software suspend state,
          which would cause non-atomic operation.
      
      However, assume the kernel can enter software suspend at the following 2 points:
      
              ktime_get_ts(&ts);
      1.
      						hypothetical_ktime_get_ts(&ts)
              monotonic_to_bootbased(&ts);
      2.
      
      monotonic_to_bootbased() should be correct relative to a ktime_get_ts(&ts)
      performed after point 1 (that is after resuming from software suspend),
      hypothetical_ktime_get_ts()
      
      Therefore it is also correct for the ktime_get_ts(&ts) before point 1,
      which is
      
      	ktime_get_ts(&ts) = hypothetical_ktime_get_ts(&ts) + time-to-execute-suspend-code
      
      Note CLOCK_MONOTONIC does not count during suspension.
      
      So remove the irq disablement, which causes the following warning on
      -RT kernels:
      
       With this reasoning, and the -RT bug that the irq disablement causes
       (because spin_lock is now a sleeping lock), remove the IRQ protection as it
       causes:
      
       [ 1064.668109] in_atomic(): 0, irqs_disabled(): 1, pid: 15296, name:m
       [ 1064.668110] INFO: lockdep is turned off.
       [ 1064.668110] irq event stamp: 0
       [ 1064.668112] hardirqs last  enabled at (0): [<          (null)>]  )
       [ 1064.668116] hardirqs last disabled at (0): [] c0
       [ 1064.668118] softirqs last  enabled at (0): [] c0
       [ 1064.668118] softirqs last disabled at (0): [<          (null)>]  )
       [ 1064.668121] CPU: 13 PID: 15296 Comm: qemu-kvm Not tainted 3.10.0-1
       [ 1064.668121] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 5
       [ 1064.668123]  ffff8c1796b88000 00000000afe7344c ffff8c179abf3c68 f3
       [ 1064.668125]  ffff8c179abf3c90 ffffffff930ccb3d ffff8c1b992b3610 f0
       [ 1064.668126]  00007ffc1a26fbc0 ffff8c179abf3cb0 ffffffff9375f694 f0
       [ 1064.668126] Call Trace:
       [ 1064.668132]  [] dump_stack+0x19/0x1b
       [ 1064.668135]  [] __might_sleep+0x12d/0x1f0
       [ 1064.668138]  [] rt_spin_lock+0x24/0x60
       [ 1064.668155]  [] __get_kvmclock_ns+0x36/0x110 [k]
       [ 1064.668159]  [] ? futex_wait_queue_me+0x103/0x10
       [ 1064.668171]  [] kvm_arch_vm_ioctl+0xa2/0xd70 [k]
       [ 1064.668173]  [] ? futex_wait+0x1ac/0x2a0
      
      v2: notice get_kvmclock_ns with the same problem (Pankaj).
      v3: remove useless helper function (Pankaj).
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e891a32e
    • M
      kvm: better MWAIT emulation for guests · 668fffa3
      Michael S. Tsirkin 提交于
      Guests that are heavy on futexes end up IPI'ing each other a lot. That
      can lead to significant slowdowns and latency increase for those guests
      when running within KVM.
      
      If only a single guest is needed on a host, we have a lot of spare host
      CPU time we can throw at the problem. Modern CPUs implement a feature
      called "MWAIT" which allows guests to wake up sleeping remote CPUs without
      an IPI - thus without an exit - at the expense of never going out of guest
      context.
      
      The decision whether this is something sensible to use should be up to the
      VM admin, so to user space. We can however allow MWAIT execution on systems
      that support it properly hardware wise.
      
      This patch adds a CAP to user space and a KVM cpuid leaf to indicate
      availability of native MWAIT execution. With that enabled, the worst a
      guest can do is waste as many cycles as a "jmp ." would do, so it's not
      a privilege problem.
      
      We consciously do *not* expose the feature in our CPUID bitmap, as most
      people will want to benefit from sleeping vCPUs to allow for over commit.
      Reported-by: N"Gabriel L. Somlo" <gsomlo@gmail.com>
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      [agraf: fix amd, change commit message]
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      668fffa3
    • K
      KVM: x86: virtualize cpuid faulting · db2336a8
      Kyle Huey 提交于
      Hardware support for faulting on the cpuid instruction is not required to
      emulate it, because cpuid triggers a VM exit anyways. KVM handles the relevant
      MSRs (MSR_PLATFORM_INFO and MSR_MISC_FEATURES_ENABLE) and upon a
      cpuid-induced VM exit checks the cpuid faulting state and the CPL.
      kvm_require_cpl is even kind enough to inject the GP fault for us.
      Signed-off-by: NKyle Huey <khuey@kylehuey.com>
      Reviewed-by: NDavid Matlack <dmatlack@google.com>
      [Return "1" from kvm_emulate_cpuid, it's not void. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      db2336a8
  2. 13 4月, 2017 8 次提交
  3. 07 4月, 2017 4 次提交
    • J
      kvm: nVMX: Disallow userspace-injected exceptions in guest mode · 28d06353
      Jim Mattson 提交于
      The userspace exception injection API and code path are entirely
      unprepared for exceptions that might cause a VM-exit from L2 to L1, so
      the best course of action may be to simply disallow this for now.
      
      1. The API provides no mechanism for userspace to specify the new DR6
      bits for a #DB exception or the new CR2 value for a #PF
      exception. Presumably, userspace is expected to modify these registers
      directly with KVM_SET_SREGS before the next KVM_RUN ioctl. However, in
      the event that L1 intercepts the exception, these registers should not
      be changed. Instead, the new values should be provided in the
      exit_qualification field of vmcs12 (Intel SDM vol 3, section 27.1).
      
      2. In the case of a userspace-injected #DB, inject_pending_event()
      clears DR7.GD before calling vmx_queue_exception(). However, in the
      event that L1 intercepts the exception, this is too early, because
      DR7.GD should not be modified by a #DB that causes a VM-exit directly
      (Intel SDM vol 3, section 27.1).
      
      3. If the injected exception is a #PF, nested_vmx_check_exception()
      doesn't properly check whether or not L1 is interested in the
      associated error code (using the #PF error code mask and match fields
      from vmcs12). It may either return 0 when it should call
      nested_vmx_vmexit() or vice versa.
      
      4. nested_vmx_check_exception() assumes that it is dealing with a
      hardware-generated exception intercept from L2, with some of the
      relevant details (the VM-exit interruption-information and the exit
      qualification) live in vmcs02. For userspace-injected exceptions, this
      is not the case.
      
      5. prepare_vmcs12() assumes that when its exit_intr_info argument
      specifies valid information with a valid error code that it can VMREAD
      the VM-exit interruption error code from vmcs02. For
      userspace-injected exceptions, this is not the case.
      Signed-off-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      28d06353
    • D
      KVM: x86: fix user triggerable warning in kvm_apic_accept_events() · 28bf2888
      David Hildenbrand 提交于
      If we already entered/are about to enter SMM, don't allow switching to
      INIT/SIPI_RECEIVED, otherwise the next call to kvm_apic_accept_events()
      will report a warning.
      
      Same applies if we are already in MP state INIT_RECEIVED and SMM is
      requested to be turned on. Refuse to set the VCPU events in this case.
      
      Fixes: cd7764fe ("KVM: x86: latch INITs while in system management mode")
      Cc: stable@vger.kernel.org # 4.2+
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      28bf2888
    • P
      kvm: make KVM_CAP_COALESCED_MMIO architecture agnostic · 30422558
      Paolo Bonzini 提交于
      Remove code from architecture files that can be moved to virt/kvm, since there
      is already common code for coalesced MMIO.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      [Removed a pointless 'break' after 'return'.]
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      30422558
    • P
      KVM: x86: drop legacy device assignment · ad6260da
      Paolo Bonzini 提交于
      Legacy device assignment has been deprecated since 4.2 (released
      1.5 years ago).  VFIO is better and everyone should have switched to it.
      If they haven't, this should convince them. :)
      Reviewed-by: NAlex Williamson <alex.williamson@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ad6260da
  4. 28 3月, 2017 1 次提交
  5. 24 3月, 2017 2 次提交
  6. 02 3月, 2017 1 次提交
  7. 17 2月, 2017 3 次提交
  8. 15 2月, 2017 4 次提交
    • P
      kvm: x86: do not use KVM_REQ_EVENT for APICv interrupt injection · b95234c8
      Paolo Bonzini 提交于
      Since bf9f6ac8 ("KVM: Update Posted-Interrupts Descriptor when vCPU
      is blocked", 2015-09-18) the posted interrupt descriptor is checked
      unconditionally for PIR.ON.  Therefore we don't need KVM_REQ_EVENT to
      trigger the scan and, if NMIs or SMIs are not involved, we can avoid
      the complicated event injection path.
      
      Calling kvm_vcpu_kick if PIR.ON=1 is also useless, though it has been
      there since APICv was introduced.
      
      However, without the KVM_REQ_EVENT safety net KVM needs to be much
      more careful about races between vmx_deliver_posted_interrupt and
      vcpu_enter_guest.  First, the IPI for posted interrupts may be issued
      between setting vcpu->mode = IN_GUEST_MODE and disabling interrupts.
      If that happens, kvm_trigger_posted_interrupt returns true, but
      smp_kvm_posted_intr_ipi doesn't do anything about it.  The guest is
      entered with PIR.ON, but the posted interrupt IPI has not been sent
      and the interrupt is only delivered to the guest on the next vmentry
      (if any).  To fix this, disable interrupts before setting vcpu->mode.
      This ensures that the IPI is delayed until the guest enters non-root mode;
      it is then trapped by the processor causing the interrupt to be injected.
      
      Second, the IPI may be issued between kvm_x86_ops->sync_pir_to_irr(vcpu)
      and vcpu->mode = IN_GUEST_MODE.  In this case, kvm_vcpu_kick is called
      but it (correctly) doesn't do anything because it sees vcpu->mode ==
      OUTSIDE_GUEST_MODE.  Again, the guest is entered with PIR.ON but no
      posted interrupt IPI is pending; this time, the fix for this is to move
      the RVI update after IN_GUEST_MODE.
      
      Both issues were mostly masked by the liberal usage of KVM_REQ_EVENT,
      though the second could actually happen with VT-d posted interrupts.
      In both race scenarios KVM_REQ_EVENT would cancel guest entry, resulting
      in another vmentry which would inject the interrupt.
      
      This saves about 300 cycles on the self_ipi_* tests of vmexit.flat.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b95234c8
    • P
      KVM: x86: do not scan IRR twice on APICv vmentry · 76dfafd5
      Paolo Bonzini 提交于
      Calls to apic_find_highest_irr are scanning IRR twice, once
      in vmx_sync_pir_from_irr and once in apic_search_irr.  Change
      sync_pir_from_irr to get the new maximum IRR from kvm_apic_update_irr;
      now that it does the computation, it can also do the RVI write.
      
      In order to avoid complications in svm.c, make the callback optional.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      76dfafd5
    • P
    • P
      kvm: nVMX: move nested events check to kvm_vcpu_running · 0ad3bed6
      Paolo Bonzini 提交于
      vcpu_run calls kvm_vcpu_running, not kvm_arch_vcpu_runnable,
      and the former does not call check_nested_events.
      
      Once KVM_REQ_EVENT is removed from the APICv interrupt injection
      path, however, this would leave no place to trigger a vmexit
      from L2 to L1, causing a missed interrupt delivery while in guest
      mode.  This is caught by the "ack interrupt on exit" test in
      vmx.flat.
      
      [This does not change the calls to check_nested_events in
       inject_pending_event.  That is material for a separate cleanup.]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0ad3bed6
  9. 09 2月, 2017 1 次提交
    • A
      KVM: x86: hide KVM_HC_CLOCK_PAIRING on 32 bit · 8ef81a9a
      Arnd Bergmann 提交于
      The newly added hypercall doesn't work on x86-32:
      
      arch/x86/kvm/x86.c: In function 'kvm_pv_clock_pairing':
      arch/x86/kvm/x86.c:6163:6: error: implicit declaration of function 'kvm_get_walltime_and_clockread';did you mean 'kvm_get_time_scale'? [-Werror=implicit-function-declaration]
      
      This adds an #ifdef around it, matching the one around the related
      functions that are also only implemented on 64-bit systems.
      
      Fixes: 55dd00a7 ("KVM: x86: add KVM_HC_CLOCK_PAIRING hypercall")
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      8ef81a9a
  10. 08 2月, 2017 2 次提交
  11. 04 2月, 2017 1 次提交
  12. 27 1月, 2017 1 次提交
  13. 17 1月, 2017 1 次提交
    • D
      KVM: x86: fix fixing of hypercalls · ce2e852e
      Dmitry Vyukov 提交于
      emulator_fix_hypercall() replaces hypercall with vmcall instruction,
      but it does not handle GP exception properly when writes the new instruction.
      It can return X86EMUL_PROPAGATE_FAULT without setting exception information.
      This leads to incorrect emulation and triggers
      WARN_ON(ctxt->exception.vector > 0x1f) in x86_emulate_insn()
      as discovered by syzkaller fuzzer:
      
      WARNING: CPU: 2 PID: 18646 at arch/x86/kvm/emulate.c:5558
      Call Trace:
       warn_slowpath_null+0x2c/0x40 kernel/panic.c:582
       x86_emulate_insn+0x16a5/0x4090 arch/x86/kvm/emulate.c:5572
       x86_emulate_instruction+0x403/0x1cc0 arch/x86/kvm/x86.c:5618
       emulate_instruction arch/x86/include/asm/kvm_host.h:1127 [inline]
       handle_exception+0x594/0xfd0 arch/x86/kvm/vmx.c:5762
       vmx_handle_exit+0x2b7/0x38b0 arch/x86/kvm/vmx.c:8625
       vcpu_enter_guest arch/x86/kvm/x86.c:6888 [inline]
       vcpu_run arch/x86/kvm/x86.c:6947 [inline]
      
      Set exception information when write in emulator_fix_hypercall() fails.
      Signed-off-by: NDmitry Vyukov <dvyukov@google.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Wanpeng Li <wanpeng.li@hotmail.com>
      Cc: kvm@vger.kernel.org
      Cc: syzkaller@googlegroups.com
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      ce2e852e
  14. 12 1月, 2017 2 次提交
    • W
      KVM: x86: fix NULL deref in vcpu_scan_ioapic · 546d87e5
      Wanpeng Li 提交于
      Reported by syzkaller:
      
          BUG: unable to handle kernel NULL pointer dereference at 00000000000001b0
          IP: _raw_spin_lock+0xc/0x30
          PGD 3e28eb067
          PUD 3f0ac6067
          PMD 0
          Oops: 0002 [#1] SMP
          CPU: 0 PID: 2431 Comm: test Tainted: G           OE   4.10.0-rc1+ #3
          Call Trace:
           ? kvm_ioapic_scan_entry+0x3e/0x110 [kvm]
           kvm_arch_vcpu_ioctl_run+0x10a8/0x15f0 [kvm]
           ? pick_next_task_fair+0xe1/0x4e0
           ? kvm_arch_vcpu_load+0xea/0x260 [kvm]
           kvm_vcpu_ioctl+0x33a/0x600 [kvm]
           ? hrtimer_try_to_cancel+0x29/0x130
           ? do_nanosleep+0x97/0xf0
           do_vfs_ioctl+0xa1/0x5d0
           ? __hrtimer_init+0x90/0x90
           ? do_nanosleep+0x5b/0xf0
           SyS_ioctl+0x79/0x90
           do_syscall_64+0x6e/0x180
           entry_SYSCALL64_slow_path+0x25/0x25
          RIP: _raw_spin_lock+0xc/0x30 RSP: ffffa43688973cc0
      
      The syzkaller folks reported a NULL pointer dereference due to
      ENABLE_CAP succeeding even without an irqchip.  The Hyper-V
      synthetic interrupt controller is activated, resulting in a
      wrong request to rescan the ioapic and a NULL pointer dereference.
      
          #include <sys/ioctl.h>
          #include <sys/mman.h>
          #include <sys/types.h>
          #include <linux/kvm.h>
          #include <pthread.h>
          #include <stddef.h>
          #include <stdint.h>
          #include <stdlib.h>
          #include <string.h>
          #include <unistd.h>
      
          #ifndef KVM_CAP_HYPERV_SYNIC
          #define KVM_CAP_HYPERV_SYNIC 123
          #endif
      
          void* thr(void* arg)
          {
      	struct kvm_enable_cap cap;
      	cap.flags = 0;
      	cap.cap = KVM_CAP_HYPERV_SYNIC;
      	ioctl((long)arg, KVM_ENABLE_CAP, &cap);
      	return 0;
          }
      
          int main()
          {
      	void *host_mem = mmap(0, 0x1000, PROT_READ|PROT_WRITE,
      			MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
      	int kvmfd = open("/dev/kvm", 0);
      	int vmfd = ioctl(kvmfd, KVM_CREATE_VM, 0);
      	struct kvm_userspace_memory_region memreg;
      	memreg.slot = 0;
      	memreg.flags = 0;
      	memreg.guest_phys_addr = 0;
      	memreg.memory_size = 0x1000;
      	memreg.userspace_addr = (unsigned long)host_mem;
      	host_mem[0] = 0xf4;
      	ioctl(vmfd, KVM_SET_USER_MEMORY_REGION, &memreg);
      	int cpufd = ioctl(vmfd, KVM_CREATE_VCPU, 0);
      	struct kvm_sregs sregs;
      	ioctl(cpufd, KVM_GET_SREGS, &sregs);
      	sregs.cr0 = 0;
      	sregs.cr4 = 0;
      	sregs.efer = 0;
      	sregs.cs.selector = 0;
      	sregs.cs.base = 0;
      	ioctl(cpufd, KVM_SET_SREGS, &sregs);
      	struct kvm_regs regs = { .rflags = 2 };
      	ioctl(cpufd, KVM_SET_REGS, &regs);
      	ioctl(vmfd, KVM_CREATE_IRQCHIP, 0);
      	pthread_t th;
      	pthread_create(&th, 0, thr, (void*)(long)cpufd);
      	usleep(rand() % 10000);
      	ioctl(cpufd, KVM_RUN, 0);
      	pthread_join(th, 0);
      	return 0;
          }
      
      This patch fixes it by failing ENABLE_CAP if without an irqchip.
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Fixes: 5c919412 (kvm/x86: Hyper-V synthetic interrupt controller)
      Cc: stable@vger.kernel.org # 4.5+
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      546d87e5
    • D
      KVM: x86: flush pending lapic jump label updates on module unload · cef84c30
      David Matlack 提交于
      KVM's lapic emulation uses static_key_deferred (apic_{hw,sw}_disabled).
      These are implemented with delayed_work structs which can still be
      pending when the KVM module is unloaded. We've seen this cause kernel
      panics when the kvm_intel module is quickly reloaded.
      
      Use the new static_key_deferred_flush() API to flush pending updates on
      module unload.
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      cef84c30
  15. 09 1月, 2017 6 次提交