1. 27 7月, 2017 1 次提交
    • P
      KVM: x86: do mask out upper bits of PAE CR3 · a512177e
      Paolo Bonzini 提交于
      This reverts the change of commit f85c758d,
      as the behavior it modified was intended.
      
      The VM is running in 32-bit PAE mode, and Table 4-7 of the Intel manual
      says:
      
      Table 4-7. Use of CR3 with PAE Paging
      Bit Position(s)	Contents
      4:0		Ignored
      31:5		Physical address of the 32-Byte aligned
      		page-directory-pointer table used for linear-address
      		translation
      63:32		Ignored (these bits exist only on processors supporting
      		the Intel-64 architecture)
      
      To placate the static checker, write the mask explicitly as an
      unsigned long constant instead of using a 32-bit unsigned constant.
      
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Fixes: f85c758dSigned-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a512177e
  2. 19 7月, 2017 1 次提交
  3. 14 7月, 2017 4 次提交
    • R
      kvm: x86: hyperv: make VP_INDEX managed by userspace · d3457c87
      Roman Kagan 提交于
      Hyper-V identifies vCPUs by Virtual Processor Index, which can be
      queried via HV_X64_MSR_VP_INDEX msr.  It is defined by the spec as a
      sequential number which can't exceed the maximum number of vCPUs per VM.
      APIC ids can be sparse and thus aren't a valid replacement for VP
      indices.
      
      Current KVM uses its internal vcpu index as VP_INDEX.  However, to make
      it predictable and persistent across VM migrations, the userspace has to
      control the value of VP_INDEX.
      
      This patch achieves that, by storing vp_index explicitly on vcpu, and
      allowing HV_X64_MSR_VP_INDEX to be set from the host side.  For
      compatibility it's initialized to KVM vcpu index.  Also a few variables
      are renamed to make clear distinction betweed this Hyper-V vp_index and
      KVM vcpu_id (== APIC id).  Besides, a new capability,
      KVM_CAP_HYPERV_VP_INDEX, is added to allow the userspace to skip
      attempting msr writes where unsupported, to avoid spamming error logs.
      Signed-off-by: NRoman Kagan <rkagan@virtuozzo.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      d3457c87
    • W
      KVM: async_pf: Let guest support delivery of async_pf from guest mode · 52a5c155
      Wanpeng Li 提交于
      Adds another flag bit (bit 2) to MSR_KVM_ASYNC_PF_EN. If bit 2 is 1,
      async page faults are delivered to L1 as #PF vmexits; if bit 2 is 0,
      kvm_can_do_async_pf returns 0 if in guest mode.
      
      This is similar to what svm.c wanted to do all along, but it is only
      enabled for Linux as L1 hypervisor.  Foreign hypervisors must never
      receive async page faults as vmexits, because they'd probably be very
      confused about that.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      52a5c155
    • W
      KVM: async_pf: Force a nested vmexit if the injected #PF is async_pf · adfe20fb
      Wanpeng Li 提交于
      Add an nested_apf field to vcpu->arch.exception to identify an async page
      fault, and constructs the expected vm-exit information fields. Force a
      nested VM exit from nested_vmx_check_exception() if the injected #PF is
      async page fault.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      adfe20fb
    • W
      KVM: x86: Simplify kvm_x86_ops->queue_exception parameter list · cfcd20e5
      Wanpeng Li 提交于
      This patch removes all arguments except the first in
      kvm_x86_ops->queue_exception since they can extract the arguments from
      vcpu->arch.exception themselves.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      cfcd20e5
  4. 13 7月, 2017 3 次提交
    • R
      kvm: x86: hyperv: add KVM_CAP_HYPERV_SYNIC2 · efc479e6
      Roman Kagan 提交于
      There is a flaw in the Hyper-V SynIC implementation in KVM: when message
      page or event flags page is enabled by setting the corresponding msr,
      KVM zeroes it out.  This is problematic because on migration the
      corresponding MSRs are loaded on the destination, so the content of
      those pages is lost.
      
      This went unnoticed so far because the only user of those pages was
      in-KVM hyperv synic timers, which could continue working despite that
      zeroing.
      
      Newer QEMU uses those pages for Hyper-V VMBus implementation, and
      zeroing them breaks the migration.
      
      Besides, in newer QEMU the content of those pages is fully managed by
      QEMU, so zeroing them is undesirable even when writing the MSRs from the
      guest side.
      
      To support this new scheme, introduce a new capability,
      KVM_CAP_HYPERV_SYNIC2, which, when enabled, makes sure that the synic
      pages aren't zeroed out in KVM.
      Signed-off-by: NRoman Kagan <rkagan@virtuozzo.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      efc479e6
    • L
      KVM: x86: make backwards_tsc_observed a per-VM variable · a826faf1
      Ladi Prosek 提交于
      The backwards_tsc_observed global introduced in commit 16a96021 is never
      reset to false. If a VM happens to be running while the host is suspended
      (a common source of the TSC jumping backwards), master clock will never
      be enabled again for any VM. In contrast, if no VM is running while the
      host is suspended, master clock is unaffected. This is inconsistent and
      unnecessarily strict. Let's track the backwards_tsc_observed variable
      separately and let each VM start with a clean slate.
      
      Real world impact: My Windows VMs get slower after my laptop undergoes a
      suspend/resume cycle. The only way to get the perf back is unloading and
      reloading the kvm module.
      Signed-off-by: NLadi Prosek <lprosek@redhat.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      a826faf1
    • R
      KVM: x86: update master clock before computing kvmclock_offset · 0bc48bea
      Radim Krčmář 提交于
      kvm master clock usually has a different frequency than the kernel boot
      clock.  This is not a problem until the master clock is updated;
      update uses the current kernel boot clock to compute new kvm clock,
      which erases any kvm clock cycles that might have built up due to
      frequency difference over a long period.
      
      KVM_SET_CLOCK is one of places where we can safely update master clock
      as the guest-visible clock is going to be shifted anyway.
      
      The problem with current code is that it updates the kvm master clock
      after updating the offset.  If the master clock was enabled before
      calling KVM_SET_CLOCK, then it might have built up a significant delta
      from kernel boot clock.
      In the worst case, the time set by userspace would be shifted by so much
      that it couldn't have been set at any point during KVM_SET_CLOCK.
      
      To fix this, move kvm_gen_update_masterclock() before computing
      kvmclock_offset, which means that the master clock and kernel boot clock
      will be sufficiently close together.
      Another solution would be to replace get_kvmclock_ns() with
      "ktime_get_boot_ns() + ka->kvmclock_offset", which is marginally more
      accurate, but would break symmetry with KVM_GET_CLOCK.
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      0bc48bea
  5. 03 7月, 2017 1 次提交
  6. 30 6月, 2017 1 次提交
  7. 22 6月, 2017 1 次提交
    • P
      KVM: x86: fix singlestepping over syscall · c8401dda
      Paolo Bonzini 提交于
      TF is handled a bit differently for syscall and sysret, compared
      to the other instructions: TF is checked after the instruction completes,
      so that the OS can disable #DB at a syscall by adding TF to FMASK.
      When the sysret is executed the #DB is taken "as if" the syscall insn
      just completed.
      
      KVM emulates syscall so that it can trap 32-bit syscall on Intel processors.
      Fix the behavior, otherwise you could get #DB on a user stack which is not
      nice.  This does not affect Linux guests, as they use an IST or task gate
      for #DB.
      
      This fixes CVE-2017-7518.
      
      Cc: stable@vger.kernel.org
      Reported-by: NAndy Lutomirski <luto@kernel.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      c8401dda
  8. 11 6月, 2017 1 次提交
    • W
      KVM: async_pf: avoid async pf injection when in guest mode · 9bc1f09f
      Wanpeng Li 提交于
       INFO: task gnome-terminal-:1734 blocked for more than 120 seconds.
             Not tainted 4.12.0-rc4+ #8
       "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
       gnome-terminal- D    0  1734   1015 0x00000000
       Call Trace:
        __schedule+0x3cd/0xb30
        schedule+0x40/0x90
        kvm_async_pf_task_wait+0x1cc/0x270
        ? __vfs_read+0x37/0x150
        ? prepare_to_swait+0x22/0x70
        do_async_page_fault+0x77/0xb0
        ? do_async_page_fault+0x77/0xb0
        async_page_fault+0x28/0x30
      
      This is triggered by running both win7 and win2016 on L1 KVM simultaneously,
      and then gives stress to memory on L1, I can observed this hang on L1 when
      at least ~70% swap area is occupied on L0.
      
      This is due to async pf was injected to L2 which should be injected to L1,
      L2 guest starts receiving pagefault w/ bogus %cr2(apf token from the host
      actually), and L1 guest starts accumulating tasks stuck in D state in
      kvm_async_pf_task_wait() since missing PAGE_READY async_pfs.
      
      This patch fixes the hang by doing async pf when executing L1 guest.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      9bc1f09f
  9. 04 6月, 2017 1 次提交
  10. 01 6月, 2017 1 次提交
    • Z
      KVM: x86: Fix nmi injection failure when vcpu got blocked · 47a66eed
      ZhuangYanying 提交于
      When spin_lock_irqsave() deadlock occurs inside the guest, vcpu threads,
      other than the lock-holding one, would enter into S state because of
      pvspinlock. Then inject NMI via libvirt API "inject-nmi", the NMI could
      not be injected into vm.
      
      The reason is:
      1 It sets nmi_queued to 1 when calling ioctl KVM_NMI in qemu, and sets
      cpu->kvm_vcpu_dirty to true in do_inject_external_nmi() meanwhile.
      2 It sets nmi_queued to 0 in process_nmi(), before entering guest, because
      cpu->kvm_vcpu_dirty is true.
      
      It's not enough just to check nmi_queued to decide whether to stay in
      vcpu_block() or not. NMI should be injected immediately at any situation.
      Add checking nmi_pending, and testing KVM_REQ_NMI replaces nmi_queued
      in vm_vcpu_has_events().
      
      Do the same change for SMIs.
      Signed-off-by: NZhuang Yanying <ann.zhuangyanying@huawei.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      47a66eed
  11. 20 5月, 2017 3 次提交
    • R
      KVM: x86: zero base3 of unusable segments · f0367ee1
      Radim Krčmář 提交于
      Static checker noticed that base3 could be used uninitialized if the
      segment was not present (useable).  Random stack values probably would
      not pass VMCS entry checks.
      Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
      Fixes: 1aa36616 ("KVM: x86 emulator: consolidate segment accessors")
      Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      f0367ee1
    • W
      KVM: X86: Fix read out-of-bounds vulnerability in kvm pio emulation · cbfc6c91
      Wanpeng Li 提交于
      Huawei folks reported a read out-of-bounds vulnerability in kvm pio emulation.
      
      - "inb" instruction to access PIT Mod/Command register (ioport 0x43, write only,
        a read should be ignored) in guest can get a random number.
      - "rep insb" instruction to access PIT register port 0x43 can control memcpy()
        in emulator_pio_in_emulated() to copy max 0x400 bytes but only read 1 bytes,
        which will disclose the unimportant kernel memory in host but no crash.
      
      The similar test program below can reproduce the read out-of-bounds vulnerability:
      
      void hexdump(void *mem, unsigned int len)
      {
              unsigned int i, j;
      
              for(i = 0; i < len + ((len % HEXDUMP_COLS) ? (HEXDUMP_COLS - len % HEXDUMP_COLS) : 0); i++)
              {
                      /* print offset */
                      if(i % HEXDUMP_COLS == 0)
                      {
                              printf("0x%06x: ", i);
                      }
      
                      /* print hex data */
                      if(i < len)
                      {
                              printf("%02x ", 0xFF & ((char*)mem)[i]);
                      }
                      else /* end of block, just aligning for ASCII dump */
                      {
                              printf("   ");
                      }
      
                      /* print ASCII dump */
                      if(i % HEXDUMP_COLS == (HEXDUMP_COLS - 1))
                      {
                              for(j = i - (HEXDUMP_COLS - 1); j <= i; j++)
                              {
                                      if(j >= len) /* end of block, not really printing */
                                      {
                                              putchar(' ');
                                      }
                                      else if(isprint(((char*)mem)[j])) /* printable char */
                                      {
                                              putchar(0xFF & ((char*)mem)[j]);
                                      }
                                      else /* other char */
                                      {
                                              putchar('.');
                                      }
                              }
                              putchar('\n');
                      }
              }
      }
      
      int main(void)
      {
      	int i;
      	if (iopl(3))
      	{
      		err(1, "set iopl unsuccessfully\n");
      		return -1;
      	}
      	static char buf[0x40];
      
      	/* test ioport 0x40,0x41,0x42,0x43,0x44,0x45 */
      
      	memset(buf, 0xab, sizeof(buf));
      
      	asm volatile("push %rdi;");
      	asm volatile("mov %0, %%rdi;"::"q"(buf));
      
      	asm volatile ("mov $0x40, %rdx;");
      	asm volatile ("in %dx,%al;");
      	asm volatile ("stosb;");
      
      	asm volatile ("mov $0x41, %rdx;");
      	asm volatile ("in %dx,%al;");
      	asm volatile ("stosb;");
      
      	asm volatile ("mov $0x42, %rdx;");
      	asm volatile ("in %dx,%al;");
      	asm volatile ("stosb;");
      
      	asm volatile ("mov $0x43, %rdx;");
      	asm volatile ("in %dx,%al;");
      	asm volatile ("stosb;");
      
      	asm volatile ("mov $0x44, %rdx;");
      	asm volatile ("in %dx,%al;");
      	asm volatile ("stosb;");
      
      	asm volatile ("mov $0x45, %rdx;");
      	asm volatile ("in %dx,%al;");
      	asm volatile ("stosb;");
      
      	asm volatile ("pop %rdi;");
      	hexdump(buf, 0x40);
      
      	printf("\n");
      
      	/* ins port 0x40 */
      
      	memset(buf, 0xab, sizeof(buf));
      
      	asm volatile("push %rdi;");
      	asm volatile("mov %0, %%rdi;"::"q"(buf));
      
      	asm volatile ("mov $0x20, %rcx;");
      	asm volatile ("mov $0x40, %rdx;");
      	asm volatile ("rep insb;");
      
      	asm volatile ("pop %rdi;");
      	hexdump(buf, 0x40);
      
      	printf("\n");
      
      	/* ins port 0x43 */
      
      	memset(buf, 0xab, sizeof(buf));
      
      	asm volatile("push %rdi;");
      	asm volatile("mov %0, %%rdi;"::"q"(buf));
      
      	asm volatile ("mov $0x20, %rcx;");
      	asm volatile ("mov $0x43, %rdx;");
      	asm volatile ("rep insb;");
      
      	asm volatile ("pop %rdi;");
      	hexdump(buf, 0x40);
      
      	printf("\n");
      	return 0;
      }
      
      The vcpu->arch.pio_data buffer is used by both in/out instrutions emulation
      w/o clear after using which results in some random datas are left over in
      the buffer. Guest reads port 0x43 will be ignored since it is write only,
      however, the function kernel_pio() can't distigush this ignore from successfully
      reads data from device's ioport. There is no new data fill the buffer from
      port 0x43, however, emulator_pio_in_emulated() will copy the stale data in
      the buffer to the guest unconditionally. This patch fixes it by clearing the
      buffer before in instruction emulation to avoid to grant guest the stale data
      in the buffer.
      
      In addition, string I/O is not supported for in kernel device. So there is no
      iteration to read ioport %RCX times for string I/O. The function kernel_pio()
      just reads one round, and then copy the io size * %RCX to the guest unconditionally,
      actually it copies the one round ioport data w/ other random datas which are left
      over in the vcpu->arch.pio_data buffer to the guest. This patch fixes it by
      introducing the string I/O support for in kernel device in order to grant the right
      ioport datas to the guest.
      
      Before the patch:
      
      0x000000: fe 38 93 93 ff ff ab ab .8......
      0x000008: ab ab ab ab ab ab ab ab ........
      0x000010: ab ab ab ab ab ab ab ab ........
      0x000018: ab ab ab ab ab ab ab ab ........
      0x000020: ab ab ab ab ab ab ab ab ........
      0x000028: ab ab ab ab ab ab ab ab ........
      0x000030: ab ab ab ab ab ab ab ab ........
      0x000038: ab ab ab ab ab ab ab ab ........
      
      0x000000: f6 00 00 00 00 00 00 00 ........
      0x000008: 00 00 00 00 00 00 00 00 ........
      0x000010: 00 00 00 00 4d 51 30 30 ....MQ00
      0x000018: 30 30 20 33 20 20 20 20 00 3
      0x000020: ab ab ab ab ab ab ab ab ........
      0x000028: ab ab ab ab ab ab ab ab ........
      0x000030: ab ab ab ab ab ab ab ab ........
      0x000038: ab ab ab ab ab ab ab ab ........
      
      0x000000: f6 00 00 00 00 00 00 00 ........
      0x000008: 00 00 00 00 00 00 00 00 ........
      0x000010: 00 00 00 00 4d 51 30 30 ....MQ00
      0x000018: 30 30 20 33 20 20 20 20 00 3
      0x000020: ab ab ab ab ab ab ab ab ........
      0x000028: ab ab ab ab ab ab ab ab ........
      0x000030: ab ab ab ab ab ab ab ab ........
      0x000038: ab ab ab ab ab ab ab ab ........
      
      After the patch:
      
      0x000000: 1e 02 f8 00 ff ff ab ab ........
      0x000008: ab ab ab ab ab ab ab ab ........
      0x000010: ab ab ab ab ab ab ab ab ........
      0x000018: ab ab ab ab ab ab ab ab ........
      0x000020: ab ab ab ab ab ab ab ab ........
      0x000028: ab ab ab ab ab ab ab ab ........
      0x000030: ab ab ab ab ab ab ab ab ........
      0x000038: ab ab ab ab ab ab ab ab ........
      
      0x000000: d2 e2 d2 df d2 db d2 d7 ........
      0x000008: d2 d3 d2 cf d2 cb d2 c7 ........
      0x000010: d2 c4 d2 c0 d2 bc d2 b8 ........
      0x000018: d2 b4 d2 b0 d2 ac d2 a8 ........
      0x000020: ab ab ab ab ab ab ab ab ........
      0x000028: ab ab ab ab ab ab ab ab ........
      0x000030: ab ab ab ab ab ab ab ab ........
      0x000038: ab ab ab ab ab ab ab ab ........
      
      0x000000: 00 00 00 00 00 00 00 00 ........
      0x000008: 00 00 00 00 00 00 00 00 ........
      0x000010: 00 00 00 00 00 00 00 00 ........
      0x000018: 00 00 00 00 00 00 00 00 ........
      0x000020: ab ab ab ab ab ab ab ab ........
      0x000028: ab ab ab ab ab ab ab ab ........
      0x000030: ab ab ab ab ab ab ab ab ........
      0x000038: ab ab ab ab ab ab ab ab ........
      Reported-by: NMoguofang <moguofang@huawei.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Moguofang <moguofang@huawei.com>
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      cbfc6c91
    • W
      KVM: x86: Fix potential preemption when get the current kvmclock timestamp · e2c2206a
      Wanpeng Li 提交于
       BUG: using __this_cpu_read() in preemptible [00000000] code: qemu-system-x86/2809
       caller is __this_cpu_preempt_check+0x13/0x20
       CPU: 2 PID: 2809 Comm: qemu-system-x86 Not tainted 4.11.0+ #13
       Call Trace:
        dump_stack+0x99/0xce
        check_preemption_disabled+0xf5/0x100
        __this_cpu_preempt_check+0x13/0x20
        get_kvmclock_ns+0x6f/0x110 [kvm]
        get_time_ref_counter+0x5d/0x80 [kvm]
        kvm_hv_process_stimers+0x2a1/0x8a0 [kvm]
        ? kvm_hv_process_stimers+0x2a1/0x8a0 [kvm]
        ? kvm_arch_vcpu_ioctl_run+0xac9/0x1ce0 [kvm]
        kvm_arch_vcpu_ioctl_run+0x5bf/0x1ce0 [kvm]
        kvm_vcpu_ioctl+0x384/0x7b0 [kvm]
        ? kvm_vcpu_ioctl+0x384/0x7b0 [kvm]
        ? __fget+0xf3/0x210
        do_vfs_ioctl+0xa4/0x700
        ? __fget+0x114/0x210
        SyS_ioctl+0x79/0x90
        entry_SYSCALL_64_fastpath+0x23/0xc2
       RIP: 0033:0x7f9d164ed357
        ? __this_cpu_preempt_check+0x13/0x20
      
      This can be reproduced by run kvm-unit-tests/hyperv_stimer.flat w/
      CONFIG_PREEMPT and CONFIG_DEBUG_PREEMPT enabled.
      
      Safe access to per-CPU data requires a couple of constraints, though: the
      thread working with the data cannot be preempted and it cannot be migrated
      while it manipulates per-CPU variables. If the thread is preempted, the
      thread that replaces it could try to work with the same variables; migration
      to another CPU could also cause confusion. However there is no preemption
      disable when reads host per-CPU tsc rate to calculate the current kvmclock
      timestamp.
      
      This patch fixes it by utilizing get_cpu/put_cpu pair to guarantee both
      __this_cpu_read() and rdtsc() are not preempted.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      e2c2206a
  12. 15 5月, 2017 1 次提交
    • W
      KVM: x86: Fix load damaged SSEx MXCSR register · a575813b
      Wanpeng Li 提交于
      Reported by syzkaller:
      
         BUG: unable to handle kernel paging request at ffffffffc07f6a2e
         IP: report_bug+0x94/0x120
         PGD 348e12067
         P4D 348e12067
         PUD 348e14067
         PMD 3cbd84067
         PTE 80000003f7e87161
      
         Oops: 0003 [#1] SMP
         CPU: 2 PID: 7091 Comm: kvm_load_guest_ Tainted: G           OE   4.11.0+ #8
         task: ffff92fdfb525400 task.stack: ffffbda6c3d04000
         RIP: 0010:report_bug+0x94/0x120
         RSP: 0018:ffffbda6c3d07b20 EFLAGS: 00010202
          do_trap+0x156/0x170
          do_error_trap+0xa3/0x170
          ? kvm_load_guest_fpu.part.175+0x12a/0x170 [kvm]
          ? mark_held_locks+0x79/0xa0
          ? retint_kernel+0x10/0x10
          ? trace_hardirqs_off_thunk+0x1a/0x1c
          do_invalid_op+0x20/0x30
          invalid_op+0x1e/0x30
         RIP: 0010:kvm_load_guest_fpu.part.175+0x12a/0x170 [kvm]
          ? kvm_load_guest_fpu.part.175+0x1c/0x170 [kvm]
          kvm_arch_vcpu_ioctl_run+0xed6/0x1b70 [kvm]
          kvm_vcpu_ioctl+0x384/0x780 [kvm]
          ? kvm_vcpu_ioctl+0x384/0x780 [kvm]
          ? sched_clock+0x13/0x20
          ? __do_page_fault+0x2a0/0x550
          do_vfs_ioctl+0xa4/0x700
          ? up_read+0x1f/0x40
          ? __do_page_fault+0x2a0/0x550
          SyS_ioctl+0x79/0x90
          entry_SYSCALL_64_fastpath+0x23/0xc2
      
      SDM mentioned that "The MXCSR has several reserved bits, and attempting to write
      a 1 to any of these bits will cause a general-protection exception(#GP) to be
      generated". The syzkaller forks' testcase overrides xsave area w/ random values
      and steps on the reserved bits of MXCSR register. The damaged MXCSR register
      values of guest will be restored to SSEx MXCSR register before vmentry. This
      patch fixes it by catching userspace override MXCSR register reserved bits w/
      random values and bails out immediately.
      Reported-by: NAndrey Konovalov <andreyknvl@google.com>
      Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      a575813b
  13. 09 5月, 2017 1 次提交
    • M
      mm: introduce kv[mz]alloc helpers · a7c3e901
      Michal Hocko 提交于
      Patch series "kvmalloc", v5.
      
      There are many open coded kmalloc with vmalloc fallback instances in the
      tree.  Most of them are not careful enough or simply do not care about
      the underlying semantic of the kmalloc/page allocator which means that
      a) some vmalloc fallbacks are basically unreachable because the kmalloc
      part will keep retrying until it succeeds b) the page allocator can
      invoke a really disruptive steps like the OOM killer to move forward
      which doesn't sound appropriate when we consider that the vmalloc
      fallback is available.
      
      As it can be seen implementing kvmalloc requires quite an intimate
      knowledge if the page allocator and the memory reclaim internals which
      strongly suggests that a helper should be implemented in the memory
      subsystem proper.
      
      Most callers, I could find, have been converted to use the helper
      instead.  This is patch 6.  There are some more relying on __GFP_REPEAT
      in the networking stack which I have converted as well and Eric Dumazet
      was not opposed [2] to convert them as well.
      
      [1] http://lkml.kernel.org/r/20170130094940.13546-1-mhocko@kernel.org
      [2] http://lkml.kernel.org/r/1485273626.16328.301.camel@edumazet-glaptop3.roam.corp.google.com
      
      This patch (of 9):
      
      Using kmalloc with the vmalloc fallback for larger allocations is a
      common pattern in the kernel code.  Yet we do not have any common helper
      for that and so users have invented their own helpers.  Some of them are
      really creative when doing so.  Let's just add kv[mz]alloc and make sure
      it is implemented properly.  This implementation makes sure to not make
      a large memory pressure for > PAGE_SZE requests (__GFP_NORETRY) and also
      to not warn about allocation failures.  This also rules out the OOM
      killer as the vmalloc is a more approapriate fallback than a disruptive
      user visible action.
      
      This patch also changes some existing users and removes helpers which
      are specific for them.  In some cases this is not possible (e.g.
      ext4_kvmalloc, libcfs_kvzalloc) because those seems to be broken and
      require GFP_NO{FS,IO} context which is not vmalloc compatible in general
      (note that the page table allocation is GFP_KERNEL).  Those need to be
      fixed separately.
      
      While we are at it, document that __vmalloc{_node} about unsupported gfp
      mask because there seems to be a lot of confusion out there.
      kvmalloc_node will warn about GFP_KERNEL incompatible (which are not
      superset) flags to catch new abusers.  Existing ones would have to die
      slowly.
      
      [sfr@canb.auug.org.au: f2fs fixup]
        Link: http://lkml.kernel.org/r/20170320163735.332e64b7@canb.auug.org.au
      Link: http://lkml.kernel.org/r/20170306103032.2540-2-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Reviewed-by: Andreas Dilger <adilger@dilger.ca>	[ext4 part]
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a7c3e901
  14. 03 5月, 2017 1 次提交
    • P
      Revert "KVM: Support vCPU-based gfn->hva cache" · 4e335d9e
      Paolo Bonzini 提交于
      This reverts commit bbd64115.
      
      I've been sitting on this revert for too long and it unfortunately
      missed 4.11.  It's also the reason why I haven't merged ring-based
      dirty tracking for 4.12.
      
      Using kvm_vcpu_memslots in kvm_gfn_to_hva_cache_init and
      kvm_vcpu_write_guest_offset_cached means that the MSR value can
      now be used to access SMRAM, simply by making it point to an SMRAM
      physical address.  This is problematic because it lets the guest
      OS overwrite memory that it shouldn't be able to touch.
      
      Cc: stable@vger.kernel.org
      Fixes: bbd64115Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4e335d9e
  15. 02 5月, 2017 1 次提交
  16. 27 4月, 2017 4 次提交
    • L
      KVM: x86: fix emulation of RSM and IRET instructions · 6ed071f0
      Ladi Prosek 提交于
      On AMD, the effect of set_nmi_mask called by emulate_iret_real and em_rsm
      on hflags is reverted later on in x86_emulate_instruction where hflags are
      overwritten with ctxt->emul_flags (the kvm_set_hflags call). This manifests
      as a hang when rebooting Windows VMs with QEMU, OVMF, and >1 vcpu.
      
      Instead of trying to merge ctxt->emul_flags into vcpu->arch.hflags after
      an instruction is emulated, this commit deletes emul_flags altogether and
      makes the emulator access vcpu->arch.hflags using two new accessors. This
      way all changes, on the emulator side as well as in functions called from
      the emulator and accessing vcpu state with emul_to_vcpu, are preserved.
      
      More details on the bug and its manifestation with Windows and OVMF:
      
        It's a KVM bug in the interaction between SMI/SMM and NMI, specific to AMD.
        I believe that the SMM part explains why we started seeing this only with
        OVMF.
      
        KVM masks and unmasks NMI when entering and leaving SMM. When KVM emulates
        the RSM instruction in em_rsm, the set_nmi_mask call doesn't stick because
        later on in x86_emulate_instruction we overwrite arch.hflags with
        ctxt->emul_flags, effectively reverting the effect of the set_nmi_mask call.
        The AMD-specific hflag of interest here is HF_NMI_MASK.
      
        When rebooting the system, Windows sends an NMI IPI to all but the current
        cpu to shut them down. Only after all of them are parked in HLT will the
        initiating cpu finish the restart. If NMI is masked, other cpus never get
        the memo and the initiating cpu spins forever, waiting for
        hal!HalpInterruptProcessorsStarted to drop. That's the symptom we observe.
      
      Fixes: a584539b ("KVM: x86: pass the whole hflags field to emulator and back")
      Signed-off-by: NLadi Prosek <lprosek@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      6ed071f0
    • A
      KVM: add explicit barrier to kvm_vcpu_kick · cde9af6e
      Andrew Jones 提交于
      kvm_vcpu_kick() must issue a general memory barrier prior to reading
      vcpu->mode in order to ensure correctness of the mutual-exclusion
      memory barrier pattern used with vcpu->requests.  While the cmpxchg
      called from kvm_vcpu_kick():
      
       kvm_vcpu_kick
         kvm_arch_vcpu_should_kick
           kvm_vcpu_exiting_guest_mode
             cmpxchg
      
      implies general memory barriers before and after the operation, that
      implication is only valid when cmpxchg succeeds.  We need an explicit
      barrier for when it fails, otherwise a VCPU thread on its entry path
      that reads zero for vcpu->requests does not exclude the possibility
      the requesting thread sees !IN_GUEST_MODE when it reads vcpu->mode.
      
      kvm_make_all_cpus_request already had a barrier, so we remove it, as
      now it would be redundant.
      Signed-off-by: NAndrew Jones <drjones@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      cde9af6e
    • R
    • R
      KVM: add kvm_{test,clear}_request to replace {test,clear}_bit · 72875d8a
      Radim Krčmář 提交于
      Users were expected to use kvm_check_request() for testing and clearing,
      but request have expanded their use since then and some users want to
      only test or do a faster clear.
      
      Make sure that requests are not directly accessed with bit operations.
      Reviewed-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      Reviewed-by: NAndrew Jones <drjones@redhat.com>
      Reviewed-by: NCornelia Huck <cornelia.huck@de.ibm.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      72875d8a
  17. 21 4月, 2017 3 次提交
    • M
      KVM: x86: remove irq disablement around KVM_SET_CLOCK/KVM_GET_CLOCK · e891a32e
      Marcelo Tosatti 提交于
      The disablement of interrupts at KVM_SET_CLOCK/KVM_GET_CLOCK
      attempts to disable software suspend from causing "non atomic behaviour" of
      the operation:
      
          Add a helper function to compute the kernel time and convert nanoseconds
          back to CPU specific cycles.  Note that these must not be called in preemptible
          context, as that would mean the kernel could enter software suspend state,
          which would cause non-atomic operation.
      
      However, assume the kernel can enter software suspend at the following 2 points:
      
              ktime_get_ts(&ts);
      1.
      						hypothetical_ktime_get_ts(&ts)
              monotonic_to_bootbased(&ts);
      2.
      
      monotonic_to_bootbased() should be correct relative to a ktime_get_ts(&ts)
      performed after point 1 (that is after resuming from software suspend),
      hypothetical_ktime_get_ts()
      
      Therefore it is also correct for the ktime_get_ts(&ts) before point 1,
      which is
      
      	ktime_get_ts(&ts) = hypothetical_ktime_get_ts(&ts) + time-to-execute-suspend-code
      
      Note CLOCK_MONOTONIC does not count during suspension.
      
      So remove the irq disablement, which causes the following warning on
      -RT kernels:
      
       With this reasoning, and the -RT bug that the irq disablement causes
       (because spin_lock is now a sleeping lock), remove the IRQ protection as it
       causes:
      
       [ 1064.668109] in_atomic(): 0, irqs_disabled(): 1, pid: 15296, name:m
       [ 1064.668110] INFO: lockdep is turned off.
       [ 1064.668110] irq event stamp: 0
       [ 1064.668112] hardirqs last  enabled at (0): [<          (null)>]  )
       [ 1064.668116] hardirqs last disabled at (0): [] c0
       [ 1064.668118] softirqs last  enabled at (0): [] c0
       [ 1064.668118] softirqs last disabled at (0): [<          (null)>]  )
       [ 1064.668121] CPU: 13 PID: 15296 Comm: qemu-kvm Not tainted 3.10.0-1
       [ 1064.668121] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 5
       [ 1064.668123]  ffff8c1796b88000 00000000afe7344c ffff8c179abf3c68 f3
       [ 1064.668125]  ffff8c179abf3c90 ffffffff930ccb3d ffff8c1b992b3610 f0
       [ 1064.668126]  00007ffc1a26fbc0 ffff8c179abf3cb0 ffffffff9375f694 f0
       [ 1064.668126] Call Trace:
       [ 1064.668132]  [] dump_stack+0x19/0x1b
       [ 1064.668135]  [] __might_sleep+0x12d/0x1f0
       [ 1064.668138]  [] rt_spin_lock+0x24/0x60
       [ 1064.668155]  [] __get_kvmclock_ns+0x36/0x110 [k]
       [ 1064.668159]  [] ? futex_wait_queue_me+0x103/0x10
       [ 1064.668171]  [] kvm_arch_vm_ioctl+0xa2/0xd70 [k]
       [ 1064.668173]  [] ? futex_wait+0x1ac/0x2a0
      
      v2: notice get_kvmclock_ns with the same problem (Pankaj).
      v3: remove useless helper function (Pankaj).
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e891a32e
    • M
      kvm: better MWAIT emulation for guests · 668fffa3
      Michael S. Tsirkin 提交于
      Guests that are heavy on futexes end up IPI'ing each other a lot. That
      can lead to significant slowdowns and latency increase for those guests
      when running within KVM.
      
      If only a single guest is needed on a host, we have a lot of spare host
      CPU time we can throw at the problem. Modern CPUs implement a feature
      called "MWAIT" which allows guests to wake up sleeping remote CPUs without
      an IPI - thus without an exit - at the expense of never going out of guest
      context.
      
      The decision whether this is something sensible to use should be up to the
      VM admin, so to user space. We can however allow MWAIT execution on systems
      that support it properly hardware wise.
      
      This patch adds a CAP to user space and a KVM cpuid leaf to indicate
      availability of native MWAIT execution. With that enabled, the worst a
      guest can do is waste as many cycles as a "jmp ." would do, so it's not
      a privilege problem.
      
      We consciously do *not* expose the feature in our CPUID bitmap, as most
      people will want to benefit from sleeping vCPUs to allow for over commit.
      Reported-by: N"Gabriel L. Somlo" <gsomlo@gmail.com>
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      [agraf: fix amd, change commit message]
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      668fffa3
    • K
      KVM: x86: virtualize cpuid faulting · db2336a8
      Kyle Huey 提交于
      Hardware support for faulting on the cpuid instruction is not required to
      emulate it, because cpuid triggers a VM exit anyways. KVM handles the relevant
      MSRs (MSR_PLATFORM_INFO and MSR_MISC_FEATURES_ENABLE) and upon a
      cpuid-induced VM exit checks the cpuid faulting state and the CPL.
      kvm_require_cpl is even kind enough to inject the GP fault for us.
      Signed-off-by: NKyle Huey <khuey@kylehuey.com>
      Reviewed-by: NDavid Matlack <dmatlack@google.com>
      [Return "1" from kvm_emulate_cpuid, it's not void. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      db2336a8
  18. 13 4月, 2017 8 次提交
  19. 07 4月, 2017 3 次提交
    • J
      kvm: nVMX: Disallow userspace-injected exceptions in guest mode · 28d06353
      Jim Mattson 提交于
      The userspace exception injection API and code path are entirely
      unprepared for exceptions that might cause a VM-exit from L2 to L1, so
      the best course of action may be to simply disallow this for now.
      
      1. The API provides no mechanism for userspace to specify the new DR6
      bits for a #DB exception or the new CR2 value for a #PF
      exception. Presumably, userspace is expected to modify these registers
      directly with KVM_SET_SREGS before the next KVM_RUN ioctl. However, in
      the event that L1 intercepts the exception, these registers should not
      be changed. Instead, the new values should be provided in the
      exit_qualification field of vmcs12 (Intel SDM vol 3, section 27.1).
      
      2. In the case of a userspace-injected #DB, inject_pending_event()
      clears DR7.GD before calling vmx_queue_exception(). However, in the
      event that L1 intercepts the exception, this is too early, because
      DR7.GD should not be modified by a #DB that causes a VM-exit directly
      (Intel SDM vol 3, section 27.1).
      
      3. If the injected exception is a #PF, nested_vmx_check_exception()
      doesn't properly check whether or not L1 is interested in the
      associated error code (using the #PF error code mask and match fields
      from vmcs12). It may either return 0 when it should call
      nested_vmx_vmexit() or vice versa.
      
      4. nested_vmx_check_exception() assumes that it is dealing with a
      hardware-generated exception intercept from L2, with some of the
      relevant details (the VM-exit interruption-information and the exit
      qualification) live in vmcs02. For userspace-injected exceptions, this
      is not the case.
      
      5. prepare_vmcs12() assumes that when its exit_intr_info argument
      specifies valid information with a valid error code that it can VMREAD
      the VM-exit interruption error code from vmcs02. For
      userspace-injected exceptions, this is not the case.
      Signed-off-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      28d06353
    • D
      KVM: x86: fix user triggerable warning in kvm_apic_accept_events() · 28bf2888
      David Hildenbrand 提交于
      If we already entered/are about to enter SMM, don't allow switching to
      INIT/SIPI_RECEIVED, otherwise the next call to kvm_apic_accept_events()
      will report a warning.
      
      Same applies if we are already in MP state INIT_RECEIVED and SMM is
      requested to be turned on. Refuse to set the VCPU events in this case.
      
      Fixes: cd7764fe ("KVM: x86: latch INITs while in system management mode")
      Cc: stable@vger.kernel.org # 4.2+
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      28bf2888
    • P
      kvm: make KVM_CAP_COALESCED_MMIO architecture agnostic · 30422558
      Paolo Bonzini 提交于
      Remove code from architecture files that can be moved to virt/kvm, since there
      is already common code for coalesced MMIO.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      [Removed a pointless 'break' after 'return'.]
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      30422558