1. 05 9月, 2017 2 次提交
  2. 25 8月, 2017 11 次提交
    • J
      kvm: nVMX: Validate the virtual-APIC address on nested VM-entry · 712b12d7
      Jim Mattson 提交于
      According to the SDM, if the "use TPR shadow" VM-execution control is
      1, bits 11:0 of the virtual-APIC address must be 0 and the address
      should set any bits beyond the processor's physical-address width.
      Signed-off-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      712b12d7
    • W
      KVM: nVMX: Fix trying to cancel vmlauch/vmresume · bfcf83b1
      Wanpeng Li 提交于
      ------------[ cut here ]------------
      WARNING: CPU: 7 PID: 3861 at /home/kernel/ssd/kvm/arch/x86/kvm//vmx.c:11299 nested_vmx_vmexit+0x176e/0x1980 [kvm_intel]
      CPU: 7 PID: 3861 Comm: qemu-system-x86 Tainted: G        W  OE   4.13.0-rc4+ #11
      RIP: 0010:nested_vmx_vmexit+0x176e/0x1980 [kvm_intel]
      Call Trace:
       ? kvm_multiple_exception+0x149/0x170 [kvm]
       ? handle_emulation_failure+0x79/0x230 [kvm]
       ? load_vmcs12_host_state+0xa80/0xa80 [kvm_intel]
       ? check_chain_key+0x137/0x1e0
       ? reexecute_instruction.part.168+0x130/0x130 [kvm]
       nested_vmx_inject_exception_vmexit+0xb7/0x100 [kvm_intel]
       ? nested_vmx_inject_exception_vmexit+0xb7/0x100 [kvm_intel]
       vmx_queue_exception+0x197/0x300 [kvm_intel]
       kvm_arch_vcpu_ioctl_run+0x1b0c/0x2c90 [kvm]
       ? kvm_arch_vcpu_runnable+0x220/0x220 [kvm]
       ? preempt_count_sub+0x18/0xc0
       ? restart_apic_timer+0x17d/0x300 [kvm]
       ? kvm_lapic_restart_hv_timer+0x37/0x50 [kvm]
       ? kvm_arch_vcpu_load+0x1d8/0x350 [kvm]
       kvm_vcpu_ioctl+0x4e4/0x910 [kvm]
       ? kvm_vcpu_ioctl+0x4e4/0x910 [kvm]
       ? kvm_dev_ioctl+0xbe0/0xbe0 [kvm]
      
      The flag "nested_run_pending", which can override the decision of which should run
      next, L1 or L2. nested_run_pending=1 means that we *must* run L2 next, not L1. This
      is necessary in particular when L1 did a VMLAUNCH of L2 and therefore expects L2 to
      be run (and perhaps be injected with an event it specified, etc.). Nested_run_pending
      is especially intended to avoid switching  to L1 in the injection decision-point.
      
      This can be handled just like the other cases in vmx_check_nested_events, instead of
      having a special case in vmx_queue_exception.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      bfcf83b1
    • W
      KVM: X86: Fix loss of exception which has not yet been injected · 664f8e26
      Wanpeng Li 提交于
      vmx_complete_interrupts() assumes that the exception is always injected,
      so it can be dropped by kvm_clear_exception_queue().  However,
      an exception cannot be injected immediately if it is: 1) originally
      destined to a nested guest; 2) trapped to cause a vmexit; 3) happening
      right after VMLAUNCH/VMRESUME, i.e. when nested_run_pending is true.
      
      This patch applies to exceptions the same algorithm that is used for
      NMIs, replacing exception.reinject with "exception.injected" (equivalent
      to nmi_injected).
      
      exception.pending now represents an exception that is queued and whose
      side effects (e.g., update RFLAGS.RF or DR7) have not been applied yet.
      If exception.pending is true, the exception might result in a nested
      vmexit instead, too (in which case the side effects must not be applied).
      
      exception.injected instead represents an exception that is going to be
      injected into the guest at the next vmentry.
      Reported-by: NRadim Krčmář <rkrcmar@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      664f8e26
    • W
      KVM: VMX: use kvm_event_needs_reinjection · 274bba52
      Wanpeng Li 提交于
      Use kvm_event_needs_reinjection() encapsulation.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      274bba52
    • P
      KVM: MMU: speedup update_permission_bitmask · 09f037aa
      Paolo Bonzini 提交于
      update_permission_bitmask currently does a 128-iteration loop to,
      essentially, compute a constant array.  Computing the 8 bits in parallel
      reduces it to 16 iterations, and is enough to speed it up substantially
      because many boolean operations in the inner loop become constants or
      simplify noticeably.
      
      Because update_permission_bitmask is actually the top item in the profile
      for nested vmexits, this speeds up an L2->L1 vmexit by about ten thousand
      clock cycles, or up to 30%:
      
                                               before     after
         cpuid                                 35173      25954
         vmcall                                35122      27079
         inl_from_pmtimer                      52635      42675
         inl_from_qemu                         53604      44599
         inl_from_kernel                       38498      30798
         outl_to_kernel                        34508      28816
         wr_tsc_adjust_msr                     34185      26818
         rd_tsc_adjust_msr                     37409      27049
         mmio-no-eventfd:pci-mem               50563      45276
         mmio-wildcard-eventfd:pci-mem         34495      30823
         mmio-datamatch-eventfd:pci-mem        35612      31071
         portio-no-eventfd:pci-io              44925      40661
         portio-wildcard-eventfd:pci-io        29708      27269
         portio-datamatch-eventfd:pci-io       31135      27164
      
      (I wrote a small C program to compare the tables for all values of CR0.WP,
      CR4.SMAP and CR4.SMEP, and they match).
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      09f037aa
    • Y
      KVM: MMU: Expose the LA57 feature to VM. · fd8cb433
      Yu Zhang 提交于
      This patch exposes 5 level page table feature to the VM.
      At the same time, the canonical virtual address checking is
      extended to support both 48-bits and 57-bits address width.
      Signed-off-by: NYu Zhang <yu.c.zhang@linux.intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      fd8cb433
    • Y
      KVM: MMU: Add 5 level EPT & Shadow page table support. · 855feb67
      Yu Zhang 提交于
      Extends the shadow paging code, so that 5 level shadow page
      table can be constructed if VM is running in 5 level paging
      mode.
      
      Also extends the ept code, so that 5 level ept table can be
      constructed if maxphysaddr of VM exceeds 48 bits. Unlike the
      shadow logic, KVM should still use 4 level ept table for a VM
      whose physical address width is less than 48 bits, even when
      the VM is running in 5 level paging mode.
      Signed-off-by: NYu Zhang <yu.c.zhang@linux.intel.com>
      [Unconditionally reset the MMU context in kvm_cpuid_update.
       Changing MAXPHYADDR invalidates the reserved bit bitmasks.
       - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      855feb67
    • Y
      KVM: MMU: Rename PT64_ROOT_LEVEL to PT64_ROOT_4LEVEL. · 2a7266a8
      Yu Zhang 提交于
      Now we have 4 level page table and 5 level page table in 64 bits
      long mode, let's rename the PT64_ROOT_LEVEL to PT64_ROOT_4LEVEL,
      then we can use PT64_ROOT_5LEVEL for 5 level page table, it's
      helpful to make the code more clear.
      
      Also PT64_ROOT_MAX_LEVEL is defined as 4, so that we can just
      redefine it to 5 whenever a replacement is needed for 5 level
      paging.
      Signed-off-by: NYu Zhang <yu.c.zhang@linux.intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      2a7266a8
    • Y
      KVM: MMU: check guest CR3 reserved bits based on its physical address width. · d1cd3ce9
      Yu Zhang 提交于
      Currently, KVM uses CR3_L_MODE_RESERVED_BITS to check the
      reserved bits in CR3. Yet the length of reserved bits in
      guest CR3 should be based on the physical address width
      exposed to the VM. This patch changes CR3 check logic to
      calculate the reserved bits at runtime.
      Signed-off-by: NYu Zhang <yu.c.zhang@linux.intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d1cd3ce9
    • Y
      KVM: x86: Add return value to kvm_cpuid(). · e911eb3b
      Yu Zhang 提交于
      Return false in kvm_cpuid() when it fails to find the cpuid
      entry. Also, this routine(and its caller) is optimized with
      a new argument - check_limit, so that the check_cpuid_limit()
      fall back can be avoided.
      Signed-off-by: NYu Zhang <yu.c.zhang@linux.intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e911eb3b
    • P
      kvm: vmx: Raise #UD on unsupported XSAVES/XRSTORS · 3db13480
      Paolo Bonzini 提交于
      A guest may not be configured to support XSAVES/XRSTORS, even when the host
      does. If the guest does not support XSAVES/XRSTORS, clear the secondary
      execution control so that the processor will raise #UD.
      
      Also clear the "allowed-1" bit for XSAVES/XRSTORS exiting in the
      IA32_VMX_PROCBASED_CTLS2 MSR, and pass through VMCS12's control in
      the VMCS02.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3db13480
  3. 24 8月, 2017 5 次提交
  4. 18 8月, 2017 6 次提交
  5. 16 8月, 2017 1 次提交
    • A
      kvm: avoid uninitialized-variable warnings · 076b925d
      Arnd Bergmann 提交于
      When PAGE_OFFSET is not a compile-time constant, we run into
      warnings from the use of kvm_is_error_hva() that the compiler
      cannot optimize out:
      
      arch/arm/kvm/../../../virt/kvm/kvm_main.c: In function '__kvm_gfn_to_hva_cache_init':
      arch/arm/kvm/../../../virt/kvm/kvm_main.c:1978:14: error: 'nr_pages_avail' may be used uninitialized in this function [-Werror=maybe-uninitialized]
      arch/arm/kvm/../../../virt/kvm/kvm_main.c: In function 'gfn_to_page_many_atomic':
      arch/arm/kvm/../../../virt/kvm/kvm_main.c:1660:5: error: 'entry' may be used uninitialized in this function [-Werror=maybe-uninitialized]
      
      This adds fake initializations to the two instances I ran into.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      076b925d
  6. 12 8月, 2017 5 次提交
    • J
      kvm: x86: Disallow illegal IA32_APIC_BASE MSR values · d3802286
      Jim Mattson 提交于
      Host-initiated writes to the IA32_APIC_BASE MSR do not have to follow
      local APIC state transition constraints, but the value written must be
      valid.
      Signed-off-by: NJim Mattson <jmattson@google.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d3802286
    • W
      KVM: MMU: Bail out immediately if there is no available mmu page · 26eeb53c
      Wanpeng Li 提交于
      Bailing out immediately if there is no available mmu page to alloc.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      26eeb53c
    • W
      KVM: MMU: Fix softlockup due to mmu_lock is held too long · 42bcbebf
      Wanpeng Li 提交于
      watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [warn_test:3089]
       irq event stamp: 20532
       hardirqs last  enabled at (20531): [<ffffffff8e9b6908>] restore_regs_and_iret+0x0/0x1d
       hardirqs last disabled at (20532): [<ffffffff8e9b7ae8>] apic_timer_interrupt+0x98/0xb0
       softirqs last  enabled at (8266): [<ffffffff8e9badc6>] __do_softirq+0x206/0x4c1
       softirqs last disabled at (8253): [<ffffffff8e083918>] irq_exit+0xf8/0x100
       CPU: 5 PID: 3089 Comm: warn_test Tainted: G           OE   4.13.0-rc3+ #8
       RIP: 0010:kvm_mmu_prepare_zap_page+0x72/0x4b0 [kvm]
       Call Trace:
        make_mmu_pages_available.isra.120+0x71/0xc0 [kvm]
        kvm_mmu_load+0x1cf/0x410 [kvm]
        kvm_arch_vcpu_ioctl_run+0x1316/0x1bf0 [kvm]
        kvm_vcpu_ioctl+0x340/0x700 [kvm]
        ? kvm_vcpu_ioctl+0x340/0x700 [kvm]
        ? __fget+0xfc/0x210
        do_vfs_ioctl+0xa4/0x6a0
        ? __fget+0x11d/0x210
        SyS_ioctl+0x79/0x90
        entry_SYSCALL_64_fastpath+0x23/0xc2
        ? __this_cpu_preempt_check+0x13/0x20
      
      This can be reproduced readily by ept=N and running syzkaller tests since
      many syzkaller testcases don't setup any memory regions. However, if ept=Y
      rmode identity map will be created, then kvm_mmu_calculate_mmu_pages() will
      extend the number of VM's mmu pages to at least KVM_MIN_ALLOC_MMU_PAGES
      which just hide the issue.
      
      I saw the scenario kvm->arch.n_max_mmu_pages == 0 && kvm->arch.n_used_mmu_pages == 1,
      so there is one active mmu page on the list, kvm_mmu_prepare_zap_page() fails
      to zap any pages, however prepare_zap_oldest_mmu_page() always returns true.
      It incurs infinite loop in make_mmu_pages_available() which causes mmu->lock
      softlockup.
      
      This patch fixes it by setting the return value of prepare_zap_oldest_mmu_page()
      according to whether or not there is mmu page zapped.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      42bcbebf
    • D
      KVM: nVMX: validate eptp pointer · a057e0e2
      David Hildenbrand 提交于
      Let's reuse the function introduced with eptp switching.
      
      We don't explicitly have to check against enable_ept_ad_bits, as this
      is implicitly done when checking against nested_vmx_ept_caps in
      valid_ept_address().
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a057e0e2
    • A
      KVM: MAINTAINERS improvements · a170504f
      Andrew Jones 提交于
      Remove nonexistent files, allow less awkward expressions when
      extracting arch-specific information, and only return relevant
      information when using arch-specific expressions. Additionally
      add include/trace/events/kvm.h, arch/*/include/uapi/asm/kvm*,
      and arch/powerpc/kernel/kvm* to appropriate sections. The arch-
      specific expressions are now:
      
       /KVM/                                        -- All KVM
       /\(KVM\)|\(KVM\/x86\)/                       -- X86
       /\(KVM\)|\(KVM\/x86\)|\(KVM\/amd\)/          -- X86 plus AMD
       /\(KVM\)|\(KVM\/arm\)/                       -- ARM
       /\(KVM\)|\(KVM\/arm\)|\(KVM\/arm64\)/        -- ARM plus ARM64
       /\(KVM\)|\(KVM\/powerpc\)/                   -- POWERPC
       /\(KVM\)|\(KVM\/s390\)/                      -- S390
       /\(KVM\)|\(KVM\/mips\)/                      -- MIPS
      Signed-off-by: NAndrew Jones <drjones@redhat.com>
      Acked-by: NCornelia Huck <cohuck@redhat.com>
      Acked-by: NJoerg Roedel <jroedel@suse.de>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a170504f
  7. 10 8月, 2017 3 次提交
    • P
      kvm: nVMX: Add support for fast unprotection of nested guest page tables · eebed243
      Paolo Bonzini 提交于
      This is the same as commit 14727754 ("kvm: svm: Add support for
      additional SVM NPF error codes", 2016-11-23), but for Intel processors.
      In this case, the exit qualification field's bit 8 says whether the
      EPT violation occurred while translating the guest's final physical
      address or rather while translating the guest page tables.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      eebed243
    • B
      KVM: SVM: Limit PFERR_NESTED_GUEST_PAGE error_code check to L1 guest · 64531a3b
      Brijesh Singh 提交于
      Commit 14727754 ("kvm: svm: Add support for additional SVM NPF error
      codes", 2016-11-23) added a new error code to aid nested page fault
      handling.  The commit unprotects (kvm_mmu_unprotect_page) the page when
      we get a NPF due to guest page table walk where the page was marked RO.
      
      However, if an L0->L2 shadow nested page table can also be marked read-only
      when a page is read only in L1's nested page table.  If such a page
      is accessed by L2 while walking page tables it can cause a nested
      page fault (page table walks are write accesses).  However, after
      kvm_mmu_unprotect_page we may get another page fault, and again in an
      endless stream.
      
      To cover this use case, we qualify the new error_code check with
      vcpu->arch.mmu_direct_map so that the error_code check would run on L1
      guest, and not the L2 guest.  This avoids hitting the above scenario.
      
      Fixes: 14727754
      Cc: stable@vger.kernel.org
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Thomas Lendacky <thomas.lendacky@amd.com>
      Signed-off-by: NBrijesh Singh <brijesh.singh@amd.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      64531a3b
    • W
      KVM: X86: Fix residual mmio emulation request to userspace · bbeac283
      Wanpeng Li 提交于
      Reported by syzkaller:
      
      The kvm-intel.unrestricted_guest=0
      
         WARNING: CPU: 5 PID: 1014 at /home/kernel/data/kvm/arch/x86/kvm//x86.c:7227 kvm_arch_vcpu_ioctl_run+0x38b/0x1be0 [kvm]
         CPU: 5 PID: 1014 Comm: warn_test Tainted: G        W  OE   4.13.0-rc3+ #8
         RIP: 0010:kvm_arch_vcpu_ioctl_run+0x38b/0x1be0 [kvm]
         Call Trace:
          ? put_pid+0x3a/0x50
          ? rcu_read_lock_sched_held+0x79/0x80
          ? kmem_cache_free+0x2f2/0x350
          kvm_vcpu_ioctl+0x340/0x700 [kvm]
          ? kvm_vcpu_ioctl+0x340/0x700 [kvm]
          ? __fget+0xfc/0x210
          do_vfs_ioctl+0xa4/0x6a0
          ? __fget+0x11d/0x210
          SyS_ioctl+0x79/0x90
          entry_SYSCALL_64_fastpath+0x23/0xc2
          ? __this_cpu_preempt_check+0x13/0x20
      
      The syszkaller folks reported a residual mmio emulation request to userspace
      due to vm86 fails to emulate inject real mode interrupt(fails to read CS) and
      incurs a triple fault. The vCPU returns to userspace with vcpu->mmio_needed == true
      and KVM_EXIT_SHUTDOWN exit reason. However, the syszkaller testcase constructs
      several threads to launch the same vCPU, the thread which lauch this vCPU after
      the thread whichs get the vcpu->mmio_needed == true and KVM_EXIT_SHUTDOWN will
      trigger the warning.
      
         #define _GNU_SOURCE
         #include <pthread.h>
         #include <stdio.h>
         #include <stdlib.h>
         #include <string.h>
         #include <sys/wait.h>
         #include <sys/types.h>
         #include <sys/stat.h>
         #include <sys/mman.h>
         #include <fcntl.h>
         #include <unistd.h>
         #include <linux/kvm.h>
         #include <stdio.h>
      
         int kvmcpu;
         struct kvm_run *run;
      
         void* thr(void* arg)
         {
           int res;
           res = ioctl(kvmcpu, KVM_RUN, 0);
           printf("ret1=%d exit_reason=%d suberror=%d\n",
               res, run->exit_reason, run->internal.suberror);
           return 0;
         }
      
         void test()
         {
           int i, kvm, kvmvm;
           pthread_t th[4];
      
           kvm = open("/dev/kvm", O_RDWR);
           kvmvm = ioctl(kvm, KVM_CREATE_VM, 0);
           kvmcpu = ioctl(kvmvm, KVM_CREATE_VCPU, 0);
           run = (struct kvm_run*)mmap(0, 4096, PROT_READ|PROT_WRITE, MAP_SHARED, kvmcpu, 0);
           srand(getpid());
           for (i = 0; i < 4; i++) {
             pthread_create(&th[i], 0, thr, 0);
             usleep(rand() % 10000);
           }
           for (i = 0; i < 4; i++)
             pthread_join(th[i], 0);
         }
      
         int main()
         {
           for (;;) {
             int pid = fork();
             if (pid < 0)
               exit(1);
             if (pid == 0) {
               test();
               exit(0);
             }
             int status;
             while (waitpid(pid, &status, __WALL) != pid) {}
           }
           return 0;
         }
      
      This patch fixes it by resetting the vcpu->mmio_needed once we receive
      the triple fault to avoid the residue.
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Tested-by: NDmitry Vyukov <dvyukov@google.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      bbeac283
  8. 08 8月, 2017 4 次提交
  9. 07 8月, 2017 3 次提交