1. 12 8月, 2017 3 次提交
    • W
      KVM: MMU: Bail out immediately if there is no available mmu page · 26eeb53c
      Wanpeng Li 提交于
      Bailing out immediately if there is no available mmu page to alloc.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      26eeb53c
    • W
      KVM: MMU: Fix softlockup due to mmu_lock is held too long · 42bcbebf
      Wanpeng Li 提交于
      watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [warn_test:3089]
       irq event stamp: 20532
       hardirqs last  enabled at (20531): [<ffffffff8e9b6908>] restore_regs_and_iret+0x0/0x1d
       hardirqs last disabled at (20532): [<ffffffff8e9b7ae8>] apic_timer_interrupt+0x98/0xb0
       softirqs last  enabled at (8266): [<ffffffff8e9badc6>] __do_softirq+0x206/0x4c1
       softirqs last disabled at (8253): [<ffffffff8e083918>] irq_exit+0xf8/0x100
       CPU: 5 PID: 3089 Comm: warn_test Tainted: G           OE   4.13.0-rc3+ #8
       RIP: 0010:kvm_mmu_prepare_zap_page+0x72/0x4b0 [kvm]
       Call Trace:
        make_mmu_pages_available.isra.120+0x71/0xc0 [kvm]
        kvm_mmu_load+0x1cf/0x410 [kvm]
        kvm_arch_vcpu_ioctl_run+0x1316/0x1bf0 [kvm]
        kvm_vcpu_ioctl+0x340/0x700 [kvm]
        ? kvm_vcpu_ioctl+0x340/0x700 [kvm]
        ? __fget+0xfc/0x210
        do_vfs_ioctl+0xa4/0x6a0
        ? __fget+0x11d/0x210
        SyS_ioctl+0x79/0x90
        entry_SYSCALL_64_fastpath+0x23/0xc2
        ? __this_cpu_preempt_check+0x13/0x20
      
      This can be reproduced readily by ept=N and running syzkaller tests since
      many syzkaller testcases don't setup any memory regions. However, if ept=Y
      rmode identity map will be created, then kvm_mmu_calculate_mmu_pages() will
      extend the number of VM's mmu pages to at least KVM_MIN_ALLOC_MMU_PAGES
      which just hide the issue.
      
      I saw the scenario kvm->arch.n_max_mmu_pages == 0 && kvm->arch.n_used_mmu_pages == 1,
      so there is one active mmu page on the list, kvm_mmu_prepare_zap_page() fails
      to zap any pages, however prepare_zap_oldest_mmu_page() always returns true.
      It incurs infinite loop in make_mmu_pages_available() which causes mmu->lock
      softlockup.
      
      This patch fixes it by setting the return value of prepare_zap_oldest_mmu_page()
      according to whether or not there is mmu page zapped.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      42bcbebf
    • D
      KVM: nVMX: validate eptp pointer · a057e0e2
      David Hildenbrand 提交于
      Let's reuse the function introduced with eptp switching.
      
      We don't explicitly have to check against enable_ept_ad_bits, as this
      is implicitly done when checking against nested_vmx_ept_caps in
      valid_ept_address().
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a057e0e2
  2. 10 8月, 2017 3 次提交
    • P
      kvm: nVMX: Add support for fast unprotection of nested guest page tables · eebed243
      Paolo Bonzini 提交于
      This is the same as commit 14727754 ("kvm: svm: Add support for
      additional SVM NPF error codes", 2016-11-23), but for Intel processors.
      In this case, the exit qualification field's bit 8 says whether the
      EPT violation occurred while translating the guest's final physical
      address or rather while translating the guest page tables.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      eebed243
    • B
      KVM: SVM: Limit PFERR_NESTED_GUEST_PAGE error_code check to L1 guest · 64531a3b
      Brijesh Singh 提交于
      Commit 14727754 ("kvm: svm: Add support for additional SVM NPF error
      codes", 2016-11-23) added a new error code to aid nested page fault
      handling.  The commit unprotects (kvm_mmu_unprotect_page) the page when
      we get a NPF due to guest page table walk where the page was marked RO.
      
      However, if an L0->L2 shadow nested page table can also be marked read-only
      when a page is read only in L1's nested page table.  If such a page
      is accessed by L2 while walking page tables it can cause a nested
      page fault (page table walks are write accesses).  However, after
      kvm_mmu_unprotect_page we may get another page fault, and again in an
      endless stream.
      
      To cover this use case, we qualify the new error_code check with
      vcpu->arch.mmu_direct_map so that the error_code check would run on L1
      guest, and not the L2 guest.  This avoids hitting the above scenario.
      
      Fixes: 14727754
      Cc: stable@vger.kernel.org
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Thomas Lendacky <thomas.lendacky@amd.com>
      Signed-off-by: NBrijesh Singh <brijesh.singh@amd.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      64531a3b
    • W
      KVM: X86: Fix residual mmio emulation request to userspace · bbeac283
      Wanpeng Li 提交于
      Reported by syzkaller:
      
      The kvm-intel.unrestricted_guest=0
      
         WARNING: CPU: 5 PID: 1014 at /home/kernel/data/kvm/arch/x86/kvm//x86.c:7227 kvm_arch_vcpu_ioctl_run+0x38b/0x1be0 [kvm]
         CPU: 5 PID: 1014 Comm: warn_test Tainted: G        W  OE   4.13.0-rc3+ #8
         RIP: 0010:kvm_arch_vcpu_ioctl_run+0x38b/0x1be0 [kvm]
         Call Trace:
          ? put_pid+0x3a/0x50
          ? rcu_read_lock_sched_held+0x79/0x80
          ? kmem_cache_free+0x2f2/0x350
          kvm_vcpu_ioctl+0x340/0x700 [kvm]
          ? kvm_vcpu_ioctl+0x340/0x700 [kvm]
          ? __fget+0xfc/0x210
          do_vfs_ioctl+0xa4/0x6a0
          ? __fget+0x11d/0x210
          SyS_ioctl+0x79/0x90
          entry_SYSCALL_64_fastpath+0x23/0xc2
          ? __this_cpu_preempt_check+0x13/0x20
      
      The syszkaller folks reported a residual mmio emulation request to userspace
      due to vm86 fails to emulate inject real mode interrupt(fails to read CS) and
      incurs a triple fault. The vCPU returns to userspace with vcpu->mmio_needed == true
      and KVM_EXIT_SHUTDOWN exit reason. However, the syszkaller testcase constructs
      several threads to launch the same vCPU, the thread which lauch this vCPU after
      the thread whichs get the vcpu->mmio_needed == true and KVM_EXIT_SHUTDOWN will
      trigger the warning.
      
         #define _GNU_SOURCE
         #include <pthread.h>
         #include <stdio.h>
         #include <stdlib.h>
         #include <string.h>
         #include <sys/wait.h>
         #include <sys/types.h>
         #include <sys/stat.h>
         #include <sys/mman.h>
         #include <fcntl.h>
         #include <unistd.h>
         #include <linux/kvm.h>
         #include <stdio.h>
      
         int kvmcpu;
         struct kvm_run *run;
      
         void* thr(void* arg)
         {
           int res;
           res = ioctl(kvmcpu, KVM_RUN, 0);
           printf("ret1=%d exit_reason=%d suberror=%d\n",
               res, run->exit_reason, run->internal.suberror);
           return 0;
         }
      
         void test()
         {
           int i, kvm, kvmvm;
           pthread_t th[4];
      
           kvm = open("/dev/kvm", O_RDWR);
           kvmvm = ioctl(kvm, KVM_CREATE_VM, 0);
           kvmcpu = ioctl(kvmvm, KVM_CREATE_VCPU, 0);
           run = (struct kvm_run*)mmap(0, 4096, PROT_READ|PROT_WRITE, MAP_SHARED, kvmcpu, 0);
           srand(getpid());
           for (i = 0; i < 4; i++) {
             pthread_create(&th[i], 0, thr, 0);
             usleep(rand() % 10000);
           }
           for (i = 0; i < 4; i++)
             pthread_join(th[i], 0);
         }
      
         int main()
         {
           for (;;) {
             int pid = fork();
             if (pid < 0)
               exit(1);
             if (pid == 0) {
               test();
               exit(0);
             }
             int status;
             while (waitpid(pid, &status, __WALL) != pid) {}
           }
           return 0;
         }
      
      This patch fixes it by resetting the vcpu->mmio_needed once we receive
      the triple fault to avoid the residue.
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Tested-by: NDmitry Vyukov <dvyukov@google.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      bbeac283
  3. 08 8月, 2017 4 次提交
  4. 07 8月, 2017 10 次提交
  5. 05 8月, 2017 1 次提交
  6. 04 8月, 2017 5 次提交
  7. 03 8月, 2017 6 次提交
    • W
      KVM: nVMX: Fix interrupt window request with "Acknowledge interrupt on exit" · 6550c4df
      Wanpeng Li 提交于
      ------------[ cut here ]------------
       WARNING: CPU: 5 PID: 2288 at arch/x86/kvm/vmx.c:11124 nested_vmx_vmexit+0xd64/0xd70 [kvm_intel]
       CPU: 5 PID: 2288 Comm: qemu-system-x86 Not tainted 4.13.0-rc2+ #7
       RIP: 0010:nested_vmx_vmexit+0xd64/0xd70 [kvm_intel]
      Call Trace:
        vmx_check_nested_events+0x131/0x1f0 [kvm_intel]
        ? vmx_check_nested_events+0x131/0x1f0 [kvm_intel]
        kvm_arch_vcpu_ioctl_run+0x5dd/0x1be0 [kvm]
        ? vmx_vcpu_load+0x1be/0x220 [kvm_intel]
        ? kvm_arch_vcpu_load+0x62/0x230 [kvm]
        kvm_vcpu_ioctl+0x340/0x700 [kvm]
        ? kvm_vcpu_ioctl+0x340/0x700 [kvm]
        ? __fget+0xfc/0x210
        do_vfs_ioctl+0xa4/0x6a0
        ? __fget+0x11d/0x210
        SyS_ioctl+0x79/0x90
        do_syscall_64+0x8f/0x750
        ? trace_hardirqs_on_thunk+0x1a/0x1c
        entry_SYSCALL64_slow_path+0x25/0x25
      
      This can be reproduced by booting L1 guest w/ 'noapic' grub parameter, which
      means that tells the kernel to not make use of any IOAPICs that may be present
      in the system.
      
      Actually external_intr variable in nested_vmx_vmexit() is the req_int_win
      variable passed from vcpu_enter_guest() which means that the L0's userspace
      requests an irq window. I observed the scenario (!kvm_cpu_has_interrupt(vcpu) &&
      L0's userspace reqeusts an irq window) is true, so there is no interrupt which
      L1 requires to inject to L2, we should not attempt to emualte "Acknowledge
      interrupt on exit" for the irq window requirement in this scenario.
      
      This patch fixes it by not attempt to emulate "Acknowledge interrupt on exit"
      if there is no L1 requirement to inject an interrupt to L2.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      [Added code comment to make it obvious that the behavior is not correct.
       We should do a userspace exit with open interrupt window instead of the
       nested VM exit.  This patch still improves the behavior, so it was
       accepted as a (temporary) workaround.]
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      6550c4df
    • D
      KVM: nVMX: mark vmcs12 pages dirty on L2 exit · c9f04407
      David Matlack 提交于
      The host physical addresses of L1's Virtual APIC Page and Posted
      Interrupt descriptor are loaded into the VMCS02. The CPU may write
      to these pages via their host physical address while L2 is running,
      bypassing address-translation-based dirty tracking (e.g. EPT write
      protection). Mark them dirty on every exit from L2 to prevent them
      from getting out of sync with dirty tracking.
      
      Also mark the virtual APIC page and the posted interrupt descriptor
      dirty when KVM is virtualizing posted interrupt processing.
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      c9f04407
    • D
      kvm: nVMX: don't flush VMCS12 during VMXOFF or VCPU teardown · 8ca44e88
      David Matlack 提交于
      According to the Intel SDM, software cannot rely on the current VMCS to be
      coherent after a VMXOFF or shutdown. So this is a valid way to handle VMCS12
      flushes.
      
      24.11.1 Software Use of Virtual-Machine Control Structures
      ...
        If a logical processor leaves VMX operation, any VMCSs active on
        that logical processor may be corrupted (see below). To prevent
        such corruption of a VMCS that may be used either after a return
        to VMX operation or on another logical processor, software should
        execute VMCLEAR for that VMCS before executing the VMXOFF instruction
        or removing power from the processor (e.g., as part of a transition
        to the S3 and S4 power states).
      ...
      
      This fixes a "suspicious rcu_dereference_check() usage!" warning during
      kvm_vm_release() because nested_release_vmcs12() calls
      kvm_vcpu_write_guest_page() without holding kvm->srcu.
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      8ca44e88
    • P
      KVM: nVMX: do not pin the VMCS12 · 9f744c59
      Paolo Bonzini 提交于
      Since the current implementation of VMCS12 does a memcpy in and out
      of guest memory, we do not need current_vmcs12 and current_vmcs12_page
      anymore.  current_vmptr is enough to read and write the VMCS12.
      
      And David Matlack noted:
      
        This patch also fixes dirty tracking (memslot->dirty_bitmap) of the
        VMCS12 page by using kvm_write_guest. nested_release_page() only marks
        the struct page dirty.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      [Added David Matlack's note and nested_release_page_clean() fix.]
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      9f744c59
    • L
      KVM: X86: init irq->level in kvm_pv_kick_cpu_op · ebd28fcb
      Longpeng(Mike) 提交于
      'lapic_irq' is a local variable and its 'level' field isn't
      initialized, so 'level' is random, it doesn't matter but
      makes UBSAN unhappy:
      
      UBSAN: Undefined behaviour in .../lapic.c:...
      load of value 10 is not a valid value for type '_Bool'
      ...
      Call Trace:
       [<ffffffff81f030b6>] dump_stack+0x1e/0x20
       [<ffffffff81f03173>] ubsan_epilogue+0x12/0x55
       [<ffffffff81f03b96>] __ubsan_handle_load_invalid_value+0x118/0x162
       [<ffffffffa1575173>] kvm_apic_set_irq+0xc3/0xf0 [kvm]
       [<ffffffffa1575b20>] kvm_irq_delivery_to_apic_fast+0x450/0x910 [kvm]
       [<ffffffffa15858ea>] kvm_irq_delivery_to_apic+0xfa/0x7a0 [kvm]
       [<ffffffffa1517f4e>] kvm_emulate_hypercall+0x62e/0x760 [kvm]
       [<ffffffffa113141a>] handle_vmcall+0x1a/0x30 [kvm_intel]
       [<ffffffffa114e592>] vmx_handle_exit+0x7a2/0x1fa0 [kvm_intel]
      ...
      Signed-off-by: NLongpeng(Mike) <longpeng2@huawei.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      ebd28fcb
    • W
      KVM: X86: Fix loss of pending INIT due to race · f4ef1910
      Wanpeng Li 提交于
      When SMP VM start, AP may lost INIT because of receiving INIT between
      kvm_vcpu_ioctl_x86_get/set_vcpu_events.
      
             vcpu 0                             vcpu 1
                                         kvm_vcpu_ioctl_x86_get_vcpu_events
                                           events->smi.latched_init = 0
        send INIT to vcpu1
          set vcpu1's pending_events
                                         kvm_vcpu_ioctl_x86_set_vcpu_events
                                            if (events->smi.latched_init == 0)
                                              clear INIT in pending_events
      
      This patch fixes it by just update SMM related flags if we are in SMM.
      
      Thanks Peng Hao for the report and original commit message.
      Reported-by: NPeng Hao <peng.hao2@zte.com.cn>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      f4ef1910
  8. 02 8月, 2017 5 次提交
    • G
      ARM64: dts: marvell: armada-37xx: Fix the number of GPIO on south bridge · d7a65c49
      Gregory CLEMENT 提交于
      The number of pins in South Bridge is 30 and not 29. There is a fix for
      the driver for the pinctrl, but a fix is also need at device tree level
      for the GPIO.
      
      Fixes: afda007f ("ARM64: dts: marvell: Add pinctrl nodes for Armada
      3700")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NGregory CLEMENT <gregory.clement@free-electrons.com>
      d7a65c49
    • S
      powerpc/83xx/mpc832x_rdb: fix of_irq_to_resource() error check · f29bb786
      Sergei Shtylyov 提交于
      of_irq_to_resource() has recently been fixed to return negative error #'s
      along with 0 in case of failure, however the Freescale MPC832x RDB board
      code still only regards 0 as a failure indication -- fix it up.
      
      Fixes: 7a4228bb ("of: irq: use of_irq_get() in of_irq_to_resource()")
      Signed-off-by: NSergei Shtylyov <sergei.shtylyov@cogentembedded.com>
      Acked-by: NScott Wood <oss@buserror.net>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      f29bb786
    • W
      KVM: async_pf: make rcu irq exit if not triggered from idle task · 337c017c
      Wanpeng Li 提交于
       WARNING: CPU: 5 PID: 1242 at kernel/rcu/tree_plugin.h:323 rcu_note_context_switch+0x207/0x6b0
       CPU: 5 PID: 1242 Comm: unity-settings- Not tainted 4.13.0-rc2+ #1
       RIP: 0010:rcu_note_context_switch+0x207/0x6b0
       Call Trace:
        __schedule+0xda/0xba0
        ? kvm_async_pf_task_wait+0x1b2/0x270
        schedule+0x40/0x90
        kvm_async_pf_task_wait+0x1cc/0x270
        ? prepare_to_swait+0x22/0x70
        do_async_page_fault+0x77/0xb0
        ? do_async_page_fault+0x77/0xb0
        async_page_fault+0x28/0x30
       RIP: 0010:__d_lookup_rcu+0x90/0x1e0
      
      I encounter this when trying to stress the async page fault in L1 guest w/
      L2 guests running.
      
      Commit 9b132fbe (Add rcu user eqs exception hooks for async page
      fault) adds rcu_irq_enter/exit() to kvm_async_pf_task_wait() to exit cpu
      idle eqs when needed, to protect the code that needs use rcu.  However,
      we need to call the pair even if the function calls schedule(), as seen
      from the above backtrace.
      
      This patch fixes it by informing the RCU subsystem exit/enter the irq
      towards/away from idle for both n.halted and !n.halted.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      337c017c
    • P
      KVM: nVMX: fixes to nested virt interrupt injection · b96fb439
      Paolo Bonzini 提交于
      There are three issues in nested_vmx_check_exception:
      
      1) it is not taking PFEC_MATCH/PFEC_MASK into account, as reported
      by Wanpeng Li;
      
      2) it should rebuild the interruption info and exit qualification fields
      from scratch, as reported by Jim Mattson, because the values from the
      L2->L0 vmexit may be invalid (e.g. if an emulated instruction causes
      a page fault, the EPT misconfig's exit qualification is incorrect).
      
      3) CR2 and DR6 should not be written for exception intercept vmexits
      (CR2 only for AMD).
      
      This patch fixes the first two and adds a comment about the last,
      outlining the fix.
      
      Cc: Jim Mattson <jmattson@google.com>
      Cc: Wanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b96fb439
    • P
      KVM: nVMX: do not fill vm_exit_intr_error_code in prepare_vmcs12 · 7313c698
      Paolo Bonzini 提交于
      Do this in the caller of nested_vmx_vmexit instead.
      
      nested_vmx_check_exception was doing a vmwrite to the vmcs02's
      VM_EXIT_INTR_ERROR_CODE field, so that prepare_vmcs12 would move
      the field to vmcs12->vm_exit_intr_error_code.  However that isn't
      possible on pre-Haswell machines.  Moving the vmcs12 write to the
      callers fixes it.
      Reported-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      [Changed nested_vmx_reflect_vmexit() return type to (int)1 from (bool)1,
       thanks to fengguang.wu@intel.com]
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      7313c698
  9. 01 8月, 2017 2 次提交
  10. 31 7月, 2017 1 次提交
    • B
      parisc: Define CONFIG_CPU_BIG_ENDIAN · 74ad3d28
      Babu Moger 提交于
      While working on enabling queued rwlock on SPARC, found this following
      code in include/asm-generic/qrwlock.h which uses CONFIG_CPU_BIG_ENDIAN
      to clear a byte.
      
      static inline u8 *__qrwlock_write_byte(struct qrwlock *lock)
       {
      	return (u8 *)lock + 3 * IS_BUILTIN(CONFIG_CPU_BIG_ENDIAN);
       }
      
      Problem is many of the fixed big endian architectures don't define
      CPU_BIG_ENDIAN and clears the wrong byte.
      
      Define CPU_BIG_ENDIAN for parisc architecture to fix it.
      Signed-off-by: NBabu Moger <babu.moger@oracle.com>
      Signed-off-by: NHelge Deller <deller@gmx.de>
      74ad3d28