1. 23 9月, 2017 1 次提交
    • J
      x86/asm: Fix inline asm call constraints for Clang · f5caf621
      Josh Poimboeuf 提交于
      For inline asm statements which have a CALL instruction, we list the
      stack pointer as a constraint to convince GCC to ensure the frame
      pointer is set up first:
      
        static inline void foo()
        {
      	register void *__sp asm(_ASM_SP);
      	asm("call bar" : "+r" (__sp))
        }
      
      Unfortunately, that pattern causes Clang to corrupt the stack pointer.
      
      The fix is easy: convert the stack pointer register variable to a global
      variable.
      
      It should be noted that the end result is different based on the GCC
      version.  With GCC 6.4, this patch has exactly the same result as
      before:
      
      	defconfig	defconfig-nofp	distro		distro-nofp
       before	9820389		9491555		8816046		8516940
       after	9820389		9491555		8816046		8516940
      
      With GCC 7.2, however, GCC's behavior has changed.  It now changes its
      behavior based on the conversion of the register variable to a global.
      That somehow convinces it to *always* set up the frame pointer before
      inserting *any* inline asm.  (Therefore, listing the variable as an
      output constraint is a no-op and is no longer necessary.)  It's a bit
      overkill, but the performance impact should be negligible.  And in fact,
      there's a nice improvement with frame pointers disabled:
      
      	defconfig	defconfig-nofp	distro		distro-nofp
       before	9796316		9468236		9076191		8790305
       after	9796957		9464267		9076381		8785949
      
      So in summary, while listing the stack pointer as an output constraint
      is no longer necessary for newer versions of GCC, it's still needed for
      older versions.
      Suggested-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Reported-by: NMatthias Kaehlcke <mka@chromium.org>
      Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Dmitriy Vyukov <dvyukov@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Miguel Bernal Marin <miguel.bernal.marin@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/3db862e970c432ae823cf515c52b54fec8270e0e.1505942196.git.jpoimboe@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f5caf621
  2. 15 9月, 2017 5 次提交
  3. 14 9月, 2017 1 次提交
  4. 13 9月, 2017 1 次提交
  5. 29 8月, 2017 1 次提交
    • T
      x86/idt: Unify gate_struct handling for 32/64-bit kernels · 64b163fa
      Thomas Gleixner 提交于
      The first 32 bits of gate struct are the same for 32 and 64 bit kernels.
      
      The 32-bit version uses desc_struct and no designated data structure,
      so we need different accessors for 32 and 64 bit kernels.
      
      Aside of that the macros which are necessary to build the 32-bit
      gate descriptor are horrible to read.
      
      Unify the gate structs and switch all code fiddling with it over.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Link: http://lkml.kernel.org/r/20170828064957.861974317@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      64b163fa
  6. 25 8月, 2017 8 次提交
    • J
      kvm: nVMX: Validate the virtual-APIC address on nested VM-entry · 712b12d7
      Jim Mattson 提交于
      According to the SDM, if the "use TPR shadow" VM-execution control is
      1, bits 11:0 of the virtual-APIC address must be 0 and the address
      should set any bits beyond the processor's physical-address width.
      Signed-off-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      712b12d7
    • P
      KVM: x86: simplify handling of PKRU · b9dd21e1
      Paolo Bonzini 提交于
      Move it to struct kvm_arch_vcpu, replacing guest_pkru_valid with a
      simple comparison against the host value of the register.  The write of
      PKRU in addition can be skipped if the guest has not enabled the feature.
      Once we do this, we need not test OSPKE in the host anymore, because
      guest_CR4.PKE=1 implies host_CR4.PKE=1.
      
      The static PKU test is kept to elide the code on older CPUs.
      Suggested-by: NYang Zhang <zy107165@alibaba-inc.com>
      Fixes: 1be0e61c
      Cc: stable@vger.kernel.org
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b9dd21e1
    • W
      KVM: nVMX: Fix trying to cancel vmlauch/vmresume · bfcf83b1
      Wanpeng Li 提交于
      ------------[ cut here ]------------
      WARNING: CPU: 7 PID: 3861 at /home/kernel/ssd/kvm/arch/x86/kvm//vmx.c:11299 nested_vmx_vmexit+0x176e/0x1980 [kvm_intel]
      CPU: 7 PID: 3861 Comm: qemu-system-x86 Tainted: G        W  OE   4.13.0-rc4+ #11
      RIP: 0010:nested_vmx_vmexit+0x176e/0x1980 [kvm_intel]
      Call Trace:
       ? kvm_multiple_exception+0x149/0x170 [kvm]
       ? handle_emulation_failure+0x79/0x230 [kvm]
       ? load_vmcs12_host_state+0xa80/0xa80 [kvm_intel]
       ? check_chain_key+0x137/0x1e0
       ? reexecute_instruction.part.168+0x130/0x130 [kvm]
       nested_vmx_inject_exception_vmexit+0xb7/0x100 [kvm_intel]
       ? nested_vmx_inject_exception_vmexit+0xb7/0x100 [kvm_intel]
       vmx_queue_exception+0x197/0x300 [kvm_intel]
       kvm_arch_vcpu_ioctl_run+0x1b0c/0x2c90 [kvm]
       ? kvm_arch_vcpu_runnable+0x220/0x220 [kvm]
       ? preempt_count_sub+0x18/0xc0
       ? restart_apic_timer+0x17d/0x300 [kvm]
       ? kvm_lapic_restart_hv_timer+0x37/0x50 [kvm]
       ? kvm_arch_vcpu_load+0x1d8/0x350 [kvm]
       kvm_vcpu_ioctl+0x4e4/0x910 [kvm]
       ? kvm_vcpu_ioctl+0x4e4/0x910 [kvm]
       ? kvm_dev_ioctl+0xbe0/0xbe0 [kvm]
      
      The flag "nested_run_pending", which can override the decision of which should run
      next, L1 or L2. nested_run_pending=1 means that we *must* run L2 next, not L1. This
      is necessary in particular when L1 did a VMLAUNCH of L2 and therefore expects L2 to
      be run (and perhaps be injected with an event it specified, etc.). Nested_run_pending
      is especially intended to avoid switching  to L1 in the injection decision-point.
      
      This can be handled just like the other cases in vmx_check_nested_events, instead of
      having a special case in vmx_queue_exception.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      bfcf83b1
    • W
      KVM: X86: Fix loss of exception which has not yet been injected · 664f8e26
      Wanpeng Li 提交于
      vmx_complete_interrupts() assumes that the exception is always injected,
      so it can be dropped by kvm_clear_exception_queue().  However,
      an exception cannot be injected immediately if it is: 1) originally
      destined to a nested guest; 2) trapped to cause a vmexit; 3) happening
      right after VMLAUNCH/VMRESUME, i.e. when nested_run_pending is true.
      
      This patch applies to exceptions the same algorithm that is used for
      NMIs, replacing exception.reinject with "exception.injected" (equivalent
      to nmi_injected).
      
      exception.pending now represents an exception that is queued and whose
      side effects (e.g., update RFLAGS.RF or DR7) have not been applied yet.
      If exception.pending is true, the exception might result in a nested
      vmexit instead, too (in which case the side effects must not be applied).
      
      exception.injected instead represents an exception that is going to be
      injected into the guest at the next vmentry.
      Reported-by: NRadim Krčmář <rkrcmar@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      664f8e26
    • W
      KVM: VMX: use kvm_event_needs_reinjection · 274bba52
      Wanpeng Li 提交于
      Use kvm_event_needs_reinjection() encapsulation.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      274bba52
    • Y
      KVM: MMU: Expose the LA57 feature to VM. · fd8cb433
      Yu Zhang 提交于
      This patch exposes 5 level page table feature to the VM.
      At the same time, the canonical virtual address checking is
      extended to support both 48-bits and 57-bits address width.
      Signed-off-by: NYu Zhang <yu.c.zhang@linux.intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      fd8cb433
    • Y
      KVM: MMU: Add 5 level EPT & Shadow page table support. · 855feb67
      Yu Zhang 提交于
      Extends the shadow paging code, so that 5 level shadow page
      table can be constructed if VM is running in 5 level paging
      mode.
      
      Also extends the ept code, so that 5 level ept table can be
      constructed if maxphysaddr of VM exceeds 48 bits. Unlike the
      shadow logic, KVM should still use 4 level ept table for a VM
      whose physical address width is less than 48 bits, even when
      the VM is running in 5 level paging mode.
      Signed-off-by: NYu Zhang <yu.c.zhang@linux.intel.com>
      [Unconditionally reset the MMU context in kvm_cpuid_update.
       Changing MAXPHYADDR invalidates the reserved bit bitmasks.
       - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      855feb67
    • P
      kvm: vmx: Raise #UD on unsupported XSAVES/XRSTORS · 3db13480
      Paolo Bonzini 提交于
      A guest may not be configured to support XSAVES/XRSTORS, even when the host
      does. If the guest does not support XSAVES/XRSTORS, clear the secondary
      execution control so that the processor will raise #UD.
      
      Also clear the "allowed-1" bit for XSAVES/XRSTORS exiting in the
      IA32_VMX_PROCBASED_CTLS2 MSR, and pass through VMCS12's control in
      the VMCS02.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3db13480
  7. 24 8月, 2017 3 次提交
    • J
      kvm: vmx: Raise #UD on unsupported RDSEED · 75f4fc8d
      Jim Mattson 提交于
      A guest may not be configured to support RDSEED, even when the host
      does. If the guest does not support RDSEED, intercept the instruction
      and synthesize #UD. Also clear the "allowed-1" bit for RDSEED exiting
      in the IA32_VMX_PROCBASED_CTLS2 MSR.
      Signed-off-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      75f4fc8d
    • J
      kvm: vmx: Raise #UD on unsupported RDRAND · 45ec368c
      Jim Mattson 提交于
      A guest may not be configured to support RDRAND, even when the host
      does. If the guest does not support RDRAND, intercept the instruction
      and synthesize #UD. Also clear the "allowed-1" bit for RDRAND exiting
      in the IA32_VMX_PROCBASED_CTLS2 MSR.
      Signed-off-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      45ec368c
    • P
      KVM: VMX: cache secondary exec controls · 80154d77
      Paolo Bonzini 提交于
      Currently, secondary execution controls are divided in three groups:
      
      - static, depending mostly on the module arguments or the processor
        (vmx_secondary_exec_control)
      
      - static, depending on CPUID (vmx_cpuid_update)
      
      - dynamic, depending on nested VMX or local APIC state
      
      Because walking CPUID is expensive, prepare_vmcs02 is using only
      the first group.  This however is unnecessarily complicated.  Just
      cache the static secondary execution controls, and then prepare_vmcs02
      does not need to compute them every time.  Computation of all static
      secondary execution controls is now kept in a single function,
      vmx_compute_secondary_exec_control.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      80154d77
  8. 18 8月, 2017 5 次提交
  9. 12 8月, 2017 1 次提交
  10. 10 8月, 2017 2 次提交
    • P
      kvm: nVMX: Add support for fast unprotection of nested guest page tables · eebed243
      Paolo Bonzini 提交于
      This is the same as commit 14727754 ("kvm: svm: Add support for
      additional SVM NPF error codes", 2016-11-23), but for Intel processors.
      In this case, the exit qualification field's bit 8 says whether the
      EPT violation occurred while translating the guest's final physical
      address or rather while translating the guest page tables.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      eebed243
    • W
      KVM: X86: Fix residual mmio emulation request to userspace · bbeac283
      Wanpeng Li 提交于
      Reported by syzkaller:
      
      The kvm-intel.unrestricted_guest=0
      
         WARNING: CPU: 5 PID: 1014 at /home/kernel/data/kvm/arch/x86/kvm//x86.c:7227 kvm_arch_vcpu_ioctl_run+0x38b/0x1be0 [kvm]
         CPU: 5 PID: 1014 Comm: warn_test Tainted: G        W  OE   4.13.0-rc3+ #8
         RIP: 0010:kvm_arch_vcpu_ioctl_run+0x38b/0x1be0 [kvm]
         Call Trace:
          ? put_pid+0x3a/0x50
          ? rcu_read_lock_sched_held+0x79/0x80
          ? kmem_cache_free+0x2f2/0x350
          kvm_vcpu_ioctl+0x340/0x700 [kvm]
          ? kvm_vcpu_ioctl+0x340/0x700 [kvm]
          ? __fget+0xfc/0x210
          do_vfs_ioctl+0xa4/0x6a0
          ? __fget+0x11d/0x210
          SyS_ioctl+0x79/0x90
          entry_SYSCALL_64_fastpath+0x23/0xc2
          ? __this_cpu_preempt_check+0x13/0x20
      
      The syszkaller folks reported a residual mmio emulation request to userspace
      due to vm86 fails to emulate inject real mode interrupt(fails to read CS) and
      incurs a triple fault. The vCPU returns to userspace with vcpu->mmio_needed == true
      and KVM_EXIT_SHUTDOWN exit reason. However, the syszkaller testcase constructs
      several threads to launch the same vCPU, the thread which lauch this vCPU after
      the thread whichs get the vcpu->mmio_needed == true and KVM_EXIT_SHUTDOWN will
      trigger the warning.
      
         #define _GNU_SOURCE
         #include <pthread.h>
         #include <stdio.h>
         #include <stdlib.h>
         #include <string.h>
         #include <sys/wait.h>
         #include <sys/types.h>
         #include <sys/stat.h>
         #include <sys/mman.h>
         #include <fcntl.h>
         #include <unistd.h>
         #include <linux/kvm.h>
         #include <stdio.h>
      
         int kvmcpu;
         struct kvm_run *run;
      
         void* thr(void* arg)
         {
           int res;
           res = ioctl(kvmcpu, KVM_RUN, 0);
           printf("ret1=%d exit_reason=%d suberror=%d\n",
               res, run->exit_reason, run->internal.suberror);
           return 0;
         }
      
         void test()
         {
           int i, kvm, kvmvm;
           pthread_t th[4];
      
           kvm = open("/dev/kvm", O_RDWR);
           kvmvm = ioctl(kvm, KVM_CREATE_VM, 0);
           kvmcpu = ioctl(kvmvm, KVM_CREATE_VCPU, 0);
           run = (struct kvm_run*)mmap(0, 4096, PROT_READ|PROT_WRITE, MAP_SHARED, kvmcpu, 0);
           srand(getpid());
           for (i = 0; i < 4; i++) {
             pthread_create(&th[i], 0, thr, 0);
             usleep(rand() % 10000);
           }
           for (i = 0; i < 4; i++)
             pthread_join(th[i], 0);
         }
      
         int main()
         {
           for (;;) {
             int pid = fork();
             if (pid < 0)
               exit(1);
             if (pid == 0) {
               test();
               exit(0);
             }
             int status;
             while (waitpid(pid, &status, __WALL) != pid) {}
           }
           return 0;
         }
      
      This patch fixes it by resetting the vcpu->mmio_needed once we receive
      the triple fault to avoid the residue.
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Tested-by: NDmitry Vyukov <dvyukov@google.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      bbeac283
  11. 08 8月, 2017 2 次提交
  12. 07 8月, 2017 8 次提交
  13. 03 8月, 2017 2 次提交
    • W
      KVM: nVMX: Fix interrupt window request with "Acknowledge interrupt on exit" · 6550c4df
      Wanpeng Li 提交于
      ------------[ cut here ]------------
       WARNING: CPU: 5 PID: 2288 at arch/x86/kvm/vmx.c:11124 nested_vmx_vmexit+0xd64/0xd70 [kvm_intel]
       CPU: 5 PID: 2288 Comm: qemu-system-x86 Not tainted 4.13.0-rc2+ #7
       RIP: 0010:nested_vmx_vmexit+0xd64/0xd70 [kvm_intel]
      Call Trace:
        vmx_check_nested_events+0x131/0x1f0 [kvm_intel]
        ? vmx_check_nested_events+0x131/0x1f0 [kvm_intel]
        kvm_arch_vcpu_ioctl_run+0x5dd/0x1be0 [kvm]
        ? vmx_vcpu_load+0x1be/0x220 [kvm_intel]
        ? kvm_arch_vcpu_load+0x62/0x230 [kvm]
        kvm_vcpu_ioctl+0x340/0x700 [kvm]
        ? kvm_vcpu_ioctl+0x340/0x700 [kvm]
        ? __fget+0xfc/0x210
        do_vfs_ioctl+0xa4/0x6a0
        ? __fget+0x11d/0x210
        SyS_ioctl+0x79/0x90
        do_syscall_64+0x8f/0x750
        ? trace_hardirqs_on_thunk+0x1a/0x1c
        entry_SYSCALL64_slow_path+0x25/0x25
      
      This can be reproduced by booting L1 guest w/ 'noapic' grub parameter, which
      means that tells the kernel to not make use of any IOAPICs that may be present
      in the system.
      
      Actually external_intr variable in nested_vmx_vmexit() is the req_int_win
      variable passed from vcpu_enter_guest() which means that the L0's userspace
      requests an irq window. I observed the scenario (!kvm_cpu_has_interrupt(vcpu) &&
      L0's userspace reqeusts an irq window) is true, so there is no interrupt which
      L1 requires to inject to L2, we should not attempt to emualte "Acknowledge
      interrupt on exit" for the irq window requirement in this scenario.
      
      This patch fixes it by not attempt to emulate "Acknowledge interrupt on exit"
      if there is no L1 requirement to inject an interrupt to L2.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      [Added code comment to make it obvious that the behavior is not correct.
       We should do a userspace exit with open interrupt window instead of the
       nested VM exit.  This patch still improves the behavior, so it was
       accepted as a (temporary) workaround.]
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      6550c4df
    • D
      KVM: nVMX: mark vmcs12 pages dirty on L2 exit · c9f04407
      David Matlack 提交于
      The host physical addresses of L1's Virtual APIC Page and Posted
      Interrupt descriptor are loaded into the VMCS02. The CPU may write
      to these pages via their host physical address while L2 is running,
      bypassing address-translation-based dirty tracking (e.g. EPT write
      protection). Mark them dirty on every exit from L2 to prevent them
      from getting out of sync with dirty tracking.
      
      Also mark the virtual APIC page and the posted interrupt descriptor
      dirty when KVM is virtualizing posted interrupt processing.
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      c9f04407