1. 23 9月, 2017 1 次提交
    • J
      x86/asm: Fix inline asm call constraints for Clang · f5caf621
      Josh Poimboeuf 提交于
      For inline asm statements which have a CALL instruction, we list the
      stack pointer as a constraint to convince GCC to ensure the frame
      pointer is set up first:
      
        static inline void foo()
        {
      	register void *__sp asm(_ASM_SP);
      	asm("call bar" : "+r" (__sp))
        }
      
      Unfortunately, that pattern causes Clang to corrupt the stack pointer.
      
      The fix is easy: convert the stack pointer register variable to a global
      variable.
      
      It should be noted that the end result is different based on the GCC
      version.  With GCC 6.4, this patch has exactly the same result as
      before:
      
      	defconfig	defconfig-nofp	distro		distro-nofp
       before	9820389		9491555		8816046		8516940
       after	9820389		9491555		8816046		8516940
      
      With GCC 7.2, however, GCC's behavior has changed.  It now changes its
      behavior based on the conversion of the register variable to a global.
      That somehow convinces it to *always* set up the frame pointer before
      inserting *any* inline asm.  (Therefore, listing the variable as an
      output constraint is a no-op and is no longer necessary.)  It's a bit
      overkill, but the performance impact should be negligible.  And in fact,
      there's a nice improvement with frame pointers disabled:
      
      	defconfig	defconfig-nofp	distro		distro-nofp
       before	9796316		9468236		9076191		8790305
       after	9796957		9464267		9076381		8785949
      
      So in summary, while listing the stack pointer as an output constraint
      is no longer necessary for newer versions of GCC, it's still needed for
      older versions.
      Suggested-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Reported-by: NMatthias Kaehlcke <mka@chromium.org>
      Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Dmitriy Vyukov <dvyukov@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Miguel Bernal Marin <miguel.bernal.marin@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/3db862e970c432ae823cf515c52b54fec8270e0e.1505942196.git.jpoimboe@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f5caf621
  2. 15 9月, 2017 7 次提交
    • J
      kvm: nVMX: Handle deferred early VMLAUNCH/VMRESUME failure properly · 4f350c6d
      Jim Mattson 提交于
      When emulating a nested VM-entry from L1 to L2, several control field
      validation checks are deferred to the hardware. Should one of these
      validation checks fail, vcpu_vmx_run will set the vmx->fail flag. When
      this happens, the L2 guest state is not loaded (even in part), and
      execution should continue in L1 with the next instruction after the
      VMLAUNCH/VMRESUME.
      
      The VMCS12 is not modified (except for the VM-instruction error
      field), the VMCS12 MSR save/load lists are not processed, and the CPU
      state is not loaded from the VMCS12 host area. Moreover, the vmcs02
      exit reason is stale, so it should not be consulted for any reason.
      Signed-off-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4f350c6d
    • J
      kvm: vmx: Handle VMLAUNCH/VMRESUME failure properly · b060ca3b
      Jim Mattson 提交于
      On an early VMLAUNCH/VMRESUME failure (i.e. one which sets the
      VM-instruction error field of the current VMCS), the launch state of
      the current VMCS is not set to "launched," and the VM-exit information
      fields of the current VMCS (including IDT-vectoring information and
      exit reason) are stale.
      
      On a late VMLAUNCH/VMRESUME failure (i.e. one which sets the high bit
      of the exit reason field), the launch state of the current VMCS is not
      set to "launched," and only two of the VM-exit information fields of
      the current VMCS are modified (exit reason and exit
      qualification). The remaining VM-exit information fields of the
      current VMCS (including IDT-vectoring information, in particular) are
      stale.
      Signed-off-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b060ca3b
    • J
      kvm: nVMX: Remove nested_vmx_succeed after successful VM-entry · 7881f96c
      Jim Mattson 提交于
      After a successful VM-entry, RFLAGS is cleared, with the exception of
      bit 1, which is always set. This is handled by load_vmcs12_host_state.
      Signed-off-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7881f96c
    • D
      kvm,lapic: Justify use of swait_active() · cc1b4680
      Davidlohr Bueso 提交于
      A comment might serve future readers.
      Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      cc1b4680
    • J
      KVM: VMX: Do not BUG() on out-of-bounds guest IRQ · 3a8b0677
      Jan H. Schönherr 提交于
      The value of the guest_irq argument to vmx_update_pi_irte() is
      ultimately coming from a KVM_IRQFD API call. Do not BUG() in
      vmx_update_pi_irte() if the value is out-of bounds. (Especially,
      since KVM as a whole seems to hang after that.)
      
      Instead, print a message only once if we find that we don't have a
      route for a certain IRQ (which can be out-of-bounds or within the
      array).
      
      This fixes CVE-2017-1000252.
      
      Fixes: efc64404 ("KVM: x86: Update IRTE for posted-interrupts")
      Signed-off-by: NJan H. Schönherr <jschoenh@amazon.de>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3a8b0677
    • J
      kvm: nVMX: Don't allow L2 to access the hardware CR8 · 51aa68e7
      Jim Mattson 提交于
      If L1 does not specify the "use TPR shadow" VM-execution control in
      vmcs12, then L0 must specify the "CR8-load exiting" and "CR8-store
      exiting" VM-execution controls in vmcs02. Failure to do so will give
      the L2 VM unrestricted read/write access to the hardware CR8.
      
      This fixes CVE-2017-12154.
      Signed-off-by: NJim Mattson <jmattson@google.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      51aa68e7
    • W
      KVM: async_pf: Fix #DF due to inject "Page not Present" and "Page Ready" exceptions simultaneously · 9a6e7c39
      Wanpeng Li 提交于
      qemu-system-x86-8600  [004] d..1  7205.687530: kvm_entry: vcpu 2
      qemu-system-x86-8600  [004] ....  7205.687532: kvm_exit: reason EXCEPTION_NMI rip 0xffffffffa921297d info ffffeb2c0e44e018 80000b0e
      qemu-system-x86-8600  [004] ....  7205.687532: kvm_page_fault: address ffffeb2c0e44e018 error_code 0
      qemu-system-x86-8600  [004] ....  7205.687620: kvm_try_async_get_page: gva = 0xffffeb2c0e44e018, gfn = 0x427e4e
      qemu-system-x86-8600  [004] .N..  7205.687628: kvm_async_pf_not_present: token 0x8b002 gva 0xffffeb2c0e44e018
          kworker/4:2-7814  [004] ....  7205.687655: kvm_async_pf_completed: gva 0xffffeb2c0e44e018 address 0x7fcc30c4e000
      qemu-system-x86-8600  [004] ....  7205.687703: kvm_async_pf_ready: token 0x8b002 gva 0xffffeb2c0e44e018
      qemu-system-x86-8600  [004] d..1  7205.687711: kvm_entry: vcpu 2
      
      After running some memory intensive workload in guest, I catch the kworker
      which completes the GUP too quickly, and queues an "Page Ready" #PF exception
      after the "Page not Present" exception before the next vmentry as the above
      trace which will result in #DF injected to guest.
      
      This patch fixes it by clearing the queue for "Page not Present" if "Page Ready"
      occurs before the next vmentry since the GUP has already got the required page
      and shadow page table has already been fixed by "Page Ready" handler.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Fixes: 7c90705b ("KVM: Inject asynchronous page fault into a PV guest if page is swapped out.")
      [Changed indentation and added clearing of injected. - Radim]
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      9a6e7c39
  3. 14 9月, 2017 4 次提交
  4. 13 9月, 2017 5 次提交
  5. 01 9月, 2017 1 次提交
  6. 29 8月, 2017 1 次提交
    • T
      x86/idt: Unify gate_struct handling for 32/64-bit kernels · 64b163fa
      Thomas Gleixner 提交于
      The first 32 bits of gate struct are the same for 32 and 64 bit kernels.
      
      The 32-bit version uses desc_struct and no designated data structure,
      so we need different accessors for 32 and 64 bit kernels.
      
      Aside of that the macros which are necessary to build the 32-bit
      gate descriptor are horrible to read.
      
      Unify the gate structs and switch all code fiddling with it over.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Link: http://lkml.kernel.org/r/20170828064957.861974317@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      64b163fa
  7. 26 8月, 2017 1 次提交
  8. 25 8月, 2017 14 次提交
    • J
      kvm: nVMX: Validate the virtual-APIC address on nested VM-entry · 712b12d7
      Jim Mattson 提交于
      According to the SDM, if the "use TPR shadow" VM-execution control is
      1, bits 11:0 of the virtual-APIC address must be 0 and the address
      should set any bits beyond the processor's physical-address width.
      Signed-off-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      712b12d7
    • P
      KVM, pkeys: do not use PKRU value in vcpu->arch.guest_fpu.state · 38cfd5e3
      Paolo Bonzini 提交于
      The host pkru is restored right after vcpu exit (commit 1be0e61c), so
      KVM_GET_XSAVE will return the host PKRU value instead.  Fix this by
      using the guest PKRU explicitly in fill_xsave and load_xsave.  This
      part is based on a patch by Junkang Fu.
      
      The host PKRU data may also not match the value in vcpu->arch.guest_fpu.state,
      because it could have been changed by userspace since the last time
      it was saved, so skip loading it in kvm_load_guest_fpu.
      Reported-by: NJunkang Fu <junkang.fjk@alibaba-inc.com>
      Cc: Yang Zhang <zy107165@alibaba-inc.com>
      Fixes: 1be0e61c
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      38cfd5e3
    • P
      KVM: x86: simplify handling of PKRU · b9dd21e1
      Paolo Bonzini 提交于
      Move it to struct kvm_arch_vcpu, replacing guest_pkru_valid with a
      simple comparison against the host value of the register.  The write of
      PKRU in addition can be skipped if the guest has not enabled the feature.
      Once we do this, we need not test OSPKE in the host anymore, because
      guest_CR4.PKE=1 implies host_CR4.PKE=1.
      
      The static PKU test is kept to elide the code on older CPUs.
      Suggested-by: NYang Zhang <zy107165@alibaba-inc.com>
      Fixes: 1be0e61c
      Cc: stable@vger.kernel.org
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b9dd21e1
    • P
      KVM: x86: block guest protection keys unless the host has them enabled · c469268c
      Paolo Bonzini 提交于
      If the host has protection keys disabled, we cannot read and write the
      guest PKRU---RDPKRU and WRPKRU fail with #GP(0) if CR4.PKE=0.  Block
      the PKU cpuid bit in that case.
      
      This ensures that guest_CR4.PKE=1 implies host_CR4.PKE=1.
      
      Fixes: 1be0e61c
      Cc: stable@vger.kernel.org
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c469268c
    • W
      KVM: nVMX: Fix trying to cancel vmlauch/vmresume · bfcf83b1
      Wanpeng Li 提交于
      ------------[ cut here ]------------
      WARNING: CPU: 7 PID: 3861 at /home/kernel/ssd/kvm/arch/x86/kvm//vmx.c:11299 nested_vmx_vmexit+0x176e/0x1980 [kvm_intel]
      CPU: 7 PID: 3861 Comm: qemu-system-x86 Tainted: G        W  OE   4.13.0-rc4+ #11
      RIP: 0010:nested_vmx_vmexit+0x176e/0x1980 [kvm_intel]
      Call Trace:
       ? kvm_multiple_exception+0x149/0x170 [kvm]
       ? handle_emulation_failure+0x79/0x230 [kvm]
       ? load_vmcs12_host_state+0xa80/0xa80 [kvm_intel]
       ? check_chain_key+0x137/0x1e0
       ? reexecute_instruction.part.168+0x130/0x130 [kvm]
       nested_vmx_inject_exception_vmexit+0xb7/0x100 [kvm_intel]
       ? nested_vmx_inject_exception_vmexit+0xb7/0x100 [kvm_intel]
       vmx_queue_exception+0x197/0x300 [kvm_intel]
       kvm_arch_vcpu_ioctl_run+0x1b0c/0x2c90 [kvm]
       ? kvm_arch_vcpu_runnable+0x220/0x220 [kvm]
       ? preempt_count_sub+0x18/0xc0
       ? restart_apic_timer+0x17d/0x300 [kvm]
       ? kvm_lapic_restart_hv_timer+0x37/0x50 [kvm]
       ? kvm_arch_vcpu_load+0x1d8/0x350 [kvm]
       kvm_vcpu_ioctl+0x4e4/0x910 [kvm]
       ? kvm_vcpu_ioctl+0x4e4/0x910 [kvm]
       ? kvm_dev_ioctl+0xbe0/0xbe0 [kvm]
      
      The flag "nested_run_pending", which can override the decision of which should run
      next, L1 or L2. nested_run_pending=1 means that we *must* run L2 next, not L1. This
      is necessary in particular when L1 did a VMLAUNCH of L2 and therefore expects L2 to
      be run (and perhaps be injected with an event it specified, etc.). Nested_run_pending
      is especially intended to avoid switching  to L1 in the injection decision-point.
      
      This can be handled just like the other cases in vmx_check_nested_events, instead of
      having a special case in vmx_queue_exception.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      bfcf83b1
    • W
      KVM: X86: Fix loss of exception which has not yet been injected · 664f8e26
      Wanpeng Li 提交于
      vmx_complete_interrupts() assumes that the exception is always injected,
      so it can be dropped by kvm_clear_exception_queue().  However,
      an exception cannot be injected immediately if it is: 1) originally
      destined to a nested guest; 2) trapped to cause a vmexit; 3) happening
      right after VMLAUNCH/VMRESUME, i.e. when nested_run_pending is true.
      
      This patch applies to exceptions the same algorithm that is used for
      NMIs, replacing exception.reinject with "exception.injected" (equivalent
      to nmi_injected).
      
      exception.pending now represents an exception that is queued and whose
      side effects (e.g., update RFLAGS.RF or DR7) have not been applied yet.
      If exception.pending is true, the exception might result in a nested
      vmexit instead, too (in which case the side effects must not be applied).
      
      exception.injected instead represents an exception that is going to be
      injected into the guest at the next vmentry.
      Reported-by: NRadim Krčmář <rkrcmar@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      664f8e26
    • W
      KVM: VMX: use kvm_event_needs_reinjection · 274bba52
      Wanpeng Li 提交于
      Use kvm_event_needs_reinjection() encapsulation.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      274bba52
    • P
      KVM: MMU: speedup update_permission_bitmask · 09f037aa
      Paolo Bonzini 提交于
      update_permission_bitmask currently does a 128-iteration loop to,
      essentially, compute a constant array.  Computing the 8 bits in parallel
      reduces it to 16 iterations, and is enough to speed it up substantially
      because many boolean operations in the inner loop become constants or
      simplify noticeably.
      
      Because update_permission_bitmask is actually the top item in the profile
      for nested vmexits, this speeds up an L2->L1 vmexit by about ten thousand
      clock cycles, or up to 30%:
      
                                               before     after
         cpuid                                 35173      25954
         vmcall                                35122      27079
         inl_from_pmtimer                      52635      42675
         inl_from_qemu                         53604      44599
         inl_from_kernel                       38498      30798
         outl_to_kernel                        34508      28816
         wr_tsc_adjust_msr                     34185      26818
         rd_tsc_adjust_msr                     37409      27049
         mmio-no-eventfd:pci-mem               50563      45276
         mmio-wildcard-eventfd:pci-mem         34495      30823
         mmio-datamatch-eventfd:pci-mem        35612      31071
         portio-no-eventfd:pci-io              44925      40661
         portio-wildcard-eventfd:pci-io        29708      27269
         portio-datamatch-eventfd:pci-io       31135      27164
      
      (I wrote a small C program to compare the tables for all values of CR0.WP,
      CR4.SMAP and CR4.SMEP, and they match).
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      09f037aa
    • Y
      KVM: MMU: Expose the LA57 feature to VM. · fd8cb433
      Yu Zhang 提交于
      This patch exposes 5 level page table feature to the VM.
      At the same time, the canonical virtual address checking is
      extended to support both 48-bits and 57-bits address width.
      Signed-off-by: NYu Zhang <yu.c.zhang@linux.intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      fd8cb433
    • Y
      KVM: MMU: Add 5 level EPT & Shadow page table support. · 855feb67
      Yu Zhang 提交于
      Extends the shadow paging code, so that 5 level shadow page
      table can be constructed if VM is running in 5 level paging
      mode.
      
      Also extends the ept code, so that 5 level ept table can be
      constructed if maxphysaddr of VM exceeds 48 bits. Unlike the
      shadow logic, KVM should still use 4 level ept table for a VM
      whose physical address width is less than 48 bits, even when
      the VM is running in 5 level paging mode.
      Signed-off-by: NYu Zhang <yu.c.zhang@linux.intel.com>
      [Unconditionally reset the MMU context in kvm_cpuid_update.
       Changing MAXPHYADDR invalidates the reserved bit bitmasks.
       - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      855feb67
    • Y
      KVM: MMU: Rename PT64_ROOT_LEVEL to PT64_ROOT_4LEVEL. · 2a7266a8
      Yu Zhang 提交于
      Now we have 4 level page table and 5 level page table in 64 bits
      long mode, let's rename the PT64_ROOT_LEVEL to PT64_ROOT_4LEVEL,
      then we can use PT64_ROOT_5LEVEL for 5 level page table, it's
      helpful to make the code more clear.
      
      Also PT64_ROOT_MAX_LEVEL is defined as 4, so that we can just
      redefine it to 5 whenever a replacement is needed for 5 level
      paging.
      Signed-off-by: NYu Zhang <yu.c.zhang@linux.intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      2a7266a8
    • Y
      KVM: MMU: check guest CR3 reserved bits based on its physical address width. · d1cd3ce9
      Yu Zhang 提交于
      Currently, KVM uses CR3_L_MODE_RESERVED_BITS to check the
      reserved bits in CR3. Yet the length of reserved bits in
      guest CR3 should be based on the physical address width
      exposed to the VM. This patch changes CR3 check logic to
      calculate the reserved bits at runtime.
      Signed-off-by: NYu Zhang <yu.c.zhang@linux.intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d1cd3ce9
    • Y
      KVM: x86: Add return value to kvm_cpuid(). · e911eb3b
      Yu Zhang 提交于
      Return false in kvm_cpuid() when it fails to find the cpuid
      entry. Also, this routine(and its caller) is optimized with
      a new argument - check_limit, so that the check_cpuid_limit()
      fall back can be avoided.
      Signed-off-by: NYu Zhang <yu.c.zhang@linux.intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e911eb3b
    • P
      kvm: vmx: Raise #UD on unsupported XSAVES/XRSTORS · 3db13480
      Paolo Bonzini 提交于
      A guest may not be configured to support XSAVES/XRSTORS, even when the host
      does. If the guest does not support XSAVES/XRSTORS, clear the secondary
      execution control so that the processor will raise #UD.
      
      Also clear the "allowed-1" bit for XSAVES/XRSTORS exiting in the
      IA32_VMX_PROCBASED_CTLS2 MSR, and pass through VMCS12's control in
      the VMCS02.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3db13480
  9. 24 8月, 2017 5 次提交
  10. 18 8月, 2017 1 次提交