1. 22 8月, 2018 1 次提交
    • A
      x86: kvm: avoid unused variable warning · 7288bde1
      Arnd Bergmann 提交于
      Removing one of the two accesses of the maxphyaddr variable led to
      a harmless warning:
      
      arch/x86/kvm/x86.c: In function 'kvm_set_mmio_spte_mask':
      arch/x86/kvm/x86.c:6563:6: error: unused variable 'maxphyaddr' [-Werror=unused-variable]
      
      Removing the #ifdef seems to be the nicest workaround, as it
      makes the code look cleaner than adding another #ifdef.
      
      Fixes: 28a1f3ac ("kvm: x86: Set highest physical address bits in non-present/reserved SPTEs")
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Cc: stable@vger.kernel.org # L1TF
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7288bde1
  2. 15 8月, 2018 1 次提交
  3. 06 8月, 2018 10 次提交
    • W
      KVM: X86: Implement "send IPI" hypercall · 4180bf1b
      Wanpeng Li 提交于
      Using hypercall to send IPIs by one vmexit instead of one by one for
      xAPIC/x2APIC physical mode and one vmexit per-cluster for x2APIC cluster
      mode. Intel guest can enter x2apic cluster mode when interrupt remmaping
      is enabled in qemu, however, latest AMD EPYC still just supports xapic
      mode which can get great improvement by Exit-less IPIs. This patchset
      lets a guest send multicast IPIs, with at most 128 destinations per
      hypercall in 64-bit mode and 64 vCPUs per hypercall in 32-bit mode.
      
      Hardware: Xeon Skylake 2.5GHz, 2 sockets, 40 cores, 80 threads, the VM
      is 80 vCPUs, IPI microbenchmark(https://lkml.org/lkml/2017/12/19/141):
      
      x2apic cluster mode, vanilla
      
       Dry-run:                         0,            2392199 ns
       Self-IPI:                  6907514,           15027589 ns
       Normal IPI:              223910476,          251301666 ns
       Broadcast IPI:                   0,         9282161150 ns
       Broadcast lock:                  0,         8812934104 ns
      
      x2apic cluster mode, pv-ipi
      
       Dry-run:                         0,            2449341 ns
       Self-IPI:                  6720360,           15028732 ns
       Normal IPI:              228643307,          255708477 ns
       Broadcast IPI:                   0,         7572293590 ns  => 22% performance boost
       Broadcast lock:                  0,         8316124651 ns
      
      x2apic physical mode, vanilla
      
       Dry-run:                         0,            3135933 ns
       Self-IPI:                  8572670,           17901757 ns
       Normal IPI:              226444334,          255421709 ns
       Broadcast IPI:                   0,        19845070887 ns
       Broadcast lock:                  0,        19827383656 ns
      
      x2apic physical mode, pv-ipi
      
       Dry-run:                         0,            2446381 ns
       Self-IPI:                  6788217,           15021056 ns
       Normal IPI:              219454441,          249583458 ns
       Broadcast IPI:                   0,         7806540019 ns  => 154% performance boost
       Broadcast lock:                  0,         9143618799 ns
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4180bf1b
    • T
      KVM/x86: Move X86_CR4_OSXSAVE check into kvm_valid_sregs() · 74fec5b9
      Tianyu Lan 提交于
      X86_CR4_OSXSAVE check belongs to sregs check and so move into
      kvm_valid_sregs().
      Signed-off-by: NLan Tianyu <Tianyu.Lan@microsoft.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      74fec5b9
    • J
      kvm: x86: Remove CR3_PCID_INVD flag · 208320ba
      Junaid Shahid 提交于
      It is a duplicate of X86_CR3_PCID_NOFLUSH. So just use that instead.
      Signed-off-by: NJunaid Shahid <junaids@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      208320ba
    • J
      kvm: x86: Skip shadow page resync on CR3 switch when indicated by guest · 956bf353
      Junaid Shahid 提交于
      When the guest indicates that the TLB doesn't need to be flushed in a
      CR3 switch, we can also skip resyncing the shadow page tables since an
      out-of-sync shadow page table is equivalent to an out-of-sync TLB.
      Signed-off-by: NJunaid Shahid <junaids@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      956bf353
    • J
      kvm: x86: Skip TLB flush on fast CR3 switch when indicated by guest · ade61e28
      Junaid Shahid 提交于
      When PCIDs are enabled, the MSb of the source operand for a MOV-to-CR3
      instruction indicates that the TLB doesn't need to be flushed.
      
      This change enables this optimization for MOV-to-CR3s in the guest
      that have been intercepted by KVM for shadow paging and are handled
      within the fast CR3 switch path.
      Signed-off-by: NJunaid Shahid <junaids@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ade61e28
    • J
      kvm: x86: Introduce KVM_REQ_LOAD_CR3 · 6e42782f
      Junaid Shahid 提交于
      The KVM_REQ_LOAD_CR3 request loads the hardware CR3 using the
      current root_hpa.
      Signed-off-by: NJunaid Shahid <junaids@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      6e42782f
    • J
      kvm: x86: Add fast CR3 switch code path · 7c390d35
      Junaid Shahid 提交于
      When using shadow paging, a CR3 switch in the guest results in a VM Exit.
      In the common case, that VM exit doesn't require much processing by KVM.
      However, it does acquire the MMU lock, which can start showing signs of
      contention under some workloads even on a 2 VCPU VM when the guest is
      using KPTI. Therefore, we add a fast path that avoids acquiring the MMU
      lock in the most common cases e.g. when switching back and forth between
      the kernel and user mode CR3s used by KPTI with no guest page table
      changes in between.
      
      For now, this fast path is implemented only for 64-bit guests and hosts
      to avoid the handling of PDPTEs, but it can be extended later to 32-bit
      guests and/or hosts as well.
      Signed-off-by: NJunaid Shahid <junaids@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7c390d35
    • J
      kvm: nVMX: Introduce KVM_CAP_NESTED_STATE · 8fcc4b59
      Jim Mattson 提交于
      For nested virtualization L0 KVM is managing a bit of state for L2 guests,
      this state can not be captured through the currently available IOCTLs. In
      fact the state captured through all of these IOCTLs is usually a mix of L1
      and L2 state. It is also dependent on whether the L2 guest was running at
      the moment when the process was interrupted to save its state.
      
      With this capability, there are two new vcpu ioctls: KVM_GET_NESTED_STATE
      and KVM_SET_NESTED_STATE. These can be used for saving and restoring a VM
      that is in VMX operation.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: x86@kernel.org
      Cc: kvm@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NJim Mattson <jmattson@google.com>
      [karahmed@ - rename structs and functions and make them ready for AMD and
                   address previous comments.
                 - handle nested.smm state.
                 - rebase & a bit of refactoring.
                 - Merge 7/8 and 8/8 into one patch. ]
      Signed-off-by: NKarimAllah Ahmed <karahmed@amazon.de>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      8fcc4b59
    • P
      KVM: x86: do not load vmcs12 pages while still in SMM · 7f7f1ba3
      Paolo Bonzini 提交于
      If the vCPU enters system management mode while running a nested guest,
      RSM starts processing the vmentry while still in SMM.  In that case,
      however, the pages pointed to by the vmcs12 might be incorrectly
      loaded from SMRAM.  To avoid this, delay the handling of the pages
      until just before the next vmentry.  This is done with a new request
      and a new entry in kvm_x86_ops, which we will be able to reuse for
      nested VMX state migration.
      
      Extracted from a patch by Jim Mattson and KarimAllah Ahmed.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7f7f1ba3
    • P
      KVM: x86: ensure all MSRs can always be KVM_GET/SET_MSR'd · 44883f01
      Paolo Bonzini 提交于
      Some of the MSRs returned by GET_MSR_INDEX_LIST currently cannot be sent back
      to KVM_GET_MSR and/or KVM_SET_MSR; either they can never be sent back, or you
      they are only accepted under special conditions.  This makes the API a pain to
      use.
      
      To avoid this pain, this patch makes it so that the result of the get-list
      ioctl can always be used for host-initiated get and set.  Since we don't have
      a separate way to check for read-only MSRs, this means some Hyper-V MSRs are
      ignored when written.  Arguably they should not even be in the result of
      GET_MSR_INDEX_LIST, but I am leaving there in case userspace is using the
      outcome of GET_MSR_INDEX_LIST to derive the support for the corresponding
      Hyper-V feature.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      44883f01
  4. 05 8月, 2018 1 次提交
  5. 15 7月, 2018 1 次提交
  6. 05 7月, 2018 1 次提交
    • P
      x86/KVM/VMX: Add L1D flush logic · c595ceee
      Paolo Bonzini 提交于
      Add the logic for flushing L1D on VMENTER. The flush depends on the static
      key being enabled and the new l1tf_flush_l1d flag being set.
      
      The flags is set:
       - Always, if the flush module parameter is 'always'
      
       - Conditionally at:
         - Entry to vcpu_run(), i.e. after executing user space
      
         - From the sched_in notifier, i.e. when switching to a vCPU thread.
      
         - From vmexit handlers which are considered unsafe, i.e. where
           sensitive data can be brought into L1D:
      
           - The emulator, which could be a good target for other speculative
             execution-based threats,
      
           - The MMU, which can bring host page tables in the L1 cache.
           
           - External interrupts
      
           - Nested operations that require the MMU (see above). That is
             vmptrld, vmptrst, vmclear,vmwrite,vmread.
      
           - When handling invept,invvpid
      
      [ tglx: Split out from combo patch and reduced to a single flag ]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      c595ceee
  7. 14 6月, 2018 1 次提交
  8. 13 6月, 2018 1 次提交
    • K
      treewide: kvzalloc() -> kvcalloc() · 778e1cdd
      Kees Cook 提交于
      The kvzalloc() function has a 2-factor argument form, kvcalloc(). This
      patch replaces cases of:
      
              kvzalloc(a * b, gfp)
      
      with:
              kvcalloc(a * b, gfp)
      
      as well as handling cases of:
      
              kvzalloc(a * b * c, gfp)
      
      with:
      
              kvzalloc(array3_size(a, b, c), gfp)
      
      as it's slightly less ugly than:
      
              kvcalloc(array_size(a, b), c, gfp)
      
      This does, however, attempt to ignore constant size factors like:
      
              kvzalloc(4 * 1024, gfp)
      
      though any constants defined via macros get caught up in the conversion.
      
      Any factors with a sizeof() of "unsigned char", "char", and "u8" were
      dropped, since they're redundant.
      
      The Coccinelle script used for this was:
      
      // Fix redundant parens around sizeof().
      @@
      type TYPE;
      expression THING, E;
      @@
      
      (
        kvzalloc(
      -	(sizeof(TYPE)) * E
      +	sizeof(TYPE) * E
        , ...)
      |
        kvzalloc(
      -	(sizeof(THING)) * E
      +	sizeof(THING) * E
        , ...)
      )
      
      // Drop single-byte sizes and redundant parens.
      @@
      expression COUNT;
      typedef u8;
      typedef __u8;
      @@
      
      (
        kvzalloc(
      -	sizeof(u8) * (COUNT)
      +	COUNT
        , ...)
      |
        kvzalloc(
      -	sizeof(__u8) * (COUNT)
      +	COUNT
        , ...)
      |
        kvzalloc(
      -	sizeof(char) * (COUNT)
      +	COUNT
        , ...)
      |
        kvzalloc(
      -	sizeof(unsigned char) * (COUNT)
      +	COUNT
        , ...)
      |
        kvzalloc(
      -	sizeof(u8) * COUNT
      +	COUNT
        , ...)
      |
        kvzalloc(
      -	sizeof(__u8) * COUNT
      +	COUNT
        , ...)
      |
        kvzalloc(
      -	sizeof(char) * COUNT
      +	COUNT
        , ...)
      |
        kvzalloc(
      -	sizeof(unsigned char) * COUNT
      +	COUNT
        , ...)
      )
      
      // 2-factor product with sizeof(type/expression) and identifier or constant.
      @@
      type TYPE;
      expression THING;
      identifier COUNT_ID;
      constant COUNT_CONST;
      @@
      
      (
      - kvzalloc
      + kvcalloc
        (
      -	sizeof(TYPE) * (COUNT_ID)
      +	COUNT_ID, sizeof(TYPE)
        , ...)
      |
      - kvzalloc
      + kvcalloc
        (
      -	sizeof(TYPE) * COUNT_ID
      +	COUNT_ID, sizeof(TYPE)
        , ...)
      |
      - kvzalloc
      + kvcalloc
        (
      -	sizeof(TYPE) * (COUNT_CONST)
      +	COUNT_CONST, sizeof(TYPE)
        , ...)
      |
      - kvzalloc
      + kvcalloc
        (
      -	sizeof(TYPE) * COUNT_CONST
      +	COUNT_CONST, sizeof(TYPE)
        , ...)
      |
      - kvzalloc
      + kvcalloc
        (
      -	sizeof(THING) * (COUNT_ID)
      +	COUNT_ID, sizeof(THING)
        , ...)
      |
      - kvzalloc
      + kvcalloc
        (
      -	sizeof(THING) * COUNT_ID
      +	COUNT_ID, sizeof(THING)
        , ...)
      |
      - kvzalloc
      + kvcalloc
        (
      -	sizeof(THING) * (COUNT_CONST)
      +	COUNT_CONST, sizeof(THING)
        , ...)
      |
      - kvzalloc
      + kvcalloc
        (
      -	sizeof(THING) * COUNT_CONST
      +	COUNT_CONST, sizeof(THING)
        , ...)
      )
      
      // 2-factor product, only identifiers.
      @@
      identifier SIZE, COUNT;
      @@
      
      - kvzalloc
      + kvcalloc
        (
      -	SIZE * COUNT
      +	COUNT, SIZE
        , ...)
      
      // 3-factor product with 1 sizeof(type) or sizeof(expression), with
      // redundant parens removed.
      @@
      expression THING;
      identifier STRIDE, COUNT;
      type TYPE;
      @@
      
      (
        kvzalloc(
      -	sizeof(TYPE) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kvzalloc(
      -	sizeof(TYPE) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kvzalloc(
      -	sizeof(TYPE) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kvzalloc(
      -	sizeof(TYPE) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kvzalloc(
      -	sizeof(THING) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kvzalloc(
      -	sizeof(THING) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kvzalloc(
      -	sizeof(THING) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kvzalloc(
      -	sizeof(THING) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      )
      
      // 3-factor product with 2 sizeof(variable), with redundant parens removed.
      @@
      expression THING1, THING2;
      identifier COUNT;
      type TYPE1, TYPE2;
      @@
      
      (
        kvzalloc(
      -	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        kvzalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        kvzalloc(
      -	sizeof(THING1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        kvzalloc(
      -	sizeof(THING1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        kvzalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      |
        kvzalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      )
      
      // 3-factor product, only identifiers, with redundant parens removed.
      @@
      identifier STRIDE, SIZE, COUNT;
      @@
      
      (
        kvzalloc(
      -	(COUNT) * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kvzalloc(
      -	COUNT * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kvzalloc(
      -	COUNT * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kvzalloc(
      -	(COUNT) * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kvzalloc(
      -	COUNT * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kvzalloc(
      -	(COUNT) * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kvzalloc(
      -	(COUNT) * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kvzalloc(
      -	COUNT * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      )
      
      // Any remaining multi-factor products, first at least 3-factor products,
      // when they're not all constants...
      @@
      expression E1, E2, E3;
      constant C1, C2, C3;
      @@
      
      (
        kvzalloc(C1 * C2 * C3, ...)
      |
        kvzalloc(
      -	(E1) * E2 * E3
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kvzalloc(
      -	(E1) * (E2) * E3
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kvzalloc(
      -	(E1) * (E2) * (E3)
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kvzalloc(
      -	E1 * E2 * E3
      +	array3_size(E1, E2, E3)
        , ...)
      )
      
      // And then all remaining 2 factors products when they're not all constants,
      // keeping sizeof() as the second factor argument.
      @@
      expression THING, E1, E2;
      type TYPE;
      constant C1, C2, C3;
      @@
      
      (
        kvzalloc(sizeof(THING) * C2, ...)
      |
        kvzalloc(sizeof(TYPE) * C2, ...)
      |
        kvzalloc(C1 * C2 * C3, ...)
      |
        kvzalloc(C1 * C2, ...)
      |
      - kvzalloc
      + kvcalloc
        (
      -	sizeof(TYPE) * (E2)
      +	E2, sizeof(TYPE)
        , ...)
      |
      - kvzalloc
      + kvcalloc
        (
      -	sizeof(TYPE) * E2
      +	E2, sizeof(TYPE)
        , ...)
      |
      - kvzalloc
      + kvcalloc
        (
      -	sizeof(THING) * (E2)
      +	E2, sizeof(THING)
        , ...)
      |
      - kvzalloc
      + kvcalloc
        (
      -	sizeof(THING) * E2
      +	E2, sizeof(THING)
        , ...)
      |
      - kvzalloc
      + kvcalloc
        (
      -	(E1) * E2
      +	E1, E2
        , ...)
      |
      - kvzalloc
      + kvcalloc
        (
      -	(E1) * (E2)
      +	E1, E2
        , ...)
      |
      - kvzalloc
      + kvcalloc
        (
      -	E1 * E2
      +	E1, E2
        , ...)
      )
      Signed-off-by: NKees Cook <keescook@chromium.org>
      778e1cdd
  9. 12 6月, 2018 3 次提交
  10. 04 6月, 2018 1 次提交
    • W
      KVM: VMX: Optimize tscdeadline timer latency · c5ce8235
      Wanpeng Li 提交于
      'Commit d0659d94 ("KVM: x86: add option to advance tscdeadline
      hrtimer expiration")' advances the tscdeadline (the timer is emulated
      by hrtimer) expiration in order that the latency which is incurred
      by hypervisor (apic_timer_fn -> vmentry) can be avoided. This patch
      adds the advance tscdeadline expiration support to which the tscdeadline
      timer is emulated by VMX preemption timer to reduce the hypervisor
      lantency (handle_preemption_timer -> vmentry). The guest can also
      set an expiration that is very small (for example in Linux if an
      hrtimer feeds a expiration in the past); in that case we set delta_tsc
      to 0, leading to an immediately vmexit when delta_tsc is not bigger than
      advance ns.
      
      This patch can reduce ~63% latency (~4450 cycles to ~1660 cycles on
      a haswell desktop) for kvm-unit-tests/tscdeadline_latency when testing
      busy waits.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c5ce8235
  11. 02 6月, 2018 1 次提交
  12. 26 5月, 2018 2 次提交
  13. 24 5月, 2018 1 次提交
    • W
      KVM: x86: Update cpuid properly when CR4.OSXAVE or CR4.PKE is changed · c4d21882
      Wei Huang 提交于
      The CPUID bits of OSXSAVE (function=0x1) and OSPKE (func=0x7, leaf=0x0)
      allows user apps to detect if OS has set CR4.OSXSAVE or CR4.PKE. KVM is
      supposed to update these CPUID bits when CR4 is updated. Current KVM
      code doesn't handle some special cases when updates come from emulator.
      Here is one example:
      
        Step 1: guest boots
        Step 2: guest OS enables XSAVE ==> CR4.OSXSAVE=1 and CPUID.OSXSAVE=1
        Step 3: guest hot reboot ==> QEMU reset CR4 to 0, but CPUID.OSXAVE==1
        Step 4: guest os checks CPUID.OSXAVE, detects 1, then executes xgetbv
      
      Step 4 above will cause an #UD and guest crash because guest OS hasn't
      turned on OSXAVE yet. This patch solves the problem by comparing the the
      old_cr4 with cr4. If the related bits have been changed,
      kvm_update_cpuid() needs to be called.
      Signed-off-by: NWei Huang <wei@redhat.com>
      Reviewed-by: NBandan Das <bsd@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      c4d21882
  14. 23 5月, 2018 1 次提交
  15. 17 5月, 2018 1 次提交
  16. 15 5月, 2018 3 次提交
  17. 11 5月, 2018 2 次提交
  18. 16 4月, 2018 2 次提交
  19. 11 4月, 2018 1 次提交
    • K
      X86/KVM: Do not allow DISABLE_EXITS_MWAIT when LAPIC ARAT is not available · 8e9b29b6
      KarimAllah Ahmed 提交于
      If the processor does not have an "Always Running APIC Timer" (aka ARAT),
      we should not give guests direct access to MWAIT. The LAPIC timer would
      stop ticking in deep C-states, so any host deadlines would not wakeup the
      host kernel.
      
      The host kernel intel_idle driver handles this by switching to broadcast
      mode when ARAT is not available and MWAIT is issued with a deep C-state
      that would stop the LAPIC timer. When MWAIT is passed through, we can not
      tell when MWAIT is issued.
      
      So just disable this capability when LAPIC ARAT is not available. I am not
      even sure if there are any CPUs with VMX support but no LAPIC ARAT or not.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Reported-by: NWanpeng Li <kernellwp@gmail.com>
      Signed-off-by: NKarimAllah Ahmed <karahmed@amazon.de>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      8e9b29b6
  20. 05 4月, 2018 3 次提交
    • P
      kvm: x86: fix a compile warning · 3140c156
      Peng Hao 提交于
      fix a "warning: no previous prototype".
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NPeng Hao <peng.hao2@zte.com.cn>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3140c156
    • W
      KVM: X86: Add Force Emulation Prefix for "emulate the next instruction" · 6c86eedc
      Wanpeng Li 提交于
      There is no easy way to force KVM to run an instruction through the emulator
      (by design as that will expose the x86 emulator as a significant attack-surface).
      However, we do wish to expose the x86 emulator in case we are testing it
      (e.g. via kvm-unit-tests). Therefore, this patch adds a "force emulation prefix"
      that is designed to raise #UD which KVM will trap and it's #UD exit-handler will
      match "force emulation prefix" to run instruction after prefix by the x86 emulator.
      To not expose the x86 emulator by default, we add a module parameter that should
      be off by default.
      
      A simple testcase here:
      
          #include <stdio.h>
          #include <string.h>
      
          #define HYPERVISOR_INFO 0x40000000
      
          #define CPUID(idx, eax, ebx, ecx, edx) \
              asm volatile (\
              "ud2a; .ascii \"kvm\"; cpuid" \
              :"=b" (*ebx), "=a" (*eax), "=c" (*ecx), "=d" (*edx) \
                  :"0"(idx) );
      
          void main()
          {
              unsigned int eax, ebx, ecx, edx;
              char string[13];
      
              CPUID(HYPERVISOR_INFO, &eax, &ebx, &ecx, &edx);
              *(unsigned int *)(string + 0) = ebx;
              *(unsigned int *)(string + 4) = ecx;
              *(unsigned int *)(string + 8) = edx;
      
              string[12] = 0;
              if (strncmp(string, "KVMKVMKVM\0\0\0", 12) == 0)
                  printf("kvm guest\n");
              else
                  printf("bare hardware\n");
          }
      Suggested-by: NAndrew Cooper <andrew.cooper3@citrix.com>
      Reviewed-by: NRadim Krčmář <rkrcmar@redhat.com>
      Reviewed-by: NLiran Alon <liran.alon@oracle.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Andrew Cooper <andrew.cooper3@citrix.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Liran Alon <liran.alon@oracle.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      [Correctly handle usermode exits. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      6c86eedc
    • W
      KVM: X86: Introduce handle_ud() · 082d06ed
      Wanpeng Li 提交于
      Introduce handle_ud() to handle invalid opcode, this function will be
      used by later patches.
      Reviewed-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Reviewed-by: NLiran Alon <liran.alon@oracle.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim KrÄmář <rkrcmar@redhat.com>
      Cc: Andrew Cooper <andrew.cooper3@citrix.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Liran Alon <liran.alon@oracle.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      082d06ed
  21. 29 3月, 2018 2 次提交
    • L
      KVM: nVMX: Require immediate-exit when event reinjected to L2 and L1 event pending · 1a680e35
      Liran Alon 提交于
      In case L2 VMExit to L0 during event-delivery, VMCS02 is filled with
      IDT-vectoring-info which vmx_complete_interrupts() makes sure to
      reinject before next resume of L2.
      
      While handling the VMExit in L0, an IPI could be sent by another L1 vCPU
      to the L1 vCPU which currently runs L2 and exited to L0.
      
      When L0 will reach vcpu_enter_guest() and call inject_pending_event(),
      it will note that a previous event was re-injected to L2 (by
      IDT-vectoring-info) and therefore won't check if there are pending L1
      events which require exit from L2 to L1. Thus, L0 enters L2 without
      immediate VMExit even though there are pending L1 events!
      
      This commit fixes the issue by making sure to check for L1 pending
      events even if a previous event was reinjected to L2 and bailing out
      from inject_pending_event() before evaluating a new pending event in
      case an event was already reinjected.
      
      The bug was observed by the following setup:
      * L0 is a 64CPU machine which runs KVM.
      * L1 is a 16CPU machine which runs KVM.
      * L0 & L1 runs with APICv disabled.
      (Also reproduced with APICv enabled but easier to analyze below info
      with APICv disabled)
      * L1 runs a 16CPU L2 Windows Server 2012 R2 guest.
      During L2 boot, L1 hangs completely and analyzing the hang reveals that
      one L1 vCPU is holding KVM's mmu_lock and is waiting forever on an IPI
      that he has sent for another L1 vCPU. And all other L1 vCPUs are
      currently attempting to grab mmu_lock. Therefore, all L1 vCPUs are stuck
      forever (as L1 runs with kernel-preemption disabled).
      
      Observing /sys/kernel/debug/tracing/trace_pipe reveals the following
      series of events:
      (1) qemu-system-x86-19066 [030] kvm_nested_vmexit: rip:
      0xfffff802c5dca82f reason: EPT_VIOLATION ext_inf1: 0x0000000000000182
      ext_inf2: 0x00000000800000d2 ext_int: 0x00000000 ext_int_err: 0x00000000
      (2) qemu-system-x86-19054 [028] kvm_apic_accept_irq: apicid f
      vec 252 (Fixed|edge)
      (3) qemu-system-x86-19066 [030] kvm_inj_virq: irq 210
      (4) qemu-system-x86-19066 [030] kvm_entry: vcpu 15
      (5) qemu-system-x86-19066 [030] kvm_exit: reason EPT_VIOLATION
      rip 0xffffe00069202690 info 83 0
      (6) qemu-system-x86-19066 [030] kvm_nested_vmexit: rip:
      0xffffe00069202690 reason: EPT_VIOLATION ext_inf1: 0x0000000000000083
      ext_inf2: 0x0000000000000000 ext_int: 0x00000000 ext_int_err: 0x00000000
      (7) qemu-system-x86-19066 [030] kvm_nested_vmexit_inject: reason:
      EPT_VIOLATION ext_inf1: 0x0000000000000083 ext_inf2: 0x0000000000000000
      ext_int: 0x00000000 ext_int_err: 0x00000000
      (8) qemu-system-x86-19066 [030] kvm_entry: vcpu 15
      
      Which can be analyzed as follows:
      (1) L2 VMExit to L0 on EPT_VIOLATION during delivery of vector 0xd2.
      Therefore, vmx_complete_interrupts() will set KVM_REQ_EVENT and reinject
      a pending-interrupt of 0xd2.
      (2) L1 sends an IPI of vector 0xfc (CALL_FUNCTION_VECTOR) to destination
      vCPU 15. This will set relevant bit in LAPIC's IRR and set KVM_REQ_EVENT.
      (3) L0 reach vcpu_enter_guest() which calls inject_pending_event() which
      notes that interrupt 0xd2 was reinjected and therefore calls
      vmx_inject_irq() and returns. Without checking for pending L1 events!
      Note that at this point, KVM_REQ_EVENT was cleared by vcpu_enter_guest()
      before calling inject_pending_event().
      (4) L0 resumes L2 without immediate-exit even though there is a pending
      L1 event (The IPI pending in LAPIC's IRR).
      
      We have already reached the buggy scenario but events could be
      furthered analyzed:
      (5+6) L2 VMExit to L0 on EPT_VIOLATION.  This time not during
      event-delivery.
      (7) L0 decides to forward the VMExit to L1 for further handling.
      (8) L0 resumes into L1. Note that because KVM_REQ_EVENT is cleared, the
      LAPIC's IRR is not examined and therefore the IPI is still not delivered
      into L1!
      Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Reviewed-by: NNikita Leshenko <nikita.leshchenko@oracle.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      1a680e35
    • L
      KVM: x86: Fix misleading comments on handling pending exceptions · a042c26f
      Liran Alon 提交于
      The reason that exception.pending should block re-injection of
      NMI/interrupt is not described correctly in comment in code.
      Instead, it describes why a pending exception should be injected
      before a pending NMI/interrupt.
      
      Therefore, move currently present comment to code-block evaluating
      a new pending event which explains why exception.pending is evaluated
      first.
      In addition, create a new comment describing that exception.pending
      blocks re-injection of NMI/interrupt because the exception was
      queued by handling vmexit which was due to NMI/interrupt delivery.
      Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Reviewed-by: NNikita Leshenko <nikita.leshchenko@oracle.com>
      Signed-off-by: NKrish Sadhukhan <krish.sadhukhan@orcle.com>
      [Used a comment from Sean J <sean.j.christopherson@intel.com>. - Radim]
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      a042c26f