1. 21 2月, 2017 5 次提交
    • A
      x86/kvm/vmx: Defer TR reload after VM exit · b7ffc44d
      Andy Lutomirski 提交于
      Intel's VMX is daft and resets the hidden TSS limit register to 0x67
      on VMX reload, and the 0x67 is not configurable.  KVM currently
      reloads TR using the LTR instruction on every exit, but this is quite
      slow because LTR is serializing.
      
      The 0x67 limit is entirely harmless unless ioperm() is in use, so
      defer the reload until a task using ioperm() is actually running.
      
      Here's some poorly done benchmarking using kvm-unit-tests:
      
      Before:
      
      cpuid 1313
      vmcall 1195
      mov_from_cr8 11
      mov_to_cr8 17
      inl_from_pmtimer 6770
      inl_from_qemu 6856
      inl_from_kernel 2435
      outl_to_kernel 1402
      
      After:
      
      cpuid 1291
      vmcall 1181
      mov_from_cr8 11
      mov_to_cr8 16
      inl_from_pmtimer 6457
      inl_from_qemu 6209
      inl_from_kernel 2339
      outl_to_kernel 1391
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      [Force-reload TR in invalidate_tss_limit. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b7ffc44d
    • A
      x86/kvm/vmx: Simplify segment_base() · 8c2e41f7
      Andy Lutomirski 提交于
      Use actual pointer types for pointers (instead of unsigned long) and
      replace hardcoded constants with the appropriate self-documenting
      macros.
      
      The function is still a bit messy, but this seems a lot better than
      before to me.
      
      This is mostly borrowed from a patch by Thomas Garnier.
      
      Cc: Thomas Garnier <thgarnie@google.com>
      Cc: Jim Mattson <jmattson@google.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      8c2e41f7
    • A
      x86/kvm/vmx: Get rid of segment_base() on 64-bit kernels · e28baead
      Andy Lutomirski 提交于
      It was a bit buggy (it didn't list all segment types that needed
      64-bit fixups), but the bug was irrelevant because it wasn't called
      in any interesting context on 64-bit kernels and was only used for
      data segents on 32-bit kernels.
      
      To avoid confusion, make it explicitly 32-bit only.
      
      Cc: Thomas Garnier <thgarnie@google.com>
      Cc: Jim Mattson <jmattson@google.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e28baead
    • A
      x86/kvm/vmx: Don't fetch the TSS base from the GDT · e0c23063
      Andy Lutomirski 提交于
      The current CPU's TSS base is a foregone conclusion, so there's no need
      to parse it out of the segment tables.  This should save a couple cycles
      (as STR is surely microcoded and poorly optimized) but, more importantly,
      it's a cleanup and it means that segment_base() will never be called on
      64-bit kernels.
      
      Cc: Thomas Garnier <thgarnie@google.com>
      Cc: Jim Mattson <jmattson@google.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e0c23063
    • P
      kvm: fix page struct leak in handle_vmon · 06ce521a
      Paolo Bonzini 提交于
      handle_vmon gets a reference on VMXON region page,
      but does not release it. Release the reference.
      
      Found by syzkaller; based on a patch by Dmitry.
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      06ce521a
  2. 17 2月, 2017 2 次提交
  3. 15 2月, 2017 11 次提交
    • J
      kvm: nVMX: Refactor nested_vmx_run() · 858e25c0
      Jim Mattson 提交于
      Nested_vmx_run is split into two parts: the part that handles the
      VMLAUNCH/VMRESUME instruction, and the part that modifies the vcpu state
      to transition from VMX root mode to VMX non-root mode. The latter will
      be used when restoring the checkpointed state of a vCPU that was in VMX
      operation when a snapshot was taken.
      Signed-off-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      858e25c0
    • J
      kvm: nVMX: Split VMCS checks from nested_vmx_run() · ca0bde28
      Jim Mattson 提交于
      The checks performed on the contents of the vmcs12 are extracted from
      nested_vmx_run so that they can be used to validate a vmcs12 that has
      been restored from a checkpoint.
      Signed-off-by: NJim Mattson <jmattson@google.com>
      [Change prepare_vmcs02 and nested_vmx_load_cr3's last argument to u32,
       to match check_vmentry_postreqs.  Update comments for singlestep
       handling. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ca0bde28
    • J
      kvm: nVMX: Refactor nested_get_vmcs12_pages() · 6beb7bd5
      Jim Mattson 提交于
      Perform the checks on vmcs12 state early, but defer the gpa->hpa lookups
      until after prepare_vmcs02. Later, when we restore the checkpointed
      state of a vCPU in guest mode, we will not be able to do the gpa->hpa
      lookups when the restore is done.
      Signed-off-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      6beb7bd5
    • J
      kvm: nVMX: Refactor handle_vmptrld() · a8bc284e
      Jim Mattson 提交于
      Handle_vmptrld is split into two parts: the part that handles the
      VMPTRLD instruction, and the part that establishes the current VMCS
      pointer.  The latter will be used when restoring the checkpointed state
      of a vCPU that had a valid VMCS pointer when a snapshot was taken.
      Signed-off-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a8bc284e
    • J
      kvm: nVMX: Refactor handle_vmon() · e29acc55
      Jim Mattson 提交于
      Handle_vmon is split into two parts: the part that handles the VMXON
      instruction, and the part that modifies the vcpu state to transition
      from legacy mode to VMX operation. The latter will be used when
      restoring the checkpointed state of a vCPU that was in VMX operation
      when a snapshot was taken.
      Signed-off-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e29acc55
    • J
      kvm: nVMX: Prepare for checkpointing L2 state · cf8b84f4
      Jim Mattson 提交于
      Split prepare_vmcs12 into two parts: the part that stores the current L2
      guest state and the part that sets up the exit information fields. The
      former will be used when checkpointing the vCPU's VMX state.
      
      Modify prepare_vmcs02 so that it can construct a vmcs02 midway through
      L2 execution, using the checkpointed L2 guest state saved into the
      cached vmcs12 above.
      Signed-off-by: NJim Mattson <jmattson@google.com>
      [Rebasing: add from_vmentry argument to prepare_vmcs02 instead of using
       vmx->nested.nested_run_pending, because it is no longer 1 at the
       point prepare_vmcs02 is called. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      cf8b84f4
    • P
      kvm: x86: do not use KVM_REQ_EVENT for APICv interrupt injection · b95234c8
      Paolo Bonzini 提交于
      Since bf9f6ac8 ("KVM: Update Posted-Interrupts Descriptor when vCPU
      is blocked", 2015-09-18) the posted interrupt descriptor is checked
      unconditionally for PIR.ON.  Therefore we don't need KVM_REQ_EVENT to
      trigger the scan and, if NMIs or SMIs are not involved, we can avoid
      the complicated event injection path.
      
      Calling kvm_vcpu_kick if PIR.ON=1 is also useless, though it has been
      there since APICv was introduced.
      
      However, without the KVM_REQ_EVENT safety net KVM needs to be much
      more careful about races between vmx_deliver_posted_interrupt and
      vcpu_enter_guest.  First, the IPI for posted interrupts may be issued
      between setting vcpu->mode = IN_GUEST_MODE and disabling interrupts.
      If that happens, kvm_trigger_posted_interrupt returns true, but
      smp_kvm_posted_intr_ipi doesn't do anything about it.  The guest is
      entered with PIR.ON, but the posted interrupt IPI has not been sent
      and the interrupt is only delivered to the guest on the next vmentry
      (if any).  To fix this, disable interrupts before setting vcpu->mode.
      This ensures that the IPI is delayed until the guest enters non-root mode;
      it is then trapped by the processor causing the interrupt to be injected.
      
      Second, the IPI may be issued between kvm_x86_ops->sync_pir_to_irr(vcpu)
      and vcpu->mode = IN_GUEST_MODE.  In this case, kvm_vcpu_kick is called
      but it (correctly) doesn't do anything because it sees vcpu->mode ==
      OUTSIDE_GUEST_MODE.  Again, the guest is entered with PIR.ON but no
      posted interrupt IPI is pending; this time, the fix for this is to move
      the RVI update after IN_GUEST_MODE.
      
      Both issues were mostly masked by the liberal usage of KVM_REQ_EVENT,
      though the second could actually happen with VT-d posted interrupts.
      In both race scenarios KVM_REQ_EVENT would cancel guest entry, resulting
      in another vmentry which would inject the interrupt.
      
      This saves about 300 cycles on the self_ipi_* tests of vmexit.flat.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b95234c8
    • P
      KVM: x86: do not scan IRR twice on APICv vmentry · 76dfafd5
      Paolo Bonzini 提交于
      Calls to apic_find_highest_irr are scanning IRR twice, once
      in vmx_sync_pir_from_irr and once in apic_search_irr.  Change
      sync_pir_from_irr to get the new maximum IRR from kvm_apic_update_irr;
      now that it does the computation, it can also do the RVI write.
      
      In order to avoid complications in svm.c, make the callback optional.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      76dfafd5
    • P
      KVM: x86: preparatory changes for APICv cleanups · 810e6def
      Paolo Bonzini 提交于
      Add return value to __kvm_apic_update_irr/kvm_apic_update_irr.
      Move vmx_sync_pir_to_irr around.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      810e6def
    • P
      KVM: vmx: clear pending interrupts on KVM_SET_LAPIC · 967235d3
      Paolo Bonzini 提交于
      Pending interrupts might be in the PI descriptor when the
      LAPIC is restored from an external state; we do not want
      them to be injected.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      967235d3
    • P
      kvm: vmx: Use the hardware provided GPA instead of page walk · db1c056c
      Paolo Bonzini 提交于
      As in the SVM patch, the guest physical address is passed by
      VMX to x86_emulate_instruction already, so mark the GPA as available
      in vcpu->arch.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      db1c056c
  4. 08 2月, 2017 2 次提交
  5. 27 1月, 2017 2 次提交
  6. 21 1月, 2017 1 次提交
  7. 09 1月, 2017 5 次提交
  8. 05 1月, 2017 1 次提交
  9. 22 12月, 2016 1 次提交
  10. 19 12月, 2016 1 次提交
  11. 15 12月, 2016 1 次提交
  12. 08 12月, 2016 8 次提交
    • J
      KVM: nVMX: invvpid handling improvements · 16c2aec6
      Jan Dakinevich 提交于
       - Expose all invalidation types to the L1
      
       - Reject invvpid instruction, if L1 passed zero vpid value to single
         context invalidations
      Signed-off-by: NJan Dakinevich <jan.dakinevich@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      16c2aec6
    • L
      KVM: nVMX: check host CR3 on vmentry and vmexit · 1dc35dac
      Ladi Prosek 提交于
      This commit adds missing host CR3 checks. Before entering guest mode, the value
      of CR3 is checked for reserved bits. After returning, nested_vmx_load_cr3 is
      called to set the new CR3 value and check and load PDPTRs.
      Signed-off-by: NLadi Prosek <lprosek@redhat.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      1dc35dac
    • L
      KVM: nVMX: introduce nested_vmx_load_cr3 and call it on vmentry · 9ed38ffa
      Ladi Prosek 提交于
      Loading CR3 as part of emulating vmentry is different from regular CR3 loads,
      as implemented in kvm_set_cr3, in several ways.
      
      * different rules are followed to check CR3 and it is desirable for the caller
      to distinguish between the possible failures
      * PDPTRs are not loaded if PAE paging and nested EPT are both enabled
      * many MMU operations are not necessary
      
      This patch introduces nested_vmx_load_cr3 suitable for CR3 loads as part of
      nested vmentry and vmexit, and makes use of it on the nested vmentry path.
      Signed-off-by: NLadi Prosek <lprosek@redhat.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      9ed38ffa
    • L
      KVM: nVMX: propagate errors from prepare_vmcs02 · ee146c1c
      Ladi Prosek 提交于
      It is possible that prepare_vmcs02 fails to load the guest state. This
      patch adds the proper error handling for such a case. L1 will receive
      an INVALID_STATE vmexit with the appropriate exit qualification if it
      happens.
      
      A failure to set guest CR3 is the only error propagated from prepare_vmcs02
      at the moment.
      Signed-off-by: NLadi Prosek <lprosek@redhat.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      ee146c1c
    • L
      KVM: nVMX: fix CR3 load if L2 uses PAE paging and EPT · 7ca29de2
      Ladi Prosek 提交于
      KVM does not correctly handle L1 hypervisors that emulate L2 real mode with
      PAE and EPT, such as Hyper-V. In this mode, the L1 hypervisor populates guest
      PDPTE VMCS fields and leaves guest CR3 uninitialized because it is not used
      (see 26.3.2.4 Loading Page-Directory-Pointer-Table Entries). KVM always
      dereferences CR3 and tries to load PDPTEs if PAE is on. This leads to two
      related issues:
      
      1) On the first nested vmentry, the guest PDPTEs, as populated by L1, are
      overwritten in ept_load_pdptrs because the registers are believed to have
      been loaded in load_pdptrs as part of kvm_set_cr3. This is incorrect. L2 is
      running with PAE enabled but PDPTRs have been set up by L1.
      
      2) When L2 is about to enable paging and loads its CR3, we, again, attempt
      to load PDPTEs in load_pdptrs called from kvm_set_cr3. There are no guarantees
      that this will succeed (it's just a CR3 load, paging is not enabled yet) and
      if it doesn't, kvm_set_cr3 returns early without persisting the CR3 which is
      then lost and L2 crashes right after it enables paging.
      
      This patch replaces the kvm_set_cr3 call with a simple register write if PAE
      and EPT are both on. CR3 is not to be interpreted in this case.
      Signed-off-by: NLadi Prosek <lprosek@redhat.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      7ca29de2
    • D
      KVM: nVMX: load GUEST_EFER after GUEST_CR0 during emulated VM-entry · 5a6a9748
      David Matlack 提交于
      vmx_set_cr0() modifies GUEST_EFER and "IA-32e mode guest" in the current
      VMCS. Call vmx_set_efer() after vmx_set_cr0() so that emulated VM-entry
      is more faithful to VMCS12.
      
      This patch correctly causes VM-entry to fail when "IA-32e mode guest" is
      1 and GUEST_CR0.PG is 0. Previously this configuration would succeed and
      "IA-32e mode guest" would silently be disabled by KVM.
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      5a6a9748
    • D
      KVM: nVMX: generate MSR_IA32_CR{0,4}_FIXED1 from guest CPUID · 8322ebbb
      David Matlack 提交于
      MSR_IA32_CR{0,4}_FIXED1 define which bits in CR0 and CR4 are allowed to
      be 1 during VMX operation. Since the set of allowed-1 bits is the same
      in and out of VMX operation, we can generate these MSRs entirely from
      the guest's CPUID. This lets userspace avoiding having to save/restore
      these MSRs.
      
      This patch also initializes MSR_IA32_CR{0,4}_FIXED1 from the CPU's MSRs
      by default. This is a saner than the current default of -1ull, which
      includes bits that the host CPU does not support.
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      8322ebbb
    • D
      KVM: nVMX: fix checks on CR{0,4} during virtual VMX operation · 3899152c
      David Matlack 提交于
      KVM emulates MSR_IA32_VMX_CR{0,4}_FIXED1 with the value -1ULL, meaning
      all CR0 and CR4 bits are allowed to be 1 during VMX operation.
      
      This does not match real hardware, which disallows the high 32 bits of
      CR0 to be 1, and disallows reserved bits of CR4 to be 1 (including bits
      which are defined in the SDM but missing according to CPUID). A guest
      can induce a VM-entry failure by setting these bits in GUEST_CR0 and
      GUEST_CR4, despite MSR_IA32_VMX_CR{0,4}_FIXED1 indicating they are
      valid.
      
      Since KVM has allowed all bits to be 1 in CR0 and CR4, the existing
      checks on these registers do not verify must-be-0 bits. Fix these checks
      to identify must-be-0 bits according to MSR_IA32_VMX_CR{0,4}_FIXED1.
      
      This patch should introduce no change in behavior in KVM, since these
      MSRs are still -1ULL.
      Signed-off-by: NDavid Matlack <dmatlack@google.com>
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      3899152c