1. 17 10月, 2018 10 次提交
    • V
      KVM: x86: hyperv: optimize 'all cpus' case in kvm_hv_flush_tlb() · a812297c
      Vitaly Kuznetsov 提交于
      We can use 'NULL' to represent 'all cpus' case in
      kvm_make_vcpus_request_mask() and avoid building vCPU mask with
      all vCPUs.
      Suggested-by: NRadim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Reviewed-by: NRoman Kagan <rkagan@virtuozzo.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a812297c
    • V
      KVM: x86: hyperv: enforce vp_index < KVM_MAX_VCPUS · 9170200e
      Vitaly Kuznetsov 提交于
      Hyper-V TLFS (5.0b) states:
      
      > Virtual processors are identified by using an index (VP index). The
      > maximum number of virtual processors per partition supported by the
      > current implementation of the hypervisor can be obtained through CPUID
      > leaf 0x40000005. A virtual processor index must be less than the
      > maximum number of virtual processors per partition.
      
      Forbid userspace to set VP_INDEX above KVM_MAX_VCPUS. get_vcpu_by_vpidx()
      can now be optimized to bail early when supplied vpidx is >= KVM_MAX_VCPUS.
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Reviewed-by: NRoman Kagan <rkagan@virtuozzo.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      9170200e
    • P
      kvm/x86: return meaningful value from KVM_SIGNAL_MSI · 0624fca9
      Paolo Bonzini 提交于
      If kvm_apic_map_get_dest_lapic() finds a disabled LAPIC,
      it will return with bitmap==0 and (*r == -1) will be returned to
      userspace.
      
      QEMU may then record "KVM: injection failed, MSI lost
      (Operation not permitted)" in its log, which is quite puzzling.
      Reported-by: NPeng Hao <penghao122@sina.com.cn>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0624fca9
    • W
      KVM: x86: move definition PT_MAX_HUGEPAGE_LEVEL and KVM_NR_PAGE_SIZES together · 4fef0f49
      Wei Yang 提交于
      Currently, there are two definitions related to huge page, but a little bit
      far from each other and seems loosely connected:
      
       * KVM_NR_PAGE_SIZES defines the number of different size a page could map
       * PT_MAX_HUGEPAGE_LEVEL means the maximum level of huge page
      
      The number of different size a page could map equals the maximum level
      of huge page, which is implied by current definition.
      
      While current implementation may not be kind to readers and further
      developers:
      
       * KVM_NR_PAGE_SIZES looks like a stand alone definition at first sight
       * in case we need to support more level, two places need to change
      
      This patch tries to make these two definition more close, so that reader
      and developer would feel more comfortable to manipulate.
      Signed-off-by: NWei Yang <richard.weiyang@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4fef0f49
    • T
      KVM/VMX: Remve unused function is_external_interrupt(). · aaa45da2
      Tianyu Lan 提交于
      is_external_interrupt() is not used now and so remove it.
      Signed-off-by: NLan Tianyu <Tianyu.Lan@microsoft.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      aaa45da2
    • W
      KVM: x86: return 0 in case kvm_mmu_memory_cache has min number of objects · daefb794
      Wei Yang 提交于
      The code tries to pre-allocate *min* number of objects, so it is ok to
      return 0 when the kvm_mmu_memory_cache meets the requirement.
      Signed-off-by: NWei Yang <richard.weiyang@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      daefb794
    • K
    • S
      KVM: nVMX: restore host state in nested_vmx_vmexit for VMFail · bd18bffc
      Sean Christopherson 提交于
      A VMEnter that VMFails (as opposed to VMExits) does not touch host
      state beyond registers that are explicitly noted in the VMFail path,
      e.g. EFLAGS.  Host state does not need to be loaded because VMFail
      is only signaled for consistency checks that occur before the CPU
      starts to load guest state, i.e. there is no need to restore any
      state as nothing has been modified.  But in the case where a VMFail
      is detected by hardware and not by KVM (due to deferring consistency
      checks to hardware), KVM has already loaded some amount of guest
      state.  Luckily, "loaded" only means loaded to KVM's software model,
      i.e. vmcs01 has not been modified.  So, unwind our software model to
      the pre-VMEntry host state.
      
      Not restoring host state in this VMFail path leads to a variety of
      failures because we end up with stale data in vcpu->arch, e.g. CR0,
      CR4, EFER, etc... will all be out of sync relative to vmcs01.  Any
      significant delta in the stale data is all but guaranteed to crash
      L1, e.g. emulation of SMEP, SMAP, UMIP, WP, etc... will be wrong.
      
      An alternative to this "soft" reload would be to load host state from
      vmcs12 as if we triggered a VMExit (as opposed to VMFail), but that is
      wildly inconsistent with respect to the VMX architecture, e.g. an L1
      VMM with separate VMExit and VMFail paths would explode.
      
      Note that this approach does not mean KVM is 100% accurate with
      respect to VMX hardware behavior, even at an architectural level
      (the exact order of consistency checks is microarchitecture specific).
      But 100% emulation accuracy isn't the goal (with this patch), rather
      the goal is to be consistent in the information delivered to L1, e.g.
      a VMExit should not fall-through VMENTER, and a VMFail should not jump
      to HOST_RIP.
      
      This technically reverts commit "5af41573 (KVM: nVMX: Fix mmu
      context after VMLAUNCH/VMRESUME failure)", but retains the core
      aspects of that patch, just in an open coded form due to the need to
      pull state from vmcs01 instead of vmcs12.  Restoring host state
      resolves a variety of issues introduced by commit "4f350c6d
      (kvm: nVMX: Handle deferred early VMLAUNCH/VMRESUME failure properly)",
      which remedied the incorrect behavior of treating VMFail like VMExit
      but in doing so neglected to restore arch state that had been modified
      prior to attempting nested VMEnter.
      
      A sample failure that occurs due to stale vcpu.arch state is a fault
      of some form while emulating an LGDT (due to emulated UMIP) from L1
      after a failed VMEntry to L3, in this case when running the KVM unit
      test test_tpr_threshold_values in L1.  L0 also hits a WARN in this
      case due to a stale arch.cr4.UMIP.
      
      L1:
        BUG: unable to handle kernel paging request at ffffc90000663b9e
        PGD 276512067 P4D 276512067 PUD 276513067 PMD 274efa067 PTE 8000000271de2163
        Oops: 0009 [#1] SMP
        CPU: 5 PID: 12495 Comm: qemu-system-x86 Tainted: G        W         4.18.0-rc2+ #2
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        RIP: 0010:native_load_gdt+0x0/0x10
      
        ...
      
        Call Trace:
         load_fixmap_gdt+0x22/0x30
         __vmx_load_host_state+0x10e/0x1c0 [kvm_intel]
         vmx_switch_vmcs+0x2d/0x50 [kvm_intel]
         nested_vmx_vmexit+0x222/0x9c0 [kvm_intel]
         vmx_handle_exit+0x246/0x15a0 [kvm_intel]
         kvm_arch_vcpu_ioctl_run+0x850/0x1830 [kvm]
         kvm_vcpu_ioctl+0x3a1/0x5c0 [kvm]
         do_vfs_ioctl+0x9f/0x600
         ksys_ioctl+0x66/0x70
         __x64_sys_ioctl+0x16/0x20
         do_syscall_64+0x4f/0x100
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      L0:
        WARNING: CPU: 2 PID: 3529 at arch/x86/kvm/vmx.c:6618 handle_desc+0x28/0x30 [kvm_intel]
        ...
        CPU: 2 PID: 3529 Comm: qemu-system-x86 Not tainted 4.17.2-coffee+ #76
        Hardware name: Intel Corporation Kabylake Client platform/KBL S
        RIP: 0010:handle_desc+0x28/0x30 [kvm_intel]
      
        ...
      
        Call Trace:
         kvm_arch_vcpu_ioctl_run+0x863/0x1840 [kvm]
         kvm_vcpu_ioctl+0x3a1/0x5c0 [kvm]
         do_vfs_ioctl+0x9f/0x5e0
         ksys_ioctl+0x66/0x70
         __x64_sys_ioctl+0x16/0x20
         do_syscall_64+0x49/0xf0
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Fixes: 5af41573 (KVM: nVMX: Fix mmu context after VMLAUNCH/VMRESUME failure)
      Fixes: 4f350c6d (kvm: nVMX: Handle deferred early VMLAUNCH/VMRESUME failure properly)
      Cc: Jim Mattson <jmattson@google.com>
      Cc: Krish Sadhukhan <krish.sadhukhan@oracle.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim KrÄmář <rkrcmar@redhat.com>
      Cc: Wanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      bd18bffc
    • J
      KVM: nVMX: Clear reserved bits of #DB exit qualification · cfb634fe
      Jim Mattson 提交于
      According to volume 3 of the SDM, bits 63:15 and 12:4 of the exit
      qualification field for debug exceptions are reserved (cleared to
      0). However, the SDM is incorrect about bit 16 (corresponding to
      DR6.RTM). This bit should be set if a debug exception (#DB) or a
      breakpoint exception (#BP) occurred inside an RTM region while
      advanced debugging of RTM transactional regions was enabled. Note that
      this is the opposite of DR6.RTM, which "indicates (when clear) that a
      debug exception (#DB) or breakpoint exception (#BP) occurred inside an
      RTM region while advanced debugging of RTM transactional regions was
      enabled."
      
      There is still an issue with stale DR6 bits potentially being
      misreported for the current debug exception.  DR6 should not have been
      modified before vectoring the #DB exception, and the "new DR6 bits"
      should be available somewhere, but it was and they aren't.
      
      Fixes: b96fb439 ("KVM: nVMX: fixes to nested virt interrupt injection")
      Signed-off-by: NJim Mattson <jmattson@google.com>
      Reviewed-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      cfb634fe
    • W
      KVM: LAPIC: Tune lapic_timer_advance_ns automatically · 3b8a5df6
      Wanpeng Li 提交于
      In cloud environment, lapic_timer_advance_ns is needed to be tuned for every CPU
      generations, and every host kernel versions(the kvm-unit-tests/tscdeadline_latency.flat
      is 5700 cycles for upstream kernel and 9600 cycles for our 3.10 product kernel,
      both preemption_timer=N, Skylake server).
      
      This patch adds the capability to automatically tune lapic_timer_advance_ns
      step by step, the initial value is 1000ns as 'commit d0659d94 ("KVM: x86:
      add option to advance tscdeadline hrtimer expiration")' recommended, it will be
      reduced when it is too early, and increased when it is too late. The guest_tsc
      and tsc_deadline are hard to equal, so we assume we are done when the delta
      is within a small scope e.g. 100 cycles. This patch reduces latency
      (kvm-unit-tests/tscdeadline_latency, busy waits, preemption_timer enabled)
      from ~2600 cyles to ~1200 cyles on our Skylake server.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Liran Alon <liran.alon@oracle.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3b8a5df6
  2. 13 10月, 2018 5 次提交
  3. 04 10月, 2018 3 次提交
    • P
      kvm: nVMX: fix entry with pending interrupt if APICv is enabled · 7e712684
      Paolo Bonzini 提交于
      Commit b5861e5c introduced a check on
      the interrupt-window and NMI-window CPU execution controls in order to
      inject an external interrupt vmexit before the first guest instruction
      executes.  However, when APIC virtualization is enabled the host does not
      need a vmexit in order to inject an interrupt at the next interrupt window;
      instead, it just places the interrupt vector in RVI and the processor will
      inject it as soon as possible.  Therefore, on machines with APICv it is
      not enough to check the CPU execution controls: the same scenario can also
      happen if RVI>vPPR.
      
      Fixes: b5861e5cReviewed-by: NNikita Leshchenko <nikita.leshchenko@oracle.com>
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Cc: Liran Alon <liran.alon@oracle.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7e712684
    • P
      KVM: VMX: hide flexpriority from guest when disabled at the module level · 2cf7ea9f
      Paolo Bonzini 提交于
      As of commit 8d860bbe ("kvm: vmx: Basic APIC virtualization controls
      have three settings"), KVM will disable VIRTUALIZE_APIC_ACCESSES when
      a nested guest writes APIC_BASE MSR and kvm-intel.flexpriority=0,
      whereas previously KVM would allow a nested guest to enable
      VIRTUALIZE_APIC_ACCESSES so long as it's supported in hardware.  That is,
      KVM now advertises VIRTUALIZE_APIC_ACCESSES to a guest but doesn't
      (always) allow setting it when kvm-intel.flexpriority=0, and may even
      initially allow the control and then clear it when the nested guest
      writes APIC_BASE MSR, which is decidedly odd even if it doesn't cause
      functional issues.
      
      Hide the control completely when the module parameter is cleared.
      reported-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Fixes: 8d860bbe ("kvm: vmx: Basic APIC virtualization controls have three settings")
      Cc: Jim Mattson <jmattson@google.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      2cf7ea9f
    • S
      KVM: VMX: check for existence of secondary exec controls before accessing · fd6b6d9b
      Sean Christopherson 提交于
      Return early from vmx_set_virtual_apic_mode() if the processor doesn't
      support VIRTUALIZE_APIC_ACCESSES or VIRTUALIZE_X2APIC_MODE, both of
      which reside in SECONDARY_VM_EXEC_CONTROL.  This eliminates warnings
      due to VMWRITEs to SECONDARY_VM_EXEC_CONTROL (VMCS field 401e) failing
      on processors without secondary exec controls.
      
      Remove the similar check for TPR shadowing as it is incorporated in the
      flexpriority_enabled check and the APIC-related code in
      vmx_update_msr_bitmap() is further gated by VIRTUALIZE_X2APIC_MODE.
      Reported-by: NGerhard Wiesinger <redhat@wiesinger.com>
      Fixes: 8d860bbe ("kvm: vmx: Basic APIC virtualization controls have three settings")
      Cc: Jim Mattson <jmattson@google.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      fd6b6d9b
  4. 01 10月, 2018 4 次提交
    • S
      KVM: x86: fix L1TF's MMIO GFN calculation · daa07cbc
      Sean Christopherson 提交于
      One defense against L1TF in KVM is to always set the upper five bits
      of the *legal* physical address in the SPTEs for non-present and
      reserved SPTEs, e.g. MMIO SPTEs.  In the MMIO case, the GFN of the
      MMIO SPTE may overlap with the upper five bits that are being usurped
      to defend against L1TF.  To preserve the GFN, the bits of the GFN that
      overlap with the repurposed bits are shifted left into the reserved
      bits, i.e. the GFN in the SPTE will be split into high and low parts.
      When retrieving the GFN from the MMIO SPTE, e.g. to check for an MMIO
      access, get_mmio_spte_gfn() unshifts the affected bits and restores
      the original GFN for comparison.  Unfortunately, get_mmio_spte_gfn()
      neglects to mask off the reserved bits in the SPTE that were used to
      store the upper chunk of the GFN.  As a result, KVM fails to detect
      MMIO accesses whose GPA overlaps the repurprosed bits, which in turn
      causes guest panics and hangs.
      
      Fix the bug by generating a mask that covers the lower chunk of the
      GFN, i.e. the bits that aren't shifted by the L1TF mitigation.  The
      alternative approach would be to explicitly zero the five reserved
      bits that are used to store the upper chunk of the GFN, but that
      requires additional run-time computation and makes an already-ugly
      bit of code even more inscrutable.
      
      I considered adding a WARN_ON_ONCE(low_phys_bits-1 <= PAGE_SHIFT) to
      warn if GENMASK_ULL() generated a nonsensical value, but that seemed
      silly since that would mean a system that supports VMX has less than
      18 bits of physical address space...
      Reported-by: NSakari Ailus <sakari.ailus@iki.fi>
      Fixes: d9b47449c1a1 ("kvm: x86: Set highest physical address bits in non-present/reserved SPTEs")
      Cc: Junaid Shahid <junaids@google.com>
      Cc: Jim Mattson <jmattson@google.com>
      Cc: stable@vger.kernel.org
      Reviewed-by: NJunaid Shahid <junaids@google.com>
      Tested-by: NSakari Ailus <sakari.ailus@linux.intel.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      daa07cbc
    • L
      KVM: nVMX: Fix emulation of VM_ENTRY_LOAD_BNDCFGS · 62cf9bd8
      Liran Alon 提交于
      L2 IA32_BNDCFGS should be updated with vmcs12->guest_bndcfgs only
      when VM_ENTRY_LOAD_BNDCFGS is specified in vmcs12->vm_entry_controls.
      
      Otherwise, L2 IA32_BNDCFGS should be set to vmcs01->guest_bndcfgs which
      is L1 IA32_BNDCFGS.
      Reviewed-by: NNikita Leshchenko <nikita.leshchenko@oracle.com>
      Reviewed-by: NDarren Kenny <darren.kenny@oracle.com>
      Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      62cf9bd8
    • L
      KVM: x86: Do not use kvm_x86_ops->mpx_supported() directly · 503234b3
      Liran Alon 提交于
      Commit a87036ad ("KVM: x86: disable MPX if host did not enable
      MPX XSAVE features") introduced kvm_mpx_supported() to return true
      iff MPX is enabled in the host.
      
      However, that commit seems to have missed replacing some calls to
      kvm_x86_ops->mpx_supported() to kvm_mpx_supported().
      
      Complete original commit by replacing remaining calls to
      kvm_mpx_supported().
      
      Fixes: a87036ad ("KVM: x86: disable MPX if host did not enable
      MPX XSAVE features")
      Suggested-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      503234b3
    • L
      KVM: nVMX: Do not expose MPX VMX controls when guest MPX disabled · 5f76f6f5
      Liran Alon 提交于
      Before this commit, KVM exposes MPX VMX controls to L1 guest only based
      on if KVM and host processor supports MPX virtualization.
      However, these controls should be exposed to guest only in case guest
      vCPU supports MPX.
      
      Without this change, a L1 guest running with kernel which don't have
      commit 691bd434 ("kvm: vmx: allow host to access guest
      MSR_IA32_BNDCFGS") asserts in QEMU on the following:
      	qemu-kvm: error: failed to set MSR 0xd90 to 0x0
      	qemu-kvm: .../qemu-2.10.0/target/i386/kvm.c:1801 kvm_put_msrs:
      	Assertion 'ret == cpu->kvm_msr_buf->nmsrs failed'
      This is because L1 KVM kvm_init_msr_list() will see that
      vmx_mpx_supported() (As it only checks MPX VMX controls support) and
      therefore KVM_GET_MSR_INDEX_LIST IOCTL will include MSR_IA32_BNDCFGS.
      However, later when L1 will attempt to set this MSR via KVM_SET_MSRS
      IOCTL, it will fail because !guest_cpuid_has_mpx(vcpu).
      
      Therefore, fix the issue by exposing MPX VMX controls to L1 guest only
      when vCPU supports MPX.
      
      Fixes: 36be0b9d ("KVM: x86: Add nested virtualization support for MPX")
      Reported-by: NEyal Moscovici <eyal.moscovici@oracle.com>
      Reviewed-by: NNikita Leshchenko <nikita.leshchenko@oracle.com>
      Reviewed-by: NDarren Kenny <darren.kenny@oracle.com>
      Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5f76f6f5
  5. 25 9月, 2018 1 次提交
    • P
      KVM: x86: never trap MSR_KERNEL_GS_BASE · 4679b61f
      Paolo Bonzini 提交于
      KVM has an old optimization whereby accesses to the kernel GS base MSR
      are trapped when the guest is in 32-bit and not when it is in 64-bit mode.
      The idea is that swapgs is not available in 32-bit mode, thus the
      guest has no reason to access the MSR unless in 64-bit mode and
      32-bit applications need not pay the price of switching the kernel GS
      base between the host and the guest values.
      
      However, this optimization adds complexity to the code for little
      benefit (these days most guests are going to be 64-bit anyway) and in fact
      broke after commit 678e315e ("KVM: vmx: add dedicated utility to
      access guest's kernel_gs_base", 2018-08-06); the guest kernel GS base
      can be corrupted across SMIs and UEFI Secure Boot is therefore broken
      (a secure boot Linux guest, for example, fails to reach the login prompt
      about half the time).  This patch just removes the optimization; the
      kernel GS base MSR is now never trapped by KVM, similarly to the FS and
      GS base MSRs.
      
      Fixes: 678e315eReviewed-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4679b61f
  6. 21 9月, 2018 1 次提交
  7. 20 9月, 2018 15 次提交
    • D
      KVM: x86: Control guest reads of MSR_PLATFORM_INFO · 6fbbde9a
      Drew Schmitt 提交于
      Add KVM_CAP_MSR_PLATFORM_INFO so that userspace can disable guest access
      to reads of MSR_PLATFORM_INFO.
      
      Disabling access to reads of this MSR gives userspace the control to "expose"
      this platform-dependent information to guests in a clear way. As it exists
      today, guests that read this MSR would get unpopulated information if userspace
      hadn't already set it (and prior to this patch series, only the CPUID faulting
      information could have been populated). This existing interface could be
      confusing if guests don't handle the potential for incorrect/incomplete
      information gracefully (e.g. zero reported for base frequency).
      Signed-off-by: NDrew Schmitt <dasch@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      6fbbde9a
    • D
      KVM: x86: Turbo bits in MSR_PLATFORM_INFO · d84f1cff
      Drew Schmitt 提交于
      Allow userspace to set turbo bits in MSR_PLATFORM_INFO. Previously, only
      the CPUID faulting bit was settable. But now any bit in
      MSR_PLATFORM_INFO would be settable. This can be used, for example, to
      convey frequency information about the platform on which the guest is
      running.
      Signed-off-by: NDrew Schmitt <dasch@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d84f1cff
    • K
      nVMX x86: Check VPID value on vmentry of L2 guests · ba8e23db
      Krish Sadhukhan 提交于
      According to section "Checks on VMX Controls" in Intel SDM vol 3C, the
      following check needs to be enforced on vmentry of L2 guests:
      
          If the 'enable VPID' VM-execution control is 1, the value of the
          of the VPID VM-execution control field must not be 0000H.
      Signed-off-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com>
      Reviewed-by: NMark Kanda <mark.kanda@oracle.com>
      Reviewed-by: NLiran Alon <liran.alon@oracle.com>
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ba8e23db
    • K
      nVMX x86: check posted-interrupt descriptor addresss on vmentry of L2 · 6de84e58
      Krish Sadhukhan 提交于
      According to section "Checks on VMX Controls" in Intel SDM vol 3C,
      the following check needs to be enforced on vmentry of L2 guests:
      
         - Bits 5:0 of the posted-interrupt descriptor address are all 0.
         - The posted-interrupt descriptor address does not set any bits
           beyond the processor's physical-address width.
      Signed-off-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com>
      Reviewed-by: NMark Kanda <mark.kanda@oracle.com>
      Reviewed-by: NLiran Alon <liran.alon@oracle.com>
      Reviewed-by: NDarren Kenny <darren.kenny@oracle.com>
      Reviewed-by: NKarl Heubaum <karl.heubaum@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      6de84e58
    • L
      KVM: nVMX: Wake blocked vCPU in guest-mode if pending interrupt in virtual APICv · e6c67d8c
      Liran Alon 提交于
      In case L1 do not intercept L2 HLT or enter L2 in HLT activity-state,
      it is possible for a vCPU to be blocked while it is in guest-mode.
      
      According to Intel SDM 26.6.5 Interrupt-Window Exiting and
      Virtual-Interrupt Delivery: "These events wake the logical processor
      if it just entered the HLT state because of a VM entry".
      Therefore, if L1 enters L2 in HLT activity-state and L2 has a pending
      deliverable interrupt in vmcs12->guest_intr_status.RVI, then the vCPU
      should be waken from the HLT state and injected with the interrupt.
      
      In addition, if while the vCPU is blocked (while it is in guest-mode),
      it receives a nested posted-interrupt, then the vCPU should also be
      waken and injected with the posted interrupt.
      
      To handle these cases, this patch enhances kvm_vcpu_has_events() to also
      check if there is a pending interrupt in L2 virtual APICv provided by
      L1. That is, it evaluates if there is a pending virtual interrupt for L2
      by checking RVI[7:4] > VPPR[7:4] as specified in Intel SDM 29.2.1
      Evaluation of Pending Interrupts.
      
      Note that this also handles the case of nested posted-interrupt by the
      fact RVI is updated in vmx_complete_nested_posted_interrupt() which is
      called from kvm_vcpu_check_block() -> kvm_arch_vcpu_runnable() ->
      kvm_vcpu_running() -> vmx_check_nested_events() ->
      vmx_complete_nested_posted_interrupt().
      Reviewed-by: NNikita Leshenko <nikita.leshchenko@oracle.com>
      Reviewed-by: NDarren Kenny <darren.kenny@oracle.com>
      Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      e6c67d8c
    • P
      KVM: VMX: check nested state and CR4.VMXE against SMM · 5bea5123
      Paolo Bonzini 提交于
      VMX cannot be enabled under SMM, check it when CR4 is set and when nested
      virtualization state is restored.
      
      This should fix some WARNs reported by syzkaller, mostly around
      alloc_shadow_vmcs.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5bea5123
    • S
      kvm: x86: make kvm_{load|put}_guest_fpu() static · 822f312d
      Sebastian Andrzej Siewior 提交于
      The functions
      	kvm_load_guest_fpu()
      	kvm_put_guest_fpu()
      
      are only used locally, make them static. This requires also that both
      functions are moved because they are used before their implementation.
      Those functions were exported (via EXPORT_SYMBOL) before commit
      e5bb4025 ("KVM: Drop kvm_{load,put}_guest_fpu() exports").
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      822f312d
    • S
      KVM: VMX: use preemption timer to force immediate VMExit · d264ee0c
      Sean Christopherson 提交于
      A VMX preemption timer value of '0' is guaranteed to cause a VMExit
      prior to the CPU executing any instructions in the guest.  Use the
      preemption timer (if it's supported) to trigger immediate VMExit
      in place of the current method of sending a self-IPI.  This ensures
      that pending VMExit injection to L1 occurs prior to executing any
      instructions in the guest (regardless of nesting level).
      
      When deferring VMExit injection, KVM generates an immediate VMExit
      from the (possibly nested) guest by sending itself an IPI.  Because
      hardware interrupts are blocked prior to VMEnter and are unblocked
      (in hardware) after VMEnter, this results in taking a VMExit(INTR)
      before any guest instruction is executed.  But, as this approach
      relies on the IPI being received before VMEnter executes, it only
      works as intended when KVM is running as L0.  Because there are no
      architectural guarantees regarding when IPIs are delivered, when
      running nested the INTR may "arrive" long after L2 is running e.g.
      L0 KVM doesn't force an immediate switch to L1 to deliver an INTR.
      
      For the most part, this unintended delay is not an issue since the
      events being injected to L1 also do not have architectural guarantees
      regarding their timing.  The notable exception is the VMX preemption
      timer[1], which is architecturally guaranteed to cause a VMExit prior
      to executing any instructions in the guest if the timer value is '0'
      at VMEnter.  Specifically, the delay in injecting the VMExit causes
      the preemption timer KVM unit test to fail when run in a nested guest.
      
      Note: this approach is viable even on CPUs with a broken preemption
      timer, as broken in this context only means the timer counts at the
      wrong rate.  There are no known errata affecting timer value of '0'.
      
      [1] I/O SMIs also have guarantees on when they arrive, but I have
          no idea if/how those are emulated in KVM.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      [Use a hook for SVM instead of leaving the default in x86.c - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d264ee0c
    • S
      KVM: VMX: modify preemption timer bit only when arming timer · f459a707
      Sean Christopherson 提交于
      Provide a singular location where the VMX preemption timer bit is
      set/cleared so that future usages of the preemption timer can ensure
      the VMCS bit is up-to-date without having to modify unrelated code
      paths.  For example, the preemption timer can be used to force an
      immediate VMExit.  Cache the status of the timer to avoid redundant
      VMREAD and VMWRITE, e.g. if the timer stays armed across multiple
      VMEnters/VMExits.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f459a707
    • S
      KVM: VMX: immediately mark preemption timer expired only for zero value · 4c008127
      Sean Christopherson 提交于
      A VMX preemption timer value of '0' at the time of VMEnter is
      architecturally guaranteed to cause a VMExit prior to the CPU
      executing any instructions in the guest.  This architectural
      definition is in place to ensure that a previously expired timer
      is correctly recognized by the CPU as it is possible for the timer
      to reach zero and not trigger a VMexit due to a higher priority
      VMExit being signalled instead, e.g. a pending #DB that morphs into
      a VMExit.
      
      Whether by design or coincidence, commit f4124500 ("KVM: nVMX:
      Fully emulate preemption timer") special cased timer values of '0'
      and '1' to ensure prompt delivery of the VMExit.  Unlike '0', a
      timer value of '1' has no has no architectural guarantees regarding
      when it is delivered.
      
      Modify the timer emulation to trigger immediate VMExit if and only
      if the timer value is '0', and document precisely why '0' is special.
      Do this even if calibration of the virtual TSC failed, i.e. VMExit
      will occur immediately regardless of the frequency of the timer.
      Making only '0' a special case gives KVM leeway to be more aggressive
      in ensuring the VMExit is injected prior to executing instructions in
      the nested guest, and also eliminates any ambiguity as to why '1' is
      a special case, e.g. why wasn't the threshold for a "short timeout"
      set to 10, 100, 1000, etc...
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4c008127
    • A
      KVM: SVM: Switch to bitmap_zalloc() · a101c9d6
      Andy Shevchenko 提交于
      Switch to bitmap_zalloc() to show clearly what we are allocating.
      Besides that it returns pointer of bitmap type instead of opaque void *.
      Signed-off-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a101c9d6
    • T
      KVM/MMU: Fix comment in walk_shadow_page_lockless_end() · 9a984586
      Tianyu Lan 提交于
      kvm_commit_zap_page() has been renamed to kvm_mmu_commit_zap_page()
      This patch is to fix the commit.
      Signed-off-by: NLan Tianyu <Tianyu.Lan@microsoft.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      9a984586
    • W
      KVM: x86: don't reset root in kvm_mmu_setup() · 83b20b28
      Wei Yang 提交于
      Here is the code path which shows kvm_mmu_setup() is invoked after
      kvm_mmu_create(). Since kvm_mmu_setup() is only invoked in this code path,
      this means the root_hpa and prev_roots are guaranteed to be invalid. And
      it is not necessary to reset it again.
      
          kvm_vm_ioctl_create_vcpu()
              kvm_arch_vcpu_create()
                  vmx_create_vcpu()
                      kvm_vcpu_init()
                          kvm_arch_vcpu_init()
                              kvm_mmu_create()
              kvm_arch_vcpu_setup()
                  kvm_mmu_setup()
                      kvm_init_mmu()
      
      This patch set reset_roots to false in kmv_mmu_setup().
      
      Fixes: 50c28f21Signed-off-by: NWei Yang <richard.weiyang@gmail.com>
      Reviewed-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      83b20b28
    • J
      kvm: mmu: Don't read PDPTEs when paging is not enabled · d35b34a9
      Junaid Shahid 提交于
      kvm should not attempt to read guest PDPTEs when CR0.PG = 0 and
      CR4.PAE = 1.
      Signed-off-by: NJunaid Shahid <junaids@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d35b34a9
    • V
      x86/kvm/lapic: always disable MMIO interface in x2APIC mode · d1766202
      Vitaly Kuznetsov 提交于
      When VMX is used with flexpriority disabled (because of no support or
      if disabled with module parameter) MMIO interface to lAPIC is still
      available in x2APIC mode while it shouldn't be (kvm-unit-tests):
      
      PASS: apic_disable: Local apic enabled in x2APIC mode
      PASS: apic_disable: CPUID.1H:EDX.APIC[bit 9] is set
      FAIL: apic_disable: *0xfee00030: 50014
      
      The issue appears because we basically do nothing while switching to
      x2APIC mode when APIC access page is not used. apic_mmio_{read,write}
      only check if lAPIC is disabled before proceeding to actual write.
      
      When APIC access is virtualized we correctly manipulate with VMX controls
      in vmx_set_virtual_apic_mode() and we don't get vmexits from memory writes
      in x2APIC mode so there's no issue.
      
      Disabling MMIO interface seems to be easy. The question is: what do we
      do with these reads and writes? If we add apic_x2apic_mode() check to
      apic_mmio_in_range() and return -EOPNOTSUPP these reads and writes will
      go to userspace. When lAPIC is in kernel, Qemu uses this interface to
      inject MSIs only (see kvm_apic_mem_write() in hw/i386/kvm/apic.c). This
      somehow works with disabled lAPIC but when we're in xAPIC mode we will
      get a real injected MSI from every write to lAPIC. Not good.
      
      The simplest solution seems to be to just ignore writes to the region
      and return ~0 for all reads when we're in x2APIC mode. This is what this
      patch does. However, this approach is inconsistent with what currently
      happens when flexpriority is enabled: we allocate APIC access page and
      create KVM memory region so in x2APIC modes all reads and writes go to
      this pre-allocated page which is, btw, the same for all vCPUs.
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      d1766202
  8. 08 9月, 2018 1 次提交
    • W
      KVM: LAPIC: Fix pv ipis out-of-bounds access · bdf7ffc8
      Wanpeng Li 提交于
      Dan Carpenter reported that the untrusted data returns from kvm_register_read()
      results in the following static checker warning:
        arch/x86/kvm/lapic.c:576 kvm_pv_send_ipi()
        error: buffer underflow 'map->phys_map' 's32min-s32max'
      
      KVM guest can easily trigger this by executing the following assembly sequence
      in Ring0:
      
      mov $10, %rax
      mov $0xFFFFFFFF, %rbx
      mov $0xFFFFFFFF, %rdx
      mov $0, %rsi
      vmcall
      
      As this will cause KVM to execute the following code-path:
      vmx_handle_exit() -> handle_vmcall() -> kvm_emulate_hypercall() -> kvm_pv_send_ipi()
      which will reach out-of-bounds access.
      
      This patch fixes it by adding a check to kvm_pv_send_ipi() against map->max_apic_id,
      ignoring destinations that are not present and delivering the rest. We also check
      whether or not map->phys_map[min + i] is NULL since the max_apic_id is set to the
      max apic id, some phys_map maybe NULL when apic id is sparse, especially kvm
      unconditionally set max_apic_id to 255 to reserve enough space for any xAPIC ID.
      Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
      Reviewed-by: NLiran Alon <liran.alon@oracle.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Liran Alon <liran.alon@oracle.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      [Add second "if (min > map->max_apic_id)" to complete the fix. -Radim]
      Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
      bdf7ffc8