1. 17 10月, 2018 11 次提交
    • V
      KVM: x86: hyperv: optimize 'all cpus' case in kvm_hv_flush_tlb() · a812297c
      Vitaly Kuznetsov 提交于
      We can use 'NULL' to represent 'all cpus' case in
      kvm_make_vcpus_request_mask() and avoid building vCPU mask with
      all vCPUs.
      Suggested-by: NRadim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Reviewed-by: NRoman Kagan <rkagan@virtuozzo.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a812297c
    • V
      KVM: x86: hyperv: enforce vp_index < KVM_MAX_VCPUS · 9170200e
      Vitaly Kuznetsov 提交于
      Hyper-V TLFS (5.0b) states:
      
      > Virtual processors are identified by using an index (VP index). The
      > maximum number of virtual processors per partition supported by the
      > current implementation of the hypervisor can be obtained through CPUID
      > leaf 0x40000005. A virtual processor index must be less than the
      > maximum number of virtual processors per partition.
      
      Forbid userspace to set VP_INDEX above KVM_MAX_VCPUS. get_vcpu_by_vpidx()
      can now be optimized to bail early when supplied vpidx is >= KVM_MAX_VCPUS.
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Reviewed-by: NRoman Kagan <rkagan@virtuozzo.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      9170200e
    • P
      kvm/x86: return meaningful value from KVM_SIGNAL_MSI · 0624fca9
      Paolo Bonzini 提交于
      If kvm_apic_map_get_dest_lapic() finds a disabled LAPIC,
      it will return with bitmap==0 and (*r == -1) will be returned to
      userspace.
      
      QEMU may then record "KVM: injection failed, MSI lost
      (Operation not permitted)" in its log, which is quite puzzling.
      Reported-by: NPeng Hao <penghao122@sina.com.cn>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0624fca9
    • W
      KVM: x86: move definition PT_MAX_HUGEPAGE_LEVEL and KVM_NR_PAGE_SIZES together · 4fef0f49
      Wei Yang 提交于
      Currently, there are two definitions related to huge page, but a little bit
      far from each other and seems loosely connected:
      
       * KVM_NR_PAGE_SIZES defines the number of different size a page could map
       * PT_MAX_HUGEPAGE_LEVEL means the maximum level of huge page
      
      The number of different size a page could map equals the maximum level
      of huge page, which is implied by current definition.
      
      While current implementation may not be kind to readers and further
      developers:
      
       * KVM_NR_PAGE_SIZES looks like a stand alone definition at first sight
       * in case we need to support more level, two places need to change
      
      This patch tries to make these two definition more close, so that reader
      and developer would feel more comfortable to manipulate.
      Signed-off-by: NWei Yang <richard.weiyang@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4fef0f49
    • T
      KVM/VMX: Remve unused function is_external_interrupt(). · aaa45da2
      Tianyu Lan 提交于
      is_external_interrupt() is not used now and so remove it.
      Signed-off-by: NLan Tianyu <Tianyu.Lan@microsoft.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      aaa45da2
    • W
      KVM: x86: return 0 in case kvm_mmu_memory_cache has min number of objects · daefb794
      Wei Yang 提交于
      The code tries to pre-allocate *min* number of objects, so it is ok to
      return 0 when the kvm_mmu_memory_cache meets the requirement.
      Signed-off-by: NWei Yang <richard.weiyang@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      daefb794
    • K
    • W
      KVM: x86: adjust kvm_mmu_page member to save 8 bytes · 3ff519f2
      Wei Yang 提交于
      On a 64bits machine, struct is naturally aligned with 8 bytes. Since
      kvm_mmu_page member *unsync* and *role* are less then 4 bytes, we can
      rearrange the sequence to compace the struct.
      
      As the comment shows, *role* and *gfn* are used to key the shadow page. In
      order to keep the comment valid, this patch moves the *unsync* up and
      exchange the position of *role* and *gfn*.
      
      From /proc/slabinfo, it shows the size of kvm_mmu_page is 8 bytes less and
      with one more object per slap after applying this patch.
      
          # name            <active_objs> <num_objs> <objsize> <objperslab>
          kvm_mmu_page_header      0           0       168         24
      
          kvm_mmu_page_header      0           0       160         25
      Signed-off-by: NWei Yang <richard.weiyang@gmail.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3ff519f2
    • S
      KVM: nVMX: restore host state in nested_vmx_vmexit for VMFail · bd18bffc
      Sean Christopherson 提交于
      A VMEnter that VMFails (as opposed to VMExits) does not touch host
      state beyond registers that are explicitly noted in the VMFail path,
      e.g. EFLAGS.  Host state does not need to be loaded because VMFail
      is only signaled for consistency checks that occur before the CPU
      starts to load guest state, i.e. there is no need to restore any
      state as nothing has been modified.  But in the case where a VMFail
      is detected by hardware and not by KVM (due to deferring consistency
      checks to hardware), KVM has already loaded some amount of guest
      state.  Luckily, "loaded" only means loaded to KVM's software model,
      i.e. vmcs01 has not been modified.  So, unwind our software model to
      the pre-VMEntry host state.
      
      Not restoring host state in this VMFail path leads to a variety of
      failures because we end up with stale data in vcpu->arch, e.g. CR0,
      CR4, EFER, etc... will all be out of sync relative to vmcs01.  Any
      significant delta in the stale data is all but guaranteed to crash
      L1, e.g. emulation of SMEP, SMAP, UMIP, WP, etc... will be wrong.
      
      An alternative to this "soft" reload would be to load host state from
      vmcs12 as if we triggered a VMExit (as opposed to VMFail), but that is
      wildly inconsistent with respect to the VMX architecture, e.g. an L1
      VMM with separate VMExit and VMFail paths would explode.
      
      Note that this approach does not mean KVM is 100% accurate with
      respect to VMX hardware behavior, even at an architectural level
      (the exact order of consistency checks is microarchitecture specific).
      But 100% emulation accuracy isn't the goal (with this patch), rather
      the goal is to be consistent in the information delivered to L1, e.g.
      a VMExit should not fall-through VMENTER, and a VMFail should not jump
      to HOST_RIP.
      
      This technically reverts commit "5af41573 (KVM: nVMX: Fix mmu
      context after VMLAUNCH/VMRESUME failure)", but retains the core
      aspects of that patch, just in an open coded form due to the need to
      pull state from vmcs01 instead of vmcs12.  Restoring host state
      resolves a variety of issues introduced by commit "4f350c6d
      (kvm: nVMX: Handle deferred early VMLAUNCH/VMRESUME failure properly)",
      which remedied the incorrect behavior of treating VMFail like VMExit
      but in doing so neglected to restore arch state that had been modified
      prior to attempting nested VMEnter.
      
      A sample failure that occurs due to stale vcpu.arch state is a fault
      of some form while emulating an LGDT (due to emulated UMIP) from L1
      after a failed VMEntry to L3, in this case when running the KVM unit
      test test_tpr_threshold_values in L1.  L0 also hits a WARN in this
      case due to a stale arch.cr4.UMIP.
      
      L1:
        BUG: unable to handle kernel paging request at ffffc90000663b9e
        PGD 276512067 P4D 276512067 PUD 276513067 PMD 274efa067 PTE 8000000271de2163
        Oops: 0009 [#1] SMP
        CPU: 5 PID: 12495 Comm: qemu-system-x86 Tainted: G        W         4.18.0-rc2+ #2
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        RIP: 0010:native_load_gdt+0x0/0x10
      
        ...
      
        Call Trace:
         load_fixmap_gdt+0x22/0x30
         __vmx_load_host_state+0x10e/0x1c0 [kvm_intel]
         vmx_switch_vmcs+0x2d/0x50 [kvm_intel]
         nested_vmx_vmexit+0x222/0x9c0 [kvm_intel]
         vmx_handle_exit+0x246/0x15a0 [kvm_intel]
         kvm_arch_vcpu_ioctl_run+0x850/0x1830 [kvm]
         kvm_vcpu_ioctl+0x3a1/0x5c0 [kvm]
         do_vfs_ioctl+0x9f/0x600
         ksys_ioctl+0x66/0x70
         __x64_sys_ioctl+0x16/0x20
         do_syscall_64+0x4f/0x100
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      L0:
        WARNING: CPU: 2 PID: 3529 at arch/x86/kvm/vmx.c:6618 handle_desc+0x28/0x30 [kvm_intel]
        ...
        CPU: 2 PID: 3529 Comm: qemu-system-x86 Not tainted 4.17.2-coffee+ #76
        Hardware name: Intel Corporation Kabylake Client platform/KBL S
        RIP: 0010:handle_desc+0x28/0x30 [kvm_intel]
      
        ...
      
        Call Trace:
         kvm_arch_vcpu_ioctl_run+0x863/0x1840 [kvm]
         kvm_vcpu_ioctl+0x3a1/0x5c0 [kvm]
         do_vfs_ioctl+0x9f/0x5e0
         ksys_ioctl+0x66/0x70
         __x64_sys_ioctl+0x16/0x20
         do_syscall_64+0x49/0xf0
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Fixes: 5af41573 (KVM: nVMX: Fix mmu context after VMLAUNCH/VMRESUME failure)
      Fixes: 4f350c6d (kvm: nVMX: Handle deferred early VMLAUNCH/VMRESUME failure properly)
      Cc: Jim Mattson <jmattson@google.com>
      Cc: Krish Sadhukhan <krish.sadhukhan@oracle.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim KrÄmář <rkrcmar@redhat.com>
      Cc: Wanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      bd18bffc
    • J
      KVM: nVMX: Clear reserved bits of #DB exit qualification · cfb634fe
      Jim Mattson 提交于
      According to volume 3 of the SDM, bits 63:15 and 12:4 of the exit
      qualification field for debug exceptions are reserved (cleared to
      0). However, the SDM is incorrect about bit 16 (corresponding to
      DR6.RTM). This bit should be set if a debug exception (#DB) or a
      breakpoint exception (#BP) occurred inside an RTM region while
      advanced debugging of RTM transactional regions was enabled. Note that
      this is the opposite of DR6.RTM, which "indicates (when clear) that a
      debug exception (#DB) or breakpoint exception (#BP) occurred inside an
      RTM region while advanced debugging of RTM transactional regions was
      enabled."
      
      There is still an issue with stale DR6 bits potentially being
      misreported for the current debug exception.  DR6 should not have been
      modified before vectoring the #DB exception, and the "new DR6 bits"
      should be available somewhere, but it was and they aren't.
      
      Fixes: b96fb439 ("KVM: nVMX: fixes to nested virt interrupt injection")
      Signed-off-by: NJim Mattson <jmattson@google.com>
      Reviewed-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      cfb634fe
    • W
      KVM: LAPIC: Tune lapic_timer_advance_ns automatically · 3b8a5df6
      Wanpeng Li 提交于
      In cloud environment, lapic_timer_advance_ns is needed to be tuned for every CPU
      generations, and every host kernel versions(the kvm-unit-tests/tscdeadline_latency.flat
      is 5700 cycles for upstream kernel and 9600 cycles for our 3.10 product kernel,
      both preemption_timer=N, Skylake server).
      
      This patch adds the capability to automatically tune lapic_timer_advance_ns
      step by step, the initial value is 1000ns as 'commit d0659d94 ("KVM: x86:
      add option to advance tscdeadline hrtimer expiration")' recommended, it will be
      reduced when it is too early, and increased when it is too late. The guest_tsc
      and tsc_deadline are hard to equal, so we assume we are done when the delta
      is within a small scope e.g. 100 cycles. This patch reduces latency
      (kvm-unit-tests/tscdeadline_latency, busy waits, preemption_timer enabled)
      from ~2600 cyles to ~1200 cyles on our Skylake server.
      
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Liran Alon <liran.alon@oracle.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3b8a5df6
  2. 13 10月, 2018 5 次提交
  3. 04 10月, 2018 3 次提交
    • P
      kvm: nVMX: fix entry with pending interrupt if APICv is enabled · 7e712684
      Paolo Bonzini 提交于
      Commit b5861e5c introduced a check on
      the interrupt-window and NMI-window CPU execution controls in order to
      inject an external interrupt vmexit before the first guest instruction
      executes.  However, when APIC virtualization is enabled the host does not
      need a vmexit in order to inject an interrupt at the next interrupt window;
      instead, it just places the interrupt vector in RVI and the processor will
      inject it as soon as possible.  Therefore, on machines with APICv it is
      not enough to check the CPU execution controls: the same scenario can also
      happen if RVI>vPPR.
      
      Fixes: b5861e5cReviewed-by: NNikita Leshchenko <nikita.leshchenko@oracle.com>
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Cc: Liran Alon <liran.alon@oracle.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7e712684
    • P
      KVM: VMX: hide flexpriority from guest when disabled at the module level · 2cf7ea9f
      Paolo Bonzini 提交于
      As of commit 8d860bbe ("kvm: vmx: Basic APIC virtualization controls
      have three settings"), KVM will disable VIRTUALIZE_APIC_ACCESSES when
      a nested guest writes APIC_BASE MSR and kvm-intel.flexpriority=0,
      whereas previously KVM would allow a nested guest to enable
      VIRTUALIZE_APIC_ACCESSES so long as it's supported in hardware.  That is,
      KVM now advertises VIRTUALIZE_APIC_ACCESSES to a guest but doesn't
      (always) allow setting it when kvm-intel.flexpriority=0, and may even
      initially allow the control and then clear it when the nested guest
      writes APIC_BASE MSR, which is decidedly odd even if it doesn't cause
      functional issues.
      
      Hide the control completely when the module parameter is cleared.
      reported-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Fixes: 8d860bbe ("kvm: vmx: Basic APIC virtualization controls have three settings")
      Cc: Jim Mattson <jmattson@google.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      2cf7ea9f
    • S
      KVM: VMX: check for existence of secondary exec controls before accessing · fd6b6d9b
      Sean Christopherson 提交于
      Return early from vmx_set_virtual_apic_mode() if the processor doesn't
      support VIRTUALIZE_APIC_ACCESSES or VIRTUALIZE_X2APIC_MODE, both of
      which reside in SECONDARY_VM_EXEC_CONTROL.  This eliminates warnings
      due to VMWRITEs to SECONDARY_VM_EXEC_CONTROL (VMCS field 401e) failing
      on processors without secondary exec controls.
      
      Remove the similar check for TPR shadowing as it is incorporated in the
      flexpriority_enabled check and the APIC-related code in
      vmx_update_msr_bitmap() is further gated by VIRTUALIZE_X2APIC_MODE.
      Reported-by: NGerhard Wiesinger <redhat@wiesinger.com>
      Fixes: 8d860bbe ("kvm: vmx: Basic APIC virtualization controls have three settings")
      Cc: Jim Mattson <jmattson@google.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      fd6b6d9b
  4. 01 10月, 2018 4 次提交
    • S
      KVM: x86: fix L1TF's MMIO GFN calculation · daa07cbc
      Sean Christopherson 提交于
      One defense against L1TF in KVM is to always set the upper five bits
      of the *legal* physical address in the SPTEs for non-present and
      reserved SPTEs, e.g. MMIO SPTEs.  In the MMIO case, the GFN of the
      MMIO SPTE may overlap with the upper five bits that are being usurped
      to defend against L1TF.  To preserve the GFN, the bits of the GFN that
      overlap with the repurposed bits are shifted left into the reserved
      bits, i.e. the GFN in the SPTE will be split into high and low parts.
      When retrieving the GFN from the MMIO SPTE, e.g. to check for an MMIO
      access, get_mmio_spte_gfn() unshifts the affected bits and restores
      the original GFN for comparison.  Unfortunately, get_mmio_spte_gfn()
      neglects to mask off the reserved bits in the SPTE that were used to
      store the upper chunk of the GFN.  As a result, KVM fails to detect
      MMIO accesses whose GPA overlaps the repurprosed bits, which in turn
      causes guest panics and hangs.
      
      Fix the bug by generating a mask that covers the lower chunk of the
      GFN, i.e. the bits that aren't shifted by the L1TF mitigation.  The
      alternative approach would be to explicitly zero the five reserved
      bits that are used to store the upper chunk of the GFN, but that
      requires additional run-time computation and makes an already-ugly
      bit of code even more inscrutable.
      
      I considered adding a WARN_ON_ONCE(low_phys_bits-1 <= PAGE_SHIFT) to
      warn if GENMASK_ULL() generated a nonsensical value, but that seemed
      silly since that would mean a system that supports VMX has less than
      18 bits of physical address space...
      Reported-by: NSakari Ailus <sakari.ailus@iki.fi>
      Fixes: d9b47449c1a1 ("kvm: x86: Set highest physical address bits in non-present/reserved SPTEs")
      Cc: Junaid Shahid <junaids@google.com>
      Cc: Jim Mattson <jmattson@google.com>
      Cc: stable@vger.kernel.org
      Reviewed-by: NJunaid Shahid <junaids@google.com>
      Tested-by: NSakari Ailus <sakari.ailus@linux.intel.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      daa07cbc
    • L
      KVM: nVMX: Fix emulation of VM_ENTRY_LOAD_BNDCFGS · 62cf9bd8
      Liran Alon 提交于
      L2 IA32_BNDCFGS should be updated with vmcs12->guest_bndcfgs only
      when VM_ENTRY_LOAD_BNDCFGS is specified in vmcs12->vm_entry_controls.
      
      Otherwise, L2 IA32_BNDCFGS should be set to vmcs01->guest_bndcfgs which
      is L1 IA32_BNDCFGS.
      Reviewed-by: NNikita Leshchenko <nikita.leshchenko@oracle.com>
      Reviewed-by: NDarren Kenny <darren.kenny@oracle.com>
      Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      62cf9bd8
    • L
      KVM: x86: Do not use kvm_x86_ops->mpx_supported() directly · 503234b3
      Liran Alon 提交于
      Commit a87036ad ("KVM: x86: disable MPX if host did not enable
      MPX XSAVE features") introduced kvm_mpx_supported() to return true
      iff MPX is enabled in the host.
      
      However, that commit seems to have missed replacing some calls to
      kvm_x86_ops->mpx_supported() to kvm_mpx_supported().
      
      Complete original commit by replacing remaining calls to
      kvm_mpx_supported().
      
      Fixes: a87036ad ("KVM: x86: disable MPX if host did not enable
      MPX XSAVE features")
      Suggested-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      503234b3
    • L
      KVM: nVMX: Do not expose MPX VMX controls when guest MPX disabled · 5f76f6f5
      Liran Alon 提交于
      Before this commit, KVM exposes MPX VMX controls to L1 guest only based
      on if KVM and host processor supports MPX virtualization.
      However, these controls should be exposed to guest only in case guest
      vCPU supports MPX.
      
      Without this change, a L1 guest running with kernel which don't have
      commit 691bd434 ("kvm: vmx: allow host to access guest
      MSR_IA32_BNDCFGS") asserts in QEMU on the following:
      	qemu-kvm: error: failed to set MSR 0xd90 to 0x0
      	qemu-kvm: .../qemu-2.10.0/target/i386/kvm.c:1801 kvm_put_msrs:
      	Assertion 'ret == cpu->kvm_msr_buf->nmsrs failed'
      This is because L1 KVM kvm_init_msr_list() will see that
      vmx_mpx_supported() (As it only checks MPX VMX controls support) and
      therefore KVM_GET_MSR_INDEX_LIST IOCTL will include MSR_IA32_BNDCFGS.
      However, later when L1 will attempt to set this MSR via KVM_SET_MSRS
      IOCTL, it will fail because !guest_cpuid_has_mpx(vcpu).
      
      Therefore, fix the issue by exposing MPX VMX controls to L1 guest only
      when vCPU supports MPX.
      
      Fixes: 36be0b9d ("KVM: x86: Add nested virtualization support for MPX")
      Reported-by: NEyal Moscovici <eyal.moscovici@oracle.com>
      Reviewed-by: NNikita Leshchenko <nikita.leshchenko@oracle.com>
      Reviewed-by: NDarren Kenny <darren.kenny@oracle.com>
      Signed-off-by: NLiran Alon <liran.alon@oracle.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5f76f6f5
  5. 25 9月, 2018 1 次提交
    • P
      KVM: x86: never trap MSR_KERNEL_GS_BASE · 4679b61f
      Paolo Bonzini 提交于
      KVM has an old optimization whereby accesses to the kernel GS base MSR
      are trapped when the guest is in 32-bit and not when it is in 64-bit mode.
      The idea is that swapgs is not available in 32-bit mode, thus the
      guest has no reason to access the MSR unless in 64-bit mode and
      32-bit applications need not pay the price of switching the kernel GS
      base between the host and the guest values.
      
      However, this optimization adds complexity to the code for little
      benefit (these days most guests are going to be 64-bit anyway) and in fact
      broke after commit 678e315e ("KVM: vmx: add dedicated utility to
      access guest's kernel_gs_base", 2018-08-06); the guest kernel GS base
      can be corrupted across SMIs and UEFI Secure Boot is therefore broken
      (a secure boot Linux guest, for example, fails to reach the login prompt
      about half the time).  This patch just removes the optimization; the
      kernel GS base MSR is now never trapped by KVM, similarly to the FS and
      GS base MSRs.
      
      Fixes: 678e315eReviewed-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      4679b61f
  6. 21 9月, 2018 1 次提交
  7. 20 9月, 2018 15 次提交