1. 24 1月, 2020 3 次提交
    • S
      KVM: x86: Allocate vcpu struct in common x86 code · a9dd6f09
      Sean Christopherson 提交于
      Move allocation of VMX and SVM vcpus to common x86.  Although the struct
      being allocated is technically a VMX/SVM struct, it can be interpreted
      directly as a 'struct kvm_vcpu' because of the pre-existing requirement
      that 'struct kvm_vcpu' be located at offset zero of the arch/vendor vcpu
      struct.
      
      Remove the message from the build-time assertions regarding placement of
      the struct, as compatibility with the arch usercopy region is no longer
      the sole dependent on 'struct kvm_vcpu' being at offset zero.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a9dd6f09
    • S
      KVM: x86: Free wbinvd_dirty_mask if vCPU creation fails · 16be9dde
      Sean Christopherson 提交于
      Free the vCPU's wbinvd_dirty_mask if vCPU creation fails after
      kvm_arch_vcpu_init(), e.g. when installing the vCPU's file descriptor.
      Do the freeing by calling kvm_arch_vcpu_free() instead of open coding
      the freeing.  This adds a likely superfluous, but ultimately harmless,
      call to kvmclock_reset(), which only clears vcpu->arch.pv_time_enabled.
      Using kvm_arch_vcpu_free() allows for additional cleanup in the future.
      
      Fixes: f5f48ee1 ("KVM: VMX: Execute WBINVD to keep data consistency with assigned devices")
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      16be9dde
    • P
      KVM: x86: avoid incorrect writes to host MSR_IA32_SPEC_CTRL · 6441fa61
      Paolo Bonzini 提交于
      If the guest is configured to have SPEC_CTRL but the host does not
      (which is a nonsensical configuration but these are not explicitly
      forbidden) then a host-initiated MSR write can write vmx->spec_ctrl
      (respectively svm->spec_ctrl) and trigger a #GP when KVM tries to
      restore the host value of the MSR.  Add a more comprehensive check
      for valid bits of SPEC_CTRL, covering host CPUID flags and,
      since we are at it and it is more correct that way, guest CPUID
      flags too.
      
      For AMD, remove the unnecessary is_guest_mode check around setting
      the MSR interception bitmap, so that the code looks the same as
      for Intel.
      
      Cc: Jim Mattson <jmattson@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      6441fa61
  2. 23 1月, 2020 1 次提交
  3. 21 1月, 2020 7 次提交
    • S
      KVM: x86: Refactor and rename bit() to feature_bit() macro · 87382003
      Sean Christopherson 提交于
      Rename bit() to __feature_bit() to give it a more descriptive name, and
      add a macro, feature_bit(), to stuff the X68_FEATURE_ prefix to keep
      line lengths manageable for code that hardcodes the bit to be retrieved.
      
      No functional change intended.
      
      Cc: Jim Mattson <jmattson@google.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      87382003
    • S
      KVM: x86: Add dedicated emulator helpers for querying CPUID features · 5ae78e95
      Sean Christopherson 提交于
      Add feature-specific helpers for querying guest CPUID support from the
      emulator instead of having the emulator do a full CPUID and perform its
      own bit tests.  The primary motivation is to eliminate the emulator's
      usage of bit() so that future patches can add more extensive build-time
      assertions on the usage of bit() without having to expose yet more code
      to the emulator.
      
      Note, providing a generic guest_cpuid_has() to the emulator doesn't work
      due to the existing built-time assertions in guest_cpuid_has(), which
      require the feature being checked to be a compile-time constant.
      
      No functional change intended.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      5ae78e95
    • S
      KVM: x86: Add macro to ensure reserved cr4 bits checks stay in sync · 345599f9
      Sean Christopherson 提交于
      Add a helper macro to generate the set of reserved cr4 bits for both
      host and guest to ensure that adding a check on guest capabilities is
      also added for host capabilities, and vice versa.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      345599f9
    • S
      KVM: x86: Ensure all logical CPUs have consistent reserved cr4 bits · f1cdecf5
      Sean Christopherson 提交于
      Check the current CPU's reserved cr4 bits against the mask calculated
      for the boot CPU to ensure consistent behavior across all CPUs.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f1cdecf5
    • S
      KVM: x86: Don't let userspace set host-reserved cr4 bits · b11306b5
      Sean Christopherson 提交于
      Calculate the host-reserved cr4 bits at runtime based on the system's
      capabilities (using logic similar to __do_cpuid_func()), and use the
      dynamically generated mask for the reserved bit check in kvm_set_cr4()
      instead using of the static CR4_RESERVED_BITS define.  This prevents
      userspace from "enabling" features in cr4 that are not supported by the
      system, e.g. by ignoring KVM_GET_SUPPORTED_CPUID and specifying a bogus
      CPUID for the vCPU.
      
      Allowing userspace to set unsupported bits in cr4 can lead to a variety
      of undesirable behavior, e.g. failed VM-Enter, and in general increases
      KVM's attack surface.  A crafty userspace can even abuse CR4.LA57 to
      induce an unchecked #GP on a WRMSR.
      
      On a platform without LA57 support:
      
        KVM_SET_CPUID2 // CPUID_7_0_ECX.LA57 = 1
        KVM_SET_SREGS  // CR4.LA57 = 1
        KVM_SET_MSRS   // KERNEL_GS_BASE = 0x0004000000000000
        KVM_RUN
      
      leads to a #GP when writing KERNEL_GS_BASE into hardware:
      
        unchecked MSR access error: WRMSR to 0xc0000102 (tried to write 0x0004000000000000)
        at rIP: 0xffffffffa00f239a (vmx_prepare_switch_to_guest+0x10a/0x1d0 [kvm_intel])
        Call Trace:
         kvm_arch_vcpu_ioctl_run+0x671/0x1c70 [kvm]
         kvm_vcpu_ioctl+0x36b/0x5d0 [kvm]
         do_vfs_ioctl+0xa1/0x620
         ksys_ioctl+0x66/0x70
         __x64_sys_ioctl+0x16/0x20
         do_syscall_64+0x4c/0x170
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7fc08133bf47
      
      Note, the above sequence fails VM-Enter due to invalid guest state.
      Userspace can allow VM-Enter to succeed (after the WRMSR #GP) by adding
      a KVM_SET_SREGS w/ CR4.LA57=0 after KVM_SET_MSRS, in which case KVM will
      technically leak the host's KERNEL_GS_BASE into the guest.  But, as
      KERNEL_GS_BASE is a userspace-defined value/address, the leak is largely
      benign as a malicious userspace would simply be exposing its own data to
      the guest, and attacking a benevolent userspace would require multiple
      bugs in the userspace VMM.
      
      Cc: stable@vger.kernel.org
      Cc: Jun Nakajima <jun.nakajima@intel.com>
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b11306b5
    • M
      KVM: x86: check kvm_pit outside kvm_vm_ioctl_reinject() · cad23e72
      Miaohe Lin 提交于
      check kvm_pit outside kvm_vm_ioctl_reinject() to keep codestyle consistent
      with other kvm_pit func and prepare for futher cleanups.
      Signed-off-by: NMiaohe Lin <linmiaohe@huawei.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      cad23e72
    • W
      KVM: VMX: FIXED+PHYSICAL mode single target IPI fastpath · 1e9e2622
      Wanpeng Li 提交于
      ICR and TSCDEADLINE MSRs write cause the main MSRs write vmexits in our
      product observation, multicast IPIs are not as common as unicast IPI like
      RESCHEDULE_VECTOR and CALL_FUNCTION_SINGLE_VECTOR etc.
      
      This patch introduce a mechanism to handle certain performance-critical
      WRMSRs in a very early stage of KVM VMExit handler.
      
      This mechanism is specifically used for accelerating writes to x2APIC ICR
      that attempt to send a virtual IPI with physical destination-mode, fixed
      delivery-mode and single target. Which was found as one of the main causes
      of VMExits for Linux workloads.
      
      The reason this mechanism significantly reduce the latency of such virtual
      IPIs is by sending the physical IPI to the target vCPU in a very early stage
      of KVM VMExit handler, before host interrupts are enabled and before expensive
      operations such as reacquiring KVM’s SRCU lock.
      Latency is reduced even more when KVM is able to use APICv posted-interrupt
      mechanism (which allows to deliver the virtual IPI directly to target vCPU
      without the need to kick it to host).
      
      Testing on Xeon Skylake server:
      
      The virtual IPI latency from sender send to receiver receive reduces
      more than 200+ cpu cycles.
      Reviewed-by: NLiran Alon <liran.alon@oracle.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krčmář <rkrcmar@redhat.com>
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Liran Alon <liran.alon@oracle.com>
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      1e9e2622
  4. 09 1月, 2020 6 次提交
    • S
      KVM: x86: Use gpa_t for cr2/gpa to fix TDP support on 32-bit KVM · 736c291c
      Sean Christopherson 提交于
      Convert a plethora of parameters and variables in the MMU and page fault
      flows from type gva_t to gpa_t to properly handle TDP on 32-bit KVM.
      
      Thanks to PSE and PAE paging, 32-bit kernels can access 64-bit physical
      addresses.  When TDP is enabled, the fault address is a guest physical
      address and thus can be a 64-bit value, even when both KVM and its guest
      are using 32-bit virtual addressing, e.g. VMX's VMCS.GUEST_PHYSICAL is a
      64-bit field, not a natural width field.
      
      Using a gva_t for the fault address means KVM will incorrectly drop the
      upper 32-bits of the GPA.  Ditto for gva_to_gpa() when it is used to
      translate L2 GPAs to L1 GPAs.
      
      Opportunistically rename variables and parameters to better reflect the
      dual address modes, e.g. use "cr2_or_gpa" for fault addresses and plain
      "addr" instead of "vaddr" when the address may be either a GVA or an L2
      GPA.  Similarly, use "gpa" in the nonpaging_page_fault() flows to avoid
      a confusing "gpa_t gva" declaration; this also sets the stage for a
      future patch to combing nonpaging_page_fault() and tdp_page_fault() with
      minimal churn.
      
      Sprinkle in a few comments to document flows where an address is known
      to be a GVA and thus can be safely truncated to a 32-bit value.  Add
      WARNs in kvm_handle_page_fault() and FNAME(gva_to_gpa_nested)() to help
      document such cases and detect bugs.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      736c291c
    • S
      KVM: x86: Add a WARN on TIF_NEED_FPU_LOAD in kvm_load_guest_fpu() · 95145c25
      Sean Christopherson 提交于
      WARN once in kvm_load_guest_fpu() if TIF_NEED_FPU_LOAD is observed, as
      that would mean that KVM is corrupting userspace's FPU by saving
      unknown register state into arch.user_fpu.  Add a comment to explain
      why KVM WARNs on TIF_NEED_FPU_LOAD instead of implementing logic
      similar to fpu__copy().
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      95145c25
    • S
      KVM: x86: Fix potential put_fpu() w/o load_fpu() on MPX platform · f958bd23
      Sean Christopherson 提交于
      Unlike most state managed by XSAVE, MPX is initialized to zero on INIT.
      Because INITs are usually recognized in the context of a VCPU_RUN call,
      kvm_vcpu_reset() puts the guest's FPU so that the FPU state is resident
      in memory, zeros the MPX state, and reloads FPU state to hardware.  But,
      in the unlikely event that an INIT is recognized during
      kvm_arch_vcpu_ioctl_get_mpstate() via kvm_apic_accept_events(),
      kvm_vcpu_reset() will call kvm_put_guest_fpu() without a preceding
      kvm_load_guest_fpu() and corrupt the guest's FPU state (and possibly
      userspace's FPU state as well).
      
      Given that MPX is being removed from the kernel[*], fix the bug with the
      simple-but-ugly approach of loading the guest's FPU during
      KVM_GET_MP_STATE.
      
      [*] See commit f240652b ("x86/mpx: Remove MPX APIs").
      
      Fixes: f775b13e ("x86,kvm: move qemu/guest FPU switching out to vcpu_run")
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f958bd23
    • M
      KVM: x86: Fix some comment typos · 0a03cbda
      Miaohe Lin 提交于
      Fix some typos in comment.
      Signed-off-by: NMiaohe Lin <linmiaohe@huawei.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      0a03cbda
    • P
      KVM: X86: Convert the last users of "shorthand = 0" to use macros · 150a84fe
      Peter Xu 提交于
      Change the last users of "shorthand = 0" to use APIC_DEST_NOSHORT.
      Reviewed-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      150a84fe
    • P
      KVM: X86: Use APIC_DEST_* macros properly in kvm_lapic_irq.dest_mode · c96001c5
      Peter Xu 提交于
      We were using either APIC_DEST_PHYSICAL|APIC_DEST_LOGICAL or 0|1 to
      fill in kvm_lapic_irq.dest_mode.  It's fine only because in most cases
      when we check against dest_mode it's against APIC_DEST_PHYSICAL (which
      equals to 0).  However, that's not consistent.  We'll have problem
      when we want to start checking against APIC_DEST_LOGICAL, which does
      not equals to 1.
      
      This patch firstly introduces kvm_lapic_irq_dest_mode() helper to take
      any boolean of destination mode and return the APIC_DEST_* macro.
      Then, it replaces the 0|1 settings of irq.dest_mode with the helper.
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c96001c5
  5. 23 11月, 2019 3 次提交
    • S
      KVM: x86: Grab KVM's srcu lock when setting nested state · ad5996d9
      Sean Christopherson 提交于
      Acquire kvm->srcu for the duration of ->set_nested_state() to fix a bug
      where nVMX derefences ->memslots without holding ->srcu or ->slots_lock.
      
      The other half of nested migration, ->get_nested_state(), does not need
      to acquire ->srcu as it is a purely a dump of internal KVM (and CPU)
      state to userspace.
      
      Detected as an RCU lockdep splat that is 100% reproducible by running
      KVM's state_test selftest with CONFIG_PROVE_LOCKING=y.  Note that the
      failing function, kvm_is_visible_gfn(), is only checking the validity of
      a gfn, it's not actually accessing guest memory (which is more or less
      unsupported during vmx_set_nested_state() due to incorrect MMU state),
      i.e. vmx_set_nested_state() itself isn't fundamentally broken.  In any
      case, setting nested state isn't a fast path so there's no reason to go
      out of our way to avoid taking ->srcu.
      
        =============================
        WARNING: suspicious RCU usage
        5.4.0-rc7+ #94 Not tainted
        -----------------------------
        include/linux/kvm_host.h:626 suspicious rcu_dereference_check() usage!
      
                     other info that might help us debug this:
      
        rcu_scheduler_active = 2, debug_locks = 1
        1 lock held by evmcs_test/10939:
         #0: ffff88826ffcb800 (&vcpu->mutex){+.+.}, at: kvm_vcpu_ioctl+0x85/0x630 [kvm]
      
        stack backtrace:
        CPU: 1 PID: 10939 Comm: evmcs_test Not tainted 5.4.0-rc7+ #94
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        Call Trace:
         dump_stack+0x68/0x9b
         kvm_is_visible_gfn+0x179/0x180 [kvm]
         mmu_check_root+0x11/0x30 [kvm]
         fast_cr3_switch+0x40/0x120 [kvm]
         kvm_mmu_new_cr3+0x34/0x60 [kvm]
         nested_vmx_load_cr3+0xbd/0x1f0 [kvm_intel]
         nested_vmx_enter_non_root_mode+0xab8/0x1d60 [kvm_intel]
         vmx_set_nested_state+0x256/0x340 [kvm_intel]
         kvm_arch_vcpu_ioctl+0x491/0x11a0 [kvm]
         kvm_vcpu_ioctl+0xde/0x630 [kvm]
         do_vfs_ioctl+0xa2/0x6c0
         ksys_ioctl+0x66/0x70
         __x64_sys_ioctl+0x16/0x20
         do_syscall_64+0x54/0x200
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
        RIP: 0033:0x7f59a2b95f47
      
      Fixes: 8fcc4b59 ("kvm: nVMX: Introduce KVM_CAP_NESTED_STATE")
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ad5996d9
    • S
      KVM: x86: Open code shared_msr_update() in its only caller · 05c19c2f
      Sean Christopherson 提交于
      Fold shared_msr_update() into its sole user to eliminate its pointless
      bounds check, its godawful printk, its misleading comment (it's called
      under a global lock), and its woefully inaccurate name.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      05c19c2f
    • S
      KVM: x86: Remove a spurious export of a static function · 24885d1d
      Sean Christopherson 提交于
      A recent change inadvertently exported a static function, which results
      in modpost throwing a warning.  Fix it.
      
      Fixes: cbbaa272 ("KVM: x86: fix presentation of TSX feature in ARCH_CAPABILITIES")
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Cc: stable@vger.kernel.org
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      24885d1d
  6. 21 11月, 2019 5 次提交
    • M
      KVM: x86: remove set but not used variable 'called' · db5a95ec
      Mao Wenan 提交于
      Fixes gcc '-Wunused-but-set-variable' warning:
      
      arch/x86/kvm/x86.c: In function kvm_make_scan_ioapic_request_mask:
      arch/x86/kvm/x86.c:7911:7: warning: variable called set but not
      used [-Wunused-but-set-variable]
      
      It is not used since commit 7ee30bc1 ("KVM: x86: deliver KVM
      IOAPIC scan request to target vCPUs")
      Signed-off-by: NMao Wenan <maowenan@huawei.com>
      Fixes: 7ee30bc1 ("KVM: x86: deliver KVM IOAPIC scan request to target vCPUs")
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      db5a95ec
    • P
      KVM: vmx: implement MSR_IA32_TSX_CTRL disable RTM functionality · c11f83e0
      Paolo Bonzini 提交于
      The current guest mitigation of TAA is both too heavy and not really
      sufficient.  It is too heavy because it will cause some affected CPUs
      (those that have MDS_NO but lack TAA_NO) to fall back to VERW and
      get the corresponding slowdown.  It is not really sufficient because
      it will cause the MDS_NO bit to disappear upon microcode update, so
      that VMs started before the microcode update will not be runnable
      anymore afterwards, even with tsx=on.
      
      Instead, if tsx=on on the host, we can emulate MSR_IA32_TSX_CTRL for
      the guest and let it run without the VERW mitigation.  Even though
      MSR_IA32_TSX_CTRL is quite heavyweight, and we do not want to write
      it on every vmentry, we can use the shared MSR functionality because
      the host kernel need not protect itself from TSX-based side-channels.
      Tested-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c11f83e0
    • P
      KVM: x86: implement MSR_IA32_TSX_CTRL effect on CPUID · edef5c36
      Paolo Bonzini 提交于
      Because KVM always emulates CPUID, the CPUID clear bit
      (bit 1) of MSR_IA32_TSX_CTRL must be emulated "manually"
      by the hypervisor when performing said emulation.
      
      Right now neither kvm-intel.ko nor kvm-amd.ko implement
      MSR_IA32_TSX_CTRL but this will change in the next patch.
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Tested-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      edef5c36
    • P
      KVM: x86: do not modify masked bits of shared MSRs · de1fca5d
      Paolo Bonzini 提交于
      "Shared MSRs" are guest MSRs that are written to the host MSRs but
      keep their value until the next return to userspace.  They support
      a mask, so that some bits keep the host value, but this mask is
      only used to skip an unnecessary MSR write and the value written
      to the MSR is always the guest MSR.
      
      Fix this and, while at it, do not update smsr->values[slot].curr if
      for whatever reason the wrmsr fails.  This should only happen due to
      reserved bits, so the value written to smsr->values[slot].curr
      will not match when the user-return notifier and the host value will
      always be restored.  However, it is untidy and in rare cases this
      can actually avoid spurious WRMSRs on return to userspace.
      
      Cc: stable@vger.kernel.org
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Tested-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      de1fca5d
    • P
      KVM: x86: fix presentation of TSX feature in ARCH_CAPABILITIES · cbbaa272
      Paolo Bonzini 提交于
      KVM does not implement MSR_IA32_TSX_CTRL, so it must not be presented
      to the guests.  It is also confusing to have !ARCH_CAP_TSX_CTRL_MSR &&
      !RTM && ARCH_CAP_TAA_NO: lack of MSR_IA32_TSX_CTRL suggests TSX was not
      hidden (it actually was), yet the value says that TSX is not vulnerable
      to microarchitectural data sampling.  Fix both.
      
      Cc: stable@vger.kernel.org
      Tested-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      cbbaa272
  7. 20 11月, 2019 1 次提交
  8. 15 11月, 2019 5 次提交
  9. 13 11月, 2019 1 次提交
  10. 12 11月, 2019 1 次提交
    • C
      KVM: X86: Fix initialization of MSR lists · 7a5ee6ed
      Chenyi Qiang 提交于
      The three MSR lists(msrs_to_save[], emulated_msrs[] and
      msr_based_features[]) are global arrays of kvm.ko, which are
      adjusted (copy supported MSRs forward to override the unsupported MSRs)
      when insmod kvm-{intel,amd}.ko, but it doesn't reset these three arrays
      to their initial value when rmmod kvm-{intel,amd}.ko. Thus, at the next
      installation, kvm-{intel,amd}.ko will do operations on the modified
      arrays with some MSRs lost and some MSRs duplicated.
      
      So define three constant arrays to hold the initial MSR lists and
      initialize msrs_to_save[], emulated_msrs[] and msr_based_features[]
      based on the constant arrays.
      
      Cc: stable@vger.kernel.org
      Reviewed-by: NXiaoyao Li <xiaoyao.li@intel.com>
      Signed-off-by: NChenyi Qiang <chenyi.qiang@intel.com>
      [Remove now useless conditionals. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7a5ee6ed
  11. 11 11月, 2019 1 次提交
  12. 05 11月, 2019 1 次提交
  13. 04 11月, 2019 1 次提交
    • P
      kvm: mmu: ITLB_MULTIHIT mitigation · b8e8c830
      Paolo Bonzini 提交于
      With some Intel processors, putting the same virtual address in the TLB
      as both a 4 KiB and 2 MiB page can confuse the instruction fetch unit
      and cause the processor to issue a machine check resulting in a CPU lockup.
      
      Unfortunately when EPT page tables use huge pages, it is possible for a
      malicious guest to cause this situation.
      
      Add a knob to mark huge pages as non-executable. When the nx_huge_pages
      parameter is enabled (and we are using EPT), all huge pages are marked as
      NX. If the guest attempts to execute in one of those pages, the page is
      broken down into 4K pages, which are then marked executable.
      
      This is not an issue for shadow paging (except nested EPT), because then
      the host is in control of TLB flushes and the problematic situation cannot
      happen.  With nested EPT, again the nested guest can cause problems shadow
      and direct EPT is treated in the same way.
      
      [ tglx: Fixup default to auto and massage wording a bit ]
      Originally-by: NJunaid Shahid <junaids@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      b8e8c830
  14. 02 11月, 2019 1 次提交
    • M
      KVM: x86: switch KVMCLOCK base to monotonic raw clock · 53fafdbb
      Marcelo Tosatti 提交于
      Commit 0bc48bea ("KVM: x86: update master clock before computing
      kvmclock_offset")
      switches the order of operations to avoid the conversion
      
      TSC (without frequency correction) ->
      system_timestamp (with frequency correction),
      
      which might cause a time jump.
      
      However, it leaves any other masterclock update unsafe, which includes,
      at the moment:
      
              * HV_X64_MSR_REFERENCE_TSC MSR write.
              * TSC writes.
              * Host suspend/resume.
      
      Avoid the time jump issue by using frequency uncorrected
      CLOCK_MONOTONIC_RAW clock.
      
      Its the guests time keeping software responsability
      to track and correct a reference clock such as UTC.
      
      This fixes forward time jump (which can result in
      failure to bring up a vCPU) during vCPU hotplug:
      
      Oct 11 14:48:33 storage kernel: CPU2 has been hot-added
      Oct 11 14:48:34 storage kernel: CPU3 has been hot-added
      Oct 11 14:49:22 storage kernel: smpboot: Booting Node 0 Processor 2 APIC 0x2          <-- time jump of almost 1 minute
      Oct 11 14:49:22 storage kernel: smpboot: do_boot_cpu failed(-1) to wakeup CPU#2
      Oct 11 14:49:23 storage kernel: smpboot: Booting Node 0 Processor 3 APIC 0x3
      Oct 11 14:49:23 storage kernel: kvm-clock: cpu 3, msr 0:7ff640c1, secondary cpu clock
      
      Which happens because:
      
                      /*
                       * Wait 10s total for a response from AP
                       */
                      boot_error = -1;
                      timeout = jiffies + 10*HZ;
                      while (time_before(jiffies, timeout)) {
                               ...
                      }
      Analyzed-by: NIgor Mammedov <imammedo@redhat.com>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      53fafdbb
  15. 28 10月, 2019 1 次提交
  16. 23 10月, 2019 1 次提交
    • J
      KVM: nVMX: Don't leak L1 MMIO regions to L2 · 671ddc70
      Jim Mattson 提交于
      If the "virtualize APIC accesses" VM-execution control is set in the
      VMCS, the APIC virtualization hardware is triggered when a page walk
      in VMX non-root mode terminates at a PTE wherein the address of the 4k
      page frame matches the APIC-access address specified in the VMCS. On
      hardware, the APIC-access address may be any valid 4k-aligned physical
      address.
      
      KVM's nVMX implementation enforces the additional constraint that the
      APIC-access address specified in the vmcs12 must be backed by
      a "struct page" in L1. If not, L0 will simply clear the "virtualize
      APIC accesses" VM-execution control in the vmcs02.
      
      The problem with this approach is that the L1 guest has arranged the
      vmcs12 EPT tables--or shadow page tables, if the "enable EPT"
      VM-execution control is clear in the vmcs12--so that the L2 guest
      physical address(es)--or L2 guest linear address(es)--that reference
      the L2 APIC map to the APIC-access address specified in the
      vmcs12. Without the "virtualize APIC accesses" VM-execution control in
      the vmcs02, the APIC accesses in the L2 guest will directly access the
      APIC-access page in L1.
      
      When there is no mapping whatsoever for the APIC-access address in L1,
      the L2 VM just loses the intended APIC virtualization. However, when
      the APIC-access address is mapped to an MMIO region in L1, the L2
      guest gets direct access to the L1 MMIO device. For example, if the
      APIC-access address specified in the vmcs12 is 0xfee00000, then L2
      gets direct access to L1's APIC.
      
      Since this vmcs12 configuration is something that KVM cannot
      faithfully emulate, the appropriate response is to exit to userspace
      with KVM_INTERNAL_ERROR_EMULATION.
      
      Fixes: fe3ef05c ("KVM: nVMX: Prepare vmcs02 from vmcs01 and vmcs12")
      Reported-by: NDan Cross <dcross@google.com>
      Signed-off-by: NJim Mattson <jmattson@google.com>
      Reviewed-by: NPeter Shier <pshier@google.com>
      Reviewed-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      671ddc70
  17. 22 10月, 2019 1 次提交