1. 09 1月, 2020 1 次提交
    • P
      KVM: X86: Use APIC_DEST_* macros properly in kvm_lapic_irq.dest_mode · c96001c5
      Peter Xu 提交于
      We were using either APIC_DEST_PHYSICAL|APIC_DEST_LOGICAL or 0|1 to
      fill in kvm_lapic_irq.dest_mode.  It's fine only because in most cases
      when we check against dest_mode it's against APIC_DEST_PHYSICAL (which
      equals to 0).  However, that's not consistent.  We'll have problem
      when we want to start checking against APIC_DEST_LOGICAL, which does
      not equals to 1.
      
      This patch firstly introduces kvm_lapic_irq_dest_mode() helper to take
      any boolean of destination mode and return the APIC_DEST_* macro.
      Then, it replaces the 0|1 settings of irq.dest_mode with the helper.
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c96001c5
  2. 23 11月, 2019 3 次提交
    • S
      KVM: x86: Grab KVM's srcu lock when setting nested state · ad5996d9
      Sean Christopherson 提交于
      Acquire kvm->srcu for the duration of ->set_nested_state() to fix a bug
      where nVMX derefences ->memslots without holding ->srcu or ->slots_lock.
      
      The other half of nested migration, ->get_nested_state(), does not need
      to acquire ->srcu as it is a purely a dump of internal KVM (and CPU)
      state to userspace.
      
      Detected as an RCU lockdep splat that is 100% reproducible by running
      KVM's state_test selftest with CONFIG_PROVE_LOCKING=y.  Note that the
      failing function, kvm_is_visible_gfn(), is only checking the validity of
      a gfn, it's not actually accessing guest memory (which is more or less
      unsupported during vmx_set_nested_state() due to incorrect MMU state),
      i.e. vmx_set_nested_state() itself isn't fundamentally broken.  In any
      case, setting nested state isn't a fast path so there's no reason to go
      out of our way to avoid taking ->srcu.
      
        =============================
        WARNING: suspicious RCU usage
        5.4.0-rc7+ #94 Not tainted
        -----------------------------
        include/linux/kvm_host.h:626 suspicious rcu_dereference_check() usage!
      
                     other info that might help us debug this:
      
        rcu_scheduler_active = 2, debug_locks = 1
        1 lock held by evmcs_test/10939:
         #0: ffff88826ffcb800 (&vcpu->mutex){+.+.}, at: kvm_vcpu_ioctl+0x85/0x630 [kvm]
      
        stack backtrace:
        CPU: 1 PID: 10939 Comm: evmcs_test Not tainted 5.4.0-rc7+ #94
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        Call Trace:
         dump_stack+0x68/0x9b
         kvm_is_visible_gfn+0x179/0x180 [kvm]
         mmu_check_root+0x11/0x30 [kvm]
         fast_cr3_switch+0x40/0x120 [kvm]
         kvm_mmu_new_cr3+0x34/0x60 [kvm]
         nested_vmx_load_cr3+0xbd/0x1f0 [kvm_intel]
         nested_vmx_enter_non_root_mode+0xab8/0x1d60 [kvm_intel]
         vmx_set_nested_state+0x256/0x340 [kvm_intel]
         kvm_arch_vcpu_ioctl+0x491/0x11a0 [kvm]
         kvm_vcpu_ioctl+0xde/0x630 [kvm]
         do_vfs_ioctl+0xa2/0x6c0
         ksys_ioctl+0x66/0x70
         __x64_sys_ioctl+0x16/0x20
         do_syscall_64+0x54/0x200
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
        RIP: 0033:0x7f59a2b95f47
      
      Fixes: 8fcc4b59 ("kvm: nVMX: Introduce KVM_CAP_NESTED_STATE")
      Cc: stable@vger.kernel.org
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ad5996d9
    • S
      KVM: x86: Open code shared_msr_update() in its only caller · 05c19c2f
      Sean Christopherson 提交于
      Fold shared_msr_update() into its sole user to eliminate its pointless
      bounds check, its godawful printk, its misleading comment (it's called
      under a global lock), and its woefully inaccurate name.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      05c19c2f
    • S
      KVM: x86: Remove a spurious export of a static function · 24885d1d
      Sean Christopherson 提交于
      A recent change inadvertently exported a static function, which results
      in modpost throwing a warning.  Fix it.
      
      Fixes: cbbaa272 ("KVM: x86: fix presentation of TSX feature in ARCH_CAPABILITIES")
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Cc: stable@vger.kernel.org
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      24885d1d
  3. 21 11月, 2019 5 次提交
    • M
      KVM: x86: remove set but not used variable 'called' · db5a95ec
      Mao Wenan 提交于
      Fixes gcc '-Wunused-but-set-variable' warning:
      
      arch/x86/kvm/x86.c: In function kvm_make_scan_ioapic_request_mask:
      arch/x86/kvm/x86.c:7911:7: warning: variable called set but not
      used [-Wunused-but-set-variable]
      
      It is not used since commit 7ee30bc1 ("KVM: x86: deliver KVM
      IOAPIC scan request to target vCPUs")
      Signed-off-by: NMao Wenan <maowenan@huawei.com>
      Fixes: 7ee30bc1 ("KVM: x86: deliver KVM IOAPIC scan request to target vCPUs")
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      db5a95ec
    • P
      KVM: vmx: implement MSR_IA32_TSX_CTRL disable RTM functionality · c11f83e0
      Paolo Bonzini 提交于
      The current guest mitigation of TAA is both too heavy and not really
      sufficient.  It is too heavy because it will cause some affected CPUs
      (those that have MDS_NO but lack TAA_NO) to fall back to VERW and
      get the corresponding slowdown.  It is not really sufficient because
      it will cause the MDS_NO bit to disappear upon microcode update, so
      that VMs started before the microcode update will not be runnable
      anymore afterwards, even with tsx=on.
      
      Instead, if tsx=on on the host, we can emulate MSR_IA32_TSX_CTRL for
      the guest and let it run without the VERW mitigation.  Even though
      MSR_IA32_TSX_CTRL is quite heavyweight, and we do not want to write
      it on every vmentry, we can use the shared MSR functionality because
      the host kernel need not protect itself from TSX-based side-channels.
      Tested-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      c11f83e0
    • P
      KVM: x86: implement MSR_IA32_TSX_CTRL effect on CPUID · edef5c36
      Paolo Bonzini 提交于
      Because KVM always emulates CPUID, the CPUID clear bit
      (bit 1) of MSR_IA32_TSX_CTRL must be emulated "manually"
      by the hypervisor when performing said emulation.
      
      Right now neither kvm-intel.ko nor kvm-amd.ko implement
      MSR_IA32_TSX_CTRL but this will change in the next patch.
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Tested-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      edef5c36
    • P
      KVM: x86: do not modify masked bits of shared MSRs · de1fca5d
      Paolo Bonzini 提交于
      "Shared MSRs" are guest MSRs that are written to the host MSRs but
      keep their value until the next return to userspace.  They support
      a mask, so that some bits keep the host value, but this mask is
      only used to skip an unnecessary MSR write and the value written
      to the MSR is always the guest MSR.
      
      Fix this and, while at it, do not update smsr->values[slot].curr if
      for whatever reason the wrmsr fails.  This should only happen due to
      reserved bits, so the value written to smsr->values[slot].curr
      will not match when the user-return notifier and the host value will
      always be restored.  However, it is untidy and in rare cases this
      can actually avoid spurious WRMSRs on return to userspace.
      
      Cc: stable@vger.kernel.org
      Reviewed-by: NJim Mattson <jmattson@google.com>
      Tested-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      de1fca5d
    • P
      KVM: x86: fix presentation of TSX feature in ARCH_CAPABILITIES · cbbaa272
      Paolo Bonzini 提交于
      KVM does not implement MSR_IA32_TSX_CTRL, so it must not be presented
      to the guests.  It is also confusing to have !ARCH_CAP_TSX_CTRL_MSR &&
      !RTM && ARCH_CAP_TAA_NO: lack of MSR_IA32_TSX_CTRL suggests TSX was not
      hidden (it actually was), yet the value says that TSX is not vulnerable
      to microarchitectural data sampling.  Fix both.
      
      Cc: stable@vger.kernel.org
      Tested-by: NJim Mattson <jmattson@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      cbbaa272
  4. 20 11月, 2019 1 次提交
  5. 15 11月, 2019 5 次提交
  6. 13 11月, 2019 1 次提交
  7. 12 11月, 2019 1 次提交
    • C
      KVM: X86: Fix initialization of MSR lists · 7a5ee6ed
      Chenyi Qiang 提交于
      The three MSR lists(msrs_to_save[], emulated_msrs[] and
      msr_based_features[]) are global arrays of kvm.ko, which are
      adjusted (copy supported MSRs forward to override the unsupported MSRs)
      when insmod kvm-{intel,amd}.ko, but it doesn't reset these three arrays
      to their initial value when rmmod kvm-{intel,amd}.ko. Thus, at the next
      installation, kvm-{intel,amd}.ko will do operations on the modified
      arrays with some MSRs lost and some MSRs duplicated.
      
      So define three constant arrays to hold the initial MSR lists and
      initialize msrs_to_save[], emulated_msrs[] and msr_based_features[]
      based on the constant arrays.
      
      Cc: stable@vger.kernel.org
      Reviewed-by: NXiaoyao Li <xiaoyao.li@intel.com>
      Signed-off-by: NChenyi Qiang <chenyi.qiang@intel.com>
      [Remove now useless conditionals. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7a5ee6ed
  8. 11 11月, 2019 1 次提交
  9. 05 11月, 2019 1 次提交
  10. 04 11月, 2019 1 次提交
    • P
      kvm: mmu: ITLB_MULTIHIT mitigation · b8e8c830
      Paolo Bonzini 提交于
      With some Intel processors, putting the same virtual address in the TLB
      as both a 4 KiB and 2 MiB page can confuse the instruction fetch unit
      and cause the processor to issue a machine check resulting in a CPU lockup.
      
      Unfortunately when EPT page tables use huge pages, it is possible for a
      malicious guest to cause this situation.
      
      Add a knob to mark huge pages as non-executable. When the nx_huge_pages
      parameter is enabled (and we are using EPT), all huge pages are marked as
      NX. If the guest attempts to execute in one of those pages, the page is
      broken down into 4K pages, which are then marked executable.
      
      This is not an issue for shadow paging (except nested EPT), because then
      the host is in control of TLB flushes and the problematic situation cannot
      happen.  With nested EPT, again the nested guest can cause problems shadow
      and direct EPT is treated in the same way.
      
      [ tglx: Fixup default to auto and massage wording a bit ]
      Originally-by: NJunaid Shahid <junaids@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      b8e8c830
  11. 02 11月, 2019 1 次提交
    • M
      KVM: x86: switch KVMCLOCK base to monotonic raw clock · 53fafdbb
      Marcelo Tosatti 提交于
      Commit 0bc48bea ("KVM: x86: update master clock before computing
      kvmclock_offset")
      switches the order of operations to avoid the conversion
      
      TSC (without frequency correction) ->
      system_timestamp (with frequency correction),
      
      which might cause a time jump.
      
      However, it leaves any other masterclock update unsafe, which includes,
      at the moment:
      
              * HV_X64_MSR_REFERENCE_TSC MSR write.
              * TSC writes.
              * Host suspend/resume.
      
      Avoid the time jump issue by using frequency uncorrected
      CLOCK_MONOTONIC_RAW clock.
      
      Its the guests time keeping software responsability
      to track and correct a reference clock such as UTC.
      
      This fixes forward time jump (which can result in
      failure to bring up a vCPU) during vCPU hotplug:
      
      Oct 11 14:48:33 storage kernel: CPU2 has been hot-added
      Oct 11 14:48:34 storage kernel: CPU3 has been hot-added
      Oct 11 14:49:22 storage kernel: smpboot: Booting Node 0 Processor 2 APIC 0x2          <-- time jump of almost 1 minute
      Oct 11 14:49:22 storage kernel: smpboot: do_boot_cpu failed(-1) to wakeup CPU#2
      Oct 11 14:49:23 storage kernel: smpboot: Booting Node 0 Processor 3 APIC 0x3
      Oct 11 14:49:23 storage kernel: kvm-clock: cpu 3, msr 0:7ff640c1, secondary cpu clock
      
      Which happens because:
      
                      /*
                       * Wait 10s total for a response from AP
                       */
                      boot_error = -1;
                      timeout = jiffies + 10*HZ;
                      while (time_before(jiffies, timeout)) {
                               ...
                      }
      Analyzed-by: NIgor Mammedov <imammedo@redhat.com>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      53fafdbb
  12. 28 10月, 2019 1 次提交
  13. 23 10月, 2019 1 次提交
    • J
      KVM: nVMX: Don't leak L1 MMIO regions to L2 · 671ddc70
      Jim Mattson 提交于
      If the "virtualize APIC accesses" VM-execution control is set in the
      VMCS, the APIC virtualization hardware is triggered when a page walk
      in VMX non-root mode terminates at a PTE wherein the address of the 4k
      page frame matches the APIC-access address specified in the VMCS. On
      hardware, the APIC-access address may be any valid 4k-aligned physical
      address.
      
      KVM's nVMX implementation enforces the additional constraint that the
      APIC-access address specified in the vmcs12 must be backed by
      a "struct page" in L1. If not, L0 will simply clear the "virtualize
      APIC accesses" VM-execution control in the vmcs02.
      
      The problem with this approach is that the L1 guest has arranged the
      vmcs12 EPT tables--or shadow page tables, if the "enable EPT"
      VM-execution control is clear in the vmcs12--so that the L2 guest
      physical address(es)--or L2 guest linear address(es)--that reference
      the L2 APIC map to the APIC-access address specified in the
      vmcs12. Without the "virtualize APIC accesses" VM-execution control in
      the vmcs02, the APIC accesses in the L2 guest will directly access the
      APIC-access page in L1.
      
      When there is no mapping whatsoever for the APIC-access address in L1,
      the L2 VM just loses the intended APIC virtualization. However, when
      the APIC-access address is mapped to an MMIO region in L1, the L2
      guest gets direct access to the L1 MMIO device. For example, if the
      APIC-access address specified in the vmcs12 is 0xfee00000, then L2
      gets direct access to L1's APIC.
      
      Since this vmcs12 configuration is something that KVM cannot
      faithfully emulate, the appropriate response is to exit to userspace
      with KVM_INTERNAL_ERROR_EMULATION.
      
      Fixes: fe3ef05c ("KVM: nVMX: Prepare vmcs02 from vmcs01 and vmcs12")
      Reported-by: NDan Cross <dcross@google.com>
      Signed-off-by: NJim Mattson <jmattson@google.com>
      Reviewed-by: NPeter Shier <pshier@google.com>
      Reviewed-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      671ddc70
  14. 22 10月, 2019 8 次提交
  15. 18 10月, 2019 1 次提交
  16. 04 10月, 2019 1 次提交
  17. 03 10月, 2019 1 次提交
  18. 01 10月, 2019 1 次提交
  19. 26 9月, 2019 1 次提交
    • W
      KVM: X86: Fix userspace set invalid CR4 · 3ca94192
      Wanpeng Li 提交于
      Reported by syzkaller:
      
      	WARNING: CPU: 0 PID: 6544 at /home/kernel/data/kvm/arch/x86/kvm//vmx/vmx.c:4689 handle_desc+0x37/0x40 [kvm_intel]
      	CPU: 0 PID: 6544 Comm: a.out Tainted: G           OE     5.3.0-rc4+ #4
      	RIP: 0010:handle_desc+0x37/0x40 [kvm_intel]
      	Call Trace:
      	 vmx_handle_exit+0xbe/0x6b0 [kvm_intel]
      	 vcpu_enter_guest+0x4dc/0x18d0 [kvm]
      	 kvm_arch_vcpu_ioctl_run+0x407/0x660 [kvm]
      	 kvm_vcpu_ioctl+0x3ad/0x690 [kvm]
      	 do_vfs_ioctl+0xa2/0x690
      	 ksys_ioctl+0x6d/0x80
      	 __x64_sys_ioctl+0x1a/0x20
      	 do_syscall_64+0x74/0x720
      	 entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      When CR4.UMIP is set, guest should have UMIP cpuid flag. Current
      kvm set_sregs function doesn't have such check when userspace inputs
      sregs values. SECONDARY_EXEC_DESC is enabled on writes to CR4.UMIP
      in vmx_set_cr4 though guest doesn't have UMIP cpuid flag. The testcast
      triggers handle_desc warning when executing ltr instruction since
      guest architectural CR4 doesn't set UMIP. This patch fixes it by
      adding valid CR4 and CPUID combination checking in __set_sregs.
      
      syzkaller source: https://syzkaller.appspot.com/x/repro.c?x=138efb99600000
      
      Reported-by: syzbot+0f1819555fbdce992df9@syzkaller.appspotmail.com
      Cc: stable@vger.kernel.org
      Signed-off-by: NWanpeng Li <wanpengli@tencent.com>
      Reviewed-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      3ca94192
  20. 25 9月, 2019 1 次提交
  21. 24 9月, 2019 3 次提交
    • S
      KVM: x86/mmu: Revert "KVM: x86/mmu: Remove is_obsolete() call" · 10605204
      Sean Christopherson 提交于
      Now that the fast invalidate mechanism has been reintroduced, restore
      the performance tweaks for fast invalidation that existed prior to its
      removal.
      
      Paraphrasing the original changelog (commit 5ff05683 was itself a
      partial revert):
      
        Don't force reloading the remote mmu when zapping an obsolete page, as
        a MMU_RELOAD request has already been issued by kvm_mmu_zap_all_fast()
        immediately after incrementing mmu_valid_gen, i.e. after marking pages
        obsolete.
      
      This reverts commit 5ff05683.
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      10605204
    • T
      KVM: vmx: Emulate MSR IA32_UMWAIT_CONTROL · 6e3ba4ab
      Tao Xu 提交于
      UMWAIT and TPAUSE instructions use 32bit IA32_UMWAIT_CONTROL at MSR index
      E1H to determines the maximum time in TSC-quanta that the processor can
      reside in either C0.1 or C0.2.
      
      This patch emulates MSR IA32_UMWAIT_CONTROL in guest and differentiate
      IA32_UMWAIT_CONTROL between host and guest. The variable
      mwait_control_cached in arch/x86/kernel/cpu/umwait.c caches the MSR value,
      so this patch uses it to avoid frequently rdmsr of IA32_UMWAIT_CONTROL.
      Co-developed-by: NJingqi Liu <jingqi.liu@intel.com>
      Signed-off-by: NJingqi Liu <jingqi.liu@intel.com>
      Signed-off-by: NTao Xu <tao3.xu@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      6e3ba4ab
    • S
      KVM: VMX: Handle single-step #DB for EMULTYPE_SKIP on EPT misconfig · 1957aa63
      Sean Christopherson 提交于
      VMX's EPT misconfig flow to handle fast-MMIO path falls back to decoding
      the instruction to determine the instruction length when running as a
      guest (Hyper-V doesn't fill VMCS.VM_EXIT_INSTRUCTION_LEN because it's
      technically not defined for EPT misconfigs).  Rather than implement the
      slow skip in VMX's generic skip_emulated_instruction(),
      handle_ept_misconfig() directly calls kvm_emulate_instruction() with
      EMULTYPE_SKIP, which intentionally doesn't do single-step detection, and
      so handle_ept_misconfig() misses a single-step #DB.
      
      Rework the EPT misconfig fallback case to route it through
      kvm_skip_emulated_instruction() so that single-step #DBs and interrupt
      shadow updates are handled automatically.  I.e. make VMX's slow skip
      logic match SVM's and have the SVM flow not intentionally avoid the
      shadow update.
      
      Alternatively, the handle_ept_misconfig() could manually handle single-
      step detection, but that results in EMULTYPE_SKIP having split logic for
      the interrupt shadow vs. single-step #DBs, and split emulator logic is
      largely what led to this mess in the first place.
      
      Modifying SVM to mirror VMX flow isn't really an option as SVM's case
      isn't limited to a specific exit reason, i.e. handling the slow skip in
      skip_emulated_instruction() is mandatory for all intents and purposes.
      
      Drop VMX's skip_emulated_instruction() wrapper since it can now fail,
      and instead WARN if it fails unexpectedly, e.g. if exit_reason somehow
      becomes corrupted.
      
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Fixes: d391f120 ("x86/kvm/vmx: do not use vm-exit instruction length for fast MMIO when running nested")
      Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      1957aa63