1. 08 12月, 2021 1 次提交
    • V
      KVM: nVMX: Don't use Enlightened MSR Bitmap for L3 · 250552b9
      Vitaly Kuznetsov 提交于
      When KVM runs as a nested hypervisor on top of Hyper-V it uses Enlightened
      VMCS and enables Enlightened MSR Bitmap feature for its L1s and L2s (which
      are actually L2s and L3s from Hyper-V's perspective). When MSR bitmap is
      updated, KVM has to reset HV_VMX_ENLIGHTENED_CLEAN_FIELD_MSR_BITMAP from
      clean fields to make Hyper-V aware of the change. For KVM's L1s, this is
      done in vmx_disable_intercept_for_msr()/vmx_enable_intercept_for_msr().
      MSR bitmap for L2 is build in nested_vmx_prepare_msr_bitmap() by blending
      MSR bitmap for L1 and L1's idea of MSR bitmap for L2. KVM, however, doesn't
      check if the resulting bitmap is different and never cleans
      HV_VMX_ENLIGHTENED_CLEAN_FIELD_MSR_BITMAP in eVMCS02. This is incorrect and
      may result in Hyper-V missing the update.
      
      The issue could've been solved by calling evmcs_touch_msr_bitmap() for
      eVMCS02 from nested_vmx_prepare_msr_bitmap() unconditionally but doing so
      would not give any performance benefits (compared to not using Enlightened
      MSR Bitmap at all). 3-level nesting is also not a very common setup
      nowadays.
      
      Don't enable 'Enlightened MSR Bitmap' feature for KVM's L2s (real L3s) for
      now.
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20211129094704.326635-2-vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      250552b9
  2. 05 12月, 2021 3 次提交
  3. 04 12月, 2021 4 次提交
  4. 03 12月, 2021 1 次提交
    • J
      x86/64/mm: Map all kernel memory into trampoline_pgd · 51523ed1
      Joerg Roedel 提交于
      The trampoline_pgd only maps the 0xfffffff000000000-0xffffffffffffffff
      range of kernel memory (with 4-level paging). This range contains the
      kernel's text+data+bss mappings and the module mapping space but not the
      direct mapping and the vmalloc area.
      
      This is enough to get the application processors out of real-mode, but
      for code that switches back to real-mode the trampoline_pgd is missing
      important parts of the address space. For example, consider this code
      from arch/x86/kernel/reboot.c, function machine_real_restart() for a
      64-bit kernel:
      
        #ifdef CONFIG_X86_32
        	load_cr3(initial_page_table);
        #else
        	write_cr3(real_mode_header->trampoline_pgd);
      
        	/* Exiting long mode will fail if CR4.PCIDE is set. */
        	if (boot_cpu_has(X86_FEATURE_PCID))
        		cr4_clear_bits(X86_CR4_PCIDE);
        #endif
      
        	/* Jump to the identity-mapped low memory code */
        #ifdef CONFIG_X86_32
        	asm volatile("jmpl *%0" : :
        		     "rm" (real_mode_header->machine_real_restart_asm),
        		     "a" (type));
        #else
        	asm volatile("ljmpl *%0" : :
        		     "m" (real_mode_header->machine_real_restart_asm),
        		     "D" (type));
        #endif
      
      The code switches to the trampoline_pgd, which unmaps the direct mapping
      and also the kernel stack. The call to cr4_clear_bits() will find no
      stack and crash the machine. The real_mode_header pointer below points
      into the direct mapping, and dereferencing it also causes a crash.
      
      The reason this does not crash always is only that kernel mappings are
      global and the CR3 switch does not flush those mappings. But if theses
      mappings are not in the TLB already, the above code will crash before it
      can jump to the real-mode stub.
      
      Extend the trampoline_pgd to contain all kernel mappings to prevent
      these crashes and to make code which runs on this page-table more
      robust.
      Signed-off-by: NJoerg Roedel <jroedel@suse.de>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20211202153226.22946-5-joro@8bytes.org
      51523ed1
  5. 02 12月, 2021 6 次提交
    • S
      KVM: x86/mmu: Retry page fault if root is invalidated by memslot update · a955cad8
      Sean Christopherson 提交于
      Bail from the page fault handler if the root shadow page was obsoleted by
      a memslot update.  Do the check _after_ acuiring mmu_lock, as the TDP MMU
      doesn't rely on the memslot/MMU generation, and instead relies on the
      root being explicit marked invalid by kvm_mmu_zap_all_fast(), which takes
      mmu_lock for write.
      
      For the TDP MMU, inserting a SPTE into an obsolete root can leak a SP if
      kvm_tdp_mmu_zap_invalidated_roots() has already zapped the SP, i.e. has
      moved past the gfn associated with the SP.
      
      For other MMUs, the resulting behavior is far more convoluted, though
      unlikely to be truly problematic.  Installing SPs/SPTEs into the obsolete
      root isn't directly problematic, as the obsolete root will be unloaded
      and dropped before the vCPU re-enters the guest.  But because the legacy
      MMU tracks shadow pages by their role, any SP created by the fault can
      can be reused in the new post-reload root.  Again, that _shouldn't_ be
      problematic as any leaf child SPTEs will be created for the current/valid
      memslot generation, and kvm_mmu_get_page() will not reuse child SPs from
      the old generation as they will be flagged as obsolete.  But, given that
      continuing with the fault is pointess (the root will be unloaded), apply
      the check to all MMUs.
      
      Fixes: b7cccd39 ("KVM: x86/mmu: Fast invalidation for TDP MMU")
      Cc: stable@vger.kernel.org
      Cc: Ben Gardon <bgardon@google.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20211120045046.3940942-5-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a955cad8
    • D
      KVM: VMX: Set failure code in prepare_vmcs02() · bfbb307c
      Dan Carpenter 提交于
      The error paths in the prepare_vmcs02() function are supposed to set
      *entry_failure_code but this path does not.  It leads to using an
      uninitialized variable in the caller.
      
      Fixes: 71f73470 ("KVM: nVMX: Load GUEST_IA32_PERF_GLOBAL_CTRL MSR on VM-Entry")
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Message-Id: <20211130125337.GB24578@kili>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      bfbb307c
    • P
      KVM: ensure APICv is considered inactive if there is no APIC · ef8b4b72
      Paolo Bonzini 提交于
      kvm_vcpu_apicv_active() returns false if a virtual machine has no in-kernel
      local APIC, however kvm_apicv_activated might still be true if there are
      no reasons to disable APICv; in fact it is quite likely that there is none
      because APICv is inhibited by specific configurations of the local APIC
      and those configurations cannot be programmed.  This triggers a WARN:
      
         WARN_ON_ONCE(kvm_apicv_activated(vcpu->kvm) != kvm_vcpu_apicv_active(vcpu));
      
      To avoid this, introduce another cause for APICv inhibition, namely the
      absence of an in-kernel local APIC.  This cause is enabled by default,
      and is dropped by either KVM_CREATE_IRQCHIP or the enabling of
      KVM_CAP_IRQCHIP_SPLIT.
      Reported-by: NIgnat Korchagin <ignat@cloudflare.com>
      Fixes: ee49a893 ("KVM: x86: Move SVM's APICv sanity check to common x86", 2021-10-22)
      Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Reviewed-by: NSean Christopherson <seanjc@google.com>
      Tested-by: NIgnat Korchagin <ignat@cloudflare.com>
      Message-Id: <20211130123746.293379-1-pbonzini@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ef8b4b72
    • L
      KVM: x86/pmu: Fix reserved bits for AMD PerfEvtSeln register · cb1d220d
      Like Xu 提交于
      If we run the following perf command in an AMD Milan guest:
      
        perf stat \
        -e cpu/event=0x1d0/ \
        -e cpu/event=0x1c7/ \
        -e cpu/umask=0x1f,event=0x18e/ \
        -e cpu/umask=0x7,event=0x18e/ \
        -e cpu/umask=0x18,event=0x18e/ \
        ./workload
      
      dmesg will report a #GP warning from an unchecked MSR access
      error on MSR_F15H_PERF_CTLx.
      
      This is because according to APM (Revision: 4.03) Figure 13-7,
      the bits [35:32] of AMD PerfEvtSeln register is a part of the
      event select encoding, which extends the EVENT_SELECT field
      from 8 bits to 12 bits.
      
      Opportunistically update pmu->reserved_bits for reserved bit 19.
      Reported-by: NJim Mattson <jmattson@google.com>
      Fixes: ca724305 ("KVM: x86/vPMU: Implement AMD vPMU code for KVM")
      Signed-off-by: NLike Xu <likexu@tencent.com>
      Message-Id: <20211118130320.95997-1-likexu@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      cb1d220d
    • F
      x86/tsc: Disable clocksource watchdog for TSC on qualified platorms · b50db709
      Feng Tang 提交于
      There are cases that the TSC clocksource is wrongly judged as unstable by
      the clocksource watchdog mechanism which tries to validate the TSC against
      HPET, PM_TIMER or jiffies. While there is hardly a general reliable way to
      check the validity of a watchdog, Thomas Gleixner proposed [1]:
      
      "I'm inclined to lift that requirement when the CPU has:
      
          1) X86_FEATURE_CONSTANT_TSC
          2) X86_FEATURE_NONSTOP_TSC
          3) X86_FEATURE_NONSTOP_TSC_S3
          4) X86_FEATURE_TSC_ADJUST
          5) At max. 4 sockets
      
       After two decades of horrors we're finally at a point where TSC seems
       to be halfway reliable and less abused by BIOS tinkerers. TSC_ADJUST
       was really key as we can now detect even small modifications reliably
       and the important point is that we can cure them as well (not pretty
       but better than all other options)."
      
      As feature #3 X86_FEATURE_NONSTOP_TSC_S3 only exists on several generations
      of Atom processorz, and is always coupled with X86_FEATURE_CONSTANT_TSC
      and X86_FEATURE_NONSTOP_TSC, skip checking it, and also be more defensive
      to use maximal 2 sockets.
      
      The check is done inside tsc_init() before registering 'tsc-early' and
      'tsc' clocksources, as there were cases that both of them had been
      wrongly judged as unreliable.
      
      For more background of tsc/watchdog, there is a good summary in [2]
      
      [tglx} Update vs. jiffies:
      
        On systems where the only remaining clocksource aside of TSC is jiffies
        there is no way to make this work because that creates a circular
        dependency. Jiffies accuracy depends on not missing a periodic timer
        interrupt, which is not guaranteed. That could be detected by TSC, but as
        TSC is not trusted this cannot be compensated. The consequence is a
        circulus vitiosus which results in shutting down TSC and falling back to
        the jiffies clocksource which is even more unreliable.
      
      [1]. https://lore.kernel.org/lkml/87eekfk8bd.fsf@nanos.tec.linutronix.de/
      [2]. https://lore.kernel.org/lkml/87a6pimt1f.ffs@nanos.tec.linutronix.de/
      
      [ tglx: Refine comment and amend changelog ]
      
      Fixes: 6e3cd952 ("x86/hpet: Use another crystalball to evaluate HPET usability")
      Suggested-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NFeng Tang <feng.tang@intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: "Paul E. McKenney" <paulmck@kernel.org>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/r/20211117023751.24190-2-feng.tang@intel.com
      b50db709
    • F
      x86/tsc: Add a timer to make sure TSC_adjust is always checked · c7719e79
      Feng Tang 提交于
      The TSC_ADJUST register is checked every time a CPU enters idle state, but
      Thomas Gleixner mentioned there is still a caveat that a system won't enter
      idle [1], either because it's too busy or configured purposely to not enter
      idle.
      
      Setup a periodic timer (every 10 minutes) to make sure the check is
      happening on a regular base.
      
      [1] https://lore.kernel.org/lkml/875z286xtk.fsf@nanos.tec.linutronix.de/
      
      Fixes: 6e3cd952 ("x86/hpet: Use another crystalball to evaluate HPET usability")
      Requested-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NFeng Tang <feng.tang@intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: "Paul E. McKenney" <paulmck@kernel.org>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/r/20211117023751.24190-1-feng.tang@intel.com
      c7719e79
  6. 01 12月, 2021 2 次提交
  7. 30 11月, 2021 17 次提交
  8. 27 11月, 2021 1 次提交
  9. 26 11月, 2021 5 次提交