1. 14 12月, 2021 1 次提交
  2. 05 12月, 2021 4 次提交
    • T
      x86/sme: Explicitly map new EFI memmap table as encrypted · 1ff2fc02
      Tom Lendacky 提交于
      Reserving memory using efi_mem_reserve() calls into the x86
      efi_arch_mem_reserve() function. This function will insert a new EFI
      memory descriptor into the EFI memory map representing the area of
      memory to be reserved and marking it as EFI runtime memory. As part
      of adding this new entry, a new EFI memory map is allocated and mapped.
      The mapping is where a problem can occur. This new memory map is mapped
      using early_memremap() and generally mapped encrypted, unless the new
      memory for the mapping happens to come from an area of memory that is
      marked as EFI_BOOT_SERVICES_DATA memory. In this case, the new memory will
      be mapped unencrypted. However, during replacement of the old memory map,
      efi_mem_type() is disabled, so the new memory map will now be long-term
      mapped encrypted (in efi.memmap), resulting in the map containing invalid
      data and causing the kernel boot to crash.
      
      Since it is known that the area will be mapped encrypted going forward,
      explicitly map the new memory map as encrypted using early_memremap_prot().
      
      Cc: <stable@vger.kernel.org> # 4.14.x
      Fixes: 8f716c9b ("x86/mm: Add support to access boot related data in the clear")
      Link: https://lore.kernel.org/all/ebf1eb2940405438a09d51d121ec0d02c8755558.1634752931.git.thomas.lendacky@amd.com/Signed-off-by: NTom Lendacky <thomas.lendacky@amd.com>
      [ardb: incorporate Kconfig fix by Arnd]
      Signed-off-by: NArd Biesheuvel <ardb@kernel.org>
      1ff2fc02
    • T
      KVM: SVM: Do not terminate SEV-ES guests on GHCB validation failure · ad5b3532
      Tom Lendacky 提交于
      Currently, an SEV-ES guest is terminated if the validation of the VMGEXIT
      exit code or exit parameters fails.
      
      The VMGEXIT instruction can be issued from userspace, even though
      userspace (likely) can't update the GHCB. To prevent userspace from being
      able to kill the guest, return an error through the GHCB when validation
      fails rather than terminating the guest. For cases where the GHCB can't be
      updated (e.g. the GHCB can't be mapped, etc.), just return back to the
      guest.
      
      The new error codes are documented in the lasest update to the GHCB
      specification.
      
      Fixes: 291bd20d ("KVM: SVM: Add initial support for a VMGEXIT VMEXIT")
      Signed-off-by: NTom Lendacky <thomas.lendacky@amd.com>
      Message-Id: <b57280b5562893e2616257ac9c2d4525a9aeeb42.1638471124.git.thomas.lendacky@amd.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ad5b3532
    • S
      KVM: SEV: Fall back to vmalloc for SEV-ES scratch area if necessary · a655276a
      Sean Christopherson 提交于
      Use kvzalloc() to allocate KVM's buffer for SEV-ES's GHCB scratch area so
      that KVM falls back to __vmalloc() if physically contiguous memory isn't
      available.  The buffer is purely a KVM software construct, i.e. there's
      no need for it to be physically contiguous.
      
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20211109222350.2266045-3-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a655276a
    • S
      KVM: SEV: Return appropriate error codes if SEV-ES scratch setup fails · 75236f5f
      Sean Christopherson 提交于
      Return appropriate error codes if setting up the GHCB scratch area for an
      SEV-ES guest fails.  In particular, returning -EINVAL instead of -ENOMEM
      when allocating the kernel buffer could be confusing as userspace would
      likely suspect a guest issue.
      
      Fixes: 8f423a80 ("KVM: SVM: Support MMIO for an SEV-ES guest")
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20211109222350.2266045-2-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      75236f5f
  3. 04 12月, 2021 4 次提交
  4. 03 12月, 2021 1 次提交
    • J
      x86/64/mm: Map all kernel memory into trampoline_pgd · 51523ed1
      Joerg Roedel 提交于
      The trampoline_pgd only maps the 0xfffffff000000000-0xffffffffffffffff
      range of kernel memory (with 4-level paging). This range contains the
      kernel's text+data+bss mappings and the module mapping space but not the
      direct mapping and the vmalloc area.
      
      This is enough to get the application processors out of real-mode, but
      for code that switches back to real-mode the trampoline_pgd is missing
      important parts of the address space. For example, consider this code
      from arch/x86/kernel/reboot.c, function machine_real_restart() for a
      64-bit kernel:
      
        #ifdef CONFIG_X86_32
        	load_cr3(initial_page_table);
        #else
        	write_cr3(real_mode_header->trampoline_pgd);
      
        	/* Exiting long mode will fail if CR4.PCIDE is set. */
        	if (boot_cpu_has(X86_FEATURE_PCID))
        		cr4_clear_bits(X86_CR4_PCIDE);
        #endif
      
        	/* Jump to the identity-mapped low memory code */
        #ifdef CONFIG_X86_32
        	asm volatile("jmpl *%0" : :
        		     "rm" (real_mode_header->machine_real_restart_asm),
        		     "a" (type));
        #else
        	asm volatile("ljmpl *%0" : :
        		     "m" (real_mode_header->machine_real_restart_asm),
        		     "D" (type));
        #endif
      
      The code switches to the trampoline_pgd, which unmaps the direct mapping
      and also the kernel stack. The call to cr4_clear_bits() will find no
      stack and crash the machine. The real_mode_header pointer below points
      into the direct mapping, and dereferencing it also causes a crash.
      
      The reason this does not crash always is only that kernel mappings are
      global and the CR3 switch does not flush those mappings. But if theses
      mappings are not in the TLB already, the above code will crash before it
      can jump to the real-mode stub.
      
      Extend the trampoline_pgd to contain all kernel mappings to prevent
      these crashes and to make code which runs on this page-table more
      robust.
      Signed-off-by: NJoerg Roedel <jroedel@suse.de>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20211202153226.22946-5-joro@8bytes.org
      51523ed1
  5. 02 12月, 2021 6 次提交
    • S
      KVM: x86/mmu: Retry page fault if root is invalidated by memslot update · a955cad8
      Sean Christopherson 提交于
      Bail from the page fault handler if the root shadow page was obsoleted by
      a memslot update.  Do the check _after_ acuiring mmu_lock, as the TDP MMU
      doesn't rely on the memslot/MMU generation, and instead relies on the
      root being explicit marked invalid by kvm_mmu_zap_all_fast(), which takes
      mmu_lock for write.
      
      For the TDP MMU, inserting a SPTE into an obsolete root can leak a SP if
      kvm_tdp_mmu_zap_invalidated_roots() has already zapped the SP, i.e. has
      moved past the gfn associated with the SP.
      
      For other MMUs, the resulting behavior is far more convoluted, though
      unlikely to be truly problematic.  Installing SPs/SPTEs into the obsolete
      root isn't directly problematic, as the obsolete root will be unloaded
      and dropped before the vCPU re-enters the guest.  But because the legacy
      MMU tracks shadow pages by their role, any SP created by the fault can
      can be reused in the new post-reload root.  Again, that _shouldn't_ be
      problematic as any leaf child SPTEs will be created for the current/valid
      memslot generation, and kvm_mmu_get_page() will not reuse child SPs from
      the old generation as they will be flagged as obsolete.  But, given that
      continuing with the fault is pointess (the root will be unloaded), apply
      the check to all MMUs.
      
      Fixes: b7cccd39 ("KVM: x86/mmu: Fast invalidation for TDP MMU")
      Cc: stable@vger.kernel.org
      Cc: Ben Gardon <bgardon@google.com>
      Signed-off-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20211120045046.3940942-5-seanjc@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      a955cad8
    • D
      KVM: VMX: Set failure code in prepare_vmcs02() · bfbb307c
      Dan Carpenter 提交于
      The error paths in the prepare_vmcs02() function are supposed to set
      *entry_failure_code but this path does not.  It leads to using an
      uninitialized variable in the caller.
      
      Fixes: 71f73470 ("KVM: nVMX: Load GUEST_IA32_PERF_GLOBAL_CTRL MSR on VM-Entry")
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Message-Id: <20211130125337.GB24578@kili>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      bfbb307c
    • P
      KVM: ensure APICv is considered inactive if there is no APIC · ef8b4b72
      Paolo Bonzini 提交于
      kvm_vcpu_apicv_active() returns false if a virtual machine has no in-kernel
      local APIC, however kvm_apicv_activated might still be true if there are
      no reasons to disable APICv; in fact it is quite likely that there is none
      because APICv is inhibited by specific configurations of the local APIC
      and those configurations cannot be programmed.  This triggers a WARN:
      
         WARN_ON_ONCE(kvm_apicv_activated(vcpu->kvm) != kvm_vcpu_apicv_active(vcpu));
      
      To avoid this, introduce another cause for APICv inhibition, namely the
      absence of an in-kernel local APIC.  This cause is enabled by default,
      and is dropped by either KVM_CREATE_IRQCHIP or the enabling of
      KVM_CAP_IRQCHIP_SPLIT.
      Reported-by: NIgnat Korchagin <ignat@cloudflare.com>
      Fixes: ee49a893 ("KVM: x86: Move SVM's APICv sanity check to common x86", 2021-10-22)
      Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Reviewed-by: NSean Christopherson <seanjc@google.com>
      Tested-by: NIgnat Korchagin <ignat@cloudflare.com>
      Message-Id: <20211130123746.293379-1-pbonzini@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ef8b4b72
    • L
      KVM: x86/pmu: Fix reserved bits for AMD PerfEvtSeln register · cb1d220d
      Like Xu 提交于
      If we run the following perf command in an AMD Milan guest:
      
        perf stat \
        -e cpu/event=0x1d0/ \
        -e cpu/event=0x1c7/ \
        -e cpu/umask=0x1f,event=0x18e/ \
        -e cpu/umask=0x7,event=0x18e/ \
        -e cpu/umask=0x18,event=0x18e/ \
        ./workload
      
      dmesg will report a #GP warning from an unchecked MSR access
      error on MSR_F15H_PERF_CTLx.
      
      This is because according to APM (Revision: 4.03) Figure 13-7,
      the bits [35:32] of AMD PerfEvtSeln register is a part of the
      event select encoding, which extends the EVENT_SELECT field
      from 8 bits to 12 bits.
      
      Opportunistically update pmu->reserved_bits for reserved bit 19.
      Reported-by: NJim Mattson <jmattson@google.com>
      Fixes: ca724305 ("KVM: x86/vPMU: Implement AMD vPMU code for KVM")
      Signed-off-by: NLike Xu <likexu@tencent.com>
      Message-Id: <20211118130320.95997-1-likexu@tencent.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      cb1d220d
    • F
      x86/tsc: Disable clocksource watchdog for TSC on qualified platorms · b50db709
      Feng Tang 提交于
      There are cases that the TSC clocksource is wrongly judged as unstable by
      the clocksource watchdog mechanism which tries to validate the TSC against
      HPET, PM_TIMER or jiffies. While there is hardly a general reliable way to
      check the validity of a watchdog, Thomas Gleixner proposed [1]:
      
      "I'm inclined to lift that requirement when the CPU has:
      
          1) X86_FEATURE_CONSTANT_TSC
          2) X86_FEATURE_NONSTOP_TSC
          3) X86_FEATURE_NONSTOP_TSC_S3
          4) X86_FEATURE_TSC_ADJUST
          5) At max. 4 sockets
      
       After two decades of horrors we're finally at a point where TSC seems
       to be halfway reliable and less abused by BIOS tinkerers. TSC_ADJUST
       was really key as we can now detect even small modifications reliably
       and the important point is that we can cure them as well (not pretty
       but better than all other options)."
      
      As feature #3 X86_FEATURE_NONSTOP_TSC_S3 only exists on several generations
      of Atom processorz, and is always coupled with X86_FEATURE_CONSTANT_TSC
      and X86_FEATURE_NONSTOP_TSC, skip checking it, and also be more defensive
      to use maximal 2 sockets.
      
      The check is done inside tsc_init() before registering 'tsc-early' and
      'tsc' clocksources, as there were cases that both of them had been
      wrongly judged as unreliable.
      
      For more background of tsc/watchdog, there is a good summary in [2]
      
      [tglx} Update vs. jiffies:
      
        On systems where the only remaining clocksource aside of TSC is jiffies
        there is no way to make this work because that creates a circular
        dependency. Jiffies accuracy depends on not missing a periodic timer
        interrupt, which is not guaranteed. That could be detected by TSC, but as
        TSC is not trusted this cannot be compensated. The consequence is a
        circulus vitiosus which results in shutting down TSC and falling back to
        the jiffies clocksource which is even more unreliable.
      
      [1]. https://lore.kernel.org/lkml/87eekfk8bd.fsf@nanos.tec.linutronix.de/
      [2]. https://lore.kernel.org/lkml/87a6pimt1f.ffs@nanos.tec.linutronix.de/
      
      [ tglx: Refine comment and amend changelog ]
      
      Fixes: 6e3cd952 ("x86/hpet: Use another crystalball to evaluate HPET usability")
      Suggested-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NFeng Tang <feng.tang@intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: "Paul E. McKenney" <paulmck@kernel.org>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/r/20211117023751.24190-2-feng.tang@intel.com
      b50db709
    • F
      x86/tsc: Add a timer to make sure TSC_adjust is always checked · c7719e79
      Feng Tang 提交于
      The TSC_ADJUST register is checked every time a CPU enters idle state, but
      Thomas Gleixner mentioned there is still a caveat that a system won't enter
      idle [1], either because it's too busy or configured purposely to not enter
      idle.
      
      Setup a periodic timer (every 10 minutes) to make sure the check is
      happening on a regular base.
      
      [1] https://lore.kernel.org/lkml/875z286xtk.fsf@nanos.tec.linutronix.de/
      
      Fixes: 6e3cd952 ("x86/hpet: Use another crystalball to evaluate HPET usability")
      Requested-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NFeng Tang <feng.tang@intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: "Paul E. McKenney" <paulmck@kernel.org>
      Cc: stable@vger.kernel.org
      Link: https://lore.kernel.org/r/20211117023751.24190-1-feng.tang@intel.com
      c7719e79
  6. 01 12月, 2021 3 次提交
  7. 30 11月, 2021 17 次提交
  8. 27 11月, 2021 1 次提交
  9. 26 11月, 2021 3 次提交