1. 05 12月, 2021 1 次提交
    • T
      KVM: SVM: Do not terminate SEV-ES guests on GHCB validation failure · ad5b3532
      Tom Lendacky 提交于
      Currently, an SEV-ES guest is terminated if the validation of the VMGEXIT
      exit code or exit parameters fails.
      
      The VMGEXIT instruction can be issued from userspace, even though
      userspace (likely) can't update the GHCB. To prevent userspace from being
      able to kill the guest, return an error through the GHCB when validation
      fails rather than terminating the guest. For cases where the GHCB can't be
      updated (e.g. the GHCB can't be mapped, etc.), just return back to the
      guest.
      
      The new error codes are documented in the lasest update to the GHCB
      specification.
      
      Fixes: 291bd20d ("KVM: SVM: Add initial support for a VMGEXIT VMEXIT")
      Signed-off-by: NTom Lendacky <thomas.lendacky@amd.com>
      Message-Id: <b57280b5562893e2616257ac9c2d4525a9aeeb42.1638471124.git.thomas.lendacky@amd.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      ad5b3532
  2. 02 12月, 2021 1 次提交
  3. 01 12月, 2021 1 次提交
  4. 27 11月, 2021 1 次提交
  5. 25 11月, 2021 2 次提交
  6. 20 11月, 2021 1 次提交
  7. 18 11月, 2021 1 次提交
    • M
      KVM: x86/mmu: include EFER.LMA in extended mmu role · b8453cdc
      Maxim Levitsky 提交于
      Incorporate EFER.LMA into kvm_mmu_extended_role, as it used to compute the
      guest root level and is not reflected in kvm_mmu_page_role.level when TDP
      is in use.  When simply running the guest, it is impossible for EFER.LMA
      and kvm_mmu.root_level to get out of sync, as the guest cannot transition
      from PAE paging to 64-bit paging without toggling CR0.PG, i.e. without
      first bouncing through a different MMU context.  And stuffing guest state
      via KVM_SET_SREGS{,2} also ensures a full MMU context reset.
      
      However, if KVM_SET_SREGS{,2} is followed by KVM_SET_NESTED_STATE, e.g. to
      set guest state when migrating the VM while L2 is active, the vCPU state
      will reflect L2, not L1.  If L1 is using TDP for L2, then root_mmu will
      have been configured using L2's state, despite not being used for L2.  If
      L2.EFER.LMA != L1.EFER.LMA, and L2 is using PAE paging, then root_mmu will
      be configured for guest PAE paging, but will match the mmu_role for 64-bit
      paging and cause KVM to not reconfigure root_mmu on the next nested VM-Exit.
      
      Alternatively, the root_mmu's role could be invalidated after a successful
      KVM_SET_NESTED_STATE that yields vcpu->arch.mmu != vcpu->arch.root_mmu,
      i.e. that switches the active mmu to guest_mmu, but doing so is unnecessarily
      tricky, and not even needed if L1 and L2 do have the same role (e.g., they
      are both 64-bit guests and run with the same CR4).
      Suggested-by: NSean Christopherson <seanjc@google.com>
      Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Message-Id: <20211115131837.195527-3-mlevitsk@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b8453cdc
  8. 13 11月, 2021 1 次提交
  9. 11 11月, 2021 10 次提交
    • V
      KVM: x86: Drop arbitrary KVM_SOFT_MAX_VCPUS · da1bfd52
      Vitaly Kuznetsov 提交于
      KVM_CAP_NR_VCPUS is used to get the "recommended" maximum number of
      VCPUs and arm64/mips/riscv report num_online_cpus(). Powerpc reports
      either num_online_cpus() or num_present_cpus(), s390 has multiple
      constants depending on hardware features. On x86, KVM reports an
      arbitrary value of '710' which is supposed to be the maximum tested
      value but it's possible to test all KVM_MAX_VCPUS even when there are
      less physical CPUs available.
      
      Drop the arbitrary '710' value and return num_online_cpus() on x86 as
      well. The recommendation will match other architectures and will mean
      'no CPU overcommit'.
      
      For reference, QEMU only queries KVM_CAP_NR_VCPUS to print a warning
      when the requested vCPU number exceeds it. The static limit of '710'
      is quite weird as smaller systems with just a few physical CPUs should
      certainly "recommend" less.
      Suggested-by: NEduardo Habkost <ehabkost@redhat.com>
      Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Message-Id: <20211111134733.86601-1-vkuznets@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      da1bfd52
    • P
      KVM: x86: Make sure KVM_CPUID_FEATURES really are KVM_CPUID_FEATURES · 760849b1
      Paul Durrant 提交于
      Currently when kvm_update_cpuid_runtime() runs, it assumes that the
      KVM_CPUID_FEATURES leaf is located at 0x40000001. This is not true,
      however, if Hyper-V support is enabled. In this case the KVM leaves will
      be offset.
      
      This patch introdues as new 'kvm_cpuid_base' field into struct
      kvm_vcpu_arch to track the location of the KVM leaves and function
      kvm_update_kvm_cpuid_base() (called from kvm_set_cpuid()) to locate the
      leaves using the 'KVMKVMKVM\0\0\0' signature (which is now given a
      definition in kvm_para.h). Adjustment of KVM_CPUID_FEATURES will hence now
      target the correct leaf.
      
      NOTE: A new for_each_possible_hypervisor_cpuid_base() macro is intoduced
            into processor.h to avoid having duplicate code for the iteration
            over possible hypervisor base leaves.
      Signed-off-by: NPaul Durrant <pdurrant@amazon.com>
      Message-Id: <20211105095101.5384-3-pdurrant@amazon.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      760849b1
    • M
      KVM: x86: inhibit APICv when KVM_GUESTDBG_BLOCKIRQ active · cae72dcc
      Maxim Levitsky 提交于
      KVM_GUESTDBG_BLOCKIRQ relies on interrupts being injected using
      standard kvm's inject_pending_event, and not via APICv/AVIC.
      
      Since this is a debug feature, just inhibit APICv/AVIC while
      KVM_GUESTDBG_BLOCKIRQ is in use on at least one vCPU.
      
      Fixes: 61e5f69e ("KVM: x86: implement KVM_GUESTDBG_BLOCKIRQ")
      Reported-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
      Reviewed-by: NSean Christopherson <seanjc@google.com>
      Tested-by: NSean Christopherson <seanjc@google.com>
      Message-Id: <20211108090245.166408-1-mlevitsk@redhat.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      cae72dcc
    • D
      KVM: x86: Fix recording of guest steal time / preempted status · 7e2175eb
      David Woodhouse 提交于
      In commit b0431382 ("x86/KVM: Make sure KVM_VCPU_FLUSH_TLB flag is
      not missed") we switched to using a gfn_to_pfn_cache for accessing the
      guest steal time structure in order to allow for an atomic xchg of the
      preempted field. This has a couple of problems.
      
      Firstly, kvm_map_gfn() doesn't work at all for IOMEM pages when the
      atomic flag is set, which it is in kvm_steal_time_set_preempted(). So a
      guest vCPU using an IOMEM page for its steal time would never have its
      preempted field set.
      
      Secondly, the gfn_to_pfn_cache is not invalidated in all cases where it
      should have been. There are two stages to the GFN->PFN conversion;
      first the GFN is converted to a userspace HVA, and then that HVA is
      looked up in the process page tables to find the underlying host PFN.
      Correct invalidation of the latter would require being hooked up to the
      MMU notifiers, but that doesn't happen---so it just keeps mapping and
      unmapping the *wrong* PFN after the userspace page tables change.
      
      In the !IOMEM case at least the stale page *is* pinned all the time it's
      cached, so it won't be freed and reused by anyone else while still
      receiving the steal time updates. The map/unmap dance only takes care
      of the KVM administrivia such as marking the page dirty.
      
      Until the gfn_to_pfn cache handles the remapping automatically by
      integrating with the MMU notifiers, we might as well not get a
      kernel mapping of it, and use the perfectly serviceable userspace HVA
      that we already have.  We just need to implement the atomic xchg on
      the userspace address with appropriate exception handling, which is
      fairly trivial.
      
      Cc: stable@vger.kernel.org
      Fixes: b0431382 ("x86/KVM: Make sure KVM_VCPU_FLUSH_TLB flag is not missed")
      Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>
      Message-Id: <3645b9b889dac6438394194bb5586a46b68d581f.camel@infradead.org>
      [I didn't entirely agree with David's assessment of the
       usefulness of the gfn_to_pfn cache, and integrated the outcome
       of the discussion in the above commit message. - Paolo]
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      7e2175eb
    • P
      KVM: SEV: Add support for SEV intra host migration · b5663931
      Peter Gonda 提交于
      For SEV to work with intra host migration, contents of the SEV info struct
      such as the ASID (used to index the encryption key in the AMD SP) and
      the list of memory regions need to be transferred to the target VM.
      This change adds a commands for a target VMM to get a source SEV VM's sev
      info.
      Signed-off-by: NPeter Gonda <pgonda@google.com>
      Suggested-by: NSean Christopherson <seanjc@google.com>
      Reviewed-by: NMarc Orr <marcorr@google.com>
      Cc: Marc Orr <marcorr@google.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
      Cc: Brijesh Singh <brijesh.singh@amd.com>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Wanpeng Li <wanpengli@tencent.com>
      Cc: Jim Mattson <jmattson@google.com>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: kvm@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Message-Id: <20211021174303.385706-3-pgonda@google.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      b5663931
    • A
      x86/kvm: Add guest support for detecting and enabling SEV Live Migration feature. · f4495615
      Ashish Kalra 提交于
      The guest support for detecting and enabling SEV Live migration
      feature uses the following logic :
      
       - kvm_init_plaform() checks if its booted under the EFI
      
         - If not EFI,
      
           i) if kvm_para_has_feature(KVM_FEATURE_MIGRATION_CONTROL), issue a wrmsrl()
               to enable the SEV live migration support
      
         - If EFI,
      
           i) If kvm_para_has_feature(KVM_FEATURE_MIGRATION_CONTROL), read
              the UEFI variable which indicates OVMF support for live migration
      
           ii) the variable indicates live migration is supported, issue a wrmsrl() to
                enable the SEV live migration support
      
      The EFI live migration check is done using a late_initcall() callback.
      
      Also, ensure that _bss_decrypted section is marked as decrypted in the
      hypervisor's guest page encryption status tracking.
      Signed-off-by: NAshish Kalra <ashish.kalra@amd.com>
      Reviewed-by: NSteve Rutherford <srutherford@google.com>
      Message-Id: <b4453e4c87103ebef12217d2505ea99a1c3e0f0f.1629726117.git.ashish.kalra@amd.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f4495615
    • B
      mm: x86: Invoke hypercall when page encryption status is changed · 064ce6c5
      Brijesh Singh 提交于
      Invoke a hypercall when a memory region is changed from encrypted ->
      decrypted and vice versa. Hypervisor needs to know the page encryption
      status during the guest migration.
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: x86@kernel.org
      Cc: kvm@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Reviewed-by: NSteve Rutherford <srutherford@google.com>
      Reviewed-by: NVenu Busireddy <venu.busireddy@oracle.com>
      Signed-off-by: NBrijesh Singh <brijesh.singh@amd.com>
      Signed-off-by: NAshish Kalra <ashish.kalra@amd.com>
      Reviewed-by: NBorislav Petkov <bp@suse.de>
      Message-Id: <0a237d5bb08793916c7790a3e653a2cbe7485761.1629726117.git.ashish.kalra@amd.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      064ce6c5
    • B
      x86/kvm: Add AMD SEV specific Hypercall3 · 08c2336d
      Brijesh Singh 提交于
      KVM hypercall framework relies on alternative framework to patch the
      VMCALL -> VMMCALL on AMD platform. If a hypercall is made before
      apply_alternative() is called then it defaults to VMCALL. The approach
      works fine on non SEV guest. A VMCALL would causes #UD, and hypervisor
      will be able to decode the instruction and do the right things. But
      when SEV is active, guest memory is encrypted with guest key and
      hypervisor will not be able to decode the instruction bytes.
      
      To highlight the need to provide this interface, capturing the
      flow of apply_alternatives() :
      setup_arch() call init_hypervisor_platform() which detects
      the hypervisor platform the kernel is running under and then the
      hypervisor specific initialization code can make early hypercalls.
      For example, KVM specific initialization in case of SEV will try
      to mark the "__bss_decrypted" section's encryption state via early
      page encryption status hypercalls.
      
      Now, apply_alternatives() is called much later when setup_arch()
      calls check_bugs(), so we do need some kind of an early,
      pre-alternatives hypercall interface. Other cases of pre-alternatives
      hypercalls include marking per-cpu GHCB pages as decrypted on SEV-ES
      and per-cpu apf_reason, steal_time and kvm_apic_eoi as decrypted for
      SEV generally.
      
      Add SEV specific hypercall3, it unconditionally uses VMMCALL. The hypercall
      will be used by the SEV guest to notify encrypted pages to the hypervisor.
      
      This kvm_sev_hypercall3() function is abstracted and used as follows :
      All these early hypercalls are made through early_set_memory_XX() interfaces,
      which in turn invoke pv_ops (paravirt_ops).
      
      This early_set_memory_XX() -> pv_ops.mmu.notify_page_enc_status_changed()
      is a generic interface and can easily have SEV, TDX and any other
      future platform specific abstractions added to it.
      
      Currently, pv_ops.mmu.notify_page_enc_status_changed() callback is setup to
      invoke kvm_sev_hypercall3() in case of SEV.
      
      Similarly, in case of TDX, pv_ops.mmu.notify_page_enc_status_changed()
      can be setup to a TDX specific callback.
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Joerg Roedel <joro@8bytes.org>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: x86@kernel.org
      Cc: kvm@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Reviewed-by: NSteve Rutherford <srutherford@google.com>
      Reviewed-by: NVenu Busireddy <venu.busireddy@oracle.com>
      Signed-off-by: NBrijesh Singh <brijesh.singh@amd.com>
      Signed-off-by: NAshish Kalra <ashish.kalra@amd.com>
      Message-Id: <6fd25c749205dd0b1eb492c60d41b124760cc6ae.1629726117.git.ashish.kalra@amd.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      08c2336d
    • B
      x86/smp: Factor out parts of native_smp_prepare_cpus() · ce2612b6
      Boris Ostrovsky 提交于
      Commit 66558b73 ("sched: Add cluster scheduler level for x86")
      introduced cpu_l2c_shared_map mask which is expected to be initialized
      by smp_op.smp_prepare_cpus(). That commit only updated
      native_smp_prepare_cpus() version but not xen_pv_smp_prepare_cpus().
      As result Xen PV guests crash in set_cpu_sibling_map().
      
      While the new mask can be allocated in xen_pv_smp_prepare_cpus() one can
      see that both versions of smp_prepare_cpus ops share a number of common
      operations that can be factored out. So do that instead.
      
      Fixes: 66558b73 ("sched: Add cluster scheduler level for x86")
      Signed-off-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NJuergen Gross <jgross@suse.com>
      Link: https://lkml.kernel.org/r/1635896196-18961-1-git-send-email-boris.ostrovsky@oracle.com
      ce2612b6
    • P
      static_call,x86: Robustify trampoline patching · 2105a927
      Peter Zijlstra 提交于
      Add a few signature bytes after the static call trampoline and verify
      those bytes match before patching the trampoline. This avoids patching
      random other JMPs (such as CFI jump-table entries) instead.
      
      These bytes decode as:
      
         d:   53                      push   %rbx
         e:   43 54                   rex.XB push %r12
      
      And happen to spell "SCT".
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20211030074758.GT174703@worktop.programming.kicks-ass.net
      2105a927
  10. 04 11月, 2021 1 次提交
    • D
      x86/fpu: Optimize out sigframe xfeatures when in init state · 30d02551
      Dave Hansen 提交于
      tl;dr: AMX state is ~8k.  Signal frames can have space for this
      ~8k and each signal entry writes out all 8k even if it is zeros.
      Skip writing zeros for AMX to speed up signal delivery by about
      4% overall when AMX is in its init state.
      
      This is a user-visible change to the sigframe ABI.
      
      == Hardware XSAVE Background ==
      
      XSAVE state components may be tracked by the processor as being
      in their initial configuration.  Software can detect which
      features are in this configuration by looking at the XSTATE_BV
      field in an XSAVE buffer or with the XGETBV(1) instruction.
      
      Both the XSAVE and XSAVEOPT instructions enumerate features s
      being in the initial configuration via the XSTATE_BV field in the
      XSAVE header,  However, XSAVEOPT declines to actually write
      features in their initial configuration to the buffer.  XSAVE
      writes the feature unconditionally, regardless of whether it is
      in the initial configuration or not.
      
      Basically, XSAVE users never need to inspect XSTATE_BV to
      determine if the feature has been written to the buffer.
      XSAVEOPT users *do* need to inspect XSTATE_BV.  They might also
      need to clear out the buffer if they want to make an isolated
      change to the state, like modifying one register.
      
      == Software Signal / XSAVE Background ==
      
      Signal frames have historically been written with XSAVE itself.
      Each state is written in its entirety, regardless of being in its
      initial configuration.
      
      In other words, the signal frame ABI uses the XSAVE behavior, not
      the XSAVEOPT behavior.
      
      == Problem ==
      
      This means that any application which has acquired permission to
      use AMX via ARCH_REQ_XCOMP_PERM will write 8k of state to the
      signal frame.  This 8k write will occur even when AMX was in its
      initial configuration and software *knows* this because of
      XSTATE_BV.
      
      This problem also exists to a lesser degree with AVX-512 and its
      2k of state.  However, AVX-512 use does not require
      ARCH_REQ_XCOMP_PERM and is more likely to have existing users
      which would be impacted by any change in behavior.
      
      == Solution ==
      
      Stop writing out AMX xfeatures which are in their initial state
      to the signal frame.  This effectively makes the signal frame
      XSAVE buffer look as if it were written with a combination of
      XSAVEOPT and XSAVE behavior.  Userspace which handles XSAVEOPT-
      style buffers should be able to handle this naturally.
      
      For now, include only the AMX xfeatures: XTILE and XTILEDATA in
      this new behavior.  These require new ABI to use anyway, which
      makes their users very unlikely to be broken.  This XSAVEOPT-like
      behavior should be expected for all future dynamic xfeatures.  It
      may also be extended to legacy features like AVX-512 in the
      future.
      
      Only attempt this optimization on systems with dynamic features.
      Disable dynamic feature support (XFD) if XGETBV1 is unavailable
      by adding a CPUID dependency.
      
      This has been measured to reduce the *overall* cycle cost of
      signal delivery by about 4%.
      
      Fixes: 2308ee57 ("x86/fpu/amx: Enable the AMX feature in 64-bit mode")
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: N"Chang S. Bae" <chang.seok.bae@intel.com>
      Link: https://lore.kernel.org/r/20211102224750.FA412E26@davehans-spike.ostc.intel.com
      30d02551
  11. 02 11月, 2021 5 次提交
  12. 29 10月, 2021 7 次提交
  13. 28 10月, 2021 3 次提交
  14. 26 10月, 2021 5 次提交