提交 · ee3a4f666207b5a9d3d4bc7f45c9d59f2aeb3a0d · openeuler / Kernel

05 12月, 2021 1 次提交

KVM: SVM: Do not terminate SEV-ES guests on GHCB validation failure · ad5b3532

由 Tom Lendacky 提交于 12月 02, 2021

Currently, an SEV-ES guest is terminated if the validation of the VMGEXIT
exit code or exit parameters fails.

The VMGEXIT instruction can be issued from userspace, even though
userspace (likely) can't update the GHCB. To prevent userspace from being
able to kill the guest, return an error through the GHCB when validation
fails rather than terminating the guest. For cases where the GHCB can't be
updated (e.g. the GHCB can't be mapped, etc.), just return back to the
guest.

The new error codes are documented in the lasest update to the GHCB
specification.

Fixes: 291bd20d ("KVM: SVM: Add initial support for a VMGEXIT VMEXIT")
Signed-off-by: NTom Lendacky <thomas.lendacky@amd.com>
Message-Id: <b57280b5562893e2616257ac9c2d4525a9aeeb42.1638471124.git.thomas.lendacky@amd.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

ad5b3532

02 12月, 2021 1 次提交

KVM: ensure APICv is considered inactive if there is no APIC · ef8b4b72

由 Paolo Bonzini 提交于 11月 30, 2021

kvm_vcpu_apicv_active() returns false if a virtual machine has no in-kernel
local APIC, however kvm_apicv_activated might still be true if there are
no reasons to disable APICv; in fact it is quite likely that there is none
because APICv is inhibited by specific configurations of the local APIC
and those configurations cannot be programmed. This triggers a WARN:

WARN_ON_ONCE(kvm_apicv_activated(vcpu->kvm) != kvm_vcpu_apicv_active(vcpu));

To avoid this, introduce another cause for APICv inhibition, namely the
absence of an in-kernel local APIC. This cause is enabled by default,
and is dropped by either KVM_CREATE_IRQCHIP or the enabling of
KVM_CAP_IRQCHIP_SPLIT.
Reported-by: NIgnat Korchagin <ignat@cloudflare.com>
Fixes: ee49a893 ("KVM: x86: Move SVM's APICv sanity check to common x86", 2021-10-22)
Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: NSean Christopherson <seanjc@google.com>
Tested-by: NIgnat Korchagin <ignat@cloudflare.com>
Message-Id: <20211130123746.293379-1-pbonzini@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

ef8b4b72

01 12月, 2021 1 次提交

x86/cpu: Drop spurious underscore from RAPTOR_LAKE #define · 7d697f0d

由 Tony Luck 提交于 11月 19, 2021

Convention for all the other "lake" CPUs is all one word.

So s/RAPTOR_LAKE/RAPTORLAKE/

Fixes: fbdb5e8f ("x86/cpu: Add Raptor Lake to Intel family")
Reported-by: NRui Zhang <rui.zhang@intel.com>
Signed-off-by: NTony Luck <tony.luck@intel.com>
Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20211119170832.1034220-1-tony.luck@intel.com

7d697f0d

27 11月, 2021 1 次提交

iommu/vt-d: Remove unused PASID_DISABLED · 21e96a20

由 Joerg Roedel 提交于 11月 23, 2021

The macro is unused after commit 00ecd540 so it can be removed.
Reported-by: NLinus Torvalds <torvalds@linux-foundation.org>
Fixes: 00ecd540 ("iommu/vt-d: Clean up unused PASID updating functions")
Signed-off-by: NJoerg Roedel <jroedel@suse.de>
Reviewed-by: NLu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/20211123105507.7654-2-joro@8bytes.org

21e96a20

25 11月, 2021 2 次提交

xen: make HYPERVISOR_set_debugreg() always_inline · 00db58cf

由 Juergen Gross 提交于 11月 25, 2021

HYPERVISOR_set_debugreg() is being called from noinstr code, so it
should be attributed "always_inline".

Fixes: 7361fac0 ("x86/xen: Make set_debugreg() noinstr")
Reported-by: Nkernel test robot <lkp@intel.com>
Signed-off-by: NJuergen Gross <jgross@suse.com>
Reviewed-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
Link: https://lore.kernel.org/r/20211125092056.24758-3-jgross@suse.comSigned-off-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>

00db58cf

xen: make HYPERVISOR_get_debugreg() always_inline · b1c45ad5

由 Juergen Gross 提交于 11月 25, 2021

HYPERVISOR_get_debugreg() is being called from noinstr code, so it
should be attributed "always_inline".

Fixes: f4afb713 ("x86/xen: Make get_debugreg() noinstr")
Reported-by: Nkernel test robot <lkp@intel.com>
Signed-off-by: NJuergen Gross <jgross@suse.com>
Reviewed-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
Link: https://lore.kernel.org/r/20211125092056.24758-2-jgross@suse.comSigned-off-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>

b1c45ad5

20 11月, 2021 1 次提交

xen/pvh: add missing prototype to header · 2a099192

由 Juergen Gross 提交于 11月 19, 2021

The prototype of mem_map_via_hcall() is missing in its header, so add
it.
Reported-by: Nkernel test robot <lkp@intel.com>
Fixes: a43fb7da ("xen/pvh: Move Xen code for getting mem map via hcall out of common file")
Signed-off-by: NJuergen Gross <jgross@suse.com>
Link: https://lore.kernel.org/r/20211119153913.21678-1-jgross@suse.comReviewed-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>

2a099192

18 11月, 2021 1 次提交

KVM: x86/mmu: include EFER.LMA in extended mmu role · b8453cdc

由 Maxim Levitsky 提交于 11月 15, 2021

Incorporate EFER.LMA into kvm_mmu_extended_role, as it used to compute the
guest root level and is not reflected in kvm_mmu_page_role.level when TDP
is in use. When simply running the guest, it is impossible for EFER.LMA
and kvm_mmu.root_level to get out of sync, as the guest cannot transition
from PAE paging to 64-bit paging without toggling CR0.PG, i.e. without
first bouncing through a different MMU context. And stuffing guest state
via KVM_SET_SREGS{,2} also ensures a full MMU context reset.

However, if KVM_SET_SREGS{,2} is followed by KVM_SET_NESTED_STATE, e.g. to
set guest state when migrating the VM while L2 is active, the vCPU state
will reflect L2, not L1. If L1 is using TDP for L2, then root_mmu will
have been configured using L2's state, despite not being used for L2. If
L2.EFER.LMA != L1.EFER.LMA, and L2 is using PAE paging, then root_mmu will
be configured for guest PAE paging, but will match the mmu_role for 64-bit
paging and cause KVM to not reconfigure root_mmu on the next nested VM-Exit.

Alternatively, the root_mmu's role could be invalidated after a successful
KVM_SET_NESTED_STATE that yields vcpu->arch.mmu != vcpu->arch.root_mmu,
i.e. that switches the active mmu to guest_mmu, but doing so is unnecessarily
tricky, and not even needed if L1 and L2 do have the same role (e.g., they
are both 64-bit guests and run with the same CR4).
Suggested-by: NSean Christopherson <seanjc@google.com>
Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20211115131837.195527-3-mlevitsk@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

b8453cdc

13 11月, 2021 1 次提交

x86/cpu: Add Raptor Lake to Intel family · fbdb5e8f

由 Tony Luck 提交于 11月 12, 2021

Add model ID for Raptor Lake.

[ dhansen: These get added as soon as possible so that folks doing
  development can leverage them. ]
Signed-off-by: NTony Luck <tony.luck@intel.com>
Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
Link: https://lkml.kernel.org/r/20211112182835.924977-1-tony.luck@intel.com

fbdb5e8f

11 11月, 2021 10 次提交

KVM: x86: Drop arbitrary KVM_SOFT_MAX_VCPUS · da1bfd52

由 Vitaly Kuznetsov 提交于 11月 11, 2021

KVM_CAP_NR_VCPUS is used to get the "recommended" maximum number of
VCPUs and arm64/mips/riscv report num_online_cpus(). Powerpc reports
either num_online_cpus() or num_present_cpus(), s390 has multiple
constants depending on hardware features. On x86, KVM reports an
arbitrary value of '710' which is supposed to be the maximum tested
value but it's possible to test all KVM_MAX_VCPUS even when there are
less physical CPUs available.

Drop the arbitrary '710' value and return num_online_cpus() on x86 as
well. The recommendation will match other architectures and will mean
'no CPU overcommit'.

For reference, QEMU only queries KVM_CAP_NR_VCPUS to print a warning
when the requested vCPU number exceeds it. The static limit of '710'
is quite weird as smaller systems with just a few physical CPUs should
certainly "recommend" less.
Suggested-by: NEduardo Habkost <ehabkost@redhat.com>
Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20211111134733.86601-1-vkuznets@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

da1bfd52

KVM: x86: Make sure KVM_CPUID_FEATURES really are KVM_CPUID_FEATURES · 760849b1

由 Paul Durrant 提交于 11月 05, 2021

Currently when kvm_update_cpuid_runtime() runs, it assumes that the
KVM_CPUID_FEATURES leaf is located at 0x40000001. This is not true,
however, if Hyper-V support is enabled. In this case the KVM leaves will
be offset.

This patch introdues as new 'kvm_cpuid_base' field into struct
kvm_vcpu_arch to track the location of the KVM leaves and function
kvm_update_kvm_cpuid_base() (called from kvm_set_cpuid()) to locate the
leaves using the 'KVMKVMKVM\0\0\0' signature (which is now given a
definition in kvm_para.h). Adjustment of KVM_CPUID_FEATURES will hence now
target the correct leaf.

NOTE: A new for_each_possible_hypervisor_cpuid_base() macro is intoduced
      into processor.h to avoid having duplicate code for the iteration
      over possible hypervisor base leaves.
Signed-off-by: NPaul Durrant <pdurrant@amazon.com>
Message-Id: <20211105095101.5384-3-pdurrant@amazon.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

760849b1

KVM: x86: inhibit APICv when KVM_GUESTDBG_BLOCKIRQ active · cae72dcc

由 Maxim Levitsky 提交于 11月 08, 2021

KVM_GUESTDBG_BLOCKIRQ relies on interrupts being injected using
standard kvm's inject_pending_event, and not via APICv/AVIC.

Since this is a debug feature, just inhibit APICv/AVIC while
KVM_GUESTDBG_BLOCKIRQ is in use on at least one vCPU.

Fixes: 61e5f69e ("KVM: x86: implement KVM_GUESTDBG_BLOCKIRQ")
Reported-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: NSean Christopherson <seanjc@google.com>
Tested-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20211108090245.166408-1-mlevitsk@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

cae72dcc

KVM: x86: Fix recording of guest steal time / preempted status · 7e2175eb

由 David Woodhouse 提交于 11月 02, 2021

In commit b0431382 ("x86/KVM: Make sure KVM_VCPU_FLUSH_TLB flag is
not missed") we switched to using a gfn_to_pfn_cache for accessing the
guest steal time structure in order to allow for an atomic xchg of the
preempted field. This has a couple of problems.

Firstly, kvm_map_gfn() doesn't work at all for IOMEM pages when the
atomic flag is set, which it is in kvm_steal_time_set_preempted(). So a
guest vCPU using an IOMEM page for its steal time would never have its
preempted field set.

Secondly, the gfn_to_pfn_cache is not invalidated in all cases where it
should have been. There are two stages to the GFN->PFN conversion;
first the GFN is converted to a userspace HVA, and then that HVA is
looked up in the process page tables to find the underlying host PFN.
Correct invalidation of the latter would require being hooked up to the
MMU notifiers, but that doesn't happen---so it just keeps mapping and
unmapping the *wrong* PFN after the userspace page tables change.

In the !IOMEM case at least the stale page *is* pinned all the time it's
cached, so it won't be freed and reused by anyone else while still
receiving the steal time updates. The map/unmap dance only takes care
of the KVM administrivia such as marking the page dirty.

Until the gfn_to_pfn cache handles the remapping automatically by
integrating with the MMU notifiers, we might as well not get a
kernel mapping of it, and use the perfectly serviceable userspace HVA
that we already have. We just need to implement the atomic xchg on
the userspace address with appropriate exception handling, which is
fairly trivial.

Cc: stable@vger.kernel.org
Fixes: b0431382 ("x86/KVM: Make sure KVM_VCPU_FLUSH_TLB flag is not missed")
Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>
Message-Id: <3645b9b889dac6438394194bb5586a46b68d581f.camel@infradead.org>
[I didn't entirely agree with David's assessment of the
usefulness of the gfn_to_pfn cache, and integrated the outcome
of the discussion in the above commit message. - Paolo]
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

7e2175eb

KVM: SEV: Add support for SEV intra host migration · b5663931

由 Peter Gonda 提交于 10月 21, 2021

For SEV to work with intra host migration, contents of the SEV info struct
such as the ASID (used to index the encryption key in the AMD SP) and
the list of memory regions need to be transferred to the target VM.
This change adds a commands for a target VMM to get a source SEV VM's sev
info.
Signed-off-by: NPeter Gonda <pgonda@google.com>
Suggested-by: NSean Christopherson <seanjc@google.com>
Reviewed-by: NMarc Orr <marcorr@google.com>
Cc: Marc Orr <marcorr@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
Cc: Brijesh Singh <brijesh.singh@amd.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Wanpeng Li <wanpengli@tencent.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Message-Id: <20211021174303.385706-3-pgonda@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

b5663931

x86/kvm: Add guest support for detecting and enabling SEV Live Migration feature. · f4495615

由 Ashish Kalra 提交于 8月 24, 2021

The guest support for detecting and enabling SEV Live migration
feature uses the following logic :

 - kvm_init_plaform() checks if its booted under the EFI

   - If not EFI,

     i) if kvm_para_has_feature(KVM_FEATURE_MIGRATION_CONTROL), issue a wrmsrl()
         to enable the SEV live migration support

   - If EFI,

     i) If kvm_para_has_feature(KVM_FEATURE_MIGRATION_CONTROL), read
        the UEFI variable which indicates OVMF support for live migration

     ii) the variable indicates live migration is supported, issue a wrmsrl() to
          enable the SEV live migration support

The EFI live migration check is done using a late_initcall() callback.

Also, ensure that _bss_decrypted section is marked as decrypted in the
hypervisor's guest page encryption status tracking.
Signed-off-by: NAshish Kalra <ashish.kalra@amd.com>
Reviewed-by: NSteve Rutherford <srutherford@google.com>
Message-Id: <b4453e4c87103ebef12217d2505ea99a1c3e0f0f.1629726117.git.ashish.kalra@amd.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

f4495615

mm: x86: Invoke hypercall when page encryption status is changed · 064ce6c5

由 Brijesh Singh 提交于 8月 24, 2021

Invoke a hypercall when a memory region is changed from encrypted ->
decrypted and vice versa. Hypervisor needs to know the page encryption
status during the guest migration.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Reviewed-by: NSteve Rutherford <srutherford@google.com>
Reviewed-by: NVenu Busireddy <venu.busireddy@oracle.com>
Signed-off-by: NBrijesh Singh <brijesh.singh@amd.com>
Signed-off-by: NAshish Kalra <ashish.kalra@amd.com>
Reviewed-by: NBorislav Petkov <bp@suse.de>
Message-Id: <0a237d5bb08793916c7790a3e653a2cbe7485761.1629726117.git.ashish.kalra@amd.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

064ce6c5

x86/kvm: Add AMD SEV specific Hypercall3 · 08c2336d

由 Brijesh Singh 提交于 8月 24, 2021

KVM hypercall framework relies on alternative framework to patch the
VMCALL -> VMMCALL on AMD platform. If a hypercall is made before
apply_alternative() is called then it defaults to VMCALL. The approach
works fine on non SEV guest. A VMCALL would causes #UD, and hypervisor
will be able to decode the instruction and do the right things. But
when SEV is active, guest memory is encrypted with guest key and
hypervisor will not be able to decode the instruction bytes.

To highlight the need to provide this interface, capturing the
flow of apply_alternatives() :
setup_arch() call init_hypervisor_platform() which detects
the hypervisor platform the kernel is running under and then the
hypervisor specific initialization code can make early hypercalls.
For example, KVM specific initialization in case of SEV will try
to mark the "__bss_decrypted" section's encryption state via early
page encryption status hypercalls.

Now, apply_alternatives() is called much later when setup_arch()
calls check_bugs(), so we do need some kind of an early,
pre-alternatives hypercall interface. Other cases of pre-alternatives
hypercalls include marking per-cpu GHCB pages as decrypted on SEV-ES
and per-cpu apf_reason, steal_time and kvm_apic_eoi as decrypted for
SEV generally.

Add SEV specific hypercall3, it unconditionally uses VMMCALL. The hypercall
will be used by the SEV guest to notify encrypted pages to the hypervisor.

This kvm_sev_hypercall3() function is abstracted and used as follows :
All these early hypercalls are made through early_set_memory_XX() interfaces,
which in turn invoke pv_ops (paravirt_ops).

This early_set_memory_XX() -> pv_ops.mmu.notify_page_enc_status_changed()
is a generic interface and can easily have SEV, TDX and any other
future platform specific abstractions added to it.

Currently, pv_ops.mmu.notify_page_enc_status_changed() callback is setup to
invoke kvm_sev_hypercall3() in case of SEV.

Similarly, in case of TDX, pv_ops.mmu.notify_page_enc_status_changed()
can be setup to a TDX specific callback.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Reviewed-by: NSteve Rutherford <srutherford@google.com>
Reviewed-by: NVenu Busireddy <venu.busireddy@oracle.com>
Signed-off-by: NBrijesh Singh <brijesh.singh@amd.com>
Signed-off-by: NAshish Kalra <ashish.kalra@amd.com>
Message-Id: <6fd25c749205dd0b1eb492c60d41b124760cc6ae.1629726117.git.ashish.kalra@amd.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

08c2336d

x86/smp: Factor out parts of native_smp_prepare_cpus() · ce2612b6

由 Boris Ostrovsky 提交于 11月 02, 2021

Commit 66558b73 ("sched: Add cluster scheduler level for x86")
introduced cpu_l2c_shared_map mask which is expected to be initialized
by smp_op.smp_prepare_cpus(). That commit only updated
native_smp_prepare_cpus() version but not xen_pv_smp_prepare_cpus().
As result Xen PV guests crash in set_cpu_sibling_map().

While the new mask can be allocated in xen_pv_smp_prepare_cpus() one can
see that both versions of smp_prepare_cpus ops share a number of common
operations that can be factored out. So do that instead.

Fixes: 66558b73 ("sched: Add cluster scheduler level for x86")
Signed-off-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: NJuergen Gross <jgross@suse.com>
Link: https://lkml.kernel.org/r/1635896196-18961-1-git-send-email-boris.ostrovsky@oracle.com

ce2612b6

static_call,x86: Robustify trampoline patching · 2105a927

由 Peter Zijlstra 提交于 10月 30, 2021

Add a few signature bytes after the static call trampoline and verify
those bytes match before patching the trampoline. This avoids patching
random other JMPs (such as CFI jump-table entries) instead.

These bytes decode as:

   d:   53                      push   %rbx
   e:   43 54                   rex.XB push %r12

And happen to spell "SCT".
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20211030074758.GT174703@worktop.programming.kicks-ass.net

2105a927

04 11月, 2021 1 次提交

x86/fpu: Optimize out sigframe xfeatures when in init state · 30d02551

由 Dave Hansen 提交于 11月 02, 2021

tl;dr: AMX state is ~8k.  Signal frames can have space for this
~8k and each signal entry writes out all 8k even if it is zeros.
Skip writing zeros for AMX to speed up signal delivery by about
4% overall when AMX is in its init state.

This is a user-visible change to the sigframe ABI.

== Hardware XSAVE Background ==

XSAVE state components may be tracked by the processor as being
in their initial configuration.  Software can detect which
features are in this configuration by looking at the XSTATE_BV
field in an XSAVE buffer or with the XGETBV(1) instruction.

Both the XSAVE and XSAVEOPT instructions enumerate features s
being in the initial configuration via the XSTATE_BV field in the
XSAVE header,  However, XSAVEOPT declines to actually write
features in their initial configuration to the buffer.  XSAVE
writes the feature unconditionally, regardless of whether it is
in the initial configuration or not.

Basically, XSAVE users never need to inspect XSTATE_BV to
determine if the feature has been written to the buffer.
XSAVEOPT users *do* need to inspect XSTATE_BV.  They might also
need to clear out the buffer if they want to make an isolated
change to the state, like modifying one register.

== Software Signal / XSAVE Background ==

Signal frames have historically been written with XSAVE itself.
Each state is written in its entirety, regardless of being in its
initial configuration.

In other words, the signal frame ABI uses the XSAVE behavior, not
the XSAVEOPT behavior.

== Problem ==

This means that any application which has acquired permission to
use AMX via ARCH_REQ_XCOMP_PERM will write 8k of state to the
signal frame.  This 8k write will occur even when AMX was in its
initial configuration and software *knows* this because of
XSTATE_BV.

This problem also exists to a lesser degree with AVX-512 and its
2k of state.  However, AVX-512 use does not require
ARCH_REQ_XCOMP_PERM and is more likely to have existing users
which would be impacted by any change in behavior.

== Solution ==

Stop writing out AMX xfeatures which are in their initial state
to the signal frame.  This effectively makes the signal frame
XSAVE buffer look as if it were written with a combination of
XSAVEOPT and XSAVE behavior.  Userspace which handles XSAVEOPT-
style buffers should be able to handle this naturally.

For now, include only the AMX xfeatures: XTILE and XTILEDATA in
this new behavior.  These require new ABI to use anyway, which
makes their users very unlikely to be broken.  This XSAVEOPT-like
behavior should be expected for all future dynamic xfeatures.  It
may also be extended to legacy features like AVX-512 in the
future.

Only attempt this optimization on systems with dynamic features.
Disable dynamic feature support (XFD) if XGETBV1 is unavailable
by adding a CPUID dependency.

This has been measured to reduce the *overall* cycle cost of
signal delivery by about 4%.

Fixes: 2308ee57 ("x86/fpu/amx: Enable the AMX feature in 64-bit mode")
Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Tested-by: N"Chang S. Bae" <chang.seok.bae@intel.com>
Link: https://lore.kernel.org/r/20211102224750.FA412E26@davehans-spike.ostc.intel.com

30d02551

02 11月, 2021 5 次提交

xen: allow pv-only hypercalls only with CONFIG_XEN_PV · ee1f9d19

由 Juergen Gross 提交于 10月 28, 2021

Put the definitions of the hypercalls usable only by pv guests inside
CONFIG_XEN_PV sections.

On Arm two dummy functions related to pv hypercalls can be removed.

While at it remove the no longer supported tmem hypercall definition.
Signed-off-by: NJuergen Gross <jgross@suse.com>
Link: https://lore.kernel.org/r/20211028081221.2475-3-jgross@suse.comReviewed-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>

ee1f9d19

x86/xen: remove 32-bit pv leftovers · d99bb72a

由 Juergen Gross 提交于 10月 28, 2021

There are some remaining 32-bit pv-guest support leftovers in the Xen
hypercall interface. Remove them.
Signed-off-by: NJuergen Gross <jgross@suse.com>
Link: https://lore.kernel.org/r/20211028081221.2475-2-jgross@suse.comReviewed-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>

d99bb72a

xen-pciback: allow compiling on other archs than x86 · a67efff2

由 Oleksandr Andrushchenko 提交于 10月 28, 2021

Xen-pciback driver was designed to be built for x86 only. But it
can also be used by other architectures, e.g. Arm.

Currently PCI backend implements multiple functionalities at a time,
such as:
1. It is used as a database for assignable PCI devices, e.g. xl
pci-assignable-{add|remove|list} manipulates that list. So, whenever
the toolstack needs to know which PCI devices can be passed through
it reads that from the relevant sysfs entries of the pciback.
2. It is used to hold the unbound PCI devices list, e.g. when passing
through a PCI device it needs to be unbound from the relevant device
driver and bound to pciback (strictly speaking it is not required
that the device is bound to pciback, but pciback is again used as a
database of the passed through PCI devices, so we can re-bind the
devices back to their original drivers when guest domain shuts down)
3. Device reset for the devices being passed through
4. Para-virtualised use-cases support

The para-virtualised part of the driver is not always needed as some
architectures, e.g. Arm or x86 PVH Dom0, are not using backend-frontend
model for PCI device passthrough.

For such use-cases make the very first step in splitting the
xen-pciback driver into two parts: Xen PCI stub and PCI PV backend
drivers.

For that add new configuration options CONFIG_XEN_PCI_STUB and
CONFIG_XEN_PCIDEV_STUB, so the driver can be limited in its
functionality, e.g. no support for para-virtualised scenario.
x86 platform will continue using CONFIG_XEN_PCIDEV_BACKEND for the
fully featured backend driver.
Signed-off-by: NOleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Signed-off-by: NAnastasiia Lukianenko <anastasiia_lukianenko@epam.com>
Reviewed-by: NStefano Stabellini <sstabellini@kernel.org>
Reviewed-by: NJuergen Gross <jgross@suse.com>
Link: https://lore.kernel.org/r/20211028143620.144936-1-andr2000@gmail.comSigned-off-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>

a67efff2

x86/xen: switch initial pvops IRQ functions to dummy ones · e453f872

由 Juergen Gross 提交于 10月 28, 2021

The initial pvops functions handling irq flags will only ever be called
before interrupts are being enabled.

So switch them to be dummy functions:
- xen_save_fl() can always return 0
- xen_irq_disable() is a nop
- xen_irq_enable() can BUG()

Add some generic paravirt functions for that purpose.
Signed-off-by: NJuergen Gross <jgross@suse.com>
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
Link: https://lore.kernel.org/r/20211028072748.29862-3-jgross@suse.comSigned-off-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>

e453f872

x86/pvh: add prototype for xen_pvh_init() · 76721679

由 Juergen Gross 提交于 10月 06, 2021

xen_pvh_init() is lacking a prototype in a header, add it.
Reported-by: Nkernel test robot <lkp@intel.com>
Signed-off-by: NJuergen Gross <jgross@suse.com>
Link: https://lore.kernel.org/r/20211006061950.9227-1-jgross@suse.comReviewed-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>

76721679

29 10月, 2021 7 次提交

bpf,x86: Respect X86_FEATURE_RETPOLINE* · 87c87ecd

由 Peter Zijlstra 提交于 10月 26, 2021

Current BPF codegen doesn't respect X86_FEATURE_RETPOLINE* flags and
unconditionally emits a thunk call, this is sub-optimal and doesn't
match the regular, compiler generated, code.

Update the i386 JIT to emit code equal to what the compiler emits for
the regular kernel text (IOW. a plain THUNK call).

Update the x86_64 JIT to emit code similar to the result of compiler
and kernel rewrites as according to X86_FEATURE_RETPOLINE* flags.
Inlining RETPOLINE_AMD (lfence; jmp *%reg) and !RETPOLINE (jmp *%reg),
while doing a THUNK call for RETPOLINE.

This removes the hard-coded retpoline thunks and shrinks the generated
code. Leaving a single retpoline thunk definition in the kernel.
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: NBorislav Petkov <bp@suse.de>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Acked-by: NJosh Poimboeuf <jpoimboe@redhat.com>
Tested-by: NAlexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/r/20211026120310.614772675@infradead.org

87c87ecd

x86/alternative: Implement .retpoline_sites support · 75085009

由 Peter Zijlstra 提交于 10月 26, 2021

Rewrite retpoline thunk call sites to be indirect calls for
spectre_v2=off. This ensures spectre_v2=off is as near to a
RETPOLINE=n build as possible.

This is the replacement for objtool writing alternative entries to
ensure the same and achieves feature-parity with the previous
approach.

One noteworthy feature is that it relies on the thunks to be in
machine order to compute the register index.

Specifically, this does not yet address the Jcc __x86_indirect_thunk_*
calls generated by clang, a future patch will add this.
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: NBorislav Petkov <bp@suse.de>
Acked-by: NJosh Poimboeuf <jpoimboe@redhat.com>
Tested-by: NAlexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/r/20211026120310.232495794@infradead.org

75085009

x86/retpoline: Create a retpoline thunk array · 1a6f7442

由 Peter Zijlstra 提交于 10月 26, 2021

Stick all the retpolines in a single symbol and have the individual
thunks as inner labels, this should guarantee thunk order and layout.

Previously there were 16 (or rather 15 without rsp) separate symbols and
a toolchain might reasonably expect it could displace them however it
liked, with disregard for their relative position.

However, now they're part of a larger symbol. Any change to their
relative position would disrupt this larger _array symbol and thus not
be sound.

This is the same reasoning used for data symbols. On their own there
is no guarantee about their relative position wrt to one aonther, but
we're still able to do arrays because an array as a whole is a single
larger symbol.
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: NBorislav Petkov <bp@suse.de>
Acked-by: NJosh Poimboeuf <jpoimboe@redhat.com>
Tested-by: NAlexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/r/20211026120310.169659320@infradead.org

1a6f7442

x86/retpoline: Move the retpoline thunk declarations to nospec-branch.h · 6fda8a38

由 Peter Zijlstra 提交于 10月 26, 2021

Because it makes no sense to split the retpoline gunk over multiple
headers.
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: NBorislav Petkov <bp@suse.de>
Acked-by: NJosh Poimboeuf <jpoimboe@redhat.com>
Tested-by: NAlexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/r/20211026120310.106290934@infradead.org

6fda8a38

x86/asm: Fixup odd GEN-for-each-reg.h usage · b6d3d994

由 Peter Zijlstra 提交于 10月 26, 2021

Currently GEN-for-each-reg.h usage leaves GEN defined, relying on any
subsequent usage to start with #undef, which is rude.
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: NBorislav Petkov <bp@suse.de>
Acked-by: NJosh Poimboeuf <jpoimboe@redhat.com>
Tested-by: NAlexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/r/20211026120310.041792350@infradead.org

b6d3d994

x86/asm: Fix register order · a92ede2d

由 Peter Zijlstra 提交于 10月 26, 2021

Ensure the register order is correct; this allows for easy translation
between register number and trampoline and vice-versa.
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: NBorislav Petkov <bp@suse.de>
Acked-by: NJosh Poimboeuf <jpoimboe@redhat.com>
Tested-by: NAlexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/r/20211026120309.978573921@infradead.org

a92ede2d

x86/retpoline: Remove unused replacement symbols · 4fe79e71

由 Peter Zijlstra 提交于 10月 26, 2021

Now that objtool no longer creates alternatives, these replacement
symbols are no longer needed, remove them.
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: NBorislav Petkov <bp@suse.de>
Acked-by: NJosh Poimboeuf <jpoimboe@redhat.com>
Tested-by: NAlexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/r/20211026120309.915051744@infradead.org

4fe79e71

28 10月, 2021 3 次提交

x86/hyperv: Add Write/Read MSR registers via ghcb page · faff4406

由 Tianyu Lan 提交于 10月 25, 2021

Hyperv provides GHCB protocol to write Synthetic Interrupt
Controller MSR registers in Isolation VM with AMD SEV SNP
and these registers are emulated by hypervisor directly.
Hyperv requires to write SINTx MSR registers twice. First
writes MSR via GHCB page to communicate with hypervisor
and then writes wrmsr instruction to talk with paravisor
which runs in VMPL0. Guest OS ID MSR also needs to be set
via GHCB page.
Reviewed-by: NMichael Kelley <mikelley@microsoft.com>
Signed-off-by: NTianyu Lan <Tianyu.Lan@microsoft.com>
Link: https://lore.kernel.org/r/20211025122116.264793-7-ltykernel@gmail.comSigned-off-by: NWei Liu <wei.liu@kernel.org>

faff4406

x86/hyperv: Add new hvcall guest address host visibility support · 810a5212

由 Tianyu Lan 提交于 10月 25, 2021

Add new hvcall guest address host visibility support to mark
memory visible to host. Call it inside set_memory_decrypted
/encrypted(). Add HYPERVISOR feature check in the
hv_is_isolation_supported() to optimize in non-virtualization
environment.
Acked-by: NDave Hansen <dave.hansen@intel.com>
Reviewed-by: NMichael Kelley <mikelley@microsoft.com>
Signed-off-by: NTianyu Lan <Tianyu.Lan@microsoft.com>
Link: https://lore.kernel.org/r/20211025122116.264793-4-ltykernel@gmail.com
[ wei: fix conflicts with tip ]
Signed-off-by: NWei Liu <wei.liu@kernel.org>

810a5212

x86/hyperv: Initialize GHCB page in Isolation VM · 0cc4f6d9

由 Tianyu Lan 提交于 10月 25, 2021

Hyperv exposes GHCB page via SEV ES GHCB MSR for SNP guest
to communicate with hypervisor. Map GHCB page for all
cpus to read/write MSR register and submit hvcall request
via ghcb page.
Reviewed-by: NMichael Kelley <mikelley@microsoft.com>
Signed-off-by: NTianyu Lan <Tianyu.Lan@microsoft.com>
Link: https://lore.kernel.org/r/20211025122116.264793-2-ltykernel@gmail.comSigned-off-by: NWei Liu <wei.liu@kernel.org>

0cc4f6d9

26 10月, 2021 5 次提交

x86/fpu/amx: Enable the AMX feature in 64-bit mode · 2308ee57

由 Chang S. Bae 提交于 10月 21, 2021

Add the AMX state components in XFEATURE_MASK_USER_SUPPORTED and the
TILE_DATA component to the dynamic states and update the permission check
table accordingly.

This is only effective on 64 bit kernels as for 32bit kernels
XFEATURE_MASK_TILE is defined as 0.

TILE_DATA is caller-saved state and the only dynamic state. Add build time
sanity check to ensure the assumption that every dynamic feature is caller-
saved.

Make AMX state depend on XFD as it is dynamic feature.
Signed-off-by: NChang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211021225527.10184-24-chang.seok.bae@intel.com

2308ee57

x86/fpu/amx: Define AMX state components and have it used for boot-time checks · eec2113e

由 Chang S. Bae 提交于 10月 21, 2021

The XSTATE initialization uses check_xstate_against_struct() to sanity
check the size of XSTATE-enabled features. AMX is a XSAVE-enabled feature,
and its size is not hard-coded but discoverable at run-time via CPUID.

The AMX state is composed of state components 17 and 18, which are all user
state components. The first component is the XTILECFG state of a 64-byte
tile-related control register. The state component 18, called XTILEDATA,
contains the actual tile data, and the state size varies on
implementations. The architectural maximum, as defined in the CPUID(0x1d,
1): EAX[15:0], is a byte less than 64KB. The first implementation supports
8KB.

Check the XTILEDATA state size dynamically. The feature introduces the new
tile register, TMM. Define one register struct only and read the number of
registers from CPUID. Cross-check the overall size with CPUID again.
Signed-off-by: NChang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211021225527.10184-21-chang.seok.bae@intel.com

eec2113e

x86/fpu/xstate: Add fpstate_realloc()/free() · 500afbf6

由 Chang S. Bae 提交于 10月 21, 2021

The fpstate embedded in struct fpu is the default state for storing the FPU
registers. It's sized so that the default supported features can be stored.
For dynamically enabled features the register buffer is too small.

The #NM handler detects first use of a feature which is disabled in the
XFD MSR. After handling permission checks it recalculates the size for
kernel space and user space state and invokes fpstate_realloc() which
tries to reallocate fpstate and install it.

Provide the allocator function which checks whether the current buffer size
is sufficient and if not allocates one. If allocation is successful the new
fpstate is initialized with the new features and sizes and the now enabled
features is removed from the task's XFD mask.

realloc_fpstate() uses vzalloc(). If use of this mechanism grows to
re-allocate buffers larger than 64KB, a more sophisticated allocation
scheme that includes purpose-built reclaim capability might be justified.
Signed-off-by: NChang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211021225527.10184-19-chang.seok.bae@intel.com

500afbf6

x86/fpu/xstate: Add XFD #NM handler · 783e87b4

由 Chang S. Bae 提交于 10月 21, 2021

If the XFD MSR has feature bits set then #NM will be raised when user space
attempts to use an instruction related to one of these features.

When the task has no permissions to use that feature, raise SIGILL, which
is the same behavior as #UD.

If the task has permissions, calculate the new buffer size for the extended
feature set and allocate a larger fpstate. In the unlikely case that
vzalloc() fails, SIGSEGV is raised.

The allocation function will be added in the next step. Provide a stub
which fails for now.

  [ tglx: Updated serialization ]
Signed-off-by: NChang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Link: https://lore.kernel.org/r/20211021225527.10184-18-chang.seok.bae@intel.com

783e87b4

x86/fpu: Add XFD state to fpstate · 8bf26758

由 Chang S. Bae 提交于 10月 21, 2021

Add storage for XFD register state to struct fpstate. This will be used to
store the XFD MSR state. This will be used for switching the XFD MSR when
FPU content is restored.

Add a per-CPU variable to cache the current MSR value so the MSR has only
to be written when the values are different.
Signed-off-by: NChang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NChang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20211021225527.10184-15-chang.seok.bae@intel.com

8bf26758

openeuler / Kernel 接近 2 年 前同步成功

openeuler / Kernel
接近 2 年前同步成功