提交 · 5719455fbd952a69ebc860d47bb0287e9198fe12 · openeuler / Kernel

15 12月, 2020 12 次提交

KVM: SVM: Do not report support for SMM for an SEV-ES guest · 5719455f

由 Tom Lendacky 提交于 12月 10, 2020

SEV-ES guests do not currently support SMM. Update the has_emulated_msr()
kvm_x86_ops function to take a struct kvm parameter so that the capability
can be reported at a VM level.

Since this op is also called during KVM initialization and before a struct
kvm instance is available, comments will be added to each implementation
of has_emulated_msr() to indicate the kvm parameter can be null.
Signed-off-by: NTom Lendacky <thomas.lendacky@amd.com>
Message-Id: <75de5138e33b945d2fb17f81ae507bda381808e3.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

5719455f

KVM: SVM: Add support for CR8 write traps for an SEV-ES guest · d1949b93

由 Tom Lendacky 提交于 12月 10, 2020

For SEV-ES guests, the interception of control register write access
is not recommended. Control register interception occurs prior to the
control register being modified and the hypervisor is unable to modify
the control register itself because the register is located in the
encrypted register state.

SEV-ES guests introduce new control register write traps. These traps
provide intercept support of a control register write after the control
register has been modified. The new control register value is provided in
the VMCB EXITINFO1 field, allowing the hypervisor to track the setting
of the guest control registers.

Add support to track the value of the guest CR8 register using the control
register write trap so that the hypervisor understands the guest operating
mode.
Signed-off-by: NTom Lendacky <thomas.lendacky@amd.com>
Message-Id: <5a01033f4c8b3106ca9374b7cadf8e33da852df1.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

d1949b93

KVM: SVM: Add support for CR4 write traps for an SEV-ES guest · 5b51cb13

由 Tom Lendacky 提交于 12月 10, 2020

For SEV-ES guests, the interception of control register write access
is not recommended. Control register interception occurs prior to the
control register being modified and the hypervisor is unable to modify
the control register itself because the register is located in the
encrypted register state.

SEV-ES guests introduce new control register write traps. These traps
provide intercept support of a control register write after the control
register has been modified. The new control register value is provided in
the VMCB EXITINFO1 field, allowing the hypervisor to track the setting
of the guest control registers.

Add support to track the value of the guest CR4 register using the control
register write trap so that the hypervisor understands the guest operating
mode.
Signed-off-by: NTom Lendacky <thomas.lendacky@amd.com>
Message-Id: <c3880bf2db8693aa26f648528fbc6e967ab46e25.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

5b51cb13

KVM: SVM: Add support for CR0 write traps for an SEV-ES guest · f27ad38a

由 Tom Lendacky 提交于 12月 10, 2020

For SEV-ES guests, the interception of control register write access
is not recommended. Control register interception occurs prior to the
control register being modified and the hypervisor is unable to modify
the control register itself because the register is located in the
encrypted register state.

SEV-ES support introduces new control register write traps. These traps
provide intercept support of a control register write after the control
register has been modified. The new control register value is provided in
the VMCB EXITINFO1 field, allowing the hypervisor to track the setting
of the guest control registers.

Add support to track the value of the guest CR0 register using the control
register write trap so that the hypervisor understands the guest operating
mode.
Signed-off-by: NTom Lendacky <thomas.lendacky@amd.com>
Message-Id: <182c9baf99df7e40ad9617ff90b84542705ef0d7.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

f27ad38a

KVM: SVM: Add support for EFER write traps for an SEV-ES guest · 2985afbc

由 Tom Lendacky 提交于 12月 10, 2020

For SEV-ES guests, the interception of EFER write access is not
recommended. EFER interception occurs prior to EFER being modified and
the hypervisor is unable to modify EFER itself because the register is
located in the encrypted register state.

SEV-ES support introduces a new EFER write trap. This trap provides
intercept support of an EFER write after it has been modified. The new
EFER value is provided in the VMCB EXITINFO1 field, allowing the
hypervisor to track the setting of the guest EFER.

Add support to track the value of the guest EFER value using the EFER
write trap so that the hypervisor understands the guest operating mode.
Signed-off-by: NTom Lendacky <thomas.lendacky@amd.com>
Message-Id: <8993149352a3a87cd0625b3b61bfd31ab28977e1.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

2985afbc

KVM: SVM: Support string IO operations for an SEV-ES guest · 7ed9abfe

由 Tom Lendacky 提交于 12月 10, 2020

For an SEV-ES guest, string-based port IO is performed to a shared
(un-encrypted) page so that both the hypervisor and guest can read or
write to it and each see the contents.

For string-based port IO operations, invoke SEV-ES specific routines that
can complete the operation using common KVM port IO support.
Signed-off-by: NTom Lendacky <thomas.lendacky@amd.com>
Message-Id: <9d61daf0ffda496703717218f415cdc8fd487100.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

7ed9abfe

KVM: SVM: Add initial support for a VMGEXIT VMEXIT · 291bd20d

由 Tom Lendacky 提交于 12月 10, 2020

SEV-ES adds a new VMEXIT reason code, VMGEXIT. Initial support for a
VMGEXIT includes mapping the GHCB based on the guest GPA, which is
obtained from a new VMCB field, and then validating the required inputs
for the VMGEXIT exit reason.

Since many of the VMGEXIT exit reasons correspond to existing VMEXIT
reasons, the information from the GHCB is copied into the VMCB control
exit code areas and KVM register areas. The standard exit handlers are
invoked, similar to standard VMEXIT processing. Before restarting the
vCPU, the GHCB is updated with any registers that have been updated by
the hypervisor.
Signed-off-by: NTom Lendacky <thomas.lendacky@amd.com>
Message-Id: <c6a4ed4294a369bd75c44d03bd7ce0f0c3840e50.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

291bd20d

KVM: SVM: Add required changes to support intercepts under SEV-ES · f1c6366e

由 Tom Lendacky 提交于 12月 14, 2020

When a guest is running under SEV-ES, the hypervisor cannot access the
guest register state. There are numerous places in the KVM code where
certain registers are accessed that are not allowed to be accessed (e.g.
RIP, CR0, etc). Add checks to prevent register accesses and add intercept
update support at various points within the KVM code.

Also, when handling a VMGEXIT, exceptions are passed back through the
GHCB. Since the RDMSR/WRMSR intercepts (may) inject a #GP on error,
update the SVM intercepts to handle this for SEV-ES guests.
Signed-off-by: NTom Lendacky <thomas.lendacky@amd.com>
[Redo MSR part using the .complete_emulated_msr callback. - Paolo]
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

f1c6366e

KVM: x86: introduce complete_emulated_msr callback · f9a4d621

由 Paolo Bonzini 提交于 12月 14, 2020

This will be used by SEV-ES to inject MSR failure via the GHCB.
Reviewed-by: NTom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

f9a4d621

KVM: SVM: Add support for the SEV-ES VMSA · add5e2f0

由 Tom Lendacky 提交于 12月 10, 2020

Allocate a page during vCPU creation to be used as the encrypted VM save
area (VMSA) for the SEV-ES guest. Provide a flag in the kvm_vcpu_arch
structure that indicates whether the guest state is protected.

When freeing a VMSA page that has been encrypted, the cache contents must
be flushed using the MSR_AMD64_VM_PAGE_FLUSH before freeing the page.

[ i386 build warnings ]
Reported-by: Nkernel test robot <lkp@intel.com>
Signed-off-by: NTom Lendacky <thomas.lendacky@amd.com>
Message-Id: <fde272b17eec804f3b9db18c131262fe074015c5.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

add5e2f0

KVM: SVM: Add GHCB accessor functions for retrieving fields · 0f60bde1

由 Tom Lendacky 提交于 12月 10, 2020

Update the GHCB accessor functions to add functions for retrieve GHCB
fields by name. Update existing code to use the new accessor functions.
Signed-off-by: NTom Lendacky <thomas.lendacky@amd.com>
Message-Id: <664172c53a5fb4959914e1a45d88e805649af0ad.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

0f60bde1

x86/cpu: Add VM page flush MSR availablility as a CPUID feature · 69372cf0

由 Tom Lendacky 提交于 12月 10, 2020

On systems that do not have hardware enforced cache coherency between
encrypted and unencrypted mappings of the same physical page, the
hypervisor can use the VM page flush MSR (0xc001011e) to flush the cache
contents of an SEV guest page. When a small number of pages are being
flushed, this can be used in place of issuing a WBINVD across all CPUs.

CPUID 0x8000001f_eax[2] is used to determine if the VM page flush MSR is
available. Add a CPUID feature to indicate it is supported and define the
MSR.
Signed-off-by: NTom Lendacky <thomas.lendacky@amd.com>
Message-Id: <f1966379e31f9b208db5257509c4a089a87d33d0.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

69372cf0

12 12月, 2020 1 次提交

x86: Enumerate AVX512 FP16 CPUID feature flag · e1b35da5

由 Kyung Min Park 提交于 12月 07, 2020

Enumerate AVX512 Half-precision floating point (FP16) CPUID feature
flag. Compared with using FP32, using FP16 cut the number of bits
required for storage in half, reducing the exponent from 8 bits to 5,
and the mantissa from 23 bits to 10. Using FP16 also enables developers
to train and run inference on deep learning models fast when all
precision or magnitude (FP32) is not needed.

A processor supports AVX512 FP16 if CPUID.(EAX=7,ECX=0):EDX[bit 23]
is present. The AVX512 FP16 requires AVX512BW feature be implemented
since the instructions for manipulating 32bit masks are associated with
AVX512BW.

The only in-kernel usage of this is kvm passthrough. The CPU feature
flag is shown as "avx512_fp16" in /proc/cpuinfo.
Signed-off-by: NKyung Min Park <kyung.min.park@intel.com>
Acked-by: NDave Hansen <dave.hansen@intel.com>
Reviewed-by: NTony Luck <tony.luck@intel.com>
Message-Id: <20201208033441.28207-2-kyung.min.park@intel.com>
Acked-by: NBorislav Petkov <bp@suse.de>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

e1b35da5

15 11月, 2020 4 次提交

KVM: X86: Implement ring-based dirty memory tracking · fb04a1ed

由 Peter Xu 提交于 9月 30, 2020

This patch is heavily based on previous work from Lei Cao
<lei.cao@stratus.com> and Paolo Bonzini <pbonzini@redhat.com>. [1]

KVM currently uses large bitmaps to track dirty memory. These bitmaps
are copied to userspace when userspace queries KVM for its dirty page
information. The use of bitmaps is mostly sufficient for live
migration, as large parts of memory are be dirtied from one log-dirty
pass to another. However, in a checkpointing system, the number of
dirty pages is small and in fact it is often bounded---the VM is
paused when it has dirtied a pre-defined number of pages. Traversing a
large, sparsely populated bitmap to find set bits is time-consuming,
as is copying the bitmap to user-space.

A similar issue will be there for live migration when the guest memory
is huge while the page dirty procedure is trivial. In that case for
each dirty sync we need to pull the whole dirty bitmap to userspace
and analyse every bit even if it's mostly zeros.

The preferred data structure for above scenarios is a dense list of
guest frame numbers (GFN). This patch series stores the dirty list in
kernel memory that can be memory mapped into userspace to allow speedy
harvesting.

This patch enables dirty ring for X86 only. However it should be
easily extended to other archs as well.

[1] https://patchwork.kernel.org/patch/10471409/Signed-off-by: NLei Cao <lei.cao@stratus.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NPeter Xu <peterx@redhat.com>
Message-Id: <20201001012222.5767-1-peterx@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

fb04a1ed

KVM: X86: Don't track dirty for KVM_SET_[TSS_ADDR|IDENTITY_MAP_ADDR] · ff5a983c

由 Peter Xu 提交于 9月 30, 2020

Originally, we have three code paths that can dirty a page without
vcpu context for X86:

  - init_rmode_identity_map
  - init_rmode_tss
  - kvmgt_rw_gpa

init_rmode_identity_map and init_rmode_tss will be setup on
destination VM no matter what (and the guest cannot even see them), so
it does not make sense to track them at all.

To do this, allow __x86_set_memory_region() to return the userspace
address that just allocated to the caller.  Then in both of the
functions we directly write to the userspace address instead of
calling kvm_write_*() APIs.

Another trivial change is that we don't need to explicitly clear the
identity page table root in init_rmode_identity_map() because no
matter what we'll write to the whole page with 4M huge page entries.
Suggested-by: NPaolo Bonzini <pbonzini@redhat.com>
Reviewed-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NPeter Xu <peterx@redhat.com>
Message-Id: <20201001012044.5151-4-peterx@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

ff5a983c

KVM: x86: emulate wait-for-SIPI and SIPI-VMExit · bf0cd88c

由 Yadong Qi 提交于 11月 06, 2020

Background: We have a lightweight HV, it needs INIT-VMExit and
SIPI-VMExit to wake-up APs for guests since it do not monitor
the Local APIC. But currently virtual wait-for-SIPI(WFS) state
is not supported in nVMX, so when running on top of KVM, the L1
HV cannot receive the INIT-VMExit and SIPI-VMExit which cause
the L2 guest cannot wake up the APs.

According to Intel SDM Chapter 25.2 Other Causes of VM Exits,
SIPIs cause VM exits when a logical processor is in
wait-for-SIPI state.

In this patch:
    1. introduce SIPI exit reason,
    2. introduce wait-for-SIPI state for nVMX,
    3. advertise wait-for-SIPI support to guest.

When L1 hypervisor is not monitoring Local APIC, L0 need to emulate
INIT-VMExit and SIPI-VMExit to L1 to emulate INIT-SIPI-SIPI for
L2. L2 LAPIC write would be traped by L0 Hypervisor(KVM), L0 should
emulate the INIT/SIPI vmexit to L1 hypervisor to set proper state
for L2's vcpu state.

Handle procdure:
Source vCPU:
    L2 write LAPIC.ICR(INIT).
    L0 trap LAPIC.ICR write(INIT): inject a latched INIT event to target
       vCPU.
Target vCPU:
    L0 emulate an INIT VMExit to L1 if is guest mode.
    L1 set guest VMCS, guest_activity_state=WAIT_SIPI, vmresume.
    L0 set vcpu.mp_state to INIT_RECEIVED if (vmcs12.guest_activity_state
       == WAIT_SIPI).

Source vCPU:
    L2 write LAPIC.ICR(SIPI).
    L0 trap LAPIC.ICR write(INIT): inject a latched SIPI event to traget
       vCPU.
Target vCPU:
    L0 emulate an SIPI VMExit to L1 if (vcpu.mp_state == INIT_RECEIVED).
    L1 set CS:IP, guest_activity_state=ACTIVE, vmresume.
    L0 resume to L2.
    L2 start-up.
Signed-off-by: NYadong Qi <yadong.qi@intel.com>
Message-Id: <20200922052343.84388-1-yadong.qi@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Message-Id: <20201106065122.403183-1-yadong.qi@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

bf0cd88c

KVM: x86: Move vendor CR4 validity check to dedicated kvm_x86_ops hook · c2fe3cd4

由 Sean Christopherson 提交于 10月 06, 2020

Split out VMX's checks on CR4.VMXE to a dedicated hook, .is_valid_cr4(),
and invoke the new hook from kvm_valid_cr4(). This fixes an issue where
KVM_SET_SREGS would return success while failing to actually set CR4.

Fixing the issue by explicitly checking kvm_x86_ops.set_cr4()'s return
in __set_sregs() is not a viable option as KVM has already stuffed a
variety of vCPU state.

Note, kvm_valid_cr4() and is_valid_cr4() have different return types and
inverted semantics. This will be remedied in a future patch.

Fixes: 5e1746d6 ("KVM: nVMX: Allow setting the VMXE bit in CR4")
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20201007014417.29276-5-sean.j.christopherson@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

c2fe3cd4

13 11月, 2020 1 次提交

KVM: x86: Introduce cr3_lm_rsvd_bits in kvm_vcpu_arch · 0107973a

由 Babu Moger 提交于 11月 12, 2020

SEV guests fail to boot on a system that supports the PCID feature.

While emulating the RSM instruction, KVM reads the guest CR3
and calls kvm_set_cr3(). If the vCPU is in the long mode,
kvm_set_cr3() does a sanity check for the CR3 value. In this case,
it validates whether the value has any reserved bits set. The
reserved bit range is 63:cpuid_maxphysaddr(). When AMD memory
encryption is enabled, the memory encryption bit is set in the CR3
value. The memory encryption bit may fall within the KVM reserved
bit range, causing the KVM emulation failure.

Introduce a new field cr3_lm_rsvd_bits in kvm_vcpu_arch which will
cache the reserved bits in the CR3 value. This will be initialized
to rsvd_bits(cpuid_maxphyaddr(vcpu), 63).

If the architecture has any special bits(like AMD SEV encryption bit)
that needs to be masked from the reserved bits, should be cleared
in vendor specific kvm_x86_ops.vcpu_after_set_cpuid handler.

Fixes: a780a3ea ("KVM: X86: Fix reserved bits check for MOV to CR3")
Signed-off-by: NBabu Moger <babu.moger@amd.com>
Message-Id: <160521947657.32054.3264016688005356563.stgit@bmoger-ubuntu>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

0107973a

29 10月, 2020 1 次提交

x86/kvm: Reserve KVM_FEATURE_MSI_EXT_DEST_ID · 5a169bf0

由 David Woodhouse 提交于 10月 19, 2020

No functional change; just reserve the feature bit for now so that VMMs
can start to implement it.

This will allow the host to indicate that MSI emulation supports 15-bit
destination IDs, allowing up to 32768 CPUs without interrupt remapping.

cf. https://patchwork.kernel.org/patch/11816693/ for qemu
Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>
Acked-by: NPaolo Bonzini <pbonzini@redhat.com>
Message-Id: <4cd59bed05f4b7410d3d1ffd1e997ab53683874d.camel@infradead.org>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

5a169bf0

26 10月, 2020 1 次提交

treewide: Convert macro and uses of __section(foo) to __section("foo") · 33def849

由 Joe Perches 提交于 10月 21, 2020

Use a more generic form for __section that requires quotes to avoid
complications with clang and gcc differences.

Remove the quote operator # from compiler_attributes.h __section macro.

Convert all unquoted __section(foo) uses to quoted __section("foo").
Also convert __attribute__((section("foo"))) uses to __section("foo")
even if the __attribute__ has multiple list entry forms.

Conversion done using the script at:

https://lore.kernel.org/lkml/75393e5ddc272dc7403de74d645e6c6e0f4e70eb.camel@perches.com/2-convert_section.plSigned-off-by: NJoe Perches <joe@perches.com>
Reviewed-by: NNick Desaulniers <ndesaulniers@gooogle.com>
Reviewed-by: NMiguel Ojeda <ojeda@kernel.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

33def849

24 10月, 2020 1 次提交

x86/uaccess: fix code generation in put_user() · 9c5743df

由 Rasmus Villemoes 提交于 10月 23, 2020

Quoting https://gcc.gnu.org/onlinedocs/gcc/Local-Register-Variables.html:

  You can define a local register variable and associate it with a
  specified register...

  The only supported use for this feature is to specify registers for
  input and output operands when calling Extended asm (see Extended
  Asm). This may be necessary if the constraints for a particular
  machine don't provide sufficient control to select the desired
  register.

On 32-bit x86, this is used to ensure that gcc will put an 8-byte value
into the %edx:%eax pair, while all other cases will just use the single
register %eax (%rax on x86-64).  While the _ASM_AX actually just expands
to "%eax", note this comment next to get_user() which does something
very similar:

 * The use of _ASM_DX as the register specifier is a bit of a
 * simplification, as gcc only cares about it as the starting point
 * and not size: for a 64-bit value it will use %ecx:%edx on 32 bits
 * (%ecx being the next register in gcc's x86 register sequence), and
 * %rdx on 64 bits.

However, getting this to work requires that there is no code between the
assignment to the local register variable and its use as an input to the
asm() which can possibly clobber any of the registers involved -
including evaluation of the expressions making up other inputs.

In the current code, the ptr expression used directly as an input may
cause such code to be emitted.  For example, Sean Christopherson
observed that with KASAN enabled and ptr being current->set_child_tid
(from chedule_tail()), the load of current->set_child_tid causes a call
to __asan_load8() to be emitted immediately prior to the __put_user_4
call, and Naresh Kamboju reports that various mmstress tests fail on
KASAN-enabled builds.

It's also possible to synthesize a broken case without KASAN if one uses
"foo()" as the ptr argument, with foo being some "extern u64 __user
*foo(void);" (though I don't know if that appears in real code).

Fix it by making sure ptr gets evaluated before the assignment to
__val_pu, and add a comment that __val_pu must be the last thing
computed before the asm() is entered.

Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Reported-by: NNaresh Kamboju <naresh.kamboju@linaro.org>
Tested-by: NNaresh Kamboju <naresh.kamboju@linaro.org>
Fixes: d55564cf ("x86: Make __put_user() generate an out-of-line call")
Signed-off-by: NRasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

9c5743df

23 10月, 2020 1 次提交

kvm: x86/mmu: Allocate struct kvm_mmu_pages for all pages in TDP MMU · 89c0fd49

由 Ben Gardon 提交于 10月 14, 2020

Attach struct kvm_mmu_pages to every page in the TDP MMU to track
metadata, facilitate NX reclaim, and enable inproved parallelism of MMU
operations in future patches.

Tested by running kvm-unit-tests and KVM selftests on an Intel Haswell
machine. This series introduced no new failures.

This series can be viewed in Gerrit at:
	https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538Signed-off-by: NBen Gardon <bgardon@google.com>
Message-Id: <20201014182700.2888246-12-bgardon@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

89c0fd49

22 10月, 2020 6 次提交

kvm: x86/mmu: Allocate and free TDP MMU roots · 02c00b3a

由 Ben Gardon 提交于 10月 14, 2020

The TDP MMU must be able to allocate paging structure root pages and track
the usage of those pages. Implement a similar, but separate system for root
page allocation to that of the x86 shadow paging implementation. When
future patches add synchronization model changes to allow for parallel
page faults, these pages will need to be handled differently from the
x86 shadow paging based MMU's root pages.

Tested by running kvm-unit-tests and KVM selftests on an Intel Haswell
machine. This series introduced no new failures.

This series can be viewed in Gerrit at:
https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538Signed-off-by: NBen Gardon <bgardon@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

02c00b3a

kvm: x86/mmu: Init / Uninit the TDP MMU · fe5db27d

由 Ben Gardon 提交于 10月 14, 2020

The TDP MMU offers an alternative mode of operation to the x86 shadow
paging based MMU, optimized for running an L1 guest with TDP. The TDP MMU
will require new fields that need to be initialized and torn down. Add
hooks into the existing KVM MMU initialization process to do that
initialization / cleanup. Currently the initialization and cleanup
fucntions do not do very much, however more operations will be added in
future patches.

Tested by running kvm-unit-tests and KVM selftests on an Intel Haswell
machine. This series introduced no new failures.

This series can be viewed in Gerrit at:
https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538Signed-off-by: NBen Gardon <bgardon@google.com>
Message-Id: <20201014182700.2888246-4-bgardon@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

fe5db27d

KVM: x86: allow kvm_x86_ops.set_efer to return an error value · 72f211ec

由 Maxim Levitsky 提交于 10月 01, 2020

This will be used to signal an error to the userspace, in case
the vendor code failed during handling of this msr. (e.g -ENOMEM)
Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20201001112954.6258-4-mlevitsk@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

72f211ec

KVM: x86: bump KVM_MAX_CPUID_ENTRIES · 3f4e3eb4

由 Vitaly Kuznetsov 提交于 10月 01, 2020

As vcpu->arch.cpuid_entries is now allocated dynamically, the only
remaining use for KVM_MAX_CPUID_ENTRIES is to check KVM_SET_CPUID/
KVM_SET_CPUID2 input for sanity. Since it was reported that the
current limit (80) is insufficient for some CPUs, bump
KVM_MAX_CPUID_ENTRIES and use an arbitrary value '256' as the new
limit.
Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20201001130541.1398392-4-vkuznets@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

3f4e3eb4

KVM: x86: allocate vcpu->arch.cpuid_entries dynamically · 255cbecf

由 Vitaly Kuznetsov 提交于 10月 01, 2020

The current limit for guest CPUID leaves (KVM_MAX_CPUID_ENTRIES, 80)
is reported to be insufficient but before we bump it let's switch to
allocating vcpu->arch.cpuid_entries[] array dynamically. Currently,
'struct kvm_cpuid_entry2' is 40 bytes so vcpu->arch.cpuid_entries is
3200 bytes which accounts for 1/4 of the whole 'struct kvm_vcpu_arch'
but having it pre-allocated (for all vCPUs which we also pre-allocate)
gives us no real benefits.

Another plus of the dynamic allocation is that we now do kvm_check_cpuid()
check before we assign anything to vcpu->arch.cpuid_nent/cpuid_entries so
no changes are made in case the check fails.

Opportunistically remove unneeded 'out' labels from
kvm_vcpu_ioctl_set_cpuid()/kvm_vcpu_ioctl_set_cpuid2() and return
directly whenever possible.
Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20201001130541.1398392-3-vkuznets@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>

255cbecf

kvm: x86: only provide PV features if enabled in guest's CPUID · 66570e96

由 Oliver Upton 提交于 8月 18, 2020

KVM unconditionally provides PV features to the guest, regardless of the
configured CPUID. An unwitting guest that doesn't check
KVM_CPUID_FEATURES before use could access paravirt features that
userspace did not intend to provide. Fix this by checking the guest's
CPUID before performing any paravirtual operations.

Introduce a capability, KVM_CAP_ENFORCE_PV_FEATURE_CPUID, to gate the
aforementioned enforcement. Migrating a VM from a host w/o this patch to
a host with this patch could silently change the ABI exposed to the
guest, warranting that we default to the old behavior and opt-in for
the new one.
Reviewed-by: NJim Mattson <jmattson@google.com>
Reviewed-by: NPeter Shier <pshier@google.com>
Signed-off-by: NOliver Upton <oupton@google.com>
Change-Id: I202a0926f65035b872bfe8ad15307c026de59a98
Message-Id: <20200818152429.1923996-4-oupton@google.com>
Reviewed-by: NWanpeng Li <wanpengli@tencent.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

66570e96

15 10月, 2020 1 次提交

Drivers: hv: vmbus: Add parsing of VMbus interrupt in ACPI DSDT · 626b901f

由 Michael Kelley 提交于 8月 14, 2020

On ARM64, Hyper-V now specifies the interrupt to be used by VMbus
in the ACPI DSDT. This information is not used on x86 because the
interrupt vector must be hardcoded. But update the generic
VMbus driver to do the parsing and pass the information to the
architecture specific code that sets up the Linux IRQ. Update
consumers of the interrupt to get it from an architecture specific
function.
Signed-off-by: NMichael Kelley <mikelley@microsoft.com>
Link: https://lore.kernel.org/r/1597434304-40631-1-git-send-email-mikelley@microsoft.comSigned-off-by: NWei Liu <wei.liu@kernel.org>

626b901f

14 10月, 2020 1 次提交

x86/numa: cleanup configuration dependent command-line options · 2dd57d34

由 Dan Williams 提交于 10月 13, 2020

Patch series "device-dax: Support sub-dividing soft-reserved ranges", v5.

The device-dax facility allows an address range to be directly mapped
through a chardev, or optionally hotplugged to the core kernel page
allocator as System-RAM.  It is the mechanism for converting persistent
memory (pmem) to be used as another volatile memory pool i.e.  the current
Memory Tiering hot topic on linux-mm.

In the case of pmem the nvdimm-namespace-label mechanism can sub-divide
it, but that labeling mechanism is not available / applicable to
soft-reserved ("EFI specific purpose") memory [3].  This series provides a
sysfs-mechanism for the daxctl utility to enable provisioning of
volatile-soft-reserved memory ranges.

The motivations for this facility are:

1/ Allow performance differentiated memory ranges to be split between
   kernel-managed and directly-accessed use cases.

2/ Allow physical memory to be provisioned along performance relevant
   address boundaries. For example, divide a memory-side cache [4] along
   cache-color boundaries.

3/ Parcel out soft-reserved memory to VMs using device-dax as a security
   / permissions boundary [5]. Specifically I have seen people (ab)using
   memmap=nn!ss (mark System-RAM as Persistent Memory) just to get the
   device-dax interface on custom address ranges. A follow-on for the VM
   use case is to teach device-dax to dynamically allocate 'struct page' at
   runtime to reduce the duplication of 'struct page' space in both the
   guest and the host kernel for the same physical pages.

[2]: http://lore.kernel.org/r/20200713160837.13774-11-joao.m.martins@oracle.com
[3]: http://lore.kernel.org/r/157309097008.1579826.12818463304589384434.stgit@dwillia2-desk3.amr.corp.intel.com
[4]: http://lore.kernel.org/r/154899811738.3165233.12325692939590944259.stgit@dwillia2-desk3.amr.corp.intel.com
[5]: http://lore.kernel.org/r/20200110190313.17144-1-joao.m.martins@oracle.com

This patch (of 23):

In preparation for adding a new numa= option clean up the existing ones to
avoid ifdefs in numa_setup(), and provide feedback when the option is
numa=fake= option is invalid due to kernel config.  The same does not need
to be done for numa=noacpi, since the capability is already hard disabled
at compile-time.
Suggested-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brice Goglin <Brice.Goglin@inria.fr>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: David Airlie <airlied@linux.ie>
Cc: David Hildenbrand <david@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Jia He <justin.he@arm.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: Will Deacon <will@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Hulk Robot <hulkci@huawei.com>
Cc: Jason Yan <yanaijie@huawei.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Link: https://lkml.kernel.org/r/160106109960.30709.7379926726669669398.stgit@dwillia2-desk3.amr.corp.intel.com
Link: https://lkml.kernel.org/r/159643094279.4062302.17779410714418721328.stgit@dwillia2-desk3.amr.corp.intel.com
Link: https://lkml.kernel.org/r/159643094925.4062302.14979872973043772305.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

2dd57d34

13 10月, 2020 3 次提交

x86/uaccess: utilize CONFIG_CC_HAS_ASM_GOTO_OUTPUT · 865c50e1

由 Nick Desaulniers 提交于 8月 07, 2020

Clang-11 shipped support for outputs to asm goto statments along the
fallthrough path. Double up some of the get_user() and related macros
to be able to take advantage of this extended GNU C extension. This
should help improve the generated code's performance for these accesses.

Cc: Bill Wendling <morbo@google.com>
Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NNick Desaulniers <ndesaulniers@google.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

865c50e1

x86: Make __put_user() generate an out-of-line call · d55564cf

由 Linus Torvalds 提交于 4月 08, 2020

Instead of inlining the stac/mov/clac sequence (which also requires
individual exception table entries and several asm instruction
alternatives entries), just generate "call __put_user_nocheck_X" for the
__put_user() cases, the same way we changed __get_user earlier.

Unlike the get_user() case, we didn't have the same nice infrastructure
to just generate the call with a single case, so this actually has to
change some of the infrastructure in order to do this.  But that only
cleans up the code further.

So now, instead of using a case statement for the sizes, we just do the
same thing we've done on the get_user() side for a long time: use the
size as an immediate constant to the asm, and generate the asm that way
directly.

In order to handle the special case of 64-bit data on a 32-bit kernel, I
needed to change the calling convention slightly: the data is passed in
%eax[:%edx], the pointer in %ecx, and the return value is also returned
in %ecx.  It used to be returned in %eax, but because of how %eax can
now be a double register input, we don't want mix that with a
single-register output.

The actual low-level asm is easier to handle: we'll just share the code
between the checking and non-checking case, with the non-checking case
jumping into the middle of the function.  That may sound a bit too
special, but this code is all very very special anyway, so...

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

d55564cf

x86: Make __get_user() generate an out-of-line call · ea6f043f

由 Linus Torvalds 提交于 4月 08, 2020

Instead of inlining the whole stac/lfence/mov/clac sequence (which also
requires individual exception table entries and several asm instruction
alternatives entries), just generate "call __get_user_nocheck_X" for the
__get_user() cases.

We can use all the same infrastructure that we already do for the
regular "get_user()", and the end result is simpler source code, and
much simpler code generation.

It also means that when I introduce asm goto with input for
"unsafe_get_user()", there are no nasty interactions with the
__get_user() code.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

ea6f043f

08 10月, 2020 1 次提交

x86/mce: Allow for copy_mc_fragile symbol checksum to be generated · b3149ffc

由 Borislav Petkov 提交于 10月 07, 2020

Add asm/mce.h to asm/asm-prototypes.h so that that asm symbol's checksum
can be generated in order to support CONFIG_MODVERSIONS with it and fix:

WARNING: modpost: EXPORT symbol "copy_mc_fragile" [vmlinux] version \
generation failed, symbol will not be versioned.

For reference see:

4efca4ed ("kbuild: modversions for EXPORT_SYMBOL() for asm")
334bb773 ("x86/kbuild: enable modversions for symbols exported from asm")

Fixes: ec6347bb ("x86, powerpc: Rename memcpy_mcsafe() to copy_mc_to_{user, kernel}()")
Signed-off-by: NBorislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201007111447.GA23257@zn.tnic

b3149ffc

07 10月, 2020 6 次提交

x86/asm: Add an enqcmds() wrapper for the ENQCMDS instruction · 7f5933f8

由 Dave Jiang 提交于 10月 05, 2020

Currently, the MOVDIR64B instruction is used to atomically submit
64-byte work descriptors to devices. Although it can encounter errors
like device queue full, command not accepted, device not ready, etc when
writing to a device MMIO, MOVDIR64B can not report back on errors from
the device itself. This means that MOVDIR64B users need to separately
interact with a device to see if a descriptor was successfully queued,
which slows down device interactions.

ENQCMD and ENQCMDS also atomically submit 64-byte work descriptors
to devices. But, they *can* report back errors directly from the
device, such as if the device was busy, or device not enabled or does
not support the command. This immediate feedback from the submission
instruction itself reduces the number of interactions with the device
and can greatly increase efficiency.

ENQCMD can be used at any privilege level, but can effectively only
submit work on behalf of the current process. ENQCMDS is a ring0-only
instruction and can explicitly specify a process context instead of
being tied to the current process or needing to reprogram the IA32_PASID
MSR.

Use ENQCMDS for work submission within the kernel because a Process
Address ID (PASID) is setup to translate the kernel virtual address
space. This PASID is provided to ENQCMDS from the descriptor structure
submitted to the device and not retrieved from IA32_PASID MSR, which is
setup for the current user address space.

See Intel Software Developer’s Manual for more information on the
instructions.

[ bp:
- Make operand constraints like movdir64b() because both insns are
basically doing the same thing, more or less.
- Fixup comments and cleanup. ]

Link: https://lkml.kernel.org/r/20200924180041.34056-3-dave.jiang@intel.comSigned-off-by: NDave Jiang <dave.jiang@intel.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Reviewed-by: NTony Luck <tony.luck@intel.com>
Link: https://lkml.kernel.org/r/20201005151126.657029-3-dave.jiang@intel.com

7f5933f8

x86/asm: Carve out a generic movdir64b() helper for general usage · 0888e103

由 Dave Jiang 提交于 10月 05, 2020

Carve out the MOVDIR64B inline asm primitive into a generic helper so
that it can be used by other functions. Move it to special_insns.h and
have iosubmit_cmds512() call it.

 [ bp: Massage commit message. ]
Suggested-by: NMichael Matz <matz@suse.de>
Signed-off-by: NDave Jiang <dave.jiang@intel.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Reviewed-by: NTony Luck <tony.luck@intel.com>
Reviewed-by: NBorislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201005151126.657029-2-dave.jiang@intel.com

0888e103

x86/mce: Decode a kernel instruction to determine if it is copying from user · 30063810

由 Tony Luck 提交于 10月 06, 2020

All instructions copying data between kernel and user memory
are tagged with either _ASM_EXTABLE_UA or _ASM_EXTABLE_CPY
entries in the exception table. ex_fault_handler_type() returns
EX_HANDLER_UACCESS for both of these.

Recovery is only possible when the machine check was triggered
on a read from user memory. In this case the same strategy for
recovery applies as if the user had made the access in ring3. If
the fault was in kernel memory while copying to user there is no
current recovery plan.

For MOV and MOVZ instructions a full decode of the instruction
is done to find the source address. For MOVS instructions
the source address is in the %rsi register. The function
fault_in_kernel_space() determines whether the source address is
kernel or user, upgrade it from "static" so it can be used here.
Co-developed-by: NYouquan Song <youquan.song@intel.com>
Signed-off-by: NYouquan Song <youquan.song@intel.com>
Signed-off-by: NTony Luck <tony.luck@intel.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201006210910.21062-7-tony.luck@intel.com

30063810

x86/mce: Add _ASM_EXTABLE_CPY for copy user access · 278b917f

由 Youquan Song 提交于 10月 06, 2020

_ASM_EXTABLE_UA is a general exception entry to record the exception fixup
for all exception spots between kernel and user space access.

To enable recovery from machine checks while coping data from user
addresses it is necessary to be able to distinguish the places that are
looping copying data from those that copy a single byte/word/etc.

Add a new macro _ASM_EXTABLE_CPY and use it in place of _ASM_EXTABLE_UA
in the copy functions.

Record the exception reason number to regs->ax at
ex_handler_uaccess which is used to check MCE triggered.

The new fixup routine ex_handler_copy() is almost an exact copy of
ex_handler_uaccess() The difference is that it sets regs->ax to the trap
number. Following patches use this to avoid trying to copy remaining
bytes from the tail of the copy and possibly hitting the poison again.

New mce.kflags bit MCE_IN_KERNEL_COPYIN will be used by mce_severity()
calculation to indicate that a machine check is recoverable because the
kernel was copying from user space.
Signed-off-by: NYouquan Song <youquan.song@intel.com>
Signed-off-by: NTony Luck <tony.luck@intel.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201006210910.21062-4-tony.luck@intel.com

278b917f

x86/mce: Provide method to find out the type of an exception handler · a05d54c4

由 Tony Luck 提交于 10月 06, 2020

Avoid a proliferation of ex_has_*_handler() functions by having just
one function that returns the type of the handler (if any).

Drop the __visible attribute for this function. It is not called
from assembler so the attribute is not necessary.
Signed-off-by: NTony Luck <tony.luck@intel.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201006210910.21062-3-tony.luck@intel.com

a05d54c4

x86/platform/uv: Update Copyrights to conform to HPE standards · 7a6d94f0

由 Mike Travis 提交于 10月 05, 2020

Add Copyrights to those files that have been updated for UV5 changes.
Signed-off-by: NMike Travis <mike.travis@hpe.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20201005203929.148656-14-mike.travis@hpe.com

7a6d94f0

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功