提交 · 4c72ab5aa6e0ac2a5c11f9180e1fff89d7f2d38b · openeuler / Kernel

02 8月, 2021 10 次提交

KVM: x86: Preserve guest's CR0.CD/NW on INIT · 4c72ab5a

由 Sean Christopherson 提交于 7月 13, 2021

Preserve CR0.CD and CR0.NW on INIT instead of forcing them to '1', as
defined by both Intel's SDM and AMD's APM.

Note, current versions of Intel's SDM are very poorly written with
respect to INIT behavior.  Table 9-1. "IA-32 and Intel 64 Processor
States Following Power-up, Reset, or INIT" quite clearly lists power-up,
RESET, _and_ INIT as setting CR0=60000010H, i.e. CD/NW=1.  But the SDM
then attempts to qualify CD/NW behavior in a footnote:

  2. The CD and NW flags are unchanged, bit 4 is set to 1, all other bits
     are cleared.

Presumably that footnote is only meant for INIT, as the RESET case and
especially the power-up case are rather non-sensical.  Another footnote
all but confirms that:

  6. Internal caches are invalid after power-up and RESET, but left
     unchanged with an INIT.

Bare metal testing shows that CD/NW are indeed preserved on INIT (someone
else can hack their BIOS to check RESET and power-up :-D).
Reported-by: NReiji Watanabe <reijiw@google.com>
Reviewed-by: NReiji Watanabe <reijiw@google.com>
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20210713163324.627647-47-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

4c72ab5a

KVM: SVM: Emulate #INIT in response to triple fault shutdown · 265e4353

由 Sean Christopherson 提交于 7月 13, 2021

Emulate a full #INIT instead of simply initializing the VMCB if the
guest hits a shutdown.  Initializing the VMCB but not other vCPU state,
much of which is mirrored by the VMCB, results in incoherent and broken
vCPU state.

Ideally, KVM would not automatically init anything on shutdown, and
instead put the vCPU into e.g. KVM_MP_STATE_UNINITIALIZED and force
userspace to explicitly INIT or RESET the vCPU.  Even better would be to
add KVM_MP_STATE_SHUTDOWN, since technically NMI can break shutdown
(and SMI on Intel CPUs).

But, that ship has sailed, and emulating #INIT is the next best thing as
that has at least some connection with reality since there exist bare
metal platforms that automatically INIT the CPU if it hits shutdown.

Fixes: 46fe4ddd ("[PATCH] KVM: SVM: Propagate cpu shutdown events to userspace")
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20210713163324.627647-45-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

265e4353

KVM: x86: Move setting of sregs during vCPU RESET/INIT to common x86 · f39e805e

由 Sean Christopherson 提交于 7月 13, 2021

Move the setting of CR0, CR4, EFER, RFLAGS, and RIP from vendor code to
common x86.  VMX and SVM now have near-identical sequences, the only
difference being that VMX updates the exception bitmap.  Updating the
bitmap on SVM is unnecessary, but benign.  Unfortunately it can't be left
behind in VMX due to the need to update exception intercepts after the
control registers are set.
Reviewed-by: NReiji Watanabe <reijiw@google.com>
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20210713163324.627647-37-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

f39e805e

KVM: x86/mmu: Skip the permission_fault() check on MMIO if CR0.PG=0 · 908b7d43

由 Sean Christopherson 提交于 7月 13, 2021

Skip the MMU permission_fault() check if paging is disabled when
verifying the cached MMIO GVA is usable. The check is unnecessary and
can theoretically get a false positive since the MMU doesn't zero out
"permissions" or "pkru_mask" when guest paging is disabled.

The obvious alternative is to zero out all the bitmasks when configuring
nonpaging MMUs, but that's unnecessary work and doesn't align with the
MMU's general approach of doing as little as possible for flows that are
supposed to be unreachable.

This is nearly a nop as the false positive is nothing more than an
insignificant performance blip, and more or less limited to string MMIO
when L1 is running with paging disabled. KVM doesn't cache MMIO if L2 is
active with nested TDP since the "GVA" is really an L2 GPA. If L2 is
active without nested TDP, then paging can't be disabled as neither VMX
nor SVM allows entering the guest without paging of some form.

Jumping back to L1 with paging disabled, in that case direct_map is true
and so KVM will use CR2 as a GPA; the only time it doesn't is if the
fault from the emulator doesn't match or emulator_can_use_gpa(), and that
fails only on string MMIO and other instructions with multiple memory
operands.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20210713163324.627647-27-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

908b7d43

KVM: x86: Move EDX initialization at vCPU RESET to common code · 49d8665c

由 Sean Christopherson 提交于 7月 13, 2021

Move the EDX initialization at vCPU RESET, which is now identical between
VMX and SVM, into common code.

No functional change intended.
Reviewed-by: NReiji Watanabe <reijiw@google.com>
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20210713163324.627647-20-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

49d8665c

KVM: x86: Flush the guest's TLB on INIT · df37ed38

由 Sean Christopherson 提交于 7月 13, 2021

Flush the guest's TLB on INIT, as required by Intel's SDM.  Although
AMD's APM states that the TLBs are unchanged by INIT, it's not clear that
that's correct as the APM also states that the TLB is flush on "External
initialization of the processor."  Regardless, relying on the guest to be
paranoid is unnecessarily risky, while an unnecessary flush is benign
from a functional perspective and likely has no measurable impact on
guest performance.

Note, as of the April 2021 version of Intels' SDM, it also contradicts
itself with respect to TLB flushing.  The overview of INIT explicitly
calls out the TLBs as being invalidated, while a table later in the same
section says they are unchanged.

  9.1 INITIALIZATION OVERVIEW:
    The major difference is that during an INIT, the internal caches, MSRs,
    MTRRs, and x87 FPU state are left unchanged (although, the TLBs and BTB
    are invalidated as with a hardware reset)

  Table 9-1:

  Register                    Power up    Reset      INIT
  Data and Code Cache, TLBs:  Invalid[6]  Invalid[6] Unchanged

Given Core2's erratum[*] about global TLB entries not being flush on INIT,
it's safe to assume that the table is simply wrong.

  AZ28. INIT Does Not Clear Global Entries in the TLB
  Problem: INIT may not flush a TLB entry when:
    • The processor is in protected mode with paging enabled and the page global enable
      flag is set (PGE bit of CR4 register)
    • G bit for the page table entry is set
    • TLB entry is present in TLB when INIT occurs
    • Software may encounter unexpected page fault or incorrect address translation due
      to a TLB entry erroneously left in TLB after INIT.

  Workaround: Write to CR3, CR4 (setting bits PSE, PGE or PAE) or CR0 (setting
              bits PG or PE) registers before writing to memory early in BIOS
              code to clear all the global entries from TLB.

  Status: For the steppings affected, see the Summary Tables of Changes.

[*] https://www.intel.com/content/dam/support/us/en/documents/processors/mobile/celeron/sb/320121.pdf

Fixes: 6aa8b732 ("[PATCH] kvm: userspace interface")
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20210713163324.627647-2-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

df37ed38

KVM: x86: APICv: drop immediate APICv disablement on current vCPU · df63202f

由 Maxim Levitsky 提交于 7月 13, 2021

Special case of disabling the APICv on the current vCPU right away in
kvm_request_apicv_update doesn't bring much benefit vs raising
KVM_REQ_APICV_UPDATE on it instead, since this request will be processed
on the next entry to the guest.
(the comment about having another #VMEXIT is wrong).

It also hides various assumptions that APIVc enable state matches
the APICv inhibit state, as this special case only makes those states
match on the current vCPU.

Previous patches fixed few such assumptions so now it should be safe
to drop this special case.
Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210713142023.106183-5-mlevitsk@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

df63202f

KVM: X86: Add per-vm stat for max rmap list size · ec1cf69c

由 Peter Xu 提交于 6月 25, 2021

Add a new statistic max_mmu_rmap_size, which stores the maximum size of rmap
for the vm.
Signed-off-by: NPeter Xu <peterx@redhat.com>
Message-Id: <20210625153214.43106-2-peterx@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

ec1cf69c

KVM: x86: Hoist kvm_dirty_regs check out of sync_regs() · e489a4a6

由 Sean Christopherson 提交于 7月 02, 2021

Move the kvm_dirty_regs vs. KVM_SYNC_X86_VALID_FIELDS check out of
sync_regs() and into its sole caller, kvm_arch_vcpu_ioctl_run().  This
allows a future patch to allow synchronizing select state for protected
VMs.
Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NIsaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com>
Message-Id: <889017a8d31cea46472e0c64b234ef5919278ed9.1625186503.git.isaku.yamahata@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

e489a4a6

KVM: x86: Use KVM_BUG/KVM_BUG_ON to handle bugs that are fatal to the VM · 67369273

由 Sean Christopherson 提交于 7月 02, 2021

Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: NIsaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com>
Message-Id: <0e8760a26151f47dc47052b25ca8b84fffe0641e.1625186503.git.isaku.yamahata@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

67369273

30 7月, 2021 1 次提交

KVM: x86: accept userspace interrupt only if no event is injected · fa7a549d

由 Paolo Bonzini 提交于 7月 14, 2021

Once an exception has been injected, any side effects related to
the exception (such as setting CR2 or DR6) have been taked place.
Therefore, once KVM sets the VM-entry interruption information
field or the AMD EVENTINJ field, the next VM-entry must deliver that
exception.

Pending interrupts are processed after injected exceptions, so
in theory it would not be a problem to use KVM_INTERRUPT when
an injected exception is present.  However, DOSEMU is using
run->ready_for_interrupt_injection to detect interrupt windows
and then using KVM_SET_SREGS/KVM_SET_REGS to inject the
interrupt manually.  For this to work, the interrupt window
must be delayed after the completion of the previous event
injection.

Cc: stable@vger.kernel.org
Reported-by: NStas Sergeev <stsp2@yandex.ru>
Tested-by: NStas Sergeev <stsp2@yandex.ru>
Fixes: 71cc849b ("KVM: x86: Fix split-irqchip vs interrupt injection window request")
Reviewed-by: NSean Christopherson <seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

fa7a549d

26 7月, 2021 1 次提交

KVM: x86: Check the right feature bit for MSR_KVM_ASYNC_PF_ACK access · 0a31df68

由 Vitaly Kuznetsov 提交于 7月 22, 2021

MSR_KVM_ASYNC_PF_ACK MSR is part of interrupt based asynchronous page fault
interface and not the original (deprecated) KVM_FEATURE_ASYNC_PF. This is
stated in Documentation/virt/kvm/msr.rst.

Fixes: 66570e96 ("kvm: x86: only provide PV features if enabled in guest's CPUID")
Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: NOliver Upton <oupton@google.com>
Message-Id: <20210722123018.260035-1-vkuznets@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

0a31df68

15 7月, 2021 2 次提交

KVM: X86: Disable hardware breakpoints unconditionally before kvm_x86->run() · f85d4016

由 Lai Jiangshan 提交于 6月 29, 2021

When the host is using debug registers but the guest is not using them
nor is the guest in guest-debug state, the kvm code does not reset
the host debug registers before kvm_x86->run().  Rather, it relies on
the hardware vmentry instruction to automatically reset the dr7 registers
which ensures that the host breakpoints do not affect the guest.

This however violates the non-instrumentable nature around VM entry
and exit; for example, when a host breakpoint is set on vcpu->arch.cr2,

Another issue is consistency.  When the guest debug registers are active,
the host breakpoints are reset before kvm_x86->run(). But when the
guest debug registers are inactive, the host breakpoints are delayed to
be disabled.  The host tracing tools may see different results depending
on what the guest is doing.

To fix the problems, we clear %db7 unconditionally before kvm_x86->run()
if the host has set any breakpoints, no matter if the guest is using
them or not.
Signed-off-by: NLai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20210628172632.81029-1-jiangshanlai@gmail.com>
Cc: stable@vger.kernel.org
[Only clear %db7 instead of reloading all debug registers. - Paolo]
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

f85d4016

Revert "KVM: x86: WARN and reject loading KVM if NX is supported but not enabled" · f0414b07

由 Sean Christopherson 提交于 6月 24, 2021

Let KVM load if EFER.NX=0 even if NX is supported, the analysis and
testing (or lack thereof) for the non-PAE host case was garbage.

If the kernel won't be using PAE paging, .Ldefault_entry in head_32.S
skips over the entire EFER sequence. Hopefully that can be changed in
the future to allow KVM to require EFER.NX, but the motivation behind
KVM's requirement isn't yet merged. Reverting and revisiting the mess
at a later date is by far the safest approach.

This reverts commit 8bbed95d.

Fixes: 8bbed95d ("KVM: x86: WARN and reject loading KVM if NX is supported but not enabled")
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20210625001853.318148-1-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

f0414b07

25 6月, 2021 8 次提交

kvm: x86: Allow userspace to handle emulation errors · 19238e75

由 Aaron Lewis 提交于 5月 10, 2021

Add a fallback mechanism to the in-kernel instruction emulator that
allows userspace the opportunity to process an instruction the emulator
was unable to.  When the in-kernel instruction emulator fails to process
an instruction it will either inject a #UD into the guest or exit to
userspace with exit reason KVM_INTERNAL_ERROR.  This is because it does
not know how to proceed in an appropriate manner.  This feature lets
userspace get involved to see if it can figure out a better path
forward.
Signed-off-by: NAaron Lewis <aaronlewis@google.com>
Reviewed-by: NDavid Edmondson <david.edmondson@oracle.com>
Message-Id: <20210510144834.658457-2-aaronlewis@google.com>
Reviewed-by: NJim Mattson <jmattson@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

19238e75

KVM: x86: Read and pass all CR0/CR4 role bits to shadow MMU helper · 20f632bd

由 Sean Christopherson 提交于 6月 22, 2021

Grab all CR0/CR4 MMU role bits from current vCPU state when initializing
a non-nested shadow MMU. Extract the masks from kvm_post_set_cr{0,4}(),
as the CR0/CR4 update masks must exactly match the mmu_role bits, with
one exception (see below). The "full" CR0/CR4 will be used by future
commits to initialize the MMU and its role, as opposed to the current
approach of pulling everything from vCPU, which is incorrect for certain
flows, e.g. nested NPT.

CR4.LA57 is an exception, as it can be toggled on VM-Exit (for L1's MMU)
but can't be toggled via MOV CR4 while long mode is active. I.e. LA57
needs to be in the mmu_role, but technically doesn't need to be checked
by kvm_post_set_cr4(). However, the extra check is completely benign as
the hardware restrictions simply mean LA57 will never be _the_ cause of
a MMU reset during MOV CR4.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20210622175739.3610207-18-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

20f632bd

KVM: x86: Fix sizes used to pass around CR0, CR4, and EFER · dbc4739b

由 Sean Christopherson 提交于 6月 22, 2021

When configuring KVM's MMU, pass CR0 and CR4 as unsigned longs, and EFER
as a u64 in various flows (mostly MMU).  Passing the params as u32s is
functionally ok since all of the affected registers reserve bits 63:32 to
zero (enforced by KVM), but it's technically wrong.

No functional change intended.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20210622175739.3610207-15-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

dbc4739b

KVM: x86: Alert userspace that KVM_SET_CPUID{,2} after KVM_RUN is broken · 63f5a190

由 Sean Christopherson 提交于 6月 22, 2021

Warn userspace that KVM_SET_CPUID{,2} after KVM_RUN "may" cause guest
instability. Initialize last_vmentry_cpu to -1 and use it to detect if
the vCPU has been run at least once when its CPUID model is changed.

KVM does not correctly handle changes to paging related settings in the
guest's vCPU model after KVM_RUN, e.g. MAXPHYADDR, GBPAGES, etc... KVM
could theoretically zap all shadow pages, but actually making that happen
is a mess due to lock inversion (vcpu->mutex is held). And even then,
updating paging settings on the fly would only work if all vCPUs are
stopped, updated in concert with identical settings, then restarted.

To support running vCPUs with different vCPU models (that affect paging),
KVM would need to track all relevant information in kvm_mmu_page_role.
Note, that's the _page_ role, not the full mmu_role. Updating mmu_role
isn't sufficient as a vCPU can reuse a shadow page translation that was
created by a vCPU with different settings and thus completely skip the
reserved bit checks (that are tied to CPUID).

Tracking CPUID state in kvm_mmu_page_role is _extremely_ undesirable as
it would require doubling gfn_track from a u16 to a u32, i.e. would
increase KVM's memory footprint by 2 bytes for every 4kb of guest memory.
E.g. MAXPHYADDR (6 bits), GBPAGES, AMD vs. INTEL = 1 bit, and SEV C-BIT
would all need to be tracked.

In practice, there is no remotely sane use case for changing any paging
related CPUID entries on the fly, so just sweep it under the rug (after
yelling at userspace).
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20210622175739.3610207-8-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

63f5a190

KVM: x86: Properly reset MMU context at vCPU RESET/INIT · 0aa18375

由 Sean Christopherson 提交于 6月 22, 2021

Reset the MMU context at vCPU INIT (and RESET for good measure) if CR0.PG
was set prior to INIT.  Simply re-initializing the current MMU is not
sufficient as the current root HPA may not be usable in the new context.
E.g. if TDP is disabled and INIT arrives while the vCPU is in long mode,
KVM will fail to switch to the 32-bit pae_root and bomb on the next
VM-Enter due to running with a 64-bit CR3 in 32-bit mode.

This bug was papered over in both VMX and SVM, but still managed to rear
its head in the MMU role on VMX.  Because EFER.LMA=1 requires CR0.PG=1,
kvm_calc_shadow_mmu_root_page_role() checks for EFER.LMA without first
checking CR0.PG.  VMX's RESET/INIT flow writes CR0 before EFER, and so
an INIT with the vCPU in 64-bit mode will cause the hack-a-fix to
generate the wrong MMU role.

In VMX, the INIT issue is specific to running without unrestricted guest
since unrestricted guest is available if and only if EPT is enabled.
Commit 8668a3c4 ("KVM: VMX: Reset mmu context when entering real
mode") resolved the issue by forcing a reset when entering emulated real
mode.

In SVM, commit ebae871a ("kvm: svm: reset mmu on VCPU reset") forced
a MMU reset on every INIT to workaround the flaw in common x86.  Note, at
the time the bug was fixed, the SVM problem was exacerbated by a complete
lack of a CR4 update.

The vendor resets will be reverted in future patches, primarily to aid
bisection in case there are non-INIT flows that rely on the existing VMX
logic.

Because CR0.PG is unconditionally cleared on INIT, and because CR0.WP and
all CR4/EFER paging bits are ignored if CR0.PG=0, simply checking that
CR0.PG was '1' prior to INIT/RESET is sufficient to detect a required MMU
context reset.

Cc: stable@vger.kernel.org
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20210622175739.3610207-4-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

0aa18375

KVM: debugfs: Reuse binary stats descriptors · bc9e9e67

由 Jing Zhang 提交于 6月 23, 2021

To remove code duplication, use the binary stats descriptors in the
implementation of the debugfs interface for statistics. This unifies
the definition of statistics for the binary and debugfs interfaces.
Signed-off-by: NJing Zhang <jingzhangos@google.com>
Message-Id: <20210618222709.1858088-8-jingzhangos@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

bc9e9e67

KVM: stats: Support binary stats retrieval for a VCPU · ce55c049

由 Jing Zhang 提交于 6月 18, 2021

Add a VCPU ioctl to get a statistics file descriptor by which a read
functionality is provided for userspace to read out VCPU stats header,
descriptors and data.
Define VCPU statistics descriptors and header for all architectures.
Reviewed-by: NDavid Matlack <dmatlack@google.com>
Reviewed-by: NRicardo Koller <ricarkol@google.com>
Reviewed-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com>
Reviewed-by: NFuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com> #arm64
Signed-off-by: NJing Zhang <jingzhangos@google.com>
Message-Id: <20210618222709.1858088-5-jingzhangos@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

ce55c049

KVM: stats: Support binary stats retrieval for a VM · fcfe1bae

由 Jing Zhang 提交于 6月 18, 2021

Add a VM ioctl to get a statistics file descriptor by which a read
functionality is provided for userspace to read out VM stats header,
descriptors and data.
Define VM statistics descriptors and header for all architectures.
Reviewed-by: NDavid Matlack <dmatlack@google.com>
Reviewed-by: NRicardo Koller <ricarkol@google.com>
Reviewed-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com>
Reviewed-by: NFuad Tabba <tabba@google.com>
Tested-by: Fuad Tabba <tabba@google.com> #arm64
Signed-off-by: NJing Zhang <jingzhangos@google.com>
Message-Id: <20210618222709.1858088-4-jingzhangos@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

fcfe1bae

24 6月, 2021 5 次提交

KVM: stats: Separate generic stats from architecture specific ones · 0193cc90

由 Jing Zhang 提交于 6月 18, 2021

Generic KVM stats are those collected in architecture independent code
or those supported by all architectures; put all generic statistics in
a separate structure.  This ensures that they are defined the same way
in the statistics API which is being added, removing duplication among
different architectures in the declaration of the descriptors.

No functional change intended.
Reviewed-by: NDavid Matlack <dmatlack@google.com>
Reviewed-by: NRicardo Koller <ricarkol@google.com>
Reviewed-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: NJing Zhang <jingzhangos@google.com>
Message-Id: <20210618222709.1858088-2-jingzhangos@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

0193cc90

x86/pkru: Remove xstate fiddling from write_pkru() · 72a6c08c

由 Thomas Gleixner 提交于 6月 23, 2021

The PKRU value of a task is stored in task->thread.pkru when the task is
scheduled out. PKRU is restored on schedule in from there. So keeping the
XSAVE buffer up to date is a pointless exercise.

Remove the xstate fiddling and cleanup all related functions.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Reviewed-by: NBorislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121456.897372712@linutronix.de

72a6c08c

x86/pkeys: Move read_pkru() and write_pkru() · 784a4661

由 Dave Hansen 提交于 6月 23, 2021

write_pkru() was originally used just to write to the PKRU register.  It
was mercifully short and sweet and was not out of place in pgtable.h with
some other pkey-related code.

But, later work included a requirement to also modify the task XSAVE
buffer when updating the register.  This really is more related to the
XSAVE architecture than to paging.

Move the read/write_pkru() to asm/pkru.h.  pgtable.h won't miss them.
Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121455.102647114@linutronix.de

784a4661

x86/fpu: Rename copy_kernel_to_fpregs() to restore_fpregs_from_fpstate() · 1c61fada

由 Thomas Gleixner 提交于 6月 23, 2021

This is not a copy functionality. It restores the register state from the
supplied kernel buffer.

No functional changes.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Reviewed-by: NBorislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121454.716058365@linutronix.de

1c61fada

x86/fpu: Rename copy_fpregs_to_fpstate() to save_fpregs_to_fpstate() · ebe7234b

由 Thomas Gleixner 提交于 6月 23, 2021

A copy is guaranteed to leave the source intact, which is not the case when
FNSAVE is used as that reinitilizes the registers.

Save does not make such guarantees and it matches what this is about,
i.e. to save the state for a later restore.

Rename it to save_fpregs_to_fpstate().
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Reviewed-by: NBorislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121454.508853062@linutronix.de

ebe7234b

23 6月, 2021 1 次提交

x86/kvm: Avoid looking up PKRU in XSAVE buffer · 71ef4533

由 Dave Hansen 提交于 6月 23, 2021

PKRU is being removed from the kernel XSAVE/FPU buffers.  This removal
will probably include warnings for code that look up PKRU in those
buffers.

KVM currently looks up the location of PKRU but doesn't even use the
pointer that it gets back.  Rework the code to avoid calling
get_xsave_addr() except in cases where its result is actually used.

This makes the code more clear and also avoids the inevitable PKRU
warnings.

This is probably a good cleanup and could go upstream idependently
of any PKRU rework.
Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Reviewed-by: NBorislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20210623121453.541037562@linutronix.de

71ef4533

18 6月, 2021 12 次提交

KVM: x86: WARN and reject loading KVM if NX is supported but not enabled · 8bbed95d

由 Sean Christopherson 提交于 6月 15, 2021

WARN if NX is reported as supported but not enabled in EFER. All flavors
of the kernel, including non-PAE 32-bit kernels, set EFER.NX=1 if NX is
supported, even if NX usage is disable via kernel command line. KVM relies
on NX being enabled if it's supported, e.g. KVM will generate illegal NPT
entries if nx_huge_pages is enabled and NX is supported but not enabled.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Reviewed-by: NJim Mattson <jmattson@google.com>
Message-Id: <20210615164535.2146172-4-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

8bbed95d

KVM: X86: Introduce KVM_HC_MAP_GPA_RANGE hypercall · 0dbb1123

由 Ashish Kalra 提交于 6月 08, 2021

This hypercall is used by the SEV guest to notify a change in the page
encryption status to the hypervisor. The hypercall should be invoked
only when the encryption attribute is changed from encrypted -> decrypted
and vice versa. By default all guest pages are considered encrypted.

The hypercall exits to userspace to manage the guest shared regions and
integrate with the userspace VMM's migration code.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Reviewed-by: NSteve Rutherford <srutherford@google.com>
Signed-off-by: NBrijesh Singh <brijesh.singh@amd.com>
Signed-off-by: NAshish Kalra <ashish.kalra@amd.com>
Co-developed-by: NSean Christopherson <seanjc@google.com>
Signed-off-by: NSean Christopherson <seanjc@google.com>
Co-developed-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Message-Id: <90778988e1ee01926ff9cac447aacb745f954c8c.1623174621.git.ashish.kalra@amd.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

0dbb1123

KVM: x86: Check for pending interrupts when APICv is getting disabled · bca66dbc

由 Vitaly Kuznetsov 提交于 6月 09, 2021

When APICv is active, interrupt injection doesn't raise KVM_REQ_EVENT
request (see __apic_accept_irq()) as the required work is done by hardware.
In case KVM_REQ_APICV_UPDATE collides with such injection, the interrupt
may never get delivered.

Currently, the described situation is hardly possible: all
kvm_request_apicv_update() calls normally happen upon VM creation when
no interrupts are pending. We are, however, going to move unconditional
kvm_request_apicv_update() call from kvm_hv_activate_synic() to
synic_update_vector() and without this fix 'hyperv_connections' test from
kvm-unit-tests gets stuck on IPI delivery attempt right after configuring
a SynIC route which triggers APICv disablement.
Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20210609150911.1471882c-4-vkuznets@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

bca66dbc

KVM: x86: Drop pointless @reset_roots from kvm_init_mmu() · c9060662

由 Sean Christopherson 提交于 6月 09, 2021

Remove the @reset_roots param from kvm_init_mmu(), the one user,
kvm_mmu_reset_context() has already unloaded the MMU and thus freed and
invalidated all roots.  This also happens to be why the reset_roots=true
paths doesn't leak roots; they're already invalid.

No functional change intended.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20210609234235.1244004-14-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

c9060662

KVM: x86: Defer MMU sync on PCID invalidation · e62f1aa8

由 Sean Christopherson 提交于 6月 09, 2021

Defer the MMU sync on PCID invalidation so that multiple sync requests in
a single VM-Exit are batched.  This is a very minor optimization as
checking for unsync'd children is quite cheap.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20210609234235.1244004-13-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

e62f1aa8

KVM: x86: Use KVM_REQ_TLB_FLUSH_GUEST to handle INVPCID(ALL) emulation · 28f28d45

由 Sean Christopherson 提交于 6月 09, 2021

Use KVM_REQ_TLB_FLUSH_GUEST instead of KVM_REQ_MMU_RELOAD when emulating
INVPCID of all contexts. In the current code, this is a glorified nop as
TLB_FLUSH_GUEST becomes kvm_mmu_unload(), same as MMU_RELOAD, when TDP
is disabled, which is the only time INVPCID is only intercepted+emulated.
In the future, reusing TLB_FLUSH_GUEST will simplify optimizing paths
that emulate a guest TLB flush, e.g. by synchronizing as needed instead
of completely unloading all MMUs.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20210609234235.1244004-11-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

28f28d45

KVM: x86: Drop skip MMU sync and TLB flush params from "new PGD" helpers · b5129100

由 Sean Christopherson 提交于 6月 09, 2021

Drop skip_mmu_sync and skip_tlb_flush from __kvm_mmu_new_pgd() now that
all call sites unconditionally skip both the sync and flush.

No functional change intended.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20210609234235.1244004-8-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

b5129100

KVM: x86: Uncondtionally skip MMU sync/TLB flush in MOV CR3's PGD switch · 415b1a01

由 Sean Christopherson 提交于 6月 09, 2021

Stop leveraging the MMU sync and TLB flush requested by the fast PGD
switch helper now that kvm_set_cr3() manually handles the necessary sync,
frees, and TLB flush. This will allow dropping the params from the fast
PGD helpers since nested SVM is now the odd blob out.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20210609234235.1244004-6-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

415b1a01

KVM: x86: Invalidate all PGDs for the current PCID on MOV CR3 w/ flush · 21823fbd

由 Sean Christopherson 提交于 6月 09, 2021

Flush and sync all PGDs for the current/target PCID on MOV CR3 with a
TLB flush, i.e. without PCID_NOFLUSH set.  Paraphrasing Intel's SDM
regarding the behavior of MOV to CR3:

  - If CR4.PCIDE = 0, invalidates all TLB entries associated with PCID
    000H and all entries in all paging-structure caches associated with
    PCID 000H.

  - If CR4.PCIDE = 1 and NOFLUSH=0, invalidates all TLB entries
    associated with the PCID specified in bits 11:0, and all entries in
    all paging-structure caches associated with that PCID. It is not
    required to invalidate entries in the TLBs and paging-structure
    caches that are associated with other PCIDs.

  - If CR4.PCIDE=1 and NOFLUSH=1, is not required to invalidate any TLB
    entries or entries in paging-structure caches.

Extract and reuse the logic for INVPCID(single) which is effectively the
same flow and works even if CR4.PCIDE=0, as the current PCID will be '0'
in that case, thus honoring the requirement of flushing PCID=0.

Continue passing skip_tlb_flush to kvm_mmu_new_pgd() even though it
_should_ be redundant; the clean up will be done in a future patch.  The
overhead of an unnecessary nop sync is minimal (especially compared to
the actual sync), and the TLB flush is handled via request.  Avoiding the
the negligible overhead is not worth the risk of breaking kernels that
backport the fix.

Fixes: 956bf353 ("kvm: x86: Skip shadow page resync on CR3 switch when indicated by guest")
Cc: Junaid Shahid <junaids@google.com>
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20210609234235.1244004-5-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

21823fbd

KVM: nVMX: Sync all PGDs on nested transition with shadow paging · 07ffaf34

由 Sean Christopherson 提交于 6月 09, 2021

Trigger a full TLB flush on behalf of the guest on nested VM-Enter and
VM-Exit when VPID is disabled for L2. kvm_mmu_new_pgd() syncs only the
current PGD, which can theoretically leave stale, unsync'd entries in a
previous guest PGD, which could be consumed if L2 is allowed to load CR3
with PCID_NOFLUSH=1.

Rename KVM_REQ_HV_TLB_FLUSH to KVM_REQ_TLB_FLUSH_GUEST so that it can
be utilized for its obvious purpose of emulating a guest TLB flush.

Note, there is no change the actual TLB flush executed by KVM, even
though the fast PGD switch uses KVM_REQ_TLB_FLUSH_CURRENT. When VPID is
disabled for L2, vpid02 is guaranteed to be '0', and thus
nested_get_vpid02() will return the VPID that is shared by L1 and L2.

Generate the request outside of kvm_mmu_new_pgd(), as getting the common
helper to correctly identify which requested is needed is quite painful.
E.g. using KVM_REQ_TLB_FLUSH_GUEST when nested EPT is in play is wrong as
a TLB flush from the L1 kernel's perspective does not invalidate EPT
mappings. And, by using KVM_REQ_TLB_FLUSH_GUEST, nVMX can do future
simplification by moving the logic into nested_vmx_transition_tlb_flush().

Fixes: 41fab65e ("KVM: nVMX: Skip MMU sync on nested VMX transition when possible")
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20210609234235.1244004-2-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

07ffaf34

KVM: x86: avoid loading PDPTRs after migration when possible · 158a48ec

由 Maxim Levitsky 提交于 6月 07, 2021

if new KVM_*_SREGS2 ioctls are used, the PDPTRs are
a part of the migration state and are correctly
restored by those ioctls.
Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210607090203.133058-9-mlevitsk@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

158a48ec

KVM: x86: Introduce KVM_GET_SREGS2 / KVM_SET_SREGS2 · 6dba9403

由 Maxim Levitsky 提交于 6月 07, 2021

This is a new version of KVM_GET_SREGS / KVM_SET_SREGS.

It has the following changes:
   * Has flags for future extensions
   * Has vcpu's PDPTRs, allowing to save/restore them on migration.
   * Lacks obsolete interrupt bitmap (done now via KVM_SET_VCPU_EVENTS)

New capability, KVM_CAP_SREGS2 is added to signal
the userspace of this ioctl.
Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20210607090203.133058-8-mlevitsk@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

6dba9403

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功