提交 · f8d8ac215919cf40d3b358846ed8e8e3a3683131 · openeuler / Kernel

25 6月, 2022 15 次提交

KVM: x86: Warning APICv inconsistency only when vcpu APIC mode is valid · f8d8ac21

由 Suravee Suthikulpanit 提交于 5月 19, 2022

When launching a VM with x2APIC and specify more than 255 vCPUs,
the guest kernel can disable x2APIC (e.g. specify nox2apic kernel option).
The VM fallbacks to xAPIC mode, and disable the vCPU ID 255 and greater.

In this case, APICV is deactivated for the disabled vCPUs.
However, the current APICv consistency warning does not account for
this case, which results in a warning.

Therefore, modify warning logic to report only when vCPU APIC mode
is valid.
Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: NSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Message-Id: <20220519102709.24125-15-suravee.suthikulpanit@amd.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

f8d8ac21

KVM: SVM: Introduce hybrid-AVIC mode · 0e311d33

由 Suravee Suthikulpanit 提交于 5月 19, 2022

Currently, AVIC is inhibited when booting a VM w/ x2APIC support.
because AVIC cannot virtualize x2APIC MSR register accesses.
However, the AVIC doorbell can be used to accelerate interrupt
injection into a running vCPU, while all guest accesses to x2APIC MSRs
will be intercepted and emulated by KVM.

With hybrid-AVIC support, the APICV_INHIBIT_REASON_X2APIC is
no longer enforced.
Suggested-by: NMaxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: NSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Message-Id: <20220519102709.24125-14-suravee.suthikulpanit@amd.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

0e311d33

KVM: SVM: Do not throw warning when calling avic_vcpu_load on a running vcpu · c0caeee6

由 Suravee Suthikulpanit 提交于 5月 19, 2022

Originalliy, this WARN_ON is designed to detect when calling
avic_vcpu_load() on an already running vcpu in AVIC mode (i.e. the AVIC
is_running bit is set).

However, for x2AVIC, the vCPU can switch from xAPIC to x2APIC mode while in
running state, in which the avic_vcpu_load() will be called from
svm_refresh_apicv_exec_ctrl().

Therefore, remove this warning since it is no longer appropriate.
Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: NPankaj Gupta <pankaj.gupta@amd.com>
Signed-off-by: NSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Message-Id: <20220519102709.24125-13-suravee.suthikulpanit@amd.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

c0caeee6

KVM: SVM: Introduce logic to (de)activate x2AVIC mode · 4d1d7942

由 Suravee Suthikulpanit 提交于 5月 19, 2022

Introduce logic to (de)activate AVIC, which also allows
switching between AVIC to x2AVIC mode at runtime.

When an AVIC-enabled guest switches from APIC to x2APIC mode,
the SVM driver needs to perform the following steps:

1. Set the x2APIC mode bit for AVIC in VMCB along with the maximum
APIC ID support for each mode accodingly.

2. Disable x2APIC MSRs interception in order to allow the hardware
to virtualize x2APIC MSRs accesses.
Reported-by: Nkernel test robot <lkp@intel.com>
Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: NSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Message-Id: <20220519102709.24125-12-suravee.suthikulpanit@amd.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

4d1d7942

KVM: x86: nSVM: always intercept x2apic msrs · 7a8f7c1f

由 Maxim Levitsky 提交于 5月 19, 2022

As a preparation for x2avic, this patch ensures that x2apic msrs
are always intercepted for the nested guest.
Reviewed-by: NSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Tested-by: NSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20220519102709.24125-11-suravee.suthikulpanit@amd.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

7a8f7c1f

KVM: SVM: Refresh AVIC configuration when changing APIC mode · 05c4fe8c

由 Suravee Suthikulpanit 提交于 5月 19, 2022

AMD AVIC can support xAPIC and x2APIC virtualization,
which requires changing x2APIC bit VMCB and MSR intercepton
for x2APIC MSRs. Therefore, call avic_refresh_apicv_exec_ctrl()
to refresh configuration accordingly.
Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: NSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Message-Id: <20220519102709.24125-10-suravee.suthikulpanit@amd.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

05c4fe8c

KVM: x86: Deactivate APICv on vCPU with APIC disabled · 8fc9c7a3

由 Suravee Suthikulpanit 提交于 5月 19, 2022

APICv should be deactivated on vCPU that has APIC disabled.
Therefore, call kvm_vcpu_update_apicv() when changing
APIC mode, and add additional check for APIC disable mode
when determine APICV activation,
Suggested-by: NMaxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: NSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Message-Id: <20220519102709.24125-9-suravee.suthikulpanit@amd.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

8fc9c7a3

KVM: SVM: Adding support for configuring x2APIC MSRs interception · 5c127c85

由 Suravee Suthikulpanit 提交于 5月 19, 2022

When enabling x2APIC virtualization (x2AVIC), the interception of
x2APIC MSRs must be disabled to let the hardware virtualize guest
MSR accesses.

Current implementation keeps track of list of MSR interception state
in the svm_direct_access_msrs array. Therefore, extends the array to
include x2APIC MSRs.
Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: NSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Message-Id: <20220519102709.24125-8-suravee.suthikulpanit@amd.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

5c127c85

KVM: SVM: Do not support updating APIC ID when in x2APIC mode · ab1b1dc1

由 Suravee Suthikulpanit 提交于 5月 19, 2022

In X2APIC mode, the Logical Destination Register is read-only,
which provides a fixed mapping between the logical and physical
APIC IDs. Therefore, there is no Logical APIC ID table in X2AVIC
and the processor uses the X2APIC ID in the backing page to create
a vCPU’s logical ID.

In addition, KVM does not support updating APIC ID in x2APIC mode,
which means AVIC does not need to handle this case.

Therefore, check x2APIC mode when handling physical and logical
APIC ID update, and when invalidating logical APIC ID table.
Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
Suggested-by: NMaxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: NSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Message-Id: <20220519102709.24125-7-suravee.suthikulpanit@amd.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

ab1b1dc1

KVM: SVM: Update avic_kick_target_vcpus to support 32-bit APIC ID · c514d3a3

由 Suravee Suthikulpanit 提交于 5月 19, 2022

In x2APIC mode, ICRH contains 32-bit destination APIC ID.
So, update the avic_kick_target_vcpus() accordingly.
Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: NSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Message-Id: <20220519102709.24125-6-suravee.suthikulpanit@amd.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

c514d3a3

KVM: SVM: Update max number of vCPUs supported for x2AVIC mode · d2fe6bf5

由 Suravee Suthikulpanit 提交于 5月 19, 2022

xAVIC and x2AVIC modes can support diffferent number of vcpus.
Update existing logics to support each mode accordingly.

Also, modify the maximum physical APIC ID for AVIC to 255 to reflect
the actual value supported by the architecture.
Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: NPankaj Gupta <pankaj.gupta@amd.com>
Signed-off-by: NSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Message-Id: <20220519102709.24125-5-suravee.suthikulpanit@amd.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

d2fe6bf5

KVM: SVM: Detect X2APIC virtualization (x2AVIC) support · 4bdec12a

由 Suravee Suthikulpanit 提交于 5月 19, 2022

Add CPUID check for the x2APIC virtualization (x2AVIC) feature.
If available, the SVM driver can support both AVIC and x2AVIC modes
when load the kvm_amd driver with avic=1. The operating mode will be
determined at runtime depending on the guest APIC mode.
Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: NPankaj Gupta <pankaj.gupta@amd.com>
Signed-off-by: NSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Message-Id: <20220519102709.24125-4-suravee.suthikulpanit@amd.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

4bdec12a

KVM: x86: lapic: Rename [GET/SET]_APIC_DEST_FIELD to [GET/SET]_XAPIC_DEST_FIELD · bf348f66

由 Suravee Suthikulpanit 提交于 5月 19, 2022

To signify that the macros only support 8-bit xAPIC destination ID.
Suggested-by: NMaxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: NPankaj Gupta <pankaj.gupta@amd.com>
Signed-off-by: NSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220519102709.24125-3-suravee.suthikulpanit@amd.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

bf348f66

x86/cpufeatures: Introduce x2AVIC CPUID bit · aae99a7c

由 Suravee Suthikulpanit 提交于 5月 19, 2022

Introduce a new feature bit for virtualized x2APIC (x2AVIC) in
CPUID_Fn8000000A_EDX [SVM Revision and Feature Identification].
Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: NSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220519102709.24125-2-suravee.suthikulpanit@amd.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

aae99a7c

KVM: debugfs: expose pid of vcpu threads · e36de87d

由 Vineeth Pillai 提交于 5月 23, 2022

Add a new debugfs file to expose the pid of each vcpu threads. This
is very helpful for userland tools to get the vcpu pids without
worrying about thread naming conventions of the VMM.
Signed-off-by: NVineeth Pillai (Google) <vineeth@bitbyteword.org>
Message-Id: <20220523190327.2658-1-vineeth@bitbyteword.org>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

e36de87d

24 6月, 2022 25 次提交

KVM: nVMX: clean up posted interrupt descriptor try_cmpxchg · 4de5c54f

由 Paolo Bonzini 提交于 6月 24, 2022

Rely on try_cmpxchg64 for re-reading the PID on failure, using READ_ONCE
only right before the first iteration.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

4de5c54f

KVM: selftests: Enhance handling WRMSR ICR register in x2APIC mode · 4b88b1a5

由 Zeng Guang 提交于 6月 23, 2022

Hardware would directly write x2APIC ICR register instead of software
emulation in some circumstances, e.g when Intel IPI virtualization is
enabled. This behavior requires normal reserved bits checking to ensure
them input as zero, otherwise it will cause #GP. So we need mask out
those reserved bits from the data written to vICR register.

Remove Delivery Status bit emulation in test case as this flag
is invalid and not needed in x2APIC mode. KVM may ignore clearing
it during interrupt dispatch which will lead to fake test failure.

Opportunistically correct vector number for test sending IPI to
non-existent vCPUs.
Signed-off-by: NZeng Guang <guang.zeng@intel.com>
Message-Id: <20220623094511.26066-1-guang.zeng@intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

4b88b1a5

KVM: selftests: Add a self test for CMCI and UCNA emulations. · eede2065

由 Jue Wang 提交于 6月 10, 2022

This patch add a self test that verifies user space can inject
UnCorrectable No Action required (UCNA) memory errors to the guest.
It also verifies that incorrectly configured MSRs for Corrected
Machine Check Interrupt (CMCI) emulation will result in #GP.
Signed-off-by: NJue Wang <juew@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220610171134.772566-9-juew@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

eede2065

KVM: x86: Enable CMCI capability by default and handle injected UCNA errors · aebc3ca1

由 Jue Wang 提交于 6月 10, 2022

This patch enables MCG_CMCI_P by default in kvm_mce_cap_supported. It
reuses ioctl KVM_X86_SET_MCE to implement injection of UnCorrectable
No Action required (UCNA) errors, signaled via Corrected Machine
Check Interrupt (CMCI).

Neither of the CMCI and UCNA emulations depends on hardware.
Signed-off-by: NJue Wang <juew@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220610171134.772566-8-juew@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

aebc3ca1

KVM: x86: Add emulation for MSR_IA32_MCx_CTL2 MSRs. · 281b5278

由 Jue Wang 提交于 6月 10, 2022

This patch adds the emulation of IA32_MCi_CTL2 registers to KVM. A
separate mci_ctl2_banks array is used to keep the existing mce_banks
register layout intact.

In Machine Check Architecture, in addition to MCG_CMCI_P, bit 30 of
the per-bank register IA32_MCi_CTL2 controls whether Corrected Machine
Check error reporting is enabled.
Signed-off-by: NJue Wang <juew@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220610171134.772566-7-juew@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

281b5278

KVM: x86: Use kcalloc to allocate the mce_banks array. · 087acc4e

由 Jue Wang 提交于 6月 10, 2022

This patch updates the allocation of mce_banks with the array allocation
API (kcalloc) as a precedent for the later mci_ctl2_banks to implement
per-bank control of Corrected Machine Check Interrupt (CMCI).
Suggested-by: NSean Christopherson <seanjc@google.com>
Signed-off-by: NJue Wang <juew@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220610171134.772566-6-juew@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

087acc4e

KVM: x86: Add Corrected Machine Check Interrupt (CMCI) emulation to lapic. · 4b903561

由 Jue Wang 提交于 6月 10, 2022

This patch calculates the number of lvt entries as part of
KVM_X86_MCE_SETUP conditioned on the presence of MCG_CMCI_P bit in
MCG_CAP and stores result in kvm_lapic. It translats from APIC_LVTx
register to index in lapic_lvt_entry enum. It extends the APIC_LVTx
macro as well as other lapic write/reset handling etc to support
Corrected Machine Check Interrupt.
Signed-off-by: NJue Wang <juew@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220610171134.772566-5-juew@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

4b903561

KVM: x86: Add APIC_LVTx() macro. · 987f625e

由 Jue Wang 提交于 6月 10, 2022

An APIC_LVTx macro is introduced to calcualte the APIC_LVTx register
offset based on the index in the lapic_lvt_entry enum. Later patches
will extend the APIC_LVTx macro to support the APIC_LVTCMCI register
in order to implement Corrected Machine Check Interrupt signaling.
Suggested-by: NSean Christopherson <seanjc@google.com>
Signed-off-by: NJue Wang <juew@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220610171134.772566-4-juew@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

987f625e

KVM: x86: Fill apic_lvt_mask with enums / explicit entries. · 1d8c681f

由 Jue Wang 提交于 6月 10, 2022

This patch defines a lapic_lvt_entry enum used as explicit indices to
the apic_lvt_mask array. In later patches a LVT_CMCI will be added to
implement the Corrected Machine Check Interrupt signaling.
Suggested-by: NSean Christopherson <seanjc@google.com>
Signed-off-by: NJue Wang <juew@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220610171134.772566-3-juew@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

1d8c681f

KVM: x86: Make APIC_VERSION capture only the magic 0x14UL. · 951ceb94

由 Jue Wang 提交于 6月 10, 2022

Refactor APIC_VERSION so that the maximum number of LVT entries is
inserted at runtime rather than compile time. This will be used in a
subsequent commit to expose the LVT CMCI Register to VMs that support
Corrected Machine Check error counting/signaling
(IA32_MCG_CAP.MCG_CMCI_P=1).
Suggested-by: NSean Christopherson <seanjc@google.com>
Signed-off-by: NJue Wang <juew@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Message-Id: <20220610171134.772566-2-juew@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

951ceb94

KVM: x86/mmu: Avoid unnecessary flush on eager page split · 03787394

由 Paolo Bonzini 提交于 6月 22, 2022

The TLB flush before installing the newly-populated lower level
page table is unnecessary if the lower-level page table maps
the huge page identically.  KVM knows it is if it did not reuse
an existing shadow page table, tell drop_large_spte() to skip
the flush in that case.

Extracted from a patch by David Matlack.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

03787394

KVM: x86/mmu: Extend Eager Page Splitting to nested MMUs · ada51a9d

由 David Matlack 提交于 6月 22, 2022

Add support for Eager Page Splitting pages that are mapped by nested
MMUs. Walk through the rmap first splitting all 1GiB pages to 2MiB
pages, and then splitting all 2MiB pages to 4KiB pages.

Note, Eager Page Splitting is limited to nested MMUs as a policy rather
than due to any technical reason (the sp->role.guest_mode check could
just be deleted and Eager Page Splitting would work correctly for all
shadow MMU pages). There is really no reason to support Eager Page
Splitting for tdp_mmu=N, since such support will eventually be phased
out, and there is no current use case supporting Eager Page Splitting on
hosts where TDP is either disabled or unavailable in hardware.
Furthermore, future improvements to nested MMU scalability may diverge
the code from the legacy shadow paging implementation. These
improvements will be simpler to make if Eager Page Splitting does not
have to worry about legacy shadow paging.

Splitting huge pages mapped by nested MMUs requires dealing with some
extra complexity beyond that of the TDP MMU:

(1) The shadow MMU has a limit on the number of shadow pages that are
    allowed to be allocated. So, as a policy, Eager Page Splitting
    refuses to split if there are KVM_MIN_FREE_MMU_PAGES or fewer
    pages available.

(2) Splitting a huge page may end up re-using an existing lower level
    shadow page tables. This is unlike the TDP MMU which always allocates
    new shadow page tables when splitting.

(3) When installing the lower level SPTEs, they must be added to the
    rmap which may require allocating additional pte_list_desc structs.

Case (2) is especially interesting since it may require a TLB flush,
unlike the TDP MMU which can fully split huge pages without any TLB
flushes. Specifically, an existing lower level page table may point to
even lower level page tables that are not fully populated, effectively
unmapping a portion of the huge page, which requires a flush.  As of
this commit, a flush is always done always after dropping the huge page
and before installing the lower level page table.

This TLB flush could instead be delayed until the MMU lock is about to be
dropped, which would batch flushes for multiple splits.  However these
flushes should be rare in practice (a huge page must be aliased in
multiple SPTEs and have been split for NX Huge Pages in only some of
them). Flushing immediately is simpler to plumb and also reduces the
chances of tripping over a CPU bug (e.g. see iTLB multihit).

[ This commit is based off of the original implementation of Eager Page
  Splitting from Peter in Google's kernel from 2016. ]
Suggested-by: NPeter Feiner <pfeiner@google.com>
Signed-off-by: NDavid Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-23-dmatlack@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

ada51a9d

KVM: Allow for different capacities in kvm_mmu_memory_cache structs · 837f66c7

由 David Matlack 提交于 6月 22, 2022

Allow the capacity of the kvm_mmu_memory_cache struct to be chosen at
declaration time rather than being fixed for all declarations. This will
be used in a follow-up commit to declare an cache in x86 with a capacity
of 512+ objects without having to increase the capacity of all caches in
KVM.

This change requires each cache now specify its capacity at runtime,
since the cache struct itself no longer has a fixed capacity known at
compile time. To protect against someone accidentally defining a
kvm_mmu_memory_cache struct directly (without the extra storage), this
commit includes a WARN_ON() in kvm_mmu_topup_memory_cache().

In order to support different capacities, this commit changes the
objects pointer array to be dynamically allocated the first time the
cache is topped-up.

While here, opportunistically clean up the stack-allocated
kvm_mmu_memory_cache structs in riscv and arm64 to use designated
initializers.

No functional change intended.
Reviewed-by: NMarc Zyngier <maz@kernel.org>
Signed-off-by: NDavid Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-22-dmatlack@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

837f66c7

KVM: x86/mmu: pull call to drop_large_spte() into __link_shadow_page() · 0cd8dc73

由 Paolo Bonzini 提交于 6月 22, 2022

Before allocating a child shadow page table, all callers check
whether the parent already points to a huge page and, if so, they
drop that SPTE.  This is done by drop_large_spte().

However, dropping the large SPTE is really only necessary before the
sp is installed.  While the sp is returned by kvm_mmu_get_child_sp(),
installing it happens later in __link_shadow_page().  Move the call
there instead of having it in each and every caller.

To ensure that the shadow page is not linked twice if it was present,
do _not_ opportunistically make kvm_mmu_get_child_sp() idempotent:
instead, return an error value if the shadow page already existed.
This is a bit more verbose, but clearer than NULL.

Finally, now that the drop_large_spte() name is not taken anymore,
remove the two underscores in front of __drop_large_spte().
Reviewed-by: NSean Christopherson <seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

0cd8dc73

KVM: x86/mmu: Zap collapsible SPTEs in shadow MMU at all possible levels · 20d49186

由 David Matlack 提交于 6月 22, 2022

Currently KVM only zaps collapsible 4KiB SPTEs in the shadow MMU. This
is fine for now since KVM never creates intermediate huge pages during
dirty logging. In other words, KVM always replaces 1GiB pages directly
with 4KiB pages, so there is no reason to look for collapsible 2MiB
pages.

However, this will stop being true once the shadow MMU participates in
eager page splitting. During eager page splitting, each 1GiB is first
split into 2MiB pages and then those are split into 4KiB pages. The
intermediate 2MiB pages may be left behind if an error condition causes
eager page splitting to bail early.

No functional change intended.
Reviewed-by: NPeter Xu <peterx@redhat.com>
Signed-off-by: NDavid Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-20-dmatlack@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

20d49186

KVM: x86/mmu: Extend make_huge_page_split_spte() for the shadow MMU · 47855da0

由 David Matlack 提交于 6月 22, 2022

Currently make_huge_page_split_spte() assumes execute permissions can be
granted to any 4K SPTE when splitting huge pages. This is true for the
TDP MMU but is not necessarily true for the shadow MMU, since KVM may be
shadowing a non-executable huge page.

To fix this, pass in the role of the child shadow page where the huge
page will be split and derive the execution permission from that.  This
is correct because huge pages are always split with direct shadow page
and thus the shadow page role contains the correct access permissions.

No functional change intended.
Signed-off-by: NDavid Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-19-dmatlack@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

47855da0

KVM: x86/mmu: Cache the access bits of shadowed translations · 6a97575d

由 David Matlack 提交于 6月 22, 2022

Splitting huge pages requires allocating/finding shadow pages to replace
the huge page. Shadow pages are keyed, in part, off the guest access
permissions they are shadowing. For fully direct MMUs, there is no
shadowing so the access bits in the shadow page role are always ACC_ALL.
But during shadow paging, the guest can enforce whatever access
permissions it wants.

In particular, eager page splitting needs to know the permissions to use
for the subpages, but KVM cannot retrieve them from the guest page
tables because eager page splitting does not have a vCPU. Fortunately,
the guest access permissions are easy to cache whenever page faults or
FNAME(sync_page) update the shadow page tables; this is an extension of
the existing cache of the shadowed GFNs in the gfns array of the shadow
page. The access bits only take up 3 bits, which leaves 61 bits left
over for gfns, which is more than enough.

Now that the gfns array caches more information than just GFNs, rename
it to shadowed_translation.

While here, preemptively fix up the WARN_ON() that detects gfn
mismatches in direct SPs. The WARN_ON() was paired with a
pr_err_ratelimited(), which means that users could sometimes see the
WARN without the accompanying error message. Fix this by outputting the
error message as part of the WARN splat, and opportunistically make
them WARN_ONCE() because if these ever fire, they are all but guaranteed
to fire a lot and will bring down the kernel.
Signed-off-by: NDavid Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-18-dmatlack@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

6a97575d

KVM: x86/mmu: Update page stats in __rmap_add() · 81cb4657

由 David Matlack 提交于 6月 22, 2022

Update the page stats in __rmap_add() rather than at the call site. This
will avoid having to manually update page stats when splitting huge
pages in a subsequent commit.

No functional change intended.
Reviewed-by: NBen Gardon <bgardon@google.com>
Reviewed-by: NPeter Xu <peterx@redhat.com>
Signed-off-by: NDavid Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-17-dmatlack@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

81cb4657

KVM: x86/mmu: Decouple rmap_add() and link_shadow_page() from kvm_vcpu · 2ff9039a

由 David Matlack 提交于 6月 22, 2022

Allow adding new entries to the rmap and linking shadow pages without a
struct kvm_vcpu pointer by moving the implementation of rmap_add() and
link_shadow_page() into inner helper functions.

No functional change intended.
Reviewed-by: NBen Gardon <bgardon@google.com>
Reviewed-by: NPeter Xu <peterx@redhat.com>
Signed-off-by: NDavid Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-16-dmatlack@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

2ff9039a

KVM: x86/mmu: Pass const memslot to rmap_add() · 6ec6509e

由 David Matlack 提交于 6月 22, 2022

Constify rmap_add()'s @slot parameter; it is simply passed on to
gfn_to_rmap(), which takes a const memslot.

No functional change intended.
Reviewed-by: NBen Gardon <bgardon@google.com>
Signed-off-by: NDavid Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-15-dmatlack@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

6ec6509e

KVM: x86/mmu: Allow NULL @vcpu in kvm_mmu_find_shadow_page() · cbd858b1

由 David Matlack 提交于 6月 22, 2022

Allow @vcpu to be NULL in kvm_mmu_find_shadow_page() (and its only
caller __kvm_mmu_get_shadow_page()). @vcpu is only required to sync
indirect shadow pages, so it's safe to pass in NULL when looking up
direct shadow pages.

This will be used for doing eager page splitting, which allocates direct
shadow pages from the context of a VM ioctl without access to a vCPU
pointer.
Signed-off-by: NDavid Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-14-dmatlack@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

cbd858b1

KVM: x86/mmu: Pass kvm pointer separately from vcpu to kvm_mmu_find_shadow_page() · 3cc736b3

由 David Matlack 提交于 6月 22, 2022

Get the kvm pointer from the caller, rather than deriving it from
vcpu->kvm, and plumb the kvm pointer all the way from
kvm_mmu_get_shadow_page(). With this change in place, the vcpu pointer
is only needed to sync indirect shadow pages. In other words,
__kvm_mmu_get_shadow_page() can now be used to get *direct* shadow pages
without a vcpu pointer. This enables eager page splitting, which needs
to allocate direct shadow pages during VM ioctls.

No functional change intended.
Signed-off-by: NDavid Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-13-dmatlack@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

3cc736b3

KVM: x86/mmu: Replace vcpu with kvm in kvm_mmu_alloc_shadow_page() · 336081fb

由 David Matlack 提交于 6月 22, 2022

The vcpu pointer in kvm_mmu_alloc_shadow_page() is only used to get the
kvm pointer. So drop the vcpu pointer and just pass in the kvm pointer.

No functional change intended.
Signed-off-by: NDavid Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-12-dmatlack@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

336081fb

KVM: x86/mmu: Pass memory caches to allocate SPs separately · 2f8b1b53

由 David Matlack 提交于 6月 22, 2022

Refactor kvm_mmu_alloc_shadow_page() to receive the caches from which it
will allocate the various pieces of memory for shadow pages as a
parameter, rather than deriving them from the vcpu pointer. This will be
useful in a future commit where shadow pages are allocated during VM
ioctls for eager page splitting, and thus will use a different set of
caches.

Preemptively pull the caches out all the way to
kvm_mmu_get_shadow_page() since eager page splitting will not be calling
kvm_mmu_alloc_shadow_page() directly.

No functional change intended.
Signed-off-by: NDavid Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-11-dmatlack@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

2f8b1b53

KVM: x86/mmu: Move guest PT write-protection to account_shadowed() · be911771

由 David Matlack 提交于 6月 22, 2022

Move the code that write-protects newly-shadowed guest page tables into
account_shadowed(). This avoids a extra gfn-to-memslot lookup and is a
more logical place for this code to live. But most importantly, this
reduces kvm_mmu_alloc_shadow_page()'s reliance on having a struct
kvm_vcpu pointer, which will be necessary when creating new shadow pages
during VM ioctls for eager page splitting.

Note, it is safe to drop the role.level == PG_LEVEL_4K check since
account_shadowed() returns early if role.level > PG_LEVEL_4K.

No functional change intended.
Reviewed-by: NSean Christopherson <seanjc@google.com>
Signed-off-by: NDavid Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-10-dmatlack@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

be911771

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功