提交 · 2ba676774dfc9369f67f30cd3b4f58ccda907c24 · openeuler / Kernel

30 4月, 2022 12 次提交

KVM: x86/mmu: cleanup computation of MMU roles for two-dimensional paging · 2ba67677

由 Paolo Bonzini 提交于 2月 10, 2022

Inline kvm_calc_mmu_role_common into its sole caller, and simplify it
by removing the computation of unnecessary bits.

Extended bits are unnecessary because page walking uses the CPU role,
and EFER.NX/CR0.WP can be set to one unconditionally---matching the
format of shadow pages rather than the format of guest pages.

The MMU role for two dimensional paging does still depend on the CPU role,
even if only barely so, due to SMM and guest mode; for consistency,
pass it down to kvm_calc_tdp_mmu_root_page_role instead of querying
the vcpu with is_smm or is_guest_mode.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

2ba67677

KVM: x86/mmu: remove kvm_calc_shadow_root_page_role_common · 19b5dcc3

由 Paolo Bonzini 提交于 2月 10, 2022

kvm_calc_shadow_root_page_role_common is the same as
kvm_calc_cpu_role except for the level, which is overwritten
afterwards in kvm_calc_shadow_mmu_root_page_role
and kvm_calc_shadow_npt_root_page_role.

role.base.direct is already set correctly for the CPU role,
and CR0.PG=1 is required for VMRUN so it will also be
correct for nested NPT.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

19b5dcc3

KVM: x86/mmu: remove ept_ad field · ec283cb1

由 Paolo Bonzini 提交于 2月 10, 2022

The ept_ad field is used during page walk to determine if the guest PTEs
have accessed and dirty bits.  In the MMU role, the ad_disabled
bit represents whether the *shadow* PTEs have the bits, so it
would be incorrect to replace PT_HAVE_ACCESSED_DIRTY with just
!mmu->mmu_role.base.ad_disabled.

However, the similar field in the CPU mode, ad_disabled, is initialized
correctly: to the opposite value of ept_ad for shadow EPT, and zero
for non-EPT guest paging modes (which always have A/D bits).  It is
therefore possible to compute PT_HAVE_ACCESSED_DIRTY from the CPU mode,
like other page-format fields; it just has to be inverted to account
for the different polarity.

In fact, now that the CPU mode is distinct from the MMU roles, it would
even be possible to remove PT_HAVE_ACCESSED_DIRTY macro altogether, and
use !mmu->cpu_role.base.ad_disabled instead.  I am not doing this because
the macro has a small effect in terms of dead code elimination:

   text	   data	    bss	    dec	    hex
 103544	  16665	    112	 120321	  1d601    # as of this patch
 103746	  16665	    112	 120523	  1d6cb    # without PT_HAVE_ACCESSED_DIRTY
Reviewed-by: NSean Christopherson <seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

ec283cb1

KVM: x86/mmu: do not recompute root level from kvm_mmu_role_regs · 60f3cb60

由 Paolo Bonzini 提交于 2月 10, 2022

The root_level can be found in the cpu_role (in fact the field
is superfluous and could be removed, but one thing at a time).
Since there is only one usage left of role_regs_to_root_level,
inline it into kvm_calc_cpu_role.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

60f3cb60

KVM: x86/mmu: split cpu_role from mmu_role · e5ed0fb0

由 Paolo Bonzini 提交于 2月 11, 2022

Snapshot the state of the processor registers that govern page walk into
a new field of struct kvm_mmu. This is a more natural representation
than having it *mostly* in mmu_role but not exclusively; the delta
right now is represented in other fields, such as root_level.

The nested MMU now has only the CPU role; and in fact the new function
kvm_calc_cpu_role is analogous to the previous kvm_calc_nested_mmu_role,
except that it has role.base.direct equal to !CR0.PG. For a walk-only
MMU, "direct" has no meaning, but we set it to !CR0.PG so that
role.ext.cr0_pg can go away in a future patch.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

e5ed0fb0

KVM: x86/mmu: remove "bool base_only" arguments · b8980508

由 Paolo Bonzini 提交于 2月 10, 2022

The argument is always false now that kvm_mmu_calc_root_page_role has
been removed.
Reviewed-by: NSean Christopherson <seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

b8980508

KVM: x86/mmu: pull computation of kvm_mmu_role_regs to kvm_init_mmu · 39e7e2bf

由 Paolo Bonzini 提交于 2月 10, 2022

The init_kvm_*mmu functions, with the exception of shadow NPT,
do not need to know the full values of CR0/CR4/EFER; they only
need to know the bits that make up the "role".  This cleanup
however will take quite a few incremental steps.  As a start,
pull the common computation of the struct kvm_mmu_role_regs
into their caller: all of them extract the struct from the vcpu
as the very first step.
Reviewed-by: NDavid Matlack <dmatlack@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

39e7e2bf

KVM: x86/mmu: constify uses of struct kvm_mmu_role_regs · 82ffa13f

由 Paolo Bonzini 提交于 2月 10, 2022

struct kvm_mmu_role_regs is computed just once and then accessed.  Use
const to make this clearer, even though the const fields of struct
kvm_mmu_role_regs already prevent (or make it harder...) to modify
the contents of the struct.
Reviewed-by: NDavid Matlack <dmatlack@google.com>
Reviewed-by: NSean Christopherson <seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

82ffa13f

KVM: x86/mmu: nested EPT cannot be used in SMM · daed87b8

由 Paolo Bonzini 提交于 2月 10, 2022

The role.base.smm flag is always zero when setting up shadow EPT,
do not bother copying it over from vcpu->arch.root_mmu.
Reviewed-by: NDavid Matlack <dmatlack@google.com>
Reviewed-by: NSean Christopherson <seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

daed87b8

KVM: x86/mmu: Use enable_mmio_caching to track if MMIO caching is enabled · 8b9e74bf

由 Sean Christopherson 提交于 4月 20, 2022

Clear enable_mmio_caching if hardware can't support MMIO caching and use
the dedicated flag to detect if MMIO caching is enabled instead of
assuming shadow_mmio_value==0 means MMIO caching is disabled.  TDX will
use a zero value even when caching is enabled, and is_mmio_spte() isn't
so hot that it needs to avoid an extra memory access, i.e. there's no
reason to be super clever.  And the clever approach may not even be more
performant, e.g. gcc-11 lands the extra check on a non-zero value inline,
but puts the enable_mmio_caching out-of-line, i.e. avoids the few extra
uops for non-MMIO SPTEs.

Cc: Isaku Yamahata <isaku.yamahata@intel.com>
Cc: Kai Huang <kai.huang@intel.com>
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220420002747.3287931-1-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

8b9e74bf

KVM: x86/mmu: fix potential races when walking host page table · 44187235

由 Mingwei Zhang 提交于 4月 29, 2022

KVM uses lookup_address_in_mm() to detect the hugepage size that the host
uses to map a pfn.  The function suffers from several issues:

 - no usage of READ_ONCE(*). This allows multiple dereference of the same
   page table entry. The TOCTOU problem because of that may cause KVM to
   incorrectly treat a newly generated leaf entry as a nonleaf one, and
   dereference the content by using its pfn value.

 - the information returned does not match what KVM needs; for non-present
   entries it returns the level at which the walk was terminated, as long
   as the entry is not 'none'.  KVM needs level information of only 'present'
   entries, otherwise it may regard a non-present PXE entry as a present
   large page mapping.

 - the function is not safe for mappings that can be torn down, because it
   does not disable IRQs and because it returns a PTE pointer which is never
   safe to dereference after the function returns.

So implement the logic for walking host page tables directly in KVM, and
stop using lookup_address_in_mm().

Cc: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NMingwei Zhang <mizhang@google.com>
Message-Id: <20220429031757.2042406-1-mizhang@google.com>
[Inline in host_pfn_mapping_level, ensure no semantic change for its
 callers. - Paolo]
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

44187235

KVM: x86/mmu: Do not create SPTEs for GFNs that exceed host.MAXPHYADDR · 86931ff7

由 Sean Christopherson 提交于 4月 28, 2022

Disallow memslots and MMIO SPTEs whose gpa range would exceed the host's
MAXPHYADDR, i.e. don't create SPTEs for gfns that exceed host.MAXPHYADDR.
The TDP MMU bounds its zapping based on host.MAXPHYADDR, and so if the
guest, possibly with help from userspace, manages to coerce KVM into
creating a SPTE for an "impossible" gfn, KVM will leak the associated
shadow pages (page tables):

  WARNING: CPU: 10 PID: 1122 at arch/x86/kvm/mmu/tdp_mmu.c:57
                                kvm_mmu_uninit_tdp_mmu+0x4b/0x60 [kvm]
  Modules linked in: kvm_intel kvm irqbypass
  CPU: 10 PID: 1122 Comm: set_memory_regi Tainted: G        W         5.18.0-rc1+ #293
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  RIP: 0010:kvm_mmu_uninit_tdp_mmu+0x4b/0x60 [kvm]
  Call Trace:
   <TASK>
   kvm_arch_destroy_vm+0x130/0x1b0 [kvm]
   kvm_destroy_vm+0x162/0x2d0 [kvm]
   kvm_vm_release+0x1d/0x30 [kvm]
   __fput+0x82/0x240
   task_work_run+0x5b/0x90
   exit_to_user_mode_prepare+0xd2/0xe0
   syscall_exit_to_user_mode+0x1d/0x40
   entry_SYSCALL_64_after_hwframe+0x44/0xae
   </TASK>

On bare metal, encountering an impossible gpa in the page fault path is
well and truly impossible, barring CPU bugs, as the CPU will signal #PF
during the gva=>gpa translation (or a similar failure when stuffing a
physical address into e.g. the VMCS/VMCB).  But if KVM is running as a VM
itself, the MAXPHYADDR enumerated to KVM may not be the actual MAXPHYADDR
of the underlying hardware, in which case the hardware will not fault on
the illegal-from-KVM's-perspective gpa.

Alternatively, KVM could continue allowing the dodgy behavior and simply
zap the max possible range.  But, for hosts with MAXPHYADDR < 52, that's
a (minor) waste of cycles, and more importantly, KVM can't reasonably
support impossible memslots when running on bare metal (or with an
accurate MAXPHYADDR as a VM).  Note, limiting the overhead by checking if
KVM is running as a guest is not a safe option as the host isn't required
to announce itself to the guest in any way, e.g. doesn't need to set the
HYPERVISOR CPUID bit.

A second alternative to disallowing the memslot behavior would be to
disallow creating a VM with guest.MAXPHYADDR > host.MAXPHYADDR.  That
restriction is undesirable as there are legitimate use cases for doing
so, e.g. using the highest host.MAXPHYADDR out of a pool of heterogeneous
systems so that VMs can be migrated between hosts with different
MAXPHYADDRs without running afoul of the allow_smaller_maxphyaddr mess.

Note that any guest.MAXPHYADDR is valid with shadow paging, and it is
even useful in order to test KVM with MAXPHYADDR=52 (i.e. without
any reserved physical address bits).

The now common kvm_mmu_max_gfn() is inclusive instead of exclusive.
The memslot and TDP MMU code want an exclusive value, but the name
implies the returned value is inclusive, and the MMIO path needs an
inclusive check.

Fixes: faaf05b0 ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU")
Fixes: 524a1e4e ("KVM: x86/mmu: Don't leak non-leaf SPTEs when zapping all SPTEs")
Cc: stable@vger.kernel.org
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Cc: Ben Gardon <bgardon@google.com>
Cc: David Matlack <dmatlack@google.com>
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220428233416.2446833-1-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

86931ff7

05 4月, 2022 1 次提交

KVM: x86/mmu: Resolve nx_huge_pages when kvm.ko is loaded · 1d0e8480

由 Sean Christopherson 提交于 3月 31, 2022

Resolve nx_huge_pages to true/false when kvm.ko is loaded, leaving it as
-1 is technically undefined behavior when its value is read out by
param_get_bool(), as boolean values are supposed to be '0' or '1'.

Alternatively, KVM could define a custom getter for the param, but the
auto value doesn't depend on the vendor module in any way, and printing
"auto" would be unnecessarily unfriendly to the user.

In addition to fixing the undefined behavior, resolving the auto value
also fixes the scenario where the auto value resolves to N and no vendor
module is loaded.  Previously, -1 would result in Y being printed even
though KVM would ultimately disable the mitigation.

Rename the existing MMU module init/exit helpers to clarify that they're
invoked with respect to the vendor module, and add comments to document
why KVM has two separate "module init" flows.

  =========================================================================
  UBSAN: invalid-load in kernel/params.c:320:33
  load of value 255 is not a valid value for type '_Bool'
  CPU: 6 PID: 892 Comm: tail Not tainted 5.17.0-rc3+ #799
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  Call Trace:
   <TASK>
   dump_stack_lvl+0x34/0x44
   ubsan_epilogue+0x5/0x40
   __ubsan_handle_load_invalid_value.cold+0x43/0x48
   param_get_bool.cold+0xf/0x14
   param_attr_show+0x55/0x80
   module_attr_show+0x1c/0x30
   sysfs_kf_seq_show+0x93/0xc0
   seq_read_iter+0x11c/0x450
   new_sync_read+0x11b/0x1a0
   vfs_read+0xf0/0x190
   ksys_read+0x5f/0xe0
   do_syscall_64+0x3b/0xc0
   entry_SYSCALL_64_after_hwframe+0x44/0xae
   </TASK>
  =========================================================================

Fixes: b8e8c830 ("kvm: mmu: ITLB_MULTIHIT mitigation")
Cc: stable@vger.kernel.org
Reported-by: NBruno Goncalves <bgoncalv@redhat.com>
Reported-by: NJan Stancek <jstancek@redhat.com>
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220331221359.3912754-1-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

1d0e8480

02 4月, 2022 7 次提交

KVM: x86/mmu: Don't rebuild page when the page is synced and no tlb flushing is required · 8d5678a7

由 Hou Wenlong 提交于 3月 15, 2022

Before Commit c3e5e415 ("KVM: X86: Change kvm_sync_page()
to return true when remote flush is needed"), the return value
of kvm_sync_page() indicates whether the page is synced, and
kvm_mmu_get_page() would rebuild page when the sync fails.
But now, kvm_sync_page() returns false when the page is
synced and no tlb flushing is required, which leads to
rebuild page in kvm_mmu_get_page(). So return the return
value of mmu->sync_page() directly and check it in
kvm_mmu_get_page(). If the sync fails, the page will be
zapped and the invalid_list is not empty, so set flush as
true is accepted in mmu_sync_children().

Cc: stable@vger.kernel.org
Fixes: c3e5e415 ("KVM: X86: Change kvm_sync_page() to return true when remote flush is needed")
Signed-off-by: NHou Wenlong <houwenlong.hwl@antgroup.com>
Acked-by: NLai Jiangshan <jiangshanlai@gmail.com>
Message-Id: <0dabeeb789f57b0d793f85d073893063e692032d.1647336064.git.houwenlong.hwl@antgroup.com>
[mmu_sync_children should not flush if the page is zapped. - Paolo]
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

8d5678a7

KVM: x86: mmu: trace kvm_mmu_set_spte after the new SPTE was set · 5959ff4a

由 Maxim Levitsky 提交于 3月 02, 2022

It makes more sense to print new SPTE value than the
old value.
Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220302102457.588450-1-mlevitsk@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

5959ff4a

KVM: X86: Handle implicit supervisor access with SMAP · 4f4aa80e

由 Lai Jiangshan 提交于 3月 11, 2022

There are two kinds of implicit supervisor access
implicit supervisor access when CPL = 3
implicit supervisor access when CPL < 3

Current permission_fault() handles only the first kind for SMAP.

But if the access is implicit when SMAP is on, data may not be read
nor write from any user-mode address regardless the current CPL.

So the second kind should be also supported.

The first kind can be detect via CPL and access mode: if it is
supervisor access and CPL = 3, it must be implicit supervisor access.

But it is not possible to detect the second kind without extra
information, so this patch adds an artificial PFERR_EXPLICIT_ACCESS
into @access. This extra information also works for the first kind, so
the logic is changed to use this information for both cases.

The value of PFERR_EXPLICIT_ACCESS is deliberately chosen to be bit 48
which is in the most significant 16 bits of u64 and less likely to be
forced to change due to future hardware uses it.

This patch removes the call to ->get_cpl() for access mode is determined
by @access. Not only does it reduce a function call, but also remove
confusions when the permission is checked for nested TDP. The nested
TDP shouldn't have SMAP checking nor even the L2's CPL have any bearing
on it. The original code works just because it is always user walk for
NPT and SMAP fault is not set for EPT in update_permission_bitmask.
Signed-off-by: NLai Jiangshan <jiangshan.ljs@antgroup.com>
Message-Id: <20220311070346.45023-5-jiangshanlai@gmail.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

4f4aa80e

KVM: X86: Fix comments in update_permission_bitmask · 94b4a2f1

由 Lai Jiangshan 提交于 3月 11, 2022

The commit 09f037aa ("KVM: MMU: speedup update_permission_bitmask")
refactored the code of update_permission_bitmask() and change the
comments.  It added a condition into a list to match the new code,
so the number/order for conditions in the comments should be updated
too.
Signed-off-by: NLai Jiangshan <jiangshan.ljs@antgroup.com>
Message-Id: <20220311070346.45023-3-jiangshanlai@gmail.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

94b4a2f1

KVM: X86: Change the type of access u32 to u64 · 5b22bbe7

由 Lai Jiangshan 提交于 3月 11, 2022

Change the type of access u32 to u64 for FNAME(walk_addr) and
->gva_to_gpa().

The kinds of accesses are usually combinations of UWX, and VMX/SVM's
nested paging adds a new factor of access: is it an access for a guest
page table or for a final guest physical address.

And SMAP relies a factor for supervisor access: explicit or implicit.

So @access in FNAME(walk_addr) and ->gva_to_gpa() is better to include
all these information to do the walk.

Although @access(u32) has enough bits to encode all the kinds, this
patch extends it to u64:
	o Extra bits will be in the higher 32 bits, so that we can
	  easily obtain the traditional access mode (UWX) by converting
	  it to u32.
	o Reuse the value for the access kind defined by SVM's nested
	  paging (PFERR_GUEST_FINAL_MASK and PFERR_GUEST_PAGE_MASK) as
	  @error_code in kvm_handle_page_fault().
Signed-off-by: NLai Jiangshan <jiangshan.ljs@antgroup.com>
Message-Id: <20220311070346.45023-2-jiangshanlai@gmail.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

5b22bbe7

KVM: x86/mmu: Zap only TDP MMU leafs in zap range and mmu_notifier unmap · f47e5bbb

由 Sean Christopherson 提交于 3月 25, 2022

Re-introduce zapping only leaf SPTEs in kvm_zap_gfn_range() and
kvm_tdp_mmu_unmap_gfn_range(), this time without losing a pending TLB
flush when processing multiple roots (including nested TDP shadow roots).
Dropping the TLB flush resulted in random crashes when running Hyper-V
Server 2019 in a guest with KSM enabled in the host (or any source of
mmu_notifier invalidations, KSM is just the easiest to force).

This effectively revert commits 873dd122
and fcb93eb6, and thus restores commit
cf3e2642, plus this delta on top:

bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, int as_id, gfn_t start, gfn_t end,
        struct kvm_mmu_page *root;

        for_each_tdp_mmu_root_yield_safe(kvm, root, as_id)
-               flush = tdp_mmu_zap_leafs(kvm, root, start, end, can_yield, false);
+               flush = tdp_mmu_zap_leafs(kvm, root, start, end, can_yield, flush);

        return flush;
 }

Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: NSean Christopherson <seanjc@google.com>
Tested-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20220325230348.2587437-1-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

f47e5bbb

KVM: MMU: propagate alloc_workqueue failure · a1a39128

由 Paolo Bonzini 提交于 3月 25, 2022

If kvm->arch.tdp_mmu_zap_wq cannot be created, the failure has
to be propagated up to kvm_mmu_init_vm and kvm_arch_init_vm.
kvm_arch_init_vm also has to undo all the initialization, so
group all the MMU initialization code at the beginning and
handle cleaning up of kvm_page_track_init.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

a1a39128

21 3月, 2022 1 次提交

Revert "KVM: x86/mmu: Zap only TDP MMU leafs in kvm_zap_gfn_range()" · 873dd122

由 Paolo Bonzini 提交于 3月 18, 2022

This reverts commit cf3e2642.

Multi-vCPU Hyper-V guests started crashing randomly on boot with the
latest kvm/queue and the problem can be bisected the problem to this
particular patch. Basically, I'm not able to boot e.g. 16-vCPU guest
successfully anymore. Both Intel and AMD seem to be affected. Reverting
the commit saves the day.
Reported-by: NVitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

873dd122

08 3月, 2022 6 次提交

KVM: x86/mmu: Zap invalidated roots via asynchronous worker · 22b94c4b

由 Paolo Bonzini 提交于 3月 02, 2022

Use the system worker threads to zap the roots invalidated
by the TDP MMU's "fast zap" mechanism, implemented by
kvm_tdp_mmu_invalidate_all_roots().

At this point, apart from allowing some parallelism in the zapping of
roots, the workqueue is a glorified linked list: work items are added and
flushed entirely within a single kvm->slots_lock critical section. However,
the workqueue fixes a latent issue where kvm_mmu_zap_all_invalidated_roots()
assumes that it owns a reference to all invalid roots; therefore, no
one can set the invalid bit outside kvm_mmu_zap_all_fast(). Putting the
invalidated roots on a linked list... erm, on a workqueue ensures that
tdp_mmu_zap_root_work() only puts back those extra references that
kvm_mmu_zap_all_invalidated_roots() had gifted to it.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

22b94c4b

KVM: x86/mmu: Defer TLB flush to caller when freeing TDP MMU shadow pages · bb95dfb9

由 Sean Christopherson 提交于 2月 26, 2022

Defer TLB flushes to the caller when freeing TDP MMU shadow pages instead
of immediately flushing. Because the shadow pages are freed in an RCU
callback, so long as at least one CPU holds RCU, all CPUs are protected.
For vCPUs running in the guest, i.e. consuming TLB entries, KVM only
needs to ensure the caller services the pending TLB flush before dropping
its RCU protections. I.e. use the caller's RCU as a proxy for all vCPUs
running in the guest.

Deferring the flushes allows batching flushes, e.g. when installing a
1gb hugepage and zapping a pile of SPs. And when zapping an entire root,
deferring flushes allows skipping the flush entirely (because flushes are
not needed in that case).

Avoiding flushes when zapping an entire root is especially important as
synchronizing with other CPUs via IPI after zapping every shadow page can
cause significant performance issues for large VMs. The issue is
exacerbated by KVM zapping entire top-level entries without dropping
RCU protection, which can lead to RCU stalls even when zapping roots
backing relatively "small" amounts of guest memory, e.g. 2tb. Removing
the IPI bottleneck largely mitigates the RCU issues, though it's likely
still a problem for 5-level paging. A future patch will further address
the problem by zapping roots in multiple passes to avoid holding RCU for
an extended duration.
Reviewed-by: NBen Gardon <bgardon@google.com>
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220226001546.360188-20-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

bb95dfb9

KVM: x86/mmu: Zap only TDP MMU leafs in kvm_zap_gfn_range() · cf3e2642

由 Sean Christopherson 提交于 2月 26, 2022

Zap only leaf SPTEs in the TDP MMU's zap_gfn_range(), and rename various
functions accordingly.  When removing mappings for functional correctness
(except for the stupid VFIO GPU passthrough memslots bug), zapping the
leaf SPTEs is sufficient as the paging structures themselves do not point
at guest memory and do not directly impact the final translation (in the
TDP MMU).

Note, this aligns the TDP MMU with the legacy/full MMU, which zaps only
the rmaps, a.k.a. leaf SPTEs, in kvm_zap_gfn_range() and
kvm_unmap_gfn_range().
Signed-off-by: NSean Christopherson <seanjc@google.com>
Reviewed-by: NBen Gardon <bgardon@google.com>
Message-Id: <20220226001546.360188-18-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

cf3e2642

KVM: x86/mmu: Document that zapping invalidated roots doesn't need to flush · 7ae5840e

由 Sean Christopherson 提交于 2月 26, 2022

Remove the misleading flush "handling" when zapping invalidated TDP MMU
roots, and document that flushing is unnecessary for all flavors of MMUs
when zapping invalid/obsolete roots/pages.  The "handling" in the TDP MMU
is dead code, as zap_gfn_range() is called with shared=true, in which
case it will never return true due to the flushing being handled by
tdp_mmu_zap_spte_atomic().

No functional change intended.
Signed-off-by: NSean Christopherson <seanjc@google.com>
Reviewed-by: NBen Gardon <bgardon@google.com>
Message-Id: <20220226001546.360188-6-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

7ae5840e

KVM: x86/mmu: Formalize TDP MMU's (unintended?) deferred TLB flush logic · db01416b

由 Sean Christopherson 提交于 2月 26, 2022

Explicitly ignore the result of zap_gfn_range() when putting the last
reference to a TDP MMU root, and add a pile of comments to formalize the
TDP MMU's behavior of deferring TLB flushes to alloc/reuse. Note, this
only affects the !shared case, as zap_gfn_range() subtly never returns
true for "flush" as the flush is handled by tdp_mmu_zap_spte_atomic().

Putting the root without a flush is ok because even if there are stale
references to the root in the TLB, they are unreachable because KVM will
not run the guest with the same ASID without first flushing (where ASID
in this context refers to both SVM's explicit ASID and Intel's implicit
ASID that is constructed from VPID+PCID+EPT4A+etc...).
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220226001546.360188-5-seanjc@google.com>
Reviewed-by: NMingwei Zhang <mizhang@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

db01416b

KVM: x86/mmu: Fix wrong/misleading comments in TDP MMU fast zap · f28e9c7f

由 Sean Christopherson 提交于 2月 26, 2022

Fix misleading and arguably wrong comments in the TDP MMU's fast zap
flow. The comments, and the fact that actually zapping invalid roots was
added separately, strongly suggests that zapping invalid roots is an
optimization and not required for correctness. That is a lie.

KVM _must_ zap invalid roots before returning from kvm_mmu_zap_all_fast(),
because when it's called from kvm_mmu_invalidate_zap_pages_in_memslot(),
KVM is relying on it to fully remove all references to the memslot. Once
the memslot is gone, KVM's mmu_notifier hooks will be unable to find the
stale references as the hva=>gfn translation is done via the memslots.
If KVM doesn't immediately zap SPTEs and userspace unmaps a range after
deleting a memslot, KVM will fail to zap in response to the mmu_notifier
due to not finding a memslot corresponding to the notifier's range, which
leads to a variation of use-after-free.

The other misleading comment (and code) explicitly states that roots
without a reference should be skipped. While that's technically true,
it's also extremely misleading as it should be impossible for KVM to
encounter a defunct root on the list while holding mmu_lock for write.
Opportunistically add a WARN to enforce that invariant.

Fixes: b7cccd39 ("KVM: x86/mmu: Fast invalidation for TDP MMU")
Fixes: 4c6654bd ("KVM: x86/mmu: Tear down roots before kvm_mmu_zap_all_fast returns")
Signed-off-by: NSean Christopherson <seanjc@google.com>
Reviewed-by: NBen Gardon <bgardon@google.com>
Message-Id: <20220226001546.360188-4-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

f28e9c7f

02 3月, 2022 1 次提交

KVM: x86/mmu: Passing up the error state of mmu_alloc_shadow_roots() · c6c937d6

由 Like Xu 提交于 3月 01, 2022

Just like on the optional mmu_alloc_direct_roots() path, once shadow
path reaches "r = -EIO" somewhere, the caller needs to know the actual
state in order to enter error handling and avoid something worse.

Fixes: 4a38162e ("KVM: MMU: load PDPTRs outside mmu_lock")
Signed-off-by: NLike Xu <likexu@tencent.com>
Reviewed-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220301124941.48412-1-likexu@tencent.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

c6c937d6

01 3月, 2022 3 次提交

KVM: WARN if is_unsync_root() is called on a root without a shadow page · 5d6a3221

由 Sean Christopherson 提交于 2月 25, 2022

WARN and bail if is_unsync_root() is passed a root for which there is no
shadow page, i.e. is passed the physical address of one of the special
roots, which do not have an associated shadow page.  The current usage
squeaks by without bug reports because neither kvm_mmu_sync_roots() nor
kvm_mmu_sync_prev_roots() calls the helper with pae_root or pml4_root,
and 5-level AMD CPUs are not generally available, i.e. no one can coerce
KVM into calling is_unsync_root() on pml5_root.

Note, this doesn't fix the mess with 5-level nNPT, it just (hopefully)
prevents KVM from crashing.

Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: NSean Christopherson <seanjc@google.com>
Message-Id: <20220225182248.3812651-8-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

5d6a3221

KVM: x86/mmu: Zap only obsolete roots if a root shadow page is zapped · 527d5cd7

由 Sean Christopherson 提交于 2月 25, 2022

Zap only obsolete roots when responding to zapping a single root shadow
page.  Because KVM keeps root_count elevated when stuffing a previous
root into its PGD cache, shadowing a 64-bit guest means that zapping any
root causes all vCPUs to reload all roots, even if their current root is
not affected by the zap.

For many kernels, zapping a single root is a frequent operation, e.g. in
Linux it happens whenever an mm is dropped, e.g. process exits, etc...
Signed-off-by: NSean Christopherson <seanjc@google.com>
Reviewed-by: NBen Gardon <bgardon@google.com>
Message-Id: <20220225182248.3812651-5-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

527d5cd7

KVM: Drop kvm_reload_remote_mmus(), open code request in x86 users · 2f6f66cc

由 Sean Christopherson 提交于 2月 25, 2022

Remove the generic kvm_reload_remote_mmus() and open code its
functionality into the two x86 callers.  x86 is (obviously) the only
architecture that uses the hook, and is also the only architecture that
uses KVM_REQ_MMU_RELOAD in a way that's consistent with the name.  That
will change in a future patch, as x86's usage when zapping a single
shadow page x86 doesn't actually _need_ to reload all vCPUs' MMUs, only
MMUs whose root is being zapped actually need to be reloaded.

s390 also uses KVM_REQ_MMU_RELOAD, but for a slightly different purpose.

Drop the generic code in anticipation of implementing s390 and x86 arch
specific requests, which will allow dropping KVM_REQ_MMU_RELOAD entirely.

Opportunistically reword the x86 TDP MMU comment to avoid making
references to functions (and requests!) when possible, and to remove the
rather ambiguous "this".

No functional change intended.

Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: NSean Christopherson <seanjc@google.com>
Reviewed-by: NBen Gardon <bgardon@google.com>
Message-Id: <20220225182248.3812651-4-seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

2f6f66cc

25 2月, 2022 9 次提交

KVM: x86/mmu: clear MMIO cache when unloading the MMU · 6d58f275

由 Paolo Bonzini 提交于 2月 14, 2022

For cleanliness, do not leave a stale GVA in the cache after all the roots are
cleared. In practice, kvm_mmu_load will go through kvm_mmu_sync_roots if
paging is on, and will not use vcpu_match_mmio_gva at all if paging is off.
However, leaving data in the cache might cause bugs in the future.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

6d58f275

KVM: x86/mmu: Always use current mmu's role when loading new PGD · d2e5f333

由 Paolo Bonzini 提交于 11月 22, 2021

Since the guest PGD is now loaded after the MMU has been set up
completely, the desired role for a cache hit is simply the current
mmu_role.  There is no need to compute it again, so __kvm_mmu_new_pgd
can be folded in kvm_mmu_new_pgd.
Reviewed-by: NSean Christopherson <seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

d2e5f333

KVM: x86/mmu: load new PGD after the shadow MMU is initialized · 3cffc89d

由 Paolo Bonzini 提交于 2月 04, 2022

Now that __kvm_mmu_new_pgd does not look at the MMU's root_level and
shadow_root_level anymore, pull the PGD load after the initialization of
the shadow MMUs.

Besides being more intuitive, this enables future simplifications
and optimizations because it's not necessary anymore to compute the
role outside kvm_init_mmu.  In particular, kvm_mmu_reset_context was not
attempting to use a cached PGD to avoid having to figure out the new role.
With this change, it could follow what nested_{vmx,svm}_load_cr3 are doing,
and avoid unloading all the cached roots.
Reviewed-by: NSean Christopherson <seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

3cffc89d

KVM: x86/mmu: look for a cached PGD when going from 32-bit to 64-bit · 5499ea73

由 Paolo Bonzini 提交于 2月 09, 2022

Right now, PGD caching avoids placing a PAE root in the cache by using the
old value of mmu->root_level and mmu->shadow_root_level; it does not look
for a cached PGD if the old root is a PAE one, and then frees it using
kvm_mmu_free_roots.

Change the logic instead to free the uncacheable root early.
This way, __kvm_new_mmu_pgd is able to look up the cache when going from
32-bit to 64-bit (if there is a hit, the invalid root becomes the least
recently used).  An example of this is nested virtualization with shadow
paging, when a 64-bit L1 runs a 32-bit L2.

As a side effect (which is actually the reason why this patch was
written), PGD caching does not use the old value of mmu->root_level
and mmu->shadow_root_level anymore.
Reviewed-by: NSean Christopherson <seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

5499ea73

KVM: x86/mmu: do not pass vcpu to root freeing functions · 0c1c92f1

由 Paolo Bonzini 提交于 2月 21, 2022

These functions only operate on a given MMU, of which there is more
than one in a vCPU (we care about two, because the third does not have
any roots and is only used to walk guest page tables). They do need a
struct kvm in order to lock the mmu_lock, but they do not needed anything
else in the struct kvm_vcpu. So, pass the vcpu->kvm directly to them.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

0c1c92f1

KVM: x86/mmu: do not consult levels when freeing roots · 594bef79

由 Paolo Bonzini 提交于 2月 08, 2022

Right now, PGD caching requires a complicated dance of first computing
the MMU role and passing it to __kvm_mmu_new_pgd(), and then separately calling
kvm_init_mmu().

Part of this is due to kvm_mmu_free_roots using mmu->root_level and
mmu->shadow_root_level to distinguish whether the page table uses a single
root or 4 PAE roots. Because kvm_init_mmu() can overwrite mmu->root_level,
kvm_mmu_free_roots() must be called before kvm_init_mmu().

However, even after kvm_init_mmu() there is a way to detect whether the
page table may hold PAE roots, as root.hpa isn't backed by a shadow when
it points at PAE roots. Using this method results in simpler code, and
is one less obstacle in moving all calls to __kvm_mmu_new_pgd() after the
MMU has been initialized.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

594bef79

KVM: x86: use struct kvm_mmu_root_info for mmu->root · b9e5603c

由 Paolo Bonzini 提交于 2月 21, 2022

The root_hpa and root_pgd fields form essentially a struct kvm_mmu_root_info.
Use the struct to have more consistency between mmu->root and
mmu->prev_roots.

The patch is entirely search and replace except for cached_root_available,
which does not need a temporary struct kvm_mmu_root_info anymore.
Reviewed-by: NSean Christopherson <seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

b9e5603c

KVM: x86/mmu: avoid NULL-pointer dereference on page freeing bugs · 9191b8f0

由 Paolo Bonzini 提交于 2月 08, 2022

WARN and bail if KVM attempts to free a root that isn't backed by a shadow
page. KVM allocates a bare page for "special" roots, e.g. when using PAE
paging or shadowing 2/3/4-level page tables with 4/5-level, and so root_hpa
will be valid but won't be backed by a shadow page. It's all too easy to
blindly call mmu_free_root_page() on root_hpa, be nice and WARN instead of
crashing KVM and possibly the kernel.
Reviewed-by: NSean Christopherson <seanjc@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

9191b8f0

KVM: x86/mmu: make apf token non-zero to fix bug · 6f3c1fc5

由 Liang Zhang 提交于 2月 22, 2022

In current async pagefault logic, when a page is ready, KVM relies on
kvm_arch_can_dequeue_async_page_present() to determine whether to deliver
a READY event to the Guest. This function test token value of struct
kvm_vcpu_pv_apf_data, which must be reset to zero by Guest kernel when a
READY event is finished by Guest. If value is zero meaning that a READY
event is done, so the KVM can deliver another.
But the kvm_arch_setup_async_pf() may produce a valid token with zero
value, which is confused with previous mention and may lead the loss of
this READY event.

This bug may cause task blocked forever in Guest:
 INFO: task stress:7532 blocked for more than 1254 seconds.
       Not tainted 5.10.0 #16
 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 task:stress          state:D stack:    0 pid: 7532 ppid:  1409
 flags:0x00000080
 Call Trace:
  __schedule+0x1e7/0x650
  schedule+0x46/0xb0
  kvm_async_pf_task_wait_schedule+0xad/0xe0
  ? exit_to_user_mode_prepare+0x60/0x70
  __kvm_handle_async_pf+0x4f/0xb0
  ? asm_exc_page_fault+0x8/0x30
  exc_page_fault+0x6f/0x110
  ? asm_exc_page_fault+0x8/0x30
  asm_exc_page_fault+0x1e/0x30
 RIP: 0033:0x402d00
 RSP: 002b:00007ffd31912500 EFLAGS: 00010206
 RAX: 0000000000071000 RBX: ffffffffffffffff RCX: 00000000021a32b0
 RDX: 000000000007d011 RSI: 000000000007d000 RDI: 00000000021262b0
 RBP: 00000000021262b0 R08: 0000000000000003 R09: 0000000000000086
 R10: 00000000000000eb R11: 00007fefbdf2baa0 R12: 0000000000000000
 R13: 0000000000000002 R14: 000000000007d000 R15: 0000000000001000
Signed-off-by: NLiang Zhang <zhangliang5@huawei.com>
Message-Id: <20220222031239.1076682-1-zhangliang5@huawei.com>
Cc: stable@vger.kernel.org
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

6f3c1fc5

openeuler / Kernel 接近 2 年 前同步成功

openeuler / Kernel
接近 2 年前同步成功