- 30 4月, 2022 1 次提交
-
-
由 Sean Christopherson 提交于
Disallow memslots and MMIO SPTEs whose gpa range would exceed the host's MAXPHYADDR, i.e. don't create SPTEs for gfns that exceed host.MAXPHYADDR. The TDP MMU bounds its zapping based on host.MAXPHYADDR, and so if the guest, possibly with help from userspace, manages to coerce KVM into creating a SPTE for an "impossible" gfn, KVM will leak the associated shadow pages (page tables): WARNING: CPU: 10 PID: 1122 at arch/x86/kvm/mmu/tdp_mmu.c:57 kvm_mmu_uninit_tdp_mmu+0x4b/0x60 [kvm] Modules linked in: kvm_intel kvm irqbypass CPU: 10 PID: 1122 Comm: set_memory_regi Tainted: G W 5.18.0-rc1+ #293 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:kvm_mmu_uninit_tdp_mmu+0x4b/0x60 [kvm] Call Trace: <TASK> kvm_arch_destroy_vm+0x130/0x1b0 [kvm] kvm_destroy_vm+0x162/0x2d0 [kvm] kvm_vm_release+0x1d/0x30 [kvm] __fput+0x82/0x240 task_work_run+0x5b/0x90 exit_to_user_mode_prepare+0xd2/0xe0 syscall_exit_to_user_mode+0x1d/0x40 entry_SYSCALL_64_after_hwframe+0x44/0xae </TASK> On bare metal, encountering an impossible gpa in the page fault path is well and truly impossible, barring CPU bugs, as the CPU will signal #PF during the gva=>gpa translation (or a similar failure when stuffing a physical address into e.g. the VMCS/VMCB). But if KVM is running as a VM itself, the MAXPHYADDR enumerated to KVM may not be the actual MAXPHYADDR of the underlying hardware, in which case the hardware will not fault on the illegal-from-KVM's-perspective gpa. Alternatively, KVM could continue allowing the dodgy behavior and simply zap the max possible range. But, for hosts with MAXPHYADDR < 52, that's a (minor) waste of cycles, and more importantly, KVM can't reasonably support impossible memslots when running on bare metal (or with an accurate MAXPHYADDR as a VM). Note, limiting the overhead by checking if KVM is running as a guest is not a safe option as the host isn't required to announce itself to the guest in any way, e.g. doesn't need to set the HYPERVISOR CPUID bit. A second alternative to disallowing the memslot behavior would be to disallow creating a VM with guest.MAXPHYADDR > host.MAXPHYADDR. That restriction is undesirable as there are legitimate use cases for doing so, e.g. using the highest host.MAXPHYADDR out of a pool of heterogeneous systems so that VMs can be migrated between hosts with different MAXPHYADDRs without running afoul of the allow_smaller_maxphyaddr mess. Note that any guest.MAXPHYADDR is valid with shadow paging, and it is even useful in order to test KVM with MAXPHYADDR=52 (i.e. without any reserved physical address bits). The now common kvm_mmu_max_gfn() is inclusive instead of exclusive. The memslot and TDP MMU code want an exclusive value, but the name implies the returned value is inclusive, and the MMIO path needs an inclusive check. Fixes: faaf05b0 ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU") Fixes: 524a1e4e ("KVM: x86/mmu: Don't leak non-leaf SPTEs when zapping all SPTEs") Cc: stable@vger.kernel.org Cc: Maxim Levitsky <mlevitsk@redhat.com> Cc: Ben Gardon <bgardon@google.com> Cc: David Matlack <dmatlack@google.com> Signed-off-by: NSean Christopherson <seanjc@google.com> Message-Id: <20220428233416.2446833-1-seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 05 4月, 2022 1 次提交
-
-
由 Sean Christopherson 提交于
Resolve nx_huge_pages to true/false when kvm.ko is loaded, leaving it as -1 is technically undefined behavior when its value is read out by param_get_bool(), as boolean values are supposed to be '0' or '1'. Alternatively, KVM could define a custom getter for the param, but the auto value doesn't depend on the vendor module in any way, and printing "auto" would be unnecessarily unfriendly to the user. In addition to fixing the undefined behavior, resolving the auto value also fixes the scenario where the auto value resolves to N and no vendor module is loaded. Previously, -1 would result in Y being printed even though KVM would ultimately disable the mitigation. Rename the existing MMU module init/exit helpers to clarify that they're invoked with respect to the vendor module, and add comments to document why KVM has two separate "module init" flows. ========================================================================= UBSAN: invalid-load in kernel/params.c:320:33 load of value 255 is not a valid value for type '_Bool' CPU: 6 PID: 892 Comm: tail Not tainted 5.17.0-rc3+ #799 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 Call Trace: <TASK> dump_stack_lvl+0x34/0x44 ubsan_epilogue+0x5/0x40 __ubsan_handle_load_invalid_value.cold+0x43/0x48 param_get_bool.cold+0xf/0x14 param_attr_show+0x55/0x80 module_attr_show+0x1c/0x30 sysfs_kf_seq_show+0x93/0xc0 seq_read_iter+0x11c/0x450 new_sync_read+0x11b/0x1a0 vfs_read+0xf0/0x190 ksys_read+0x5f/0xe0 do_syscall_64+0x3b/0xc0 entry_SYSCALL_64_after_hwframe+0x44/0xae </TASK> ========================================================================= Fixes: b8e8c830 ("kvm: mmu: ITLB_MULTIHIT mitigation") Cc: stable@vger.kernel.org Reported-by: NBruno Goncalves <bgoncalv@redhat.com> Reported-by: NJan Stancek <jstancek@redhat.com> Signed-off-by: NSean Christopherson <seanjc@google.com> Message-Id: <20220331221359.3912754-1-seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 02 4月, 2022 6 次提交
-
-
由 Maxim Levitsky 提交于
It makes more sense to print new SPTE value than the old value. Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com> Reviewed-by: NSean Christopherson <seanjc@google.com> Message-Id: <20220302102457.588450-1-mlevitsk@redhat.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Lai Jiangshan 提交于
There are two kinds of implicit supervisor access implicit supervisor access when CPL = 3 implicit supervisor access when CPL < 3 Current permission_fault() handles only the first kind for SMAP. But if the access is implicit when SMAP is on, data may not be read nor write from any user-mode address regardless the current CPL. So the second kind should be also supported. The first kind can be detect via CPL and access mode: if it is supervisor access and CPL = 3, it must be implicit supervisor access. But it is not possible to detect the second kind without extra information, so this patch adds an artificial PFERR_EXPLICIT_ACCESS into @access. This extra information also works for the first kind, so the logic is changed to use this information for both cases. The value of PFERR_EXPLICIT_ACCESS is deliberately chosen to be bit 48 which is in the most significant 16 bits of u64 and less likely to be forced to change due to future hardware uses it. This patch removes the call to ->get_cpl() for access mode is determined by @access. Not only does it reduce a function call, but also remove confusions when the permission is checked for nested TDP. The nested TDP shouldn't have SMAP checking nor even the L2's CPL have any bearing on it. The original code works just because it is always user walk for NPT and SMAP fault is not set for EPT in update_permission_bitmask. Signed-off-by: NLai Jiangshan <jiangshan.ljs@antgroup.com> Message-Id: <20220311070346.45023-5-jiangshanlai@gmail.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Lai Jiangshan 提交于
The commit 09f037aa ("KVM: MMU: speedup update_permission_bitmask") refactored the code of update_permission_bitmask() and change the comments. It added a condition into a list to match the new code, so the number/order for conditions in the comments should be updated too. Signed-off-by: NLai Jiangshan <jiangshan.ljs@antgroup.com> Message-Id: <20220311070346.45023-3-jiangshanlai@gmail.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Lai Jiangshan 提交于
Change the type of access u32 to u64 for FNAME(walk_addr) and ->gva_to_gpa(). The kinds of accesses are usually combinations of UWX, and VMX/SVM's nested paging adds a new factor of access: is it an access for a guest page table or for a final guest physical address. And SMAP relies a factor for supervisor access: explicit or implicit. So @access in FNAME(walk_addr) and ->gva_to_gpa() is better to include all these information to do the walk. Although @access(u32) has enough bits to encode all the kinds, this patch extends it to u64: o Extra bits will be in the higher 32 bits, so that we can easily obtain the traditional access mode (UWX) by converting it to u32. o Reuse the value for the access kind defined by SVM's nested paging (PFERR_GUEST_FINAL_MASK and PFERR_GUEST_PAGE_MASK) as @error_code in kvm_handle_page_fault(). Signed-off-by: NLai Jiangshan <jiangshan.ljs@antgroup.com> Message-Id: <20220311070346.45023-2-jiangshanlai@gmail.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Re-introduce zapping only leaf SPTEs in kvm_zap_gfn_range() and kvm_tdp_mmu_unmap_gfn_range(), this time without losing a pending TLB flush when processing multiple roots (including nested TDP shadow roots). Dropping the TLB flush resulted in random crashes when running Hyper-V Server 2019 in a guest with KSM enabled in the host (or any source of mmu_notifier invalidations, KSM is just the easiest to force). This effectively revert commits 873dd122 and fcb93eb6, and thus restores commit cf3e2642, plus this delta on top: bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, int as_id, gfn_t start, gfn_t end, struct kvm_mmu_page *root; for_each_tdp_mmu_root_yield_safe(kvm, root, as_id) - flush = tdp_mmu_zap_leafs(kvm, root, start, end, can_yield, false); + flush = tdp_mmu_zap_leafs(kvm, root, start, end, can_yield, flush); return flush; } Cc: Ben Gardon <bgardon@google.com> Signed-off-by: NSean Christopherson <seanjc@google.com> Tested-by: NVitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20220325230348.2587437-1-seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Paolo Bonzini 提交于
If kvm->arch.tdp_mmu_zap_wq cannot be created, the failure has to be propagated up to kvm_mmu_init_vm and kvm_arch_init_vm. kvm_arch_init_vm also has to undo all the initialization, so group all the MMU initialization code at the beginning and handle cleaning up of kvm_page_track_init. Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 21 3月, 2022 1 次提交
-
-
由 Paolo Bonzini 提交于
This reverts commit cf3e2642. Multi-vCPU Hyper-V guests started crashing randomly on boot with the latest kvm/queue and the problem can be bisected the problem to this particular patch. Basically, I'm not able to boot e.g. 16-vCPU guest successfully anymore. Both Intel and AMD seem to be affected. Reverting the commit saves the day. Reported-by: NVitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 08 3月, 2022 6 次提交
-
-
由 Paolo Bonzini 提交于
Use the system worker threads to zap the roots invalidated by the TDP MMU's "fast zap" mechanism, implemented by kvm_tdp_mmu_invalidate_all_roots(). At this point, apart from allowing some parallelism in the zapping of roots, the workqueue is a glorified linked list: work items are added and flushed entirely within a single kvm->slots_lock critical section. However, the workqueue fixes a latent issue where kvm_mmu_zap_all_invalidated_roots() assumes that it owns a reference to all invalid roots; therefore, no one can set the invalid bit outside kvm_mmu_zap_all_fast(). Putting the invalidated roots on a linked list... erm, on a workqueue ensures that tdp_mmu_zap_root_work() only puts back those extra references that kvm_mmu_zap_all_invalidated_roots() had gifted to it. Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Defer TLB flushes to the caller when freeing TDP MMU shadow pages instead of immediately flushing. Because the shadow pages are freed in an RCU callback, so long as at least one CPU holds RCU, all CPUs are protected. For vCPUs running in the guest, i.e. consuming TLB entries, KVM only needs to ensure the caller services the pending TLB flush before dropping its RCU protections. I.e. use the caller's RCU as a proxy for all vCPUs running in the guest. Deferring the flushes allows batching flushes, e.g. when installing a 1gb hugepage and zapping a pile of SPs. And when zapping an entire root, deferring flushes allows skipping the flush entirely (because flushes are not needed in that case). Avoiding flushes when zapping an entire root is especially important as synchronizing with other CPUs via IPI after zapping every shadow page can cause significant performance issues for large VMs. The issue is exacerbated by KVM zapping entire top-level entries without dropping RCU protection, which can lead to RCU stalls even when zapping roots backing relatively "small" amounts of guest memory, e.g. 2tb. Removing the IPI bottleneck largely mitigates the RCU issues, though it's likely still a problem for 5-level paging. A future patch will further address the problem by zapping roots in multiple passes to avoid holding RCU for an extended duration. Reviewed-by: NBen Gardon <bgardon@google.com> Signed-off-by: NSean Christopherson <seanjc@google.com> Message-Id: <20220226001546.360188-20-seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Zap only leaf SPTEs in the TDP MMU's zap_gfn_range(), and rename various functions accordingly. When removing mappings for functional correctness (except for the stupid VFIO GPU passthrough memslots bug), zapping the leaf SPTEs is sufficient as the paging structures themselves do not point at guest memory and do not directly impact the final translation (in the TDP MMU). Note, this aligns the TDP MMU with the legacy/full MMU, which zaps only the rmaps, a.k.a. leaf SPTEs, in kvm_zap_gfn_range() and kvm_unmap_gfn_range(). Signed-off-by: NSean Christopherson <seanjc@google.com> Reviewed-by: NBen Gardon <bgardon@google.com> Message-Id: <20220226001546.360188-18-seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Remove the misleading flush "handling" when zapping invalidated TDP MMU roots, and document that flushing is unnecessary for all flavors of MMUs when zapping invalid/obsolete roots/pages. The "handling" in the TDP MMU is dead code, as zap_gfn_range() is called with shared=true, in which case it will never return true due to the flushing being handled by tdp_mmu_zap_spte_atomic(). No functional change intended. Signed-off-by: NSean Christopherson <seanjc@google.com> Reviewed-by: NBen Gardon <bgardon@google.com> Message-Id: <20220226001546.360188-6-seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Explicitly ignore the result of zap_gfn_range() when putting the last reference to a TDP MMU root, and add a pile of comments to formalize the TDP MMU's behavior of deferring TLB flushes to alloc/reuse. Note, this only affects the !shared case, as zap_gfn_range() subtly never returns true for "flush" as the flush is handled by tdp_mmu_zap_spte_atomic(). Putting the root without a flush is ok because even if there are stale references to the root in the TLB, they are unreachable because KVM will not run the guest with the same ASID without first flushing (where ASID in this context refers to both SVM's explicit ASID and Intel's implicit ASID that is constructed from VPID+PCID+EPT4A+etc...). Signed-off-by: NSean Christopherson <seanjc@google.com> Message-Id: <20220226001546.360188-5-seanjc@google.com> Reviewed-by: NMingwei Zhang <mizhang@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Fix misleading and arguably wrong comments in the TDP MMU's fast zap flow. The comments, and the fact that actually zapping invalid roots was added separately, strongly suggests that zapping invalid roots is an optimization and not required for correctness. That is a lie. KVM _must_ zap invalid roots before returning from kvm_mmu_zap_all_fast(), because when it's called from kvm_mmu_invalidate_zap_pages_in_memslot(), KVM is relying on it to fully remove all references to the memslot. Once the memslot is gone, KVM's mmu_notifier hooks will be unable to find the stale references as the hva=>gfn translation is done via the memslots. If KVM doesn't immediately zap SPTEs and userspace unmaps a range after deleting a memslot, KVM will fail to zap in response to the mmu_notifier due to not finding a memslot corresponding to the notifier's range, which leads to a variation of use-after-free. The other misleading comment (and code) explicitly states that roots without a reference should be skipped. While that's technically true, it's also extremely misleading as it should be impossible for KVM to encounter a defunct root on the list while holding mmu_lock for write. Opportunistically add a WARN to enforce that invariant. Fixes: b7cccd39 ("KVM: x86/mmu: Fast invalidation for TDP MMU") Fixes: 4c6654bd ("KVM: x86/mmu: Tear down roots before kvm_mmu_zap_all_fast returns") Signed-off-by: NSean Christopherson <seanjc@google.com> Reviewed-by: NBen Gardon <bgardon@google.com> Message-Id: <20220226001546.360188-4-seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 02 3月, 2022 1 次提交
-
-
由 Like Xu 提交于
Just like on the optional mmu_alloc_direct_roots() path, once shadow path reaches "r = -EIO" somewhere, the caller needs to know the actual state in order to enter error handling and avoid something worse. Fixes: 4a38162e ("KVM: MMU: load PDPTRs outside mmu_lock") Signed-off-by: NLike Xu <likexu@tencent.com> Reviewed-by: NSean Christopherson <seanjc@google.com> Message-Id: <20220301124941.48412-1-likexu@tencent.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 01 3月, 2022 3 次提交
-
-
由 Sean Christopherson 提交于
WARN and bail if is_unsync_root() is passed a root for which there is no shadow page, i.e. is passed the physical address of one of the special roots, which do not have an associated shadow page. The current usage squeaks by without bug reports because neither kvm_mmu_sync_roots() nor kvm_mmu_sync_prev_roots() calls the helper with pae_root or pml4_root, and 5-level AMD CPUs are not generally available, i.e. no one can coerce KVM into calling is_unsync_root() on pml5_root. Note, this doesn't fix the mess with 5-level nNPT, it just (hopefully) prevents KVM from crashing. Cc: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: NSean Christopherson <seanjc@google.com> Message-Id: <20220225182248.3812651-8-seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Zap only obsolete roots when responding to zapping a single root shadow page. Because KVM keeps root_count elevated when stuffing a previous root into its PGD cache, shadowing a 64-bit guest means that zapping any root causes all vCPUs to reload all roots, even if their current root is not affected by the zap. For many kernels, zapping a single root is a frequent operation, e.g. in Linux it happens whenever an mm is dropped, e.g. process exits, etc... Signed-off-by: NSean Christopherson <seanjc@google.com> Reviewed-by: NBen Gardon <bgardon@google.com> Message-Id: <20220225182248.3812651-5-seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Remove the generic kvm_reload_remote_mmus() and open code its functionality into the two x86 callers. x86 is (obviously) the only architecture that uses the hook, and is also the only architecture that uses KVM_REQ_MMU_RELOAD in a way that's consistent with the name. That will change in a future patch, as x86's usage when zapping a single shadow page x86 doesn't actually _need_ to reload all vCPUs' MMUs, only MMUs whose root is being zapped actually need to be reloaded. s390 also uses KVM_REQ_MMU_RELOAD, but for a slightly different purpose. Drop the generic code in anticipation of implementing s390 and x86 arch specific requests, which will allow dropping KVM_REQ_MMU_RELOAD entirely. Opportunistically reword the x86 TDP MMU comment to avoid making references to functions (and requests!) when possible, and to remove the rather ambiguous "this". No functional change intended. Cc: Ben Gardon <bgardon@google.com> Signed-off-by: NSean Christopherson <seanjc@google.com> Reviewed-by: NBen Gardon <bgardon@google.com> Message-Id: <20220225182248.3812651-4-seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 25 2月, 2022 9 次提交
-
-
由 Paolo Bonzini 提交于
For cleanliness, do not leave a stale GVA in the cache after all the roots are cleared. In practice, kvm_mmu_load will go through kvm_mmu_sync_roots if paging is on, and will not use vcpu_match_mmio_gva at all if paging is off. However, leaving data in the cache might cause bugs in the future. Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Paolo Bonzini 提交于
Since the guest PGD is now loaded after the MMU has been set up completely, the desired role for a cache hit is simply the current mmu_role. There is no need to compute it again, so __kvm_mmu_new_pgd can be folded in kvm_mmu_new_pgd. Reviewed-by: NSean Christopherson <seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Paolo Bonzini 提交于
Now that __kvm_mmu_new_pgd does not look at the MMU's root_level and shadow_root_level anymore, pull the PGD load after the initialization of the shadow MMUs. Besides being more intuitive, this enables future simplifications and optimizations because it's not necessary anymore to compute the role outside kvm_init_mmu. In particular, kvm_mmu_reset_context was not attempting to use a cached PGD to avoid having to figure out the new role. With this change, it could follow what nested_{vmx,svm}_load_cr3 are doing, and avoid unloading all the cached roots. Reviewed-by: NSean Christopherson <seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Paolo Bonzini 提交于
Right now, PGD caching avoids placing a PAE root in the cache by using the old value of mmu->root_level and mmu->shadow_root_level; it does not look for a cached PGD if the old root is a PAE one, and then frees it using kvm_mmu_free_roots. Change the logic instead to free the uncacheable root early. This way, __kvm_new_mmu_pgd is able to look up the cache when going from 32-bit to 64-bit (if there is a hit, the invalid root becomes the least recently used). An example of this is nested virtualization with shadow paging, when a 64-bit L1 runs a 32-bit L2. As a side effect (which is actually the reason why this patch was written), PGD caching does not use the old value of mmu->root_level and mmu->shadow_root_level anymore. Reviewed-by: NSean Christopherson <seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Paolo Bonzini 提交于
These functions only operate on a given MMU, of which there is more than one in a vCPU (we care about two, because the third does not have any roots and is only used to walk guest page tables). They do need a struct kvm in order to lock the mmu_lock, but they do not needed anything else in the struct kvm_vcpu. So, pass the vcpu->kvm directly to them. Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Paolo Bonzini 提交于
Right now, PGD caching requires a complicated dance of first computing the MMU role and passing it to __kvm_mmu_new_pgd(), and then separately calling kvm_init_mmu(). Part of this is due to kvm_mmu_free_roots using mmu->root_level and mmu->shadow_root_level to distinguish whether the page table uses a single root or 4 PAE roots. Because kvm_init_mmu() can overwrite mmu->root_level, kvm_mmu_free_roots() must be called before kvm_init_mmu(). However, even after kvm_init_mmu() there is a way to detect whether the page table may hold PAE roots, as root.hpa isn't backed by a shadow when it points at PAE roots. Using this method results in simpler code, and is one less obstacle in moving all calls to __kvm_mmu_new_pgd() after the MMU has been initialized. Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Paolo Bonzini 提交于
The root_hpa and root_pgd fields form essentially a struct kvm_mmu_root_info. Use the struct to have more consistency between mmu->root and mmu->prev_roots. The patch is entirely search and replace except for cached_root_available, which does not need a temporary struct kvm_mmu_root_info anymore. Reviewed-by: NSean Christopherson <seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Paolo Bonzini 提交于
WARN and bail if KVM attempts to free a root that isn't backed by a shadow page. KVM allocates a bare page for "special" roots, e.g. when using PAE paging or shadowing 2/3/4-level page tables with 4/5-level, and so root_hpa will be valid but won't be backed by a shadow page. It's all too easy to blindly call mmu_free_root_page() on root_hpa, be nice and WARN instead of crashing KVM and possibly the kernel. Reviewed-by: NSean Christopherson <seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Liang Zhang 提交于
In current async pagefault logic, when a page is ready, KVM relies on kvm_arch_can_dequeue_async_page_present() to determine whether to deliver a READY event to the Guest. This function test token value of struct kvm_vcpu_pv_apf_data, which must be reset to zero by Guest kernel when a READY event is finished by Guest. If value is zero meaning that a READY event is done, so the KVM can deliver another. But the kvm_arch_setup_async_pf() may produce a valid token with zero value, which is confused with previous mention and may lead the loss of this READY event. This bug may cause task blocked forever in Guest: INFO: task stress:7532 blocked for more than 1254 seconds. Not tainted 5.10.0 #16 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:stress state:D stack: 0 pid: 7532 ppid: 1409 flags:0x00000080 Call Trace: __schedule+0x1e7/0x650 schedule+0x46/0xb0 kvm_async_pf_task_wait_schedule+0xad/0xe0 ? exit_to_user_mode_prepare+0x60/0x70 __kvm_handle_async_pf+0x4f/0xb0 ? asm_exc_page_fault+0x8/0x30 exc_page_fault+0x6f/0x110 ? asm_exc_page_fault+0x8/0x30 asm_exc_page_fault+0x1e/0x30 RIP: 0033:0x402d00 RSP: 002b:00007ffd31912500 EFLAGS: 00010206 RAX: 0000000000071000 RBX: ffffffffffffffff RCX: 00000000021a32b0 RDX: 000000000007d011 RSI: 000000000007d000 RDI: 00000000021262b0 RBP: 00000000021262b0 R08: 0000000000000003 R09: 0000000000000086 R10: 00000000000000eb R11: 00007fefbdf2baa0 R12: 0000000000000000 R13: 0000000000000002 R14: 000000000007d000 R15: 0000000000001000 Signed-off-by: NLiang Zhang <zhangliang5@huawei.com> Message-Id: <20220222031239.1076682-1-zhangliang5@huawei.com> Cc: stable@vger.kernel.org Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 19 2月, 2022 1 次提交
-
-
由 Sean Christopherson 提交于
Remove mmu_audit.c and all its collateral, the auditing code has suffered severe bitrot, ironically partly due to shadow paging being more stable and thus not benefiting as much from auditing, but mostly due to TDP supplanting shadow paging for non-nested guests and shadowing of nested TDP not heavily stressing the logic that is being audited. Signed-off-by: NSean Christopherson <seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 11 2月, 2022 11 次提交
-
-
由 David Matlack 提交于
When using KVM_DIRTY_LOG_INITIALLY_SET, huge pages are not write-protected when dirty logging is enabled on the memslot. Instead they are write-protected once userspace invokes KVM_CLEAR_DIRTY_LOG for the first time and only for the specific sub-region being cleared. Enhance KVM_CLEAR_DIRTY_LOG to also try to split huge pages prior to write-protecting to avoid causing write-protection faults on vCPU threads. This also allows userspace to smear the cost of huge page splitting across multiple ioctls, rather than splitting the entire memslot as is the case when initially-all-set is not used. Signed-off-by: NDavid Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-17-dmatlack@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 David Matlack 提交于
When dirty logging is enabled without initially-all-set, try to split all huge pages in the memslot down to 4KB pages so that vCPUs do not have to take expensive write-protection faults to split huge pages. Eager page splitting is best-effort only. This commit only adds the support for the TDP MMU, and even there splitting may fail due to out of memory conditions. Failures to split a huge page is fine from a correctness standpoint because KVM will always follow up splitting by write-protecting any remaining huge pages. Eager page splitting moves the cost of splitting huge pages off of the vCPU threads and onto the thread enabling dirty logging on the memslot. This is useful because: 1. Splitting on the vCPU thread interrupts vCPUs execution and is disruptive to customers whereas splitting on VM ioctl threads can run in parallel with vCPU execution. 2. Splitting all huge pages at once is more efficient because it does not require performing VM-exit handling or walking the page table for every 4KiB page in the memslot, and greatly reduces the amount of contention on the mmu_lock. For example, when running dirty_log_perf_test with 96 virtual CPUs, 1GiB per vCPU, and 1GiB HugeTLB memory, the time it takes vCPUs to write to all of their memory after dirty logging is enabled decreased by 95% from 2.94s to 0.14s. Eager Page Splitting is over 100x more efficient than the current implementation of splitting on fault under the read lock. For example, taking the same workload as above, Eager Page Splitting reduced the CPU required to split all huge pages from ~270 CPU-seconds ((2.94s - 0.14s) * 96 vCPU threads) to only 1.55 CPU-seconds. Eager page splitting does increase the amount of time it takes to enable dirty logging since it has split all huge pages. For example, the time it took to enable dirty logging in the 96GiB region of the aforementioned test increased from 0.001s to 1.55s. Reviewed-by: NPeter Xu <peterx@redhat.com> Signed-off-by: NDavid Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-16-dmatlack@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 David Matlack 提交于
restore_acc_track_spte() is pure SPTE bit manipulation, making it a good fit for spte.h. And now that the WARN_ON_ONCE() calls have been removed, there isn't any good reason to not inline it. This move also prepares for a follow-up commit that will need to call restore_acc_track_spte() from spte.c No functional change intended. Reviewed-by: NBen Gardon <bgardon@google.com> Reviewed-by: NPeter Xu <peterx@redhat.com> Signed-off-by: NDavid Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-11-dmatlack@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 David Matlack 提交于
The new_spte local variable is unnecessary. Deleting it can save a line of code and simplify the remaining lines a bit. No functional change intended. Suggested-by: NSean Christopherson <seanjc@google.com> Signed-off-by: NDavid Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-10-dmatlack@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 David Matlack 提交于
The warnings in restore_acc_track_spte() can be removed because the only caller checks is_access_track_spte(), and is_access_track_spte() checks !spte_ad_enabled(). In other words, the warning can never be triggered. No functional change intended. Suggested-by: NSean Christopherson <seanjc@google.com> Signed-off-by: NDavid Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-9-dmatlack@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 David Matlack 提交于
The function formerly known as rmap_write_protect() has been renamed to kvm_vcpu_write_protect_gfn(), so we can get rid of the double underscores in front of __rmap_write_protect(). No functional change intended. Reviewed-by: NBen Gardon <bgardon@google.com> Reviewed-by: NPeter Xu <peterx@redhat.com> Reviewed-by: NSean Christopherson <seanjc@google.com> Signed-off-by: NDavid Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-3-dmatlack@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 David Matlack 提交于
rmap_write_protect() is a poor name because it also write-protects SPTEs in the TDP MMU, not just SPTEs in the rmap. It is also confusing that rmap_write_protect() is not a simple wrapper around __rmap_write_protect(), since that is the common pattern for functions with double-underscore names. Rename rmap_write_protect() to kvm_vcpu_write_protect_gfn() to convey that KVM is write-protecting a specific gfn in the context of a vCPU. No functional change intended. Reviewed-by: NBen Gardon <bgardon@google.com> Reviewed-by: NPeter Xu <peterx@redhat.com> Reviewed-by: NSean Christopherson <seanjc@google.com> Signed-off-by: NDavid Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-2-dmatlack@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 David Matlack 提交于
Consolidate the large comment above DEFAULT_SPTE_HOST_WRITABLE with the large comment above is_writable_pte() into one comment. This comment explains the different reasons why an SPTE may be non-writable and KVM keeps track of that with the {Host,MMU}-writable bits. No functional change intended. Signed-off-by: NDavid Matlack <dmatlack@google.com> Message-Id: <20220125230723.1701061-1-dmatlack@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 David Matlack 提交于
Both "writeable" and "writable" are valid, but we should be consistent about which we use. DEFAULT_SPTE_MMU_WRITEABLE was the odd one out in the SPTE code, so rename it to DEFAULT_SPTE_MMU_WRITABLE. No functional change intended. Signed-off-by: NDavid Matlack <dmatlack@google.com> Message-Id: <20220125230713.1700406-1-dmatlack@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 David Matlack 提交于
Check SPTE writable invariants when setting SPTEs rather than in spte_can_locklessly_be_made_writable(). By the time KVM checks spte_can_locklessly_be_made_writable(), the SPTE has long been since corrupted. Note that these invariants only apply to shadow-present leaf SPTEs (i.e. not to MMIO SPTEs, non-leaf SPTEs, etc.). Add a comment explaining the restriction and only instrument the code paths that set shadow-present leaf SPTEs. To account for access tracking, also check the SPTE writable invariants when marking an SPTE as an access track SPTE. This also lets us remove a redundant WARN from mark_spte_for_access_track(). Suggested-by: NSean Christopherson <seanjc@google.com> Signed-off-by: NDavid Matlack <dmatlack@google.com> Message-Id: <20220125230518.1697048-3-dmatlack@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Rename a variety of kvm_x86_op function pointers so that preferred name for vendor implementations follows the pattern <vendor>_<function>, e.g. rename .run() to .vcpu_run() to match {svm,vmx}_vcpu_run(). This will allow vendor implementations to be wired up via the KVM_X86_OP macro. In many cases, VMX and SVM "disagree" on the preferred name, though in reality it's VMX and x86 that disagree as SVM blindly prepended _svm to the kvm_x86_ops name. Justification for using the VMX nomenclature: - set_{irq,nmi} => inject_{irq,nmi} because the helper is injecting an event that has already been "set" in e.g. the vIRR. SVM's relevant VMCB field is even named event_inj, and KVM's stat is irq_injections. - prepare_guest_switch => prepare_switch_to_guest because the former is ambiguous, e.g. it could mean switching between multiple guests, switching from the guest to host, etc... - update_pi_irte => pi_update_irte to allow for matching match the rest of VMX's posted interrupt naming scheme, which is vmx_pi_<blah>(). - start_assignment => pi_start_assignment to again follow VMX's posted interrupt naming scheme, and to provide context for what bit of code might care about an otherwise undescribed "assignment". The "tlb_flush" => "flush_tlb" creates an inconsistency with respect to Hyper-V's "tlb_remote_flush" hooks, but Hyper-V really is the one that's wrong. x86, VMX, and SVM all use flush_tlb, and even common KVM is on a variant of the bandwagon with "kvm_flush_remote_tlbs", e.g. a more appropriate name for the Hyper-V hooks would be flush_remote_tlbs. Leave that change for another time as the Hyper-V hooks always start as NULL, i.e. the name doesn't matter for using kvm-x86-ops.h, and changing all names requires an astounding amount of churn. VMX and SVM function names are intentionally left as is to minimize the diff. Both VMX and SVM will need to rename even more functions in order to fully utilize KVM_X86_OPS, i.e. an additional patch for each is inevitable. No functional change intended. Signed-off-by: NSean Christopherson <seanjc@google.com> Message-Id: <20220128005208.4008533-5-seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-