- 22 10月, 2021 12 次提交
-
-
由 Lai Jiangshan 提交于
X86_CR4_PCIDE doesn't participate in kvm_mmu_role, so the mmu context doesn't need to be reset. It is only required to flush all the guest tlb. Signed-off-by: NLai Jiangshan <laijs@linux.alibaba.com> Reviewed-by: NSean Christopherson <seanjc@google.com> Message-Id: <20210919024246.89230-2-jiangshanlai@gmail.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Wanpeng Li 提交于
SDM mentioned that, RDPMC: IF (((CR4.PCE = 1) or (CPL = 0) or (CR0.PE = 0)) and (ECX indicates a supported counter)) THEN EAX := counter[31:0]; EDX := ZeroExtend(counter[MSCB:32]); ELSE (* ECX is not valid or CR4.PCE is 0 and CPL is 1, 2, or 3 and CR0.PE is 1 *) #GP(0); FI; Let's add a comment why CR0.PE isn't tested since it's impossible for CPL to be >0 if CR0.PE=0. Signed-off-by: NWanpeng Li <wanpengli@tencent.com> Message-Id: <1634724836-73721-1-git-send-email-wanpengli@tencent.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Paul pointed out the error messages when KVM fails to load are unhelpful in understanding exactly what went wrong if userspace probes the "wrong" module. Add a mandatory kvm_x86_ops field to track vendor module names, kvm_intel and kvm_amd, and use the name for relevant error message when KVM fails to load so that the user knows which module failed to load. Opportunistically tweak the "disabled by bios" error message to clarify that _support_ was disabled, not that the module itself was magically disabled by BIOS. Suggested-by: NPaul Menzel <pmenzel@molgen.mpg.de> Signed-off-by: NSean Christopherson <seanjc@google.com> Message-Id: <20211018183929.897461-2-seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Junaid Shahid 提交于
Currently, the NX huge page recovery thread wakes up every minute and zaps 1/nx_huge_pages_recovery_ratio of the total number of split NX huge pages at a time. This is intended to ensure that only a relatively small number of pages get zapped at a time. But for very large VMs (or more specifically, VMs with a large number of executable pages), a period of 1 minute could still result in this number being too high (unless the ratio is changed significantly, but that can result in split pages lingering on for too long). This change makes the period configurable instead of fixing it at 1 minute. Users of large VMs can then adjust the period and/or the ratio to reduce the number of pages zapped at one time while still maintaining the same overall duration for cycling through the entire list. By default, KVM derives a period from the ratio such that a page will remain on the list for 1 hour on average. Signed-off-by: NJunaid Shahid <junaids@google.com> Message-Id: <20211020010627.305925-1-junaids@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Wanpeng Li 提交于
SDM section 18.2.3 mentioned that: "IA32_PERF_GLOBAL_OVF_CTL MSR allows software to clear overflow indicator(s) of any general-purpose or fixed-function counters via a single WRMSR." It is R/W mentioned by SDM, we read this msr on bare-metal during perf testing, the value is always 0 for ICX/SKX boxes on hands. Let's fill get_msr MSR_CORE_PERF_GLOBAL_OVF_CTRL w/ 0 as hardware behavior and drop global_ovf_ctrl variable. Tested-by: NLike Xu <likexu@tencent.com> Signed-off-by: NWanpeng Li <wanpengli@tencent.com> Message-Id: <1634631160-67276-2-git-send-email-wanpengli@tencent.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 David Matlack 提交于
slot_handle_leaf is a misnomer because it only operates on 4K SPTEs whereas "leaf" is used to describe any valid terminal SPTE (4K or large page). Rename slot_handle_leaf to slot_handle_level_4k to avoid confusion. Making this change makes it more obvious there is a benign discrepency between the legacy MMU and the TDP MMU when it comes to dirty logging. The legacy MMU only iterates through 4K SPTEs when zapping for collapsing and when clearing D-bits. The TDP MMU, on the other hand, iterates through SPTEs on all levels. The TDP MMU behavior of zapping SPTEs at all levels is technically overkill for its current dirty logging implementation, which always demotes to 4k SPTES, but both the TDP MMU and legacy MMU zap if and only if the SPTE can be replaced by a larger page, i.e. will not spuriously zap 2m (or larger) SPTEs. Opportunistically add comments to explain this discrepency in the code. Signed-off-by: NDavid Matlack <dmatlack@google.com> Message-Id: <20211019162223.3935109-1-dmatlack@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Xiaoyao Li 提交于
Per Intel SDM, RTIT_CTL_BRANCH_EN bit has no dependency on any CPUID leaf 0x14. Signed-off-by: NXiaoyao Li <xiaoyao.li@intel.com> Message-Id: <20210827070249.924633-5-xiaoyao.li@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Xiaoyao Li 提交于
To better self explain the meaning of this field and match the PT_CAP_num_address_ranges constatn. Suggested-by: NSean Christopherson <seanjc@google.com> Signed-off-by: NXiaoyao Li <xiaoyao.li@intel.com> Message-Id: <20210827070249.924633-4-xiaoyao.li@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Xiaoyao Li 提交于
The number of valid PT ADDR MSRs for the guest is precomputed in vmx->pt_desc.addr_range. Use it instead of calculating again. Signed-off-by: NXiaoyao Li <xiaoyao.li@intel.com> Message-Id: <20210827070249.924633-3-xiaoyao.li@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Xiaoyao Li 提交于
A minor optimization to WRMSR MSR_IA32_RTIT_CTL when necessary. Opportunistically refine the comment to call out that KVM requires VM_EXIT_CLEAR_IA32_RTIT_CTL to expose PT to the guest. Reviewed-by: NSean Christopherson <seanjc@google.com> Signed-off-by: NXiaoyao Li <xiaoyao.li@intel.com> Message-Id: <20210827070249.924633-2-xiaoyao.li@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Paolo Bonzini 提交于
"prefetch", "prefault" and "speculative" are used throughout KVM to mean the same thing. Use a single name, standardizing on "prefetch" which is already used by various functions such as direct_pte_prefetch, FNAME(prefetch_gpte), FNAME(pte_prefetch), etc. Suggested-by: NDavid Matlack <dmatlack@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 David Stevens 提交于
Unify the flags for rmaps and page tracking data, using a single flag in struct kvm_arch and a single loop to go over all the address spaces and memslots. This avoids code duplication between alloc_all_memslots_rmaps and kvm_page_track_enable_mmu_write_tracking. Signed-off-by: NDavid Stevens <stevensd@chromium.org> [This patch is the delta between David's v2 and v3, with conflicts fixed and my own commit message. - Paolo] Co-developed-by: NSean Christopherson <seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 19 10月, 2021 6 次提交
-
-
由 Oliver Upton 提交于
To date, VMM-directed TSC synchronization and migration has been a bit messy. KVM has some baked-in heuristics around TSC writes to infer if the VMM is attempting to synchronize. This is problematic, as it depends on host userspace writing to the guest's TSC within 1 second of the last write. A much cleaner approach to configuring the guest's views of the TSC is to simply migrate the TSC offset for every vCPU. Offsets are idempotent, and thus not subject to change depending on when the VMM actually reads/writes values from/to KVM. The VMM can then read the TSC once with KVM_GET_CLOCK to capture a (realtime, host_tsc) pair at the instant when the guest is paused. Cc: David Matlack <dmatlack@google.com> Cc: Sean Christopherson <seanjc@google.com> Signed-off-by: NOliver Upton <oupton@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com> Message-Id: <20210916181538.968978-8-oupton@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Oliver Upton 提交于
Refactor kvm_synchronize_tsc to make a new function that allows callers to specify TSC parameters (offset, value, nanoseconds, etc.) explicitly for the sake of participating in TSC synchronization. Signed-off-by: NOliver Upton <oupton@google.com> Message-Id: <20210916181538.968978-7-oupton@google.com> [Make sure kvm->arch.cur_tsc_generation and vcpu->arch.this_tsc_generation are equal at the end of __kvm_synchronize_tsc, if matched is false. Reported by Maxim Levitsky. - Paolo] Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Paolo Bonzini 提交于
Protect the reference point for kvmclock with a seqcount, so that kvmclock updates for all vCPUs can proceed in parallel. Xen runstate updates will also run in parallel and not bounce the kvmclock cacheline. Of the variables that were protected by pvclock_gtod_sync_lock, nr_vcpus_matched_tsc is different because it is updated outside pvclock_update_vm_gtod_copy and read inside it. Therefore, we need to keep it protected by a spinlock. In fact it must now be a raw spinlock, because pvclock_update_vm_gtod_copy, being the write-side of a seqcount, is non-preemptible. Since we already have tsc_write_lock which is a raw spinlock, we can just use tsc_write_lock as the lock that protects the write-side of the seqcount. Co-developed-by: NOliver Upton <oupton@google.com> Message-Id: <20210916181538.968978-6-oupton@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Oliver Upton 提交于
Handling the migration of TSCs correctly is difficult, in part because Linux does not provide userspace with the ability to retrieve a (TSC, realtime) clock pair for a single instant in time. In lieu of a more convenient facility, KVM can report similar information in the kvm_clock structure. Provide userspace with a host TSC & realtime pair iff the realtime clock is based on the TSC. If userspace provides KVM_SET_CLOCK with a valid realtime value, advance the KVM clock by the amount of elapsed time. Do not step the KVM clock backwards, though, as it is a monotonic oscillator. Suggested-by: NPaolo Bonzini <pbonzini@redhat.com> Signed-off-by: NOliver Upton <oupton@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com> Message-Id: <20210916181538.968978-5-oupton@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Paolo Bonzini 提交于
This is a new warning in clang top-of-tree (will be clang 14): In file included from arch/x86/kvm/mmu/mmu.c:27: arch/x86/kvm/mmu/spte.h:318:9: error: use of bitwise '|' with boolean operands [-Werror,-Wbitwise-instead-of-logical] return __is_bad_mt_xwr(rsvd_check, spte) | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ || arch/x86/kvm/mmu/spte.h:318:9: note: cast one or both operands to int to silence this warning The code is fine, but change it anyway to shut up this clever clogs of a compiler. Reported-by: torvic9@mailbox.org Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Paolo Bonzini 提交于
If allocation of rmaps fails, but some of the pointers have already been written, those pointers can be cleaned up when the memslot is freed, or even reused later for another attempt at allocating the rmaps. Therefore there is no need to WARN, as done for example in memslot_rmap_alloc, but the allocation *must* be skipped lest KVM will overwrite the previous pointer and will indeed leak memory. Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 18 10月, 2021 1 次提交
-
-
由 Andrei Vagin 提交于
This looks like a typo in 8f32d5e5. This change didn't intend to do any functional changes. The problem was caught by gVisor tests. Fixes: 8f32d5e5 ("KVM: x86/mmu: allow kvm_faultin_pfn to return page fault handling code") Cc: Maxim Levitsky <mlevitsk@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: NAndrei Vagin <avagin@gmail.com> Message-Id: <20211015163221.472508-1-avagin@gmail.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 15 10月, 2021 1 次提交
-
-
由 Paolo Bonzini 提交于
The size of the data in the scratch buffer is not divided by the size of each port I/O operation, so vcpu->arch.pio.count ends up being larger than it should be by a factor of size. Cc: stable@vger.kernel.org Fixes: 7ed9abfe ("KVM: SVM: Support string IO operations for an SEV-ES guest") Acked-by: NTom Lendacky <thomas.lendacky@amd.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 01 10月, 2021 20 次提交
-
-
由 David Stevens 提交于
Avoid allocating the gfn_track arrays if nothing needs them. If there are no external to KVM users of the API (i.e. no GVT-g), then page tracking is only needed for shadow page tables. This means that when tdp is enabled and there are no external users, then the gfn_track arrays can be lazily allocated when the shadow MMU is actually used. This avoid allocations equal to .05% of guest memory when nested virtualization is not used, if the kernel is compiled without GVT-g. Signed-off-by: NDavid Stevens <stevensd@chromium.org> Message-Id: <20210922045859.2011227-3-stevensd@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 David Stevens 提交于
Add a config option that allows kvm to determine whether or not there are any external users of page tracking. Signed-off-by: NDavid Stevens <stevensd@chromium.org> Message-Id: <20210922045859.2011227-2-stevensd@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Krish Sadhukhan 提交于
According to section "TLB Flush" in APM vol 2, "Support for TLB_CONTROL commands other than the first two, is optional and is indicated by CPUID Fn8000_000A_EDX[FlushByAsid]. All encodings of TLB_CONTROL not defined in the APM are reserved." Signed-off-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com> Message-Id: <20210920235134.101970-3-krish.sadhukhan@oracle.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Juergen Gross 提交于
By switching from kfree() to kvfree() in kvm_arch_free_vm() Arm64 can use the common variant. This can be accomplished by adding another macro __KVM_HAVE_ARCH_VM_FREE, which will be used only by x86 for now. Further simplification can be achieved by adding __kvm_arch_free_vm() doing the common part. Suggested-by: NPaolo Bonzini <pbonzini@redhat.com> Signed-off-by: NJuergen Gross <jgross@suse.com> Message-Id: <20210903130808.30142-5-jgross@suse.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Babu Moger 提交于
Predictive Store Forwarding: AMD Zen3 processors feature a new technology called Predictive Store Forwarding (PSF). PSF is a hardware-based micro-architectural optimization designed to improve the performance of code execution by predicting address dependencies between loads and stores. How PSF works: It is very common for a CPU to execute a load instruction to an address that was recently written by a store. Modern CPUs implement a technique known as Store-To-Load-Forwarding (STLF) to improve performance in such cases. With STLF, data from the store is forwarded directly to the load without having to wait for it to be written to memory. In a typical CPU, STLF occurs after the address of both the load and store are calculated and determined to match. PSF expands on this by speculating on the relationship between loads and stores without waiting for the address calculation to complete. With PSF, the CPU learns over time the relationship between loads and stores. If STLF typically occurs between a particular store and load, the CPU will remember this. In typical code, PSF provides a performance benefit by speculating on the load result and allowing later instructions to begin execution sooner than they otherwise would be able to. The details of security analysis of AMD predictive store forwarding is documented here. https://www.amd.com/system/files/documents/security-analysis-predictive-store-forwarding.pdf Predictive Store Forwarding controls: There are two hardware control bits which influence the PSF feature: - MSR 48h bit 2 – Speculative Store Bypass (SSBD) - MSR 48h bit 7 – Predictive Store Forwarding Disable (PSFD) The PSF feature is disabled if either of these bits are set. These bits are controllable on a per-thread basis in an SMT system. By default, both SSBD and PSFD are 0 meaning that the speculation features are enabled. While the SSBD bit disables PSF and speculative store bypass, PSFD only disables PSF. PSFD may be desirable for software which is concerned with the speculative behavior of PSF but desires a smaller performance impact than setting SSBD. Support for PSFD is indicated in CPUID Fn8000_0008 EBX[28]. All processors that support PSF will also support PSFD. Linux kernel does not have the interface to enable/disable PSFD yet. Plan here is to expose the PSFD technology to KVM so that the guest kernel can make use of it if they wish to. Signed-off-by: NBabu Moger <Babu.Moger@amd.com> Message-Id: <163244601049.30292.5855870305350227855.stgit@bmoger-ubuntu> [Keep feature private to KVM, as requested by Borislav Petkov. - Paolo] Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 David Matlack 提交于
mmu_try_to_unsync_pages checks if page tracking is active for the given gfn, which requires knowing the memslot. We can pass down the memslot via make_spte to avoid this lookup. The memslot is also handy for make_spte's marking of the gfn as dirty: we can test whether dirty page tracking is enabled, and if so ensure that pages are mapped as writable with 4K granularity. Apart from the warning, no functional change is intended. Signed-off-by: NDavid Matlack <dmatlack@google.com> Message-Id: <20210813203504.2742757-7-dmatlack@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 David Matlack 提交于
Avoid the memslot lookup in rmap_add, by passing it down from the fault handling code to mmu_set_spte and then to rmap_add. No functional change intended. Signed-off-by: NDavid Matlack <dmatlack@google.com> Message-Id: <20210813203504.2742757-6-dmatlack@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Paolo Bonzini 提交于
mmu_set_spte is called for either PTE prefetching or page faults. The three boolean arguments write_fault, speculative and host_writable are always respectively false/true/true for prefetching and coming from a struct kvm_page_fault for page faults. Let mmu_set_spte distinguish these two situation by accepting a possibly NULL struct kvm_page_fault argument. Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Paolo Bonzini 提交于
The level and A/D bit support of the new SPTE can be found in the role, which is stored in the kvm_mmu_page struct. This merges two arguments into one. For the TDP MMU, the kvm_mmu_page was not used (kvm_tdp_mmu_map does not use it if the SPTE is already present) so we fetch it just before calling make_spte. Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Paolo Bonzini 提交于
Prepare for removing the ad_disabled argument of make_spte; instead it can be found in the role of a struct kvm_mmu_page. First of all, the TDP MMU must set the role accurately. Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Paolo Bonzini 提交于
The level of the new SPTE can be found in the kvm_mmu_page struct; there is no need to pass it down. Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Paolo Bonzini 提交于
Now that make_spte is called directly by the shadow MMU (rather than wrapped by set_spte), it only has to return one boolean value. Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Paolo Bonzini 提交于
Since the two callers of set_spte do different things with the results, inlining it actually makes the code simpler to reason about. For example, FNAME(sync_page) already has a struct kvm_mmu_page *, but set_spte had to fish it back out of sptep's private page data. Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Paolo Bonzini 提交于
Since the two callers of set_spte do different things with the results, inlining it actually makes the code simpler to reason about. For example, mmu_set_spte looks quite like tdp_mmu_map_handle_target_level, but the similarity is hidden by set_spte. Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 David Matlack 提交于
Now that kvm_page_fault has a pointer to the memslot it can be passed down to the page tracking code to avoid a redundant slot lookup. No functional change intended. Signed-off-by: NDavid Matlack <dmatlack@google.com> Message-Id: <20210813203504.2742757-5-dmatlack@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 David Matlack 提交于
The memslot for the faulting gfn is used throughout the page fault handling code, so capture it in kvm_page_fault as soon as we know the gfn and use it in the page fault handling code that has direct access to the kvm_page_fault struct. Replace various tests using is_noslot_pfn with more direct tests on fault->slot being NULL. This, in combination with the subsequent patch, improves "Populate memory time" in dirty_log_perf_test by 5% when using the legacy MMU. There is no discerable improvement to the performance of the TDP MMU. No functional change intended. Suggested-by: NBen Gardon <bgardon@google.com> Signed-off-by: NDavid Matlack <dmatlack@google.com> Message-Id: <20210813203504.2742757-4-dmatlack@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Paolo Bonzini 提交于
tdp_mmu_map_set_spte_atomic is not taking care of dirty logging anymore, the only difference that remains is that it takes a vCPU instead of the struct kvm. Merge the two functions. Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Paolo Bonzini 提交于
This simplifies set_spte, which we want to remove, and unifies code between the shadow MMU and the TDP MMU. The warning will be added back later to make_spte as well. There is a small disadvantage in the TDP MMU; it may unnecessarily mark a page as dirty twice if two vCPUs end up mapping the same page twice. However, this is a very small cost for a case that is already rare. Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 David Matlack 提交于
Consolidate rmap_recycle and rmap_add into a single function since they are only ever called together (and only from one place). This has a nice side effect of eliminating an extra kvm_vcpu_gfn_to_memslot(). In addition it makes mmu_set_spte(), which is a very long function, a little shorter. No functional change intended. Signed-off-by: NDavid Matlack <dmatlack@google.com> Message-Id: <20210813203504.2742757-3-dmatlack@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
WARN and bail if the shadow walk for faulting in a SPTE terminates early, i.e. doesn't reach the expected level because the walk encountered a terminal SPTE. The shadow walks for page faults are subtle in that they install non-leaf SPTEs (zapping leaf SPTEs if necessary!) in the loop body, and consume the newly created non-leaf SPTE in the loop control, e.g. __shadow_walk_next(). In other words, the walks guarantee that the walk will stop if and only if the target level is reached by installing non-leaf SPTEs to guarantee the walk remains valid. Opportunistically use fault->goal-level instead of it.level in FNAME(fetch) to further clarify that KVM always installs the leaf SPTE at the target level. Reviewed-by: NLai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: NSean Christopherson <seanjc@google.com> Signed-off-by: NLai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20210906122547.263316-1-jiangshanlai@gmail.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-