- 22 10月, 2021 3 次提交
-
-
由 Lai Jiangshan 提交于
X86_CR4_PCIDE doesn't participate in kvm_mmu_role, so the mmu context doesn't need to be reset. It is only required to flush all the guest tlb. Signed-off-by: NLai Jiangshan <laijs@linux.alibaba.com> Reviewed-by: NSean Christopherson <seanjc@google.com> Message-Id: <20210919024246.89230-2-jiangshanlai@gmail.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Paul pointed out the error messages when KVM fails to load are unhelpful in understanding exactly what went wrong if userspace probes the "wrong" module. Add a mandatory kvm_x86_ops field to track vendor module names, kvm_intel and kvm_amd, and use the name for relevant error message when KVM fails to load so that the user knows which module failed to load. Opportunistically tweak the "disabled by bios" error message to clarify that _support_ was disabled, not that the module itself was magically disabled by BIOS. Suggested-by: NPaul Menzel <pmenzel@molgen.mpg.de> Signed-off-by: NSean Christopherson <seanjc@google.com> Message-Id: <20211018183929.897461-2-seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 David Stevens 提交于
Unify the flags for rmaps and page tracking data, using a single flag in struct kvm_arch and a single loop to go over all the address spaces and memslots. This avoids code duplication between alloc_all_memslots_rmaps and kvm_page_track_enable_mmu_write_tracking. Signed-off-by: NDavid Stevens <stevensd@chromium.org> [This patch is the delta between David's v2 and v3, with conflicts fixed and my own commit message. - Paolo] Co-developed-by: NSean Christopherson <seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 19 10月, 2021 5 次提交
-
-
由 Oliver Upton 提交于
To date, VMM-directed TSC synchronization and migration has been a bit messy. KVM has some baked-in heuristics around TSC writes to infer if the VMM is attempting to synchronize. This is problematic, as it depends on host userspace writing to the guest's TSC within 1 second of the last write. A much cleaner approach to configuring the guest's views of the TSC is to simply migrate the TSC offset for every vCPU. Offsets are idempotent, and thus not subject to change depending on when the VMM actually reads/writes values from/to KVM. The VMM can then read the TSC once with KVM_GET_CLOCK to capture a (realtime, host_tsc) pair at the instant when the guest is paused. Cc: David Matlack <dmatlack@google.com> Cc: Sean Christopherson <seanjc@google.com> Signed-off-by: NOliver Upton <oupton@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com> Message-Id: <20210916181538.968978-8-oupton@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Oliver Upton 提交于
Refactor kvm_synchronize_tsc to make a new function that allows callers to specify TSC parameters (offset, value, nanoseconds, etc.) explicitly for the sake of participating in TSC synchronization. Signed-off-by: NOliver Upton <oupton@google.com> Message-Id: <20210916181538.968978-7-oupton@google.com> [Make sure kvm->arch.cur_tsc_generation and vcpu->arch.this_tsc_generation are equal at the end of __kvm_synchronize_tsc, if matched is false. Reported by Maxim Levitsky. - Paolo] Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Paolo Bonzini 提交于
Protect the reference point for kvmclock with a seqcount, so that kvmclock updates for all vCPUs can proceed in parallel. Xen runstate updates will also run in parallel and not bounce the kvmclock cacheline. Of the variables that were protected by pvclock_gtod_sync_lock, nr_vcpus_matched_tsc is different because it is updated outside pvclock_update_vm_gtod_copy and read inside it. Therefore, we need to keep it protected by a spinlock. In fact it must now be a raw spinlock, because pvclock_update_vm_gtod_copy, being the write-side of a seqcount, is non-preemptible. Since we already have tsc_write_lock which is a raw spinlock, we can just use tsc_write_lock as the lock that protects the write-side of the seqcount. Co-developed-by: NOliver Upton <oupton@google.com> Message-Id: <20210916181538.968978-6-oupton@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Oliver Upton 提交于
Handling the migration of TSCs correctly is difficult, in part because Linux does not provide userspace with the ability to retrieve a (TSC, realtime) clock pair for a single instant in time. In lieu of a more convenient facility, KVM can report similar information in the kvm_clock structure. Provide userspace with a host TSC & realtime pair iff the realtime clock is based on the TSC. If userspace provides KVM_SET_CLOCK with a valid realtime value, advance the KVM clock by the amount of elapsed time. Do not step the KVM clock backwards, though, as it is a monotonic oscillator. Suggested-by: NPaolo Bonzini <pbonzini@redhat.com> Signed-off-by: NOliver Upton <oupton@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com> Message-Id: <20210916181538.968978-5-oupton@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Paolo Bonzini 提交于
If allocation of rmaps fails, but some of the pointers have already been written, those pointers can be cleaned up when the memslot is freed, or even reused later for another attempt at allocating the rmaps. Therefore there is no need to WARN, as done for example in memslot_rmap_alloc, but the allocation *must* be skipped lest KVM will overwrite the previous pointer and will indeed leak memory. Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 01 10月, 2021 6 次提交
-
-
由 David Stevens 提交于
Avoid allocating the gfn_track arrays if nothing needs them. If there are no external to KVM users of the API (i.e. no GVT-g), then page tracking is only needed for shadow page tables. This means that when tdp is enabled and there are no external users, then the gfn_track arrays can be lazily allocated when the shadow MMU is actually used. This avoid allocations equal to .05% of guest memory when nested virtualization is not used, if the kernel is compiled without GVT-g. Signed-off-by: NDavid Stevens <stevensd@chromium.org> Message-Id: <20210922045859.2011227-3-stevensd@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Juergen Gross 提交于
By switching from kfree() to kvfree() in kvm_arch_free_vm() Arm64 can use the common variant. This can be accomplished by adding another macro __KVM_HAVE_ARCH_VM_FREE, which will be used only by x86 for now. Further simplification can be achieved by adding __kvm_arch_free_vm() doing the common part. Suggested-by: NPaolo Bonzini <pbonzini@redhat.com> Signed-off-by: NJuergen Gross <jgross@suse.com> Message-Id: <20210903130808.30142-5-jgross@suse.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Oliver Upton 提交于
Sean noticed that KVM_GET_CLOCK was checking kvm_arch.use_master_clock outside of the pvclock sync lock. This is problematic, as the clock value written to the user may or may not actually correspond to a stable TSC. Fix the race by populating the entire kvm_clock_data structure behind the pvclock_gtod_sync_lock. Suggested-by: NSean Christopherson <seanjc@google.com> Signed-off-by: NOliver Upton <oupton@google.com> Message-Id: <20210916181538.968978-4-oupton@google.com> Reviewed-by: NMarcelo Tosatti <mtosatti@redhat.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Paolo Bonzini 提交于
No functional change intended. Reviewed-by: NMarcelo Tosatti <mtosatti@redhat.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Paolo Bonzini 提交于
Updates to the kvmclock parameters needs to do a complicated dance of KVM_REQ_MCLOCK_INPROGRESS and KVM_REQ_CLOCK_UPDATE in addition to taking pvclock_gtod_sync_lock. Place that in two functions that can be called on all of master clock update, KVM_SET_CLOCK, and Hyper-V reenlightenment. Reviewed-by: NMarcelo Tosatti <mtosatti@redhat.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Maxim Levitsky 提交于
This was tested by booting a nested guest with TSC=1Ghz, observing the clocks, and doing about 100 cycles of migration. Note that qemu patch is needed to support migration because of a new MSR that needs to be placed in the migration state. The patch will be sent to the qemu mailing list soon. Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210914154825.104886-14-mlevitsk@redhat.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 30 9月, 2021 12 次提交
-
-
由 Longpeng(Mike) 提交于
All of the irqfds would to be updated when update the irq routing, it's too expensive if there're too many irqfds. However we can reduce the cost by avoid some unnecessary updates. For irqs of MSI type on X86, the update can be saved if the msi values are not change. The vfio migration could receives benefit from this optimi- zaiton. The test VM has 128 vcpus and 8 VF (with 65 vectors enabled), so the VM has more than 520 irqfds. We mesure the cost of the vfio_msix_enable (in QEMU, it would set routing for each irqfd) for each VF, and we can see the total cost can be significantly reduced. Origin Apply this Patch 1st 8 4 2nd 15 5 3rd 22 6 4th 24 6 5th 36 7 6th 44 7 7th 51 8 8th 58 8 Total 258ms 51ms We're also tring to optimize the QEMU part [1], but it's still worth to optimize the KVM to gain more benefits. [1] https://lists.gnu.org/archive/html/qemu-devel/2021-08/msg04215.htmlSigned-off-by: NLongpeng(Mike) <longpeng2@huawei.com> Message-Id: <20210827080003.2689-1-longpeng2@huawei.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Manually look for a CPUID.0x1 entry instead of bouncing through kvm_cpuid() when retrieving the Family-Model-Stepping information for vCPU RESET/INIT. This fixes a potential undefined behavior bug due to kvm_cpuid() using the uninitialized "dummy" param as the ECX _input_, a.k.a. the index. A more minimal fix would be to simply zero "dummy", but the extra work in kvm_cpuid() is wasteful, and KVM should be treating the FMS retrieval as an out-of-band access, e.g. same as how KVM computes guest.MAXPHYADDR. Both Intel's SDM and AMD's APM describe the RDX value at RESET/INIT as holding the CPU's FMS information, not as holding CPUID.0x1.EAX. KVM's usage of CPUID entries to get FMS is simply a pragmatic approach to avoid having yet another way for userspace to provide inconsistent data. No functional change intended. Signed-off-by: NSean Christopherson <seanjc@google.com> Reviewed-by: NJim Mattson <jmattson@google.com> Message-Id: <20210929222426.1855730-3-seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
WARN if CR0, CR3, or CR4 are non-zero at RESET, which given the current KVM implementation, really means WARN if they're not zeroed at vCPU creation. VMX in particular has several ->set_*() flows that read other registers to handle side effects, and because those flows are common to RESET and INIT, KVM subtly relies on emulated/virtualized registers to be zeroed at vCPU creation in order to do the right thing at RESET. Use CRs as a sentinel because they are most likely to be written as side effects, and because KVM specifically needs CR0.PG and CR0.PE to be '0' to correctly reflect the state of the vCPU's MMU. CRs are also loaded and stored from/to the VMCS, and so adds some level of coverage to verify that KVM doesn't conflate zero-allocating the VMCS with properly initializing the VMCS with VMWRITEs. Note, '0' is somewhat arbitrary, vCPU creation can technically stuff any value for a register so long as it's coherent with respect to the current vCPU state. In practice, '0' works for all registers and is convenient. Suggested-by: NVitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: NSean Christopherson <seanjc@google.com> Reviewed-by: NVitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com> Message-Id: <20210921000303.400537-11-seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Move the few bits of relevant fx_init() code into kvm_arch_vcpu_create(), dropping the superfluous check on vcpu->arch.guest_fpu that was blindly and wrongly added by commit ed02b213 ("KVM: SVM: Guest FPU state save/restore not needed for SEV-ES guest"). Note, KVM currently allocates and then frees FPU state for SEV-ES guests, rather than avoid the allocation in the first place. While that approach is inarguably inefficient and unnecessary, it's a cleanup for the future. No functional change intended. Signed-off-by: NSean Christopherson <seanjc@google.com> Reviewed-by: NVitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com> Message-Id: <20210921000303.400537-7-seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Drop code to initialize XCR0 during fx_init(), a.k.a. vCPU creation, as XCR0 has been initialized during kvm_vcpu_reset() (for RESET) since commit a554d207 ("KVM: X86: Processor States following Reset or INIT"). Back when XCR0 support was added by commit 2acf923e ("KVM: VMX: Enable XSAVE/XRSTOR for guest"), KVM didn't differentiate between RESET and INIT. Ignoring the fact that calling fx_init() for INIT is obviously wrong, e.g. FPU state after INIT is not the same as after RESET, setting XCR0 in fx_init() was correct. Eventually fx_init() got moved to kvm_arch_vcpu_init(), a.k.a. vCPU creation (ignore the terrible name) by commit 0ee6a517 ("x86/fpu, kvm: Simplify fx_init()"). Finally, commit 95a0d01e ("KVM: x86: Move all vcpu init code into kvm_arch_vcpu_create()") killed off kvm_arch_vcpu_init(), leaving behind the oddity of redundant setting of guest state during vCPU creation. No functional change intended. Signed-off-by: NSean Christopherson <seanjc@google.com> Reviewed-by: NVitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com> Message-Id: <20210921000303.400537-6-seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Drop code to set CR0.ET for the guest during initialization of the guest FPU. The code was added as a misguided bug fix by commit 380102c8 ("KVM Set the ET flag in CR0 after initializing FX") to resolve an issue where vcpu->cr0 (now vcpu->arch.cr0) was not correctly initialized on SVM systems. While init_vmcb() did set CR0.ET, it only did so in the VMCB, and subtly did not update vcpu->cr0. Stuffing CR0.ET worked around the immediate problem, but did not fix the real bug of vcpu->cr0 and the VMCB being out of sync. That underlying bug was eventually remedied by commit 18fa000a ("KVM: SVM: Reset cr0 properly on vcpu reset"). No functional change intended. Signed-off-by: NSean Christopherson <seanjc@google.com> Reviewed-by: NVitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com> Message-Id: <20210921000303.400537-5-seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Do not blindly mark all registers as available+dirty at RESET/INIT, and instead rely on writes to registers to go through the proper mutators or to explicitly mark registers as dirty. INIT in particular does not blindly overwrite all registers, e.g. select bits in CR0 are preserved across INIT, thus marking registers available+dirty without first reading the register from hardware is incorrect. In practice this is a benign bug as KVM doesn't let the guest control CR0 bits that are preserved across INIT, and all other true registers are explicitly written during the RESET/INIT flows. The PDPTRs and EX_INFO "registers" are not explicitly written, but accessing those values during RESET/INIT is nonsensical and would be a KVM bug regardless of register caching. Fixes: 66f7b72e ("KVM: x86: Make register state after reset conform to specification") [sean: !!! NOT FOR STABLE !!!] Signed-off-by: NSean Christopherson <seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com> Message-Id: <20210921000303.400537-4-seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Replace impressively complex "logic" for computing the page offset from CR3 when loading PDPTRs. Unlike other paging modes, the address held in CR3 for PAE paging is 32-byte aligned, i.e. occupies bits 31:5, thus bits 11:5 need to be used as the offset from the gfn when reading PDPTRs. The existing calculation originated in commit 1342d353 ("[PATCH] KVM: MMU: Load the pae pdptrs on cr3 change like the processor does"), which read the PDPTRs from guest memory as individual 8-byte loads. At the time, the so called "offset" was the base index of PDPTR0 as a _u64_, not a byte offset. Naming aside, the computation was useful and arguably simplified the overall flow. Unfortunately, when commit 195aefde ("KVM: Add general accessors to read and write guest memory") added accessors with offsets at byte granularity, the cleverness of the original code was lost and KVM was left with convoluted code for a simple operation. No functional change intended. Signed-off-by: NSean Christopherson <seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com> Message-Id: <20210831164224.1119728-4-seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Open code the call to mmu->translate_gpa() when loading nested PDPTRs and kill off the existing helper, kvm_read_guest_page_mmu(), to discourage incorrect use. Reading guest memory straight from an L2 GPA is extremely rare (as evidenced by the lack of users), as very few constructs in x86 specify physical addresses, even fewer are virtualized by KVM, and even fewer yet require emulation of L2 by L0 KVM. No functional change intended. Signed-off-by: NSean Christopherson <seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com> Message-Id: <20210831164224.1119728-3-seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Juergen Gross 提交于
KVM_MAX_VCPU_ID is not specifying the highest allowed vcpu-id, but the number of allowed vcpu-ids. This has already led to confusion, so rename KVM_MAX_VCPU_ID to KVM_MAX_VCPU_IDS to make its semantics more clear Suggested-by: NEduardo Habkost <ehabkost@redhat.com> Signed-off-by: NJuergen Gross <jgross@suse.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com> Message-Id: <20210913135745.13944-3-jgross@suse.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Vitaly Kuznetsov 提交于
kvm_make_vcpus_request_mask() already disables preemption so just like kvm_make_all_cpus_request_except() it can be switched to using pre-allocated per-cpu cpumasks. This allows for improvements for both users of the function: in Hyper-V emulation code 'tlb_flush' can now be dropped from 'struct kvm_vcpu_hv' and kvm_make_scan_ioapic_request_mask() gets rid of dynamic allocation. cpumask_available() checks in kvm_make_vcpu_request() and kvm_kick_many_cpus() can now be dropped as they checks for an impossible condition: kvm_init() makes sure per-cpu masks are allocated. Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: NSean Christopherson <seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com> Message-Id: <20210903075141.403071-9-vkuznets@redhat.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Vitaly Kuznetsov 提交于
Both remaining callers of kvm_make_vcpus_request_mask() pass 'NULL' for 'except' parameter so it can just be dropped. No functional change intended ©. Suggested-by: NSean Christopherson <seanjc@google.com> Reviewed-by: NSean Christopherson <seanjc@google.com> Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com> Message-Id: <20210903075141.403071-6-vkuznets@redhat.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 22 9月, 2021 6 次提交
-
-
由 Fares Mehanna 提交于
Intel PMU MSRs is in msrs_to_save_all[], so add AMD PMU MSRs to have a consistent behavior between Intel and AMD when using KVM_GET_MSRS, KVM_SET_MSRS or KVM_GET_MSR_INDEX_LIST. We have to add legacy and new MSRs to handle guests running without X86_FEATURE_PERFCTR_CORE. Signed-off-by: NFares Mehanna <faresx@amazon.de> Message-Id: <20210915133951.22389-1-faresx@amazon.de> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Maxim Levitsky 提交于
When exiting SMM, pdpts are loaded again from the guest memory. This fixes a theoretical bug, when exit from SMM triggers entry to the nested guest which re-uses some of the migration code which uses this flag as a workaround for a legacy userspace. Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210913140954.165665-4-mlevitsk@redhat.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Use vcpu_idx to identify vCPU0 when updating HyperV's TSC page, which is shared by all vCPUs and "owned" by vCPU0 (because vCPU0 is the only vCPU that's guaranteed to exist). Using kvm_get_vcpu() to find vCPU works, but it's a rather odd and suboptimal method to check the index of a given vCPU. No functional change intended. Signed-off-by: NSean Christopherson <seanjc@google.com> Reviewed-by: NJim Mattson <jmattson@google.com> Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com> Reviewed-by: NVitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210910183220.2397812-3-seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Haimin Zhang 提交于
Check the return of init_srcu_struct(), which can fail due to OOM, when initializing the page track mechanism. Lack of checking leads to a NULL pointer deref found by a modified syzkaller. Reported-by: NTCS Robot <tcs_robot@tencent.com> Signed-off-by: NHaimin Zhang <tcs_kernel@tencent.com> Message-Id: <1630636626-12262-1-git-send-email-tcs_kernel@tencent.com> [Move the call towards the beginning of kvm_arch_init_vm. - Paolo] Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Explicitly zero the guest's CR3 and mark it available+dirty at RESET/INIT. Per Intel's SDM and AMD's APM, CR3 is zeroed at both RESET and INIT. For RESET, this is a nop as vcpu is zero-allocated. For INIT, the bug has likely escaped notice because no firmware/kernel puts its page tables root at PA=0, let alone relies on INIT to get the desired CR3 for such page tables. Cc: stable@vger.kernel.org Signed-off-by: NSean Christopherson <seanjc@google.com> Message-Id: <20210921000303.400537-3-seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Mark all registers as available and dirty at vCPU creation, as the vCPU has obviously not been loaded into hardware, let alone been given the chance to be modified in hardware. On SVM, reading from "uninitialized" hardware is a non-issue as VMCBs are zero allocated (thus not truly uninitialized) and hardware does not allow for arbitrary field encoding schemes. On VMX, backing memory for VMCSes is also zero allocated, but true initialization of the VMCS _technically_ requires VMWRITEs, as the VMX architectural specification technically allows CPU implementations to encode fields with arbitrary schemes. E.g. a CPU could theoretically store the inverted value of every field, which would result in VMREAD to a zero-allocated field returns all ones. In practice, only the AR_BYTES fields are known to be manipulated by hardware during VMREAD/VMREAD; no known hardware or VMM (for nested VMX) does fancy encoding of cacheable field values (CR0, CR3, CR4, etc...). In other words, this is technically a bug fix, but practically speakings it's a glorified nop. Failure to mark registers as available has been a lurking bug for quite some time. The original register caching supported only GPRs (+RIP, which is kinda sorta a GPR), with the masks initialized at ->vcpu_reset(). That worked because the two cacheable registers, RIP and RSP, are generally speaking not read as side effects in other flows. Arguably, commit aff48baa ("KVM: Fetch guest cr3 from hardware on demand") was the first instance of failure to mark regs available. While _just_ marking CR3 available during vCPU creation wouldn't have fixed the VMREAD from an uninitialized VMCS bug because ept_update_paging_mode_cr0() unconditionally read vmcs.GUEST_CR3, marking CR3 _and_ intentionally not reading GUEST_CR3 when it's available would have avoided VMREAD to a technically-uninitialized VMCS. Fixes: aff48baa ("KVM: Fetch guest cr3 from hardware on demand") Fixes: 6de4f3ad ("KVM: Cache pdptrs") Fixes: 6de12732 ("KVM: VMX: Optimize vmx_get_rflags()") Fixes: 2fb92db1 ("KVM: VMX: Cache vmcs segment fields") Fixes: bd31fe49 ("KVM: VMX: Add proper cache tracking for CR0") Fixes: f98c1e77 ("KVM: VMX: Add proper cache tracking for CR4") Fixes: 5addc235 ("KVM: VMX: Cache vmcs.EXIT_QUALIFICATION using arch avail_reg flags") Fixes: 87915858 ("KVM: VMX: Cache vmcs.EXIT_INTR_INFO using arch avail_reg flags") Signed-off-by: NSean Christopherson <seanjc@google.com> Message-Id: <20210921000303.400537-2-seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 06 9月, 2021 1 次提交
-
-
由 Zelin Deng 提交于
When MSR_IA32_TSC_ADJUST is written by guest due to TSC ADJUST feature especially there's a big tsc warp (like a new vCPU is hot-added into VM which has been up for a long time), tsc_offset is added by a large value then go back to guest. This causes system time jump as tsc_timestamp is not adjusted in the meantime and pvclock monotonic character. To fix this, just notify kvm to update vCPU's guest time before back to guest. Cc: stable@vger.kernel.org Signed-off-by: NZelin Deng <zelin.deng@linux.alibaba.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com> Message-Id: <1619576521-81399-2-git-send-email-zelin.deng@linux.alibaba.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 21 8月, 2021 7 次提交
-
-
由 Maxim Levitsky 提交于
KVM_GUESTDBG_BLOCKIRQ will allow KVM to block all interrupts while running. This change is mostly intended for more robust single stepping of the guest and it has the following benefits when enabled: * Resuming from a breakpoint is much more reliable. When resuming execution from a breakpoint, with interrupts enabled, more often than not, KVM would inject an interrupt and make the CPU jump immediately to the interrupt handler and eventually return to the breakpoint, to trigger it again. From the user point of view it looks like the CPU never executed a single instruction and in some cases that can even prevent forward progress, for example, when the breakpoint is placed by an automated script (e.g lx-symbols), which does something in response to the breakpoint and then continues the guest automatically. If the script execution takes enough time for another interrupt to arrive, the guest will be stuck on the same breakpoint RIP forever. * Normal single stepping is much more predictable, since it won't land the debugger into an interrupt handler. * RFLAGS.TF has less chance to be leaked to the guest: We set that flag behind the guest's back to do single stepping but if single step lands us into an interrupt/exception handler it will be leaked to the guest in the form of being pushed to the stack. This doesn't completely eliminate this problem as exceptions can still happen, but at least this reduces the chances of this happening. Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210811122927.900604-6-mlevitsk@redhat.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Mingwei Zhang 提交于
Existing KVM code tracks the number of large pages regardless of their sizes. Therefore, when large page of 1GB (or larger) is adopted, the information becomes less useful because lpages counts a mix of 1G and 2M pages. So remove the lpages since it is easy for user space to aggregate the info. Instead, provide a comprehensive page stats of all sizes from 4K to 512G. Suggested-by: NBen Gardon <bgardon@google.com> Reviewed-by: NDavid Matlack <dmatlack@google.com> Reviewed-by: NBen Gardon <bgardon@google.com> Signed-off-by: NMingwei Zhang <mizhang@google.com> Cc: Jing Zhang <jingzhangos@google.com> Cc: David Matlack <dmatlack@google.com> Cc: Sean Christopherson <seanjc@google.com> Message-Id: <20210803044607.599629-4-mizhang@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Jing Zhang 提交于
Add new types of KVM stats, linear and logarithmic histogram. Histogram are very useful for observing the value distribution of time or size related stats. Signed-off-by: NJing Zhang <jingzhangos@google.com> Message-Id: <20210802165633.1866976-2-jingzhangos@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Maxim Levitsky 提交于
Since AVIC can be inhibited and uninhibited rapidly it is possible that we have nothing to do by the time the svm_refresh_apicv_exec_ctrl is called. Detect and avoid this, which will be useful when we will start calling avic_vcpu_load/avic_vcpu_put when the avic inhibition state changes. Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210810205251.424103-14-mlevitsk@redhat.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Maxim Levitsky 提交于
Currently on SVM, the kvm_request_apicv_update toggles the APICv memslot without doing any synchronization. If there is a mismatch between that memslot state and the AVIC state, on one of the vCPUs, an APIC mmio access can be lost: For example: VCPU0: enable the APIC_ACCESS_PAGE_PRIVATE_MEMSLOT VCPU1: access an APIC mmio register. Since AVIC is still disabled on VCPU1, the access will not be intercepted by it, and neither will it cause MMIO fault, but rather it will just be read/written from/to the dummy page mapped into the APIC_ACCESS_PAGE_PRIVATE_MEMSLOT. Fix that by adding a lock guarding the AVIC state changes, and carefully order the operations of kvm_request_apicv_update to avoid this race: 1. Take the lock 2. Send KVM_REQ_APICV_UPDATE 3. Update the apic inhibit reason 4. Release the lock This ensures that at (2) all vCPUs are kicked out of the guest mode, but don't yet see the new avic state. Then only after (4) all other vCPUs can update their AVIC state and resume. Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210810205251.424103-10-mlevitsk@redhat.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Maxim Levitsky 提交于
Thanks to the former patches, it is now possible to keep the APICv memslot always enabled, and it will be invisible to the guest when it is inhibited This code is based on a suggestion from Sean Christopherson: https://lkml.org/lkml/2021/7/19/2970Suggested-by: NSean Christopherson <seanjc@google.com> Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210810205251.424103-9-mlevitsk@redhat.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Peter Xu 提交于
Introduce kvm_mmu_slot_lpages() to calculcate lpage_info and rmap array size. The other __kvm_mmu_slot_lpages() can take an extra parameter of npages rather than fetching from the memslot pointer. Start to use the latter one in kvm_alloc_memslot_metadata(). Signed-off-by: NPeter Xu <peterx@redhat.com> Message-Id: <20210730220455.26054-4-peterx@redhat.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-