- 11 2月, 2022 6 次提交
-
-
由 David Matlack 提交于
When using KVM_DIRTY_LOG_INITIALLY_SET, huge pages are not write-protected when dirty logging is enabled on the memslot. Instead they are write-protected once userspace invokes KVM_CLEAR_DIRTY_LOG for the first time and only for the specific sub-region being cleared. Enhance KVM_CLEAR_DIRTY_LOG to also try to split huge pages prior to write-protecting to avoid causing write-protection faults on vCPU threads. This also allows userspace to smear the cost of huge page splitting across multiple ioctls, rather than splitting the entire memslot as is the case when initially-all-set is not used. Signed-off-by: NDavid Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-17-dmatlack@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 David Matlack 提交于
When dirty logging is enabled without initially-all-set, try to split all huge pages in the memslot down to 4KB pages so that vCPUs do not have to take expensive write-protection faults to split huge pages. Eager page splitting is best-effort only. This commit only adds the support for the TDP MMU, and even there splitting may fail due to out of memory conditions. Failures to split a huge page is fine from a correctness standpoint because KVM will always follow up splitting by write-protecting any remaining huge pages. Eager page splitting moves the cost of splitting huge pages off of the vCPU threads and onto the thread enabling dirty logging on the memslot. This is useful because: 1. Splitting on the vCPU thread interrupts vCPUs execution and is disruptive to customers whereas splitting on VM ioctl threads can run in parallel with vCPU execution. 2. Splitting all huge pages at once is more efficient because it does not require performing VM-exit handling or walking the page table for every 4KiB page in the memslot, and greatly reduces the amount of contention on the mmu_lock. For example, when running dirty_log_perf_test with 96 virtual CPUs, 1GiB per vCPU, and 1GiB HugeTLB memory, the time it takes vCPUs to write to all of their memory after dirty logging is enabled decreased by 95% from 2.94s to 0.14s. Eager Page Splitting is over 100x more efficient than the current implementation of splitting on fault under the read lock. For example, taking the same workload as above, Eager Page Splitting reduced the CPU required to split all huge pages from ~270 CPU-seconds ((2.94s - 0.14s) * 96 vCPU threads) to only 1.55 CPU-seconds. Eager page splitting does increase the amount of time it takes to enable dirty logging since it has split all huge pages. For example, the time it took to enable dirty logging in the 96GiB region of the aforementioned test increased from 0.001s to 1.55s. Reviewed-by: NPeter Xu <peterx@redhat.com> Signed-off-by: NDavid Matlack <dmatlack@google.com> Message-Id: <20220119230739.2234394-16-dmatlack@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Use slightly more verbose names for the so called "memory encrypt", a.k.a. "mem enc", kvm_x86_ops hooks to bridge the gap between the current super short kvm_x86_ops names and SVM's more verbose, but non-conforming names. This is a step toward using kvm-x86-ops.h with KVM_X86_CVM_OP() to fill svm_x86_ops. Opportunistically rename mem_enc_op() to mem_enc_ioctl() to better reflect its true nature, as it really is a full fledged ioctl() of its own. Ideally, the hook would be named confidential_vm_ioctl() or so, as the ioctl() is a gateway to more than just memory encryption, and because its underlying purpose to support Confidential VMs, which can be provided without memory encryption, e.g. if the TCB of the guest includes the host kernel but not host userspace, or by isolation in hardware without encrypting memory. But, diverging from KVM_MEMORY_ENCRYPT_OP even further is undeseriable, and short of creating alises for all related ioctl()s, which introduces a different flavor of divergence, KVM is stuck with the nomenclature. Defer renaming SVM's functions to a future commit as there are additional changes needed to make SVM fully conforming and to match reality (looking at you, svm_vm_copy_asid_from()). No functional change intended. Signed-off-by: NSean Christopherson <seanjc@google.com> Message-Id: <20220128005208.4008533-20-seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Move kvm_get_cs_db_l_bits() to SVM and rename it appropriately so that its svm_x86_ops entry can be filled via kvm-x86-ops, and to eliminate a superfluous export from KVM x86. No functional change intended. Signed-off-by: NSean Christopherson <seanjc@google.com> Message-Id: <20220128005208.4008533-16-seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Rename a variety of kvm_x86_op function pointers so that preferred name for vendor implementations follows the pattern <vendor>_<function>, e.g. rename .run() to .vcpu_run() to match {svm,vmx}_vcpu_run(). This will allow vendor implementations to be wired up via the KVM_X86_OP macro. In many cases, VMX and SVM "disagree" on the preferred name, though in reality it's VMX and x86 that disagree as SVM blindly prepended _svm to the kvm_x86_ops name. Justification for using the VMX nomenclature: - set_{irq,nmi} => inject_{irq,nmi} because the helper is injecting an event that has already been "set" in e.g. the vIRR. SVM's relevant VMCB field is even named event_inj, and KVM's stat is irq_injections. - prepare_guest_switch => prepare_switch_to_guest because the former is ambiguous, e.g. it could mean switching between multiple guests, switching from the guest to host, etc... - update_pi_irte => pi_update_irte to allow for matching match the rest of VMX's posted interrupt naming scheme, which is vmx_pi_<blah>(). - start_assignment => pi_start_assignment to again follow VMX's posted interrupt naming scheme, and to provide context for what bit of code might care about an otherwise undescribed "assignment". The "tlb_flush" => "flush_tlb" creates an inconsistency with respect to Hyper-V's "tlb_remote_flush" hooks, but Hyper-V really is the one that's wrong. x86, VMX, and SVM all use flush_tlb, and even common KVM is on a variant of the bandwagon with "kvm_flush_remote_tlbs", e.g. a more appropriate name for the Hyper-V hooks would be flush_remote_tlbs. Leave that change for another time as the Hyper-V hooks always start as NULL, i.e. the name doesn't matter for using kvm-x86-ops.h, and changing all names requires an astounding amount of churn. VMX and SVM function names are intentionally left as is to minimize the diff. Both VMX and SVM will need to rename even more functions in order to fully utilize KVM_X86_OPS, i.e. an additional patch for each is inevitable. No functional change intended. Signed-off-by: NSean Christopherson <seanjc@google.com> Message-Id: <20220128005208.4008533-5-seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Jinrong Liang 提交于
The "struct kvm_vcpu *vcpu" parameter of kvm_scale_tsc() is not used, so remove it. No functional change intended. Signed-off-by: NJinrong Liang <cloudliang@tencent.com> Message-Id: <20220125095909.38122-18-cloudliang@tencent.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 01 2月, 2022 1 次提交
-
-
由 Sean Christopherson 提交于
Handle non-APICv interrupt delivery in vendor code, even though it means VMX and SVM will temporarily have duplicate code. SVM's AVIC has a race condition that requires KVM to fall back to legacy interrupt injection _after_ the interrupt has been logged in the vIRR, i.e. to fix the race, SVM will need to open code the full flow anyways[*]. Refactor the code so that the SVM bug without introducing other issues, e.g. SVM would return "success" and thus invoke trace_kvm_apicv_accept_irq() even when delivery through the AVIC failed, and to opportunistically prepare for using KVM_X86_OP to fill each vendor's kvm_x86_ops struct, which will rely on the vendor function matching the kvm_x86_op pointer name. No functional change intended. [*] https://lore.kernel.org/all/20211213104634.199141-4-mlevitsk@redhat.comSigned-off-by: NSean Christopherson <seanjc@google.com> Message-Id: <20220128005208.4008533-3-seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 27 1月, 2022 2 次提交
-
-
由 Sean Christopherson 提交于
Forcibly leave nested virtualization operation if userspace toggles SMM state via KVM_SET_VCPU_EVENTS or KVM_SYNC_X86_EVENTS. If userspace forces the vCPU out of SMM while it's post-VMXON and then injects an SMI, vmx_enter_smm() will overwrite vmx->nested.smm.vmxon and end up with both vmxon=false and smm.vmxon=false, but all other nVMX state allocated. Don't attempt to gracefully handle the transition as (a) most transitions are nonsencial, e.g. forcing SMM while L2 is running, (b) there isn't sufficient information to handle all transitions, e.g. SVM wants access to the SMRAM save state, and (c) KVM_SET_VCPU_EVENTS must precede KVM_SET_NESTED_STATE during state restore as the latter disallows putting the vCPU into L2 if SMM is active, and disallows tagging the vCPU as being post-VMXON in SMM if SMM is not active. Abuse of KVM_SET_VCPU_EVENTS manifests as a WARN and memory leak in nVMX due to failure to free vmcs01's shadow VMCS, but the bug goes far beyond just a memory leak, e.g. toggling SMM on while L2 is active puts the vCPU in an architecturally impossible state. WARNING: CPU: 0 PID: 3606 at free_loaded_vmcs arch/x86/kvm/vmx/vmx.c:2665 [inline] WARNING: CPU: 0 PID: 3606 at free_loaded_vmcs+0x158/0x1a0 arch/x86/kvm/vmx/vmx.c:2656 Modules linked in: CPU: 1 PID: 3606 Comm: syz-executor725 Not tainted 5.17.0-rc1-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:free_loaded_vmcs arch/x86/kvm/vmx/vmx.c:2665 [inline] RIP: 0010:free_loaded_vmcs+0x158/0x1a0 arch/x86/kvm/vmx/vmx.c:2656 Code: <0f> 0b eb b3 e8 8f 4d 9f 00 e9 f7 fe ff ff 48 89 df e8 92 4d 9f 00 Call Trace: <TASK> kvm_arch_vcpu_destroy+0x72/0x2f0 arch/x86/kvm/x86.c:11123 kvm_vcpu_destroy arch/x86/kvm/../../../virt/kvm/kvm_main.c:441 [inline] kvm_destroy_vcpus+0x11f/0x290 arch/x86/kvm/../../../virt/kvm/kvm_main.c:460 kvm_free_vcpus arch/x86/kvm/x86.c:11564 [inline] kvm_arch_destroy_vm+0x2e8/0x470 arch/x86/kvm/x86.c:11676 kvm_destroy_vm arch/x86/kvm/../../../virt/kvm/kvm_main.c:1217 [inline] kvm_put_kvm+0x4fa/0xb00 arch/x86/kvm/../../../virt/kvm/kvm_main.c:1250 kvm_vm_release+0x3f/0x50 arch/x86/kvm/../../../virt/kvm/kvm_main.c:1273 __fput+0x286/0x9f0 fs/file_table.c:311 task_work_run+0xdd/0x1a0 kernel/task_work.c:164 exit_task_work include/linux/task_work.h:32 [inline] do_exit+0xb29/0x2a30 kernel/exit.c:806 do_group_exit+0xd2/0x2f0 kernel/exit.c:935 get_signal+0x4b0/0x28c0 kernel/signal.c:2862 arch_do_signal_or_restart+0x2a9/0x1c40 arch/x86/kernel/signal.c:868 handle_signal_work kernel/entry/common.c:148 [inline] exit_to_user_mode_loop kernel/entry/common.c:172 [inline] exit_to_user_mode_prepare+0x17d/0x290 kernel/entry/common.c:207 __syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline] syscall_exit_to_user_mode+0x19/0x60 kernel/entry/common.c:300 do_syscall_64+0x42/0xb0 arch/x86/entry/common.c:86 entry_SYSCALL_64_after_hwframe+0x44/0xae </TASK> Cc: stable@vger.kernel.org Reported-by: syzbot+8112db3ab20e70d50c31@syzkaller.appspotmail.com Signed-off-by: NSean Christopherson <seanjc@google.com> Message-Id: <20220125220358.2091737-1-seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Pass the emulation type to kvm_x86_ops.can_emulate_insutrction() so that a future commit can harden KVM's SEV support to WARN on emulation scenarios that should never happen. No functional change intended. Signed-off-by: NSean Christopherson <seanjc@google.com> Reviewed-by: NLiam Merwick <liam.merwick@oracle.com> Message-Id: <20220120010719.711476-6-seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 25 1月, 2022 1 次提交
-
-
由 Quanfa Fu 提交于
Make kvm_vcpu_reload_apic_access_page() static as it is no longer invoked directly by vmx and it is also no longer exported. No functional change intended. Signed-off-by: NQuanfa Fu <quanfafu@gmail.com> Message-Id: <20211219091446.174584-1-quanfafu@gmail.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 20 1月, 2022 2 次提交
-
-
由 Sean Christopherson 提交于
Drop kvm_x86_ops' pre/post_block() now that all implementations are nops. No functional change intended. Signed-off-by: NSean Christopherson <seanjc@google.com> Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com> Message-Id: <20211208015236.1616697-10-seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Reject KVM_RUN if emulation is required (because VMX is running without unrestricted guest) and an exception is pending, as KVM doesn't support emulating exceptions except when emulating real mode via vm86. The vCPU is hosed either way, but letting KVM_RUN proceed triggers a WARN due to the impossible condition. Alternatively, the WARN could be removed, but then userspace and/or KVM bugs would result in the vCPU silently running in a bad state, which isn't very friendly to users. Originally, the bug was hit by syzkaller with a nested guest as that doesn't require kvm_intel.unrestricted_guest=0. That particular flavor is likely fixed by commit cd0e615c ("KVM: nVMX: Synthesize TRIPLE_FAULT for L2 if emulation is required"), but it's trivial to trigger the WARN with a non-nested guest, and userspace can likely force bad state via ioctls() for a nested guest as well. Checking for the impossible condition needs to be deferred until KVM_RUN because KVM can't force specific ordering between ioctls. E.g. clearing exception.pending in KVM_SET_SREGS doesn't prevent userspace from setting it in KVM_SET_VCPU_EVENTS, and disallowing KVM_SET_VCPU_EVENTS with emulation_required would prevent userspace from queuing an exception and then stuffing sregs. Note, if KVM were to try and detect/prevent the condition prior to KVM_RUN, handle_invalid_guest_state() and/or handle_emulation_failure() would need to be modified to clear the pending exception prior to exiting to userspace. ------------[ cut here ]------------ WARNING: CPU: 6 PID: 137812 at arch/x86/kvm/vmx/vmx.c:1623 vmx_queue_exception+0x14f/0x160 [kvm_intel] CPU: 6 PID: 137812 Comm: vmx_invalid_nes Not tainted 5.15.2-7cc36c3e14ae-pop #279 Hardware name: ASUS Q87M-E/Q87M-E, BIOS 1102 03/03/2014 RIP: 0010:vmx_queue_exception+0x14f/0x160 [kvm_intel] Code: <0f> 0b e9 fd fe ff ff 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 RSP: 0018:ffffa45c83577d38 EFLAGS: 00010202 RAX: 0000000000000003 RBX: 0000000080000006 RCX: 0000000000000006 RDX: 0000000000000000 RSI: 0000000000010002 RDI: ffff9916af734000 RBP: ffff9916af734000 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000006 R13: 0000000000000000 R14: ffff9916af734038 R15: 0000000000000000 FS: 00007f1e1a47c740(0000) GS:ffff99188fb80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f1e1a6a8008 CR3: 000000026f83b005 CR4: 00000000001726e0 Call Trace: kvm_arch_vcpu_ioctl_run+0x13a2/0x1f20 [kvm] kvm_vcpu_ioctl+0x279/0x690 [kvm] __x64_sys_ioctl+0x83/0xb0 do_syscall_64+0x3b/0xc0 entry_SYSCALL_64_after_hwframe+0x44/0xae Reported-by: syzbot+82112403ace4cbd780d8@syzkaller.appspotmail.com Signed-off-by: NSean Christopherson <seanjc@google.com> Message-Id: <20211228232437.1875318-2-seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 15 1月, 2022 1 次提交
-
-
由 Kevin Tian 提交于
Always intercepting IA32_XFD causes non-negligible overhead when this register is updated frequently in the guest. Disable r/w emulation after intercepting the first WRMSR(IA32_XFD) with a non-zero value. Disable WRMSR emulation implies that IA32_XFD becomes out-of-sync with the software states in fpstate and the per-cpu xfd cache. This leads to two additional changes accordingly: - Call fpu_sync_guest_vmexit_xfd_state() after vm-exit to bring software states back in-sync with the MSR, before handle_exit_irqoff() is called. - Always trap #NM once write interception is disabled for IA32_XFD. The #NM exception is rare if the guest doesn't use dynamic features. Otherwise, there is at most one exception per guest task given a dynamic feature. p.s. We have confirmed that SDM is being revised to say that when setting IA32_XFD[18] the AMX register state is not guaranteed to be preserved. This clarification avoids adding mess for a creative guest which sets IA32_XFD[18]=1 before saving active AMX state to its own storage. Signed-off-by: NKevin Tian <kevin.tian@intel.com> Signed-off-by: NJing Liu <jing2.liu@intel.com> Signed-off-by: NYang Zhong <yang.zhong@intel.com> Message-Id: <20220105123532.12586-22-yang.zhong@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 07 1月, 2022 4 次提交
-
-
由 Michael Roth 提交于
Normally guests will set up CR3 themselves, but some guests, such as kselftests, and potentially CONFIG_PVH guests, rely on being booted with paging enabled and CR3 initialized to a pre-allocated page table. Currently CR3 updates via KVM_SET_SREGS* are not loaded into the guest VMCB until just prior to entering the guest. For SEV-ES/SEV-SNP, this is too late, since it will have switched over to using the VMSA page prior to that point, with the VMSA CR3 copied from the VMCB initial CR3 value: 0. Address this by sync'ing the CR3 value into the VMCB save area immediately when KVM_SET_SREGS* is issued so it will find it's way into the initial VMSA. Suggested-by: NTom Lendacky <thomas.lendacky@amd.com> Signed-off-by: NMichael Roth <michael.roth@amd.com> Message-Id: <20211216171358.61140-10-michael.roth@amd.com> [Remove vmx_post_set_cr3; add a remark about kvm_set_cr3 not calling the new hook. - Paolo] Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 David Woodhouse 提交于
This adds basic support for delivering 2 level event channels to a guest. Initially, it only supports delivery via the IRQ routing table, triggered by an eventfd. In order to do so, it has a kvm_xen_set_evtchn_fast() function which will use the pre-mapped shared_info page if it already exists and is still valid, while the slow path through the irqfd_inject workqueue will remap the shared_info page if necessary. It sets the bits in the shared_info page but not the vcpu_info; that is deferred to __kvm_xen_has_interrupt() which raises the vector to the appropriate vCPU. Add a 'verbose' mode to xen_shinfo_test while adding test cases for this. Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk> Message-Id: <20211210163625.2886-5-dwmw2@infradead.org> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 David Woodhouse 提交于
Use the newly reinstated gfn_to_pfn_cache to maintain a kernel mapping of the Xen shared_info page so that it can be accessed in atomic context. Note that we do not participate in dirty tracking for the shared info page and we do not explicitly mark it dirty every single tim we deliver an event channel interrupts. We wouldn't want to do that even if we *did* have a valid vCPU context with which to do so. Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk> Message-Id: <20211210163625.2886-4-dwmw2@infradead.org> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Like Xu 提交于
Depending on whether intr should be triggered or not, KVM registers two different event overflow callbacks in the perf_event context. The code skeleton of these two functions is very similar, so the pmc->intr can be stored into pmc from pmc_reprogram_counter() which provides smaller instructions footprint against the u-architecture branch predictor. The __kvm_perf_overflow() can be called in non-nmi contexts and a flag is needed to distinguish the caller context and thus avoid a check on kvm_is_in_guest(), otherwise we might get warnings from suspicious RCU or check_preemption_disabled(). Suggested-by: NPaolo Bonzini <pbonzini@redhat.com> Signed-off-by: NLike Xu <likexu@tencent.com> Message-Id: <20211130074221.93635-5-likexu@tencent.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 20 12月, 2021 1 次提交
-
-
由 Marc Orr 提交于
The kvm_run struct's if_flag is a part of the userspace/kernel API. The SEV-ES patches failed to set this flag because it's no longer needed by QEMU (according to the comment in the source code). However, other hypervisors may make use of this flag. Therefore, set the flag for guests with encrypted registers (i.e., with guest_state_protected set). Fixes: f1c6366e ("KVM: SVM: Add required changes to support intercepts under SEV-ES") Signed-off-by: NMarc Orr <marcorr@google.com> Message-Id: <20211209155257.128747-1-marcorr@google.com> Cc: stable@vger.kernel.org Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com> Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
-
- 10 12月, 2021 1 次提交
-
-
由 Vitaly Kuznetsov 提交于
Prior to commit 0baedd79 ("KVM: x86: make Hyper-V PV TLB flush use tlb_flush_guest()"), kvm_hv_flush_tlb() was using 'KVM_REQ_TLB_FLUSH | KVM_REQUEST_NO_WAKEUP' when making a request to flush TLBs on other vCPUs and KVM_REQ_TLB_FLUSH is/was defined as: (0 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP) so KVM_REQUEST_WAIT was lost. Hyper-V TLFS, however, requires that "This call guarantees that by the time control returns back to the caller, the observable effects of all flushes on the specified virtual processors have occurred." and without KVM_REQUEST_WAIT there's a small chance that the vCPU making the TLB flush will resume running before all IPIs get delivered to other vCPUs and a stale mapping can get read there. Fix the issue by adding KVM_REQUEST_WAIT flag to KVM_REQ_TLB_FLUSH_GUEST: kvm_hv_flush_tlb() is the sole caller which uses it for kvm_make_all_cpus_request()/kvm_make_vcpus_request_mask() where KVM_REQUEST_WAIT makes a difference. Cc: stable@kernel.org Fixes: 0baedd79 ("KVM: x86: make Hyper-V PV TLB flush use tlb_flush_guest()") Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20211209102937.584397-1-vkuznets@redhat.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 08 12月, 2021 10 次提交
-
-
由 Hou Wenlong 提交于
The next patch would use kvm_emulate_instruction() with EMULTYPE_SKIP in complete_userspace_io callback to fix a problem in msr access emulation. However, EMULTYPE_SKIP only updates RIP, more things like updating interruptibility state and injecting single-step #DBs would be done in the callback. Since the emulator also does those things after x86_emulate_insn(), add a new emulation type to pair with EMULTYPE_SKIP to do those things for completion of user exits within the emulator. Suggested-by: NSean Christopherson <seanjc@google.com> Signed-off-by: NHou Wenlong <houwenlong93@linux.alibaba.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com> Message-Id: <8f8c8e268b65f31d55c2881a4b30670946ecfa0d.1635842679.git.houwenlong93@linux.alibaba.com>
-
由 Lai Jiangshan 提交于
It uses vcpu->arch.walk_mmu always; nested EPT does not have PDPTRs, and nested NPT treats them like all other non-leaf page table levels instead of caching them. Signed-off-by: NLai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211124122055.64424-11-jiangshanlai@gmail.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Lai Jiangshan 提交于
This bit is very close to mean "role.quadrant is not in use", except that it is false also when the MMU is mapping guest physical addresses directly. In that case, role.quadrant is indeed not in use, but there are no guest PTEs at all. Changing the name and direction of the bit removes the special case, since a guest with paging disabled, or not considering guest paging structures as is the case for two-dimensional paging, does not have to deal with 4-byte guest PTEs. Suggested-by: NPaolo Bonzini <pbonzini@redhat.com> Signed-off-by: NLai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211124122055.64424-10-jiangshanlai@gmail.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Lai Jiangshan 提交于
Reduce an indirect function call (retpoline) and some intialization code. Signed-off-by: NLai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211124122055.64424-4-jiangshanlai@gmail.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Lai Jiangshan 提交于
The mmu->gva_to_gpa() has no "struct kvm_mmu *mmu", so an extra FNAME(gva_to_gpa_nested) is needed. Add the parameter can simplify the code. And it makes it explicit that the walk is upon vcpu->arch.walk_mmu for gva and vcpu->arch.mmu for L2 gpa in translate_nested_gpa() via the new parameter. Signed-off-by: NLai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211124122055.64424-3-jiangshanlai@gmail.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Lai Jiangshan 提交于
The body of __kvm_mmu_free_some_pages() has been removed. Signed-off-by: NLai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211118110814.2568-13-jiangshanlai@gmail.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Drop kvm_arch_vcpu_block_finish() now that all arch implementations are nops. No functional change intended. Acked-by: NChristian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: NDavid Matlack <dmatlack@google.com> Signed-off-by: NSean Christopherson <seanjc@google.com> Message-Id: <20211009021236.4122790-10-seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Rename a variety of HLT-related helpers to free up the function name "kvm_vcpu_halt" for future use in generic KVM code, e.g. to differentiate between "block" and "halt". No functional change intended. Reviewed-by: NDavid Matlack <dmatlack@google.com> Signed-off-by: NSean Christopherson <seanjc@google.com> Message-Id: <20211009021236.4122790-13-seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Maciej S. Szmigiero 提交于
There is no point in recalculating from scratch the total number of pages in all memslots each time a memslot is created or deleted. Use KVM's cached nr_memslot_pages to compute the default max number of MMU pages. Note that even with nr_memslot_pages capped at ULONG_MAX we can't safely multiply it by KVM_PERMILLE_MMU_PAGES (20) since this operation can possibly overflow an unsigned long variable. Write this "* 20 / 1000" operation as "/ 50" instead to avoid such overflow. Signed-off-by: NMaciej S. Szmigiero <maciej.szmigiero@oracle.com> [sean: use common KVM field and rework changelog accordingly] Reviewed-by: NSean Christopherson <seanjc@google.com> Message-Id: <d14c5a24535269606675437d5602b7dac4ad8c0e.1638817640.git.maciej.szmigiero@oracle.com>
-
由 Paolo Bonzini 提交于
Fix the number of bits in the role, and simplify the explanation of why several bits or combinations of bits are redundant. Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 02 12月, 2021 1 次提交
-
-
由 Paolo Bonzini 提交于
kvm_vcpu_apicv_active() returns false if a virtual machine has no in-kernel local APIC, however kvm_apicv_activated might still be true if there are no reasons to disable APICv; in fact it is quite likely that there is none because APICv is inhibited by specific configurations of the local APIC and those configurations cannot be programmed. This triggers a WARN: WARN_ON_ONCE(kvm_apicv_activated(vcpu->kvm) != kvm_vcpu_apicv_active(vcpu)); To avoid this, introduce another cause for APICv inhibition, namely the absence of an in-kernel local APIC. This cause is enabled by default, and is dropped by either KVM_CREATE_IRQCHIP or the enabling of KVM_CAP_IRQCHIP_SPLIT. Reported-by: NIgnat Korchagin <ignat@cloudflare.com> Fixes: ee49a893 ("KVM: x86: Move SVM's APICv sanity check to common x86", 2021-10-22) Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com> Reviewed-by: NSean Christopherson <seanjc@google.com> Tested-by: NIgnat Korchagin <ignat@cloudflare.com> Message-Id: <20211130123746.293379-1-pbonzini@redhat.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 18 11月, 2021 1 次提交
-
-
由 Maxim Levitsky 提交于
Incorporate EFER.LMA into kvm_mmu_extended_role, as it used to compute the guest root level and is not reflected in kvm_mmu_page_role.level when TDP is in use. When simply running the guest, it is impossible for EFER.LMA and kvm_mmu.root_level to get out of sync, as the guest cannot transition from PAE paging to 64-bit paging without toggling CR0.PG, i.e. without first bouncing through a different MMU context. And stuffing guest state via KVM_SET_SREGS{,2} also ensures a full MMU context reset. However, if KVM_SET_SREGS{,2} is followed by KVM_SET_NESTED_STATE, e.g. to set guest state when migrating the VM while L2 is active, the vCPU state will reflect L2, not L1. If L1 is using TDP for L2, then root_mmu will have been configured using L2's state, despite not being used for L2. If L2.EFER.LMA != L1.EFER.LMA, and L2 is using PAE paging, then root_mmu will be configured for guest PAE paging, but will match the mmu_role for 64-bit paging and cause KVM to not reconfigure root_mmu on the next nested VM-Exit. Alternatively, the root_mmu's role could be invalidated after a successful KVM_SET_NESTED_STATE that yields vcpu->arch.mmu != vcpu->arch.root_mmu, i.e. that switches the active mmu to guest_mmu, but doing so is unnecessarily tricky, and not even needed if L1 and L2 do have the same role (e.g., they are both 64-bit guests and run with the same CR4). Suggested-by: NSean Christopherson <seanjc@google.com> Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com> Message-Id: <20211115131837.195527-3-mlevitsk@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 17 11月, 2021 5 次提交
-
-
由 Sean Christopherson 提交于
Now that all state needed for VMX's PT interrupt handler is exposed to vmx.c (specifically the currently running vCPU), move the handler into vmx.c where it belongs. Signed-off-by: NSean Christopherson <seanjc@google.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20211111020738.2512932-14-seanjc@google.com
-
由 Sean Christopherson 提交于
Move x86's perf guest callbacks into common KVM, as they are semantically identical to arm64's callbacks (the only other such KVM callbacks). arm64 will convert to the common versions in a future patch. Implement the necessary arm64 arch hooks now to avoid having to provide stubs or a temporary #define (from x86) to avoid arm64 compilation errors when CONFIG_GUEST_PERF_EVENTS=y. Signed-off-by: NSean Christopherson <seanjc@google.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com> Acked-by: NMarc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20211111020738.2512932-13-seanjc@google.com
-
由 Sean Christopherson 提交于
Use the generic kvm_running_vcpu plus a new 'handling_intr_from_guest' variable in kvm_arch_vcpu instead of the semi-redundant current_vcpu. kvm_before/after_interrupt() must be called while the vCPU is loaded, (which protects against preemption), thus kvm_running_vcpu is guaranteed to be non-NULL when handling_intr_from_guest is non-zero. Switching to kvm_get_running_vcpu() will allows moving KVM's perf callbacks to generic code, and the new flag will be used in a future patch to more precisely identify the "NMI from guest" case. Signed-off-by: NSean Christopherson <seanjc@google.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20211111020738.2512932-11-seanjc@google.com
-
由 Like Xu 提交于
To prepare for using static_calls to optimize perf's guest callbacks, replace ->is_in_guest and ->is_user_mode with a new multiplexed hook ->state, tweak ->handle_intel_pt_intr to play nice with being called when there is no active guest, and drop "guest" from ->get_guest_ip. Return '0' from ->state and ->handle_intel_pt_intr to indicate "not in guest" so that DEFINE_STATIC_CALL_RET0 can be used to define the static calls, i.e. no callback == !guest. [sean: extracted from static_call patch, fixed get_ip() bug, wrote changelog] Suggested-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Originally-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: NLike Xu <like.xu@linux.intel.com> Signed-off-by: NZhu Lingshan <lingshan.zhu@intel.com> Signed-off-by: NSean Christopherson <seanjc@google.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com> Link: https://lore.kernel.org/r/20211111020738.2512932-7-seanjc@google.com
-
由 Sean Christopherson 提交于
Override the Processor Trace (PT) interrupt handler for guest mode if and only if PT is configured for host+guest mode, i.e. is being used independently by both host and guest. If PT is configured for system mode, the host fully controls PT and must handle all events. Fixes: 8479e04e ("KVM: x86: Inject PMI for KVM guest") Reported-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com> Reported-by: NArtem Kashkanov <artem.kashkanov@intel.com> Signed-off-by: NSean Christopherson <seanjc@google.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Acked-by: NPaolo Bonzini <pbonzini@redhat.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20211111020738.2512932-4-seanjc@google.com
-
- 11 11月, 2021 4 次提交
-
-
由 Vitaly Kuznetsov 提交于
KVM_CAP_NR_VCPUS is used to get the "recommended" maximum number of VCPUs and arm64/mips/riscv report num_online_cpus(). Powerpc reports either num_online_cpus() or num_present_cpus(), s390 has multiple constants depending on hardware features. On x86, KVM reports an arbitrary value of '710' which is supposed to be the maximum tested value but it's possible to test all KVM_MAX_VCPUS even when there are less physical CPUs available. Drop the arbitrary '710' value and return num_online_cpus() on x86 as well. The recommendation will match other architectures and will mean 'no CPU overcommit'. For reference, QEMU only queries KVM_CAP_NR_VCPUS to print a warning when the requested vCPU number exceeds it. The static limit of '710' is quite weird as smaller systems with just a few physical CPUs should certainly "recommend" less. Suggested-by: NEduardo Habkost <ehabkost@redhat.com> Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20211111134733.86601-1-vkuznets@redhat.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Paul Durrant 提交于
Currently when kvm_update_cpuid_runtime() runs, it assumes that the KVM_CPUID_FEATURES leaf is located at 0x40000001. This is not true, however, if Hyper-V support is enabled. In this case the KVM leaves will be offset. This patch introdues as new 'kvm_cpuid_base' field into struct kvm_vcpu_arch to track the location of the KVM leaves and function kvm_update_kvm_cpuid_base() (called from kvm_set_cpuid()) to locate the leaves using the 'KVMKVMKVM\0\0\0' signature (which is now given a definition in kvm_para.h). Adjustment of KVM_CPUID_FEATURES will hence now target the correct leaf. NOTE: A new for_each_possible_hypervisor_cpuid_base() macro is intoduced into processor.h to avoid having duplicate code for the iteration over possible hypervisor base leaves. Signed-off-by: NPaul Durrant <pdurrant@amazon.com> Message-Id: <20211105095101.5384-3-pdurrant@amazon.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Maxim Levitsky 提交于
KVM_GUESTDBG_BLOCKIRQ relies on interrupts being injected using standard kvm's inject_pending_event, and not via APICv/AVIC. Since this is a debug feature, just inhibit APICv/AVIC while KVM_GUESTDBG_BLOCKIRQ is in use on at least one vCPU. Fixes: 61e5f69e ("KVM: x86: implement KVM_GUESTDBG_BLOCKIRQ") Reported-by: NVitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com> Reviewed-by: NSean Christopherson <seanjc@google.com> Tested-by: NSean Christopherson <seanjc@google.com> Message-Id: <20211108090245.166408-1-mlevitsk@redhat.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 David Woodhouse 提交于
In commit b0431382 ("x86/KVM: Make sure KVM_VCPU_FLUSH_TLB flag is not missed") we switched to using a gfn_to_pfn_cache for accessing the guest steal time structure in order to allow for an atomic xchg of the preempted field. This has a couple of problems. Firstly, kvm_map_gfn() doesn't work at all for IOMEM pages when the atomic flag is set, which it is in kvm_steal_time_set_preempted(). So a guest vCPU using an IOMEM page for its steal time would never have its preempted field set. Secondly, the gfn_to_pfn_cache is not invalidated in all cases where it should have been. There are two stages to the GFN->PFN conversion; first the GFN is converted to a userspace HVA, and then that HVA is looked up in the process page tables to find the underlying host PFN. Correct invalidation of the latter would require being hooked up to the MMU notifiers, but that doesn't happen---so it just keeps mapping and unmapping the *wrong* PFN after the userspace page tables change. In the !IOMEM case at least the stale page *is* pinned all the time it's cached, so it won't be freed and reused by anyone else while still receiving the steal time updates. The map/unmap dance only takes care of the KVM administrivia such as marking the page dirty. Until the gfn_to_pfn cache handles the remapping automatically by integrating with the MMU notifiers, we might as well not get a kernel mapping of it, and use the perfectly serviceable userspace HVA that we already have. We just need to implement the atomic xchg on the userspace address with appropriate exception handling, which is fairly trivial. Cc: stable@vger.kernel.org Fixes: b0431382 ("x86/KVM: Make sure KVM_VCPU_FLUSH_TLB flag is not missed") Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk> Message-Id: <3645b9b889dac6438394194bb5586a46b68d581f.camel@infradead.org> [I didn't entirely agree with David's assessment of the usefulness of the gfn_to_pfn cache, and integrated the outcome of the discussion in the above commit message. - Paolo] Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-