- 28 1月, 2022 3 次提交
-
-
由 Vitaly Kuznetsov 提交于
vmcs_to_field_offset{,_table} may sound misleading as VMCS is an opaque blob which is not supposed to be accessed directly. In fact, vmcs_to_field_offset{,_table} are related to KVM defined VMCS12 structure. Rename vmcs_field_to_offset() to get_vmcs12_field_offset() for clarity. No functional change intended. Reviewed-by: NSean Christopherson <seanjc@google.com> Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20220112170134.1904308-4-vkuznets@redhat.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Vitaly Kuznetsov 提交于
Enlightened VMCS v1 doesn't have VMX_PREEMPTION_TIMER_VALUE field, PIN_BASED_VMX_PREEMPTION_TIMER is also filtered out already so it makes sense to filter out VM_EXIT_SAVE_VMX_PREEMPTION_TIMER too. Note, none of the currently existing Windows/Hyper-V versions are known to enable 'save VMX-preemption timer value' when eVMCS is in use, the change is aimed at making the filtering future proof. Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20220112170134.1904308-3-vkuznets@redhat.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Vitaly Kuznetsov 提交于
Similar to MSR_IA32_VMX_EXIT_CTLS/MSR_IA32_VMX_TRUE_EXIT_CTLS, MSR_IA32_VMX_ENTRY_CTLS/MSR_IA32_VMX_TRUE_ENTRY_CTLS pair, MSR_IA32_VMX_TRUE_PINBASED_CTLS needs to be filtered the same way MSR_IA32_VMX_PINBASED_CTLS is currently filtered as guests may solely rely on 'true' MSR data. Note, none of the currently existing Windows/Hyper-V versions are known to stumble upon the unfiltered MSR_IA32_VMX_TRUE_PINBASED_CTLS, the change is aimed at making the filtering future proof. Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20220112170134.1904308-2-vkuznets@redhat.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 27 1月, 2022 4 次提交
-
-
由 Sean Christopherson 提交于
WARN if KVM attempts to allocate a shadow VMCS for vmcs02. KVM emulates VMCS shadowing but doesn't virtualize it, i.e. KVM should never allocate a "real" shadow VMCS for L2. The previous code WARNed but continued anyway with the allocation, presumably in an attempt to avoid NULL pointer dereference. However, alloc_vmcs (and hence alloc_shadow_vmcs) can fail, and indeed the sole caller does: if (enable_shadow_vmcs && !alloc_shadow_vmcs(vcpu)) goto out_shadow_vmcs; which makes it not a useful attempt. Signed-off-by: NSean Christopherson <seanjc@google.com> Message-Id: <20220125220527.2093146-1-seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Forcibly leave nested virtualization operation if userspace toggles SMM state via KVM_SET_VCPU_EVENTS or KVM_SYNC_X86_EVENTS. If userspace forces the vCPU out of SMM while it's post-VMXON and then injects an SMI, vmx_enter_smm() will overwrite vmx->nested.smm.vmxon and end up with both vmxon=false and smm.vmxon=false, but all other nVMX state allocated. Don't attempt to gracefully handle the transition as (a) most transitions are nonsencial, e.g. forcing SMM while L2 is running, (b) there isn't sufficient information to handle all transitions, e.g. SVM wants access to the SMRAM save state, and (c) KVM_SET_VCPU_EVENTS must precede KVM_SET_NESTED_STATE during state restore as the latter disallows putting the vCPU into L2 if SMM is active, and disallows tagging the vCPU as being post-VMXON in SMM if SMM is not active. Abuse of KVM_SET_VCPU_EVENTS manifests as a WARN and memory leak in nVMX due to failure to free vmcs01's shadow VMCS, but the bug goes far beyond just a memory leak, e.g. toggling SMM on while L2 is active puts the vCPU in an architecturally impossible state. WARNING: CPU: 0 PID: 3606 at free_loaded_vmcs arch/x86/kvm/vmx/vmx.c:2665 [inline] WARNING: CPU: 0 PID: 3606 at free_loaded_vmcs+0x158/0x1a0 arch/x86/kvm/vmx/vmx.c:2656 Modules linked in: CPU: 1 PID: 3606 Comm: syz-executor725 Not tainted 5.17.0-rc1-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:free_loaded_vmcs arch/x86/kvm/vmx/vmx.c:2665 [inline] RIP: 0010:free_loaded_vmcs+0x158/0x1a0 arch/x86/kvm/vmx/vmx.c:2656 Code: <0f> 0b eb b3 e8 8f 4d 9f 00 e9 f7 fe ff ff 48 89 df e8 92 4d 9f 00 Call Trace: <TASK> kvm_arch_vcpu_destroy+0x72/0x2f0 arch/x86/kvm/x86.c:11123 kvm_vcpu_destroy arch/x86/kvm/../../../virt/kvm/kvm_main.c:441 [inline] kvm_destroy_vcpus+0x11f/0x290 arch/x86/kvm/../../../virt/kvm/kvm_main.c:460 kvm_free_vcpus arch/x86/kvm/x86.c:11564 [inline] kvm_arch_destroy_vm+0x2e8/0x470 arch/x86/kvm/x86.c:11676 kvm_destroy_vm arch/x86/kvm/../../../virt/kvm/kvm_main.c:1217 [inline] kvm_put_kvm+0x4fa/0xb00 arch/x86/kvm/../../../virt/kvm/kvm_main.c:1250 kvm_vm_release+0x3f/0x50 arch/x86/kvm/../../../virt/kvm/kvm_main.c:1273 __fput+0x286/0x9f0 fs/file_table.c:311 task_work_run+0xdd/0x1a0 kernel/task_work.c:164 exit_task_work include/linux/task_work.h:32 [inline] do_exit+0xb29/0x2a30 kernel/exit.c:806 do_group_exit+0xd2/0x2f0 kernel/exit.c:935 get_signal+0x4b0/0x28c0 kernel/signal.c:2862 arch_do_signal_or_restart+0x2a9/0x1c40 arch/x86/kernel/signal.c:868 handle_signal_work kernel/entry/common.c:148 [inline] exit_to_user_mode_loop kernel/entry/common.c:172 [inline] exit_to_user_mode_prepare+0x17d/0x290 kernel/entry/common.c:207 __syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline] syscall_exit_to_user_mode+0x19/0x60 kernel/entry/common.c:300 do_syscall_64+0x42/0xb0 arch/x86/entry/common.c:86 entry_SYSCALL_64_after_hwframe+0x44/0xae </TASK> Cc: stable@vger.kernel.org Reported-by: syzbot+8112db3ab20e70d50c31@syzkaller.appspotmail.com Signed-off-by: NSean Christopherson <seanjc@google.com> Message-Id: <20220125220358.2091737-1-seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Pass the emulation type to kvm_x86_ops.can_emulate_insutrction() so that a future commit can harden KVM's SEV support to WARN on emulation scenarios that should never happen. No functional change intended. Signed-off-by: NSean Christopherson <seanjc@google.com> Reviewed-by: NLiam Merwick <liam.merwick@oracle.com> Message-Id: <20220120010719.711476-6-seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Jim Mattson 提交于
The maximum size of a VMCS (or VMXON region) is 4096. By definition, these are order 0 allocations. Signed-off-by: NJim Mattson <jmattson@google.com> Message-Id: <20220125004359.147600-1-jmattson@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 25 1月, 2022 1 次提交
-
-
由 Sean Christopherson 提交于
Set vmcs.GUEST_PENDING_DBG_EXCEPTIONS.BS, a.k.a. the pending single-step breakpoint flag, when re-injecting a #DB with RFLAGS.TF=1, and STI or MOVSS blocking is active. Setting the flag is necessary to make VM-Entry consistency checks happy, as VMX has an invariant that if RFLAGS.TF is set and STI/MOVSS blocking is true, then the previous instruction must have been STI or MOV/POP, and therefore a single-step #DB must be pending since the RFLAGS.TF cannot have been set by the previous instruction, i.e. the one instruction delay after setting RFLAGS.TF must have already expired. Normally, the CPU sets vmcs.GUEST_PENDING_DBG_EXCEPTIONS.BS appropriately when recording guest state as part of a VM-Exit, but #DB VM-Exits intentionally do not treat the #DB as "guest state" as interception of the #DB effectively makes the #DB host-owned, thus KVM needs to manually set PENDING_DBG.BS when forwarding/re-injecting the #DB to the guest. Note, although this bug can be triggered by guest userspace, doing so requires IOPL=3, and guest userspace running with IOPL=3 has full access to all I/O ports (from the guest's perspective) and can crash/reboot the guest any number of ways. IOPL=3 is required because STI blocking kicks in if and only if RFLAGS.IF is toggled 0=>1, and if CPL>IOPL, STI either takes a #GP or modifies RFLAGS.VIF, not RFLAGS.IF. MOVSS blocking can be initiated by userspace, but can be coincident with a #DB if and only if DR7.GD=1 (General Detect enabled) and a MOV DR is executed in the MOVSS shadow. MOV DR #GPs at CPL>0, thus MOVSS blocking is problematic only for CPL0 (and only if the guest is crazy enough to access a DR in a MOVSS shadow). All other sources of #DBs are either suppressed by MOVSS blocking (single-step, code fetch, data, and I/O), are mutually exclusive with MOVSS blocking (T-bit task switch), or are already handled by KVM (ICEBP, a.k.a. INT1). This bug was originally found by running tests[1] created for XSA-308[2]. Note that Xen's userspace test emits ICEBP in the MOVSS shadow, which is presumably why the Xen bug was deemed to be an exploitable DOS from guest userspace. KVM already handles ICEBP by skipping the ICEBP instruction and thus clears MOVSS blocking as a side effect of its "emulation". [1] http://xenbits.xenproject.org/docs/xtf/xsa-308_2main_8c_source.html [2] https://xenbits.xen.org/xsa/advisory-308.htmlReported-by: NDavid Woodhouse <dwmw2@infradead.org> Reported-by: NAlexander Graf <graf@amazon.de> Signed-off-by: NSean Christopherson <seanjc@google.com> Message-Id: <20220120000624.655815-1-seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 24 1月, 2022 1 次提交
-
-
由 Sean Christopherson 提交于
Zero vmcs.HOST_IA32_SYSENTER_ESP when initializing *constant* host state if and only if SYSENTER cannot be used, i.e. the kernel is a 64-bit kernel and is not emulating 32-bit syscalls. As the name suggests, vmx_set_constant_host_state() is intended for state that is *constant*. When SYSENTER is used, SYSENTER_ESP isn't constant because stacks are per-CPU, and the VMCS must be updated whenever the vCPU is migrated to a new CPU. The logic in vmx_vcpu_load_vmcs() doesn't differentiate between "never loaded" and "loaded on a different CPU", i.e. setting SYSENTER_ESP on VMCS load also handles setting correct host state when the VMCS is first loaded. Because a VMCS must be loaded before it is initialized during vCPU RESET, zeroing the field in vmx_set_constant_host_state() obliterates the value that was written when the VMCS was loaded. If the vCPU is run before it is migrated, the subsequent VM-Exit will zero out MSR_IA32_SYSENTER_ESP, leading to a #DF on the next 32-bit syscall. double fault: 0000 [#1] SMP CPU: 0 PID: 990 Comm: stable Not tainted 5.16.0+ #97 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 EIP: entry_SYSENTER_32+0x0/0xe7 Code: <9c> 50 eb 17 0f 20 d8 a9 00 10 00 00 74 0d 25 ff ef ff ff 0f 22 d8 EAX: 000000a2 EBX: a8d1300c ECX: a8d13014 EDX: 00000000 ESI: a8f87000 EDI: a8d13014 EBP: a8d12fc0 ESP: 00000000 DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068 EFLAGS: 00210093 CR0: 80050033 CR2: fffffffc CR3: 02c3b000 CR4: 00152e90 Fixes: 6ab8a405 ("KVM: VMX: Avoid to rdmsrl(MSR_IA32_SYSENTER_ESP)") Cc: Lai Jiangshan <laijs@linux.alibaba.com> Signed-off-by: NSean Christopherson <seanjc@google.com> Message-Id: <20220122015211.1468758-1-seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 20 1月, 2022 9 次提交
-
-
由 Sean Christopherson 提交于
When waking vCPUs in the posted interrupt wakeup handling, do exactly that and no more. There is no need to kick the vCPU as the wakeup handler just needs to get the vCPU task running, and if it's in the guest then it's definitely running. Signed-off-by: NSean Christopherson <seanjc@google.com> Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com> Message-Id: <20211208015236.1616697-21-seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Move the fallback "wake_up" path into the helper to trigger posted interrupt helper now that the nested and non-nested paths are identical. No functional change intended. Signed-off-by: NSean Christopherson <seanjc@google.com> Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com> Message-Id: <20211208015236.1616697-20-seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Refactor the posted interrupt helper to take the desired notification vector instead of a bool so that the callers are self-documenting. No functional change intended. Signed-off-by: NSean Christopherson <seanjc@google.com> Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com> Message-Id: <20211208015236.1616697-19-seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Replace the full "kick" with just the "wake" in the fallback path when triggering a virtual interrupt via a posted interrupt fails because the guest is not IN_GUEST_MODE. If the guest transitions into guest mode between the check and the kick, then it's guaranteed to see the pending interrupt as KVM syncs the PIR to IRR (and onto GUEST_RVI) after setting IN_GUEST_MODE. Kicking the guest in this case is nothing more than an unnecessary VM-Exit (and host IRQ). Opportunistically update comments to explain the various ordering rules and barriers at play. Signed-off-by: NSean Christopherson <seanjc@google.com> Message-Id: <20211208015236.1616697-17-seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Drop kvm_x86_ops' pre/post_block() now that all implementations are nops. No functional change intended. Signed-off-by: NSean Christopherson <seanjc@google.com> Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com> Message-Id: <20211208015236.1616697-10-seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Handle the switch to/from the hypervisor/software timer when a vCPU is blocking in common x86 instead of in VMX. Even though VMX is the only user of a hypervisor timer, the logic and all functions involved are generic x86 (unless future CPUs do something completely different and implement a hypervisor timer that runs regardless of mode). Handling the switch in common x86 will allow for the elimination of the pre/post_blocks hooks, and also lets KVM switch back to the hypervisor timer if and only if it was in use (without additional params). Add a comment explaining why the switch cannot be deferred to kvm_sched_out() or kvm_vcpu_block(). Signed-off-by: NSean Christopherson <seanjc@google.com> Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com> Message-Id: <20211208015236.1616697-8-seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Move the seemingly generic block_vcpu_list from kvm_vcpu to vcpu_vmx, and rename the list and all associated variables to clarify that it tracks the set of vCPU that need to be poked on a posted interrupt to the wakeup vector. The list is not used to track _all_ vCPUs that are blocking, and the term "blocked" can be misleading as it may refer to a blocking condition in the host or the guest, where as the PI wakeup case is specifically for the vCPUs that are actively blocking from within the guest. No functional change intended. Signed-off-by: NSean Christopherson <seanjc@google.com> Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com> Message-Id: <20211208015236.1616697-7-seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Move the posted interrupt pre/post_block logic into vcpu_put/load respectively, using the kvm_vcpu_is_blocking() to determining whether or not the wakeup handler needs to be set (and unset). This avoids updating the PI descriptor if halt-polling is successful, reduces the number of touchpoints for updating the descriptor, and eliminates the confusing behavior of intentionally leaving a "stale" PI.NDST when a blocking vCPU is scheduled back in after preemption. The downside is that KVM will do the PID update twice if the vCPU is preempted after prepare_to_rcuwait() but before schedule(), but that's a rare case (and non-existent on !PREEMPT kernels). The notable wart is the need to send a self-IPI on the wakeup vector if an outstanding notification is pending after configuring the wakeup vector. Ideally, KVM would just do a kvm_vcpu_wake_up() in this case, but the scheduler doesn't support waking a task from its preemption notifier callback, i.e. while the task is right in the middle of being scheduled out. Note, setting the wakeup vector before halt-polling is not necessary: once the pending IRQ will be recorded in the PIR, kvm_vcpu_has_events() will detect this (via kvm_cpu_get_interrupt(), kvm_apic_get_interrupt(), apic_has_interrupt_for_ppr() and finally vmx_sync_pir_to_irr()) and terminate the polling. Signed-off-by: NSean Christopherson <seanjc@google.com> Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com> Message-Id: <20211208015236.1616697-5-seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Reject KVM_RUN if emulation is required (because VMX is running without unrestricted guest) and an exception is pending, as KVM doesn't support emulating exceptions except when emulating real mode via vm86. The vCPU is hosed either way, but letting KVM_RUN proceed triggers a WARN due to the impossible condition. Alternatively, the WARN could be removed, but then userspace and/or KVM bugs would result in the vCPU silently running in a bad state, which isn't very friendly to users. Originally, the bug was hit by syzkaller with a nested guest as that doesn't require kvm_intel.unrestricted_guest=0. That particular flavor is likely fixed by commit cd0e615c ("KVM: nVMX: Synthesize TRIPLE_FAULT for L2 if emulation is required"), but it's trivial to trigger the WARN with a non-nested guest, and userspace can likely force bad state via ioctls() for a nested guest as well. Checking for the impossible condition needs to be deferred until KVM_RUN because KVM can't force specific ordering between ioctls. E.g. clearing exception.pending in KVM_SET_SREGS doesn't prevent userspace from setting it in KVM_SET_VCPU_EVENTS, and disallowing KVM_SET_VCPU_EVENTS with emulation_required would prevent userspace from queuing an exception and then stuffing sregs. Note, if KVM were to try and detect/prevent the condition prior to KVM_RUN, handle_invalid_guest_state() and/or handle_emulation_failure() would need to be modified to clear the pending exception prior to exiting to userspace. ------------[ cut here ]------------ WARNING: CPU: 6 PID: 137812 at arch/x86/kvm/vmx/vmx.c:1623 vmx_queue_exception+0x14f/0x160 [kvm_intel] CPU: 6 PID: 137812 Comm: vmx_invalid_nes Not tainted 5.15.2-7cc36c3e14ae-pop #279 Hardware name: ASUS Q87M-E/Q87M-E, BIOS 1102 03/03/2014 RIP: 0010:vmx_queue_exception+0x14f/0x160 [kvm_intel] Code: <0f> 0b e9 fd fe ff ff 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 RSP: 0018:ffffa45c83577d38 EFLAGS: 00010202 RAX: 0000000000000003 RBX: 0000000080000006 RCX: 0000000000000006 RDX: 0000000000000000 RSI: 0000000000010002 RDI: ffff9916af734000 RBP: ffff9916af734000 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000006 R13: 0000000000000000 R14: ffff9916af734038 R15: 0000000000000000 FS: 00007f1e1a47c740(0000) GS:ffff99188fb80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f1e1a6a8008 CR3: 000000026f83b005 CR4: 00000000001726e0 Call Trace: kvm_arch_vcpu_ioctl_run+0x13a2/0x1f20 [kvm] kvm_vcpu_ioctl+0x279/0x690 [kvm] __x64_sys_ioctl+0x83/0xb0 do_syscall_64+0x3b/0xc0 entry_SYSCALL_64_after_hwframe+0x44/0xae Reported-by: syzbot+82112403ace4cbd780d8@syzkaller.appspotmail.com Signed-off-by: NSean Christopherson <seanjc@google.com> Message-Id: <20211228232437.1875318-2-seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 18 1月, 2022 3 次提交
-
-
由 Marcelo Tosatti 提交于
blocked_vcpu_on_cpu_lock is taken from hard interrupt context (pi_wakeup_handler), therefore it cannot sleep. Switch it to a raw spinlock. Fixes: [41297.066254] BUG: scheduling while atomic: CPU 0/KVM/635218/0x00010001 [41297.066323] Preemption disabled at: [41297.066324] [<ffffffff902ee47f>] irq_enter_rcu+0xf/0x60 [41297.066339] Call Trace: [41297.066342] <IRQ> [41297.066346] dump_stack_lvl+0x34/0x44 [41297.066353] ? irq_enter_rcu+0xf/0x60 [41297.066356] __schedule_bug.cold+0x7d/0x8b [41297.066361] __schedule+0x439/0x5b0 [41297.066365] ? task_blocks_on_rt_mutex.constprop.0.isra.0+0x1b0/0x440 [41297.066369] schedule_rtlock+0x1e/0x40 [41297.066371] rtlock_slowlock_locked+0xf1/0x260 [41297.066374] rt_spin_lock+0x3b/0x60 [41297.066378] pi_wakeup_handler+0x31/0x90 [kvm_intel] [41297.066388] sysvec_kvm_posted_intr_wakeup_ipi+0x9d/0xd0 [41297.066392] </IRQ> [41297.066392] asm_sysvec_kvm_posted_intr_wakeup_ipi+0x12/0x20 ... Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Like Xu 提交于
The new module parameter to control PMU virtualization should apply to Intel as well as AMD, for situations where userspace is not trusted. If the module parameter allows PMU virtualization, there could be a new KVM_CAP or guest CPUID bits whereby userspace can enable/disable PMU virtualization on a per-VM basis. If the module parameter does not allow PMU virtualization, there should be no userspace override, since we have no precedent for authorizing that kind of override. If it's false, other counter-based profiling features (such as LBR including the associated CPUID bits if any) will not be exposed. Change its name from "pmu" to "enable_pmu" as we have temporary variables with the same name in our code like "struct kvm_pmu *pmu". Fixes: b1d66dad ("KVM: x86/svm: Add module param to control PMU virtualization") Suggested-by : Jim Mattson <jmattson@google.com> Signed-off-by: NLike Xu <likexu@tencent.com> Message-Id: <20220111073823.21885-1-likexu@tencent.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Like Xu 提交于
According to CPUID 0x0A.EBX bit vector, the event [7] should be the unrealized event "Topdown Slots" instead of the *kernel* generalized common hardware event "REF_CPU_CYCLES", so we need to skip the cpuid unavaliblity check in the intel_pmc_perf_hw_id() for the last REF_CPU_CYCLES event and update the confusing comment. If the event is marked as unavailable in the Intel guest CPUID 0AH.EBX leaf, we need to avoid any perf_event creation, whether it's a gp or fixed counter. To distinguish whether it is a rejected event or an event that needs to be programmed with PERF_TYPE_RAW type, a new special returned value of "PERF_COUNT_HW_MAX + 1" is introduced. Fixes: 62079d8a ("KVM: PMU: add proper support for fixed counter 2") Signed-off-by: NLike Xu <likexu@tencent.com> Message-Id: <20220105051509.69437-1-likexu@tencent.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 15 1月, 2022 3 次提交
-
-
由 Kevin Tian 提交于
Always intercepting IA32_XFD causes non-negligible overhead when this register is updated frequently in the guest. Disable r/w emulation after intercepting the first WRMSR(IA32_XFD) with a non-zero value. Disable WRMSR emulation implies that IA32_XFD becomes out-of-sync with the software states in fpstate and the per-cpu xfd cache. This leads to two additional changes accordingly: - Call fpu_sync_guest_vmexit_xfd_state() after vm-exit to bring software states back in-sync with the MSR, before handle_exit_irqoff() is called. - Always trap #NM once write interception is disabled for IA32_XFD. The #NM exception is rare if the guest doesn't use dynamic features. Otherwise, there is at most one exception per guest task given a dynamic feature. p.s. We have confirmed that SDM is being revised to say that when setting IA32_XFD[18] the AMX register state is not guaranteed to be preserved. This clarification avoids adding mess for a creative guest which sets IA32_XFD[18]=1 before saving active AMX state to its own storage. Signed-off-by: NKevin Tian <kevin.tian@intel.com> Signed-off-by: NJing Liu <jing2.liu@intel.com> Signed-off-by: NYang Zhong <yang.zhong@intel.com> Message-Id: <20220105123532.12586-22-yang.zhong@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Jing Liu 提交于
This saves one unnecessary VM-exit in guest #NM handler, given that the MSR is already restored with the guest value before the guest is resumed. Suggested-by: NPaolo Bonzini <pbonzini@redhat.com> Signed-off-by: NJing Liu <jing2.liu@intel.com> Signed-off-by: NYang Zhong <yang.zhong@intel.com> Message-Id: <20220105123532.12586-15-yang.zhong@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Jing Liu 提交于
Guest IA32_XFD_ERR is generally modified in two places: - Set by CPU when #NM is triggered; - Cleared by guest in its #NM handler; Intercept #NM for the first case when a nonzero value is written to IA32_XFD. Nonzero indicates that the guest is willing to do dynamic fpstate expansion for certain xfeatures, thus KVM needs to manage and virtualize guest XFD_ERR properly. The vcpu exception bitmap is updated in XFD write emulation according to guest_fpu::xfd. Save the current XFD_ERR value to the guest_fpu container in the #NM VM-exit handler. This must be done with interrupt disabled, otherwise the unsaved MSR value may be clobbered by host activity. The saving operation is conducted conditionally only when guest_fpu:xfd includes a non-zero value. Doing so also avoids misread on a platform which doesn't support XFD but #NM is triggered due to L1 interception. Queueing #NM to the guest is postponed to handle_exception_nmi(). This goes through the nested_vmx check so a virtual vmexit is queued instead when #NM is triggered in L2 but L1 wants to intercept it. Restore the host value (always ZERO outside of the host #NM handler) before enabling interrupt. Restore the guest value from the guest_fpu container right before entering the guest (with interrupt disabled). Suggested-by: NThomas Gleixner <tglx@linutronix.de> Signed-off-by: NJing Liu <jing2.liu@intel.com> Signed-off-by: NKevin Tian <kevin.tian@intel.com> Signed-off-by: NYang Zhong <yang.zhong@intel.com> Message-Id: <20220105123532.12586-13-yang.zhong@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 07 1月, 2022 9 次提交
-
-
由 Michael Roth 提交于
Normally guests will set up CR3 themselves, but some guests, such as kselftests, and potentially CONFIG_PVH guests, rely on being booted with paging enabled and CR3 initialized to a pre-allocated page table. Currently CR3 updates via KVM_SET_SREGS* are not loaded into the guest VMCB until just prior to entering the guest. For SEV-ES/SEV-SNP, this is too late, since it will have switched over to using the VMSA page prior to that point, with the VMSA CR3 copied from the VMCB initial CR3 value: 0. Address this by sync'ing the CR3 value into the VMCB save area immediately when KVM_SET_SREGS* is issued so it will find it's way into the initial VMSA. Suggested-by: NTom Lendacky <thomas.lendacky@amd.com> Signed-off-by: NMichael Roth <michael.roth@amd.com> Message-Id: <20211216171358.61140-10-michael.roth@amd.com> [Remove vmx_post_set_cr3; add a remark about kvm_set_cr3 not calling the new hook. - Paolo] Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Peter Zijlstra 提交于
Use asm-goto-output for smaller fast path code. Message-Id: <YbcbbGW2GcMx6KpD@hirez.programming.kicks-ass.net> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Eric Hankland 提交于
When KVM retires a guest branch instruction through emulation, increment any vPMCs that are configured to monitor "branch instructions retired," and update the sample period of those counters so that they will overflow at the right time. Signed-off-by: NEric Hankland <ehankland@google.com> [jmattson: - Split the code to increment "branch instructions retired" into a separate commit. - Moved/consolidated the calls to kvm_pmu_trigger_event() in the emulation of VMLAUNCH/VMRESUME to accommodate the evolution of that code. ] Fixes: f5132b01 ("KVM: Expose a version 2 architectural PMU to a guests") Signed-off-by: NJim Mattson <jmattson@google.com> Message-Id: <20211130074221.93635-7-likexu@tencent.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Like Xu 提交于
Since we set the same semantic event value for the fixed counter in pmc->eventsel, returning the perf_hw_id for the fixed counter via find_fixed_event() can be painlessly replaced by pmc_perf_hw_id() with the help of pmc_is_fixed() check. Signed-off-by: NLike Xu <likexu@tencent.com> Message-Id: <20211130074221.93635-4-likexu@tencent.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Like Xu 提交于
The find_arch_event() returns a "unsigned int" value, which is used by the pmc_reprogram_counter() to program a PERF_TYPE_HARDWARE type perf_event. The returned value is actually the kernel defined generic perf_hw_id, let's rename it to pmc_perf_hw_id() with simpler incoming parameters for better self-explanation. Signed-off-by: NLike Xu <likexu@tencent.com> Message-Id: <20211130074221.93635-3-likexu@tencent.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Like Xu 提交于
The current pmc->eventsel for fixed counter is underutilised. The pmc->eventsel can be setup for all known available fixed counters since we have mapping between fixed pmc index and the intel_arch_events array. Either gp or fixed counter, it will simplify the later checks for consistency between eventsel and perf_hw_id. Signed-off-by: NLike Xu <likexu@tencent.com> Message-Id: <20211130074221.93635-2-likexu@tencent.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Paolo Bonzini 提交于
Because IceLake has 4 fixed performance counters but KVM only supports 3, it is possible for reprogram_fixed_counters to pass to reprogram_fixed_counter an index that is out of bounds for the fixed_pmc_events array. Ultimately intel_find_fixed_event, which is the only place that uses fixed_pmc_events, handles this correctly because it checks against the size of fixed_pmc_events anyway. Every other place operates on the fixed_counters[] array which is sized according to INTEL_PMC_MAX_FIXED. However, it is cleaner if the unsupported performance counters are culled early on in reprogram_fixed_counters. Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Lai Jiangshan 提交于
When !CR0_PG -> CR0_PG, vcpu->arch.cr3 becomes active, but GUEST_CR3 is still vmx->ept_identity_map_addr if EPT + !URG. So VCPU_EXREG_CR3 is considered to be dirty and GUEST_CR3 needs to be updated in this case. Reported-by: NMaxim Levitsky <mlevitsk@redhat.com> Suggested-by: NSean Christopherson <seanjc@google.com> Signed-off-by: NLai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211216021938.11752-4-jiangshanlai@gmail.com> Fixes: c62c7bd4 ("KVM: VMX: Update vmcs.GUEST_CR3 only when the guest CR3 is dirty") Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Lai Jiangshan 提交于
The host CR3 in the vcpu thread can only be changed when scheduling, so commit 15ad9762 ("KVM: VMX: Save HOST_CR3 in vmx_prepare_switch_to_guest()") changed vmx.c to only save it in vmx_prepare_switch_to_guest(). However, it also has to be synced in vmx_sync_vmcs_host_state() when switching VMCS. vmx_set_host_fs_gs() is called in both places, so rename it to vmx_set_vmcs_host_state() and make it update HOST_CR3. Fixes: 15ad9762 ("KVM: VMX: Save HOST_CR3 in vmx_prepare_switch_to_guest()") Signed-off-by: NLai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20211216021938.11752-2-jiangshanlai@gmail.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 22 12月, 2021 1 次提交
-
-
由 Sean Christopherson 提交于
Drop a check that guards triggering a posted interrupt on the currently running vCPU, and more importantly guards waking the target vCPU if triggering a posted interrupt fails because the vCPU isn't IN_GUEST_MODE. If a vIRQ is delivered from asynchronous context, the target vCPU can be the currently running vCPU and can also be blocking, in which case skipping kvm_vcpu_wake_up() is effectively dropping what is supposed to be a wake event for the vCPU. The "do nothing" logic when "vcpu == running_vcpu" mostly works only because the majority of calls to ->deliver_posted_interrupt(), especially when using posted interrupts, come from synchronous KVM context. But if a device is exposed to the guest using vfio-pci passthrough, the VFIO IRQ and vCPU are bound to the same pCPU, and the IRQ is _not_ configured to use posted interrupts, wake events from the device will be delivered to KVM from IRQ context, e.g. vfio_msihandler() | |-> eventfd_signal() | |-> ... | |-> irqfd_wakeup() | |->kvm_arch_set_irq_inatomic() | |-> kvm_irq_delivery_to_apic_fast() | |-> kvm_apic_set_irq() This also aligns the non-nested and nested usage of triggering posted interrupts, and will allow for additional cleanups. Fixes: 379a3c8e ("KVM: VMX: Optimize posted-interrupt delivery for timer fastpath") Cc: stable@vger.kernel.org Reported-by: NLongpeng (Mike) <longpeng2@huawei.com> Signed-off-by: NSean Christopherson <seanjc@google.com> Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com> Message-Id: <20211208015236.1616697-18-seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 20 12月, 2021 3 次提交
-
-
由 Sean Christopherson 提交于
Synthesize a triple fault if L2 guest state is invalid at the time of VM-Enter, which can happen if L1 modifies SMRAM or if userspace stuffs guest state via ioctls(), e.g. KVM_SET_SREGS. KVM should never emulate invalid guest state, since from L1's perspective, it's architecturally impossible for L2 to have invalid state while L2 is running in hardware. E.g. attempts to set CR0 or CR4 to unsupported values will either VM-Exit or #GP. Modifying vCPU state via RSM+SMRAM and ioctl() are the only paths that can trigger this scenario, as nested VM-Enter correctly rejects any attempt to enter L2 with invalid state. RSM is a straightforward case as (a) KVM follows AMD's SMRAM layout and behavior, and (b) Intel's SDM states that loading reserved CR0/CR4 bits via RSM results in shutdown, i.e. there is precedent for KVM's behavior. Following AMD's SMRAM layout is important as AMD's layout saves/restores the descriptor cache information, including CS.RPL and SS.RPL, and also defines all the fields relevant to invalid guest state as read-only, i.e. so long as the vCPU had valid state before the SMI, which is guaranteed for L2, RSM will generate valid state unless SMRAM was modified. Intel's layout saves/restores only the selector, which means that scenarios where the selector and cached RPL don't match, e.g. conforming code segments, would yield invalid guest state. Intel CPUs fudge around this issued by stuffing SS.RPL and CS.RPL on RSM. Per Intel's SDM on the "Default Treatment of RSM", paraphrasing for brevity: IF internal storage indicates that the [CPU was post-VMXON] THEN enter VMX operation (root or non-root); restore VMX-critical state as defined in Section 34.14.1; set to their fixed values any bits in CR0 and CR4 whose values must be fixed in VMX operation [unless coming from an unrestricted guest]; IF RFLAGS.VM = 0 AND (in VMX root operation OR the “unrestricted guest” VM-execution control is 0) THEN CS.RPL := SS.DPL; SS.RPL := SS.DPL; FI; restore current VMCS pointer; FI; Note that Intel CPUs also overwrite the fixed CR0/CR4 bits, whereas KVM will sythesize TRIPLE_FAULT in this scenario. KVM's behavior is allowed as both Intel and AMD define CR0/CR4 SMRAM fields as read-only, i.e. the only way for CR0 and/or CR4 to have illegal values is if they were modified by the L1 SMM handler, and Intel's SDM "SMRAM State Save Map" section states "modifying these registers will result in unpredictable behavior". KVM's ioctl() behavior is less straightforward. Because KVM allows ioctls() to be executed in any order, rejecting an ioctl() if it would result in invalid L2 guest state is not an option as KVM cannot know if a future ioctl() would resolve the invalid state, e.g. KVM_SET_SREGS, or drop the vCPU out of L2, e.g. KVM_SET_NESTED_STATE. Ideally, KVM would reject KVM_RUN if L2 contained invalid guest state, but that carries the risk of a false positive, e.g. if RSM loaded invalid guest state and KVM exited to userspace. Setting a flag/request to detect such a scenario is undesirable because (a) it's extremely unlikely to add value to KVM as a whole, and (b) KVM would need to consider ioctl() interactions with such a flag, e.g. if userspace migrated the vCPU while the flag were set. Cc: stable@vger.kernel.org Signed-off-by: NSean Christopherson <seanjc@google.com> Message-Id: <20211207193006.120997-3-seanjc@google.com> Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Revert a relatively recent change that set vmx->fail if the vCPU is in L2 and emulation_required is true, as that behavior is completely bogus. Setting vmx->fail and synthesizing a VM-Exit is contradictory and wrong: (a) it's impossible to have both a VM-Fail and VM-Exit (b) vmcs.EXIT_REASON is not modified on VM-Fail (c) emulation_required refers to guest state and guest state checks are always VM-Exits, not VM-Fails. For KVM specifically, emulation_required is handled before nested exits in __vmx_handle_exit(), thus setting vmx->fail has no immediate effect, i.e. KVM calls into handle_invalid_guest_state() and vmx->fail is ignored. Setting vmx->fail can ultimately result in a WARN in nested_vmx_vmexit() firing when tearing down the VM as KVM never expects vmx->fail to be set when L2 is active, KVM always reflects those errors into L1. ------------[ cut here ]------------ WARNING: CPU: 0 PID: 21158 at arch/x86/kvm/vmx/nested.c:4548 nested_vmx_vmexit+0x16bd/0x17e0 arch/x86/kvm/vmx/nested.c:4547 Modules linked in: CPU: 0 PID: 21158 Comm: syz-executor.1 Not tainted 5.16.0-rc3-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:nested_vmx_vmexit+0x16bd/0x17e0 arch/x86/kvm/vmx/nested.c:4547 Code: <0f> 0b e9 2e f8 ff ff e8 57 b3 5d 00 0f 0b e9 00 f1 ff ff 89 e9 80 Call Trace: vmx_leave_nested arch/x86/kvm/vmx/nested.c:6220 [inline] nested_vmx_free_vcpu+0x83/0xc0 arch/x86/kvm/vmx/nested.c:330 vmx_free_vcpu+0x11f/0x2a0 arch/x86/kvm/vmx/vmx.c:6799 kvm_arch_vcpu_destroy+0x6b/0x240 arch/x86/kvm/x86.c:10989 kvm_vcpu_destroy+0x29/0x90 arch/x86/kvm/../../../virt/kvm/kvm_main.c:441 kvm_free_vcpus arch/x86/kvm/x86.c:11426 [inline] kvm_arch_destroy_vm+0x3ef/0x6b0 arch/x86/kvm/x86.c:11545 kvm_destroy_vm arch/x86/kvm/../../../virt/kvm/kvm_main.c:1189 [inline] kvm_put_kvm+0x751/0xe40 arch/x86/kvm/../../../virt/kvm/kvm_main.c:1220 kvm_vcpu_release+0x53/0x60 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3489 __fput+0x3fc/0x870 fs/file_table.c:280 task_work_run+0x146/0x1c0 kernel/task_work.c:164 exit_task_work include/linux/task_work.h:32 [inline] do_exit+0x705/0x24f0 kernel/exit.c:832 do_group_exit+0x168/0x2d0 kernel/exit.c:929 get_signal+0x1740/0x2120 kernel/signal.c:2852 arch_do_signal_or_restart+0x9c/0x730 arch/x86/kernel/signal.c:868 handle_signal_work kernel/entry/common.c:148 [inline] exit_to_user_mode_loop kernel/entry/common.c:172 [inline] exit_to_user_mode_prepare+0x191/0x220 kernel/entry/common.c:207 __syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline] syscall_exit_to_user_mode+0x2e/0x70 kernel/entry/common.c:300 do_syscall_64+0x53/0xd0 arch/x86/entry/common.c:86 entry_SYSCALL_64_after_hwframe+0x44/0xae Fixes: c8607e4a ("KVM: x86: nVMX: don't fail nested VM entry on invalid guest state if !from_vmentry") Reported-by: syzbot+f1d2136db9c80d4733e8@syzkaller.appspotmail.com Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: NSean Christopherson <seanjc@google.com> Message-Id: <20211207193006.120997-2-seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Marc Orr 提交于
The kvm_run struct's if_flag is a part of the userspace/kernel API. The SEV-ES patches failed to set this flag because it's no longer needed by QEMU (according to the comment in the source code). However, other hypervisors may make use of this flag. Therefore, set the flag for guests with encrypted registers (i.e., with guest_state_protected set). Fixes: f1c6366e ("KVM: SVM: Add required changes to support intercepts under SEV-ES") Signed-off-by: NMarc Orr <marcorr@google.com> Message-Id: <20211209155257.128747-1-marcorr@google.com> Cc: stable@vger.kernel.org Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com> Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com>
-
- 09 12月, 2021 2 次提交
-
-
由 Sean Christopherson 提交于
Move the WARN sanity checks out of the PI descriptor update loop so as not to spam the kernel log if the condition is violated and the update takes multiple attempts due to another writer. This also eliminates a few extra uops from the retry path. Technically not checking every attempt could mean KVM will now fail to WARN in a scenario that would have failed before, but any such failure would be inherently racy as some other agent (CPU or device) would have to concurrent modify the PI descriptor. Add a helper to handle the actual write and more importantly to document why the write may need to be retried. Signed-off-by: NSean Christopherson <seanjc@google.com> Message-Id: <20211208015236.1616697-4-seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Add a memory barrier between writing vcpu->requests and reading vcpu->guest_mode to ensure the read is ordered after the write when (potentially) delivering an IRQ to L2 via nested posted interrupt. If the request were to be completed after reading vcpu->mode, it would be possible for the target vCPU to enter the guest without posting the interrupt and without handling the event request. Note, the barrier is only for documentation since atomic operations are serializing on x86. Suggested-by: NPaolo Bonzini <pbonzini@redhat.com> Fixes: 6b697711 ("KVM: nVMX: Fix races when sending nested PI while dest enters/leaves L2") Fixes: 705699a1 ("KVM: nVMX: Enable nested posted interrupt processing") Signed-off-by: NSean Christopherson <seanjc@google.com> Message-Id: <20211208015236.1616697-3-seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 08 12月, 2021 1 次提交
-
-
由 Vitaly Kuznetsov 提交于
Updating MSR bitmap for L2 is not cheap and rearly needed. TLFS for Hyper-V offers 'Enlightened MSR Bitmap' feature which allows L1 hypervisor to inform L0 when it changes MSR bitmap, this eliminates the need to examine L1's MSR bitmap for L2 every time when 'real' MSR bitmap for L2 gets constructed. Use 'vmx->nested.msr_bitmap_changed' flag to implement the feature. Note, KVM already uses 'Enlightened MSR bitmap' feature when it runs as a nested hypervisor on top of Hyper-V. The newly introduced feature is going to be used by Hyper-V guests on KVM. When the feature is enabled for Win10+WSL2, it shaves off around 700 CPU cycles from a nested vmexit cost (tight cpuid loop test). Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20211129094704.326635-5-vkuznets@redhat.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-