- 08 6月, 2022 13 次提交
-
-
由 Sean Christopherson 提交于
Add kvm_caps to hold a variety of capabilites and defaults that aren't handled by kvm_cpu_caps because they aren't CPUID bits in order to reduce the amount of boilerplate code required to add a new feature. The vast majority (all?) of the caps interact with vendor code and are written only during initialization, i.e. should be tagged __read_mostly, declared extern in x86.h, and exported. No functional change intended. Signed-off-by: NSean Christopherson <seanjc@google.com> Message-Id: <20220524135624.22988-4-chenyi.qiang@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Like Xu 提交于
All gp or fixed counters have been reprogrammed using PERF_TYPE_RAW, which means that the table that maps perf_hw_id to event select values is no longer useful, at least for AMD. For Intel, the logic to check if the pmu event reported by Intel cpuid is not available is still required, in which case pmc_perf_hw_id() could be renamed to hw_event_is_unavail() and a bool value is returned to replace the semantics of "PERF_COUNT_HW_MAX+1". Signed-off-by: NLike Xu <likexu@tencent.com> Message-Id: <20220518132512.37864-12-likexu@tencent.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Paolo Bonzini 提交于
Since reprogram_counter(), reprogram_{gp, fixed}_counter() currently have the same incoming parameter "struct kvm_pmc *pmc", the callers can simplify the conetxt by using uniformly exported interface, which makes reprogram_ {gp, fixed}_counter() static and eliminates EXPORT_SYMBOL_GPL. Signed-off-by: NLike Xu <likexu@tencent.com> Message-Id: <20220518132512.37864-8-likexu@tencent.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Like Xu 提交于
Because inside reprogram_gp_counter() it is bound to assign the requested eventel to pmc->eventsel, this assignment step can be moved forward, thus simplifying the passing of parameters to "struct kvm_pmc *pmc" only. No functional change intended. Signed-off-by: NLike Xu <likexu@tencent.com> Message-Id: <20220518132512.37864-6-likexu@tencent.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Paolo Bonzini 提交于
Whenever an MSR is part of KVM_GET_MSR_INDEX_LIST, it has to be always retrievable and settable with KVM_GET_MSR and KVM_SET_MSR. Accept the PMU MSRs unconditionally in intel_is_valid_msr, if the access was host-initiated. Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Maciej S. Szmigiero 提交于
A NMI that L1 wants to inject into its L2 should be directly re-injected, without causing L0 side effects like engaging NMI blocking for L1. It's also worth noting that in this case it is L1 responsibility to track the NMI window status for its L2 guest. Signed-off-by: NMaciej S. Szmigiero <maciej.szmigiero@oracle.com> Message-Id: <f894d13501cd48157b3069a4b4c7369575ddb60e.1651440202.git.maciej.szmigiero@oracle.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
In the IRQ injection tracepoint, differentiate between Hard IRQs and Soft "IRQs", i.e. interrupts that are reinjected after incomplete delivery of a software interrupt from an INTn instruction. Tag reinjected interrupts as such, even though the information is usually redundant since soft interrupts are only ever reinjected by KVM. Though rare in practice, a hard IRQ can be reinjected. Signed-off-by: NSean Christopherson <seanjc@google.com> [MSS: change "kvm_inj_virq" event "reinjected" field type to bool] Signed-off-by: NMaciej S. Szmigiero <maciej.szmigiero@oracle.com> Message-Id: <9664d49b3bd21e227caa501cff77b0569bebffe2.1651440202.git.maciej.szmigiero@oracle.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Re-inject INTn software interrupts instead of retrying the instruction if the CPU encountered an intercepted exception while vectoring the INTn, e.g. if KVM intercepted a #PF when utilizing shadow paging. Retrying the instruction is architecturally wrong e.g. will result in a spurious #DB if there's a code breakpoint on the INT3/O, and lack of re-injection also breaks nested virtualization, e.g. if L1 injects a software interrupt and vectoring the injected interrupt encounters an exception that is intercepted by L0 but not L1. Signed-off-by: NSean Christopherson <seanjc@google.com> Signed-off-by: NMaciej S. Szmigiero <maciej.szmigiero@oracle.com> Message-Id: <1654ad502f860948e4f2d57b8bd881d67301f785.1651440202.git.maciej.szmigiero@oracle.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Re-inject INT3/INTO instead of retrying the instruction if the CPU encountered an intercepted exception while vectoring the software exception, e.g. if vectoring INT3 encounters a #PF and KVM is using shadow paging. Retrying the instruction is architecturally wrong, e.g. will result in a spurious #DB if there's a code breakpoint on the INT3/O, and lack of re-injection also breaks nested virtualization, e.g. if L1 injects a software exception and vectoring the injected exception encounters an exception that is intercepted by L0 but not L1. Due to, ahem, deficiencies in the SVM architecture, acquiring the next RIP may require flowing through the emulator even if NRIPS is supported, as the CPU clears next_rip if the VM-Exit is due to an exception other than "exceptions caused by the INT3, INTO, and BOUND instructions". To deal with this, "skip" the instruction to calculate next_rip (if it's not already known), and then unwind the RIP write and any side effects (RFLAGS updates). Save the computed next_rip and use it to re-stuff next_rip if injection doesn't complete. This allows KVM to do the right thing if next_rip was known prior to injection, e.g. if L1 injects a soft event into L2, and there is no backing INTn instruction, e.g. if L1 is injecting an arbitrary event. Note, it's impossible to guarantee architectural correctness given SVM's architectural flaws. E.g. if the guest executes INTn (no KVM injection), an exit occurs while vectoring the INTn, and the guest modifies the code stream while the exit is being handled, KVM will compute the incorrect next_rip due to "skipping" the wrong instruction. A future enhancement to make this less awful would be for KVM to detect that the decoded instruction is not the correct INTn and drop the to-be-injected soft event (retrying is a lesser evil compared to shoving the wrong RIP on the exception stack). Reported-by: NMaciej S. Szmigiero <maciej.szmigiero@oracle.com> Signed-off-by: NSean Christopherson <seanjc@google.com> Signed-off-by: NMaciej S. Szmigiero <maciej.szmigiero@oracle.com> Message-Id: <65cb88deab40bc1649d509194864312a89bbe02e.1651440202.git.maciej.szmigiero@oracle.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
If NRIPS is supported in hardware but disabled in KVM, set next_rip to the next RIP when advancing RIP as part of emulating INT3 injection. There is no flag to tell the CPU that KVM isn't using next_rip, and so leaving next_rip is left as is will result in the CPU pushing garbage onto the stack when vectoring the injected event. Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com> Fixes: 66b7138f ("KVM: SVM: Emulate nRIP feature when reinjecting INT3") Signed-off-by: NSean Christopherson <seanjc@google.com> Signed-off-by: NMaciej S. Szmigiero <maciej.szmigiero@oracle.com> Message-Id: <cd328309a3b88604daa2359ad56f36cb565ce2d4.1651440202.git.maciej.szmigiero@oracle.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Unwind the RIP advancement done by svm_queue_exception() when injecting an INT3 ultimately "fails" due to the CPU encountering a VM-Exit while vectoring the injected event, even if the exception reported by the CPU isn't the same event that was injected. If vectoring INT3 encounters an exception, e.g. #NP, and vectoring the #NP encounters an intercepted exception, e.g. #PF when KVM is using shadow paging, then the #NP will be reported as the event that was in-progress. Note, this is still imperfect, as it will get a false positive if the INT3 is cleanly injected, no VM-Exit occurs before the IRET from the INT3 handler in the guest, the instruction following the INT3 generates an exception (directly or indirectly), _and_ vectoring that exception encounters an exception that is intercepted by KVM. The false positives could theoretically be solved by further analyzing the vectoring event, e.g. by comparing the error code against the expected error code were an exception to occur when vectoring the original injected exception, but SVM without NRIPS is a complete disaster, trying to make it 100% correct is a waste of time. Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com> Fixes: 66b7138f ("KVM: SVM: Emulate nRIP feature when reinjecting INT3") Signed-off-by: NSean Christopherson <seanjc@google.com> Signed-off-by: NMaciej S. Szmigiero <maciej.szmigiero@oracle.com> Message-Id: <450133cf0a026cb9825a2ff55d02cb136a1cb111.1651440202.git.maciej.szmigiero@oracle.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Maciej S. Szmigiero 提交于
Don't BUG/WARN on interrupt injection due to GIF being cleared, since it's trivial for userspace to force the situation via KVM_SET_VCPU_EVENTS (even if having at least a WARN there would be correct for KVM internally generated injections). kernel BUG at arch/x86/kvm/svm/svm.c:3386! invalid opcode: 0000 [#1] SMP CPU: 15 PID: 926 Comm: smm_test Not tainted 5.17.0-rc3+ #264 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:svm_inject_irq+0xab/0xb0 [kvm_amd] Code: <0f> 0b 0f 1f 00 0f 1f 44 00 00 80 3d ac b3 01 00 00 55 48 89 f5 53 RSP: 0018:ffffc90000b37d88 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff88810a234ac0 RCX: 0000000000000006 RDX: 0000000000000000 RSI: ffffc90000b37df7 RDI: ffff88810a234ac0 RBP: ffffc90000b37df7 R08: ffff88810a1fa410 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: ffff888109571000 R14: ffff88810a234ac0 R15: 0000000000000000 FS: 0000000001821380(0000) GS:ffff88846fdc0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f74fc550008 CR3: 000000010a6fe000 CR4: 0000000000350ea0 Call Trace: <TASK> inject_pending_event+0x2f7/0x4c0 [kvm] kvm_arch_vcpu_ioctl_run+0x791/0x17a0 [kvm] kvm_vcpu_ioctl+0x26d/0x650 [kvm] __x64_sys_ioctl+0x82/0xb0 do_syscall_64+0x3b/0xc0 entry_SYSCALL_64_after_hwframe+0x44/0xae </TASK> Fixes: 219b65dc ("KVM: SVM: Improve nested interrupt injection") Cc: stable@vger.kernel.org Co-developed-by: NSean Christopherson <seanjc@google.com> Signed-off-by: NSean Christopherson <seanjc@google.com> Signed-off-by: NMaciej S. Szmigiero <maciej.szmigiero@oracle.com> Message-Id: <35426af6e123cbe91ec7ce5132ce72521f02b1b5.1651440202.git.maciej.szmigiero@oracle.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Maciej S. Szmigiero 提交于
The next_rip field of a VMCB is *not* an output-only field for a VMRUN. This field value (instead of the saved guest RIP) in used by the CPU for the return address pushed on stack when injecting a software interrupt or INT3 or INTO exception. Make sure this field gets synced from vmcb12 to vmcb02 when entering L2 or loading a nested state and NRIPS is exposed to L1. If NRIPS is supported in hardware but not exposed to L1 (nrips=0 or hidden by userspace), stuff vmcb02's next_rip from the new L2 RIP to emulate a !NRIPS CPU (which saves RIP on the stack as-is). Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com> Co-developed-by: NSean Christopherson <seanjc@google.com> Signed-off-by: NSean Christopherson <seanjc@google.com> Signed-off-by: NMaciej S. Szmigiero <maciej.szmigiero@oracle.com> Message-Id: <c2e0a3d78db3ae30530f11d4e9254b452a89f42b.1651440202.git.maciej.szmigiero@oracle.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 07 6月, 2022 1 次提交
-
-
由 Maxim Levitsky 提交于
SVM uses a per-cpu variable to cache the current value of the tsc scaling multiplier msr on each cpu. Commit 1ab9287a ("KVM: X86: Add vendor callbacks for writing the TSC multiplier") broke this caching logic. Refactor the code so that all TSC scaling multiplier writes go through a single function which checks and updates the cache. This fixes the following scenario: 1. A CPU runs a guest with some tsc scaling ratio. 2. New guest with different tsc scaling ratio starts on this CPU and terminates almost immediately. This ensures that the short running guest had set the tsc scaling ratio just once when it was set via KVM_SET_TSC_KHZ. Due to the bug, the per-cpu cache is not updated. 3. The original guest continues to run, it doesn't restore the msr value back to its own value, because the cache matches, and thus continues to run with a wrong tsc scaling ratio. Fixes: 1ab9287a ("KVM: X86: Add vendor callbacks for writing the TSC multiplier") Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com> Message-Id: <20220606181149.103072-1-mlevitsk@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 25 5月, 2022 1 次提交
-
-
由 Ashish Kalra 提交于
For some sev ioctl interfaces, the length parameter that is passed maybe less than or equal to SEV_FW_BLOB_MAX_SIZE, but larger than the data that PSP firmware returns. In this case, kmalloc will allocate memory that is the size of the input rather than the size of the data. Since PSP firmware doesn't fully overwrite the allocated buffer, these sev ioctl interface may return uninitialized kernel slab memory. Reported-by: NAndy Nguyen <theflow@google.com> Suggested-by: NDavid Rientjes <rientjes@google.com> Suggested-by: NPeter Gonda <pgonda@google.com> Cc: kvm@vger.kernel.org Cc: stable@vger.kernel.org Cc: linux-kernel@vger.kernel.org Fixes: eaf78265 ("KVM: SVM: Move SEV code to separate file") Fixes: 2c07ded0 ("KVM: SVM: add support for SEV attestation command") Fixes: 4cfdd47d ("KVM: SVM: Add KVM_SEV SEND_START command") Fixes: d3d1af85 ("KVM: SVM: Add KVM_SEND_UPDATE_DATA command") Fixes: eba04b20 ("KVM: x86: Account a variety of miscellaneous allocations") Signed-off-by: NAshish Kalra <ashish.kalra@amd.com> Reviewed-by: NPeter Gonda <pgonda@google.com> Message-Id: <20220516154310.3685678-1-Ashish.Kalra@amd.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 12 5月, 2022 1 次提交
-
-
由 Kai Huang 提交于
Intel Multi-Key Total Memory Encryption (MKTME) repurposes couple of high bits of physical address bits as 'KeyID' bits. Intel Trust Domain Extentions (TDX) further steals part of MKTME KeyID bits as TDX private KeyID bits. TDX private KeyID bits cannot be set in any mapping in the host kernel since they can only be accessed by software running inside a new CPU isolated mode. And unlike to AMD's SME, host kernel doesn't set any legacy MKTME KeyID bits to any mapping either. Therefore, it's not legitimate for KVM to set any KeyID bits in SPTE which maps guest memory. KVM maintains shadow_zero_check bits to represent which bits must be zero for SPTE which maps guest memory. MKTME KeyID bits should be set to shadow_zero_check. Currently, shadow_me_mask is used by AMD to set the sme_me_mask to SPTE, and shadow_me_shadow is excluded from shadow_zero_check. So initializing shadow_me_mask to represent all MKTME keyID bits doesn't work for VMX (as oppositely, they must be set to shadow_zero_check). Introduce a new 'shadow_me_value' to replace existing shadow_me_mask, and repurpose shadow_me_mask as 'all possible memory encryption bits'. The new schematic of them will be: - shadow_me_value: the memory encryption bit(s) that will be set to the SPTE (the original shadow_me_mask). - shadow_me_mask: all possible memory encryption bits (which is a super set of shadow_me_value). - For now, shadow_me_value is supposed to be set by SVM and VMX respectively, and it is a constant during KVM's life time. This perhaps doesn't fit MKTME but for now host kernel doesn't support it (and perhaps will never do). - Bits in shadow_me_mask are set to shadow_zero_check, except the bits in shadow_me_value. Introduce a new helper kvm_mmu_set_me_spte_mask() to initialize them. Replace shadow_me_mask with shadow_me_value in almost all code paths, except the one in PT64_PERM_MASK, which is used by need_remote_flush() to determine whether remote TLB flush is needed. This should still use shadow_me_mask as any encryption bit change should need a TLB flush. And for AMD, move initializing shadow_me_value/shadow_me_mask from kvm_mmu_reset_all_pte_masks() to svm_hardware_setup(). Signed-off-by: NKai Huang <kai.huang@intel.com> Message-Id: <f90964b93a3398b1cf1c56f510f3281e0709e2ab.1650363789.git.kai.huang@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 07 5月, 2022 1 次提交
-
-
由 Peter Gonda 提交于
svm_vm_migrate_from() uses sev_lock_vcpus_for_migration() to lock all source and target vcpu->locks. Unfortunately there is an 8 subclass limit, so a new subclass cannot be used for each vCPU. Instead maintain ownership of the first vcpu's mutex.dep_map using a role specific subclass: source vs target. Release the other vcpu's mutex.dep_maps. Fixes: b5663931 ("KVM: SEV: Add support for SEV intra host migration") Reported-by: John Sperbeck<jsperbeck@google.com> Suggested-by: NDavid Rientjes <rientjes@google.com> Suggested-by: NSean Christopherson <seanjc@google.com> Suggested-by: NPaolo Bonzini <pbonzini@redhat.com> Cc: Hillf Danton <hdanton@sina.com> Cc: kvm@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: NPeter Gonda <pgonda@google.com> Message-Id: <20220502165807.529624-1-pgonda@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 03 5月, 2022 1 次提交
-
-
由 Kyle Huey 提交于
Zen renumbered some of the performance counters that correspond to the well known events in perf_hw_id. This code in KVM was never updated for that, so guest that attempt to use counters on Zen that correspond to the pre-Zen perf_hw_id values will silently receive the wrong values. This has been observed in the wild with rr[0] when running in Zen 3 guests. rr uses the retired conditional branch counter 00d1 which is incorrectly recognized by KVM as PERF_COUNT_HW_STALLED_CYCLES_BACKEND. [0] https://rr-project.org/Signed-off-by: NKyle Huey <me@kylehuey.com> Message-Id: <20220503050136.86298-1-khuey@kylehuey.com> Cc: stable@vger.kernel.org [Check guest family, not host. - Paolo] Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 30 4月, 2022 5 次提交
-
-
由 Suravee Suthikulpanit 提交于
This can help identify potential performance issues when handles AVIC incomplete IPI due vCPU not running. Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com> Signed-off-by: NSuravee Suthikulpanit <suravee.suthikulpanit@amd.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com> Message-Id: <20220420154954.19305-3-suravee.suthikulpanit@amd.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Suravee Suthikulpanit 提交于
Currently, an AVIC-enabled VM suffers from performance bottleneck when scaling to large number of vCPUs for I/O intensive workloads. In such case, a vCPU often executes halt instruction to get into idle state waiting for interrupts, in which KVM would de-schedule the vCPU from physical CPU. When AVIC HW tries to deliver interrupt to the halting vCPU, it would result in AVIC incomplete IPI #vmexit to notify KVM to reschedule the target vCPU into running state. Investigation has shown the main hotspot is in the kvm_apic_match_dest() in the following call stack where it tries to find target vCPUs corresponding to the information in the ICRH/ICRL registers. - handle_exit - svm_invoke_exit_handler - avic_incomplete_ipi_interception - kvm_apic_match_dest However, AVIC provides hints in the #vmexit info, which can be used to retrieve the destination guest physical APIC ID. In addition, since QEMU defines guest physical APIC ID to be the same as vCPU ID, it can be used to quickly identify the target vCPU to deliver IPI, and avoid the overhead from searching through all vCPUs to match the target vCPU. Reviewed-by: NMaxim Levitsky <mlevitsk@redhat.com> Signed-off-by: NSuravee Suthikulpanit <suravee.suthikulpanit@amd.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com> Message-Id: <20220420154954.19305-2-suravee.suthikulpanit@amd.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Paolo Bonzini 提交于
root_role.level is always the same value as shadow_level: - it's kvm_mmu_get_tdp_level(vcpu) when going through init_kvm_tdp_mmu - it's the level argument when going through kvm_init_shadow_ept_mmu - it's assigned directly from new_role.base.level when going through shadow_mmu_init_context Remove the duplication and get the level directly from the role. Reviewed-by: NSean Christopherson <seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Replace the per-vendor hack-a-fix for KVM's #PF => #PF => #DF workaround with an explicit, common workaround in kvm_inject_emulated_page_fault(). Aside from being a hack, the current approach is brittle and incomplete, e.g. nSVM's KVM_SET_NESTED_STATE fails to set ->inject_page_fault(), and nVMX fails to apply the workaround when VMX is intercepting #PF due to allow_smaller_maxphyaddr=1. Signed-off-by: NSean Christopherson <seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Babu Moger 提交于
The TSC_AUX virtualization feature allows AMD SEV-ES guests to securely use TSC_AUX (auxiliary time stamp counter data) in the RDTSCP and RDPID instructions. The TSC_AUX value is set using the WRMSR instruction to the TSC_AUX MSR (0xC0000103). It is read by the RDMSR, RDTSCP and RDPID instructions. If the read/write of the TSC_AUX MSR is intercepted, then RDTSCP and RDPID must also be intercepted when TSC_AUX virtualization is present. However, the RDPID instruction can't be intercepted. This means that when TSC_AUX virtualization is present, RDTSCP and TSC_AUX MSR read/write must not be intercepted for SEV-ES (or SEV-SNP) guests. Signed-off-by: NBabu Moger <babu.moger@amd.com> Message-Id: <165040164424.1399644.13833277687385156344.stgit@bmoger-ubuntu> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 22 4月, 2022 4 次提交
-
-
由 Mingwei Zhang 提交于
Flush the CPU caches when memory is reclaimed from an SEV guest (where reclaim also includes it being unmapped from KVM's memslots). Due to lack of coherency for SEV encrypted memory, failure to flush results in silent data corruption if userspace is malicious/broken and doesn't ensure SEV guest memory is properly pinned and unpinned. Cache coherency is not enforced across the VM boundary in SEV (AMD APM vol.2 Section 15.34.7). Confidential cachelines, generated by confidential VM guests have to be explicitly flushed on the host side. If a memory page containing dirty confidential cachelines was released by VM and reallocated to another user, the cachelines may corrupt the new user at a later time. KVM takes a shortcut by assuming all confidential memory remain pinned until the end of VM lifetime. Therefore, KVM does not flush cache at mmu_notifier invalidation events. Because of this incorrect assumption and the lack of cache flushing, malicous userspace can crash the host kernel: creating a malicious VM and continuously allocates/releases unpinned confidential memory pages when the VM is running. Add cache flush operations to mmu_notifier operations to ensure that any physical memory leaving the guest VM get flushed. In particular, hook mmu_notifier_invalidate_range_start and mmu_notifier_release events and flush cache accordingly. The hook after releasing the mmu lock to avoid contention with other vCPUs. Cc: stable@vger.kernel.org Suggested-by: NSean Christpherson <seanjc@google.com> Reported-by: NMingwei Zhang <mizhang@google.com> Signed-off-by: NMingwei Zhang <mizhang@google.com> Message-Id: <20220421031407.2516575-4-mizhang@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Mingwei Zhang 提交于
Use clflush_cache_range() to flush the confidential memory when SME_COHERENT is supported in AMD CPU. Cache flush is still needed since SME_COHERENT only support cache invalidation at CPU side. All confidential cache lines are still incoherent with DMA devices. Cc: stable@vger.kerel.org Fixes: add5e2f0 ("KVM: SVM: Add support for the SEV-ES VMSA") Reviewed-by: NSean Christopherson <seanjc@google.com> Signed-off-by: NMingwei Zhang <mizhang@google.com> Message-Id: <20220421031407.2516575-3-mizhang@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Rework sev_flush_guest_memory() to explicitly handle only a single page, and harden it to fall back to WBINVD if VM_PAGE_FLUSH fails. Per-page flushing is currently used only to flush the VMSA, and in its current form, the helper is completely broken with respect to flushing actual guest memory, i.e. won't work correctly for an arbitrary memory range. VM_PAGE_FLUSH takes a host virtual address, and is subject to normal page walks, i.e. will fault if the address is not present in the host page tables or does not have the correct permissions. Current AMD CPUs also do not honor SMAP overrides (undocumented in kernel versions of the APM), so passing in a userspace address is completely out of the question. In other words, KVM would need to manually walk the host page tables to get the pfn, ensure the pfn is stable, and then use the direct map to invoke VM_PAGE_FLUSH. And the latter might not even work, e.g. if userspace is particularly evil/clever and backs the guest with Secret Memory (which unmaps memory from the direct map). Signed-off-by: NSean Christopherson <seanjc@google.com> Fixes: add5e2f0 ("KVM: SVM: Add support for the SEV-ES VMSA") Reported-by: NMingwei Zhang <mizhang@google.com> Cc: stable@vger.kernel.org Signed-off-by: NMingwei Zhang <mizhang@google.com> Message-Id: <20220421031407.2516575-2-mizhang@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Like Xu 提交于
NMI-watchdog is one of the favorite features of kernel developers, but it does not work in AMD guest even with vPMU enabled and worse, the system misrepresents this capability via /proc. This is a PMC emulation error. KVM does not pass the latest valid value to perf_event in time when guest NMI-watchdog is running, thus the perf_event corresponding to the watchdog counter will enter the old state at some point after the first guest NMI injection, forcing the hardware register PMC0 to be constantly written to 0x800000000001. Meanwhile, the running counter should accurately reflect its new value based on the latest coordinated pmc->counter (from vPMC's point of view) rather than the value written directly by the guest. Fixes: 168d918f ("KVM: x86: Adjust counter sample period after a wrmsr") Reported-by: NDongli Cao <caodongli@kingsoft.com> Signed-off-by: NLike Xu <likexu@tencent.com> Reviewed-by: NYanan Wang <wangyanan55@huawei.com> Tested-by: NYanan Wang <wangyanan55@huawei.com> Reviewed-by: NJim Mattson <jmattson@google.com> Message-Id: <20220409015226.38619-1-likexu@tencent.com> Cc: stable@vger.kernel.org Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 14 4月, 2022 3 次提交
-
-
由 Peter Gonda 提交于
If an SEV-ES guest requests termination, exit to userspace with KVM_EXIT_SYSTEM_EVENT and a dedicated SEV_TERM type instead of -EINVAL so that userspace can take appropriate action. See AMD's GHCB spec section '4.1.13 Termination Request' for more details. Suggested-by: NSean Christopherson <seanjc@google.com> Suggested-by: NPaolo Bonzini <pbonzini@redhat.com> Cc: kvm@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: NPeter Gonda <pgonda@google.com> Reported-by: Nkernel test robot <lkp@intel.com> Message-Id: <20220407210233.782250-1-pgonda@google.com> [Add documentatino. - Paolo] Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Remove WARNs that sanity check that KVM never lets a triple fault for L2 escape and incorrectly end up in L1. In normal operation, the sanity check is perfectly valid, but it incorrectly assumes that it's impossible for userspace to induce KVM_REQ_TRIPLE_FAULT without bouncing through KVM_RUN (which guarantees kvm_check_nested_state() will see and handle the triple fault). The WARN can currently be triggered if userspace injects a machine check while L2 is active and CR4.MCE=0. And a future fix to allow save/restore of KVM_REQ_TRIPLE_FAULT, e.g. so that a synthesized triple fault isn't lost on migration, will make it trivially easy for userspace to trigger the WARN. Clearing KVM_REQ_TRIPLE_FAULT when forcibly leaving guest mode is tempting, but wrong, especially if/when the request is saved/restored, e.g. if userspace restores events (including a triple fault) and then restores nested state (which may forcibly leave guest mode). Ignoring the fact that KVM doesn't currently provide the necessary APIs, it's userspace's responsibility to manage pending events during save/restore. ------------[ cut here ]------------ WARNING: CPU: 7 PID: 1399 at arch/x86/kvm/vmx/nested.c:4522 nested_vmx_vmexit+0x7fe/0xd90 [kvm_intel] Modules linked in: kvm_intel kvm irqbypass CPU: 7 PID: 1399 Comm: state_test Not tainted 5.17.0-rc3+ #808 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:nested_vmx_vmexit+0x7fe/0xd90 [kvm_intel] Call Trace: <TASK> vmx_leave_nested+0x30/0x40 [kvm_intel] vmx_set_nested_state+0xca/0x3e0 [kvm_intel] kvm_arch_vcpu_ioctl+0xf49/0x13e0 [kvm] kvm_vcpu_ioctl+0x4b9/0x660 [kvm] __x64_sys_ioctl+0x83/0xb0 do_syscall_64+0x3b/0xc0 entry_SYSCALL_64_after_hwframe+0x44/0xae </TASK> ---[ end trace 0000000000000000 ]--- Fixes: cb6a32c2 ("KVM: x86: Handle triple fault in L2 without killing L1") Cc: stable@vger.kernel.org Cc: Chenyi Qiang <chenyi.qiang@intel.com> Signed-off-by: NSean Christopherson <seanjc@google.com> Message-Id: <20220407002315.78092-2-seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Like Xu 提交于
The pmu_ops should be moved to kvm_x86_init_ops and tagged as __initdata. That'll save those precious few bytes, and more importantly make the original ops unreachable, i.e. make it harder to sneak in post-init modification bugs. Suggested-by: NSean Christopherson <seanjc@google.com> Signed-off-by: NLike Xu <likexu@tencent.com> Reviewed-by: NSean Christopherson <seanjc@google.com> Signed-off-by: NSean Christopherson <seanjc@google.com> Message-Id: <20220329235054.3534728-4-seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 12 4月, 2022 1 次提交
-
-
由 Suravee Suthikulpanit 提交于
Since current AVIC implementation cannot support encrypted memory, inhibit AVIC for SEV-enabled guest. Signed-off-by: NSuravee Suthikulpanit <suravee.suthikulpanit@amd.com> Message-Id: <20220408133710.54275-1-suravee.suthikulpanit@amd.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 06 4月, 2022 2 次提交
-
-
由 Tom Lendacky 提交于
The save area for SEV-ES/SEV-SNP guests, as used by the hardware, is different from the save area of a non SEV-ES/SEV-SNP guest. This is the first step in defining the multiple save areas to keep them separate and ensuring proper operation amongst the different types of guests. Create an SEV-ES/SEV-SNP save area and adjust usage to the new save area definition where needed. Signed-off-by: NTom Lendacky <thomas.lendacky@amd.com> Signed-off-by: NBrijesh Singh <brijesh.singh@amd.com> Signed-off-by: NBorislav Petkov <bp@suse.de> Reviewed-by: NVenu Busireddy <venu.busireddy@oracle.com> Link: https://lore.kernel.org/r/20220405182743.308853-1-brijesh.singh@amd.com
-
由 Brijesh Singh 提交于
The hypervisor uses the sev_features field (offset 3B0h) in the Save State Area to control the SEV-SNP guest features such as SNPActive, vTOM, ReflectVC etc. An SEV-SNP guest can read the sev_features field through the SEV_STATUS MSR. While at it, update dump_vmcb() to log the VMPL level. See APM2 Table 15-34 and B-4 for more details. Signed-off-by: NBrijesh Singh <brijesh.singh@amd.com> Signed-off-by: NBorislav Petkov <bp@suse.de> Reviewed-by: NVenu Busireddy <venu.busireddy@oracle.com> Link: https://lore.kernel.org/r/20220307213356.2797205-2-brijesh.singh@amd.com
-
- 05 4月, 2022 1 次提交
-
-
由 Peter Gonda 提交于
Add resched to avoid warning from sev_clflush_pages() with large number of pages. Signed-off-by: NPeter Gonda <pgonda@google.com> Cc: Sean Christopherson <seanjc@google.com> Cc: kvm@vger.kernel.org Cc: linux-kernel@vger.kernel.org Message-Id: <20220330164306.2376085-1-pgonda@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 02 4月, 2022 6 次提交
-
-
由 Maxim Levitsky 提交于
Inhibit the AVIC of the vCPU that is running nested for the duration of the nested run, so that all interrupts arriving from both its vCPU siblings and from KVM are delivered using normal IPIs and cause that vCPU to vmexit. Note that unlike normal AVIC inhibition, there is no need to update the AVIC mmio memslot, because the nested guest uses its own set of paging tables. That also means that AVIC doesn't need to be inhibited VM wide. Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com> Message-Id: <20220322174050.241850-7-mlevitsk@redhat.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Maxim Levitsky 提交于
In case L1 enables vGIF for L2, the L2 cannot affect L1's GIF, regardless of STGI/CLGI intercepts, and since VM entry enables GIF, this means that L1's GIF is always 1 while L2 is running. Thus in this case leave L1's vGIF in vmcb01, while letting L2 control the vGIF thus implementing nested vGIF. Also allow KVM to toggle L1's GIF during nested entry/exit by always using vmcb01. Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com> Message-Id: <20220322174050.241850-5-mlevitsk@redhat.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Maxim Levitsky 提交于
Expose the pause filtering and threshold in the guest CPUID and support PAUSE filtering when possible: - If the L0 doesn't intercept PAUSE (cpu_pm=on), then allow L1 to have full control over PAUSE filtering. - if the L1 doesn't intercept PAUSE, use host values and update the adaptive count/threshold even when running nested. - Otherwise always exit to L1; it is not really possible to merge the fields correctly. It is expected that in this case, userspace will not enable this feature in the guest CPUID, to avoid having the guest update both fields pointlessly. Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com> Message-Id: <20220322174050.241850-4-mlevitsk@redhat.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Maxim Levitsky 提交于
This was tested with kvm-unit-test that was developed for this purpose. Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com> Message-Id: <20220322174050.241850-3-mlevitsk@redhat.com> [Copy all of DEBUGCTL except for reserved bits. - Paolo] Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Maxim Levitsky 提交于
When L2 is running without LBR virtualization, we should ensure that L1's LBR msrs continue to update as usual. Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com> Message-Id: <20220322174050.241850-2-mlevitsk@redhat.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Maxim Levitsky 提交于
KVM always uses vgif when allowed, thus there is no need to query current vmcb for it Suggested-by: NPaolo Bonzini <pbonzini@redhat.com> Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com> Message-Id: <20220322172449.235575-9-mlevitsk@redhat.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-