- 13 2月, 2020 7 次提交
-
-
由 Paolo Bonzini 提交于
The check cpu->hv_clock.system_time < 0 is redundant since system_time is a u64 and hence can never be less than zero. But what was actually meant is to check that the result is positive, since kernel_ns and v->kvm->arch.kvmclock_offset are both s64. Reported-by: NColin King <colin.king@canonical.com> Suggested-by: NSean Christopherson <sean.j.christopherson@intel.com> Addresses-Coverity: ("Macro compares unsigned to 0") Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Define PT_MAX_FULL_LEVELS as PT64_ROOT_MAX_LEVEL, i.e. 5, to fix shadow paging for 5-level guest page tables. PT_MAX_FULL_LEVELS is used to size the arrays that track guest pages table information, i.e. using a "max levels" of 4 causes KVM to access garbage beyond the end of an array when querying state for level 5 entries. E.g. FNAME(gpte_changed) will read garbage and most likely return %true for a level 5 entry, soft-hanging the guest because FNAME(fetch) will restart the guest instead of creating SPTEs because it thinks the guest PTE has changed. Note, KVM doesn't yet support 5-level nested EPT, so PT_MAX_FULL_LEVELS gets to stay "4" for the PTTYPE_EPT case. Fixes: 855feb67 ("KVM: MMU: Add 5 level EPT & Shadow page table support.") Cc: stable@vger.kernel.org Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Hardcode the EPT page-walk level for L2 to be 4 levels, as KVM's MMU currently also hardcodes the page walk level for nested EPT to be 4 levels. The L2 guest is all but guaranteed to soft hang on its first instruction when L1 is using EPT, as KVM will construct 4-level page tables and then tell hardware to use 5-level page tables. Fixes: 855feb67 ("KVM: MMU: Add 5 level EPT & Shadow page table support.") Cc: stable@vger.kernel.org Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Miaohe Lin 提交于
Fix some typos in the comments. Also fix coding style. [Sean Christopherson rewrites the comment of write_fault_to_shadow_pgtable field in struct kvm_vcpu_arch.] Signed-off-by: NMiaohe Lin <linmiaohe@huawei.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Wrap calls to ->page_fault() with a small shim to directly invoke the TDP fault handler when the kernel is using retpolines and TDP is being used. Single out the TDP fault handler and annotate the TDP path as likely to coerce the compiler into preferring it over the indirect function call. Rename tdp_page_fault() to kvm_tdp_page_fault(), as it's exposed outside of mmu.c to allow inlining the shim. Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Miaohe Lin 提交于
kvm_make_request() provides smp_wmb() so pending_events changes are guaranteed to be visible. Signed-off-by: NMiaohe Lin <linmiaohe@huawei.com> Reviewed-by: NVitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: NSean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Miaohe Lin 提交于
The KVM_REQ_EVENT request is already made in kvm_set_rflags(). We should not make it again. Signed-off-by: NMiaohe Lin <linmiaohe@huawei.com> Reviewed-by: NVitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 12 2月, 2020 4 次提交
-
-
由 Oliver Upton 提交于
KVM allows the deferral of exception payloads when a vCPU is in guest mode to allow the L1 hypervisor to intercept certain events (#PF, #DB) before register state has been modified. However, this behavior is incompatible with the KVM_{GET,SET}_VCPU_EVENTS ABI, as userspace expects register state to have been immediately modified. Userspace may opt-in for the payload deferral behavior with the KVM_CAP_EXCEPTION_PAYLOAD per-VM capability. As such, kvm_multiple_exception() will immediately manipulate guest registers if the capability hasn't been requested. Since the deferral is only necessary if a userspace ioctl were to be serviced at the same as a payload bearing exception is recognized, this behavior can be relaxed. Instead, opportunistically defer the payload from kvm_multiple_exception() and deliver the payload before completing a KVM_GET_VCPU_EVENTS ioctl. Signed-off-by: NOliver Upton <oupton@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Oliver Upton 提交于
SDM 27.3.4 states that the 'pending debug exceptions' VMCS field will be populated if a VM-exit caused by an INIT signal takes priority over a debug-trap. Emulate this behavior when synthesizing an INIT signal VM-exit into L1. Fixes: 4b9852f4 ("KVM: x86: Fix INIT signal handling in various CPU states") Signed-off-by: NOliver Upton <oupton@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Oliver Upton 提交于
KVM defines the #DB payload as compatible with the 'pending debug exceptions' field under VMX, not DR6. Mask off bit 12 when applying the payload to DR6, as it is reserved on DR6 but not the 'pending debug exceptions' field. Fixes: f10c729f ("kvm: vmx: Defer setting of DR6 until #DB delivery") Signed-off-by: NOliver Upton <oupton@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Paolo Bonzini 提交于
Do not initialize the microcode version at RESET or INIT, only on vCPU creation. Microcode updates are not lost during INIT, and exact behavior across a warm RESET is not specified by the architecture. Since we do not support a microcode update directly from the hypervisor, but only as a result of userspace setting the microcode version MSR, it's simpler for userspace if we do nothing in KVM and let userspace emulate behavior for RESET as it sees fit. Userspace can tie the fix to the availability of MSR_IA32_UCODE_REV in the list of emulated MSRs. Reported-by: NAlex Williamson <alex.williamson@redhat.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 08 2月, 2020 2 次提交
-
-
由 Al Viro 提交于
The former contains nothing but a pointer to an array of the latter... Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Eric Sandeen 提交于
Unused now. Signed-off-by: NEric Sandeen <sandeen@redhat.com> Acked-by: NDavid Howells <dhowells@redhat.com> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 07 2月, 2020 1 次提交
-
-
由 Tony W Wang-oc 提交于
When a system suspends, the local APIC is disabled in the suspend sequence, but the IOAPIC is left in the current state. This means unmasked interrupt lines stay unmasked. This is usually the case for IOAPIC pin 9 to which the ACPI interrupt is connected. That means that in suspended state the IOAPIC can respond to an external interrupt, e.g. the wakeup via keyboard/RTC/ACPI, but the interrupt message cannot be handled by the disabled local APIC. As a consequence the Remote IRR bit is set, but the local APIC does not send an EOI to acknowledge it. This causes the affected interrupt line to become stale and the stale Remote IRR bit will cause a hang when __synchronize_hardirq() is invoked for that interrupt line. To prevent this, mask all IOAPIC entries before disabling the local APIC. The resume code already has the unmask operation inside. [ tglx: Massaged changelog ] Signed-off-by: NTony W Wang-oc <TonyWWang-oc@zhaoxin.com> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/1579076539-7267-1-git-send-email-TonyWWang-oc@zhaoxin.com
-
- 05 2月, 2020 26 次提交
-
-
由 Miaohe Lin 提交于
The function vmx_decache_cr0_guest_bits() is only called below its implementation. So this is meaningless and should be removed. Signed-off-by: NMiaohe Lin <linmiaohe@huawei.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Re-add code to mark CR4.UMIP as reserved if UMIP is not supported by the host. The UMIP handling was unintentionally dropped during a recent refactoring. Not flagging CR4.UMIP allows the guest to set its CR4.UMIP regardless of host support or userspace desires. On CPUs with UMIP support, including emulated UMIP, this allows the guest to enable UMIP against the wishes of the userspace VMM. On CPUs without any form of UMIP, this results in a failed VM-Enter due to invalid guest state. Fixes: 345599f9 ("KVM: x86: Add macro to ensure reserved cr4 bits checks stay in sync") Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: NVitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Paolo Bonzini 提交于
Three of the feature bits in vmxfeatures.h have names that are different from the Intel SDM. The names have been adjusted recently in KVM but they were using the old name in the tip tree's x86/cpu branch. Adjust for consistency. Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Paolo Bonzini 提交于
Userspace that does not know about the AMD_IBRS bit might still allow the guest to protect itself with MSR_IA32_SPEC_CTRL using the Intel SPEC_CTRL bit. However, svm.c disallows this and will cause a #GP in the guest when writing to the MSR. Fix this by loosening the test and allowing the Intel CPUID bit, and in fact allow the AMD_STIBP bit as well since it allows writing to MSR_IA32_SPEC_CTRL too. Reported-by: NZhiyi Guo <zhguo@redhat.com> Analyzed-by: NDr. David Alan Gilbert <dgilbert@redhat.com> Analyzed-by: NLaszlo Ersek <lersek@redhat.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Eric Hankland 提交于
Correct the logic in intel_pmu_set_msr() for fixed and general purpose counters. This was recently changed to set pmc->counter without taking in to account the value of pmc_read_counter() which will be incorrect if the counter is currently running and non-zero; this changes back to the old logic which accounted for the value of currently running counters. Signed-off-by: NEric Hankland <ehankland@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Vitaly Kuznetsov 提交于
Sane L1 hypervisors are not supposed to turn any of the unsupported VMX controls on for its guests and nested_vmx_check_controls() checks for that. This is, however, not the case for the controls which are supported on the host but are missing in enlightened VMCS and when eVMCS is in use. It would certainly be possible to add these missing checks to nested_check_vm_execution_controls()/_vm_exit_controls()/.. but it seems preferable to keep eVMCS-specific stuff in eVMCS and reduce the impact on non-eVMCS guests by doing less unrelated checks. Create a separate nested_evmcs_check_controls() for this purpose. Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Vitaly Kuznetsov 提交于
With fine grained VMX feature enablement QEMU>=4.2 tries to do KVM_SET_MSRS with default (matching CPU model) values and in case eVMCS is also enabled, fails. It would be possible to drop VMX feature filtering completely and make this a guest's responsibility: if it decides to use eVMCS it should know which fields are available and which are not. Hyper-V mostly complies to this, however, there are some problematic controls: SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES VM_{ENTRY,EXIT}_LOAD_IA32_PERF_GLOBAL_CTRL which Hyper-V enables. As there are no corresponding fields in eVMCS, we can't handle this properly in KVM. This is a Hyper-V issue. Move VMX controls sanitization from nested_enable_evmcs() to vmx_get_msr(), and do the bare minimum (only clear controls which are known to cause issues). This allows userspace to keep setting controls it wants and at the same time hides them from the guest. Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Ben Gardon 提交于
Separate the functions for generating MMIO page table entries from the function that inserts them into the paging structure. This refactoring will facilitate changes to the MMU sychronization model to use atomic compare / exchanges (which are not guaranteed to succeed) instead of a monolithic MMU lock. No functional change expected. Tested by running kvm-unit-tests on an Intel Haswell machine. This commit introduced no new failures. Signed-off-by: NBen Gardon <bgardon@google.com> Reviewed-by: NOliver Upton <oupton@google.com> Reviewed-by: NPeter Shier <pshier@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Ben Gardon 提交于
There are several functions which pass an access permission mask for SPTEs as an unsigned. This works, but checkpatch complains about it. Switch the occurrences of unsigned to unsigned int to satisfy checkpatch. No functional change expected. Tested by running kvm-unit-tests on an Intel Haswell machine. This commit introduced no new failures. Signed-off-by: NBen Gardon <bgardon@google.com> Reviewed-by: NOliver Upton <oupton@google.com> Reviewed-by: NVitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
The blurb pertaining to the return value of nested_vmx_load_cr3() no longer matches reality, remove it entirely as the behavior it is attempting to document is quite obvious when reading the actual code. Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
kvm_setup_pv_tlb_flush will waste memory and print a misguiding message when KVM paravirtualization is not available. Intel SDM says that the when cpuid is used with EAX higher than the maximum supported value for basic of extended function, the data for the highest supported basic function will be returned. So, in some systems, kvm_arch_para_features will return bogus data, causing kvm_setup_pv_tlb_flush to detect support for pv tlb flush. Testing for kvm_para_available will work as it checks for the hypervisor signature. Besides, when the "nopv" command line parameter is used, it should not continue as well, as kvm_guest_init will no be called in that case. Signed-off-by: NThadeu Lima de Souza Cascardo <cascardo@canonical.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Take a u64 instead of an unsigned long in kvm_dr7_valid() to fix a build warning on i386 due to right-shifting a 32-bit value by 32 when checking for bits being set in dr7[63:32]. Alternatively, the warning could be resolved by rewriting the check to use an i386-friendly method, but taking a u64 fixes another oddity on 32-bit KVM. Beause KVM implements natural width VMCS fields as u64s to avoid layout issues between 32-bit and 64-bit, a devious guest can stuff vmcs12->guest_dr7 with a 64-bit value even when both the guest and host are 32-bit kernels. KVM eventually drops vmcs12->guest_dr7[63:32] when propagating vmcs12->guest_dr7 to vmcs02, but ideally KVM would not rely on that behavior for correctness. Cc: Jim Mattson <jmattson@google.com> Cc: Krish Sadhukhan <krish.sadhukhan@oracle.com> Fixes: ecb697d10f70 ("KVM: nVMX: Check GUEST_DR7 on vmentry of nested guests") Reported-by: NRandy Dunlap <rdunlap@infradead.org> Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Paolo Bonzini 提交于
Commit 53fafdbb ("KVM: x86: switch KVMCLOCK base to monotonic raw clock") changed kvmclock to use tkr_raw instead of tkr_mono. However, the default kvmclock_offset for the VM was still based on the monotonic clock and, if the raw clock drifted enough from the monotonic clock, this could cause a negative system_time to be written to the guest's struct pvclock. RHEL5 does not like it and (if it boots fast enough to observe a negative time value) it hangs. There is another thing to be careful about: getboottime64 returns the host boot time with tkr_mono frequency, and subtracting the tkr_raw-based kvmclock value will cause the wallclock to be off if tkr_raw drifts from tkr_mono. To avoid this, compute the wallclock delta from the current time instead of being clever and using getboottime64. Fixes: 53fafdbb ("KVM: x86: switch KVMCLOCK base to monotonic raw clock") Cc: stable@vger.kernel.org Reviewed-by: NVitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Paolo Bonzini 提交于
We will need a copy of tk->offs_boot in the next patch. Store it and cleanup the struct: instead of storing tk->tkr_xxx.base with the tk->offs_boot included, store the raw value in struct pvclock_clock and sum it in do_monotonic_raw and do_realtime. tk->tkr_xxx.xtime_nsec also moves to struct pvclock_clock. While at it, fix a (usually harmless) typo in do_monotonic_raw, which was using gtod->clock.shift instead of gtod->raw_clock.shift. Fixes: 53fafdbb ("KVM: x86: switch KVMCLOCK base to monotonic raw clock") Cc: stable@vger.kernel.org Reviewed-by: NVitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Miaohe Lin 提交于
The function nested_vmx_run() declaration is below its implementation. So this is meaningless and should be removed. Signed-off-by: NMiaohe Lin <linmiaohe@huawei.com> Reviewed-by: NVitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Paolo Bonzini 提交于
SVM is now able to disable AVIC dynamically whenever the in-kernel PIT sets up an ack notifier, so we can enable it even if in-kernel IOAPIC/PIC/PIT are in use. Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Suravee Suthikulpanit 提交于
In-kernel IOAPIC does not receive EOI with AMD SVM AVIC since the processor accelerate write to APIC EOI register and does not trap if the interrupt is edge-triggered. Workaround this by lazy check for pending APIC EOI at the time when setting new IOPIC irq, and update IOAPIC EOI if no pending APIC EOI. Signed-off-by: NSuravee Suthikulpanit <suravee.suthikulpanit@amd.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Suravee Suthikulpanit 提交于
Refactor code for handling IOAPIC EOI for subsequent patch. There is no functional change. Signed-off-by: NSuravee Suthikulpanit <suravee.suthikulpanit@amd.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Suravee Suthikulpanit 提交于
AMD SVM AVIC accelerates EOI write and does not trap. This causes in-kernel PIT re-injection mode to fail since it relies on irq-ack notifier mechanism. So, APICv is activated only when in-kernel PIT is in discard mode e.g. w/ qemu option: -global kvm-pit.lost_tick_policy=discard Also, introduce APICV_INHIBIT_REASON_PIT_REINJ bit to be used for this reason. Suggested-by: NPaolo Bonzini <pbonzini@redhat.com> Signed-off-by: NSuravee Suthikulpanit <suravee.suthikulpanit@amd.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Suravee Suthikulpanit 提交于
AMD AVIC does not support ExtINT. Therefore, AVIC must be temporary deactivated and fall back to using legacy interrupt injection via vINTR and interrupt window. Also, introduce APICV_INHIBIT_REASON_IRQWIN to be used for this reason. Signed-off-by: NSuravee Suthikulpanit <suravee.suthikulpanit@amd.com> [Rename svm_request_update_avic to svm_toggle_avic_for_extint. - Paolo] Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Suravee Suthikulpanit 提交于
Since AVIC does not currently work w/ nested virtualization, deactivate AVIC for the guest if setting CPUID Fn80000001_ECX[SVM] (i.e. indicate support for SVM, which is needed for nested virtualization). Also, introduce a new APICV_INHIBIT_REASON_NESTED bit to be used for this reason. Suggested-by: NAlexander Graf <graf@amazon.com> Signed-off-by: NSuravee Suthikulpanit <suravee.suthikulpanit@amd.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Suravee Suthikulpanit 提交于
Since disabling APICv has to be done for all vcpus on AMD-based system, adopt the newly introduced kvm_request_apicv_update() interface, and introduce a new APICV_INHIBIT_REASON_HYPERV. Also, remove the kvm_vcpu_deactivate_apicv() since no longer used. Cc: Roman Kagan <rkagan@virtuozzo.com> Signed-off-by: NSuravee Suthikulpanit <suravee.suthikulpanit@amd.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Suravee Suthikulpanit 提交于
Add necessary logics to support (de)activate AVIC at runtime. Signed-off-by: NSuravee Suthikulpanit <suravee.suthikulpanit@amd.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Suravee Suthikulpanit 提交于
AMD SVM AVIC needs to update APIC backing page mapping before changing APICv mode. Introduce struct kvm_x86_ops.pre_update_apicv_exec_ctrl function hook to be called prior KVM APICv update request to each vcpu. Signed-off-by: NSuravee Suthikulpanit <suravee.suthikulpanit@amd.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Suravee Suthikulpanit 提交于
Inibit reason bits are used to determine if APICv deactivation is applicable for a particular hardware virtualization architecture. Signed-off-by: NSuravee Suthikulpanit <suravee.suthikulpanit@amd.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Suravee Suthikulpanit 提交于
Re-factor avic_init_access_page() to avic_update_access_page() since activate/deactivate AVIC requires setting/unsetting the memory region used for virtual APIC backing page (APIC_ACCESS_PAGE_PRIVATE_MEMSLOT). Signed-off-by: NSuravee Suthikulpanit <suravee.suthikulpanit@amd.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-