- 01 10月, 2018 3 次提交
-
-
由 Liran Alon 提交于
L2 IA32_BNDCFGS should be updated with vmcs12->guest_bndcfgs only when VM_ENTRY_LOAD_BNDCFGS is specified in vmcs12->vm_entry_controls. Otherwise, L2 IA32_BNDCFGS should be set to vmcs01->guest_bndcfgs which is L1 IA32_BNDCFGS. Reviewed-by: NNikita Leshchenko <nikita.leshchenko@oracle.com> Reviewed-by: NDarren Kenny <darren.kenny@oracle.com> Signed-off-by: NLiran Alon <liran.alon@oracle.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Liran Alon 提交于
Commit a87036ad ("KVM: x86: disable MPX if host did not enable MPX XSAVE features") introduced kvm_mpx_supported() to return true iff MPX is enabled in the host. However, that commit seems to have missed replacing some calls to kvm_x86_ops->mpx_supported() to kvm_mpx_supported(). Complete original commit by replacing remaining calls to kvm_mpx_supported(). Fixes: a87036ad ("KVM: x86: disable MPX if host did not enable MPX XSAVE features") Suggested-by: NSean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: NLiran Alon <liran.alon@oracle.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Liran Alon 提交于
Before this commit, KVM exposes MPX VMX controls to L1 guest only based on if KVM and host processor supports MPX virtualization. However, these controls should be exposed to guest only in case guest vCPU supports MPX. Without this change, a L1 guest running with kernel which don't have commit 691bd434 ("kvm: vmx: allow host to access guest MSR_IA32_BNDCFGS") asserts in QEMU on the following: qemu-kvm: error: failed to set MSR 0xd90 to 0x0 qemu-kvm: .../qemu-2.10.0/target/i386/kvm.c:1801 kvm_put_msrs: Assertion 'ret == cpu->kvm_msr_buf->nmsrs failed' This is because L1 KVM kvm_init_msr_list() will see that vmx_mpx_supported() (As it only checks MPX VMX controls support) and therefore KVM_GET_MSR_INDEX_LIST IOCTL will include MSR_IA32_BNDCFGS. However, later when L1 will attempt to set this MSR via KVM_SET_MSRS IOCTL, it will fail because !guest_cpuid_has_mpx(vcpu). Therefore, fix the issue by exposing MPX VMX controls to L1 guest only when vCPU supports MPX. Fixes: 36be0b9d ("KVM: x86: Add nested virtualization support for MPX") Reported-by: NEyal Moscovici <eyal.moscovici@oracle.com> Reviewed-by: NNikita Leshchenko <nikita.leshchenko@oracle.com> Reviewed-by: NDarren Kenny <darren.kenny@oracle.com> Signed-off-by: NLiran Alon <liran.alon@oracle.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 25 9月, 2018 1 次提交
-
-
由 Paolo Bonzini 提交于
KVM has an old optimization whereby accesses to the kernel GS base MSR are trapped when the guest is in 32-bit and not when it is in 64-bit mode. The idea is that swapgs is not available in 32-bit mode, thus the guest has no reason to access the MSR unless in 64-bit mode and 32-bit applications need not pay the price of switching the kernel GS base between the host and the guest values. However, this optimization adds complexity to the code for little benefit (these days most guests are going to be 64-bit anyway) and in fact broke after commit 678e315e ("KVM: vmx: add dedicated utility to access guest's kernel_gs_base", 2018-08-06); the guest kernel GS base can be corrupted across SMIs and UEFI Secure Boot is therefore broken (a secure boot Linux guest, for example, fails to reach the login prompt about half the time). This patch just removes the optimization; the kernel GS base MSR is now never trapped by KVM, similarly to the FS and GS base MSRs. Fixes: 678e315eReviewed-by: NSean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 21 9月, 2018 1 次提交
-
-
由 Liran Alon 提交于
The handlers of IOCTLs in kvm_arch_vcpu_ioctl() are expected to set their return value in "r" local var and break out of switch block when they encounter some error. This is because vcpu_load() is called before the switch block which have a proper cleanup of vcpu_put() afterwards. However, KVM_{GET,SET}_NESTED_STATE IOCTLs handlers just return immediately on error without performing above mentioned cleanup. Thus, change these handlers to behave as expected. Fixes: 8fcc4b59 ("kvm: nVMX: Introduce KVM_CAP_NESTED_STATE") Reviewed-by: NMark Kanda <mark.kanda@oracle.com> Reviewed-by: NPatrick Colp <patrick.colp@oracle.com> Signed-off-by: NLiran Alon <liran.alon@oracle.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 20 9月, 2018 15 次提交
-
-
由 Drew Schmitt 提交于
Add KVM_CAP_MSR_PLATFORM_INFO so that userspace can disable guest access to reads of MSR_PLATFORM_INFO. Disabling access to reads of this MSR gives userspace the control to "expose" this platform-dependent information to guests in a clear way. As it exists today, guests that read this MSR would get unpopulated information if userspace hadn't already set it (and prior to this patch series, only the CPUID faulting information could have been populated). This existing interface could be confusing if guests don't handle the potential for incorrect/incomplete information gracefully (e.g. zero reported for base frequency). Signed-off-by: NDrew Schmitt <dasch@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Drew Schmitt 提交于
Allow userspace to set turbo bits in MSR_PLATFORM_INFO. Previously, only the CPUID faulting bit was settable. But now any bit in MSR_PLATFORM_INFO would be settable. This can be used, for example, to convey frequency information about the platform on which the guest is running. Signed-off-by: NDrew Schmitt <dasch@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Krish Sadhukhan 提交于
According to section "Checks on VMX Controls" in Intel SDM vol 3C, the following check needs to be enforced on vmentry of L2 guests: If the 'enable VPID' VM-execution control is 1, the value of the of the VPID VM-execution control field must not be 0000H. Signed-off-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com> Reviewed-by: NMark Kanda <mark.kanda@oracle.com> Reviewed-by: NLiran Alon <liran.alon@oracle.com> Reviewed-by: NJim Mattson <jmattson@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Krish Sadhukhan 提交于
According to section "Checks on VMX Controls" in Intel SDM vol 3C, the following check needs to be enforced on vmentry of L2 guests: - Bits 5:0 of the posted-interrupt descriptor address are all 0. - The posted-interrupt descriptor address does not set any bits beyond the processor's physical-address width. Signed-off-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com> Reviewed-by: NMark Kanda <mark.kanda@oracle.com> Reviewed-by: NLiran Alon <liran.alon@oracle.com> Reviewed-by: NDarren Kenny <darren.kenny@oracle.com> Reviewed-by: NKarl Heubaum <karl.heubaum@oracle.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Liran Alon 提交于
In case L1 do not intercept L2 HLT or enter L2 in HLT activity-state, it is possible for a vCPU to be blocked while it is in guest-mode. According to Intel SDM 26.6.5 Interrupt-Window Exiting and Virtual-Interrupt Delivery: "These events wake the logical processor if it just entered the HLT state because of a VM entry". Therefore, if L1 enters L2 in HLT activity-state and L2 has a pending deliverable interrupt in vmcs12->guest_intr_status.RVI, then the vCPU should be waken from the HLT state and injected with the interrupt. In addition, if while the vCPU is blocked (while it is in guest-mode), it receives a nested posted-interrupt, then the vCPU should also be waken and injected with the posted interrupt. To handle these cases, this patch enhances kvm_vcpu_has_events() to also check if there is a pending interrupt in L2 virtual APICv provided by L1. That is, it evaluates if there is a pending virtual interrupt for L2 by checking RVI[7:4] > VPPR[7:4] as specified in Intel SDM 29.2.1 Evaluation of Pending Interrupts. Note that this also handles the case of nested posted-interrupt by the fact RVI is updated in vmx_complete_nested_posted_interrupt() which is called from kvm_vcpu_check_block() -> kvm_arch_vcpu_runnable() -> kvm_vcpu_running() -> vmx_check_nested_events() -> vmx_complete_nested_posted_interrupt(). Reviewed-by: NNikita Leshenko <nikita.leshchenko@oracle.com> Reviewed-by: NDarren Kenny <darren.kenny@oracle.com> Signed-off-by: NLiran Alon <liran.alon@oracle.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Paolo Bonzini 提交于
VMX cannot be enabled under SMM, check it when CR4 is set and when nested virtualization state is restored. This should fix some WARNs reported by syzkaller, mostly around alloc_shadow_vmcs. Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
The functions kvm_load_guest_fpu() kvm_put_guest_fpu() are only used locally, make them static. This requires also that both functions are moved because they are used before their implementation. Those functions were exported (via EXPORT_SYMBOL) before commit e5bb4025 ("KVM: Drop kvm_{load,put}_guest_fpu() exports"). Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
A VMX preemption timer value of '0' is guaranteed to cause a VMExit prior to the CPU executing any instructions in the guest. Use the preemption timer (if it's supported) to trigger immediate VMExit in place of the current method of sending a self-IPI. This ensures that pending VMExit injection to L1 occurs prior to executing any instructions in the guest (regardless of nesting level). When deferring VMExit injection, KVM generates an immediate VMExit from the (possibly nested) guest by sending itself an IPI. Because hardware interrupts are blocked prior to VMEnter and are unblocked (in hardware) after VMEnter, this results in taking a VMExit(INTR) before any guest instruction is executed. But, as this approach relies on the IPI being received before VMEnter executes, it only works as intended when KVM is running as L0. Because there are no architectural guarantees regarding when IPIs are delivered, when running nested the INTR may "arrive" long after L2 is running e.g. L0 KVM doesn't force an immediate switch to L1 to deliver an INTR. For the most part, this unintended delay is not an issue since the events being injected to L1 also do not have architectural guarantees regarding their timing. The notable exception is the VMX preemption timer[1], which is architecturally guaranteed to cause a VMExit prior to executing any instructions in the guest if the timer value is '0' at VMEnter. Specifically, the delay in injecting the VMExit causes the preemption timer KVM unit test to fail when run in a nested guest. Note: this approach is viable even on CPUs with a broken preemption timer, as broken in this context only means the timer counts at the wrong rate. There are no known errata affecting timer value of '0'. [1] I/O SMIs also have guarantees on when they arrive, but I have no idea if/how those are emulated in KVM. Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com> [Use a hook for SVM instead of leaving the default in x86.c - Paolo] Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Provide a singular location where the VMX preemption timer bit is set/cleared so that future usages of the preemption timer can ensure the VMCS bit is up-to-date without having to modify unrelated code paths. For example, the preemption timer can be used to force an immediate VMExit. Cache the status of the timer to avoid redundant VMREAD and VMWRITE, e.g. if the timer stays armed across multiple VMEnters/VMExits. Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
A VMX preemption timer value of '0' at the time of VMEnter is architecturally guaranteed to cause a VMExit prior to the CPU executing any instructions in the guest. This architectural definition is in place to ensure that a previously expired timer is correctly recognized by the CPU as it is possible for the timer to reach zero and not trigger a VMexit due to a higher priority VMExit being signalled instead, e.g. a pending #DB that morphs into a VMExit. Whether by design or coincidence, commit f4124500 ("KVM: nVMX: Fully emulate preemption timer") special cased timer values of '0' and '1' to ensure prompt delivery of the VMExit. Unlike '0', a timer value of '1' has no has no architectural guarantees regarding when it is delivered. Modify the timer emulation to trigger immediate VMExit if and only if the timer value is '0', and document precisely why '0' is special. Do this even if calibration of the virtual TSC failed, i.e. VMExit will occur immediately regardless of the frequency of the timer. Making only '0' a special case gives KVM leeway to be more aggressive in ensuring the VMExit is injected prior to executing instructions in the nested guest, and also eliminates any ambiguity as to why '1' is a special case, e.g. why wasn't the threshold for a "short timeout" set to 10, 100, 1000, etc... Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Andy Shevchenko 提交于
Switch to bitmap_zalloc() to show clearly what we are allocating. Besides that it returns pointer of bitmap type instead of opaque void *. Signed-off-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Tianyu Lan 提交于
kvm_commit_zap_page() has been renamed to kvm_mmu_commit_zap_page() This patch is to fix the commit. Signed-off-by: NLan Tianyu <Tianyu.Lan@microsoft.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Wei Yang 提交于
Here is the code path which shows kvm_mmu_setup() is invoked after kvm_mmu_create(). Since kvm_mmu_setup() is only invoked in this code path, this means the root_hpa and prev_roots are guaranteed to be invalid. And it is not necessary to reset it again. kvm_vm_ioctl_create_vcpu() kvm_arch_vcpu_create() vmx_create_vcpu() kvm_vcpu_init() kvm_arch_vcpu_init() kvm_mmu_create() kvm_arch_vcpu_setup() kvm_mmu_setup() kvm_init_mmu() This patch set reset_roots to false in kmv_mmu_setup(). Fixes: 50c28f21Signed-off-by: NWei Yang <richard.weiyang@gmail.com> Reviewed-by: NLiran Alon <liran.alon@oracle.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Junaid Shahid 提交于
kvm should not attempt to read guest PDPTEs when CR0.PG = 0 and CR4.PAE = 1. Signed-off-by: NJunaid Shahid <junaids@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Vitaly Kuznetsov 提交于
When VMX is used with flexpriority disabled (because of no support or if disabled with module parameter) MMIO interface to lAPIC is still available in x2APIC mode while it shouldn't be (kvm-unit-tests): PASS: apic_disable: Local apic enabled in x2APIC mode PASS: apic_disable: CPUID.1H:EDX.APIC[bit 9] is set FAIL: apic_disable: *0xfee00030: 50014 The issue appears because we basically do nothing while switching to x2APIC mode when APIC access page is not used. apic_mmio_{read,write} only check if lAPIC is disabled before proceeding to actual write. When APIC access is virtualized we correctly manipulate with VMX controls in vmx_set_virtual_apic_mode() and we don't get vmexits from memory writes in x2APIC mode so there's no issue. Disabling MMIO interface seems to be easy. The question is: what do we do with these reads and writes? If we add apic_x2apic_mode() check to apic_mmio_in_range() and return -EOPNOTSUPP these reads and writes will go to userspace. When lAPIC is in kernel, Qemu uses this interface to inject MSIs only (see kvm_apic_mem_write() in hw/i386/kvm/apic.c). This somehow works with disabled lAPIC but when we're in xAPIC mode we will get a real injected MSI from every write to lAPIC. Not good. The simplest solution seems to be to just ignore writes to the region and return ~0 for all reads when we're in x2APIC mode. This is what this patch does. However, this approach is inconsistent with what currently happens when flexpriority is enabled: we allocate APIC access page and create KVM memory region so in x2APIC modes all reads and writes go to this pre-allocated page which is, btw, the same for all vCPUs. Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 08 9月, 2018 2 次提交
-
-
由 Wanpeng Li 提交于
Dan Carpenter reported that the untrusted data returns from kvm_register_read() results in the following static checker warning: arch/x86/kvm/lapic.c:576 kvm_pv_send_ipi() error: buffer underflow 'map->phys_map' 's32min-s32max' KVM guest can easily trigger this by executing the following assembly sequence in Ring0: mov $10, %rax mov $0xFFFFFFFF, %rbx mov $0xFFFFFFFF, %rdx mov $0, %rsi vmcall As this will cause KVM to execute the following code-path: vmx_handle_exit() -> handle_vmcall() -> kvm_emulate_hypercall() -> kvm_pv_send_ipi() which will reach out-of-bounds access. This patch fixes it by adding a check to kvm_pv_send_ipi() against map->max_apic_id, ignoring destinations that are not present and delivering the rest. We also check whether or not map->phys_map[min + i] is NULL since the max_apic_id is set to the max apic id, some phys_map maybe NULL when apic id is sparse, especially kvm unconditionally set max_apic_id to 255 to reserve enough space for any xAPIC ID. Reported-by: NDan Carpenter <dan.carpenter@oracle.com> Reviewed-by: NLiran Alon <liran.alon@oracle.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Liran Alon <liran.alon@oracle.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: NWanpeng Li <wanpengli@tencent.com> [Add second "if (min > map->max_apic_id)" to complete the fix. -Radim] Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
-
由 Liran Alon 提交于
Consider the case L1 had a IRQ/NMI event until it executed VMLAUNCH/VMRESUME which wasn't delivered because it was disallowed (e.g. interrupts disabled). When L1 executes VMLAUNCH/VMRESUME, L0 needs to evaluate if this pending event should cause an exit from L2 to L1 or delivered directly to L2 (e.g. In case L1 don't intercept EXTERNAL_INTERRUPT). Usually this would be handled by L0 requesting a IRQ/NMI window by setting VMCS accordingly. However, this setting was done on VMCS01 and now VMCS02 is active instead. Thus, when L1 executes VMLAUNCH/VMRESUME we force L0 to perform pending event evaluation by requesting a KVM_REQ_EVENT. Note that above scenario exists when L1 KVM is about to enter L2 but requests an "immediate-exit". As in this case, L1 will disable-interrupts and then send a self-IPI before entering L2. Reviewed-by: NNikita Leshchenko <nikita.leshchenko@oracle.com> Co-developed-by: NSean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: NLiran Alon <liran.alon@oracle.com> Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
-
- 07 9月, 2018 1 次提交
-
-
由 Marc Zyngier 提交于
kvm_unmap_hva is long gone, and we only have kvm_unmap_hva_range to deal with. Drop the now obsolete code. Fixes: fb1522e0 ("KVM: update to new mmu_notifier semantic v2") Cc: James Hogan <jhogan@kernel.org> Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com> Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com> Signed-off-by: NChristoffer Dall <christoffer.dall@arm.com>
-
- 30 8月, 2018 10 次提交
-
-
由 Sean Christopherson 提交于
Allowing x86_emulate_instruction() to be called directly has led to subtle bugs being introduced, e.g. not setting EMULTYPE_NO_REEXECUTE in the emulation type. While most of the blame lies on re-execute being opt-out, exporting x86_emulate_instruction() also exposes its cr2 parameter, which may have contributed to commit d391f120 ("x86/kvm/vmx: do not use vm-exit instruction length for fast MMIO when running nested") using x86_emulate_instruction() instead of emulate_instruction() because "hey, I have a cr2!", which in turn introduced its EMULTYPE_NO_REEXECUTE bug. Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
-
由 Sean Christopherson 提交于
Lack of the kvm_ prefix gives the impression that it's a VMX or SVM specific function, and there's no conflict that prevents adding the kvm_ prefix. Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
-
由 Sean Christopherson 提交于
Commit a6f177ef ("KVM: Reenter guest after emulation failure if due to access to non-mmio address") added reexecute_instruction() to handle the scenario where two (or more) vCPUS race to write a shadowed page, i.e. reexecute_instruction() is intended to return true if and only if the instruction being emulated was accessing a shadowed page. As L0 is only explicitly shadowing L1 tables, an emulation failure of a nested VM instruction cannot be due to a race to write a shadowed page and so should never be re-executed. This fixes an issue where an "MMIO" emulation failure[1] in L2 is all but guaranteed to result in an infinite loop when TDP is enabled. Because "cr2" is actually an L2 GPA when TDP is enabled, calling kvm_mmu_gva_to_gpa_write() to translate cr2 in the non-direct mapped case (L2 is never direct mapped) will almost always yield UNMAPPED_GVA and cause reexecute_instruction() to immediately return true. The !mmio_info_in_cache() check in kvm_mmu_page_fault() doesn't catch this case because mmio_info_in_cache() returns false for a nested MMU (the MMIO caching currently handles L1 only, e.g. to cache nested guests' GPAs we'd have to manually flush the cache when switching between VMs and when L1 updated its page tables controlling the nested guest). Way back when, commit 68be0803 ("KVM: x86: never re-execute instruction with enabled tdp") changed reexecute_instruction() to always return false when using TDP under the assumption that KVM would only get into the emulator for MMIO. Commit 95b3cf69 ("KVM: x86: let reexecute_instruction work for tdp") effectively reverted that behavior in order to handle the scenario where emulation failed due to an access from L1 to the shadow page tables for L2, but it didn't account for the case where emulation failed in L2 with TDP enabled. All of the above logic also applies to retry_instruction(), added by commit 1cb3f3ae ("KVM: x86: retry non-page-table writing instructions"). An indefinite loop in retry_instruction() should be impossible as it protects against retrying the same instruction over and over, but it's still correct to not retry an L2 instruction in the first place. Fix the immediate issue by adding a check for a nested guest when determining whether or not to allow retry in kvm_mmu_page_fault(). In addition to fixing the immediate bug, add WARN_ON_ONCE in the retry functions since they are not designed to handle nested cases, i.e. they need to be modified even if there is some scenario in the future where we want to allow retrying a nested guest. [1] This issue was encountered after commit 3a2936de ("kvm: mmu: Don't expose private memslots to L2") changed the page fault path to return KVM_PFN_NOSLOT when translating an L2 access to a prive memslot. Returning KVM_PFN_NOSLOT is semantically correct when we want to hide a memslot from L2, i.e. there effectively is no defined memory region for L2, but it has the unfortunate side effect of making KVM think the GFN is a MMIO page, thus triggering emulation. The failure occurred with in-development code that deliberately exposed a private memslot to L2, which L2 accessed with an instruction that is not emulated by KVM. Fixes: 95b3cf69 ("KVM: x86: let reexecute_instruction work for tdp") Fixes: 1cb3f3ae ("KVM: x86: retry non-page-table writing instructions") Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com> Cc: Jim Mattson <jmattson@google.com> Cc: Krish Sadhukhan <krish.sadhukhan@oracle.com> Cc: Xiao Guangrong <xiaoguangrong@tencent.com> Cc: stable@vger.kernel.org Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
-
由 Sean Christopherson 提交于
Effectively force kvm_mmu_page_fault() to opt-in to allowing retry to make it more obvious when and why it allows emulation to be retried. Previously this approach was less convenient due to retry and re-execute behavior being controlled by separate flags that were also inverted in their implementations (opt-in versus opt-out). Suggested-by: NPaolo Bonzini <pbonzini@redhat.com> Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com> Cc: stable@vger.kernel.org Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
-
由 Sean Christopherson 提交于
retry_instruction() and reexecute_instruction() are a package deal, i.e. there is no scenario where one is allowed and the other is not. Merge their controlling emulation type flags to enforce this in code. Name the combined flag EMULTYPE_ALLOW_RETRY to make it abundantly clear that we are allowing re{try,execute} to occur, as opposed to explicitly requesting retry of a previously failed instruction. Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com> Cc: stable@vger.kernel.org Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
-
由 Sean Christopherson 提交于
Re-execution of an instruction after emulation decode failure is intended to be used only when emulating shadow page accesses. Invert the flag to make allowing re-execution opt-in since that behavior is by far in the minority. Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com> Cc: stable@vger.kernel.org Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
-
由 Sean Christopherson 提交于
Re-execution after an emulation decode failure is only intended to handle a case where two or vCPUs race to write a shadowed page, i.e. we should never re-execute an instruction as part of RSM emulation. Add a new helper, kvm_emulate_instruction_from_buffer(), to support emulating from a pre-defined buffer. This eliminates the last direct call to x86_emulate_instruction() outside of kvm_mmu_page_fault(), which means x86_emulate_instruction() can be unexported in a future patch. Fixes: 7607b717 ("KVM: SVM: install RSM intercept") Cc: Brijesh Singh <brijesh.singh@amd.com> Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com> Cc: stable@vger.kernel.org Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
-
由 Sean Christopherson 提交于
Re-execution after an emulation decode failure is only intended to handle a case where two or vCPUs race to write a shadowed page, i.e. we should never re-execute an instruction as part of MMIO emulation. As handle_ept_misconfig() is only used for MMIO emulation, it should pass EMULTYPE_NO_REEXECUTE when using the emulator to skip an instr in the fast-MMIO case where VM_EXIT_INSTRUCTION_LEN is invalid. And because the cr2 value passed to x86_emulate_instruction() is only destined for use when retrying or reexecuting, we can simply call emulate_instruction(). Fixes: d391f120 ("x86/kvm/vmx: do not use vm-exit instruction length for fast MMIO when running nested") Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com> Cc: stable@vger.kernel.org Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
-
由 Colin Ian King 提交于
Variable dst_vaddr_end is being assigned but is never used hence it is redundant and can be removed. Cleans up clang warning: variable 'dst_vaddr_end' set but not used [-Wunused-but-set-variable] Signed-off-by: NColin Ian King <colin.king@canonical.com> Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
-
由 Vitaly Kuznetsov 提交于
nested_run_pending is set 20 lines above and check_vmentry_prereqs()/ check_vmentry_postreqs() don't seem to be resetting it (the later, however, checks it). Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com> Reviewed-by: NJim Mattson <jmattson@google.com> Reviewed-by: NEduardo Valentin <eduval@amazon.com> Reviewed-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com> Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com>
-
- 23 8月, 2018 1 次提交
-
-
由 Michal Hocko 提交于
There are several blockable mmu notifiers which might sleep in mmu_notifier_invalidate_range_start and that is a problem for the oom_reaper because it needs to guarantee a forward progress so it cannot depend on any sleepable locks. Currently we simply back off and mark an oom victim with blockable mmu notifiers as done after a short sleep. That can result in selecting a new oom victim prematurely because the previous one still hasn't torn its memory down yet. We can do much better though. Even if mmu notifiers use sleepable locks there is no reason to automatically assume those locks are held. Moreover majority of notifiers only care about a portion of the address space and there is absolutely zero reason to fail when we are unmapping an unrelated range. Many notifiers do really block and wait for HW which is harder to handle and we have to bail out though. This patch handles the low hanging fruit. __mmu_notifier_invalidate_range_start gets a blockable flag and callbacks are not allowed to sleep if the flag is set to false. This is achieved by using trylock instead of the sleepable lock for most callbacks and continue as long as we do not block down the call chain. I think we can improve that even further because there is a common pattern to do a range lookup first and then do something about that. The first part can be done without a sleeping lock in most cases AFAICS. The oom_reaper end then simply retries if there is at least one notifier which couldn't make any progress in !blockable mode. A retry loop is already implemented to wait for the mmap_sem and this is basically the same thing. The simplest way for driver developers to test this code path is to wrap userspace code which uses these notifiers into a memcg and set the hard limit to hit the oom. This can be done e.g. after the test faults in all the mmu notifier managed memory and set the hard limit to something really small. Then we are looking for a proper process tear down. [akpm@linux-foundation.org: coding style fixes] [akpm@linux-foundation.org: minor code simplification] Link: http://lkml.kernel.org/r/20180716115058.5559-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com> Acked-by: Christian König <christian.koenig@amd.com> # AMD notifiers Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx and umem_odp Reported-by: NDavid Rientjes <rientjes@google.com> Cc: "David (ChunMing) Zhou" <David1.Zhou@amd.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Alex Deucher <alexander.deucher@amd.com> Cc: David Airlie <airlied@linux.ie> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Doug Ledford <dledford@redhat.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Mike Marciniszyn <mike.marciniszyn@intel.com> Cc: Dennis Dalessandro <dennis.dalessandro@intel.com> Cc: Sudeep Dutt <sudeep.dutt@intel.com> Cc: Ashutosh Dixit <ashutosh.dixit@intel.com> Cc: Dimitri Sivanich <sivanich@sgi.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Juergen Gross <jgross@suse.com> Cc: "Jérôme Glisse" <jglisse@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Felix Kuehling <felix.kuehling@amd.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 22 8月, 2018 5 次提交
-
-
由 Paolo Bonzini 提交于
Two bug fixes: 1) missing entries in the l1d_param array; this can cause a host crash if an access attempts to reach the missing entry. Future-proof the get function against any overflows as well. However, the two entries VMENTER_L1D_FLUSH_EPT_DISABLED and VMENTER_L1D_FLUSH_NOT_REQUIRED must not be accepted by the parse function, so disable them there. 2) invalid values must be rejected even if the CPU does not have the bug, so test for them before checking boot_cpu_has(X86_BUG_L1TF) ... and a small refactoring, since the .cmd field is redundant with the index in the array. Reported-by: NBandan Das <bsd@redhat.com> Cc: stable@vger.kernel.org Fixes: a7b9020bSigned-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Thomas Gleixner 提交于
Mikhail reported the following lockdep splat: WARNING: possible irq lock inversion dependency detected CPU 0/KVM/10284 just changed the state of lock: 000000000d538a88 (&st->lock){+...}, at: speculative_store_bypass_update+0x10b/0x170 but this lock was taken by another, HARDIRQ-safe lock in the past: (&(&sighand->siglock)->rlock){-.-.} and interrupts could create inverse lock ordering between them. Possible interrupt unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&st->lock); local_irq_disable(); lock(&(&sighand->siglock)->rlock); lock(&st->lock); <Interrupt> lock(&(&sighand->siglock)->rlock); *** DEADLOCK *** The code path which connects those locks is: speculative_store_bypass_update() ssb_prctl_set() do_seccomp() do_syscall_64() In svm_vcpu_run() speculative_store_bypass_update() is called with interupts enabled via x86_virt_spec_ctrl_set_guest/host(). This is actually a false positive, because GIF=0 so interrupts are disabled even if IF=1; however, we can easily move the invocations of x86_virt_spec_ctrl_set_guest/host() into the interrupt disabled region to cure it, and it's a good idea to keep the GIF=0/IF=1 area as small and self-contained as possible. Fixes: 1f50ddb4 ("x86/speculation: Handle HT correctly on AMD") Reported-by: NMikhail Gavrilov <mikhail.v.gavrilov@gmail.com> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Tested-by: NMikhail Gavrilov <mikhail.v.gavrilov@gmail.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Borislav Petkov <bp@suse.de> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: kvm@vger.kernel.org Cc: x86@kernel.org Cc: stable@vger.kernel.org Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Virtualization of Intel SGX depends on Enclave Page Cache (EPC) management that is not yet available in the kernel, i.e. KVM support for exposing SGX to a guest cannot be added until basic support for SGX is upstreamed, which is a WIP[1]. Until SGX is properly supported in KVM, ensure a guest sees expected behavior for ENCLS, i.e. all ENCLS #UD. Because SGX does not have a true software enable bit, e.g. there is no CR4.SGXE bit, the ENCLS instruction can be executed[1] by the guest if SGX is supported by the system. Intercept all ENCLS leafs (via the ENCLS- exiting control and field) and unconditionally inject #UD. [1] https://www.spinics.net/lists/kvm/msg171333.html or https://lkml.org/lkml/2018/7/3/879 [2] A guest can execute ENCLS in the sense that ENCLS will not take an immediate #UD, but no ENCLS will ever succeed in a guest without explicit support from KVM (map EPC memory into the guest), unless KVM has a *very* egregious bug, e.g. accidentally mapped EPC memory into the guest SPTEs. In other words this patch is needed only to prevent the guest from seeing inconsistent behavior, e.g. #GP (SGX not enabled in Feature Control MSR) or #PF (leaf operand(s) does not point at EPC memory) instead of #UD on ENCLS. Intercepting ENCLS is not required to prevent the guest from truly utilizing SGX. Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20180814163334.25724-3-sean.j.christopherson@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Yi Wang 提交于
Substitute spaces with tab. No functional changes. Signed-off-by: NYi Wang <wang.yi59@zte.com.cn> Reviewed-by: NJiang Biao <jiang.biao2@zte.com.cn> Message-Id: <1534398159-48509-1-git-send-email-wang.yi59@zte.com.cn> Cc: stable@vger.kernel.org # L1TF Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Arnd Bergmann 提交于
Removing one of the two accesses of the maxphyaddr variable led to a harmless warning: arch/x86/kvm/x86.c: In function 'kvm_set_mmio_spte_mask': arch/x86/kvm/x86.c:6563:6: error: unused variable 'maxphyaddr' [-Werror=unused-variable] Removing the #ifdef seems to be the nicest workaround, as it makes the code look cleaner than adding another #ifdef. Fixes: 28a1f3ac ("kvm: x86: Set highest physical address bits in non-present/reserved SPTEs") Signed-off-by: NArnd Bergmann <arnd@arndb.de> Cc: stable@vger.kernel.org # L1TF Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 21 8月, 2018 1 次提交
-
-
由 Josh Poimboeuf 提交于
These are already defined higher up in the file. Fixes: 7db92e16 ("x86/kvm: Move l1tf setup function") Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/d7ca03ae210d07173452aeed85ffe344301219a5.1534253536.git.jpoimboe@redhat.com
-