- 05 6月, 2019 2 次提交
-
-
由 Wanpeng Li 提交于
wait_lapic_expire() call was moved above guest_enter_irqoff() because of its tracepoint, which violated the RCU extended quiescent state invoked by guest_enter_irqoff()[1][2]. This patch simply moves the tracepoint below guest_exit_irqoff() in vcpu_enter_guest(). Snapshot the delta before VM-Enter, but trace it after VM-Exit. This can help us to move wait_lapic_expire() just before vmentry in the later patch. [1] Commit 8b89fe1f ("kvm: x86: move tracepoints outside extended quiescent state") [2] https://patchwork.kernel.org/patch/7821111/ Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Liran Alon <liran.alon@oracle.com> Suggested-by: NSean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: NWanpeng Li <wanpengli@tencent.com> [Track whether wait_lapic_expire was called, and do not invoke the tracepoint if not. - Paolo] Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Kai Huang 提交于
As a prerequisite to fix several SPTE reserved bits related calculation errors caused by MKTME, which requires kvm_set_mmio_spte_mask() to use local static variable defined in mmu.c. Also move call site of kvm_set_mmio_spte_mask() from kvm_arch_init() to kvm_mmu_module_init() so that kvm_set_mmio_spte_mask() can be static. Reviewed-by: NSean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: NKai Huang <kai.huang@linux.intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 28 5月, 2019 1 次提交
-
-
由 Thomas Huth 提交于
KVM_CAP_MAX_VCPU_ID is currently always reporting KVM_MAX_VCPU_ID on all architectures. However, on s390x, the amount of usable CPUs is determined during runtime - it is depending on the features of the machine the code is running on. Since we are using the vcpu_id as an index into the SCA structures that are defined by the hardware (see e.g. the sca_add_vcpu() function), it is not only the amount of CPUs that is limited by the hard- ware, but also the range of IDs that we can use. Thus KVM_CAP_MAX_VCPU_ID must be determined during runtime on s390x, too. So the handling of KVM_CAP_MAX_VCPU_ID has to be moved from the common code into the architecture specific code, and on s390x we have to return the same value here as for KVM_CAP_MAX_VCPUS. This problem has been discovered with the kvm_create_max_vcpus selftest. With this change applied, the selftest now passes on s390x, too. Reviewed-by: NAndrew Jones <drjones@redhat.com> Reviewed-by: NCornelia Huck <cohuck@redhat.com> Reviewed-by: NDavid Hildenbrand <david@redhat.com> Signed-off-by: NThomas Huth <thuth@redhat.com> Message-Id: <20190523164309.13345-9-thuth@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
-
- 25 5月, 2019 2 次提交
-
-
由 Paolo Bonzini 提交于
Commit 11988499 ("KVM: x86: Skip EFER vs. guest CPUID checks for host-initiated writes", 2019-04-02) introduced a "return false" in a function returning int, and anyway set_efer has a "nonzero on error" conventon so it should be returning 1. Reported-by: NPavel Machek <pavel@denx.de> Fixes: 11988499 ("KVM: x86: Skip EFER vs. guest CPUID checks for host-initiated writes") Cc: Sean Christopherson <sean.j.christopherson@intel.com> Cc: stable@vger.kernel.org Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Wanpeng Li 提交于
After commit c3941d9e (KVM: lapic: Allow user to disable adaptive tuning of timer advancement), '-1' enables adaptive tuning starting from default advancment of 1000ns. However, we should expose an int instead of an overflow uint module parameter. Before patch: /sys/module/kvm/parameters/lapic_timer_advance_ns:4294967295 After patch: /sys/module/kvm/parameters/lapic_timer_advance_ns:-1 Fixes: c3941d9e (KVM: lapic: Allow user to disable adaptive tuning of timer advancement) Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Sean Christopherson <sean.j.christopherson@intel.com> Cc: Liran Alon <liran.alon@oracle.com> Reviewed-by: NSean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: NWanpeng Li <wanpengli@tencent.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 10 5月, 2019 1 次提交
-
-
由 Viresh Kumar 提交于
Currently, the notifiers are called once for each CPU of the policy->cpus cpumask. It would be more optimal if the notifier can be called only once and all the relevant information be provided to it. Out of the 23 drivers that register for the transition notifiers today, only 4 of them do per-cpu updates and the callback for the rest can be called only once for the policy without any impact. This would also avoid multiple function calls to the notifier callbacks and reduce multiple iterations of notifier core's code (which does locking as well). This patch adds pointer to the cpufreq policy to the struct cpufreq_freqs, so the notifier callback has all the information available to it with a single call. The five drivers which perform per-cpu updates are updated to use the cpufreq policy. The freqs->cpu field is redundant now and is removed. Acked-by: David S. Miller <davem@davemloft.net> (sparc) Signed-off-by: NViresh Kumar <viresh.kumar@linaro.org> Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
-
- 01 5月, 2019 8 次提交
-
-
由 Paolo Bonzini 提交于
Use specific inline functions for RIP and RSP instead of going through kvm_register_read and kvm_register_write, which are quite a mouthful. kvm_rsp_read and kvm_rsp_write did not exist, so add them. Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Except for RSP and RIP, which are held in VMX's VMCS, GPRs are always treated "available and dirtly" on both VMX and SVM, i.e. are unconditionally loaded/saved immediately before/after VM-Enter/VM-Exit. Eliminating the unnecessary caching code reduces the size of KVM by a non-trivial amount, much of which comes from the most common code paths. E.g. on x86_64, kvm_emulate_cpuid() is reduced from 342 to 182 bytes and kvm_emulate_hypercall() from 1362 to 1143, with the total size of KVM dropping by ~1000 bytes. With CONFIG_RETPOLINE=y, the numbers are even more pronounced, e.g.: 353->182, 1418->1172 and well over 2000 bytes. Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 KarimAllah Ahmed 提交于
Use kvm_vcpu_map in emulator_cmpxchg_emulated since using kvm_vcpu_gpa_to_page() and kmap() will only work for guest memory that has a "struct page". Signed-off-by: NKarimAllah Ahmed <karahmed@amazon.de> Reviewed-by: NKonrad Rzeszutek Wilk <kjonrad.wilk@oracle.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Borislav Petkov 提交于
The hardware configuration register has some useful bits which can be used by guests. Implement McStatusWrEn which can be used by guests when injecting MCEs with the in-kernel mce-inject module. For that, we need to set bit 18 - McStatusWrEn - first, before writing the MCi_STATUS registers (otherwise we #GP). Add the required machinery to do so. Signed-off-by: NBorislav Petkov <bp@suse.de> Cc: Jim Mattson <jmattson@google.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: KVM <kvm@vger.kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Sean Christopherson <sean.j.christopherson@intel.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Yazen Ghannam <Yazen.Ghannam@amd.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Liran Alon 提交于
Since commits 668fffa3 ("kvm: better MWAIT emulation for guestsâ€) and 4d5422ce ("KVM: X86: Provide a capability to disable MWAIT interceptsâ€), KVM was modified to allow an admin to configure certain guests to execute MONITOR/MWAIT inside guest without being intercepted by host. This is useful in case admin wishes to allocate a dedicated logical processor for each vCPU thread. Thus, making it safe for guest to completely control the power-state of the logical processor. The ability to use this new KVM capability was introduced to QEMU by commits 6f131f13e68d ("kvm: support -overcommit cpu-pm=on|offâ€) and 2266d4431132 ("i386/cpu: make -cpu host support monitor/mwaitâ€). However, exposing MONITOR/MWAIT to a Linux guest may cause it's intel_idle kernel module to execute c1e_promotion_disable() which will attempt to RDMSR/WRMSR from/to MSR_IA32_POWER_CTL to manipulate the "C1E Enable" bit. This behaviour was introduced by commit 32e95180 ("intel_idle: export both C1 and C1Eâ€). Becuase KVM doesn't emulate this MSR, running KVM with ignore_msrs=0 will cause the above guest behaviour to raise a #GP which will cause guest to kernel panic. Therefore, add support for nop emulation of MSR_IA32_POWER_CTL to avoid #GP in guest in this scenario. Future commits can optimise emulation further by reflecting guest MSR changes to host MSR to provide guest with the ability to fine-tune the dedicated logical processor power-state. Reviewed-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: NLiran Alon <liran.alon@oracle.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Luwei Kang 提交于
Inject a PMI for KVM guest when Intel PT working in Host-Guest mode and Guest ToPA entry memory buffer was completely filled. Signed-off-by: NLuwei Kang <luwei.kang@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
...to avoid dereferencing a null pointer when querying the per-vCPU timer advance. Fixes: 39497d76 ("KVM: lapic: Track lapic timer advance per vCPU") Reported-by: syzbot+f7e65445a40d3e0e4ebf@syzkaller.appspotmail.com Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
KVM's recent bug fix to update %rip after emulating I/O broke userspace that relied on the previous behavior of incrementing %rip prior to exiting to userspace. When running a Windows XP guest on AMD hardware, Qemu may patch "OUT 0x7E" instructions in reaction to the OUT itself. Because KVM's old behavior was to increment %rip before exiting to userspace to handle the I/O, Qemu manually adjusted %rip to account for the OUT instruction. Arguably this is a userspace bug as KVM requires userspace to re-enter the kernel to complete instruction emulation before taking any other actions. That being said, this is a bit of a grey area and breaking userspace that has worked for many years is bad. Pre-increment %rip on OUT to port 0x7e before exiting to userspace to hack around the issue. Fixes: 45def77e ("KVM: x86: update %rip after emulating IO") Reported-by: NSimon Becherer <simon@becherer.de> Reported-and-tested-by: NIakov Karpov <srid@rkmail.ru> Reported-by: NGabriele Balducci <balducci@units.it> Reported-by: NAntti Antinoja <reader@fennosys.fi> Cc: stable@vger.kernel.org Cc: Takashi Iwai <tiwai@suse.com> Cc: Jiri Slaby <jslaby@suse.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 19 4月, 2019 2 次提交
-
-
由 Sean Christopherson 提交于
The introduction of adaptive tuning of lapic timer advancement did not allow for the scenario where userspace would want to disable adaptive tuning but still employ timer advancement, e.g. for testing purposes or to handle a use case where adaptive tuning is unable to settle on a suitable time. This is epecially pertinent now that KVM places a hard threshold on the maximum advancment time. Rework the timer semantics to accept signed values, with a value of '-1' being interpreted as "use adaptive tuning with KVM's internal default", and any other value being used as an explicit advancement time, e.g. a time of '0' effectively disables advancement. Note, this does not completely restore the original behavior of lapic_timer_advance_ns. Prior to tracking the advancement per vCPU, which is necessary to support autotuning, userspace could adjust lapic_timer_advance_ns for *running* vCPU. With per-vCPU tracking, the module params are snapshotted at vCPU creation, i.e. applying a new advancement effectively requires restarting a VM. Dynamically updating a running vCPU is possible, e.g. a helper could be added to retrieve the desired delay, choosing between the global module param and the per-VCPU value depending on whether or not auto-tuning is (globally) enabled, but introduces a great deal of complexity. The wrapper itself is not complex, but understanding and documenting the effects of dynamically toggling auto-tuning and/or adjusting the timer advancement is nigh impossible since the behavior would be dependent on KVM's implementation as well as compiler optimizations. In other words, providing stable behavior would require extremely careful consideration now and in the future. Given that the expected use of a manually-tuned timer advancement is to "tune once, run many", use the vastly simpler approach of recognizing changes to the module params only when creating a new vCPU. Cc: Liran Alon <liran.alon@oracle.com> Cc: Wanpeng Li <wanpengli@tencent.com> Reviewed-by: NLiran Alon <liran.alon@oracle.com> Cc: stable@vger.kernel.org Fixes: 3b8a5df6 ("KVM: LAPIC: Tune lapic_timer_advance_ns automatically") Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Automatically adjusting the globally-shared timer advancement could corrupt the timer, e.g. if multiple vCPUs are concurrently adjusting the advancement value. That could be partially fixed by using a local variable for the arithmetic, but it would still be susceptible to a race when setting timer_advance_adjust_done. And because virtual_tsc_khz and tsc_scaling_ratio are per-vCPU, the correct calibration for a given vCPU may not apply to all vCPUs. Furthermore, lapic_timer_advance_ns is marked __read_mostly, which is effectively violated when finding a stable advancement takes an extended amount of timer. Opportunistically change the definition of lapic_timer_advance_ns to a u32 so that it matches the style of struct kvm_timer. Explicitly pass the param to kvm_create_lapic() so that it doesn't have to be exposed to lapic.c, thus reducing the probability of unintentionally using the global value instead of the per-vCPU value. Cc: Liran Alon <liran.alon@oracle.com> Cc: Wanpeng Li <wanpengli@tencent.com> Reviewed-by: NLiran Alon <liran.alon@oracle.com> Cc: stable@vger.kernel.org Fixes: 3b8a5df6 ("KVM: LAPIC: Tune lapic_timer_advance_ns automatically") Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 16 4月, 2019 9 次提交
-
-
由 Paolo Bonzini 提交于
All architectures except MIPS were defining it in the same way, and memory slots are handled entirely by common code so there is no point in keeping the definition per-architecture. Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
EFER.LME and EFER.NX are considered reserved if their respective feature bits are not advertised to the guest. Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
KVM allows userspace to violate consistency checks related to the guest's CPUID model to some degree. Generally speaking, userspace has carte blanche when it comes to guest state so long as jamming invalid state won't negatively affect the host. Currently this is seems to be a non-issue as most of the interesting EFER checks are missing, e.g. NX and LME, but those will be added shortly. Proactively exempt userspace from the CPUID checks so as not to break userspace. Note, the efer_reserved_bits check still applies to userspace writes as that mask reflects the host's capabilities, e.g. KVM shouldn't allow a guest to run with NX=1 if it has been disabled in the host. Fixes: d8017474 ("KVM: SVM: Only allow setting of EFER_SVME when CPUID SVM is set") Cc: stable@vger.kernel.org Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Hariprasad Kelam 提交于
Changed passing argument as "0 to NULL" which resolves below sparse warning arch/x86/kvm/x86.c:3096:61: warning: Using plain integer as NULL pointer Signed-off-by: NHariprasad Kelam <hariprasad.kelam@gmail.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Invoking the 64-bit variation on a 32-bit kenrel will crash the guest, trigger a WARN, and/or lead to a buffer overrun in the host, e.g. rsm_load_state_64() writes r8-r15 unconditionally, but enum kvm_reg and thus x86_emulate_ctxt._regs only define r8-r15 for CONFIG_X86_64. KVM allows userspace to report long mode support via CPUID, even though the guest is all but guaranteed to crash if it actually tries to enable long mode. But, a pure 32-bit guest that is ignorant of long mode will happily plod along. SMM complicates things as 64-bit CPUs use a different SMRAM save state area. KVM handles this correctly for 64-bit kernels, e.g. uses the legacy save state map if userspace has hid long mode from the guest, but doesn't fare well when userspace reports long mode support on a 32-bit host kernel (32-bit KVM doesn't support 64-bit guests). Since the alternative is to crash the guest, e.g. by not loading state or explicitly requesting shutdown, unconditionally use the legacy SMRAM save state map for 32-bit KVM. If a guest has managed to get far enough to handle SMIs when running under a weird/buggy userspace hypervisor, then don't deliberately crash the guest since there are no downsides (from KVM's perspective) to allow it to continue running. Fixes: 660a5d51 ("KVM: x86: save/load state on SMM switch") Cc: stable@vger.kernel.org Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Prepare for clearing HF_SMM_MASK prior to loading state from the SMRAM save state map, i.e. kvm_smm_changed() needs to be called after state has been loaded and so cannot be done automatically when setting hflags from RSM. Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
RSM emulation is currently broken on VMX when the interrupted guest has CR4.VMXE=1. Rather than dance around the issue of HF_SMM_MASK being set when loading SMSTATE into architectural state, ideally RSM emulation itself would be reworked to clear HF_SMM_MASK prior to loading non-SMM architectural state. Ostensibly, the only motivation for having HF_SMM_MASK set throughout the loading of state from the SMRAM save state area is so that the memory accesses from GET_SMSTATE() are tagged with role.smm. Load all of the SMRAM save state area from guest memory at the beginning of RSM emulation, and load state from the buffer instead of reading guest memory one-by-one. This paves the way for clearing HF_SMM_MASK prior to loading state, and also aligns RSM with the enter_smm() behavior, which fills a buffer and writes SMRAM save state in a single go. Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 WANG Chao 提交于
guest xcr0 could leak into host when MCE happens in guest mode. Because do_machine_check() could schedule out at a few places. For example: kvm_load_guest_xcr0 ... kvm_x86_ops->run(vcpu) { vmx_vcpu_run vmx_complete_atomic_exit kvm_machine_check do_machine_check do_memory_failure memory_failure lock_page In this case, host_xcr0 is 0x2ff, guest vcpu xcr0 is 0xff. After schedule out, host cpu has guest xcr0 loaded (0xff). In __switch_to { switch_fpu_finish copy_kernel_to_fpregs XRSTORS If any bit i in XSTATE_BV[i] == 1 and xcr0[i] == 0, XRSTORS will generate #GP (In this case, bit 9). Then ex_handler_fprestore kicks in and tries to reinitialize fpu by restoring init fpu state. Same story as last #GP, except we get DOUBLE FAULT this time. Cc: stable@vger.kernel.org Signed-off-by: NWANG Chao <chao.wang@ucloud.cn> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Ben Gardon 提交于
KVM bases its memory usage limits on the total number of guest pages across all memslots. However, those limits, and the calculations to produce them, use 32 bit unsigned integers. This can result in overflow if a VM has more guest pages that can be represented by a u32. As a result of this overflow, KVM can use a low limit on the number of MMU pages it will allocate. This makes KVM unable to map all of guest memory at once, prompting spurious faults. Tested: Ran all kvm-unit-tests on an Intel Haswell machine. This patch introduced no new failures. Signed-off-by: NBen Gardon <bgardon@google.com> Cc: stable@vger.kernel.org Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 13 4月, 2019 1 次提交
-
-
由 Rik van Riel 提交于
Defer loading of FPU state until return to userspace. This gives the kernel the potential to skip loading FPU state for tasks that stay in kernel mode, or for tasks that end up with repeated invocations of kernel_fpu_begin() & kernel_fpu_end(). The fpregs_lock/unlock() section ensures that the registers remain unchanged. Otherwise a context switch or a bottom half could save the registers to its FPU context and the processor's FPU registers would became random if modified at the same time. KVM swaps the host/guest registers on entry/exit path. This flow has been kept as is. First it ensures that the registers are loaded and then saves the current (host) state before it loads the guest's registers. The swap is done at the very end with disabled interrupts so it should not change anymore before theg guest is entered. The read/save version seems to be cheaper compared to memcpy() in a micro benchmark. Each thread gets TIF_NEED_FPU_LOAD set as part of fork() / fpu__copy(). For kernel threads, this flag gets never cleared which avoids saving / restoring the FPU state for kernel threads and during in-kernel usage of the FPU registers. [ bp: Correct and update commit message and fix checkpatch warnings. s/register/registers/ where it is used in plural. minor comment corrections. remove unused trace_x86_fpu_activate_state() TP. ] Signed-off-by: NRik van Riel <riel@surriel.com> Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: NBorislav Petkov <bp@suse.de> Reviewed-by: NDave Hansen <dave.hansen@intel.com> Reviewed-by: NThomas Gleixner <tglx@linutronix.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: Aubrey Li <aubrey.li@intel.com> Cc: Babu Moger <Babu.Moger@amd.com> Cc: "Chang S. Bae" <chang.seok.bae@intel.com> Cc: Dmitry Safonov <dima@arista.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: "Jason A. Donenfeld" <Jason@zx2c4.com> Cc: Joerg Roedel <jroedel@suse.de> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: kvm ML <kvm@vger.kernel.org> Cc: Nicolai Stange <nstange@suse.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: "Radim Krčmář" <rkrcmar@redhat.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Waiman Long <longman@redhat.com> Cc: x86-ml <x86@kernel.org> Cc: Yi Wang <wang.yi59@zte.com.cn> Link: https://lkml.kernel.org/r/20190403164156.19645-24-bigeasy@linutronix.de
-
- 11 4月, 2019 1 次提交
-
-
After changing the argument of __raw_xsave_addr() from a mask to number Dave suggested to check if it makes sense to do the same for get_xsave_addr(). As it turns out it does. Only get_xsave_addr() needs the mask to check if the requested feature is part of what is supported/saved and then uses the number again. The shift operation is cheaper compared to fls64() (find last bit set). Also, the feature number uses less opcode space compared to the mask. :) Make the get_xsave_addr() argument a xfeature number instead of a mask and fix up its callers. Furthermore, use xfeature_nr and xfeature_mask consistently. This results in the following changes to the kvm code: feature -> xfeature_mask index -> xfeature_nr Suggested-by: NDave Hansen <dave.hansen@linux.intel.com> Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: NBorislav Petkov <bp@suse.de> Reviewed-by: NDave Hansen <dave.hansen@intel.com> Reviewed-by: NThomas Gleixner <tglx@linutronix.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: "Jason A. Donenfeld" <Jason@zx2c4.com> Cc: kvm ML <kvm@vger.kernel.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "Radim Krčmář" <rkrcmar@redhat.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Rik van Riel <riel@surriel.com> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Siarhei Liakh <Siarhei.Liakh@concurrent-rt.com> Cc: x86-ml <x86@kernel.org> Link: https://lkml.kernel.org/r/20190403164156.19645-12-bigeasy@linutronix.de
-
- 29 3月, 2019 4 次提交
-
-
由 Sean Christopherson 提交于
Most (all?) x86 platforms provide a port IO based reset mechanism, e.g. OUT 92h or CF9h. Userspace may emulate said mechanism, i.e. reset a vCPU in response to KVM_EXIT_IO, without explicitly announcing to KVM that it is doing a reset, e.g. Qemu jams vCPU state and resumes running. To avoid corruping %rip after such a reset, commit 0967b7bf ("KVM: Skip pio instruction when it is emulated, not executed") changed the behavior of PIO handlers, i.e. today's "fast" PIO handling to skip the instruction prior to exiting to userspace. Full emulation doesn't need such tricks becase re-emulating the instruction will naturally handle %rip being changed to point at the reset vector. Updating %rip prior to executing to userspace has several drawbacks: - Userspace sees the wrong %rip on the exit, e.g. if PIO emulation fails it will likely yell about the wrong address. - Single step exits to userspace for are effectively dropped as KVM_EXIT_DEBUG is overwritten with KVM_EXIT_IO. - Behavior of PIO emulation is different depending on whether it goes down the fast path or the slow path. Rather than skip the PIO instruction before exiting to userspace, snapshot the linear %rip and cancel PIO completion if the current value does not match the snapshot. For a 64-bit vCPU, i.e. the most common scenario, the snapshot and comparison has negligible overhead as VMCS.GUEST_RIP will be cached regardless, i.e. there is no extra VMREAD in this case. All other alternatives to snapshotting the linear %rip that don't rely on an explicit reset announcenment suffer from one corner case or another. For example, canceling PIO completion on any write to %rip fails if userspace does a save/restore of %rip, and attempting to avoid that issue by canceling PIO only if %rip changed then fails if PIO collides with the reset %rip. Attempting to zero in on the exact reset vector won't work for APs, which means adding more hooks such as the vCPU's MP_STATE, and so on and so forth. Checking for a linear %rip match technically suffers from corner cases, e.g. userspace could theoretically rewrite the underlying code page and expect a different instruction to execute, or the guest hardcodes a PIO reset at 0xfffffff0, but those are far, far outside of what can be considered normal operation. Fixes: 432baf60 ("KVM: VMX: use kvm_fast_pio_in for handling IN I/O") Cc: <stable@vger.kernel.org> Reported-by: NJim Mattson <jmattson@google.com> Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Xiaoyao Li 提交于
Since MSR_IA32_ARCH_CAPABILITIES is emualted unconditionally even if host doesn't suppot it. We should move it to array emulated_msrs from arry msrs_to_save, to report to userspace that guest support this msr. Signed-off-by: NXiaoyao Li <xiaoyao.li@linux.intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
The CPUID flag ARCH_CAPABILITIES is unconditioinally exposed to host userspace for all x86 hosts, i.e. KVM advertises ARCH_CAPABILITIES regardless of hardware support under the pretense that KVM fully emulates MSR_IA32_ARCH_CAPABILITIES. Unfortunately, only VMX hosts handle accesses to MSR_IA32_ARCH_CAPABILITIES (despite KVM_GET_MSRS also reporting MSR_IA32_ARCH_CAPABILITIES for all hosts). Move the MSR_IA32_ARCH_CAPABILITIES handling to common x86 code so that it's emulated on AMD hosts. Fixes: 1eaafe91 ("kvm: x86: IA32_ARCH_CAPABILITIES is always supported") Cc: stable@vger.kernel.org Reported-by: NXiaoyao Li <xiaoyao.li@linux.intel.com> Cc: Jim Mattson <jmattson@google.com> Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Wei Yang 提交于
* nr_mmu_pages would be non-zero only if kvm->arch.n_requested_mmu_pages is non-zero. * nr_mmu_pages is always non-zero, since kvm_mmu_calculate_mmu_pages() never return zero. Based on these two reasons, we can merge the two *if* clause and use the return value from kvm_mmu_calculate_mmu_pages() directly. This simplify the code and also eliminate the possibility for reader to believe nr_mmu_pages would be zero. Signed-off-by: NWei Yang <richard.weiyang@gmail.com> Reviewed-by: NSean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 21 2月, 2019 6 次提交
-
-
由 Sean Christopherson 提交于
Revert to a slow kvm_mmu_zap_all() for kvm_arch_flush_shadow_all(). Flushing all shadow entries is only done during VM teardown, i.e. kvm_arch_flush_shadow_all() is only called when the associated MM struct is being released or when the VM instance is being freed. Although the performance of teardown itself isn't critical, KVM should still voluntarily schedule to play nice with the rest of the kernel; but that can be done without the fast invalidate mechanism in a future patch. This reverts commit 6ca18b69. Cc: Xiao Guangrong <guangrong.xiao@gmail.com> Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Unwinding optimizations related to obsolete pages is a step towards removing x86 KVM's fast invalidate mechanism, i.e. this is one part of a revert all patches from the series that introduced the mechanism[1]. This reverts commit 365c8868. [1] https://lkml.kernel.org/r/1369960590-14138-1-git-send-email-xiaoguangrong@linux.vnet.ibm.com Cc: Xiao Guangrong <guangrong.xiao@gmail.com> Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
kvm_arch_memslots_updated() is at this point in time an x86-specific hook for handling MMIO generation wraparound. x86 stashes 19 bits of the memslots generation number in its MMIO sptes in order to avoid full page fault walks for repeat faults on emulated MMIO addresses. Because only 19 bits are used, wrapping the MMIO generation number is possible, if unlikely. kvm_arch_memslots_updated() alerts x86 that the generation has changed so that it can invalidate all MMIO sptes in case the effective MMIO generation has wrapped so as to avoid using a stale spte, e.g. a (very) old spte that was created with generation==0. Given that the purpose of kvm_arch_memslots_updated() is to prevent consuming stale entries, it needs to be called before the new generation is propagated to memslots. Invalidating the MMIO sptes after updating memslots means that there is a window where a vCPU could dereference the new memslots generation, e.g. 0, and incorrectly reuse an old MMIO spte that was created with (pre-wrap) generation==0. Fixes: e59dbe09 ("KVM: Introduce kvm_arch_memslots_updated()") Cc: <stable@vger.kernel.org> Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Ben Gardon 提交于
There are many KVM kernel memory allocations which are tied to the life of the VM process and should be charged to the VM process's cgroup. If the allocations aren't tied to the process, the OOM killer will not know that killing the process will free the associated kernel memory. Add __GFP_ACCOUNT flags to many of the allocations which are not yet being charged to the VM process's cgroup. Tested: Ran all kvm-unit-tests on a 64 bit Haswell machine, the patch introduced no new failures. Ran a kernel memory accounting test which creates a VM to touch memory and then checks that the kernel memory allocated for the process is within certain bounds. With this patch we account for much more of the vmalloc and slab memory allocated for the VM. There remain a few allocations which should be charged to the VM's cgroup but are not. In x86, they include: vcpu->arch.pio_data There allocations are unaccounted in this patch because they are mapped to userspace, and accounting them to a cgroup causes problems. This should be addressed in a future patch. Signed-off-by: NBen Gardon <bgardon@google.com> Reviewed-by: NShakeel Butt <shakeelb@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Luwei Kang 提交于
Some Posted-Interrupts from passthrough devices may be lost or overwritten when the vCPU is in runnable state. The SN (Suppress Notification) of PID (Posted Interrupt Descriptor) will be set when the vCPU is preempted (vCPU in KVM_MP_STATE_RUNNABLE state but not running on physical CPU). If a posted interrupt coming at this time, the irq remmaping facility will set the bit of PIR (Posted Interrupt Requests) without ON (Outstanding Notification). So this interrupt can't be sync to APIC virtualization register and will not be handled by Guest because ON is zero. Signed-off-by: NLuwei Kang <luwei.kang@intel.com> [Eliminate the pi_clear_sn fast path. - Paolo] Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Paolo Bonzini 提交于
Currently apicv_active can be true even if in-kernel LAPIC emulation is disabled. Avoid this by properly initializing it in kvm_arch_vcpu_init, and then do not do anything to deactivate APICv when it is actually not used (Currently APICv is only deactivated by SynIC code that in turn is only reachable when in-kernel LAPIC is in use. However, it is cleaner if kvm_vcpu_deactivate_apicv avoids relying on this. Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 14 2月, 2019 1 次提交
-
-
由 Luwei Kang 提交于
Some Posted-Interrupts from passthrough devices may be lost or overwritten when the vCPU is in runnable state. The SN (Suppress Notification) of PID (Posted Interrupt Descriptor) will be set when the vCPU is preempted (vCPU in KVM_MP_STATE_RUNNABLE state but not running on physical CPU). If a posted interrupt comes at this time, the irq remapping facility will set the bit of PIR (Posted Interrupt Requests) but not ON (Outstanding Notification). Then, the interrupt will not be seen by KVM, which always expects PID.ON=1 if PID.PIR=1 as documented in the Intel processor SDM but not in the VT-d specification. To fix this, restore the invariant after PID.SN is cleared. Signed-off-by: NLuwei Kang <luwei.kang@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 08 2月, 2019 1 次提交
-
-
由 Paolo Bonzini 提交于
Bugzilla: 1671930 Emulation of certain instructions (VMXON, VMCLEAR, VMPTRLD, VMWRITE with memory operand, INVEPT, INVVPID) can incorrectly inject a page fault when passed an operand that points to an MMIO address. The page fault will use uninitialized kernel stack memory as the CR2 and error code. The right behavior would be to abort the VM with a KVM_EXIT_INTERNAL_ERROR exit to userspace; however, it is not an easy fix, so for now just ensure that the error code and CR2 are zero. Embargoed until Feb 7th 2019. Reported-by: NFelix Wilhelm <fwilhelm@google.com> Cc: stable@kernel.org Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 26 1月, 2019 1 次提交
-
-
由 Gustavo A. R. Silva 提交于
In preparation to enabling -Wimplicit-fallthrough, mark switch cases where we are expecting to fall through. This patch fixes the following warnings: arch/x86/kvm/lapic.c:1037:27: warning: this statement may fall through [-Wimplicit-fallthrough=] arch/x86/kvm/lapic.c:1876:3: warning: this statement may fall through [-Wimplicit-fallthrough=] arch/x86/kvm/hyperv.c:1637:6: warning: this statement may fall through [-Wimplicit-fallthrough=] arch/x86/kvm/svm.c:4396:6: warning: this statement may fall through [-Wimplicit-fallthrough=] arch/x86/kvm/mmu.c:4372:36: warning: this statement may fall through [-Wimplicit-fallthrough=] arch/x86/kvm/x86.c:3835:6: warning: this statement may fall through [-Wimplicit-fallthrough=] arch/x86/kvm/x86.c:7938:23: warning: this statement may fall through [-Wimplicit-fallthrough=] arch/x86/kvm/vmx/vmx.c:2015:6: warning: this statement may fall through [-Wimplicit-fallthrough=] arch/x86/kvm/vmx/vmx.c:1773:6: warning: this statement may fall through [-Wimplicit-fallthrough=] Warning level 3 was used: -Wimplicit-fallthrough=3 This patch is part of the ongoing efforts to enabling -Wimplicit-fallthrough. Signed-off-by: NGustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-