- 17 2月, 2021 2 次提交
-
-
由 Michal Hocko 提交于
Preemption mode selection is currently hardcoded on Kconfig choices. Introduce a dedicated option to tune preemption flavour at boot time, This will be only available on architectures efficiently supporting static calls in order not to tempt with the feature against additional overhead that might be prohibitive or undesirable. CONFIG_PREEMPT_DYNAMIC is automatically selected by CONFIG_PREEMPT if the architecture provides the necessary support (CONFIG_STATIC_CALL_INLINE, CONFIG_GENERIC_ENTRY, and provide with __preempt_schedule_function() / __preempt_schedule_notrace_function()). Suggested-by: NPeter Zijlstra <peterz@infradead.org> Signed-off-by: NMichal Hocko <mhocko@suse.com> Signed-off-by: NFrederic Weisbecker <frederic@kernel.org> [peterz: relax requirement to HAVE_STATIC_CALL] Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: NIngo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20210118141223.123667-5-frederic@kernel.org
-
由 Peter Zijlstra 提交于
Provide a stub function that return 0 and wire up the static call site patching to replace the CALL with a single 5 byte instruction that clears %RAX, the return value register. The function can be cast to any function pointer type that has a single %RAX return (including pointers). Also provide a version that returns an int for convenience. We are clearing the entire %RAX register in any case, whether the return value is 32 or 64 bits, since %RAX is always a scratch register anyway. Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: NFrederic Weisbecker <frederic@kernel.org> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: NIngo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20210118141223.123667-2-frederic@kernel.org
-
- 11 2月, 2021 1 次提交
-
-
由 Thomas Gleixner 提交于
Invoking x86_init.irqs.create_pci_msi_domain() before x86_init.pci.arch_init() breaks XEN PV. The XEN_PV specific pci.arch_init() function overrides the default create_pci_msi_domain() which is obviously too late. As a consequence the XEN PV PCI/MSI allocation goes through the native path which runs out of vectors and causes malfunction. Invoke it after x86_init.pci.arch_init(). Fixes: 6b15ffa0 ("x86/irq: Initialize PCI/MSI domain at PCI init time") Reported-by: NJuergen Gross <jgross@suse.com> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Tested-by: NJuergen Gross <jgross@suse.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/87pn18djte.fsf@nanos.tec.linutronix.de
-
- 09 2月, 2021 2 次提交
-
-
由 Borislav Petkov 提交于
Commit 20bf2b37 ("x86/build: Disable CET instrumentation in the kernel") disabled CET instrumentation which gets added by default by the Ubuntu gcc9 and 10 by default, but did that only for 64-bit builds. It would still fail when building a 32-bit target. So disable CET for all x86 builds. Fixes: 20bf2b37 ("x86/build: Disable CET instrumentation in the kernel") Reported-by: NAC <achirvasub@gmail.com> Signed-off-by: NBorislav Petkov <bp@suse.de> Acked-by: NJosh Poimboeuf <jpoimboe@redhat.com> Tested-by: NAC <achirvasub@gmail.com> Link: https://lkml.kernel.org/r/YCCIgMHkzh/xT4ex@arch-chirva.localdomain
-
由 Jarkko Sakkinen 提交于
This has been shown in tests: [ +0.000008] WARNING: CPU: 3 PID: 7620 at kernel/rcu/srcutree.c:374 cleanup_srcu_struct+0xed/0x100 This is essentially a use-after free, although SRCU notices it as an SRCU cleanup in an invalid context. == Background == SGX has a data structure (struct sgx_encl_mm) which keeps per-mm SGX metadata. This is separate from struct sgx_encl because, in theory, an enclave can be mapped from more than one mm. sgx_encl_mm includes a pointer back to the sgx_encl. This means that sgx_encl must have a longer lifetime than all of the sgx_encl_mm's that point to it. That's usually the case: sgx_encl_mm is freed only after the mmu_notifier is unregistered in sgx_release(). However, there's a race. If the process is exiting, sgx_mmu_notifier_release() can be called in parallel with sgx_release() instead of being called *by* it. The mmu_notifier path keeps encl_mm alive past when sgx_encl can be freed. This inverts the lifetime rules and means that sgx_mmu_notifier_release() can access a freed sgx_encl. == Fix == Increase encl->refcount when encl_mm->encl is established. Release this reference when encl_mm is freed. This ensures that encl outlives encl_mm. [ bp: Massage commit message. ] Fixes: 1728ab54 ("x86/sgx: Add a page reclaimer") Reported-by: NHaitao Huang <haitao.huang@linux.intel.com> Signed-off-by: NJarkko Sakkinen <jarkko@kernel.org> Signed-off-by: NBorislav Petkov <bp@suse.de> Acked-by: NDave Hansen <dave.hansen@linux.intel.com> Link: https://lkml.kernel.org/r/20210207221401.29933-1-jarkko@kernel.org
-
- 08 2月, 2021 1 次提交
-
-
由 Rafael J. Wysocki 提交于
If the maximum performance level taken for computing the arch_max_freq_ratio value used in the x86 scale-invariance code is higher than the one corresponding to the cpuinfo.max_freq value coming from the acpi_cpufreq driver, the scale-invariant utilization falls below 100% even if the CPU runs at cpuinfo.max_freq or slightly faster, which causes the schedutil governor to select a frequency below cpuinfo.max_freq. That frequency corresponds to a frequency table entry below the maximum performance level necessary to get to the "boost" range of CPU frequencies which prevents "boost" frequencies from being used in some workloads. While this issue is related to scale-invariance, it may be amplified by commit db865272 ("cpufreq: Avoid configuring old governors as default with intel_pstate") from the 5.10 development cycle which made it extremely easy to default to schedutil even if the preferred driver is acpi_cpufreq as long as intel_pstate is built too, because the mere presence of the latter effectively removes the ondemand governor from the defaults. Distro kernels are likely to include both intel_pstate and acpi_cpufreq on x86, so their users who cannot use intel_pstate or choose to use acpi_cpufreq may easily be affectecd by this issue. If CPPC is available, it can be used to address this issue by extending the frequency tables created by acpi_cpufreq to cover the entire available frequency range (including "boost" frequencies) for each CPU, but if CPPC is not there, acpi_cpufreq has no idea what the maximum "boost" frequency is and the frequency tables created by it cannot be extended in a meaningful way, so in that case make it ask the arch scale-invariance code to to use the "nominal" performance level for CPU utilization scaling in order to avoid the issue at hand. Fixes: db865272 ("cpufreq: Avoid configuring old governors as default with intel_pstate") Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: NGiovanni Gherdovich <ggherdovich@suse.cz> Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
-
- 06 2月, 2021 4 次提交
-
-
由 Borislav Petkov 提交于
With CONFIG_X86_5LEVEL, CONFIG_UBSAN and CONFIG_UBSAN_UNSIGNED_OVERFLOW enabled, clang fails the build with x86_64-linux-ld: arch/x86/platform/efi/efi_64.o: in function `efi_sync_low_kernel_mappings': efi_64.c:(.text+0x22c): undefined reference to `__compiletime_assert_354' which happens due to -fsanitize=unsigned-integer-overflow being enabled: -fsanitize=unsigned-integer-overflow: Unsigned integer overflow, where the result of an unsigned integer computation cannot be represented in its type. Unlike signed integer overflow, this is not undefined behavior, but it is often unintentional. This sanitizer does not check for lossy implicit conversions performed before such a computation (see -fsanitize=implicit-conversion). and that fires when the (intentional) EFI_VA_START/END defines overflow an unsigned long, leading to the assertion expressions not getting optimized away (on GCC they do)... However, those checks are superfluous: the runtime services mapping code already makes sure the ranges don't overshoot EFI_VA_END as the EFI mapping range is hardcoded. On each runtime services call, it is switched to the EFI-specific PGD and even if mappings manage to escape that last PGD, this won't remain unnoticed for long. So rip them out. See https://github.com/ClangBuiltLinux/linux/issues/256 for more info. Reported-by: NArnd Bergmann <arnd@arndb.de> Signed-off-by: NBorislav Petkov <bp@suse.de> Reviewed-by: NNathan Chancellor <nathan@kernel.org> Acked-by: NArd Biesheuvel <ardb@kernel.org> Tested-by: NNick Desaulniers <ndesaulniers@google.com> Tested-by: NNathan Chancellor <nathan@kernel.org> Link: http://lkml.kernel.org/r/20210107223424.4135538-1-arnd@kernel.org
-
由 Gabriel Krisman Bertazi 提交于
Commit 29915524 ("entry: Drop usage of TIF flags in the generic syscall code") introduced a bug on architectures using the generic syscall entry code, in which processes stopped by PTRACE_SYSCALL do not trap on syscall return after receiving a TIF_SINGLESTEP. The reason is that the meaning of TIF_SINGLESTEP flag is overloaded to cause the trap after a system call is executed, but since the above commit, the syscall call handler only checks for the SYSCALL_WORK flags on the exit work. Split the meaning of TIF_SINGLESTEP such that it only means single-step mode, and create a new type of SYSCALL_WORK to request a trap immediately after a syscall in single-step mode. In the current implementation, the SYSCALL_WORK flag shadows the TIF_SINGLESTEP flag for simplicity. Update x86 to flip this bit when a tracer enables single stepping. Fixes: 29915524 ("entry: Drop usage of TIF flags in the generic syscall code") Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NGabriel Krisman Bertazi <krisman@collabora.com> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Tested-by: NKyle Huey <me@kylehuey.com> Link: https://lore.kernel.org/r/87h7mtc9pr.fsf_-_@collabora.com
-
由 Lai Jiangshan 提交于
local_db_save() is called at the start of exc_debug_kernel(), reads DR7 and disables breakpoints to prevent recursion. When running in a guest (X86_FEATURE_HYPERVISOR), local_db_save() reads the per-cpu variable cpu_dr7 to check whether a breakpoint is active or not before it accesses DR7. A data breakpoint on cpu_dr7 therefore results in infinite #DB recursion. Disallow data breakpoints on cpu_dr7 to prevent that. Fixes: 84b6a349("x86/entry: Optimize local_db_save() for virt") Signed-off-by: NLai Jiangshan <laijs@linux.alibaba.com> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20210204152708.21308-2-jiangshanlai@gmail.com
-
由 Lai Jiangshan 提交于
When FSGSBASE is enabled, paranoid_entry() fetches the per-CPU GSBASE value via __per_cpu_offset or pcpu_unit_offsets. When a data breakpoint is set on __per_cpu_offset[cpu] (read-write operation), the specific CPU will be stuck in an infinite #DB loop. RCU will try to send an NMI to the specific CPU, but it is not working either since NMI also relies on paranoid_entry(). Which means it's undebuggable. Fixes: eaad9812("x86/entry/64: Introduce the FIND_PERCPU_BASE macro") Signed-off-by: NLai Jiangshan <laijs@linux.alibaba.com> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20210204152708.21308-1-jiangshanlai@gmail.com
-
- 05 2月, 2021 3 次提交
-
-
由 Dave Hansen 提交于
Jan Kiszka reported that the x2apic_wrmsr_fence() function uses a plain MFENCE while the Intel SDM (10.12.3 MSR Access in x2APIC Mode) calls for MFENCE; LFENCE. Short summary: we have special MSRs that have weaker ordering than all the rest. Add fencing consistent with current SDM recommendations. This is not known to cause any issues in practice, only in theory. Longer story below: The reason the kernel uses a different semantic is that the SDM changed (roughly in late 2017). The SDM changed because folks at Intel were auditing all of the recommended fences in the SDM and realized that the x2apic fences were insufficient. Why was the pain MFENCE judged insufficient? WRMSR itself is normally a serializing instruction. No fences are needed because the instruction itself serializes everything. But, there are explicit exceptions for this serializing behavior written into the WRMSR instruction documentation for two classes of MSRs: IA32_TSC_DEADLINE and the X2APIC MSRs. Back to x2apic: WRMSR is *not* serializing in this specific case. But why is MFENCE insufficient? MFENCE makes writes visible, but only affects load/store instructions. WRMSR is unfortunately not a load/store instruction and is unaffected by MFENCE. This means that a non-serializing WRMSR could be reordered by the CPU to execute before the writes made visible by the MFENCE have even occurred in the first place. This means that an x2apic IPI could theoretically be triggered before there is any (visible) data to process. Does this affect anything in practice? I honestly don't know. It seems quite possible that by the time an interrupt gets to consume the (not yet) MFENCE'd data, it has become visible, mostly by accident. To be safe, add the SDM-recommended fences for all x2apic WRMSRs. This also leaves open the question of the _other_ weakly-ordered WRMSR: MSR_IA32_TSC_DEADLINE. While it has the same ordering architecture as the x2APIC MSRs, it seems substantially less likely to be a problem in practice. While writes to the in-memory Local Vector Table (LVT) might theoretically be reordered with respect to a weakly-ordered WRMSR like TSC_DEADLINE, the SDM has this to say: In x2APIC mode, the WRMSR instruction is used to write to the LVT entry. The processor ensures the ordering of this write and any subsequent WRMSR to the deadline; no fencing is required. But, that might still leave xAPIC exposed. The safest thing to do for now is to add the extra, recommended LFENCE. [ bp: Massage commit message, fix typos, drop accidentally added newline to tools/arch/x86/include/asm/barrier.h. ] Reported-by: NJan Kiszka <jan.kiszka@siemens.com> Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com> Signed-off-by: NBorislav Petkov <bp@suse.de> Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Acked-by: NThomas Gleixner <tglx@linutronix.de> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20200305174708.F77040DD@viggo.jf.intel.com
-
由 Mike Rapoport 提交于
This reverts commit bde9cfa3. Changing the first memory page type from E820_TYPE_RESERVED to E820_TYPE_RAM makes it a part of "System RAM" resource rather than a reserved resource and this in turn causes devmem_is_allowed() to treat is as area that can be accessed but it is filled with zeroes instead of the actual data as previously. The change in /dev/mem output causes lilo to fail as was reported at slakware users forum, and probably other legacy applications will experience similar problems. Link: https://www.linuxquestions.org/questions/slackware-14/slackware-current-lilo-vesa-warnings-after-recent-updates-4175689617/#post6214439Signed-off-by: NMike Rapoport <rppt@linux.ibm.com> Cc: stable@kernel.org Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Sean Christopherson 提交于
Set cr3_lm_rsvd_bits, which is effectively an invalid GPA mask, at vCPU reset. The reserved bits check needs to be done even if userspace never configures the guest's CPUID model. Cc: stable@vger.kernel.org Fixes: 0107973a ("KVM: x86: Introduce cr3_lm_rsvd_bits in kvm_vcpu_arch") Signed-off-by: NSean Christopherson <seanjc@google.com> Message-Id: <20210204000117.3303214-2-seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 04 2月, 2021 1 次提交
-
-
由 Ben Gardon 提交于
There is a bug in the TDP MMU function to zap SPTEs which could be replaced with a larger mapping which prevents the function from doing anything. Fix this by correctly zapping the last level SPTEs. Cc: stable@vger.kernel.org Fixes: 14881998 ("kvm: x86/mmu: Support disabling dirty logging for the tdp MMU") Signed-off-by: NBen Gardon <bgardon@google.com> Message-Id: <20210202185734.1680553-11-bgardon@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 03 2月, 2021 3 次提交
-
-
由 Paolo Bonzini 提交于
If not in long mode, the low bits of CR3 are reserved but not enforced to be zero, so remove those checks. If in long mode, however, the MBZ bits extend down to the highest physical address bit of the guest, excluding the encryption bit. Make the checks consistent with the above, and match them between nested_vmcb_checks and KVM_SET_SREGS. Cc: stable@vger.kernel.org Fixes: 761e4169 ("KVM: nSVM: Check that MBZ bits in CR3 and CR4 are not set on vmrun of nested guests") Fixes: a780a3ea ("KVM: X86: Fix reserved bits check for MOV to CR3") Reviewed-by: NSean Christopherson <seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Don't let KVM load when running as an SEV guest, regardless of what CPUID says. Memory is encrypted with a key that is not accessible to the host (L0), thus it's impossible for L0 to emulate SVM, e.g. it'll see garbage when reading the VMCB. Technically, KVM could decrypt all memory that needs to be accessible to the L0 and use shadow paging so that L0 does not need to shadow NPT, but exposing such information to L0 largely defeats the purpose of running as an SEV guest. This can always be revisited if someone comes up with a use case for running VMs inside SEV guests. Note, VMLOAD, VMRUN, etc... will also #GP on GPAs with C-bit set, i.e. KVM is doomed even if the SEV guest is debuggable and the hypervisor is willing to decrypt the VMCB. This may or may not be fixed on CPUs that have the SVME_ADDR_CHK fix. Cc: stable@vger.kernel.org Signed-off-by: NSean Christopherson <seanjc@google.com> Message-Id: <20210202212017.2486595-1-seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Set the emulator context to PROT64 if SYSENTER transitions from 32-bit userspace (compat mode) to a 64-bit kernel, otherwise the RIP update at the end of x86_emulate_insn() will incorrectly truncate the new RIP. Note, this bug is mostly limited to running an Intel virtual CPU model on an AMD physical CPU, as other combinations of virtual and physical CPUs do not trigger full emulation. On Intel CPUs, SYSENTER in compatibility mode is legal, and unconditionally transitions to 64-bit mode. On AMD CPUs, SYSENTER is illegal in compatibility mode and #UDs. If the vCPU is AMD, KVM injects a #UD on SYSENTER in compat mode. If the pCPU is Intel, SYSENTER will execute natively and not trigger #UD->VM-Exit (ignoring guest TLB shenanigans). Fixes: fede8076 ("KVM: x86: handle wrap around 32-bit address space") Cc: stable@vger.kernel.org Signed-off-by: NJonny Barker <jonny@jonnybarker.com> [sean: wrote changelog] Signed-off-by: NSean Christopherson <seanjc@google.com> Message-Id: <20210202165546.2390296-1-seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 02 2月, 2021 4 次提交
-
-
由 Fenghua Yu 提交于
Add Alder Lake mobile processor to CPU list to enumerate and enable the split lock feature. Signed-off-by: NFenghua Yu <fenghua.yu@intel.com> Signed-off-by: NBorislav Petkov <bp@suse.de> Reviewed-by: NTony Luck <tony.luck@intel.com> Link: https://lkml.kernel.org/r/20210201190007.4031869-1-fenghua.yu@intel.com
-
由 Vitaly Kuznetsov 提交于
Commit 7a873e45 ("KVM: selftests: Verify supported CR4 bits can be set before KVM_SET_CPUID2") reveals that KVM allows to set X86_CR4_PCIDE even when PCID support is missing: ==== Test Assertion Failure ==== x86_64/set_sregs_test.c:41: rc pid=6956 tid=6956 - Invalid argument 1 0x000000000040177d: test_cr4_feature_bit at set_sregs_test.c:41 2 0x00000000004014fc: main at set_sregs_test.c:119 3 0x00007f2d9346d041: ?? ??:0 4 0x000000000040164d: _start at ??:? KVM allowed unsupported CR4 bit (0x20000) Add X86_FEATURE_PCID feature check to __cr4_reserved_bits() to make kvm_is_valid_cr4() fail. Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210201142843.108190-1-vkuznets@redhat.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Zheng Zhan Liang 提交于
Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Wanpeng Li <wanpengli@tencent.com> Cc: kvm@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: NZheng Zhan Liang <zhengzhanliang@huorong.cn> Message-Id: <20210201055310.267029-1-zhengzhanliang@huorong.cn> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Paolo Bonzini 提交于
Userspace that does not know about KVM_GET_MSR_FEATURE_INDEX_LIST will generally use the default value for MSR_IA32_ARCH_CAPABILITIES. When this happens and the host has tsx=on, it is possible to end up with virtual machines that have HLE and RTM disabled, but TSX_CTRL available. If the fleet is then switched to tsx=off, kvm_get_arch_capabilities() will clear the ARCH_CAP_TSX_CTRL_MSR bit and it will not be possible to use the tsx=off hosts as migration destinations, even though the guests do not have TSX enabled. To allow this migration, allow guests to write to their TSX_CTRL MSR, while keeping the host MSR unchanged for the entire life of the guests. This ensures that TSX remains disabled and also saves MSR reads and writes, and it's okay to do because with tsx=off we know that guests will not have the HLE and RTM features in their CPUID. (If userspace sets bogus CPUID data, we do not expect HLE and RTM to work in guests anyway). Cc: stable@vger.kernel.org Fixes: cbbaa272 ("KVM: x86: fix presentation of TSX feature in ARCH_CAPABILITIES") Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 01 2月, 2021 1 次提交
-
-
由 Peter Zijlstra 提交于
Tom reported that one of the GDB test-cases failed, and Boris bisected it to commit: d53d9bc0 ("x86/debug: Change thread.debugreg6 to thread.virtual_dr6") The debugging session led us to commit: 6c0aca28 ("x86: Ignore trap bits on single step exceptions") It turns out that TF and data breakpoints are both traps and will be merged, while instruction breakpoints are faults and will not be merged. This means 6c0aca28 is wrong, only TF and instruction breakpoints need to be excluded while TF and data breakpoints can be merged. [ bp: Massage commit message. ] Fixes: d53d9bc0 ("x86/debug: Change thread.debugreg6 to thread.virtual_dr6") Fixes: 6c0aca28 ("x86: Ignore trap bits on single step exceptions") Reported-by: NTom de Vries <tdevries@suse.de> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: NBorislav Petkov <bp@suse.de> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/YBMAbQGACujjfz%2Bi@hirez.programming.kicks-ass.net Link: https://lkml.kernel.org/r/20210128211627.GB4348@worktop.programming.kicks-ass.net
-
- 30 1月, 2021 1 次提交
-
-
由 Josh Poimboeuf 提交于
With retpolines disabled, some configurations of GCC, and specifically the GCC versions 9 and 10 in Ubuntu will add Intel CET instrumentation to the kernel by default. That breaks certain tracing scenarios by adding a superfluous ENDBR64 instruction before the fentry call, for functions which can be called indirectly. CET instrumentation isn't currently necessary in the kernel, as CET is only supported in user space. Disable it unconditionally and move it into the x86's Makefile as CET/CFI... enablement should be a per-arch decision anyway. [ bp: Massage and extend commit message. ] Fixes: 29be86d7 ("kbuild: add -fcf-protection=none when using retpoline flags") Reported-by: NNikolay Borisov <nborisov@suse.com> Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: NBorislav Petkov <bp@suse.de> Reviewed-by: NNikolay Borisov <nborisov@suse.com> Tested-by: NNikolay Borisov <nborisov@suse.com> Cc: <stable@vger.kernel.org> Cc: Seth Forshee <seth.forshee@canonical.com> Cc: Masahiro Yamada <yamada.masahiro@socionext.com> Link: https://lkml.kernel.org/r/20210128215219.6kct3h2eiustncws@treble
-
- 29 1月, 2021 1 次提交
-
-
由 Peter Gonda 提交于
Grab kvm->lock before pinning memory when registering an encrypted region; sev_pin_memory() relies on kvm->lock being held to ensure correctness when checking and updating the number of pinned pages. Add a lockdep assertion to help prevent future regressions. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Brijesh Singh <brijesh.singh@amd.com> Cc: Sean Christopherson <seanjc@google.com> Cc: x86@kernel.org Cc: kvm@vger.kernel.org Cc: stable@vger.kernel.org Cc: linux-kernel@vger.kernel.org Fixes: 1e80fdc0 ("KVM: SVM: Pin guest memory when SEV is active") Signed-off-by: NPeter Gonda <pgonda@google.com> V2 - Fix up patch description - Correct file paths svm.c -> sev.c - Add unlock of kvm->lock on sev_pin_memory error V1 - https://lore.kernel.org/kvm/20210126185431.1824530-1-pgonda@google.com/ Message-Id: <20210127161524.2832400-1-pgonda@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 28 1月, 2021 1 次提交
-
-
由 Michael Roth 提交于
Recent commit 255cbecf modified struct kvm_vcpu_arch to make 'cpuid_entries' a pointer to an array of kvm_cpuid_entry2 entries rather than embedding the array in the struct. KVM_SET_CPUID and KVM_SET_CPUID2 were updated accordingly, but KVM_GET_CPUID2 was missed. As a result, KVM_GET_CPUID2 currently returns random fields from struct kvm_vcpu_arch to userspace rather than the expected CPUID values. Fix this by treating 'cpuid_entries' as a pointer when copying its contents to userspace buffer. Fixes: 255cbecf ("KVM: x86: allocate vcpu->arch.cpuid_entries dynamically") Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: NMichael Roth <michael.roth@amd.com.com> Message-Id: <20210128024451.1816770-1-michael.roth@amd.com> Cc: stable@vger.kernel.org Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 27 1月, 2021 1 次提交
-
-
由 Juergen Gross 提交于
When booting a kernel which has been built with CONFIG_AMD_MEM_ENCRYPT enabled as a Xen pv guest a warning is issued for each processor: [ 5.964347] ------------[ cut here ]------------ [ 5.968314] WARNING: CPU: 0 PID: 1 at /home/gross/linux/head/arch/x86/xen/enlighten_pv.c:660 get_trap_addr+0x59/0x90 [ 5.972321] Modules linked in: [ 5.976313] CPU: 0 PID: 1 Comm: swapper/0 Tainted: G W 5.11.0-rc5-default #75 [ 5.980313] Hardware name: Dell Inc. OptiPlex 9020/0PC5F7, BIOS A05 12/05/2013 [ 5.984313] RIP: e030:get_trap_addr+0x59/0x90 [ 5.988313] Code: 42 10 83 f0 01 85 f6 74 04 84 c0 75 1d b8 01 00 00 00 c3 48 3d 00 80 83 82 72 08 48 3d 20 81 83 82 72 0c b8 01 00 00 00 eb db <0f> 0b 31 c0 c3 48 2d 00 80 83 82 48 ba 72 1c c7 71 1c c7 71 1c 48 [ 5.992313] RSP: e02b:ffffc90040033d38 EFLAGS: 00010202 [ 5.996313] RAX: 0000000000000001 RBX: ffffffff82a141d0 RCX: ffffffff8222ec38 [ 6.000312] RDX: ffffffff8222ec38 RSI: 0000000000000005 RDI: ffffc90040033d40 [ 6.004313] RBP: ffff8881003984a0 R08: 0000000000000007 R09: ffff888100398000 [ 6.008312] R10: 0000000000000007 R11: ffffc90040246000 R12: ffff8884082182a8 [ 6.012313] R13: 0000000000000100 R14: 000000000000001d R15: ffff8881003982d0 [ 6.016316] FS: 0000000000000000(0000) GS:ffff888408200000(0000) knlGS:0000000000000000 [ 6.020313] CS: e030 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 6.024313] CR2: ffffc900020ef000 CR3: 000000000220a000 CR4: 0000000000050660 [ 6.028314] Call Trace: [ 6.032313] cvt_gate_to_trap.part.7+0x3f/0x90 [ 6.036313] ? asm_exc_double_fault+0x30/0x30 [ 6.040313] xen_convert_trap_info+0x87/0xd0 [ 6.044313] xen_pv_cpu_up+0x17a/0x450 [ 6.048313] bringup_cpu+0x2b/0xc0 [ 6.052313] ? cpus_read_trylock+0x50/0x50 [ 6.056313] cpuhp_invoke_callback+0x80/0x4c0 [ 6.060313] _cpu_up+0xa7/0x140 [ 6.064313] cpu_up+0x98/0xd0 [ 6.068313] bringup_nonboot_cpus+0x4f/0x60 [ 6.072313] smp_init+0x26/0x79 [ 6.076313] kernel_init_freeable+0x103/0x258 [ 6.080313] ? rest_init+0xd0/0xd0 [ 6.084313] kernel_init+0xa/0x110 [ 6.088313] ret_from_fork+0x1f/0x30 [ 6.092313] ---[ end trace be9ecf17dceeb4f3 ]--- Reason is that there is no Xen pv trap entry for X86_TRAP_VC. Fix that by adding a generic trap handler for unknown traps and wire all unknown bare metal handlers to this generic handler, which will just crash the system in case such a trap will ever happen. Fixes: 0786138c ("x86/sev-es: Add a Runtime #VC Exception Handler") Cc: <stable@vger.kernel.org> # v5.10 Signed-off-by: NJuergen Gross <jgross@suse.com> Reviewed-by: NAndrew Cooper <andrew.cooper3@citrix.com> Signed-off-by: NJuergen Gross <jgross@suse.com>
-
- 26 1月, 2021 9 次提交
-
-
由 Paolo Bonzini 提交于
VMX also uses KVM_REQ_GET_NESTED_STATE_PAGES for the Hyper-V eVMCS, which may need to be loaded outside guest mode. Therefore we cannot WARN in that case. However, that part of nested_get_vmcs12_pages is _not_ needed at vmentry time. Split it out of KVM_REQ_GET_NESTED_STATE_PAGES handling, so that both vmentry and migration (and in the latter case, independent of is_guest_mode) do the parts that are needed. Cc: <stable@vger.kernel.org> # 5.10.x: f2c7ef3b: KVM: nSVM: cancel KVM_REQ_GET_NESTED_STATE_PAGES Cc: <stable@vger.kernel.org> # 5.10.x Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Revert the dirty/available tracking of GPRs now that KVM copies the GPRs to the GHCB on any post-VMGEXIT VMRUN, even if a GPR is not dirty. Per commit de3cd117 ("KVM: x86: Omit caching logic for always-available GPRs"), tracking for GPRs noticeably impacts KVM's code footprint. This reverts commit 1c04d8c9. Signed-off-by: NSean Christopherson <seanjc@google.com> Message-Id: <20210122235049.3107620-3-seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Drop the per-GPR dirty checks when synchronizing GPRs to the GHCB, the GRPs' dirty bits are set from time zero and never cleared, i.e. will always be seen as dirty. The obvious alternative would be to clear the dirty bits when appropriate, but removing the dirty checks is desirable as it allows reverting GPR dirty+available tracking, which adds overhead to all flavors of x86 VMs. Note, unconditionally writing the GPRs in the GHCB is tacitly allowed by the GHCB spec, which allows the hypervisor (or guest) to provide unnecessary info; it's the guest's responsibility to consume only what it needs (the hypervisor is untrusted after all). The guest and hypervisor can supply additional state if desired but must not rely on that additional state being provided. Cc: Brijesh Singh <brijesh.singh@amd.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Fixes: 291bd20d ("KVM: SVM: Add initial support for a VMGEXIT VMEXIT") Signed-off-by: NSean Christopherson <seanjc@google.com> Message-Id: <20210122235049.3107620-2-seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Maxim Levitsky 提交于
Even when we are outside the nested guest, some vmcs02 fields may not be in sync vs vmcs12. This is intentional, even across nested VM-exit, because the sync can be delayed until the nested hypervisor performs a VMCLEAR or a VMREAD/VMWRITE that affects those rarely accessed fields. However, during KVM_GET_NESTED_STATE, the vmcs12 has to be up to date to be able to restore it. To fix that, call copy_vmcs02_to_vmcs12_rare() before the vmcs12 contents are copied to userspace. Fixes: 7952d769 ("KVM: nVMX: Sync rarely accessed guest fields only when needed") Reviewed-by: NSean Christopherson <seanjc@google.com> Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210114205449.8715-2-mlevitsk@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Lorenzo Brescia 提交于
On VMX, if we exit and then re-enter immediately without leaving the vmx_vcpu_run() function, the kvm_entry event is not logged. That means we will see one (or more) kvm_exit, without its (their) corresponding kvm_entry, as shown here: CPU-1979 [002] 89.871187: kvm_entry: vcpu 1 CPU-1979 [002] 89.871218: kvm_exit: reason MSR_WRITE CPU-1979 [002] 89.871259: kvm_exit: reason MSR_WRITE It also seems possible for a kvm_entry event to be logged, but then we leave vmx_vcpu_run() right away (if vmx->emulation_required is true). In this case, we will have a spurious kvm_entry event in the trace. Fix these situations by moving trace_kvm_entry() inside vmx_vcpu_run() (where trace_kvm_exit() already is). A trace obtained with this patch applied looks like this: CPU-14295 [000] 8388.395387: kvm_entry: vcpu 0 CPU-14295 [000] 8388.395392: kvm_exit: reason MSR_WRITE CPU-14295 [000] 8388.395393: kvm_entry: vcpu 0 CPU-14295 [000] 8388.395503: kvm_exit: reason EXTERNAL_INTERRUPT Of course, not calling trace_kvm_entry() in common x86 code any longer means that we need to adjust the SVM side of things too. Signed-off-by: NLorenzo Brescia <lorenzo.brescia@edu.unito.it> Signed-off-by: NDario Faggioli <dfaggioli@suse.com> Message-Id: <160873470698.11652.13483635328769030605.stgit@Wayrath> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Jay Zhou 提交于
The injection process of smi has two steps: Qemu KVM Step1: cpu->interrupt_request &= \ ~CPU_INTERRUPT_SMI; kvm_vcpu_ioctl(cpu, KVM_SMI) call kvm_vcpu_ioctl_smi() and kvm_make_request(KVM_REQ_SMI, vcpu); Step2: kvm_vcpu_ioctl(cpu, KVM_RUN, 0) call process_smi() if kvm_check_request(KVM_REQ_SMI, vcpu) is true, mark vcpu->arch.smi_pending = true; The vcpu->arch.smi_pending will be set true in step2, unfortunately if vcpu paused between step1 and step2, the kvm_run->immediate_exit will be set and vcpu has to exit to Qemu immediately during step2 before mark vcpu->arch.smi_pending true. During VM migration, Qemu will get the smi pending status from KVM using KVM_GET_VCPU_EVENTS ioctl at the downtime, then the smi pending status will be lost. Signed-off-by: NJay Zhou <jianjay.zhou@huawei.com> Signed-off-by: NShengen Zhuang <zhuangshengen@huawei.com> Message-Id: <20210118084720.1585-1-jianjay.zhou@huawei.com> Cc: stable@vger.kernel.org Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Like Xu 提交于
The HW_REF_CPU_CYCLES event on the fixed counter 2 is pseudo-encoded as 0x0300 in the intel_perfmon_event_map[]. Correct its usage. Fixes: 62079d8a ("KVM: PMU: add proper support for fixed counter 2") Signed-off-by: NLike Xu <like.xu@linux.intel.com> Message-Id: <20201230081916.63417-1-like.xu@linux.intel.com> Reviewed-by: NSean Christopherson <seanjc@google.com> Cc: stable@vger.kernel.org Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Like Xu 提交于
Since we know vPMU will not work properly when (1) the guest bit_width(s) of the [gp|fixed] counters are greater than the host ones, or (2) guest requested architectural events exceeds the range supported by the host, so we can setup a smaller left shift value and refresh the guest cpuid entry, thus fixing the following UBSAN shift-out-of-bounds warning: shift exponent 197 is too large for 64-bit type 'long long unsigned int' Call Trace: __dump_stack lib/dump_stack.c:79 [inline] dump_stack+0x107/0x163 lib/dump_stack.c:120 ubsan_epilogue+0xb/0x5a lib/ubsan.c:148 __ubsan_handle_shift_out_of_bounds.cold+0xb1/0x181 lib/ubsan.c:395 intel_pmu_refresh.cold+0x75/0x99 arch/x86/kvm/vmx/pmu_intel.c:348 kvm_vcpu_after_set_cpuid+0x65a/0xf80 arch/x86/kvm/cpuid.c:177 kvm_vcpu_ioctl_set_cpuid2+0x160/0x440 arch/x86/kvm/cpuid.c:308 kvm_arch_vcpu_ioctl+0x11b6/0x2d70 arch/x86/kvm/x86.c:4709 kvm_vcpu_ioctl+0x7b9/0xdb0 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3386 vfs_ioctl fs/ioctl.c:48 [inline] __do_sys_ioctl fs/ioctl.c:753 [inline] __se_sys_ioctl fs/ioctl.c:739 [inline] __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:739 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Reported-by: syzbot+ae488dc136a4cc6ba32b@syzkaller.appspotmail.com Signed-off-by: NLike Xu <like.xu@linux.intel.com> Message-Id: <20210118025800.34620-1-like.xu@linux.intel.com> Cc: stable@vger.kernel.org Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Add compile-time asserts in rsvd_bits() to guard against KVM passing in garbage hardcoded values, and cap the upper bound at '63' for dynamic values to prevent generating a mask that would overflow a u64. Suggested-by: NPaolo Bonzini <pbonzini@redhat.com> Signed-off-by: NSean Christopherson <seanjc@google.com> Message-Id: <20210113204515.3473079-1-seanjc@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 25 1月, 2021 1 次提交
-
-
由 Mike Rapoport 提交于
Patch series "mm: fix initialization of struct page for holes in memory layout", v3. Commit 73a6e474 ("mm: memmap_init: iterate over memblock regions rather that check each PFN") exposed several issues with the memory map initialization and these patches fix those issues. Initially there were crashes during compaction that Qian Cai reported back in April [1]. It seemed back then that the problem was fixed, but a few weeks ago Andrea Arcangeli hit the same bug [2] and there was an additional discussion at [3]. [1] https://lore.kernel.org/lkml/8C537EB7-85EE-4DCF-943E-3CC0ED0DF56D@lca.pw [2] https://lore.kernel.org/lkml/20201121194506.13464-1-aarcange@redhat.com [3] https://lore.kernel.org/mm-commits/20201206005401.qKuAVgOXr%akpm@linux-foundation.org This patch (of 2): The first 4Kb of memory is a BIOS owned area and to avoid its allocation for the kernel it was not listed in e820 tables as memory. As the result, pfn 0 was never recognised by the generic memory management and it is not a part of neither node 0 nor ZONE_DMA. If set_pfnblock_flags_mask() would be ever called for the pageblock corresponding to the first 2Mbytes of memory, having pfn 0 outside of ZONE_DMA would trigger VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page), pfn), page); Along with reserving the first 4Kb in e820 tables, several first pages are reserved with memblock in several places during setup_arch(). These reservations are enough to ensure the kernel does not touch the BIOS area and it is not necessary to remove E820_TYPE_RAM for pfn 0. Remove the update of e820 table that changes the type of pfn 0 and move the comment describing why it was done to trim_low_memory_range() that reserves the beginning of the memory. Link: https://lkml.kernel.org/r/20210111194017.22696-2-rppt@kernel.orgSigned-off-by: NMike Rapoport <rppt@linux.ibm.com> Cc: Baoquan He <bhe@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: David Hildenbrand <david@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Qian Cai <cai@lca.pw> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 22 1月, 2021 1 次提交
-
-
由 Gayatri Kammela 提交于
Add Alder Lake mobile CPU model number to Intel family. Signed-off-by: NGayatri Kammela <gayatri.kammela@intel.com> Signed-off-by: NTony Luck <tony.luck@intel.com> Signed-off-by: NBorislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20210121215004.11618-1-tony.luck@intel.com
-
- 21 1月, 2021 2 次提交
-
-
由 Andy Lutomirski 提交于
The default kernel_fpu_begin() doesn't work on systems that support XMM but haven't yet enabled CR4.OSFXSR. This causes crashes when _mmx_memcpy() is called too early because LDMXCSR generates #UD when the aforementioned bit is clear. Fix it by using kernel_fpu_begin_mask(KFPU_387) explicitly. Fixes: 7ad81676 ("x86/fpu: Reset MXCSR to default in kernel_fpu_begin()") Reported-by: NKrzysztof Mazur <krzysiek@podlesie.net> Signed-off-by: NAndy Lutomirski <luto@kernel.org> Signed-off-by: NBorislav Petkov <bp@suse.de> Tested-by: NKrzysztof Piotr Olędzki <ole@ans.pl> Tested-by: NKrzysztof Mazur <krzysiek@podlesie.net> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/e7bf21855fe99e5f3baa27446e32623358f69e8d.1611205691.git.luto@kernel.org
-
由 Andy Lutomirski 提交于
Currently, requesting kernel FPU access doesn't distinguish which parts of the extended ("FPU") state are needed. This is nice for simplicity, but there are a few cases in which it's suboptimal: - The vast majority of in-kernel FPU users want XMM/YMM/ZMM state but do not use legacy 387 state. These users want MXCSR initialized but don't care about the FPU control word. Skipping FNINIT would save time. (Empirically, FNINIT is several times slower than LDMXCSR.) - Code that wants MMX doesn't want or need MXCSR initialized. _mmx_memcpy(), for example, can run before CR4.OSFXSR gets set, and initializing MXCSR will fail because LDMXCSR generates an #UD when the aforementioned CR4 bit is not set. - Any future in-kernel users of XFD (eXtended Feature Disable)-capable dynamic states will need special handling. Add a more specific API that allows callers to specify exactly what they want. Signed-off-by: NAndy Lutomirski <luto@kernel.org> Signed-off-by: NBorislav Petkov <bp@suse.de> Tested-by: NKrzysztof Piotr Olędzki <ole@ans.pl> Link: https://lkml.kernel.org/r/aff1cac8b8fc7ee900cf73e8f2369966621b053f.1611205691.git.luto@kernel.org
-
- 20 1月, 2021 1 次提交
-
-
由 Rafael J. Wysocki 提交于
On x86 scale invariace tends to be disabled during resume from suspend-to-RAM, because the MPERF or APERF MSR values are not as expected then due to updates taking place after the platform firmware has been invoked to complete the suspend transition. That, of course, is not desirable, especially if the schedutil scaling governor is in use, because the lack of scale invariance causes it to be less reliable. To counter that effect, modify init_freq_invariance() to register a syscore_ops object for scale invariance with the ->resume callback pointing to init_counter_refs() which will run on the CPU starting the resume transition (the other CPUs will be taken care of the "online" operations taking place later). Fixes: e2b0d619 ("x86, sched: check for counters overflow in frequency invariant accounting") Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Acked-by: NGiovanni Gherdovich <ggherdovich@suse.cz> Link: https://lkml.kernel.org/r/1803209.Mvru99baaF@kreacher
-