- 08 8月, 2018 2 次提交
-
-
由 Andi Kleen 提交于
Some cases in THP like: - MADV_FREE - mprotect - split mark the PMD non present for temporarily to prevent races. The window for an L1TF attack in these contexts is very small, but it wants to be fixed for correctness sake. Use the proper low level functions for pmd/pud_mknotpresent() to address this. Signed-off-by: NAndi Kleen <ak@linux.intel.com> Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
-
由 Andi Kleen 提交于
For kernel mappings PAGE_PROTNONE is not necessarily set for a non present mapping, but the inversion logic explicitely checks for !PRESENT and PROT_NONE. Remove the PROT_NONE check and make the inversion unconditional for all not present mappings. Signed-off-by: NAndi Kleen <ak@linux.intel.com> Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
-
- 07 8月, 2018 1 次提交
-
-
由 Thomas Gleixner 提交于
Josh reported that the late SMT evaluation in cpu_smt_state_init() sets cpu_smt_control to CPU_SMT_NOT_SUPPORTED in case that 'nosmt' was supplied on the kernel command line as it cannot differentiate between SMT disabled by BIOS and SMT soft disable via 'nosmt'. That wreckages the state and makes the sysfs interface unusable. Rework this so that during bringup of the non boot CPUs the availability of SMT is determined in cpu_smt_allowed(). If a newly booted CPU is not a 'primary' thread then set the local cpu_smt_available marker and evaluate this explicitely right after the initial SMP bringup has finished. SMT evaulation on x86 is a trainwreck as the firmware has all the information _before_ booting the kernel, but there is no interface to query it. Fixes: 73d5e2b4 ("cpu/hotplug: detect SMT disabled by BIOS") Reported-by: NJosh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
-
- 05 8月, 2018 11 次提交
-
-
由 Paolo Bonzini 提交于
When nested virtualization is in use, VMENTER operations from the nested hypervisor into the nested guest will always be processed by the bare metal hypervisor, and KVM's "conditional cache flushes" mode in particular does a flush on nested vmentry. Therefore, include the "skip L1D flush on vmentry" bit in KVM's suggested ARCH_CAPABILITIES setting. Add the relevant Documentation. Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com> Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
-
由 Paolo Bonzini 提交于
Bit 3 of ARCH_CAPABILITIES tells a hypervisor that L1D flush on vmentry is not needed. Add a new value to enum vmx_l1d_flush_state, which is used either if there is no L1TF bug at all, or if bit 3 is set in ARCH_CAPABILITIES. Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com> Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
-
由 Paolo Bonzini 提交于
Three changes to the content of the sysfs file: - If EPT is disabled, L1TF cannot be exploited even across threads on the same core, and SMT is irrelevant. - If mitigation is completely disabled, and SMT is enabled, print "vulnerable" instead of "vulnerable, SMT vulnerable" - Reorder the two parts so that the main vulnerability state comes first and the detail on SMT is second. Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com> Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
-
由 Nicolai Stange 提交于
For VMEXITs caused by external interrupts, vmx_handle_external_intr() indirectly calls into the interrupt handlers through the host's IDT. It follows that these interrupts get accounted for in the kvm_cpu_l1tf_flush_l1d per-cpu flag. The subsequently executed vmx_l1d_flush() will thus be aware that some interrupts have happened and conduct a L1d flush anyway. Setting l1tf_flush_l1d from vmx_handle_external_intr() isn't needed anymore. Drop it. Signed-off-by: NNicolai Stange <nstange@suse.de> Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
-
由 Nicolai Stange 提交于
The last missing piece to having vmx_l1d_flush() take interrupts after VMEXIT into account is to set the kvm_cpu_l1tf_flush_l1d per-cpu flag on irq entry. Issue calls to kvm_set_cpu_l1tf_flush_l1d() from entering_irq(), ipi_entering_ack_irq(), smp_reschedule_interrupt() and uv_bau_message_interrupt(). Suggested-by: NPaolo Bonzini <pbonzini@redhat.com> Signed-off-by: NNicolai Stange <nstange@suse.de> Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
-
由 Nicolai Stange 提交于
The next patch in this series will have to make the definition of irq_cpustat_t available to entering_irq(). Inclusion of asm/hardirq.h into asm/apic.h would cause circular header dependencies like asm/smp.h asm/apic.h asm/hardirq.h linux/irq.h linux/topology.h linux/smp.h asm/smp.h or linux/gfp.h linux/mmzone.h asm/mmzone.h asm/mmzone_64.h asm/smp.h asm/apic.h asm/hardirq.h linux/irq.h linux/irqdesc.h linux/kobject.h linux/sysfs.h linux/kernfs.h linux/idr.h linux/gfp.h and others. This causes compilation errors because of the header guards becoming effective in the second inclusion: symbols/macros that had been defined before wouldn't be available to intermediate headers in the #include chain anymore. A possible workaround would be to move the definition of irq_cpustat_t into its own header and include that from both, asm/hardirq.h and asm/apic.h. However, this wouldn't solve the real problem, namely asm/harirq.h unnecessarily pulling in all the linux/irq.h cruft: nothing in asm/hardirq.h itself requires it. Also, note that there are some other archs, like e.g. arm64, which don't have that #include in their asm/hardirq.h. Remove the linux/irq.h #include from x86' asm/hardirq.h. Fix resulting compilation errors by adding appropriate #includes to *.c files as needed. Note that some of these *.c files could be cleaned up a bit wrt. to their set of #includes, but that should better be done from separate patches, if at all. Signed-off-by: NNicolai Stange <nstange@suse.de> Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
-
由 Nicolai Stange 提交于
Part of the L1TF mitigation for vmx includes flushing the L1D cache upon VMENTRY. L1D flushes are costly and two modes of operations are provided to users: "always" and the more selective "conditional" mode. If operating in the latter, the cache would get flushed only if a host side code path considered unconfined had been traversed. "Unconfined" in this context means that it might have pulled in sensitive data like user data or kernel crypto keys. The need for L1D flushes is tracked by means of the per-vcpu flag l1tf_flush_l1d. KVM exit handlers considered unconfined set it. A vmx_l1d_flush() subsequently invoked before the next VMENTER will conduct a L1d flush based on its value and reset that flag again. Currently, interrupts delivered "normally" while in root operation between VMEXIT and VMENTER are not taken into account. Part of the reason is that these don't leave any traces and thus, the vmx code is unable to tell if any such has happened. As proposed by Paolo Bonzini, prepare for tracking all interrupts by introducing a new per-cpu flag, "kvm_cpu_l1tf_flush_l1d". It will be in strong analogy to the per-vcpu ->l1tf_flush_l1d. A later patch will make interrupt handlers set it. For the sake of cache locality, group kvm_cpu_l1tf_flush_l1d into x86' per-cpu irq_cpustat_t as suggested by Peter Zijlstra. Provide the helpers kvm_set_cpu_l1tf_flush_l1d(), kvm_clear_cpu_l1tf_flush_l1d() and kvm_get_cpu_l1tf_flush_l1d(). Make them trivial resp. non-existent for !CONFIG_KVM_INTEL as appropriate. Let vmx_l1d_flush() handle kvm_cpu_l1tf_flush_l1d in the same way as l1tf_flush_l1d. Suggested-by: NPaolo Bonzini <pbonzini@redhat.com> Suggested-by: NPeter Zijlstra <peterz@infradead.org> Signed-off-by: NNicolai Stange <nstange@suse.de> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Nicolai Stange 提交于
An upcoming patch will extend KVM's L1TF mitigation in conditional mode to also cover interrupts after VMEXITs. For tracking those, stores to a new per-cpu flag from interrupt handlers will become necessary. In order to improve cache locality, this new flag will be added to x86's irq_cpustat_t. Make some space available there by shrinking the ->softirq_pending bitfield from 32 to 16 bits: the number of bits actually used is only NR_SOFTIRQS, i.e. 10. Suggested-by: NPaolo Bonzini <pbonzini@redhat.com> Signed-off-by: NNicolai Stange <nstange@suse.de> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Reviewed-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Nicolai Stange 提交于
Currently, vmx_vcpu_run() checks if l1tf_flush_l1d is set and invokes vmx_l1d_flush() if so. This test is unncessary for the "always flush L1D" mode. Move the check to vmx_l1d_flush()'s conditional mode code path. Notes: - vmx_l1d_flush() is likely to get inlined anyway and thus, there's no extra function call. - This inverts the (static) branch prediction, but there hadn't been any explicit likely()/unlikely() annotations before and so it stays as is. Signed-off-by: NNicolai Stange <nstange@suse.de> Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
-
由 Nicolai Stange 提交于
The vmx_l1d_flush_always static key is only ever evaluated if vmx_l1d_should_flush is enabled. In that case however, there are only two L1d flushing modes possible: "always" and "conditional". The "conditional" mode's implementation tends to require more sophisticated logic than the "always" mode. Avoid inverted logic by replacing the 'vmx_l1d_flush_always' static key with a 'vmx_l1d_flush_cond' one. There is no change in functionality. Signed-off-by: NNicolai Stange <nstange@suse.de> Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
-
由 Nicolai Stange 提交于
vmx_l1d_flush() gets invoked only if l1tf_flush_l1d is true. There's no point in setting l1tf_flush_l1d to true from there again. Signed-off-by: NNicolai Stange <nstange@suse.de> Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
-
- 27 7月, 2018 2 次提交
-
-
由 Shakeel Butt 提交于
The size of kvm's shadow page tables corresponds to the size of the guest virtual machines on the system. Large VMs can spend a significant amount of memory as shadow page tables which can not be left as system memory overhead. So, account shadow page tables to the kmemcg. [shakeelb@google.com: replace (GFP_KERNEL|__GFP_ACCOUNT) with GFP_KERNEL_ACCOUNT] Link: http://lkml.kernel.org/r/20180629140224.205849-1-shakeelb@google.com Link: http://lkml.kernel.org/r/20180627181349.149778-1-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Peter Feiner <pfeiner@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kirill A. Shutemov 提交于
Make sure to initialize all VMAs properly, not only those which come from vm_area_cachep. Link: http://lkml.kernel.org/r/20180724121139.62570-3-kirill.shutemov@linux.intel.comSigned-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: NLinus Torvalds <torvalds@linux-foundation.org> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 19 7月, 2018 1 次提交
-
-
由 Nicolai Stange 提交于
The slow path in vmx_l1d_flush() reads from vmx_l1d_flush_pages in order to evict the L1d cache. However, these pages are never cleared and, in theory, their data could be leaked. More importantly, KSM could merge a nested hypervisor's vmx_l1d_flush_pages to fewer than 1 << L1D_CACHE_ORDER host physical pages and this would break the L1d flushing algorithm: L1D on x86_64 is tagged by physical addresses. Fix this by initializing the individual vmx_l1d_flush_pages with a different pattern each. Rename the "empty_zp" asm constraint identifier in vmx_l1d_flush() to "flush_pages" to reflect this change. Fixes: a47dd5f0 ("x86/KVM/VMX: Add L1D flush algorithm") Signed-off-by: NNicolai Stange <nstange@suse.de> Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
-
- 18 7月, 2018 2 次提交
-
-
由 Peng Hao 提交于
Inside a nested guest, access to hardware can be slow enough that tsc_read_refs always return ULLONG_MAX, causing tsc_refine_calibration_work to be called periodically and the nested guest to spend a lot of time reading the ACPI timer. However, if the TSC frequency is available from the pvclock page, we can just set X86_FEATURE_TSC_KNOWN_FREQ and avoid the recalibration. 'refine' operation. Suggested-by: NPeter Zijlstra <peterz@infradead.org> Signed-off-by: NPeng Hao <peng.hao2@zte.com.cn> [Commit message rewritten. - Paolo] Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Liran Alon 提交于
When eVMCS is enabled, all VMCS allocated to be used by KVM are marked with revision_id of KVM_EVMCS_VERSION instead of revision_id reported by MSR_IA32_VMX_BASIC. However, even though not explictly documented by TLFS, VMXArea passed as VMXON argument should still be marked with revision_id reported by physical CPU. This issue was found by the following setup: * L0 = KVM which expose eVMCS to it's L1 guest. * L1 = KVM which consume eVMCS reported by L0. This setup caused the following to occur: 1) L1 execute hardware_enable(). 2) hardware_enable() calls kvm_cpu_vmxon() to execute VMXON. 3) L0 intercept L1 VMXON and execute handle_vmon() which notes vmxarea->revision_id != VMCS12_REVISION and therefore fails with nested_vmx_failInvalid() which sets RFLAGS.CF. 4) L1 kvm_cpu_vmxon() don't check RFLAGS.CF for failure and therefore hardware_enable() continues as usual. 5) L1 hardware_enable() then calls ept_sync_global() which executes INVEPT. 6) L0 intercept INVEPT and execute handle_invept() which notes !vmx->nested.vmxon and thus raise a #UD to L1. 7) Raised #UD caused L1 to panic. Reviewed-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com> Cc: stable@vger.kernel.org Fixes: 773e8a04Signed-off-by: NLiran Alon <liran.alon@oracle.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 17 7月, 2018 1 次提交
-
-
由 Dewet Thibaut 提交于
commit b3b7c479 ("x86/MCE: Serialize sysfs changes") introduced a min interval limitation when setting the check interval for polled MCEs. However, the logic is that 0 disables polling for corrected MCEs, see Documentation/x86/x86_64/machinecheck. The limitation prevents disabling. Remove this limitation and allow the value 0 to disable polling again. Fixes: b3b7c479 ("x86/MCE: Serialize sysfs changes") Signed-off-by: NDewet Thibaut <thibaut.dewet@nokia.com> Signed-off-by: NAlexander Sverdlin <alexander.sverdlin@nokia.com> [ Massage commit message. ] Signed-off-by: NBorislav Petkov <bp@suse.de> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/20180716084927.24869-1-alexander.sverdlin@nokia.com
-
- 16 7月, 2018 2 次提交
-
-
由 Ville Syrjälä 提交于
APM_DO_POP_SEGS does not restore fs/gs which were zeroed by APM_DO_ZERO_SEGS. Trying to access __preempt_count with zeroed fs doesn't really work. Move the ibrs call outside the APM_DO_SAVE_SEGS/APM_DO_RESTORE_SEGS invocations so that fs is actually restored before calling preempt_enable(). Fixes the following sort of oopses: [ 0.313581] general protection fault: 0000 [#1] PREEMPT SMP [ 0.313803] Modules linked in: [ 0.314040] CPU: 0 PID: 268 Comm: kapmd Not tainted 4.16.0-rc1-triton-bisect-00090-gdd84441a #19 [ 0.316161] EIP: __apm_bios_call_simple+0xc8/0x170 [ 0.316161] EFLAGS: 00210016 CPU: 0 [ 0.316161] EAX: 00000102 EBX: 00000000 ECX: 00000102 EDX: 00000000 [ 0.316161] ESI: 0000530e EDI: dea95f64 EBP: dea95f18 ESP: dea95ef0 [ 0.316161] DS: 007b ES: 007b FS: 0000 GS: 0000 SS: 0068 [ 0.316161] CR0: 80050033 CR2: 00000000 CR3: 015d3000 CR4: 000006d0 [ 0.316161] Call Trace: [ 0.316161] ? cpumask_weight.constprop.15+0x20/0x20 [ 0.316161] on_cpu0+0x44/0x70 [ 0.316161] apm+0x54e/0x720 [ 0.316161] ? __switch_to_asm+0x26/0x40 [ 0.316161] ? __schedule+0x17d/0x590 [ 0.316161] kthread+0xc0/0xf0 [ 0.316161] ? proc_apm_show+0x150/0x150 [ 0.316161] ? kthread_create_worker_on_cpu+0x20/0x20 [ 0.316161] ret_from_fork+0x2e/0x38 [ 0.316161] Code: da 8e c2 8e e2 8e ea 57 55 2e ff 1d e0 bb 5d b1 0f 92 c3 5d 5f 07 1f 89 47 0c 90 8d b4 26 00 00 00 00 90 8d b4 26 00 00 00 00 90 <64> ff 0d 84 16 5c b1 74 7f 8b 45 dc 8e e0 8b 45 d8 8e e8 8b 45 [ 0.316161] EIP: __apm_bios_call_simple+0xc8/0x170 SS:ESP: 0068:dea95ef0 [ 0.316161] ---[ end trace 656253db2deaa12c ]--- Fixes: dd84441a ("x86/speculation: Use IBRS if available before calling into firmware") Signed-off-by: NVille Syrjälä <ville.syrjala@linux.intel.com> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Cc: David Woodhouse <dwmw@amazon.co.uk> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: x86@kernel.org Cc: David Woodhouse <dwmw@amazon.co.uk> Cc: "H. Peter Anvin" <hpa@zytor.com> Link: https://lkml.kernel.org/r/20180709133534.5963-1-ville.syrjala@linux.intel.com
-
由 Dan Williams 提交于
All copy_to_user() implementations need to be prepared to handle faults accessing userspace. The __memcpy_mcsafe() implementation handles both mmu-faults on the user destination and machine-check-exceptions on the source buffer. However, the memcpy_mcsafe() wrapper may silently fallback to memcpy() depending on build options and cpu-capabilities. Force copy_to_user_mcsafe() to always use __memcpy_mcsafe() when available, and otherwise disable all of the copy_to_user_mcsafe() infrastructure when __memcpy_mcsafe() is not available, i.e. CONFIG_X86_MCE=n. This fixes crashes of the form: run fstests generic/323 at 2018-07-02 12:46:23 BUG: unable to handle kernel paging request at 00007f0d50001000 RIP: 0010:__memcpy+0x12/0x20 [..] Call Trace: copyout_mcsafe+0x3a/0x50 _copy_to_iter_mcsafe+0xa1/0x4a0 ? dax_alive+0x30/0x50 dax_iomap_actor+0x1f9/0x280 ? dax_iomap_rw+0x100/0x100 iomap_apply+0xba/0x130 ? dax_iomap_rw+0x100/0x100 dax_iomap_rw+0x95/0x100 ? dax_iomap_rw+0x100/0x100 xfs_file_dax_read+0x7b/0x1d0 [xfs] xfs_file_read_iter+0xa7/0xc0 [xfs] aio_read+0x11c/0x1a0 Reported-by: NRoss Zwisler <ross.zwisler@linux.intel.com> Tested-by: NRoss Zwisler <ross.zwisler@linux.intel.com> Signed-off-by: NDan Williams <dan.j.williams@intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Fixes: 8780356e ("x86/asm/memcpy_mcsafe: Define copy_to_iter_mcsafe()") Link: http://lkml.kernel.org/r/153108277790.37979.1486841789275803399.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 15 7月, 2018 7 次提交
-
-
由 Radim Krčmář 提交于
pvti_cpu0_va is the address of shared kvmclock data structure. pvti_cpu0_va is currently kept unset (1) on 32 bit systems, (2) when kvmclock vsyscall is disabled, and (3) if kvmclock is not stable. This poses a problem, because kvm_ptp needs pvti_cpu0_va, but (1) can work on 32 bit, (2) has little relation to the vsyscall, and (3) does not need stable kvmclock (although kvmclock won't be used for system clock if it's not stable, so kvm_ptp is pointless in that case). Expose pvti_cpu0_va whenever kvmclock is enabled to allow all users to work with it. This fixes a regression found on Gentoo: https://bugs.gentoo.org/658544. Fixes: 9f08890a ("x86/pvclock: add setter for pvclock_pvti_cpu0_va") Cc: stable@vger.kernel.org Reported-by: NAndreas Steinmetz <ast@domdv.de> Signed-off-by: NRadim Krčmář <rkrcmar@redhat.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Janakarajan Natarajan 提交于
Prevent a config where KVM_AMD=y and CRYPTO_DEV_CCP_DD=m thereby ensuring that AMD Secure Processor device driver will be built-in when KVM_AMD is also built-in. v1->v2: * Removed usage of 'imply' Kconfig option. * Change patch commit message. Fixes: 505c9e94 ("KVM: x86: prefer "depends on" to "select" for SEV") Cc: <stable@vger.kernel.org> # 4.16.x Signed-off-by: NJanakarajan Natarajan <Janakarajan.Natarajan@amd.com> Reviewed-by: NBrijesh Singh <brijesh.singh@amd.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Jim Mattson 提交于
This exit qualification was inadvertently dropped when the two VM-entry failure blocks were coalesced. Fixes: e79f245d ("X86/KVM: Properly update 'tsc_offset' to represent the running guest") Signed-off-by: NJim Mattson <jmattson@google.com> Reviewed-by: NKrish Sadhukhan <krish.sadhukhan@oracle.com> Reviewed-by: NDavid Hildenbrand <david@redhat.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Vitaly Kuznetsov 提交于
When we switched from doing rdmsr() to reading FS/GS base values from current->thread we completely forgot about legacy 32-bit userspaces which we still support in KVM (why?). task->thread.{fsbase,gsbase} are only synced for 64-bit processes, calling save_fsgs_for_kvm() and using its result from current is illegal for legacy processes. There's no ARCH_SET_FS/GS prctls for legacy applications. Base MSRs are, however, not always equal to zero. Intel's manual says (3.4.4 Segment Loading Instructions in IA-32e Mode): "In order to set up compatibility mode for an application, segment-load instructions (MOV to Sreg, POP Sreg) work normally in 64-bit mode. An entry is read from the system descriptor table (GDT or LDT) and is loaded in the hidden portion of the segment register. ... The hidden descriptor register fields for FS.base and GS.base are physically mapped to MSRs in order to load all address bits supported by a 64-bit implementation. " The issue was found by strace test suite where 32-bit ioctl_kvm_run test started segfaulting. Reported-by: NDmitry V. Levin <ldv@altlinux.org> Bisected-by: NMasatake YAMATO <yamato@redhat.com> Fixes: 42b933b5 ("x86/kvm/vmx: read MSR_{FS,KERNEL_GS}_BASE from current->thread") Cc: stable@vger.kernel.org Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Paolo Bonzini 提交于
This lets userspace read the MSR_IA32_ARCH_CAPABILITIES and check that all requested features are available on the host. Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Hugh Dickins 提交于
Markus reported that BTS is sporadically missing the tail of the trace in the perf_event data buffer: [decode error (1): instruction overflow] shown in GDB; and bisected it to the conversion of debug_store to PTI. A little "optimization" crept into alloc_bts_buffer(), which mistakenly placed bts_interrupt_threshold away from the 24-byte record boundary. Intel SDM Vol 3B 17.4.9 says "This address must point to an offset from the BTS buffer base that is a multiple of the BTS record size." Revert "max" from a byte count to a record count, to calculate the bts_interrupt_threshold correctly: which turns out to fix problem seen. Fixes: c1961a46 ("x86/events/intel/ds: Map debug buffers in cpu_entry_area") Reported-and-tested-by: NMarkus T Metzger <markus.t.metzger@intel.com> Signed-off-by: NHugh Dickins <hughd@google.com> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@intel.com> Cc: Andi Kleen <andi.kleen@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Stephane Eranian <eranian@google.com> Cc: stable@vger.kernel.org # v4.14+ Link: https://lkml.kernel.org/r/alpine.LSU.2.11.1807141248290.1614@eggly.anvils
-
由 Philipp Rudo 提交于
- Build the kernel without the fix - Add some flag to the purgatories KBUILD_CFLAGS,I used -fno-asynchronous-unwind-tables - Re-build the kernel When you look at makes output you see that sha256.o is not re-build in the last step. Also readelf -S still shows the .eh_frame section for sha256.o. With the fix sha256.o is rebuilt in the last step. Without FORCE make does not detect changes only made to the command line options. So object files might not be re-built even when they should be. Fix this by adding FORCE where it is missing. Link: http://lkml.kernel.org/r/20180704110044.29279-2-prudo@linux.ibm.com Fixes: df6f2801 ("kernel/kexec_file.c: move purgatories sha256 to common code") Signed-off-by: NPhilipp Rudo <prudo@linux.ibm.com> Acked-by: NDave Young <dyoung@redhat.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: <stable@vger.kernel.org> [4.17+] Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 13 7月, 2018 10 次提交
-
-
由 Jiri Kosina 提交于
Introduce the 'l1tf=' kernel command line option to allow for boot-time switching of mitigation that is used on processors affected by L1TF. The possible values are: full Provides all available mitigations for the L1TF vulnerability. Disables SMT and enables all mitigations in the hypervisors. SMT control via /sys/devices/system/cpu/smt/control is still possible after boot. Hypervisors will issue a warning when the first VM is started in a potentially insecure configuration, i.e. SMT enabled or L1D flush disabled. full,force Same as 'full', but disables SMT control. Implies the 'nosmt=force' command line option. sysfs control of SMT and the hypervisor flush control is disabled. flush Leaves SMT enabled and enables the conditional hypervisor mitigation. Hypervisors will issue a warning when the first VM is started in a potentially insecure configuration, i.e. SMT enabled or L1D flush disabled. flush,nosmt Disables SMT and enables the conditional hypervisor mitigation. SMT control via /sys/devices/system/cpu/smt/control is still possible after boot. If SMT is reenabled or flushing disabled at runtime hypervisors will issue a warning. flush,nowarn Same as 'flush', but hypervisors will not warn when a VM is started in a potentially insecure configuration. off Disables hypervisor mitigations and doesn't emit any warnings. Default is 'flush'. Let KVM adhere to these semantics, which means: - 'lt1f=full,force' : Performe L1D flushes. No runtime control possible. - 'l1tf=full' - 'l1tf-flush' - 'l1tf=flush,nosmt' : Perform L1D flushes and warn on VM start if SMT has been runtime enabled or L1D flushing has been run-time enabled - 'l1tf=flush,nowarn' : Perform L1D flushes and no warnings are emitted. - 'l1tf=off' : L1D flushes are not performed and no warnings are emitted. KVM can always override the L1D flushing behavior using its 'vmentry_l1d_flush' module parameter except when lt1f=full,force is set. This makes KVM's private 'nosmt' option redundant, and as it is a bit non-systematic anyway (this is something to control globally, not on hypervisor level), remove that option. Add the missing Documentation entry for the l1tf vulnerability sysfs file while at it. Signed-off-by: NJiri Kosina <jkosina@suse.cz> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Tested-by: NJiri Kosina <jkosina@suse.cz> Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: NJosh Poimboeuf <jpoimboe@redhat.com> Link: https://lkml.kernel.org/r/20180713142323.202758176@linutronix.de
-
由 Thomas Gleixner 提交于
The CPU_SMT_NOT_SUPPORTED state is set (if the processor does not support SMT) when the sysfs SMT control file is initialized. That was fine so far as this was only required to make the output of the control file correct and to prevent writes in that case. With the upcoming l1tf command line parameter, this needs to be set up before the L1TF mitigation selection and command line parsing happens. Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Tested-by: NJiri Kosina <jkosina@suse.cz> Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: NJosh Poimboeuf <jpoimboe@redhat.com> Link: https://lkml.kernel.org/r/20180713142323.121795971@linutronix.de
-
由 Thomas Gleixner 提交于
All mitigation modes can be switched at run time with a static key now: - Use sysfs_streq() instead of strcmp() to handle the trailing new line from sysfs writes correctly. - Make the static key management handle multiple invocations properly. - Set the module parameter file to RW Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Tested-by: NJiri Kosina <jkosina@suse.cz> Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: NJosh Poimboeuf <jpoimboe@redhat.com> Link: https://lkml.kernel.org/r/20180713142322.954525119@linutronix.de
-
由 Thomas Gleixner 提交于
Writes to the parameter files are not serialized at the sysfs core level, so local serialization is required. Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Tested-by: NJiri Kosina <jkosina@suse.cz> Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: NJosh Poimboeuf <jpoimboe@redhat.com> Link: https://lkml.kernel.org/r/20180713142322.873642605@linutronix.de
-
由 Thomas Gleixner 提交于
Avoid the conditional in the L1D flush control path. Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Tested-by: NJiri Kosina <jkosina@suse.cz> Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: NJosh Poimboeuf <jpoimboe@redhat.com> Link: https://lkml.kernel.org/r/20180713142322.790914912@linutronix.de
-
由 Thomas Gleixner 提交于
In preparation of allowing run time control for L1D flushing, move the setup code to the module parameter handler. In case of pre module init parsing, just store the value and let vmx_init() do the actual setup after running kvm_init() so that enable_ept is having the correct state. During run-time invoke it directly from the parameter setter to prepare for run-time control. Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Tested-by: NJiri Kosina <jkosina@suse.cz> Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: NJosh Poimboeuf <jpoimboe@redhat.com> Link: https://lkml.kernel.org/r/20180713142322.694063239@linutronix.de
-
由 Thomas Gleixner 提交于
If Extended Page Tables (EPT) are disabled or not supported, no L1D flushing is required. The setup function can just avoid setting up the L1D flush for the EPT=n case. Invoke it after the hardware setup has be done and enable_ept has the correct state and expose the EPT disabled state in the mitigation status as well. Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Tested-by: NJiri Kosina <jkosina@suse.cz> Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: NJosh Poimboeuf <jpoimboe@redhat.com> Link: https://lkml.kernel.org/r/20180713142322.612160168@linutronix.de
-
由 Thomas Gleixner 提交于
The VMX module parameter to control the L1D flush should become writeable. The MSR list is set up at VM init per guest VCPU, but the run time switching is based on a static key which is global. Toggling the MSR list at run time might be feasible, but for now drop this optimization and use the regular MSR write to make run-time switching possible. The default mitigation is the conditional flush anyway, so for extra paranoid setups this will add some small overhead, but the extra code executed is in the noise compared to the flush itself. Aside of that the EPT disabled case is not handled correctly at the moment and the MSR list magic is in the way for fixing that as well. If it's really providing a significant advantage, then this needs to be revisited after the code is correct and the control is writable. Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Tested-by: NJiri Kosina <jkosina@suse.cz> Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: NJosh Poimboeuf <jpoimboe@redhat.com> Link: https://lkml.kernel.org/r/20180713142322.516940445@linutronix.de
-
由 Thomas Gleixner 提交于
Store the effective mitigation of VMX in a status variable and use it to report the VMX state in the l1tf sysfs file. Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Tested-by: NJiri Kosina <jkosina@suse.cz> Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: NJosh Poimboeuf <jpoimboe@redhat.com> Link: https://lkml.kernel.org/r/20180713142322.433098358@linutronix.de
-
由 Juergen Gross 提交于
Setting pv_irq_ops for Xen PV domains should be done as early as possible in order to support e.g. very early printk() usage. The same applies to xen_vcpu_info_reset(0), as it is needed for the pv irq ops. Move the call of xen_setup_machphys_mapping() after initializing the pv functions as it contains a WARN_ON(), too. Remove the no longer necessary conditional in xen_init_irq_ops() from PVH V1 times to make clear this is a PV only function. Cc: <stable@vger.kernel.org> # 4.14 Signed-off-by: NJuergen Gross <jgross@suse.com> Reviewed-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: NJuergen Gross <jgross@suse.com>
-
- 12 7月, 2018 1 次提交
-
-
由 Juergen Gross 提交于
When removing the global bit from __supported_pte_mask do the same for __default_kernel_pte_mask in order to avoid the WARN_ONCE() in check_pgprot() when setting a kernel pte before having called init_mem_mapping(). Cc: <stable@vger.kernel.org> # 4.17 Reported-by: NMichael Young <m.a.young@durham.ac.uk> Signed-off-by: NJuergen Gross <jgross@suse.com> Reviewed-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: NJuergen Gross <jgross@suse.com>
-