- 11 12月, 2019 1 次提交
-
-
由 Krzysztof Kozlowski 提交于
Adjust indentation from spaces to tab (+optional two spaces) as in coding style with command like: $ sed -e 's/^ /\t/' -i */Kconfig Signed-off-by: NKrzysztof Kozlowski <krzk@kernel.org> Signed-off-by: NBorislav Petkov <bp@suse.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: x86-ml <x86@kernel.org> Cc: xen-devel@lists.xenproject.org Link: https://lkml.kernel.org/r/1574306470-10305-1-git-send-email-krzk@kernel.org
-
- 10 12月, 2019 2 次提交
-
-
由 Ingo Molnar 提交于
Update various comments, fix outright mistakes and meaningless descriptions. Also harmonize the style across the file, both in form and in language. Cc: linux-kernel@vger.kernel.org Cc: Borislav Petkov <bp@alien8.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Ingo Molnar 提交于
In 20 years we accumulated 89 #include lines in setup.c, but we only need 30 of them (!) ... Get rid of the excessive ones, and while at it, sort the remaining ones alphabetically. Also get rid of the incomplete changelogs at the header of the file, and explain better what this file does. Cc: linux-kernel@vger.kernel.org Cc: Borislav Petkov <bp@alien8.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
- 05 12月, 2019 2 次提交
-
-
由 Masahiro Yamada 提交于
Userspace cannot compile <asm/sembuf.h> due to some missing type definitions. For example, building it for x86 fails as follows: CC usr/include/asm/sembuf.h.s In file included from <command-line>:32:0: usr/include/asm/sembuf.h:17:20: error: field `sem_perm' has incomplete type struct ipc64_perm sem_perm; /* permissions .. see ipc.h */ ^~~~~~~~ usr/include/asm/sembuf.h:24:2: error: unknown type name `__kernel_time_t' __kernel_time_t sem_otime; /* last semop time */ ^~~~~~~~~~~~~~~ usr/include/asm/sembuf.h:25:2: error: unknown type name `__kernel_ulong_t' __kernel_ulong_t __unused1; ^~~~~~~~~~~~~~~~ usr/include/asm/sembuf.h:26:2: error: unknown type name `__kernel_time_t' __kernel_time_t sem_ctime; /* last change time */ ^~~~~~~~~~~~~~~ usr/include/asm/sembuf.h:27:2: error: unknown type name `__kernel_ulong_t' __kernel_ulong_t __unused2; ^~~~~~~~~~~~~~~~ usr/include/asm/sembuf.h:29:2: error: unknown type name `__kernel_ulong_t' __kernel_ulong_t sem_nsems; /* no. of semaphores in array */ ^~~~~~~~~~~~~~~~ usr/include/asm/sembuf.h:30:2: error: unknown type name `__kernel_ulong_t' __kernel_ulong_t __unused3; ^~~~~~~~~~~~~~~~ usr/include/asm/sembuf.h:31:2: error: unknown type name `__kernel_ulong_t' __kernel_ulong_t __unused4; ^~~~~~~~~~~~~~~~ It is just a matter of missing include directive. Include <asm/ipcbuf.h> to make it self-contained, and add it to the compile-test coverage. Link: http://lkml.kernel.org/r/20191030063855.9989-3-yamada.masahiro@socionext.comSigned-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Masahiro Yamada 提交于
Userspace cannot compile <asm/msgbuf.h> due to some missing type definitions. For example, building it for x86 fails as follows: CC usr/include/asm/msgbuf.h.s In file included from usr/include/asm/msgbuf.h:6:0, from <command-line>:32: usr/include/asm-generic/msgbuf.h:25:20: error: field `msg_perm' has incomplete type struct ipc64_perm msg_perm; ^~~~~~~~ usr/include/asm-generic/msgbuf.h:27:2: error: unknown type name `__kernel_time_t' __kernel_time_t msg_stime; /* last msgsnd time */ ^~~~~~~~~~~~~~~ usr/include/asm-generic/msgbuf.h:28:2: error: unknown type name `__kernel_time_t' __kernel_time_t msg_rtime; /* last msgrcv time */ ^~~~~~~~~~~~~~~ usr/include/asm-generic/msgbuf.h:29:2: error: unknown type name `__kernel_time_t' __kernel_time_t msg_ctime; /* last change time */ ^~~~~~~~~~~~~~~ usr/include/asm-generic/msgbuf.h:41:2: error: unknown type name `__kernel_pid_t' __kernel_pid_t msg_lspid; /* pid of last msgsnd */ ^~~~~~~~~~~~~~ usr/include/asm-generic/msgbuf.h:42:2: error: unknown type name `__kernel_pid_t' __kernel_pid_t msg_lrpid; /* last receive pid */ ^~~~~~~~~~~~~~ It is just a matter of missing include directive. Include <asm/ipcbuf.h> to make it self-contained, and add it to the compile-test coverage. Link: http://lkml.kernel.org/r/20191030063855.9989-2-yamada.masahiro@socionext.comSigned-off-by: NMasahiro Yamada <yamada.masahiro@socionext.com> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 04 12月, 2019 2 次提交
-
-
由 Jim Mattson 提交于
We will never need more guest_msrs than there are indices in vmx_msr_index. Thus, at present, the guest_msrs array will not exceed 168 bytes. Signed-off-by: NJim Mattson <jmattson@google.com> Reviewed-by: NLiran Alon <liran.alon@oracle.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Paolo Bonzini 提交于
The bounds check was present in KVM_GET_SUPPORTED_CPUID but not KVM_GET_EMULATED_CPUID. Reported-by: syzbot+e3f4897236c4eeb8af4f@syzkaller.appspotmail.com Fixes: 84cffe49 ("kvm: Emulate MOVBE", 2013-10-29) Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 02 12月, 2019 2 次提交
-
-
由 Daniel Axtens 提交于
In the case where KASAN directly allocates memory to back vmalloc space, don't map the early shadow page over it. We prepopulate pgds/p4ds for the range that would otherwise be empty. This is required to get it synced to hardware on boot, allowing the lower levels of the page tables to be filled dynamically. Link: http://lkml.kernel.org/r/20191031093909.9228-5-dja@axtens.netSigned-off-by: NDaniel Axtens <dja@axtens.net> Acked-by: NDmitry Vyukov <dvyukov@google.com> Reviewed-by: NAndrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Cc: Christophe Leroy <christophe.leroy@c-s.fr> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Ingo Molnar 提交于
There's a bug in the new PAT code, the conversion of memtype_check_conflict() is buggy: 8d04a5f9: ("x86/mm/pat: Convert the PAT tree to a generic interval tree") dprintk("Overlap at 0x%Lx-0x%Lx\n", match->start, match->end); found_type = match->type; - node = rb_next(&match->rb); - while (node) { - match = rb_entry(node, struct memtype, rb); - - if (match->start >= end) /* Checked all possible matches */ - goto success; - - if (is_node_overlap(match, start, end) && - match->type != found_type) { + match = memtype_interval_iter_next(match, start, end); + while (match) { + if (match->type != found_type) goto failure; - } - node = rb_next(&match->rb); + match = memtype_interval_iter_next(match, start, end); } Note how the '>= end' condition to end the interval check, got converted into: + match = memtype_interval_iter_next(match, start, end); This is subtly off by one, because the interval trees interfaces require closed interval parameters: include/linux/interval_tree_generic.h /* \ * Iterate over intervals intersecting [start;last] \ * \ * Note that a node's interval intersects [start;last] iff: \ * Cond1: ITSTART(node) <= last \ * and \ * Cond2: start <= ITLAST(node) \ */ \ ... if (ITSTART(node) <= last) { /* Cond1 */ \ if (start <= ITLAST(node)) /* Cond2 */ \ return node; /* node is leftmost match */ \ [start;last] is a closed interval (note that '<= last' check) - while the PAT 'end' parameter is 1 byte beyond the end of the range, because ioremap() and the other mapping APIs usually use the [start,end) half-open interval, derived from 'size'. This is what ioremap() does for example: /* * Mappings have to be page-aligned */ offset = phys_addr & ~PAGE_MASK; phys_addr &= PHYSICAL_PAGE_MASK; size = PAGE_ALIGN(last_addr+1) - phys_addr; retval = reserve_memtype(phys_addr, (u64)phys_addr + size, pcm, &new_pcm); phys_addr+size will be on a page boundary, after the last byte of the mapped interval. So the correct parameter to use in the interval tree searches is not 'end' but 'end-1'. This could have relevance if conflicting PAT ranges are exactly adjacent, for example a future WC region is followed immediately by an already mapped UC- region - in this case memtype_check_conflict() would incorrectly deny the WC memtype region and downgrade the memtype to UC-. BTW., rather annoyingly this downgrading is done silently in memtype_check_insert(): int memtype_check_insert(struct memtype *new, enum page_cache_mode *ret_type) { int err = 0; err = memtype_check_conflict(new->start, new->end, new->type, ret_type); if (err) return err; if (ret_type) new->type = *ret_type; memtype_interval_insert(new, &memtype_rbroot); return 0; } So on such a conflict we'd just silently get UC- in *ret_type, and write it into the new region, never the wiser ... So assuming that the patch below fixes the primary bug the diagnostics side of ioremap() cache attribute downgrades would be another thing to fix. Anyway, I checked all the interval-tree iterations, and most of them are off by one - but I think the one related to memtype_check_conflict() is the one causing this particular performance regression. The only correct interval-tree searches were these two: arch/x86/mm/pat_interval.c: match = memtype_interval_iter_first(&memtype_rbroot, 0, ULONG_MAX); arch/x86/mm/pat_interval.c: match = memtype_interval_iter_next(match, 0, ULONG_MAX); The ULONG_MAX was hiding the off-by-one in plain sight. :-) Note that the bug was probably benign in the sense of implementing a too strict cache attribute conflict policy and downgrading cache attributes, so AFAICS the worst outcome of this bug would be a performance regression, not any instabilities. Reported-by: Nkernel test robot <rong.a.chen@intel.com> Reported-by: NKenneth R. Crudup <kenny@panix.com> Reported-by: NMariusz Ceier <mceier+kernel@gmail.com> Tested-by: NMariusz Ceier <mceier@gmail.com> Tested-by: NKenneth R. Crudup <kenny@panix.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20191201144947.GA4167@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 01 12月, 2019 1 次提交
-
-
由 Borislav Petkov 提交于
... for better readability. No functional changes. [ Minor edit. ] Signed-off-by: NBorislav Petkov <bp@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
- 29 11月, 2019 1 次提交
-
-
由 Srinivas Pandruvada 提交于
While writing to MSR IA32_THERM_STATUS/IA32_PKG_THERM_STATUS, avoid writing 1 to read only and reserved fields because updating some fields generates exception. [ bp: Vertically align for better readability. ] Fixes: f6656208 ("x86/mce/therm_throt: Optimize notifications of thermal throttle") Reported-by: NDominik Brodowski <linux@dominikbrodowski.net> Tested-by: NDominik Brodowski <linux@dominikbrodowski.net> Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: NBorislav Petkov <bp@suse.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: x86-ml <x86@kernel.org> Link: https://lkml.kernel.org/r/20191128150824.22413-1-srinivas.pandruvada@linux.intel.com
-
- 28 11月, 2019 1 次提交
-
-
The state/owner of the FPU is saved to fpu_fpregs_owner_ctx by pointing to the context that is currently loaded. It never changed during the lifetime of a task - it remained stable/constant. After deferred FPU registers loading until return to userland was implemented, the content of fpu_fpregs_owner_ctx may change during preemption and must not be cached. This went unnoticed for some time and was now noticed, in particular since gcc 9 is caching that load in copy_fpstate_to_sigframe() and reusing it in the retry loop: copy_fpstate_to_sigframe() load fpu_fpregs_owner_ctx and save on stack fpregs_lock() copy_fpregs_to_sigframe() /* failed */ fpregs_unlock() *** PREEMPTION, another uses FPU, changes fpu_fpregs_owner_ctx *** fault_in_pages_writeable() /* succeed, retry */ fpregs_lock() __fpregs_load_activate() fpregs_state_valid() /* uses fpu_fpregs_owner_ctx from stack */ copy_fpregs_to_sigframe() /* succeeds, random FPU content */ This is a comparison of the assembly produced by gcc 9, without vs with this patch: | # arch/x86/kernel/fpu/signal.c:173: if (!access_ok(buf, size)) | cmpq %rdx, %rax # tmp183, _4 | jb .L190 #, |-# arch/x86/include/asm/fpu/internal.h:512: return fpu == this_cpu_read_stable(fpu_fpregs_owner_ctx) && cpu == fpu->last_cpu; |-#APP |-# 512 "arch/x86/include/asm/fpu/internal.h" 1 |- movq %gs:fpu_fpregs_owner_ctx,%rax #, pfo_ret__ |-# 0 "" 2 |-#NO_APP |- movq %rax, -88(%rbp) # pfo_ret__, %sfp … |-# arch/x86/include/asm/fpu/internal.h:512: return fpu == this_cpu_read_stable(fpu_fpregs_owner_ctx) && cpu == fpu->last_cpu; |- movq -88(%rbp), %rcx # %sfp, pfo_ret__ |- cmpq %rcx, -64(%rbp) # pfo_ret__, %sfp |+# arch/x86/include/asm/fpu/internal.h:512: return fpu == this_cpu_read(fpu_fpregs_owner_ctx) && cpu == fpu->last_cpu; |+#APP |+# 512 "arch/x86/include/asm/fpu/internal.h" 1 |+ movq %gs:fpu_fpregs_owner_ctx(%rip),%rax # fpu_fpregs_owner_ctx, pfo_ret__ |+# 0 "" 2 |+# arch/x86/include/asm/fpu/internal.h:512: return fpu == this_cpu_read(fpu_fpregs_owner_ctx) && cpu == fpu->last_cpu; |+#NO_APP |+ cmpq %rax, -64(%rbp) # pfo_ret__, %sfp Use this_cpu_read() instead this_cpu_read_stable() to avoid caching of fpu_fpregs_owner_ctx during preemption points. The Fixes: tag points to the commit where deferred FPU loading was added. Since this commit, the compiler is no longer allowed to move the load of fpu_fpregs_owner_ctx somewhere else / outside of the locked section. A task preemption will change its value and stale content will be observed. [ bp: Massage. ] Debugged-by: NAustin Clements <austin@google.com> Debugged-by: NDavid Chase <drchase@golang.org> Debugged-by: NIan Lance Taylor <ian@airs.com> Fixes: 5f409e20 ("x86/fpu: Defer FPU state load until return to userspace") Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: NBorislav Petkov <bp@suse.de> Reviewed-by: NRik van Riel <riel@surriel.com> Tested-by: NBorislav Petkov <bp@suse.de> Cc: Aubrey Li <aubrey.li@intel.com> Cc: Austin Clements <austin@google.com> Cc: Barret Rhoden <brho@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Chase <drchase@golang.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: ian@airs.com Cc: Ingo Molnar <mingo@redhat.com> Cc: Josh Bleecher Snyder <josharian@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: x86-ml <x86@kernel.org> Link: https://lkml.kernel.org/r/20191128085306.hxfa2o3knqtu4wfn@linutronix.de Link: https://bugzilla.kernel.org/show_bug.cgi?id=205663
-
- 27 11月, 2019 13 次提交
-
-
由 Peter Gonda 提交于
Memory encryption support does not have module parameter dependencies and can be moved into the general x86 cpuid __do_cpuid_ent function. This changes maintains current behavior of passing through all of CPUID.8000001F. Suggested-by: NJim Mattson <jmattson@google.com> Signed-off-by: NPeter Gonda <pgonda@google.com> Reviewed-by: NJim Mattson <jmattson@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Borislav Petkov 提交于
Signed-off-by: NBorislav Petkov <bp@alien8.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Anthony Steinhauser 提交于
When you successfully write 0 to /sys/devices/cpu/rdpmc, the RDPMC instruction should be disabled unconditionally and immediately (after you close the SYSFS file) by the documentation. Instead, in the current implementation the PMU must be reloaded which happens only eventually some time in the future. Only after that the RDPMC instruction becomes disabled (on ring 3) on the respective core. This change makes the treatment of the 0 value as blocking and as unconditional as the current treatment of the 2 value, only the CR4.PCE bit is naturally set to false instead of true. Signed-off-by: NAnthony Steinhauser <asteinhauser@google.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: acme@kernel.org Link: https://lkml.kernel.org/r/20191125054838.137615-1-asteinhauser@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Jason A. Donenfeld 提交于
For glue code that's used by Zinc, the actual Crypto API functions might not necessarily exist, and don't need to exist either. Before this patch, there are valid build configurations that lead to a unbuildable kernel. This fixes it to conditionalize those symbols on the existence of the proper config entry. Signed-off-by: NJason A. Donenfeld <Jason@zx2c4.com> Acked-by: NArd Biesheuvel <ardb@kernel.org> Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
-
由 Andy Lutomirski 提交于
Signed-off-by: NAndy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Andy Lutomirski 提交于
seg_segment_reg() should be unreachable with task == current. Rather than confusingly trying to make it work, just explicitly disable this case. (regset->get is used for current in the coredump code, but the ->set interface is only used for ptrace, and you can't ptrace yourself.) Signed-off-by: NAndy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Andy Lutomirski 提交于
A double fault has a decent chance of being recoverable by killing the offending thread. Use die() so that we at least try to recover. Signed-off-by: NAndy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Andy Lutomirski 提交于
The old x86_32 doublefault_fn() was old and crufty, and it did not even try to recover. do_double_fault() is much nicer. Rewrite the 32-bit double fault code to sanitize CPU state and call do_double_fault(). This is mostly an exercise i386 archaeology. With this patch applied, 32-bit double faults get a real stack trace, just like 64-bit double faults. [ mingo: merged the patch to a later kernel base. ] Signed-off-by: NAndy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Andy Lutomirski 提交于
There are three problems with the current layout of the doublefault stack and TSS. First, the TSS is only cacheline-aligned, which is not enough -- if the hardware portion of the TSS (struct x86_hw_tss) crosses a page boundary, horrible things happen [0]. Second, the stack and TSS are global, so simultaneous double faults on different CPUs will cause massive corruption. Third, the whole mechanism won't work if user CR3 is loaded, resulting in a triple fault [1]. Let the doublefault stack and TSS share a page (which prevents the TSS from spanning a page boundary), make it percpu, and move it into cpu_entry_area. Teach the stack dump code about the doublefault stack. [0] Real hardware will read past the end of the page onto the next *physical* page if a task switch happens. Virtual machines may have any number of bugs, and I would consider it reasonable for a VM to summarily kill the guest if it tries to task-switch to a page-spanning TSS. [1] Real hardware triple faults. At least some VMs seem to hang. I'm not sure what's going on. Signed-off-by: NAndy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Andy Lutomirski 提交于
doublefault.c now only contains 32-bit code. Rename it to doublefault_32.c. Signed-off-by: NAndy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Andy Lutomirski 提交于
The 64-bit doublefault handler is much nicer than the 32-bit one. As a first step toward unifying them, make the 64-bit handler self-contained. This should have no effect no functional effect except in the odd case of x86_64 with CONFIG_DOUBLEFAULT=n in which case it will change the logging a bit. This also gets rid of CONFIG_DOUBLEFAULT configurability on 64-bit kernels. It didn't do anything useful -- CONFIG_DOUBLEFAULT=n didn't actually disable doublefault handling on x86_64. Signed-off-by: NAndy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Joerg Roedel 提交于
The job of vmalloc_sync_all() is to help the lazy freeing of vmalloc() ranges: before such vmap ranges are reused we make sure that they are unmapped from every task's page tables. This is really easy on pagetable setups where the kernel page tables are shared between all tasks - this is the case on 32-bit kernels with SHARED_KERNEL_PMD = 1. But on !SHARED_KERNEL_PMD 32-bit kernels this involves iterating over the pgd_list and clearing all pmd entries in the pgds that are cleared in the init_mm.pgd, which is the reference pagetable that the vmalloc() code uses. In that context the current practice of vmalloc_sync_all() iterating until FIX_ADDR_TOP is buggy: for (address = VMALLOC_START & PMD_MASK; address >= TASK_SIZE_MAX && address < FIXADDR_TOP; address += PMD_SIZE) { struct page *page; Because iterating up to FIXADDR_TOP will involve a lot of non-vmalloc address ranges: VMALLOC -> PKMAP -> LDT -> CPU_ENTRY_AREA -> FIX_ADDR This is mostly harmless for the FIX_ADDR and CPU_ENTRY_AREA ranges that don't clear their pmds, but it's lethal for the LDT range, which relies on having different mappings in different processes, and 'synchronizing' them in the vmalloc sense corrupts those pagetable entries (clearing them). This got particularly prominent with PTI, which turns SHARED_KERNEL_PMD off and makes this the dominant mapping mode on 32-bit. To make LDT working again vmalloc_sync_all() must only iterate over the volatile parts of the kernel address range that are identical between all processes. So the correct check in vmalloc_sync_all() is "address < VMALLOC_END" to make sure the VMALLOC areas are synchronized and the LDT mapping is not falsely overwritten. The CPU_ENTRY_AREA and the FIXMAP area are no longer synced either, but this is not really a proplem since their PMDs get established during bootup and never change. This change fixes the ldt_gdt selftest in my setup. [ mingo: Fixed up the changelog to explain the logic and modified the copying to only happen up until VMALLOC_END. ] Reported-by: NBorislav Petkov <bp@suse.de> Tested-by: NBorislav Petkov <bp@suse.de> Signed-off-by: NJoerg Roedel <jroedel@suse.de> Cc: <stable@vger.kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: hpa@zytor.com Fixes: 7757d607: ("x86/pti: Allow CONFIG_PAGE_TABLE_ISOLATION for x86_32") Link: https://lkml.kernel.org/r/20191126111119.GA110513@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Ingo Molnar 提交于
After the following commit: 05b042a1: ("x86/pti/32: Calculate the various PTI cpu_entry_area sizes correctly, make the CPU_ENTRY_AREA_PAGES assert precise") 'struct cpu_entry_area' has to be Kconfig invariant, so that we always have a matching CPU_ENTRY_AREA_PAGES size. This commit added a CONFIG_X86_IOPL_IOPERM dependency to tss_struct: 111e7b15: ("x86/ioperm: Extend IOPL config to control ioperm() as well") Which, if CONFIG_X86_IOPL_IOPERM is turned off, reduces the size of cpu_entry_area by two pages, triggering the assert: ./include/linux/compiler.h:391:38: error: call to ‘__compiletime_assert_202’ declared with attribute error: BUILD_BUG_ON failed: (CPU_ENTRY_AREA_PAGES+1)*PAGE_SIZE != CPU_ENTRY_AREA_MAP_SIZE Simplify the Kconfig dependencies and make cpu_entry_area constant size on 32-bit kernels again. Fixes: 05b042a1: ("x86/pti/32: Calculate the various PTI cpu_entry_area sizes correctly, make the CPU_ENTRY_AREA_PAGES assert precise") Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
- 26 11月, 2019 2 次提交
-
-
由 Adrian Hunter 提交于
Add to the opcode map the following instructions: v4fmaddps v4fmaddss v4fnmaddps v4fnmaddss vaesdec vaesdeclast vaesenc vaesenclast vcvtne2ps2bf16 vcvtneps2bf16 vdpbf16ps gf2p8affineinvqb vgf2p8affineinvqb gf2p8affineqb vgf2p8affineqb gf2p8mulb vgf2p8mulb vp2intersectd vp2intersectq vp4dpwssd vp4dpwssds vpclmulqdq vpcompressb vpcompressw vpdpbusd vpdpbusds vpdpwssd vpdpwssds vpexpandb vpexpandw vpopcntb vpopcntd vpopcntq vpopcntw vpshldd vpshldq vpshldvd vpshldvq vpshldvw vpshldw vpshrdd vpshrdq vpshrdvd vpshrdvq vpshrdvw vpshrdw vpshufbitqmb For information about the instructions, refer Intel SDM May 2019 (325462-070US) and Intel Architecture Instruction Set Extensions May 2019 (319433-037). The instruction decoding can be tested using the perf tools' "x86 instruction decoder - new instructions" test e.g. $ perf test -v "new " 2>&1 | grep -i 'v4fmaddps' Decoded ok: 62 f2 7f 48 9a 20 v4fmaddps (%eax),%zmm0,%zmm4 Decoded ok: 62 f2 7f 48 9a a4 c8 78 56 34 12 v4fmaddps 0x12345678(%eax,%ecx,8),%zmm0,%zmm4 Decoded ok: 62 f2 7f 48 9a 20 v4fmaddps (%rax),%zmm0,%zmm4 Decoded ok: 67 62 f2 7f 48 9a 20 v4fmaddps (%eax),%zmm0,%zmm4 Decoded ok: 62 f2 7f 48 9a a4 c8 78 56 34 12 v4fmaddps 0x12345678(%rax,%rcx,8),%zmm0,%zmm4 Decoded ok: 67 62 f2 7f 48 9a a4 c8 78 56 34 12 v4fmaddps 0x12345678(%eax,%ecx,8),%zmm0,%zmm4 Signed-off-by: NAdrian Hunter <adrian.hunter@intel.com> Acked-by: NMasami Hiramatsu <mhiramat@kernel.org> Cc: Andi Kleen <ak@linux.intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Yu-cheng Yu <yu-cheng.yu@intel.com> Cc: x86@kernel.org Link: http://lore.kernel.org/lkml/20191125125044.31879-3-adrian.hunter@intel.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
-
由 Arnd Bergmann 提交于
The correct type on x32 is 64-bit wide, same as for the other struct members around it, so use __kernel_long_t in place of the original __kernel_time_t here, corresponding to the rest of the structure. Fixes: caf5e32d ("y2038: ipc: remove __kernel_time_t reference from headers") Reported-by: NBen Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: NArnd Bergmann <arnd@arndb.de>
-
- 25 11月, 2019 6 次提交
-
-
由 Andy Lutomirski 提交于
UNWIND_ESPFIX_STACK needs to read the GDT, and the GDT mapping that can be accessed via %fs is not mapped in the user pagetables. Use SGDT to find the cpu_entry_area mapping and read the espfix offset from that instead. Reported-and-tested-by: NBorislav Petkov <bp@alien8.de> Signed-off-by: NAndy Lutomirski <luto@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: <stable@vger.kernel.org> Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Will Deacon 提交于
The generic implementation of refcount_t should be good enough for everybody, so remove ARCH_HAS_REFCOUNT and REFCOUNT_FULL entirely, leaving the generic implementation enabled unconditionally. Signed-off-by: NWill Deacon <will@kernel.org> Reviewed-by: NArd Biesheuvel <ardb@kernel.org> Acked-by: NKees Cook <keescook@chromium.org> Tested-by: NHanjun Guo <guohanjun@huawei.com> Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org> Cc: Elena Reshetova <elena.reshetova@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/20191121115902.2551-9-will@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Ingo Molnar 提交于
x86/pti/32: Calculate the various PTI cpu_entry_area sizes correctly, make the CPU_ENTRY_AREA_PAGES assert precise When two recent commits that increased the size of the 'struct cpu_entry_area' were merged in -tip, the 32-bit defconfig build started failing on the following build time assert: ./include/linux/compiler.h:391:38: error: call to ‘__compiletime_assert_189’ declared with attribute error: BUILD_BUG_ON failed: CPU_ENTRY_AREA_PAGES * PAGE_SIZE < CPU_ENTRY_AREA_MAP_SIZE arch/x86/mm/cpu_entry_area.c:189:2: note: in expansion of macro ‘BUILD_BUG_ON’ In function ‘setup_cpu_entry_area_ptes’, Which corresponds to the following build time assert: BUILD_BUG_ON(CPU_ENTRY_AREA_PAGES * PAGE_SIZE < CPU_ENTRY_AREA_MAP_SIZE); The purpose of this assert is to sanity check the fixed-value definition of CPU_ENTRY_AREA_PAGES arch/x86/include/asm/pgtable_32_types.h: #define CPU_ENTRY_AREA_PAGES (NR_CPUS * 41) The '41' is supposed to match sizeof(struct cpu_entry_area)/PAGE_SIZE, which value we didn't want to define in such a low level header, because it would cause dependency hell. Every time the size of cpu_entry_area is changed, we have to adjust CPU_ENTRY_AREA_PAGES accordingly - and this assert is checking that constraint. But the assert is both imprecise and buggy, primarily because it doesn't include the single readonly IDT page that is mapped at CPU_ENTRY_AREA_BASE (which begins at a PMD boundary). This bug was hidden by the fact that by accident CPU_ENTRY_AREA_PAGES is defined too large upstream (v5.4-rc8): #define CPU_ENTRY_AREA_PAGES (NR_CPUS * 40) While 'struct cpu_entry_area' is 155648 bytes, or 38 pages. So we had two extra pages, which hid the bug. The following commit (not yet upstream) increased the size to 40 pages: x86/iopl: ("Restrict iopl() permission scope") ... but increased CPU_ENTRY_AREA_PAGES only 41 - i.e. shortening the gap to just 1 extra page. Then another not-yet-upstream commit changed the size again: 880a98c3: ("x86/cpu_entry_area: Add guard page for entry stack on 32bit") Which increased the cpu_entry_area size from 38 to 39 pages, but didn't change CPU_ENTRY_AREA_PAGES (kept it at 40). This worked fine, because we still had a page left from the accidental 'reserve'. But when these two commits were merged into the same tree, the combined size of cpu_entry_area grew from 38 to 40 pages, while CPU_ENTRY_AREA_PAGES finally caught up to 40 as well. Which is fine in terms of functionality, but the assert broke: BUILD_BUG_ON(CPU_ENTRY_AREA_PAGES * PAGE_SIZE < CPU_ENTRY_AREA_MAP_SIZE); because CPU_ENTRY_AREA_MAP_SIZE is the total size of the area, which is 1 page larger due to the IDT page. To fix all this, change the assert to two precise asserts: BUILD_BUG_ON((CPU_ENTRY_AREA_PAGES+1)*PAGE_SIZE != CPU_ENTRY_AREA_MAP_SIZE); BUILD_BUG_ON(CPU_ENTRY_AREA_TOTAL_SIZE != CPU_ENTRY_AREA_MAP_SIZE); This takes the IDT page into account, and also connects the size-based define of CPU_ENTRY_AREA_TOTAL_SIZE with the address-subtraction based define of CPU_ENTRY_AREA_MAP_SIZE. Also clean up some of the names which made it rather confusing: - 'CPU_ENTRY_AREA_TOT_SIZE' wasn't actually the 'total' size of the cpu-entry-area, but the per-cpu array size, so rename this to CPU_ENTRY_AREA_ARRAY_SIZE. - Introduce CPU_ENTRY_AREA_TOTAL_SIZE that _is_ the total mapping size, with the IDT included. - Add comments where '+1' denotes the IDT mapping - it wasn't obvious and took me about 3 hours to decode... Finally, because this particular commit is actually applied after this patch: 880a98c3: ("x86/cpu_entry_area: Add guard page for entry stack on 32bit") Fix the CPU_ENTRY_AREA_PAGES value from 40 pages to the correct 39 pages. All future commits that change cpu_entry_area will have to adjust this value precisely. As a side note, we should probably attempt to remove CPU_ENTRY_AREA_PAGES and derive its value directly from the structure, without causing header hell - but that is an adventure for another day! :-) Fixes: 880a98c3: ("x86/cpu_entry_area: Add guard page for entry stack on 32bit") Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: stable@kernel.org Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Daniel Borkmann 提交于
Given that we have BPF_MOD_NOP_TO_{CALL,JUMP}, BPF_MOD_{CALL,JUMP}_TO_NOP and BPF_MOD_{CALL,JUMP}_TO_{CALL,JUMP} poke types and that we also pass in old_addr as well as new_addr, it's a bit redundant and unnecessarily complicates __bpf_arch_text_poke() itself since we can derive the same from the *_addr that were passed in. Hence simplify and use BPF_MOD_{CALL,JUMP} as types which also allows to clean up call-sites. In addition to that, __bpf_arch_text_poke() currently verifies that text matches expected old_insn before we invoke text_poke_bp(). Also add a check on new_insn and skip rewrite if it already matches. Reason why this is rather useful is that it avoids making any special casing in prog_array_map_poke_run() when old and new prog were NULL and has the benefit that also for this case we perform a check on text whether it really matches our expectations. Suggested-by: NAndrii Nakryiko <andriin@fb.com> Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Signed-off-by: NAlexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/fcb00a2b0b288d6c73de4ef58116a821c8fe8f2f.1574555798.git.daniel@iogearbox.net
-
由 Daniel Borkmann 提交于
Add initial code emission for *direct* jumps for tail call maps in order to avoid the retpoline overhead from a493a87f ("bpf, x64: implement retpoline for tail call") for situations that allow for it, meaning, for known constant keys at verification time which are used as index into the tail call map. In case of Cilium which makes heavy use of tail calls, constant keys are used in the vast majority, only for a single occurrence we use a dynamic key. High level outline is that if the target prog is NULL in the map, we emit a 5-byte nop for the fall-through case and if not, we emit a 5-byte direct relative jmp to the target bpf_func + skipped prologue offset. Later during runtime, we patch these 5-byte nop/jmps upon tail call map update or deletions dynamically. Note that on x86-64 the direct jmp works as we reuse the same stack frame and skip prologue (as opposed to some other JIT implementations). One of the issues is that the tail call map slots can change at any given time even during JITing. Therefore, we have two passes: i) emit nops for all patchable locations during main JITing phase until we declare prog->jited = 1 eventually. At this point the image is stable, not public yet and with all jmps disabled. While JITing, we collect additional info like poke->ip in order to remember the patch location for later modifications. In ii) bpf_tail_call_direct_fixup() walks over the progs poke_tab, locks the tail call maps poke_mutex to prevent from parallel updates and patches in the right locations via __bpf_arch_text_poke(). Note, the main bpf_arch_text_poke() cannot be used at this point since we're not yet exposed to kallsyms. For the update we use plain memcpy() since the image is not public and still in read-write mode. After patching, we activate that poke entry through poke->ip_stable. Meaning, at this point any tail call map updates/deletions are not going to ignore that poke entry anymore. Then, bpf_arch_text_poke() might still occur on the read-write image until we finally locked it as read-only. Both modifications on the given image are under text_mutex to avoid interference with each other when update requests come in in parallel for different tail call maps (current one we have locked in JIT and different one where poke->ip_stable was already set). Example prog: # ./bpftool p d x i 1655 0: (b7) r3 = 0 1: (18) r2 = map[id:526] 3: (85) call bpf_tail_call#12 4: (b7) r0 = 1 5: (95) exit Before: # ./bpftool p d j i 1655 0xffffffffc076e55c: 0: nopl 0x0(%rax,%rax,1) 5: push %rbp 6: mov %rsp,%rbp 9: sub $0x200,%rsp 10: push %rbx 11: push %r13 13: push %r14 15: push %r15 17: pushq $0x0 _ 19: xor %edx,%edx |_ index (arg 3) 1b: movabs $0xffff88d95cc82600,%rsi |_ map (arg 2) 25: mov %edx,%edx | index >= array->map.max_entries 27: cmp %edx,0x24(%rsi) | 2a: jbe 0x0000000000000066 |_ 2c: mov -0x224(%rbp),%eax | tail call limit check 32: cmp $0x20,%eax | 35: ja 0x0000000000000066 | 37: add $0x1,%eax | 3a: mov %eax,-0x224(%rbp) |_ 40: mov 0xd0(%rsi,%rdx,8),%rax |_ prog = array->ptrs[index] 48: test %rax,%rax | prog == NULL check 4b: je 0x0000000000000066 |_ 4d: mov 0x30(%rax),%rax | goto *(prog->bpf_func + prologue_size) 51: add $0x19,%rax | 55: callq 0x0000000000000061 | retpoline for indirect jump 5a: pause | 5c: lfence | 5f: jmp 0x000000000000005a | 61: mov %rax,(%rsp) | 65: retq |_ 66: mov $0x1,%eax 6b: pop %rbx 6c: pop %r15 6e: pop %r14 70: pop %r13 72: pop %rbx 73: leaveq 74: retq After; state after JIT: # ./bpftool p d j i 1655 0xffffffffc08e8930: 0: nopl 0x0(%rax,%rax,1) 5: push %rbp 6: mov %rsp,%rbp 9: sub $0x200,%rsp 10: push %rbx 11: push %r13 13: push %r14 15: push %r15 17: pushq $0x0 _ 19: xor %edx,%edx |_ index (arg 3) 1b: movabs $0xffff9d8afd74c000,%rsi |_ map (arg 2) 25: mov -0x224(%rbp),%eax | tail call limit check 2b: cmp $0x20,%eax | 2e: ja 0x000000000000003e | 30: add $0x1,%eax | 33: mov %eax,-0x224(%rbp) |_ 39: jmpq 0xfffffffffffd1785 |_ [direct] goto *(prog->bpf_func + prologue_size) 3e: mov $0x1,%eax 43: pop %rbx 44: pop %r15 46: pop %r14 48: pop %r13 4a: pop %rbx 4b: leaveq 4c: retq After; state after map update (target prog): # ./bpftool p d j i 1655 0xffffffffc08e8930: 0: nopl 0x0(%rax,%rax,1) 5: push %rbp 6: mov %rsp,%rbp 9: sub $0x200,%rsp 10: push %rbx 11: push %r13 13: push %r14 15: push %r15 17: pushq $0x0 19: xor %edx,%edx 1b: movabs $0xffff9d8afd74c000,%rsi 25: mov -0x224(%rbp),%eax 2b: cmp $0x20,%eax . 2e: ja 0x000000000000003e . 30: add $0x1,%eax . 33: mov %eax,-0x224(%rbp) |_ 39: jmpq 0xffffffffffb09f55 |_ goto *(prog->bpf_func + prologue_size) 3e: mov $0x1,%eax 43: pop %rbx 44: pop %r15 46: pop %r14 48: pop %r13 4a: pop %rbx 4b: leaveq 4c: retq After; state after map update (no prog): # ./bpftool p d j i 1655 0xffffffffc08e8930: 0: nopl 0x0(%rax,%rax,1) 5: push %rbp 6: mov %rsp,%rbp 9: sub $0x200,%rsp 10: push %rbx 11: push %r13 13: push %r14 15: push %r15 17: pushq $0x0 19: xor %edx,%edx 1b: movabs $0xffff9d8afd74c000,%rsi 25: mov -0x224(%rbp),%eax 2b: cmp $0x20,%eax . 2e: ja 0x000000000000003e . 30: add $0x1,%eax . 33: mov %eax,-0x224(%rbp) |_ 39: nopl 0x0(%rax,%rax,1) |_ fall-through nop 3e: mov $0x1,%eax 43: pop %rbx 44: pop %r15 46: pop %r14 48: pop %r13 4a: pop %rbx 4b: leaveq 4c: retq Nice bonus is that this also shrinks the code emission quite a bit for every tail call invocation. Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Signed-off-by: NAlexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/6ada4c1c9d35eeb5f4ecfab94593dafa6b5c4b09.1574452833.git.daniel@iogearbox.net
-
由 Daniel Borkmann 提交于
Add BPF_MOD_{NOP_TO_JUMP,JUMP_TO_JUMP,JUMP_TO_NOP} patching for x86 JIT in order to be able to patch direct jumps or nop them out. We need this facility in order to patch tail call jumps and in later work also BPF static keys. Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net> Signed-off-by: NAlexei Starovoitov <ast@kernel.org> Acked-by: NAndrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/aa4784196a8e5e985af4b30a4fe5336bce6e9643.1574452833.git.daniel@iogearbox.net
-
- 23 11月, 2019 4 次提交
-
-
由 Jim Mattson 提交于
Commit 37e4c997 ("KVM: VMX: validate individual bits of guest MSR_IA32_FEATURE_CONTROL") broke the KVM_SET_MSRS ABI by instituting new constraints on the data values that kvm would accept for the guest MSR, IA32_FEATURE_CONTROL. Perhaps these constraints should have been opt-in via a new KVM capability, but they were applied indiscriminately, breaking at least one existing hypervisor. Relax the constraints to allow either or both of FEATURE_CONTROL_VMXON_ENABLED_OUTSIDE_SMX and FEATURE_CONTROL_VMXON_ENABLED_INSIDE_SMX to be set when nVMX is enabled. This change is sufficient to fix the aforementioned breakage. Fixes: 37e4c997 ("KVM: VMX: validate individual bits of guest MSR_IA32_FEATURE_CONTROL") Signed-off-by: NJim Mattson <jmattson@google.com> Reviewed-by: NLiran Alon <liran.alon@oracle.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Acquire kvm->srcu for the duration of ->set_nested_state() to fix a bug where nVMX derefences ->memslots without holding ->srcu or ->slots_lock. The other half of nested migration, ->get_nested_state(), does not need to acquire ->srcu as it is a purely a dump of internal KVM (and CPU) state to userspace. Detected as an RCU lockdep splat that is 100% reproducible by running KVM's state_test selftest with CONFIG_PROVE_LOCKING=y. Note that the failing function, kvm_is_visible_gfn(), is only checking the validity of a gfn, it's not actually accessing guest memory (which is more or less unsupported during vmx_set_nested_state() due to incorrect MMU state), i.e. vmx_set_nested_state() itself isn't fundamentally broken. In any case, setting nested state isn't a fast path so there's no reason to go out of our way to avoid taking ->srcu. ============================= WARNING: suspicious RCU usage 5.4.0-rc7+ #94 Not tainted ----------------------------- include/linux/kvm_host.h:626 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 2, debug_locks = 1 1 lock held by evmcs_test/10939: #0: ffff88826ffcb800 (&vcpu->mutex){+.+.}, at: kvm_vcpu_ioctl+0x85/0x630 [kvm] stack backtrace: CPU: 1 PID: 10939 Comm: evmcs_test Not tainted 5.4.0-rc7+ #94 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 Call Trace: dump_stack+0x68/0x9b kvm_is_visible_gfn+0x179/0x180 [kvm] mmu_check_root+0x11/0x30 [kvm] fast_cr3_switch+0x40/0x120 [kvm] kvm_mmu_new_cr3+0x34/0x60 [kvm] nested_vmx_load_cr3+0xbd/0x1f0 [kvm_intel] nested_vmx_enter_non_root_mode+0xab8/0x1d60 [kvm_intel] vmx_set_nested_state+0x256/0x340 [kvm_intel] kvm_arch_vcpu_ioctl+0x491/0x11a0 [kvm] kvm_vcpu_ioctl+0xde/0x630 [kvm] do_vfs_ioctl+0xa2/0x6c0 ksys_ioctl+0x66/0x70 __x64_sys_ioctl+0x16/0x20 do_syscall_64+0x54/0x200 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7f59a2b95f47 Fixes: 8fcc4b59 ("kvm: nVMX: Introduce KVM_CAP_NESTED_STATE") Cc: stable@vger.kernel.org Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
Fold shared_msr_update() into its sole user to eliminate its pointless bounds check, its godawful printk, its misleading comment (it's called under a global lock), and its woefully inaccurate name. Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Sean Christopherson 提交于
A recent change inadvertently exported a static function, which results in modpost throwing a warning. Fix it. Fixes: cbbaa272 ("KVM: x86: fix presentation of TSX feature in ARCH_CAPABILITIES") Signed-off-by: NSean Christopherson <sean.j.christopherson@intel.com> Cc: stable@vger.kernel.org Reviewed-by: NJim Mattson <jmattson@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 22 11月, 2019 3 次提交
-
-
由 Eric Biggers 提交于
It's not valid to call crypto_unregister_skciphers() without a prior call to crypto_register_skciphers(). Fixes: 84e03fa3 ("crypto: x86/chacha - expose SIMD ChaCha routine as library function") Signed-off-by: NEric Biggers <ebiggers@google.com> Acked-by: NArd Biesheuvel <ardb@kernel.org> Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
-
由 Dexuan Cui 提交于
The API will be used by the hv_balloon and hv_vmbus drivers. Balloon up/down and hot-add of memory must not be active if the user wants the Linux VM to support hibernation, because they are incompatible with hibernation according to Hyper-V team, e.g. upon suspend the balloon VSP doesn't save any info about the ballooned-out pages (if any); so, after Linux resumes, Linux balloon VSC expects that the VSP will return the pages if Linux is under memory pressure, but the VSP will never do that, since the VSP thinks it never stole the pages from the VM. So, if the user wants Linux VM to support hibernation, Linux must forbid balloon up/down and hot-add, and the only functionality of the balloon VSC driver is reporting the VM's memory pressure to the host. Ideally, when Linux detects that the user wants it to support hibernation, the balloon VSC should tell the VSP that it does not support ballooning and hot-add. However, the current version of the VSP requires the VSC should support these capabilities, otherwise the capability negotiation fails and the VSC can not load at all, so with the later changes to the VSC driver, Linux VM still reports to the VSP that the VSC supports these capabilities, but the VSC ignores the VSP's requests of balloon up/down and hot add, and reports an error to the VSP, when applicable. BTW, in the future the balloon VSP driver will allow the VSC to not support the capabilities of balloon up/down and hot add. The ACPI S4 state is not a must for hibernation to work, because Linux is able to hibernate as long as the system can shut down. However in practice we decide to artificially use the presence of the virtual ACPI S4 state as an indicator of the user's intent of using hibernation, because Linux VM must find a way to know if the user wants to use the hibernation feature or not. By default, Hyper-V does not enable the virtual ACPI S4 state; on recent Hyper-V hosts (e.g. RS5, 19H1), the administrator is able to enable the state for a VM by WMI commands. Once all the vmbus and VSC patches for the hibernation feature are accepted, an extra patch will be submitted to forbid hibernation if the virtual ACPI S4 state is absent, i.e. hv_is_hibernation_supported() is false. Signed-off-by: NDexuan Cui <decui@microsoft.com> Reviewed-by: NMichael Kelley <mikelley@microsoft.com> Acked-by: NThomas Gleixner <tglx@linutronix.de> Signed-off-by: NSasha Levin <sashal@kernel.org>
-
由 Himadri Pandya 提交于
Hyper-V assumes page size to be 4K. While this assumption holds true on x86 architecture, it might not be true for ARM64 architecture. Hence define hyper-v specific function to allocate a zeroed page which can have a different implementation on ARM64 architecture to handle the conflict between hyper-v's assumed page size and actual guest page size. Signed-off-by: NHimadri Pandya <himadri18.07@gmail.com> Reviewed-by: NMichael Kelley <mikelley@microsoft.com> Signed-off-by: NSasha Levin <sashal@kernel.org>
-