- 12 12月, 2017 1 次提交
-
-
由 Jan Beulich 提交于
Unconditionally reporting a value seen on the P4 or older invokes functionality like io_apic_get_unique_id() on 32-bit builds, resulting in a panic() with sufficiently many CPUs and/or IO-APICs. Doing what that function does would be the hypervisor's responsibility anyway, so makes no sense to be used when running on Xen. Uniformly report a more modern version; this shouldn't matter much as both LAPIC and IO-APIC are being managed entirely / mostly by the hypervisor. Signed-off-by: NJan Beulich <jbeulich@suse.com> Reviewed-by: NJuergen Gross <jgross@suse.com> Signed-off-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
-
- 09 11月, 2017 3 次提交
-
-
由 Joao Martins 提交于
In order to support pvclock vdso on xen we need to setup the time info page for vcpu 0 and register the page with Xen using the VCPUOP_register_vcpu_time_memory_area hypercall. This hypercall will also forcefully update the pvti which will set some of the necessary flags for vdso. Afterwards we check if it supports the PVCLOCK_TSC_STABLE_BIT flag which is mandatory for having vdso/vsyscall support. And if so, it will set the cpu 0 pvti that will be later on used when mapping the vdso image. The xen headers are also updated to include the new hypercall for registering the secondary vcpu_time_info struct. Signed-off-by: NJoao Martins <joao.m.martins@oracle.com> Reviewed-by: NJuergen Gross <jgross@suse.com> Reviewed-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
-
由 Joao Martins 提交于
Specifically check for PVCLOCK_TSC_STABLE_BIT and if this bit is set, then set it too on pvclock flags. This allows Xen clocksource to use it and thus speeding up xen_clocksource_read() callers (i.e. sched_clock()) Signed-off-by: NJoao Martins <joao.m.martins@oracle.com> Reviewed-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
-
由 Joao Martins 提交于
Right now there is only a pvclock_pvti_cpu0_va() which is defined on kvmclock since: commit dac16fba ("x86/vdso: Get pvclock data from the vvar VMA instead of the fixmap") The only user of this interface so far is kvm. This commit adds a setter function for the pvti page and moves pvclock_pvti_cpu0_va to pvclock, which is a more generic place to have it; and would allow other PV clocksources to use it, such as Xen. While moving pvclock_pvti_cpu0_va into pvclock, rename also this function to pvclock_get_pvti_cpu0_va (including its call sites) to be symmetric with the setter (pvclock_set_pvti_cpu0_va). Signed-off-by: NJoao Martins <joao.m.martins@oracle.com> Acked-by: NAndy Lutomirski <luto@kernel.org> Acked-by: NPaolo Bonzini <pbonzini@redhat.com> Acked-by: NThomas Gleixner <tglx@linutronix.de> Signed-off-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
-
- 07 11月, 2017 2 次提交
-
-
由 Juergen Gross 提交于
Update arch/x86/include/asm/xen/cpuid.h from the Xen tree to get newest definitions. Signed-off-by: NJuergen Gross <jgross@suse.com> Reviewed-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
-
由 Juergen Gross 提交于
The grant v2 support was removed from the kernel with commit 438b33c7 ("xen/grant-table: remove support for V2 tables") as the higher memory footprint of v2 grants resulted in less grants being possible for a kernel compared to the v1 grant interface. As machines with more than 16TB of memory are expected to be more common in the near future support of grant v2 is mandatory in order to be able to run a Xen pv domain at any memory location. So re-add grant v2 support basically by reverting above commit. Signed-off-by: NJuergen Gross <jgross@suse.com> Reviewed-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
-
- 04 11月, 2017 1 次提交
-
-
由 Paul Durrant 提交于
If the domain has XENFEAT_auto_translated_physmap then use of the PV- specific HYPERVISOR_mmu_update hypercall is clearly incorrect. This patch adds checks in xen_remap_domain_gfn_array() and xen_unmap_domain_gfn_array() which call through to the approprate xlate_mmu function if the feature is present. A check is also added to xen_remap_domain_gfn_range() to fail with -EOPNOTSUPP since this should not be used in an HVM tools domain. Signed-off-by: NPaul Durrant <paul.durrant@citrix.com> Reviewed-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
-
- 03 11月, 2017 1 次提交
-
-
由 Boris Ostrovsky 提交于
For any other error sync_cmos_clock() will reschedule itself every second or so, for no good reason. Suggested-by: NPaolo Bonzini <pbonzini@redhat.com> Reviewed-by: NJuergen Gross <jgross@suse.com> Signed-off-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
-
- 31 10月, 2017 1 次提交
-
-
由 Juergen Gross 提交于
Physical addresses on processors supporting 5 level paging can be up to 52 bits wide. For a Xen pv guest running on such a machine those physical addresses have to be supported in order to be able to use any memory on the machine even if the guest itself does not support 5 level paging. So when reading/writing a MFN from/to a pte don't use the kernel's PTE_PFN_MASK but a new XEN_PTE_MFN_MASK allowing full 40 bit wide MFNs. Signed-off-by: NJuergen Gross <jgross@suse.com> Reviewed-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
-
- 27 10月, 2017 1 次提交
-
-
由 Ingo Molnar 提交于
This reverts commit ce56a86e. There's unanticipated interaction with some boot parameters like 'mem=', which now cause the new checks via valid_mmap_phys_addr_range() to be too restrictive, crashing a Qemu bootup in fact, as reported by Fengguang Wu. So while the motivation of the change is still entirely valid, we need a few more rounds of testing to get it right - it's way too late after -rc6, so revert it for now. Reported-by: NFengguang Wu <fengguang.wu@intel.com> Signed-off-by: NIngo Molnar <mingo@kernel.org> Acked-by: NCraig Bergstrom <craigb@google.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luis R. Rodriguez <mcgrof@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Toshi Kani <toshi.kani@hp.com> Cc: dsafonov@virtuozzo.com Cc: kirill.shutemov@linux.intel.com Cc: mhocko@suse.com Cc: oleg@redhat.com Cc: linux-kernel@vger.kernel.org Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
- 24 10月, 2017 1 次提交
-
-
由 Alexander Shishkin 提交于
Commit: d2878d64 ("perf/x86/intel/bts: Disallow use by unprivileged users on paranoid systems") ... adds a privilege check in the exactly wrong place in the event init path: after the 'LBR exclusive' reference has been taken, and doesn't release it in the case of insufficient privileges. After this, nobody in the system gets to use PT or LBR afterwards. This patch moves the privilege check to where it should have been in the first place. Signed-off-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: d2878d64 ("perf/x86/intel/bts: Disallow use by unprivileged users on paranoid systems") Link: http://lkml.kernel.org/r/20171023123533.16973-1-alexander.shishkin@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 23 10月, 2017 2 次提交
-
-
由 Josh Poimboeuf 提交于
Improve the warning messages to show the relevant function name+offset. This makes it much easier to diagnose problems with the ORC metadata. Before: WARNING: can't dereference iret registers at ffff8801c5f17fe0 for ip ffffffff95f0d94b After: WARNING: can't dereference iret registers at ffff880178f5ffe0 for ip int3+0x5b/0x60 Reported-by: NAndrei Vagin <avagin@virtuozzo.com> Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: ee9f8fce ("x86/unwind: Add the ORC unwinder") Link: http://lkml.kernel.org/r/6bada6b9eac86017e16bd79e1e77877935cb50bb.1508516398.git.jpoimboe@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Josh Poimboeuf 提交于
This fixes the following ORC warning in the 'int3' entry code: WARNING: can't dereference iret registers at ffff8801c5f17fe0 for ip ffffffff95f0d94b The ORC metadata had the wrong stack offset for the iret registers. Their location on the stack is dependent on whether the exception has an error code. Reported-and-tested-by: NAndrei Vagin <avagin@virtuozzo.com> Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 8c1f7558 ("x86/entry/64: Add unwind hint annotations") Link: http://lkml.kernel.org/r/931d57f0551ed7979d5e7e05370d445c8e5137f8.1508516398.git.jpoimboe@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 22 10月, 2017 1 次提交
-
-
由 Borislav Petkov 提交于
Some F14h machines have an erratum which, "under a highly specific and detailed set of internal timing conditions" can lead to skipping instructions and RIP corruption. Add the fix for those machines when their BIOS doesn't apply it or there simply isn't BIOS update for them. Tested-by: <mirh@protonmail.ch> Signed-off-by: NBorislav Petkov <bp@suse.de> Cc: <stable@vger.kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sherry Hurwitz <sherry.hurwitz@amd.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Yazen Ghannam <Yazen.Ghannam@amd.com> Link: http://lkml.kernel.org/r/20171022104731.28249-1-bp@alien8.de Link: https://bugzilla.kernel.org/show_bug.cgi?id=197285 [ Added pr_info() that we activated the workaround. ] Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
- 20 10月, 2017 1 次提交
-
-
由 Craig Bergstrom 提交于
Currently, it is possible to mmap() any offset from /dev/mem. If a program mmaps() /dev/mem offsets outside of the addressable limits of a system, the page table can be corrupted by setting reserved bits. For example if you mmap() offset 0x0001000000000000 of /dev/mem on an x86_64 system with a 48-bit bus, the page fault handler will be called with error_code set to RSVD. The kernel then crashes with a page table corruption error. This change prevents this page table corruption on x86 by refusing to mmap offsets higher than the highest valid address in the system. Signed-off-by: NCraig Bergstrom <craigb@google.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luis R. Rodriguez <mcgrof@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Toshi Kani <toshi.kani@hp.com> Cc: dsafonov@virtuozzo.com Cc: kirill.shutemov@linux.intel.com Cc: mhocko@suse.com Cc: oleg@redhat.com Link: http://lkml.kernel.org/r/20171019192856.39672-1-craigb@google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 18 10月, 2017 4 次提交
-
-
由 Andy Lutomirski 提交于
Borislav thinks that we don't need this knob in a released kernel. Get rid of it. Requested-by: NBorislav Petkov <bp@alien8.de> Signed-off-by: NAndy Lutomirski <luto@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: b956575b ("x86/mm: Flush more aggressively in lazy TLB mode") Link: http://lkml.kernel.org/r/1fa72431924e81e86c164ff7881bf9240d1f1a6c.1508000261.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Andy Lutomirski 提交于
Due to timezones, commit: b956575b ("x86/mm: Flush more aggressively in lazy TLB mode") was an outdated patch that well tested and fixed the bug but didn't address Borislav's review comments. Tidy it up: - The name "tlb_use_lazy_mode()" was highly confusing. Change it to "tlb_defer_switch_to_init_mm()", which describes what it actually means. - Move the static_branch crap into a helper. - Improve comments. Actually removing the debugfs option is in the next patch. Reported-by: NBorislav Petkov <bp@alien8.de> Signed-off-by: NAndy Lutomirski <luto@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: b956575b ("x86/mm: Flush more aggressively in lazy TLB mode") Link: http://lkml.kernel.org/r/154ef95428d4592596b6e98b0af1d2747d6cfbf8.1508000261.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Andy Lutomirski 提交于
Let's avoid hard-to-diagnose crashes in the future. Signed-off-by: NAndy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/f423bbc97864089fbdeb813f1ea126c6eaed844a.1508000261.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Borislav Petkov 提交于
Blacklist Broadwell X model 79 for late loading due to an erratum. Signed-off-by: NBorislav Petkov <bp@suse.de> Acked-by: NTony Luck <tony.luck@intel.com> Cc: <stable@vger.kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20171018111225.25635-1-bp@alien8.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 17 10月, 2017 1 次提交
-
-
由 Thomas Gleixner 提交于
Moving the early IDT setup out of assembly code breaks the boot on first generation 486 systems. The reason is that the call of idt_setup_early_handler, which sets up the early handlers was added after the call to cr4_init_shadow(). cr4_init_shadow() tries to read CR4 which is not available on those systems. The accessor function uses a extable fixup to handle the resulting fault. As the IDT is not set up yet, the cr4 read exception causes an instantaneous reboot for obvious reasons. Call idt_setup_early_handler() before cr4_init_shadow() so IDT is set up before the first exception hits. Fixes: 87e81786 ("x86/idt: Move early IDT setup out of 32-bit asm") Reported-and-tested-by: NMatthew Whitehead <whiteheadm@acm.org> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andy Lutomirski <luto@kernel.org> Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1710161210290.1973@nanos
-
- 16 10月, 2017 1 次提交
-
-
由 Colin Ian King 提交于
The 'this_leaf' variable is assigned a value that is never read and it is updated a little later with a newer value, hence we can remove the redundant assignment. Cleans up the following Clang warning: Value stored to 'this_leaf' is never read Signed-off-by: NColin Ian King <colin.king@canonical.com> Reviewed-by: NBorislav Petkov <bp@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: kernel-janitors@vger.kernel.org Link: http://lkml.kernel.org/r/20171015160203.12332-1-colin.king@canonical.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 14 10月, 2017 2 次提交
-
-
由 Borislav Petkov 提交于
On CPUs like AMD's Geode, for example, we shouldn't even try to load microcode because they do not support the modern microcode loading interface. However, we do the family check *after* the other checks whether the loader has been disabled on the command line or whether we're running in a guest. So move the family checks first in order to exit early if we're being loaded on an unsupported family. Reported-and-tested-by: NSven Glodowski <glodi1@arcor.de> Signed-off-by: NBorislav Petkov <bp@suse.de> Cc: <stable@vger.kernel.org> # 4.11.. Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://bugzilla.suse.com/show_bug.cgi?id=1061396 Link: http://lkml.kernel.org/r/20171012112316.977-1-bp@alien8.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Andy Lutomirski 提交于
Since commit: 94b1b03b ("x86/mm: Rework lazy TLB mode and TLB freshness tracking") x86's lazy TLB mode has been all the way lazy: when running a kernel thread (including the idle thread), the kernel keeps using the last user mm's page tables without attempting to maintain user TLB coherence at all. From a pure semantic perspective, this is fine -- kernel threads won't attempt to access user pages, so having stale TLB entries doesn't matter. Unfortunately, I forgot about a subtlety. By skipping TLB flushes, we also allow any paging-structure caches that may exist on the CPU to become incoherent. This means that we can have a paging-structure cache entry that references a freed page table, and the CPU is within its rights to do a speculative page walk starting at the freed page table. I can imagine this causing two different problems: - A speculative page walk starting from a bogus page table could read IO addresses. I haven't seen any reports of this causing problems. - A speculative page walk that involves a bogus page table can install garbage in the TLB. Such garbage would always be at a user VA, but some AMD CPUs have logic that triggers a machine check when it notices these bogus entries. I've seen a couple reports of this. Boris further explains the failure mode: > It is actually more of an optimization which assumes that paging-structure > entries are in WB DRAM: > > "TlbCacheDis: cacheable memory disable. Read-write. 0=Enables > performance optimization that assumes PML4, PDP, PDE, and PTE entries > are in cacheable WB-DRAM; memory type checks may be bypassed, and > addresses outside of WB-DRAM may result in undefined behavior or NB > protocol errors. 1=Disables performance optimization and allows PML4, > PDP, PDE and PTE entries to be in any memory type. Operating systems > that maintain page tables in memory types other than WB- DRAM must set > TlbCacheDis to insure proper operation." > > The MCE generated is an NB protocol error to signal that > > "Link: A specific coherent-only packet from a CPU was issued to an > IO link. This may be caused by software which addresses page table > structures in a memory type other than cacheable WB-DRAM without > properly configuring MSRC001_0015[TlbCacheDis]. This may occur, for > example, when page table structure addresses are above top of memory. In > such cases, the NB will generate an MCE if it sees a mismatch between > the memory operation generated by the core and the link type." > > I'm assuming coherent-only packets don't go out on IO links, thus the > error. To fix this, reinstate TLB coherence in lazy mode. With this patch applied, we do it in one of two ways: - If we have PCID, we simply switch back to init_mm's page tables when we enter a kernel thread -- this seems to be quite cheap except for the cost of serializing the CPU. - If we don't have PCID, then we set a flag and switch to init_mm the first time we would otherwise need to flush the TLB. The /sys/kernel/debug/x86/tlb_use_lazy_mode debug switch can be changed to override the default mode for benchmarking. In theory, we could optimize this better by only flushing the TLB in lazy CPUs when a page table is freed. Doing that would require auditing the mm code to make sure that all page table freeing goes through tlb_remove_page() as well as reworking some data structures to implement the improved flush logic. Reported-by: NMarkus Trippelsdorf <markus@trippelsdorf.de> Reported-by: NAdam Borowski <kilobyte@angband.pl> Signed-off-by: NAndy Lutomirski <luto@kernel.org> Signed-off-by: NBorislav Petkov <bp@suse.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Eric Biggers <ebiggers@google.com> Cc: Johannes Hirte <johannes.hirte@datenkhaos.de> Cc: Kees Cook <keescook@chromium.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Roman Kagan <rkagan@virtuozzo.com> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 94b1b03b ("x86/mm: Rework lazy TLB mode and TLB freshness tracking") Link: http://lkml.kernel.org/r/20171009170231.fkpraqokz6e4zeco@pd.tnicSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 12 10月, 2017 3 次提交
-
-
由 Len Brown 提交于
SKX stepping-3 fixed the TSC_DEADLINE issue in a different ucode version number than stepping-4. Linux needs to know this stepping-3 specific version number to also enable the TSC_DEADLINE on stepping-3. The steppings and ucode versions are documented in the SKX BIOS update: https://downloadmirror.intel.com/26978/eng/ReleaseNotes_R00.01.0004.txtSigned-off-by: NLen Brown <len.brown@intel.com> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Cc: peterz@infradead.org Link: https://lkml.kernel.org/r/60f2bbf7cf617e212b522e663f84225bfebc50e5.1507756305.git.len.brown@intel.com
-
由 Paolo Bonzini 提交于
Commit 594a30fb ("x86/apic: Silence "FW_BUG TSC_DEADLINE disabled due to Errata" on CPUs without the feature", 2017-08-30) was also about silencing the warning on VirtualBox; however, KVM does expose the TSC deadline timer, and it's virtualized so that it is immune from CPU errata. Therefore, booting 4.13 with "-cpu Haswell" shows this in the logs: [ 0.000000] [Firmware Bug]: TSC_DEADLINE disabled due to Errata; please update microcode to version: 0xb2 (or later) Even if you had a hypervisor that does _not_ virtualize the TSC deadline and rather exposes the hardware one, it should be the hypervisors task to update microcode and possibly hide the flag from CPUID. So just hide the message when running on _any_ hypervisor, not just those that do not support the TSC deadline timer. The older check still makes sense, so keep it. Fixes: bd9240a1 ("x86/apic: Add TSC_DEADLINE quirk due to errata") Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Hans de Goede <hdegoede@redhat.com> Cc: kvm@vger.kernel.org Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/1507630377-54471-1-git-send-email-pbonzini@redhat.com
-
由 Haozhong Zhang 提交于
When KVM emulates an exit from L2 to L1, it loads L1 CR4 into the guest CR4. Before this CR4 loading, the guest CR4 refers to L2 CR4. Because these two CR4's are in different levels of guest, we should vmx_set_cr4() rather than kvm_set_cr4() here. The latter, which is used to handle guest writes to its CR4, checks the guest change to CR4 and may fail if the change is invalid. The failure may cause trouble. Consider we start a L1 guest with non-zero L1 PCID in use, (i.e. L1 CR4.PCIDE == 1 && L1 CR3.PCID != 0) and a L2 guest with L2 PCID disabled, (i.e. L2 CR4.PCIDE == 0) and following events may happen: 1. If kvm_set_cr4() is used in load_vmcs12_host_state() to load L1 CR4 into guest CR4 (in VMCS01) for L2 to L1 exit, it will fail because of PCID check. As a result, the guest CR4 recorded in L0 KVM (i.e. vcpu->arch.cr4) is left to the value of L2 CR4. 2. Later, if L1 attempts to change its CR4, e.g., clearing VMXE bit, kvm_set_cr4() in L0 KVM will think L1 also wants to enable PCID, because the wrong L2 CR4 is used by L0 KVM as L1 CR4. As L1 CR3.PCID != 0, L0 KVM will inject GP to L1 guest. Fixes: 4704d0be ("KVM: nVMX: Exiting from L2 to L1") Cc: qemu-stable@nongnu.org Signed-off-by: NHaozhong Zhang <haozhong.zhang@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 11 10月, 2017 1 次提交
-
-
由 Tom Lendacky 提交于
Some routines in mem_encrypt.c are called very early in the boot process, e.g. sme_enable(). When CONFIG_KCOV=y is defined the resulting code added to sme_enable() (and others) for KCOV instrumentation results in a kernel crash. Disable the KCOV instrumentation for mem_encrypt.c by adding KCOV_INSTRUMENT_mem_encrypt.o := n to arch/x86/mm/Makefile. In order to avoid other possible early boot issues, model mem_encrypt.c after head64.c in regards to tools. In addition to disabling KCOV as stated above and a previous patch that disables branch profiling, also remove the "-pg" CFLAG if CONFIG_FUNCTION_TRACER is enabled and set KASAN_SANITIZE to "n", each of which are done on a file basis. Reported-by: Nkernel test robot <lkp@01.org> Signed-off-by: NTom Lendacky <thomas.lendacky@amd.com> Reviewed-by: NBorislav Petkov <bp@suse.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20171010194504.18887.38053.stgit@tlendack-t1.amdoffice.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 10 10月, 2017 11 次提交
-
-
由 Ladi Prosek 提交于
is_last_gpte() is not equivalent to the pseudo-code given in commit 6bb69c9b ("KVM: MMU: simplify last_pte_bitmap") because an incorrect value of last_nonleaf_level may override the result even if level == 1. It is critical for is_last_gpte() to return true on level == 1 to terminate page walks. Otherwise memory corruption may occur as level is used as an index to various data structures throughout the page walking code. Even though the actual bug would be wherever the MMU is initialized (as in the previous patch), be defensive and ensure here that is_last_gpte() returns the correct value. This patch is also enough to fix CVE-2017-12188. Fixes: 6bb69c9b Cc: stable@vger.kernel.org Cc: Andy Honig <ahonig@google.com> Signed-off-by: NLadi Prosek <lprosek@redhat.com> [Panic if walk_addr_generic gets an incorrect level; this is a serious bug and it's not worth a WARN_ON where the recovery path might hide further exploitable issues; suggested by Andrew Honig. - Paolo] Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Ladi Prosek 提交于
The function updates context->root_level but didn't call update_last_nonleaf_level so the previous and potentially wrong value was used for page walks. For example, a zero value of last_nonleaf_level would allow a potential out-of-bounds access in arch/x86/mmu/paging_tmpl.h's walk_addr_generic function (CVE-2017-12188). Fixes: 155a97a3Signed-off-by: NLadi Prosek <lprosek@redhat.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Zhenzhong Duan 提交于
As xen_cpuhp_setup is called by PV and PVHVM, the name of "x86/xen/hvm_guest" is confusing. Signed-off-by: NZhenzhong Duan <zhenzhong.duan@oracle.com> Reviewed-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
-
由 Marcelo Henrique Cerri 提交于
Do not consider the fixed size of hv_vp_set when passing the variable header size to hv_do_rep_hypercall(). The Hyper-V hypervisor specification states that for a hypercall with a variable header only the size of the variable portion should be supplied via the input control. For HVCALL_FLUSH_VIRTUAL_ADDRESS_SPACE_EX/LIST_EX calls that means the fixed portion of hv_vp_set should not be considered. That fixes random failures of some applications that are unexpectedly killed with SIGBUS or SIGSEGV. Signed-off-by: NMarcelo Henrique Cerri <marcelo.cerri@canonical.com> Cc: Dexuan Cui <decui@microsoft.com> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Jork Loeser <Jork.Loeser@microsoft.com> Cc: Josh Poulson <jopoulso@microsoft.com> Cc: K. Y. Srinivasan <kys@microsoft.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Simon Xiao <sixiao@microsoft.com> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: devel@linuxdriverproject.org Fixes: 628f54cc ("x86/hyper-v: Support extended CPU ranges for TLB flush hypercalls") Link: http://lkml.kernel.org/r/1507210469-29065-1-git-send-email-marcelo.cerri@canonical.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Vitaly Kuznetsov 提交于
hv_do_hypercall() does virt_to_phys() translation and with some configs (CONFIG_SLAB) this doesn't work for percpu areas, we pass wrong memory to hypervisor and get #GP. We could use working slow_virt_to_phys() instead but doing so kills the performance. Move pcpu_flush/pcpu_flush_ex structures out of percpu areas and allocate memory on first call. The additional level of indirection gives us a small performance penalty, in future we may consider introducing hypercall functions which avoid virt_to_phys() conversion and cache physical addresses of pcpu_flush/pcpu_flush_ex structures somewhere. Reported-by: NSimon Xiao <sixiao@microsoft.com> Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com> Cc: Dexuan Cui <decui@microsoft.com> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Jork Loeser <Jork.Loeser@microsoft.com> Cc: K. Y. Srinivasan <kys@microsoft.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: devel@linuxdriverproject.org Link: http://lkml.kernel.org/r/20171005113924.28021-1-vkuznets@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Vitaly Kuznetsov 提交于
hv_flush_pcpu_ex structures are not cleared between calls for performance reasons (they're variable size up to PAGE_SIZE each) but we must clear hv_vp_set.bank_contents part of it to avoid flushing unneeded vCPUs. The rest of the structure is formed correctly. To do the clearing in an efficient way stash the maximum possible vCPU number (this may differ from Linux CPU id). Reported-by: NJork Loeser <Jork.Loeser@microsoft.com> Signed-off-by: NVitaly Kuznetsov <vkuznets@redhat.com> Cc: Dexuan Cui <decui@microsoft.com> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: K. Y. Srinivasan <kys@microsoft.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: devel@linuxdriverproject.org Link: http://lkml.kernel.org/r/20171006154854.18092-1-vkuznets@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Colin Ian King 提交于
Currently if an allocation fails then the error return paths don't free up any currently allocated pmus[].boxes and pmus causing a memory leak. Add an error clean up exit path that frees these objects. Detected by CoverityScan, CID#711632 ("Resource Leak") Signed-off-by: NColin Ian King <colin.king@canonical.com> Acked-by: NPeter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: kernel-janitors@vger.kernel.org Fixes: 087bfbb0 ("perf/x86: Add generic Intel uncore PMU support") Link: http://lkml.kernel.org/r/20171009172655.6132-1-colin.king@canonical.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Josh Poimboeuf 提交于
x86-32 doesn't have stack validation, so in most cases it doesn't make sense to warn about bad frame pointers. Reported-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com> Cc: Byungchul Park <byungchul.park@lge.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: LKP <lkp@01.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/a69658760800bf281e6353248c23e0fa0acf5230.1507597785.git.jpoimboe@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Josh Poimboeuf 提交于
When printing the unwinder dump, the stack pointer could be unaligned, for one of two reasons: - stack corruption; or - GCC created an unaligned stack. There's no way for the unwinder to tell the difference between the two, so we have to assume one or the other. GCC unaligned stacks are very rare, and have only been spotted before GCC 5. Presumably, if we're doing an unwinder stack dump, stack corruption is more likely than a GCC unaligned stack. So always align the stack before starting the dump. Reported-and-tested-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-and-tested-by: NFengguang Wu <fengguang.wu@intel.com> Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com> Cc: Byungchul Park <byungchul.park@lge.com> Cc: LKP <lkp@01.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/2f540c515946ab09ed267e1a1d6421202a0cce08.1507597785.git.jpoimboe@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Josh Poimboeuf 提交于
On x86-32, Tetsuo Handa and Fengguang Wu reported unwinder warnings like: WARNING: kernel stack regs at f60bb9c8 in swapper:1 has bad 'bp' value 0ba00000 And also there were some stack dumps with a bunch of unreliable '?' symbols after an apic_timer_interrupt symbol, meaning the unwinder got confused when it tried to read the regs. The cause of those issues is that, with GCC 4.8 (and possibly older), there are cases where GCC misaligns the stack pointer in a leaf function for no apparent reason: c124a388 <acpi_rs_move_data>: c124a388: 55 push %ebp c124a389: 89 e5 mov %esp,%ebp c124a38b: 57 push %edi c124a38c: 56 push %esi c124a38d: 89 d6 mov %edx,%esi c124a38f: 53 push %ebx c124a390: 31 db xor %ebx,%ebx c124a392: 83 ec 03 sub $0x3,%esp ... c124a3e3: 83 c4 03 add $0x3,%esp c124a3e6: 5b pop %ebx c124a3e7: 5e pop %esi c124a3e8: 5f pop %edi c124a3e9: 5d pop %ebp c124a3ea: c3 ret If an interrupt occurs in such a function, the regs on the stack will be unaligned, which breaks the frame pointer encoding assumption. So on 32-bit, use the MSB instead of the LSB to encode the regs. This isn't an issue on 64-bit, because interrupts align the stack before writing to it. Reported-and-tested-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-and-tested-by: NFengguang Wu <fengguang.wu@intel.com> Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com> Cc: Byungchul Park <byungchul.park@lge.com> Cc: LKP <lkp@01.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/279a26996a482ca716605c7dbc7f2db9d8d91e81.1507597785.git.jpoimboe@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Josh Poimboeuf 提交于
Tetsuo Handa and Fengguang Wu reported a panic in the unwinder: BUG: unable to handle kernel NULL pointer dereference at 000001f2 IP: update_stack_state+0xd4/0x340 *pde = 00000000 Oops: 0000 [#1] PREEMPT SMP CPU: 0 PID: 18728 Comm: 01-cpu-hotplug Not tainted 4.13.0-rc4-00170-gb09be676 #592 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-20161025_171302-gandalf 04/01/2014 task: bb0b53c0 task.stack: bb3ac000 EIP: update_stack_state+0xd4/0x340 EFLAGS: 00010002 CPU: 0 EAX: 0000a570 EBX: bb3adccb ECX: 0000f401 EDX: 0000a570 ESI: 00000001 EDI: 000001ba EBP: bb3adc6b ESP: bb3adc3f DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 CR0: 80050033 CR2: 000001f2 CR3: 0b3a7000 CR4: 00140690 DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 DR6: fffe0ff0 DR7: 00000400 Call Trace: ? unwind_next_frame+0xea/0x400 ? __unwind_start+0xf5/0x180 ? __save_stack_trace+0x81/0x160 ? save_stack_trace+0x20/0x30 ? __lock_acquire+0xfa5/0x12f0 ? lock_acquire+0x1c2/0x230 ? tick_periodic+0x3a/0xf0 ? _raw_spin_lock+0x42/0x50 ? tick_periodic+0x3a/0xf0 ? tick_periodic+0x3a/0xf0 ? debug_smp_processor_id+0x12/0x20 ? tick_handle_periodic+0x23/0xc0 ? local_apic_timer_interrupt+0x63/0x70 ? smp_trace_apic_timer_interrupt+0x235/0x6a0 ? trace_apic_timer_interrupt+0x37/0x3c ? strrchr+0x23/0x50 Code: 0f 95 c1 89 c7 89 45 e4 0f b6 c1 89 c6 89 45 dc 8b 04 85 98 cb 74 bc 88 4d e3 89 45 f0 83 c0 01 84 c9 89 04 b5 98 cb 74 bc 74 3b <8b> 47 38 8b 57 34 c6 43 1d 01 25 00 00 02 00 83 e2 03 09 d0 83 EIP: update_stack_state+0xd4/0x340 SS:ESP: 0068:bb3adc3f CR2: 00000000000001f2 ---[ end trace 0d147fd4aba8ff50 ]--- Kernel panic - not syncing: Fatal exception in interrupt On x86-32, after decoding a frame pointer to get a regs address, regs_size() dereferences the regs pointer when it checks regs->cs to see if the regs are user mode. This is dangerous because it's possible that what looks like a decoded frame pointer is actually a corrupt value, and we don't want the unwinder to make things worse. Instead of calling regs_size() on an unsafe pointer, just assume they're kernel regs to start with. Later, once it's safe to access the regs, we can do the user mode check and corresponding safety check for the remaining two regs. Reported-and-tested-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reported-and-tested-by: NFengguang Wu <fengguang.wu@intel.com> Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com> Cc: Byungchul Park <byungchul.park@lge.com> Cc: LKP <lkp@01.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: 5ed8d8bb ("x86/unwind: Move common code into update_stack_state()") Link: http://lkml.kernel.org/r/7f95b9a6993dec7674b3f3ab3dcd3294f7b9644d.1507597785.git.jpoimboe@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 09 10月, 2017 2 次提交
-
-
由 Mathias Krause 提交于
The alt_max_short() macro in asm/alternative.h does not work as intended, leading to nasty bugs. E.g. alt_max_short("1", "3") evaluates to 3, but alt_max_short("3", "1") evaluates to 1 -- not exactly the maximum of 1 and 3. In fact, I had to learn it the hard way by crashing my kernel in not so funny ways by attempting to make use of the ALTENATIVE_2 macro with alternatives where the first one was larger than the second one. According to [1] and commit dbe4058a ("x86/alternatives: Fix ALTERNATIVE_2 padding generation properly") the right handed side should read "-(-(a < b))" not "-(-(a - b))". Fix that, to make the macro work as intended. While at it, fix up the comments regarding the additional "-", too. It's not about gas' usage of s32 but brain dead logic of having a "true" value of -1 for the < operator ... *sigh* Btw., the one in asm/alternative-asm.h is correct. And, apparently, all current users of ALTERNATIVE_2() pass same sized alternatives, avoiding to hit the bug. [1] http://graphics.stanford.edu/~seander/bithacks.html#IntegerMinOrMaxReviewed-and-tested-by: NBorislav Petkov <bp@suse.de> Fixes: dbe4058a ("x86/alternatives: Fix ALTERNATIVE_2 padding generation properly") Signed-off-by: NMathias Krause <minipli@googlemail.com> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Cc: Borislav Petkov <bp@suse.de> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/1507228213-13095-1-git-send-email-minipli@googlemail.com
-
由 Andy Lutomirski 提交于
Trying to reboot via real mode fails with PCID on: long mode cannot be exited while CR4.PCIDE is set. (No, I have no idea why, but the SDM and actual CPUs are in agreement here.) The result is a GPF and a hang instead of a reboot. I didn't catch this in testing because neither my computer nor my VM reboots this way. I can trigger it with reboot=bios, though. Fixes: 660da7c9 ("x86/mm: Enable CR4.PCIDE on supported systems") Reported-and-tested-by: NSteven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: NAndy Lutomirski <luto@kernel.org> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Cc: Borislav Petkov <bp@alien8.de> Link: https://lkml.kernel.org/r/f1e7d965998018450a7a70c2823873686a8b21c0.1507524746.git.luto@kernel.org
-