- 01 4月, 2015 1 次提交
-
-
由 Ingo Molnar 提交于
Currently x86-64 efi_call_phys_prolog() saves into a global variable (save_pgd), and efi_call_phys_epilog() restores the kernel pagetables from that global variable. Change this to a cleaner save/restore pattern where the saving function returns the saved object and the restore function restores that. Apply the same concept to the 32-bit code as well. Plus this approach, as an added bonus, allows us to express the !efi_enabled(EFI_OLD_MEMMAP) situation in a clean fashion as well, via a 'NULL' return value. Cc: Tapasweni Pathak <tapaswenipathak@gmail.com> Signed-off-by: NIngo Molnar <mingo@kernel.org> Signed-off-by: NMatt Fleming <matt.fleming@intel.com>
-
- 06 3月, 2015 1 次提交
-
-
由 Quentin Casasnovas 提交于
Commit: f31a9f7c ("x86/xsaves: Use xsaves/xrstors to save and restore xsave area") introduced alternative instructions for XSAVES/XRSTORS and commit: adb9d526 ("x86/xsaves: Add xsaves and xrstors support for booting time") added support for the XSAVES/XRSTORS instructions at boot time. Unfortunately both failed to properly protect them against faulting: The 'xstate_fault' macro will use the closest label named '1' backward and that ends up in the .altinstr_replacement section rather than in .text. This means that the kernel will never find in the __ex_table the .text address where this instruction might fault, leading to serious problems if userspace manages to trigger the fault. Signed-off-by: NQuentin Casasnovas <quentin.casasnovas@oracle.com> Signed-off-by: NJamie Iles <jamie.iles@oracle.com> [ Improved the changelog, fixed some whitespace noise. ] Acked-by: NBorislav Petkov <bp@alien8.de> Acked-by: NLinus Torvalds <torvalds@linux-foundation.org> Cc: <stable@vger.kernel.org> Cc: Allan Xavier <mr.a.xavier@gmail.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: adb9d526 ("x86/xsaves: Add xsaves and xrstors support for booting time") Fixes: f31a9f7c ("x86/xsaves: Use xsaves/xrstors to save and restore xsave area") Signed-off-by: NIngo Molnar <mingo@kernel.org>
-
- 20 2月, 2015 2 次提交
-
-
由 Jiri Kosina 提交于
Commit f47233c2 ("x86/mm/ASLR: Propagate base load address calculation") causes PAGE_SIZE redefinition warnings for UML subarch builds. This is caused by added includes that were leftovers from previous patch versions are are not actually needed (especially page_types.h inlcude in module.c). Drop those stray includes. Reported-by: Nkbuild test robot <fengguang.wu@intel.com> Signed-off-by: NJiri Kosina <jkosina@suse.cz> Cc: Borislav Petkov <bp@suse.de> Cc: H. Peter Anvin <hpa@linux.intel.com> Cc: Kees Cook <keescook@chromium.org> Link: http://lkml.kernel.org/r/alpine.LNX.2.00.1502201017240.28769@pobox.suse.czSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 David Vrabel 提交于
Since _PAGE_PROTNONE aliases _PAGE_GLOBAL it is only valid if _PAGE_PRESENT is clear. Make pte_protnone() and pmd_protnone() check for this. This fixes a 64-bit Xen PV guest regression introduced by 8a0516ed ("mm: convert p[te|md]_numa users to p[te|md]_protnone_numa"). Any userspace process would endlessly fault. In a 64-bit PV guest, userspace page table entries have _PAGE_GLOBAL set by the hypervisor. This meant that any fault on a present userspace entry (e.g., a write to a read-only mapping) would be misinterpreted as a NUMA hinting fault and the fault would not be correctly handled, resulting in the access endlessly faulting. Signed-off-by: NDavid Vrabel <david.vrabel@citrix.com> Acked-by: NMel Gorman <mgorman@suse.de> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 19 2月, 2015 3 次提交
-
-
由 Jiri Kosina 提交于
Commit: e2b32e67 ("x86, kaslr: randomize module base load address") makes the base address for module to be unconditionally randomized in case when CONFIG_RANDOMIZE_BASE is defined and "nokaslr" option isn't present on the commandline. This is not consistent with how choose_kernel_location() decides whether it will randomize kernel load base. Namely, CONFIG_HIBERNATION disables kASLR (unless "kaslr" option is explicitly specified on kernel commandline), which makes the state space larger than what module loader is looking at. IOW CONFIG_HIBERNATION && CONFIG_RANDOMIZE_BASE is a valid config option, kASLR wouldn't be applied by default in that case, but module loader is not aware of that. Instead of fixing the logic in module.c, this patch takes more generic aproach. It introduces a new bootparam setup data_type SETUP_KASLR and uses that to pass the information whether kaslr has been applied during kernel decompression, and sets a global 'kaslr_enabled' variable accordingly, so that any kernel code (module loading, livepatching, ...) can make decisions based on its value. x86 module loader is converted to make use of this flag. Signed-off-by: NJiri Kosina <jkosina@suse.cz> Acked-by: NKees Cook <keescook@chromium.org> Cc: "H. Peter Anvin" <hpa@linux.intel.com> Link: https://lkml.kernel.org/r/alpine.LNX.2.00.1502101411280.10719@pobox.suse.cz [ Always dump correct kaslr status when panicking ] Signed-off-by: NBorislav Petkov <bp@suse.de>
-
由 Bryan O'Donoghue 提交于
Intel's Quark X1000 SoC contains a set of registers called Isolated Memory Regions. IMRs are accessed over the IOSF mailbox interface. IMRs are areas carved out of memory that define read/write access rights to the various system agents within the Quark system. For a given agent in the system it is possible to specify if that agent may read or write an area of memory defined by an IMR with a granularity of 1 KiB. Quark_SecureBootPRM_330234_001.pdf section 4.5 details the concept of IMRs quark-x1000-datasheet.pdf section 12.7.4 details the implementation of IMRs in silicon. eSRAM flush, CPU Snoop write-only, CPU SMM Mode, CPU non-SMM mode, RMU and PCIe Virtual Channels (VC0 and VC1) can have individual read/write access masks applied to them for a given memory region in Quark X1000. This enables IMRs to treat each memory transaction type listed above on an individual basis and to filter appropriately based on the IMR access mask for the memory region. Quark supports eight IMRs. Since all of the DMA capable SoC components in the X1000 are mapped to VC0 it is possible to define sections of memory as invalid for DMA write operations originating from Ethernet, USB, SD and any other DMA capable south-cluster component on VC0. Similarly it is possible to mark kernel memory as non-SMM mode read/write only or to mark BIOS runtime memory as SMM mode accessible only depending on the particular memory footprint on a given system. On an IMR violation Quark SoC X1000 systems are configured to reset the system, so ensuring that the IMR memory map is consistent with the EFI provided memory map is critical to ensure no IMR violations reset the system. The API for accessing IMRs is based on MTRR code but doesn't provide a /proc or /sys interface to manipulate IMRs. Defining the size and extent of IMRs is exclusively the domain of in-kernel code. Quark firmware sets up a series of locked IMRs around pieces of memory that firmware owns such as ACPI runtime data. During boot a series of unlocked IMRs are placed around items in memory to guarantee no DMA modification of those items can take place. Grub also places an unlocked IMR around the kernel boot params data structure and compressed kernel image. It is necessary for the kernel to tear down all unlocked IMRs in order to ensure that the kernel's view of memory passed via the EFI memory map is consistent with the IMR memory map. Without tearing down all unlocked IMRs on boot transitory IMRs such as those used to protect the compressed kernel image will cause IMR violations and system reboots. The IMR init code tears down all unlocked IMRs and sets a protective IMR around the kernel .text and .rodata as one contiguous block. This sanitizes the IMR memory map with respect to the EFI memory map and protects the read-only portions of the kernel from unwarranted DMA access. Tested-by: NOng, Boon Leong <boon.leong.ong@intel.com> Signed-off-by: NBryan O'Donoghue <pure.logic@nexus-software.ie> Reviewed-by: NAndy Shevchenko <andy.schevchenko@gmail.com> Reviewed-by: NDarren Hart <dvhart@linux.intel.com> Reviewed-by: NOng, Boon Leong <boon.leong.ong@intel.com> Cc: andy.shevchenko@gmail.com Cc: dvhart@infradead.org Link: http://lkml.kernel.org/r/1422635379-12476-2-git-send-email-pure.logic@nexus-software.ieSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Ricardo Ribalda Delgado 提交于
Without this patch: LD init/built-in.o arch/x86/built-in.o: In function `dtb_lapic_setup': kernel/devicetree.c:155: undefined reference to `apic_force_enable' Makefile:923: recipe for target 'vmlinux' failed make: *** [vmlinux] Error 1 Signed-off-by: NRicardo Ribalda Delgado <ricardo.ribalda@gmail.com> Reviewed-by: NMaciej W. Rozycki <macro@linux-mips.org> Cc: David Rientjes <rientjes@google.com> Cc: Jan Beulich <JBeulich@suse.com> Link: http://lkml.kernel.org/r/1422905231-16067-1-git-send-email-ricardo.ribalda@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 18 2月, 2015 1 次提交
-
-
由 Raghavendra K T 提交于
Paravirt spinlock clears slowpath flag after doing unlock. As explained by Linus currently it does: prev = *lock; add_smp(&lock->tickets.head, TICKET_LOCK_INC); /* add_smp() is a full mb() */ if (unlikely(lock->tickets.tail & TICKET_SLOWPATH_FLAG)) __ticket_unlock_slowpath(lock, prev); which is *exactly* the kind of things you cannot do with spinlocks, because after you've done the "add_smp()" and released the spinlock for the fast-path, you can't access the spinlock any more. Exactly because a fast-path lock might come in, and release the whole data structure. Linus suggested that we should not do any writes to lock after unlock(), and we can move slowpath clearing to fastpath lock. So this patch implements the fix with: 1. Moving slowpath flag to head (Oleg): Unlocked locks don't care about the slowpath flag; therefore we can keep it set after the last unlock, and clear it again on the first (try)lock. -- this removes the write after unlock. note that keeping slowpath flag would result in unnecessary kicks. By moving the slowpath flag from the tail to the head ticket we also avoid the need to access both the head and tail tickets on unlock. 2. use xadd to avoid read/write after unlock that checks the need for unlock_kick (Linus): We further avoid the need for a read-after-release by using xadd; the prev head value will include the slowpath flag and indicate if we need to do PV kicking of suspended spinners -- on modern chips xadd isn't (much) more expensive than an add + load. Result: setup: 16core (32 cpu +ht sandy bridge 8GB 16vcpu guest) benchmark overcommit %improve kernbench 1x -0.13 kernbench 2x 0.02 dbench 1x -1.77 dbench 2x -0.63 [Jeremy: Hinted missing TICKET_LOCK_INC for kick] [Oleg: Moved slowpath flag to head, ticket_equals idea] [PeterZ: Added detailed changelog] Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org> Reported-by: NSasha Levin <sasha.levin@oracle.com> Tested-by: NSasha Levin <sasha.levin@oracle.com> Signed-off-by: NRaghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: NOleg Nesterov <oleg@redhat.com> Cc: Andrew Jones <drjones@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Christoph Lameter <cl@linux.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dave Jones <davej@redhat.com> Cc: David Vrabel <david.vrabel@citrix.com> Cc: Fernando Luis Vázquez Cao <fernando_b1@lab.ntt.co.jp> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Ulrich Obergfell <uobergfe@redhat.com> Cc: Waiman Long <Waiman.Long@hp.com> Cc: a.ryabinin@samsung.com Cc: dave@stgolabs.net Cc: hpa@zytor.com Cc: jasowang@redhat.com Cc: jeremy@goop.org Cc: paul.gortmaker@windriver.com Cc: riel@redhat.com Cc: tglx@linutronix.de Cc: waiman.long@hp.com Cc: xen-devel@lists.xenproject.org Link: http://lkml.kernel.org/r/20150215173043.GA7471@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 14 2月, 2015 3 次提交
-
-
由 Andrey Ryabinin 提交于
Stack instrumentation allows to detect out of bounds memory accesses for variables allocated on stack. Compiler adds redzones around every variable on stack and poisons redzones in function's prologue. Such approach significantly increases stack usage, so all in-kernel stacks size were doubled. Signed-off-by: NAndrey Ryabinin <a.ryabinin@samsung.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Konstantin Serebryany <kcc@google.com> Cc: Dmitry Chernenkov <dmitryc@google.com> Signed-off-by: NAndrey Konovalov <adech.fo@gmail.com> Cc: Yuri Gribov <tetra2005@gmail.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Christoph Lameter <cl@linux.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrey Ryabinin 提交于
Recently instrumentation of builtin functions calls was removed from GCC 5.0. To check the memory accessed by such functions, userspace asan always uses interceptors for them. So now we should do this as well. This patch declares memset/memmove/memcpy as weak symbols. In mm/kasan/kasan.c we have our own implementation of those functions which checks memory before accessing it. Default memset/memmove/memcpy now now always have aliases with '__' prefix. For files that built without kasan instrumentation (e.g. mm/slub.c) original mem* replaced (via #define) with prefixed variants, cause we don't want to check memory accesses there. Signed-off-by: NAndrey Ryabinin <a.ryabinin@samsung.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Konstantin Serebryany <kcc@google.com> Cc: Dmitry Chernenkov <dmitryc@google.com> Signed-off-by: NAndrey Konovalov <adech.fo@gmail.com> Cc: Yuri Gribov <tetra2005@gmail.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Christoph Lameter <cl@linux.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrey Ryabinin 提交于
This patch adds arch specific code for kernel address sanitizer. 16TB of virtual addressed used for shadow memory. It's located in range [ffffec0000000000 - fffffc0000000000] between vmemmap and %esp fixup stacks. At early stage we map whole shadow region with zero page. Latter, after pages mapped to direct mapping address range we unmap zero pages from corresponding shadow (see kasan_map_shadow()) and allocate and map a real shadow memory reusing vmemmap_populate() function. Also replace __pa with __pa_nodebug before shadow initialized. __pa with CONFIG_DEBUG_VIRTUAL=y make external function call (__phys_addr) __phys_addr is instrumented, so __asan_load could be called before shadow area initialized. Signed-off-by: NAndrey Ryabinin <a.ryabinin@samsung.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Konstantin Serebryany <kcc@google.com> Cc: Dmitry Chernenkov <dmitryc@google.com> Signed-off-by: NAndrey Konovalov <adech.fo@gmail.com> Cc: Yuri Gribov <tetra2005@gmail.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Christoph Lameter <cl@linux.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Jim Davis <jim.epost@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 13 2月, 2015 4 次提交
-
-
由 Andy Lutomirski 提交于
If an attacker can cause a controlled kernel stack overflow, overwriting the restart block is a very juicy exploit target. This is because the restart_block is held in the same memory allocation as the kernel stack. Moving the restart block to struct task_struct prevents this exploit by making the restart_block harder to locate. Note that there are other fields in thread_info that are also easy targets, at least on some architectures. It's also a decent simplification, since the restart code is more or less identical on all architectures. [james.hogan@imgtec.com: metag: align thread_info::supervisor_stack] Signed-off-by: NAndy Lutomirski <luto@amacapital.net> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: David Miller <davem@davemloft.net> Acked-by: NRichard Weinberger <richard@nod.at> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Matt Turner <mattst88@gmail.com> Cc: Vineet Gupta <vgupta@synopsys.com> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Haavard Skinnemoen <hskinnemoen@gmail.com> Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no> Cc: Steven Miao <realmz6@gmail.com> Cc: Mark Salter <msalter@redhat.com> Cc: Aurelien Jacquiot <a-jacquiot@ti.com> Cc: Mikael Starvik <starvik@axis.com> Cc: Jesper Nilsson <jesper.nilsson@axis.com> Cc: David Howells <dhowells@redhat.com> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Michal Simek <monstr@monstr.eu> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Jonas Bonn <jonas@southpole.se> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Helge Deller <deller@gmx.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Tested-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Chen Liqin <liqin.linux@gmail.com> Cc: Lennox Wu <lennox.wu@gmail.com> Cc: Chris Metcalf <cmetcalf@ezchip.com> Cc: Guan Xuetao <gxt@mprc.pku.edu.cn> Cc: Chris Zankel <chris@zankel.net> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Guenter Roeck <linux@roeck-us.net> Signed-off-by: NJames Hogan <james.hogan@imgtec.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
Commit b38af472 ("x86,mm: fix pte_special versus pte_numa") adjusted the pte_special check to take into account that a special pte had SPECIAL and neither PRESENT nor PROTNONE. Now that NUMA hinting PTEs are no longer modifying _PAGE_PRESENT it should be safe to restore the original pte_special behaviour. Signed-off-by: NMel Gorman <mgorman@suse.de> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Dave Jones <davej@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Rik van Riel <riel@redhat.com> Cc: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
This patch removes the NUMA PTE bits and associated helpers. As a side-effect it increases the maximum possible swap space on x86-64. One potential source of problems is races between the marking of PTEs PROT_NONE, NUMA hinting faults and migration. It must be guaranteed that a PTE being protected is not faulted in parallel, seen as a pte_none and corrupting memory. The base case is safe but transhuge has problems in the past due to an different migration mechanism and a dependance on page lock to serialise migrations and warrants a closer look. task_work hinting update parallel fault ------------------------ -------------- change_pmd_range change_huge_pmd __pmd_trans_huge_lock pmdp_get_and_clear __handle_mm_fault pmd_none do_huge_pmd_anonymous_page read? pmd_lock blocks until hinting complete, fail !pmd_none test write? __do_huge_pmd_anonymous_page acquires pmd_lock, checks pmd_none pmd_modify set_pmd_at task_work hinting update parallel migration ------------------------ ------------------ change_pmd_range change_huge_pmd __pmd_trans_huge_lock pmdp_get_and_clear __handle_mm_fault do_huge_pmd_numa_page migrate_misplaced_transhuge_page pmd_lock waits for updates to complete, recheck pmd_same pmd_modify set_pmd_at Both of those are safe and the case where a transhuge page is inserted during a protection update is unchanged. The case where two processes try migrating at the same time is unchanged by this series so should still be ok. I could not find a case where we are accidentally depending on the PTE not being cleared and flushed. If one is missed, it'll manifest as corruption problems that start triggering shortly after this series is merged and only happen when NUMA balancing is enabled. Signed-off-by: NMel Gorman <mgorman@suse.de> Tested-by: NSasha Levin <sasha.levin@oracle.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Dave Jones <davej@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Rik van Riel <riel@redhat.com> Cc: Mark Brown <broonie@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
This is a preparatory patch that introduces protnone helpers for automatic NUMA balancing. Signed-off-by: NMel Gorman <mgorman@suse.de> Acked-by: NLinus Torvalds <torvalds@linux-foundation.org> Acked-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Tested-by: NSasha Levin <sasha.levin@oracle.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Dave Jones <davej@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 12 2月, 2015 1 次提交
-
-
由 Kirill A. Shutemov 提交于
LKP has triggered a compiler warning after my recent patch "mm: account pmd page tables to the process": mm/mmap.c: In function 'exit_mmap': >> mm/mmap.c:2857:2: warning: right shift count >= width of type [enabled by default] The code: > 2857 WARN_ON(mm_nr_pmds(mm) > 2858 round_up(FIRST_USER_ADDRESS, PUD_SIZE) >> PUD_SHIFT); In this, on tile, we have FIRST_USER_ADDRESS defined as 0. round_up() has the same type -- int. PUD_SHIFT. I think the best way to fix it is to define FIRST_USER_ADDRESS as unsigned long. On every arch for consistency. Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: NWu Fengguang <fengguang.wu@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 11 2月, 2015 2 次提交
-
-
由 Rusty Russell 提交于
Disappointing, as this was kind of neat (especially getting to use RCU to manage the address -> eventfd mapping). But now the devices are PCI handled in userspace, we get rid of both the NOTIFY hypercall and the interface to connect an eventfd. Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
-
由 Kirill A. Shutemov 提交于
We've replaced remap_file_pages(2) implementation with emulation. Nobody creates non-linear mapping anymore. Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 10 2月, 2015 1 次提交
-
-
由 Len Brown 提交于
The Processor generation code-named Haswell added MSR_{CORE | GFX | RING}_PERF_LIMIT_REASONS to explain when and how the processor limits frequency. turbostat -v will now decode these bits. Each MSR has an "Active" set of bits which describe current conditions, and a "Logged" set of bits, which describe what has happened since last cleared. Turbostat currently doesn't clear the log bits. Signed-off-by: NLen Brown <len.brown@intel.com>
-
- 06 2月, 2015 1 次提交
-
-
由 Paolo Bonzini 提交于
This patch introduces a new module parameter for the KVM module; when it is present, KVM attempts a bit of polling on every HLT before scheduling itself out via kvm_vcpu_block. This parameter helps a lot for latency-bound workloads---in particular I tested it with O_DSYNC writes with a battery-backed disk in the host. In this case, writes are fast (because the data doesn't have to go all the way to the platters) but they cannot be merged by either the host or the guest. KVM's performance here is usually around 30% of bare metal, or 50% if you use cache=directsync or cache=writethrough (these parameters avoid that the guest sends pointless flush requests, and at the same time they are not slow because of the battery-backed cache). The bad performance happens because on every halt the host CPU decides to halt itself too. When the interrupt comes, the vCPU thread is then migrated to a new physical CPU, and in general the latency is horrible because the vCPU thread has to be scheduled back in. With this patch performance reaches 60-65% of bare metal and, more important, 99% of what you get if you use idle=poll in the guest. This means that the tunable gets rid of this particular bottleneck, and more work can be done to improve performance in the kernel or QEMU. Of course there is some price to pay; every time an otherwise idle vCPUs is interrupted by an interrupt, it will poll unnecessarily and thus impose a little load on the host. The above results were obtained with a mostly random value of the parameter (500000), and the load was around 1.5-2.5% CPU usage on one of the host's core for each idle guest vCPU. The patch also adds a new stat, /sys/kernel/debug/kvm/halt_successful_poll, that can be used to tune the parameter. It counts how many HLT instructions received an interrupt during the polling period; each successful poll avoids that Linux schedules the VCPU thread out and back in, and may also avoid a likely trip to C1 and back for the physical CPU. While the VM is idle, a Linux 4 VCPU VM halts around 10 times per second. Of these halts, almost all are failed polls. During the benchmark, instead, basically all halts end within the polling period, except a more or less constant stream of 50 per second coming from vCPUs that are not running the benchmark. The wasted time is thus very low. Things may be slightly different for Windows VMs, which have a ~10 ms timer tick. The effect is also visible on Marcelo's recently-introduced latency test for the TSC deadline timer. Though of course a non-RT kernel has awful latency bounds, the latency of the timer is around 8000-10000 clock cycles compared to 20000-120000 without setting halt_poll_ns. For the TSC deadline timer, thus, the effect is both a smaller average latency and a smaller variance. Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 05 2月, 2015 2 次提交
-
-
由 Jiang Liu 提交于
Some PCI device drivers assume that pci_dev->irq won't change after calling pci_disable_device() and pci_enable_device() during suspend and resume. Commit c03b3b07 ("x86, irq, mpparse: Release IOAPIC pin when PCI device is disabled") frees PCI IRQ resources when pci_disable_device() is called and reallocate IRQ resources when pci_enable_device() is called again. This breaks above assumption. So commit 3eec5952 ("x86, irq, PCI: Keep IRQ assignment for PCI devices during suspend/hibernation") and 9eabc99a ("x86, irq, PCI: Keep IRQ assignment for runtime power management") fix the issue by avoiding freeing/reallocating IRQ resources during PCI device suspend/resume. They achieve this by checking dev.power.is_prepared and dev.power.runtime_status. PM maintainer, Rafael, then pointed out that it's really an ugly fix which leaking PM internal state information to IRQ subsystem. Recently David Vrabel <david.vrabel@citrix.com> also reports an regression in pciback driver caused by commit cffe0a2b ("x86, irq: Keep balance of IOAPIC pin reference count"). Please refer to: http://lkml.org/lkml/2015/1/14/546 So this patch refine the way to release PCI IRQ resources. Instead of releasing PCI IRQ resources in pci_disable_device()/ pcibios_disable_device(), we now release it at driver unbinding notification BUS_NOTIFY_UNBOUND_DRIVER. In other word, we only release PCI IRQ resources when there's no driver bound to the PCI device, and it keeps the assumption that pci_dev->irq won't through multiple invocation of pci_enable_device()/pci_disable_device(). Signed-off-by: NJiang Liu <jiang.liu@linux.intel.com> Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
-
由 Tiejun Chen 提交于
After f78146b0, "KVM: Fix page-crossing MMIO", and 87da7e66, "KVM: x86: fix vcpu->mmio_fragments overflow", actually KVM_MMIO_SIZE is gone. Signed-off-by: NTiejun Chen <tiejun.chen@intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 04 2月, 2015 6 次提交
-
-
由 Andy Lutomirski 提交于
While perfmon2 is a sufficiently evil library (it pokes MSRs directly) that breaking it is fair game, it's still useful, so we might as well try to support it. This allows users to write 2 to /sys/devices/cpu/rdpmc to disable all rdpmc protection so that hack like perfmon2 can continue to work. At some point, if perf_event becomes fast enough to replace perfmon2, then this can go. Signed-off-by: NAndy Lutomirski <luto@amacapital.net> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Vince Weaver <vince@deater.net> Cc: "hillf.zj" <hillf.zj@alibaba-inc.com> Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/caac3c1c707dcca48ecbc35f4def21495856f479.1414190806.git.luto@amacapital.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Andy Lutomirski 提交于
We currently allow any process to use rdpmc. This significantly weakens the protection offered by PR_TSC_DISABLED, and it could be helpful to users attempting to exploit timing attacks. Since we can't enable access to individual counters, use a very coarse heuristic to limit access to rdpmc: allow access only when a perf_event is mmapped. This protects seccomp sandboxes. There is plenty of room to further tighen these restrictions. For example, this allows rdpmc for any x86_pmu event, but it's only useful for self-monitoring tasks. As a side effect, cap_user_rdpmc will now be false for AMD uncore events. This isn't a real regression, since .event_idx is disabled for these events anyway for the time being. Whenever that gets re-added, the cap_user_rdpmc code can be adjusted or refactored accordingly. Signed-off-by: NAndy Lutomirski <luto@amacapital.net> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Vince Weaver <vince@deater.net> Cc: "hillf.zj" <hillf.zj@alibaba-inc.com> Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/a2bdb3cf3a1d70c26980d7c6dddfbaa69f3182bf.1414190806.git.luto@amacapital.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Andy Lutomirski 提交于
The code is correct, but only for a rather subtle reason. This confused me for quite a while when I read switch_mm, so clarify the code to avoid confusing other people, too. TBH, I wouldn't be surprised if this code was only correct by accident. Signed-off-by: NAndy Lutomirski <luto@amacapital.net> Reviewed-by: NThomas Gleixner <tglx@linutronix.de> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Kees Cook <keescook@chromium.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Vince Weaver <vince@deater.net> Cc: "hillf.zj" <hillf.zj@alibaba-inc.com> Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/0db86397f968996fb772c443c251415b0b430ddd.1414190806.git.luto@amacapital.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Andy Lutomirski 提交于
Context switches and TLB flushes can change individual bits of CR4. CR4 reads take several cycles, so store a shadow copy of CR4 in a per-cpu variable. To avoid wasting a cache line, I added the CR4 shadow to cpu_tlbstate, which is already touched in switch_mm. The heaviest users of the cr4 shadow will be switch_mm and __switch_to_xtra, and __switch_to_xtra is called shortly after switch_mm during context switch, so the cacheline is likely to be hot. Signed-off-by: NAndy Lutomirski <luto@amacapital.net> Reviewed-by: NThomas Gleixner <tglx@linutronix.de> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Kees Cook <keescook@chromium.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Vince Weaver <vince@deater.net> Cc: "hillf.zj" <hillf.zj@alibaba-inc.com> Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/3a54dd3353fffbf84804398e00dfdc5b7c1afd7d.1414190806.git.luto@amacapital.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Andy Lutomirski 提交于
CR4 manipulation was split, seemingly at random, between direct (write_cr4) and using a helper (set/clear_in_cr4). Unfortunately, the set_in_cr4 and clear_in_cr4 helpers also poke at the boot code, which only a small subset of users actually wanted. This patch replaces all cr4 access in functions that don't leave cr4 exactly the way they found it with new helpers cr4_set_bits, cr4_clear_bits, and cr4_set_bits_and_update_boot. Signed-off-by: NAndy Lutomirski <luto@amacapital.net> Reviewed-by: NThomas Gleixner <tglx@linutronix.de> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Vince Weaver <vince@deater.net> Cc: "hillf.zj" <hillf.zj@alibaba-inc.com> Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Cc: Paul Mackerras <paulus@samba.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/495a10bdc9e67016b8fd3945700d46cfd5c12c2f.1414190806.git.luto@amacapital.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Josh Poimboeuf 提交于
Rename CONFIG_LIVE_PATCHING to CONFIG_LIVEPATCH to make the naming of the config and the code more consistent. Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com> Reviewed-by: NJingoo Han <jg1.han@samsung.com> Signed-off-by: NJiri Kosina <jkosina@suse.cz>
-
- 03 2月, 2015 2 次提交
-
-
由 Andy Shevchenko 提交于
Intel Moorestown platform support was removed few years ago. This is a follow up which removes Moorestown specific code for the serial devices. It includes mrst_max3110 and earlyprintk bits. This was used on SFI (Medfield, Clovertrail) based platforms as well, though new ones use normal serial interface for the console service. Signed-off-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com> Acked-by: NDavid Cohen <david.a.cohen@linux.intel.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Marcelo Tosatti 提交于
Revert 7c6a98df, given that testing PIR is not necessary anymore. Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 30 1月, 2015 1 次提交
-
-
由 Kai Huang 提交于
This patch adds PML support in VMX. A new module parameter 'enable_pml' is added to allow user to enable/disable it manually. Signed-off-by: NKai Huang <kai.huang@linux.intel.com> Reviewed-by: NXiao Guangrong <guangrong.xiao@linux.intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 29 1月, 2015 3 次提交
-
-
由 Kai Huang 提交于
This patch adds new kvm_x86_ops dirty logging hooks to enable/disable dirty logging for particular memory slot, and to flush potentially logged dirty GPAs before reporting slot->dirty_bitmap to userspace. kvm x86 common code calls these hooks when they are available so PML logic can be hidden to VMX specific. SVM won't be impacted as these hooks remain NULL there. Signed-off-by: NKai Huang <kai.huang@linux.intel.com> Reviewed-by: NXiao Guangrong <guangrong.xiao@linux.intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Kai Huang 提交于
This patch changes the second parameter of kvm_mmu_slot_remove_write_access from 'slot id' to 'struct kvm_memory_slot *' to align with kvm_x86_ops dirty logging hooks, which will be introduced in further patch. Better way is to change second parameter of kvm_arch_commit_memory_region from 'struct kvm_userspace_memory_region *' to 'struct kvm_memory_slot * new', but it requires changes on other non-x86 ARCH too, so avoid it now. Signed-off-by: NKai Huang <kai.huang@linux.intel.com> Reviewed-by: NXiao Guangrong <guangrong.xiao@linux.intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 Kai Huang 提交于
This patch adds new mmu layer functions to clear/set D-bit for memory slot, and to write protect superpages for memory slot. In case of PML, CPU logs the dirty GPA automatically to PML buffer when CPU updates D-bit from 0 to 1, therefore we don't have to write protect 4K pages, instead, we only need to clear D-bit in order to log that GPA. For superpages, we still write protect it and let page fault code to handle dirty page logging, as we still need to split superpage to 4K pages in PML. As PML is always enabled during guest's lifetime, to eliminate unnecessary PML GPA logging, we set D-bit manually for the slot with dirty logging disabled. Signed-off-by: NKai Huang <kai.huang@linux.intel.com> Reviewed-by: NXiao Guangrong <guangrong.xiao@linux.intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
- 28 1月, 2015 2 次提交
-
-
由 David Vrabel 提交于
The scratch frame mappings for ballooned pages and the m2p override are broken. Remove them in preparation for replacing them with simpler mechanisms that works. The scratch pages did not ensure that the page was not in use. In particular, the foreign page could still be in use by hardware. If the guest reused the frame the hardware could read or write that frame. The m2p override did not handle the same frame being granted by two different grant references. Trying an M2P override lookup in this case is impossible. With the m2p override removed, the grant map/unmap for the kernel mappings (for x86 PV) can be easily batched in set_foreign_p2m_mapping() and clear_foreign_p2m_mapping(). Signed-off-by: NDavid Vrabel <david.vrabel@citrix.com> Reviewed-by: NStefano Stabellini <stefano.stabellini@eu.citrix.com>
-
由 David Vrabel 提交于
When unmapping grants, instead of converting the kernel map ops to unmap ops on the fly, pre-populate the set of unmap ops. This allows the grant unmap for the kernel mappings to be trivially batched in the future. Signed-off-by: NDavid Vrabel <david.vrabel@citrix.com> Reviewed-by: NStefano Stabellini <stefano.stabellini@eu.citrix.com>
-
- 26 1月, 2015 2 次提交
-
-
由 Nadav Amit 提交于
The IRET instruction should clear NMI masking, but the current implementation does not do so. Signed-off-by: NNadav Amit <namit@cs.technion.ac.il> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
-
由 K. Y. Srinivasan 提交于
Implement a clockevent device based on the timer support available on Hyper-V. In this version of the patch I have addressed Jason's review comments. Signed-off-by: NK. Y. Srinivasan <kys@microsoft.com> Reviewed-by: NJason Wang <jasowang@redhat.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
- 23 1月, 2015 2 次提交
-
-
由 Andy Lutomirski 提交于
The Witcher 2 did something like this to allocate a TLS segment index: struct user_desc u_info; bzero(&u_info, sizeof(u_info)); u_info.entry_number = (uint32_t)-1; syscall(SYS_set_thread_area, &u_info); Strictly speaking, this code was never correct. It should have set read_exec_only and seg_not_present to 1 to indicate that it wanted to find a free slot without putting anything there, or it should have put something sensible in the TLS slot if it wanted to allocate a TLS entry for real. The actual effect of this code was to allocate a bogus segment that could be used to exploit espfix. The set_thread_area hardening patches changed the behavior, causing set_thread_area to return -EINVAL and crashing the game. This changes set_thread_area to interpret this as a request to find a free slot and to leave it empty, which isn't *quite* what the game expects but should be close enough to keep it working. In particular, using the code above to allocate two segments will allocate the same segment both times. According to FrostbittenKing on Github, this fixes The Witcher 2. If this somehow still causes problems, we could instead allocate a limit==0 32-bit data segment, but that seems rather ugly to me. Fixes: 41bdc785 x86/tls: Validate TLS entries to protect espfix Signed-off-by: NAndy Lutomirski <luto@amacapital.net> Cc: stable@vger.kernel.org Cc: torvalds@linux-foundation.org Link: http://lkml.kernel.org/r/0cb251abe1ff0958b8e468a9a9a905b80ae3a746.1421954363.git.luto@amacapital.netSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
-
由 Andy Lutomirski 提交于
32-bit programs don't have an lm bit in their ABI, so they can't reliably cause LDT_empty to return true without resorting to memset. They shouldn't need to do this. This should fix a longstanding, if minor, issue in all 64-bit kernels as well as a potential regression in the TLS hardening code. Fixes: 41bdc785 x86/tls: Validate TLS entries to protect espfix Cc: stable@vger.kernel.org Signed-off-by: NAndy Lutomirski <luto@amacapital.net> Cc: torvalds@linux-foundation.org Link: http://lkml.kernel.org/r/72a059de55e86ad5e2935c80aa91880ddf19d07c.1421954363.git.luto@amacapital.netSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
-