1. 01 4月, 2015 1 次提交
    • I
      efi: Clean up the efi_call_phys_[prolog|epilog]() save/restore interaction · 744937b0
      Ingo Molnar 提交于
      Currently x86-64 efi_call_phys_prolog() saves into a global variable (save_pgd),
      and efi_call_phys_epilog() restores the kernel pagetables from that global
      variable.
      
      Change this to a cleaner save/restore pattern where the saving function returns
      the saved object and the restore function restores that.
      
      Apply the same concept to the 32-bit code as well.
      
      Plus this approach, as an added bonus, allows us to express the
      !efi_enabled(EFI_OLD_MEMMAP) situation in a clean fashion as well,
      via a 'NULL' return value.
      
      Cc: Tapasweni Pathak <tapaswenipathak@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NMatt Fleming <matt.fleming@intel.com>
      744937b0
  2. 06 3月, 2015 1 次提交
    • Q
      x86/fpu/xsaves: Fix improper uses of __ex_table · 06c8173e
      Quentin Casasnovas 提交于
      Commit:
      
        f31a9f7c ("x86/xsaves: Use xsaves/xrstors to save and restore xsave area")
      
      introduced alternative instructions for XSAVES/XRSTORS and commit:
      
        adb9d526 ("x86/xsaves: Add xsaves and xrstors support for booting time")
      
      added support for the XSAVES/XRSTORS instructions at boot time.
      
      Unfortunately both failed to properly protect them against faulting:
      
      The 'xstate_fault' macro will use the closest label named '1'
      backward and that ends up in the .altinstr_replacement section
      rather than in .text. This means that the kernel will never find
      in the __ex_table the .text address where this instruction might
      fault, leading to serious problems if userspace manages to
      trigger the fault.
      Signed-off-by: NQuentin Casasnovas <quentin.casasnovas@oracle.com>
      Signed-off-by: NJamie Iles <jamie.iles@oracle.com>
      [ Improved the changelog, fixed some whitespace noise. ]
      Acked-by: NBorislav Petkov <bp@alien8.de>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: <stable@vger.kernel.org>
      Cc: Allan Xavier <mr.a.xavier@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: adb9d526 ("x86/xsaves: Add xsaves and xrstors support for booting time")
      Fixes: f31a9f7c ("x86/xsaves: Use xsaves/xrstors to save and restore xsave area")
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      06c8173e
  3. 20 2月, 2015 2 次提交
  4. 19 2月, 2015 3 次提交
    • J
      x86/mm/ASLR: Propagate base load address calculation · f47233c2
      Jiri Kosina 提交于
      Commit:
      
        e2b32e67 ("x86, kaslr: randomize module base load address")
      
      makes the base address for module to be unconditionally randomized in
      case when CONFIG_RANDOMIZE_BASE is defined and "nokaslr" option isn't
      present on the commandline.
      
      This is not consistent with how choose_kernel_location() decides whether
      it will randomize kernel load base.
      
      Namely, CONFIG_HIBERNATION disables kASLR (unless "kaslr" option is
      explicitly specified on kernel commandline), which makes the state space
      larger than what module loader is looking at. IOW CONFIG_HIBERNATION &&
      CONFIG_RANDOMIZE_BASE is a valid config option, kASLR wouldn't be applied
      by default in that case, but module loader is not aware of that.
      
      Instead of fixing the logic in module.c, this patch takes more generic
      aproach. It introduces a new bootparam setup data_type SETUP_KASLR and
      uses that to pass the information whether kaslr has been applied during
      kernel decompression, and sets a global 'kaslr_enabled' variable
      accordingly, so that any kernel code (module loading, livepatching, ...)
      can make decisions based on its value.
      
      x86 module loader is converted to make use of this flag.
      Signed-off-by: NJiri Kosina <jkosina@suse.cz>
      Acked-by: NKees Cook <keescook@chromium.org>
      Cc: "H. Peter Anvin" <hpa@linux.intel.com>
      Link: https://lkml.kernel.org/r/alpine.LNX.2.00.1502101411280.10719@pobox.suse.cz
      [ Always dump correct kaslr status when panicking ]
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      f47233c2
    • B
      x86/intel/quark: Add Isolated Memory Regions for Quark X1000 · 28a375df
      Bryan O'Donoghue 提交于
      Intel's Quark X1000 SoC contains a set of registers called
      Isolated Memory Regions. IMRs are accessed over the IOSF mailbox
      interface. IMRs are areas carved out of memory that define
      read/write access rights to the various system agents within the
      Quark system. For a given agent in the system it is possible to
      specify if that agent may read or write an area of memory
      defined by an IMR with a granularity of 1 KiB.
      
      Quark_SecureBootPRM_330234_001.pdf section 4.5 details the
      concept of IMRs quark-x1000-datasheet.pdf section 12.7.4 details
      the implementation of IMRs in silicon.
      
      eSRAM flush, CPU Snoop write-only, CPU SMM Mode, CPU non-SMM
      mode, RMU and PCIe Virtual Channels (VC0 and VC1) can have
      individual read/write access masks applied to them for a given
      memory region in Quark X1000. This enables IMRs to treat each
      memory transaction type listed above on an individual basis and
      to filter appropriately based on the IMR access mask for the
      memory region. Quark supports eight IMRs.
      
      Since all of the DMA capable SoC components in the X1000 are
      mapped to VC0 it is possible to define sections of memory as
      invalid for DMA write operations originating from Ethernet, USB,
      SD and any other DMA capable south-cluster component on VC0.
      Similarly it is possible to mark kernel memory as non-SMM mode
      read/write only or to mark BIOS runtime memory as SMM mode
      accessible only depending on the particular memory footprint on
      a given system.
      
      On an IMR violation Quark SoC X1000 systems are configured to
      reset the system, so ensuring that the IMR memory map is
      consistent with the EFI provided memory map is critical to
      ensure no IMR violations reset the system.
      
      The API for accessing IMRs is based on MTRR code but doesn't
      provide a /proc or /sys interface to manipulate IMRs. Defining
      the size and extent of IMRs is exclusively the domain of
      in-kernel code.
      
      Quark firmware sets up a series of locked IMRs around pieces of
      memory that firmware owns such as ACPI runtime data. During boot
      a series of unlocked IMRs are placed around items in memory to
      guarantee no DMA modification of those items can take place.
      Grub also places an unlocked IMR around the kernel boot params
      data structure and compressed kernel image. It is necessary for
      the kernel to tear down all unlocked IMRs in order to ensure
      that the kernel's view of memory passed via the EFI memory map
      is consistent with the IMR memory map. Without tearing down all
      unlocked IMRs on boot transitory IMRs such as those used to
      protect the compressed kernel image will cause IMR violations and system reboots.
      
      The IMR init code tears down all unlocked IMRs and sets a
      protective IMR around the kernel .text and .rodata as one
      contiguous block. This sanitizes the IMR memory map with respect
      to the EFI memory map and protects the read-only portions of the
      kernel from unwarranted DMA access.
      Tested-by: NOng, Boon Leong <boon.leong.ong@intel.com>
      Signed-off-by: NBryan O'Donoghue <pure.logic@nexus-software.ie>
      Reviewed-by: NAndy Shevchenko <andy.schevchenko@gmail.com>
      Reviewed-by: NDarren Hart <dvhart@linux.intel.com>
      Reviewed-by: NOng, Boon Leong <boon.leong.ong@intel.com>
      Cc: andy.shevchenko@gmail.com
      Cc: dvhart@infradead.org
      Link: http://lkml.kernel.org/r/1422635379-12476-2-git-send-email-pure.logic@nexus-software.ieSigned-off-by: NIngo Molnar <mingo@kernel.org>
      28a375df
    • R
      x86/apic: Fix the devicetree build in certain configs · b273c2c2
      Ricardo Ribalda Delgado 提交于
      Without this patch:
      
        LD      init/built-in.o
        arch/x86/built-in.o: In function `dtb_lapic_setup': kernel/devicetree.c:155:
        undefined reference to `apic_force_enable'
        Makefile:923: recipe for target 'vmlinux' failed
        make: *** [vmlinux] Error 1
      Signed-off-by: NRicardo Ribalda Delgado <ricardo.ribalda@gmail.com>
      Reviewed-by: NMaciej W. Rozycki <macro@linux-mips.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jan Beulich <JBeulich@suse.com>
      Link: http://lkml.kernel.org/r/1422905231-16067-1-git-send-email-ricardo.ribalda@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b273c2c2
  5. 18 2月, 2015 1 次提交
    • R
      x86/spinlocks/paravirt: Fix memory corruption on unlock · d6abfdb2
      Raghavendra K T 提交于
      Paravirt spinlock clears slowpath flag after doing unlock.
      As explained by Linus currently it does:
      
                      prev = *lock;
                      add_smp(&lock->tickets.head, TICKET_LOCK_INC);
      
                      /* add_smp() is a full mb() */
      
                      if (unlikely(lock->tickets.tail & TICKET_SLOWPATH_FLAG))
                              __ticket_unlock_slowpath(lock, prev);
      
      which is *exactly* the kind of things you cannot do with spinlocks,
      because after you've done the "add_smp()" and released the spinlock
      for the fast-path, you can't access the spinlock any more.  Exactly
      because a fast-path lock might come in, and release the whole data
      structure.
      
      Linus suggested that we should not do any writes to lock after unlock(),
      and we can move slowpath clearing to fastpath lock.
      
      So this patch implements the fix with:
      
       1. Moving slowpath flag to head (Oleg):
          Unlocked locks don't care about the slowpath flag; therefore we can keep
          it set after the last unlock, and clear it again on the first (try)lock.
          -- this removes the write after unlock. note that keeping slowpath flag would
          result in unnecessary kicks.
          By moving the slowpath flag from the tail to the head ticket we also avoid
          the need to access both the head and tail tickets on unlock.
      
       2. use xadd to avoid read/write after unlock that checks the need for
          unlock_kick (Linus):
          We further avoid the need for a read-after-release by using xadd;
          the prev head value will include the slowpath flag and indicate if we
          need to do PV kicking of suspended spinners -- on modern chips xadd
          isn't (much) more expensive than an add + load.
      
      Result:
       setup: 16core (32 cpu +ht sandy bridge 8GB 16vcpu guest)
       benchmark overcommit %improve
       kernbench  1x           -0.13
       kernbench  2x            0.02
       dbench     1x           -1.77
       dbench     2x           -0.63
      
      [Jeremy: Hinted missing TICKET_LOCK_INC for kick]
      [Oleg: Moved slowpath flag to head, ticket_equals idea]
      [PeterZ: Added detailed changelog]
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Tested-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NRaghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Andrew Jones <drjones@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Fernando Luis Vázquez Cao <fernando_b1@lab.ntt.co.jp>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ulrich Obergfell <uobergfe@redhat.com>
      Cc: Waiman Long <Waiman.Long@hp.com>
      Cc: a.ryabinin@samsung.com
      Cc: dave@stgolabs.net
      Cc: hpa@zytor.com
      Cc: jasowang@redhat.com
      Cc: jeremy@goop.org
      Cc: paul.gortmaker@windriver.com
      Cc: riel@redhat.com
      Cc: tglx@linutronix.de
      Cc: waiman.long@hp.com
      Cc: xen-devel@lists.xenproject.org
      Link: http://lkml.kernel.org/r/20150215173043.GA7471@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d6abfdb2
  6. 14 2月, 2015 3 次提交
    • A
      kasan: enable stack instrumentation · c420f167
      Andrey Ryabinin 提交于
      Stack instrumentation allows to detect out of bounds memory accesses for
      variables allocated on stack.  Compiler adds redzones around every
      variable on stack and poisons redzones in function's prologue.
      
      Such approach significantly increases stack usage, so all in-kernel stacks
      size were doubled.
      Signed-off-by: NAndrey Ryabinin <a.ryabinin@samsung.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Konstantin Serebryany <kcc@google.com>
      Cc: Dmitry Chernenkov <dmitryc@google.com>
      Signed-off-by: NAndrey Konovalov <adech.fo@gmail.com>
      Cc: Yuri Gribov <tetra2005@gmail.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c420f167
    • A
      x86_64: kasan: add interceptors for memset/memmove/memcpy functions · 393f203f
      Andrey Ryabinin 提交于
      Recently instrumentation of builtin functions calls was removed from GCC
      5.0.  To check the memory accessed by such functions, userspace asan
      always uses interceptors for them.
      
      So now we should do this as well.  This patch declares
      memset/memmove/memcpy as weak symbols.  In mm/kasan/kasan.c we have our
      own implementation of those functions which checks memory before accessing
      it.
      
      Default memset/memmove/memcpy now now always have aliases with '__'
      prefix.  For files that built without kasan instrumentation (e.g.
      mm/slub.c) original mem* replaced (via #define) with prefixed variants,
      cause we don't want to check memory accesses there.
      Signed-off-by: NAndrey Ryabinin <a.ryabinin@samsung.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Konstantin Serebryany <kcc@google.com>
      Cc: Dmitry Chernenkov <dmitryc@google.com>
      Signed-off-by: NAndrey Konovalov <adech.fo@gmail.com>
      Cc: Yuri Gribov <tetra2005@gmail.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      393f203f
    • A
      x86_64: add KASan support · ef7f0d6a
      Andrey Ryabinin 提交于
      This patch adds arch specific code for kernel address sanitizer.
      
      16TB of virtual addressed used for shadow memory.  It's located in range
      [ffffec0000000000 - fffffc0000000000] between vmemmap and %esp fixup
      stacks.
      
      At early stage we map whole shadow region with zero page.  Latter, after
      pages mapped to direct mapping address range we unmap zero pages from
      corresponding shadow (see kasan_map_shadow()) and allocate and map a real
      shadow memory reusing vmemmap_populate() function.
      
      Also replace __pa with __pa_nodebug before shadow initialized.  __pa with
      CONFIG_DEBUG_VIRTUAL=y make external function call (__phys_addr)
      __phys_addr is instrumented, so __asan_load could be called before shadow
      area initialized.
      Signed-off-by: NAndrey Ryabinin <a.ryabinin@samsung.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Konstantin Serebryany <kcc@google.com>
      Cc: Dmitry Chernenkov <dmitryc@google.com>
      Signed-off-by: NAndrey Konovalov <adech.fo@gmail.com>
      Cc: Yuri Gribov <tetra2005@gmail.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jim Davis <jim.epost@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ef7f0d6a
  7. 13 2月, 2015 4 次提交
    • A
      all arches, signal: move restart_block to struct task_struct · f56141e3
      Andy Lutomirski 提交于
      If an attacker can cause a controlled kernel stack overflow, overwriting
      the restart block is a very juicy exploit target.  This is because the
      restart_block is held in the same memory allocation as the kernel stack.
      
      Moving the restart block to struct task_struct prevents this exploit by
      making the restart_block harder to locate.
      
      Note that there are other fields in thread_info that are also easy
      targets, at least on some architectures.
      
      It's also a decent simplification, since the restart code is more or less
      identical on all architectures.
      
      [james.hogan@imgtec.com: metag: align thread_info::supervisor_stack]
      Signed-off-by: NAndy Lutomirski <luto@amacapital.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: David Miller <davem@davemloft.net>
      Acked-by: NRichard Weinberger <richard@nod.at>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
      Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
      Cc: Steven Miao <realmz6@gmail.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Aurelien Jacquiot <a-jacquiot@ti.com>
      Cc: Mikael Starvik <starvik@axis.com>
      Cc: Jesper Nilsson <jesper.nilsson@axis.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Richard Kuo <rkuo@codeaurora.org>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
      Tested-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Chen Liqin <liqin.linux@gmail.com>
      Cc: Lennox Wu <lennox.wu@gmail.com>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Signed-off-by: NJames Hogan <james.hogan@imgtec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f56141e3
    • M
      x86: mm: restore original pte_special check · c819f37e
      Mel Gorman 提交于
      Commit b38af472 ("x86,mm: fix pte_special versus pte_numa") adjusted
      the pte_special check to take into account that a special pte had
      SPECIAL and neither PRESENT nor PROTNONE.  Now that NUMA hinting PTEs
      are no longer modifying _PAGE_PRESENT it should be safe to restore the
      original pte_special behaviour.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c819f37e
    • M
      mm: remove remaining references to NUMA hinting bits and helpers · 21d9ee3e
      Mel Gorman 提交于
      This patch removes the NUMA PTE bits and associated helpers.  As a
      side-effect it increases the maximum possible swap space on x86-64.
      
      One potential source of problems is races between the marking of PTEs
      PROT_NONE, NUMA hinting faults and migration.  It must be guaranteed that
      a PTE being protected is not faulted in parallel, seen as a pte_none and
      corrupting memory.  The base case is safe but transhuge has problems in
      the past due to an different migration mechanism and a dependance on page
      lock to serialise migrations and warrants a closer look.
      
      task_work hinting update			parallel fault
      ------------------------			--------------
      change_pmd_range
        change_huge_pmd
          __pmd_trans_huge_lock
            pmdp_get_and_clear
      						__handle_mm_fault
      						pmd_none
      						  do_huge_pmd_anonymous_page
      						  read? pmd_lock blocks until hinting complete, fail !pmd_none test
      						  write? __do_huge_pmd_anonymous_page acquires pmd_lock, checks pmd_none
            pmd_modify
            set_pmd_at
      
      task_work hinting update			parallel migration
      ------------------------			------------------
      change_pmd_range
        change_huge_pmd
          __pmd_trans_huge_lock
            pmdp_get_and_clear
      						__handle_mm_fault
      						  do_huge_pmd_numa_page
      						    migrate_misplaced_transhuge_page
      						    pmd_lock waits for updates to complete, recheck pmd_same
            pmd_modify
            set_pmd_at
      
      Both of those are safe and the case where a transhuge page is inserted
      during a protection update is unchanged.  The case where two processes try
      migrating at the same time is unchanged by this series so should still be
      ok.  I could not find a case where we are accidentally depending on the
      PTE not being cleared and flushed.  If one is missed, it'll manifest as
      corruption problems that start triggering shortly after this series is
      merged and only happen when NUMA balancing is enabled.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Tested-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mark Brown <broonie@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      21d9ee3e
    • M
      mm: add p[te|md] protnone helpers for use by NUMA balancing · e7bb4b6d
      Mel Gorman 提交于
      This is a preparatory patch that introduces protnone helpers for automatic
      NUMA balancing.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Tested-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e7bb4b6d
  8. 12 2月, 2015 1 次提交
  9. 11 2月, 2015 2 次提交
  10. 06 2月, 2015 1 次提交
    • P
      kvm: add halt_poll_ns module parameter · f7819512
      Paolo Bonzini 提交于
      This patch introduces a new module parameter for the KVM module; when it
      is present, KVM attempts a bit of polling on every HLT before scheduling
      itself out via kvm_vcpu_block.
      
      This parameter helps a lot for latency-bound workloads---in particular
      I tested it with O_DSYNC writes with a battery-backed disk in the host.
      In this case, writes are fast (because the data doesn't have to go all
      the way to the platters) but they cannot be merged by either the host or
      the guest.  KVM's performance here is usually around 30% of bare metal,
      or 50% if you use cache=directsync or cache=writethrough (these
      parameters avoid that the guest sends pointless flush requests, and
      at the same time they are not slow because of the battery-backed cache).
      The bad performance happens because on every halt the host CPU decides
      to halt itself too.  When the interrupt comes, the vCPU thread is then
      migrated to a new physical CPU, and in general the latency is horrible
      because the vCPU thread has to be scheduled back in.
      
      With this patch performance reaches 60-65% of bare metal and, more
      important, 99% of what you get if you use idle=poll in the guest.  This
      means that the tunable gets rid of this particular bottleneck, and more
      work can be done to improve performance in the kernel or QEMU.
      
      Of course there is some price to pay; every time an otherwise idle vCPUs
      is interrupted by an interrupt, it will poll unnecessarily and thus
      impose a little load on the host.  The above results were obtained with
      a mostly random value of the parameter (500000), and the load was around
      1.5-2.5% CPU usage on one of the host's core for each idle guest vCPU.
      
      The patch also adds a new stat, /sys/kernel/debug/kvm/halt_successful_poll,
      that can be used to tune the parameter.  It counts how many HLT
      instructions received an interrupt during the polling period; each
      successful poll avoids that Linux schedules the VCPU thread out and back
      in, and may also avoid a likely trip to C1 and back for the physical CPU.
      
      While the VM is idle, a Linux 4 VCPU VM halts around 10 times per second.
      Of these halts, almost all are failed polls.  During the benchmark,
      instead, basically all halts end within the polling period, except a more
      or less constant stream of 50 per second coming from vCPUs that are not
      running the benchmark.  The wasted time is thus very low.  Things may
      be slightly different for Windows VMs, which have a ~10 ms timer tick.
      
      The effect is also visible on Marcelo's recently-introduced latency
      test for the TSC deadline timer.  Though of course a non-RT kernel has
      awful latency bounds, the latency of the timer is around 8000-10000 clock
      cycles compared to 20000-120000 without setting halt_poll_ns.  For the TSC
      deadline timer, thus, the effect is both a smaller average latency and
      a smaller variance.
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      f7819512
  11. 05 2月, 2015 2 次提交
    • J
      x86/PCI: Refine the way to release PCI IRQ resources · b4b55cda
      Jiang Liu 提交于
      Some PCI device drivers assume that pci_dev->irq won't change after
      calling pci_disable_device() and pci_enable_device() during suspend and
      resume.
      
      Commit c03b3b07 ("x86, irq, mpparse: Release IOAPIC pin when
      PCI device is disabled") frees PCI IRQ resources when pci_disable_device()
      is called and reallocate IRQ resources when pci_enable_device() is
      called again. This breaks above assumption. So commit 3eec5952
      ("x86, irq, PCI: Keep IRQ assignment for PCI devices during
      suspend/hibernation") and 9eabc99a ("x86, irq, PCI: Keep IRQ
      assignment for runtime power management") fix the issue by avoiding
      freeing/reallocating IRQ resources during PCI device suspend/resume.
      They achieve this by checking dev.power.is_prepared and
      dev.power.runtime_status.  PM maintainer, Rafael, then pointed out that
      it's really an ugly fix which leaking PM internal state information to
      IRQ subsystem.
      
      Recently David Vrabel <david.vrabel@citrix.com> also reports an
      regression in pciback driver caused by commit cffe0a2b ("x86, irq:
      Keep balance of IOAPIC pin reference count"). Please refer to:
      http://lkml.org/lkml/2015/1/14/546
      
      So this patch refine the way to release PCI IRQ resources. Instead of
      releasing PCI IRQ resources in pci_disable_device()/
      pcibios_disable_device(), we now release it at driver unbinding
      notification BUS_NOTIFY_UNBOUND_DRIVER. In other word, we only release
      PCI IRQ resources when there's no driver bound to the PCI device, and
      it keeps the assumption that pci_dev->irq won't through multiple
      invocation of pci_enable_device()/pci_disable_device().
      Signed-off-by: NJiang Liu <jiang.liu@linux.intel.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      b4b55cda
    • T
      kvm: remove KVM_MMIO_SIZE · 1c2b364b
      Tiejun Chen 提交于
      After f78146b0, "KVM: Fix page-crossing MMIO", and
      87da7e66, "KVM: x86: fix vcpu->mmio_fragments overflow",
      actually KVM_MMIO_SIZE is gone.
      Signed-off-by: NTiejun Chen <tiejun.chen@intel.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      1c2b364b
  12. 04 2月, 2015 6 次提交
  13. 03 2月, 2015 2 次提交
  14. 30 1月, 2015 1 次提交
  15. 29 1月, 2015 3 次提交
  16. 28 1月, 2015 2 次提交
  17. 26 1月, 2015 1 次提交
  18. 23 1月, 2015 3 次提交
    • A
      x86, tls: Interpret an all-zero struct user_desc as "no segment" · 3669ef9f
      Andy Lutomirski 提交于
      The Witcher 2 did something like this to allocate a TLS segment index:
      
              struct user_desc u_info;
              bzero(&u_info, sizeof(u_info));
              u_info.entry_number = (uint32_t)-1;
      
              syscall(SYS_set_thread_area, &u_info);
      
      Strictly speaking, this code was never correct.  It should have set
      read_exec_only and seg_not_present to 1 to indicate that it wanted
      to find a free slot without putting anything there, or it should
      have put something sensible in the TLS slot if it wanted to allocate
      a TLS entry for real.  The actual effect of this code was to
      allocate a bogus segment that could be used to exploit espfix.
      
      The set_thread_area hardening patches changed the behavior, causing
      set_thread_area to return -EINVAL and crashing the game.
      
      This changes set_thread_area to interpret this as a request to find
      a free slot and to leave it empty, which isn't *quite* what the game
      expects but should be close enough to keep it working.  In
      particular, using the code above to allocate two segments will
      allocate the same segment both times.
      
      According to FrostbittenKing on Github, this fixes The Witcher 2.
      
      If this somehow still causes problems, we could instead allocate
      a limit==0 32-bit data segment, but that seems rather ugly to me.
      
      Fixes: 41bdc785 x86/tls: Validate TLS entries to protect espfix
      Signed-off-by: NAndy Lutomirski <luto@amacapital.net>
      Cc: stable@vger.kernel.org
      Cc: torvalds@linux-foundation.org
      Link: http://lkml.kernel.org/r/0cb251abe1ff0958b8e468a9a9a905b80ae3a746.1421954363.git.luto@amacapital.netSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      3669ef9f
    • A
      x86, tls, ldt: Stop checking lm in LDT_empty · e30ab185
      Andy Lutomirski 提交于
      32-bit programs don't have an lm bit in their ABI, so they can't
      reliably cause LDT_empty to return true without resorting to memset.
      They shouldn't need to do this.
      
      This should fix a longstanding, if minor, issue in all 64-bit kernels
      as well as a potential regression in the TLS hardening code.
      
      Fixes: 41bdc785 x86/tls: Validate TLS entries to protect espfix
      Cc: stable@vger.kernel.org
      Signed-off-by: NAndy Lutomirski <luto@amacapital.net>
      Cc: torvalds@linux-foundation.org
      Link: http://lkml.kernel.org/r/72a059de55e86ad5e2935c80aa91880ddf19d07c.1421954363.git.luto@amacapital.netSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      e30ab185
    • D
      x86, mpx: Fix potential performance issue on unmaps · c922228e
      Dave Hansen 提交于
      The 3.19 merge window saw some TLB modifications merged which caused a
      performance regression. They were fixed in commit 045bbb9fa.
      
      Once that fix was applied, I also noticed that there was a small
      but intermittent regression still present.  It was not present
      consistently enough to bisect reliably, but I'm fairly confident
      that it came from (my own) MPX patches.  The source was reading
      a relatively unused field in the mm_struct via arch_unmap.
      
      I also noted that this code was in the main instruction flow of
      do_munmap() and probably had more icache impact than we want.
      
      This patch does two things:
      1. Adds a static (via Kconfig) and dynamic (via cpuid) check
         for MPX with cpu_feature_enabled().  This keeps us from
         reading that cacheline in the mm and trades it for a check
         of the global CPUID variables at least on CPUs without MPX.
      2. Adds an unlikely() to ensure that the MPX call ends up out
         of the main instruction flow in do_munmap().  I've added
         a detailed comment about why this was done and why we want
         it even on systems where MPX is present.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: luto@amacapital.net
      Cc: Dave Hansen <dave@sr71.net>
      Link: http://lkml.kernel.org/r/20150108223021.AEEAB987@viggo.jf.intel.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      c922228e
  19. 22 1月, 2015 1 次提交