1. 05 3月, 2015 2 次提交
    • L
      x86/mm: Simplify enabling direct_gbpages · e5008abe
      Luis R. Rodriguez 提交于
      direct_gbpages can be force enabled as an early parameter
      but not really have taken effect when DEBUG_PAGEALLOC
      or KMEMCHECK is enabled. You can also enable direct_gbpages
      right now if you have an x86_64 architecture but your CPU
      doesn't really have support for this feature. In both cases
      PG_LEVEL_1G won't actually be enabled but direct_gbpages is used
      in other areas under the assumptions that PG_LEVEL_1G
      was set. Fix this by putting together all requirements
      which make this feature sensible to enable under, and only
      enable both finally flipping on PG_LEVEL_1G and leaving
      PG_LEVEL_1G set when this is true.
      
      We only enable this feature then to be possible on sensible
      builds defined by the new ENABLE_DIRECT_GBPAGES. If the
      CPU has support for it you can either enable this by using
      the DIRECT_GBPAGES option or using the early kernel parameter.
      If a platform had support for this you can always force disable
      it as well.
      Signed-off-by: NLuis R. Rodriguez <mcgrof@suse.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Dexuan Cui <decui@microsoft.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: JBeulich@suse.com
      Cc: Jan Beulich <JBeulich@suse.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Lindgren <tony@atomide.com>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: julia.lawall@lip6.fr
      Link: http://lkml.kernel.org/r/1425518654-3403-3-git-send-email-mcgrof@do-not-panic.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e5008abe
    • L
      x86/mm: Use IS_ENABLED() for direct_gbpages · d9fd579c
      Luis R. Rodriguez 提交于
      Replace #ifdef eyesore with IS_ENABLED() use.
      Signed-off-by: NLuis R. Rodriguez <mcgrof@suse.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Dexuan Cui <decui@microsoft.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: JBeulich@suse.com
      Cc: Jan Beulich <JBeulich@suse.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Lindgren <tony@atomide.com>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: julia.lawall@lip6.fr
      Link: http://lkml.kernel.org/r/1425518654-3403-2-git-send-email-mcgrof@do-not-panic.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d9fd579c
  2. 28 2月, 2015 1 次提交
  3. 19 2月, 2015 4 次提交
    • H
      x86, mm/ASLR: Fix stack randomization on 64-bit systems · 4e7c22d4
      Hector Marco-Gisbert 提交于
      The issue is that the stack for processes is not properly randomized on
      64 bit architectures due to an integer overflow.
      
      The affected function is randomize_stack_top() in file
      "fs/binfmt_elf.c":
      
        static unsigned long randomize_stack_top(unsigned long stack_top)
        {
                 unsigned int random_variable = 0;
      
                 if ((current->flags & PF_RANDOMIZE) &&
                         !(current->personality & ADDR_NO_RANDOMIZE)) {
                         random_variable = get_random_int() & STACK_RND_MASK;
                         random_variable <<= PAGE_SHIFT;
                 }
                 return PAGE_ALIGN(stack_top) + random_variable;
                 return PAGE_ALIGN(stack_top) - random_variable;
        }
      
      Note that, it declares the "random_variable" variable as "unsigned int".
      Since the result of the shifting operation between STACK_RND_MASK (which
      is 0x3fffff on x86_64, 22 bits) and PAGE_SHIFT (which is 12 on x86_64):
      
      	  random_variable <<= PAGE_SHIFT;
      
      then the two leftmost bits are dropped when storing the result in the
      "random_variable". This variable shall be at least 34 bits long to hold
      the (22+12) result.
      
      These two dropped bits have an impact on the entropy of process stack.
      Concretely, the total stack entropy is reduced by four: from 2^28 to
      2^30 (One fourth of expected entropy).
      
      This patch restores back the entropy by correcting the types involved
      in the operations in the functions randomize_stack_top() and
      stack_maxrandom_size().
      
      The successful fix can be tested with:
      
        $ for i in `seq 1 10`; do cat /proc/self/maps | grep stack; done
        7ffeda566000-7ffeda587000 rw-p 00000000 00:00 0                          [stack]
        7fff5a332000-7fff5a353000 rw-p 00000000 00:00 0                          [stack]
        7ffcdb7a1000-7ffcdb7c2000 rw-p 00000000 00:00 0                          [stack]
        7ffd5e2c4000-7ffd5e2e5000 rw-p 00000000 00:00 0                          [stack]
        ...
      
      Once corrected, the leading bytes should be between 7ffc and 7fff,
      rather than always being 7fff.
      Signed-off-by: NHector Marco-Gisbert <hecmargi@upv.es>
      Signed-off-by: NIsmael Ripoll <iripoll@upv.es>
      [ Rebased, fixed 80 char bugs, cleaned up commit message, added test example and CVE ]
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Cc: <stable@vger.kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Fixes: CVE-2015-1593
      Link: http://lkml.kernel.org/r/20150214173350.GA18393@www.outflux.netSigned-off-by: NBorislav Petkov <bp@suse.de>
      4e7c22d4
    • D
      x86/mm/init: Fix incorrect page size in init_memory_mapping() printks · f15e0518
      Dave Hansen 提交于
      With 32-bit non-PAE kernels, we have 2 page sizes available
      (at most): 4k and 4M.
      
      Enabling PAE replaces that 4M size with a 2M one (which 64-bit
      systems use too).
      
      But, when booting a 32-bit non-PAE kernel, in one of our
      early-boot printouts, we say:
      
        init_memory_mapping: [mem 0x00000000-0x000fffff]
         [mem 0x00000000-0x000fffff] page 4k
        init_memory_mapping: [mem 0x37000000-0x373fffff]
         [mem 0x37000000-0x373fffff] page 2M
        init_memory_mapping: [mem 0x00100000-0x36ffffff]
         [mem 0x00100000-0x003fffff] page 4k
         [mem 0x00400000-0x36ffffff] page 2M
        init_memory_mapping: [mem 0x37400000-0x377fdfff]
         [mem 0x37400000-0x377fdfff] page 4k
      
      Which is obviously wrong.  There is no 2M page available.  This
      is probably because of a badly-named variable: in the map_range
      code: PG_LEVEL_2M.
      
      Instead of renaming all the PG_LEVEL_2M's.  This patch just
      fixes the printout:
      
        init_memory_mapping: [mem 0x00000000-0x000fffff]
         [mem 0x00000000-0x000fffff] page 4k
        init_memory_mapping: [mem 0x37000000-0x373fffff]
         [mem 0x37000000-0x373fffff] page 4M
        init_memory_mapping: [mem 0x00100000-0x36ffffff]
         [mem 0x00100000-0x003fffff] page 4k
         [mem 0x00400000-0x36ffffff] page 4M
        init_memory_mapping: [mem 0x37400000-0x377fdfff]
         [mem 0x37400000-0x377fdfff] page 4k
        BRK [0x03206000, 0x03206fff] PGTABLE
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Link: http://lkml.kernel.org/r/20150210212030.665EC267@viggo.jf.intel.comSigned-off-by: NBorislav Petkov <bp@suse.de>
      f15e0518
    • P
      x86/mm/pat: Ensure different messages in STRICT_DEVMEM and PAT cases · 1f40a8bf
      Pavel Machek 提交于
      STRICT_DEVMEM and PAT produce same failure accessing /dev/mem,
      which is quite confusing to the user. Make printk messages
      different to lessen confusion.
      Signed-off-by: NPavel Machek <pavel@ucw.cz>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      1f40a8bf
    • F
      x86/mm: Reduce PAE-mode per task pgd allocation overhead from 4K to 32 bytes · 1db491f7
      Fenghua Yu 提交于
      With more embedded systems emerging using Quark, among other
      things, 32-bit kernel matters again. 32-bit machine and kernel
      uses PAE paging, which currently wastes at least 4K of memory
      per process on Linux where we have to reserve an entire page to
      support a single 32-byte PGD structure. It would be a very good
      thing if we could eliminate that wastage.
      
      PAE paging is used to access more than 4GB memory on x86-32. And
      it is required for NX.
      
      In this patch, we still allocate one page for pgd for a Xen
      domain and 64-bit kernel because one page pgd is assumed in
      these cases. But we can save memory space by only allocating
      32-byte pgd for 32-bit PAE kernel when it is not running as a
      Xen domain.
      Signed-off-by: NFenghua Yu <fenghua.yu@intel.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Glenn Williamson <glenn.p.williamson@intel.com>
      Cc: H. Peter Anvin <hpa@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1421382601-46912-1-git-send-email-fenghua.yu@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1db491f7
  4. 14 2月, 2015 4 次提交
    • A
      kasan: enable instrumentation of global variables · bebf56a1
      Andrey Ryabinin 提交于
      This feature let us to detect accesses out of bounds of global variables.
      This will work as for globals in kernel image, so for globals in modules.
      Currently this won't work for symbols in user-specified sections (e.g.
      __init, __read_mostly, ...)
      
      The idea of this is simple.  Compiler increases each global variable by
      redzone size and add constructors invoking __asan_register_globals()
      function.  Information about global variable (address, size, size with
      redzone ...) passed to __asan_register_globals() so we could poison
      variable's redzone.
      
      This patch also forces module_alloc() to return 8*PAGE_SIZE aligned
      address making shadow memory handling (
      kasan_module_alloc()/kasan_module_free() ) more simple.  Such alignment
      guarantees that each shadow page backing modules address space correspond
      to only one module_alloc() allocation.
      Signed-off-by: NAndrey Ryabinin <a.ryabinin@samsung.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Konstantin Serebryany <kcc@google.com>
      Cc: Dmitry Chernenkov <dmitryc@google.com>
      Signed-off-by: NAndrey Konovalov <adech.fo@gmail.com>
      Cc: Yuri Gribov <tetra2005@gmail.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bebf56a1
    • A
      kasan: enable stack instrumentation · c420f167
      Andrey Ryabinin 提交于
      Stack instrumentation allows to detect out of bounds memory accesses for
      variables allocated on stack.  Compiler adds redzones around every
      variable on stack and poisons redzones in function's prologue.
      
      Such approach significantly increases stack usage, so all in-kernel stacks
      size were doubled.
      Signed-off-by: NAndrey Ryabinin <a.ryabinin@samsung.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Konstantin Serebryany <kcc@google.com>
      Cc: Dmitry Chernenkov <dmitryc@google.com>
      Signed-off-by: NAndrey Konovalov <adech.fo@gmail.com>
      Cc: Yuri Gribov <tetra2005@gmail.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c420f167
    • A
      x86_64: add KASan support · ef7f0d6a
      Andrey Ryabinin 提交于
      This patch adds arch specific code for kernel address sanitizer.
      
      16TB of virtual addressed used for shadow memory.  It's located in range
      [ffffec0000000000 - fffffc0000000000] between vmemmap and %esp fixup
      stacks.
      
      At early stage we map whole shadow region with zero page.  Latter, after
      pages mapped to direct mapping address range we unmap zero pages from
      corresponding shadow (see kasan_map_shadow()) and allocate and map a real
      shadow memory reusing vmemmap_populate() function.
      
      Also replace __pa with __pa_nodebug before shadow initialized.  __pa with
      CONFIG_DEBUG_VIRTUAL=y make external function call (__phys_addr)
      __phys_addr is instrumented, so __asan_load could be called before shadow
      area initialized.
      Signed-off-by: NAndrey Ryabinin <a.ryabinin@samsung.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Konstantin Serebryany <kcc@google.com>
      Cc: Dmitry Chernenkov <dmitryc@google.com>
      Signed-off-by: NAndrey Konovalov <adech.fo@gmail.com>
      Cc: Yuri Gribov <tetra2005@gmail.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jim Davis <jim.epost@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ef7f0d6a
    • T
      x86: use %*pb[l] to print bitmaps including cpumasks and nodemasks · bf58b487
      Tejun Heo 提交于
      printk and friends can now format bitmaps using '%*pb[l]'.  cpumask
      and nodemask also provide cpumask_pr_args() and nodemask_pr_args()
      respectively which can be used to generate the two printf arguments
      necessary to format the specified cpu/nodemask.
      
      * Unnecessary buffer size calculation and condition on the lenght
        removed from intel_cacheinfo.c::show_shared_cpu_map_func().
      
      * uv_nmi_nr_cpus_pr() got overly smart and implemented "..."
        abbreviation if the output stretched over the predefined 1024 byte
        buffer.  Replaced with plain printk.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Mike Travis <travis@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bf58b487
  5. 13 2月, 2015 1 次提交
  6. 12 2月, 2015 4 次提交
    • A
      mm: gup: use get_user_pages_unlocked within get_user_pages_fast · a7b78075
      Andrea Arcangeli 提交于
      This allows the get_user_pages_fast slow path to release the mmap_sem
      before blocking.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Peter Feiner <pfeiner@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a7b78075
    • K
      mm: account pmd page tables to the process · dc6c9a35
      Kirill A. Shutemov 提交于
      Dave noticed that unprivileged process can allocate significant amount of
      memory -- >500 MiB on x86_64 -- and stay unnoticed by oom-killer and
      memory cgroup.  The trick is to allocate a lot of PMD page tables.  Linux
      kernel doesn't account PMD tables to the process, only PTE.
      
      The use-cases below use few tricks to allocate a lot of PMD page tables
      while keeping VmRSS and VmPTE low.  oom_score for the process will be 0.
      
      	#include <errno.h>
      	#include <stdio.h>
      	#include <stdlib.h>
      	#include <unistd.h>
      	#include <sys/mman.h>
      	#include <sys/prctl.h>
      
      	#define PUD_SIZE (1UL << 30)
      	#define PMD_SIZE (1UL << 21)
      
      	#define NR_PUD 130000
      
      	int main(void)
      	{
      		char *addr = NULL;
      		unsigned long i;
      
      		prctl(PR_SET_THP_DISABLE);
      		for (i = 0; i < NR_PUD ; i++) {
      			addr = mmap(addr + PUD_SIZE, PUD_SIZE, PROT_WRITE|PROT_READ,
      					MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
      			if (addr == MAP_FAILED) {
      				perror("mmap");
      				break;
      			}
      			*addr = 'x';
      			munmap(addr, PMD_SIZE);
      			mmap(addr, PMD_SIZE, PROT_WRITE|PROT_READ,
      					MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED, -1, 0);
      			if (addr == MAP_FAILED)
      				perror("re-mmap"), exit(1);
      		}
      		printf("PID %d consumed %lu KiB in PMD page tables\n",
      				getpid(), i * 4096 >> 10);
      		return pause();
      	}
      
      The patch addresses the issue by account PMD tables to the process the
      same way we account PTE.
      
      The main place where PMD tables is accounted is __pmd_alloc() and
      free_pmd_range(). But there're few corner cases:
      
       - HugeTLB can share PMD page tables. The patch handles by accounting
         the table to all processes who share it.
      
       - x86 PAE pre-allocates few PMD tables on fork.
      
       - Architectures with FIRST_USER_ADDRESS > 0. We need to adjust sanity
         check on exit(2).
      
      Accounting only happens on configuration where PMD page table's level is
      present (PMD is not folded).  As with nr_ptes we use per-mm counter.  The
      counter value is used to calculate baseline for badness score by
      oom-killer.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Reviewed-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: David Rientjes <rientjes@google.com>
      Tested-by: NSedat Dilek <sedat.dilek@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dc6c9a35
    • N
      mm/hugetlb: pmd_huge() returns true for non-present hugepage · cbef8478
      Naoya Horiguchi 提交于
      Migrating hugepages and hwpoisoned hugepages are considered as non-present
      hugepages, and they are referenced via migration entries and hwpoison
      entries in their page table slots.
      
      This behavior causes race condition because pmd_huge() doesn't tell
      non-huge pages from migrating/hwpoisoned hugepages.  follow_page_mask() is
      one example where the kernel would call follow_page_pte() for such
      hugepage while this function is supposed to handle only normal pages.
      
      To avoid this, this patch makes pmd_huge() return true when pmd_none() is
      true *and* pmd_present() is false.  We don't have to worry about mixing up
      non-present pmd entry with normal pmd (pointing to leaf level pte entry)
      because pmd_present() is true in normal pmd.
      
      The same race condition could happen in (x86-specific) gup_pmd_range(),
      where this patch simply adds pmd_present() check instead of pmd_huge().
      This is because gup_pmd_range() is fast path.  If we have non-present
      hugepage in this function, we will go into gup_huge_pmd(), then return 0
      at flag mask check, and finally fall back to the slow path.
      
      Fixes: 290408d4 ("hugetlb: hugepage migration core")
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Cc: <stable@vger.kernel.org>	[2.6.36+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cbef8478
    • N
      mm/hugetlb: reduce arch dependent code around follow_huge_* · 61f77eda
      Naoya Horiguchi 提交于
      Currently we have many duplicates in definitions around
      follow_huge_addr(), follow_huge_pmd(), and follow_huge_pud(), so this
      patch tries to remove the m.  The basic idea is to put the default
      implementation for these functions in mm/hugetlb.c as weak symbols
      (regardless of CONFIG_ARCH_WANT_GENERAL_HUGETL B), and to implement
      arch-specific code only when the arch needs it.
      
      For follow_huge_addr(), only powerpc and ia64 have their own
      implementation, and in all other architectures this function just returns
      ERR_PTR(-EINVAL).  So this patch sets returning ERR_PTR(-EINVAL) as
      default.
      
      As for follow_huge_(pmd|pud)(), if (pmd|pud)_huge() is implemented to
      always return 0 in your architecture (like in ia64 or sparc,) it's never
      called (the callsite is optimized away) no matter how implemented it is.
      So in such architectures, we don't need arch-specific implementation.
      
      In some architecture (like mips, s390 and tile,) their current
      arch-specific follow_huge_(pmd|pud)() are effectively identical with the
      common code, so this patch lets these architecture use the common code.
      
      One exception is metag, where pmd_huge() could return non-zero but it
      expects follow_huge_pmd() to always return NULL.  This means that we need
      arch-specific implementation which returns NULL.  This behavior looks
      strange to me (because non-zero pmd_huge() implies that the architecture
      supports PMD-based hugepage, so follow_huge_pmd() can/should return some
      relevant value,) but that's beyond this cleanup patch, so let's keep it.
      
      Justification of non-trivial changes:
      - in s390, follow_huge_pmd() checks !MACHINE_HAS_HPAGE at first, and this
        patch removes the check. This is OK because we can assume MACHINE_HAS_HPAGE
        is true when follow_huge_pmd() can be called (note that pmd_huge() has
        the same check and always returns 0 for !MACHINE_HAS_HPAGE.)
      - in s390 and mips, we use HPAGE_MASK instead of PMD_MASK as done in common
        code. This patch forces these archs use PMD_MASK, but it's OK because
        they are identical in both archs.
        In s390, both of HPAGE_SHIFT and PMD_SHIFT are 20.
        In mips, HPAGE_SHIFT is defined as (PAGE_SHIFT + PAGE_SHIFT - 3) and
        PMD_SHIFT is define as (PAGE_SHIFT + PAGE_SHIFT + PTE_ORDER - 3), but
        PTE_ORDER is always 0, so these are identical.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      61f77eda
  7. 11 2月, 2015 1 次提交
  8. 04 2月, 2015 2 次提交
  9. 30 1月, 2015 1 次提交
    • L
      vm: add VM_FAULT_SIGSEGV handling support · 33692f27
      Linus Torvalds 提交于
      The core VM already knows about VM_FAULT_SIGBUS, but cannot return a
      "you should SIGSEGV" error, because the SIGSEGV case was generally
      handled by the caller - usually the architecture fault handler.
      
      That results in lots of duplication - all the architecture fault
      handlers end up doing very similar "look up vma, check permissions, do
      retries etc" - but it generally works.  However, there are cases where
      the VM actually wants to SIGSEGV, and applications _expect_ SIGSEGV.
      
      In particular, when accessing the stack guard page, libsigsegv expects a
      SIGSEGV.  And it usually got one, because the stack growth is handled by
      that duplicated architecture fault handler.
      
      However, when the generic VM layer started propagating the error return
      from the stack expansion in commit fee7e49d ("mm: propagate error
      from stack expansion even for guard page"), that now exposed the
      existing VM_FAULT_SIGBUS result to user space.  And user space really
      expected SIGSEGV, not SIGBUS.
      
      To fix that case, we need to add a VM_FAULT_SIGSEGV, and teach all those
      duplicate architecture fault handlers about it.  They all already have
      the code to handle SIGSEGV, so it's about just tying that new return
      value to the existing code, but it's all a bit annoying.
      
      This is the mindless minimal patch to do this.  A more extensive patch
      would be to try to gather up the mostly shared fault handling logic into
      one generic helper routine, and long-term we really should do that
      cleanup.
      
      Just from this patch, you can generally see that most architectures just
      copied (directly or indirectly) the old x86 way of doing things, but in
      the meantime that original x86 model has been improved to hold the VM
      semaphore for shorter times etc and to handle VM_FAULT_RETRY and other
      "newer" things, so it would be a good idea to bring all those
      improvements to the generic case and teach other architectures about
      them too.
      Reported-and-tested-by: NTakashi Iwai <tiwai@suse.de>
      Tested-by: NJan Engelhardt <jengelh@inai.de>
      Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> # "s390 still compiles and boots"
      Cc: linux-arch@vger.kernel.org
      Cc: stable@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      33692f27
  10. 23 1月, 2015 2 次提交
  11. 20 1月, 2015 1 次提交
  12. 02 1月, 2015 1 次提交
  13. 23 12月, 2014 1 次提交
    • J
      x86: Fix step size adjustment during initial memory mapping · 132978b9
      Jan Beulich 提交于
      The old scheme can lead to failure in certain cases - the
      problem is that after bumping step_size the next (non-final)
      iteration is only guaranteed to make available a memory block
      the size of what step_size was before. E.g. for a memory block
      [0,3004600000) we'd have:
      
       iter	start		end		step		amount
       1	3004400000	30045fffff	 2M		  2M
       2	3004000000	30043fffff	64M		  4M
       3	3000000000	3003ffffff	 2G		 64M
       4	2000000000	2fffffffff	64G		 64G
      
      Yet to map 64G with 4k pages (as happens e.g. under PV Xen) we
      need slightly over 128M, but the first three iterations made
      only about 70M available.
      
      The condition (new_mapped_ram_size > mapped_ram_size) for
      bumping step_size is just not suitable. Instead we want to bump
      it when we know we have enough memory available to cover a block
      of the new step_size. And rather than making that condition more
      complicated than needed, simply adjust step_size by the largest
      possible factor we know we can cover at that point - which is
      shifting it left by one less than the difference between page
      table level shifts. (Interestingly the original STEP_SIZE_SHIFT
      definition had a comment hinting at that having been the
      intention, just that it should have been PUD_SHIFT-PMD_SHIFT-1
      instead of (PUD_SHIFT-PMD_SHIFT)/2, and of course for non-PAE
      32-bit we can't really use these two constants as they're equal
      there.)
      
      Furthermore the comment in get_new_step_size() didn't get
      updated when the bottom-down mapping logic got added. Yet while
      an overflow (flushing step_size to zero) of the shift doesn't
      matter for the top-down method, it does for bottom-up because
      round_up(x, 0) = 0, and an upper range boundary of zero can't
      really work well.
      Signed-off-by: NJan Beulich <jbeulich@suse.com>
      Acked-by: NYinghai Lu <yinghai@kernel.org>
      Link: http://lkml.kernel.org/r/54945C1E020000780005114E@mail.emea.novell.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      132978b9
  14. 18 12月, 2014 2 次提交
  15. 16 12月, 2014 2 次提交
    • L
      x86: mm: consolidate VM_FAULT_RETRY handling · 26178ec1
      Linus Torvalds 提交于
      The VM_FAULT_RETRY handling was confusing and incorrect for the case of
      returning to kernel mode.  We need to handle the exception table fixup
      if we return to kernel mode due to a fatal signal - it will basically
      look to the kernel user mode access like the access failed due to the VM
      going away from udner it.  Which is correct - the process is dying - and
      avoids the whole "repeat endless kernel page faults" case.
      
      Handling the VM_FAULT_RETRY early and in just one place also simplifies
      the mmap_sem handling, since once we've taken care of VM_FAULT_RETRY we
      know that we can just drop the lock.  The remaining accounting and
      possible error handling is thread-local and does not need the mmap_sem.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      26178ec1
    • L
      x86: mm: move mmap_sem unlock from mm_fault_error() to caller · 7fb08eca
      Linus Torvalds 提交于
      This replaces four copies in various stages of mm_fault_error() handling
      with just a single one.  It will also allow for more natural placement
      of the unlocking after some further cleanup.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7fb08eca
  16. 14 12月, 2014 1 次提交
    • J
      mm/debug-pagealloc: make debug-pagealloc boottime configurable · 031bc574
      Joonsoo Kim 提交于
      Now, we have prepared to avoid using debug-pagealloc in boottime.  So
      introduce new kernel-parameter to disable debug-pagealloc in boottime, and
      makes related functions to be disabled in this case.
      
      Only non-intuitive part is change of guard page functions.  Because guard
      page is effective only if debug-pagealloc is enabled, turning off
      according to debug-pagealloc is reasonable thing to do.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Jungsoo Son <jungsoo.son@lge.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      031bc574
  17. 11 12月, 2014 1 次提交
    • X
      x86/mm: Fix zone ranges boot printout · c072b90c
      Xishi Qiu 提交于
      This is the usual physical memory layout boot printout:
      	...
      	[    0.000000] Zone ranges:
      	[    0.000000]   DMA      [mem 0x00001000-0x00ffffff]
      	[    0.000000]   DMA32    [mem 0x01000000-0xffffffff]
      	[    0.000000]   Normal   [mem 0x100000000-0xc3fffffff]
      	[    0.000000] Movable zone start for each node
      	[    0.000000] Early memory node ranges
      	[    0.000000]   node   0: [mem 0x00001000-0x00099fff]
      	[    0.000000]   node   0: [mem 0x00100000-0xbf78ffff]
      	[    0.000000]   node   0: [mem 0x100000000-0x63fffffff]
      	[    0.000000]   node   1: [mem 0x640000000-0xc3fffffff]
      	...
      
      This is the log when we set "mem=2G" on the boot cmdline:
      	...
      	[    0.000000] Zone ranges:
      	[    0.000000]   DMA      [mem 0x00001000-0x00ffffff]
      	[    0.000000]   DMA32    [mem 0x01000000-0xffffffff]  // should be 0x7fffffff, right?
      	[    0.000000]   Normal   empty
      	[    0.000000] Movable zone start for each node
      	[    0.000000] Early memory node ranges
      	[    0.000000]   node   0: [mem 0x00001000-0x00099fff]
      	[    0.000000]   node   0: [mem 0x00100000-0x7fffffff]
      	...
      
      This patch fixes the printout, the following log shows the right
      ranges:
      	...
      	[    0.000000] Zone ranges:
      	[    0.000000]   DMA      [mem 0x00001000-0x00ffffff]
      	[    0.000000]   DMA32    [mem 0x01000000-0x7fffffff]
      	[    0.000000]   Normal   empty
      	[    0.000000] Movable zone start for each node
      	[    0.000000] Early memory node ranges
      	[    0.000000]   node   0: [mem 0x00001000-0x00099fff]
      	[    0.000000]   node   0: [mem 0x00100000-0x7fffffff]
      	...
      Suggested-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NXishi Qiu <qiuxishi@huawei.com>
      Cc: Linux MM <linux-mm@kvack.org>
      Cc: <dave@sr71.net>
      Cc: Rik van Riel <riel@redhat.com>
      Link: http://lkml.kernel.org/r/5487AB3D.6070306@huawei.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      c072b90c
  18. 08 12月, 2014 1 次提交
  19. 04 12月, 2014 1 次提交
  20. 19 11月, 2014 2 次提交
    • D
      x86 mpx: Change return type of get_reg_offset() · 68c009c4
      Dave Hansen 提交于
      get_reg_offset() used to return the register contents themselves
      instead of the register offset.  When it did that, it was an
      unsigned long.  I changed it to return an integer _offset_
      instead of the register.  But, I neglected to change the return
      type of the function or the variables in which we store the
      result of the call.
      
      This fixes up the code to clear up the warnings from the smatch
      bot:
      
      New smatch warnings:
      arch/x86/mm/mpx.c:178 mpx_get_addr_ref() warn: unsigned 'addr_offset' is never less than zero.
      arch/x86/mm/mpx.c:184 mpx_get_addr_ref() warn: unsigned 'base_offset' is never less than zero.
      arch/x86/mm/mpx.c:188 mpx_get_addr_ref() warn: unsigned 'indx_offset' is never less than zero.
      arch/x86/mm/mpx.c:196 mpx_get_addr_ref() warn: unsigned 'addr_offset' is never less than zero.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: x86@kernel.org
      Link: http://lkml.kernel.org/r/20141118182343.C3E0C629@viggo.jf.intel.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      68c009c4
    • K
      x86, mm: Set NX across entire PMD at boot · 45e2a9d4
      Kees Cook 提交于
      When setting up permissions on kernel memory at boot, the end of the
      PMD that was split from bss remained executable. It should be NX like
      the rest. This performs a PMD alignment instead of a PAGE alignment to
      get the correct span of memory.
      
      Before:
      ---[ High Kernel Mapping ]---
      ...
      0xffffffff8202d000-0xffffffff82200000  1868K     RW       GLB NX pte
      0xffffffff82200000-0xffffffff82c00000    10M     RW   PSE GLB NX pmd
      0xffffffff82c00000-0xffffffff82df5000  2004K     RW       GLB NX pte
      0xffffffff82df5000-0xffffffff82e00000    44K     RW       GLB x  pte
      0xffffffff82e00000-0xffffffffc0000000   978M                     pmd
      
      After:
      ---[ High Kernel Mapping ]---
      ...
      0xffffffff8202d000-0xffffffff82200000  1868K     RW       GLB NX pte
      0xffffffff82200000-0xffffffff82e00000    12M     RW   PSE GLB NX pmd
      0xffffffff82e00000-0xffffffffc0000000   978M                     pmd
      
      [ tglx: Changed it to roundup(_brk_end, PMD_SIZE) and added a comment.
              We really should unmap the reminder along with the holes
              caused by init,initdata etc. but thats a different issue ]
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Wang Nan <wangnan0@huawei.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/20141114194737.GA3091@www.outflux.netSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      45e2a9d4
  21. 18 11月, 2014 4 次提交
    • D
      x86, mpx: Cleanup unused bound tables · 1de4fa14
      Dave Hansen 提交于
      The previous patch allocates bounds tables on-demand.  As noted in
      an earlier description, these can add up to *HUGE* amounts of
      memory.  This has caused OOMs in practice when running tests.
      
      This patch adds support for freeing bounds tables when they are no
      longer in use.
      
      There are two types of mappings in play when unmapping tables:
       1. The mapping with the actual data, which userspace is
          munmap()ing or brk()ing away, etc...
       2. The mapping for the bounds table *backing* the data
          (is tagged with VM_MPX, see the patch "add MPX specific
          mmap interface").
      
      If userspace use the prctl() indroduced earlier in this patchset
      to enable the management of bounds tables in kernel, when it
      unmaps the first type of mapping with the actual data, the kernel
      needs to free the mapping for the bounds table backing the data.
      This patch hooks in at the very end of do_unmap() to do so.
      We look at the addresses being unmapped and find the bounds
      directory entries and tables which cover those addresses.  If
      an entire table is unused, we clear associated directory entry
      and free the table.
      
      Once we unmap the bounds table, we would have a bounds directory
      entry pointing at empty address space. That address space might
      now be allocated for some other (random) use, and the MPX
      hardware might now try to walk it as if it were a bounds table.
      That would be bad.  So any unmapping of an enture bounds table
      has to be accompanied by a corresponding write to the bounds
      directory entry to invalidate it.  That write to the bounds
      directory can fault, which causes the following problem:
      
      Since we are doing the freeing from munmap() (and other paths
      like it), we hold mmap_sem for write. If we fault, the page
      fault handler will attempt to acquire mmap_sem for read and
      we will deadlock.  To avoid the deadlock, we pagefault_disable()
      when touching the bounds directory entry and use a
      get_user_pages() to resolve the fault.
      
      The unmapping of bounds tables happends under vm_munmap().  We
      also (indirectly) call vm_munmap() to _do_ the unmapping of the
      bounds tables.  We avoid unbounded recursion by disallowing
      freeing of bounds tables *for* bounds tables.  This would not
      occur normally, so should not have any practical impact.  Being
      strict about it here helps ensure that we do not have an
      exploitable stack overflow.
      Based-on-patch-by: NQiaowei Ren <qiaowei.ren@intel.com>
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: linux-mm@kvack.org
      Cc: linux-mips@linux-mips.org
      Cc: Dave Hansen <dave@sr71.net>
      Link: http://lkml.kernel.org/r/20141114151831.E4531C4A@viggo.jf.intel.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      1de4fa14
    • D
      x86, mpx: On-demand kernel allocation of bounds tables · fe3d197f
      Dave Hansen 提交于
      This is really the meat of the MPX patch set.  If there is one patch to
      review in the entire series, this is the one.  There is a new ABI here
      and this kernel code also interacts with userspace memory in a
      relatively unusual manner.  (small FAQ below).
      
      Long Description:
      
      This patch adds two prctl() commands to provide enable or disable the
      management of bounds tables in kernel, including on-demand kernel
      allocation (See the patch "on-demand kernel allocation of bounds tables")
      and cleanup (See the patch "cleanup unused bound tables"). Applications
      do not strictly need the kernel to manage bounds tables and we expect
      some applications to use MPX without taking advantage of this kernel
      support. This means the kernel can not simply infer whether an application
      needs bounds table management from the MPX registers.  The prctl() is an
      explicit signal from userspace.
      
      PR_MPX_ENABLE_MANAGEMENT is meant to be a signal from userspace to
      require kernel's help in managing bounds tables.
      
      PR_MPX_DISABLE_MANAGEMENT is the opposite, meaning that userspace don't
      want kernel's help any more. With PR_MPX_DISABLE_MANAGEMENT, the kernel
      won't allocate and free bounds tables even if the CPU supports MPX.
      
      PR_MPX_ENABLE_MANAGEMENT will fetch the base address of the bounds
      directory out of a userspace register (bndcfgu) and then cache it into
      a new field (->bd_addr) in  the 'mm_struct'.  PR_MPX_DISABLE_MANAGEMENT
      will set "bd_addr" to an invalid address.  Using this scheme, we can
      use "bd_addr" to determine whether the management of bounds tables in
      kernel is enabled.
      
      Also, the only way to access that bndcfgu register is via an xsaves,
      which can be expensive.  Caching "bd_addr" like this also helps reduce
      the cost of those xsaves when doing table cleanup at munmap() time.
      Unfortunately, we can not apply this optimization to #BR fault time
      because we need an xsave to get the value of BNDSTATUS.
      
      ==== Why does the hardware even have these Bounds Tables? ====
      
      MPX only has 4 hardware registers for storing bounds information.
      If MPX-enabled code needs more than these 4 registers, it needs to
      spill them somewhere. It has two special instructions for this
      which allow the bounds to be moved between the bounds registers
      and some new "bounds tables".
      
      They are similar conceptually to a page fault and will be raised by
      the MPX hardware during both bounds violations or when the tables
      are not present. This patch handles those #BR exceptions for
      not-present tables by carving the space out of the normal processes
      address space (essentially calling the new mmap() interface indroduced
      earlier in this patch set.) and then pointing the bounds-directory
      over to it.
      
      The tables *need* to be accessed and controlled by userspace because
      the instructions for moving bounds in and out of them are extremely
      frequent. They potentially happen every time a register pointing to
      memory is dereferenced. Any direct kernel involvement (like a syscall)
      to access the tables would obviously destroy performance.
      
      ==== Why not do this in userspace? ====
      
      This patch is obviously doing this allocation in the kernel.
      However, MPX does not strictly *require* anything in the kernel.
      It can theoretically be done completely from userspace. Here are
      a few ways this *could* be done. I don't think any of them are
      practical in the real-world, but here they are.
      
      Q: Can virtual space simply be reserved for the bounds tables so
         that we never have to allocate them?
      A: As noted earlier, these tables are *HUGE*. An X-GB virtual
         area needs 4*X GB of virtual space, plus 2GB for the bounds
         directory. If we were to preallocate them for the 128TB of
         user virtual address space, we would need to reserve 512TB+2GB,
         which is larger than the entire virtual address space today.
         This means they can not be reserved ahead of time. Also, a
         single process's pre-popualated bounds directory consumes 2GB
         of virtual *AND* physical memory. IOW, it's completely
         infeasible to prepopulate bounds directories.
      
      Q: Can we preallocate bounds table space at the same time memory
         is allocated which might contain pointers that might eventually
         need bounds tables?
      A: This would work if we could hook the site of each and every
         memory allocation syscall. This can be done for small,
         constrained applications. But, it isn't practical at a larger
         scale since a given app has no way of controlling how all the
         parts of the app might allocate memory (think libraries). The
         kernel is really the only place to intercept these calls.
      
      Q: Could a bounds fault be handed to userspace and the tables
         allocated there in a signal handler instead of in the kernel?
      A: (thanks to tglx) mmap() is not on the list of safe async
         handler functions and even if mmap() would work it still
         requires locking or nasty tricks to keep track of the
         allocation state there.
      
      Having ruled out all of the userspace-only approaches for managing
      bounds tables that we could think of, we create them on demand in
      the kernel.
      Based-on-patch-by: NQiaowei Ren <qiaowei.ren@intel.com>
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: linux-mm@kvack.org
      Cc: linux-mips@linux-mips.org
      Cc: Dave Hansen <dave@sr71.net>
      Link: http://lkml.kernel.org/r/20141114151829.AD4310DE@viggo.jf.intel.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      fe3d197f
    • D
      x86, mpx: Decode MPX instruction to get bound violation information · fcc7ffd6
      Dave Hansen 提交于
      This patch sets bound violation fields of siginfo struct in #BR
      exception handler by decoding the user instruction and constructing
      the faulting pointer.
      
      We have to be very careful when decoding these instructions.  They
      are completely controlled by userspace and may be changed at any
      time up to and including the point where we try to copy them in to
      the kernel.  They may or may not be MPX instructions and could be
      completely invalid for all we know.
      
      Note: This code is based on Qiaowei Ren's specialized MPX
      decoder, but uses the generic decoder whenever possible.  It was
      tested for robustness by generating a completely random data
      stream and trying to decode that stream.  I also unmapped random
      pages inside the stream to test the "partial instruction" short
      read code.
      
      We kzalloc() the siginfo instead of stack allocating it because
      we need to memset() it anyway, and doing this makes it much more
      clear when it got initialized by the MPX instruction decoder.
      
      Changes from the old decoder:
       * Use the generic decoder instead of custom functions.  Saved
         ~70 lines of code overall.
       * Remove insn->addr_bytes code (never used??)
       * Make sure never to possibly overflow the regoff[] array, plus
         check the register range correctly in 32 and 64-bit modes.
       * Allow get_reg() to return an error and have mpx_get_addr_ref()
         handle when it sees errors.
       * Only call insn_get_*() near where we actually use the values
         instead if trying to call them all at once.
       * Handle short reads from copy_from_user() and check the actual
         number of read bytes against what we expect from
         insn_get_length().  If a read stops in the middle of an
         instruction, we error out.
       * Actually check the opcodes intead of ignoring them.
       * Dynamically kzalloc() siginfo_t so we don't leak any stack
         data.
       * Detect and handle decoder failures instead of ignoring them.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Based-on-patch-by: NQiaowei Ren <qiaowei.ren@intel.com>
      Cc: linux-mm@kvack.org
      Cc: linux-mips@linux-mips.org
      Cc: Dave Hansen <dave@sr71.net>
      Link: http://lkml.kernel.org/r/20141114151828.5BDD0915@viggo.jf.intel.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      fcc7ffd6
    • Q
      x86, mpx: Add MPX-specific mmap interface · 57319d80
      Qiaowei Ren 提交于
      We have chosen to perform the allocation of bounds tables in
      kernel (See the patch "on-demand kernel allocation of bounds
      tables") and to mark these VMAs with VM_MPX.
      
      However, there is currently no suitable interface to actually do
      this.  Existing interfaces, like do_mmap_pgoff(), have no way to
      set a modified ->vm_ops or ->vm_flags and don't hold mmap_sem
      long enough to let a caller do it.
      
      This patch wraps mmap_region() and hold mmap_sem long enough to
      make the modifications to the VMA which we need.
      
      Also note the 32/64-bit #ifdef in the header.  We actually need
      to do this at runtime eventually.  But, for now, we don't support
      running 32-bit binaries on 64-bit kernels.  Support for this will
      come in later patches.
      Signed-off-by: NQiaowei Ren <qiaowei.ren@intel.com>
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: linux-mm@kvack.org
      Cc: linux-mips@linux-mips.org
      Cc: Dave Hansen <dave@sr71.net>
      Link: http://lkml.kernel.org/r/20141114151827.CE440F67@viggo.jf.intel.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      57319d80
  22. 17 11月, 2014 1 次提交
    • T
      x86: mm: Move PAT only functions to mm/pat.c · 0dbcae88
      Thomas Gleixner 提交于
      Commit e00c8cc9 "x86: Use new cache mode type in memtype related
      functions" broke the ARCH=um build.
      
       arch/x86/include/asm/cacheflush.h:67:36: error: return type is an incomplete type
       static inline enum page_cache_mode get_page_memtype(struct page *pg)
      
      The reason is simple. get_page_memtype() and set_page_memtype()
      require enum page_cache_mode now, which is defined in
      asm/pgtable_types.h. UM does not include that file for obvious reasons.
      
      The simple solution is to move that functions to arch/x86/mm/pat.c
      where the only callsites of this are located. They should have been
      there in the first place.
      
      Fixes: e00c8cc9 "x86: Use new cache mode type in memtype related functions"
      Reported-by: NFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Richard Weinberger <richard@nod.at>
      0dbcae88