1. 20 2月, 2015 2 次提交
  2. 19 2月, 2015 10 次提交
    • Q
      x86/microcode/intel: Handle truncated microcode images more robustly · 35a9ff4e
      Quentin Casasnovas 提交于
      We do not check the input data bounds containing the microcode before
      copying a struct microcode_intel_header from it. A specially crafted
      microcode could cause the kernel to read invalid memory and lead to a
      denial-of-service.
      Signed-off-by: NQuentin Casasnovas <quentin.casasnovas@oracle.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Link: http://lkml.kernel.org/r/1422964824-22056-3-git-send-email-quentin.casasnovas@oracle.com
      [ Made error message differ from the next one and flipped comparison. ]
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      35a9ff4e
    • Q
      x86/microcode/intel: Guard against stack overflow in the loader · f84598bd
      Quentin Casasnovas 提交于
      mc_saved_tmp is a static array allocated on the stack, we need to make
      sure mc_saved_count stays within its bounds, otherwise we're overflowing
      the stack in _save_mc(). A specially crafted microcode header could lead
      to a kernel crash or potentially kernel execution.
      Signed-off-by: NQuentin Casasnovas <quentin.casasnovas@oracle.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Link: http://lkml.kernel.org/r/1422964824-22056-1-git-send-email-quentin.casasnovas@oracle.comSigned-off-by: NBorislav Petkov <bp@suse.de>
      f84598bd
    • H
      x86, mm/ASLR: Fix stack randomization on 64-bit systems · 4e7c22d4
      Hector Marco-Gisbert 提交于
      The issue is that the stack for processes is not properly randomized on
      64 bit architectures due to an integer overflow.
      
      The affected function is randomize_stack_top() in file
      "fs/binfmt_elf.c":
      
        static unsigned long randomize_stack_top(unsigned long stack_top)
        {
                 unsigned int random_variable = 0;
      
                 if ((current->flags & PF_RANDOMIZE) &&
                         !(current->personality & ADDR_NO_RANDOMIZE)) {
                         random_variable = get_random_int() & STACK_RND_MASK;
                         random_variable <<= PAGE_SHIFT;
                 }
                 return PAGE_ALIGN(stack_top) + random_variable;
                 return PAGE_ALIGN(stack_top) - random_variable;
        }
      
      Note that, it declares the "random_variable" variable as "unsigned int".
      Since the result of the shifting operation between STACK_RND_MASK (which
      is 0x3fffff on x86_64, 22 bits) and PAGE_SHIFT (which is 12 on x86_64):
      
      	  random_variable <<= PAGE_SHIFT;
      
      then the two leftmost bits are dropped when storing the result in the
      "random_variable". This variable shall be at least 34 bits long to hold
      the (22+12) result.
      
      These two dropped bits have an impact on the entropy of process stack.
      Concretely, the total stack entropy is reduced by four: from 2^28 to
      2^30 (One fourth of expected entropy).
      
      This patch restores back the entropy by correcting the types involved
      in the operations in the functions randomize_stack_top() and
      stack_maxrandom_size().
      
      The successful fix can be tested with:
      
        $ for i in `seq 1 10`; do cat /proc/self/maps | grep stack; done
        7ffeda566000-7ffeda587000 rw-p 00000000 00:00 0                          [stack]
        7fff5a332000-7fff5a353000 rw-p 00000000 00:00 0                          [stack]
        7ffcdb7a1000-7ffcdb7c2000 rw-p 00000000 00:00 0                          [stack]
        7ffd5e2c4000-7ffd5e2e5000 rw-p 00000000 00:00 0                          [stack]
        ...
      
      Once corrected, the leading bytes should be between 7ffc and 7fff,
      rather than always being 7fff.
      Signed-off-by: NHector Marco-Gisbert <hecmargi@upv.es>
      Signed-off-by: NIsmael Ripoll <iripoll@upv.es>
      [ Rebased, fixed 80 char bugs, cleaned up commit message, added test example and CVE ]
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Cc: <stable@vger.kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Fixes: CVE-2015-1593
      Link: http://lkml.kernel.org/r/20150214173350.GA18393@www.outflux.netSigned-off-by: NBorislav Petkov <bp@suse.de>
      4e7c22d4
    • D
      x86/mm/init: Fix incorrect page size in init_memory_mapping() printks · f15e0518
      Dave Hansen 提交于
      With 32-bit non-PAE kernels, we have 2 page sizes available
      (at most): 4k and 4M.
      
      Enabling PAE replaces that 4M size with a 2M one (which 64-bit
      systems use too).
      
      But, when booting a 32-bit non-PAE kernel, in one of our
      early-boot printouts, we say:
      
        init_memory_mapping: [mem 0x00000000-0x000fffff]
         [mem 0x00000000-0x000fffff] page 4k
        init_memory_mapping: [mem 0x37000000-0x373fffff]
         [mem 0x37000000-0x373fffff] page 2M
        init_memory_mapping: [mem 0x00100000-0x36ffffff]
         [mem 0x00100000-0x003fffff] page 4k
         [mem 0x00400000-0x36ffffff] page 2M
        init_memory_mapping: [mem 0x37400000-0x377fdfff]
         [mem 0x37400000-0x377fdfff] page 4k
      
      Which is obviously wrong.  There is no 2M page available.  This
      is probably because of a badly-named variable: in the map_range
      code: PG_LEVEL_2M.
      
      Instead of renaming all the PG_LEVEL_2M's.  This patch just
      fixes the printout:
      
        init_memory_mapping: [mem 0x00000000-0x000fffff]
         [mem 0x00000000-0x000fffff] page 4k
        init_memory_mapping: [mem 0x37000000-0x373fffff]
         [mem 0x37000000-0x373fffff] page 4M
        init_memory_mapping: [mem 0x00100000-0x36ffffff]
         [mem 0x00100000-0x003fffff] page 4k
         [mem 0x00400000-0x36ffffff] page 4M
        init_memory_mapping: [mem 0x37400000-0x377fdfff]
         [mem 0x37400000-0x377fdfff] page 4k
        BRK [0x03206000, 0x03206fff] PGTABLE
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Link: http://lkml.kernel.org/r/20150210212030.665EC267@viggo.jf.intel.comSigned-off-by: NBorislav Petkov <bp@suse.de>
      f15e0518
    • J
      x86/mm/ASLR: Propagate base load address calculation · f47233c2
      Jiri Kosina 提交于
      Commit:
      
        e2b32e67 ("x86, kaslr: randomize module base load address")
      
      makes the base address for module to be unconditionally randomized in
      case when CONFIG_RANDOMIZE_BASE is defined and "nokaslr" option isn't
      present on the commandline.
      
      This is not consistent with how choose_kernel_location() decides whether
      it will randomize kernel load base.
      
      Namely, CONFIG_HIBERNATION disables kASLR (unless "kaslr" option is
      explicitly specified on kernel commandline), which makes the state space
      larger than what module loader is looking at. IOW CONFIG_HIBERNATION &&
      CONFIG_RANDOMIZE_BASE is a valid config option, kASLR wouldn't be applied
      by default in that case, but module loader is not aware of that.
      
      Instead of fixing the logic in module.c, this patch takes more generic
      aproach. It introduces a new bootparam setup data_type SETUP_KASLR and
      uses that to pass the information whether kaslr has been applied during
      kernel decompression, and sets a global 'kaslr_enabled' variable
      accordingly, so that any kernel code (module loading, livepatching, ...)
      can make decisions based on its value.
      
      x86 module loader is converted to make use of this flag.
      Signed-off-by: NJiri Kosina <jkosina@suse.cz>
      Acked-by: NKees Cook <keescook@chromium.org>
      Cc: "H. Peter Anvin" <hpa@linux.intel.com>
      Link: https://lkml.kernel.org/r/alpine.LNX.2.00.1502101411280.10719@pobox.suse.cz
      [ Always dump correct kaslr status when panicking ]
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      f47233c2
    • R
      x86/apic: Fix the devicetree build in certain configs · b273c2c2
      Ricardo Ribalda Delgado 提交于
      Without this patch:
      
        LD      init/built-in.o
        arch/x86/built-in.o: In function `dtb_lapic_setup': kernel/devicetree.c:155:
        undefined reference to `apic_force_enable'
        Makefile:923: recipe for target 'vmlinux' failed
        make: *** [vmlinux] Error 1
      Signed-off-by: NRicardo Ribalda Delgado <ricardo.ribalda@gmail.com>
      Reviewed-by: NMaciej W. Rozycki <macro@linux-mips.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jan Beulich <JBeulich@suse.com>
      Link: http://lkml.kernel.org/r/1422905231-16067-1-git-send-email-ricardo.ribalda@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b273c2c2
    • W
      kprobes/x86: Mark 2 bytes NOP as boostable · b7e37567
      Wang Nan 提交于
      Currently, x86 kprobes is unable to boost 2 bytes nop like:
      
        nopl 0x0(%rax,%rax,1)
      
      which is 0x0f 0x1f 0x44 0x00 0x00.
      
      Such nops have exactly 5 bytes to hold a relative jmp
      instruction. Boosting them should be obviously safe.
      
      This patch enable boosting such nops by simply updating
      twobyte_is_boostable[] array.
      Signed-off-by: NWang Nan <wangnan0@huawei.com>
      Acked-by: NMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: <lizefan@huawei.com>
      Link: http://lkml.kernel.org/r/1423532045-41049-1-git-send-email-wangnan0@huawei.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b7e37567
    • D
      uprobes/x86: Fix 2-byte opcode table · 5154d4f2
      Denys Vlasenko 提交于
      Enabled probing of lar, lsl, popcnt, lddqu, prefetch insns.
      They should be safe to probe, they throw no exceptions.
      
      Enabled probing of 3-byte opcodes 0f 38-3f xx - these are
      vector isns, so should be safe.
      
      Enabled probing of many currently undefined 0f xx insns.
      At the rate new vector instructions are getting added,
      we don't want to constantly enable more bits.
      We want to only occasionally *disable* ones which
      for some reason can't be probed.
      This includes 0f 24,26 opcodes, which are undefined
      since Pentium. On 486, they were "mov to/from test register".
      
      Explained more fully what 0f 78,79 opcodes are.
      
      Explained what 0f ae opcode is. (It's unclear why we don't allow
      probing it, but let's not change it for now).
      Signed-off-by: NDenys Vlasenko <dvlasenk@redhat.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Jim Keniston <jkenisto@us.ibm.com>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Link: http://lkml.kernel.org/r/1423768732-32194-3-git-send-email-dvlasenk@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      5154d4f2
    • D
      uprobes/x86: Fix 1-byte opcode tables · 67fc8092
      Denys Vlasenko 提交于
      This change fixes 1-byte opcode tables so that only insns
      for which we have real reasons to disallow probing are marked
      with unset bits.
      
      To that end:
      
      Set bits for all prefix bytes. Their setting is ignored anyway -
      we check the bitmap against OPCODE1(insn), not against first
      byte. Keeping them set to 0 only confuses code reader with
      "why we don't support that opcode" question.
      
      Thus: enable bytes c4,c5 in 64-bit mode (VEX prefixes).
      Byte 62 (EVEX prefix) is not yet enabled since insn decoder
      does not support that yet.
      
      For 32-bit mode, enable probing of opcodes 63 (arpl) and d6
      (salc). They don't require any special handling.
      
      For 64-bit mode, disable 9a and ea - these undefined opcodes
      were mistakenly left enabled.
      Signed-off-by: NDenys Vlasenko <dvlasenk@redhat.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Jim Keniston <jkenisto@us.ibm.com>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Link: http://lkml.kernel.org/r/1423768732-32194-2-git-send-email-dvlasenk@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      67fc8092
    • D
      uprobes/x86: Add comment with insn opcodes, mnemonics and why we dont support them · 097f4e5e
      Denys Vlasenko 提交于
      After adding these, it's clear we have some awkward choices
      there. Some valid instructions are prohibited from uprobing
      while several invalid ones are allowed.
      
      Hopefully future edits to the good-opcode tables will fix wrong
      bits or explain why those bits are not wrong.
      
      No actual code changes.
      Signed-off-by: NDenys Vlasenko <dvlasenk@redhat.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Jim Keniston <jkenisto@us.ibm.com>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Link: http://lkml.kernel.org/r/1423768732-32194-1-git-send-email-dvlasenk@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      097f4e5e
  3. 18 2月, 2015 3 次提交
    • J
      x86/irq: Check for valid irq descriptor in check_irq_vectors_for_cpu_disable() · d97eb896
      Joerg Roedel 提交于
      When an interrupt is migrated away from a cpu it will stay
      in its vector_irq array until smp_irq_move_cleanup_interrupt
      succeeded. The cfg->move_in_progress flag is cleared already
      when the IPI was sent.
      
      When the interrupt is destroyed after migration its 'struct
      irq_desc' is freed and the vector_irq arrays are cleaned up.
      But since cfg->move_in_progress is already 0 the references
      at cpus before the last migration will not be cleared. So
      this would leave a reference to an already destroyed irq
      alive.
      
      When the cpu is taken down at this point, the
      check_irq_vectors_for_cpu_disable() function finds a valid irq
      number in the vector_irq array, but gets NULL for its
      descriptor and dereferences it, causing a kernel panic.
      
      This has been observed on real systems at shutdown. Add a
      check to check_irq_vectors_for_cpu_disable() for a valid
      'struct irq_desc' to prevent this issue.
      Signed-off-by: NJoerg Roedel <jroedel@suse.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NJiang Liu <jiang.liu@linux.intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jan Beulich <JBeulich@suse.com>
      Cc: K. Y. Srinivasan <kys@microsoft.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: alnovak@suse.com
      Cc: joro@8bytes.org
      Link: http://lkml.kernel.org/r/20150204132754.GA10078@suse.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d97eb896
    • J
      x86/irq: Fix regression caused by commit b568b860 · 1ea76fba
      Jiang Liu 提交于
      Commit b568b860 ("Treat SCI interrupt as normal GSI interrupt")
      accidently removes support of legacy PIC interrupt when fixing a
      regression for Xen, which causes a nasty regression on HP/Compaq
      nc6000 where we fail to register the ACPI interrupt, and thus
      lose eg. thermal notifications leading a potentially overheated
      machine.
      
      So reintroduce support of legacy PIC based ACPI SCI interrupt.
      Reported-by: NVille Syrjälä <syrjala@sci.fi>
      Tested-by: NVille Syrjälä <syrjala@sci.fi>
      Signed-off-by: NJiang Liu <jiang.liu@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NPavel Machek <pavel@ucw.cz>
      Cc: <stable@vger.kernel.org> # 3.19+
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
      Cc: Sander Eikelenboom <linux@eikelenboom.it>
      Cc: linux-pm@vger.kernel.org
      Link: http://lkml.kernel.org/r/1424052673-22974-1-git-send-email-jiang.liu@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1ea76fba
    • R
      x86/spinlocks/paravirt: Fix memory corruption on unlock · d6abfdb2
      Raghavendra K T 提交于
      Paravirt spinlock clears slowpath flag after doing unlock.
      As explained by Linus currently it does:
      
                      prev = *lock;
                      add_smp(&lock->tickets.head, TICKET_LOCK_INC);
      
                      /* add_smp() is a full mb() */
      
                      if (unlikely(lock->tickets.tail & TICKET_SLOWPATH_FLAG))
                              __ticket_unlock_slowpath(lock, prev);
      
      which is *exactly* the kind of things you cannot do with spinlocks,
      because after you've done the "add_smp()" and released the spinlock
      for the fast-path, you can't access the spinlock any more.  Exactly
      because a fast-path lock might come in, and release the whole data
      structure.
      
      Linus suggested that we should not do any writes to lock after unlock(),
      and we can move slowpath clearing to fastpath lock.
      
      So this patch implements the fix with:
      
       1. Moving slowpath flag to head (Oleg):
          Unlocked locks don't care about the slowpath flag; therefore we can keep
          it set after the last unlock, and clear it again on the first (try)lock.
          -- this removes the write after unlock. note that keeping slowpath flag would
          result in unnecessary kicks.
          By moving the slowpath flag from the tail to the head ticket we also avoid
          the need to access both the head and tail tickets on unlock.
      
       2. use xadd to avoid read/write after unlock that checks the need for
          unlock_kick (Linus):
          We further avoid the need for a read-after-release by using xadd;
          the prev head value will include the slowpath flag and indicate if we
          need to do PV kicking of suspended spinners -- on modern chips xadd
          isn't (much) more expensive than an add + load.
      
      Result:
       setup: 16core (32 cpu +ht sandy bridge 8GB 16vcpu guest)
       benchmark overcommit %improve
       kernbench  1x           -0.13
       kernbench  2x            0.02
       dbench     1x           -1.77
       dbench     2x           -0.63
      
      [Jeremy: Hinted missing TICKET_LOCK_INC for kick]
      [Oleg: Moved slowpath flag to head, ticket_equals idea]
      [PeterZ: Added detailed changelog]
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Tested-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NRaghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Andrew Jones <drjones@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Fernando Luis Vázquez Cao <fernando_b1@lab.ntt.co.jp>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ulrich Obergfell <uobergfe@redhat.com>
      Cc: Waiman Long <Waiman.Long@hp.com>
      Cc: a.ryabinin@samsung.com
      Cc: dave@stgolabs.net
      Cc: hpa@zytor.com
      Cc: jasowang@redhat.com
      Cc: jeremy@goop.org
      Cc: paul.gortmaker@windriver.com
      Cc: riel@redhat.com
      Cc: tglx@linutronix.de
      Cc: waiman.long@hp.com
      Cc: xen-devel@lists.xenproject.org
      Link: http://lkml.kernel.org/r/20150215173043.GA7471@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d6abfdb2
  4. 14 2月, 2015 7 次提交
    • A
      kasan: enable instrumentation of global variables · bebf56a1
      Andrey Ryabinin 提交于
      This feature let us to detect accesses out of bounds of global variables.
      This will work as for globals in kernel image, so for globals in modules.
      Currently this won't work for symbols in user-specified sections (e.g.
      __init, __read_mostly, ...)
      
      The idea of this is simple.  Compiler increases each global variable by
      redzone size and add constructors invoking __asan_register_globals()
      function.  Information about global variable (address, size, size with
      redzone ...) passed to __asan_register_globals() so we could poison
      variable's redzone.
      
      This patch also forces module_alloc() to return 8*PAGE_SIZE aligned
      address making shadow memory handling (
      kasan_module_alloc()/kasan_module_free() ) more simple.  Such alignment
      guarantees that each shadow page backing modules address space correspond
      to only one module_alloc() allocation.
      Signed-off-by: NAndrey Ryabinin <a.ryabinin@samsung.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Konstantin Serebryany <kcc@google.com>
      Cc: Dmitry Chernenkov <dmitryc@google.com>
      Signed-off-by: NAndrey Konovalov <adech.fo@gmail.com>
      Cc: Yuri Gribov <tetra2005@gmail.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bebf56a1
    • A
      mm: vmalloc: pass additional vm_flags to __vmalloc_node_range() · cb9e3c29
      Andrey Ryabinin 提交于
      For instrumenting global variables KASan will shadow memory backing memory
      for modules.  So on module loading we will need to allocate memory for
      shadow and map it at address in shadow that corresponds to the address
      allocated in module_alloc().
      
      __vmalloc_node_range() could be used for this purpose, except it puts a
      guard hole after allocated area.  Guard hole in shadow memory should be a
      problem because at some future point we might need to have a shadow memory
      at address occupied by guard hole.  So we could fail to allocate shadow
      for module_alloc().
      
      Now we have VM_NO_GUARD flag disabling guard page, so we need to pass into
      __vmalloc_node_range().  Add new parameter 'vm_flags' to
      __vmalloc_node_range() function.
      Signed-off-by: NAndrey Ryabinin <a.ryabinin@samsung.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Konstantin Serebryany <kcc@google.com>
      Cc: Dmitry Chernenkov <dmitryc@google.com>
      Signed-off-by: NAndrey Konovalov <adech.fo@gmail.com>
      Cc: Yuri Gribov <tetra2005@gmail.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cb9e3c29
    • A
      kasan: enable stack instrumentation · c420f167
      Andrey Ryabinin 提交于
      Stack instrumentation allows to detect out of bounds memory accesses for
      variables allocated on stack.  Compiler adds redzones around every
      variable on stack and poisons redzones in function's prologue.
      
      Such approach significantly increases stack usage, so all in-kernel stacks
      size were doubled.
      Signed-off-by: NAndrey Ryabinin <a.ryabinin@samsung.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Konstantin Serebryany <kcc@google.com>
      Cc: Dmitry Chernenkov <dmitryc@google.com>
      Signed-off-by: NAndrey Konovalov <adech.fo@gmail.com>
      Cc: Yuri Gribov <tetra2005@gmail.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c420f167
    • A
      x86_64: kasan: add interceptors for memset/memmove/memcpy functions · 393f203f
      Andrey Ryabinin 提交于
      Recently instrumentation of builtin functions calls was removed from GCC
      5.0.  To check the memory accessed by such functions, userspace asan
      always uses interceptors for them.
      
      So now we should do this as well.  This patch declares
      memset/memmove/memcpy as weak symbols.  In mm/kasan/kasan.c we have our
      own implementation of those functions which checks memory before accessing
      it.
      
      Default memset/memmove/memcpy now now always have aliases with '__'
      prefix.  For files that built without kasan instrumentation (e.g.
      mm/slub.c) original mem* replaced (via #define) with prefixed variants,
      cause we don't want to check memory accesses there.
      Signed-off-by: NAndrey Ryabinin <a.ryabinin@samsung.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Konstantin Serebryany <kcc@google.com>
      Cc: Dmitry Chernenkov <dmitryc@google.com>
      Signed-off-by: NAndrey Konovalov <adech.fo@gmail.com>
      Cc: Yuri Gribov <tetra2005@gmail.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      393f203f
    • A
      x86_64: add KASan support · ef7f0d6a
      Andrey Ryabinin 提交于
      This patch adds arch specific code for kernel address sanitizer.
      
      16TB of virtual addressed used for shadow memory.  It's located in range
      [ffffec0000000000 - fffffc0000000000] between vmemmap and %esp fixup
      stacks.
      
      At early stage we map whole shadow region with zero page.  Latter, after
      pages mapped to direct mapping address range we unmap zero pages from
      corresponding shadow (see kasan_map_shadow()) and allocate and map a real
      shadow memory reusing vmemmap_populate() function.
      
      Also replace __pa with __pa_nodebug before shadow initialized.  __pa with
      CONFIG_DEBUG_VIRTUAL=y make external function call (__phys_addr)
      __phys_addr is instrumented, so __asan_load could be called before shadow
      area initialized.
      Signed-off-by: NAndrey Ryabinin <a.ryabinin@samsung.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Konstantin Serebryany <kcc@google.com>
      Cc: Dmitry Chernenkov <dmitryc@google.com>
      Signed-off-by: NAndrey Konovalov <adech.fo@gmail.com>
      Cc: Yuri Gribov <tetra2005@gmail.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jim Davis <jim.epost@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ef7f0d6a
    • T
      x86: use %*pb[l] to print bitmaps including cpumasks and nodemasks · bf58b487
      Tejun Heo 提交于
      printk and friends can now format bitmaps using '%*pb[l]'.  cpumask
      and nodemask also provide cpumask_pr_args() and nodemask_pr_args()
      respectively which can be used to generate the two printf arguments
      necessary to format the specified cpu/nodemask.
      
      * Unnecessary buffer size calculation and condition on the lenght
        removed from intel_cacheinfo.c::show_shared_cpu_map_func().
      
      * uv_nmi_nr_cpus_pr() got overly smart and implemented "..."
        abbreviation if the output stretched over the predefined 1024 byte
        buffer.  Replaced with plain printk.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Mike Travis <travis@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bf58b487
    • L
      Revert "x86/apic: Only disable CPU x2apic mode when necessary" · 8329aa9f
      Linus Torvalds 提交于
      This reverts commit 5fcee53c.
      
      It causes the suspend to fail on at least the Chromebook Pixel, possibly
      other platforms too.
      
      Joerg Roedel points out that the logic should probably have been
      
                      if (max_physical_apicid > 255 ||
                          !(IS_ENABLED(CONFIG_HYPERVISOR_GUEST) &&
                            hypervisor_x2apic_available())) {
      
      instead, but since the code is not in any fast-path, so we can just live
      without that optimization and just revert to the original code.
      Acked-by: NJoerg Roedel <joro@8bytes.org>
      Acked-by: NJiang Liu <jiang.liu@linux.intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8329aa9f
  5. 13 2月, 2015 8 次提交
    • M
      x86/efi: Avoid triple faults during EFI mixed mode calls · 96738c69
      Matt Fleming 提交于
      Andy pointed out that if an NMI or MCE is received while we're in the
      middle of an EFI mixed mode call a triple fault will occur. This can
      happen, for example, when issuing an EFI mixed mode call while running
      perf.
      
      The reason for the triple fault is that we execute the mixed mode call
      in 32-bit mode with paging disabled but with 64-bit kernel IDT handlers
      installed throughout the call.
      
      At Andy's suggestion, stop playing the games we currently do at runtime,
      such as disabling paging and installing a 32-bit GDT for __KERNEL_CS. We
      can simply switch to the __KERNEL32_CS descriptor before invoking
      firmware services, and run in compatibility mode. This way, if an
      NMI/MCE does occur the kernel IDT handler will execute correctly, since
      it'll jump to __KERNEL_CS automatically.
      
      However, this change is only possible post-ExitBootServices(). Before
      then the firmware "owns" the machine and expects for its 32-bit IDT
      handlers to be left intact to service interrupts, etc.
      
      So, we now need to distinguish between early boot and runtime
      invocations of EFI services. During early boot, we need to restore the
      GDT that the firmware expects to be present. We can only jump to the
      __KERNEL32_CS code segment for mixed mode calls after ExitBootServices()
      has been invoked.
      
      A liberal sprinkling of comments in the thunking code should make the
      differences in early and late environments more apparent.
      Reported-by: NAndy Lutomirski <luto@amacapital.net>
      Tested-by: NBorislav Petkov <bp@suse.de>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NMatt Fleming <matt.fleming@intel.com>
      96738c69
    • R
      lguest: don't look in console features to find emerg_wr. · 55c2d788
      Rusty Russell 提交于
      The 1.0 spec clearly states that you must set the ACKNOWLEDGE and
      DRIVER status bits before accessing the feature bits.  This is a
      problem for the early console code, which doesn't really want to
      acknowledge the device (the spec specifically excepts writing to the
      console's emerg_wr from the usual ordering constrains).
      
      Instead, we check that the *size* of the device configuration is
      sufficient to hold emerg_wr: at worst (if the device doesn't support
      the VIRTIO_CONSOLE_F_EMERG_WRITE feature), it will ignore the
      writes.
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      55c2d788
    • R
      kernel.h: remove ancient __FUNCTION__ hack · 02f1f217
      Rasmus Villemoes 提交于
      __FUNCTION__ hasn't been treated as a string literal since gcc 3.4, so
      this only helps people who only test-compile using 3.3 (compiler-gcc3.h
      barks at anything older than that).  Besides, there are almost no
      occurrences of __FUNCTION__ left in the tree.
      
      [akpm@linux-foundation.org: convert remaining __FUNCTION__ references]
      Signed-off-by: NRasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Joe Perches <joe@perches.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      02f1f217
    • A
      all arches, signal: move restart_block to struct task_struct · f56141e3
      Andy Lutomirski 提交于
      If an attacker can cause a controlled kernel stack overflow, overwriting
      the restart block is a very juicy exploit target.  This is because the
      restart_block is held in the same memory allocation as the kernel stack.
      
      Moving the restart block to struct task_struct prevents this exploit by
      making the restart_block harder to locate.
      
      Note that there are other fields in thread_info that are also easy
      targets, at least on some architectures.
      
      It's also a decent simplification, since the restart code is more or less
      identical on all architectures.
      
      [james.hogan@imgtec.com: metag: align thread_info::supervisor_stack]
      Signed-off-by: NAndy Lutomirski <luto@amacapital.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: David Miller <davem@davemloft.net>
      Acked-by: NRichard Weinberger <richard@nod.at>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
      Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
      Cc: Steven Miao <realmz6@gmail.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Aurelien Jacquiot <a-jacquiot@ti.com>
      Cc: Mikael Starvik <starvik@axis.com>
      Cc: Jesper Nilsson <jesper.nilsson@axis.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Richard Kuo <rkuo@codeaurora.org>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
      Tested-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Chen Liqin <liqin.linux@gmail.com>
      Cc: Lennox Wu <lennox.wu@gmail.com>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Signed-off-by: NJames Hogan <james.hogan@imgtec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f56141e3
    • M
      x86: mm: restore original pte_special check · c819f37e
      Mel Gorman 提交于
      Commit b38af472 ("x86,mm: fix pte_special versus pte_numa") adjusted
      the pte_special check to take into account that a special pte had
      SPECIAL and neither PRESENT nor PROTNONE.  Now that NUMA hinting PTEs
      are no longer modifying _PAGE_PRESENT it should be safe to restore the
      original pte_special behaviour.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c819f37e
    • M
      mm: remove remaining references to NUMA hinting bits and helpers · 21d9ee3e
      Mel Gorman 提交于
      This patch removes the NUMA PTE bits and associated helpers.  As a
      side-effect it increases the maximum possible swap space on x86-64.
      
      One potential source of problems is races between the marking of PTEs
      PROT_NONE, NUMA hinting faults and migration.  It must be guaranteed that
      a PTE being protected is not faulted in parallel, seen as a pte_none and
      corrupting memory.  The base case is safe but transhuge has problems in
      the past due to an different migration mechanism and a dependance on page
      lock to serialise migrations and warrants a closer look.
      
      task_work hinting update			parallel fault
      ------------------------			--------------
      change_pmd_range
        change_huge_pmd
          __pmd_trans_huge_lock
            pmdp_get_and_clear
      						__handle_mm_fault
      						pmd_none
      						  do_huge_pmd_anonymous_page
      						  read? pmd_lock blocks until hinting complete, fail !pmd_none test
      						  write? __do_huge_pmd_anonymous_page acquires pmd_lock, checks pmd_none
            pmd_modify
            set_pmd_at
      
      task_work hinting update			parallel migration
      ------------------------			------------------
      change_pmd_range
        change_huge_pmd
          __pmd_trans_huge_lock
            pmdp_get_and_clear
      						__handle_mm_fault
      						  do_huge_pmd_numa_page
      						    migrate_misplaced_transhuge_page
      						    pmd_lock waits for updates to complete, recheck pmd_same
            pmd_modify
            set_pmd_at
      
      Both of those are safe and the case where a transhuge page is inserted
      during a protection update is unchanged.  The case where two processes try
      migrating at the same time is unchanged by this series so should still be
      ok.  I could not find a case where we are accidentally depending on the
      PTE not being cleared and flushed.  If one is missed, it'll manifest as
      corruption problems that start triggering shortly after this series is
      merged and only happen when NUMA balancing is enabled.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Tested-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mark Brown <broonie@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      21d9ee3e
    • M
      mm: convert p[te|md]_numa users to p[te|md]_protnone_numa · 8a0516ed
      Mel Gorman 提交于
      Convert existing users of pte_numa and friends to the new helper.  Note
      that the kernel is broken after this patch is applied until the other page
      table modifiers are also altered.  This patch layout is to make review
      easier.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: NAneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Tested-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8a0516ed
    • M
      mm: add p[te|md] protnone helpers for use by NUMA balancing · e7bb4b6d
      Mel Gorman 提交于
      This is a preparatory patch that introduces protnone helpers for automatic
      NUMA balancing.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Tested-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Kirill Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e7bb4b6d
  6. 12 2月, 2015 5 次提交
    • A
      mm: gup: use get_user_pages_unlocked within get_user_pages_fast · a7b78075
      Andrea Arcangeli 提交于
      This allows the get_user_pages_fast slow path to release the mmap_sem
      before blocking.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Peter Feiner <pfeiner@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a7b78075
    • K
      mm: account pmd page tables to the process · dc6c9a35
      Kirill A. Shutemov 提交于
      Dave noticed that unprivileged process can allocate significant amount of
      memory -- >500 MiB on x86_64 -- and stay unnoticed by oom-killer and
      memory cgroup.  The trick is to allocate a lot of PMD page tables.  Linux
      kernel doesn't account PMD tables to the process, only PTE.
      
      The use-cases below use few tricks to allocate a lot of PMD page tables
      while keeping VmRSS and VmPTE low.  oom_score for the process will be 0.
      
      	#include <errno.h>
      	#include <stdio.h>
      	#include <stdlib.h>
      	#include <unistd.h>
      	#include <sys/mman.h>
      	#include <sys/prctl.h>
      
      	#define PUD_SIZE (1UL << 30)
      	#define PMD_SIZE (1UL << 21)
      
      	#define NR_PUD 130000
      
      	int main(void)
      	{
      		char *addr = NULL;
      		unsigned long i;
      
      		prctl(PR_SET_THP_DISABLE);
      		for (i = 0; i < NR_PUD ; i++) {
      			addr = mmap(addr + PUD_SIZE, PUD_SIZE, PROT_WRITE|PROT_READ,
      					MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
      			if (addr == MAP_FAILED) {
      				perror("mmap");
      				break;
      			}
      			*addr = 'x';
      			munmap(addr, PMD_SIZE);
      			mmap(addr, PMD_SIZE, PROT_WRITE|PROT_READ,
      					MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED, -1, 0);
      			if (addr == MAP_FAILED)
      				perror("re-mmap"), exit(1);
      		}
      		printf("PID %d consumed %lu KiB in PMD page tables\n",
      				getpid(), i * 4096 >> 10);
      		return pause();
      	}
      
      The patch addresses the issue by account PMD tables to the process the
      same way we account PTE.
      
      The main place where PMD tables is accounted is __pmd_alloc() and
      free_pmd_range(). But there're few corner cases:
      
       - HugeTLB can share PMD page tables. The patch handles by accounting
         the table to all processes who share it.
      
       - x86 PAE pre-allocates few PMD tables on fork.
      
       - Architectures with FIRST_USER_ADDRESS > 0. We need to adjust sanity
         check on exit(2).
      
      Accounting only happens on configuration where PMD page table's level is
      present (PMD is not folded).  As with nr_ptes we use per-mm counter.  The
      counter value is used to calculate baseline for badness score by
      oom-killer.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Reviewed-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: David Rientjes <rientjes@google.com>
      Tested-by: NSedat Dilek <sedat.dilek@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dc6c9a35
    • K
      mm: make FIRST_USER_ADDRESS unsigned long on all archs · d016bf7e
      Kirill A. Shutemov 提交于
      LKP has triggered a compiler warning after my recent patch "mm: account
      pmd page tables to the process":
      
          mm/mmap.c: In function 'exit_mmap':
       >> mm/mmap.c:2857:2: warning: right shift count >= width of type [enabled by default]
      
      The code:
      
       > 2857                WARN_ON(mm_nr_pmds(mm) >
         2858                                round_up(FIRST_USER_ADDRESS, PUD_SIZE) >> PUD_SHIFT);
      
      In this, on tile, we have FIRST_USER_ADDRESS defined as 0.  round_up() has
      the same type -- int.  PUD_SHIFT.
      
      I think the best way to fix it is to define FIRST_USER_ADDRESS as unsigned
      long.  On every arch for consistency.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: NWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d016bf7e
    • N
      mm/hugetlb: pmd_huge() returns true for non-present hugepage · cbef8478
      Naoya Horiguchi 提交于
      Migrating hugepages and hwpoisoned hugepages are considered as non-present
      hugepages, and they are referenced via migration entries and hwpoison
      entries in their page table slots.
      
      This behavior causes race condition because pmd_huge() doesn't tell
      non-huge pages from migrating/hwpoisoned hugepages.  follow_page_mask() is
      one example where the kernel would call follow_page_pte() for such
      hugepage while this function is supposed to handle only normal pages.
      
      To avoid this, this patch makes pmd_huge() return true when pmd_none() is
      true *and* pmd_present() is false.  We don't have to worry about mixing up
      non-present pmd entry with normal pmd (pointing to leaf level pte entry)
      because pmd_present() is true in normal pmd.
      
      The same race condition could happen in (x86-specific) gup_pmd_range(),
      where this patch simply adds pmd_present() check instead of pmd_huge().
      This is because gup_pmd_range() is fast path.  If we have non-present
      hugepage in this function, we will go into gup_huge_pmd(), then return 0
      at flag mask check, and finally fall back to the slow path.
      
      Fixes: 290408d4 ("hugetlb: hugepage migration core")
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Cc: <stable@vger.kernel.org>	[2.6.36+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cbef8478
    • N
      mm/hugetlb: reduce arch dependent code around follow_huge_* · 61f77eda
      Naoya Horiguchi 提交于
      Currently we have many duplicates in definitions around
      follow_huge_addr(), follow_huge_pmd(), and follow_huge_pud(), so this
      patch tries to remove the m.  The basic idea is to put the default
      implementation for these functions in mm/hugetlb.c as weak symbols
      (regardless of CONFIG_ARCH_WANT_GENERAL_HUGETL B), and to implement
      arch-specific code only when the arch needs it.
      
      For follow_huge_addr(), only powerpc and ia64 have their own
      implementation, and in all other architectures this function just returns
      ERR_PTR(-EINVAL).  So this patch sets returning ERR_PTR(-EINVAL) as
      default.
      
      As for follow_huge_(pmd|pud)(), if (pmd|pud)_huge() is implemented to
      always return 0 in your architecture (like in ia64 or sparc,) it's never
      called (the callsite is optimized away) no matter how implemented it is.
      So in such architectures, we don't need arch-specific implementation.
      
      In some architecture (like mips, s390 and tile,) their current
      arch-specific follow_huge_(pmd|pud)() are effectively identical with the
      common code, so this patch lets these architecture use the common code.
      
      One exception is metag, where pmd_huge() could return non-zero but it
      expects follow_huge_pmd() to always return NULL.  This means that we need
      arch-specific implementation which returns NULL.  This behavior looks
      strange to me (because non-zero pmd_huge() implies that the architecture
      supports PMD-based hugepage, so follow_huge_pmd() can/should return some
      relevant value,) but that's beyond this cleanup patch, so let's keep it.
      
      Justification of non-trivial changes:
      - in s390, follow_huge_pmd() checks !MACHINE_HAS_HPAGE at first, and this
        patch removes the check. This is OK because we can assume MACHINE_HAS_HPAGE
        is true when follow_huge_pmd() can be called (note that pmd_huge() has
        the same check and always returns 0 for !MACHINE_HAS_HPAGE.)
      - in s390 and mips, we use HPAGE_MASK instead of PMD_MASK as done in common
        code. This patch forces these archs use PMD_MASK, but it's OK because
        they are identical in both archs.
        In s390, both of HPAGE_SHIFT and PMD_SHIFT are 20.
        In mips, HPAGE_SHIFT is defined as (PAGE_SHIFT + PAGE_SHIFT - 3) and
        PMD_SHIFT is define as (PAGE_SHIFT + PAGE_SHIFT + PTE_ORDER - 3), but
        PTE_ORDER is always 0, so these are identical.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      61f77eda
  7. 11 2月, 2015 5 次提交