1. 24 5月, 2016 4 次提交
    • M
      mm: make vm_brk killable · 2d6c9282
      Michal Hocko 提交于
      Now that all the callers handle vm_brk failure we can change it wait for
      mmap_sem killable to help oom_reaper to not get blocked just because
      vm_brk gets blocked behind mmap_sem readers.
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2d6c9282
    • M
      mm: make vm_munmap killable · ae798783
      Michal Hocko 提交于
      Almost all current users of vm_munmap are ignoring the return value and
      so they do not handle potential error.  This means that some VMAs might
      stay behind.  This patch doesn't try to solve those potential problems.
      Quite contrary it adds a new failure mode by using down_write_killable
      in vm_munmap.  This should be safer than other failure modes, though,
      because the process is guaranteed to die as soon as it leaves the kernel
      and exit_mmap will clean the whole address space.
      
      This will help in the OOM conditions when the oom victim might be stuck
      waiting for the mmap_sem for write which in turn can block oom_reaper
      which relies on the mmap_sem for read to make a forward progress and
      reclaim the address space of the victim.
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ae798783
    • M
      mm: make vm_mmap killable · 9fbeb5ab
      Michal Hocko 提交于
      All the callers of vm_mmap seem to check for the failure already and
      bail out in one way or another on the error which means that we can
      change it to use killable version of vm_mmap_pgoff and return -EINTR if
      the current task gets killed while waiting for mmap_sem.  This also
      means that vm_mmap_pgoff can be killable by default and drop the
      additional parameter.
      
      This will help in the OOM conditions when the oom victim might be stuck
      waiting for the mmap_sem for write which in turn can block oom_reaper
      which relies on the mmap_sem for read to make a forward progress and
      reclaim the address space of the victim.
      
      Please note that load_elf_binary is ignoring vm_mmap error for
      current->personality & MMAP_PAGE_ZERO case but that shouldn't be a
      problem because the address is not used anywhere and we never return to
      the userspace if we got killed.
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9fbeb5ab
    • M
      mm: make mmap_sem for write waits killable for mm syscalls · dc0ef0df
      Michal Hocko 提交于
      This is a follow up work for oom_reaper [1].  As the async OOM killing
      depends on oom_sem for read we would really appreciate if a holder for
      write didn't stood in the way.  This patchset is changing many of
      down_write calls to be killable to help those cases when the writer is
      blocked and waiting for readers to release the lock and so help
      __oom_reap_task to process the oom victim.
      
      Most of the patches are really trivial because the lock is help from a
      shallow syscall paths where we can return EINTR trivially and allow the
      current task to die (note that EINTR will never get to the userspace as
      the task has fatal signal pending).  Others seem to be easy as well as
      the callers are already handling fatal errors and bail and return to
      userspace which should be sufficient to handle the failure gracefully.
      I am not familiar with all those code paths so a deeper review is really
      appreciated.
      
      As this work is touching more areas which are not directly connected I
      have tried to keep the CC list as small as possible and people who I
      believed would be familiar are CCed only to the specific patches (all
      should have received the cover though).
      
      This patchset is based on linux-next and it depends on
      down_write_killable for rw_semaphores which got merged into tip
      locking/rwsem branch and it is merged into this next tree.  I guess it
      would be easiest to route these patches via mmotm because of the
      dependency on the tip tree but if respective maintainers prefer other
      way I have no objections.
      
      I haven't covered all the mmap_write(mm->mmap_sem) instances here
      
        $ git grep "down_write(.*\<mmap_sem\>)" next/master | wc -l
        98
        $ git grep "down_write(.*\<mmap_sem\>)" | wc -l
        62
      
      I have tried to cover those which should be relatively easy to review in
      this series because this alone should be a nice improvement.  Other
      places can be changed on top.
      
      [0] http://lkml.kernel.org/r/1456752417-9626-1-git-send-email-mhocko@kernel.org
      [1] http://lkml.kernel.org/r/1452094975-551-1-git-send-email-mhocko@kernel.org
      [2] http://lkml.kernel.org/r/1456750705-7141-1-git-send-email-mhocko@kernel.org
      
      This patch (of 18):
      
      This is the first step in making mmap_sem write waiters killable.  It
      focuses on the trivial ones which are taking the lock early after
      entering the syscall and they are not changing state before.
      
      Therefore it is very easy to change them to use down_write_killable and
      immediately return with -EINTR.  This will allow the waiter to pass away
      without blocking the mmap_sem which might be required to make a forward
      progress.  E.g.  the oom reaper will need the lock for reading to
      dismantle the OOM victim address space.
      
      The only tricky function in this patch is vm_mmap_pgoff which has many
      call sites via vm_mmap.  To reduce the risk keep vm_mmap with the
      original non-killable semantic for now.
      
      vm_munmap callers do not bother checking the return value so open code
      it into the munmap syscall path for now for simplicity.
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dc0ef0df
  2. 21 5月, 2016 1 次提交
  3. 20 5月, 2016 1 次提交
  4. 18 3月, 2016 3 次提交
  5. 19 2月, 2016 3 次提交
    • K
      mm: fix regression in remap_file_pages() emulation · 48f7df32
      Kirill A. Shutemov 提交于
      Grazvydas Ignotas has reported a regression in remap_file_pages()
      emulation.
      
      Testcase:
      	#define _GNU_SOURCE
      	#include <assert.h>
      	#include <stdlib.h>
      	#include <stdio.h>
      	#include <sys/mman.h>
      
      	#define SIZE    (4096 * 3)
      
      	int main(int argc, char **argv)
      	{
      		unsigned long *p;
      		long i;
      
      		p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
      				MAP_SHARED | MAP_ANONYMOUS, -1, 0);
      		if (p == MAP_FAILED) {
      			perror("mmap");
      			return -1;
      		}
      
      		for (i = 0; i < SIZE / 4096; i++)
      			p[i * 4096 / sizeof(*p)] = i;
      
      		if (remap_file_pages(p, 4096, 0, 1, 0)) {
      			perror("remap_file_pages");
      			return -1;
      		}
      
      		if (remap_file_pages(p, 4096 * 2, 0, 1, 0)) {
      			perror("remap_file_pages");
      			return -1;
      		}
      
      		assert(p[0] == 1);
      
      		munmap(p, SIZE);
      
      		return 0;
      	}
      
      The second remap_file_pages() fails with -EINVAL.
      
      The reason is that remap_file_pages() emulation assumes that the target
      vma covers whole area we want to over map.  That assumption is broken by
      first remap_file_pages() call: it split the area into two vma.
      
      The solution is to check next adjacent vmas, if they map the same file
      with the same flags.
      
      Fixes: c8d78c18 ("mm: replace remap_file_pages() syscall with emulation")
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: NGrazvydas Ignotas <notasas@gmail.com>
      Tested-by: NGrazvydas Ignotas <notasas@gmail.com>
      Cc: <stable@vger.kernel.org>	[4.0+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      48f7df32
    • D
      mm/core, x86/mm/pkeys: Add execute-only protection keys support · 62b5f7d0
      Dave Hansen 提交于
      Protection keys provide new page-based protection in hardware.
      But, they have an interesting attribute: they only affect data
      accesses and never affect instruction fetches.  That means that
      if we set up some memory which is set as "access-disabled" via
      protection keys, we can still execute from it.
      
      This patch uses protection keys to set up mappings to do just that.
      If a user calls:
      
      	mmap(..., PROT_EXEC);
      or
      	mprotect(ptr, sz, PROT_EXEC);
      
      (note PROT_EXEC-only without PROT_READ/WRITE), the kernel will
      notice this, and set a special protection key on the memory.  It
      also sets the appropriate bits in the Protection Keys User Rights
      (PKRU) register so that the memory becomes unreadable and
      unwritable.
      
      I haven't found any userspace that does this today.  With this
      facility in place, we expect userspace to move to use it
      eventually.  Userspace _could_ start doing this today.  Any
      PROT_EXEC calls get converted to PROT_READ inside the kernel, and
      would transparently be upgraded to "true" PROT_EXEC with this
      code.  IOW, userspace never has to do any PROT_EXEC runtime
      detection.
      
      This feature provides enhanced protection against leaking
      executable memory contents.  This helps thwart attacks which are
      attempting to find ROP gadgets on the fly.
      
      But, the security provided by this approach is not comprehensive.
      The PKRU register which controls access permissions is a normal
      user register writable from unprivileged userspace.  An attacker
      who can execute the 'wrpkru' instruction can easily disable the
      protection provided by this feature.
      
      The protection key that is used for execute-only support is
      permanently dedicated at compile time.  This is fine for now
      because there is currently no API to set a protection key other
      than this one.
      
      Despite there being a constant PKRU value across the entire
      system, we do not set it unless this feature is in use in a
      process.  That is to preserve the PKRU XSAVE 'init state',
      which can lead to faster context switches.
      
      PKRU *is* a user register and the kernel is modifying it.  That
      means that code doing:
      
      	pkru = rdpkru()
      	pkru |= 0x100;
      	mmap(..., PROT_EXEC);
      	wrpkru(pkru);
      
      could lose the bits in PKRU that enforce execute-only
      permissions.  To avoid this, we suggest avoiding ever calling
      mmap() or mprotect() when the PKRU value is expected to be
      unstable.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Chen Gang <gang.chen.5i5j@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: David Hildenbrand <dahi@linux.vnet.ibm.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Piotr Kwapulinski <kwapulinski.piotr@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Stephen Smalley <sds@tycho.nsa.gov>
      Cc: Vladimir Murzin <vladimir.murzin@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: keescook@google.com
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20160212210240.CB4BB5CA@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      62b5f7d0
    • D
      mm/core, arch, powerpc: Pass a protection key in to calc_vm_flag_bits() · e6bfb709
      Dave Hansen 提交于
      This plumbs a protection key through calc_vm_flag_bits().  We
      could have done this in calc_vm_prot_bits(), but I did not feel
      super strongly which way to go.  It was pretty arbitrary which
      one to use.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Arve Hjønnevåg <arve@android.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Chen Gang <gang.chen.5i5j@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Geliang Tang <geliangtang@163.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Leon Romanovsky <leon@leon.nu>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
      Cc: Maxime Coquelin <mcoquelin.stm32@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Riley Andrews <riandrews@android.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: devel@driverdev.osuosl.org
      Cc: linux-api@vger.kernel.org
      Cc: linux-arch@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-mm@kvack.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Link: http://lkml.kernel.org/r/20160212210231.E6F1F0D6@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e6bfb709
  6. 06 2月, 2016 2 次提交
  7. 04 2月, 2016 1 次提交
    • K
      mm: warn about VmData over RLIMIT_DATA · d977d56c
      Konstantin Khlebnikov 提交于
      This patch provides a way of working around a slight regression
      introduced by commit 84638335 ("mm: rework virtual memory
      accounting").
      
      Before that commit RLIMIT_DATA have control only over size of the brk
      region.  But that change have caused problems with all existing versions
      of valgrind, because it set RLIMIT_DATA to zero.
      
      This patch fixes rlimit check (limit actually in bytes, not pages) and
      by default turns it into warning which prints at first VmData misuse:
      
        "mmap: top (795): VmData 516096 exceed data ulimit 512000.  Will be forbidden soon."
      
      Behavior is controlled by boot param ignore_rlimit_data=y/n and by sysfs
      /sys/module/kernel/parameters/ignore_rlimit_data.  For now it set to "y".
      
      [akpm@linux-foundation.org: tweak kernel-parameters.txt text[
      Signed-off-by: NKonstantin Khlebnikov <koct9i@gmail.com>
      Link: http://lkml.kernel.org/r/20151228211015.GL2194@uranusReported-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Vegard Nossum <vegard.nossum@oracle.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
      Cc: Kees Cook <keescook@google.com>
      Cc: Willy Tarreau <w@1wt.eu>
      Cc: Pavel Emelyanov <xemul@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d977d56c
  8. 16 1月, 2016 1 次提交
  9. 15 1月, 2016 4 次提交
    • K
      mm: rework virtual memory accounting · 84638335
      Konstantin Khlebnikov 提交于
      When inspecting a vague code inside prctl(PR_SET_MM_MEM) call (which
      testing the RLIMIT_DATA value to figure out if we're allowed to assign
      new @start_brk, @brk, @start_data, @end_data from mm_struct) it's been
      commited that RLIMIT_DATA in a form it's implemented now doesn't do
      anything useful because most of user-space libraries use mmap() syscall
      for dynamic memory allocations.
      
      Linus suggested to convert RLIMIT_DATA rlimit into something suitable
      for anonymous memory accounting.  But in this patch we go further, and
      the changes are bundled together as:
      
       * keep vma counting if CONFIG_PROC_FS=n, will be used for limits
       * replace mm->shared_vm with better defined mm->data_vm
       * account anonymous executable areas as executable
       * account file-backed growsdown/up areas as stack
       * drop struct file* argument from vm_stat_account
       * enforce RLIMIT_DATA for size of data areas
      
      This way code looks cleaner: now code/stack/data classification depends
      only on vm_flags state:
      
       VM_EXEC & ~VM_WRITE            -> code  (VmExe + VmLib in proc)
       VM_GROWSUP | VM_GROWSDOWN      -> stack (VmStk)
       VM_WRITE & ~VM_SHARED & !stack -> data  (VmData)
      
      The rest (VmSize - VmData - VmStk - VmExe - VmLib) could be called
      "shared", but that might be strange beast like readonly-private or VM_IO
      area.
      
       - RLIMIT_AS            limits whole address space "VmSize"
       - RLIMIT_STACK         limits stack "VmStk" (but each vma individually)
       - RLIMIT_DATA          now limits "VmData"
      Signed-off-by: NKonstantin Khlebnikov <koct9i@gmail.com>
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
      Cc: Vegard Nossum <vegard.nossum@oracle.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Willy Tarreau <w@1wt.eu>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Kees Cook <keescook@google.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: Pavel Emelyanov <xemul@virtuozzo.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      84638335
    • D
      mm: mmap: add new /proc tunable for mmap_base ASLR · d07e2259
      Daniel Cashman 提交于
      Address Space Layout Randomization (ASLR) provides a barrier to
      exploitation of user-space processes in the presence of security
      vulnerabilities by making it more difficult to find desired code/data
      which could help an attack.  This is done by adding a random offset to
      the location of regions in the process address space, with a greater
      range of potential offset values corresponding to better protection/a
      larger search-space for brute force, but also to greater potential for
      fragmentation.
      
      The offset added to the mmap_base address, which provides the basis for
      the majority of the mappings for a process, is set once on process exec
      in arch_pick_mmap_layout() and is done via hard-coded per-arch values,
      which reflect, hopefully, the best compromise for all systems.  The
      trade-off between increased entropy in the offset value generation and
      the corresponding increased variability in address space fragmentation
      is not absolute, however, and some platforms may tolerate higher amounts
      of entropy.  This patch introduces both new Kconfig values and a sysctl
      interface which may be used to change the amount of entropy used for
      offset generation on a system.
      
      The direct motivation for this change was in response to the
      libstagefright vulnerabilities that affected Android, specifically to
      information provided by Google's project zero at:
      
        http://googleprojectzero.blogspot.com/2015/09/stagefrightened.html
      
      The attack presented therein, by Google's project zero, specifically
      targeted the limited randomness used to generate the offset added to the
      mmap_base address in order to craft a brute-force-based attack.
      Concretely, the attack was against the mediaserver process, which was
      limited to respawning every 5 seconds, on an arm device.  The hard-coded
      8 bits used resulted in an average expected success rate of defeating
      the mmap ASLR after just over 10 minutes (128 tries at 5 seconds a
      piece).  With this patch, and an accompanying increase in the entropy
      value to 16 bits, the same attack would take an average expected time of
      over 45 hours (32768 tries), which makes it both less feasible and more
      likely to be noticed.
      
      The introduced Kconfig and sysctl options are limited by per-arch
      minimum and maximum values, the minimum of which was chosen to match the
      current hard-coded value and the maximum of which was chosen so as to
      give the greatest flexibility without generating an invalid mmap_base
      address, generally a 3-4 bits less than the number of bits in the
      user-space accessible virtual address space.
      
      When decided whether or not to change the default value, a system
      developer should consider that mmap_base address could be placed
      anywhere up to 2^(value) bits away from the non-randomized location,
      which would introduce variable-sized areas above and below the mmap_base
      address such that the maximum vm_area_struct size may be reduced,
      preventing very large allocations.
      
      This patch (of 4):
      
      ASLR only uses as few as 8 bits to generate the random offset for the
      mmap base address on 32 bit architectures.  This value was chosen to
      prevent a poorly chosen value from dividing the address space in such a
      way as to prevent large allocations.  This may not be an issue on all
      platforms.  Allow the specification of a minimum number of bits so that
      platforms desiring greater ASLR protection may determine where to place
      the trade-off.
      Signed-off-by: NDaniel Cashman <dcashman@google.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Acked-by: NKees Cook <keescook@chromium.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Don Zickus <dzickus@redhat.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mark Salyzyn <salyzyn@android.com>
      Cc: Jeff Vander Stoep <jeffv@google.com>
      Cc: Nick Kralevich <nnk@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Hector Marco-Gisbert <hecmargi@upv.es>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d07e2259
    • P
      mm/mmap.c: remove incorrect MAP_FIXED flag comparison from mmap_region · bc36f701
      Piotr Kwapulinski 提交于
      The following flag comparison in mmap_region makes no sense:
      
          if (!(vm_flags & MAP_FIXED))
              return -ENOMEM;
      
      The condition is always false and thus the above "return -ENOMEM" is
      never executed.  The vm_flags must not be compared with MAP_FIXED flag.
      The vm_flags may only be compared with VM_* flags.  MAP_FIXED has the
      same value as VM_MAYREAD.
      
      Hitting the rlimit is a slow path and find_vma_intersection should
      realize that there is no overlapping VMA for !MAP_FIXED case pretty
      quickly.
      Signed-off-by: NPiotr Kwapulinski <kwapulinski.piotr@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bc36f701
    • C
      mm/mmap.c: remove redundant local variables for may_expand_vm() · 0b57d6ba
      Chen Gang 提交于
      Simplify may_expand_vm().
      
      [akpm@linux-foundation.org: further simplification, per Naoya Horiguchi]
      Signed-off-by: NChen Gang <gang.chen.5i5j@gmail.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0b57d6ba
  10. 12 1月, 2016 1 次提交
    • A
      mm: Add a vm_special_mapping.fault() method · f872f540
      Andy Lutomirski 提交于
      Requiring special mappings to give a list of struct pages is
      inflexible: it prevents sane use of IO memory in a special
      mapping, it's inefficient (it requires arch code to initialize a
      list of struct pages, and it requires the mm core to walk the
      entire list just to figure out how long it is), and it prevents
      arch code from doing anything fancy when a special mapping fault
      occurs.
      
      Add a .fault method as an alternative to filling in a .pages
      array.
      
      Looks-OK-to: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/a26d1677c0bc7e774c33f469451a78ca31e9e6af.1451446564.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f872f540
  11. 06 11月, 2015 8 次提交
  12. 23 9月, 2015 1 次提交
  13. 18 9月, 2015 1 次提交
  14. 11 9月, 2015 2 次提交
  15. 09 9月, 2015 4 次提交
    • C
      mm/mmap.c:insert_vm_struct(): check for failure before setting values · c9d13f5f
      Chen Gang 提交于
      There's no point in initializing vma->vm_pgoff if the insertion attempt
      will be failing anyway.  Run the checks before performing the
      initialization.
      Signed-off-by: NChen Gang <gang.chen.5i5j@gmail.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c9d13f5f
    • C
      mm/mmap.c: simplify the failure return working flow · e3975891
      Chen Gang 提交于
      __split_vma() doesn't need out_err label, neither need initializing err.
      
      copy_vma() can return NULL directly when kmem_cache_alloc() fails.
      Signed-off-by: NChen Gang <gang.chen.5i5j@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e3975891
    • O
      mremap: fix the wrong !vma->vm_file check in copy_vma() · ce75799b
      Oleg Nesterov 提交于
      Test-case:
      
      	#define _GNU_SOURCE
      	#include <stdio.h>
      	#include <unistd.h>
      	#include <stdlib.h>
      	#include <string.h>
      	#include <sys/mman.h>
      	#include <assert.h>
      
      	void *find_vdso_vaddr(void)
      	{
      		FILE *perl;
      		char buf[32] = {};
      
      		perl = popen("perl -e 'open STDIN,qq|/proc/@{[getppid]}/maps|;"
      				"/^(.*?)-.*vdso/ && print hex $1 while <>'", "r");
      		fread(buf, sizeof(buf), 1, perl);
      		fclose(perl);
      
      		return (void *)atol(buf);
      	}
      
      	#define PAGE_SIZE	4096
      
      	void *get_unmapped_area(void)
      	{
      		void *p = mmap(0, PAGE_SIZE, PROT_NONE,
      				MAP_PRIVATE|MAP_ANONYMOUS, -1,0);
      		assert(p != MAP_FAILED);
      		munmap(p, PAGE_SIZE);
      		return p;
      	}
      
      	char save[2][PAGE_SIZE];
      
      	int main(void)
      	{
      		void *vdso = find_vdso_vaddr();
      		void *page[2];
      
      		assert(vdso);
      		memcpy(save, vdso, sizeof (save));
      		// force another fault on the next check
      		assert(madvise(vdso, 2 * PAGE_SIZE, MADV_DONTNEED) == 0);
      
      		page[0] = mremap(vdso,
      				PAGE_SIZE, PAGE_SIZE, MREMAP_FIXED | MREMAP_MAYMOVE,
      				get_unmapped_area());
      		page[1] = mremap(vdso + PAGE_SIZE,
      				PAGE_SIZE, PAGE_SIZE, MREMAP_FIXED | MREMAP_MAYMOVE,
      				get_unmapped_area());
      
      		assert(page[0] != MAP_FAILED && page[1] != MAP_FAILED);
      		printf("match: %d %d\n",
      			!memcmp(save[0], page[0], PAGE_SIZE),
      			!memcmp(save[1], page[1], PAGE_SIZE));
      
      		return 0;
      	}
      
      fails without this patch. Before the previous commit it gets the wrong
      page, now it segfaults (which is imho better).
      
      This is because copy_vma() wrongly assumes that if vma->vm_file == NULL
      is irrelevant until the first fault which will use do_anonymous_page().
      This is obviously wrong for the special mapping.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ce75799b
    • O
      mmap: fix the usage of ->vm_pgoff in special_mapping paths · 8a9cc3b5
      Oleg Nesterov 提交于
      Test-case:
      
      	#include <stdio.h>
      	#include <unistd.h>
      	#include <stdlib.h>
      	#include <string.h>
      	#include <sys/mman.h>
      	#include <assert.h>
      
      	void *find_vdso_vaddr(void)
      	{
      		FILE *perl;
      		char buf[32] = {};
      
      		perl = popen("perl -e 'open STDIN,qq|/proc/@{[getppid]}/maps|;"
      				"/^(.*?)-.*vdso/ && print hex $1 while <>'", "r");
      		fread(buf, sizeof(buf), 1, perl);
      		fclose(perl);
      
      		return (void *)atol(buf);
      	}
      
      	#define PAGE_SIZE	4096
      
      	int main(void)
      	{
      		void *vdso = find_vdso_vaddr();
      		assert(vdso);
      
      		// of course they should differ, and they do so far
      		printf("vdso pages differ: %d\n",
      			!!memcmp(vdso, vdso + PAGE_SIZE, PAGE_SIZE));
      
      		// split into 2 vma's
      		assert(mprotect(vdso, PAGE_SIZE, PROT_READ) == 0);
      
      		// force another fault on the next check
      		assert(madvise(vdso, 2 * PAGE_SIZE, MADV_DONTNEED) == 0);
      
      		// now they no longer differ, the 2nd vm_pgoff is wrong
      		printf("vdso pages differ: %d\n",
      			!!memcmp(vdso, vdso + PAGE_SIZE, PAGE_SIZE));
      
      		return 0;
      	}
      
      Output:
      
      	vdso pages differ: 1
      	vdso pages differ: 0
      
      This is because split_vma() correctly updates ->vm_pgoff, but the logic
      in insert_vm_struct() and special_mapping_fault() is absolutely broken,
      so the fault at vdso + PAGE_SIZE return the 1st page. The same happens
      if you simply unmap the 1st page.
      
      special_mapping_fault() does:
      
      	pgoff = vmf->pgoff - vma->vm_pgoff;
      
      and this is _only_ correct if vma->vm_start mmaps the first page from
      ->vm_private_data array.
      
      vdso or any other user of install_special_mapping() is not anonymous,
      it has the "backing storage" even if it is just the array of pages.
      So we actually need to make vm_pgoff work as an offset in this array.
      
      Note: this also allows to fix another problem: currently gdb can't access
      "[vvar]" memory because in this case special_mapping_fault() doesn't work.
      Now that we can use ->vm_pgoff we can implement ->access() and fix this.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8a9cc3b5
  16. 05 9月, 2015 1 次提交
  17. 10 7月, 2015 1 次提交
    • E
      vfs: Commit to never having exectuables on proc and sysfs. · 90f8572b
      Eric W. Biederman 提交于
      Today proc and sysfs do not contain any executable files.  Several
      applications today mount proc or sysfs without noexec and nosuid and
      then depend on there being no exectuables files on proc or sysfs.
      Having any executable files show on proc or sysfs would cause
      a user space visible regression, and most likely security problems.
      
      Therefore commit to never allowing executables on proc and sysfs by
      adding a new flag to mark them as filesystems without executables and
      enforce that flag.
      
      Test the flag where MNT_NOEXEC is tested today, so that the only user
      visible effect will be that exectuables will be treated as if the
      execute bit is cleared.
      
      The filesystems proc and sysfs do not currently incoporate any
      executable files so this does not result in any user visible effects.
      
      This makes it unnecessary to vet changes to proc and sysfs tightly for
      adding exectuable files or changes to chattr that would modify
      existing files, as no matter what the individual file say they will
      not be treated as exectuable files by the vfs.
      
      Not having to vet changes to closely is important as without this we
      are only one proc_create call (or another goof up in the
      implementation of notify_change) from having problematic executables
      on proc.  Those mistakes are all too easy to make and would create
      a situation where there are security issues or the assumptions of
      some program having to be broken (and cause userspace regressions).
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      90f8572b
  18. 25 6月, 2015 1 次提交