1. 21 5月, 2016 1 次提交
  2. 20 5月, 2016 1 次提交
  3. 18 3月, 2016 3 次提交
  4. 19 2月, 2016 3 次提交
    • K
      mm: fix regression in remap_file_pages() emulation · 48f7df32
      Kirill A. Shutemov 提交于
      Grazvydas Ignotas has reported a regression in remap_file_pages()
      emulation.
      
      Testcase:
      	#define _GNU_SOURCE
      	#include <assert.h>
      	#include <stdlib.h>
      	#include <stdio.h>
      	#include <sys/mman.h>
      
      	#define SIZE    (4096 * 3)
      
      	int main(int argc, char **argv)
      	{
      		unsigned long *p;
      		long i;
      
      		p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
      				MAP_SHARED | MAP_ANONYMOUS, -1, 0);
      		if (p == MAP_FAILED) {
      			perror("mmap");
      			return -1;
      		}
      
      		for (i = 0; i < SIZE / 4096; i++)
      			p[i * 4096 / sizeof(*p)] = i;
      
      		if (remap_file_pages(p, 4096, 0, 1, 0)) {
      			perror("remap_file_pages");
      			return -1;
      		}
      
      		if (remap_file_pages(p, 4096 * 2, 0, 1, 0)) {
      			perror("remap_file_pages");
      			return -1;
      		}
      
      		assert(p[0] == 1);
      
      		munmap(p, SIZE);
      
      		return 0;
      	}
      
      The second remap_file_pages() fails with -EINVAL.
      
      The reason is that remap_file_pages() emulation assumes that the target
      vma covers whole area we want to over map.  That assumption is broken by
      first remap_file_pages() call: it split the area into two vma.
      
      The solution is to check next adjacent vmas, if they map the same file
      with the same flags.
      
      Fixes: c8d78c18 ("mm: replace remap_file_pages() syscall with emulation")
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: NGrazvydas Ignotas <notasas@gmail.com>
      Tested-by: NGrazvydas Ignotas <notasas@gmail.com>
      Cc: <stable@vger.kernel.org>	[4.0+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      48f7df32
    • D
      mm/core, x86/mm/pkeys: Add execute-only protection keys support · 62b5f7d0
      Dave Hansen 提交于
      Protection keys provide new page-based protection in hardware.
      But, they have an interesting attribute: they only affect data
      accesses and never affect instruction fetches.  That means that
      if we set up some memory which is set as "access-disabled" via
      protection keys, we can still execute from it.
      
      This patch uses protection keys to set up mappings to do just that.
      If a user calls:
      
      	mmap(..., PROT_EXEC);
      or
      	mprotect(ptr, sz, PROT_EXEC);
      
      (note PROT_EXEC-only without PROT_READ/WRITE), the kernel will
      notice this, and set a special protection key on the memory.  It
      also sets the appropriate bits in the Protection Keys User Rights
      (PKRU) register so that the memory becomes unreadable and
      unwritable.
      
      I haven't found any userspace that does this today.  With this
      facility in place, we expect userspace to move to use it
      eventually.  Userspace _could_ start doing this today.  Any
      PROT_EXEC calls get converted to PROT_READ inside the kernel, and
      would transparently be upgraded to "true" PROT_EXEC with this
      code.  IOW, userspace never has to do any PROT_EXEC runtime
      detection.
      
      This feature provides enhanced protection against leaking
      executable memory contents.  This helps thwart attacks which are
      attempting to find ROP gadgets on the fly.
      
      But, the security provided by this approach is not comprehensive.
      The PKRU register which controls access permissions is a normal
      user register writable from unprivileged userspace.  An attacker
      who can execute the 'wrpkru' instruction can easily disable the
      protection provided by this feature.
      
      The protection key that is used for execute-only support is
      permanently dedicated at compile time.  This is fine for now
      because there is currently no API to set a protection key other
      than this one.
      
      Despite there being a constant PKRU value across the entire
      system, we do not set it unless this feature is in use in a
      process.  That is to preserve the PKRU XSAVE 'init state',
      which can lead to faster context switches.
      
      PKRU *is* a user register and the kernel is modifying it.  That
      means that code doing:
      
      	pkru = rdpkru()
      	pkru |= 0x100;
      	mmap(..., PROT_EXEC);
      	wrpkru(pkru);
      
      could lose the bits in PKRU that enforce execute-only
      permissions.  To avoid this, we suggest avoiding ever calling
      mmap() or mprotect() when the PKRU value is expected to be
      unstable.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Chen Gang <gang.chen.5i5j@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: David Hildenbrand <dahi@linux.vnet.ibm.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Piotr Kwapulinski <kwapulinski.piotr@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Stephen Smalley <sds@tycho.nsa.gov>
      Cc: Vladimir Murzin <vladimir.murzin@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: keescook@google.com
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-mm@kvack.org
      Link: http://lkml.kernel.org/r/20160212210240.CB4BB5CA@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      62b5f7d0
    • D
      mm/core, arch, powerpc: Pass a protection key in to calc_vm_flag_bits() · e6bfb709
      Dave Hansen 提交于
      This plumbs a protection key through calc_vm_flag_bits().  We
      could have done this in calc_vm_prot_bits(), but I did not feel
      super strongly which way to go.  It was pretty arbitrary which
      one to use.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Arve Hjønnevåg <arve@android.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Chen Gang <gang.chen.5i5j@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Geliang Tang <geliangtang@163.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Leon Romanovsky <leon@leon.nu>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
      Cc: Maxime Coquelin <mcoquelin.stm32@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Riley Andrews <riandrews@android.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: devel@driverdev.osuosl.org
      Cc: linux-api@vger.kernel.org
      Cc: linux-arch@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-mm@kvack.org
      Cc: linuxppc-dev@lists.ozlabs.org
      Link: http://lkml.kernel.org/r/20160212210231.E6F1F0D6@viggo.jf.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e6bfb709
  5. 06 2月, 2016 2 次提交
  6. 04 2月, 2016 1 次提交
    • K
      mm: warn about VmData over RLIMIT_DATA · d977d56c
      Konstantin Khlebnikov 提交于
      This patch provides a way of working around a slight regression
      introduced by commit 84638335 ("mm: rework virtual memory
      accounting").
      
      Before that commit RLIMIT_DATA have control only over size of the brk
      region.  But that change have caused problems with all existing versions
      of valgrind, because it set RLIMIT_DATA to zero.
      
      This patch fixes rlimit check (limit actually in bytes, not pages) and
      by default turns it into warning which prints at first VmData misuse:
      
        "mmap: top (795): VmData 516096 exceed data ulimit 512000.  Will be forbidden soon."
      
      Behavior is controlled by boot param ignore_rlimit_data=y/n and by sysfs
      /sys/module/kernel/parameters/ignore_rlimit_data.  For now it set to "y".
      
      [akpm@linux-foundation.org: tweak kernel-parameters.txt text[
      Signed-off-by: NKonstantin Khlebnikov <koct9i@gmail.com>
      Link: http://lkml.kernel.org/r/20151228211015.GL2194@uranusReported-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Vegard Nossum <vegard.nossum@oracle.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
      Cc: Kees Cook <keescook@google.com>
      Cc: Willy Tarreau <w@1wt.eu>
      Cc: Pavel Emelyanov <xemul@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d977d56c
  7. 16 1月, 2016 1 次提交
  8. 15 1月, 2016 4 次提交
    • K
      mm: rework virtual memory accounting · 84638335
      Konstantin Khlebnikov 提交于
      When inspecting a vague code inside prctl(PR_SET_MM_MEM) call (which
      testing the RLIMIT_DATA value to figure out if we're allowed to assign
      new @start_brk, @brk, @start_data, @end_data from mm_struct) it's been
      commited that RLIMIT_DATA in a form it's implemented now doesn't do
      anything useful because most of user-space libraries use mmap() syscall
      for dynamic memory allocations.
      
      Linus suggested to convert RLIMIT_DATA rlimit into something suitable
      for anonymous memory accounting.  But in this patch we go further, and
      the changes are bundled together as:
      
       * keep vma counting if CONFIG_PROC_FS=n, will be used for limits
       * replace mm->shared_vm with better defined mm->data_vm
       * account anonymous executable areas as executable
       * account file-backed growsdown/up areas as stack
       * drop struct file* argument from vm_stat_account
       * enforce RLIMIT_DATA for size of data areas
      
      This way code looks cleaner: now code/stack/data classification depends
      only on vm_flags state:
      
       VM_EXEC & ~VM_WRITE            -> code  (VmExe + VmLib in proc)
       VM_GROWSUP | VM_GROWSDOWN      -> stack (VmStk)
       VM_WRITE & ~VM_SHARED & !stack -> data  (VmData)
      
      The rest (VmSize - VmData - VmStk - VmExe - VmLib) could be called
      "shared", but that might be strange beast like readonly-private or VM_IO
      area.
      
       - RLIMIT_AS            limits whole address space "VmSize"
       - RLIMIT_STACK         limits stack "VmStk" (but each vma individually)
       - RLIMIT_DATA          now limits "VmData"
      Signed-off-by: NKonstantin Khlebnikov <koct9i@gmail.com>
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
      Cc: Vegard Nossum <vegard.nossum@oracle.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Willy Tarreau <w@1wt.eu>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Kees Cook <keescook@google.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: Pavel Emelyanov <xemul@virtuozzo.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      84638335
    • D
      mm: mmap: add new /proc tunable for mmap_base ASLR · d07e2259
      Daniel Cashman 提交于
      Address Space Layout Randomization (ASLR) provides a barrier to
      exploitation of user-space processes in the presence of security
      vulnerabilities by making it more difficult to find desired code/data
      which could help an attack.  This is done by adding a random offset to
      the location of regions in the process address space, with a greater
      range of potential offset values corresponding to better protection/a
      larger search-space for brute force, but also to greater potential for
      fragmentation.
      
      The offset added to the mmap_base address, which provides the basis for
      the majority of the mappings for a process, is set once on process exec
      in arch_pick_mmap_layout() and is done via hard-coded per-arch values,
      which reflect, hopefully, the best compromise for all systems.  The
      trade-off between increased entropy in the offset value generation and
      the corresponding increased variability in address space fragmentation
      is not absolute, however, and some platforms may tolerate higher amounts
      of entropy.  This patch introduces both new Kconfig values and a sysctl
      interface which may be used to change the amount of entropy used for
      offset generation on a system.
      
      The direct motivation for this change was in response to the
      libstagefright vulnerabilities that affected Android, specifically to
      information provided by Google's project zero at:
      
        http://googleprojectzero.blogspot.com/2015/09/stagefrightened.html
      
      The attack presented therein, by Google's project zero, specifically
      targeted the limited randomness used to generate the offset added to the
      mmap_base address in order to craft a brute-force-based attack.
      Concretely, the attack was against the mediaserver process, which was
      limited to respawning every 5 seconds, on an arm device.  The hard-coded
      8 bits used resulted in an average expected success rate of defeating
      the mmap ASLR after just over 10 minutes (128 tries at 5 seconds a
      piece).  With this patch, and an accompanying increase in the entropy
      value to 16 bits, the same attack would take an average expected time of
      over 45 hours (32768 tries), which makes it both less feasible and more
      likely to be noticed.
      
      The introduced Kconfig and sysctl options are limited by per-arch
      minimum and maximum values, the minimum of which was chosen to match the
      current hard-coded value and the maximum of which was chosen so as to
      give the greatest flexibility without generating an invalid mmap_base
      address, generally a 3-4 bits less than the number of bits in the
      user-space accessible virtual address space.
      
      When decided whether or not to change the default value, a system
      developer should consider that mmap_base address could be placed
      anywhere up to 2^(value) bits away from the non-randomized location,
      which would introduce variable-sized areas above and below the mmap_base
      address such that the maximum vm_area_struct size may be reduced,
      preventing very large allocations.
      
      This patch (of 4):
      
      ASLR only uses as few as 8 bits to generate the random offset for the
      mmap base address on 32 bit architectures.  This value was chosen to
      prevent a poorly chosen value from dividing the address space in such a
      way as to prevent large allocations.  This may not be an issue on all
      platforms.  Allow the specification of a minimum number of bits so that
      platforms desiring greater ASLR protection may determine where to place
      the trade-off.
      Signed-off-by: NDaniel Cashman <dcashman@google.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Acked-by: NKees Cook <keescook@chromium.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Don Zickus <dzickus@redhat.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mark Salyzyn <salyzyn@android.com>
      Cc: Jeff Vander Stoep <jeffv@google.com>
      Cc: Nick Kralevich <nnk@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Hector Marco-Gisbert <hecmargi@upv.es>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d07e2259
    • P
      mm/mmap.c: remove incorrect MAP_FIXED flag comparison from mmap_region · bc36f701
      Piotr Kwapulinski 提交于
      The following flag comparison in mmap_region makes no sense:
      
          if (!(vm_flags & MAP_FIXED))
              return -ENOMEM;
      
      The condition is always false and thus the above "return -ENOMEM" is
      never executed.  The vm_flags must not be compared with MAP_FIXED flag.
      The vm_flags may only be compared with VM_* flags.  MAP_FIXED has the
      same value as VM_MAYREAD.
      
      Hitting the rlimit is a slow path and find_vma_intersection should
      realize that there is no overlapping VMA for !MAP_FIXED case pretty
      quickly.
      Signed-off-by: NPiotr Kwapulinski <kwapulinski.piotr@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bc36f701
    • C
      mm/mmap.c: remove redundant local variables for may_expand_vm() · 0b57d6ba
      Chen Gang 提交于
      Simplify may_expand_vm().
      
      [akpm@linux-foundation.org: further simplification, per Naoya Horiguchi]
      Signed-off-by: NChen Gang <gang.chen.5i5j@gmail.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0b57d6ba
  9. 12 1月, 2016 1 次提交
    • A
      mm: Add a vm_special_mapping.fault() method · f872f540
      Andy Lutomirski 提交于
      Requiring special mappings to give a list of struct pages is
      inflexible: it prevents sane use of IO memory in a special
      mapping, it's inefficient (it requires arch code to initialize a
      list of struct pages, and it requires the mm core to walk the
      entire list just to figure out how long it is), and it prevents
      arch code from doing anything fancy when a special mapping fault
      occurs.
      
      Add a .fault method as an alternative to filling in a .pages
      array.
      
      Looks-OK-to: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/a26d1677c0bc7e774c33f469451a78ca31e9e6af.1451446564.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f872f540
  10. 06 11月, 2015 8 次提交
  11. 23 9月, 2015 1 次提交
  12. 18 9月, 2015 1 次提交
  13. 11 9月, 2015 2 次提交
  14. 09 9月, 2015 4 次提交
    • C
      mm/mmap.c:insert_vm_struct(): check for failure before setting values · c9d13f5f
      Chen Gang 提交于
      There's no point in initializing vma->vm_pgoff if the insertion attempt
      will be failing anyway.  Run the checks before performing the
      initialization.
      Signed-off-by: NChen Gang <gang.chen.5i5j@gmail.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c9d13f5f
    • C
      mm/mmap.c: simplify the failure return working flow · e3975891
      Chen Gang 提交于
      __split_vma() doesn't need out_err label, neither need initializing err.
      
      copy_vma() can return NULL directly when kmem_cache_alloc() fails.
      Signed-off-by: NChen Gang <gang.chen.5i5j@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e3975891
    • O
      mremap: fix the wrong !vma->vm_file check in copy_vma() · ce75799b
      Oleg Nesterov 提交于
      Test-case:
      
      	#define _GNU_SOURCE
      	#include <stdio.h>
      	#include <unistd.h>
      	#include <stdlib.h>
      	#include <string.h>
      	#include <sys/mman.h>
      	#include <assert.h>
      
      	void *find_vdso_vaddr(void)
      	{
      		FILE *perl;
      		char buf[32] = {};
      
      		perl = popen("perl -e 'open STDIN,qq|/proc/@{[getppid]}/maps|;"
      				"/^(.*?)-.*vdso/ && print hex $1 while <>'", "r");
      		fread(buf, sizeof(buf), 1, perl);
      		fclose(perl);
      
      		return (void *)atol(buf);
      	}
      
      	#define PAGE_SIZE	4096
      
      	void *get_unmapped_area(void)
      	{
      		void *p = mmap(0, PAGE_SIZE, PROT_NONE,
      				MAP_PRIVATE|MAP_ANONYMOUS, -1,0);
      		assert(p != MAP_FAILED);
      		munmap(p, PAGE_SIZE);
      		return p;
      	}
      
      	char save[2][PAGE_SIZE];
      
      	int main(void)
      	{
      		void *vdso = find_vdso_vaddr();
      		void *page[2];
      
      		assert(vdso);
      		memcpy(save, vdso, sizeof (save));
      		// force another fault on the next check
      		assert(madvise(vdso, 2 * PAGE_SIZE, MADV_DONTNEED) == 0);
      
      		page[0] = mremap(vdso,
      				PAGE_SIZE, PAGE_SIZE, MREMAP_FIXED | MREMAP_MAYMOVE,
      				get_unmapped_area());
      		page[1] = mremap(vdso + PAGE_SIZE,
      				PAGE_SIZE, PAGE_SIZE, MREMAP_FIXED | MREMAP_MAYMOVE,
      				get_unmapped_area());
      
      		assert(page[0] != MAP_FAILED && page[1] != MAP_FAILED);
      		printf("match: %d %d\n",
      			!memcmp(save[0], page[0], PAGE_SIZE),
      			!memcmp(save[1], page[1], PAGE_SIZE));
      
      		return 0;
      	}
      
      fails without this patch. Before the previous commit it gets the wrong
      page, now it segfaults (which is imho better).
      
      This is because copy_vma() wrongly assumes that if vma->vm_file == NULL
      is irrelevant until the first fault which will use do_anonymous_page().
      This is obviously wrong for the special mapping.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ce75799b
    • O
      mmap: fix the usage of ->vm_pgoff in special_mapping paths · 8a9cc3b5
      Oleg Nesterov 提交于
      Test-case:
      
      	#include <stdio.h>
      	#include <unistd.h>
      	#include <stdlib.h>
      	#include <string.h>
      	#include <sys/mman.h>
      	#include <assert.h>
      
      	void *find_vdso_vaddr(void)
      	{
      		FILE *perl;
      		char buf[32] = {};
      
      		perl = popen("perl -e 'open STDIN,qq|/proc/@{[getppid]}/maps|;"
      				"/^(.*?)-.*vdso/ && print hex $1 while <>'", "r");
      		fread(buf, sizeof(buf), 1, perl);
      		fclose(perl);
      
      		return (void *)atol(buf);
      	}
      
      	#define PAGE_SIZE	4096
      
      	int main(void)
      	{
      		void *vdso = find_vdso_vaddr();
      		assert(vdso);
      
      		// of course they should differ, and they do so far
      		printf("vdso pages differ: %d\n",
      			!!memcmp(vdso, vdso + PAGE_SIZE, PAGE_SIZE));
      
      		// split into 2 vma's
      		assert(mprotect(vdso, PAGE_SIZE, PROT_READ) == 0);
      
      		// force another fault on the next check
      		assert(madvise(vdso, 2 * PAGE_SIZE, MADV_DONTNEED) == 0);
      
      		// now they no longer differ, the 2nd vm_pgoff is wrong
      		printf("vdso pages differ: %d\n",
      			!!memcmp(vdso, vdso + PAGE_SIZE, PAGE_SIZE));
      
      		return 0;
      	}
      
      Output:
      
      	vdso pages differ: 1
      	vdso pages differ: 0
      
      This is because split_vma() correctly updates ->vm_pgoff, but the logic
      in insert_vm_struct() and special_mapping_fault() is absolutely broken,
      so the fault at vdso + PAGE_SIZE return the 1st page. The same happens
      if you simply unmap the 1st page.
      
      special_mapping_fault() does:
      
      	pgoff = vmf->pgoff - vma->vm_pgoff;
      
      and this is _only_ correct if vma->vm_start mmaps the first page from
      ->vm_private_data array.
      
      vdso or any other user of install_special_mapping() is not anonymous,
      it has the "backing storage" even if it is just the array of pages.
      So we actually need to make vm_pgoff work as an offset in this array.
      
      Note: this also allows to fix another problem: currently gdb can't access
      "[vvar]" memory because in this case special_mapping_fault() doesn't work.
      Now that we can use ->vm_pgoff we can implement ->access() and fix this.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8a9cc3b5
  15. 05 9月, 2015 1 次提交
  16. 10 7月, 2015 1 次提交
    • E
      vfs: Commit to never having exectuables on proc and sysfs. · 90f8572b
      Eric W. Biederman 提交于
      Today proc and sysfs do not contain any executable files.  Several
      applications today mount proc or sysfs without noexec and nosuid and
      then depend on there being no exectuables files on proc or sysfs.
      Having any executable files show on proc or sysfs would cause
      a user space visible regression, and most likely security problems.
      
      Therefore commit to never allowing executables on proc and sysfs by
      adding a new flag to mark them as filesystems without executables and
      enforce that flag.
      
      Test the flag where MNT_NOEXEC is tested today, so that the only user
      visible effect will be that exectuables will be treated as if the
      execute bit is cleared.
      
      The filesystems proc and sysfs do not currently incoporate any
      executable files so this does not result in any user visible effects.
      
      This makes it unnecessary to vet changes to proc and sysfs tightly for
      adding exectuable files or changes to chattr that would modify
      existing files, as no matter what the individual file say they will
      not be treated as exectuable files by the vfs.
      
      Not having to vet changes to closely is important as without this we
      are only one proc_create call (or another goof up in the
      implementation of notify_change) from having problematic executables
      on proc.  Those mistakes are all too easy to make and would create
      a situation where there are security issues or the assumptions of
      some program having to be broken (and cause userspace regressions).
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      90f8572b
  17. 25 6月, 2015 1 次提交
  18. 16 4月, 2015 2 次提交
  19. 15 4月, 2015 1 次提交
  20. 26 3月, 2015 1 次提交