1. 19 2月, 2016 1 次提交
    • K
      mm: fix regression in remap_file_pages() emulation · 48f7df32
      Kirill A. Shutemov 提交于
      Grazvydas Ignotas has reported a regression in remap_file_pages()
      emulation.
      
      Testcase:
      	#define _GNU_SOURCE
      	#include <assert.h>
      	#include <stdlib.h>
      	#include <stdio.h>
      	#include <sys/mman.h>
      
      	#define SIZE    (4096 * 3)
      
      	int main(int argc, char **argv)
      	{
      		unsigned long *p;
      		long i;
      
      		p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
      				MAP_SHARED | MAP_ANONYMOUS, -1, 0);
      		if (p == MAP_FAILED) {
      			perror("mmap");
      			return -1;
      		}
      
      		for (i = 0; i < SIZE / 4096; i++)
      			p[i * 4096 / sizeof(*p)] = i;
      
      		if (remap_file_pages(p, 4096, 0, 1, 0)) {
      			perror("remap_file_pages");
      			return -1;
      		}
      
      		if (remap_file_pages(p, 4096 * 2, 0, 1, 0)) {
      			perror("remap_file_pages");
      			return -1;
      		}
      
      		assert(p[0] == 1);
      
      		munmap(p, SIZE);
      
      		return 0;
      	}
      
      The second remap_file_pages() fails with -EINVAL.
      
      The reason is that remap_file_pages() emulation assumes that the target
      vma covers whole area we want to over map.  That assumption is broken by
      first remap_file_pages() call: it split the area into two vma.
      
      The solution is to check next adjacent vmas, if they map the same file
      with the same flags.
      
      Fixes: c8d78c18 ("mm: replace remap_file_pages() syscall with emulation")
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: NGrazvydas Ignotas <notasas@gmail.com>
      Tested-by: NGrazvydas Ignotas <notasas@gmail.com>
      Cc: <stable@vger.kernel.org>	[4.0+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      48f7df32
  2. 06 2月, 2016 2 次提交
  3. 04 2月, 2016 1 次提交
    • K
      mm: warn about VmData over RLIMIT_DATA · d977d56c
      Konstantin Khlebnikov 提交于
      This patch provides a way of working around a slight regression
      introduced by commit 84638335 ("mm: rework virtual memory
      accounting").
      
      Before that commit RLIMIT_DATA have control only over size of the brk
      region.  But that change have caused problems with all existing versions
      of valgrind, because it set RLIMIT_DATA to zero.
      
      This patch fixes rlimit check (limit actually in bytes, not pages) and
      by default turns it into warning which prints at first VmData misuse:
      
        "mmap: top (795): VmData 516096 exceed data ulimit 512000.  Will be forbidden soon."
      
      Behavior is controlled by boot param ignore_rlimit_data=y/n and by sysfs
      /sys/module/kernel/parameters/ignore_rlimit_data.  For now it set to "y".
      
      [akpm@linux-foundation.org: tweak kernel-parameters.txt text[
      Signed-off-by: NKonstantin Khlebnikov <koct9i@gmail.com>
      Link: http://lkml.kernel.org/r/20151228211015.GL2194@uranusReported-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Vegard Nossum <vegard.nossum@oracle.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
      Cc: Kees Cook <keescook@google.com>
      Cc: Willy Tarreau <w@1wt.eu>
      Cc: Pavel Emelyanov <xemul@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d977d56c
  4. 16 1月, 2016 1 次提交
  5. 15 1月, 2016 4 次提交
    • K
      mm: rework virtual memory accounting · 84638335
      Konstantin Khlebnikov 提交于
      When inspecting a vague code inside prctl(PR_SET_MM_MEM) call (which
      testing the RLIMIT_DATA value to figure out if we're allowed to assign
      new @start_brk, @brk, @start_data, @end_data from mm_struct) it's been
      commited that RLIMIT_DATA in a form it's implemented now doesn't do
      anything useful because most of user-space libraries use mmap() syscall
      for dynamic memory allocations.
      
      Linus suggested to convert RLIMIT_DATA rlimit into something suitable
      for anonymous memory accounting.  But in this patch we go further, and
      the changes are bundled together as:
      
       * keep vma counting if CONFIG_PROC_FS=n, will be used for limits
       * replace mm->shared_vm with better defined mm->data_vm
       * account anonymous executable areas as executable
       * account file-backed growsdown/up areas as stack
       * drop struct file* argument from vm_stat_account
       * enforce RLIMIT_DATA for size of data areas
      
      This way code looks cleaner: now code/stack/data classification depends
      only on vm_flags state:
      
       VM_EXEC & ~VM_WRITE            -> code  (VmExe + VmLib in proc)
       VM_GROWSUP | VM_GROWSDOWN      -> stack (VmStk)
       VM_WRITE & ~VM_SHARED & !stack -> data  (VmData)
      
      The rest (VmSize - VmData - VmStk - VmExe - VmLib) could be called
      "shared", but that might be strange beast like readonly-private or VM_IO
      area.
      
       - RLIMIT_AS            limits whole address space "VmSize"
       - RLIMIT_STACK         limits stack "VmStk" (but each vma individually)
       - RLIMIT_DATA          now limits "VmData"
      Signed-off-by: NKonstantin Khlebnikov <koct9i@gmail.com>
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
      Cc: Vegard Nossum <vegard.nossum@oracle.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Willy Tarreau <w@1wt.eu>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Kees Cook <keescook@google.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: Pavel Emelyanov <xemul@virtuozzo.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      84638335
    • D
      mm: mmap: add new /proc tunable for mmap_base ASLR · d07e2259
      Daniel Cashman 提交于
      Address Space Layout Randomization (ASLR) provides a barrier to
      exploitation of user-space processes in the presence of security
      vulnerabilities by making it more difficult to find desired code/data
      which could help an attack.  This is done by adding a random offset to
      the location of regions in the process address space, with a greater
      range of potential offset values corresponding to better protection/a
      larger search-space for brute force, but also to greater potential for
      fragmentation.
      
      The offset added to the mmap_base address, which provides the basis for
      the majority of the mappings for a process, is set once on process exec
      in arch_pick_mmap_layout() and is done via hard-coded per-arch values,
      which reflect, hopefully, the best compromise for all systems.  The
      trade-off between increased entropy in the offset value generation and
      the corresponding increased variability in address space fragmentation
      is not absolute, however, and some platforms may tolerate higher amounts
      of entropy.  This patch introduces both new Kconfig values and a sysctl
      interface which may be used to change the amount of entropy used for
      offset generation on a system.
      
      The direct motivation for this change was in response to the
      libstagefright vulnerabilities that affected Android, specifically to
      information provided by Google's project zero at:
      
        http://googleprojectzero.blogspot.com/2015/09/stagefrightened.html
      
      The attack presented therein, by Google's project zero, specifically
      targeted the limited randomness used to generate the offset added to the
      mmap_base address in order to craft a brute-force-based attack.
      Concretely, the attack was against the mediaserver process, which was
      limited to respawning every 5 seconds, on an arm device.  The hard-coded
      8 bits used resulted in an average expected success rate of defeating
      the mmap ASLR after just over 10 minutes (128 tries at 5 seconds a
      piece).  With this patch, and an accompanying increase in the entropy
      value to 16 bits, the same attack would take an average expected time of
      over 45 hours (32768 tries), which makes it both less feasible and more
      likely to be noticed.
      
      The introduced Kconfig and sysctl options are limited by per-arch
      minimum and maximum values, the minimum of which was chosen to match the
      current hard-coded value and the maximum of which was chosen so as to
      give the greatest flexibility without generating an invalid mmap_base
      address, generally a 3-4 bits less than the number of bits in the
      user-space accessible virtual address space.
      
      When decided whether or not to change the default value, a system
      developer should consider that mmap_base address could be placed
      anywhere up to 2^(value) bits away from the non-randomized location,
      which would introduce variable-sized areas above and below the mmap_base
      address such that the maximum vm_area_struct size may be reduced,
      preventing very large allocations.
      
      This patch (of 4):
      
      ASLR only uses as few as 8 bits to generate the random offset for the
      mmap base address on 32 bit architectures.  This value was chosen to
      prevent a poorly chosen value from dividing the address space in such a
      way as to prevent large allocations.  This may not be an issue on all
      platforms.  Allow the specification of a minimum number of bits so that
      platforms desiring greater ASLR protection may determine where to place
      the trade-off.
      Signed-off-by: NDaniel Cashman <dcashman@google.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Acked-by: NKees Cook <keescook@chromium.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Don Zickus <dzickus@redhat.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mark Salyzyn <salyzyn@android.com>
      Cc: Jeff Vander Stoep <jeffv@google.com>
      Cc: Nick Kralevich <nnk@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Hector Marco-Gisbert <hecmargi@upv.es>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d07e2259
    • P
      mm/mmap.c: remove incorrect MAP_FIXED flag comparison from mmap_region · bc36f701
      Piotr Kwapulinski 提交于
      The following flag comparison in mmap_region makes no sense:
      
          if (!(vm_flags & MAP_FIXED))
              return -ENOMEM;
      
      The condition is always false and thus the above "return -ENOMEM" is
      never executed.  The vm_flags must not be compared with MAP_FIXED flag.
      The vm_flags may only be compared with VM_* flags.  MAP_FIXED has the
      same value as VM_MAYREAD.
      
      Hitting the rlimit is a slow path and find_vma_intersection should
      realize that there is no overlapping VMA for !MAP_FIXED case pretty
      quickly.
      Signed-off-by: NPiotr Kwapulinski <kwapulinski.piotr@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bc36f701
    • C
      mm/mmap.c: remove redundant local variables for may_expand_vm() · 0b57d6ba
      Chen Gang 提交于
      Simplify may_expand_vm().
      
      [akpm@linux-foundation.org: further simplification, per Naoya Horiguchi]
      Signed-off-by: NChen Gang <gang.chen.5i5j@gmail.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0b57d6ba
  6. 06 11月, 2015 8 次提交
  7. 23 9月, 2015 1 次提交
  8. 18 9月, 2015 1 次提交
  9. 11 9月, 2015 2 次提交
  10. 09 9月, 2015 4 次提交
    • C
      mm/mmap.c:insert_vm_struct(): check for failure before setting values · c9d13f5f
      Chen Gang 提交于
      There's no point in initializing vma->vm_pgoff if the insertion attempt
      will be failing anyway.  Run the checks before performing the
      initialization.
      Signed-off-by: NChen Gang <gang.chen.5i5j@gmail.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c9d13f5f
    • C
      mm/mmap.c: simplify the failure return working flow · e3975891
      Chen Gang 提交于
      __split_vma() doesn't need out_err label, neither need initializing err.
      
      copy_vma() can return NULL directly when kmem_cache_alloc() fails.
      Signed-off-by: NChen Gang <gang.chen.5i5j@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e3975891
    • O
      mremap: fix the wrong !vma->vm_file check in copy_vma() · ce75799b
      Oleg Nesterov 提交于
      Test-case:
      
      	#define _GNU_SOURCE
      	#include <stdio.h>
      	#include <unistd.h>
      	#include <stdlib.h>
      	#include <string.h>
      	#include <sys/mman.h>
      	#include <assert.h>
      
      	void *find_vdso_vaddr(void)
      	{
      		FILE *perl;
      		char buf[32] = {};
      
      		perl = popen("perl -e 'open STDIN,qq|/proc/@{[getppid]}/maps|;"
      				"/^(.*?)-.*vdso/ && print hex $1 while <>'", "r");
      		fread(buf, sizeof(buf), 1, perl);
      		fclose(perl);
      
      		return (void *)atol(buf);
      	}
      
      	#define PAGE_SIZE	4096
      
      	void *get_unmapped_area(void)
      	{
      		void *p = mmap(0, PAGE_SIZE, PROT_NONE,
      				MAP_PRIVATE|MAP_ANONYMOUS, -1,0);
      		assert(p != MAP_FAILED);
      		munmap(p, PAGE_SIZE);
      		return p;
      	}
      
      	char save[2][PAGE_SIZE];
      
      	int main(void)
      	{
      		void *vdso = find_vdso_vaddr();
      		void *page[2];
      
      		assert(vdso);
      		memcpy(save, vdso, sizeof (save));
      		// force another fault on the next check
      		assert(madvise(vdso, 2 * PAGE_SIZE, MADV_DONTNEED) == 0);
      
      		page[0] = mremap(vdso,
      				PAGE_SIZE, PAGE_SIZE, MREMAP_FIXED | MREMAP_MAYMOVE,
      				get_unmapped_area());
      		page[1] = mremap(vdso + PAGE_SIZE,
      				PAGE_SIZE, PAGE_SIZE, MREMAP_FIXED | MREMAP_MAYMOVE,
      				get_unmapped_area());
      
      		assert(page[0] != MAP_FAILED && page[1] != MAP_FAILED);
      		printf("match: %d %d\n",
      			!memcmp(save[0], page[0], PAGE_SIZE),
      			!memcmp(save[1], page[1], PAGE_SIZE));
      
      		return 0;
      	}
      
      fails without this patch. Before the previous commit it gets the wrong
      page, now it segfaults (which is imho better).
      
      This is because copy_vma() wrongly assumes that if vma->vm_file == NULL
      is irrelevant until the first fault which will use do_anonymous_page().
      This is obviously wrong for the special mapping.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ce75799b
    • O
      mmap: fix the usage of ->vm_pgoff in special_mapping paths · 8a9cc3b5
      Oleg Nesterov 提交于
      Test-case:
      
      	#include <stdio.h>
      	#include <unistd.h>
      	#include <stdlib.h>
      	#include <string.h>
      	#include <sys/mman.h>
      	#include <assert.h>
      
      	void *find_vdso_vaddr(void)
      	{
      		FILE *perl;
      		char buf[32] = {};
      
      		perl = popen("perl -e 'open STDIN,qq|/proc/@{[getppid]}/maps|;"
      				"/^(.*?)-.*vdso/ && print hex $1 while <>'", "r");
      		fread(buf, sizeof(buf), 1, perl);
      		fclose(perl);
      
      		return (void *)atol(buf);
      	}
      
      	#define PAGE_SIZE	4096
      
      	int main(void)
      	{
      		void *vdso = find_vdso_vaddr();
      		assert(vdso);
      
      		// of course they should differ, and they do so far
      		printf("vdso pages differ: %d\n",
      			!!memcmp(vdso, vdso + PAGE_SIZE, PAGE_SIZE));
      
      		// split into 2 vma's
      		assert(mprotect(vdso, PAGE_SIZE, PROT_READ) == 0);
      
      		// force another fault on the next check
      		assert(madvise(vdso, 2 * PAGE_SIZE, MADV_DONTNEED) == 0);
      
      		// now they no longer differ, the 2nd vm_pgoff is wrong
      		printf("vdso pages differ: %d\n",
      			!!memcmp(vdso, vdso + PAGE_SIZE, PAGE_SIZE));
      
      		return 0;
      	}
      
      Output:
      
      	vdso pages differ: 1
      	vdso pages differ: 0
      
      This is because split_vma() correctly updates ->vm_pgoff, but the logic
      in insert_vm_struct() and special_mapping_fault() is absolutely broken,
      so the fault at vdso + PAGE_SIZE return the 1st page. The same happens
      if you simply unmap the 1st page.
      
      special_mapping_fault() does:
      
      	pgoff = vmf->pgoff - vma->vm_pgoff;
      
      and this is _only_ correct if vma->vm_start mmaps the first page from
      ->vm_private_data array.
      
      vdso or any other user of install_special_mapping() is not anonymous,
      it has the "backing storage" even if it is just the array of pages.
      So we actually need to make vm_pgoff work as an offset in this array.
      
      Note: this also allows to fix another problem: currently gdb can't access
      "[vvar]" memory because in this case special_mapping_fault() doesn't work.
      Now that we can use ->vm_pgoff we can implement ->access() and fix this.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8a9cc3b5
  11. 05 9月, 2015 1 次提交
  12. 10 7月, 2015 1 次提交
    • E
      vfs: Commit to never having exectuables on proc and sysfs. · 90f8572b
      Eric W. Biederman 提交于
      Today proc and sysfs do not contain any executable files.  Several
      applications today mount proc or sysfs without noexec and nosuid and
      then depend on there being no exectuables files on proc or sysfs.
      Having any executable files show on proc or sysfs would cause
      a user space visible regression, and most likely security problems.
      
      Therefore commit to never allowing executables on proc and sysfs by
      adding a new flag to mark them as filesystems without executables and
      enforce that flag.
      
      Test the flag where MNT_NOEXEC is tested today, so that the only user
      visible effect will be that exectuables will be treated as if the
      execute bit is cleared.
      
      The filesystems proc and sysfs do not currently incoporate any
      executable files so this does not result in any user visible effects.
      
      This makes it unnecessary to vet changes to proc and sysfs tightly for
      adding exectuable files or changes to chattr that would modify
      existing files, as no matter what the individual file say they will
      not be treated as exectuable files by the vfs.
      
      Not having to vet changes to closely is important as without this we
      are only one proc_create call (or another goof up in the
      implementation of notify_change) from having problematic executables
      on proc.  Those mistakes are all too easy to make and would create
      a situation where there are security issues or the assumptions of
      some program having to be broken (and cause userspace regressions).
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      90f8572b
  13. 25 6月, 2015 1 次提交
  14. 16 4月, 2015 2 次提交
  15. 15 4月, 2015 1 次提交
  16. 26 3月, 2015 1 次提交
  17. 12 2月, 2015 3 次提交
    • R
      mm/mmap.c: fix arithmetic overflow in __vm_enough_memory() · 5703b087
      Roman Gushchin 提交于
      I noticed, that "allowed" can easily overflow by falling below 0,
      because (total_vm / 32) can be larger than "allowed".  The problem
      occurs in OVERCOMMIT_NONE mode.
      
      In this case, a huge allocation can success and overcommit the system
      (despite OVERCOMMIT_NONE mode).  All subsequent allocations will fall
      (system-wide), so system become unusable.
      
      The problem was masked out by commit c9b1d098
      ("mm: limit growth of 3% hardcoded other user reserve"),
      but it's easy to reproduce it on older kernels:
      1) set overcommit_memory sysctl to 2
      2) mmap() large file multiple times (with VM_SHARED flag)
      3) try to malloc() large amount of memory
      
      It also can be reproduced on newer kernels, but miss-configured
      sysctl_user_reserve_kbytes is required.
      
      Fix this issue by switching to signed arithmetic here.
      
      [akpm@linux-foundation.org: use min_t]
      Signed-off-by: NRoman Gushchin <klamm@yandex-team.ru>
      Cc: Andrew Shewmaker <agshew@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5703b087
    • K
      mm: fix false-positive warning on exit due mm_nr_pmds(mm) · b30fe6c7
      Kirill A. Shutemov 提交于
      The problem is that we check nr_ptes/nr_pmds in exit_mmap() which happens
      *before* pgd_free().  And if an arch does pte/pmd allocation in
      pgd_alloc() and frees them in pgd_free() we see offset in counters by the
      time of the checks.
      
      We tried to workaround this by offsetting expected counter value according
      to FIRST_USER_ADDRESS for both nr_pte and nr_pmd in exit_mmap().  But it
      doesn't work in some cases:
      
      1. ARM with LPAE enabled also has non-zero USER_PGTABLES_CEILING, but
         upper addresses occupied with huge pmd entries, so the trick with
         offsetting expected counter value will get really ugly: we will have
         to apply it nr_pmds, but not nr_ptes.
      
      2. Metag has non-zero FIRST_USER_ADDRESS, but doesn't do allocation
         pte/pmd page tables allocation in pgd_alloc(), just setup a pgd entry
         which is allocated at boot and shared accross all processes.
      
      The proposal is to move the check to check_mm() which happens *after*
      pgd_free() and do proper accounting during pgd_alloc() and pgd_free()
      which would bring counters to zero if nothing leaked.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: NTyler Baker <tyler.baker@linaro.org>
      Tested-by: NTyler Baker <tyler.baker@linaro.org>
      Tested-by: NNishanth Menon <nm@ti.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b30fe6c7
    • K
      mm: account pmd page tables to the process · dc6c9a35
      Kirill A. Shutemov 提交于
      Dave noticed that unprivileged process can allocate significant amount of
      memory -- >500 MiB on x86_64 -- and stay unnoticed by oom-killer and
      memory cgroup.  The trick is to allocate a lot of PMD page tables.  Linux
      kernel doesn't account PMD tables to the process, only PTE.
      
      The use-cases below use few tricks to allocate a lot of PMD page tables
      while keeping VmRSS and VmPTE low.  oom_score for the process will be 0.
      
      	#include <errno.h>
      	#include <stdio.h>
      	#include <stdlib.h>
      	#include <unistd.h>
      	#include <sys/mman.h>
      	#include <sys/prctl.h>
      
      	#define PUD_SIZE (1UL << 30)
      	#define PMD_SIZE (1UL << 21)
      
      	#define NR_PUD 130000
      
      	int main(void)
      	{
      		char *addr = NULL;
      		unsigned long i;
      
      		prctl(PR_SET_THP_DISABLE);
      		for (i = 0; i < NR_PUD ; i++) {
      			addr = mmap(addr + PUD_SIZE, PUD_SIZE, PROT_WRITE|PROT_READ,
      					MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
      			if (addr == MAP_FAILED) {
      				perror("mmap");
      				break;
      			}
      			*addr = 'x';
      			munmap(addr, PMD_SIZE);
      			mmap(addr, PMD_SIZE, PROT_WRITE|PROT_READ,
      					MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED, -1, 0);
      			if (addr == MAP_FAILED)
      				perror("re-mmap"), exit(1);
      		}
      		printf("PID %d consumed %lu KiB in PMD page tables\n",
      				getpid(), i * 4096 >> 10);
      		return pause();
      	}
      
      The patch addresses the issue by account PMD tables to the process the
      same way we account PTE.
      
      The main place where PMD tables is accounted is __pmd_alloc() and
      free_pmd_range(). But there're few corner cases:
      
       - HugeTLB can share PMD page tables. The patch handles by accounting
         the table to all processes who share it.
      
       - x86 PAE pre-allocates few PMD tables on fork.
      
       - Architectures with FIRST_USER_ADDRESS > 0. We need to adjust sanity
         check on exit(2).
      
      Accounting only happens on configuration where PMD page table's level is
      present (PMD is not folded).  As with nr_ptes we use per-mm counter.  The
      counter value is used to calculate baseline for badness score by
      oom-killer.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Reviewed-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: David Rientjes <rientjes@google.com>
      Tested-by: NSedat Dilek <sedat.dilek@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dc6c9a35
  18. 11 2月, 2015 2 次提交
    • K
      rmap: drop support of non-linear mappings · 27ba0644
      Kirill A. Shutemov 提交于
      We don't create non-linear mappings anymore.  Let's drop code which
      handles them in rmap.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      27ba0644
    • K
      mm: replace remap_file_pages() syscall with emulation · c8d78c18
      Kirill A. Shutemov 提交于
      remap_file_pages(2) was invented to be able efficiently map parts of
      huge file into limited 32-bit virtual address space such as in database
      workloads.
      
      Nonlinear mappings are pain to support and it seems there's no
      legitimate use-cases nowadays since 64-bit systems are widely available.
      
      Let's drop it and get rid of all these special-cased code.
      
      The patch replaces the syscall with emulation which creates new VMA on
      each remap_file_pages(), unless they it can be merged with an adjacent
      one.
      
      I didn't find *any* real code that uses remap_file_pages(2) to test
      emulation impact on.  I've checked Debian code search and source of all
      packages in ALT Linux.  No real users: libc wrappers, mentions in
      strace, gdb, valgrind and this kind of stuff.
      
      There are few basic tests in LTP for the syscall.  They work just fine
      with emulation.
      
      To test performance impact, I've written small test case which
      demonstrate pretty much worst case scenario: map 4G shmfs file, write to
      begin of every page pgoff of the page, remap pages in reverse order,
      read every page.
      
      The test creates 1 million of VMAs if emulation is in use, so I had to
      set vm.max_map_count to 1100000 to avoid -ENOMEM.
      
      Before:		23.3 ( +-  4.31% ) seconds
      After:		43.9 ( +-  0.85% ) seconds
      Slowdown:	1.88x
      
      I believe we can live with that.
      
      Test case:
      
              #define _GNU_SOURCE
              #include <assert.h>
              #include <stdlib.h>
              #include <stdio.h>
              #include <sys/mman.h>
      
              #define MB	(1024UL * 1024)
              #define SIZE	(4096 * MB)
      
              int main(int argc, char **argv)
              {
                      unsigned long *p;
                      long i, pass;
      
                      for (pass = 0; pass < 10; pass++) {
                              p = mmap(NULL, SIZE, PROT_READ|PROT_WRITE,
                                              MAP_SHARED | MAP_ANONYMOUS, -1, 0);
                              if (p == MAP_FAILED) {
                                      perror("mmap");
                                      return -1;
                              }
      
                              for (i = 0; i < SIZE / 4096; i++)
                                      p[i * 4096 / sizeof(*p)] = i;
      
                              for (i = 0; i < SIZE / 4096; i++) {
                                      if (remap_file_pages(p + i * 4096 / sizeof(*p), 4096,
                                                      0, (SIZE - 4096 * (i + 1)) >> 12, 0)) {
                                              perror("remap_file_pages");
                                              return -1;
                                      }
                              }
      
                              for (i = SIZE / 4096 - 1; i >= 0; i--)
                                      assert(p[i * 4096 / sizeof(*p)] == SIZE / 4096 - i - 1);
      
                              munmap(p, SIZE);
                      }
      
                      return 0;
              }
      
      [akpm@linux-foundation.org: fix spello]
      [sasha.levin@oracle.com: initialize populate before usage]
      [sasha.levin@oracle.com: grab file ref to prevent race while mmaping]
      Signed-off-by: N"Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Armin Rigo <arigo@tunes.org>
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c8d78c18
  19. 12 1月, 2015 2 次提交
  20. 14 12月, 2014 1 次提交