1. 09 9月, 2015 2 次提交
    • O
      mremap: fix the wrong !vma->vm_file check in copy_vma() · ce75799b
      Oleg Nesterov 提交于
      Test-case:
      
      	#define _GNU_SOURCE
      	#include <stdio.h>
      	#include <unistd.h>
      	#include <stdlib.h>
      	#include <string.h>
      	#include <sys/mman.h>
      	#include <assert.h>
      
      	void *find_vdso_vaddr(void)
      	{
      		FILE *perl;
      		char buf[32] = {};
      
      		perl = popen("perl -e 'open STDIN,qq|/proc/@{[getppid]}/maps|;"
      				"/^(.*?)-.*vdso/ && print hex $1 while <>'", "r");
      		fread(buf, sizeof(buf), 1, perl);
      		fclose(perl);
      
      		return (void *)atol(buf);
      	}
      
      	#define PAGE_SIZE	4096
      
      	void *get_unmapped_area(void)
      	{
      		void *p = mmap(0, PAGE_SIZE, PROT_NONE,
      				MAP_PRIVATE|MAP_ANONYMOUS, -1,0);
      		assert(p != MAP_FAILED);
      		munmap(p, PAGE_SIZE);
      		return p;
      	}
      
      	char save[2][PAGE_SIZE];
      
      	int main(void)
      	{
      		void *vdso = find_vdso_vaddr();
      		void *page[2];
      
      		assert(vdso);
      		memcpy(save, vdso, sizeof (save));
      		// force another fault on the next check
      		assert(madvise(vdso, 2 * PAGE_SIZE, MADV_DONTNEED) == 0);
      
      		page[0] = mremap(vdso,
      				PAGE_SIZE, PAGE_SIZE, MREMAP_FIXED | MREMAP_MAYMOVE,
      				get_unmapped_area());
      		page[1] = mremap(vdso + PAGE_SIZE,
      				PAGE_SIZE, PAGE_SIZE, MREMAP_FIXED | MREMAP_MAYMOVE,
      				get_unmapped_area());
      
      		assert(page[0] != MAP_FAILED && page[1] != MAP_FAILED);
      		printf("match: %d %d\n",
      			!memcmp(save[0], page[0], PAGE_SIZE),
      			!memcmp(save[1], page[1], PAGE_SIZE));
      
      		return 0;
      	}
      
      fails without this patch. Before the previous commit it gets the wrong
      page, now it segfaults (which is imho better).
      
      This is because copy_vma() wrongly assumes that if vma->vm_file == NULL
      is irrelevant until the first fault which will use do_anonymous_page().
      This is obviously wrong for the special mapping.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ce75799b
    • O
      mmap: fix the usage of ->vm_pgoff in special_mapping paths · 8a9cc3b5
      Oleg Nesterov 提交于
      Test-case:
      
      	#include <stdio.h>
      	#include <unistd.h>
      	#include <stdlib.h>
      	#include <string.h>
      	#include <sys/mman.h>
      	#include <assert.h>
      
      	void *find_vdso_vaddr(void)
      	{
      		FILE *perl;
      		char buf[32] = {};
      
      		perl = popen("perl -e 'open STDIN,qq|/proc/@{[getppid]}/maps|;"
      				"/^(.*?)-.*vdso/ && print hex $1 while <>'", "r");
      		fread(buf, sizeof(buf), 1, perl);
      		fclose(perl);
      
      		return (void *)atol(buf);
      	}
      
      	#define PAGE_SIZE	4096
      
      	int main(void)
      	{
      		void *vdso = find_vdso_vaddr();
      		assert(vdso);
      
      		// of course they should differ, and they do so far
      		printf("vdso pages differ: %d\n",
      			!!memcmp(vdso, vdso + PAGE_SIZE, PAGE_SIZE));
      
      		// split into 2 vma's
      		assert(mprotect(vdso, PAGE_SIZE, PROT_READ) == 0);
      
      		// force another fault on the next check
      		assert(madvise(vdso, 2 * PAGE_SIZE, MADV_DONTNEED) == 0);
      
      		// now they no longer differ, the 2nd vm_pgoff is wrong
      		printf("vdso pages differ: %d\n",
      			!!memcmp(vdso, vdso + PAGE_SIZE, PAGE_SIZE));
      
      		return 0;
      	}
      
      Output:
      
      	vdso pages differ: 1
      	vdso pages differ: 0
      
      This is because split_vma() correctly updates ->vm_pgoff, but the logic
      in insert_vm_struct() and special_mapping_fault() is absolutely broken,
      so the fault at vdso + PAGE_SIZE return the 1st page. The same happens
      if you simply unmap the 1st page.
      
      special_mapping_fault() does:
      
      	pgoff = vmf->pgoff - vma->vm_pgoff;
      
      and this is _only_ correct if vma->vm_start mmaps the first page from
      ->vm_private_data array.
      
      vdso or any other user of install_special_mapping() is not anonymous,
      it has the "backing storage" even if it is just the array of pages.
      So we actually need to make vm_pgoff work as an offset in this array.
      
      Note: this also allows to fix another problem: currently gdb can't access
      "[vvar]" memory because in this case special_mapping_fault() doesn't work.
      Now that we can use ->vm_pgoff we can implement ->access() and fix this.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8a9cc3b5
  2. 05 9月, 2015 1 次提交
  3. 10 7月, 2015 1 次提交
    • E
      vfs: Commit to never having exectuables on proc and sysfs. · 90f8572b
      Eric W. Biederman 提交于
      Today proc and sysfs do not contain any executable files.  Several
      applications today mount proc or sysfs without noexec and nosuid and
      then depend on there being no exectuables files on proc or sysfs.
      Having any executable files show on proc or sysfs would cause
      a user space visible regression, and most likely security problems.
      
      Therefore commit to never allowing executables on proc and sysfs by
      adding a new flag to mark them as filesystems without executables and
      enforce that flag.
      
      Test the flag where MNT_NOEXEC is tested today, so that the only user
      visible effect will be that exectuables will be treated as if the
      execute bit is cleared.
      
      The filesystems proc and sysfs do not currently incoporate any
      executable files so this does not result in any user visible effects.
      
      This makes it unnecessary to vet changes to proc and sysfs tightly for
      adding exectuable files or changes to chattr that would modify
      existing files, as no matter what the individual file say they will
      not be treated as exectuable files by the vfs.
      
      Not having to vet changes to closely is important as without this we
      are only one proc_create call (or another goof up in the
      implementation of notify_change) from having problematic executables
      on proc.  Those mistakes are all too easy to make and would create
      a situation where there are security issues or the assumptions of
      some program having to be broken (and cause userspace regressions).
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      90f8572b
  4. 25 6月, 2015 1 次提交
  5. 16 4月, 2015 2 次提交
  6. 15 4月, 2015 1 次提交
  7. 26 3月, 2015 1 次提交
  8. 12 2月, 2015 3 次提交
    • R
      mm/mmap.c: fix arithmetic overflow in __vm_enough_memory() · 5703b087
      Roman Gushchin 提交于
      I noticed, that "allowed" can easily overflow by falling below 0,
      because (total_vm / 32) can be larger than "allowed".  The problem
      occurs in OVERCOMMIT_NONE mode.
      
      In this case, a huge allocation can success and overcommit the system
      (despite OVERCOMMIT_NONE mode).  All subsequent allocations will fall
      (system-wide), so system become unusable.
      
      The problem was masked out by commit c9b1d098
      ("mm: limit growth of 3% hardcoded other user reserve"),
      but it's easy to reproduce it on older kernels:
      1) set overcommit_memory sysctl to 2
      2) mmap() large file multiple times (with VM_SHARED flag)
      3) try to malloc() large amount of memory
      
      It also can be reproduced on newer kernels, but miss-configured
      sysctl_user_reserve_kbytes is required.
      
      Fix this issue by switching to signed arithmetic here.
      
      [akpm@linux-foundation.org: use min_t]
      Signed-off-by: NRoman Gushchin <klamm@yandex-team.ru>
      Cc: Andrew Shewmaker <agshew@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5703b087
    • K
      mm: fix false-positive warning on exit due mm_nr_pmds(mm) · b30fe6c7
      Kirill A. Shutemov 提交于
      The problem is that we check nr_ptes/nr_pmds in exit_mmap() which happens
      *before* pgd_free().  And if an arch does pte/pmd allocation in
      pgd_alloc() and frees them in pgd_free() we see offset in counters by the
      time of the checks.
      
      We tried to workaround this by offsetting expected counter value according
      to FIRST_USER_ADDRESS for both nr_pte and nr_pmd in exit_mmap().  But it
      doesn't work in some cases:
      
      1. ARM with LPAE enabled also has non-zero USER_PGTABLES_CEILING, but
         upper addresses occupied with huge pmd entries, so the trick with
         offsetting expected counter value will get really ugly: we will have
         to apply it nr_pmds, but not nr_ptes.
      
      2. Metag has non-zero FIRST_USER_ADDRESS, but doesn't do allocation
         pte/pmd page tables allocation in pgd_alloc(), just setup a pgd entry
         which is allocated at boot and shared accross all processes.
      
      The proposal is to move the check to check_mm() which happens *after*
      pgd_free() and do proper accounting during pgd_alloc() and pgd_free()
      which would bring counters to zero if nothing leaked.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: NTyler Baker <tyler.baker@linaro.org>
      Tested-by: NTyler Baker <tyler.baker@linaro.org>
      Tested-by: NNishanth Menon <nm@ti.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b30fe6c7
    • K
      mm: account pmd page tables to the process · dc6c9a35
      Kirill A. Shutemov 提交于
      Dave noticed that unprivileged process can allocate significant amount of
      memory -- >500 MiB on x86_64 -- and stay unnoticed by oom-killer and
      memory cgroup.  The trick is to allocate a lot of PMD page tables.  Linux
      kernel doesn't account PMD tables to the process, only PTE.
      
      The use-cases below use few tricks to allocate a lot of PMD page tables
      while keeping VmRSS and VmPTE low.  oom_score for the process will be 0.
      
      	#include <errno.h>
      	#include <stdio.h>
      	#include <stdlib.h>
      	#include <unistd.h>
      	#include <sys/mman.h>
      	#include <sys/prctl.h>
      
      	#define PUD_SIZE (1UL << 30)
      	#define PMD_SIZE (1UL << 21)
      
      	#define NR_PUD 130000
      
      	int main(void)
      	{
      		char *addr = NULL;
      		unsigned long i;
      
      		prctl(PR_SET_THP_DISABLE);
      		for (i = 0; i < NR_PUD ; i++) {
      			addr = mmap(addr + PUD_SIZE, PUD_SIZE, PROT_WRITE|PROT_READ,
      					MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
      			if (addr == MAP_FAILED) {
      				perror("mmap");
      				break;
      			}
      			*addr = 'x';
      			munmap(addr, PMD_SIZE);
      			mmap(addr, PMD_SIZE, PROT_WRITE|PROT_READ,
      					MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED, -1, 0);
      			if (addr == MAP_FAILED)
      				perror("re-mmap"), exit(1);
      		}
      		printf("PID %d consumed %lu KiB in PMD page tables\n",
      				getpid(), i * 4096 >> 10);
      		return pause();
      	}
      
      The patch addresses the issue by account PMD tables to the process the
      same way we account PTE.
      
      The main place where PMD tables is accounted is __pmd_alloc() and
      free_pmd_range(). But there're few corner cases:
      
       - HugeTLB can share PMD page tables. The patch handles by accounting
         the table to all processes who share it.
      
       - x86 PAE pre-allocates few PMD tables on fork.
      
       - Architectures with FIRST_USER_ADDRESS > 0. We need to adjust sanity
         check on exit(2).
      
      Accounting only happens on configuration where PMD page table's level is
      present (PMD is not folded).  As with nr_ptes we use per-mm counter.  The
      counter value is used to calculate baseline for badness score by
      oom-killer.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Reviewed-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: David Rientjes <rientjes@google.com>
      Tested-by: NSedat Dilek <sedat.dilek@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dc6c9a35
  9. 11 2月, 2015 2 次提交
    • K
      rmap: drop support of non-linear mappings · 27ba0644
      Kirill A. Shutemov 提交于
      We don't create non-linear mappings anymore.  Let's drop code which
      handles them in rmap.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      27ba0644
    • K
      mm: replace remap_file_pages() syscall with emulation · c8d78c18
      Kirill A. Shutemov 提交于
      remap_file_pages(2) was invented to be able efficiently map parts of
      huge file into limited 32-bit virtual address space such as in database
      workloads.
      
      Nonlinear mappings are pain to support and it seems there's no
      legitimate use-cases nowadays since 64-bit systems are widely available.
      
      Let's drop it and get rid of all these special-cased code.
      
      The patch replaces the syscall with emulation which creates new VMA on
      each remap_file_pages(), unless they it can be merged with an adjacent
      one.
      
      I didn't find *any* real code that uses remap_file_pages(2) to test
      emulation impact on.  I've checked Debian code search and source of all
      packages in ALT Linux.  No real users: libc wrappers, mentions in
      strace, gdb, valgrind and this kind of stuff.
      
      There are few basic tests in LTP for the syscall.  They work just fine
      with emulation.
      
      To test performance impact, I've written small test case which
      demonstrate pretty much worst case scenario: map 4G shmfs file, write to
      begin of every page pgoff of the page, remap pages in reverse order,
      read every page.
      
      The test creates 1 million of VMAs if emulation is in use, so I had to
      set vm.max_map_count to 1100000 to avoid -ENOMEM.
      
      Before:		23.3 ( +-  4.31% ) seconds
      After:		43.9 ( +-  0.85% ) seconds
      Slowdown:	1.88x
      
      I believe we can live with that.
      
      Test case:
      
              #define _GNU_SOURCE
              #include <assert.h>
              #include <stdlib.h>
              #include <stdio.h>
              #include <sys/mman.h>
      
              #define MB	(1024UL * 1024)
              #define SIZE	(4096 * MB)
      
              int main(int argc, char **argv)
              {
                      unsigned long *p;
                      long i, pass;
      
                      for (pass = 0; pass < 10; pass++) {
                              p = mmap(NULL, SIZE, PROT_READ|PROT_WRITE,
                                              MAP_SHARED | MAP_ANONYMOUS, -1, 0);
                              if (p == MAP_FAILED) {
                                      perror("mmap");
                                      return -1;
                              }
      
                              for (i = 0; i < SIZE / 4096; i++)
                                      p[i * 4096 / sizeof(*p)] = i;
      
                              for (i = 0; i < SIZE / 4096; i++) {
                                      if (remap_file_pages(p + i * 4096 / sizeof(*p), 4096,
                                                      0, (SIZE - 4096 * (i + 1)) >> 12, 0)) {
                                              perror("remap_file_pages");
                                              return -1;
                                      }
                              }
      
                              for (i = SIZE / 4096 - 1; i >= 0; i--)
                                      assert(p[i * 4096 / sizeof(*p)] == SIZE / 4096 - i - 1);
      
                              munmap(p, SIZE);
                      }
      
                      return 0;
              }
      
      [akpm@linux-foundation.org: fix spello]
      [sasha.levin@oracle.com: initialize populate before usage]
      [sasha.levin@oracle.com: grab file ref to prevent race while mmaping]
      Signed-off-by: N"Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Armin Rigo <arigo@tunes.org>
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c8d78c18
  10. 12 1月, 2015 2 次提交
  11. 14 12月, 2014 3 次提交
  12. 04 12月, 2014 1 次提交
    • D
      mm: fix anon_vma_clone() error treatment · c4ea95d7
      Daniel Forrest 提交于
      Andrew Morton noticed that the error return from anon_vma_clone() was
      being dropped and replaced with -ENOMEM (which is not itself a bug
      because the only error return value from anon_vma_clone() is -ENOMEM).
      
      I did an audit of callers of anon_vma_clone() and discovered an actual
      bug where the error return was being lost.  In __split_vma(), between
      Linux 3.11 and 3.12 the code was changed so the err variable is used
      before the call to anon_vma_clone() and the default initial value of
      -ENOMEM is overwritten.  So a failure of anon_vma_clone() will return
      success since err at this point is now zero.
      
      Below is a patch which fixes this bug and also propagates the error
      return value from anon_vma_clone() in all cases.
      
      Fixes: ef0855d3 ("mm: mempolicy: turn vma_set_policy() into vma_dup_policy()")
      Signed-off-by: NDaniel Forrest <dan.forrest@ssec.wisc.edu>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Tim Hartrick <tim@edgecast.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>	[3.12+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c4ea95d7
  13. 18 11月, 2014 1 次提交
    • D
      x86, mpx: Cleanup unused bound tables · 1de4fa14
      Dave Hansen 提交于
      The previous patch allocates bounds tables on-demand.  As noted in
      an earlier description, these can add up to *HUGE* amounts of
      memory.  This has caused OOMs in practice when running tests.
      
      This patch adds support for freeing bounds tables when they are no
      longer in use.
      
      There are two types of mappings in play when unmapping tables:
       1. The mapping with the actual data, which userspace is
          munmap()ing or brk()ing away, etc...
       2. The mapping for the bounds table *backing* the data
          (is tagged with VM_MPX, see the patch "add MPX specific
          mmap interface").
      
      If userspace use the prctl() indroduced earlier in this patchset
      to enable the management of bounds tables in kernel, when it
      unmaps the first type of mapping with the actual data, the kernel
      needs to free the mapping for the bounds table backing the data.
      This patch hooks in at the very end of do_unmap() to do so.
      We look at the addresses being unmapped and find the bounds
      directory entries and tables which cover those addresses.  If
      an entire table is unused, we clear associated directory entry
      and free the table.
      
      Once we unmap the bounds table, we would have a bounds directory
      entry pointing at empty address space. That address space might
      now be allocated for some other (random) use, and the MPX
      hardware might now try to walk it as if it were a bounds table.
      That would be bad.  So any unmapping of an enture bounds table
      has to be accompanied by a corresponding write to the bounds
      directory entry to invalidate it.  That write to the bounds
      directory can fault, which causes the following problem:
      
      Since we are doing the freeing from munmap() (and other paths
      like it), we hold mmap_sem for write. If we fault, the page
      fault handler will attempt to acquire mmap_sem for read and
      we will deadlock.  To avoid the deadlock, we pagefault_disable()
      when touching the bounds directory entry and use a
      get_user_pages() to resolve the fault.
      
      The unmapping of bounds tables happends under vm_munmap().  We
      also (indirectly) call vm_munmap() to _do_ the unmapping of the
      bounds tables.  We avoid unbounded recursion by disallowing
      freeing of bounds tables *for* bounds tables.  This would not
      occur normally, so should not have any practical impact.  Being
      strict about it here helps ensure that we do not have an
      exploitable stack overflow.
      Based-on-patch-by: NQiaowei Ren <qiaowei.ren@intel.com>
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: linux-mm@kvack.org
      Cc: linux-mips@linux-mips.org
      Cc: Dave Hansen <dave@sr71.net>
      Link: http://lkml.kernel.org/r/20141114151831.E4531C4A@viggo.jf.intel.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      1de4fa14
  14. 30 10月, 2014 1 次提交
    • D
      mm, thp: fix collapsing of hugepages on madvise · 6d50e60c
      David Rientjes 提交于
      If an anonymous mapping is not allowed to fault thp memory and then
      madvise(MADV_HUGEPAGE) is used after fault, khugepaged will never
      collapse this memory into thp memory.
      
      This occurs because the madvise(2) handler for thp, hugepage_madvise(),
      clears VM_NOHUGEPAGE on the stack and it isn't stored in vma->vm_flags
      until the final action of madvise_behavior().  This causes the
      khugepaged_enter_vma_merge() to be a no-op in hugepage_madvise() when
      the vma had previously had VM_NOHUGEPAGE set.
      
      Fix this by passing the correct vma flags to the khugepaged mm slot
      handler.  There's no chance khugepaged can run on this vma until after
      madvise_behavior() returns since we hold mm->mmap_sem.
      
      It would be possible to clear VM_NOHUGEPAGE directly from vma->vm_flags
      in hugepage_advise(), but I didn't want to introduce special case
      behavior into madvise_behavior().  I think it's best to just let it
      always set vma->vm_flags itself.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Reported-by: NSuleiman Souhlal <suleiman@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6d50e60c
  15. 14 10月, 2014 1 次提交
    • P
      mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared · 64e45507
      Peter Feiner 提交于
      For VMAs that don't want write notifications, PTEs created for read faults
      have their write bit set.  If the read fault happens after VM_SOFTDIRTY is
      cleared, then the PTE's softdirty bit will remain clear after subsequent
      writes.
      
      Here's a simple code snippet to demonstrate the bug:
      
        char* m = mmap(NULL, getpagesize(), PROT_READ | PROT_WRITE,
                       MAP_ANONYMOUS | MAP_SHARED, -1, 0);
        system("echo 4 > /proc/$PPID/clear_refs"); /* clear VM_SOFTDIRTY */
        assert(*m == '\0');     /* new PTE allows write access */
        assert(!soft_dirty(x));
        *m = 'x';               /* should dirty the page */
        assert(soft_dirty(x));  /* fails */
      
      With this patch, write notifications are enabled when VM_SOFTDIRTY is
      cleared.  Furthermore, to avoid unnecessary faults, write notifications
      are disabled when VM_SOFTDIRTY is set.
      
      As a side effect of enabling and disabling write notifications with
      care, this patch fixes a bug in mprotect where vm_page_prot bits set by
      drivers were zapped on mprotect.  An analogous bug was fixed in mmap by
      commit c9d0bf24 ("mm: uncached vma support with writenotify").
      Signed-off-by: NPeter Feiner <pfeiner@google.com>
      Reported-by: NPeter Feiner <pfeiner@google.com>
      Suggested-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Jamie Liu <jamieliu@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      64e45507
  16. 10 10月, 2014 5 次提交
  17. 11 9月, 2014 1 次提交
  18. 08 9月, 2014 1 次提交
    • T
      percpu_counter: add @gfp to percpu_counter_init() · 908c7f19
      Tejun Heo 提交于
      Percpu allocator now supports allocation mask.  Add @gfp to
      percpu_counter_init() so that !GFP_KERNEL allocation masks can be used
      with percpu_counters too.
      
      We could have left percpu_counter_init() alone and added
      percpu_counter_init_gfp(); however, the number of users isn't that
      high and introducing _gfp variants to all percpu data structures would
      be quite ugly, so let's just do the conversion.  This is the one with
      the most users.  Other percpu data structures are a lot easier to
      convert.
      
      This patch doesn't make any functional difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NJan Kara <jack@suse.cz>
      Acked-by: N"David S. Miller" <davem@davemloft.net>
      Cc: x86@kernel.org
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      908c7f19
  19. 09 8月, 2014 1 次提交
    • D
      mm: allow drivers to prevent new writable mappings · 4bb5f5d9
      David Herrmann 提交于
      This patch (of 6):
      
      The i_mmap_writable field counts existing writable mappings of an
      address_space.  To allow drivers to prevent new writable mappings, make
      this counter signed and prevent new writable mappings if it is negative.
      This is modelled after i_writecount and DENYWRITE.
      
      This will be required by the shmem-sealing infrastructure to prevent any
      new writable mappings after the WRITE seal has been set.  In case there
      exists a writable mapping, this operation will fail with EBUSY.
      
      Note that we rely on the fact that iff you already own a writable mapping,
      you can increase the counter without using the helpers.  This is the same
      that we do for i_writecount.
      Signed-off-by: NDavid Herrmann <dh.herrmann@gmail.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Ryan Lortie <desrt@desrt.ca>
      Cc: Lennart Poettering <lennart@poettering.net>
      Cc: Daniel Mack <zonque@gmail.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4bb5f5d9
  20. 07 8月, 2014 1 次提交
  21. 07 6月, 2014 1 次提交
  22. 05 6月, 2014 2 次提交
  23. 21 5月, 2014 1 次提交
  24. 08 4月, 2014 1 次提交
    • D
      mm: per-thread vma caching · 615d6e87
      Davidlohr Bueso 提交于
      This patch is a continuation of efforts trying to optimize find_vma(),
      avoiding potentially expensive rbtree walks to locate a vma upon faults.
      The original approach (https://lkml.org/lkml/2013/11/1/410), where the
      largest vma was also cached, ended up being too specific and random,
      thus further comparison with other approaches were needed.  There are
      two things to consider when dealing with this, the cache hit rate and
      the latency of find_vma().  Improving the hit-rate does not necessarily
      translate in finding the vma any faster, as the overhead of any fancy
      caching schemes can be too high to consider.
      
      We currently cache the last used vma for the whole address space, which
      provides a nice optimization, reducing the total cycles in find_vma() by
      up to 250%, for workloads with good locality.  On the other hand, this
      simple scheme is pretty much useless for workloads with poor locality.
      Analyzing ebizzy runs shows that, no matter how many threads are
      running, the mmap_cache hit rate is less than 2%, and in many situations
      below 1%.
      
      The proposed approach is to replace this scheme with a small per-thread
      cache, maximizing hit rates at a very low maintenance cost.
      Invalidations are performed by simply bumping up a 32-bit sequence
      number.  The only expensive operation is in the rare case of a seq
      number overflow, where all caches that share the same address space are
      flushed.  Upon a miss, the proposed replacement policy is based on the
      page number that contains the virtual address in question.  Concretely,
      the following results are seen on an 80 core, 8 socket x86-64 box:
      
      1) System bootup: Most programs are single threaded, so the per-thread
         scheme does improve ~50% hit rate by just adding a few more slots to
         the cache.
      
      +----------------+----------+------------------+
      | caching scheme | hit-rate | cycles (billion) |
      +----------------+----------+------------------+
      | baseline       | 50.61%   | 19.90            |
      | patched        | 73.45%   | 13.58            |
      +----------------+----------+------------------+
      
      2) Kernel build: This one is already pretty good with the current
         approach as we're dealing with good locality.
      
      +----------------+----------+------------------+
      | caching scheme | hit-rate | cycles (billion) |
      +----------------+----------+------------------+
      | baseline       | 75.28%   | 11.03            |
      | patched        | 88.09%   | 9.31             |
      +----------------+----------+------------------+
      
      3) Oracle 11g Data Mining (4k pages): Similar to the kernel build workload.
      
      +----------------+----------+------------------+
      | caching scheme | hit-rate | cycles (billion) |
      +----------------+----------+------------------+
      | baseline       | 70.66%   | 17.14            |
      | patched        | 91.15%   | 12.57            |
      +----------------+----------+------------------+
      
      4) Ebizzy: There's a fair amount of variation from run to run, but this
         approach always shows nearly perfect hit rates, while baseline is just
         about non-existent.  The amounts of cycles can fluctuate between
         anywhere from ~60 to ~116 for the baseline scheme, but this approach
         reduces it considerably.  For instance, with 80 threads:
      
      +----------------+----------+------------------+
      | caching scheme | hit-rate | cycles (billion) |
      +----------------+----------+------------------+
      | baseline       | 1.06%    | 91.54            |
      | patched        | 99.97%   | 14.18            |
      +----------------+----------+------------------+
      
      [akpm@linux-foundation.org: fix nommu build, per Davidlohr]
      [akpm@linux-foundation.org: document vmacache_valid() logic]
      [akpm@linux-foundation.org: attempt to untangle header files]
      [akpm@linux-foundation.org: add vmacache_find() BUG_ON]
      [hughd@google.com: add vmacache_valid_mm() (from Oleg)]
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: adjust and enhance comments]
      Signed-off-by: NDavidlohr Bueso <davidlohr@hp.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NMichel Lespinasse <walken@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Tested-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      615d6e87
  25. 04 4月, 2014 1 次提交
  26. 31 3月, 2014 1 次提交
  27. 19 3月, 2014 1 次提交