1. 10 7月, 2015 1 次提交
    • E
      vfs: Commit to never having exectuables on proc and sysfs. · 90f8572b
      Eric W. Biederman 提交于
      Today proc and sysfs do not contain any executable files.  Several
      applications today mount proc or sysfs without noexec and nosuid and
      then depend on there being no exectuables files on proc or sysfs.
      Having any executable files show on proc or sysfs would cause
      a user space visible regression, and most likely security problems.
      
      Therefore commit to never allowing executables on proc and sysfs by
      adding a new flag to mark them as filesystems without executables and
      enforce that flag.
      
      Test the flag where MNT_NOEXEC is tested today, so that the only user
      visible effect will be that exectuables will be treated as if the
      execute bit is cleared.
      
      The filesystems proc and sysfs do not currently incoporate any
      executable files so this does not result in any user visible effects.
      
      This makes it unnecessary to vet changes to proc and sysfs tightly for
      adding exectuable files or changes to chattr that would modify
      existing files, as no matter what the individual file say they will
      not be treated as exectuable files by the vfs.
      
      Not having to vet changes to closely is important as without this we
      are only one proc_create call (or another goof up in the
      implementation of notify_change) from having problematic executables
      on proc.  Those mistakes are all too easy to make and would create
      a situation where there are security issues or the assumptions of
      some program having to be broken (and cause userspace regressions).
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      90f8572b
  2. 25 6月, 2015 1 次提交
  3. 16 4月, 2015 2 次提交
  4. 15 4月, 2015 1 次提交
  5. 26 3月, 2015 1 次提交
  6. 12 2月, 2015 3 次提交
    • R
      mm/mmap.c: fix arithmetic overflow in __vm_enough_memory() · 5703b087
      Roman Gushchin 提交于
      I noticed, that "allowed" can easily overflow by falling below 0,
      because (total_vm / 32) can be larger than "allowed".  The problem
      occurs in OVERCOMMIT_NONE mode.
      
      In this case, a huge allocation can success and overcommit the system
      (despite OVERCOMMIT_NONE mode).  All subsequent allocations will fall
      (system-wide), so system become unusable.
      
      The problem was masked out by commit c9b1d098
      ("mm: limit growth of 3% hardcoded other user reserve"),
      but it's easy to reproduce it on older kernels:
      1) set overcommit_memory sysctl to 2
      2) mmap() large file multiple times (with VM_SHARED flag)
      3) try to malloc() large amount of memory
      
      It also can be reproduced on newer kernels, but miss-configured
      sysctl_user_reserve_kbytes is required.
      
      Fix this issue by switching to signed arithmetic here.
      
      [akpm@linux-foundation.org: use min_t]
      Signed-off-by: NRoman Gushchin <klamm@yandex-team.ru>
      Cc: Andrew Shewmaker <agshew@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5703b087
    • K
      mm: fix false-positive warning on exit due mm_nr_pmds(mm) · b30fe6c7
      Kirill A. Shutemov 提交于
      The problem is that we check nr_ptes/nr_pmds in exit_mmap() which happens
      *before* pgd_free().  And if an arch does pte/pmd allocation in
      pgd_alloc() and frees them in pgd_free() we see offset in counters by the
      time of the checks.
      
      We tried to workaround this by offsetting expected counter value according
      to FIRST_USER_ADDRESS for both nr_pte and nr_pmd in exit_mmap().  But it
      doesn't work in some cases:
      
      1. ARM with LPAE enabled also has non-zero USER_PGTABLES_CEILING, but
         upper addresses occupied with huge pmd entries, so the trick with
         offsetting expected counter value will get really ugly: we will have
         to apply it nr_pmds, but not nr_ptes.
      
      2. Metag has non-zero FIRST_USER_ADDRESS, but doesn't do allocation
         pte/pmd page tables allocation in pgd_alloc(), just setup a pgd entry
         which is allocated at boot and shared accross all processes.
      
      The proposal is to move the check to check_mm() which happens *after*
      pgd_free() and do proper accounting during pgd_alloc() and pgd_free()
      which would bring counters to zero if nothing leaked.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: NTyler Baker <tyler.baker@linaro.org>
      Tested-by: NTyler Baker <tyler.baker@linaro.org>
      Tested-by: NNishanth Menon <nm@ti.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b30fe6c7
    • K
      mm: account pmd page tables to the process · dc6c9a35
      Kirill A. Shutemov 提交于
      Dave noticed that unprivileged process can allocate significant amount of
      memory -- >500 MiB on x86_64 -- and stay unnoticed by oom-killer and
      memory cgroup.  The trick is to allocate a lot of PMD page tables.  Linux
      kernel doesn't account PMD tables to the process, only PTE.
      
      The use-cases below use few tricks to allocate a lot of PMD page tables
      while keeping VmRSS and VmPTE low.  oom_score for the process will be 0.
      
      	#include <errno.h>
      	#include <stdio.h>
      	#include <stdlib.h>
      	#include <unistd.h>
      	#include <sys/mman.h>
      	#include <sys/prctl.h>
      
      	#define PUD_SIZE (1UL << 30)
      	#define PMD_SIZE (1UL << 21)
      
      	#define NR_PUD 130000
      
      	int main(void)
      	{
      		char *addr = NULL;
      		unsigned long i;
      
      		prctl(PR_SET_THP_DISABLE);
      		for (i = 0; i < NR_PUD ; i++) {
      			addr = mmap(addr + PUD_SIZE, PUD_SIZE, PROT_WRITE|PROT_READ,
      					MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
      			if (addr == MAP_FAILED) {
      				perror("mmap");
      				break;
      			}
      			*addr = 'x';
      			munmap(addr, PMD_SIZE);
      			mmap(addr, PMD_SIZE, PROT_WRITE|PROT_READ,
      					MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED, -1, 0);
      			if (addr == MAP_FAILED)
      				perror("re-mmap"), exit(1);
      		}
      		printf("PID %d consumed %lu KiB in PMD page tables\n",
      				getpid(), i * 4096 >> 10);
      		return pause();
      	}
      
      The patch addresses the issue by account PMD tables to the process the
      same way we account PTE.
      
      The main place where PMD tables is accounted is __pmd_alloc() and
      free_pmd_range(). But there're few corner cases:
      
       - HugeTLB can share PMD page tables. The patch handles by accounting
         the table to all processes who share it.
      
       - x86 PAE pre-allocates few PMD tables on fork.
      
       - Architectures with FIRST_USER_ADDRESS > 0. We need to adjust sanity
         check on exit(2).
      
      Accounting only happens on configuration where PMD page table's level is
      present (PMD is not folded).  As with nr_ptes we use per-mm counter.  The
      counter value is used to calculate baseline for badness score by
      oom-killer.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Reviewed-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: David Rientjes <rientjes@google.com>
      Tested-by: NSedat Dilek <sedat.dilek@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dc6c9a35
  7. 11 2月, 2015 2 次提交
    • K
      rmap: drop support of non-linear mappings · 27ba0644
      Kirill A. Shutemov 提交于
      We don't create non-linear mappings anymore.  Let's drop code which
      handles them in rmap.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      27ba0644
    • K
      mm: replace remap_file_pages() syscall with emulation · c8d78c18
      Kirill A. Shutemov 提交于
      remap_file_pages(2) was invented to be able efficiently map parts of
      huge file into limited 32-bit virtual address space such as in database
      workloads.
      
      Nonlinear mappings are pain to support and it seems there's no
      legitimate use-cases nowadays since 64-bit systems are widely available.
      
      Let's drop it and get rid of all these special-cased code.
      
      The patch replaces the syscall with emulation which creates new VMA on
      each remap_file_pages(), unless they it can be merged with an adjacent
      one.
      
      I didn't find *any* real code that uses remap_file_pages(2) to test
      emulation impact on.  I've checked Debian code search and source of all
      packages in ALT Linux.  No real users: libc wrappers, mentions in
      strace, gdb, valgrind and this kind of stuff.
      
      There are few basic tests in LTP for the syscall.  They work just fine
      with emulation.
      
      To test performance impact, I've written small test case which
      demonstrate pretty much worst case scenario: map 4G shmfs file, write to
      begin of every page pgoff of the page, remap pages in reverse order,
      read every page.
      
      The test creates 1 million of VMAs if emulation is in use, so I had to
      set vm.max_map_count to 1100000 to avoid -ENOMEM.
      
      Before:		23.3 ( +-  4.31% ) seconds
      After:		43.9 ( +-  0.85% ) seconds
      Slowdown:	1.88x
      
      I believe we can live with that.
      
      Test case:
      
              #define _GNU_SOURCE
              #include <assert.h>
              #include <stdlib.h>
              #include <stdio.h>
              #include <sys/mman.h>
      
              #define MB	(1024UL * 1024)
              #define SIZE	(4096 * MB)
      
              int main(int argc, char **argv)
              {
                      unsigned long *p;
                      long i, pass;
      
                      for (pass = 0; pass < 10; pass++) {
                              p = mmap(NULL, SIZE, PROT_READ|PROT_WRITE,
                                              MAP_SHARED | MAP_ANONYMOUS, -1, 0);
                              if (p == MAP_FAILED) {
                                      perror("mmap");
                                      return -1;
                              }
      
                              for (i = 0; i < SIZE / 4096; i++)
                                      p[i * 4096 / sizeof(*p)] = i;
      
                              for (i = 0; i < SIZE / 4096; i++) {
                                      if (remap_file_pages(p + i * 4096 / sizeof(*p), 4096,
                                                      0, (SIZE - 4096 * (i + 1)) >> 12, 0)) {
                                              perror("remap_file_pages");
                                              return -1;
                                      }
                              }
      
                              for (i = SIZE / 4096 - 1; i >= 0; i--)
                                      assert(p[i * 4096 / sizeof(*p)] == SIZE / 4096 - i - 1);
      
                              munmap(p, SIZE);
                      }
      
                      return 0;
              }
      
      [akpm@linux-foundation.org: fix spello]
      [sasha.levin@oracle.com: initialize populate before usage]
      [sasha.levin@oracle.com: grab file ref to prevent race while mmaping]
      Signed-off-by: N"Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Armin Rigo <arigo@tunes.org>
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c8d78c18
  8. 12 1月, 2015 2 次提交
  9. 14 12月, 2014 3 次提交
  10. 04 12月, 2014 1 次提交
    • D
      mm: fix anon_vma_clone() error treatment · c4ea95d7
      Daniel Forrest 提交于
      Andrew Morton noticed that the error return from anon_vma_clone() was
      being dropped and replaced with -ENOMEM (which is not itself a bug
      because the only error return value from anon_vma_clone() is -ENOMEM).
      
      I did an audit of callers of anon_vma_clone() and discovered an actual
      bug where the error return was being lost.  In __split_vma(), between
      Linux 3.11 and 3.12 the code was changed so the err variable is used
      before the call to anon_vma_clone() and the default initial value of
      -ENOMEM is overwritten.  So a failure of anon_vma_clone() will return
      success since err at this point is now zero.
      
      Below is a patch which fixes this bug and also propagates the error
      return value from anon_vma_clone() in all cases.
      
      Fixes: ef0855d3 ("mm: mempolicy: turn vma_set_policy() into vma_dup_policy()")
      Signed-off-by: NDaniel Forrest <dan.forrest@ssec.wisc.edu>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Tim Hartrick <tim@edgecast.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>	[3.12+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c4ea95d7
  11. 18 11月, 2014 1 次提交
    • D
      x86, mpx: Cleanup unused bound tables · 1de4fa14
      Dave Hansen 提交于
      The previous patch allocates bounds tables on-demand.  As noted in
      an earlier description, these can add up to *HUGE* amounts of
      memory.  This has caused OOMs in practice when running tests.
      
      This patch adds support for freeing bounds tables when they are no
      longer in use.
      
      There are two types of mappings in play when unmapping tables:
       1. The mapping with the actual data, which userspace is
          munmap()ing or brk()ing away, etc...
       2. The mapping for the bounds table *backing* the data
          (is tagged with VM_MPX, see the patch "add MPX specific
          mmap interface").
      
      If userspace use the prctl() indroduced earlier in this patchset
      to enable the management of bounds tables in kernel, when it
      unmaps the first type of mapping with the actual data, the kernel
      needs to free the mapping for the bounds table backing the data.
      This patch hooks in at the very end of do_unmap() to do so.
      We look at the addresses being unmapped and find the bounds
      directory entries and tables which cover those addresses.  If
      an entire table is unused, we clear associated directory entry
      and free the table.
      
      Once we unmap the bounds table, we would have a bounds directory
      entry pointing at empty address space. That address space might
      now be allocated for some other (random) use, and the MPX
      hardware might now try to walk it as if it were a bounds table.
      That would be bad.  So any unmapping of an enture bounds table
      has to be accompanied by a corresponding write to the bounds
      directory entry to invalidate it.  That write to the bounds
      directory can fault, which causes the following problem:
      
      Since we are doing the freeing from munmap() (and other paths
      like it), we hold mmap_sem for write. If we fault, the page
      fault handler will attempt to acquire mmap_sem for read and
      we will deadlock.  To avoid the deadlock, we pagefault_disable()
      when touching the bounds directory entry and use a
      get_user_pages() to resolve the fault.
      
      The unmapping of bounds tables happends under vm_munmap().  We
      also (indirectly) call vm_munmap() to _do_ the unmapping of the
      bounds tables.  We avoid unbounded recursion by disallowing
      freeing of bounds tables *for* bounds tables.  This would not
      occur normally, so should not have any practical impact.  Being
      strict about it here helps ensure that we do not have an
      exploitable stack overflow.
      Based-on-patch-by: NQiaowei Ren <qiaowei.ren@intel.com>
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: linux-mm@kvack.org
      Cc: linux-mips@linux-mips.org
      Cc: Dave Hansen <dave@sr71.net>
      Link: http://lkml.kernel.org/r/20141114151831.E4531C4A@viggo.jf.intel.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      1de4fa14
  12. 30 10月, 2014 1 次提交
    • D
      mm, thp: fix collapsing of hugepages on madvise · 6d50e60c
      David Rientjes 提交于
      If an anonymous mapping is not allowed to fault thp memory and then
      madvise(MADV_HUGEPAGE) is used after fault, khugepaged will never
      collapse this memory into thp memory.
      
      This occurs because the madvise(2) handler for thp, hugepage_madvise(),
      clears VM_NOHUGEPAGE on the stack and it isn't stored in vma->vm_flags
      until the final action of madvise_behavior().  This causes the
      khugepaged_enter_vma_merge() to be a no-op in hugepage_madvise() when
      the vma had previously had VM_NOHUGEPAGE set.
      
      Fix this by passing the correct vma flags to the khugepaged mm slot
      handler.  There's no chance khugepaged can run on this vma until after
      madvise_behavior() returns since we hold mm->mmap_sem.
      
      It would be possible to clear VM_NOHUGEPAGE directly from vma->vm_flags
      in hugepage_advise(), but I didn't want to introduce special case
      behavior into madvise_behavior().  I think it's best to just let it
      always set vma->vm_flags itself.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Reported-by: NSuleiman Souhlal <suleiman@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6d50e60c
  13. 14 10月, 2014 1 次提交
    • P
      mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared · 64e45507
      Peter Feiner 提交于
      For VMAs that don't want write notifications, PTEs created for read faults
      have their write bit set.  If the read fault happens after VM_SOFTDIRTY is
      cleared, then the PTE's softdirty bit will remain clear after subsequent
      writes.
      
      Here's a simple code snippet to demonstrate the bug:
      
        char* m = mmap(NULL, getpagesize(), PROT_READ | PROT_WRITE,
                       MAP_ANONYMOUS | MAP_SHARED, -1, 0);
        system("echo 4 > /proc/$PPID/clear_refs"); /* clear VM_SOFTDIRTY */
        assert(*m == '\0');     /* new PTE allows write access */
        assert(!soft_dirty(x));
        *m = 'x';               /* should dirty the page */
        assert(soft_dirty(x));  /* fails */
      
      With this patch, write notifications are enabled when VM_SOFTDIRTY is
      cleared.  Furthermore, to avoid unnecessary faults, write notifications
      are disabled when VM_SOFTDIRTY is set.
      
      As a side effect of enabling and disabling write notifications with
      care, this patch fixes a bug in mprotect where vm_page_prot bits set by
      drivers were zapped on mprotect.  An analogous bug was fixed in mmap by
      commit c9d0bf24 ("mm: uncached vma support with writenotify").
      Signed-off-by: NPeter Feiner <pfeiner@google.com>
      Reported-by: NPeter Feiner <pfeiner@google.com>
      Suggested-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Jamie Liu <jamieliu@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      64e45507
  14. 10 10月, 2014 5 次提交
  15. 11 9月, 2014 1 次提交
  16. 08 9月, 2014 1 次提交
    • T
      percpu_counter: add @gfp to percpu_counter_init() · 908c7f19
      Tejun Heo 提交于
      Percpu allocator now supports allocation mask.  Add @gfp to
      percpu_counter_init() so that !GFP_KERNEL allocation masks can be used
      with percpu_counters too.
      
      We could have left percpu_counter_init() alone and added
      percpu_counter_init_gfp(); however, the number of users isn't that
      high and introducing _gfp variants to all percpu data structures would
      be quite ugly, so let's just do the conversion.  This is the one with
      the most users.  Other percpu data structures are a lot easier to
      convert.
      
      This patch doesn't make any functional difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NJan Kara <jack@suse.cz>
      Acked-by: N"David S. Miller" <davem@davemloft.net>
      Cc: x86@kernel.org
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      908c7f19
  17. 09 8月, 2014 1 次提交
    • D
      mm: allow drivers to prevent new writable mappings · 4bb5f5d9
      David Herrmann 提交于
      This patch (of 6):
      
      The i_mmap_writable field counts existing writable mappings of an
      address_space.  To allow drivers to prevent new writable mappings, make
      this counter signed and prevent new writable mappings if it is negative.
      This is modelled after i_writecount and DENYWRITE.
      
      This will be required by the shmem-sealing infrastructure to prevent any
      new writable mappings after the WRITE seal has been set.  In case there
      exists a writable mapping, this operation will fail with EBUSY.
      
      Note that we rely on the fact that iff you already own a writable mapping,
      you can increase the counter without using the helpers.  This is the same
      that we do for i_writecount.
      Signed-off-by: NDavid Herrmann <dh.herrmann@gmail.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Ryan Lortie <desrt@desrt.ca>
      Cc: Lennart Poettering <lennart@poettering.net>
      Cc: Daniel Mack <zonque@gmail.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4bb5f5d9
  18. 07 8月, 2014 1 次提交
  19. 07 6月, 2014 1 次提交
  20. 05 6月, 2014 2 次提交
  21. 21 5月, 2014 1 次提交
  22. 08 4月, 2014 1 次提交
    • D
      mm: per-thread vma caching · 615d6e87
      Davidlohr Bueso 提交于
      This patch is a continuation of efforts trying to optimize find_vma(),
      avoiding potentially expensive rbtree walks to locate a vma upon faults.
      The original approach (https://lkml.org/lkml/2013/11/1/410), where the
      largest vma was also cached, ended up being too specific and random,
      thus further comparison with other approaches were needed.  There are
      two things to consider when dealing with this, the cache hit rate and
      the latency of find_vma().  Improving the hit-rate does not necessarily
      translate in finding the vma any faster, as the overhead of any fancy
      caching schemes can be too high to consider.
      
      We currently cache the last used vma for the whole address space, which
      provides a nice optimization, reducing the total cycles in find_vma() by
      up to 250%, for workloads with good locality.  On the other hand, this
      simple scheme is pretty much useless for workloads with poor locality.
      Analyzing ebizzy runs shows that, no matter how many threads are
      running, the mmap_cache hit rate is less than 2%, and in many situations
      below 1%.
      
      The proposed approach is to replace this scheme with a small per-thread
      cache, maximizing hit rates at a very low maintenance cost.
      Invalidations are performed by simply bumping up a 32-bit sequence
      number.  The only expensive operation is in the rare case of a seq
      number overflow, where all caches that share the same address space are
      flushed.  Upon a miss, the proposed replacement policy is based on the
      page number that contains the virtual address in question.  Concretely,
      the following results are seen on an 80 core, 8 socket x86-64 box:
      
      1) System bootup: Most programs are single threaded, so the per-thread
         scheme does improve ~50% hit rate by just adding a few more slots to
         the cache.
      
      +----------------+----------+------------------+
      | caching scheme | hit-rate | cycles (billion) |
      +----------------+----------+------------------+
      | baseline       | 50.61%   | 19.90            |
      | patched        | 73.45%   | 13.58            |
      +----------------+----------+------------------+
      
      2) Kernel build: This one is already pretty good with the current
         approach as we're dealing with good locality.
      
      +----------------+----------+------------------+
      | caching scheme | hit-rate | cycles (billion) |
      +----------------+----------+------------------+
      | baseline       | 75.28%   | 11.03            |
      | patched        | 88.09%   | 9.31             |
      +----------------+----------+------------------+
      
      3) Oracle 11g Data Mining (4k pages): Similar to the kernel build workload.
      
      +----------------+----------+------------------+
      | caching scheme | hit-rate | cycles (billion) |
      +----------------+----------+------------------+
      | baseline       | 70.66%   | 17.14            |
      | patched        | 91.15%   | 12.57            |
      +----------------+----------+------------------+
      
      4) Ebizzy: There's a fair amount of variation from run to run, but this
         approach always shows nearly perfect hit rates, while baseline is just
         about non-existent.  The amounts of cycles can fluctuate between
         anywhere from ~60 to ~116 for the baseline scheme, but this approach
         reduces it considerably.  For instance, with 80 threads:
      
      +----------------+----------+------------------+
      | caching scheme | hit-rate | cycles (billion) |
      +----------------+----------+------------------+
      | baseline       | 1.06%    | 91.54            |
      | patched        | 99.97%   | 14.18            |
      +----------------+----------+------------------+
      
      [akpm@linux-foundation.org: fix nommu build, per Davidlohr]
      [akpm@linux-foundation.org: document vmacache_valid() logic]
      [akpm@linux-foundation.org: attempt to untangle header files]
      [akpm@linux-foundation.org: add vmacache_find() BUG_ON]
      [hughd@google.com: add vmacache_valid_mm() (from Oleg)]
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: adjust and enhance comments]
      Signed-off-by: NDavidlohr Bueso <davidlohr@hp.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NMichel Lespinasse <walken@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Tested-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      615d6e87
  23. 04 4月, 2014 1 次提交
  24. 31 3月, 2014 1 次提交
  25. 19 3月, 2014 1 次提交
  26. 24 1月, 2014 2 次提交
    • C
      mm: ignore VM_SOFTDIRTY on VMA merging · 34228d47
      Cyrill Gorcunov 提交于
      The VM_SOFTDIRTY bit affects vma merge routine: if two VMAs has all bits
      in vm_flags matched except dirty bit the kernel can't longer merge them
      and this forces the kernel to generate new VMAs instead.
      
      It finally may lead to the situation when userspace application reaches
      vm.max_map_count limit and get crashed in worse case
      
       | (gimp:11768): GLib-ERROR **: gmem.c:110: failed to allocate 4096 bytes
       |
       | (file-tiff-load:12038): LibGimpBase-WARNING **: file-tiff-load: gimp_wire_read(): error
       | xinit: connection to X server lost
       |
       | waiting for X server to shut down
       | /usr/lib64/gimp/2.0/plug-ins/file-tiff-load terminated: Hangup
       | /usr/lib64/gimp/2.0/plug-ins/script-fu terminated: Hangup
       | /usr/lib64/gimp/2.0/plug-ins/script-fu terminated: Hangup
      
        https://bugzilla.kernel.org/show_bug.cgi?id=67651
        https://bugzilla.gnome.org/show_bug.cgi?id=719619#c0
      
      Initial problem came from missed VM_SOFTDIRTY in do_brk() routine but
      even if we would set up VM_SOFTDIRTY here, there is still a way to
      prevent VMAs from merging: one can call
      
       | echo 4 > /proc/$PID/clear_refs
      
      and clear all VM_SOFTDIRTY over all VMAs presented in memory map, then
      new do_brk() will try to extend old VMA and finds that dirty bit doesn't
      match thus new VMA will be generated.
      
      As discussed with Pavel, the right approach should be to ignore
      VM_SOFTDIRTY bit when we're trying to merge VMAs and if merge successed
      we mark extended VMA with dirty bit where needed.
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Reported-by: NBastian Hougaard <gnome@rvzt.net>
      Reported-by: NMel Gorman <mgorman@suse.de>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      34228d47
    • P
      mm: audit/fix non-modular users of module_init in core code · a64fb3cd
      Paul Gortmaker 提交于
      Code that is obj-y (always built-in) or dependent on a bool Kconfig
      (built-in or absent) can never be modular.  So using module_init as an
      alias for __initcall can be somewhat misleading.
      
      Fix these up now, so that we can relocate module_init from init.h into
      module.h in the future.  If we don't do this, we'd have to add module.h
      to obviously non-modular code, and that would be a worse thing.
      
      The audit targets the following module_init users for change:
       mm/ksm.c                       bool KSM
       mm/mmap.c                      bool MMU
       mm/huge_memory.c               bool TRANSPARENT_HUGEPAGE
       mm/mmu_notifier.c              bool MMU_NOTIFIER
      
      Note that direct use of __initcall is discouraged, vs.  one of the
      priority categorized subgroups.  As __initcall gets mapped onto
      device_initcall, our use of subsys_initcall (which makes sense for these
      files) will thus change this registration from level 6-device to level
      4-subsys (i.e.  slightly earlier).
      
      However no observable impact of that difference has been observed during
      testing.
      
      One might think that core_initcall (l2) or postcore_initcall (l3) would
      be more appropriate for anything in mm/ but if we look at some actual
      init functions themselves, we see things like:
      
      mm/huge_memory.c --> hugepage_init     --> hugepage_init_sysfs
      mm/mmap.c        --> init_user_reserve --> sysctl_user_reserve_kbytes
      mm/ksm.c         --> ksm_init          --> sysfs_create_group
      
      and hence the choice of subsys_initcall (l4) seems reasonable, and at
      the same time minimizes the risk of changing the priority too
      drastically all at once.  We can adjust further in the future.
      
      Also, several instances of missing ";" at EOL are fixed.
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a64fb3cd
  27. 22 1月, 2014 1 次提交