1. 22 3月, 2012 3 次提交
  2. 07 3月, 2012 2 次提交
  3. 06 3月, 2012 1 次提交
  4. 11 1月, 2012 2 次提交
    • K
      mm: simplify find_vma_prev() · 6bd4837d
      KOSAKI Motohiro 提交于
      commit 297c5eee ("mm: make the vma list be doubly linked") added the
      vm_prev member to vm_area_struct.  We can simplify find_vma_prev() by
      using it.  Also, this change helps to improve page fault performance
      because it has stronger locality of reference.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6bd4837d
    • A
      mremap: enforce rmap src/dst vma ordering in case of vma_merge() succeeding in copy_vma() · 948f017b
      Andrea Arcangeli 提交于
      migrate was doing an rmap_walk with speculative lock-less access on
      pagetables.  That could lead it to not serializing properly against mremap
      PT locks.  But a second problem remains in the order of vmas in the
      same_anon_vma list used by the rmap_walk.
      
      If vma_merge succeeds in copy_vma, the src vma could be placed after the
      dst vma in the same_anon_vma list.  That could still lead to migrate
      missing some pte.
      
      This patch adds an anon_vma_moveto_tail() function to force the dst vma at
      the end of the list before mremap starts to solve the problem.
      
      If the mremap is very large and there are a lots of parents or childs
      sharing the anon_vma root lock, this should still scale better than taking
      the anon_vma root lock around every pte copy practically for the whole
      duration of mremap.
      
      Update: Hugh noticed special care is needed in the error path where
      move_page_tables goes in the reverse direction, a second
      anon_vma_moveto_tail() call is needed in the error path.
      
      This program exercises the anon_vma_moveto_tail:
      
      ===
      
      int main()
      {
      	static struct timeval oldstamp, newstamp;
      	long diffsec;
      	char *p, *p2, *p3, *p4;
      	if (posix_memalign((void **)&p, 2*1024*1024, SIZE))
      		perror("memalign"), exit(1);
      	if (posix_memalign((void **)&p2, 2*1024*1024, SIZE))
      		perror("memalign"), exit(1);
      	if (posix_memalign((void **)&p3, 2*1024*1024, SIZE))
      		perror("memalign"), exit(1);
      
      	memset(p, 0xff, SIZE);
      	printf("%p\n", p);
      	memset(p2, 0xff, SIZE);
      	memset(p3, 0x77, 4096);
      	if (memcmp(p, p2, SIZE))
      		printf("error\n");
      	p4 = mremap(p+SIZE/2, SIZE/2, SIZE/2, MREMAP_FIXED|MREMAP_MAYMOVE, p3);
      	if (p4 != p3)
      		perror("mremap"), exit(1);
      	p4 = mremap(p4, SIZE/2, SIZE/2, MREMAP_FIXED|MREMAP_MAYMOVE, p+SIZE/2);
      	if (p4 != p+SIZE/2)
      		perror("mremap"), exit(1);
      	if (memcmp(p, p2, SIZE))
      		printf("error\n");
      	printf("ok\n");
      
      	return 0;
      }
      ===
      
      $ perf probe -a anon_vma_moveto_tail
      Add new event:
        probe:anon_vma_moveto_tail (on anon_vma_moveto_tail)
      
      You can now use it on all perf tools, such as:
      
              perf record -e probe:anon_vma_moveto_tail -aR sleep 1
      
      $ perf record -e probe:anon_vma_moveto_tail -aR ./anon_vma_moveto_tail
      0x7f2ca2800000
      ok
      [ perf record: Woken up 1 times to write data ]
      [ perf record: Captured and wrote 0.043 MB perf.data (~1860 samples) ]
      $ perf report --stdio
         100.00%  anon_vma_moveto  [kernel.kallsyms]  [k] anon_vma_moveto_tail
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: NNai Xia <nai.xia@gmail.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Pawel Sikora <pluto@agmk.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      948f017b
  5. 01 11月, 2011 1 次提交
  6. 31 10月, 2011 1 次提交
  7. 26 7月, 2011 1 次提交
  8. 16 6月, 2011 1 次提交
    • L
      mm: get rid of the most spurious find_vma_prev() users · 9be34c9d
      Linus Torvalds 提交于
      We have some users of this function that date back to before the vma
      list was doubly linked, and just are silly.  These days, you can find
      the previous vma by just following the vma->vm_prev pointer.
      
      In some cases you don't need any find_vma() lookup at all, and in other
      cases you're better off with the regular "find_vma()" that uses the vma
      cache front-end lookup.
      
      Some "find_vma_prev()" users are still valid, though.  For example, in
      the case of a stack that grows up, it can be the case that we don't find
      any 'vma' at all (because we're looking up an address that is past the
      last vma), and that the stack that we want to grow is the 'prev' vma.
      
      But that kind of special case aside, we generally should prefer to use
      'find_vma()'.
      
      Noticed due to a totally unrelated POWER memory corruption bug that just
      happened to hit in 'find_vma_prev()' and made me go "Hmm - why are we
      using that function here?".
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9be34c9d
  9. 27 5月, 2011 1 次提交
  10. 25 5月, 2011 9 次提交
    • P
      mm: convert anon_vma->lock to a mutex · 2b575eb6
      Peter Zijlstra 提交于
      Straightforward conversion of anon_vma->lock to a mutex.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NHugh Dickins <hughd@google.com>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Miller <davem@davemloft.net>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Namhyung Kim <namhyung@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2b575eb6
    • P
      mm: Convert i_mmap_lock to a mutex · 3d48ae45
      Peter Zijlstra 提交于
      Straightforward conversion of i_mmap_lock to a mutex.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Miller <davem@davemloft.net>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Namhyung Kim <namhyung@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3d48ae45
    • P
      mm: Remove i_mmap_lock lockbreak · 97a89413
      Peter Zijlstra 提交于
      Hugh says:
       "The only significant loser, I think, would be page reclaim (when
        concurrent with truncation): could spin for a long time waiting for
        the i_mmap_mutex it expects would soon be dropped? "
      
      Counter points:
       - cpu contention makes the spin stop (need_resched())
       - zap pages should be freeing pages at a higher rate than reclaim
         ever can
      
      I think the simplification of the truncate code is definitely worth it.
      
      Effectively reverts: 2aa15890 ("mm: prevent concurrent
      unmap_mapping_range() on the same inode") and takes out the code that
      caused its problem.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Miller <davem@davemloft.net>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Namhyung Kim <namhyung@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      97a89413
    • P
      mm: mmu_gather rework · d16dfc55
      Peter Zijlstra 提交于
      Rework the existing mmu_gather infrastructure.
      
      The direct purpose of these patches was to allow preemptible mmu_gather,
      but even without that I think these patches provide an improvement to the
      status quo.
      
      The first 9 patches rework the mmu_gather infrastructure.  For review
      purpose I've split them into generic and per-arch patches with the last of
      those a generic cleanup.
      
      The next patch provides generic RCU page-table freeing, and the followup
      is a patch converting s390 to use this.  I've also got 4 patches from
      DaveM lined up (not included in this series) that uses this to implement
      gup_fast() for sparc64.
      
      Then there is one patch that extends the generic mmu_gather batching.
      
      After that follow the mm preemptibility patches, these make part of the mm
      a lot more preemptible.  It converts i_mmap_lock and anon_vma->lock to
      mutexes which together with the mmu_gather rework makes mmu_gather
      preemptible as well.
      
      Making i_mmap_lock a mutex also enables a clean-up of the truncate code.
      
      This also allows for preemptible mmu_notifiers, something that XPMEM I
      think wants.
      
      Furthermore, it removes the new and universially detested unmap_mutex.
      
      This patch:
      
      Remove the first obstacle towards a fully preemptible mmu_gather.
      
      The current scheme assumes mmu_gather is always done with preemption
      disabled and uses per-cpu storage for the page batches.  Change this to
      try and allocate a page for batching and in case of failure, use a small
      on-stack array to make some progress.
      
      Preemptible mmu_gather is desired in general and usable once i_mmap_lock
      becomes a mutex.  Doing it before the mutex conversion saves us from
      having to rework the code by moving the mmu_gather bits inside the
      pte_lock.
      
      Also avoid flushing the tlb batches from under the pte lock, this is
      useful even without the i_mmap_lock conversion as it significantly reduces
      pte lock hold times.
      
      [akpm@linux-foundation.org: fix comment tpyo]
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Miller <davem@davemloft.net>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Tony Luck <tony.luck@intel.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Namhyung Kim <namhyung@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d16dfc55
    • M
      mm: make expand_downwards() symmetrical with expand_upwards() · d05f3169
      Michal Hocko 提交于
      Currently we have expand_upwards exported while expand_downwards is
      accessible only via expand_stack or expand_stack_downwards.
      
      check_stack_guard_page is a nice example of the asymmetry.  It uses
      expand_stack for VM_GROWSDOWN while expand_upwards is called for
      VM_GROWSUP case.
      
      Let's clean this up by exporting both functions and make those names
      consistent.  Let's use expand_{upwards,downwards} because expanding
      doesn't always involve stack manipulation (an example is
      ia64_do_page_fault which uses expand_upwards for registers backing store
      expansion).  expand_downwards has to be defined for both
      CONFIG_STACK_GROWS{UP,DOWN} because get_arg_page calls the downwards
      version in the early process initialization phase for growsup
      configuration.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d05f3169
    • N
      mm: nommu: sort mm->mmap list properly · 6038def0
      Namhyung Kim 提交于
      When I was reading nommu code, I found that it handles the vma list/tree
      in an unusual way.  IIUC, because there can be more than one
      identical/overrapped vmas in the list/tree, it sorts the tree more
      strictly and does a linear search on the tree.  But it doesn't applied to
      the list (i.e.  the list could be constructed in a different order than
      the tree so that we can't use the list when finding the first vma in that
      order).
      
      Since inserting/sorting a vma in the tree and link is done at the same
      time, we can easily construct both of them in the same order.  And linear
      searching on the tree could be more costly than doing it on the list, it
      can be converted to use the list.
      
      Also, after the commit 297c5eee ("mm: make the vma list be doubly
      linked") made the list be doubly linked, there were a couple of code need
      to be fixed to construct the list properly.
      
      Patch 1/6 is a preparation.  It maintains the list sorted same as the tree
      and construct doubly-linked list properly.  Patch 2/6 is a simple
      optimization for the vma deletion.  Patch 3/6 and 4/6 convert tree
      traversal to list traversal and the rest are simple fixes and cleanups.
      
      This patch:
      
      @vma added into @mm should be sorted by start addr, end addr and VMA
      struct addr in that order because we may get identical VMAs in the @mm.
      However this was true only for the rbtree, not for the list.
      
      This patch fixes this by remembering 'rb_prev' during the tree traversal
      like find_vma_prepare() does and linking the @vma via __vma_link_list().
      After this patch, we can iterate the whole VMAs in correct order simply by
      using @mm->mmap list.
      
      [akpm@linux-foundation.org: avoid duplicating __vma_link_list()]
      Signed-off-by: NNamhyung Kim <namhyung@gmail.com>
      Acked-by: NGreg Ungerer <gerg@uclinux.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6038def0
    • S
      mmap: avoid merging cloned VMAs · 965f55de
      Shaohua Li 提交于
      Avoid merging a VMA with another VMA which is cloned from the parent process.
      
      The cloned VMA shares the anon_vma lock with the parent process's VMA.  If
      we do the merge, more vmas (even the new range is only for current
      process) use the perent process's anon_vma lock.  This introduces
      scalability issues.  find_mergeable_anon_vma() already considers this
      case.
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      965f55de
    • S
      mmap: avoid unnecessary anon_vma lock · 5f70b962
      Shaohua Li 提交于
      If we only change vma->vm_end, we can avoid taking anon_vma lock even if
      'insert' isn't NULL, which is the case of split_vma.
      
      As I understand it, we need the lock before because rmap must get the
      'insert' VMA when we adjust old VMA's vm_end (the 'insert' VMA is linked
      to anon_vma list in __insert_vm_struct before).
      
      But now this isn't true any more.  The 'insert' VMA is already linked to
      anon_vma list in __split_vma(with anon_vma_clone()) instead of
      __insert_vm_struct.  There is no race rmap can't get required VMAs.  So
      the anon_vma lock is unnecessary, and this can reduce one locking in brk
      case and improve scalability.
      
      Signed-off-by: Shaohua Li<shaohua.li@intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5f70b962
    • S
      mmap: add alignment for some variables · 34679d7e
      Shaohua Li 提交于
      Make some variables have correct alignment/section to avoid cache issue.
      In a workload which heavily does mmap/munmap, the variables will be used
      frequently.
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      34679d7e
  11. 10 5月, 2011 1 次提交
  12. 15 4月, 2011 1 次提交
  13. 13 4月, 2011 1 次提交
  14. 14 1月, 2011 3 次提交
  15. 16 12月, 2010 1 次提交
    • T
      install_special_mapping skips security_file_mmap check. · 462e635e
      Tavis Ormandy 提交于
      The install_special_mapping routine (used, for example, to setup the
      vdso) skips the security check before insert_vm_struct, allowing a local
      attacker to bypass the mmap_min_addr security restriction by limiting
      the available pages for special mappings.
      
      bprm_mm_init() also skips the check, and although I don't think this can
      be used to bypass any restrictions, I don't see any reason not to have
      the security check.
      
        $ uname -m
        x86_64
        $ cat /proc/sys/vm/mmap_min_addr
        65536
        $ cat install_special_mapping.s
        section .bss
            resb BSS_SIZE
        section .text
            global _start
            _start:
                mov     eax, __NR_pause
                int     0x80
        $ nasm -D__NR_pause=29 -DBSS_SIZE=0xfffed000 -f elf -o install_special_mapping.o install_special_mapping.s
        $ ld -m elf_i386 -Ttext=0x10000 -Tbss=0x11000 -o install_special_mapping install_special_mapping.o
        $ ./install_special_mapping &
        [1] 14303
        $ cat /proc/14303/maps
        0000f000-00010000 r-xp 00000000 00:00 0                                  [vdso]
        00010000-00011000 r-xp 00001000 00:19 2453665                            /home/taviso/install_special_mapping
        00011000-ffffe000 rwxp 00000000 00:00 0                                  [stack]
      
      It's worth noting that Red Hat are shipping with mmap_min_addr set to
      4096.
      Signed-off-by: NTavis Ormandy <taviso@google.com>
      Acked-by: NKees Cook <kees@ubuntu.com>
      Acked-by: NRobert Swiecki <swiecki@google.com>
      [ Changed to not drop the error code - akpm ]
      Reviewed-by: NJames Morris <jmorris@namei.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      462e635e
  16. 30 10月, 2010 1 次提交
    • A
      audit mmap · 120a795d
      Al Viro 提交于
      Normal syscall audit doesn't catch 5th argument of syscall.  It also
      doesn't catch the contents of userland structures pointed to be
      syscall argument, so for both old and new mmap(2) ABI it doesn't
      record the descriptor we are mapping.  For old one it also misses
      flags.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      120a795d
  17. 23 9月, 2010 1 次提交
  18. 25 8月, 2010 1 次提交
  19. 21 8月, 2010 1 次提交
  20. 10 8月, 2010 4 次提交
  21. 09 6月, 2010 1 次提交
  22. 27 4月, 2010 1 次提交
  23. 13 4月, 2010 1 次提交