1. 14 2月, 2012 1 次提交
  2. 11 1月, 2012 2 次提交
    • K
      mm: simplify find_vma_prev() · 6bd4837d
      KOSAKI Motohiro 提交于
      commit 297c5eee ("mm: make the vma list be doubly linked") added the
      vm_prev member to vm_area_struct.  We can simplify find_vma_prev() by
      using it.  Also, this change helps to improve page fault performance
      because it has stronger locality of reference.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6bd4837d
    • A
      mremap: enforce rmap src/dst vma ordering in case of vma_merge() succeeding in copy_vma() · 948f017b
      Andrea Arcangeli 提交于
      migrate was doing an rmap_walk with speculative lock-less access on
      pagetables.  That could lead it to not serializing properly against mremap
      PT locks.  But a second problem remains in the order of vmas in the
      same_anon_vma list used by the rmap_walk.
      
      If vma_merge succeeds in copy_vma, the src vma could be placed after the
      dst vma in the same_anon_vma list.  That could still lead to migrate
      missing some pte.
      
      This patch adds an anon_vma_moveto_tail() function to force the dst vma at
      the end of the list before mremap starts to solve the problem.
      
      If the mremap is very large and there are a lots of parents or childs
      sharing the anon_vma root lock, this should still scale better than taking
      the anon_vma root lock around every pte copy practically for the whole
      duration of mremap.
      
      Update: Hugh noticed special care is needed in the error path where
      move_page_tables goes in the reverse direction, a second
      anon_vma_moveto_tail() call is needed in the error path.
      
      This program exercises the anon_vma_moveto_tail:
      
      ===
      
      int main()
      {
      	static struct timeval oldstamp, newstamp;
      	long diffsec;
      	char *p, *p2, *p3, *p4;
      	if (posix_memalign((void **)&p, 2*1024*1024, SIZE))
      		perror("memalign"), exit(1);
      	if (posix_memalign((void **)&p2, 2*1024*1024, SIZE))
      		perror("memalign"), exit(1);
      	if (posix_memalign((void **)&p3, 2*1024*1024, SIZE))
      		perror("memalign"), exit(1);
      
      	memset(p, 0xff, SIZE);
      	printf("%p\n", p);
      	memset(p2, 0xff, SIZE);
      	memset(p3, 0x77, 4096);
      	if (memcmp(p, p2, SIZE))
      		printf("error\n");
      	p4 = mremap(p+SIZE/2, SIZE/2, SIZE/2, MREMAP_FIXED|MREMAP_MAYMOVE, p3);
      	if (p4 != p3)
      		perror("mremap"), exit(1);
      	p4 = mremap(p4, SIZE/2, SIZE/2, MREMAP_FIXED|MREMAP_MAYMOVE, p+SIZE/2);
      	if (p4 != p+SIZE/2)
      		perror("mremap"), exit(1);
      	if (memcmp(p, p2, SIZE))
      		printf("error\n");
      	printf("ok\n");
      
      	return 0;
      }
      ===
      
      $ perf probe -a anon_vma_moveto_tail
      Add new event:
        probe:anon_vma_moveto_tail (on anon_vma_moveto_tail)
      
      You can now use it on all perf tools, such as:
      
              perf record -e probe:anon_vma_moveto_tail -aR sleep 1
      
      $ perf record -e probe:anon_vma_moveto_tail -aR ./anon_vma_moveto_tail
      0x7f2ca2800000
      ok
      [ perf record: Woken up 1 times to write data ]
      [ perf record: Captured and wrote 0.043 MB perf.data (~1860 samples) ]
      $ perf report --stdio
         100.00%  anon_vma_moveto  [kernel.kallsyms]  [k] anon_vma_moveto_tail
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: NNai Xia <nai.xia@gmail.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Pawel Sikora <pluto@agmk.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      948f017b
  3. 01 11月, 2011 1 次提交
  4. 31 10月, 2011 1 次提交
  5. 26 7月, 2011 1 次提交
  6. 16 6月, 2011 1 次提交
    • L
      mm: get rid of the most spurious find_vma_prev() users · 9be34c9d
      Linus Torvalds 提交于
      We have some users of this function that date back to before the vma
      list was doubly linked, and just are silly.  These days, you can find
      the previous vma by just following the vma->vm_prev pointer.
      
      In some cases you don't need any find_vma() lookup at all, and in other
      cases you're better off with the regular "find_vma()" that uses the vma
      cache front-end lookup.
      
      Some "find_vma_prev()" users are still valid, though.  For example, in
      the case of a stack that grows up, it can be the case that we don't find
      any 'vma' at all (because we're looking up an address that is past the
      last vma), and that the stack that we want to grow is the 'prev' vma.
      
      But that kind of special case aside, we generally should prefer to use
      'find_vma()'.
      
      Noticed due to a totally unrelated POWER memory corruption bug that just
      happened to hit in 'find_vma_prev()' and made me go "Hmm - why are we
      using that function here?".
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9be34c9d
  7. 27 5月, 2011 1 次提交
  8. 25 5月, 2011 9 次提交
    • P
      mm: convert anon_vma->lock to a mutex · 2b575eb6
      Peter Zijlstra 提交于
      Straightforward conversion of anon_vma->lock to a mutex.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NHugh Dickins <hughd@google.com>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Miller <davem@davemloft.net>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Namhyung Kim <namhyung@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2b575eb6
    • P
      mm: Convert i_mmap_lock to a mutex · 3d48ae45
      Peter Zijlstra 提交于
      Straightforward conversion of i_mmap_lock to a mutex.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Miller <davem@davemloft.net>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Namhyung Kim <namhyung@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3d48ae45
    • P
      mm: Remove i_mmap_lock lockbreak · 97a89413
      Peter Zijlstra 提交于
      Hugh says:
       "The only significant loser, I think, would be page reclaim (when
        concurrent with truncation): could spin for a long time waiting for
        the i_mmap_mutex it expects would soon be dropped? "
      
      Counter points:
       - cpu contention makes the spin stop (need_resched())
       - zap pages should be freeing pages at a higher rate than reclaim
         ever can
      
      I think the simplification of the truncate code is definitely worth it.
      
      Effectively reverts: 2aa15890 ("mm: prevent concurrent
      unmap_mapping_range() on the same inode") and takes out the code that
      caused its problem.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Miller <davem@davemloft.net>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Namhyung Kim <namhyung@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      97a89413
    • P
      mm: mmu_gather rework · d16dfc55
      Peter Zijlstra 提交于
      Rework the existing mmu_gather infrastructure.
      
      The direct purpose of these patches was to allow preemptible mmu_gather,
      but even without that I think these patches provide an improvement to the
      status quo.
      
      The first 9 patches rework the mmu_gather infrastructure.  For review
      purpose I've split them into generic and per-arch patches with the last of
      those a generic cleanup.
      
      The next patch provides generic RCU page-table freeing, and the followup
      is a patch converting s390 to use this.  I've also got 4 patches from
      DaveM lined up (not included in this series) that uses this to implement
      gup_fast() for sparc64.
      
      Then there is one patch that extends the generic mmu_gather batching.
      
      After that follow the mm preemptibility patches, these make part of the mm
      a lot more preemptible.  It converts i_mmap_lock and anon_vma->lock to
      mutexes which together with the mmu_gather rework makes mmu_gather
      preemptible as well.
      
      Making i_mmap_lock a mutex also enables a clean-up of the truncate code.
      
      This also allows for preemptible mmu_notifiers, something that XPMEM I
      think wants.
      
      Furthermore, it removes the new and universially detested unmap_mutex.
      
      This patch:
      
      Remove the first obstacle towards a fully preemptible mmu_gather.
      
      The current scheme assumes mmu_gather is always done with preemption
      disabled and uses per-cpu storage for the page batches.  Change this to
      try and allocate a page for batching and in case of failure, use a small
      on-stack array to make some progress.
      
      Preemptible mmu_gather is desired in general and usable once i_mmap_lock
      becomes a mutex.  Doing it before the mutex conversion saves us from
      having to rework the code by moving the mmu_gather bits inside the
      pte_lock.
      
      Also avoid flushing the tlb batches from under the pte lock, this is
      useful even without the i_mmap_lock conversion as it significantly reduces
      pte lock hold times.
      
      [akpm@linux-foundation.org: fix comment tpyo]
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Miller <davem@davemloft.net>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Tony Luck <tony.luck@intel.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Namhyung Kim <namhyung@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d16dfc55
    • M
      mm: make expand_downwards() symmetrical with expand_upwards() · d05f3169
      Michal Hocko 提交于
      Currently we have expand_upwards exported while expand_downwards is
      accessible only via expand_stack or expand_stack_downwards.
      
      check_stack_guard_page is a nice example of the asymmetry.  It uses
      expand_stack for VM_GROWSDOWN while expand_upwards is called for
      VM_GROWSUP case.
      
      Let's clean this up by exporting both functions and make those names
      consistent.  Let's use expand_{upwards,downwards} because expanding
      doesn't always involve stack manipulation (an example is
      ia64_do_page_fault which uses expand_upwards for registers backing store
      expansion).  expand_downwards has to be defined for both
      CONFIG_STACK_GROWS{UP,DOWN} because get_arg_page calls the downwards
      version in the early process initialization phase for growsup
      configuration.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d05f3169
    • N
      mm: nommu: sort mm->mmap list properly · 6038def0
      Namhyung Kim 提交于
      When I was reading nommu code, I found that it handles the vma list/tree
      in an unusual way.  IIUC, because there can be more than one
      identical/overrapped vmas in the list/tree, it sorts the tree more
      strictly and does a linear search on the tree.  But it doesn't applied to
      the list (i.e.  the list could be constructed in a different order than
      the tree so that we can't use the list when finding the first vma in that
      order).
      
      Since inserting/sorting a vma in the tree and link is done at the same
      time, we can easily construct both of them in the same order.  And linear
      searching on the tree could be more costly than doing it on the list, it
      can be converted to use the list.
      
      Also, after the commit 297c5eee ("mm: make the vma list be doubly
      linked") made the list be doubly linked, there were a couple of code need
      to be fixed to construct the list properly.
      
      Patch 1/6 is a preparation.  It maintains the list sorted same as the tree
      and construct doubly-linked list properly.  Patch 2/6 is a simple
      optimization for the vma deletion.  Patch 3/6 and 4/6 convert tree
      traversal to list traversal and the rest are simple fixes and cleanups.
      
      This patch:
      
      @vma added into @mm should be sorted by start addr, end addr and VMA
      struct addr in that order because we may get identical VMAs in the @mm.
      However this was true only for the rbtree, not for the list.
      
      This patch fixes this by remembering 'rb_prev' during the tree traversal
      like find_vma_prepare() does and linking the @vma via __vma_link_list().
      After this patch, we can iterate the whole VMAs in correct order simply by
      using @mm->mmap list.
      
      [akpm@linux-foundation.org: avoid duplicating __vma_link_list()]
      Signed-off-by: NNamhyung Kim <namhyung@gmail.com>
      Acked-by: NGreg Ungerer <gerg@uclinux.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6038def0
    • S
      mmap: avoid merging cloned VMAs · 965f55de
      Shaohua Li 提交于
      Avoid merging a VMA with another VMA which is cloned from the parent process.
      
      The cloned VMA shares the anon_vma lock with the parent process's VMA.  If
      we do the merge, more vmas (even the new range is only for current
      process) use the perent process's anon_vma lock.  This introduces
      scalability issues.  find_mergeable_anon_vma() already considers this
      case.
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      965f55de
    • S
      mmap: avoid unnecessary anon_vma lock · 5f70b962
      Shaohua Li 提交于
      If we only change vma->vm_end, we can avoid taking anon_vma lock even if
      'insert' isn't NULL, which is the case of split_vma.
      
      As I understand it, we need the lock before because rmap must get the
      'insert' VMA when we adjust old VMA's vm_end (the 'insert' VMA is linked
      to anon_vma list in __insert_vm_struct before).
      
      But now this isn't true any more.  The 'insert' VMA is already linked to
      anon_vma list in __split_vma(with anon_vma_clone()) instead of
      __insert_vm_struct.  There is no race rmap can't get required VMAs.  So
      the anon_vma lock is unnecessary, and this can reduce one locking in brk
      case and improve scalability.
      
      Signed-off-by: Shaohua Li<shaohua.li@intel.com>
      Cc: Rik van Riel <riel@redhat.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5f70b962
    • S
      mmap: add alignment for some variables · 34679d7e
      Shaohua Li 提交于
      Make some variables have correct alignment/section to avoid cache issue.
      In a workload which heavily does mmap/munmap, the variables will be used
      frequently.
      Signed-off-by: NShaohua Li <shaohua.li@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      34679d7e
  9. 10 5月, 2011 1 次提交
  10. 15 4月, 2011 1 次提交
  11. 13 4月, 2011 1 次提交
  12. 14 1月, 2011 3 次提交
  13. 16 12月, 2010 1 次提交
    • T
      install_special_mapping skips security_file_mmap check. · 462e635e
      Tavis Ormandy 提交于
      The install_special_mapping routine (used, for example, to setup the
      vdso) skips the security check before insert_vm_struct, allowing a local
      attacker to bypass the mmap_min_addr security restriction by limiting
      the available pages for special mappings.
      
      bprm_mm_init() also skips the check, and although I don't think this can
      be used to bypass any restrictions, I don't see any reason not to have
      the security check.
      
        $ uname -m
        x86_64
        $ cat /proc/sys/vm/mmap_min_addr
        65536
        $ cat install_special_mapping.s
        section .bss
            resb BSS_SIZE
        section .text
            global _start
            _start:
                mov     eax, __NR_pause
                int     0x80
        $ nasm -D__NR_pause=29 -DBSS_SIZE=0xfffed000 -f elf -o install_special_mapping.o install_special_mapping.s
        $ ld -m elf_i386 -Ttext=0x10000 -Tbss=0x11000 -o install_special_mapping install_special_mapping.o
        $ ./install_special_mapping &
        [1] 14303
        $ cat /proc/14303/maps
        0000f000-00010000 r-xp 00000000 00:00 0                                  [vdso]
        00010000-00011000 r-xp 00001000 00:19 2453665                            /home/taviso/install_special_mapping
        00011000-ffffe000 rwxp 00000000 00:00 0                                  [stack]
      
      It's worth noting that Red Hat are shipping with mmap_min_addr set to
      4096.
      Signed-off-by: NTavis Ormandy <taviso@google.com>
      Acked-by: NKees Cook <kees@ubuntu.com>
      Acked-by: NRobert Swiecki <swiecki@google.com>
      [ Changed to not drop the error code - akpm ]
      Reviewed-by: NJames Morris <jmorris@namei.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      462e635e
  14. 30 10月, 2010 1 次提交
    • A
      audit mmap · 120a795d
      Al Viro 提交于
      Normal syscall audit doesn't catch 5th argument of syscall.  It also
      doesn't catch the contents of userland structures pointed to be
      syscall argument, so for both old and new mmap(2) ABI it doesn't
      record the descriptor we are mapping.  For old one it also misses
      flags.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      120a795d
  15. 23 9月, 2010 1 次提交
  16. 25 8月, 2010 1 次提交
  17. 21 8月, 2010 1 次提交
  18. 10 8月, 2010 4 次提交
  19. 09 6月, 2010 1 次提交
  20. 27 4月, 2010 1 次提交
  21. 13 4月, 2010 2 次提交
    • L
      vma_adjust: fix the copying of anon_vma chains · 287d97ac
      Linus Torvalds 提交于
      When we move the boundaries between two vma's due to things like
      mprotect, we need to make sure that the anon_vma of the pages that got
      moved from one vma to another gets properly copied around.  And that was
      not always the case, in this rather hard-to-follow code sequence.
      
      Clarify the code, and fix it so that it copies the anon_vma from the
      right source.
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Tested-by: Borislav Petkov <bp@alien8.de> [ "Yeah, not so much this one either" ]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      287d97ac
    • L
      Simplify and comment on anon_vma re-use for anon_vma_prepare() · d0e9fe17
      Linus Torvalds 提交于
      This changes the anon_vma reuse case to require that we only reuse
      simple anon_vma's - ie the case when the vma only has a single anon_vma
      associated with it.
      
      This means that a reuse of an anon_vma from an adjacent vma will always
      guarantee that both vma's are associated not only with the same
      anon_vma, they will also have the same anon_vma chain (of just a single
      entry in this case).
      
      And since anon_vma re-use was the only case where the same anon_vma
      might be associated with different chains of anon_vma's, we now have the
      case that every vma that shares the same anon_vma will always also have
      the same chain.  That makes it much easier to think about merging vma's
      that share the same anon_vma's: you can always just drop the other
      anon_vma chain in anon_vma_merge() since you know that they are always
      identical.
      
      This also splits up the function to validate the anon_vma re-use, and
      adds a lot of commentary about the possible races.
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Tested-by: Borislav Petkov <bp@alien8.de> [ "That didn't fix it" ]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d0e9fe17
  22. 13 3月, 2010 1 次提交
  23. 07 3月, 2010 3 次提交
    • R
      mm: remove VM_LOCK_RMAP code · fc148a5f
      Rik van Riel 提交于
      When a VMA is in an inconsistent state during setup or teardown, the worst
      that can happen is that the rmap code will not be able to find the page.
      
      The mapping is in the process of being torn down (PTEs just got
      invalidated by munmap), or set up (no PTEs have been instantiated yet).
      
      It is also impossible for the rmap code to follow a pointer to an already
      freed VMA, because the rmap code holds the anon_vma->lock, which the VMA
      teardown code needs to take before the VMA is removed from the anon_vma
      chain.
      
      Hence, we should not need the VM_LOCK_RMAP locking at all.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fc148a5f
    • R
      mm: change anon_vma linking to fix multi-process server scalability issue · 5beb4930
      Rik van Riel 提交于
      The old anon_vma code can lead to scalability issues with heavily forking
      workloads.  Specifically, each anon_vma will be shared between the parent
      process and all its child processes.
      
      In a workload with 1000 child processes and a VMA with 1000 anonymous
      pages per process that get COWed, this leads to a system with a million
      anonymous pages in the same anon_vma, each of which is mapped in just one
      of the 1000 processes.  However, the current rmap code needs to walk them
      all, leading to O(N) scanning complexity for each page.
      
      This can result in systems where one CPU is walking the page tables of
      1000 processes in page_referenced_one, while all other CPUs are stuck on
      the anon_vma lock.  This leads to catastrophic failure for a benchmark
      like AIM7, where the total number of processes can reach in the tens of
      thousands.  Real workloads are still a factor 10 less process intensive
      than AIM7, but they are catching up.
      
      This patch changes the way anon_vmas and VMAs are linked, which allows us
      to associate multiple anon_vmas with a VMA.  At fork time, each child
      process gets its own anon_vmas, in which its COWed pages will be
      instantiated.  The parents' anon_vma is also linked to the VMA, because
      non-COWed pages could be present in any of the children.
      
      This reduces rmap scanning complexity to O(1) for the pages of the 1000
      child processes, with O(N) complexity for at most 1/N pages in the system.
       This reduces the average scanning cost in heavily forking workloads from
      O(N) to 2.
      
      The only real complexity in this patch stems from the fact that linking a
      VMA to anon_vmas now involves memory allocations.  This means vma_adjust
      can fail, if it needs to attach a VMA to anon_vma structures.  This in
      turn means error handling needs to be added to the calling functions.
      
      A second source of complexity is that, because there can be multiple
      anon_vmas, the anon_vma linking in vma_adjust can no longer be done under
      "the" anon_vma lock.  To prevent the rmap code from walking up an
      incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag.  This bit
      flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h
      to make sure it is impossible to compile a kernel that needs both symbolic
      values for the same bitflag.
      
      Some test results:
      
      Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test
      box with 16GB RAM and not quite enough IO), the system ends up running
      >99% in system time, with every CPU on the same anon_vma lock in the
      pageout code.
      
      With these changes, AIM7 hits the cross-over point around 29.7k users.
      This happens with ~99% IO wait time, there never seems to be any spike in
      system time.  The anon_vma lock contention appears to be resolved.
      
      [akpm@linux-foundation.org: cleanups]
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5beb4930
    • J
      mm: use rlimit helpers · 59e99e5b
      Jiri Slaby 提交于
      Make sure compiler won't do weird things with limits.  E.g.  fetching them
      twice may return 2 different values after writable limits are implemented.
      
      I.e.  either use rlimit helpers added in
      3e10e716 ("resource: add helpers for
      fetching rlimits") or ACCESS_ONCE if not applicable.
      Signed-off-by: NJiri Slaby <jslaby@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      59e99e5b