1. 07 3月, 2010 1 次提交
    • R
      mm: change anon_vma linking to fix multi-process server scalability issue · 5beb4930
      Rik van Riel 提交于
      The old anon_vma code can lead to scalability issues with heavily forking
      workloads.  Specifically, each anon_vma will be shared between the parent
      process and all its child processes.
      
      In a workload with 1000 child processes and a VMA with 1000 anonymous
      pages per process that get COWed, this leads to a system with a million
      anonymous pages in the same anon_vma, each of which is mapped in just one
      of the 1000 processes.  However, the current rmap code needs to walk them
      all, leading to O(N) scanning complexity for each page.
      
      This can result in systems where one CPU is walking the page tables of
      1000 processes in page_referenced_one, while all other CPUs are stuck on
      the anon_vma lock.  This leads to catastrophic failure for a benchmark
      like AIM7, where the total number of processes can reach in the tens of
      thousands.  Real workloads are still a factor 10 less process intensive
      than AIM7, but they are catching up.
      
      This patch changes the way anon_vmas and VMAs are linked, which allows us
      to associate multiple anon_vmas with a VMA.  At fork time, each child
      process gets its own anon_vmas, in which its COWed pages will be
      instantiated.  The parents' anon_vma is also linked to the VMA, because
      non-COWed pages could be present in any of the children.
      
      This reduces rmap scanning complexity to O(1) for the pages of the 1000
      child processes, with O(N) complexity for at most 1/N pages in the system.
       This reduces the average scanning cost in heavily forking workloads from
      O(N) to 2.
      
      The only real complexity in this patch stems from the fact that linking a
      VMA to anon_vmas now involves memory allocations.  This means vma_adjust
      can fail, if it needs to attach a VMA to anon_vma structures.  This in
      turn means error handling needs to be added to the calling functions.
      
      A second source of complexity is that, because there can be multiple
      anon_vmas, the anon_vma linking in vma_adjust can no longer be done under
      "the" anon_vma lock.  To prevent the rmap code from walking up an
      incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag.  This bit
      flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h
      to make sure it is impossible to compile a kernel that needs both symbolic
      values for the same bitflag.
      
      Some test results:
      
      Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test
      box with 16GB RAM and not quite enough IO), the system ends up running
      >99% in system time, with every CPU on the same anon_vma lock in the
      pageout code.
      
      With these changes, AIM7 hits the cross-over point around 29.7k users.
      This happens with ~99% IO wait time, there never seems to be any spike in
      system time.  The anon_vma lock contention appears to be resolved.
      
      [akpm@linux-foundation.org: cleanups]
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5beb4930
  2. 17 1月, 2010 4 次提交
  3. 07 1月, 2010 2 次提交
    • J
      NOMMU: Use copy_*_user_page() in access_process_vm() · 7959722b
      Jie Zhang 提交于
      The MMU code uses the copy_*_user_page() variants in access_process_vm()
      rather than copy_*_user() as the former includes an icache flush.  This
      is important when doing things like setting software breakpoints with
      gdb.  So switch the NOMMU code over to do the same.
      
      This patch makes the reasonable assumption that copy_from_user_page()
      won't fail - which is probably fine, as we've checked the VMA from which
      we're copying is usable, and the copy is not allowed to cross VMAs.  The
      one case where it might go wrong is if the VMA is a device rather than
      RAM, and that device returns an error which - in which case rubbish will
      be returned rather than EIO.
      Signed-off-by: NJie Zhang <jie.zhang@analog.com>
      Signed-off-by: NMike Frysinger <vapier@gentoo.org>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NDavid McCullough <david_mccullough@mcafee.com>
      Acked-by: NPaul Mundt <lethal@linux-sh.org>
      Acked-by: NGreg Ungerer <gerg@uclinux.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7959722b
    • M
      NOMMU: Avoiding duplicate icache flushes of shared maps · cfe79c00
      Mike Frysinger 提交于
      When working with FDPIC, there are many shared mappings of read-only
      code regions between applications (the C library, applet packages like
      busybox, etc.), but the current do_mmap_pgoff() function will issue an
      icache flush whenever a VMA is added to an MM instead of only doing it
      when the map is initially created.
      
      The flush can instead be done when a region is first mmapped PROT_EXEC.
      Note that we may not rely on the first mapping of a region being
      executable - it's possible for it to be PROT_READ only, so we have to
      remember whether we've flushed the region or not, and then flush the
      entire region when a bit of it is made executable.
      
      However, this also affects the brk area.  That will no longer be
      executable.  We can mprotect() it to PROT_EXEC on MPU-mode kernels, but
      for NOMMU mode kernels, when it increases the brk allocation, making
      sys_brk() flush the extra from the icache should suffice.  The brk area
      probably isn't used by NOMMU programs since the brk area can only use up
      the leavings from the stack allocation, where the stack allocation is
      larger than requested.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NMike Frysinger <vapier@gentoo.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cfe79c00
  4. 31 12月, 2009 1 次提交
  5. 16 12月, 2009 1 次提交
  6. 01 11月, 2009 1 次提交
  7. 28 9月, 2009 1 次提交
  8. 25 9月, 2009 2 次提交
    • D
      NOMMU: Ignore mmap() address param as it is a hint · 06aab5a3
      David Howells 提交于
      Ignore the address parameter given to NOMMU mmap() as it is a hint, rather
      than giving an error if it's non-zero.  MAP_FIXED still gets an error.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      06aab5a3
    • D
      NOMMU: Fix MAP_PRIVATE mmap() of objects where the data can be mapped directly · 645d83c5
      David Howells 提交于
      Fix MAP_PRIVATE mmap() of files and devices where the data in the backing store
      might be mapped directly.  Use the BDI_CAP_MAP_DIRECT capability flag to govern
      whether or not we should be trying to map a file directly.  This can be used to
      determine whether or not a region has been filled in at the point where we call
      do_mmap_shared() or do_mmap_private().
      
      The BDI_CAP_MAP_DIRECT capability flag is cleared by validate_mmap_request() if
      there's any reason we can't use it.  It's also cleared in do_mmap_pgoff() if
      f_op->get_unmapped_area() fails.
      
      Without this fix, attempting to run a program from a RomFS image on a
      non-mappable MTD partition results in a BUG as the kernel attempts XIP, and
      this can be caught in gdb:
      
      Program received signal SIGABRT, Aborted.
      0xc005dce8 in add_nommu_region (region=<value optimized out>) at mm/nommu.c:547
      (gdb) bt
      #0  0xc005dce8 in add_nommu_region (region=<value optimized out>) at mm/nommu.c:547
      #1  0xc005f168 in do_mmap_pgoff (file=0xc31a6620, addr=<value optimized out>, len=3808, prot=3, flags=6146, pgoff=0) at mm/nommu.c:1373
      #2  0xc00a96b8 in elf_fdpic_map_file (params=0xc33fbbec, file=0xc31a6620, mm=0xc31bef60, what=0xc0213144 "executable") at mm.h:1145
      #3  0xc00aa8b4 in load_elf_fdpic_binary (bprm=0xc316cb00, regs=<value optimized out>) at fs/binfmt_elf_fdpic.c:343
      #4  0xc006b588 in search_binary_handler (bprm=0x6, regs=0xc33fbce0) at fs/exec.c:1234
      #5  0xc006c648 in do_execve (filename=<value optimized out>, argv=0xc3ad14cc, envp=0xc3ad1460, regs=0xc33fbce0) at fs/exec.c:1356
      #6  0xc0008cf0 in sys_execve (name=<value optimized out>, argv=0xc3ad14cc, envp=0xc3ad1460) at arch/frv/kernel/process.c:263
      #7  0xc00075dc in __syscall_call () at arch/frv/kernel/entry.S:897
      
      Note that this fix does the following commit differently:
      
      	commit a190887b
      	Author: David Howells <dhowells@redhat.com>
      	Date:   Sat Sep 5 11:17:07 2009 -0700
      	nommu: fix error handling in do_mmap_pgoff()
      Reported-by: NGraff Yang <graff.yang@gmail.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NPekka Enberg <penberg@cs.helsinki.fi>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Greg Ungerer <gerg@snapgear.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      645d83c5
  9. 24 9月, 2009 2 次提交
  10. 22 9月, 2009 4 次提交
  11. 06 9月, 2009 1 次提交
  12. 19 8月, 2009 1 次提交
  13. 17 8月, 2009 1 次提交
    • E
      Security/SELinux: seperate lsm specific mmap_min_addr · 788084ab
      Eric Paris 提交于
      Currently SELinux enforcement of controls on the ability to map low memory
      is determined by the mmap_min_addr tunable.  This patch causes SELinux to
      ignore the tunable and instead use a seperate Kconfig option specific to how
      much space the LSM should protect.
      
      The tunable will now only control the need for CAP_SYS_RAWIO and SELinux
      permissions will always protect the amount of low memory designated by
      CONFIG_LSM_MMAP_MIN_ADDR.
      
      This allows users who need to disable the mmap_min_addr controls (usual reason
      being they run WINE as a non-root user) to do so and still have SELinux
      controls preventing confined domains (like a web server) from being able to
      map some area of low memory.
      Signed-off-by: NEric Paris <eparis@redhat.com>
      Signed-off-by: NJames Morris <jmorris@namei.org>
      788084ab
  14. 06 8月, 2009 1 次提交
    • E
      Security/SELinux: seperate lsm specific mmap_min_addr · a2551df7
      Eric Paris 提交于
      Currently SELinux enforcement of controls on the ability to map low memory
      is determined by the mmap_min_addr tunable.  This patch causes SELinux to
      ignore the tunable and instead use a seperate Kconfig option specific to how
      much space the LSM should protect.
      
      The tunable will now only control the need for CAP_SYS_RAWIO and SELinux
      permissions will always protect the amount of low memory designated by
      CONFIG_LSM_MMAP_MIN_ADDR.
      
      This allows users who need to disable the mmap_min_addr controls (usual reason
      being they run WINE as a non-root user) to do so and still have SELinux
      controls preventing confined domains (like a web server) from being able to
      map some area of low memory.
      Signed-off-by: NEric Paris <eparis@redhat.com>
      Signed-off-by: NJames Morris <jmorris@namei.org>
      a2551df7
  15. 26 6月, 2009 2 次提交
  16. 10 6月, 2009 1 次提交
  17. 08 5月, 2009 1 次提交
  18. 07 5月, 2009 1 次提交
    • D
      nommu: make the initial mmap allocation excess behaviour Kconfig configurable · fc4d5c29
      David Howells 提交于
      NOMMU mmap() has an option controlled by a sysctl variable that determines
      whether the allocations made by do_mmap_private() should have the excess
      space trimmed off and returned to the allocator.  Make the initial setting
      of this variable a Kconfig configuration option.
      
      The reason there can be excess space is that the allocator only allocates
      in power-of-2 size chunks, but mmap()'s can be made in sizes that aren't a
      power of 2.
      
      There are two alternatives:
      
       (1) Keep the excess as dead space.  The dead space then remains unused for the
           lifetime of the mapping.  Mappings of shared objects such as libc, ld.so
           or busybox's text segment may retain their dead space forever.
      
       (2) Return the excess to the allocator.  This means that the dead space is
           limited to less than a page per mapping, but it means that for a transient
           process, there's more chance of fragmentation as the excess space may be
           reused fairly quickly.
      
      During the boot process, a lot of transient processes are created, and
      this can cause a lot of fragmentation as the pagecache and various slabs
      grow greatly during this time.
      
      By turning off the trimming of excess space during boot and disabling
      batching of frees, Coldfire can manage to boot.
      
      A better way of doing things might be to have /sbin/init turn this option
      off.  By that point libc, ld.so and init - which are all long-duration
      processes - have all been loaded and trimmed.
      Reported-by: NLanttor Guo <lanttor.guo@freescale.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Tested-by: NLanttor Guo <lanttor.guo@freescale.com>
      Cc: Greg Ungerer <gerg@snapgear.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fc4d5c29
  19. 03 5月, 2009 1 次提交
    • K
      mm: fix Committed_AS underflow on large NR_CPUS environment · 00a62ce9
      KOSAKI Motohiro 提交于
      The Committed_AS field can underflow in certain situations:
      
      >         # while true; do cat /proc/meminfo  | grep _AS; sleep 1; done | uniq -c
      >               1 Committed_AS: 18446744073709323392 kB
      >              11 Committed_AS: 18446744073709455488 kB
      >               6 Committed_AS:    35136 kB
      >               5 Committed_AS: 18446744073709454400 kB
      >               7 Committed_AS:    35904 kB
      >               3 Committed_AS: 18446744073709453248 kB
      >               2 Committed_AS:    34752 kB
      >               9 Committed_AS: 18446744073709453248 kB
      >               8 Committed_AS:    34752 kB
      >               3 Committed_AS: 18446744073709320960 kB
      >               7 Committed_AS: 18446744073709454080 kB
      >               3 Committed_AS: 18446744073709320960 kB
      >               5 Committed_AS: 18446744073709454080 kB
      >               6 Committed_AS: 18446744073709320960 kB
      
      Because NR_CPUS can be greater than 1000 and meminfo_proc_show() does
      not check for underflow.
      
      But NR_CPUS proportional isn't good calculation.  In general,
      possibility of lock contention is proportional to the number of online
      cpus, not theorical maximum cpus (NR_CPUS).
      
      The current kernel has generic percpu-counter stuff.  using it is right
      way.  it makes code simplify and percpu_counter_read_positive() don't
      make underflow issue.
      Reported-by: NDave Hansen <dave@linux.vnet.ibm.com>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Eric B Munson <ebmunson@us.ibm.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: <stable@kernel.org>		[All kernel versions]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      00a62ce9
  20. 03 4月, 2009 1 次提交
    • D
      nommu: fix a number of issues with the per-MM VMA patch · 33e5d769
      David Howells 提交于
      Fix a number of issues with the per-MM VMA patch:
      
       (1) Make mmap_pages_allocated an atomic_long_t, just in case this is used on
           a NOMMU system with more than 2G pages.  Makes no difference on a 32-bit
           system.
      
       (2) Report vma->vm_pgoff * PAGE_SIZE as a 64-bit value, not a 32-bit value,
           lest it overflow.
      
       (3) Move the allocation of the vm_area_struct slab back for fork.c.
      
       (4) Use KMEM_CACHE() for both vm_area_struct and vm_region slabs.
      
       (5) Use BUG_ON() rather than if () BUG().
      
       (6) Make the default validate_nommu_regions() a static inline rather than a
           #define.
      
       (7) Make free_page_series()'s objection to pages with a refcount != 1 more
           informative.
      
       (8) Adjust the __put_nommu_region() banner comment to indicate that the
           semaphore must be held for writing.
      
       (9) Limit the number of warnings about munmaps of non-mmapped regions.
      Reported-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Cc: Greg Ungerer <gerg@snapgear.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      33e5d769
  21. 27 1月, 2009 1 次提交
  22. 21 1月, 2009 1 次提交
  23. 14 1月, 2009 2 次提交
  24. 08 1月, 2009 4 次提交
    • P
      NOMMU: Teach kobjsize() about VMA regions. · ab2e83ea
      Paul Mundt 提交于
      Now that we no longer use compound pages for all large allocations,
      kobjsize() actively breaks things like binfmt_flat by always handing
      back PAGE_SIZE for mmap'ed regions. Fix this up by looking up the
      VMA region for non-compounds.
      
      Ideally binfmt_flat wants to get rid of kobjsize() completely, but
      this is an incremental step.
      Signed-off-by: NPaul Mundt <lethal@linux-sh.org>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Tested-by: NMike Frysinger <vapier.adi@gmail.com>
      ab2e83ea
    • P
      NOMMU: Make mmap allocation page trimming behaviour configurable. · dd8632a1
      Paul Mundt 提交于
      NOMMU mmap allocates a piece of memory for an mmap that's rounded up in size to
      the nearest power-of-2 number of pages.  Currently it then discards the excess
      pages back to the page allocator, making that memory available for use by other
      things.  This can, however, cause greater amount of fragmentation.
      
      To counter this, a sysctl is added in order to fine-tune the trimming
      behaviour.  The default behaviour remains to trim pages aggressively, while
      this can either be disabled completely or set to a higher page-granular
      watermark in order to have finer-grained control.
      
      vm region vm_top bits taken from an earlier patch by David Howells.
      Signed-off-by: NPaul Mundt <lethal@linux-sh.org>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Tested-by: NMike Frysinger <vapier.adi@gmail.com>
      dd8632a1
    • D
      NOMMU: Make VMAs per MM as for MMU-mode linux · 8feae131
      David Howells 提交于
      Make VMAs per mm_struct as for MMU-mode linux.  This solves two problems:
      
       (1) In SYSV SHM where nattch for a segment does not reflect the number of
           shmat's (and forks) done.
      
       (2) In mmap() where the VMA's vm_mm is set to point to the parent mm by an
           exec'ing process when VM_EXECUTABLE is specified, regardless of the fact
           that a VMA might be shared and already have its vm_mm assigned to another
           process or a dead process.
      
      A new struct (vm_region) is introduced to track a mapped region and to remember
      the circumstances under which it may be shared and the vm_list_struct structure
      is discarded as it's no longer required.
      
      This patch makes the following additional changes:
      
       (1) Regions are now allocated with alloc_pages() rather than kmalloc() and
           with no recourse to __GFP_COMP, so the pages are not composite.  Instead,
           each page has a reference on it held by the region.  Anything else that is
           interested in such a page will have to get a reference on it to retain it.
           When the pages are released due to unmapping, each page is passed to
           put_page() and will be freed when the page usage count reaches zero.
      
       (2) Excess pages are trimmed after an allocation as the allocation must be
           made as a power-of-2 quantity of pages.
      
       (3) VMAs are added to the parent MM's R/B tree and mmap lists.  As an MM may
           end up with overlapping VMAs within the tree, the VMA struct address is
           appended to the sort key.
      
       (4) Non-anonymous VMAs are now added to the backing inode's prio list.
      
       (5) Holes may be punched in anonymous VMAs with munmap(), releasing parts of
           the backing region.  The VMA and region structs will be split if
           necessary.
      
       (6) sys_shmdt() only releases one attachment to a SYSV IPC shared memory
           segment instead of all the attachments at that addresss.  Multiple
           shmat()'s return the same address under NOMMU-mode instead of different
           virtual addresses as under MMU-mode.
      
       (7) Core dumping for ELF-FDPIC requires fewer exceptions for NOMMU-mode.
      
       (8) /proc/maps is now the global list of mapped regions, and may list bits
           that aren't actually mapped anywhere.
      
       (9) /proc/meminfo gains a line (tagged "MmapCopy") that indicates the amount
           of RAM currently allocated by mmap to hold mappable regions that can't be
           mapped directly.  These are copies of the backing device or file if not
           anonymous.
      
      These changes make NOMMU mode more similar to MMU mode.  The downside is that
      NOMMU mode requires some extra memory to track things over NOMMU without this
      patch (VMAs are no longer shared, and there are now region structs).
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Tested-by: NMike Frysinger <vapier.adi@gmail.com>
      Acked-by: NPaul Mundt <lethal@linux-sh.org>
      8feae131
    • D
      NOMMU: Delete askedalloc and realalloc variables · 41836382
      David Howells 提交于
      Delete the askedalloc and realalloc variables as nothing actually uses the
      value calculated.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Tested-by: NMike Frysinger <vapier.adi@gmail.com>
      Acked-by: NPaul Mundt <lethal@linux-sh.org>
      41836382
  25. 06 1月, 2009 1 次提交
    • A
      inode->i_op is never NULL · acfa4380
      Al Viro 提交于
      We used to have rather schizophrenic set of checks for NULL ->i_op even
      though it had been eliminated years ago.  You'd need to go out of your
      way to set it to NULL explicitly _and_ a bunch of code would die on
      such inodes anyway.  After killing two remaining places that still
      did that bogosity, all that crap can go away.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      acfa4380
  26. 31 10月, 2008 1 次提交