1. 02 10月, 2010 1 次提交
    • T
      percpu: use percpu allocator on UP too · 9b8327bb
      Tejun Heo 提交于
      On UP, percpu allocations were redirected to kmalloc.  This has the
      following problems.
      
      * For certain amount of allocations (determined by
        PERCPU_DYNAMIC_EARLY_SLOTS and PERCPU_DYNAMIC_EARLY_SIZE), percpu
        allocator can be used before the usual kernel memory allocator is
        brought online.  On SMP, this is used to initialize the kernel
        memory allocator.
      
      * percpu allocator honors alignment upto PAGE_SIZE but kmalloc()
        doesn't.  For example, workqueue makes use of larger alignments for
        cpu_workqueues.
      
      Currently, users of percpu allocators need to handle UP differently,
      which is somewhat fragile and ugly.  Other than small amount of
      memory, there isn't much to lose by enabling percpu allocator on UP.
      It can simply use kernel memory based chunk allocation which was added
      for SMP archs w/o MMUs.
      
      This patch removes mm/percpu_up.c, builds mm/percpu.c on UP too and
      makes UP build use percpu-km.  As percpu addresses and kernel
      addresses are always identity mapped and static percpu variables don't
      need any special treatment, nothing is arch dependent and mm/percpu.c
      implements generic setup_per_cpu_areas() for UP.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      9b8327bb
  2. 08 9月, 2010 1 次提交
    • T
      percpu: use percpu allocator on UP too · bbddff05
      Tejun Heo 提交于
      On UP, percpu allocations were redirected to kmalloc.  This has the
      following problems.
      
      * For certain amount of allocations (determined by
        PERCPU_DYNAMIC_EARLY_SLOTS and PERCPU_DYNAMIC_EARLY_SIZE), percpu
        allocator can be used before the usual kernel memory allocator is
        brought online.  On SMP, this is used to initialize the kernel
        memory allocator.
      
      * percpu allocator honors alignment upto PAGE_SIZE but kmalloc()
        doesn't.  For example, workqueue makes use of larger alignments for
        cpu_workqueues.
      
      Currently, users of percpu allocators need to handle UP differently,
      which is somewhat fragile and ugly.  Other than small amount of
      memory, there isn't much to lose by enabling percpu allocator on UP.
      It can simply use kernel memory based chunk allocation which was added
      for SMP archs w/o MMUs.
      
      This patch removes mm/percpu_up.c, builds mm/percpu.c on UP too and
      makes UP build use percpu-km.  As percpu addresses and kernel
      addresses are always identity mapped and static percpu variables don't
      need any special treatment, nothing is arch dependent and mm/percpu.c
      implements generic setup_per_cpu_areas() for UP.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NChristoph Lameter <cl@linux-foundation.org>
      Acked-by: NPekka Enberg <penberg@cs.helsinki.fi>
      bbddff05
  3. 14 7月, 2010 1 次提交
  4. 25 5月, 2010 1 次提交
    • M
      mm: compaction: memory compaction core · 748446bb
      Mel Gorman 提交于
      This patch is the core of a mechanism which compacts memory in a zone by
      relocating movable pages towards the end of the zone.
      
      A single compaction run involves a migration scanner and a free scanner.
      Both scanners operate on pageblock-sized areas in the zone.  The migration
      scanner starts at the bottom of the zone and searches for all movable
      pages within each area, isolating them onto a private list called
      migratelist.  The free scanner starts at the top of the zone and searches
      for suitable areas and consumes the free pages within making them
      available for the migration scanner.  The pages isolated for migration are
      then migrated to the newly isolated free pages.
      
      [aarcange@redhat.com: Fix unsafe optimisation]
      [mel@csn.ul.ie: do not schedule work on other CPUs for compaction]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      748446bb
  5. 30 3月, 2010 1 次提交
    • T
      percpu: don't implicitly include slab.h from percpu.h · de380b55
      Tejun Heo 提交于
      percpu.h has always been including slab.h to get k[mz]alloc/free() for
      UP inline implementation.  percpu.h being used by very low level
      headers including module.h and sched.h, this meant that a lot files
      unintentionally got slab.h inclusion.
      
      Lee Schermerhorn was trying to make topology.h use percpu.h and got
      bitten by this implicit inclusion.  The right thing to do is break
      this ultimately unnecessary dependency.  The previous patch added
      explicit inclusion of either gfp.h or slab.h to the source files using
      them.  This patch updates percpu.h such that slab.h is no longer
      included from percpu.h.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      de380b55
  6. 17 12月, 2009 1 次提交
  7. 02 10月, 2009 1 次提交
  8. 25 9月, 2009 1 次提交
  9. 23 9月, 2009 1 次提交
    • S
      procfs: provide stack information for threads · d899bf7b
      Stefani Seibold 提交于
      A patch to give a better overview of the userland application stack usage,
      especially for embedded linux.
      
      Currently you are only able to dump the main process/thread stack usage
      which is showed in /proc/pid/status by the "VmStk" Value.  But you get no
      information about the consumed stack memory of the the threads.
      
      There is an enhancement in the /proc/<pid>/{task/*,}/*maps and which marks
      the vm mapping where the thread stack pointer reside with "[thread stack
      xxxxxxxx]".  xxxxxxxx is the maximum size of stack.  This is a value
      information, because libpthread doesn't set the start of the stack to the
      top of the mapped area, depending of the pthread usage.
      
      A sample output of /proc/<pid>/task/<tid>/maps looks like:
      
      08048000-08049000 r-xp 00000000 03:00 8312       /opt/z
      08049000-0804a000 rw-p 00001000 03:00 8312       /opt/z
      0804a000-0806b000 rw-p 00000000 00:00 0          [heap]
      a7d12000-a7d13000 ---p 00000000 00:00 0
      a7d13000-a7f13000 rw-p 00000000 00:00 0          [thread stack: 001ff4b4]
      a7f13000-a7f14000 ---p 00000000 00:00 0
      a7f14000-a7f36000 rw-p 00000000 00:00 0
      a7f36000-a8069000 r-xp 00000000 03:00 4222       /lib/libc.so.6
      a8069000-a806b000 r--p 00133000 03:00 4222       /lib/libc.so.6
      a806b000-a806c000 rw-p 00135000 03:00 4222       /lib/libc.so.6
      a806c000-a806f000 rw-p 00000000 00:00 0
      a806f000-a8083000 r-xp 00000000 03:00 14462      /lib/libpthread.so.0
      a8083000-a8084000 r--p 00013000 03:00 14462      /lib/libpthread.so.0
      a8084000-a8085000 rw-p 00014000 03:00 14462      /lib/libpthread.so.0
      a8085000-a8088000 rw-p 00000000 00:00 0
      a8088000-a80a4000 r-xp 00000000 03:00 8317       /lib/ld-linux.so.2
      a80a4000-a80a5000 r--p 0001b000 03:00 8317       /lib/ld-linux.so.2
      a80a5000-a80a6000 rw-p 0001c000 03:00 8317       /lib/ld-linux.so.2
      afaf5000-afb0a000 rw-p 00000000 00:00 0          [stack]
      ffffe000-fffff000 r-xp 00000000 00:00 0          [vdso]
      
      Also there is a new entry "stack usage" in /proc/<pid>/{task/*,}/status
      which will you give the current stack usage in kb.
      
      A sample output of /proc/self/status looks like:
      
      Name:	cat
      State:	R (running)
      Tgid:	507
      Pid:	507
      .
      .
      .
      CapBnd:	fffffffffffffeff
      voluntary_ctxt_switches:	0
      nonvoluntary_ctxt_switches:	0
      Stack usage:	12 kB
      
      I also fixed stack base address in /proc/<pid>/{task/*,}/stat to the base
      address of the associated thread stack and not the one of the main
      process.  This makes more sense.
      
      [akpm@linux-foundation.org: fs/proc/array.c now needs walk_page_range()]
      Signed-off-by: NStefani Seibold <stefani@seibold.net>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d899bf7b
  10. 22 9月, 2009 2 次提交
    • M
      mm: move use_mm/unuse_mm from aio.c to mm/ · 3d2d827f
      Michael S. Tsirkin 提交于
      Anyone who wants to do copy to/from user from a kernel thread, needs
      use_mm (like what fs/aio has).  Move that into mm/, to make reusing and
      exporting easier down the line, and make aio use it.  Next intended user,
      besides aio, will be vhost-net.
      Acked-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3d2d827f
    • H
      ksm: the mm interface to ksm · f8af4da3
      Hugh Dickins 提交于
      This patch presents the mm interface to a dummy version of ksm.c, for
      better scrutiny of that interface: the real ksm.c follows later.
      
      When CONFIG_KSM is not set, madvise(2) reject MADV_MERGEABLE and
      MADV_UNMERGEABLE with EINVAL, since that seems more helpful than
      pretending that they can be serviced.  But when CONFIG_KSM=y, accept them
      even if KSM is not currently running, and even on areas which KSM will not
      touch (e.g.  hugetlb or shared file or special driver mappings).
      
      Like other madvices, report ENOMEM despite success if any area in the
      range is unmapped, and use EAGAIN to report out of memory.
      
      Define vma flag VM_MERGEABLE to identify an area on which KSM may try
      merging pages: leave it to ksm_madvise() to decide whether to set it.
      Define mm flag MMF_VM_MERGEABLE to identify an mm which might contain
      VM_MERGEABLE areas, to minimize callouts when forking or exiting.
      
      Based upon earlier patches by Chris Wright and Izik Eidus.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: NChris Wright <chrisw@redhat.com>
      Signed-off-by: NIzik Eidus <ieidus@redhat.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f8af4da3
  11. 16 9月, 2009 2 次提交
    • A
      HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs · cae681fc
      Andi Kleen 提交于
      Useful for some testing scenarios, although specific testing is often
      done better through MADV_POISON
      
      This can be done with the x86 level MCE injector too, but this interface
      allows it to do independently from low level x86 changes.
      
      v2: Add module license (Haicheng Li)
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      cae681fc
    • A
      HWPOISON: The high level memory error handler in the VM v7 · 6a46079c
      Andi Kleen 提交于
      Add the high level memory handler that poisons pages
      that got corrupted by hardware (typically by a two bit flip in a DIMM
      or a cache) on the Linux level. The goal is to prevent everyone
      from accessing these pages in the future.
      
      This done at the VM level by marking a page hwpoisoned
      and doing the appropriate action based on the type of page
      it is.
      
      The code that does this is portable and lives in mm/memory-failure.c
      
      To quote the overview comment:
      
      High level machine check handler. Handles pages reported by the
      hardware as being corrupted usually due to a 2bit ECC memory or cache
      failure.
      
      This focuses on pages detected as corrupted in the background.
      When the current CPU tries to consume corruption the currently
      running process can just be killed directly instead. This implies
      that if the error cannot be handled for some reason it's safe to
      just ignore it because no corruption has been consumed yet. Instead
      when that happens another machine check will happen.
      
      Handles page cache pages in various states. The tricky part
      here is that we can access any page asynchronous to other VM
      users, because memory failures could happen anytime and anywhere,
      possibly violating some of their assumptions. This is why this code
      has to be extremely careful. Generally it tries to use normal locking
      rules, as in get the standard locks, even if that means the
      error handling takes potentially a long time.
      
      Some of the operations here are somewhat inefficient and have non
      linear algorithmic complexity, because the data structures have not
      been optimized for this case. This is in particular the case
      for the mapping from a vma to a process. Since this case is expected
      to be rare we hope we can get away with this.
      
      There are in principle two strategies to kill processes on poison:
      - just unmap the data and wait for an actual reference before
      killing
      - kill as soon as corruption is detected.
      Both have advantages and disadvantages and should be used
      in different situations. Right now both are implemented and can
      be switched with a new sysctl vm.memory_failure_early_kill
      The default is early kill.
      
      The patch does some rmap data structure walking on its own to collect
      processes to kill. This is unusual because normally all rmap data structure
      knowledge is in rmap.c only. I put it here for now to keep
      everything together and rmap knowledge has been seeping out anyways
      
      Includes contributions from Johannes Weiner, Chris Mason, Fengguang Wu,
      Nick Piggin (who did a lot of great work) and others.
      
      Cc: npiggin@suse.de
      Cc: riel@redhat.com
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NHidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
      6a46079c
  12. 11 9月, 2009 1 次提交
  13. 24 6月, 2009 1 次提交
    • T
      percpu: use dynamic percpu allocator as the default percpu allocator · e74e3962
      Tejun Heo 提交于
      This patch makes most !CONFIG_HAVE_SETUP_PER_CPU_AREA archs use
      dynamic percpu allocator.  The first chunk is allocated using
      embedding helper and 8k is reserved for modules.  This ensures that
      the new allocator behaves almost identically to the original allocator
      as long as static percpu variables are concerned, so it shouldn't
      introduce much breakage.
      
      s390 and alpha use custom SHIFT_PERCPU_PTR() to work around addressing
      range limit the addressing model imposes.  Unfortunately, this breaks
      if the address is specified using a variable, so for now, the two
      archs aren't converted.
      
      The following architectures are affected by this change.
      
      * sh
      * arm
      * cris
      * mips
      * sparc(32)
      * blackfin
      * avr32
      * parisc (broken, under investigation)
      * m32r
      * powerpc(32)
      
      As this change makes the dynamic allocator the default one,
      CONFIG_HAVE_DYNAMIC_PER_CPU_AREA is replaced with its invert -
      CONFIG_HAVE_LEGACY_PER_CPU_AREA, which is added to yet-to-be converted
      archs.  These archs implement their own setup_per_cpu_areas() and the
      conversion is not trivial.
      
      * powerpc(64)
      * sparc(64)
      * ia64
      * alpha
      * s390
      
      Boot and batch alloc/free tests on x86_32 with debug code (x86_32
      doesn't use default first chunk initialization).  Compile tested on
      sparc(32), powerpc(32), arm and alpha.
      
      Kyle McMartin reported that this change breaks parisc.  The problem is
      still under investigation and he is okay with pushing this patch
      forward and fixing parisc later.
      
      [ Impact: use dynamic allocator for most archs w/o custom percpu setup ]
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NRusty Russell <rusty@rustcorp.com.au>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Acked-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      Reviewed-by: NChristoph Lameter <cl@linux.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Mikael Starvik <starvik@axis.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Bryan Wu <cooloney@kernel.org>
      Cc: Kyle McMartin <kyle@mcmartin.ca>
      Cc: Matthew Wilcox <matthew@wil.cx>
      Cc: Grant Grundler <grundler@parisc-linux.org>
      Cc: Hirokazu Takata <takata@linux-m32r.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      e74e3962
  14. 17 6月, 2009 1 次提交
  15. 15 6月, 2009 1 次提交
    • V
      kmemcheck: add mm functions · 2dff4405
      Vegard Nossum 提交于
      With kmemcheck enabled, the slab allocator needs to do this:
      
      1. Tell kmemcheck to allocate the shadow memory which stores the status of
         each byte in the allocation proper, e.g. whether it is initialized or
         uninitialized.
      2. Tell kmemcheck which parts of memory that should be marked uninitialized.
         There are actually a few more states, such as "not yet allocated" and
         "recently freed".
      
      If a slab cache is set up using the SLAB_NOTRACK flag, it will never return
      memory that can take page faults because of kmemcheck.
      
      If a slab cache is NOT set up using the SLAB_NOTRACK flag, callers can still
      request memory with the __GFP_NOTRACK flag. This does not prevent the page
      faults from occuring, however, but marks the object in question as being
      initialized so that no warnings will ever be produced for this object.
      
      In addition to (and in contrast to) __GFP_NOTRACK, the
      __GFP_NOTRACK_FALSE_POSITIVE flag indicates that the allocation should
      not be tracked _because_ it would produce a false positive. Their values
      are identical, but need not be so in the future (for example, we could now
      enable/disable false positives with a config option).
      
      Parts of this patch were contributed by Pekka Enberg but merged for
      atomicity.
      Signed-off-by: NVegard Nossum <vegard.nossum@gmail.com>
      Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      
      [rebased for mainline inclusion]
      Signed-off-by: NVegard Nossum <vegard.nossum@gmail.com>
      2dff4405
  16. 12 6月, 2009 2 次提交
  17. 01 4月, 2009 1 次提交
  18. 20 2月, 2009 1 次提交
    • T
      percpu: implement new dynamic percpu allocator · fbf59bc9
      Tejun Heo 提交于
      Impact: new scalable dynamic percpu allocator which allows dynamic
              percpu areas to be accessed the same way as static ones
      
      Implement scalable dynamic percpu allocator which can be used for both
      static and dynamic percpu areas.  This will allow static and dynamic
      areas to share faster direct access methods.  This feature is optional
      and enabled only when CONFIG_HAVE_DYNAMIC_PER_CPU_AREA is defined by
      arch.  Please read comment on top of mm/percpu.c for details.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      fbf59bc9
  19. 07 1月, 2009 1 次提交
  20. 06 1月, 2009 1 次提交
  21. 29 12月, 2008 2 次提交
  22. 20 10月, 2008 1 次提交
    • K
      memcg: allocate all page_cgroup at boot · 52d4b9ac
      KAMEZAWA Hiroyuki 提交于
      Allocate all page_cgroup at boot and remove page_cgroup poitner from
      struct page.  This patch adds an interface as
      
       struct page_cgroup *lookup_page_cgroup(struct page*)
      
      All FLATMEM/DISCONTIGMEM/SPARSEMEM  and MEMORY_HOTPLUG is supported.
      
      Remove page_cgroup pointer reduces the amount of memory by
       - 4 bytes per PAGE_SIZE.
       - 8 bytes per PAGE_SIZE
      if memory controller is disabled. (even if configured.)
      
      On usual 8GB x86-32 server, this saves 8MB of NORMAL_ZONE memory.
      On my x86-64 server with 48GB of memory, this saves 96MB of memory.
      I think this reduction makes sense.
      
      By pre-allocation, kmalloc/kfree in charge/uncharge are removed.
      This means
        - we're not necessary to be afraid of kmalloc faiulre.
          (this can happen because of gfp_mask type.)
        - we can avoid calling kmalloc/kfree.
        - we can avoid allocating tons of small objects which can be fragmented.
        - we can know what amount of memory will be used for this extra-lru handling.
      
      I added printk message as
      
      	"allocated %ld bytes of page_cgroup"
              "please try cgroup_disable=memory option if you don't want"
      
      maybe enough informative for users.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Reviewed-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      52d4b9ac
  23. 29 7月, 2008 1 次提交
    • A
      mmu-notifiers: core · cddb8a5c
      Andrea Arcangeli 提交于
      With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to pages.
       There are secondary MMUs (with secondary sptes and secondary tlbs) too.
      sptes in the kvm case are shadow pagetables, but when I say spte in
      mmu-notifier context, I mean "secondary pte".  In GRU case there's no
      actual secondary pte and there's only a secondary tlb because the GRU
      secondary MMU has no knowledge about sptes and every secondary tlb miss
      event in the MMU always generates a page fault that has to be resolved by
      the CPU (this is not the case of KVM where the a secondary tlb miss will
      walk sptes in hardware and it will refill the secondary tlb transparently
      to software if the corresponding spte is present).  The same way
      zap_page_range has to invalidate the pte before freeing the page, the spte
      (and secondary tlb) must also be invalidated before any page is freed and
      reused.
      
      Currently we take a page_count pin on every page mapped by sptes, but that
      means the pages can't be swapped whenever they're mapped by any spte
      because they're part of the guest working set.  Furthermore a spte unmap
      event can immediately lead to a page to be freed when the pin is released
      (so requiring the same complex and relatively slow tlb_gather smp safe
      logic we have in zap_page_range and that can be avoided completely if the
      spte unmap event doesn't require an unpin of the page previously mapped in
      the secondary MMU).
      
      The mmu notifiers allow kvm/GRU/XPMEM to attach to the tsk->mm and know
      when the VM is swapping or freeing or doing anything on the primary MMU so
      that the secondary MMU code can drop sptes before the pages are freed,
      avoiding all page pinning and allowing 100% reliable swapping of guest
      physical address space.  Furthermore it avoids the code that teardown the
      mappings of the secondary MMU, to implement a logic like tlb_gather in
      zap_page_range that would require many IPI to flush other cpu tlbs, for
      each fixed number of spte unmapped.
      
      To make an example: if what happens on the primary MMU is a protection
      downgrade (from writeable to wrprotect) the secondary MMU mappings will be
      invalidated, and the next secondary-mmu-page-fault will call
      get_user_pages and trigger a do_wp_page through get_user_pages if it
      called get_user_pages with write=1, and it'll re-establishing an updated
      spte or secondary-tlb-mapping on the copied page.  Or it will setup a
      readonly spte or readonly tlb mapping if it's a guest-read, if it calls
      get_user_pages with write=0.  This is just an example.
      
      This allows to map any page pointed by any pte (and in turn visible in the
      primary CPU MMU), into a secondary MMU (be it a pure tlb like GRU, or an
      full MMU with both sptes and secondary-tlb like the shadow-pagetable layer
      with kvm), or a remote DMA in software like XPMEM (hence needing of
      schedule in XPMEM code to send the invalidate to the remote node, while no
      need to schedule in kvm/gru as it's an immediate event like invalidating
      primary-mmu pte).
      
      At least for KVM without this patch it's impossible to swap guests
      reliably.  And having this feature and removing the page pin allows
      several other optimizations that simplify life considerably.
      
      Dependencies:
      
      1) mm_take_all_locks() to register the mmu notifier when the whole VM
         isn't doing anything with "mm".  This allows mmu notifier users to keep
         track if the VM is in the middle of the invalidate_range_begin/end
         critical section with an atomic counter incraese in range_begin and
         decreased in range_end.  No secondary MMU page fault is allowed to map
         any spte or secondary tlb reference, while the VM is in the middle of
         range_begin/end as any page returned by get_user_pages in that critical
         section could later immediately be freed without any further
         ->invalidate_page notification (invalidate_range_begin/end works on
         ranges and ->invalidate_page isn't called immediately before freeing
         the page).  To stop all page freeing and pagetable overwrites the
         mmap_sem must be taken in write mode and all other anon_vma/i_mmap
         locks must be taken too.
      
      2) It'd be a waste to add branches in the VM if nobody could possibly
         run KVM/GRU/XPMEM on the kernel, so mmu notifiers will only enabled if
         CONFIG_KVM=m/y.  In the current kernel kvm won't yet take advantage of
         mmu notifiers, but this already allows to compile a KVM external module
         against a kernel with mmu notifiers enabled and from the next pull from
         kvm.git we'll start using them.  And GRU/XPMEM will also be able to
         continue the development by enabling KVM=m in their config, until they
         submit all GRU/XPMEM GPLv2 code to the mainline kernel.  Then they can
         also enable MMU_NOTIFIERS in the same way KVM does it (even if KVM=n).
         This guarantees nobody selects MMU_NOTIFIER=y if KVM and GRU and XPMEM
         are all =n.
      
      The mmu_notifier_register call can fail because mm_take_all_locks may be
      interrupted by a signal and return -EINTR.  Because mmu_notifier_reigster
      is used when a driver startup, a failure can be gracefully handled.  Here
      an example of the change applied to kvm to register the mmu notifiers.
      Usually when a driver startups other allocations are required anyway and
      -ENOMEM failure paths exists already.
      
       struct  kvm *kvm_arch_create_vm(void)
       {
              struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL);
      +       int err;
      
              if (!kvm)
                      return ERR_PTR(-ENOMEM);
      
              INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
      
      +       kvm->arch.mmu_notifier.ops = &kvm_mmu_notifier_ops;
      +       err = mmu_notifier_register(&kvm->arch.mmu_notifier, current->mm);
      +       if (err) {
      +               kfree(kvm);
      +               return ERR_PTR(err);
      +       }
      +
              return kvm;
       }
      
      mmu_notifier_unregister returns void and it's reliable.
      
      The patch also adds a few needed but missing includes that would prevent
      kernel to compile after these changes on non-x86 archs (x86 didn't need
      them by luck).
      
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: fix mm/filemap_xip.c build]
      [akpm@linux-foundation.org: fix mm/mmu_notifier.c build]
      Signed-off-by: NAndrea Arcangeli <andrea@qumranet.com>
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Jack Steiner <steiner@sgi.com>
      Cc: Robin Holt <holt@sgi.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Kanoj Sarcar <kanojsarcar@yahoo.com>
      Cc: Roland Dreier <rdreier@cisco.com>
      Cc: Steve Wise <swise@opengridcomputing.com>
      Cc: Avi Kivity <avi@qumranet.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Anthony Liguori <aliguori@us.ibm.com>
      Cc: Chris Wright <chrisw@redhat.com>
      Cc: Marcelo Tosatti <marcelo@kvack.org>
      Cc: Eric Dumazet <dada1@cosmosbay.com>
      Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
      Cc: Izik Eidus <izike@qumranet.com>
      Cc: Anthony Liguori <aliguori@us.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cddb8a5c
  24. 25 7月, 2008 2 次提交
    • N
      mm: remove mm_init compilation dependency on CONFIG_DEBUG_MEMORY_INIT · 5e9426ab
      Nishanth Aravamudan 提交于
      Towards the end of putting all core mm initialization in mm_init.c, I
      plan on putting the creation of a mm kobject in a function in that file.
      However, the file is currently only compiled if CONFIG_DEBUG_MEMORY_INIT
      is set. Remove this dependency, but put the code under an #ifdef on the
      same config option. This should result in no functional changes.
      Signed-off-by: NNishanth Aravamudan <nacc@us.ibm.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5e9426ab
    • M
      mm: add a basic debugging framework for memory initialisation · 6b74ab97
      Mel Gorman 提交于
      Boot initialisation is very complex, with significant numbers of
      architecture-specific routines, hooks and code ordering.  While significant
      amounts of the initialisation is architecture-independent, it trusts the data
      received from the architecture layer.  This is a mistake, and has resulted in
      a number of difficult-to-diagnose bugs.
      
      This patchset adds some validation and tracing to memory initialisation.  It
      also introduces a few basic defensive measures.  The validation code can be
      explicitly disabled for embedded systems.
      
      This patch:
      
      Add additional debugging and verification code for memory initialisation.
      
      Once enabled, the verification checks are always run and when required
      additional debugging information may be outputted via a mminit_loglevel=
      command-line parameter.
      
      The verification code is placed in a new file mm/mm_init.c.  Ideally other mm
      initialisation code will be moved here over time.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6b74ab97
  25. 18 4月, 2008 1 次提交
  26. 05 3月, 2008 1 次提交
  27. 08 2月, 2008 1 次提交
  28. 06 2月, 2008 2 次提交
  29. 04 12月, 2007 1 次提交
  30. 17 10月, 2007 2 次提交
    • K
      memory unplug: page isolation · a5d76b54
      KAMEZAWA Hiroyuki 提交于
      Implement generic chunk-of-pages isolation method by using page grouping ops.
      
      This patch add MIGRATE_ISOLATE to MIGRATE_TYPES. By this
       - MIGRATE_TYPES increases.
       - bitmap for migratetype is enlarged.
      
      pages of MIGRATE_ISOLATE migratetype will not be allocated even if it is free.
      By this, you can isolated *freed* pages from users. How-to-free pages is not
      a purpose of this patch. You may use reclaim and migrate codes to free pages.
      
      If start_isolate_page_range(start,end) is called,
       - migratetype of the range turns to be MIGRATE_ISOLATE  if
         its type is MIGRATE_MOVABLE. (*) this check can be updated if other
         memory reclaiming works make progress.
       - MIGRATE_ISOLATE is not on migratetype fallback list.
       - All free pages and will-be-freed pages are isolated.
      To check all pages in the range are isolated or not,  use test_pages_isolated(),
      To cancel isolation, use undo_isolate_page_range().
      
      Changes V6 -> V7
       - removed unnecessary #ifdef
      
      There are HOLES_IN_ZONE handling codes...I'm glad if we can remove them..
      Signed-off-by: NYasunori Goto <y-goto@jp.fujitsu.com>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a5d76b54
    • C
      Generic Virtual Memmap support for SPARSEMEM · 8f6aac41
      Christoph Lameter 提交于
      SPARSEMEM is a pretty nice framework that unifies quite a bit of code over all
      the arches.  It would be great if it could be the default so that we can get
      rid of various forms of DISCONTIG and other variations on memory maps.  So far
      what has hindered this are the additional lookups that SPARSEMEM introduces
      for virt_to_page and page_address.  This goes so far that the code to do this
      has to be kept in a separate function and cannot be used inline.
      
      This patch introduces a virtual memmap mode for SPARSEMEM, in which the memmap
      is mapped into a virtually contigious area, only the active sections are
      physically backed.  This allows virt_to_page page_address and cohorts become
      simple shift/add operations.  No page flag fields, no table lookups, nothing
      involving memory is required.
      
      The two key operations pfn_to_page and page_to_page become:
      
         #define __pfn_to_page(pfn)      (vmemmap + (pfn))
         #define __page_to_pfn(page)     ((page) - vmemmap)
      
      By having a virtual mapping for the memmap we allow simple access without
      wasting physical memory.  As kernel memory is typically already mapped 1:1
      this introduces no additional overhead.  The virtual mapping must be big
      enough to allow a struct page to be allocated and mapped for all valid
      physical pages.  This vill make a virtual memmap difficult to use on 32 bit
      platforms that support 36 address bits.
      
      However, if there is enough virtual space available and the arch already maps
      its 1-1 kernel space using TLBs (f.e.  true of IA64 and x86_64) then this
      technique makes SPARSEMEM lookups even more efficient than CONFIG_FLATMEM.
      FLATMEM needs to read the contents of the mem_map variable to get the start of
      the memmap and then add the offset to the required entry.  vmemmap is a
      constant to which we can simply add the offset.
      
      This patch has the potential to allow us to make SPARSMEM the default (and
      even the only) option for most systems.  It should be optimal on UP, SMP and
      NUMA on most platforms.  Then we may even be able to remove the other memory
      models: FLATMEM, DISCONTIG etc.
      
      [apw@shadowen.org: config cleanups, resplit code etc]
      [kamezawa.hiroyu@jp.fujitsu.com: Fix sparsemem_vmemmap init]
      [apw@shadowen.org: vmemmap: remove excess debugging]
      [apw@shadowen.org: simplify initialisation code and reduce duplication]
      [apw@shadowen.org: pull out the vmemmap code into its own file]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndy Whitcroft <apw@shadowen.org>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Andi Kleen <ak@suse.de>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8f6aac41
  31. 18 7月, 2007 1 次提交
  32. 08 5月, 2007 2 次提交
    • C
      Quicklists for page table pages · 6225e937
      Christoph Lameter 提交于
      On x86_64 this cuts allocation overhead for page table pages down to a
      fraction (kernel compile / editing load.  TSC based measurement of times spend
      in each function):
      
      no quicklist
      
      pte_alloc               1569048 4.3s(401ns/2.7us/179.7us)
      pmd_alloc                780988 2.1s(337ns/2.7us/86.1us)
      pud_alloc                780072 2.2s(424ns/2.8us/300.6us)
      pgd_alloc                260022 1s(920ns/4us/263.1us)
      
      quicklist:
      
      pte_alloc                452436 573.4ms(8ns/1.3us/121.1us)
      pmd_alloc                196204 174.5ms(7ns/889ns/46.1us)
      pud_alloc                195688 172.4ms(7ns/881ns/151.3us)
      pgd_alloc                 65228 9.8ms(8ns/150ns/6.1us)
      
      pgd allocations are the most complex and there we see the most dramatic
      improvement (may be we can cut down the amount of pgds cached somewhat?).  But
      even the pte allocations still see a doubling of performance.
      
      1. Proven code from the IA64 arch.
      
      	The method used here has been fine tuned for years and
      	is NUMA aware. It is based on the knowledge that accesses
      	to page table pages are sparse in nature. Taking a page
      	off the freelists instead of allocating a zeroed pages
      	allows a reduction of number of cachelines touched
      	in addition to getting rid of the slab overhead. So
      	performance improves. This is particularly useful if pgds
      	contain standard mappings. We can save on the teardown
      	and setup of such a page if we have some on the quicklists.
      	This includes avoiding lists operations that are otherwise
      	necessary on alloc and free to track pgds.
      
      2. Light weight alternative to use slab to manage page size pages
      
      	Slab overhead is significant and even page allocator use
      	is pretty heavy weight. The use of a per cpu quicklist
      	means that we touch only two cachelines for an allocation.
      	There is no need to access the page_struct (unless arch code
      	needs to fiddle around with it). So the fast past just
      	means bringing in one cacheline at the beginning of the
      	page. That same cacheline may then be used to store the
      	page table entry. Or a second cacheline may be used
      	if the page table entry is not in the first cacheline of
      	the page. The current code will zero the page which means
      	touching 32 cachelines (assuming 128 byte). We get down
      	from 32 to 2 cachelines in the fast path.
      
      3. x86_64 gets lightweight page table page management.
      
      	This will allow x86_64 arch code to faster repopulate pgds
      	and other page table entries. The list operations for pgds
      	are reduced in the same way as for i386 to the point where
      	a pgd is allocated from the page allocator and when it is
      	freed back to the page allocator. A pgd can pass through
      	the quicklists without having to be reinitialized.
      
      64 Consolidation of code from multiple arches
      
      	So far arches have their own implementation of quicklist
      	management. This patch moves that feature into the core allowing
      	an easier maintenance and consistent management of quicklists.
      
      Page table pages have the characteristics that they are typically zero or in a
      known state when they are freed.  This is usually the exactly same state as
      needed after allocation.  So it makes sense to build a list of freed page
      table pages and then consume the pages already in use first.  Those pages have
      already been initialized correctly (thus no need to zero them) and are likely
      already cached in such a way that the MMU can use them most effectively.  Page
      table pages are used in a sparse way so zeroing them on allocation is not too
      useful.
      
      Such an implementation already exits for ia64.  Howver, that implementation
      did not support constructors and destructors as needed by i386 / x86_64.  It
      also only supported a single quicklist.  The implementation here has
      constructor and destructor support as well as the ability for an arch to
      specify how many quicklists are needed.
      
      Quicklists are defined by an arch defining CONFIG_QUICKLIST.  If more than one
      quicklist is necessary then we can define NR_QUICK for additional lists.  F.e.
       i386 needs two and thus has
      
      config NR_QUICK
      	int
      	default 2
      
      If an arch has requested quicklist support then pages can be allocated
      from the quicklist (or from the page allocator if the quicklist is
      empty) via:
      
      quicklist_alloc(<quicklist-nr>, <gfpflags>, <constructor>)
      
      Page table pages can be freed using:
      
      quicklist_free(<quicklist-nr>, <destructor>, <page>)
      
      Pages must have a definite state after allocation and before
      they are freed. If no constructor is specified then pages
      will be zeroed on allocation and must be zeroed before they are
      freed.
      
      If a constructor is used then the constructor will establish
      a definite page state. F.e. the i386 and x86_64 pgd constructors
      establish certain mappings.
      
      Constructors and destructors can also be used to track the pages.
      i386 and x86_64 use a list of pgds in order to be able to dynamically
      update standard mappings.
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Andi Kleen <ak@suse.de>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6225e937
    • C
      SLUB core · 81819f0f
      Christoph Lameter 提交于
      This is a new slab allocator which was motivated by the complexity of the
      existing code in mm/slab.c. It attempts to address a variety of concerns
      with the existing implementation.
      
      A. Management of object queues
      
         A particular concern was the complex management of the numerous object
         queues in SLAB. SLUB has no such queues. Instead we dedicate a slab for
         each allocating CPU and use objects from a slab directly instead of
         queueing them up.
      
      B. Storage overhead of object queues
      
         SLAB Object queues exist per node, per CPU. The alien cache queue even
         has a queue array that contain a queue for each processor on each
         node. For very large systems the number of queues and the number of
         objects that may be caught in those queues grows exponentially. On our
         systems with 1k nodes / processors we have several gigabytes just tied up
         for storing references to objects for those queues  This does not include
         the objects that could be on those queues. One fears that the whole
         memory of the machine could one day be consumed by those queues.
      
      C. SLAB meta data overhead
      
         SLAB has overhead at the beginning of each slab. This means that data
         cannot be naturally aligned at the beginning of a slab block. SLUB keeps
         all meta data in the corresponding page_struct. Objects can be naturally
         aligned in the slab. F.e. a 128 byte object will be aligned at 128 byte
         boundaries and can fit tightly into a 4k page with no bytes left over.
         SLAB cannot do this.
      
      D. SLAB has a complex cache reaper
      
         SLUB does not need a cache reaper for UP systems. On SMP systems
         the per CPU slab may be pushed back into partial list but that
         operation is simple and does not require an iteration over a list
         of objects. SLAB expires per CPU, shared and alien object queues
         during cache reaping which may cause strange hold offs.
      
      E. SLAB has complex NUMA policy layer support
      
         SLUB pushes NUMA policy handling into the page allocator. This means that
         allocation is coarser (SLUB does interleave on a page level) but that
         situation was also present before 2.6.13. SLABs application of
         policies to individual slab objects allocated in SLAB is
         certainly a performance concern due to the frequent references to
         memory policies which may lead a sequence of objects to come from
         one node after another. SLUB will get a slab full of objects
         from one node and then will switch to the next.
      
      F. Reduction of the size of partial slab lists
      
         SLAB has per node partial lists. This means that over time a large
         number of partial slabs may accumulate on those lists. These can
         only be reused if allocator occur on specific nodes. SLUB has a global
         pool of partial slabs and will consume slabs from that pool to
         decrease fragmentation.
      
      G. Tunables
      
         SLAB has sophisticated tuning abilities for each slab cache. One can
         manipulate the queue sizes in detail. However, filling the queues still
         requires the uses of the spin lock to check out slabs. SLUB has a global
         parameter (min_slab_order) for tuning. Increasing the minimum slab
         order can decrease the locking overhead. The bigger the slab order the
         less motions of pages between per CPU and partial lists occur and the
         better SLUB will be scaling.
      
      G. Slab merging
      
         We often have slab caches with similar parameters. SLUB detects those
         on boot up and merges them into the corresponding general caches. This
         leads to more effective memory use. About 50% of all caches can
         be eliminated through slab merging. This will also decrease
         slab fragmentation because partial allocated slabs can be filled
         up again. Slab merging can be switched off by specifying
         slub_nomerge on boot up.
      
         Note that merging can expose heretofore unknown bugs in the kernel
         because corrupted objects may now be placed differently and corrupt
         differing neighboring objects. Enable sanity checks to find those.
      
      H. Diagnostics
      
         The current slab diagnostics are difficult to use and require a
         recompilation of the kernel. SLUB contains debugging code that
         is always available (but is kept out of the hot code paths).
         SLUB diagnostics can be enabled via the "slab_debug" option.
         Parameters can be specified to select a single or a group of
         slab caches for diagnostics. This means that the system is running
         with the usual performance and it is much more likely that
         race conditions can be reproduced.
      
      I. Resiliency
      
         If basic sanity checks are on then SLUB is capable of detecting
         common error conditions and recover as best as possible to allow the
         system to continue.
      
      J. Tracing
      
         Tracing can be enabled via the slab_debug=T,<slabcache> option
         during boot. SLUB will then protocol all actions on that slabcache
         and dump the object contents on free.
      
      K. On demand DMA cache creation.
      
         Generally DMA caches are not needed. If a kmalloc is used with
         __GFP_DMA then just create this single slabcache that is needed.
         For systems that have no ZONE_DMA requirement the support is
         completely eliminated.
      
      L. Performance increase
      
         Some benchmarks have shown speed improvements on kernbench in the
         range of 5-10%. The locking overhead of slub is based on the
         underlying base allocation size. If we can reliably allocate
         larger order pages then it is possible to increase slub
         performance much further. The anti-fragmentation patches may
         enable further performance increases.
      
      Tested on:
      i386 UP + SMP, x86_64 UP + SMP + NUMA emulation, IA64 NUMA + Simulator
      
      SLUB Boot options
      
      slub_nomerge		Disable merging of slabs
      slub_min_order=x	Require a minimum order for slab caches. This
      			increases the managed chunk size and therefore
      			reduces meta data and locking overhead.
      slub_min_objects=x	Mininum objects per slab. Default is 8.
      slub_max_order=x	Avoid generating slabs larger than order specified.
      slub_debug		Enable all diagnostics for all caches
      slub_debug=<options>	Enable selective options for all caches
      slub_debug=<o>,<cache>	Enable selective options for a certain set of
      			caches
      
      Available Debug options
      F		Double Free checking, sanity and resiliency
      R		Red zoning
      P		Object / padding poisoning
      U		Track last free / alloc
      T		Trace all allocs / frees (only use for individual slabs).
      
      To use SLUB: Apply this patch and then select SLUB as the default slab
      allocator.
      
      [hugh@veritas.com: fix an oops-causing locking error]
      [akpm@linux-foundation.org: various stupid cleanups and small fixes]
      Signed-off-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      81819f0f