1. 17 10月, 2007 2 次提交
    • A
      hugetlb: Move update_and_free_page · 6af2acb6
      Adam Litke 提交于
      Dynamic huge page pool resizing.
      
      In most real-world scenarios, configuring the size of the hugetlb pool
      correctly is a difficult task.  If too few pages are allocated to the pool,
      applications using MAP_SHARED may fail to mmap() a hugepage region and
      applications using MAP_PRIVATE may receive SIGBUS.  Isolating too much memory
      in the hugetlb pool means it is not available for other uses, especially those
      programs not using huge pages.
      
      The obvious answer is to let the hugetlb pool grow and shrink in response to
      the runtime demand for huge pages.  The work Mel Gorman has been doing to
      establish a memory zone for movable memory allocations makes dynamically
      resizing the hugetlb pool reliable within the limits of that zone.  This patch
      series implements dynamic pool resizing for private and shared mappings while
      being careful to maintain existing semantics.  Please reply with your comments
      and feedback; even just to say whether it would be a useful feature to you.
      Thanks.
      
      How it works
      ============
      
      Upon depletion of the hugetlb pool, rather than reporting an error immediately,
      first try and allocate the needed huge pages directly from the buddy allocator.
      Care must be taken to avoid unbounded growth of the hugetlb pool, so the
      hugetlb filesystem quota is used to limit overall pool size.
      
      The real work begins when we decide there is a shortage of huge pages.  What
      happens next depends on whether the pages are for a private or shared mapping.
      Private mappings are straightforward.  At fault time, if alloc_huge_page()
      fails, we allocate a page from the buddy allocator and increment the source
      node's surplus_huge_pages counter.  When free_huge_page() is called for a page
      on a node with a surplus, the page is freed directly to the buddy allocator
      instead of the hugetlb pool.
      
      Because shared mappings require all of the pages to be reserved up front, some
      additional work must be done at mmap() to support them.  We determine the
      reservation shortage and allocate the required number of pages all at once.
      These pages are then added to the hugetlb pool and marked reserved.  Where that
      is not possible the mmap() will fail.  As with private mappings, the
      appropriate surplus counters are updated.  Since reserved huge pages won't
      necessarily be used by the process, we can't be sure that free_huge_page() will
      always be called to return surplus pages to the buddy allocator.  To prevent
      the huge page pool from bloating, we must free unused surplus pages when their
      reservation has ended.
      
      Controlling it
      ==============
      
      With the entire patch series applied, pool resizing is off by default so unless
      specific action is taken, the semantics are unchanged.
      
      To take advantage of the flexibility afforded by this patch series one must
      tolerate a change in semantics.  To control hugetlb pool growth, the following
      techniques can be employed:
      
       * A sysctl tunable to enable/disable the feature entirely
       * The size= mount option for hugetlbfs filesystems to limit pool size
      
      Performance
      ===========
      
      When contiguous memory is readily available, it is expected that the cost of
      dynamicly resizing the pool will be small.  This series has been performance
      tested with 'stream' to measure this cost.
      
      Stream (http://www.cs.virginia.edu/stream/) was linked with libhugetlbfs to
      enable remapping of the text and data/bss segments into huge pages.
      
      Stream with small array
      -----------------------
      Baseline: 	nr_hugepages = 0, No libhugetlbfs segment remapping
      Preallocated:	nr_hugepages = 5, Text and data/bss remapping
      Dynamic:	nr_hugepages = 0, Text and data/bss remapping
      
      				Rate (MB/s)
      Function	Baseline	Preallocated	Dynamic
      Copy:		4695.6266	5942.8371	5982.2287
      Scale:		4451.5776	5017.1419	5658.7843
      Add:		5815.8849	7927.7827	8119.3552
      Triad:		5949.4144	8527.6492	8110.6903
      
      Stream with large array
      -----------------------
      Baseline: 	nr_hugepages =  0, No libhugetlbfs segment remapping
      Preallocated:	nr_hugepages = 67, Text and data/bss remapping
      Dynamic:	nr_hugepages =  0, Text and data/bss remapping
      
      				Rate (MB/s)
      Function	Baseline	Preallocated	Dynamic
      Copy:		2227.8281	2544.2732	2546.4947
      Scale:		2136.3208	2430.7294	2421.2074
      Add:		2773.1449	4004.0021	3999.4331
      Triad:		2748.4502	3777.0109	3773.4970
      
      * All numbers are averages taken from 10 consecutive runs with a maximum
        standard deviation of 1.3 percent noted.
      
      This patch:
      
      Simply move update_and_free_page() so that it can be reused later in this
      patch series.  The implementation is not changed.
      Signed-off-by: NAdam Litke <agl@us.ibm.com>
      Acked-by: NAndy Whitcroft <apw@shadowen.org>
      Acked-by: NDave McCracken <dave.mccracken@oracle.com>
      Acked-by: NWilliam Irwin <bill.irwin@oracle.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Ken Chen <kenchen@google.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6af2acb6
    • K
      flush icache before set_pte() on ia64: flush icache at set_pte · 954ffcb3
      KAMEZAWA Hiroyuki 提交于
      Current ia64 kernel flushes icache by lazy_mmu_prot_update() *after*
      set_pte().  This is too late.  This patch removes lazy_mmu_prot_update and
      add modfied set_pte() for flushing if necessary.
      
      This patch flush icache of a page when
      	new pte has exec bit.
      	&& new pte has present bit
      	&& new pte is user's page.
      	&& (old *ptep is not present
                  || new pte's pfn is not same to old *ptep's ptn)
      	&& new pte's page has no Pg_arch_1 bit.
      	   Pg_arch_1 is set when a page is cache consistent.
      
      I think this condition checks are much easier to understand than considering
      "Where sync_icache_dcache() should be inserted ?".
      
      pte_user() for ia64 was removed by http://lkml.org/lkml/2007/6/12/67 as
      clean-up. So, I added it again.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      954ffcb3
  2. 01 10月, 2007 1 次提交
  3. 20 9月, 2007 1 次提交
    • L
      Fix NUMA Memory Policy Reference Counting · 480eccf9
      Lee Schermerhorn 提交于
      This patch proposes fixes to the reference counting of memory policy in the
      page allocation paths and in show_numa_map().  Extracted from my "Memory
      Policy Cleanups and Enhancements" series as stand-alone.
      
      Shared policy lookup [shmem] has always added a reference to the policy,
      but this was never unrefed after page allocation or after formatting the
      numa map data.
      
      Default system policy should not require additional ref counting, nor
      should the current task's task policy.  However, show_numa_map() calls
      get_vma_policy() to examine what may be [likely is] another task's policy.
      The latter case needs protection against freeing of the policy.
      
      This patch adds a reference count to a mempolicy returned by
      get_vma_policy() when the policy is a vma policy or another task's
      mempolicy.  Again, shared policy is already reference counted on lookup.  A
      matching "unref" [__mpol_free()] is performed in alloc_page_vma() for
      shared and vma policies, and in show_numa_map() for shared and another
      task's mempolicy.  We can call __mpol_free() directly, saving an admittedly
      inexpensive inline NULL test, because we know we have a non-NULL policy.
      
      Handling policy ref counts for hugepages is a bit trickier.
      huge_zonelist() returns a zone list that might come from a shared or vma
      'BIND policy.  In this case, we should hold the reference until after the
      huge page allocation in dequeue_hugepage().  The patch modifies
      huge_zonelist() to return a pointer to the mempolicy if it needs to be
      unref'd after allocation.
      
      Kernel Build [16cpu, 32GB, ia64] - average of 10 runs:
      
      		w/o patch	w/ refcount patch
      	    Avg	  Std Devn	   Avg	  Std Devn
      Real:	 100.59	    0.38	 100.63	    0.43
      User:	1209.60	    0.37	1209.91	    0.31
      System:   81.52	    0.42	  81.64	    0.34
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Acked-by: NAndi Kleen <ak@suse.de>
      Cc: Christoph Lameter <clameter@sgi.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      480eccf9
  4. 23 8月, 2007 1 次提交
  5. 25 7月, 2007 1 次提交
  6. 20 7月, 2007 5 次提交
    • A
      hugetlb: use set_compound_page_dtor · f8af0bb8
      Akinobu Mita 提交于
      Use appropriate accessor function to set compound page destructor
      function.
      
      Cc:  William Irwin <wli@holomorphy.com>
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Acked-by: NAdam Litke <agl@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f8af0bb8
    • H
      Remove nid_lock from alloc_fresh_huge_page · 7ed5cb2b
      Hugh Dickins 提交于
      The fix to that race in alloc_fresh_huge_page() which could give an illegal
      node ID did not need nid_lock at all: the fix was to replace static int nid
      by static int prev_nid and do the work on local int nid.  nid_lock did make
      sure that racers strictly roundrobin the nodes, but that's not something we
      need to enforce strictly.  Kill nid_lock.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7ed5cb2b
    • A
      dequeue_huge_page() warning fix · 3abf7afd
      Andrew Morton 提交于
      mm/hugetlb.c: In function `dequeue_huge_page':
      mm/hugetlb.c:72: warning: 'nid' might be used uninitialized in this function
      
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: David Gibson <hermes@gibson.dropbear.id.au>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3abf7afd
    • N
      mm: fault feedback #2 · 83c54070
      Nick Piggin 提交于
      This patch completes Linus's wish that the fault return codes be made into
      bit flags, which I agree makes everything nicer.  This requires requires
      all handle_mm_fault callers to be modified (possibly the modifications
      should go further and do things like fault accounting in handle_mm_fault --
      however that would be for another patch).
      
      [akpm@linux-foundation.org: fix alpha build]
      [akpm@linux-foundation.org: fix s390 build]
      [akpm@linux-foundation.org: fix sparc build]
      [akpm@linux-foundation.org: fix sparc64 build]
      [akpm@linux-foundation.org: fix ia64 build]
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Ian Molton <spyro@f2s.com>
      Cc: Bryan Wu <bryan.wu@analog.com>
      Cc: Mikael Starvik <starvik@axis.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Hirokazu Takata <takata@linux-m32r.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Roman Zippel <zippel@linux-m68k.org>
      Cc: Greg Ungerer <gerg@uclinux.org>
      Cc: Matthew Wilcox <willy@debian.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp>
      Cc: Richard Curnow <rc@rc0.org.uk>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
      Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp>
      Cc: Chris Zankel <chris@zankel.net>
      Acked-by: NKyle McMartin <kyle@mcmartin.ca>
      Acked-by: NHaavard Skinnemoen <hskinnemoen@atmel.com>
      Acked-by: NRalf Baechle <ralf@linux-mips.org>
      Acked-by: NAndi Kleen <ak@muc.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      [ Still apparently needs some ARM and PPC loving - Linus ]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      83c54070
    • N
      mm: fault feedback #1 · d0217ac0
      Nick Piggin 提交于
      Change ->fault prototype.  We now return an int, which contains
      VM_FAULT_xxx code in the low byte, and FAULT_RET_xxx code in the next byte.
       FAULT_RET_ code tells the VM whether a page was found, whether it has been
      locked, and potentially other things.  This is not quite the way he wanted
      it yet, but that's changed in the next patch (which requires changes to
      arch code).
      
      This means we no longer set VM_CAN_INVALIDATE in the vma in order to say
      that a page is locked which requires filemap_nopage to go away (because we
      can no longer remain backward compatible without that flag), but we were
      going to do that anyway.
      
      struct fault_data is renamed to struct vm_fault as Linus asked. address
      is now a void __user * that we should firmly encourage drivers not to use
      without really good reason.
      
      The page is now returned via a page pointer in the vm_fault struct.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d0217ac0
  7. 18 7月, 2007 2 次提交
  8. 17 7月, 2007 2 次提交
  9. 17 6月, 2007 1 次提交
  10. 10 5月, 2007 2 次提交
    • K
      pretend cpuset has some form of hugetlb page reservation · 8a630112
      Ken Chen 提交于
      When cpuset is configured, it breaks the strict hugetlb page reservation as
      the accounting is done on a global variable.  Such reservation is
      completely rubbish in the presence of cpuset because the reservation is not
      checked against page availability for the current cpuset.  Application can
      still potentially OOM'ed by kernel with lack of free htlb page in cpuset
      that the task is in.  Attempt to enforce strict accounting with cpuset is
      almost impossible (or too ugly) because cpuset is too fluid that task or
      memory node can be dynamically moved between cpusets.
      
      The change of semantics for shared hugetlb mapping with cpuset is
      undesirable.  However, in order to preserve some of the semantics, we fall
      back to check against current free page availability as a best attempt and
      hopefully to minimize the impact of changing semantics that cpuset has on
      hugetlb.
      Signed-off-by: NKen Chen <kenchen@google.com>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Christoph Lameter <clameter@engr.sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8a630112
    • K
      fix leaky resv_huge_pages when cpuset is in use · ace4bd29
      Ken Chen 提交于
      The internal hugetlb resv_huge_pages variable can permanently leak nonzero
      value in the error path of hugetlb page fault handler when hugetlb page is
      used in combination of cpuset.  The leaked count can permanently trap N
      number of hugetlb pages in unusable "reserved" state.
      
      Steps to reproduce the bug:
      
        (1) create two cpuset, user1 and user2
        (2) reserve 50 htlb pages in cpuset user1
        (3) attempt to shmget/shmat 50 htlb page inside cpuset user2
        (4) kernel oom the user process in step 3
        (5) ipcrm the shm segment
      
      At this point resv_huge_pages will have a count of 49, even though
      there are no active hugetlbfs file nor hugetlb shared memory segment
      in the system.  The leak is permanent and there is no recovery method
      other than system reboot. The leaked count will hold up all future use
      of that many htlb pages in all cpusets.
      
      The culprit is that the error path of alloc_huge_page() did not
      properly undo the change it made to resv_huge_page, causing
      inconsistent state.
      Signed-off-by: NKen Chen <kenchen@google.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: Martin Bligh <mbligh@google.com>
      Acked-by: NDavid Gibson <dwg@au1.ibm.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ace4bd29
  11. 10 2月, 2007 1 次提交
    • K
      [PATCH] hugetlb: preserve hugetlb pte dirty state · 6649a386
      Ken Chen 提交于
      __unmap_hugepage_range() is buggy that it does not preserve dirty state of
      huge_pte when unmapping hugepage range.  It causes data corruption in the
      event of dop_caches being used by sys admin.  For example, an application
      creates a hugetlb file, modify pages, then unmap it.  While leaving the
      hugetlb file alive, comes along sys admin doing a "echo 3 >
      /proc/sys/vm/drop_caches".
      
      drop_pagecache_sb() will happily free all pages that aren't marked dirty if
      there are no active mapping.  Later when application remaps the hugetlb
      file back and all data are gone, triggering catastrophic flip over on
      application.
      
      Not only that, the internal resv_huge_pages count will also get all messed
      up.  Fix it up by marking page dirty appropriately.
      Signed-off-by: NKen Chen <kenchen@google.com>
      Cc: "Nish Aravamudan" <nish.aravamudan@gmail.com>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: <stable@kernel.org>
      Cc: Hugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6649a386
  12. 14 12月, 2006 2 次提交
    • A
      [PATCH] Pass vma argument to copy_user_highpage(). · 9de455b2
      Atsushi Nemoto 提交于
      To allow a more effective copy_user_highpage() on certain architectures,
      a vma argument is added to the function and cow_user_page() allowing
      the implementation of these functions to check for the VM_EXEC bit.
      
      The main part of this patch was originally written by Ralf Baechle;
      Atushi Nemoto did the the debugging.
      Signed-off-by: NAtsushi Nemoto <anemo@mba.ocn.ne.jp>
      Signed-off-by: NRalf Baechle <ralf@linux-mips.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      9de455b2
    • P
      [PATCH] cpuset: rework cpuset_zone_allowed api · 02a0e53d
      Paul Jackson 提交于
      Elaborate the API for calling cpuset_zone_allowed(), so that users have to
      explicitly choose between the two variants:
      
        cpuset_zone_allowed_hardwall()
        cpuset_zone_allowed_softwall()
      
      Until now, whether or not you got the hardwall flavor depended solely on
      whether or not you or'd in the __GFP_HARDWALL gfp flag to the gfp_mask
      argument.
      
      If you didn't specify __GFP_HARDWALL, you implicitly got the softwall
      version.
      
      Unfortunately, this meant that users would end up with the softwall version
      without thinking about it.  Since only the softwall version might sleep,
      this led to bugs with possible sleeping in interrupt context on more than
      one occassion.
      
      The hardwall version requires that the current tasks mems_allowed allows
      the node of the specified zone (or that you're in interrupt or that
      __GFP_THISNODE is set or that you're on a one cpuset system.)
      
      The softwall version, depending on the gfp_mask, might allow a node if it
      was allowed in the nearest enclusing cpuset marked mem_exclusive (which
      requires taking the cpuset lock 'callback_mutex' to evaluate.)
      
      This patch removes the cpuset_zone_allowed() call, and forces the caller to
      explicitly choose between the hardwall and the softwall case.
      
      If the caller wants the gfp_mask to determine this choice, they should (1)
      be sure they can sleep or that __GFP_HARDWALL is set, and (2) invoke the
      cpuset_zone_allowed_softwall() routine.
      
      This adds another 100 or 200 bytes to the kernel text space, due to the few
      lines of nearly duplicate code at the top of both cpuset_zone_allowed_*
      routines.  It should save a few instructions executed for the calls that
      turned into calls of cpuset_zone_allowed_hardwall, thanks to not having to
      set (before the call) then check (within the call) the __GFP_HARDWALL flag.
      
      For the most critical call, from get_page_from_freelist(), the same
      instructions are executed as before -- the old cpuset_zone_allowed()
      routine it used to call is the same code as the
      cpuset_zone_allowed_softwall() routine that it calls now.
      
      Not a perfect win, but seems worth it, to reduce this chance of hitting a
      sleeping with irq off complaint again.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      02a0e53d
  13. 08 12月, 2006 4 次提交
    • A
      [PATCH] mm: make compound page destructor handling explicit · 33f2ef89
      Andy Whitcroft 提交于
      Currently we we use the lru head link of the second page of a compound page
      to hold its destructor.  This was ok when it was purely an internal
      implmentation detail.  However, hugetlbfs overrides this destructor
      violating the layering.  Abstract this out as explicit calls, also
      introduce a type for the callback function allowing them to be type
      checked.  For each callback we pre-declare the function, causing a type
      error on definition rather than on use elsewhere.
      
      [akpm@osdl.org: cleanups]
      Signed-off-by: NAndy Whitcroft <apw@shadowen.org>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      33f2ef89
    • C
      [PATCH] htlb forget rss with pt sharing · cace673d
      Chen, Kenneth W 提交于
      Imprecise RSS accounting is an irritating ill effect with pt sharing.  After
      consulted with several VM experts, I have tried various methods to solve that
      problem: (1) iterate through all mm_structs that share the PT and increment
      count; (2) keep RSS count in page table structure and then sum them up at
      reporting time.  None of the above methods yield any satisfactory
      implementation.
      
      Since process RSS accounting is pure information only, I propose we don't
      count them at all for hugetlb page.  rlimit has such field, though there is
      absolutely no enforcement on limiting that resource.  One other method is to
      account all RSS at hugetlb mmap time regardless they are faulted or not.  I
      opt for the simplicity of no accounting at all.
      
      Hugetlb page are special, they are reserved up front in global reservation
      pool and is not reclaimable.  From physical memory resource point of view, it
      is already consumed regardless whether there are users using them.
      
      If the concern is that RSS can be used to control resource allocation, we
      already can specify hugetlb fs size limit and sysadmin can enforce that at
      mount time.  Combined with the two points mentioned above, I fail to see if
      there is anything got affected because of this patch.
      Signed-off-by: NKen Chen <kenneth.w.chen@intel.com>
      Acked-by: NHugh Dickins <hugh@veritas.com>
      Cc: Dave McCracken <dmccr@us.ibm.com>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      cace673d
    • C
      [PATCH] shared page table for hugetlb page · 39dde65c
      Chen, Kenneth W 提交于
      Following up with the work on shared page table done by Dave McCracken.  This
      set of patch target shared page table for hugetlb memory only.
      
      The shared page table is particular useful in the situation of large number of
      independent processes sharing large shared memory segments.  In the normal
      page case, the amount of memory saved from process' page table is quite
      significant.  For hugetlb, the saving on page table memory is not the primary
      objective (as hugetlb itself already cuts down page table overhead
      significantly), instead, the purpose of using shared page table on hugetlb is
      to allow faster TLB refill and smaller cache pollution upon TLB miss.
      
      With PT sharing, pte entries are shared among hundreds of processes, the cache
      consumption used by all the page table is smaller and in return, application
      gets much higher cache hit ratio.  One other effect is that cache hit ratio
      with hardware page walker hitting on pte in cache will be higher and this
      helps to reduce tlb miss latency.  These two effects contribute to higher
      application performance.
      Signed-off-by: NKen Chen <kenneth.w.chen@intel.com>
      Acked-by: NHugh Dickins <hugh@veritas.com>
      Cc: Dave McCracken <dmccr@us.ibm.com>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      39dde65c
    • C
      [PATCH] __unmap_hugepage_range(): add comment · c0a499c2
      Chen, Kenneth W 提交于
      Signed-off-by: NKen Chen <kenneth.w.chen@intel.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Hugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      c0a499c2
  14. 29 10月, 2006 1 次提交
  15. 12 10月, 2006 1 次提交
  16. 04 10月, 2006 1 次提交
  17. 26 9月, 2006 2 次提交
  18. 23 6月, 2006 1 次提交
    • C
      [PATCH] tightening hugetlb strict accounting · a43a8c39
      Chen, Kenneth W 提交于
      Current hugetlb strict accounting for shared mapping always assume mapping
      starts at zero file offset and reserves pages between zero and size of the
      file.  This assumption often reserves (or lock down) a lot more pages then
      necessary if application maps at none zero file offset.  libhugetlbfs is
      one example that requires proper reservation on shared mapping starts at
      none zero offset.
      
      This patch extends the reservation and hugetlb strict accounting to support
      any arbitrary pair of (offset, len), resulting a much more robust and
      accurate scheme.  More importantly, it won't lock down any hugetlb pages
      outside file mapping.
      Signed-off-by: NKen Chen <kenneth.w.chen@intel.com>
      Acked-by: NAdam Litke <agl@us.ibm.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      a43a8c39
  19. 01 4月, 2006 2 次提交
  20. 22 3月, 2006 7 次提交
    • P
      [PATCH] mm: hugetlb alloc_fresh_huge_page bogus node loop fix · fdb7cc59
      Paul Jackson 提交于
      Fix bogus node loop in hugetlb.c alloc_fresh_huge_page(), which was
      assuming that nodes are numbered contiguously from 0 to num_online_nodes().
      Once the hotplug folks get this far, that will be false.
      Signed-off-by: NPaul Jackson <pj@sgi.com>
      Acked-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      fdb7cc59
    • C
      [PATCH] optimize follow_hugetlb_page · d5d4b0aa
      Chen, Kenneth W 提交于
      follow_hugetlb_page() walks a range of user virtual address and then fills
      in list of struct page * into an array that is passed from the argument
      list.  It also gets a reference count via get_page().  For compound page,
      get_page() actually traverse back to head page via page_private() macro and
      then adds a reference count to the head page.  Since we are doing a virt to
      pte look up, kernel already has a struct page pointer into the head page.
      So instead of traverse into the small unit page struct and then follow a
      link back to the head page, optimize that with incrementing the reference
      count directly on the head page.
      
      The benefit is that we don't take a cache miss on accessing page struct for
      the corresponding user address and more importantly, not to pollute the
      cache with a "not very useful" round trip of pointer chasing.  This adds a
      moderate performance gain on an I/O intensive database transaction
      workload.
      Signed-off-by: NKen Chen <kenneth.w.chen@intel.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      d5d4b0aa
    • D
      [PATCH] hugepage: Make {alloc,free}_huge_page() local · 27a85ef1
      David Gibson 提交于
      Originally, mm/hugetlb.c just handled the hugepage physical allocation path
      and its {alloc,free}_huge_page() functions were used from the arch specific
      hugepage code.  These days those functions are only used with mm/hugetlb.c
      itself.  Therefore, this patch makes them static and removes their
      prototypes from hugetlb.h.  This requires a small rearrangement of code in
      mm/hugetlb.c to avoid a forward declaration.
      
      This patch causes no regressions on the libhugetlbfs testsuite (ppc64,
      POWER5).
      Signed-off-by: NDavid Gibson <dwg@au1.ibm.com>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      27a85ef1
    • D
      [PATCH] hugepage: Strict page reservation for hugepage inodes · b45b5bd6
      David Gibson 提交于
      These days, hugepages are demand-allocated at first fault time.  There's a
      somewhat dubious (and racy) heuristic when making a new mmap() to check if
      there are enough available hugepages to fully satisfy that mapping.
      
      A particularly obvious case where the heuristic breaks down is where a
      process maps its hugepages not as a single chunk, but as a bunch of
      individually mmap()ed (or shmat()ed) blocks without touching and
      instantiating the pages in between allocations.  In this case the size of
      each block is compared against the total number of available hugepages.
      It's thus easy for the process to become overcommitted, because each block
      mapping will succeed, although the total number of hugepages required by
      all blocks exceeds the number available.  In particular, this defeats such
      a program which will detect a mapping failure and adjust its hugepage usage
      downward accordingly.
      
      The patch below addresses this problem, by strictly reserving a number of
      physical hugepages for hugepage inodes which have been mapped, but not
      instatiated.  MAP_SHARED mappings are thus "safe" - they will fail on
      mmap(), not later with an OOM SIGKILL.  MAP_PRIVATE mappings can still
      trigger an OOM.  (Actually SHARED mappings can technically still OOM, but
      only if the sysadmin explicitly reduces the hugepage pool between mapping
      and instantiation)
      
      This patch appears to address the problem at hand - it allows DB2 to start
      correctly, for instance, which previously suffered the failure described
      above.
      
      This patch causes no regressions on the libhugetblfs testsuite, and makes a
      test (designed to catch this problem) pass which previously failed (ppc64,
      POWER5).
      Signed-off-by: NDavid Gibson <dwg@au1.ibm.com>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      b45b5bd6
    • D
      [PATCH] hugepage: serialize hugepage allocation and instantiation · 3935baa9
      David Gibson 提交于
      Currently, no lock or mutex is held between allocating a hugepage and
      inserting it into the pagetables / page cache.  When we do go to insert the
      page into pagetables or page cache, we recheck and may free the newly
      allocated hugepage.  However, since the number of hugepages in the system
      is strictly limited, and it's usualy to want to use all of them, this can
      still lead to spurious allocation failures.
      
      For example, suppose two processes are both mapping (MAP_SHARED) the same
      hugepage file, large enough to consume the entire available hugepage pool.
      If they race instantiating the last page in the mapping, they will both
      attempt to allocate the last available hugepage.  One will fail, of course,
      returning OOM from the fault and thus causing the process to be killed,
      despite the fact that the entire mapping can, in fact, be instantiated.
      
      The patch fixes this race by the simple method of adding a (sleeping) mutex
      to serialize the hugepage fault path between allocation and insertion into
      pagetables and/or page cache.  It would be possible to avoid the
      serialization by catching the allocation failures, waiting on some
      condition, then rechecking to see if someone else has instantiated the page
      for us.  Given the likely frequency of hugepage instantiations, it seems
      very doubtful it's worth the extra complexity.
      
      This patch causes no regression on the libhugetlbfs testsuite, and one
      test, which can trigger this race now passes where it previously failed.
      
      Actually, the test still sometimes fails, though less often and only as a
      shmat() failure, rather processes getting OOM killed by the VM.  The dodgy
      heuristic tests in fs/hugetlbfs/inode.c for whether there's enough hugepage
      space aren't protected by the new mutex, and would be ugly to do so, so
      there's still a race there.  Another patch to replace those tests with
      something saner for this reason as well as others coming...
      Signed-off-by: NDavid Gibson <dwg@au1.ibm.com>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      3935baa9
    • D
      [PATCH] hugepage: Small fixes to hugepage clear/copy path · 79ac6ba4
      David Gibson 提交于
      Move the loops used in mm/hugetlb.c to clear and copy hugepages to their
      own functions for clarity.  As we do so, we add some checks of need_resched
      - we are, after all copying megabytes of memory here.  We also add
      might_sleep() accordingly.  We generally dropped locks around the clear and
      copy, already but not everyone has PREEMPT enabled, so we should still be
      checking explicitly.
      
      For this to work, we need to remove the clear_huge_page() from
      alloc_huge_page(), which is called with the page_table_lock held in the COW
      path.  We move the clear_huge_page() to just after the alloc_huge_page() in
      the hugepage no-page path.  In the COW path, the new page is about to be
      copied over, so clearing it was just a waste of time anyway.  So as a side
      effect we also fix the fact that we held the page_table_lock for far too
      long in this path by calling alloc_huge_page() under it.
      
      It causes no regressions on the libhugetlbfs testsuite (ppc64, POWER5).
      Signed-off-by: NDavid Gibson <dwg@au1.ibm.com>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      79ac6ba4
    • Z
      [PATCH] Enable mprotect on huge pages · 8f860591
      Zhang, Yanmin 提交于
      2.6.16-rc3 uses hugetlb on-demand paging, but it doesn_t support hugetlb
      mprotect.
      
      From: David Gibson <david@gibson.dropbear.id.au>
      
        Remove a test from the mprotect() path which checks that the mprotect()ed
        range on a hugepage VMA is hugepage aligned (yes, really, the sense of
        is_aligned_hugepage_range() is the opposite of what you'd guess :-/).
      
        In fact, we don't need this test.  If the given addresses match the
        beginning/end of a hugepage VMA they must already be suitably aligned.  If
        they don't, then mprotect_fixup() will attempt to split the VMA.  The very
        first test in split_vma() will check for a badly aligned address on a
        hugepage VMA and return -EINVAL if necessary.
      
      From: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
      
        On i386 and x86-64, pte flag _PAGE_PSE collides with _PAGE_PROTNONE.  The
        identify of hugetlb pte is lost when changing page protection via mprotect.
        A page fault occurs later will trigger a bug check in huge_pte_alloc().
      
        The fix is to always make new pte a hugetlb pte and also to clean up
        legacy code where _PAGE_PRESENT is forced on in the pre-faulting day.
      Signed-off-by: NZhang Yanmin <yanmin.zhang@intel.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Signed-off-by: NKen Chen <kenneth.w.chen@intel.com>
      Signed-off-by: NNishanth Aravamudan <nacc@us.ibm.com>
      Cc: Andi Kleen <ak@muc.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      8f860591