1. 27 3月, 2008 1 次提交
  2. 11 3月, 2008 1 次提交
    • A
      hugetlb: correct page count for surplus huge pages · 2668db91
      Adam Litke 提交于
      Free pages in the hugetlb pool are free and as such have a reference count of
      zero.  Regular allocations into the pool from the buddy are "freed" into the
      pool which results in their page_count dropping to zero.  However, surplus
      pages can be directly utilized by the caller without first being freed to the
      pool.  Therefore, a call to put_page_testzero() is in order so that such a
      page will be handed to the caller with a correct count.
      
      This has not affected end users because the bad page count is reset before the
      page is handed off.  However, under CONFIG_DEBUG_VM this triggers a BUG when
      the page count is validated.
      
      Thanks go to Mel for first spotting this issue and providing an initial fix.
      Signed-off-by: NAdam Litke <agl@us.ibm.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2668db91
  3. 05 3月, 2008 2 次提交
    • N
      hugetlb: fix pool shrinking while in restricted cpuset · 348e1e04
      Nishanth Aravamudan 提交于
      Adam Litke noticed that currently we grow the hugepage pool independent of any
      cpuset the running process may be in, but when shrinking the pool, the cpuset
      is checked.  This leads to inconsistency when shrinking the pool in a
      restricted cpuset -- an administrator may have been able to grow the pool on a
      node restricted by a containing cpuset, but they cannot shrink it there.
      
      There are two options: either prevent growing of the pool outside of the
      cpuset or allow shrinking outside of the cpuset.  >From previous discussions
      on linux-mm, /proc/sys/vm/nr_hugepages is an administrative interface that
      should not be restricted by cpusets.  So allow shrinking the pool by removing
      pages from nodes outside of current's cpuset.
      Signed-off-by: NNishanth Aravamudan <nacc@us.ibm.com>
      Acked-by: NAdam Litke <agl@us.ibm.com>
      Cc: William Irwin <wli@holomorphy.com>
      Cc: Lee Schermerhorn <Lee.Schermerhonr@hp.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      348e1e04
    • A
      hugetlb: close a difficult to trigger reservation race · ac09b3a1
      Adam Litke 提交于
      A hugetlb reservation may be inadequately backed in the event of racing
      allocations and frees when utilizing surplus huge pages.  Consider the
      following series of events in processes A and B:
      
       A) Allocates some surplus pages to satisfy a reservation
       B) Frees some huge pages
       A) A notices the extra free pages and drops hugetlb_lock to free some of
          its surplus pages back to the buddy allocator.
       B) Allocates some huge pages
       A) Reacquires hugetlb_lock and returns from gather_surplus_huge_pages()
      
      Avoid this by commiting the reservation after pages have been allocated but
      before dropping the lock to free excess pages.  For parity, release the
      reservation in return_unused_surplus_pages().
      
      This patch also corrects the cpuset_mems_nr() error path in
      hugetlb_acct_memory().  If the cpuset check fails, uncommit the
      reservation, but also be sure to return any surplus huge pages that may
      have been allocated to back the failed reservation.
      
      Thanks to Andy Whitcroft for discovering this.
      Signed-off-by: NAdam Litke <agl@us.ibm.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ac09b3a1
  4. 24 2月, 2008 1 次提交
  5. 14 2月, 2008 1 次提交
  6. 09 2月, 2008 1 次提交
  7. 06 2月, 2008 1 次提交
    • N
      mm: fix PageUptodate data race · 0ed361de
      Nick Piggin 提交于
      After running SetPageUptodate, preceeding stores to the page contents to
      actually bring it uptodate may not be ordered with the store to set the
      page uptodate.
      
      Therefore, another CPU which checks PageUptodate is true, then reads the
      page contents can get stale data.
      
      Fix this by having an smp_wmb before SetPageUptodate, and smp_rmb after
      PageUptodate.
      
      Many places that test PageUptodate, do so with the page locked, and this
      would be enough to ensure memory ordering in those places if
      SetPageUptodate were only called while the page is locked.  Unfortunately
      that is not always the case for some filesystems, but it could be an idea
      for the future.
      
      Also bring the handling of anonymous page uptodateness in line with that of
      file backed page management, by marking anon pages as uptodate when they
      _are_ uptodate, rather than when our implementation requires that they be
      marked as such.  Doing allows us to get rid of the smp_wmb's in the page
      copying functions, which were especially added for anonymous pages for an
      analogous memory ordering problem.  Both file and anonymous pages are
      handled with the same barriers.
      
      FAQ:
      Q. Why not do this in flush_dcache_page?
      A. Firstly, flush_dcache_page handles only one side (the smb side) of the
      ordering protocol; we'd still need smp_rmb somewhere. Secondly, hiding away
      memory barriers in a completely unrelated function is nasty; at least in the
      PageUptodate macros, they are located together with (half) the operations
      involved in the ordering. Thirdly, the smp_wmb is only required when first
      bringing the page uptodate, wheras flush_dcache_page should be called each time
      it is written to through the kernel mapping. It is logically the wrong place to
      put it.
      
      Q. Why does this increase my text size / reduce my performance / etc.
      A. Because it is adding the necessary instructions to eliminate the data-race.
      
      Q. Can it be improved?
      A. Yes, eg. if you were to create a rule that all SetPageUptodate operations
      run under the page lock, we could avoid the smp_rmb places where PageUptodate
      is queried under the page lock. Requires audit of all filesystems and at least
      some would need reworking. That's great you're interested, I'm eagerly awaiting
      your patches.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0ed361de
  8. 25 1月, 2008 1 次提交
    • L
      fix hugepages leak due to pagetable page sharing · c5c99429
      Larry Woodman 提交于
      The shared page table code for hugetlb memory on x86 and x86_64
      is causing a leak.  When a user of hugepages exits using this code
      the system leaks some of the hugepages.
      
      -------------------------------------------------------
      Part of /proc/meminfo just before database startup:
      HugePages_Total:  5500
      HugePages_Free:   5500
      HugePages_Rsvd:      0
      Hugepagesize:     2048 kB
      
      Just before shutdown:
      HugePages_Total:  5500
      HugePages_Free:   4475
      HugePages_Rsvd:      0
      Hugepagesize:     2048 kB
      
      After shutdown:
      HugePages_Total:  5500
      HugePages_Free:   4988
      HugePages_Rsvd:
      0 Hugepagesize:     2048 kB
      ----------------------------------------------------------
      
      The problem occurs durring a fork, in copy_hugetlb_page_range().  It
      locates the dst_pte using huge_pte_alloc().  Since huge_pte_alloc() calls
      huge_pmd_share() it will share the pmd page if can, yet the main loop in
      copy_hugetlb_page_range() does a get_page() on every hugepage.  This is a
      violation of the shared hugepmd pagetable protocol and creates additional
      referenced to the hugepages causing a leak when the unmap of the VMA
      occurs.  We can skip the entire replication of the ptes when the hugepage
      pagetables are shared.  The attached patch skips copying the ptes and the
      get_page() calls if the hugetlbpage pagetable is shared.
      
      [akpm@linux-foundation.org: coding-style cleanups]
      Signed-off-by: NLarry Woodman <lwoodman@redhat.com>
      Signed-off-by: NAdam Litke <agl@us.ibm.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Ken Chen <kenchen@google.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c5c99429
  9. 15 1月, 2008 1 次提交
  10. 18 12月, 2007 2 次提交
    • N
      Revert "hugetlb: Add hugetlb_dynamic_pool sysctl" · 368d2c63
      Nishanth Aravamudan 提交于
      This reverts commit 54f9f80d ("hugetlb:
      Add hugetlb_dynamic_pool sysctl")
      
      Given the new sysctl nr_overcommit_hugepages, the boolean dynamic pool
      sysctl is not needed, as its semantics can be expressed by 0 in the
      overcommit sysctl (no dynamic pool) and non-0 in the overcommit sysctl
      (pool enabled).
      
      (Needed in 2.6.24 since it reverts a post-2.6.23 userspace-visible change)
      Signed-off-by: NNishanth Aravamudan <nacc@us.ibm.com>
      Acked-by: NAdam Litke <agl@us.ibm.com>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      368d2c63
    • N
      hugetlb: introduce nr_overcommit_hugepages sysctl · d1c3fb1f
      Nishanth Aravamudan 提交于
      hugetlb: introduce nr_overcommit_hugepages sysctl
      
      While examining the code to support /proc/sys/vm/hugetlb_dynamic_pool, I
      became convinced that having a boolean sysctl was insufficient:
      
      1) To support per-node control of hugepages, I have previously submitted
      patches to add a sysfs attribute related to nr_hugepages. However, with
      a boolean global value and per-mount quota enforcement constraining the
      dynamic pool, adding corresponding control of the dynamic pool on a
      per-node basis seems inconsistent to me.
      
      2) Administration of the hugetlb dynamic pool with multiple hugetlbfs
      mount points is, arguably, more arduous than it needs to be. Each quota
      would need to be set separately, and the sum would need to be monitored.
      
      To ease the administration, and to help make the way for per-node
      control of the static & dynamic hugepage pool, I added a separate
      sysctl, nr_overcommit_hugepages. This value serves as a high watermark
      for the overall hugepage pool, while nr_hugepages serves as a low
      watermark. The boolean sysctl can then be removed, as the condition
      
      	nr_overcommit_hugepages > 0
      
      indicates the same administrative setting as
      
      	hugetlb_dynamic_pool == 1
      
      Quotas still serve as local enforcement of the size of the pool on a
      per-mount basis.
      
      A few caveats:
      
      1) There is a race whereby the global surplus huge page counter is
      incremented before a hugepage has allocated. Another process could then
      try grow the pool, and fail to convert a surplus huge page to a normal
      huge page and instead allocate a fresh huge page. I believe this is
      benign, as no memory is leaked (the actual pages are still tracked
      correctly) and the counters won't go out of sync.
      
      2) Shrinking the static pool while a surplus is in effect will allow the
      number of surplus huge pages to exceed the overcommit value. As long as
      this condition holds, however, no more surplus huge pages will be
      allowed on the system until one of the two sysctls are increased
      sufficiently, or the surplus huge pages go out of use and are freed.
      
      Successfully tested on x86_64 with the current libhugetlbfs snapshot,
      modified to use the new sysctl.
      Signed-off-by: NNishanth Aravamudan <nacc@us.ibm.com>
      Acked-by: NAdam Litke <agl@us.ibm.com>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d1c3fb1f
  11. 11 12月, 2007 1 次提交
  12. 15 11月, 2007 8 次提交
    • K
      hugetlb: fix i_blocks accounting · 45c682a6
      Ken Chen 提交于
      For administrative purpose, we want to query actual block usage for
      hugetlbfs file via fstat.  Currently, hugetlbfs always return 0.  Fix that
      up since kernel already has all the information to track it properly.
      Signed-off-by: NKen Chen <kenchen@google.com>
      Acked-by: NAdam Litke <agl@us.ibm.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      45c682a6
    • A
      mm/hugetlb.c: make a function static · 8cde045c
      Adrian Bunk 提交于
      return_unused_surplus_pages() can become static.
      Signed-off-by: NAdrian Bunk <bunk@kernel.org>
      Acked-by: NAdam Litke <agl@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8cde045c
    • A
      hugetlb: enforce quotas during reservation for shared mappings · 90d8b7e6
      Adam Litke 提交于
      When a MAP_SHARED mmap of a hugetlbfs file succeeds, huge pages are reserved
      to guarantee no problems will occur later when instantiating pages.  If quotas
      are in force, page instantiation could fail due to a race with another process
      or an oversized (but approved) shared mapping.
      
      To prevent these scenarios, debit the quota for the full reservation amount up
      front and credit the unused quota when the reservation is released.
      Signed-off-by: NAdam Litke <agl@us.ibm.com>
      Cc: Ken Chen <kenchen@google.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: David Gibson <hermes@gibson.dropbear.id.au>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      90d8b7e6
    • A
      hugetlb: allow bulk updating in hugetlb_*_quota() · 9a119c05
      Adam Litke 提交于
      Add a second parameter 'delta' to hugetlb_get_quota and hugetlb_put_quota to
      allow bulk updating of the sbinfo->free_blocks counter.  This will be used by
      the next patch in the series.
      Signed-off-by: NAdam Litke <agl@us.ibm.com>
      Cc: Ken Chen <kenchen@google.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: David Gibson <hermes@gibson.dropbear.id.au>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9a119c05
    • A
      hugetlb: debit quota in alloc_huge_page · 2fc39cec
      Adam Litke 提交于
      Now that quota is credited by free_huge_page(), calls to hugetlb_get_quota()
      seem out of place.  The alloc/free API is unbalanced because we handle the
      hugetlb_put_quota() but expect the caller to open-code hugetlb_get_quota().
      Move the get inside alloc_huge_page to clean up this disparity.
      
      This patch has been kept apart from the previous patch because of the somewhat
      dodgy ERR_PTR() use herein.  Moving the quota logic means that
      alloc_huge_page() has two failure modes.  Quota failure must result in a
      SIGBUS while a standard allocation failure is OOM.  Unfortunately, ERR_PTR()
      doesn't like the small positive errnos we have in VM_FAULT_* so they must be
      negated before they are used.
      
      Does anyone take issue with the way I am using PTR_ERR.  If so, what are your
      thoughts on how to clean this up (without needing an if,else if,else block at
      each alloc_huge_page() callsite)?
      Signed-off-by: NAdam Litke <agl@us.ibm.com>
      Cc: Ken Chen <kenchen@google.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: David Gibson <hermes@gibson.dropbear.id.au>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2fc39cec
    • A
      hugetlb: fix quota management for private mappings · c79fb75e
      Adam Litke 提交于
      The hugetlbfs quota management system was never taught to handle MAP_PRIVATE
      mappings when that support was added.  Currently, quota is debited at page
      instantiation and credited at file truncation.  This approach works correctly
      for shared pages but is incomplete for private pages.  In addition to
      hugetlb_no_page(), private pages can be instantiated by hugetlb_cow(); but
      this function does not respect quotas.
      
      Private huge pages are treated very much like normal, anonymous pages.  They
      are not "backed" by the hugetlbfs file and are not stored in the mapping's
      radix tree.  This means that private pages are invisible to
      truncate_hugepages() so that function will not credit the quota.
      
      This patch (based on a prototype provided by Ken Chen) moves quota crediting
      for all pages into free_huge_page().  page->private is used to store a pointer
      to the mapping to which this page belongs.  This is used to credit quota on
      the appropriate hugetlbfs instance.
      Signed-off-by: NAdam Litke <agl@us.ibm.com>
      Cc: Ken Chen <kenchen@google.com>
      Cc: Ken Chen <kenchen@google.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: David Gibson <hermes@gibson.dropbear.id.au>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c79fb75e
    • A
      hugetlb: split alloc_huge_page into private and shared components · 348ea204
      Adam Litke 提交于
      Hugetlbfs implements a quota system which can limit the amount of memory that
      can be used by the filesystem.  Before allocating a new huge page for a file,
      the quota is checked and debited.  The quota is then credited when truncating
      the file.  I found a few bugs in the code for both MAP_PRIVATE and MAP_SHARED
      mappings.  Before detailing the problems and my proposed solutions, we should
      agree on a definition of quotas that properly addresses both private and
      shared pages.  Since the purpose of quotas is to limit total memory
      consumption on a per-filesystem basis, I argue that all pages allocated by the
      fs (private and shared) should be charged against quota.
      
      Private Mappings
      ================
      
      The current code will debit quota for private pages sometimes, but will never
      credit it.  At a minimum, this causes a leak in the quota accounting which
      renders the accounting essentially useless as it is.  Shared pages have a one
      to one mapping with a hugetlbfs file and are easy to account by debiting on
      allocation and crediting on truncate.  Private pages are anonymous in nature
      and have a many to one relationship with their hugetlbfs files (due to copy on
      write).  Because private pages are not indexed by the mapping's radix tree,
      thier quota cannot be credited at file truncation time.  Crediting must be
      done when the page is unmapped and freed.
      
      Shared Pages
      ============
      
      I discovered an issue concerning the interaction between the MAP_SHARED
      reservation system and quotas.  Since quota is not checked until page
      instantiation, an over-quota mmap/reservation will initially succeed.  When
      instantiating the first over-quota page, the program will receive SIGBUS.
      This is inconsistent since the reservation is supposed to be a guarantee.  The
      solution is to debit the full amount of quota at reservation time and credit
      the unused portion when the reservation is released.
      
      This patch series brings quotas back in line by making the following
      modifications:
       * Private pages
         - Debit quota in alloc_huge_page()
         - Credit quota in free_huge_page()
       * Shared pages
         - Debit quota for entire reservation at mmap time
         - Credit quota for instantiated pages in free_huge_page()
         - Credit quota for unused reservation at munmap time
      
      This patch:
      
      The shared page reservation and dynamic pool resizing features have made the
      allocation of private vs.  shared huge pages quite different.  By splitting
      out the private/shared-specific portions of the process into their own
      functions, readability is greatly improved.  alloc_huge_page now calls the
      proper helper and performs common operations.
      
      [akpm@linux-foundation.org: coding-style cleanups]
      Signed-off-by: NAdam Litke <agl@us.ibm.com>
      Cc: Ken Chen <kenchen@google.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: David Gibson <hermes@gibson.dropbear.id.au>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      348ea204
    • A
      hugetlb: follow_hugetlb_page() for write access · 5b23dbe8
      Adam Litke 提交于
      When calling get_user_pages(), a write flag is passed in by the caller to
      indicate if write access is required on the faulted-in pages.  Currently,
      follow_hugetlb_page() ignores this flag and always faults pages for
      read-only access.  This can cause data corruption because a device driver
      that calls get_user_pages() with write set will not expect COW faults to
      occur on the returned pages.
      
      This patch passes the write flag down to follow_hugetlb_page() and makes
      sure hugetlb_fault() is called with the right write_access parameter.
      
      [ezk@cs.sunysb.edu: build fix]
      Signed-off-by: NAdam Litke <agl@us.ibm.com>
      Reviewed-by: NKen Chen <kenchen@google.com>
      Cc: David Gibson <hermes@gibson.dropbear.id.au>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Signed-off-by: NErez Zadok <ezk@cs.sunysb.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5b23dbe8
  13. 20 10月, 2007 1 次提交
  14. 19 10月, 2007 1 次提交
  15. 17 10月, 2007 8 次提交
    • A
      hugetlb: fix dynamic pool resize failure case · af767cbd
      Adam Litke 提交于
      When gather_surplus_pages() fails to allocate enough huge pages to satisfy
      the requested reservation, it frees what it did allocate back to the buddy
      allocator.  put_page() should be called instead of update_and_free_page()
      to ensure that pool counters are updated as appropriate and the page's
      refcount is decremented.
      Signed-off-by: NAdam Litke <agl@us.ibm.com>
      Acked-by: NDave Hansen <haveblue@us.ibm.com>
      Cc: David Gibson <hermes@gibson.dropbear.id.au>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Ken Chen <kenchen@google.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      af767cbd
    • N
      hugetlb: fix hugepage allocation with memoryless nodes · 63b4613c
      Nishanth Aravamudan 提交于
      Anton found a problem with the hugetlb pool allocation when some nodes have
      no memory (http://marc.info/?l=linux-mm&m=118133042025995&w=2).  Lee worked
      on versions that tried to fix it, but none were accepted.  Christoph has
      created a set of patches which allow for GFP_THISNODE allocations to fail
      if the node has no memory.
      
      Currently, alloc_fresh_huge_page() returns NULL when it is not able to
      allocate a huge page on the current node, as specified by its custom
      interleave variable.  The callers of this function, though, assume that a
      failure in alloc_fresh_huge_page() indicates no hugepages can be allocated
      on the system period.  This might not be the case, for instance, if we have
      an uneven NUMA system, and we happen to try to allocate a hugepage on a
      node with less memory and fail, while there is still plenty of free memory
      on the other nodes.
      
      To correct this, make alloc_fresh_huge_page() search through all online
      nodes before deciding no hugepages can be allocated.  Add a helper function
      for actually allocating the hugepage.  Use a new global nid iterator to
      control which nid to allocate on.
      
      Note: we expect particular semantics for __GFP_THISNODE, which are now
      enforced even for memoryless nodes.  That is, there is should be no
      fallback to other nodes.  Therefore, we rely on the nid passed into
      alloc_pages_node() to be the nid the page comes from.  If this is
      incorrect, accounting will break.
      
      Tested on x86 !NUMA, x86 NUMA, x86_64 NUMA and ppc64 NUMA (with 2
      memoryless nodes).
      
      Before on the ppc64 box:
      Trying to clear the hugetlb pool
      Done.       0 free
      Trying to resize the pool to 100
      Node 0 HugePages_Free:     25
      Node 1 HugePages_Free:     75
      Node 2 HugePages_Free:      0
      Node 3 HugePages_Free:      0
      Done. Initially     100 free
      Trying to resize the pool to 200
      Node 0 HugePages_Free:     50
      Node 1 HugePages_Free:    150
      Node 2 HugePages_Free:      0
      Node 3 HugePages_Free:      0
      Done.     200 free
      
      After:
      Trying to clear the hugetlb pool
      Done.       0 free
      Trying to resize the pool to 100
      Node 0 HugePages_Free:     50
      Node 1 HugePages_Free:     50
      Node 2 HugePages_Free:      0
      Node 3 HugePages_Free:      0
      Done. Initially     100 free
      Trying to resize the pool to 200
      Node 0 HugePages_Free:    100
      Node 1 HugePages_Free:    100
      Node 2 HugePages_Free:      0
      Node 3 HugePages_Free:      0
      Done.     200 free
      Signed-off-by: NNishanth Aravamudan <nacc@us.ibm.com>
      Acked-by: NChristoph Lameter <clameter@sgi.com>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: David Gibson <hermes@gibson.dropbear.id.au>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: Ken Chen <kenchen@google.com>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      63b4613c
    • A
      hugetlb: fix pool resizing corner case · 6b0c880d
      Adam Litke 提交于
      When shrinking the size of the hugetlb pool via the nr_hugepages sysctl, we
      are careful to keep enough pages around to satisfy reservations.  But the
      calculation is flawed for the following scenario:
      
      Action                          Pool Counters (Total, Free, Resv)
      ======                          =============
      Set pool to 1 page              1 1 0
      Map 1 page MAP_PRIVATE          1 1 0
      Touch the page to fault it in   1 0 0
      Set pool to 3 pages             3 2 0
      Map 2 pages MAP_SHARED          3 2 2
      Set pool to 2 pages             2 1 2 <-- Mistake, should be 3 2 2
      Touch the 2 shared pages        2 0 1 <-- Program crashes here
      
      The last touch above will terminate the process due to lack of huge pages.
      
      This patch corrects the calculation so that it factors in pages being used
      for private mappings.  Andrew, this is a standalone fix suitable for
      mainline.  It is also now corrected in my latest dynamic pool resizing
      patchset which I will send out soon.
      Signed-off-by: NAdam Litke <agl@us.ibm.com>
      Acked-by: NKen Chen <kenchen@google.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6b0c880d
    • A
      hugetlb: Add hugetlb_dynamic_pool sysctl · 54f9f80d
      Adam Litke 提交于
      The maximum size of the huge page pool can be controlled using the overall
      size of the hugetlb filesystem (via its 'size' mount option).  However in the
      common case the this will not be set as the pool is traditionally fixed in
      size at boot time.  In order to maintain the expected semantics, we need to
      prevent the pool expanding by default.
      
      This patch introduces a new sysctl controlling dynamic pool resizing.  When
      this is enabled the pool will expand beyond its base size up to the size of
      the hugetlb filesystem.  It is disabled by default.
      Signed-off-by: NAdam Litke <agl@us.ibm.com>
      Acked-by: NAndy Whitcroft <apw@shadowen.org>
      Acked-by: NDave McCracken <dave.mccracken@oracle.com>
      Cc: William Irwin <bill.irwin@oracle.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Ken Chen <kenchen@google.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      54f9f80d
    • A
      hugetlb: Try to grow hugetlb pool for MAP_SHARED mappings · e4e574b7
      Adam Litke 提交于
      Shared mappings require special handling because the huge pages needed to
      fully populate the VMA must be reserved at mmap time.  If not enough pages are
      available when making the reservation, allocate all of the shortfall at once
      from the buddy allocator and add the pages directly to the hugetlb pool.  If
      they cannot be allocated, then fail the mapping.  The page surplus is
      accounted for in the same way as for private mappings; faulted surplus pages
      will be freed at unmap time.  Reserved, surplus pages that have not been used
      must be freed separately when their reservation has been released.
      Signed-off-by: NAdam Litke <agl@us.ibm.com>
      Acked-by: NAndy Whitcroft <apw@shadowen.org>
      Acked-by: NDave McCracken <dave.mccracken@oracle.com>
      Cc: William Irwin <bill.irwin@oracle.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Ken Chen <kenchen@google.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e4e574b7
    • A
      hugetlb: Try to grow hugetlb pool for MAP_PRIVATE mappings · 7893d1d5
      Adam Litke 提交于
      Because we overcommit hugepages for MAP_PRIVATE mappings, it is possible that
      the hugetlb pool will be exhausted or completely reserved when a hugepage is
      needed to satisfy a page fault.  Before killing the process in this situation,
      try to allocate a hugepage directly from the buddy allocator.
      
      The explicitly configured pool size becomes a low watermark.  When dynamically
      grown, the allocated huge pages are accounted as a surplus over the watermark.
       As huge pages are freed on a node, surplus pages are released to the buddy
      allocator so that the pool will shrink back to the watermark.
      
      Surplus accounting also allows for friendlier explicit pool resizing.  When
      shrinking a pool that is fully in-use, increase the surplus so pages will be
      returned to the buddy allocator as soon as they are freed.  When growing a
      pool that has a surplus, consume the surplus first and then allocate new
      pages.
      Signed-off-by: NAdam Litke <agl@us.ibm.com>
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NAndy Whitcroft <apw@shadowen.org>
      Acked-by: NDave McCracken <dave.mccracken@oracle.com>
      Cc: William Irwin <bill.irwin@oracle.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Ken Chen <kenchen@google.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7893d1d5
    • A
      hugetlb: Move update_and_free_page · 6af2acb6
      Adam Litke 提交于
      Dynamic huge page pool resizing.
      
      In most real-world scenarios, configuring the size of the hugetlb pool
      correctly is a difficult task.  If too few pages are allocated to the pool,
      applications using MAP_SHARED may fail to mmap() a hugepage region and
      applications using MAP_PRIVATE may receive SIGBUS.  Isolating too much memory
      in the hugetlb pool means it is not available for other uses, especially those
      programs not using huge pages.
      
      The obvious answer is to let the hugetlb pool grow and shrink in response to
      the runtime demand for huge pages.  The work Mel Gorman has been doing to
      establish a memory zone for movable memory allocations makes dynamically
      resizing the hugetlb pool reliable within the limits of that zone.  This patch
      series implements dynamic pool resizing for private and shared mappings while
      being careful to maintain existing semantics.  Please reply with your comments
      and feedback; even just to say whether it would be a useful feature to you.
      Thanks.
      
      How it works
      ============
      
      Upon depletion of the hugetlb pool, rather than reporting an error immediately,
      first try and allocate the needed huge pages directly from the buddy allocator.
      Care must be taken to avoid unbounded growth of the hugetlb pool, so the
      hugetlb filesystem quota is used to limit overall pool size.
      
      The real work begins when we decide there is a shortage of huge pages.  What
      happens next depends on whether the pages are for a private or shared mapping.
      Private mappings are straightforward.  At fault time, if alloc_huge_page()
      fails, we allocate a page from the buddy allocator and increment the source
      node's surplus_huge_pages counter.  When free_huge_page() is called for a page
      on a node with a surplus, the page is freed directly to the buddy allocator
      instead of the hugetlb pool.
      
      Because shared mappings require all of the pages to be reserved up front, some
      additional work must be done at mmap() to support them.  We determine the
      reservation shortage and allocate the required number of pages all at once.
      These pages are then added to the hugetlb pool and marked reserved.  Where that
      is not possible the mmap() will fail.  As with private mappings, the
      appropriate surplus counters are updated.  Since reserved huge pages won't
      necessarily be used by the process, we can't be sure that free_huge_page() will
      always be called to return surplus pages to the buddy allocator.  To prevent
      the huge page pool from bloating, we must free unused surplus pages when their
      reservation has ended.
      
      Controlling it
      ==============
      
      With the entire patch series applied, pool resizing is off by default so unless
      specific action is taken, the semantics are unchanged.
      
      To take advantage of the flexibility afforded by this patch series one must
      tolerate a change in semantics.  To control hugetlb pool growth, the following
      techniques can be employed:
      
       * A sysctl tunable to enable/disable the feature entirely
       * The size= mount option for hugetlbfs filesystems to limit pool size
      
      Performance
      ===========
      
      When contiguous memory is readily available, it is expected that the cost of
      dynamicly resizing the pool will be small.  This series has been performance
      tested with 'stream' to measure this cost.
      
      Stream (http://www.cs.virginia.edu/stream/) was linked with libhugetlbfs to
      enable remapping of the text and data/bss segments into huge pages.
      
      Stream with small array
      -----------------------
      Baseline: 	nr_hugepages = 0, No libhugetlbfs segment remapping
      Preallocated:	nr_hugepages = 5, Text and data/bss remapping
      Dynamic:	nr_hugepages = 0, Text and data/bss remapping
      
      				Rate (MB/s)
      Function	Baseline	Preallocated	Dynamic
      Copy:		4695.6266	5942.8371	5982.2287
      Scale:		4451.5776	5017.1419	5658.7843
      Add:		5815.8849	7927.7827	8119.3552
      Triad:		5949.4144	8527.6492	8110.6903
      
      Stream with large array
      -----------------------
      Baseline: 	nr_hugepages =  0, No libhugetlbfs segment remapping
      Preallocated:	nr_hugepages = 67, Text and data/bss remapping
      Dynamic:	nr_hugepages =  0, Text and data/bss remapping
      
      				Rate (MB/s)
      Function	Baseline	Preallocated	Dynamic
      Copy:		2227.8281	2544.2732	2546.4947
      Scale:		2136.3208	2430.7294	2421.2074
      Add:		2773.1449	4004.0021	3999.4331
      Triad:		2748.4502	3777.0109	3773.4970
      
      * All numbers are averages taken from 10 consecutive runs with a maximum
        standard deviation of 1.3 percent noted.
      
      This patch:
      
      Simply move update_and_free_page() so that it can be reused later in this
      patch series.  The implementation is not changed.
      Signed-off-by: NAdam Litke <agl@us.ibm.com>
      Acked-by: NAndy Whitcroft <apw@shadowen.org>
      Acked-by: NDave McCracken <dave.mccracken@oracle.com>
      Acked-by: NWilliam Irwin <bill.irwin@oracle.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Ken Chen <kenchen@google.com>
      Cc: Badari Pulavarty <pbadari@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6af2acb6
    • K
      flush icache before set_pte() on ia64: flush icache at set_pte · 954ffcb3
      KAMEZAWA Hiroyuki 提交于
      Current ia64 kernel flushes icache by lazy_mmu_prot_update() *after*
      set_pte().  This is too late.  This patch removes lazy_mmu_prot_update and
      add modfied set_pte() for flushing if necessary.
      
      This patch flush icache of a page when
      	new pte has exec bit.
      	&& new pte has present bit
      	&& new pte is user's page.
      	&& (old *ptep is not present
                  || new pte's pfn is not same to old *ptep's ptn)
      	&& new pte's page has no Pg_arch_1 bit.
      	   Pg_arch_1 is set when a page is cache consistent.
      
      I think this condition checks are much easier to understand than considering
      "Where sync_icache_dcache() should be inserted ?".
      
      pte_user() for ia64 was removed by http://lkml.org/lkml/2007/6/12/67 as
      clean-up. So, I added it again.
      Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      954ffcb3
  16. 01 10月, 2007 1 次提交
  17. 20 9月, 2007 1 次提交
    • L
      Fix NUMA Memory Policy Reference Counting · 480eccf9
      Lee Schermerhorn 提交于
      This patch proposes fixes to the reference counting of memory policy in the
      page allocation paths and in show_numa_map().  Extracted from my "Memory
      Policy Cleanups and Enhancements" series as stand-alone.
      
      Shared policy lookup [shmem] has always added a reference to the policy,
      but this was never unrefed after page allocation or after formatting the
      numa map data.
      
      Default system policy should not require additional ref counting, nor
      should the current task's task policy.  However, show_numa_map() calls
      get_vma_policy() to examine what may be [likely is] another task's policy.
      The latter case needs protection against freeing of the policy.
      
      This patch adds a reference count to a mempolicy returned by
      get_vma_policy() when the policy is a vma policy or another task's
      mempolicy.  Again, shared policy is already reference counted on lookup.  A
      matching "unref" [__mpol_free()] is performed in alloc_page_vma() for
      shared and vma policies, and in show_numa_map() for shared and another
      task's mempolicy.  We can call __mpol_free() directly, saving an admittedly
      inexpensive inline NULL test, because we know we have a non-NULL policy.
      
      Handling policy ref counts for hugepages is a bit trickier.
      huge_zonelist() returns a zone list that might come from a shared or vma
      'BIND policy.  In this case, we should hold the reference until after the
      huge page allocation in dequeue_hugepage().  The patch modifies
      huge_zonelist() to return a pointer to the mempolicy if it needs to be
      unref'd after allocation.
      
      Kernel Build [16cpu, 32GB, ia64] - average of 10 runs:
      
      		w/o patch	w/ refcount patch
      	    Avg	  Std Devn	   Avg	  Std Devn
      Real:	 100.59	    0.38	 100.63	    0.43
      User:	1209.60	    0.37	1209.91	    0.31
      System:   81.52	    0.42	  81.64	    0.34
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Acked-by: NAndi Kleen <ak@suse.de>
      Cc: Christoph Lameter <clameter@sgi.com>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      480eccf9
  18. 23 8月, 2007 1 次提交
  19. 25 7月, 2007 1 次提交
  20. 20 7月, 2007 5 次提交
    • A
      hugetlb: use set_compound_page_dtor · f8af0bb8
      Akinobu Mita 提交于
      Use appropriate accessor function to set compound page destructor
      function.
      
      Cc:  William Irwin <wli@holomorphy.com>
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Acked-by: NAdam Litke <agl@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f8af0bb8
    • H
      Remove nid_lock from alloc_fresh_huge_page · 7ed5cb2b
      Hugh Dickins 提交于
      The fix to that race in alloc_fresh_huge_page() which could give an illegal
      node ID did not need nid_lock at all: the fix was to replace static int nid
      by static int prev_nid and do the work on local int nid.  nid_lock did make
      sure that racers strictly roundrobin the nodes, but that's not something we
      need to enforce strictly.  Kill nid_lock.
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7ed5cb2b
    • A
      dequeue_huge_page() warning fix · 3abf7afd
      Andrew Morton 提交于
      mm/hugetlb.c: In function `dequeue_huge_page':
      mm/hugetlb.c:72: warning: 'nid' might be used uninitialized in this function
      
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: David Gibson <hermes@gibson.dropbear.id.au>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3abf7afd
    • N
      mm: fault feedback #2 · 83c54070
      Nick Piggin 提交于
      This patch completes Linus's wish that the fault return codes be made into
      bit flags, which I agree makes everything nicer.  This requires requires
      all handle_mm_fault callers to be modified (possibly the modifications
      should go further and do things like fault accounting in handle_mm_fault --
      however that would be for another patch).
      
      [akpm@linux-foundation.org: fix alpha build]
      [akpm@linux-foundation.org: fix s390 build]
      [akpm@linux-foundation.org: fix sparc build]
      [akpm@linux-foundation.org: fix sparc64 build]
      [akpm@linux-foundation.org: fix ia64 build]
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Ian Molton <spyro@f2s.com>
      Cc: Bryan Wu <bryan.wu@analog.com>
      Cc: Mikael Starvik <starvik@axis.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Hirokazu Takata <takata@linux-m32r.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Roman Zippel <zippel@linux-m68k.org>
      Cc: Greg Ungerer <gerg@uclinux.org>
      Cc: Matthew Wilcox <willy@debian.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Kazumoto Kojima <kkojima@rr.iij4u.or.jp>
      Cc: Richard Curnow <rc@rc0.org.uk>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
      Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp>
      Cc: Chris Zankel <chris@zankel.net>
      Acked-by: NKyle McMartin <kyle@mcmartin.ca>
      Acked-by: NHaavard Skinnemoen <hskinnemoen@atmel.com>
      Acked-by: NRalf Baechle <ralf@linux-mips.org>
      Acked-by: NAndi Kleen <ak@muc.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      [ Still apparently needs some ARM and PPC loving - Linus ]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      83c54070
    • N
      mm: fault feedback #1 · d0217ac0
      Nick Piggin 提交于
      Change ->fault prototype.  We now return an int, which contains
      VM_FAULT_xxx code in the low byte, and FAULT_RET_xxx code in the next byte.
       FAULT_RET_ code tells the VM whether a page was found, whether it has been
      locked, and potentially other things.  This is not quite the way he wanted
      it yet, but that's changed in the next patch (which requires changes to
      arch code).
      
      This means we no longer set VM_CAN_INVALIDATE in the vma in order to say
      that a page is locked which requires filemap_nopage to go away (because we
      can no longer remain backward compatible without that flag), but we were
      going to do that anyway.
      
      struct fault_data is renamed to struct vm_fault as Linus asked. address
      is now a void __user * that we should firmly encourage drivers not to use
      without really good reason.
      
      The page is now returned via a page pointer in the vm_fault struct.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d0217ac0