1. 07 8月, 2008 1 次提交
  2. 02 8月, 2008 1 次提交
  3. 30 7月, 2008 1 次提交
  4. 29 7月, 2008 2 次提交
    • A
      mm/hugetlb.c must #include <asm/io.h> · 78a34ae2
      Adrian Bunk 提交于
      This patch fixes the following build error on sh caused by commit
      aa888a74 ("hugetlb: support larger than
      MAX_ORDER"):
      
        mm/hugetlb.c: In function 'alloc_bootmem_huge_page':
        mm/hugetlb.c:958: error: implicit declaration of function 'virt_to_phys'
      Signed-off-by: NAdrian Bunk <bunk@kernel.org>
      Cc: Hirokazu Takata <takata@linux-m32r.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      78a34ae2
    • A
      mmu-notifiers: core · cddb8a5c
      Andrea Arcangeli 提交于
      With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to pages.
       There are secondary MMUs (with secondary sptes and secondary tlbs) too.
      sptes in the kvm case are shadow pagetables, but when I say spte in
      mmu-notifier context, I mean "secondary pte".  In GRU case there's no
      actual secondary pte and there's only a secondary tlb because the GRU
      secondary MMU has no knowledge about sptes and every secondary tlb miss
      event in the MMU always generates a page fault that has to be resolved by
      the CPU (this is not the case of KVM where the a secondary tlb miss will
      walk sptes in hardware and it will refill the secondary tlb transparently
      to software if the corresponding spte is present).  The same way
      zap_page_range has to invalidate the pte before freeing the page, the spte
      (and secondary tlb) must also be invalidated before any page is freed and
      reused.
      
      Currently we take a page_count pin on every page mapped by sptes, but that
      means the pages can't be swapped whenever they're mapped by any spte
      because they're part of the guest working set.  Furthermore a spte unmap
      event can immediately lead to a page to be freed when the pin is released
      (so requiring the same complex and relatively slow tlb_gather smp safe
      logic we have in zap_page_range and that can be avoided completely if the
      spte unmap event doesn't require an unpin of the page previously mapped in
      the secondary MMU).
      
      The mmu notifiers allow kvm/GRU/XPMEM to attach to the tsk->mm and know
      when the VM is swapping or freeing or doing anything on the primary MMU so
      that the secondary MMU code can drop sptes before the pages are freed,
      avoiding all page pinning and allowing 100% reliable swapping of guest
      physical address space.  Furthermore it avoids the code that teardown the
      mappings of the secondary MMU, to implement a logic like tlb_gather in
      zap_page_range that would require many IPI to flush other cpu tlbs, for
      each fixed number of spte unmapped.
      
      To make an example: if what happens on the primary MMU is a protection
      downgrade (from writeable to wrprotect) the secondary MMU mappings will be
      invalidated, and the next secondary-mmu-page-fault will call
      get_user_pages and trigger a do_wp_page through get_user_pages if it
      called get_user_pages with write=1, and it'll re-establishing an updated
      spte or secondary-tlb-mapping on the copied page.  Or it will setup a
      readonly spte or readonly tlb mapping if it's a guest-read, if it calls
      get_user_pages with write=0.  This is just an example.
      
      This allows to map any page pointed by any pte (and in turn visible in the
      primary CPU MMU), into a secondary MMU (be it a pure tlb like GRU, or an
      full MMU with both sptes and secondary-tlb like the shadow-pagetable layer
      with kvm), or a remote DMA in software like XPMEM (hence needing of
      schedule in XPMEM code to send the invalidate to the remote node, while no
      need to schedule in kvm/gru as it's an immediate event like invalidating
      primary-mmu pte).
      
      At least for KVM without this patch it's impossible to swap guests
      reliably.  And having this feature and removing the page pin allows
      several other optimizations that simplify life considerably.
      
      Dependencies:
      
      1) mm_take_all_locks() to register the mmu notifier when the whole VM
         isn't doing anything with "mm".  This allows mmu notifier users to keep
         track if the VM is in the middle of the invalidate_range_begin/end
         critical section with an atomic counter incraese in range_begin and
         decreased in range_end.  No secondary MMU page fault is allowed to map
         any spte or secondary tlb reference, while the VM is in the middle of
         range_begin/end as any page returned by get_user_pages in that critical
         section could later immediately be freed without any further
         ->invalidate_page notification (invalidate_range_begin/end works on
         ranges and ->invalidate_page isn't called immediately before freeing
         the page).  To stop all page freeing and pagetable overwrites the
         mmap_sem must be taken in write mode and all other anon_vma/i_mmap
         locks must be taken too.
      
      2) It'd be a waste to add branches in the VM if nobody could possibly
         run KVM/GRU/XPMEM on the kernel, so mmu notifiers will only enabled if
         CONFIG_KVM=m/y.  In the current kernel kvm won't yet take advantage of
         mmu notifiers, but this already allows to compile a KVM external module
         against a kernel with mmu notifiers enabled and from the next pull from
         kvm.git we'll start using them.  And GRU/XPMEM will also be able to
         continue the development by enabling KVM=m in their config, until they
         submit all GRU/XPMEM GPLv2 code to the mainline kernel.  Then they can
         also enable MMU_NOTIFIERS in the same way KVM does it (even if KVM=n).
         This guarantees nobody selects MMU_NOTIFIER=y if KVM and GRU and XPMEM
         are all =n.
      
      The mmu_notifier_register call can fail because mm_take_all_locks may be
      interrupted by a signal and return -EINTR.  Because mmu_notifier_reigster
      is used when a driver startup, a failure can be gracefully handled.  Here
      an example of the change applied to kvm to register the mmu notifiers.
      Usually when a driver startups other allocations are required anyway and
      -ENOMEM failure paths exists already.
      
       struct  kvm *kvm_arch_create_vm(void)
       {
              struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL);
      +       int err;
      
              if (!kvm)
                      return ERR_PTR(-ENOMEM);
      
              INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
      
      +       kvm->arch.mmu_notifier.ops = &kvm_mmu_notifier_ops;
      +       err = mmu_notifier_register(&kvm->arch.mmu_notifier, current->mm);
      +       if (err) {
      +               kfree(kvm);
      +               return ERR_PTR(err);
      +       }
      +
              return kvm;
       }
      
      mmu_notifier_unregister returns void and it's reliable.
      
      The patch also adds a few needed but missing includes that would prevent
      kernel to compile after these changes on non-x86 archs (x86 didn't need
      them by luck).
      
      [akpm@linux-foundation.org: coding-style fixes]
      [akpm@linux-foundation.org: fix mm/filemap_xip.c build]
      [akpm@linux-foundation.org: fix mm/mmu_notifier.c build]
      Signed-off-by: NAndrea Arcangeli <andrea@qumranet.com>
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Jack Steiner <steiner@sgi.com>
      Cc: Robin Holt <holt@sgi.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Kanoj Sarcar <kanojsarcar@yahoo.com>
      Cc: Roland Dreier <rdreier@cisco.com>
      Cc: Steve Wise <swise@opengridcomputing.com>
      Cc: Avi Kivity <avi@qumranet.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Anthony Liguori <aliguori@us.ibm.com>
      Cc: Chris Wright <chrisw@redhat.com>
      Cc: Marcelo Tosatti <marcelo@kvack.org>
      Cc: Eric Dumazet <dada1@cosmosbay.com>
      Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
      Cc: Izik Eidus <izike@qumranet.com>
      Cc: Anthony Liguori <aliguori@us.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cddb8a5c
  5. 27 7月, 2008 1 次提交
  6. 26 7月, 2008 1 次提交
  7. 25 7月, 2008 23 次提交
    • A
      hugetlb: quota is not freed for unused reserved private huge pages · 7251ff78
      Adam Litke 提交于
      With shared reservations (and now also with private reservations), we reserve
      huge pages at mmap time.  We also account for the mapping against fs quota to
      prevent a reservation from being preempted by quota exhaustion.
      
      When testing with the libhugetlbfs test suite, I found a problem with quota
      accounting.  FS quota for allocated pages is handled correctly but we are not
      releasing quota for private pages that were reserved but never allocated.  Do
      this in hugetlb_vm_op_close() at the same time as unused page reservations are
      released.
      Signed-off-by: NAdam Litke <agl@us.ibm.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Johannes Weiner <hannes@saeurebad.de>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Acked-by: NAndy Whitcroft <apw@shadowen.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7251ff78
    • M
      hugetlb: fix a hugepage reservation check for MAP_SHARED · 7f09ca51
      Mel Gorman 提交于
      When removing a huge page from the hugepage pool for a fault the system checks
      to see if the mapping requires additional pages to be reserved, and if it does
      whether there are any unreserved pages remaining.  If not, the allocation
      fails without even attempting to get a page.  In order to determine whether to
      apply this check we call vma_has_private_reserves() which tells us if this vma
      is MAP_PRIVATE and is the owner.  This incorrectly triggers the remaining
      reservation test for MAP_SHARED mappings which prevents allocation of the
      final page in the pool even though it is reserved for this mapping.
      
      In reality we only want to check this for MAP_PRIVATE mappings where the
      process is not the original mapper.  Replace vma_has_private_reserves() with
      vma_has_reserves() which indicates whether further reserves are required, and
      update the caller.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NAdam Litke <agl@us.ibm.com>
      Acked-by: NAndy Whitcroft <apw@shadowen.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7f09ca51
    • J
      hugetlb: allow arch overridden hugepage allocation · 53ba51d2
      Jon Tollefson 提交于
      Allow alloc_bootmem_huge_page() to be overridden by architectures that
      can't always use bootmem.  This requires huge_boot_pages to be available
      for use by this function.
      
      This is required for powerpc 16G pages, which have to be reserved prior to
      boot-time.  The location of these pages are indicated in the device tree.
      Acked-by: NAdam Litke <agl@us.ibm.com>
      Signed-off-by: NJon Tollefson <kniht@linux.vnet.ibm.com>
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      53ba51d2
    • N
      hugetlb: override default huge page size · e11bfbfc
      Nick Piggin 提交于
      Allow configurations with the default huge page size which is different to
      the traditional HPAGE_SIZE size.  The default huge page size is the one
      represented in the legacy /proc ABIs, SHM, and which is defaulted to when
      mounting hugetlbfs filesystems.
      
      This is implemented with a new kernel option default_hugepagesz=, which
      defaults to HPAGE_SIZE if not specified.
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e11bfbfc
    • A
      hugetlb: introduce pud_huge · ceb86879
      Andi Kleen 提交于
      Straight forward extensions for huge pages located in the PUD instead of
      PMDs.
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ceb86879
    • A
      hugetlb: printk cleanup · 4abd32db
      Andi Kleen 提交于
      - Reword sentence to clarify meaning with multiple options
      - Add support for using GB prefixes for the page size
      - Add extra printk to delayed > MAX_ORDER allocation code
      Acked-by: NAdam Litke <agl@us.ibm.com>
      Acked-by: NNishanth Aravamudan <nacc@us.ibm.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4abd32db
    • A
      hugetlb: support boot allocate different sizes · 8faa8b07
      Andi Kleen 提交于
      Make some infrastructure changes to allow boot-time allocation of
      different hugepage page sizes.
      
      - move all basic hstate initialisation into hugetlb_add_hstate
      - create a new function hugetlb_hstate_alloc_pages() to do the
        actual initial page allocations. Call this function early in
        order to allocate giant pages from bootmem.
      - Check for multiple hugepages= parameters
      Acked-by: NAdam Litke <agl@us.ibm.com>
      Acked-by: NNishanth Aravamudan <nacc@us.ibm.com>
      Acked-by: NAndrew Hastings <abh@cray.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8faa8b07
    • A
      hugetlb: support larger than MAX_ORDER · aa888a74
      Andi Kleen 提交于
      This is needed on x86-64 to handle GB pages in hugetlbfs, because it is
      not practical to enlarge MAX_ORDER to 1GB.
      
      Instead the 1GB pages are only allocated at boot using the bootmem
      allocator using the hugepages=...  option.
      
      These 1G bootmem pages are never freed.  In theory it would be possible to
      implement that with some complications, but since it would be a one-way
      street (>= MAX_ORDER pages cannot be allocated later) I decided not to
      currently.
      
      The >= MAX_ORDER code is not ifdef'ed per architecture.  It is not very
      big and the ifdef uglyness seemed not be worth it.
      
      Known problems: /proc/meminfo and "free" do not display the memory
      allocated for gb pages in "Total".  This is a little confusing for the
      user.
      Acked-by: NAndrew Hastings <abh@cray.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aa888a74
    • A
      hugetlb: abstract numa round robin selection · 5ced66c9
      Andi Kleen 提交于
      Need this as a separate function for a future patch.
      
      No behaviour change.
      Acked-by: NAdam Litke <agl@us.ibm.com>
      Acked-by: NNishanth Aravamudan <nacc@us.ibm.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5ced66c9
    • N
      hugetlb: new sysfs interface · a3437870
      Nishanth Aravamudan 提交于
      Provide new hugepages user APIs that are more suited to multiple hstates
      in sysfs.  There is a new directory, /sys/kernel/hugepages.  Underneath
      that directory there will be a directory per-supported hugepage size,
      e.g.:
      
      /sys/kernel/hugepages/hugepages-64kB
      /sys/kernel/hugepages/hugepages-16384kB
      /sys/kernel/hugepages/hugepages-16777216kB
      
      corresponding to 64k, 16m and 16g respectively.  Within each
      hugepages-size directory there are a number of files, corresponding to the
      tracked counters in the hstate, e.g.:
      
      /sys/kernel/hugepages/hugepages-64/nr_hugepages
      /sys/kernel/hugepages/hugepages-64/nr_overcommit_hugepages
      /sys/kernel/hugepages/hugepages-64/free_hugepages
      /sys/kernel/hugepages/hugepages-64/resv_hugepages
      /sys/kernel/hugepages/hugepages-64/surplus_hugepages
      
      Of these files, the first two are read-write and the latter three are
      read-only.  The size of the hugepage being manipulated is trivially
      deducible from the enclosing directory and is always expressed in kB (to
      match meminfo).
      
      [dave@linux.vnet.ibm.com: fix build]
      [nacc@us.ibm.com: hugetlb: hang off of /sys/kernel/mm rather than /sys/kernel]
      [nacc@us.ibm.com: hugetlb: remove CONFIG_SYSFS dependency]
      Acked-by: NGreg Kroah-Hartman <gregkh@suse.de>
      Signed-off-by: NNishanth Aravamudan <nacc@us.ibm.com>
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Cc: Dave Hansen <dave@linux.vnet.ibm.com>
      Signed-off-by: NNishanth Aravamudan <nacc@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a3437870
    • A
      hugetlbfs: per mount huge page sizes · a137e1cc
      Andi Kleen 提交于
      Add the ability to configure the hugetlb hstate used on a per mount basis.
      
      - Add a new pagesize= option to the hugetlbfs mount that allows setting
        the page size
      - This option causes the mount code to find the hstate corresponding to the
        specified size, and sets up a pointer to the hstate in the mount's
        superblock.
      - Change the hstate accessors to use this information rather than the
        global_hstate they were using (requires a slight change in mm/memory.c
        so we don't NULL deref in the error-unmap path -- see comments).
      
      [np: take hstate out of hugetlbfs inode and vma->vm_private_data]
      Acked-by: NAdam Litke <agl@us.ibm.com>
      Acked-by: NNishanth Aravamudan <nacc@us.ibm.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a137e1cc
    • A
      hugetlb: multiple hstates for multiple page sizes · e5ff2159
      Andi Kleen 提交于
      Add basic support for more than one hstate in hugetlbfs.  This is the key
      to supporting multiple hugetlbfs page sizes at once.
      
      - Rather than a single hstate, we now have an array, with an iterator
      - default_hstate continues to be the struct hstate which we use by default
      - Add functions for architectures to register new hstates
      
      [akpm@linux-foundation.org: coding-style fixes]
      Acked-by: NAdam Litke <agl@us.ibm.com>
      Acked-by: NNishanth Aravamudan <nacc@us.ibm.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e5ff2159
    • A
      hugetlb: modular state for hugetlb page size · a5516438
      Andi Kleen 提交于
      The goal of this patchset is to support multiple hugetlb page sizes.  This
      is achieved by introducing a new struct hstate structure, which
      encapsulates the important hugetlb state and constants (eg.  huge page
      size, number of huge pages currently allocated, etc).
      
      The hstate structure is then passed around the code which requires these
      fields, they will do the right thing regardless of the exact hstate they
      are operating on.
      
      This patch adds the hstate structure, with a single global instance of it
      (default_hstate), and does the basic work of converting hugetlb to use the
      hstate.
      
      Future patches will add more hstate structures to allow for different
      hugetlbfs mounts to have different page sizes.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Acked-by: NAdam Litke <agl@us.ibm.com>
      Acked-by: NNishanth Aravamudan <nacc@us.ibm.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a5516438
    • A
      hugetlb: factor out prep_new_huge_page · b7ba30c6
      Andi Kleen 提交于
      Needed to avoid code duplication in follow up patches.
      Acked-by: NAdam Litke <agl@us.ibm.com>
      Acked-by: NNishanth Aravamudan <nacc@us.ibm.com>
      Signed-off-by: NAndi Kleen <ak@suse.de>
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b7ba30c6
    • J
      vma_page_offset() has no callees: drop it · a858f7b2
      Johannes Weiner 提交于
      Hugh adds: vma_pagecache_offset() has a dangerously misleading name, since
      it's using hugepage units: rename it to vma_hugecache_offset().
      
      [apw@shadowen.org: restack onto fixed MAP_PRIVATE reservations]
      [akpm@linux-foundation.org: vma_split conversion]
      Signed-off-by: NJohannes Weiner <hannes@saeurebad.de>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: Nishanth Aravamudan <nacc@us.ibm.com>
      Cc: Andi Kleen <ak@suse.de>
      Cc: Nick Piggin <npiggin@suse.de>
      Signed-off-by: NAndy Whitcroft <apw@shadowen.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a858f7b2
    • A
      hugetlb reservations: fix hugetlb MAP_PRIVATE reservations across vma splits · 84afd99b
      Andy Whitcroft 提交于
      When a hugetlb mapping with a reservation is split, a new VMA is cloned
      from the original.  This new VMA is a direct copy of the original
      including the reservation count.  When this pair of VMAs are unmapped we
      will incorrect double account the unused reservation and the overall
      reservation count will be incorrect, in extreme cases it will wrap.
      
      The problem occurs when we split an existing VMA say to unmap a page in
      the middle.  split_vma() will create a new VMA copying all fields from the
      original.  As we are storing our reservation count in vm_private_data this
      is also copies, endowing the new VMA with a duplicate of the original
      VMA's reservation.  Neither of the new VMAs can exhaust these reservations
      as they are too small, but when we unmap and close these VMAs we will
      incorrect credit the remainder twice and resv_huge_pages will become out
      of sync.  This can lead to allocation failures on mappings with
      reservations and even to resv_huge_pages wrapping which prevents all
      subsequent hugepage allocations.
      
      The simple fix would be to correctly apportion the remaining reservation
      count when the split is made.  However the only hook we have vm_ops->open
      only has the new VMA we do not know the identity of the preceeding VMA.
      Also even if we did have that VMA to hand we do not know how much of the
      reservation was consumed each side of the split.
      
      This patch therefore takes a different tack.  We know that the whole of
      any private mapping (which has a reservation) has a reservation over its
      whole size.  Any present pages represent consumed reservation.  Therefore
      if we track the instantiated pages we can calculate the remaining
      reservation.
      
      This patch reuses the existing regions code to track the regions for which
      we have consumed reservation (ie.  the instantiated pages), as each page
      is faulted in we record the consumption of reservation for the new page.
      When we need to return unused reservations at unmap time we simply count
      the consumed reservation region subtracting that from the whole of the
      map.  During a VMA split the newly opened VMA will point to the same
      region map, as this map is offset oriented it remains valid for both of
      the split VMAs.  This map is referenced counted so that it is removed when
      all VMAs which are part of the mmap are gone.
      
      Thanks to Adam Litke and Mel Gorman for their review feedback.
      Signed-off-by: NAndy Whitcroft <apw@shadowen.org>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: Johannes Weiner <hannes@saeurebad.de>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
      Cc: Jon Tollefson <kniht@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      84afd99b
    • A
      hugetlb: allow huge page mappings to be created without reservations · c37f9fb1
      Andy Whitcroft 提交于
      By default all shared mappings and most private mappings now have
      reservations associated with them.  This improves semantics by providing
      allocation guarentees to the mapper.  However a small number of
      applications may attempt to make very large sparse mappings, with these
      strict reservations the system will never be able to honour the mapping.
      
      This patch set brings MAP_NORESERVE support to hugetlb files.  This allows
      new mappings to be made to hugetlbfs files without an associated
      reservation, for both shared and private mappings.  This allows
      applications which want to create very sparse mappings to opt-out of the
      reservation system.  Obviously as there is no reservation they are liable
      to fault at runtime if the huge page pool becomes exhausted; buyer beware.
      Signed-off-by: NAndy Whitcroft <apw@shadowen.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: Johannes Weiner <hannes@saeurebad.de>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c37f9fb1
    • A
      hugetlb: move reservation region support earlier · 96822904
      Andy Whitcroft 提交于
      The following patch will require use of the reservation regions support.
      Move this earlier in the file.  No changes have been made to this code.
      Signed-off-by: NAndy Whitcroft <apw@shadowen.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Acked-by: NAdam Litke <agl@us.ibm.com>
      Cc: Johannes Weiner <hannes@saeurebad.de>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96822904
    • A
      huge page private reservation review cleanups · e7c4b0bf
      Andy Whitcroft 提交于
      Create some new accessors for vma private data to cut down on and contain
      the casts.  Encapsulates the huge and small page offset calculations.
      Also adds a couple of VM_BUG_ONs for consistency.
      
      [akpm@linux-foundation.org: Make things static]
      Signed-off-by: NAndy Whitcroft <apw@shadowen.org>
      Acked-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Adam Litke <agl@us.ibm.com>
      Cc: Johannes Weiner <hannes@saeurebad.de>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e7c4b0bf
    • M
      hugetlb: guarantee that COW faults for a process that called mmap(MAP_PRIVATE)... · 04f2cbe3
      Mel Gorman 提交于
      hugetlb: guarantee that COW faults for a process that called mmap(MAP_PRIVATE) on hugetlbfs will succeed
      
      After patch 2 in this series, a process that successfully calls mmap() for
      a MAP_PRIVATE mapping will be guaranteed to successfully fault until a
      process calls fork().  At that point, the next write fault from the parent
      could fail due to COW if the child still has a reference.
      
      We only reserve pages for the parent but a copy must be made to avoid
      leaking data from the parent to the child after fork().  Reserves could be
      taken for both parent and child at fork time to guarantee faults but if
      the mapping is large it is highly likely we will not have sufficient pages
      for the reservation, and it is common to fork only to exec() immediatly
      after.  A failure here would be very undesirable.
      
      Note that the current behaviour of mainline with MAP_PRIVATE pages is
      pretty bad.  The following situation is allowed to occur today.
      
      1. Process calls mmap(MAP_PRIVATE)
      2. Process calls mlock() to fault all pages and makes sure it succeeds
      3. Process forks()
      4. Process writes to MAP_PRIVATE mapping while child still exists
      5. If the COW fails at this point, the process gets SIGKILLed even though it
         had taken care to ensure the pages existed
      
      This patch improves the situation by guaranteeing the reliability of the
      process that successfully calls mmap().  When the parent performs COW, it
      will try to satisfy the allocation without using reserves.  If that fails
      the parent will steal the page leaving any children without a page.
      Faults from the child after that point will result in failure.  If the
      child COW happens first, an attempt will be made to allocate the page
      without reserves and the child will get SIGKILLed on failure.
      
      To summarise the new behaviour:
      
      1. If the original mapper performs COW on a private mapping with multiple
         references, it will attempt to allocate a hugepage from the pool or
         the buddy allocator without using the existing reserves. On fail, VMAs
         mapping the same area are traversed and the page being COW'd is unmapped
         where found. It will then steal the original page as the last mapper in
         the normal way.
      
      2. The VMAs the pages were unmapped from are flagged to note that pages
         with data no longer exist. Future no-page faults on those VMAs will
         terminate the process as otherwise it would appear that data was corrupted.
         A warning is printed to the console that this situation occured.
      
      2. If the child performs COW first, it will attempt to satisfy the COW
         from the pool if there are enough pages or via the buddy allocator if
         overcommit is allowed and the buddy allocator can satisfy the request. If
         it fails, the child will be killed.
      
      If the pool is large enough, existing applications will not notice that
      the reserves were a factor.  Existing applications depending on the
      no-reserves been set are unlikely to exist as for much of the history of
      hugetlbfs, pages were prefaulted at mmap(), allocating the pages at that
      point or failing the mmap().
      
      [npiggin@suse.de: fix CONFIG_HUGETLB=n build]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NAdam Litke <agl@us.ibm.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      04f2cbe3
    • M
      hugetlb: reserve huge pages for reliable MAP_PRIVATE hugetlbfs mappings until fork() · a1e78772
      Mel Gorman 提交于
      This patch reserves huge pages at mmap() time for MAP_PRIVATE mappings in
      a similar manner to the reservations taken for MAP_SHARED mappings.  The
      reserve count is accounted both globally and on a per-VMA basis for
      private mappings.  This guarantees that a process that successfully calls
      mmap() will successfully fault all pages in the future unless fork() is
      called.
      
      The characteristics of private mappings of hugetlbfs files behaviour after
      this patch are;
      
      1. The process calling mmap() is guaranteed to succeed all future faults until
         it forks().
      2. On fork(), the parent may die due to SIGKILL on writes to the private
         mapping if enough pages are not available for the COW. For reasonably
         reliable behaviour in the face of a small huge page pool, children of
         hugepage-aware processes should not reference the mappings; such as
         might occur when fork()ing to exec().
      3. On fork(), the child VMAs inherit no reserves. Reads on pages already
         faulted by the parent will succeed. Successful writes will depend on enough
         huge pages being free in the pool.
      4. Quotas of the hugetlbfs mount are checked at reserve time for the mapper
         and at fault time otherwise.
      
      Before this patch, all reads or writes in the child potentially needs page
      allocations that can later lead to the death of the parent.  This applies
      to reads and writes of uninstantiated pages as well as COW.  After the
      patch it is only a write to an instantiated page that causes problems.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NAdam Litke <agl@us.ibm.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a1e78772
    • M
      hugetlb: move hugetlb_acct_memory() · fc1b8a73
      Mel Gorman 提交于
      This is a patchset to give reliable behaviour to a process that
      successfully calls mmap(MAP_PRIVATE) on a hugetlbfs file.  Currently, it
      is possible for the process to be killed due to a small hugepage pool size
      even if it calls mlock().
      
      MAP_SHARED mappings on hugetlbfs reserve huge pages at mmap() time.  This
      guarantees all future faults against the mapping will succeed.  This
      allows local allocations at first use improving NUMA locality whilst
      retaining reliability.
      
      MAP_PRIVATE mappings do not reserve pages.  This can result in an
      application being SIGKILLed later if a huge page is not available at fault
      time.  This makes huge pages usage very ill-advised in some cases as the
      unexpected application failure cannot be detected and handled as it is
      immediately fatal.  Although an application may force instantiation of the
      pages using mlock(), this may lead to poor memory placement and the
      process may still be killed when performing COW.
      
      This patchset introduces a reliability guarantee for the process which
      creates a private mapping, i.e.  the process that calls mmap() on a
      hugetlbfs file successfully.  The first patch of the set is purely
      mechanical code move to make later diffs easier to read.  The second patch
      will guarantee faults up until the process calls fork().  After patch two,
      as long as the child keeps the mappings, the parent is no longer
      guaranteed to be reliable.  Patch 3 guarantees that the parent will always
      successfully COW by unmapping the pages from the child in the event there
      are insufficient pages in the hugepage pool in allocate a new page, be it
      via a static or dynamic pool.
      
      Existing hugepage-aware applications are unlikely to be affected by this
      change.  For much of hugetlbfs's history, pages were pre-faulted at mmap()
      time or mmap() failed which acts in a reserve-like manner.  If the pool is
      sized correctly already so that parent and child can fault reliably, the
      application will not even notice the reserves.  It's only when the pool is
      too small for the application to function perfectly reliably that the
      reserves come into play.
      
      Credit goes to Andy Whitcroft for cleaning up a number of mistakes during
      review before the patches were released.
      
      This patch:
      
      A later patch in this set needs to call hugetlb_acct_memory() before it is
      defined.  This patch moves the function without modification.  This makes
      later diffs easier to read.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NAdam Litke <agl@us.ibm.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fc1b8a73
    • A
      mm/hugetlb.c: fix duplicate variable · 75353bed
      Adrian Bunk 提交于
      It's confusing that set_max_huge_pages() contained two different
      variables named "ret", and although the code works correctly this should
      be fixed.
      
      The inner of the two variables can simply be removed.
      
      Spotted by sparse.
      Signed-off-by: NAdrian Bunk <bunk@kernel.org>
      Cc:  "KOSAKI Motohiro" <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      75353bed
  8. 07 6月, 2008 1 次提交
    • N
      hugetlb: fix lockdep error · 46478758
      Nick Piggin 提交于
      =============================================
      [ INFO: possible recursive locking detected ]
      2.6.26-rc4 #30
      ---------------------------------------------
      heap-overflow/2250 is trying to acquire lock:
       (&mm->page_table_lock){--..}, at: [<c0000000000cf2e8>] .copy_hugetlb_page_range+0x108/0x280
      
      but task is already holding lock:
       (&mm->page_table_lock){--..}, at: [<c0000000000cf2dc>] .copy_hugetlb_page_range+0xfc/0x280
      
      other info that might help us debug this:
      3 locks held by heap-overflow/2250:
       #0:  (&mm->mmap_sem){----}, at: [<c000000000050e44>] .dup_mm+0x134/0x410
       #1:  (&mm->mmap_sem/1){--..}, at: [<c000000000050e54>] .dup_mm+0x144/0x410
       #2:  (&mm->page_table_lock){--..}, at: [<c0000000000cf2dc>] .copy_hugetlb_page_range+0xfc/0x280
      
      stack backtrace:
      Call Trace:
      [c00000003b2774e0] [c000000000010ce4] .show_stack+0x74/0x1f0 (unreliable)
      [c00000003b2775a0] [c0000000003f10e0] .dump_stack+0x20/0x34
      [c00000003b277620] [c0000000000889bc] .__lock_acquire+0xaac/0x1080
      [c00000003b277740] [c000000000089000] .lock_acquire+0x70/0xb0
      [c00000003b2777d0] [c0000000003ee15c] ._spin_lock+0x4c/0x80
      [c00000003b277870] [c0000000000cf2e8] .copy_hugetlb_page_range+0x108/0x280
      [c00000003b277950] [c0000000000bcaa8] .copy_page_range+0x558/0x790
      [c00000003b277ac0] [c000000000050fe0] .dup_mm+0x2d0/0x410
      [c00000003b277ba0] [c000000000051d24] .copy_process+0xb94/0x1020
      [c00000003b277ca0] [c000000000052244] .do_fork+0x94/0x310
      [c00000003b277db0] [c000000000011240] .sys_clone+0x60/0x80
      [c00000003b277e30] [c0000000000078c4] .ppc_clone+0x8/0xc
      
      Fix is the same way that mm/memory.c copy_page_range does the
      lockdep annotation.
      Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Acked-by: NAdam Litke <agl@us.ibm.com>
      Acked-by: NNishanth Aravamudan <nacc@us.ibm.com>
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      46478758
  9. 29 4月, 2008 2 次提交
    • N
      page allocator: explicitly retry hugepage allocations · 551883ae
      Nishanth Aravamudan 提交于
      Add __GFP_REPEAT to hugepage allocations.  Do so to not necessitate userspace
      putting pressure on the VM by repeated echo's into /proc/sys/vm/nr_hugepages
      to grow the pool.  With the previous patch to allow for large-order
      __GFP_REPEAT attempts to loop for a bit (as opposed to indefinitely), this
      increases the likelihood of getting hugepages when the system experiences (or
      recently experienced) load.
      
      Mel tested the patchset on an x86_32 laptop.  With the patches, it was easier
      to use the proc interface to grow the hugepage pool.  The following is the
      output of a script that grows the pool as much as possible running on
      2.6.25-rc9.
      
      Allocating hugepages test
      -------------------------
      Disabling OOM Killer for current test process
      Starting page count: 0
      Attempt 1: 57 pages Progress made with 57 pages
      Attempt 2: 73 pages Progress made with 16 pages
      Attempt 3: 74 pages Progress made with 1 pages
      Attempt 4: 75 pages Progress made with 1 pages
      Attempt 5: 77 pages Progress made with 2 pages
      
      77 pages was the most it allocated but it took 5 attempts from userspace
      to get it. With the 3 patches in this series applied,
      
      Allocating hugepages test
      -------------------------
      Disabling OOM Killer for current test process
      Starting page count: 0
      Attempt 1: 75 pages Progress made with 75 pages
      Attempt 2: 76 pages Progress made with 1 pages
      Attempt 3: 79 pages Progress made with 3 pages
      
      And 79 pages was the most it got. Your patches were able to allocate the
      bulk of possible pages on the first attempt.
      Signed-off-by: NNishanth Aravamudan <nacc@us.ibm.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Tested-by: NMel Gorman <mel@csn.ul.ie>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      551883ae
    • H
      mm: fix integer as NULL pointer warnings · 7b8ee84d
      Harvey Harrison 提交于
      mm/hugetlb.c:207:11: warning: Using plain integer as NULL pointer
      Signed-off-by: NHarvey Harrison <harvey.harrison@gmail.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7b8ee84d
  10. 28 4月, 2008 7 次提交
    • G
      hugetlbfs: common code update for s390 · 7f2e9525
      Gerald Schaefer 提交于
      Huge ptes have a special type on s390 and cannot be handled with the standard
      pte functions in certain cases, e.g.  because of a different location of the
      invalid bit.  This patch adds some new architecture- specific functions to
      hugetlb common code, as a prerequisite for the s390 large page support.
      
      This won't affect other architectures in functionality, but I need to add some
      new dummy inline functions to the headers.
      Acked-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: NGerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "David S. Miller" <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7f2e9525
    • G
      hugetlbfs: add missing TLB flush to hugetlb_cow() · 8fe627ec
      Gerald Schaefer 提交于
      A cow break on a hugetlbfs page with page_count > 1 will set a new pte with
      set_huge_pte_at(), w/o any tlb flush operation.  The old pte will remain in
      the tlb and subsequent write access to the page will result in a page fault
      loop, for as long as it may take until the tlb is flushed from somewhere else.
       This patch introduces an architecture-specific huge_ptep_clear_flush()
      function, which is called before the the set_huge_pte_at() in hugetlb_cow().
      
      ATTENTION: This is just a nop on all architectures for now, the s390
      implementation will come with our large page patch later.  Other architectures
      should define their own huge_ptep_clear_flush() if needed.
      Acked-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: NGerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "David S. Miller" <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8fe627ec
    • L
      mempolicy: rework mempolicy Reference Counting [yet again] · 52cd3b07
      Lee Schermerhorn 提交于
      After further discussion with Christoph Lameter, it has become clear that my
      earlier attempts to clean up the mempolicy reference counting were a bit of
      overkill in some areas, resulting in superflous ref/unref in what are usually
      fast paths.  In other areas, further inspection reveals that I botched the
      unref for interleave policies.
      
      A separate patch, suitable for upstream/stable trees, fixes up the known
      errors in the previous attempt to fix reference counting.
      
      This patch reworks the memory policy referencing counting and, one hopes,
      simplifies the code.  Maybe I'll get it right this time.
      
      See the update to the numa_memory_policy.txt document for a discussion of
      memory policy reference counting that motivates this patch.
      
      Summary:
      
      Lookup of mempolicy, based on (vma, address) need only add a reference for
      shared policy, and we need only unref the policy when finished for shared
      policies.  So, this patch backs out all of the unneeded extra reference
      counting added by my previous attempt.  It then unrefs only shared policies
      when we're finished with them, using the mpol_cond_put() [conditional put]
      helper function introduced by this patch.
      
      Note that shmem_swapin() calls read_swap_cache_async() with a dummy vma
      containing just the policy.  read_swap_cache_async() can call alloc_page_vma()
      multiple times, so we can't let alloc_page_vma() unref the shared policy in
      this case.  To avoid this, we make a copy of any non-null shared policy and
      remove the MPOL_F_SHARED flag from the copy.  This copy occurs before reading
      a page [or multiple pages] from swap, so the overhead should not be an issue
      here.
      
      I introduced a new static inline function "mpol_cond_copy()" to copy the
      shared policy to an on-stack policy and remove the flags that would require a
      conditional free.  The current implementation of mpol_cond_copy() assumes that
      the struct mempolicy contains no pointers to dynamically allocated structures
      that must be duplicated or reference counted during copy.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      52cd3b07
    • L
      mempolicy: rename mpol_free to mpol_put · f0be3d32
      Lee Schermerhorn 提交于
      This is a change that was requested some time ago by Mel Gorman.  Makes sense
      to me, so here it is.
      
      Note: I retain the name "mpol_free_shared_policy()" because it actually does
      free the shared_policy, which is NOT a reference counted object.  However, ...
      
      The mempolicy object[s] referenced by the shared_policy are reference counted,
      so mpol_put() is used to release the reference held by the shared_policy.  The
      mempolicy might not be freed at this time, because some task attached to the
      shared object associated with the shared policy may be in the process of
      allocating a page based on the mempolicy.  In that case, the task performing
      the allocation will hold a reference on the mempolicy, obtained via
      mpol_shared_policy_lookup().  The mempolicy will be freed when all tasks
      holding such a reference have called mpol_put() for the mempolicy.
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andi Kleen <ak@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f0be3d32
    • A
      Subject: [PATCH] hugetlb: vmstat events for huge page allocations · 3b116300
      Adam Litke 提交于
      Allocating huge pages directly from the buddy allocator is not guaranteed to
      succeed.  Success depends on several factors (such as the amount of physical
      memory available and the level of fragmentation).  With the addition of
      dynamic hugetlb pool resizing, allocations can occur much more frequently.
      For these reasons it is desirable to keep track of huge page allocation
      successes and failures.
      
      Add two new vmstat entries to track huge page allocations that succeed and
      fail.  The presence of the two entries is contingent upon CONFIG_HUGETLB_PAGE
      being enabled.
      
      [akpm@linux-foundation.org: reduced ifdeffery]
      Signed-off-by: NAdam Litke <agl@us.ibm.com>
      Signed-off-by: NEric Munson <ebmunson@us.ibm.com>
      Tested-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NAndy Whitcroft <apw@shadowen.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3b116300
    • A
      hugetlb: decrease hugetlb_lock cycling in gather_surplus_huge_pages · 19fc3f0a
      Adam Litke 提交于
      To reduce hugetlb_lock acquisitions and releases when freeing excess surplus
      pages, scan the page list in two parts.  First, transfer the needed pages to
      the hugetlb pool.  Then drop the lock and free the remaining pages back to the
      buddy allocator.
      
      In the common case there are zero excess pages and no lock operations are
      required.
      
      Thanks Mel Gorman for this improvement.
      Signed-off-by: NAdam Litke <agl@us.ibm.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: William Lee Irwin III <wli@holomorphy.com>
      Cc: Andy Whitcroft <apw@shadowen.org>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      19fc3f0a
    • M
      mm: filter based on a nodemask as well as a gfp_mask · 19770b32
      Mel Gorman 提交于
      The MPOL_BIND policy creates a zonelist that is used for allocations
      controlled by that mempolicy.  As the per-node zonelist is already being
      filtered based on a zone id, this patch adds a version of __alloc_pages() that
      takes a nodemask for further filtering.  This eliminates the need for
      MPOL_BIND to create a custom zonelist.
      
      A positive benefit of this is that allocations using MPOL_BIND now use the
      local node's distance-ordered zonelist instead of a custom node-id-ordered
      zonelist.  I.e., pages will be allocated from the closest allowed node with
      available memory.
      
      [Lee.Schermerhorn@hp.com: Mempolicy: update stale documentation and comments]
      [Lee.Schermerhorn@hp.com: Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask]
      [Lee.Schermerhorn@hp.com: Mempolicy: make dequeue_huge_page_vma() obey MPOL_BIND nodemask rework]
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Acked-by: NChristoph Lameter <clameter@sgi.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      19770b32