1. 11 12月, 2012 11 次提交
    • I
      mm/rmap: Convert the struct anon_vma::mutex to an rwsem · 5a505085
      Ingo Molnar 提交于
      Convert the struct anon_vma::mutex to an rwsem, which will help
      in solving a page-migration scalability problem. (Addressed in
      a separate patch.)
      
      The conversion is simple and straightforward: in every case
      where we mutex_lock()ed we'll now down_write().
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Paul Turner <pjt@google.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      5a505085
    • M
      mm: numa: Add THP migration for the NUMA working set scanning fault case. · b32967ff
      Mel Gorman 提交于
      Note: This is very heavily based on a patch from Peter Zijlstra with
      	fixes from Ingo Molnar, Hugh Dickins and Johannes Weiner.  That patch
      	put a lot of migration logic into mm/huge_memory.c where it does
      	not belong. This version puts tries to share some of the migration
      	logic with migrate_misplaced_page.  However, it should be noted
      	that now migrate.c is doing more with the pagetable manipulation
      	than is preferred. The end result is barely recognisable so as
      	before, the signed-offs had to be removed but will be re-added if
      	the original authors are ok with it.
      
      Add THP migration for the NUMA working set scanning fault case.
      
      It uses the page lock to serialize. No migration pte dance is
      necessary because the pte is already unmapped when we decide
      to migrate.
      
      [dhillf@gmail.com: Fix memory leak on isolation failure]
      [dhillf@gmail.com: Fix transfer of last_nid information]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      b32967ff
    • M
      mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate · b8593bfd
      Mel Gorman 提交于
      The PTE scanning rate and fault rates are two of the biggest sources of
      system CPU overhead with automatic NUMA placement.  Ideally a proper policy
      would detect if a workload was properly placed, schedule and adjust the
      PTE scanning rate accordingly. We do not track the necessary information
      to do that but we at least know if we migrated or not.
      
      This patch scans slower if a page was not migrated as the result of a
      NUMA hinting fault up to sysctl_numa_balancing_scan_period_max which is
      now higher than the previous default. Once every minute it will reset
      the scanner in case of phase changes.
      
      This is hilariously crude and the numbers are arbitrary. Workloads will
      converge quite slowly in comparison to what a proper policy should be able
      to do. On the plus side, we will chew up less CPU for workloads that have
      no need for automatic balancing.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      b8593bfd
    • H
      mm: numa: split_huge_page: Transfer last_nid on tail page · 5aa80374
      Hillf Danton 提交于
      Pass last_nid from head page to tail page.
      Signed-off-by: NHillf Danton <dhillf@gmail.com>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      5aa80374
    • M
      mm: numa: Add pte updates, hinting and migration stats · 03c5a6e1
      Mel Gorman 提交于
      It is tricky to quantify the basic cost of automatic NUMA placement in a
      meaningful manner. This patch adds some vmstats that can be used as part
      of a basic costing model.
      
      u    = basic unit = sizeof(void *)
      Ca   = cost of struct page access = sizeof(struct page) / u
      Cpte = Cost PTE access = Ca
      Cupdate = Cost PTE update = (2 * Cpte) + (2 * Wlock)
      	where Cpte is incurred twice for a read and a write and Wlock
      	is a constant representing the cost of taking or releasing a
      	lock
      Cnumahint = Cost of a minor page fault = some high constant e.g. 1000
      Cpagerw = Cost to read or write a full page = Ca + PAGE_SIZE/u
      Ci = Cost of page isolation = Ca + Wi
      	where Wi is a constant that should reflect the approximate cost
      	of the locking operation
      Cpagecopy = Cpagerw + (Cpagerw * Wnuma) + Ci + (Ci * Wnuma)
      	where Wnuma is the approximate NUMA factor. 1 is local. 1.2
      	would imply that remote accesses are 20% more expensive
      
      Balancing cost = Cpte * numa_pte_updates +
      		Cnumahint * numa_hint_faults +
      		Ci * numa_pages_migrated +
      		Cpagecopy * numa_pages_migrated
      
      Note that numa_pages_migrated is used as a measure of how many pages
      were isolated even though it would miss pages that failed to migrate. A
      vmstat counter could have been added for it but the isolation cost is
      pretty marginal in comparison to the overall cost so it seemed overkill.
      
      The ideal way to measure automatic placement benefit would be to count
      the number of remote accesses versus local accesses and do something like
      
      	benefit = (remote_accesses_before - remove_access_after) * Wnuma
      
      but the information is not readily available. As a workload converges, the
      expection would be that the number of remote numa hints would reduce to 0.
      
      	convergence = numa_hint_faults_local / numa_hint_faults
      		where this is measured for the last N number of
      		numa hints recorded. When the workload is fully
      		converged the value is 1.
      
      This can measure if the placement policy is converging and how fast it is
      doing it.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      03c5a6e1
    • P
      mm: numa: Add fault driven placement and migration · cbee9f88
      Peter Zijlstra 提交于
      NOTE: This patch is based on "sched, numa, mm: Add fault driven
      	placement and migration policy" but as it throws away all the policy
      	to just leave a basic foundation I had to drop the signed-offs-by.
      
      This patch creates a bare-bones method for setting PTEs pte_numa in the
      context of the scheduler that when faulted later will be faulted onto the
      node the CPU is running on.  In itself this does nothing useful but any
      placement policy will fundamentally depend on receiving hints on placement
      from fault context and doing something intelligent about it.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      cbee9f88
    • M
      mm: mempolicy: Implement change_prot_numa() in terms of change_protection() · 4b10e7d5
      Mel Gorman 提交于
      This patch converts change_prot_numa() to use change_protection(). As
      pte_numa and friends check the PTE bits directly it is necessary for
      change_protection() to use pmd_mknuma(). Hence the required
      modifications to change_protection() are a little clumsy but the
      end result is that most of the numa page table helpers are just one or
      two instructions.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      4b10e7d5
    • M
      mm: mempolicy: Use _PAGE_NUMA to migrate pages · 4daae3b4
      Mel Gorman 提交于
      Note: Based on "mm/mpol: Use special PROT_NONE to migrate pages" but
      	sufficiently different that the signed-off-bys were dropped
      
      Combine our previous _PAGE_NUMA, mpol_misplaced and migrate_misplaced_page()
      pieces into an effective migrate on fault scheme.
      
      Note that (on x86) we rely on PROT_NONE pages being !present and avoid
      the TLB flush from try_to_unmap(TTU_MIGRATION). This greatly improves the
      page-migration performance.
      Based-on-work-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      4daae3b4
    • M
      mm: numa: Create basic numa page hinting infrastructure · d10e63f2
      Mel Gorman 提交于
      Note: This patch started as "mm/mpol: Create special PROT_NONE
      	infrastructure" and preserves the basic idea but steals *very*
      	heavily from "autonuma: numa hinting page faults entry points" for
      	the actual fault handlers without the migration parts.	The end
      	result is barely recognisable as either patch so all Signed-off
      	and Reviewed-bys are dropped. If Peter, Ingo and Andrea are ok with
      	this version, I will re-add the signed-offs-by to reflect the history.
      
      In order to facilitate a lazy -- fault driven -- migration of pages, create
      a special transient PAGE_NUMA variant, we can then use the 'spurious'
      protection faults to drive our migrations from.
      
      The meaning of PAGE_NUMA depends on the architecture but on x86 it is
      effectively PROT_NONE. Actual PROT_NONE mappings will not generate these
      NUMA faults for the reason that the page fault code checks the permission on
      the VMA (and will throw a segmentation fault on actual PROT_NONE mappings),
      before it ever calls handle_mm_fault.
      
      [dhillf@gmail.com: Fix typo]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      d10e63f2
    • A
      mm: numa: split_huge_page: transfer the NUMA type from the pmd to the pte · 1ba6e0b5
      Andrea Arcangeli 提交于
      When we split a transparent hugepage, transfer the NUMA type from the
      pmd to the pte if needed.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      1ba6e0b5
    • M
      mm: Check if PTE is already allocated during page fault · 4fd01770
      Mel Gorman 提交于
      With transparent hugepage support, handle_mm_fault() has to be careful
      that a normal PMD has been established before handling a PTE fault. To
      achieve this, it used __pte_alloc() directly instead of pte_alloc_map
      as pte_alloc_map is unsafe to run against a huge PMD. pte_offset_map()
      is called once it is known the PMD is safe.
      
      pte_alloc_map() is smart enough to check if a PTE is already present
      before calling __pte_alloc but this check was lost. As a consequence,
      PTEs may be allocated unnecessarily and the page table lock taken.
      Thi useless PTE does get cleaned up but it's a performance hit which
      is visible in page_test from aim9.
      
      This patch simply re-adds the check normally done by pte_alloc_map to
      check if the PTE needs to be allocated before taking the page table
      lock. The effect is noticable in page_test from aim9.
      
       AIM9
                       2.6.38-vanilla 2.6.38-checkptenone
       creat-clo      446.10 ( 0.00%)   424.47 (-5.10%)
       page_test       38.10 ( 0.00%)    42.04 ( 9.37%)
       brk_test        52.45 ( 0.00%)    51.57 (-1.71%)
       exec_test      382.00 ( 0.00%)   456.90 (16.39%)
       fork_test       60.11 ( 0.00%)    67.79 (11.34%)
       MMTests Statistics: duration
       Total Elapsed Time (seconds)                611.90    612.22
      
      (While this affects 2.6.38, it is a performance rather than a
      functional bug and normally outside the rules -stable. While the big
      performance differences are to a microbench, the difference in fork
      and exec performance may be significant enough that -stable wants to
      consider the patch)
      Reported-by: NRaz Ben Yehuda <raziebe@gmail.com>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Rik van Riel <riel@redhat.com>
      [ Picked this up from the AutoNUMA tree to help
        it upstream and to allow apples-to-apples
        performance comparisons. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      4fd01770
  2. 15 10月, 2012 1 次提交
  3. 09 10月, 2012 26 次提交
    • D
      mm: thp: Use more portable PMD clearing sequenece in zap_huge_pmd(). · f5c8ad47
      David Miller 提交于
      Invalidation sequences are handled in various ways on various
      architectures.
      
      One way, which sparc64 uses, is to let the set_*_at() functions accumulate
      pending flushes into a per-cpu array.  Then the flush_tlb_range() et al.
      calls process the pending TLB flushes.
      
      In this regime, the __tlb_remove_*tlb_entry() implementations are
      essentially NOPs.
      
      The canonical PTE zap in mm/memory.c is:
      
      			ptent = ptep_get_and_clear_full(mm, addr, pte,
      							tlb->fullmm);
      			tlb_remove_tlb_entry(tlb, pte, addr);
      
      With a subsequent tlb_flush_mmu() if needed.
      
      Mirror this in the THP PMD zapping using:
      
      		orig_pmd = pmdp_get_and_clear(tlb->mm, addr, pmd);
      		page = pmd_page(orig_pmd);
      		tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
      
      And we properly accomodate TLB flush mechanims like the one described
      above.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f5c8ad47
    • D
      mm: Add and use update_mmu_cache_pmd() in transparent huge page code. · b113da65
      David Miller 提交于
      The transparent huge page code passes a PMD pointer in as the third
      argument of update_mmu_cache(), which expects a PTE pointer.
      
      This never got noticed because X86 implements update_mmu_cache() as a
      macro and thus we don't get any type checking, and X86 is the only
      architecture which supports transparent huge pages currently.
      
      Before other architectures can support transparent huge pages properly we
      need to add a new interface which will take a PMD pointer as the third
      argument rather than a PTE pointer.
      
      [akpm@linux-foundation.org: implement update_mm_cache_pmd() for s390]
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b113da65
    • D
      mm, thp: fix mapped pages avoiding unevictable list on mlock · b676b293
      David Rientjes 提交于
      When a transparent hugepage is mapped and it is included in an mlock()
      range, follow_page() incorrectly avoids setting the page's mlock bit and
      moving it to the unevictable lru.
      
      This is evident if you try to mlock(), munlock(), and then mlock() a
      range again.  Currently:
      
      	#define MAP_SIZE	(4 << 30)	/* 4GB */
      
      	void *ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE,
      			 MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
      	mlock(ptr, MAP_SIZE);
      
      		$ grep -E "Unevictable|Inactive\(anon" /proc/meminfo
      		Inactive(anon):     6304 kB
      		Unevictable:     4213924 kB
      
      	munlock(ptr, MAP_SIZE);
      
      		Inactive(anon):  4186252 kB
      		Unevictable:       19652 kB
      
      	mlock(ptr, MAP_SIZE);
      
      		Inactive(anon):  4198556 kB
      		Unevictable:       21684 kB
      
      Notice that less than 2MB was added to the unevictable list; this is
      because these pages in the range are not transparent hugepages since the
      4GB range was allocated with mmap() and has no specific alignment.  If
      posix_memalign() were used instead, unevictable would not have grown at
      all on the second mlock().
      
      The fix is to call mlock_vma_page() so that the mlock bit is set and the
      page is added to the unevictable list.  With this patch:
      
      	mlock(ptr, MAP_SIZE);
      
      		Inactive(anon):     4056 kB
      		Unevictable:     4213940 kB
      
      	munlock(ptr, MAP_SIZE);
      
      		Inactive(anon):  4198268 kB
      		Unevictable:       19636 kB
      
      	mlock(ptr, MAP_SIZE);
      
      		Inactive(anon):     4008 kB
      		Unevictable:     4213940 kB
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michel Lespinasse <walken@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b676b293
    • S
      mm: move all mmu notifier invocations to be done outside the PT lock · 2ec74c3e
      Sagi Grimberg 提交于
      In order to allow sleeping during mmu notifier calls, we need to avoid
      invoking them under the page table spinlock.  This patch solves the
      problem by calling invalidate_page notification after releasing the lock
      (but before freeing the page itself), or by wrapping the page invalidation
      with calls to invalidate_range_begin and invalidate_range_end.
      
      To prevent accidental changes to the invalidate_range_end arguments after
      the call to invalidate_range_begin, the patch introduces a convention of
      saving the arguments in consistently named locals:
      
      	unsigned long mmun_start;	/* For mmu_notifiers */
      	unsigned long mmun_end;	/* For mmu_notifiers */
      
      	...
      
      	mmun_start = ...
      	mmun_end = ...
      	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
      
      	...
      
      	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
      
      The patch changes code to use this convention for all calls to
      mmu_notifier_invalidate_range_start/end, except those where the calls are
      close enough so that anyone who glances at the code can see the values
      aren't changing.
      
      This patchset is a preliminary step towards on-demand paging design to be
      added to the RDMA stack.
      
      Why do we want on-demand paging for Infiniband?
      
        Applications register memory with an RDMA adapter using system calls,
        and subsequently post IO operations that refer to the corresponding
        virtual addresses directly to HW.  Until now, this was achieved by
        pinning the memory during the registration calls.  The goal of on demand
        paging is to avoid pinning the pages of registered memory regions (MRs).
         This will allow users the same flexibility they get when swapping any
        other part of their processes address spaces.  Instead of requiring the
        entire MR to fit in physical memory, we can allow the MR to be larger,
        and only fit the current working set in physical memory.
      
      Why should anyone care?  What problems are users currently experiencing?
      
        This can make programming with RDMA much simpler.  Today, developers
        that are working with more data than their RAM can hold need either to
        deregister and reregister memory regions throughout their process's
        life, or keep a single memory region and copy the data to it.  On demand
        paging will allow these developers to register a single MR at the
        beginning of their process's life, and let the operating system manage
        which pages needs to be fetched at a given time.  In the future, we
        might be able to provide a single memory access key for each process
        that would provide the entire process's address as one large memory
        region, and the developers wouldn't need to register memory regions at
        all.
      
      Is there any prospect that any other subsystems will utilise these
      infrastructural changes?  If so, which and how, etc?
      
        As for other subsystems, I understand that XPMEM wanted to sleep in
        MMU notifiers, as Christoph Lameter wrote at
        http://lkml.indiana.edu/hypermail/linux/kernel/0802.1/0460.html and
        perhaps Andrea knows about other use cases.
      
        Scheduling in mmu notifications is required since we need to sync the
        hardware with the secondary page tables change.  A TLB flush of an IO
        device is inherently slower than a CPU TLB flush, so our design works by
        sending the invalidation request to the device, and waiting for an
        interrupt before exiting the mmu notifier handler.
      
      Avi said:
      
        kvm may be a buyer.  kvm::mmu_lock, which serializes guest page
        faults, also protects long operations such as destroying large ranges.
        It would be good to convert it into a spinlock, but as it is used inside
        mmu notifiers, this cannot be done.
      
        (there are alternatives, such as keeping the spinlock and using a
        generation counter to do the teardown in O(1), which is what the "may"
        is doing up there).
      
      [akpm@linux-foundation.orgpossible speed tweak in hugetlb_cow(), cleanups]
      Signed-off-by: NAndrea Arcangeli <andrea@qumranet.com>
      Signed-off-by: NSagi Grimberg <sagig@mellanox.com>
      Signed-off-by: NHaggai Eran <haggaie@mellanox.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Or Gerlitz <ogerlitz@mellanox.com>
      Cc: Haggai Eran <haggaie@mellanox.com>
      Cc: Shachar Raindel <raindel@mellanox.com>
      Cc: Liran Liss <liranl@mellanox.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2ec74c3e
    • C
      mm: thp: fix the update_mmu_cache() last argument passing in mm/huge_memory.c · eab1eef9
      Catalin Marinas 提交于
      The update_mmu_cache() takes a pointer (to pte_t by default) as the last
      argument but the huge_memory.c passes a pmd_t value.  The patch changes
      the argument to the pmd_t * pointer.
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NSteve Capper <steve.capper@arm.com>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Reviewed-by: NKirill A. Shutemov <kirill@shutemov.name>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eab1eef9
    • X
      thp: khugepaged_prealloc_page() forgot to reset the page alloc indicator · e3b4126c
      Xiao Guangrong 提交于
      If NUMA is enabled, the indicator is not reset if the previous page
      request failed, ausing us to trigger the BUG_ON() in
      khugepaged_alloc_page().
      Signed-off-by: NXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e3b4126c
    • M
      mm rmap: remove vma_address check for address inside vma · 86c2ad19
      Michel Lespinasse 提交于
      In file and anon rmap, we use interval trees to find potentially relevant
      vmas and then call vma_address() to find the virtual address the given
      page might be found at in these vmas.  vma_address() used to include a
      check that the returned address falls within the limits of the vma, but
      this check isn't necessary now that we always use interval trees in rmap:
      the interval tree just doesn't return any vmas which this check would find
      to be irrelevant.  As a result, we can replace the use of -EFAULT error
      code (which then needed to be checked in every call site) with a
      VM_BUG_ON().
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Daniel Santos <daniel.santos@pobox.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      86c2ad19
    • M
      mm anon rmap: replace same_anon_vma linked list with an interval tree. · bf181b9f
      Michel Lespinasse 提交于
      When a large VMA (anon or private file mapping) is first touched, which
      will populate its anon_vma field, and then split into many regions through
      the use of mprotect(), the original anon_vma ends up linking all of the
      vmas on a linked list.  This can cause rmap to become inefficient, as we
      have to walk potentially thousands of irrelevent vmas before finding the
      one a given anon page might fall into.
      
      By replacing the same_anon_vma linked list with an interval tree (where
      each avc's interval is determined by its vma's start and last pgoffs), we
      can make rmap efficient for this use case again.
      
      While the change is large, all of its pieces are fairly simple.
      
      Most places that were walking the same_anon_vma list were looking for a
      known pgoff, so they can just use the anon_vma_interval_tree_foreach()
      interval tree iterator instead.  The exception here is ksm, where the
      page's index is not known.  It would probably be possible to rework ksm so
      that the index would be known, but for now I have decided to keep things
      simple and just walk the entirety of the interval tree there.
      
      When updating vma's that already have an anon_vma assigned, we must take
      care to re-index the corresponding avc's on their interval tree.  This is
      done through the use of anon_vma_interval_tree_pre_update_vma() and
      anon_vma_interval_tree_post_update_vma(), which remove the avc's from
      their interval tree before the update and re-insert them after the update.
       The anon_vma stays locked during the update, so there is no chance that
      rmap would miss the vmas that are being updated.
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Daniel Santos <daniel.santos@pobox.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bf181b9f
    • G
      thp: make MADV_HUGEPAGE check for mm->def_flags · 8e72033f
      Gerald Schaefer 提交于
      This adds a check to hugepage_madvise(), to refuse MADV_HUGEPAGE if
      VM_NOHUGEPAGE is set in mm->def_flags.  On s390, the VM_NOHUGEPAGE flag
      will be set in mm->def_flags for kvm processes, to prevent any future thp
      mappings.  In order to also prevent MADV_HUGEPAGE on such an mm,
      hugepage_madvise() should check mm->def_flags.
      Signed-off-by: NGerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8e72033f
    • G
      thp: introduce pmdp_invalidate() · 46dcde73
      Gerald Schaefer 提交于
      On s390, a valid page table entry must not be changed while it is attached
      to any CPU.  So instead of pmd_mknotpresent() and set_pmd_at(), an IDTE
      operation would be necessary there.  This patch introduces the
      pmdp_invalidate() function, to allow architecture-specific
      implementations.
      Signed-off-by: NGerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      46dcde73
    • G
      thp: remove assumptions on pgtable_t type · e3ebcf64
      Gerald Schaefer 提交于
      The thp page table pre-allocation code currently assumes that pgtable_t is
      of type "struct page *".  This may not be true for all architectures, so
      this patch removes that assumption by replacing the functions
      prepare_pmd_huge_pte() and get_pmd_huge_pte() with two new functions that
      can be defined architecture-specific.
      
      It also removes two VM_BUG_ON checks for page_count() and page_mapcount()
      operating on a pgtable_t.  Apart from the VM_BUG_ON removal, there will be
      no functional change introduced by this patch.
      Signed-off-by: NGerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e3ebcf64
    • X
      thp: remove unnecessary set_recommended_min_free_kbytes · 227e4047
      Xiao Guangrong 提交于
      Since it is called in start_khugepaged
      Signed-off-by: NXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      227e4047
    • X
      thp: use khugepaged_enabled to remove duplicate code · 17c230af
      Xiao Guangrong 提交于
      Use khugepaged_enabled to see whether thp is enabled
      Signed-off-by: NXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      17c230af
    • X
      thp: remove khugepaged_loop · b7231789
      Xiao Guangrong 提交于
      Merge khugepaged_loop into khugepaged
      Signed-off-by: NXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b7231789
    • X
      thp: introduce khugepaged_prealloc_page and khugepaged_alloc_page · 26234f36
      Xiao Guangrong 提交于
      They are used to abstract the difference between NUMA enabled and NUMA
      disabled to make the code more readable
      Signed-off-by: NXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      26234f36
    • X
      thp: release page in page pre-alloc path · 420256ef
      Xiao Guangrong 提交于
      If NUMA is enabled, we can release the page in the page pre-alloc
      operation, then the CONFIG_NUMA dependent code can be reduced
      Signed-off-by: NXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      420256ef
    • X
      thp: merge page pre-alloc in khugepaged_loop into khugepaged_do_scan · d516904b
      Xiao Guangrong 提交于
      There are two pre-alloc operations in these two function, the different is:
      - it allows to sleep if page alloc fail in khugepaged_loop
      - it exits immediately if page alloc fail in khugepaged_do_scan
      
      Actually, in khugepaged_do_scan, we can allow the pre-alloc to sleep on
      the first failure, then the operation in khugepaged_loop can be removed
      Signed-off-by: NXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d516904b
    • X
      thp: remove some code depend on CONFIG_NUMA · 9817626e
      Xiao Guangrong 提交于
      If NUMA is disabled, hpage is used as page pre-alloc, so there are two
      cases for hpage:
      
      - it is !NULL, means the page is not consumed otherwise,
      - the page has been consumed
      
      If NUMA is enabled, hpage is just used as alloc-fail indicator which is
      not a real page, NULL means not fail triggered.
      
      So, we can release the page only if !IS_ERR_OR_NULL
      Signed-off-by: NXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9817626e
    • X
      thp: remove wake_up_interruptible in the exit path · 2017c0bf
      Xiao Guangrong 提交于
      Add the check of kthread_should_stop() to the conditions which are used to
      wakeup on khugepaged_wait, then kthread_stop is enough to let the thread
      exit
      Signed-off-by: NXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2017c0bf
    • X
      thp: remove unnecessary khugepaged_thread check · e060f0e0
      Xiao Guangrong 提交于
      Now, khugepaged creation and cancel are completely serial under the
      protection of khugepaged_mutex, it is impossible that many khugepaged
      entities are running
      Signed-off-by: NXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e060f0e0
    • X
      thp: move khugepaged_mutex out of khugepaged · 911891af
      Xiao Guangrong 提交于
      Currently, hugepaged_mutex is used really complexly and hard to
      understand, actually, it is just used to serialize start_khugepaged and
      khugepaged for these reasons:
      
      - khugepaged_thread is shared between them
      - the thp disable path (echo never > transparent_hugepage/enabled) is
        nonblocking, so we need to protect khugepaged_thread to get a stable
        running state
      
      These can be avoided by:
      
      - use the lock to serialize the thread creation and cancel
      - thp disable path can not finised until the thread exits
      
      Then khugepaged_thread is fully controlled by start_khugepaged, khugepaged
      will be happy without the lock
      Signed-off-by: NXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      911891af
    • X
      thp: remove unnecessary check in start_khugepaged · 637e3a27
      Xiao Guangrong 提交于
      The check is unnecessary since if mm_slot_cache or mm_slots_hash
      initialize failed, no sysfs interface will be created
      Signed-off-by: NXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      637e3a27
    • X
      thp: fix the count of THP_COLLAPSE_ALLOC · 65b3c07b
      Xiao Guangrong 提交于
      THP_COLLAPSE_ALLOC is double counted if NUMA is disabled since it has
      already been calculated in khugepaged_alloc_hugepage
      Signed-off-by: NXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      65b3c07b
    • K
      mm: kill vma flag VM_INSERTPAGE · 4b6e1e37
      Konstantin Khlebnikov 提交于
      Merge VM_INSERTPAGE into VM_MIXEDMAP.  VM_MIXEDMAP VMA can mix pure-pfn
      ptes, special ptes and normal ptes.
      
      Now copy_page_range() always copies VM_MIXEDMAP VMA on fork like
      VM_PFNMAP.  If driver populates whole VMA at mmap() it probably not
      expects page-faults.
      
      This patch removes special check from vma_wants_writenotify() which
      disables pages write tracking for VMA populated via vm_instert_page().
      BDI below mapped file should not use dirty-accounting, moreover
      do_wp_page() can handle this.
      
      vm_insert_page() still marks vma after first usage.  Usually it is called
      from f_op->mmap() handler under mm->mmap_sem write-lock, so it able to
      change vma->vm_flags.  Caller must set VM_MIXEDMAP at mmap time if it
      wants to call this function from other places, for example from page-fault
      handler.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Carsten Otte <cotte@de.ibm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Cc: Jason Baron <jbaron@redhat.com>
      Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Robert Richter <robert.richter@amd.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4b6e1e37
    • K
      mm: introduce arch-specific vma flag VM_ARCH_1 · cc2383ec
      Konstantin Khlebnikov 提交于
      Combine several arch-specific vma flags into one.
      
      before patch:
      
              0x00000200      0x01000000      0x20000000      0x40000000
      x86     VM_NOHUGEPAGE   VM_HUGEPAGE     -               VM_PAT
      powerpc -               -               VM_SAO          -
      parisc  VM_GROWSUP      -               -               -
      ia64    VM_GROWSUP      -               -               -
      nommu   -               VM_MAPPED_COPY  -               -
      others  -               -               -               -
      
      after patch:
      
              0x00000200      0x01000000      0x20000000      0x40000000
      x86     -               VM_PAT          VM_HUGEPAGE     VM_NOHUGEPAGE
      powerpc -               VM_SAO          -               -
      parisc  -               VM_GROWSUP      -               -
      ia64    -               VM_GROWSUP      -               -
      nommu   -               VM_MAPPED_COPY  -               -
      others  -               VM_ARCH_1       -               -
      
      And voila! One completely free bit.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Carsten Otte <cotte@de.ibm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Cc: Jason Baron <jbaron@redhat.com>
      Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Robert Richter <robert.richter@amd.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cc2383ec
    • K
      mm, x86, pat: rework linear pfn-mmap tracking · b3b9c293
      Konstantin Khlebnikov 提交于
      Replace the generic vma-flag VM_PFN_AT_MMAP with x86-only VM_PAT.
      
      We can toss mapping address from remap_pfn_range() into
      track_pfn_vma_new(), and collect all PAT-related logic together in
      arch/x86/.
      
      This patch also restores orignal frustration-free is_cow_mapping() check
      in remap_pfn_range(), as it was before commit v2.6.28-rc8-88-g3c8bb73a
      ("x86: PAT: store vm_pgoff for all linear_over_vma_region mappings - v3")
      
      is_linear_pfn_mapping() checks can be removed from mm/huge_memory.c,
      because it already handled by VM_PFNMAP in VM_NO_THP bit-mask.
      
      [suresh.b.siddha@intel.com: Reset the VM_PAT flag as part of untrack_pfn_vma()]
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Carsten Otte <cotte@de.ibm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Cc: Jason Baron <jbaron@redhat.com>
      Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Robert Richter <robert.richter@amd.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b3b9c293
  4. 28 9月, 2012 1 次提交
  5. 30 5月, 2012 1 次提交