1. 12 12月, 2012 4 次提交
  2. 15 10月, 2012 1 次提交
  3. 09 10月, 2012 26 次提交
    • D
      mm: thp: Use more portable PMD clearing sequenece in zap_huge_pmd(). · f5c8ad47
      David Miller 提交于
      Invalidation sequences are handled in various ways on various
      architectures.
      
      One way, which sparc64 uses, is to let the set_*_at() functions accumulate
      pending flushes into a per-cpu array.  Then the flush_tlb_range() et al.
      calls process the pending TLB flushes.
      
      In this regime, the __tlb_remove_*tlb_entry() implementations are
      essentially NOPs.
      
      The canonical PTE zap in mm/memory.c is:
      
      			ptent = ptep_get_and_clear_full(mm, addr, pte,
      							tlb->fullmm);
      			tlb_remove_tlb_entry(tlb, pte, addr);
      
      With a subsequent tlb_flush_mmu() if needed.
      
      Mirror this in the THP PMD zapping using:
      
      		orig_pmd = pmdp_get_and_clear(tlb->mm, addr, pmd);
      		page = pmd_page(orig_pmd);
      		tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
      
      And we properly accomodate TLB flush mechanims like the one described
      above.
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f5c8ad47
    • D
      mm: Add and use update_mmu_cache_pmd() in transparent huge page code. · b113da65
      David Miller 提交于
      The transparent huge page code passes a PMD pointer in as the third
      argument of update_mmu_cache(), which expects a PTE pointer.
      
      This never got noticed because X86 implements update_mmu_cache() as a
      macro and thus we don't get any type checking, and X86 is the only
      architecture which supports transparent huge pages currently.
      
      Before other architectures can support transparent huge pages properly we
      need to add a new interface which will take a PMD pointer as the third
      argument rather than a PTE pointer.
      
      [akpm@linux-foundation.org: implement update_mm_cache_pmd() for s390]
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b113da65
    • D
      mm, thp: fix mapped pages avoiding unevictable list on mlock · b676b293
      David Rientjes 提交于
      When a transparent hugepage is mapped and it is included in an mlock()
      range, follow_page() incorrectly avoids setting the page's mlock bit and
      moving it to the unevictable lru.
      
      This is evident if you try to mlock(), munlock(), and then mlock() a
      range again.  Currently:
      
      	#define MAP_SIZE	(4 << 30)	/* 4GB */
      
      	void *ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE,
      			 MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
      	mlock(ptr, MAP_SIZE);
      
      		$ grep -E "Unevictable|Inactive\(anon" /proc/meminfo
      		Inactive(anon):     6304 kB
      		Unevictable:     4213924 kB
      
      	munlock(ptr, MAP_SIZE);
      
      		Inactive(anon):  4186252 kB
      		Unevictable:       19652 kB
      
      	mlock(ptr, MAP_SIZE);
      
      		Inactive(anon):  4198556 kB
      		Unevictable:       21684 kB
      
      Notice that less than 2MB was added to the unevictable list; this is
      because these pages in the range are not transparent hugepages since the
      4GB range was allocated with mmap() and has no specific alignment.  If
      posix_memalign() were used instead, unevictable would not have grown at
      all on the second mlock().
      
      The fix is to call mlock_vma_page() so that the mlock bit is set and the
      page is added to the unevictable list.  With this patch:
      
      	mlock(ptr, MAP_SIZE);
      
      		Inactive(anon):     4056 kB
      		Unevictable:     4213940 kB
      
      	munlock(ptr, MAP_SIZE);
      
      		Inactive(anon):  4198268 kB
      		Unevictable:       19636 kB
      
      	mlock(ptr, MAP_SIZE);
      
      		Inactive(anon):     4008 kB
      		Unevictable:     4213940 kB
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michel Lespinasse <walken@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b676b293
    • S
      mm: move all mmu notifier invocations to be done outside the PT lock · 2ec74c3e
      Sagi Grimberg 提交于
      In order to allow sleeping during mmu notifier calls, we need to avoid
      invoking them under the page table spinlock.  This patch solves the
      problem by calling invalidate_page notification after releasing the lock
      (but before freeing the page itself), or by wrapping the page invalidation
      with calls to invalidate_range_begin and invalidate_range_end.
      
      To prevent accidental changes to the invalidate_range_end arguments after
      the call to invalidate_range_begin, the patch introduces a convention of
      saving the arguments in consistently named locals:
      
      	unsigned long mmun_start;	/* For mmu_notifiers */
      	unsigned long mmun_end;	/* For mmu_notifiers */
      
      	...
      
      	mmun_start = ...
      	mmun_end = ...
      	mmu_notifier_invalidate_range_start(mm, mmun_start, mmun_end);
      
      	...
      
      	mmu_notifier_invalidate_range_end(mm, mmun_start, mmun_end);
      
      The patch changes code to use this convention for all calls to
      mmu_notifier_invalidate_range_start/end, except those where the calls are
      close enough so that anyone who glances at the code can see the values
      aren't changing.
      
      This patchset is a preliminary step towards on-demand paging design to be
      added to the RDMA stack.
      
      Why do we want on-demand paging for Infiniband?
      
        Applications register memory with an RDMA adapter using system calls,
        and subsequently post IO operations that refer to the corresponding
        virtual addresses directly to HW.  Until now, this was achieved by
        pinning the memory during the registration calls.  The goal of on demand
        paging is to avoid pinning the pages of registered memory regions (MRs).
         This will allow users the same flexibility they get when swapping any
        other part of their processes address spaces.  Instead of requiring the
        entire MR to fit in physical memory, we can allow the MR to be larger,
        and only fit the current working set in physical memory.
      
      Why should anyone care?  What problems are users currently experiencing?
      
        This can make programming with RDMA much simpler.  Today, developers
        that are working with more data than their RAM can hold need either to
        deregister and reregister memory regions throughout their process's
        life, or keep a single memory region and copy the data to it.  On demand
        paging will allow these developers to register a single MR at the
        beginning of their process's life, and let the operating system manage
        which pages needs to be fetched at a given time.  In the future, we
        might be able to provide a single memory access key for each process
        that would provide the entire process's address as one large memory
        region, and the developers wouldn't need to register memory regions at
        all.
      
      Is there any prospect that any other subsystems will utilise these
      infrastructural changes?  If so, which and how, etc?
      
        As for other subsystems, I understand that XPMEM wanted to sleep in
        MMU notifiers, as Christoph Lameter wrote at
        http://lkml.indiana.edu/hypermail/linux/kernel/0802.1/0460.html and
        perhaps Andrea knows about other use cases.
      
        Scheduling in mmu notifications is required since we need to sync the
        hardware with the secondary page tables change.  A TLB flush of an IO
        device is inherently slower than a CPU TLB flush, so our design works by
        sending the invalidation request to the device, and waiting for an
        interrupt before exiting the mmu notifier handler.
      
      Avi said:
      
        kvm may be a buyer.  kvm::mmu_lock, which serializes guest page
        faults, also protects long operations such as destroying large ranges.
        It would be good to convert it into a spinlock, but as it is used inside
        mmu notifiers, this cannot be done.
      
        (there are alternatives, such as keeping the spinlock and using a
        generation counter to do the teardown in O(1), which is what the "may"
        is doing up there).
      
      [akpm@linux-foundation.orgpossible speed tweak in hugetlb_cow(), cleanups]
      Signed-off-by: NAndrea Arcangeli <andrea@qumranet.com>
      Signed-off-by: NSagi Grimberg <sagig@mellanox.com>
      Signed-off-by: NHaggai Eran <haggaie@mellanox.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Or Gerlitz <ogerlitz@mellanox.com>
      Cc: Haggai Eran <haggaie@mellanox.com>
      Cc: Shachar Raindel <raindel@mellanox.com>
      Cc: Liran Liss <liranl@mellanox.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2ec74c3e
    • C
      mm: thp: fix the update_mmu_cache() last argument passing in mm/huge_memory.c · eab1eef9
      Catalin Marinas 提交于
      The update_mmu_cache() takes a pointer (to pte_t by default) as the last
      argument but the huge_memory.c passes a pmd_t value.  The patch changes
      the argument to the pmd_t * pointer.
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NSteve Capper <steve.capper@arm.com>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Reviewed-by: NKirill A. Shutemov <kirill@shutemov.name>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eab1eef9
    • X
      thp: khugepaged_prealloc_page() forgot to reset the page alloc indicator · e3b4126c
      Xiao Guangrong 提交于
      If NUMA is enabled, the indicator is not reset if the previous page
      request failed, ausing us to trigger the BUG_ON() in
      khugepaged_alloc_page().
      Signed-off-by: NXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e3b4126c
    • M
      mm rmap: remove vma_address check for address inside vma · 86c2ad19
      Michel Lespinasse 提交于
      In file and anon rmap, we use interval trees to find potentially relevant
      vmas and then call vma_address() to find the virtual address the given
      page might be found at in these vmas.  vma_address() used to include a
      check that the returned address falls within the limits of the vma, but
      this check isn't necessary now that we always use interval trees in rmap:
      the interval tree just doesn't return any vmas which this check would find
      to be irrelevant.  As a result, we can replace the use of -EFAULT error
      code (which then needed to be checked in every call site) with a
      VM_BUG_ON().
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Daniel Santos <daniel.santos@pobox.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      86c2ad19
    • M
      mm anon rmap: replace same_anon_vma linked list with an interval tree. · bf181b9f
      Michel Lespinasse 提交于
      When a large VMA (anon or private file mapping) is first touched, which
      will populate its anon_vma field, and then split into many regions through
      the use of mprotect(), the original anon_vma ends up linking all of the
      vmas on a linked list.  This can cause rmap to become inefficient, as we
      have to walk potentially thousands of irrelevent vmas before finding the
      one a given anon page might fall into.
      
      By replacing the same_anon_vma linked list with an interval tree (where
      each avc's interval is determined by its vma's start and last pgoffs), we
      can make rmap efficient for this use case again.
      
      While the change is large, all of its pieces are fairly simple.
      
      Most places that were walking the same_anon_vma list were looking for a
      known pgoff, so they can just use the anon_vma_interval_tree_foreach()
      interval tree iterator instead.  The exception here is ksm, where the
      page's index is not known.  It would probably be possible to rework ksm so
      that the index would be known, but for now I have decided to keep things
      simple and just walk the entirety of the interval tree there.
      
      When updating vma's that already have an anon_vma assigned, we must take
      care to re-index the corresponding avc's on their interval tree.  This is
      done through the use of anon_vma_interval_tree_pre_update_vma() and
      anon_vma_interval_tree_post_update_vma(), which remove the avc's from
      their interval tree before the update and re-insert them after the update.
       The anon_vma stays locked during the update, so there is no chance that
      rmap would miss the vmas that are being updated.
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Daniel Santos <daniel.santos@pobox.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bf181b9f
    • G
      thp: make MADV_HUGEPAGE check for mm->def_flags · 8e72033f
      Gerald Schaefer 提交于
      This adds a check to hugepage_madvise(), to refuse MADV_HUGEPAGE if
      VM_NOHUGEPAGE is set in mm->def_flags.  On s390, the VM_NOHUGEPAGE flag
      will be set in mm->def_flags for kvm processes, to prevent any future thp
      mappings.  In order to also prevent MADV_HUGEPAGE on such an mm,
      hugepage_madvise() should check mm->def_flags.
      Signed-off-by: NGerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8e72033f
    • G
      thp: introduce pmdp_invalidate() · 46dcde73
      Gerald Schaefer 提交于
      On s390, a valid page table entry must not be changed while it is attached
      to any CPU.  So instead of pmd_mknotpresent() and set_pmd_at(), an IDTE
      operation would be necessary there.  This patch introduces the
      pmdp_invalidate() function, to allow architecture-specific
      implementations.
      Signed-off-by: NGerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      46dcde73
    • G
      thp: remove assumptions on pgtable_t type · e3ebcf64
      Gerald Schaefer 提交于
      The thp page table pre-allocation code currently assumes that pgtable_t is
      of type "struct page *".  This may not be true for all architectures, so
      this patch removes that assumption by replacing the functions
      prepare_pmd_huge_pte() and get_pmd_huge_pte() with two new functions that
      can be defined architecture-specific.
      
      It also removes two VM_BUG_ON checks for page_count() and page_mapcount()
      operating on a pgtable_t.  Apart from the VM_BUG_ON removal, there will be
      no functional change introduced by this patch.
      Signed-off-by: NGerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e3ebcf64
    • X
      thp: remove unnecessary set_recommended_min_free_kbytes · 227e4047
      Xiao Guangrong 提交于
      Since it is called in start_khugepaged
      Signed-off-by: NXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      227e4047
    • X
      thp: use khugepaged_enabled to remove duplicate code · 17c230af
      Xiao Guangrong 提交于
      Use khugepaged_enabled to see whether thp is enabled
      Signed-off-by: NXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      17c230af
    • X
      thp: remove khugepaged_loop · b7231789
      Xiao Guangrong 提交于
      Merge khugepaged_loop into khugepaged
      Signed-off-by: NXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b7231789
    • X
      thp: introduce khugepaged_prealloc_page and khugepaged_alloc_page · 26234f36
      Xiao Guangrong 提交于
      They are used to abstract the difference between NUMA enabled and NUMA
      disabled to make the code more readable
      Signed-off-by: NXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      26234f36
    • X
      thp: release page in page pre-alloc path · 420256ef
      Xiao Guangrong 提交于
      If NUMA is enabled, we can release the page in the page pre-alloc
      operation, then the CONFIG_NUMA dependent code can be reduced
      Signed-off-by: NXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      420256ef
    • X
      thp: merge page pre-alloc in khugepaged_loop into khugepaged_do_scan · d516904b
      Xiao Guangrong 提交于
      There are two pre-alloc operations in these two function, the different is:
      - it allows to sleep if page alloc fail in khugepaged_loop
      - it exits immediately if page alloc fail in khugepaged_do_scan
      
      Actually, in khugepaged_do_scan, we can allow the pre-alloc to sleep on
      the first failure, then the operation in khugepaged_loop can be removed
      Signed-off-by: NXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d516904b
    • X
      thp: remove some code depend on CONFIG_NUMA · 9817626e
      Xiao Guangrong 提交于
      If NUMA is disabled, hpage is used as page pre-alloc, so there are two
      cases for hpage:
      
      - it is !NULL, means the page is not consumed otherwise,
      - the page has been consumed
      
      If NUMA is enabled, hpage is just used as alloc-fail indicator which is
      not a real page, NULL means not fail triggered.
      
      So, we can release the page only if !IS_ERR_OR_NULL
      Signed-off-by: NXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9817626e
    • X
      thp: remove wake_up_interruptible in the exit path · 2017c0bf
      Xiao Guangrong 提交于
      Add the check of kthread_should_stop() to the conditions which are used to
      wakeup on khugepaged_wait, then kthread_stop is enough to let the thread
      exit
      Signed-off-by: NXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2017c0bf
    • X
      thp: remove unnecessary khugepaged_thread check · e060f0e0
      Xiao Guangrong 提交于
      Now, khugepaged creation and cancel are completely serial under the
      protection of khugepaged_mutex, it is impossible that many khugepaged
      entities are running
      Signed-off-by: NXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e060f0e0
    • X
      thp: move khugepaged_mutex out of khugepaged · 911891af
      Xiao Guangrong 提交于
      Currently, hugepaged_mutex is used really complexly and hard to
      understand, actually, it is just used to serialize start_khugepaged and
      khugepaged for these reasons:
      
      - khugepaged_thread is shared between them
      - the thp disable path (echo never > transparent_hugepage/enabled) is
        nonblocking, so we need to protect khugepaged_thread to get a stable
        running state
      
      These can be avoided by:
      
      - use the lock to serialize the thread creation and cancel
      - thp disable path can not finised until the thread exits
      
      Then khugepaged_thread is fully controlled by start_khugepaged, khugepaged
      will be happy without the lock
      Signed-off-by: NXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      911891af
    • X
      thp: remove unnecessary check in start_khugepaged · 637e3a27
      Xiao Guangrong 提交于
      The check is unnecessary since if mm_slot_cache or mm_slots_hash
      initialize failed, no sysfs interface will be created
      Signed-off-by: NXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      637e3a27
    • X
      thp: fix the count of THP_COLLAPSE_ALLOC · 65b3c07b
      Xiao Guangrong 提交于
      THP_COLLAPSE_ALLOC is double counted if NUMA is disabled since it has
      already been calculated in khugepaged_alloc_hugepage
      Signed-off-by: NXiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      65b3c07b
    • K
      mm: kill vma flag VM_INSERTPAGE · 4b6e1e37
      Konstantin Khlebnikov 提交于
      Merge VM_INSERTPAGE into VM_MIXEDMAP.  VM_MIXEDMAP VMA can mix pure-pfn
      ptes, special ptes and normal ptes.
      
      Now copy_page_range() always copies VM_MIXEDMAP VMA on fork like
      VM_PFNMAP.  If driver populates whole VMA at mmap() it probably not
      expects page-faults.
      
      This patch removes special check from vma_wants_writenotify() which
      disables pages write tracking for VMA populated via vm_instert_page().
      BDI below mapped file should not use dirty-accounting, moreover
      do_wp_page() can handle this.
      
      vm_insert_page() still marks vma after first usage.  Usually it is called
      from f_op->mmap() handler under mm->mmap_sem write-lock, so it able to
      change vma->vm_flags.  Caller must set VM_MIXEDMAP at mmap time if it
      wants to call this function from other places, for example from page-fault
      handler.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Carsten Otte <cotte@de.ibm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Cc: Jason Baron <jbaron@redhat.com>
      Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Robert Richter <robert.richter@amd.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4b6e1e37
    • K
      mm: introduce arch-specific vma flag VM_ARCH_1 · cc2383ec
      Konstantin Khlebnikov 提交于
      Combine several arch-specific vma flags into one.
      
      before patch:
      
              0x00000200      0x01000000      0x20000000      0x40000000
      x86     VM_NOHUGEPAGE   VM_HUGEPAGE     -               VM_PAT
      powerpc -               -               VM_SAO          -
      parisc  VM_GROWSUP      -               -               -
      ia64    VM_GROWSUP      -               -               -
      nommu   -               VM_MAPPED_COPY  -               -
      others  -               -               -               -
      
      after patch:
      
              0x00000200      0x01000000      0x20000000      0x40000000
      x86     -               VM_PAT          VM_HUGEPAGE     VM_NOHUGEPAGE
      powerpc -               VM_SAO          -               -
      parisc  -               VM_GROWSUP      -               -
      ia64    -               VM_GROWSUP      -               -
      nommu   -               VM_MAPPED_COPY  -               -
      others  -               VM_ARCH_1       -               -
      
      And voila! One completely free bit.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Carsten Otte <cotte@de.ibm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Cc: Jason Baron <jbaron@redhat.com>
      Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Robert Richter <robert.richter@amd.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cc2383ec
    • K
      mm, x86, pat: rework linear pfn-mmap tracking · b3b9c293
      Konstantin Khlebnikov 提交于
      Replace the generic vma-flag VM_PFN_AT_MMAP with x86-only VM_PAT.
      
      We can toss mapping address from remap_pfn_range() into
      track_pfn_vma_new(), and collect all PAT-related logic together in
      arch/x86/.
      
      This patch also restores orignal frustration-free is_cow_mapping() check
      in remap_pfn_range(), as it was before commit v2.6.28-rc8-88-g3c8bb73a
      ("x86: PAT: store vm_pgoff for all linear_over_vma_region mappings - v3")
      
      is_linear_pfn_mapping() checks can be removed from mm/huge_memory.c,
      because it already handled by VM_PFNMAP in VM_NO_THP bit-mask.
      
      [suresh.b.siddha@intel.com: Reset the VM_PAT flag as part of untrack_pfn_vma()]
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Carsten Otte <cotte@de.ibm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Cc: Jason Baron <jbaron@redhat.com>
      Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Robert Richter <robert.richter@amd.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b3b9c293
  4. 28 9月, 2012 1 次提交
  5. 30 5月, 2012 5 次提交
  6. 22 3月, 2012 1 次提交
  7. 06 3月, 2012 1 次提交
    • A
      mm: thp: fix BUG on mm->nr_ptes · 1c641e84
      Andrea Arcangeli 提交于
      Dave Jones reports a few Fedora users hitting the BUG_ON(mm->nr_ptes...)
      in exit_mmap() recently.
      
      Quoting Hugh's discovery and explanation of the SMP race condition:
      
        "mm->nr_ptes had unusual locking: down_read mmap_sem plus
         page_table_lock when incrementing, down_write mmap_sem (or mm_users
         0) when decrementing; whereas THP is careful to increment and
         decrement it under page_table_lock.
      
         Now most of those paths in THP also hold mmap_sem for read or write
         (with appropriate checks on mm_users), but two do not: when
         split_huge_page() is called by hwpoison_user_mappings(), and when
         called by add_to_swap().
      
         It's conceivable that the latter case is responsible for the
         exit_mmap() BUG_ON mm->nr_ptes that has been reported on Fedora."
      
      The simplest way to fix it without having to alter the locking is to make
      split_huge_page() a noop in nr_ptes terms, so by counting the preallocated
      pagetables that exists for every mapped hugepage.  It was an arbitrary
      choice not to count them and either way is not wrong or right, because
      they are not used but they're still allocated.
      Reported-by: NDave Jones <davej@redhat.com>
      Reported-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Josh Boyer <jwboyer@redhat.com>
      Cc: <stable@vger.kernel.org>	[3.0.x, 3.1.x, 3.2.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1c641e84
  8. 09 2月, 2012 1 次提交