1. 16 1月, 2016 40 次提交
    • N
      mm: soft-offline: exit with failure for non anonymous thp · 98fd1ef4
      Naoya Horiguchi 提交于
      Currently memory_failure() doesn't handle non anonymous thp case,
      because we can hardly expect the error handling to be successful, and it
      can just hit some corner case which results in BUG_ON or something
      severe like that.  This is also the case for soft offline code, so let's
      make it in the same way.
      
      Orignal code has a MF_COUNT_INCREASED check before put_hwpoison_page(),
      but it's unnecessary because get_any_page() is already called when
      running on this code, which takes a refcount of the target page
      regardress of the flag.  So this patch also removes it.
      
      [akpm@linux-foundation.org: fix build]
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      98fd1ef4
    • N
      mm: soft-offline: clean up soft_offline_page() · acc14dc4
      Naoya Horiguchi 提交于
      soft_offline_page() has some deeply indented code, that's the sign of
      demand for cleanup.  So let's do this.  No functionality change.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      acc14dc4
    • H
      mm: make swapoff more robust against soft dirty · 9f8bdb3f
      Hugh Dickins 提交于
      Both s390 and powerpc have hit the issue of swapoff hanging, when
      CONFIG_HAVE_ARCH_SOFT_DIRTY and CONFIG_MEM_SOFT_DIRTY ifdefs were not
      quite as x86_64 had them.  I think it would be much clearer if
      HAVE_ARCH_SOFT_DIRTY was just a Kconfig option set by architectures to
      determine whether the MEM_SOFT_DIRTY option should be offered, and the
      actual code depend upon CONFIG_MEM_SOFT_DIRTY alone.
      
      But won't embark on that change myself: instead make swapoff more
      robust, by using pte_swp_clear_soft_dirty() on each pte it encounters,
      without an explicit #ifdef CONFIG_MEM_SOFT_DIRTY.  That being a no-op,
      whether the bit in question is defined as 0 or the asm-generic fallback
      is used, unless soft dirty is fully turned on.
      
      Why "maybe" in maybe_same_pte()? Rename it pte_same_as_swp().
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9f8bdb3f
    • K
      mm: fix locking order in mm_take_all_locks() · 88f306b6
      Kirill A. Shutemov 提交于
      Dmitry Vyukov has reported[1] possible deadlock (triggered by his
      syzkaller fuzzer):
      
       Possible unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        lock(&hugetlbfs_i_mmap_rwsem_key);
                                     lock(&mapping->i_mmap_rwsem);
                                     lock(&hugetlbfs_i_mmap_rwsem_key);
        lock(&mapping->i_mmap_rwsem);
      
      Both traces points to mm_take_all_locks() as a source of the problem.
      It doesn't take care about ordering or hugetlbfs_i_mmap_rwsem_key (aka
      mapping->i_mmap_rwsem for hugetlb mapping) vs.  i_mmap_rwsem.
      
      huge_pmd_share() does memory allocation under hugetlbfs_i_mmap_rwsem_key
      and allocator can take i_mmap_rwsem if it hit reclaim.  So we need to
      take i_mmap_rwsem from all hugetlb VMAs before taking i_mmap_rwsem from
      rest of VMAs.
      
      The patch also documents locking order for hugetlbfs_i_mmap_rwsem_key.
      
      [1] http://lkml.kernel.org/r/CACT4Y+Zu95tBs-0EvdiAKzUOsb4tczRRfCRTpLr4bg_OP9HuVg@mail.gmail.comSigned-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      88f306b6
    • L
      mm: mempolicy: skip non-migratable VMAs when setting MPOL_MF_LAZY · d645fc0e
      Liang Chen 提交于
      MPOL_MF_LAZY is not visible from userspace since a720094d ("mm:
      mempolicy: Hide MPOL_NOOP and MPOL_MF_LAZY from userspace for now"), but
      it should still skip non-migratable VMAs such as VM_IO, VM_PFNMAP, and
      VM_HUGETLB VMAs, and avoid useless overhead of minor faults.
      Signed-off-by: NLiang Chen <liangchen.linux@gmail.com>
      Signed-off-by: NGavin Guo <gavin.guo@canonical.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d645fc0e
    • A
      mm/page_alloc.c: remove unused struct zone *z variable · f16f091b
      Alexander Kuleshov 提交于
      Remove unused struct zone *z variable which appeared in 86051ca5
      ("mm: fix usemap initialization").
      Signed-off-by: NAlexander Kuleshov <kuleshovmail@gmail.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f16f091b
    • W
      mm/mlock.c: change can_do_mlock return value type to boolean · 7f43add4
      Wang Xiaoqiang 提交于
      Since can_do_mlock only return 1 or 0, so make it boolean.
      
      No functional change.
      
      [akpm@linux-foundation.org: update declaration in mm.h]
      Signed-off-by: NWang Xiaoqiang <wangxq10@lzu.edu.cn>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7f43add4
    • W
      mm/vmalloc.c: use macro IS_ALIGNED to judge the aligment · 61e16557
      Wang Xiaoqiang 提交于
      Just cleanup, no functional change.
      Signed-off-by: NWang Xiaoqiang <wangxq10@lzu.edu.cn>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      61e16557
    • T
      cgroup, memcg, writeback: drop spurious rcu locking around mem_cgroup_css_from_page() · 654a0dd0
      Tejun Heo 提交于
      In earlier versions, mem_cgroup_css_from_page() could return non-root
      css on a legacy hierarchy which can go away and required rcu locking;
      however, the eventual version simply returns the root cgroup if memcg is
      on a legacy hierarchy and thus doesn't need rcu locking around or in it.
      Remove spurious rcu lockings.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      654a0dd0
    • W
      mm/page_isolation: do some cleanup in "undo_isolate_page_range" · 6f8d2b8a
      Wang Xiaoqiang 提交于
      Use "IS_ALIGNED" to judge the alignment, rather than directly judging.
      Signed-off-by: NWang Xiaoqiang <wang_xiaoq@126.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6f8d2b8a
    • D
      mm: bring in additional flag for fixup_user_fault to signal unlock · 4a9e1cda
      Dominik Dingel 提交于
      During Jason's work with postcopy migration support for s390 a problem
      regarding gmap faults was discovered.
      
      The gmap code will call fixup_user_fault which will end up always in
      handle_mm_fault.  Till now we never cared about retries, but as the
      userfaultfd code kind of relies on it.  this needs some fix.
      
      This patchset does not take care of the futex code.  I will now look
      closer at this.
      
      This patch (of 2):
      
      With the introduction of userfaultfd, kvm on s390 needs fixup_user_fault
      to pass in FAULT_FLAG_ALLOW_RETRY and give feedback if during the
      faulting we ever unlocked mmap_sem.
      
      This patch brings in the logic to handle retries as well as it cleans up
      the current documentation.  fixup_user_fault was not having the same
      semantics as filemap_fault.  It never indicated if a retry happened and
      so a caller wasn't able to handle that case.  So we now changed the
      behaviour to always retry a locked mmap_sem.
      Signed-off-by: NDominik Dingel <dingel@linux.vnet.ibm.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: "Jason J. Herne" <jjherne@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Eric B Munson <emunson@akamai.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Dominik Dingel <dingel@linux.vnet.ibm.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4a9e1cda
    • D
      mm, x86: get_user_pages() for dax mappings · 3565fce3
      Dan Williams 提交于
      A dax mapping establishes a pte with _PAGE_DEVMAP set when the driver
      has established a devm_memremap_pages() mapping, i.e.  when the pfn_t
      return from ->direct_access() has PFN_DEV and PFN_MAP set.  Later, when
      encountering _PAGE_DEVMAP during a page table walk we lookup and pin a
      struct dev_pagemap instance to keep the result of pfn_to_page() valid
      until put_page().
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Tested-by: NLogan Gunthorpe <logang@deltatee.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3565fce3
    • D
      mm, dax: dax-pmd vs thp-pmd vs hugetlbfs-pmd · 5c7fb56e
      Dan Williams 提交于
      A dax-huge-page mapping while it uses some thp helpers is ultimately not
      a transparent huge page.  The distinction is especially important in the
      get_user_pages() path.  pmd_devmap() is used to distinguish dax-pmds
      from pmd_huge() and pmd_trans_huge() which have slightly different
      semantics.
      
      Explicitly mark the pmd_trans_huge() helpers that dax needs by adding
      pmd_devmap() checks.
      
      [kirill.shutemov@linux.intel.com: fix regression in handling mlocked pages in  __split_huge_pmd()]
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5c7fb56e
    • D
      mm, dax: convert vmf_insert_pfn_pmd() to pfn_t · f25748e3
      Dan Williams 提交于
      Similar to the conversion of vm_insert_mixed() use pfn_t in the
      vmf_insert_pfn_pmd() to tag the resulting pte with _PAGE_DEVICE when the
      pfn is backed by a devm_memremap_pages() mapping.
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: Matthew Wilcox <willy@linux.intel.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f25748e3
    • D
      mm, dax, gpu: convert vm_insert_mixed to pfn_t · 01c8f1c4
      Dan Williams 提交于
      Convert the raw unsigned long 'pfn' argument to pfn_t for the purpose of
      evaluating the PFN_MAP and PFN_DEV flags.  When both are set it triggers
      _PAGE_DEVMAP to be set in the resulting pte.
      
      There are no functional changes to the gpu drivers as a result of this
      conversion.
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave@sr71.net>
      Cc: David Airlie <airlied@linux.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      01c8f1c4
    • D
      x86, mm: introduce vmem_altmap to augment vmemmap_populate() · 4b94ffdc
      Dan Williams 提交于
      In support of providing struct page for large persistent memory
      capacities, use struct vmem_altmap to change the default policy for
      allocating memory for the memmap array.  The default vmemmap_populate()
      allocates page table storage area from the page allocator.  Given
      persistent memory capacities relative to DRAM it may not be feasible to
      store the memmap in 'System Memory'.  Instead vmem_altmap represents
      pre-allocated "device pages" to satisfy vmemmap_alloc_block_buf()
      requests.
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4b94ffdc
    • R
      mm, dax: fix livelock, allow dax pmd mappings to become writeable · 01871e59
      Ross Zwisler 提交于
      Prior to this change DAX PMD mappings that were made read-only were
      never able to be made writable again.  This is because the code in
      insert_pfn_pmd() that calls pmd_mkdirty() and pmd_mkwrite() would skip
      these calls if the PMD already existed in the page table.
      
      Instead, if we are doing a write always mark the PMD entry as dirty and
      writeable.  Without this code we can get into a condition where we mark
      the PMD as read-only, and then on a subsequent write fault we get into
      an infinite loop of PMD faults where we try unsuccessfully to make the
      PMD writeable.
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Reported-by: NJeff Moyer <jmoyer@redhat.com>
      Reported-by: NToshi Kani <toshi.kani@hpe.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      01871e59
    • K
      thp: fix split_huge_page() after mremap() of THP · bd56086f
      Kirill A. Shutemov 提交于
      Sasha Levin has reported KASAN out-of-bounds bug[1].  It points to "if
      (!is_swap_pte(pte[i]))" in unfreeze_page_vma() as a problematic access.
      
      The cause is that split_huge_page() doesn't handle THP correctly if it's
      not allingned to PMD boundary.  It can happen after mremap().
      
      Test-case (not always triggers the bug):
      
      	#define _GNU_SOURCE
      	#include <stdio.h>
      	#include <stdlib.h>
      	#include <sys/mman.h>
      
      	#define MB (1024UL*1024)
      	#define SIZE (2*MB)
      	#define BASE ((void *)0x400000000000)
      
      	int main()
      	{
      		char *p;
      
      		p = mmap(BASE, SIZE, PROT_READ | PROT_WRITE,
      				MAP_FIXED | MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE,
      				-1, 0);
      		if (p == MAP_FAILED)
      			perror("mmap"), exit(1);
      		p = mremap(BASE, SIZE, SIZE, MREMAP_FIXED | MREMAP_MAYMOVE,
      				BASE + SIZE + 8192);
      		if (p == MAP_FAILED)
      			perror("mremap"), exit(1);
      		system("echo 1 > /sys/kernel/debug/split_huge_pages");
      		return 0;
      	}
      
      The patch fixes freeze and unfreeze paths to handle page table boundary
      crossing.
      
      It also makes mapcount vs count check in split_huge_page_to_list()
      stricter:
       - after freeze we don't expect any subpage mapped as we remove them
         from rmap when setting up migration entries;
       - count must be 1, meaning only caller has reference to the page;
      
      [1] https://gist.github.com/sashalevin/c67fbea55e7c0576972aSigned-off-by: NKirill A.  Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bd56086f
    • M
      mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called · b8d3c4c3
      Minchan Kim 提交于
      We don't need to split THP page when MADV_FREE syscall is called if
      [start, len] is aligned with THP size.  The split could be done when VM
      decide to free it in reclaim path if memory pressure is heavy.  With
      that, we could avoid unnecessary THP split.
      
      For the feature, this patch changes pte dirtness marking logic of THP.
      Now, it marks every ptes of pages dirty unconditionally in splitting,
      which makes MADV_FREE void.  So, instead, this patch propagates pmd
      dirtiness to all pages via PG_dirty and restores pte dirtiness from
      PG_dirty.  With this, if pmd is clean(ie, MADV_FREEed) when split
      happens(e,g, shrink_page_list), all of pages are clean too so we could
      discard them.
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: <yalin.wang2010@gmail.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chen Gang <gang.chen.5i5j@gmail.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Daniel Micay <danielmicay@gmail.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Jason Evans <je@fb.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mika Penttil <mika.penttila@nextfour.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Roland Dreier <roland@kernel.org>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b8d3c4c3
    • M
      mm/ksm.c: mark stable page dirty · 337ed7eb
      Minchan Kim 提交于
      The MADV_FREE patchset changes page reclaim to simply free a clean
      anonymous page with no dirty ptes, instead of swapping it out; but KSM
      uses clean write-protected ptes to reference the stable ksm page.  So be
      sure to mark that page dirty, so it's never mistakenly discarded.
      
      [hughd@google.com: adjusted comments]
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: <yalin.wang2010@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chen Gang <gang.chen.5i5j@gmail.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Daniel Micay <danielmicay@gmail.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Jason Evans <je@fb.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mika Penttil <mika.penttila@nextfour.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Roland Dreier <roland@kernel.org>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      337ed7eb
    • M
      mm: move lazily freed pages to inactive list · 10853a03
      Minchan Kim 提交于
      MADV_FREE is a hint that it's okay to discard pages if there is memory
      pressure and we use reclaimers(ie, kswapd and direct reclaim) to free
      them so there is no value keeping them in the active anonymous LRU so
      this patch moves them to inactive LRU list's head.
      
      This means that MADV_FREE-ed pages which were living on the inactive
      list are reclaimed first because they are more likely to be cold rather
      than recently active pages.
      
      An arguable issue for the approach would be whether we should put the
      page to the head or tail of the inactive list.  I chose head because the
      kernel cannot make sure it's really cold or warm for every MADV_FREE
      usecase but at least we know it's not *hot*, so landing of inactive head
      would be a comprimise for various usecases.
      
      This fixes suboptimal behavior of MADV_FREE when pages living on the
      active list will sit there for a long time even under memory pressure
      while the inactive list is reclaimed heavily.  This basically breaks the
      whole purpose of using MADV_FREE to help the system to free memory which
      is might not be used.
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NHugh Dickins <hughd@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: <yalin.wang2010@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chen Gang <gang.chen.5i5j@gmail.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Daniel Micay <danielmicay@gmail.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Jason Evans <je@fb.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Mika Penttil <mika.penttila@nextfour.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Roland Dreier <roland@kernel.org>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      10853a03
    • M
      mm/madvise.c: free swp_entry in madvise_free · 64b42bc1
      Minchan Kim 提交于
      When I test below piece of code with 12 processes(ie, 512M * 12 = 6G
      consume) on my (3G ram + 12 cpu + 8G swap, the madvise_free is
      siginficat slower (ie, 2x times) than madvise_dontneed.
      
           loop = 5;
           mmap(512M);
           while (loop--) {
                   memset(512M);
                   madvise(MADV_FREE or MADV_DONTNEED);
           }
      
      The reason is lots of swapin.
      
      1) dontneed: 1,612 swapin
      2) madvfree: 879,585 swapin
      
      If we find hinted pages were already swapped out when syscall is called,
      it's pointless to keep the swapped-out pages in pte.  Instead, let's
      free the cold page because swapin is more expensive than (alloc page +
      zeroing).
      
      With this patch, it reduced swapin from 879,585 to 1,878 so elapsed time
      
      1) dontneed: 6.10user 233.50system 0:50.44elapsed
      2) madvfree: 6.03user 401.17system 1:30.67elapsed
      2) madvfree + below patch: 6.70user 339.14system 1:04.45elapsed
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: <yalin.wang2010@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chen Gang <gang.chen.5i5j@gmail.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Daniel Micay <danielmicay@gmail.com>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Jason Evans <je@fb.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Mika Penttil <mika.penttila@nextfour.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Roland Dreier <roland@kernel.org>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      64b42bc1
    • M
      mm: support madvise(MADV_FREE) · 854e9ed0
      Minchan Kim 提交于
      Linux doesn't have an ability to free pages lazy while other OS already
      have been supported that named by madvise(MADV_FREE).
      
      The gain is clear that kernel can discard freed pages rather than
      swapping out or OOM if memory pressure happens.
      
      Without memory pressure, freed pages would be reused by userspace
      without another additional overhead(ex, page fault + allocation +
      zeroing).
      
      Jason Evans said:
      
      : Facebook has been using MAP_UNINITIALIZED
      : (https://lkml.org/lkml/2012/1/18/308) in some of its applications for
      : several years, but there are operational costs to maintaining this
      : out-of-tree in our kernel and in jemalloc, and we are anxious to retire it
      : in favor of MADV_FREE.  When we first enabled MAP_UNINITIALIZED it
      : increased throughput for much of our workload by ~5%, and although the
      : benefit has decreased using newer hardware and kernels, there is still
      : enough benefit that we cannot reasonably retire it without a replacement.
      :
      : Aside from Facebook operations, there are numerous broadly used
      : applications that would benefit from MADV_FREE.  The ones that immediately
      : come to mind are redis, varnish, and MariaDB.  I don't have much insight
      : into Android internals and development process, but I would hope to see
      : MADV_FREE support eventually end up there as well to benefit applications
      : linked with the integrated jemalloc.
      :
      : jemalloc will use MADV_FREE once it becomes available in the Linux kernel.
      : In fact, jemalloc already uses MADV_FREE or equivalent everywhere it's
      : available: *BSD, OS X, Windows, and Solaris -- every platform except Linux
      : (and AIX, but I'm not sure it even compiles on AIX).  The lack of
      : MADV_FREE on Linux forced me down a long series of increasingly
      : sophisticated heuristics for madvise() volume reduction, and even so this
      : remains a common performance issue for people using jemalloc on Linux.
      : Please integrate MADV_FREE; many people will benefit substantially.
      
      How it works:
      
      When madvise syscall is called, VM clears dirty bit of ptes of the
      range.  If memory pressure happens, VM checks dirty bit of page table
      and if it found still "clean", it means it's a "lazyfree pages" so VM
      could discard the page instead of swapping out.  Once there was store
      operation for the page before VM peek a page to reclaim, dirty bit is
      set so VM can swap out the page instead of discarding.
      
      One thing we should notice is that basically, MADV_FREE relies on dirty
      bit in page table entry to decide whether VM allows to discard the page
      or not.  IOW, if page table entry includes marked dirty bit, VM
      shouldn't discard the page.
      
      However, as a example, if swap-in by read fault happens, page table
      entry doesn't have dirty bit so MADV_FREE could discard the page
      wrongly.
      
      For avoiding the problem, MADV_FREE did more checks with PageDirty and
      PageSwapCache.  It worked out because swapped-in page lives on swap
      cache and since it is evicted from the swap cache, the page has PG_dirty
      flag.  So both page flags check effectively prevent wrong discarding by
      MADV_FREE.
      
      However, a problem in above logic is that swapped-in page has PG_dirty
      still after they are removed from swap cache so VM cannot consider the
      page as freeable any more even if madvise_free is called in future.
      
      Look at below example for detail.
      
          ptr = malloc();
          memset(ptr);
          ..
          ..
          .. heavy memory pressure so all of pages are swapped out
          ..
          ..
          var = *ptr; -> a page swapped-in and could be removed from
                         swapcache. Then, page table doesn't mark
                         dirty bit and page descriptor includes PG_dirty
          ..
          ..
          madvise_free(ptr); -> It doesn't clear PG_dirty of the page.
          ..
          ..
          ..
          .. heavy memory pressure again.
          .. In this time, VM cannot discard the page because the page
          .. has *PG_dirty*
      
      To solve the problem, this patch clears PG_dirty if only the page is
      owned exclusively by current process when madvise is called because
      PG_dirty represents ptes's dirtiness in several processes so we could
      clear it only if we own it exclusively.
      
      Firstly, heavy users would be general allocators(ex, jemalloc, tcmalloc
      and hope glibc supports it) and jemalloc/tcmalloc already have supported
      the feature for other OS(ex, FreeBSD)
      
        barrios@blaptop:~/benchmark/ebizzy$ lscpu
        Architecture:          x86_64
        CPU op-mode(s):        32-bit, 64-bit
        Byte Order:            Little Endian
        CPU(s):                12
        On-line CPU(s) list:   0-11
        Thread(s) per core:    1
        Core(s) per socket:    1
        Socket(s):             12
        NUMA node(s):          1
        Vendor ID:             GenuineIntel
        CPU family:            6
        Model:                 2
        Stepping:              3
        CPU MHz:               3200.185
        BogoMIPS:              6400.53
        Virtualization:        VT-x
        Hypervisor vendor:     KVM
        Virtualization type:   full
        L1d cache:             32K
        L1i cache:             32K
        L2 cache:              4096K
        NUMA node0 CPU(s):     0-11
        ebizzy benchmark(./ebizzy -S 10 -n 512)
      
        Higher avg is better.
      
         vanilla-jemalloc             MADV_free-jemalloc
      
        1 thread
        records: 10                   records: 10
        avg:   2961.90                avg:  12069.70
        std:     71.96(2.43%)         std:    186.68(1.55%)
        max:   3070.00                max:  12385.00
        min:   2796.00                min:  11746.00
      
        2 thread
        records: 10                   records: 10
        avg:   5020.00                avg:  17827.00
        std:    264.87(5.28%)         std:    358.52(2.01%)
        max:   5244.00                max:  18760.00
        min:   4251.00                min:  17382.00
      
        4 thread
        records: 10                   records: 10
        avg:   8988.80                avg:  27930.80
        std:   1175.33(13.08%)        std:   3317.33(11.88%)
        max:   9508.00                max:  30879.00
        min:   5477.00                min:  21024.00
      
        8 thread
        records: 10                   records: 10
        avg:  13036.50                avg:  33739.40
        std:    170.67(1.31%)         std:   5146.22(15.25%)
        max:  13371.00                max:  40572.00
        min:  12785.00                min:  24088.00
      
        16 thread
        records: 10                   records: 10
        avg:  11092.40                avg:  31424.20
        std:    710.60(6.41%)         std:   3763.89(11.98%)
        max:  12446.00                max:  36635.00
        min:   9949.00                min:  25669.00
      
        32 thread
        records: 10                   records: 10
        avg:  11067.00                avg:  34495.80
        std:    971.06(8.77%)         std:   2721.36(7.89%)
        max:  12010.00                max:  38598.00
        min:   9002.00                min:  30636.00
      
      In summary, MADV_FREE is about much faster than MADV_DONTNEED.
      
      This patch (of 12):
      
      Add core MADV_FREE implementation.
      
      [akpm@linux-foundation.org: small cleanups]
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Mika Penttil <mika.penttila@nextfour.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Jason Evans <je@fb.com>
      Cc: Daniel Micay <danielmicay@gmail.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: <yalin.wang2010@gmail.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: "Shaohua Li" <shli@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chen Gang <gang.chen.5i5j@gmail.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Roland Dreier <roland@kernel.org>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      854e9ed0
    • V
      mm: add page_check_address_transhuge() helper · 8749cfea
      Vladimir Davydov 提交于
      page_referenced_one() and page_idle_clear_pte_refs_one() duplicate the
      code for looking up pte of a (possibly transhuge) page.  Move this code
      to a new helper function, page_check_address_transhuge(), and make the
      above mentioned functions use it.
      
      This is just a cleanup, no functional changes are intended.
      Signed-off-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Reviewed-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8749cfea
    • K
      thp: increase split_huge_page() success rate · d9654322
      Kirill A. Shutemov 提交于
      During freeze_page(), we remove the page from rmap.  It munlocks the
      page if it was mlocked.  clear_page_mlock() uses thelru cache, which
      temporary pins the page.
      
      Let's drain the lru cache before checking page's count vs.  mapcount.
      The change makes mlocked page split on first attempt, if it was not
      pinned by somebody else.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d9654322
    • K
      thp: add debugfs handle to split all huge pages · 49071d43
      Kirill A. Shutemov 提交于
      Writing 1 into 'split_huge_pages' will try to find and split all huge
      pages in the system.  This is useful for debuging.
      
      [akpm@linux-foundation.org: fix printk text, per Vlastimil]
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      49071d43
    • K
      mm: prepare page_referenced() and page_idle to new THP refcounting · b20ce5e0
      Kirill A. Shutemov 提交于
      Both page_referenced() and page_idle_clear_pte_refs_one() assume that
      THP can only be mapped with PMD, so there's no reason to look on PTEs
      for PageTransHuge() pages.  That's no true anymore: THP can be mapped
      with PTEs too.
      
      The patch removes PageTransHuge() test from the functions and opencode
      page table check.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b20ce5e0
    • K
      thp: allow mlocked THP again · e90309c9
      Kirill A. Shutemov 提交于
      Before THP refcounting rework, THP was not allowed to cross VMA
      boundary.  So, if we have THP and we split it, PG_mlocked can be safely
      transferred to small pages.
      
      With new THP refcounting and naive approach to mlocking we can end up
      with this scenario:
       1. we have a mlocked THP, which belong to one VM_LOCKED VMA.
       2. the process does munlock() on the *part* of the THP:
            - the VMA is split into two, one of them VM_LOCKED;
            - huge PMD split into PTE table;
            - THP is still mlocked;
       3. split_huge_page():
            - it transfers PG_mlocked to *all* small pages regrardless if it
      	blong to any VM_LOCKED VMA.
      
      We probably could munlock() all small pages on split_huge_page(), but I
      think we have accounting issue already on step two.
      
      Instead of forbidding mlocked pages altogether, we just avoid mlocking
      PTE-mapped THPs and munlock THPs on split_huge_pmd().
      
      This means PTE-mapped THPs will be on normal lru lists and will be split
      under memory pressure by vmscan.  After the split vmscan will detect
      unevictable small pages and mlock them.
      
      With this approach we shouldn't hit situation like described above.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e90309c9
    • K
      mm: re-enable THP · 61f5d698
      Kirill A. Shutemov 提交于
      All parts of THP with new refcounting are now in place.  We can now
      allow to enable THP.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: NSasha Levin <sasha.levin@oracle.com>
      Tested-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: NJerome Marchand <jmarchan@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      61f5d698
    • K
      thp: introduce deferred_split_huge_page() · 9a982250
      Kirill A. Shutemov 提交于
      Currently we don't split huge page on partial unmap.  It's not an ideal
      situation.  It can lead to memory overhead.
      
      Furtunately, we can detect partial unmap on page_remove_rmap().  But we
      cannot call split_huge_page() from there due to locking context.
      
      It's also counterproductive to do directly from munmap() codepath: in
      many cases we will hit this from exit(2) and splitting the huge page
      just to free it up in small pages is not what we really want.
      
      The patch introduce deferred_split_huge_page() which put the huge page
      into queue for splitting.  The splitting itself will happen when we get
      memory pressure via shrinker interface.  The page will be dropped from
      list on freeing through compound page destructor.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: NSasha Levin <sasha.levin@oracle.com>
      Tested-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NJerome Marchand <jmarchan@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9a982250
    • K
      migrate_pages: try to split pages on queuing · 248db92d
      Kirill A. Shutemov 提交于
      We are not able to migrate THPs.  It means it's not enough to split only
      PMD on migration -- we need to split compound page under it too.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: NJerome Marchand <jmarchan@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      248db92d
    • K
      thp: reintroduce split_huge_page() · e9b61f19
      Kirill A. Shutemov 提交于
      This patch adds implementation of split_huge_page() for new
      refcountings.
      
      Unlike previous implementation, new split_huge_page() can fail if
      somebody holds GUP pin on the page.  It also means that pin on page
      would prevent it from bening split under you.  It makes situation in
      many places much cleaner.
      
      The basic scheme of split_huge_page():
      
        - Check that sum of mapcounts of all subpage is equal to page_count()
          plus one (caller pin). Foll off with -EBUSY. This way we can avoid
          useless PMD-splits.
      
        - Freeze the page counters by splitting all PMD and setup migration
          PTEs.
      
        - Re-check sum of mapcounts against page_count(). Page's counts are
          stable now. -EBUSY if page is pinned.
      
        - Split compound page.
      
        - Unfreeze the page by removing migration entries.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: NSasha Levin <sasha.levin@oracle.com>
      Tested-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: NJerome Marchand <jmarchan@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e9b61f19
    • N
      mm: hwpoison: adjust for new thp refcounting · 4e41a30c
      Naoya Horiguchi 提交于
      Some mm-related BUG_ON()s could trigger from hwpoison code due to recent
      changes in thp refcounting rule.  This patch fixes them up.
      
      In the new refcounting, we no longer use tail->_mapcount to keep tail's
      refcount, and thereby we can simplify get/put_hwpoison_page().
      
      And another change is that tail's refcount is not transferred to the raw
      page during thp split (more precisely, in new rule we don't take
      refcount on tail page any more.) So when we need thp split, we have to
      transfer the refcount properly to the 4kB soft-offlined page before
      migration.
      
      thp split code goes into core code only when precheck
      (total_mapcount(head) == page_count(head) - 1) passes to avoid useless
      split, where we assume that one refcount is held by the caller of thp
      split and the others are taken via mapping.  To meet this assumption,
      this patch moves thp split part in soft_offline_page() after
      get_any_page().
      
      [akpm@linux-foundation.org: remove unneeded #define, per Kirill]
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NKirill A.  Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4e41a30c
    • N
      mm: soft-offline: check return value in second __get_any_page() call · d96b339f
      Naoya Horiguchi 提交于
      I saw the following BUG_ON triggered in a testcase where a process calls
      madvise(MADV_SOFT_OFFLINE) on thps, along with a background process that
      calls migratepages command repeatedly (doing ping-pong among different
      NUMA nodes) for the first process:
      
         Soft offlining page 0x60000 at 0x700000600000
         __get_any_page: 0x60000 free buddy page
         page:ffffea0001800000 count:0 mapcount:-127 mapping:          (null) index:0x1
         flags: 0x1fffc0000000000()
         page dumped because: VM_BUG_ON_PAGE(atomic_read(&page->_count) == 0)
         ------------[ cut here ]------------
         kernel BUG at /src/linux-dev/include/linux/mm.h:342!
         invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
         Modules linked in: cfg80211 rfkill crc32c_intel serio_raw virtio_balloon i2c_piix4 virtio_blk virtio_net ata_generic pata_acpi
         CPU: 3 PID: 3035 Comm: test_alloc_gene Tainted: G           O    4.4.0-rc8-v4.4-rc8-160107-1501-00000-rc8+ #74
         Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
         task: ffff88007c63d5c0 ti: ffff88007c210000 task.ti: ffff88007c210000
         RIP: 0010:[<ffffffff8118998c>]  [<ffffffff8118998c>] put_page+0x5c/0x60
         RSP: 0018:ffff88007c213e00  EFLAGS: 00010246
         Call Trace:
           put_hwpoison_page+0x4e/0x80
           soft_offline_page+0x501/0x520
           SyS_madvise+0x6bc/0x6f0
           entry_SYSCALL_64_fastpath+0x12/0x6a
         Code: 8b fc ff ff 5b 5d c3 48 89 df e8 b0 fa ff ff 48 89 df 31 f6 e8 c6 7d ff ff 5b 5d c3 48 c7 c6 08 54 a2 81 48 89 df e8 a4 c5 01 00 <0f> 0b 66 90 66 66 66 66 90 55 48 89 e5 41 55 41 54 53 48 8b 47
         RIP  [<ffffffff8118998c>] put_page+0x5c/0x60
          RSP <ffff88007c213e00>
      
      The root cause resides in get_any_page() which retries to get a refcount
      of the page to be soft-offlined.  This function calls
      put_hwpoison_page(), expecting that the target page is putback to LRU
      list.  But it can be also freed to buddy.  So the second check need to
      care about such case.
      
      Fixes: af8fae7c ("mm/memory-failure.c: clean up soft_offline_page()")
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: <stable@vger.kernel.org>	[3.9+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d96b339f
    • K
      thp, mm: split_huge_page(): caller need to lock page · 4d2fa965
      Kirill A. Shutemov 提交于
      We're going to use migration entries instead of compound_lock() to
      stabilize page refcounts.  Setup and remove migration entries require
      page to be locked.
      
      Some of split_huge_page() callers already have the page locked.  Let's
      require everybody to lock the page before calling split_huge_page().
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: NSasha Levin <sasha.levin@oracle.com>
      Tested-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NJerome Marchand <jmarchan@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4d2fa965
    • K
      thp: add option to setup migration entries during PMD split · ba988280
      Kirill A. Shutemov 提交于
      We are going to use migration PTE entries to stabilize page counts.  If
      the page is mapped with PMDs we need to split the PMD and setup
      migration entries.  It's reasonable to combine these operations to avoid
      double-scanning over the page table.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: NSasha Levin <sasha.levin@oracle.com>
      Tested-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NJerome Marchand <jmarchan@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ba988280
    • K
      thp: implement split_huge_pmd() · eef1b3ba
      Kirill A. Shutemov 提交于
      Original split_huge_page() combined two operations: splitting PMDs into
      tables of PTEs and splitting underlying compound page.  This patch
      implements split_huge_pmd() which split given PMD without splitting
      other PMDs this page mapped with or underlying compound page.
      
      Without tail page refcounting, implementation of split_huge_pmd() is
      pretty straight-forward.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: NSasha Levin <sasha.levin@oracle.com>
      Tested-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: NJerome Marchand <jmarchan@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eef1b3ba
    • K
      mm, numa: skip PTE-mapped THP on numa fault · e81c4802
      Kirill A. Shutemov 提交于
      We're going to have THP mapped with PTEs.  It will confuse
      numabalancing.  Let's skip them for now.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: NSasha Levin <sasha.levin@oracle.com>
      Tested-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: NJerome Marchand <jmarchan@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e81c4802
    • K
      mm: differentiate page_mapped() from page_mapcount() for compound pages · e1534ae9
      Kirill A. Shutemov 提交于
      Let's define page_mapped() to be true for compound pages if any
      sub-pages of the compound page is mapped (with PMD or PTE).
      
      On other hand page_mapcount() return mapcount for this particular small
      page.
      
      This will make cases like page_get_anon_vma() behave correctly once we
      allow huge pages to be mapped with PTE.
      
      Most users outside core-mm should use page_mapcount() instead of
      page_mapped().
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: NSasha Levin <sasha.levin@oracle.com>
      Tested-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: NJerome Marchand <jmarchan@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e1534ae9
    • K
      mm: rework mapcount accounting to enable 4k mapping of THPs · 53f9263b
      Kirill A. Shutemov 提交于
      We're going to allow mapping of individual 4k pages of THP compound.  It
      means we need to track mapcount on per small page basis.
      
      Straight-forward approach is to use ->_mapcount in all subpages to track
      how many time this subpage is mapped with PMDs or PTEs combined.  But
      this is rather expensive: mapping or unmapping of a THP page with PMD
      would require HPAGE_PMD_NR atomic operations instead of single we have
      now.
      
      The idea is to store separately how many times the page was mapped as
      whole -- compound_mapcount.  This frees up ->_mapcount in subpages to
      track PTE mapcount.
      
      We use the same approach as with compound page destructor and compound
      order to store compound_mapcount: use space in first tail page,
      ->mapping this time.
      
      Any time we map/unmap whole compound page (THP or hugetlb) -- we
      increment/decrement compound_mapcount.  When we map part of compound
      page with PTE we operate on ->_mapcount of the subpage.
      
      page_mapcount() counts both: PTE and PMD mappings of the page.
      
      Basically, we have mapcount for a subpage spread over two counters.  It
      makes tricky to detect when last mapcount for a page goes away.
      
      We introduced PageDoubleMap() for this.  When we split THP PMD for the
      first time and there's other PMD mapping left we offset up ->_mapcount
      in all subpages by one and set PG_double_map on the compound page.
      These additional references go away with last compound_mapcount.
      
      This approach provides a way to detect when last mapcount goes away on
      per small page basis without introducing new overhead for most common
      cases.
      
      [akpm@linux-foundation.org: fix typo in comment]
      [mhocko@suse.com: ignore partial THP when moving task]
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: NJerome Marchand <jmarchan@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      53f9263b