1. 09 9月, 2015 40 次提交
    • J
      mm/compaction: correct to flush migrated pages if pageblock skip happens · 1a16718c
      Joonsoo Kim 提交于
      We cache isolate_start_pfn before entering isolate_migratepages().  If
      pageblock is skipped in isolate_migratepages() due to whatever reason,
      cc->migrate_pfn can be far from isolate_start_pfn hence we flush pages
      that were freed.  For example, the following scenario can be possible:
      
      - assume order-9 compaction, pageblock order is 9
      - start_isolate_pfn is 0x200
      - isolate_migratepages()
        - skip a number of pageblocks
        - start to isolate from pfn 0x600
        - cc->migrate_pfn = 0x620
        - return
      - last_migrated_pfn is set to 0x200
      - check flushing condition
        - current_block_start is set to 0x600
        - last_migrated_pfn < current_block_start then do useless flush
      
      This wrong flush would not help the performance and success rate so this
      patch tries to fix it.  One simple way to know the exact position where
      we start to isolate migratable pages is that we cache it in
      isolate_migratepages() before entering actual isolation.  This patch
      implements that and fixes the problem.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1a16718c
    • V
      mm: rename alloc_pages_exact_node() to __alloc_pages_node() · 96db800f
      Vlastimil Babka 提交于
      alloc_pages_exact_node() was introduced in commit 6484eb3e ("page
      allocator: do not check NUMA node ID when the caller knows the node is
      valid") as an optimized variant of alloc_pages_node(), that doesn't
      fallback to current node for nid == NUMA_NO_NODE.  Unfortunately the
      name of the function can easily suggest that the allocation is
      restricted to the given node and fails otherwise.  In truth, the node is
      only preferred, unless __GFP_THISNODE is passed among the gfp flags.
      
      The misleading name has lead to mistakes in the past, see for example
      commits 5265047a ("mm, thp: really limit transparent hugepage
      allocation to local node") and b360edb4 ("mm, mempolicy:
      migrate_to_node should only migrate to node").
      
      Another issue with the name is that there's a family of
      alloc_pages_exact*() functions where 'exact' means exact size (instead
      of page order), which leads to more confusion.
      
      To prevent further mistakes, this patch effectively renames
      alloc_pages_exact_node() to __alloc_pages_node() to better convey that
      it's an optimized variant of alloc_pages_node() not intended for general
      usage.  Both functions get described in comments.
      
      It has been also considered to really provide a convenience function for
      allocations restricted to a node, but the major opinion seems to be that
      __GFP_THISNODE already provides that functionality and we shouldn't
      duplicate the API needlessly.  The number of users would be small
      anyway.
      
      Existing callers of alloc_pages_exact_node() are simply converted to
      call __alloc_pages_node(), with the exception of sba_alloc_coherent()
      which open-codes the check for NUMA_NO_NODE, so it is converted to use
      alloc_pages_node() instead.  This means it no longer performs some
      VM_BUG_ON checks, and since the current check for nid in
      alloc_pages_node() uses a 'nid < 0' comparison (which includes
      NUMA_NO_NODE), it may hide wrong values which would be previously
      exposed.
      
      Both differences will be rectified by the next patch.
      
      To sum up, this patch makes no functional changes, except temporarily
      hiding potentially buggy callers.  Restricting the checks in
      alloc_pages_node() is left for the next patch which can in turn expose
      more existing buggy callers.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NRobin Holt <robinmholt@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Acked-by: NMichael Ellerman <mpe@ellerman.id.au>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Gleb Natapov <gleb@kernel.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Cliff Whickman <cpw@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96db800f
    • H
      mm, vmscan: unlock page while waiting on writeback · 7fadc820
      Hugh Dickins 提交于
      This is merely a politeness: I've not found that shrink_page_list()
      leads to deadlock with the page it holds locked across
      wait_on_page_writeback(); but nevertheless, why hold others off by
      keeping the page locked there?
      
      And while we're at it: remove the mistaken "not " from the commentary on
      this Case 3 (and a distracting blank line from Case 2, if I may).
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7fadc820
    • J
      list_lru: don't call list_lru_from_kmem if the list_head is empty · 26f5d760
      Jeff Layton 提交于
      If the list_head is empty then we'll have called list_lru_from_kmem for
      nothing.  Move that call inside of the list_empty if block.
      Signed-off-by: NJeff Layton <jeff.layton@primarydata.com>
      Reviewed-by: NVladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      26f5d760
    • W
      kmemleak: record accurate early log buffer count and report when exceeded · 21cd3a60
      Wang Kai 提交于
      In log_early function, crt_early_log should also count once when
      'crt_early_log >= ARRAY_SIZE(early_log)'.  Otherwise the reported count
      from kmemleak_init is one less than 'actual number'.
      
      Then, in kmemleak_init, if early_log buffer size equal actual number,
      kmemleak will init sucessful, so change warning condition to
      'crt_early_log > ARRAY_SIZE(early_log)'.
      Signed-off-by: NWang Kai <morgan.wang@huawei.com>
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      21cd3a60
    • C
      mm/mmap.c: simplify the failure return working flow · e3975891
      Chen Gang 提交于
      __split_vma() doesn't need out_err label, neither need initializing err.
      
      copy_vma() can return NULL directly when kmem_cache_alloc() fails.
      Signed-off-by: NChen Gang <gang.chen.5i5j@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e3975891
    • Y
      shmem: recalculate file inode when fstat · 44a30220
      Yu Zhao 提交于
      Shmem uses shmem_recalc_inode to update i_blocks when it allocates page,
      undoes range or swaps.  But mm can drop clean page without notifying
      shmem.  This makes fstat sometimes return out-of-date block size.
      
      The problem can be partially solved when we add
      inode_operations->getattr which calls shmem_recalc_inode to update
      i_blocks for fstat.
      
      shmem_recalc_inode also updates counter used by statfs and
      vm_committed_as.  For them the situation is not changed.  They still
      suffer from the discrepancy after dropping clean page and before the
      function is called by aforementioned triggers.
      Signed-off-by: NYu Zhao <yuzhao@google.com>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      44a30220
    • A
      mm/memblock.c: rename local variable of memblock_type to 'type' · 567d117b
      Alexander Kuleshov 提交于
      Since commit e3239ff9 ("memblock: Rename memblock_region to
      memblock_type and memblock_property to memblock_region"), all local
      variables of the membock_type type were renamed to 'type'.  This commit
      renames all remaining local variables with the memblock_type type to the
      same view.
      Signed-off-by: NAlexander Kuleshov <kuleshovmail@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      567d117b
    • N
      mm/hwpoison: don't try to unpoison containment-failed pages · 230ac719
      Naoya Horiguchi 提交于
      memory_failure() can be called at any page at any time, which means that
      we can't eliminate the possibility of containment failure.  In such case
      the best option is to leak the page intentionally (and never touch it
      later.)
      
      We have an unpoison function for testing, and it cannot handle such
      containment-failed pages, which results in kernel panic (visible with
      various calltraces.) So this patch suggests that we limit the
      unpoisonable pages to properly contained pages and ignore any other
      ones.
      
      Testers are recommended to keep in mind that there're un-unpoisonable
      pages when writing test programs.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Tested-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      230ac719
    • W
      mm/hwpoison: fix race between soft_offline_page and unpoison_memory · da1b13cc
      Wanpeng Li 提交于
      Wanpeng Li reported a race between soft_offline_page() and
      unpoison_memory(), which causes the following kernel panic:
      
         BUG: Bad page state in process bash  pfn:97000
         page:ffffea00025c0000 count:0 mapcount:1 mapping:          (null) index:0x7f4fdbe00
         flags: 0x1fffff80080048(uptodate|active|swapbacked)
         page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
         bad because of flags:
         flags: 0x40(active)
         Modules linked in: snd_hda_codec_hdmi i915 rpcsec_gss_krb5 nfsv4 dns_resolver bnep rfcomm nfsd bluetooth auth_rpcgss nfs_acl nfs rfkill lockd grace sunrpc i2c_algo_bit drm_kms_helper snd_hda_codec_realtek snd_hda_codec_generic drm snd_hda_intel fscache snd_hda_codec x86_pkg_temp_thermal coretemp kvm_intel snd_hda_core snd_hwdep kvm snd_pcm snd_seq_dummy snd_seq_oss crct10dif_pclmul snd_seq_midi crc32_pclmul snd_seq_midi_event ghash_clmulni_intel snd_rawmidi aesni_intel lrw gf128mul snd_seq glue_helper ablk_helper snd_seq_device cryptd fuse snd_timer dcdbas serio_raw mei_me parport_pc snd mei ppdev i2c_core video lp soundcore parport lpc_ich shpchp mfd_core ext4 mbcache jbd2 sd_mod e1000e ahci ptp libahci crc32c_intel libata pps_core
         CPU: 3 PID: 2211 Comm: bash Not tainted 4.2.0-rc5-mm1+ #45
         Hardware name: Dell Inc. OptiPlex 7020/0F5C5X, BIOS A03 01/08/2015
         Call Trace:
           dump_stack+0x48/0x5c
           bad_page+0xe6/0x140
           free_pages_prepare+0x2f9/0x320
           ? uncharge_list+0xdd/0x100
           free_hot_cold_page+0x40/0x170
           __put_single_page+0x20/0x30
           put_page+0x25/0x40
           unmap_and_move+0x1a6/0x1f0
           migrate_pages+0x100/0x1d0
           ? kill_procs+0x100/0x100
           ? unlock_page+0x6f/0x90
           __soft_offline_page+0x127/0x2a0
           soft_offline_page+0xa6/0x200
      
      This race is explained like below:
      
        CPU0                    CPU1
      
        soft_offline_page
        __soft_offline_page
        TestSetPageHWPoison
                              unpoison_memory
                              PageHWPoison check (true)
                              TestClearPageHWPoison
                              put_page    -> release refcount held by get_hwpoison_page in unpoison_memory
                              put_page    -> release refcount held by isolate_lru_page in __soft_offline_page
        migrate_pages
      
      The second put_page() releases refcount held by isolate_lru_page() which
      will lead to unmap_and_move() releases the last refcount of page and w/
      mapcount still 1 since try_to_unmap() is not called if there is only one
      user map the page.  Anyway, the page refcount and mapcount will still
      mess if the page is mapped by multiple users.
      
      This race was introduced by commit 4491f712 ("mm/memory-failure: set
      PageHWPoison before migrate_pages()"), which focuses on preventing the
      reuse of successfully migrated page.  Before this commit we prevent the
      reuse by changing the migratetype to MIGRATE_ISOLATE during soft
      offlining, which has the following problems, so simply reverting the
      commit is not a best option:
      
        1) it doesn't eliminate the reuse completely, because
           set_migratetype_isolate() can fail to set MIGRATE_ISOLATE to the
           target page if the pageblock of the page contains one or more
           unmovable pages (i.e.  has_unmovable_pages() returns true).
      
        2) the original code changes migratetype to MIGRATE_ISOLATE
           forcibly, and sets it to MIGRATE_MOVABLE forcibly after soft offline,
           regardless of the original migratetype state, which could impact
           other subsystems like memory hotplug or compaction.
      
      This patch moves PageSetHWPoison just after put_page() in
      unmap_and_move(), which closes up the reported race window and minimizes
      another race window b/w SetPageHWPoison and reallocation (which causes
      the reuse of soft-offlined page.) The latter race window still exists
      but it's acceptable, because it's rare and effectively the same as
      ordinary "containment failure" case even if it happens, so keep the
      window open is acceptable.
      
      Fixes: 4491f712 ("mm/memory-failure: set PageHWPoison before migrate_pages()")
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reported-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Tested-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      da1b13cc
    • N
      mm/hwpoison: introduce num_poisoned_pages wrappers · 8e30456b
      Naoya Horiguchi 提交于
      num_poisoned_pages counter will be changed outside mm/memory-failure.c
      by a subsequent patch, so this patch prepares wrappers to manipulate it.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Tested-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8e30456b
    • W
      mm/hwpoison: replace most of put_page in memory error handling by put_hwpoison_page · 665d9da7
      Wanpeng Li 提交于
      Replace most instances of put_page() in memory error handling with
      put_hwpoison_page().
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      665d9da7
    • W
      mm/hwpoison: fix refcount of THP head page in no-injection case · be91748f
      Wanpeng Li 提交于
      Hwpoison injection takes a refcount of target page and another refcount
      of head page of THP if the target page is the tail page of a THP.
      However, current code doesn't release the refcount of head page if the
      THP is not supported to be injected wrt hwpoison filter.
      
      Fix it by reducing the refcount of head page if the target page is the
      tail page of a THP and it is not supported to be injected.
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Acked-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      be91748f
    • W
      mm/hwpoison: introduce put_hwpoison_page to put refcount for memory error handling · 94bf4ec8
      Wanpeng Li 提交于
      Introduce put_hwpoison_page to put refcount for memory error handling.
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Suggested-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      94bf4ec8
    • W
      mm/hwpoison: fix PageHWPoison test/set race · 1e0e635b
      Wanpeng Li 提交于
      There is a race between madvise_hwpoison path and memory_failure:
      
       CPU0					CPU1
      
      madvise_hwpoison
      get_user_pages_fast
      PageHWPoison check (false)
      					memory_failure
      					TestSetPageHWPoison
      soft_offline_page
      PageHWPoison check (true)
      return -EBUSY (without put_page)
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Suggested-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1e0e635b
    • W
      mm/hwpoison: fix failure to split thp w/ refcount held · 7d1900c7
      Wanpeng Li 提交于
      THP pages will get a refcount in madvise_hwpoison() w/
      MF_COUNT_INCREASED flag, however, the refcount is still held when fail
      to split THP pages.
      
      Fix it by reducing the refcount of THP pages when fail to split THP.
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7d1900c7
    • M
      mm: add utility for early copy from unmapped ram · 6b0f68e3
      Mark Salter 提交于
      When booting an arm64 kernel w/initrd using UEFI/grub, use of mem= will
      likely cut off part or all of the initrd.  This leaves it outside the
      kernel linear map which leads to failure when unpacking.  The x86 code
      has a similar need to relocate an initrd outside of mapped memory in
      some cases.
      
      The current x86 code uses early_memremap() to copy the original initrd
      from unmapped to mapped RAM.  This patchset creates a generic
      copy_from_early_mem() utility based on that x86 code and has arm64 and
      x86 share it in their respective initrd relocation code.
      
      This patch (of 3):
      
      In some early boot circumstances, it may be necessary to copy from RAM
      outside the kernel linear mapping to mapped RAM.  The need to relocate
      an initrd is one example in the x86 code.  This patch creates a helper
      function based on current x86 code.
      Signed-off-by: NMark Salter <msalter@redhat.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6b0f68e3
    • V
      mm, compaction: skip compound pages by order in free scanner · 9fcd6d2e
      Vlastimil Babka 提交于
      The compaction free scanner is looking for PageBuddy() pages and
      skipping all others.  For large compound pages such as THP or hugetlbfs,
      we can save a lot of iterations if we skip them at once using their
      compound_order().  This is generally unsafe and we can read a bogus
      value of order due to a race, but if we are careful, the only danger is
      skipping too much.
      
      When tested with stress-highalloc from mmtests on 4GB system with 1GB
      hugetlbfs pages, the vmstat compact_free_scanned count decreased by at
      least 15%.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Acked-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NMichal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9fcd6d2e
    • V
      mm, compaction: always skip all compound pages by order in migrate scanner · 29c0dde8
      Vlastimil Babka 提交于
      The compaction migrate scanner tries to skip THP pages by their order,
      to reduce number of iterations for pages it cannot isolate.  The check
      is only done if PageLRU() is true, which means it applies to THP pages,
      but not e.g.  hugetlbfs pages or any other non-LRU compound pages, which
      we have to iterate by base pages.
      
      This limitation comes from the assumption that it's only safe to read
      compound_order() when we have the zone's lru_lock and THP cannot be
      split under us.  But the only danger (after filtering out order values
      that are not below MAX_ORDER, to prevent overflows) is that we skip too
      much or too little after reading a bogus compound_order() due to a rare
      race.  This is the same reasoning as patch 99c0fd5e ("mm,
      compaction: skip buddy pages by their order in the migrate scanner")
      introduced for unsafely reading PageBuddy() order.
      
      After this patch, all pages are tested for PageCompound() and we skip
      them by compound_order().  The test is done after the test for
      balloon_page_movable() as we don't want to assume if balloon pages (or
      other pages with own isolation and migration implementation if a generic
      API gets implemented) are compound or not.
      
      When tested with stress-highalloc from mmtests on 4GB system with 1GB
      hugetlbfs pages, the vmstat compact_migrate_scanned count decreased by
      15%.
      
      [kirill.shutemov@linux.intel.com: change PageTransHuge checks to PageCompound for different series was squashed here]
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NMichal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      29c0dde8
    • V
      mm, compaction: encapsulate resetting cached scanner positions · 02333641
      Vlastimil Babka 提交于
      Reseting the cached compaction scanner positions is now open-coded in
      __reset_isolation_suitable() and compact_finished().  Encapsulate the
      functionality in a new function reset_cached_positions().
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NMichal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      02333641
    • V
      mm, compaction: simplify handling restart position in free pages scanner · f5f61a32
      Vlastimil Babka 提交于
      Handling the position where compaction free scanner should restart
      (stored in cc->free_pfn) got more complex with commit e14c720e ("mm,
      compaction: remember position within pageblock in free pages scanner").
      Currently the position is updated in each loop iteration of
      isolate_freepages(), although it should be enough to update it only when
      breaking from the loop.  There's also an extra check outside the loop
      updates the position in case we have met the migration scanner.
      
      This can be simplified if we move the test for having isolated enough
      from the for-loop header next to the test for contention, and
      determining the restart position only in these cases.  We can reuse the
      isolate_start_pfn variable for this instead of setting cc->free_pfn
      directly.  Outside the loop, we can simply set cc->free_pfn to current
      value of isolate_start_pfn without any extra check.
      
      Also add a VM_BUG_ON to catch possible mistake in the future, in case we
      later add a new condition that terminates isolate_freepages_block()
      prematurely without also considering the condition in
      isolate_freepages().
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f5f61a32
    • V
      mm, compaction: more robust check for scanners meeting · f2849aa0
      Vlastimil Babka 提交于
      Assorted compaction cleanups and optimizations.  The interesting patches
      are 4 and 5.  In 4, skipping of compound pages in single iteration is
      improved for migration scanner, so it works also for !PageLRU compound
      pages such as hugetlbfs, slab etc.  Patch 5 introduces this kind of
      skipping in the free scanner.  The trick is that we can read
      compound_order() without any protection, if we are careful to filter out
      values larger than MAX_ORDER.  The only danger is that we skip too much.
      The same trick was already used for reading the freepage order in the
      migrate scanner.
      
      To demonstrate improvements of Patches 4 and 5 I've run stress-highalloc
      from mmtests, set to simulate THP allocations (including __GFP_COMP) on
      a 4GB system where 1GB was occupied by hugetlbfs pages.  I'll include
      just the relevant stats:
      
                                     Patch 3     Patch 4     Patch 5
      
      Compaction stalls                 7523        7529        7515
      Compaction success                 323         304         322
      Compaction failures               7200        7224        7192
      Page migrate success            247778      264395      240737
      Page migrate failure             15358       33184       21621
      Compaction pages isolated       906928      980192      909983
      Compaction migrate scanned     2005277     1692805     1498800
      Compaction free scanned       13255284    11539986     9011276
      Compaction cost                    288         305         277
      
      With 5 iterations per patch, the results are still noisy, but we can see
      that Patch 4 does reduce migrate_scanned by 15% thanks to skipping the
      hugetlbfs pages at once.  Interestingly, free_scanned is also reduced
      and I have no idea why.  Patch 5 further reduces free_scanned as
      expected, by 15%.  Other stats are unaffected modulo noise.
      
      [1] https://lkml.org/lkml/2015/1/19/158
      
      This patch (of 5):
      
      Compaction should finish when the migration and free scanner meet, i.e.
      they reach the same pageblock.  Currently however, the test in
      compact_finished() simply just compares the exact pfns, which may yield
      a false negative when the free scanner position is in the middle of a
      pageblock and the migration scanner reaches the begining of the same
      pageblock.
      
      This hasn't been a problem until commit e14c720e ("mm, compaction:
      remember position within pageblock in free pages scanner") allowed the
      free scanner position to be in the middle of a pageblock between
      invocations.  The hot-fix 1d5bfe1f ("mm, compaction: prevent
      infinite loop in compact_zone") prevented the issue by adding a special
      check in the migration scanner to satisfy the current detection of
      scanners meeting.
      
      However, the proper fix is to make the detection more robust.  This
      patch introduces the compact_scanners_met() function that returns true
      when the free scanner position is in the same or lower pageblock than
      the migration scanner.  The special case in isolate_migratepages()
      introduced by 1d5bfe1f is removed.
      Suggested-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NMichal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f2849aa0
    • S
      mm: add support for __GFP_ZERO flag to dma_pool_alloc() · fa23f56d
      Sean O. Stalley 提交于
      Currently a call to dma_pool_alloc() with a ___GFP_ZERO flag returns a
      non-zeroed memory region.
      
      This patchset adds support for the __GFP_ZERO flag to dma_pool_alloc(),
      adds 2 wrapper functions for allocing zeroed memory from a pool, and
      provides a coccinelle script for finding & replacing instances of
      dma_pool_alloc() followed by memset(0) with a single dma_pool_zalloc()
      call.
      
      There was some concern that this always calls memset() to zero, instead
      of passing __GFP_ZERO into the page allocator.
      [https://lkml.org/lkml/2015/7/15/881]
      
      I ran a test on my system to get an idea of how often dma_pool_alloc()
      calls into pool_alloc_page().
      
      After Boot:	[   30.119863] alloc_calls:541, page_allocs:7
      After an hour:	[ 3600.951031] alloc_calls:9566, page_allocs:12
      After copying 1GB file onto a USB drive:
      		[ 4260.657148] alloc_calls:17225, page_allocs:12
      
      It doesn't look like dma_pool_alloc() calls down to the page allocator
      very often (at least on my system).
      
      This patch (of 4):
      
      Currently the __GFP_ZERO flag is ignored by dma_pool_alloc().
      Make dma_pool_alloc() zero the memory if this flag is set.
      Signed-off-by: NSean O. Stalley <sean.stalley@intel.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Vinod Koul <vinod.koul@intel.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Gilles Muller <Gilles.Muller@lip6.fr>
      Cc: Nicolas Palix <nicolas.palix@imag.fr>
      Cc: Michal Marek <mmarek@suse.cz>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fa23f56d
    • J
      vmscan: fix increasing nr_isolated incurred by putback unevictable pages · c54839a7
      Jaewon Kim 提交于
      reclaim_clean_pages_from_list() assumes that shrink_page_list() returns
      number of pages removed from the candidate list.  But shrink_page_list()
      puts back mlocked pages without passing it to caller and without
      counting as nr_reclaimed.  This increases nr_isolated.
      
      To fix this, this patch changes shrink_page_list() to pass unevictable
      pages back to caller.  Caller will take care those pages.
      
      Minchan said:
      
      It fixes two issues.
      
      1. With unevictable page, cma_alloc will be successful.
      
      Exactly speaking, cma_alloc of current kernel will fail due to
      unevictable pages.
      
      2. fix leaking of NR_ISOLATED counter of vmstat
      
      With it, too_many_isolated works.  Otherwise, it could make hang until
      the process get SIGKILL.
      Signed-off-by: NJaewon Kim <jaewon31.kim@samsung.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c54839a7
    • V
      mm: vmscan: never isolate more pages than necessary · 0b802f10
      Vladimir Davydov 提交于
      If transparent huge pages are enabled, we can isolate many more pages
      than we actually need to scan, because we count both single and huge
      pages equally in isolate_lru_pages().
      
      Since commit 5bc7b8ac ("mm: thp: add split tail pages to shrink
      page list in page reclaim"), we scan all the tail pages immediately
      after a huge page split (see shrink_page_list()).  As a result, we can
      reclaim up to SWAP_CLUSTER_MAX * HPAGE_PMD_NR (512 MB) in one run!
      
      This is easy to catch on memcg reclaim with zswap enabled.  The latter
      makes swapout instant so that if we happen to scan an unreferenced huge
      page we will evict both its head and tail pages immediately, which is
      likely to result in excessive reclaim.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0b802f10
    • C
      bootmem: avoid freeing to bootmem after bootmem is done · 1b4ace41
      Chris Metcalf 提交于
      Bootmem isn't popular any more, but some architectures still use it, and
      freeing to bootmem after calling free_all_bootmem_core() can end up
      scribbling over random memory.  Instead, make sure the kernel generates
      a warning in this case by ensuring the node_bootmem_map field is
      non-NULL when are freeing or marking bootmem.
      
      An instance of this bug was just fixed in the tile architecture ("tile:
      use free_bootmem_late() for initrd") and catching this case more widely
      seems like a good thing.
      Signed-off-by: NChris Metcalf <cmetcalf@ezchip.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Paul McQuade <paulmcquad@gmail.com>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1b4ace41
    • N
      mm, page_isolation: make set/unset_migratetype_isolate() file-local · c5b4e1b0
      Naoya Horiguchi 提交于
      Nowaday, set/unset_migratetype_isolate() is defined and used only in
      mm/page_isolation, so let's limit the scope within the file.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c5b4e1b0
    • A
      mm/mempolicy.c: get rid of duplicated check for vma(VM_PFNMAP) in queue_pages_range() · acda0c33
      Aristeu Rozanski 提交于
      This check was introduced as part of
         6f4576e3 ("mempolicy: apply page table walker on queue_pages_range()")
      
      which got duplicated by
         48684a65 ("mm: pagewalk: fix misbehavior of walk_page_range for vma(VM_PFNMAP)")
      
      by reintroducing it earlier on queue_page_test_walk()
      Signed-off-by: NAristeu Rozanski <aris@redhat.com>
      Acked-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Acked-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      acda0c33
    • T
      mem-hotplug: handle node hole when initializing numa_meminfo. · 95cf82ec
      Tang Chen 提交于
      When parsing SRAT, all memory ranges are added into numa_meminfo.  In
      numa_init(), before entering numa_cleanup_meminfo(), all possible memory
      ranges are in numa_meminfo.  And numa_cleanup_meminfo() removes all
      ranges over max_pfn or empty.
      
      But, this only works if the nodes are continuous.  Let's have a look at
      the following example:
      
      We have an SRAT like this:
      SRAT: Node 0 PXM 0 [mem 0x00000000-0x5fffffff]
      SRAT: Node 0 PXM 0 [mem 0x100000000-0x1ffffffffff]
      SRAT: Node 1 PXM 1 [mem 0x20000000000-0x3ffffffffff]
      SRAT: Node 4 PXM 2 [mem 0x40000000000-0x5ffffffffff] hotplug
      SRAT: Node 5 PXM 3 [mem 0x60000000000-0x7ffffffffff] hotplug
      SRAT: Node 2 PXM 4 [mem 0x80000000000-0x9ffffffffff] hotplug
      SRAT: Node 3 PXM 5 [mem 0xa0000000000-0xbffffffffff] hotplug
      SRAT: Node 6 PXM 6 [mem 0xc0000000000-0xdffffffffff] hotplug
      SRAT: Node 7 PXM 7 [mem 0xe0000000000-0xfffffffffff] hotplug
      
      On boot, only node 0,1,2,3 exist.
      
      And the numa_meminfo will look like this:
      numa_meminfo.nr_blks = 9
      1. on node 0: [0, 60000000]
      2. on node 0: [100000000, 20000000000]
      3. on node 1: [20000000000, 40000000000]
      4. on node 4: [40000000000, 60000000000]
      5. on node 5: [60000000000, 80000000000]
      6. on node 2: [80000000000, a0000000000]
      7. on node 3: [a0000000000, a0800000000]
      8. on node 6: [c0000000000, a0800000000]
      9. on node 7: [e0000000000, a0800000000]
      
      And numa_cleanup_meminfo() will merge 1 and 2, and remove 8,9 because the
      end address is over max_pfn, which is a0800000000.  But 4 and 5 are not
      removed because their end addresses are less then max_pfn.  But in fact,
      node 4 and 5 don't exist.
      
      In a word, numa_cleanup_meminfo() is not able to handle holes between nodes.
      
      Since memory ranges in node 4 and 5 are in numa_meminfo, in
      numa_register_memblks(), node 4 and 5 will be mistakenly set to online.
      
      If you run lscpu, it will show:
      NUMA node0 CPU(s):     0-14,128-142
      NUMA node1 CPU(s):     15-29,143-157
      NUMA node2 CPU(s):
      NUMA node3 CPU(s):
      NUMA node4 CPU(s):     62-76,190-204
      NUMA node5 CPU(s):     78-92,206-220
      
      In this patch, we use memblock_overlaps_region() to check if ranges in
      numa_meminfo overlap with ranges in memory_block.  Since memory_block
      contains all available memory at boot time, if they overlap, it means the
      ranges exist.  If not, then remove them from numa_meminfo.
      
      After this patch, lscpu will show:
      NUMA node0 CPU(s):     0-14,128-142
      NUMA node1 CPU(s):     15-29,143-157
      NUMA node4 CPU(s):     62-76,190-204
      NUMA node5 CPU(s):     78-92,206-220
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Reviewed-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Vladimir Murzin <vladimir.murzin@arm.com>
      Cc: Fabian Frederick <fabf@skynet.be>
      Cc: Alexander Kuleshov <kuleshovmail@gmail.com>
      Cc: Baoquan He <bhe@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      95cf82ec
    • T
      mm/memblock.c: make memblock_overlaps_region() return bool. · c5c5c9d1
      Tang Chen 提交于
      memblock_overlaps_region() checks if the given memblock region
      intersects a region in memblock.  If so, it returns the index of the
      intersected region.
      
      But its only caller is memblock_is_region_reserved(), and it returns 0
      if false, non-zero if true.
      
      Both of these should return bool.
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Vladimir Murzin <vladimir.murzin@arm.com>
      Cc: Fabian Frederick <fabf@skynet.be>
      Cc: Alexander Kuleshov <kuleshovmail@gmail.com>
      Cc: Baoquan He <bhe@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c5c5c9d1
    • M
      mm: madvise allow remove operation for hugetlbfs · 72079ba0
      Mike Kravetz 提交于
      Now that we have hole punching support for hugetlbfs, we can also
      support the MADV_REMOVE interface to it.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      72079ba0
    • M
      hugetlbfs: add hugetlbfs_fallocate() · 70c3547e
      Mike Kravetz 提交于
      This is based on the shmem version, but it has diverged quite a bit.  We
      have no swap to worry about, nor the new file sealing.  Add
      synchronication via the fault mutex table to coordinate page faults,
      fallocate allocation and fallocate hole punch.
      
      What this allows us to do is move physical memory in and out of a
      hugetlbfs file without having it mapped.  This also gives us the ability
      to support MADV_REMOVE since it is currently implemented using
      fallocate().  MADV_REMOVE lets madvise() remove pages from the middle of
      a hugetlbfs file, which wasn't possible before.
      
      hugetlbfs fallocate only operates on whole huge pages.
      
      Based on code by Dave Hansen.
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      70c3547e
    • M
      hugetlbfs: New huge_add_to_page_cache helper routine · ab76ad54
      Mike Kravetz 提交于
      Currently, there is only a single place where hugetlbfs pages are added
      to the page cache.  The new fallocate code be adding a second one, so
      break the functionality out into its own helper.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ab76ad54
    • M
      mm/hugetlb: alloc_huge_page handle areas hole punched by fallocate · d85f69b0
      Mike Kravetz 提交于
      Areas hole punched by fallocate will not have entries in the
      region/reserve map.  However, shared mappings with min_size subpool
      reservations may still have reserved pages.  alloc_huge_page needs to
      handle this special case and do the proper accounting.
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d85f69b0
    • M
      mm/hugetlb: vma_has_reserves() needs to handle fallocate hole punch · 1fb1b0e9
      Mike Kravetz 提交于
      In vma_has_reserves(), the current assumption is that reserves are
      always present for shared mappings.  However, this will not be the case
      with fallocate hole punch.  When punching a hole, the present page will
      be deleted as well as the region/reserve map entry (and hence any
      reservation).  vma_has_reserves is passed "chg" which indicates whether
      or not a region/reserve map is present.  Use this to determine if
      reserves are actually present or were removed via hole punch.
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1fb1b0e9
    • M
      hugetlbfs: truncate_hugepages() takes a range of pages · b5cec28d
      Mike Kravetz 提交于
      Modify truncate_hugepages() to take a range of pages (start, end)
      instead of simply start.  If an end value of LLONG_MAX is passed, the
      current "truncate" functionality is maintained.  Existing callers are
      modified to pass LLONG_MAX as end of range.  By keying off end ==
      LLONG_MAX, the routine behaves differently for truncate and hole punch.
      Page removal is now synchronized with page allocation via faults by
      using the fault mutex table.  The hole punch case can experience the
      rare region_del error and must handle accordingly.
      
      Add the routine hugetlb_fix_reserve_counts to fix up reserve counts in
      the case where region_del returns an error.
      
      Since the routine handles more than just the truncate case, it is
      renamed to remove_inode_hugepages().  To be consistent, the routine
      truncate_huge_page() is renamed remove_huge_page().
      
      Downstream of remove_inode_hugepages(), the routine
      hugetlb_unreserve_pages() is also modified to take a range of pages.
      hugetlb_unreserve_pages is modified to detect an error from region_del and
      pass it back to the caller.
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b5cec28d
    • M
      mm/hugetlb: expose hugetlb fault mutex for use by fallocate · c672c7f2
      Mike Kravetz 提交于
      hugetlb page faults are currently synchronized by the table of mutexes
      (htlb_fault_mutex_table).  fallocate code will need to synchronize with
      the page fault code when it allocates or deletes pages.  Expose
      interfaces so that fallocate operations can be synchronized with page
      faults.  Minor name changes to be more consistent with other global
      hugetlb symbols.
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c672c7f2
    • M
      mm/hugetlb: add region_del() to delete a specific range of entries · feba16e2
      Mike Kravetz 提交于
      fallocate hole punch will want to remove a specific range of pages.  The
      existing region_truncate() routine deletes all region/reserve map
      entries after a specified offset.  region_del() will provide this same
      functionality if the end of region is specified as LONG_MAX.  Hence,
      region_del() can replace region_truncate().
      
      Unlike region_truncate(), region_del() can return an error in the rare
      case where it can not allocate memory for a region descriptor.  This
      ONLY happens in the case where an existing region must be split.
      Current callers passing LONG_MAX as end of range will never experience
      this error and do not need to deal with error handling.  Future callers
      of region_del() (such as fallocate hole punch) will need to handle this
      error.
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      feba16e2
    • M
      mm/hugetlb: add cache of descriptors to resv_map for region_add · 5e911373
      Mike Kravetz 提交于
      hugetlbfs is used today by applications that want a high degree of
      control over huge page usage.  Often, large hugetlbfs files are used to
      map a large number huge pages into the application processes.  The
      applications know when page ranges within these large files will no
      longer be used, and ideally would like to release them back to the
      subpool or global pools for other uses.  The fallocate() system call
      provides an interface for preallocation and hole punching within files.
      This patch set adds fallocate functionality to hugetlbfs.
      
      fallocate hole punch will want to remove a specific range of pages.
      When pages are removed, their associated entries in the region/reserve
      map will also be removed.  This will break an assumption in the
      region_chg/region_add calling sequence.  If a new region descriptor must
      be allocated, it is done as part of the region_chg processing.  In this
      way, region_add can not fail because it does not need to attempt an
      allocation.
      
      To prepare for fallocate hole punch, create a "cache" of descriptors
      that can be used by region_add if necessary.  region_chg will ensure
      there are sufficient entries in the cache.  It will be necessary to
      track the number of in progress add operations to know a sufficient
      number of descriptors reside in the cache.  A new routine region_abort
      is added to adjust this in progress count when add operations are
      aborted.  vma_abort_reservation is also added for callers creating
      reservations with vma_needs_reservation/vma_commit_reservation.
      
      [akpm@linux-foundation.org: fix typo in comment, use more cols]
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5e911373
    • V
      mm: rename and move get/set_freepage_migratetype · bb14c2c7
      Vlastimil Babka 提交于
      The pair of get/set_freepage_migratetype() functions are used to cache
      pageblock migratetype for a page put on a pcplist, so that it does not
      have to be retrieved again when the page is put on a free list (e.g.
      when pcplists become full).  Historically it was also assumed that the
      value is accurate for pages on freelists (as the functions' names
      unfortunately suggest), but that cannot be guaranteed without affecting
      various allocator fast paths.  It is in fact not needed and all such
      uses have been removed.
      
      The last remaining (but pointless) usage related to pages of freelists
      is in move_freepages(), which this patch removes.
      
      To prevent further confusion, rename the functions to
      get/set_pcppage_migratetype() and expand their description.  Since all
      the users are now in mm/page_alloc.c, move the functions there from the
      shared header.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Acked-by: NMichal Nazarewicz <mina86@mina86.com>
      Cc: Laura Abbott <lauraa@codeaurora.org>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Seungho Park <seungho1.park@lge.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bb14c2c7