1. 06 5月, 2021 10 次提交
    • O
      mm: make alloc_contig_range handle in-use hugetlb pages · ae37c7ff
      Oscar Salvador 提交于
      alloc_contig_range() will fail if it finds a HugeTLB page within the
      range, without a chance to handle them.  Since HugeTLB pages can be
      migrated as any LRU or Movable page, it does not make sense to bail out
      without trying.  Enable the interface to recognize in-use HugeTLB pages so
      we can migrate them, and have much better chances to succeed the call.
      
      Link: https://lkml.kernel.org/r/20210419075413.1064-7-osalvador@suse.deSigned-off-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ae37c7ff
    • O
      mm: make alloc_contig_range handle free hugetlb pages · 369fa227
      Oscar Salvador 提交于
      alloc_contig_range will fail if it ever sees a HugeTLB page within the
      range we are trying to allocate, even when that page is free and can be
      easily reallocated.
      
      This has proved to be problematic for some users of alloc_contic_range,
      e.g: CMA and virtio-mem, where those would fail the call even when those
      pages lay in ZONE_MOVABLE and are free.
      
      We can do better by trying to replace such page.
      
      Free hugepages are tricky to handle so as to no userspace application
      notices disruption, we need to replace the current free hugepage with a
      new one.
      
      In order to do that, a new function called alloc_and_dissolve_huge_page is
      introduced.  This function will first try to get a new fresh hugepage, and
      if it succeeds, it will replace the old one in the free hugepage pool.
      
      The free page replacement is done under hugetlb_lock, so no external users
      of hugetlb will notice the change.  To allocate the new huge page, we use
      alloc_buddy_huge_page(), so we do not have to deal with any counters, and
      prep_new_huge_page() is not called.  This is valulable because in case we
      need to free the new page, we only need to call __free_pages().
      
      Once we know that the page to be replaced is a genuine 0-refcounted huge
      page, we remove the old page from the freelist by remove_hugetlb_page().
      Then, we can call __prep_new_huge_page() and
      __prep_account_new_huge_page() for the new huge page to properly
      initialize it and increment the hstate->nr_huge_pages counter (previously
      decremented by remove_hugetlb_page()).  Once done, the page is enqueued by
      enqueue_huge_page() and it is ready to be used.
      
      There is one tricky case when page's refcount is 0 because it is in the
      process of being released.  A missing PageHugeFreed bit will tell us that
      freeing is in flight so we retry after dropping the hugetlb_lock.  The
      race window should be small and the next retry should make a forward
      progress.
      
      E.g:
      
      CPU0				CPU1
      free_huge_page()		isolate_or_dissolve_huge_page
      				  PageHuge() == T
      				  alloc_and_dissolve_huge_page
      				    alloc_buddy_huge_page()
      				    spin_lock_irq(hugetlb_lock)
      				    // PageHuge() && !PageHugeFreed &&
      				    // !PageCount()
      				    spin_unlock_irq(hugetlb_lock)
        spin_lock_irq(hugetlb_lock)
        1) update_and_free_page
             PageHuge() == F
             __free_pages()
        2) enqueue_huge_page
             SetPageHugeFreed()
        spin_unlock_irq(&hugetlb_lock)
      				  spin_lock_irq(hugetlb_lock)
                                         1) PageHuge() == F (freed by case#1 from CPU0)
      				   2) PageHuge() == T
                                             PageHugeFreed() == T
                                             - proceed with replacing the page
      
      In the case above we retry as the window race is quite small and we have
      high chances to succeed next time.
      
      With regard to the allocation, we restrict it to the node the page belongs
      to with __GFP_THISNODE, meaning we do not fallback on other node's zones.
      
      Note that gigantic hugetlb pages are fenced off since there is a cyclic
      dependency between them and alloc_contig_range.
      
      Link: https://lkml.kernel.org/r/20210419075413.1064-6-osalvador@suse.deSigned-off-by: NOscar Salvador <osalvador@suse.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      369fa227
    • M
      hugetlb: add per-hstate mutex to synchronize user adjustments · 29383967
      Mike Kravetz 提交于
      The helper routine hstate_next_node_to_alloc accesses and modifies the
      hstate variable next_nid_to_alloc.  The helper is used by the routines
      alloc_pool_huge_page and adjust_pool_surplus.  adjust_pool_surplus is
      called with hugetlb_lock held.  However, alloc_pool_huge_page can not be
      called with the hugetlb lock held as it will call the page allocator.
      Two instances of alloc_pool_huge_page could be run in parallel or
      alloc_pool_huge_page could run in parallel with adjust_pool_surplus
      which may result in the variable next_nid_to_alloc becoming invalid for
      the caller and pages being allocated on the wrong node.
      
      Both alloc_pool_huge_page and adjust_pool_surplus are only called from
      the routine set_max_huge_pages after boot.  set_max_huge_pages is only
      called as the reusult of a user writing to the proc/sysfs nr_hugepages,
      or nr_hugepages_mempolicy file to adjust the number of hugetlb pages.
      
      It makes little sense to allow multiple adjustment to the number of
      hugetlb pages in parallel.  Add a mutex to the hstate and use it to only
      allow one hugetlb page adjustment at a time.  This will synchronize
      modifications to the next_nid_to_alloc variable.
      
      Link: https://lkml.kernel.org/r/20210409205254.242291-4-mike.kravetz@oracle.comSigned-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: NMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
      Cc: Barry Song <song.bao.hua@hisilicon.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      29383967
    • M
      mm/huge_memory.c: remove unused macro TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG · d4afd60c
      Miaohe Lin 提交于
      Commit 4958e4d8 ("mm: thp: remove debug_cow switch") forgot to
      remove TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG macro.  Remove it here.
      
      Link: https://lkml.kernel.org/r/20210318122722.13135-6-linmiaohe@huawei.comSigned-off-by: NMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: NZi Yan <ziy@nvidia.com>
      Reviewed-by: NPeter Xu <peterx@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Thomas Hellstrm (Intel) <thomas_os@shipmail.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Cc: yuleixzhang <yulei.kernel@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d4afd60c
    • P
      hugetlb/userfaultfd: unshare all pmds for hugetlbfs when register wp · 6dfeaff9
      Peter Xu 提交于
      Huge pmd sharing for hugetlbfs is racy with userfaultfd-wp because
      userfaultfd-wp is always based on pgtable entries, so they cannot be
      shared.
      
      Walk the hugetlb range and unshare all such mappings if there is, right
      before UFFDIO_REGISTER will succeed and return to userspace.
      
      This will pair with want_pmd_share() in hugetlb code so that huge pmd
      sharing is completely disabled for userfaultfd-wp registered range.
      
      Link: https://lkml.kernel.org/r/20210218231206.15524-1-peterx@redhat.comSigned-off-by: NPeter Xu <peterx@redhat.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Adam Ruprecht <ruprecht@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Cannon Matthews <cannonmatthews@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chinwen Chang <chinwen.chang@mediatek.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Michal Koutn" <mkoutny@suse.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Shawn Anastasio <shawn@anastas.io>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6dfeaff9
    • P
      mm/hugetlb: move flush_hugetlb_tlb_range() into hugetlb.h · 537cf30b
      Peter Xu 提交于
      Prepare for it to be called outside of mm/hugetlb.c.
      
      Link: https://lkml.kernel.org/r/20210218231204.15474-1-peterx@redhat.comSigned-off-by: NPeter Xu <peterx@redhat.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NAxel Rasmussen <axelrasmussen@google.com>
      Cc: Adam Ruprecht <ruprecht@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Cannon Matthews <cannonmatthews@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chinwen Chang <chinwen.chang@mediatek.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Michal Koutn" <mkoutny@suse.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Shawn Anastasio <shawn@anastas.io>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      537cf30b
    • P
      hugetlb/userfaultfd: forbid huge pmd sharing when uffd enabled · c1991e07
      Peter Xu 提交于
      Huge pmd sharing could bring problem to userfaultfd.  The thing is that
      userfaultfd is running its logic based on the special bits on page table
      entries, however the huge pmd sharing could potentially share page table
      entries for different address ranges.  That could cause issues on
      either:
      
       - When sharing huge pmd page tables for an uffd write protected range,
         the newly mapped huge pmd range will also be write protected
         unexpectedly, or,
      
       - When we try to write protect a range of huge pmd shared range, we'll
         first do huge_pmd_unshare() in hugetlb_change_protection(), however
         that also means the UFFDIO_WRITEPROTECT could be silently skipped for
         the shared region, which could lead to data loss.
      
      While at it, a few other things are done altogether:
      
       - Move want_pmd_share() from mm/hugetlb.c into linux/hugetlb.h, because
         that's definitely something that arch code would like to use too
      
       - ARM64 currently directly check against
         CONFIG_ARCH_WANT_HUGE_PMD_SHARE when trying to share huge pmd. Switch
         to the want_pmd_share() helper.
      
       - Move vma_shareable() from huge_pmd_share() into want_pmd_share().
      
      [peterx@redhat.com: fix build with !ARCH_WANT_HUGE_PMD_SHARE]
        Link: https://lkml.kernel.org/r/20210310185359.88297-1-peterx@redhat.com
      
      Link: https://lkml.kernel.org/r/20210218231202.15426-1-peterx@redhat.comSigned-off-by: NPeter Xu <peterx@redhat.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NAxel Rasmussen <axelrasmussen@google.com>
      Tested-by: NNaresh Kamboju <naresh.kamboju@linaro.org>
      Cc: Adam Ruprecht <ruprecht@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Cannon Matthews <cannonmatthews@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chinwen Chang <chinwen.chang@mediatek.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Michal Koutn" <mkoutny@suse.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Shawn Anastasio <shawn@anastas.io>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c1991e07
    • P
      hugetlb: pass vma into huge_pte_alloc() and huge_pmd_share() · aec44e0f
      Peter Xu 提交于
      Patch series "hugetlb: Disable huge pmd unshare for uffd-wp", v4.
      
      This series tries to disable huge pmd unshare of hugetlbfs backed memory
      for uffd-wp.  Although uffd-wp of hugetlbfs is still during rfc stage,
      the idea of this series may be needed for multiple tasks (Axel's uffd
      minor fault series, and Mike's soft dirty series), so I picked it out
      from the larger series.
      
      This patch (of 4):
      
      It is a preparation work to be able to behave differently in the per
      architecture huge_pte_alloc() according to different VMA attributes.
      
      Pass it deeper into huge_pmd_share() so that we can avoid the find_vma() call.
      
      [peterx@redhat.com: build fix]
        Link: https://lkml.kernel.org/r/20210304164653.GB397383@xz-x1Link: https://lkml.kernel.org/r/20210218230633.15028-1-peterx@redhat.com
      
      Link: https://lkml.kernel.org/r/20210218230633.15028-2-peterx@redhat.comSigned-off-by: NPeter Xu <peterx@redhat.com>
      Suggested-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Adam Ruprecht <ruprecht@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Cannon Matthews <cannonmatthews@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chinwen Chang <chinwen.chang@mediatek.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Michal Koutn" <mkoutny@suse.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Shawn Anastasio <shawn@anastas.io>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aec44e0f
    • M
      mm: remove nrexceptional from inode · 8bc3c481
      Matthew Wilcox (Oracle) 提交于
      We no longer track anything in nrexceptional, so remove it, saving 8 bytes
      per inode.
      
      Link: https://lkml.kernel.org/r/20201026151849.24232-5-willy@infradead.orgSigned-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Tested-by: NVishal Verma <vishal.l.verma@intel.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8bc3c481
    • M
      mm: introduce and use mapping_empty() · 7716506a
      Matthew Wilcox (Oracle) 提交于
      Patch series "Remove nrexceptional tracking", v2.
      
      We actually use nrexceptional for very little these days.  It's a minor
      pain to keep in sync with nrpages, but the pain becomes much bigger with
      the THP patches because we don't know how many indices a shadow entry
      occupies.  It's easier to just remove it than keep it accurate.
      
      Also, we save 8 bytes per inode which is nothing to sneeze at; on my
      laptop, it would improve shmem_inode_cache from 22 to 23 objects per
      16kB, and inode_cache from 26 to 27 objects.  Combined, that saves
      a megabyte of memory from a combined usage of 25MB for both caches.
      Unfortunately, ext4 doesn't cross a magic boundary, so it doesn't save
      any memory for ext4.
      
      This patch (of 4):
      
      Instead of checking the two counters (nrpages and nrexceptional), we can
      just check whether i_pages is empty.
      
      Link: https://lkml.kernel.org/r/20201026151849.24232-1-willy@infradead.org
      Link: https://lkml.kernel.org/r/20201026151849.24232-2-willy@infradead.orgSigned-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Tested-by: NVishal Verma <vishal.l.verma@intel.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7716506a
  2. 01 5月, 2021 30 次提交