1. 18 7月, 2022 3 次提交
    • G
      mm, hugetlb: skip irrelevant nodes in show_free_areas() · dcadcf1c
      Gang Li 提交于
      show_free_areas() allows to filter out node specific data which is
      irrelevant to the allocation request.  But hugetlb_show_meminfo() still
      shows hugetlb on all nodes, which is redundant and unnecessary.
      
      Use show_mem_node_skip() to skip irrelevant nodes.  And replace
      hugetlb_show_meminfo() with hugetlb_show_meminfo_node(nid).
      
      before-and-after sample output of OOM:
      
      before:
      ```
      [  214.362453] Node 1 active_anon:148kB inactive_anon:4050920kB active_file:112kB inactive_file:100kB
      [  214.375429] Node 1 Normal free:45100kB boost:0kB min:45576kB low:56968kB high:68360kB reserved_hig
      [  214.388334] lowmem_reserve[]: 0 0 0 0 0
      [  214.390251] Node 1 Normal: 423*4kB (UE) 320*8kB (UME) 187*16kB (UE) 117*32kB (UE) 57*64kB (UME) 20
      [  214.397626] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
      [  214.401518] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
      ```
      
      after:
      ```
      [  145.069705] Node 1 active_anon:128kB inactive_anon:4049412kB active_file:56kB inactive_file:84kB u
      [  145.110319] Node 1 Normal free:45424kB boost:0kB min:45576kB low:56968kB high:68360kB reserved_hig
      [  145.152315] lowmem_reserve[]: 0 0 0 0 0
      [  145.155244] Node 1 Normal: 470*4kB (UME) 373*8kB (UME) 247*16kB (UME) 168*32kB (UE) 86*64kB (UME)
      [  145.164119] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
      ```
      
      Link: https://lkml.kernel.org/r/20220706034655.1834-1-ligang.bdlg@bytedance.comSigned-off-by: NGang Li <ligang.bdlg@bytedance.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      dcadcf1c
    • M
      hugetlb: do not update address in huge_pmd_unshare · 4ddb4d91
      Mike Kravetz 提交于
      As an optimization for loops sequentially processing hugetlb address
      ranges, huge_pmd_unshare would update a passed address if it unshared a
      pmd.  Updating a loop control variable outside the loop like this is
      generally a bad idea.  These loops are now using hugetlb_mask_last_page to
      optimize scanning when non-present ptes are discovered.  The same can be
      done when huge_pmd_unshare returns 1 indicating a pmd was unshared.
      
      Remove address update from huge_pmd_unshare.  Change the passed argument
      type and update all callers.  In loops sequentially processing addresses
      use hugetlb_mask_last_page to update address if pmd is unshared.
      
      [sfr@canb.auug.org.au: fix an unused variable warning/error]
        Link: https://lkml.kernel.org/r/20220622171117.70850960@canb.auug.org.au
      Link: https://lkml.kernel.org/r/20220621235620.291305-4-mike.kravetz@oracle.comSigned-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Acked-by: NMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: kernel test robot <lkp@intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rolf Eike Beer <eike-kernel@sf-tec.de>
      Cc: Will Deacon <will@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      4ddb4d91
    • M
      hugetlb: skip to end of PT page mapping when pte not present · e95a9851
      Mike Kravetz 提交于
      Patch series "hugetlb: speed up linear address scanning", v2.
      
      At unmap, fork and remap time hugetlb address ranges are linearly scanned.
      We can optimize these scans if the ranges are sparsely populated.
      
      Also, enable page table "Lazy copy" for hugetlb at fork.
      
      NOTE: Architectures not defining CONFIG_ARCH_WANT_GENERAL_HUGETLB need to
      add an arch specific version hugetlb_mask_last_page() to take advantage of
      sparse address scanning improvements.  Baolin Wang added the routine for
      arm64.  Other architectures which could be optimized are: ia64, mips,
      parisc, powerpc, s390, sh and sparc.
      
      
      This patch (of 4):
      
      HugeTLB address ranges are linearly scanned during fork, unmap and remap
      operations.  If a non-present entry is encountered, the code currently
      continues to the next huge page aligned address.  However, a non-present
      entry implies that the page table page for that entry is not present. 
      Therefore, the linear scan can skip to the end of range mapped by the page
      table page.  This can speed operations on large sparsely populated hugetlb
      mappings.
      
      Create a new routine hugetlb_mask_last_page() that will return an address
      mask.  When the mask is ORed with an address, the result will be the
      address of the last huge page mapped by the associated page table page. 
      Use this mask to update addresses in routines which linearly scan hugetlb
      address ranges when a non-present pte is encountered.
      
      hugetlb_mask_last_page is related to the implementation of huge_pte_offset
      as hugetlb_mask_last_page is called when huge_pte_offset returns NULL. 
      This patch only provides a complete hugetlb_mask_last_page implementation
      when CONFIG_ARCH_WANT_GENERAL_HUGETLB is defined.  Architectures which
      provide their own versions of huge_pte_offset can also provide their own
      version of hugetlb_mask_last_page.
      
      Link: https://lkml.kernel.org/r/20220621235620.291305-1-mike.kravetz@oracle.com
      Link: https://lkml.kernel.org/r/20220621235620.291305-2-mike.kravetz@oracle.comSigned-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Tested-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
      Reviewed-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
      Acked-by: NMuchun Song <songmuchun@bytedance.com>
      Reported-by: Nkernel test robot <lkp@intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: James Houghton <jthoughton@google.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Rolf Eike Beer <eike-kernel@sf-tec.de>
      Cc: David Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      e95a9851
  2. 04 7月, 2022 2 次提交
    • Q
      mm: hugetlb: kill set_huge_swap_pte_at() · 18f39629
      Qi Zheng 提交于
      Commit e5251fd4 ("mm/hugetlb: introduce set_huge_swap_pte_at()
      helper") add set_huge_swap_pte_at() to handle swap entries on
      architectures that support hugepages consisting of contiguous ptes.  And
      currently the set_huge_swap_pte_at() is only overridden by arm64.
      
      set_huge_swap_pte_at() provide a sz parameter to help determine the number
      of entries to be updated.  But in fact, all hugetlb swap entries contain
      pfn information, so we can find the corresponding folio through the pfn
      recorded in the swap entry, then the folio_size() is the number of entries
      that need to be updated.
      
      And considering that users will easily cause bugs by ignoring the
      difference between set_huge_swap_pte_at() and set_huge_pte_at().  Let's
      handle swap entries in set_huge_pte_at() and remove the
      set_huge_swap_pte_at(), then we can call set_huge_pte_at() anywhere, which
      simplifies our coding.
      
      Link: https://lkml.kernel.org/r/20220626145717.53572-1-zhengqi.arch@bytedance.comSigned-off-by: NQi Zheng <zhengqi.arch@bytedance.com>
      Acked-by: NMuchun Song <songmuchun@bytedance.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      18f39629
    • M
      mm/migration: return errno when isolate_huge_page failed · 7ce82f4c
      Miaohe Lin 提交于
      We might fail to isolate huge page due to e.g.  the page is under
      migration which cleared HPageMigratable.  We should return errno in this
      case rather than always return 1 which could confuse the user, i.e.  the
      caller might think all of the memory is migrated while the hugetlb page is
      left behind.  We make the prototype of isolate_huge_page consistent with
      isolate_lru_page as suggested by Huang Ying and rename isolate_huge_page
      to isolate_hugetlb as suggested by Muchun to improve the readability.
      
      Link: https://lkml.kernel.org/r/20220530113016.16663-4-linmiaohe@huawei.com
      Fixes: e8db67eb ("mm: migrate: move_pages() supports thp migration")
      Signed-off-by: NMiaohe Lin <linmiaohe@huawei.com>
      Suggested-by: NHuang Ying <ying.huang@intel.com>
      Reported-by: kernel test robot <lkp@intel.com> (build error)
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Peter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      7ce82f4c
  3. 14 5月, 2022 1 次提交
    • B
      mm: rmap: fix CONT-PTE/PMD size hugetlb issue when migration · 5d4af619
      Baolin Wang 提交于
      On some architectures (like ARM64), it can support CONT-PTE/PMD size
      hugetlb, which means it can support not only PMD/PUD size hugetlb: 2M and
      1G, but also CONT-PTE/PMD size: 64K and 32M if a 4K page size specified.
      
      When migrating a hugetlb page, we will get the relevant page table entry
      by huge_pte_offset() only once to nuke it and remap it with a migration
      pte entry.  This is correct for PMD or PUD size hugetlb, since they always
      contain only one pmd entry or pud entry in the page table.
      
      However this is incorrect for CONT-PTE and CONT-PMD size hugetlb, since
      they can contain several continuous pte or pmd entry with same page table
      attributes.  So we will nuke or remap only one pte or pmd entry for this
      CONT-PTE/PMD size hugetlb page, which is not expected for hugetlb
      migration.  The problem is we can still continue to modify the subpages'
      data of a hugetlb page during migrating a hugetlb page, which can cause a
      serious data consistent issue, since we did not nuke the page table entry
      and set a migration pte for the subpages of a hugetlb page.
      
      To fix this issue, we should change to use huge_ptep_clear_flush() to nuke
      a hugetlb page table, and remap it with set_huge_pte_at() and
      set_huge_swap_pte_at() when migrating a hugetlb page, which already
      considered the CONT-PTE or CONT-PMD size hugetlb.
      
      [akpm@linux-foundation.org: fix nommu build]
      [baolin.wang@linux.alibaba.com: fix build errors for !CONFIG_MMU]
        Link: https://lkml.kernel.org/r/a4baca670aca637e7198d9ae4543b8873cb224dc.1652270205.git.baolin.wang@linux.alibaba.com
      Link: https://lkml.kernel.org/r/ea5abf529f0997b5430961012bfda6166c1efc8c.1652147571.git.baolin.wang@linux.alibaba.comSigned-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
      Reviewed-by: NMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.osdn.me>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      5d4af619
  4. 13 5月, 2022 4 次提交
    • P
      mm/hugetlb: handle uffd-wp during fork() · bc70fbf2
      Peter Xu 提交于
      Firstly, we'll need to pass in dst_vma into copy_hugetlb_page_range()
      because for uffd-wp it's the dst vma that matters on deciding how we
      should treat uffd-wp protected ptes.
      
      We should recognize pte markers during fork and do the pte copy if needed.
      
      [lkp@intel.com: vma_needs_copy can be static]
        Link: https://lkml.kernel.org/r/Ylb0CGeFJlc4EzLk@7ec4ff11d4ae
      Link: https://lkml.kernel.org/r/20220405014918.14932-1-peterx@redhat.comSigned-off-by: NPeter Xu <peterx@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      bc70fbf2
    • P
      mm/hugetlb: only drop uffd-wp special pte if required · 05e90bd0
      Peter Xu 提交于
      As with shmem uffd-wp special ptes, only drop the uffd-wp special swap pte
      if unmapping an entire vma or synchronized such that faults can not race
      with the unmap operation.  This requires passing zap_flags all the way to
      the lowest level hugetlb unmap routine: __unmap_hugepage_range.
      
      In general, unmap calls originated in hugetlbfs code will pass the
      ZAP_FLAG_DROP_MARKER flag as synchronization is in place to prevent
      faults.  The exception is hole punch which will first unmap without any
      synchronization.  Later when hole punch actually removes the page from the
      file, it will check to see if there was a subsequent fault and if so take
      the hugetlb fault mutex while unmapping again.  This second unmap will
      pass in ZAP_FLAG_DROP_MARKER.
      
      The justification of "whether to apply ZAP_FLAG_DROP_MARKER flag when
      unmap a hugetlb range" is (IMHO): we should never reach a state when a
      page fault could errornously fault in a page-cache page that was
      wr-protected to be writable, even in an extremely short period.  That
      could happen if e.g.  we pass ZAP_FLAG_DROP_MARKER when
      hugetlbfs_punch_hole() calls hugetlb_vmdelete_list(), because if a page
      faults after that call and before remove_inode_hugepages() is executed,
      the page cache can be mapped writable again in the small racy window, that
      can cause unexpected data overwritten.
      
      [peterx@redhat.com: fix sparse warning]
        Link: https://lkml.kernel.org/r/Ylcdw8I1L5iAoWhb@xz-m1.local
      [akpm@linux-foundation.org: move zap_flags_t from mm.h to mm_types.h to fix build issues]
      Link: https://lkml.kernel.org/r/20220405014915.14873-1-peterx@redhat.comSigned-off-by: NPeter Xu <peterx@redhat.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      05e90bd0
    • P
      mm/hugetlb: handle UFFDIO_WRITEPROTECT · 5a90d5a1
      Peter Xu 提交于
      This starts from passing cp_flags into hugetlb_change_protection() so
      hugetlb will be able to handle MM_CP_UFFD_WP[_RESOLVE] requests.
      
      huge_pte_clear_uffd_wp() is introduced to handle the case where the
      UFFDIO_WRITEPROTECT is requested upon migrating huge page entries.
      
      Link: https://lkml.kernel.org/r/20220405014906.14708-1-peterx@redhat.comSigned-off-by: NPeter Xu <peterx@redhat.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      5a90d5a1
    • P
      mm/hugetlb: take care of UFFDIO_COPY_MODE_WP · 6041c691
      Peter Xu 提交于
      Pass the wp_copy variable into hugetlb_mcopy_atomic_pte() thoughout the
      stack.  Apply the UFFD_WP bit if UFFDIO_COPY_MODE_WP is with UFFDIO_COPY.
      
      Hugetlb pages are only managed by hugetlbfs, so we're safe even without
      setting dirty bit in the huge pte if the page is installed as read-only. 
      However we'd better still keep the dirty bit set for a read-only
      UFFDIO_COPY pte (when UFFDIO_COPY_MODE_WP bit is set), not only to match
      what we do with shmem, but also because the page does contain dirty data
      that the kernel just copied from the userspace.
      
      Link: https://lkml.kernel.org/r/20220405014904.14643-1-peterx@redhat.comSigned-off-by: NPeter Xu <peterx@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      6041c691
  5. 05 5月, 2022 1 次提交
  6. 29 4月, 2022 2 次提交
  7. 22 4月, 2022 1 次提交
  8. 23 3月, 2022 2 次提交
  9. 22 3月, 2022 1 次提交
  10. 15 1月, 2022 1 次提交
  11. 10 11月, 2021 1 次提交
  12. 07 11月, 2021 4 次提交
    • Z
      hugetlbfs: extend the definition of hugepages parameter to support node allocation · b5389086
      Zhenguo Yao 提交于
      We can specify the number of hugepages to allocate at boot.  But the
      hugepages is balanced in all nodes at present.  In some scenarios, we
      only need hugepages in one node.  For example: DPDK needs hugepages
      which are in the same node as NIC.
      
      If DPDK needs four hugepages of 1G size in node1 and system has 16 numa
      nodes we must reserve 64 hugepages on the kernel cmdline.  But only four
      hugepages are used.  The others should be free after boot.  If the
      system memory is low(for example: 64G), it will be an impossible task.
      
      So extend the hugepages parameter to support specifying hugepages on a
      specific node.  For example add following parameter:
      
        hugepagesz=1G hugepages=0:1,1:3
      
      It will allocate 1 hugepage in node0 and 3 hugepages in node1.
      
      Link: https://lkml.kernel.org/r/20211005054729.86457-1-yaozhenguo1@gmail.comSigned-off-by: NZhenguo Yao <yaozhenguo1@gmail.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Zhenguo Yao <yaozhenguo1@gmail.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b5389086
    • M
      mm, hugepages: add mremap() support for hugepage backed vma · 550a7d60
      Mina Almasry 提交于
      Support mremap() for hugepage backed vma segment by simply repositioning
      page table entries.  The page table entries are repositioned to the new
      virtual address on mremap().
      
      Hugetlb mremap() support is of course generic; my motivating use case is
      a library (hugepage_text), which reloads the ELF text of executables in
      hugepages.  This significantly increases the execution performance of
      said executables.
      
      Restrict the mremap operation on hugepages to up to the size of the
      original mapping as the underlying hugetlb reservation is not yet
      capable of handling remapping to a larger size.
      
      During the mremap() operation we detect pmd_share'd mappings and we
      unshare those during the mremap().  On access and fault the sharing is
      established again.
      
      Link: https://lkml.kernel.org/r/20211013195825.3058275-1-almasrymina@google.comSigned-off-by: NMina Almasry <almasrymina@google.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Ken Chen <kenchen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Kirill Shutemov <kirill@shutemov.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      550a7d60
    • M
      hugetlb: add demote hugetlb page sysfs interfaces · 79dfc695
      Mike Kravetz 提交于
      Patch series "hugetlb: add demote/split page functionality", v4.
      
      The concurrent use of multiple hugetlb page sizes on a single system is
      becoming more common.  One of the reasons is better TLB support for
      gigantic page sizes on x86 hardware.  In addition, hugetlb pages are
      being used to back VMs in hosting environments.
      
      When using hugetlb pages to back VMs, it is often desirable to
      preallocate hugetlb pools.  This avoids the delay and uncertainty of
      allocating hugetlb pages at VM startup.  In addition, preallocating huge
      pages minimizes the issue of memory fragmentation that increases the
      longer the system is up and running.
      
      In such environments, a combination of larger and smaller hugetlb pages
      are preallocated in anticipation of backing VMs of various sizes.  Over
      time, the preallocated pool of smaller hugetlb pages may become depleted
      while larger hugetlb pages still remain.  In such situations, it is
      desirable to convert larger hugetlb pages to smaller hugetlb pages.
      
      Converting larger to smaller hugetlb pages can be accomplished today by
      first freeing the larger page to the buddy allocator and then allocating
      the smaller pages.  For example, to convert 50 GB pages on x86:
      
        gb_pages=`cat .../hugepages-1048576kB/nr_hugepages`
        m2_pages=`cat .../hugepages-2048kB/nr_hugepages`
        echo $(($gb_pages - 50)) > .../hugepages-1048576kB/nr_hugepages
        echo $(($m2_pages + 25600)) > .../hugepages-2048kB/nr_hugepages
      
      On an idle system this operation is fairly reliable and results are as
      expected.  The number of 2MB pages is increased as expected and the time
      of the operation is a second or two.
      
      However, when there is activity on the system the following issues
      arise:
      
      1) This process can take quite some time, especially if allocation of
         the smaller pages is not immediate and requires migration/compaction.
      
      2) There is no guarantee that the total size of smaller pages allocated
         will match the size of the larger page which was freed. This is
         because the area freed by the larger page could quickly be
         fragmented.
      
      In a test environment with a load that continually fills the page cache
      with clean pages, results such as the following can be observed:
      
        Unexpected number of 2MB pages allocated: Expected 25600, have 19944
        real    0m42.092s
        user    0m0.008s
        sys     0m41.467s
      
      To address these issues, introduce the concept of hugetlb page demotion.
      Demotion provides a means of 'in place' splitting of a hugetlb page to
      pages of a smaller size.  This avoids freeing pages to buddy and then
      trying to allocate from buddy.
      
      Page demotion is controlled via sysfs files that reside in the per-hugetlb
      page size and per node directories.
      
       - demote_size
              Target page size for demotion, a smaller huge page size. File
              can be written to chose a smaller huge page size if multiple are
              available.
      
       - demote
              Writable number of hugetlb pages to be demoted
      
      To demote 50 GB huge pages, one would:
      
        cat .../hugepages-1048576kB/free_hugepages   /* optional, verify free pages */
        cat .../hugepages-1048576kB/demote_size      /* optional, verify target size */
        echo 50 > .../hugepages-1048576kB/demote
      
      Only hugetlb pages which are free at the time of the request can be
      demoted.  Demotion does not add to the complexity of surplus pages and
      honors reserved huge pages.  Therefore, when a value is written to the
      sysfs demote file, that value is only the maximum number of pages which
      will be demoted.  It is possible fewer will actually be demoted.  The
      recently introduced per-hstate mutex is used to synchronize demote
      operations with other operations that modify hugetlb pools.
      
      Real world use cases
      --------------------
      The above scenario describes a real world use case where hugetlb pages
      are used to back VMs on x86.  Both issues of long allocation times and
      not necessarily getting the expected number of smaller huge pages after
      a free and allocate cycle have been experienced.  The occurrence of
      these issues is dependent on other activity within the host and can not
      be predicted.
      
      This patch (of 5):
      
      Two new sysfs files are added to demote hugtlb pages.  These files are
      both per-hugetlb page size and per node.  Files are:
      
        demote_size - The size in Kb that pages are demoted to. (read-write)
        demote - The number of huge pages to demote. (write-only)
      
      By default, demote_size is the next smallest huge page size.  Valid huge
      page sizes less than huge page size may be written to this file.  When
      huge pages are demoted, they are demoted to this size.
      
      Writing a value to demote will result in an attempt to demote that
      number of hugetlb pages to an appropriate number of demote_size pages.
      
      NOTE: Demote interfaces are only provided for huge page sizes if there
      is a smaller target demote huge page size.  For example, on x86 1GB huge
      pages will have demote interfaces.  2MB huge pages will not have demote
      interfaces.
      
      This patch does not provide full demote functionality.  It only provides
      the sysfs interfaces.
      
      It also provides documentation for the new interfaces.
      
      [mike.kravetz@oracle.com: n_mask initialization does not need to be protected by the mutex]
        Link: https://lkml.kernel.org/r/0530e4ef-2492-5186-f919-5db68edea654@oracle.com
      
      Link: https://lkml.kernel.org/r/20211007181918.136982-2-mike.kravetz@oracle.comSigned-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
      Cc: Nghia Le <nghialm78@gmail.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      79dfc695
    • P
      mm/hugetlb: drop __unmap_hugepage_range definition from hugetlb.h · 73c54763
      Peter Xu 提交于
      Remove __unmap_hugepage_range() from the header file, because it is only
      used in hugetlb.c.
      
      Link: https://lkml.kernel.org/r/20210917165108.9341-1-peterx@redhat.comSigned-off-by: NPeter Xu <peterx@redhat.com>
      Suggested-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: NMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      73c54763
  13. 09 9月, 2021 1 次提交
  14. 09 7月, 2021 1 次提交
  15. 01 7月, 2021 7 次提交
  16. 25 6月, 2021 1 次提交
    • H
      mm, futex: fix shared futex pgoff on shmem huge page · fe19bd3d
      Hugh Dickins 提交于
      If more than one futex is placed on a shmem huge page, it can happen
      that waking the second wakes the first instead, and leaves the second
      waiting: the key's shared.pgoff is wrong.
      
      When 3.11 commit 13d60f4b ("futex: Take hugepages into account when
      generating futex_key"), the only shared huge pages came from hugetlbfs,
      and the code added to deal with its exceptional page->index was put into
      hugetlb source.  Then that was missed when 4.8 added shmem huge pages.
      
      page_to_pgoff() is what others use for this nowadays: except that, as
      currently written, it gives the right answer on hugetlbfs head, but
      nonsense on hugetlbfs tails.  Fix that by calling hugetlbfs-specific
      hugetlb_basepage_index() on PageHuge tails as well as on head.
      
      Yes, it's unconventional to declare hugetlb_basepage_index() there in
      pagemap.h, rather than in hugetlb.h; but I do not expect anything but
      page_to_pgoff() ever to need it.
      
      [akpm@linux-foundation.org: give hugetlb_basepage_index() prototype the correct scope]
      
      Link: https://lkml.kernel.org/r/b17d946b-d09-326e-b42a-52884c36df32@google.com
      Fixes: 800d8c63 ("shmem: add huge pages support")
      Reported-by: NNeel Natu <neelnatu@google.com>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Reviewed-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Zhang Yi <wetpzy@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Darren Hart <dvhart@infradead.org>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fe19bd3d
  17. 17 6月, 2021 2 次提交
    • M
      mm/hugetlb: expand restore_reserve_on_error functionality · 846be085
      Mike Kravetz 提交于
      The routine restore_reserve_on_error is called to restore reservation
      information when an error occurs after page allocation.  The routine
      alloc_huge_page modifies the mapping reserve map and potentially the
      reserve count during allocation.  If code calling alloc_huge_page
      encounters an error after allocation and needs to free the page, the
      reservation information needs to be adjusted.
      
      Currently, restore_reserve_on_error only takes action on pages for which
      the reserve count was adjusted(HPageRestoreReserve flag).  There is
      nothing wrong with these adjustments.  However, alloc_huge_page ALWAYS
      modifies the reserve map during allocation even if the reserve count is
      not adjusted.  This can cause issues as observed during development of
      this patch [1].
      
      One specific series of operations causing an issue is:
      
       - Create a shared hugetlb mapping
         Reservations for all pages created by default
      
       - Fault in a page in the mapping
         Reservation exists so reservation count is decremented
      
       - Punch a hole in the file/mapping at index previously faulted
         Reservation and any associated pages will be removed
      
       - Allocate a page to fill the hole
         No reservation entry, so reserve count unmodified
         Reservation entry added to map by alloc_huge_page
      
       - Error after allocation and before instantiating the page
         Reservation entry remains in map
      
       - Allocate a page to fill the hole
         Reservation entry exists, so decrement reservation count
      
      This will cause a reservation count underflow as the reservation count
      was decremented twice for the same index.
      
      A user would observe a very large number for HugePages_Rsvd in
      /proc/meminfo.  This would also likely cause subsequent allocations of
      hugetlb pages to fail as it would 'appear' that all pages are reserved.
      
      This sequence of operations is unlikely to happen, however they were
      easily reproduced and observed using hacked up code as described in [1].
      
      Address the issue by having the routine restore_reserve_on_error take
      action on pages where HPageRestoreReserve is not set.  In this case, we
      need to remove any reserve map entry created by alloc_huge_page.  A new
      helper routine vma_del_reservation assists with this operation.
      
      There are three callers of alloc_huge_page which do not currently call
      restore_reserve_on error before freeing a page on error paths.  Add
      those missing calls.
      
      [1] https://lore.kernel.org/linux-mm/20210528005029.88088-1-almasrymina@google.com/
      
      Link: https://lkml.kernel.org/r/20210607204510.22617-1-mike.kravetz@oracle.com
      Fixes: 96b96a96 ("mm/hugetlb: fix huge page reservation leak in private mapping error paths"
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NMina Almasry <almasrymina@google.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      846be085
    • N
      mm,hwpoison: fix race with hugetlb page allocation · 25182f05
      Naoya Horiguchi 提交于
      When hugetlb page fault (under overcommitting situation) and
      memory_failure() race, VM_BUG_ON_PAGE() is triggered by the following
      race:
      
          CPU0:                           CPU1:
      
                                          gather_surplus_pages()
                                            page = alloc_surplus_huge_page()
          memory_failure_hugetlb()
            get_hwpoison_page(page)
              __get_hwpoison_page(page)
                get_page_unless_zero(page)
                                            zero = put_page_testzero(page)
                                            VM_BUG_ON_PAGE(!zero, page)
                                            enqueue_huge_page(h, page)
            put_page(page)
      
      __get_hwpoison_page() only checks the page refcount before taking an
      additional one for memory error handling, which is not enough because
      there's a time window where compound pages have non-zero refcount during
      hugetlb page initialization.
      
      So make __get_hwpoison_page() check page status a bit more for hugetlb
      pages with get_hwpoison_huge_page().  Checking hugetlb-specific flags
      under hugetlb_lock makes sure that the hugetlb page is not transitive.
      It's notable that another new function, HWPoisonHandlable(), is helpful
      to prevent a race against other transitive page states (like a generic
      compound page just before PageHuge becomes true).
      
      Link: https://lkml.kernel.org/r/20210603233632.2964832-2-nao.horiguchi@gmail.com
      Fixes: ead07f6a ("mm/memory-failure: introduce get_hwpoison_page() for consistent refcount handling")
      Signed-off-by: NNaoya Horiguchi <naoya.horiguchi@nec.com>
      Reported-by: NMuchun Song <songmuchun@bytedance.com>
      Acked-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: <stable@vger.kernel.org>	[5.12+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      25182f05
  18. 06 5月, 2021 5 次提交
    • A
      userfaultfd: add UFFDIO_CONTINUE ioctl · f6191471
      Axel Rasmussen 提交于
      This ioctl is how userspace ought to resolve "minor" userfaults.  The
      idea is, userspace is notified that a minor fault has occurred.  It
      might change the contents of the page using its second non-UFFD mapping,
      or not.  Then, it calls UFFDIO_CONTINUE to tell the kernel "I have
      ensured the page contents are correct, carry on setting up the mapping".
      
      Note that it doesn't make much sense to use UFFDIO_{COPY,ZEROPAGE} for
      MINOR registered VMAs.  ZEROPAGE maps the VMA to the zero page; but in
      the minor fault case, we already have some pre-existing underlying page.
      Likewise, UFFDIO_COPY isn't useful if we have a second non-UFFD mapping.
      We'd just use memcpy() or similar instead.
      
      It turns out hugetlb_mcopy_atomic_pte() already does very close to what
      we want, if an existing page is provided via `struct page **pagep`.  We
      already special-case the behavior a bit for the UFFDIO_ZEROPAGE case, so
      just extend that design: add an enum for the three modes of operation,
      and make the small adjustments needed for the MCOPY_ATOMIC_CONTINUE
      case.  (Basically, look up the existing page, and avoid adding the
      existing page to the page cache or calling set_page_huge_active() on
      it.)
      
      Link: https://lkml.kernel.org/r/20210301222728.176417-5-axelrasmussen@google.comSigned-off-by: NAxel Rasmussen <axelrasmussen@google.com>
      Reviewed-by: NPeter Xu <peterx@redhat.com>
      Cc: Adam Ruprecht <ruprecht@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Cannon Matthews <cannonmatthews@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chinwen Chang <chinwen.chang@mediatek.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Michal Koutn" <mkoutny@suse.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Shawn Anastasio <shawn@anastas.io>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f6191471
    • A
      userfaultfd: hugetlbfs: only compile UFFD helpers if config enabled · 714c1891
      Axel Rasmussen 提交于
      For background, mm/userfaultfd.c provides a general mcopy_atomic
      implementation.  But some types of memory (i.e., hugetlb and shmem) need
      a slightly different implementation, so they provide their own helpers
      for this.  In other words, userfaultfd is the only caller of these
      functions.
      
      This patch achieves two things:
      
      1. Don't spend time compiling code which will end up never being
         referenced anyway (a small build time optimization).
      
      2. In patches later in this series, we extend the signature of these
         helpers with UFFD-specific state (a mode enumeration).  Once this
         happens, we *have to* either not compile the helpers, or
         unconditionally define the UFFD-only state (which seems messier to me).
         This includes the declarations in the headers, as otherwise they'd
         yield warnings about implicitly defining the type of those arguments.
      
      Link: https://lkml.kernel.org/r/20210301222728.176417-4-axelrasmussen@google.comSigned-off-by: NAxel Rasmussen <axelrasmussen@google.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NPeter Xu <peterx@redhat.com>
      Cc: Adam Ruprecht <ruprecht@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Cannon Matthews <cannonmatthews@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chinwen Chang <chinwen.chang@mediatek.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Michal Koutn" <mkoutny@suse.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Shawn Anastasio <shawn@anastas.io>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      714c1891
    • O
      mm: make alloc_contig_range handle in-use hugetlb pages · ae37c7ff
      Oscar Salvador 提交于
      alloc_contig_range() will fail if it finds a HugeTLB page within the
      range, without a chance to handle them.  Since HugeTLB pages can be
      migrated as any LRU or Movable page, it does not make sense to bail out
      without trying.  Enable the interface to recognize in-use HugeTLB pages so
      we can migrate them, and have much better chances to succeed the call.
      
      Link: https://lkml.kernel.org/r/20210419075413.1064-7-osalvador@suse.deSigned-off-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ae37c7ff
    • O
      mm: make alloc_contig_range handle free hugetlb pages · 369fa227
      Oscar Salvador 提交于
      alloc_contig_range will fail if it ever sees a HugeTLB page within the
      range we are trying to allocate, even when that page is free and can be
      easily reallocated.
      
      This has proved to be problematic for some users of alloc_contic_range,
      e.g: CMA and virtio-mem, where those would fail the call even when those
      pages lay in ZONE_MOVABLE and are free.
      
      We can do better by trying to replace such page.
      
      Free hugepages are tricky to handle so as to no userspace application
      notices disruption, we need to replace the current free hugepage with a
      new one.
      
      In order to do that, a new function called alloc_and_dissolve_huge_page is
      introduced.  This function will first try to get a new fresh hugepage, and
      if it succeeds, it will replace the old one in the free hugepage pool.
      
      The free page replacement is done under hugetlb_lock, so no external users
      of hugetlb will notice the change.  To allocate the new huge page, we use
      alloc_buddy_huge_page(), so we do not have to deal with any counters, and
      prep_new_huge_page() is not called.  This is valulable because in case we
      need to free the new page, we only need to call __free_pages().
      
      Once we know that the page to be replaced is a genuine 0-refcounted huge
      page, we remove the old page from the freelist by remove_hugetlb_page().
      Then, we can call __prep_new_huge_page() and
      __prep_account_new_huge_page() for the new huge page to properly
      initialize it and increment the hstate->nr_huge_pages counter (previously
      decremented by remove_hugetlb_page()).  Once done, the page is enqueued by
      enqueue_huge_page() and it is ready to be used.
      
      There is one tricky case when page's refcount is 0 because it is in the
      process of being released.  A missing PageHugeFreed bit will tell us that
      freeing is in flight so we retry after dropping the hugetlb_lock.  The
      race window should be small and the next retry should make a forward
      progress.
      
      E.g:
      
      CPU0				CPU1
      free_huge_page()		isolate_or_dissolve_huge_page
      				  PageHuge() == T
      				  alloc_and_dissolve_huge_page
      				    alloc_buddy_huge_page()
      				    spin_lock_irq(hugetlb_lock)
      				    // PageHuge() && !PageHugeFreed &&
      				    // !PageCount()
      				    spin_unlock_irq(hugetlb_lock)
        spin_lock_irq(hugetlb_lock)
        1) update_and_free_page
             PageHuge() == F
             __free_pages()
        2) enqueue_huge_page
             SetPageHugeFreed()
        spin_unlock_irq(&hugetlb_lock)
      				  spin_lock_irq(hugetlb_lock)
                                         1) PageHuge() == F (freed by case#1 from CPU0)
      				   2) PageHuge() == T
                                             PageHugeFreed() == T
                                             - proceed with replacing the page
      
      In the case above we retry as the window race is quite small and we have
      high chances to succeed next time.
      
      With regard to the allocation, we restrict it to the node the page belongs
      to with __GFP_THISNODE, meaning we do not fallback on other node's zones.
      
      Note that gigantic hugetlb pages are fenced off since there is a cyclic
      dependency between them and alloc_contig_range.
      
      Link: https://lkml.kernel.org/r/20210419075413.1064-6-osalvador@suse.deSigned-off-by: NOscar Salvador <osalvador@suse.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      369fa227
    • M
      hugetlb: add per-hstate mutex to synchronize user adjustments · 29383967
      Mike Kravetz 提交于
      The helper routine hstate_next_node_to_alloc accesses and modifies the
      hstate variable next_nid_to_alloc.  The helper is used by the routines
      alloc_pool_huge_page and adjust_pool_surplus.  adjust_pool_surplus is
      called with hugetlb_lock held.  However, alloc_pool_huge_page can not be
      called with the hugetlb lock held as it will call the page allocator.
      Two instances of alloc_pool_huge_page could be run in parallel or
      alloc_pool_huge_page could run in parallel with adjust_pool_surplus
      which may result in the variable next_nid_to_alloc becoming invalid for
      the caller and pages being allocated on the wrong node.
      
      Both alloc_pool_huge_page and adjust_pool_surplus are only called from
      the routine set_max_huge_pages after boot.  set_max_huge_pages is only
      called as the reusult of a user writing to the proc/sysfs nr_hugepages,
      or nr_hugepages_mempolicy file to adjust the number of hugetlb pages.
      
      It makes little sense to allow multiple adjustment to the number of
      hugetlb pages in parallel.  Add a mutex to the hstate and use it to only
      allow one hugetlb page adjustment at a time.  This will synchronize
      modifications to the next_nid_to_alloc variable.
      
      Link: https://lkml.kernel.org/r/20210409205254.242291-4-mike.kravetz@oracle.comSigned-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: NMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
      Cc: Barry Song <song.bao.hua@hisilicon.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      29383967