1. 01 12月, 2022 9 次提交
  2. 23 11月, 2022 1 次提交
  3. 09 11月, 2022 5 次提交
  4. 21 10月, 2022 2 次提交
    • M
      hugetlb: fix memory leak associated with vma_lock structure · 612b8a31
      Mike Kravetz 提交于
      The hugetlb vma_lock structure hangs off the vm_private_data pointer of
      sharable hugetlb vmas.  The structure is vma specific and can not be
      shared between vmas.  At fork and various other times, vmas are duplicated
      via vm_area_dup().  When this happens, the pointer in the newly created
      vma must be cleared and the structure reallocated.  Two hugetlb specific
      routines deal with this hugetlb_dup_vma_private and hugetlb_vm_op_open. 
      Both routines are called for newly created vmas.  hugetlb_dup_vma_private
      would always clear the pointer and hugetlb_vm_op_open would allocate the
      new vms_lock structure.  This did not work in the case of this calling
      sequence pointed out in [1].
      
        move_vma
          copy_vma
            new_vma = vm_area_dup(vma);
            new_vma->vm_ops->open(new_vma); --> new_vma has its own vma lock.
          is_vm_hugetlb_page(vma)
            clear_vma_resv_huge_pages
              hugetlb_dup_vma_private --> vma->vm_private_data is set to NULL
      
      When clearing hugetlb_dup_vma_private we actually leak the associated
      vma_lock structure.
      
      The vma_lock structure contains a pointer to the associated vma.  This
      information can be used in hugetlb_dup_vma_private and hugetlb_vm_op_open
      to ensure we only clear the vm_private_data of newly created (copied)
      vmas.  In such cases, the vma->vma_lock->vma field will not point to the
      vma.
      
      Update hugetlb_dup_vma_private and hugetlb_vm_op_open to not clear
      vm_private_data if vma->vma_lock->vma == vma.  Also, log a warning if
      hugetlb_vm_op_open ever encounters the case where vma_lock has already
      been correctly allocated for the vma.
      
      [1] https://lore.kernel.org/linux-mm/5154292a-4c55-28cd-0935-82441e512fc3@huawei.com/
      
      Link: https://lkml.kernel.org/r/20221019201957.34607-1-mike.kravetz@oracle.com
      Fixes: 131a79b4 ("hugetlb: fix vma lock handling during split vma and range unmapping")
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: James Houghton <jthoughton@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      612b8a31
    • R
      mm,hugetlb: take hugetlb_lock before decrementing h->resv_huge_pages · 12df140f
      Rik van Riel 提交于
      The h->*_huge_pages counters are protected by the hugetlb_lock, but
      alloc_huge_page has a corner case where it can decrement the counter
      outside of the lock.
      
      This could lead to a corrupted value of h->resv_huge_pages, which we have
      observed on our systems.
      
      Take the hugetlb_lock before decrementing h->resv_huge_pages to avoid a
      potential race.
      
      Link: https://lkml.kernel.org/r/20221017202505.0e6a4fcd@imladris.surriel.com
      Fixes: a88c7695 ("mm: hugetlb: fix hugepage memory leak caused by wrong reserve count")
      Signed-off-by: NRik van Riel <riel@surriel.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Glen McCready <gkmccready@meta.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      12df140f
  5. 13 10月, 2022 4 次提交
    • P
      mm/hugetlb: use hugetlb_pte_stable in migration race check · f9bf6c03
      Peter Xu 提交于
      After hugetlb_pte_stable() introduced, we can also rewrite the migration
      race condition against page allocation to use the new helper too.
      
      Link: https://lkml.kernel.org/r/20221004193400.110155-3-peterx@redhat.comSigned-off-by: NPeter Xu <peterx@redhat.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      f9bf6c03
    • P
      mm/hugetlb: fix race condition of uffd missing/minor handling · 2ea7ff1e
      Peter Xu 提交于
      Patch series "mm/hugetlb: Fix selftest failures with write check", v3.
      
      Currently akpm mm-unstable fails with uffd hugetlb private mapping test
      randomly on a write check.
      
      The initial bisection of that points to the recent pmd unshare series, but
      it turns out there's no direction relationship with the series but only
      some timing change caused the race to start trigger.
      
      The race should be fixed in patch 1.  Patch 2 is a trivial cleanup on the
      similar race with hugetlb migrations, patch 3 comment on the write check
      so when anyone read it again it'll be clear why it's there.
      
      
      This patch (of 3):
      
      After the recent rework patchset of hugetlb locking on pmd sharing,
      kselftest for userfaultfd sometimes fails on hugetlb private tests with
      unexpected write fault checks.
      
      It turns out there's nothing wrong within the locking series regarding
      this matter, but it could have changed the timing of threads so it can
      trigger an old bug.
      
      The real bug is when we call hugetlb_no_page() we're not with the pgtable
      lock.  It means we're reading the pte values lockless.  It's perfectly
      fine in most cases because before we do normal page allocations we'll take
      the lock and check pte_same() again.  However before that, there are
      actually two paths on userfaultfd missing/minor handling that may directly
      move on with the fault process without checking the pte values.
      
      It means for these two paths we may be generating an uffd message based on
      an unstable pte, while an unstable pte can legally be anything as long as
      the modifier holds the pgtable lock.
      
      One example, which is also what happened in the failing kselftest and
      caused the test failure, is that for private mappings wr-protection
      changes can happen on one page.  While hugetlb_change_protection()
      generally requires pte being cleared before being changed, then there can
      be a race condition like:
      
              thread 1                              thread 2
              --------                              --------
      
            UFFDIO_WRITEPROTECT                     hugetlb_fault
              hugetlb_change_protection
                pgtable_lock()
                huge_ptep_modify_prot_start
                                                    pte==NULL
                                                    hugetlb_no_page
                                                      generate uffd missing event
                                                      even if page existed!!
                huge_ptep_modify_prot_commit
                pgtable_unlock()
      
      Fix this by rechecking the pte after pgtable lock for both userfaultfd
      missing & minor fault paths.
      
      This bug should have been around starting from uffd hugetlb introduced, so
      attaching a Fixes to the commit.  Also attach another Fixes to the minor
      support commit for easier tracking.
      
      Note that userfaultfd is actually fine with false positives (e.g.  caused
      by pte changed), but not wrong logical events (e.g.  caused by reading a
      pte during changing).  The latter can confuse the userspace, so the
      strictness is very much preferred.  E.g., MISSING event should never
      happen on the page after UFFDIO_COPY has correctly installed the page and
      returned.
      
      Link: https://lkml.kernel.org/r/20221004193400.110155-1-peterx@redhat.com
      Link: https://lkml.kernel.org/r/20221004193400.110155-2-peterx@redhat.com
      Fixes: 1a1aad8a ("userfaultfd: hugetlbfs: add userfaultfd hugetlb hook")
      Fixes: 7677f7fd ("userfaultfd: add minor fault registration mode")
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Co-developed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      2ea7ff1e
    • P
      mm/uffd: fix warning without PTE_MARKER_UFFD_WP compiled in · 515778e2
      Peter Xu 提交于
      When PTE_MARKER_UFFD_WP not configured, it's still possible to reach pte
      marker code and trigger an warning. Add a few CONFIG_PTE_MARKER_UFFD_WP
      ifdefs to make sure the code won't be reached when not compiled in.
      
      Link: https://lkml.kernel.org/r/YzeR+R6b4bwBlBHh@x1n
      Fixes: b1f9e876 ("mm/uffd: enable write protection for shmem & hugetlbfs")
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Reported-by: <syzbot+2b9b4f0895be09a6dec3@syzkaller.appspotmail.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Edward Liaw <edliaw@google.com>
      Cc: Liu Shixin <liushixin2@huawei.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      515778e2
    • A
      mm/hugetlb.c: make __hugetlb_vma_unlock_write_put() static · acfac378
      Andrew Morton 提交于
      Reported-by: Nkernel test robot <lkp@intel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      acfac378
  6. 12 10月, 2022 1 次提交
    • B
      mm/hugetlb: fix races when looking up a CONT-PTE/PMD size hugetlb page · fac35ba7
      Baolin Wang 提交于
      On some architectures (like ARM64), it can support CONT-PTE/PMD size
      hugetlb, which means it can support not only PMD/PUD size hugetlb (2M and
      1G), but also CONT-PTE/PMD size(64K and 32M) if a 4K page size specified.
      
      So when looking up a CONT-PTE size hugetlb page by follow_page(), it will
      use pte_offset_map_lock() to get the pte entry lock for the CONT-PTE size
      hugetlb in follow_page_pte().  However this pte entry lock is incorrect
      for the CONT-PTE size hugetlb, since we should use huge_pte_lock() to get
      the correct lock, which is mm->page_table_lock.
      
      That means the pte entry of the CONT-PTE size hugetlb under current pte
      lock is unstable in follow_page_pte(), we can continue to migrate or
      poison the pte entry of the CONT-PTE size hugetlb, which can cause some
      potential race issues, even though they are under the 'pte lock'.
      
      For example, suppose thread A is trying to look up a CONT-PTE size hugetlb
      page by move_pages() syscall under the lock, however antoher thread B can
      migrate the CONT-PTE hugetlb page at the same time, which will cause
      thread A to get an incorrect page, if thread A also wants to do page
      migration, then data inconsistency error occurs.
      
      Moreover we have the same issue for CONT-PMD size hugetlb in
      follow_huge_pmd().
      
      To fix above issues, rename the follow_huge_pmd() as follow_huge_pmd_pte()
      to handle PMD and PTE level size hugetlb, which uses huge_pte_lock() to
      get the correct pte entry lock to make the pte entry stable.
      
      Mike said:
      
      Support for CONT_PMD/_PTE was added with bb9dd3df ("arm64: hugetlb:
      refactor find_num_contig()").  Patch series "Support for contiguous pte
      hugepages", v4.  However, I do not believe these code paths were
      executed until migration support was added with 5480280d ("arm64/mm:
      enable HugeTLB migration for contiguous bit HugeTLB pages") I would go
      with 5480280d for the Fixes: targe.
      
      Link: https://lkml.kernel.org/r/635f43bdd85ac2615a58405da82b4d33c6e5eb05.1662017562.git.baolin.wang@linux.alibaba.com
      Fixes: 5480280d ("arm64/mm: enable HugeTLB migration for contiguous bit HugeTLB pages")
      Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
      Suggested-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      fac35ba7
  7. 08 10月, 2022 3 次提交
    • M
      hugetlb: allocate vma lock for all sharable vmas · bbff39cc
      Mike Kravetz 提交于
      The hugetlb vma lock was originally designed to synchronize pmd sharing. 
      As such, it was only necessary to allocate the lock for vmas that were
      capable of pmd sharing.  Later in the development cycle, it was discovered
      that it could also be used to simplify fault/truncation races as described
      in [1].  However, a subsequent change to allocate the lock for all vmas
      that use the page cache was never made.  A fault/truncation race could
      leave pages in a file past i_size until the file is removed.
      
      Remove the previous restriction and allocate lock for all VM_MAYSHARE
      vmas.  Warn in the unlikely event of allocation failure.
      
      [1] https://lore.kernel.org/lkml/Yxiv0SkMkZ0JWGGp@monkey/#t
      
      Link: https://lkml.kernel.org/r/20221005011707.514612-4-mike.kravetz@oracle.com
      Fixes: "hugetlb: clean up code checking for fault/truncation races"
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: James Houghton <jthoughton@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      bbff39cc
    • M
      hugetlb: take hugetlb vma_lock when clearing vma_lock->vma pointer · ecfbd733
      Mike Kravetz 提交于
      hugetlb file truncation/hole punch code may need to back out and take
      locks in order in the routine hugetlb_unmap_file_folio().  This code could
      race with vma freeing as pointed out in [1] and result in accessing a
      stale vma pointer.  To address this, take the vma_lock when clearing the
      vma_lock->vma pointer.
      
      [1] https://lore.kernel.org/linux-mm/01f10195-7088-4462-6def-909549c75ef4@huawei.com/
      
      [mike.kravetz@oracle.com: address build issues]
        Link: https://lkml.kernel.org/r/Yz5L1uxQYR1VqFtJ@monkey
      Link: https://lkml.kernel.org/r/20221005011707.514612-3-mike.kravetz@oracle.com
      Fixes: "hugetlb: use new vma_lock for pmd sharing synchronization"
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: James Houghton <jthoughton@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      ecfbd733
    • M
      hugetlb: fix vma lock handling during split vma and range unmapping · 131a79b4
      Mike Kravetz 提交于
      Patch series "hugetlb: fixes for new vma lock series".
      
      In review of the series "hugetlb: Use new vma lock for huge pmd sharing
      synchronization", Miaohe Lin pointed out two key issues:
      
      1) There is a race in the routine hugetlb_unmap_file_folio when locks
         are dropped and reacquired in the correct order [1].
      
      2) With the switch to using vma lock for fault/truncate synchronization,
         we need to make sure lock exists for all VM_MAYSHARE vmas, not just
         vmas capable of pmd sharing.
      
      These two issues are addressed here.  In addition, having a vma lock
      present in all VM_MAYSHARE vmas, uncovered some issues around vma
      splitting.  Those are also addressed.
      
      [1] https://lore.kernel.org/linux-mm/01f10195-7088-4462-6def-909549c75ef4@huawei.com/
      
      
      This patch (of 3):
      
      The hugetlb vma lock hangs off the vm_private_data field and is specific
      to the vma.  When vm_area_dup() is called as part of vma splitting, the
      vma lock pointer is copied to the new vma.  This will result in issues
      such as double freeing of the structure.  Update the hugetlb open vm_ops
      to allocate a new vma lock for the new vma.
      
      The routine __unmap_hugepage_range_final unconditionally unset VM_MAYSHARE
      to prevent subsequent pmd sharing.  hugetlb_vma_lock_free attempted to
      anticipate this by checking both VM_MAYSHARE and VM_SHARED.  However, if
      only VM_MAYSHARE was set we would miss the free.  With the introduction of
      the vma lock, a vma can not participate in pmd sharing if vm_private_data
      is NULL.  Instead of clearing VM_MAYSHARE in __unmap_hugepage_range_final,
      free the vma lock to prevent sharing.  Also, update the sharing code to
      make sure vma lock is indeed a condition for pmd sharing. 
      hugetlb_vma_lock_free can then key off VM_MAYSHARE and not miss any vmas.
      
      Link: https://lkml.kernel.org/r/20221005011707.514612-1-mike.kravetz@oracle.com
      Link: https://lkml.kernel.org/r/20221005011707.514612-2-mike.kravetz@oracle.com
      Fixes: "hugetlb: add vma based lock for pmd sharing"
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: James Houghton <jthoughton@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      131a79b4
  8. 04 10月, 2022 15 次提交
    • X
      mm/hugetlb: add available_huge_pages() func · 8346d69d
      Xin Hao 提交于
      In hugetlb.c there are several places which compare the values of
      'h->free_huge_pages' and 'h->resv_huge_pages', it looks a bit messy, so
      add a new available_huge_pages() function to do these.
      
      Link: https://lkml.kernel.org/r/20220922021929.98961-1-xhao@linux.alibaba.comSigned-off-by: NXin Hao <xhao@linux.alibaba.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      8346d69d
    • L
      mm: hugetlb: fix UAF in hugetlb_handle_userfault · 958f32ce
      Liu Shixin 提交于
      The vma_lock and hugetlb_fault_mutex are dropped before handling userfault
      and reacquire them again after handle_userfault(), but reacquire the
      vma_lock could lead to UAF[1,2] due to the following race,
      
      hugetlb_fault
        hugetlb_no_page
          /*unlock vma_lock */
          hugetlb_handle_userfault
            handle_userfault
              /* unlock mm->mmap_lock*/
                                                 vm_mmap_pgoff
                                                   do_mmap
                                                     mmap_region
                                                       munmap_vma_range
                                                         /* clean old vma */
              /* lock vma_lock again  <--- UAF */
          /* unlock vma_lock */
      
      Since the vma_lock will unlock immediately after
      hugetlb_handle_userfault(), let's drop the unneeded lock and unlock in
      hugetlb_handle_userfault() to fix the issue.
      
      [1] https://lore.kernel.org/linux-mm/000000000000d5e00a05e834962e@google.com/
      [2] https://lore.kernel.org/linux-mm/20220921014457.1668-1-liuzixian4@huawei.com/
      Link: https://lkml.kernel.org/r/20220923042113.137273-1-liushixin2@huawei.com
      Fixes: 1a1aad8a ("userfaultfd: hugetlbfs: add userfaultfd hugetlb hook")
      Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
      Signed-off-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Reported-by: syzbot+193f9cee8638750b23cf@syzkaller.appspotmail.com
      Reported-by: NLiu Zixian <liuzixian4@huawei.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
      Cc: <stable@vger.kernel.org>	[4.14+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      958f32ce
    • M
      hugetlb: freeze allocated pages before creating hugetlb pages · 2b21624f
      Mike Kravetz 提交于
      When creating hugetlb pages, the hugetlb code must first allocate
      contiguous pages from a low level allocator such as buddy, cma or
      memblock.  The pages returned from these low level allocators are ref
      counted.  This creates potential issues with other code taking speculative
      references on these pages before they can be transformed to a hugetlb
      page.  This issue has been addressed with methods and code such as that
      provided in [1].
      
      Recent discussions about vmemmap freeing [2] have indicated that it would
      be beneficial to freeze all sub pages, including the head page of pages
      returned from low level allocators before converting to a hugetlb page. 
      This helps avoid races if we want to replace the page containing vmemmap
      for the head page.
      
      There have been proposals to change at least the buddy allocator to return
      frozen pages as described at [3].  If such a change is made, it can be
      employed by the hugetlb code.  However, as mentioned above hugetlb uses
      several low level allocators so each would need to be modified to return
      frozen pages.  For now, we can manually freeze the returned pages.  This
      is done in two places:
      
      1) alloc_buddy_huge_page, only the returned head page is ref counted.
         We freeze the head page, retrying once in the VERY rare case where
         there may be an inflated ref count.
      2) prep_compound_gigantic_page, for gigantic pages the current code
         freezes all pages except the head page.  New code will simply freeze
         the head page as well.
      
      In a few other places, code checks for inflated ref counts on newly
      allocated hugetlb pages.  With the modifications to freeze after
      allocating, this code can be removed.
      
      After hugetlb pages are freshly allocated, they are often added to the
      hugetlb free lists.  Since these pages were previously ref counted, this
      was done via put_page() which would end up calling the hugetlb destructor:
      free_huge_page.  With changes to freeze pages, we simply call
      free_huge_page directly to add the pages to the free list.
      
      In a few other places, freshly allocated hugetlb pages were immediately
      put into use, and the expectation was they were already ref counted.  In
      these cases, we must manually ref count the page.
      
      [1] https://lore.kernel.org/linux-mm/20210622021423.154662-3-mike.kravetz@oracle.com/
      [2] https://lore.kernel.org/linux-mm/20220802180309.19340-1-joao.m.martins@oracle.com/
      [3] https://lore.kernel.org/linux-mm/20220809171854.3725722-1-willy@infradead.org/
      
      [mike.kravetz@oracle.com: fix NULL pointer dereference]
        Link: https://lkml.kernel.org/r/20220921202702.106069-1-mike.kravetz@oracle.com
      Link: https://lkml.kernel.org/r/20220916214638.155744-1-mike.kravetz@oracle.comSigned-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com>
      Cc: Joao Martins <joao.m.martins@oracle.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Peter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      2b21624f
    • M
      hugetlb: clean up code checking for fault/truncation races · fa27759a
      Mike Kravetz 提交于
      With the new hugetlb vma lock in place, it can also be used to handle page
      fault races with file truncation.  The lock is taken at the beginning of
      the code fault path in read mode.  During truncation, it is taken in write
      mode for each vma which has the file mapped.  The file's size (i_size) is
      modified before taking the vma lock to unmap.
      
      How are races handled?
      
      The page fault code checks i_size early in processing after taking the vma
      lock.  If the fault is beyond i_size, the fault is aborted.  If the fault
      is not beyond i_size the fault will continue and a new page will be added
      to the file.  It could be that truncation code modifies i_size after the
      check in fault code.  That is OK, as truncation code will soon remove the
      page.  The truncation code will wait until the fault is finished, as it
      must obtain the vma lock in write mode.
      
      This patch cleans up/removes late checks in the fault paths that try to
      back out pages racing with truncation.  As noted above, we just let the
      truncation code remove the pages.
      
      [mike.kravetz@oracle.com: fix reserve_alloc set but not used compiler warning]
        Link: https://lkml.kernel.org/r/Yyj7HsJWfHDoU24U@monkey
      Link: https://lkml.kernel.org/r/20220914221810.95771-10-mike.kravetz@oracle.comSigned-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: James Houghton <jthoughton@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      fa27759a
    • M
      hugetlb: use new vma_lock for pmd sharing synchronization · 40549ba8
      Mike Kravetz 提交于
      The new hugetlb vma lock is used to address this race:
      
      Faulting thread                                 Unsharing thread
      ...                                                  ...
      ptep = huge_pte_offset()
            or
      ptep = huge_pte_alloc()
      ...
                                                      i_mmap_lock_write
                                                      lock page table
      ptep invalid   <------------------------        huge_pmd_unshare()
      Could be in a previously                        unlock_page_table
      sharing process or worse                        i_mmap_unlock_write
      ...
      
      The vma_lock is used as follows:
      - During fault processing. The lock is acquired in read mode before
        doing a page table lock and allocation (huge_pte_alloc).  The lock is
        held until code is finished with the page table entry (ptep).
      - The lock must be held in write mode whenever huge_pmd_unshare is
        called.
      
      Lock ordering issues come into play when unmapping a page from all
      vmas mapping the page.  The i_mmap_rwsem must be held to search for the
      vmas, and the vma lock must be held before calling unmap which will
      call huge_pmd_unshare.  This is done today in:
      - try_to_migrate_one and try_to_unmap_ for page migration and memory
        error handling.  In these routines we 'try' to obtain the vma lock and
        fail to unmap if unsuccessful.  Calling routines already deal with the
        failure of unmapping.
      - hugetlb_vmdelete_list for truncation and hole punch.  This routine
        also tries to acquire the vma lock.  If it fails, it skips the
        unmapping.  However, we can not have file truncation or hole punch
        fail because of contention.  After hugetlb_vmdelete_list, truncation
        and hole punch call remove_inode_hugepages.  remove_inode_hugepages
        checks for mapped pages and call hugetlb_unmap_file_page to unmap them.
        hugetlb_unmap_file_page is designed to drop locks and reacquire in the
        correct order to guarantee unmap success.
      
      Link: https://lkml.kernel.org/r/20220914221810.95771-9-mike.kravetz@oracle.comSigned-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: James Houghton <jthoughton@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      40549ba8
    • M
      hugetlb: add vma based lock for pmd sharing · 8d9bfb26
      Mike Kravetz 提交于
      Allocate a new hugetlb_vma_lock structure and hang off vm_private_data for
      synchronization use by vmas that could be involved in pmd sharing.  This
      data structure contains a rw semaphore that is the primary tool used for
      synchronization.
      
      This new structure is ref counted, so that it can exist when NOT attached
      to a vma.  This is only helpful in resolving lock ordering issues where
      code may need to obtain the vma_lock while there are no guarantees the vma
      may go away.  By obtaining a ref on the structure, it can be guaranteed
      that at least the rw semaphore will not go away.
      
      Only add infrastructure for the new lock here.  Actual use will be added
      in subsequent patches.
      
      [mike.kravetz@oracle.com: fix build issue for missing hugetlb_vma_lock_release]
        Link: https://lkml.kernel.org/r/YyNUtA1vRASOE4+M@monkey
      Link: https://lkml.kernel.org/r/20220914221810.95771-7-mike.kravetz@oracle.comSigned-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: James Houghton <jthoughton@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      8d9bfb26
    • M
      hugetlb: rename vma_shareable() and refactor code · 12710fd6
      Mike Kravetz 提交于
      Rename the routine vma_shareable to vma_addr_pmd_shareable as it is
      checking a specific address within the vma.  Refactor code to check if an
      aligned range is shareable as this will be needed in a subsequent patch.
      
      Link: https://lkml.kernel.org/r/20220914221810.95771-6-mike.kravetz@oracle.comSigned-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: James Houghton <jthoughton@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      12710fd6
    • M
      hugetlb: rename remove_huge_page to hugetlb_delete_from_page_cache · 7e1813d4
      Mike Kravetz 提交于
      remove_huge_page removes a hugetlb page from the page cache.  Change to
      hugetlb_delete_from_page_cache as it is a more descriptive name. 
      huge_add_to_page_cache is global in scope, but only deals with hugetlb
      pages.  For consistency and clarity, rename to hugetlb_add_to_page_cache.
      
      Link: https://lkml.kernel.org/r/20220914221810.95771-4-mike.kravetz@oracle.comSigned-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: James Houghton <jthoughton@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      7e1813d4
    • M
      hugetlbfs: revert use i_mmap_rwsem for more pmd sharing synchronization · 3a47c54f
      Mike Kravetz 提交于
      Commit c0d0381a ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
      synchronization") added code to take i_mmap_rwsem in read mode for the
      duration of fault processing.  However, this has been shown to cause
      performance/scaling issues.  Revert the code and go back to only taking
      the semaphore in huge_pmd_share during the fault path.
      
      Keep the code that takes i_mmap_rwsem in write mode before calling
      try_to_unmap as this is required if huge_pmd_unshare is called.
      
      NOTE: Reverting this code does expose the following race condition.
      
      Faulting thread                                 Unsharing thread
      ...                                                  ...
      ptep = huge_pte_offset()
            or
      ptep = huge_pte_alloc()
      ...
                                                      i_mmap_lock_write
                                                      lock page table
      ptep invalid   <------------------------        huge_pmd_unshare()
      Could be in a previously                        unlock_page_table
      sharing process or worse                        i_mmap_unlock_write
      ...
      ptl = huge_pte_lock(ptep)
      get/update pte
      set_pte_at(pte, ptep)
      
      It is unknown if the above race was ever experienced by a user.  It was
      discovered via code inspection when initially addressed.
      
      In subsequent patches, a new synchronization mechanism will be added to
      coordinate pmd sharing and eliminate this race.
      
      Link: https://lkml.kernel.org/r/20220914221810.95771-3-mike.kravetz@oracle.comSigned-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: James Houghton <jthoughton@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      3a47c54f
    • M
      hugetlbfs: revert use i_mmap_rwsem to address page fault/truncate race · 188a3972
      Mike Kravetz 提交于
      Patch series "hugetlb: Use new vma lock for huge pmd sharing
      synchronization", v2.
      
      hugetlb fault scalability regressions have recently been reported [1]. 
      This is not the first such report, as regressions were also noted when
      commit c0d0381a ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
      synchronization") was added [2] in v5.7.  At that time, a proposal to
      address the regression was suggested [3] but went nowhere.
      
      The regression and benefit of this patch series is not evident when
      using the vm_scalability benchmark reported in [2] on a recent kernel.
      Results from running,
      "./usemem -n 48 --prealloc --prefault -O -U 3448054972"
      
      			48 sample Avg
      next-20220913		next-20220913			next-20220913
      unmodified	revert i_mmap_sema locking	vma sema locking, this series
      -----------------------------------------------------------------------------
      498150 KB/s		501934 KB/s			504793 KB/s
      
      The recent regression report [1] notes page fault and fork latency of
      shared hugetlb mappings.  To measure this, I created two simple programs:
      1) map a shared hugetlb area, write fault all pages, unmap area
         Do this in a continuous loop to measure faults per second
      2) map a shared hugetlb area, write fault a few pages, fork and exit
         Do this in a continuous loop to measure forks per second
      These programs were run on a 48 CPU VM with 320GB memory.  The shared
      mapping size was 250GB.  For comparison, a single instance of the program
      was run.  Then, multiple instances were run in parallel to introduce
      lock contention.  Changing the locking scheme results in a significant
      performance benefit.
      
      test		instances	unmodified	revert		vma
      --------------------------------------------------------------------------
      faults per sec	1		393043		395680		389932
      faults per sec  24		 71405		 81191		 79048
      forks per sec   1		  2802		  2747		  2725
      forks per sec   24		   439		   536		   500
      Combined faults 24		  1621		 68070		 53662
      Combined forks  24		   358		    67		   142
      
      Combined test is when running both faulting program and forking program
      simultaneously.
      
      Patches 1 and 2 of this series revert c0d0381a and 87bf91d3 which
      depends on c0d0381a.  Acquisition of i_mmap_rwsem is still required in
      the fault path to establish pmd sharing, so this is moved back to
      huge_pmd_share.  With c0d0381a reverted, this race is exposed:
      
      Faulting thread                                 Unsharing thread
      ...                                                  ...
      ptep = huge_pte_offset()
            or
      ptep = huge_pte_alloc()
      ...
                                                      i_mmap_lock_write
                                                      lock page table
      ptep invalid   <------------------------        huge_pmd_unshare()
      Could be in a previously                        unlock_page_table
      sharing process or worse                        i_mmap_unlock_write
      ...
      ptl = huge_pte_lock(ptep)
      get/update pte
      set_pte_at(pte, ptep)
      
      Reverting 87bf91d3 exposes races in page fault/file truncation.  When
      the new vma lock is put to use in patch 8, this will handle the fault/file
      truncation races.  This is explained in patch 9 where code associated with
      these races is cleaned up.
      
      Patches 3 - 5 restructure existing code in preparation for using the new
      vma lock (rw semaphore) for pmd sharing synchronization.  The idea is that
      this semaphore will be held in read mode for the duration of fault
      processing, and held in write mode for unmap operations which may call
      huge_pmd_unshare.  Acquiring i_mmap_rwsem is also still required to
      synchronize huge pmd sharing.  However it is only required in the fault
      path when setting up sharing, and will be acquired in huge_pmd_share().
      
      Patch 6 adds the new vma lock and all supporting routines, but does not
      actually change code to use the new lock.
      
      Patch 7 refactors code in preparation for using the new lock.  And, patch
      8 finally adds code to make use of this new vma lock.  Unfortunately, the
      fault code and truncate/hole punch code would naturally take locks in the
      opposite order which could lead to deadlock.  Since the performance of
      page faults is more important, the truncation/hole punch code is modified
      to back out and take locks in the correct order if necessary.
      
      [1] https://lore.kernel.org/linux-mm/43faf292-245b-5db5-cce9-369d8fb6bd21@infradead.org/
      [2] https://lore.kernel.org/lkml/20200622005551.GK5535@shao2-debian/
      [3] https://lore.kernel.org/linux-mm/20200706202615.32111-1-mike.kravetz@oracle.com/
      
      
      This patch (of 9):
      
      Commit c0d0381a ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
      synchronization") added code to take i_mmap_rwsem in read mode for the
      duration of fault processing.  The use of i_mmap_rwsem to prevent
      fault/truncate races depends on this.  However, this has been shown to
      cause performance/scaling issues.  As a result, that code will be
      reverted.  Since the use i_mmap_rwsem to address page fault/truncate races
      depends on this, it must also be reverted.
      
      In a subsequent patch, code will be added to detect the fault/truncate
      race and back out operations as required.
      
      Link: https://lkml.kernel.org/r/20220914221810.95771-1-mike.kravetz@oracle.com
      Link: https://lkml.kernel.org/r/20220914221810.95771-2-mike.kravetz@oracle.comSigned-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: James Houghton <jthoughton@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      188a3972
    • X
      mm/hugetlb: remove unnecessary 'NULL' values from pointer · 3259914f
      XU pengfei 提交于
      Pointer variables allocate memory first, and then judge.  There is no need
      to initialize the assignment.
      
      Link: https://lkml.kernel.org/r/20220914012113.6271-1-xupengfei@nfschina.comSigned-off-by: NXU pengfei <xupengfei@nfschina.com>
      Reviewed-by: NMuchun Song <songmuchun@bytedance.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      3259914f
    • M
      mm: hugetlb: eliminate memory-less nodes handling · a4a00b45
      Muchun Song 提交于
      The memory-notify-based approach aims to handle meory-less nodes, however,
      it just adds the complexity of code as pointed by David in thread [1]. 
      The handling of memory-less nodes is introduced by commit 4faf8d95
      ("hugetlb: handle memory hot-plug events").  >From its commit message, we
      cannot find any necessity of handling this case.  So, we can simply
      register/unregister sysfs entries in register_node/unregister_node to
      simlify the code.
      
      BTW, hotplug callback added because in hugetlb_register_all_nodes() we
      register sysfs nodes only for N_MEMORY nodes, seeing commit 9b5e5d0f,
      which said it was a preparation for handling memory-less nodes via memory
      hotplug.  Since we want to remove memory hotplug, so make sure we only
      register per-node sysfs for online (N_ONLINE) nodes in
      hugetlb_register_all_nodes().
      
      https://lore.kernel.org/linux-mm/60933ffc-b850-976c-78a0-0ee6e0ea9ef0@redhat.com/ [1]
      Link: https://lkml.kernel.org/r/20220914072603.60293-3-songmuchun@bytedance.comSuggested-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NMuchun Song <songmuchun@bytedance.com>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Rafael J. Wysocki <rafael@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      a4a00b45
    • M
      mm: hugetlb: simplify per-node sysfs creation and removal · b958d4d0
      Muchun Song 提交于
      Patch series "simplify handling of per-node sysfs creation and removal",
      v4.
      
      
      This patch (of 2):
      
      The following commit offload per-node sysfs creation and removal to a
      kworker and did not say why it is needed.  And it also said "I don't know
      that this is absolutely required".  It seems like the author was not sure
      as well.  Since it only complicates the code, this patch will revert the
      changes to simplify the code.
      
        39da08cb ("hugetlb: offload per node attribute registrations")
      
      We could use memory hotplug notifier to do per-node sysfs creation and
      removal instead of inserting those operations to node registration and
      unregistration.  Then, it can reduce the code coupling between node.c and
      hugetlb.c.  Also, it can simplify the code.
      
      Link: https://lkml.kernel.org/r/20220914072603.60293-1-songmuchun@bytedance.com
      Link: https://lkml.kernel.org/r/20220914072603.60293-2-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
      Acked-by: NMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Rafael J. Wysocki <rafael@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      b958d4d0
    • C
      mm: use nth_page instead of mem_map_offset mem_map_next · 14455eab
      Cheng Li 提交于
      To handle the discontiguous case, mem_map_next() has a parameter named
      `offset`.  As a function caller, one would be confused why "get next
      entry" needs a parameter named "offset".  The other drawback of
      mem_map_next() is that the callers must take care of the map between
      parameter "iter" and "offset", otherwise we may get an hole or duplication
      during iteration.  So we use nth_page instead of mem_map_next.
      
      And replace mem_map_offset with nth_page() per Matthew's comments.
      
      Link: https://lkml.kernel.org/r/1662708669-9395-1-git-send-email-lic121@chinatelecom.cnSigned-off-by: NCheng Li <lic121@chinatelecom.cn>
      Fixes: 69d177c2 ("hugetlbfs: handle pages higher order than MAX_ORDER")
      Reviewed-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      14455eab
    • L
      mm/hugetlb.c: remove unnecessary initialization of local `err' · 8eeda55f
      Li zeming 提交于
      Link: https://lkml.kernel.org/r/20220905020918.3552-1-zeming@nfschina.comSigned-off-by: NLi zeming <zeming@nfschina.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      8eeda55f