1. 01 7月, 2021 5 次提交
    • Y
      mm: thp: refactor NUMA fault handling · c5b5a3dd
      Yang Shi 提交于
      When the THP NUMA fault support was added THP migration was not supported
      yet.  So the ad hoc THP migration was implemented in NUMA fault handling.
      Since v4.14 THP migration has been supported so it doesn't make too much
      sense to still keep another THP migration implementation rather than using
      the generic migration code.
      
      This patch reworks the NUMA fault handling to use generic migration
      implementation to migrate misplaced page.  There is no functional change.
      
      After the refactor the flow of NUMA fault handling looks just like its
      PTE counterpart:
        Acquire ptl
        Prepare for migration (elevate page refcount)
        Release ptl
        Isolate page from lru and elevate page refcount
        Migrate the misplaced THP
      
      If migration fails just restore the old normal PMD.
      
      In the old code anon_vma lock was needed to serialize THP migration
      against THP split, but since then the THP code has been reworked a lot, it
      seems anon_vma lock is not required anymore to avoid the race.
      
      The page refcount elevation when holding ptl should prevent from THP
      split.
      
      Use migrate_misplaced_page() for both base page and THP NUMA hinting fault
      and remove all the dead and duplicate code.
      
      [dan.carpenter@oracle.com: fix a double unlock bug]
        Link: https://lkml.kernel.org/r/YLX8uYN01JmfLnlK@mwanda
      
      Link: https://lkml.kernel.org/r/20210518200801.7413-4-shy828301@gmail.comSigned-off-by: NYang Shi <shy828301@gmail.com>
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c5b5a3dd
    • M
      mm: migrate: fix missing update page_private to hugetlb_page_subpool · 6acfb5ba
      Muchun Song 提交于
      Since commit d6995da3 ("hugetlb: use page.private for hugetlb specific
      page flags") converts page.private for hugetlb specific page flags.  We
      should use hugetlb_page_subpool() to get the subpool pointer instead of
      page_private().
      
      This 'could' prevent the migration of hugetlb pages.  page_private(hpage)
      is now used for hugetlb page specific flags.  At migration time, the only
      flag which could be set is HPageVmemmapOptimized.  This flag will only be
      set if the new vmemmap reduction feature is enabled.  In addition,
      !page_mapping() implies an anonymous mapping.  So, this will prevent
      migration of hugetb pages in anonymous mappings if the vmemmap reduction
      feature is enabled.
      
      In addition, that if statement checked for the rare race condition of a
      page being migrated while in the process of being freed.  Since that check
      is now wrong, we could leak hugetlb subpool usage counts.
      
      The commit forgot to update it in the page migration routine.  So fix it.
      
      [songmuchun@bytedance.com: fix compiler error when !CONFIG_HUGETLB_PAGE reported by Randy]
        Link: https://lkml.kernel.org/r/20210521022747.35736-1-songmuchun@bytedance.com
      
      Link: https://lkml.kernel.org/r/20210520025949.1866-1-songmuchun@bytedance.com
      Fixes: d6995da3 ("hugetlb: use page.private for hugetlb specific page flags")
      Signed-off-by: NMuchun Song <songmuchun@bytedance.com>
      Reported-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Tested-by: Anshuman Khandual <anshuman.khandual@arm.com>	[arm64]
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6acfb5ba
    • M
      mm, hugetlb: fix racy resv_huge_pages underflow on UFFDIO_COPY · 8cc5fcbb
      Mina Almasry 提交于
      On UFFDIO_COPY, if we fail to copy the page contents while holding the
      hugetlb_fault_mutex, we will drop the mutex and return to the caller after
      allocating a page that consumed a reservation.  In this case there may be
      a fault that double consumes the reservation.  To handle this, we free the
      allocated page, fix the reservations, and allocate a temporary hugetlb
      page and return that to the caller.  When the caller does the copy outside
      of the lock, we again check the cache, and allocate a page consuming the
      reservation, and copy over the contents.
      
      Test:
      Hacked the code locally such that resv_huge_pages underflows produce
      a warning and the copy_huge_page_from_user() always fails, then:
      
      ./tools/testing/selftests/vm/userfaultfd hugetlb_shared 10
              2 /tmp/kokonut_test/huge/userfaultfd_test && echo test success
      ./tools/testing/selftests/vm/userfaultfd hugetlb 10
      	2 /tmp/kokonut_test/huge/userfaultfd_test && echo test success
      
      Both tests succeed and produce no warnings. After the
      test runs number of free/resv hugepages is correct.
      
      [yuehaibing@huawei.com: remove set but not used variable 'vm_alloc_shared']
        Link: https://lkml.kernel.org/r/20210601141610.28332-1-yuehaibing@huawei.com
      [almasrymina@google.com: fix allocation error check and copy func name]
        Link: https://lkml.kernel.org/r/20210605010626.1459873-1-almasrymina@google.com
      
      Link: https://lkml.kernel.org/r/20210528005029.88088-1-almasrymina@google.comSigned-off-by: NMina Almasry <almasrymina@google.com>
      Signed-off-by: NYueHaibing <yuehaibing@huawei.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8cc5fcbb
    • C
      mm/hugetlb: change parameters of arch_make_huge_pte() · 79c1c594
      Christophe Leroy 提交于
      Patch series "Subject: [PATCH v2 0/5] Implement huge VMAP and VMALLOC on powerpc 8xx", v2.
      
      This series implements huge VMAP and VMALLOC on powerpc 8xx.
      
      Powerpc 8xx has 4 page sizes:
      - 4k
      - 16k
      - 512k
      - 8M
      
      At the time being, vmalloc and vmap only support huge pages which are
      leaf at PMD level.
      
      Here the PMD level is 4M, it doesn't correspond to any supported
      page size.
      
      For now, implement use of 16k and 512k pages which is done
      at PTE level.
      
      Support of 8M pages will be implemented later, it requires use of
      hugepd tables.
      
      To allow this, the architecture provides two functions:
      - arch_vmap_pte_range_map_size() which tells vmap_pte_range() what
      page size to use. A stub returning PAGE_SIZE is provided when the
      architecture doesn't provide this function.
      - arch_vmap_pte_supported_shift() which tells __vmalloc_node_range()
      what page shift to use for a given area size. A stub returning
      PAGE_SHIFT is provided when the architecture doesn't provide this
      function.
      
      This patch (of 5):
      
      At the time being, arch_make_huge_pte() has the following prototype:
      
        pte_t arch_make_huge_pte(pte_t entry, struct vm_area_struct *vma,
      			   struct page *page, int writable);
      
      vma is used to get the pages shift or size.
      vma is also used on Sparc to get vm_flags.
      page is not used.
      writable is not used.
      
      In order to use this function without a vma, replace vma by shift and
      flags.  Also remove the used parameters.
      
      Link: https://lkml.kernel.org/r/cover.1620795204.git.christophe.leroy@csgroup.eu
      Link: https://lkml.kernel.org/r/f4633ac6a7da2f22f31a04a89e0a7026bb78b15b.1620795204.git.christophe.leroy@csgroup.euSigned-off-by: NChristophe Leroy <christophe.leroy@csgroup.eu>
      Acked-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Uladzislau Rezki <uladzislau.rezki@sony.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      79c1c594
    • M
      mm: hugetlb: alloc the vmemmap pages associated with each HugeTLB page · ad2fa371
      Muchun Song 提交于
      When we free a HugeTLB page to the buddy allocator, we need to allocate
      the vmemmap pages associated with it.  However, we may not be able to
      allocate the vmemmap pages when the system is under memory pressure.  In
      this case, we just refuse to free the HugeTLB page.  This changes behavior
      in some corner cases as listed below:
      
       1) Failing to free a huge page triggered by the user (decrease nr_pages).
      
          User needs to try again later.
      
       2) Failing to free a surplus huge page when freed by the application.
      
          Try again later when freeing a huge page next time.
      
       3) Failing to dissolve a free huge page on ZONE_MOVABLE via
          offline_pages().
      
          This can happen when we have plenty of ZONE_MOVABLE memory, but
          not enough kernel memory to allocate vmemmmap pages.  We may even
          be able to migrate huge page contents, but will not be able to
          dissolve the source huge page.  This will prevent an offline
          operation and is unfortunate as memory offlining is expected to
          succeed on movable zones.  Users that depend on memory hotplug
          to succeed for movable zones should carefully consider whether the
          memory savings gained from this feature are worth the risk of
          possibly not being able to offline memory in certain situations.
      
       4) Failing to dissolve a huge page on CMA/ZONE_MOVABLE via
          alloc_contig_range() - once we have that handling in place. Mainly
          affects CMA and virtio-mem.
      
          Similar to 3). virito-mem will handle migration errors gracefully.
          CMA might be able to fallback on other free areas within the CMA
          region.
      
      Vmemmap pages are allocated from the page freeing context.  In order for
      those allocations to be not disruptive (e.g.  trigger oom killer)
      __GFP_NORETRY is used.  hugetlb_lock is dropped for the allocation because
      a non sleeping allocation would be too fragile and it could fail too
      easily under memory pressure.  GFP_ATOMIC or other modes to access memory
      reserves is not used because we want to prevent consuming reserves under
      heavy hugetlb freeing.
      
      [mike.kravetz@oracle.com: fix dissolve_free_huge_page use of tail/head page]
        Link: https://lkml.kernel.org/r/20210527231225.226987-1-mike.kravetz@oracle.com
      [willy@infradead.org: fix alloc_vmemmap_page_list documentation warning]
        Link: https://lkml.kernel.org/r/20210615200242.1716568-6-willy@infradead.org
      
      Link: https://lkml.kernel.org/r/20210510030027.56044-7-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Barry Song <song.bao.hua@hisilicon.com>
      Cc: Bodeddula Balasubramaniam <bodeddub@amazon.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Chen Huang <chenhuang5@huawei.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joao Martins <joao.m.martins@oracle.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Oliver Neukum <oneukum@suse.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ad2fa371
  2. 30 6月, 2021 1 次提交
  3. 17 6月, 2021 1 次提交
  4. 07 5月, 2021 1 次提交
  5. 06 5月, 2021 8 次提交
  6. 01 5月, 2021 1 次提交
  7. 25 2月, 2021 2 次提交
  8. 06 2月, 2021 1 次提交
  9. 25 1月, 2021 2 次提交
  10. 16 12月, 2020 10 次提交
  11. 15 11月, 2020 1 次提交
    • M
      hugetlbfs: fix anon huge page migration race · 336bf30e
      Mike Kravetz 提交于
      Qian Cai reported the following BUG in [1]
      
        LTP: starting move_pages12
        BUG: unable to handle page fault for address: ffffffffffffffe0
        ...
        RIP: 0010:anon_vma_interval_tree_iter_first+0xa2/0x170 avc_start_pgoff at mm/interval_tree.c:63
        Call Trace:
          rmap_walk_anon+0x141/0xa30 rmap_walk_anon at mm/rmap.c:1864
          try_to_unmap+0x209/0x2d0 try_to_unmap at mm/rmap.c:1763
          migrate_pages+0x1005/0x1fb0
          move_pages_and_store_status.isra.47+0xd7/0x1a0
          __x64_sys_move_pages+0xa5c/0x1100
          do_syscall_64+0x5f/0x310
          entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Hugh Dickins diagnosed this as a migration bug caused by code introduced
      to use i_mmap_rwsem for pmd sharing synchronization.  Specifically, the
      routine unmap_and_move_huge_page() is always passing the TTU_RMAP_LOCKED
      flag to try_to_unmap() while holding i_mmap_rwsem.  This is wrong for
      anon pages as the anon_vma_lock should be held in this case.  Further
      analysis suggested that i_mmap_rwsem was not required to he held at all
      when calling try_to_unmap for anon pages as an anon page could never be
      part of a shared pmd mapping.
      
      Discussion also revealed that the hack in hugetlb_page_mapping_lock_write
      to drop page lock and acquire i_mmap_rwsem is wrong.  There is no way to
      keep mapping valid while dropping page lock.
      
      This patch does the following:
      
       - Do not take i_mmap_rwsem and set TTU_RMAP_LOCKED for anon pages when
         calling try_to_unmap.
      
       - Remove the hacky code in hugetlb_page_mapping_lock_write. The routine
         will now simply do a 'trylock' while still holding the page lock. If
         the trylock fails, it will return NULL. This could impact the
         callers:
      
          - migration calling code will receive -EAGAIN and retry up to the
            hard coded limit (10).
      
          - memory error code will treat the page as BUSY. This will force
            killing (SIGKILL) instead of SIGBUS any mapping tasks.
      
         Do note that this change in behavior only happens when there is a
         race. None of the standard kernel testing suites actually hit this
         race, but it is possible.
      
      [1] https://lore.kernel.org/lkml/20200708012044.GC992@lca.pw/
      [2] https://lore.kernel.org/linux-mm/alpine.LSU.2.11.2010071833100.2214@eggly.anvils/
      
      Fixes: c0d0381a ("hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization")
      Reported-by: NQian Cai <cai@lca.pw>
      Suggested-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NNaoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20201105195058.78401-1-mike.kravetz@oracle.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      336bf30e
  12. 19 10月, 2020 1 次提交
  13. 17 10月, 2020 1 次提交
    • O
      mm,hwpoison: rework soft offline for in-use pages · 79f5f8fa
      Oscar Salvador 提交于
      This patch changes the way we set and handle in-use poisoned pages.  Until
      now, poisoned pages were released to the buddy allocator, trusting that
      the checks that take place at allocation time would act as a safe net and
      would skip that page.
      
      This has proved to be wrong, as we got some pfn walkers out there, like
      compaction, that all they care is the page to be in a buddy freelist.
      
      Although this might not be the only user, having poisoned pages in the
      buddy allocator seems a bad idea as we should only have free pages that
      are ready and meant to be used as such.
      
      Before explaining the taken approach, let us break down the kind of pages
      we can soft offline.
      
      - Anonymous THP (after the split, they end up being 4K pages)
      - Hugetlb
      - Order-0 pages (that can be either migrated or invalited)
      
      * Normal pages (order-0 and anon-THP)
      
        - If they are clean and unmapped page cache pages, we invalidate
          then by means of invalidate_inode_page().
        - If they are mapped/dirty, we do the isolate-and-migrate dance.
      
      Either way, do not call put_page directly from those paths.  Instead, we
      keep the page and send it to page_handle_poison to perform the right
      handling.
      
      page_handle_poison sets the HWPoison flag and does the last put_page.
      
      Down the chain, we placed a check for HWPoison page in
      free_pages_prepare, that just skips any poisoned page, so those pages
      do not end up in any pcplist/freelist.
      
      After that, we set the refcount on the page to 1 and we increment
      the poisoned pages counter.
      
      If we see that the check in free_pages_prepare creates trouble, we can
      always do what we do for free pages:
      
        - wait until the page hits buddy's freelists
        - take it off, and flag it
      
      The downside of the above approach is that we could race with an
      allocation, so by the time we  want to take the page off the buddy, the
      page has been already allocated so we cannot soft offline it.
      But the user could always retry it.
      
      * Hugetlb pages
      
        - We isolate-and-migrate them
      
      After the migration has been successful, we call dissolve_free_huge_page,
      and we set HWPoison on the page if we succeed.
      Hugetlb has a slightly different handling though.
      
      While for non-hugetlb pages we cared about closing the race with an
      allocation, doing so for hugetlb pages requires quite some additional
      and intrusive code (we would need to hook in free_huge_page and some other
      places).
      So I decided to not make the code overly complicated and just fail
      normally if the page we allocated in the meantime.
      
      We can always build on top of this.
      
      As a bonus, because of the way we handle now in-use pages, we no longer
      need the put-as-isolation-migratetype dance, that was guarding for poisoned
      pages to end up in pcplists.
      Signed-off-by: NOscar Salvador <osalvador@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NNaoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Aristeu Rozanski <aris@ruivo.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Dmitry Yakunin <zeil@yandex-team.ru>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Tony Luck <tony.luck@intel.com>
      Link: https://lkml.kernel.org/r/20200922135650.1634-10-osalvador@suse.deSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      79f5f8fa
  14. 14 10月, 2020 2 次提交
  15. 27 9月, 2020 1 次提交
  16. 25 9月, 2020 1 次提交
  17. 20 9月, 2020 1 次提交