• D
    mm: remember exclusively mapped anonymous pages with PG_anon_exclusive · 6c287605
    David Hildenbrand 提交于
    Let's mark exclusively mapped anonymous pages with PG_anon_exclusive as
    exclusive, and use that information to make GUP pins reliable and stay
    consistent with the page mapped into the page table even if the page table
    entry gets write-protected.
    
    With that information at hand, we can extend our COW logic to always reuse
    anonymous pages that are exclusive.  For anonymous pages that might be
    shared, the existing logic applies.
    
    As already documented, PG_anon_exclusive is usually only expressive in
    combination with a page table entry.  Especially PTE vs.  PMD-mapped
    anonymous pages require more thought, some examples: due to mremap() we
    can easily have a single compound page PTE-mapped into multiple page
    tables exclusively in a single process -- multiple page table locks apply.
    Further, due to MADV_WIPEONFORK we might not necessarily write-protect
    all PTEs, and only some subpages might be pinned.  Long story short: once
    PTE-mapped, we have to track information about exclusivity per sub-page,
    but until then, we can just track it for the compound page in the head
    page and not having to update a whole bunch of subpages all of the time
    for a simple PMD mapping of a THP.
    
    For simplicity, this commit mostly talks about "anonymous pages", while
    it's for THP actually "the part of an anonymous folio referenced via a
    page table entry".
    
    To not spill PG_anon_exclusive code all over the mm code-base, we let the
    anon rmap code to handle all PG_anon_exclusive logic it can easily handle.
    
    If a writable, present page table entry points at an anonymous (sub)page,
    that (sub)page must be PG_anon_exclusive.  If GUP wants to take a reliably
    pin (FOLL_PIN) on an anonymous page references via a present page table
    entry, it must only pin if PG_anon_exclusive is set for the mapped
    (sub)page.
    
    This commit doesn't adjust GUP, so this is only implicitly handled for
    FOLL_WRITE, follow-up commits will teach GUP to also respect it for
    FOLL_PIN without FOLL_WRITE, to make all GUP pins of anonymous pages fully
    reliable.
    
    Whenever an anonymous page is to be shared (fork(), KSM), or when
    temporarily unmapping an anonymous page (swap, migration), the relevant
    PG_anon_exclusive bit has to be cleared to mark the anonymous page
    possibly shared.  Clearing will fail if there are GUP pins on the page:
    
    * For fork(), this means having to copy the page and not being able to
      share it.  fork() protects against concurrent GUP using the PT lock and
      the src_mm->write_protect_seq.
    
    * For KSM, this means sharing will fail.  For swap this means, unmapping
      will fail, For migration this means, migration will fail early.  All
      three cases protect against concurrent GUP using the PT lock and a
      proper clear/invalidate+flush of the relevant page table entry.
    
    This fixes memory corruptions reported for FOLL_PIN | FOLL_WRITE, when a
    pinned page gets mapped R/O and the successive write fault ends up
    replacing the page instead of reusing it.  It improves the situation for
    O_DIRECT/vmsplice/...  that still use FOLL_GET instead of FOLL_PIN, if
    fork() is *not* involved, however swapout and fork() are still
    problematic.  Properly using FOLL_PIN instead of FOLL_GET for these GUP
    users will fix the issue for them.
    
    I. Details about basic handling
    
    I.1. Fresh anonymous pages
    
    page_add_new_anon_rmap() and hugepage_add_new_anon_rmap() will mark the
    given page exclusive via __page_set_anon_rmap(exclusive=1).  As that is
    the mechanism fresh anonymous pages come into life (besides migration code
    where we copy the page->mapping), all fresh anonymous pages will start out
    as exclusive.
    
    I.2. COW reuse handling of anonymous pages
    
    When a COW handler stumbles over a (sub)page that's marked exclusive, it
    simply reuses it.  Otherwise, the handler tries harder under page lock to
    detect if the (sub)page is exclusive and can be reused.  If exclusive,
    page_move_anon_rmap() will mark the given (sub)page exclusive.
    
    Note that hugetlb code does not yet check for PageAnonExclusive(), as it
    still uses the old COW logic that is prone to the COW security issue
    because hugetlb code cannot really tolerate unnecessary/wrong COW as huge
    pages are a scarce resource.
    
    I.3. Migration handling
    
    try_to_migrate() has to try marking an exclusive anonymous page shared via
    page_try_share_anon_rmap().  If it fails because there are GUP pins on the
    page, unmap fails.  migrate_vma_collect_pmd() and
    __split_huge_pmd_locked() are handled similarly.
    
    Writable migration entries implicitly point at shared anonymous pages. 
    For readable migration entries that information is stored via a new
    "readable-exclusive" migration entry, specific to anonymous pages.
    
    When restoring a migration entry in remove_migration_pte(), information
    about exlusivity is detected via the migration entry type, and
    RMAP_EXCLUSIVE is set accordingly for
    page_add_anon_rmap()/hugepage_add_anon_rmap() to restore that information.
    
    I.4. Swapout handling
    
    try_to_unmap() has to try marking the mapped page possibly shared via
    page_try_share_anon_rmap().  If it fails because there are GUP pins on the
    page, unmap fails.  For now, information about exclusivity is lost.  In
    the future, we might want to remember that information in the swap entry
    in some cases, however, it requires more thought, care, and a way to store
    that information in swap entries.
    
    I.5. Swapin handling
    
    do_swap_page() will never stumble over exclusive anonymous pages in the
    swap cache, as try_to_migrate() prohibits that.  do_swap_page() always has
    to detect manually if an anonymous page is exclusive and has to set
    RMAP_EXCLUSIVE for page_add_anon_rmap() accordingly.
    
    I.6. THP handling
    
    __split_huge_pmd_locked() has to move the information about exclusivity
    from the PMD to the PTEs.
    
    a) In case we have a readable-exclusive PMD migration entry, simply
       insert readable-exclusive PTE migration entries.
    
    b) In case we have a present PMD entry and we don't want to freeze
       ("convert to migration entries"), simply forward PG_anon_exclusive to
       all sub-pages, no need to temporarily clear the bit.
    
    c) In case we have a present PMD entry and want to freeze, handle it
       similar to try_to_migrate(): try marking the page shared first.  In
       case we fail, we ignore the "freeze" instruction and simply split
       ordinarily.  try_to_migrate() will properly fail because the THP is
       still mapped via PTEs.
    
    When splitting a compound anonymous folio (THP), the information about
    exclusivity is implicitly handled via the migration entries: no need to
    replicate PG_anon_exclusive manually.
    
    I.7.  fork() handling fork() handling is relatively easy, because
    PG_anon_exclusive is only expressive for some page table entry types.
    
    a) Present anonymous pages
    
    page_try_dup_anon_rmap() will mark the given subpage shared -- which will
    fail if the page is pinned.  If it failed, we have to copy (or PTE-map a
    PMD to handle it on the PTE level).
    
    Note that device exclusive entries are just a pointer at a PageAnon()
    page.  fork() will first convert a device exclusive entry to a present
    page table and handle it just like present anonymous pages.
    
    b) Device private entry
    
    Device private entries point at PageAnon() pages that cannot be mapped
    directly and, therefore, cannot get pinned.
    
    page_try_dup_anon_rmap() will mark the given subpage shared, which cannot
    fail because they cannot get pinned.
    
    c) HW poison entries
    
    PG_anon_exclusive will remain untouched and is stale -- the page table
    entry is just a placeholder after all.
    
    d) Migration entries
    
    Writable and readable-exclusive entries are converted to readable entries:
    possibly shared.
    
    I.8. mprotect() handling
    
    mprotect() only has to properly handle the new readable-exclusive
    migration entry:
    
    When write-protecting a migration entry that points at an anonymous page,
    remember the information about exclusivity via the "readable-exclusive"
    migration entry type.
    
    II. Migration and GUP-fast
    
    Whenever replacing a present page table entry that maps an exclusive
    anonymous page by a migration entry, we have to mark the page possibly
    shared and synchronize against GUP-fast by a proper clear/invalidate+flush
    to make the following scenario impossible:
    
    1. try_to_migrate() places a migration entry after checking for GUP pins
       and marks the page possibly shared.
    
    2. GUP-fast pins the page due to lack of synchronization
    
    3. fork() converts the "writable/readable-exclusive" migration entry into a
       readable migration entry
    
    4. Migration fails due to the GUP pin (failing to freeze the refcount)
    
    5. Migration entries are restored. PG_anon_exclusive is lost
    
    -> We have a pinned page that is not marked exclusive anymore.
    
    Note that we move information about exclusivity from the page to the
    migration entry as it otherwise highly overcomplicates fork() and
    PTE-mapping a THP.
    
    III. Swapout and GUP-fast
    
    Whenever replacing a present page table entry that maps an exclusive
    anonymous page by a swap entry, we have to mark the page possibly shared
    and synchronize against GUP-fast by a proper clear/invalidate+flush to
    make the following scenario impossible:
    
    1. try_to_unmap() places a swap entry after checking for GUP pins and
       clears exclusivity information on the page.
    
    2. GUP-fast pins the page due to lack of synchronization.
    
    -> We have a pinned page that is not marked exclusive anymore.
    
    If we'd ever store information about exclusivity in the swap entry,
    similar to migration handling, the same considerations as in II would
    apply.  This is future work.
    
    Link: https://lkml.kernel.org/r/20220428083441.37290-13-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
    Acked-by: NVlastimil Babka <vbabka@suse.cz>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Don Dutile <ddutile@redhat.com>
    Cc: Hugh Dickins <hughd@google.com>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Jann Horn <jannh@google.com>
    Cc: Jason Gunthorpe <jgg@nvidia.com>
    Cc: John Hubbard <jhubbard@nvidia.com>
    Cc: Khalid Aziz <khalid.aziz@oracle.com>
    Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
    Cc: Liang Zhang <zhangliang5@huawei.com>
    Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
    Cc: Michal Hocko <mhocko@kernel.org>
    Cc: Mike Kravetz <mike.kravetz@oracle.com>
    Cc: Mike Rapoport <rppt@linux.ibm.com>
    Cc: Nadav Amit <namit@vmware.com>
    Cc: Oded Gabbay <oded.gabbay@gmail.com>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
    Cc: Peter Xu <peterx@redhat.com>
    Cc: Rik van Riel <riel@surriel.com>
    Cc: Roman Gushchin <guro@fb.com>
    Cc: Shakeel Butt <shakeelb@google.com>
    Cc: Yang Shi <shy828301@gmail.com>
    Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
    6c287605
huge_memory.c 84.0 KB