1. 28 6月, 2022 1 次提交
  2. 20 5月, 2022 1 次提交
    • M
      mm: don't be stuck to rmap lock on reclaim path · 6d4675e6
      Minchan Kim 提交于
      The rmap locks(i_mmap_rwsem and anon_vma->root->rwsem) could be contended
      under memory pressure if processes keep working on their vmas(e.g., fork,
      mmap, munmap).  It makes reclaim path stuck.  In our real workload traces,
      we see kswapd is waiting the lock for 300ms+(worst case, a sec) and it
      makes other processes entering direct reclaim, which were also stuck on
      the lock.
      
      This patch makes lru aging path try_lock mode like shink_page_list so the
      reclaim context will keep working with next lru pages without being stuck.
      if it found the rmap lock contended, it rotates the page back to head of
      lru in both active/inactive lrus to make them consistent behavior, which
      is basic starting point rather than adding more heristic.
      
      Since this patch introduces a new "contended" field as out-param along
      with try_lock in-param in rmap_walk_control, it's not immutable any longer
      if the try_lock is set so remove const keywords on rmap related functions.
      Since rmap walking is already expensive operation, I doubt the const
      would help sizable benefit( And we didn't have it until 5.17).
      
      In a heavy app workload in Android, trace shows following statistics.  It
      almost removes rmap lock contention from reclaim path.
      
      Martin Liu reported:
      
      Before:
      
         max_dur(ms)  min_dur(ms)  max-min(dur)ms  avg_dur(ms)  sum_dur(ms)  count blocked_function
               1632            0            1631   151.542173        31672    209  page_lock_anon_vma_read
                601            0             601   145.544681        28817    198  rmap_walk_file
      
      After:
      
         max_dur(ms)  min_dur(ms)  max-min(dur)ms  avg_dur(ms)  sum_dur(ms)  count blocked_function
                NaN          NaN              NaN          NaN          NaN    0.0             NaN
                  0            0                0     0.127645            1     12  rmap_walk_file
      
      [minchan@kernel.org: add comment, per Matthew]
        Link: https://lkml.kernel.org/r/YnNqeB5tUf6LZ57b@google.com
      Link: https://lkml.kernel.org/r/20220510215423.164547-1-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: John Dias <joaodias@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Martin Liu <liumartin@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      6d4675e6
  3. 14 5月, 2022 2 次提交
    • B
      mm: rmap: fix CONT-PTE/PMD size hugetlb issue when unmapping · a00a8759
      Baolin Wang 提交于
      On some architectures (like ARM64), it can support CONT-PTE/PMD size
      hugetlb, which means it can support not only PMD/PUD size hugetlb: 2M and
      1G, but also CONT-PTE/PMD size: 64K and 32M if a 4K page size specified.
      
      When unmapping a hugetlb page, we will get the relevant page table entry
      by huge_pte_offset() only once to nuke it.  This is correct for PMD or PUD
      size hugetlb, since they always contain only one pmd entry or pud entry in
      the page table.
      
      However this is incorrect for CONT-PTE and CONT-PMD size hugetlb, since
      they can contain several continuous pte or pmd entry with same page table
      attributes, so we will nuke only one pte or pmd entry for this
      CONT-PTE/PMD size hugetlb page.
      
      And now try_to_unmap() is only passed a hugetlb page in the case where the
      hugetlb page is poisoned.  Which means now we will unmap only one pte
      entry for a CONT-PTE or CONT-PMD size poisoned hugetlb page, and we can
      still access other subpages of a CONT-PTE or CONT-PMD size poisoned
      hugetlb page, which will cause serious issues possibly.
      
      So we should change to use huge_ptep_clear_flush() to nuke the hugetlb
      page table to fix this issue, which already considered CONT-PTE and
      CONT-PMD size hugetlb.
      
      We've already used set_huge_swap_pte_at() to set a poisoned swap entry for
      a poisoned hugetlb page.  Meanwhile adding a VM_BUG_ON() to make sure the
      passed hugetlb page is poisoned in try_to_unmap().
      
      Link: https://lkml.kernel.org/r/0a2e547238cad5bc153a85c3e9658cb9d55f9cac.1652270205.git.baolin.wang@linux.alibaba.com
      Link: https://lkml.kernel.org/r/730ea4b6d292f32fb10b7a4e87dad49b0eb30474.1652147571.git.baolin.wang@linux.alibaba.comSigned-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
      Reviewed-by: NMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.osdn.me>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      a00a8759
    • B
      mm: rmap: fix CONT-PTE/PMD size hugetlb issue when migration · 5d4af619
      Baolin Wang 提交于
      On some architectures (like ARM64), it can support CONT-PTE/PMD size
      hugetlb, which means it can support not only PMD/PUD size hugetlb: 2M and
      1G, but also CONT-PTE/PMD size: 64K and 32M if a 4K page size specified.
      
      When migrating a hugetlb page, we will get the relevant page table entry
      by huge_pte_offset() only once to nuke it and remap it with a migration
      pte entry.  This is correct for PMD or PUD size hugetlb, since they always
      contain only one pmd entry or pud entry in the page table.
      
      However this is incorrect for CONT-PTE and CONT-PMD size hugetlb, since
      they can contain several continuous pte or pmd entry with same page table
      attributes.  So we will nuke or remap only one pte or pmd entry for this
      CONT-PTE/PMD size hugetlb page, which is not expected for hugetlb
      migration.  The problem is we can still continue to modify the subpages'
      data of a hugetlb page during migrating a hugetlb page, which can cause a
      serious data consistent issue, since we did not nuke the page table entry
      and set a migration pte for the subpages of a hugetlb page.
      
      To fix this issue, we should change to use huge_ptep_clear_flush() to nuke
      a hugetlb page table, and remap it with set_huge_pte_at() and
      set_huge_swap_pte_at() when migrating a hugetlb page, which already
      considered the CONT-PTE or CONT-PMD size hugetlb.
      
      [akpm@linux-foundation.org: fix nommu build]
      [baolin.wang@linux.alibaba.com: fix build errors for !CONFIG_MMU]
        Link: https://lkml.kernel.org/r/a4baca670aca637e7198d9ae4543b8873cb224dc.1652270205.git.baolin.wang@linux.alibaba.com
      Link: https://lkml.kernel.org/r/ea5abf529f0997b5430961012bfda6166c1efc8c.1652147571.git.baolin.wang@linux.alibaba.comSigned-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
      Reviewed-by: NMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.osdn.me>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      5d4af619
  4. 13 5月, 2022 4 次提交
  5. 10 5月, 2022 8 次提交
    • D
      mm/swap: remember PG_anon_exclusive via a swp pte bit · 1493a191
      David Hildenbrand 提交于
      Patch series "mm: COW fixes part 3: reliable GUP R/W FOLL_GET of anonymous pages", v2.
      
      This series fixes memory corruptions when a GUP R/W reference (FOLL_WRITE
      | FOLL_GET) was taken on an anonymous page and COW logic fails to detect
      exclusivity of the page to then replacing the anonymous page by a copy in
      the page table: The GUP reference lost synchronicity with the pages mapped
      into the page tables.  This series focuses on x86, arm64, s390x and
      ppc64/book3s -- other architectures are fairly easy to support by
      implementing __HAVE_ARCH_PTE_SWP_EXCLUSIVE.
      
      This primarily fixes the O_DIRECT memory corruptions that can happen on
      concurrent swapout, whereby we lose DMA reads to a page (modifying the
      user page by writing to it).
      
      O_DIRECT currently uses FOLL_GET for short-term (!FOLL_LONGTERM) DMA
      from/to a user page.  In the long run, we want to convert it to properly
      use FOLL_PIN, and John is working on it, but that might take a while and
      might not be easy to backport.  In the meantime, let's restore what used
      to work before we started modifying our COW logic: make R/W FOLL_GET
      references reliable as long as there is no fork() after GUP involved.
      
      This is just the natural follow-up of part 2, that will also further
      reduce "wrong COW" on the swapin path, for example, when we cannot remove
      a page from the swapcache due to concurrent writeback, or if we have two
      threads faulting on the same swapped-out page.  Fixing O_DIRECT is just a
      nice side-product
      
      This issue, including other related COW issues, has been summarized in [3]
      under 2):
      "
        2. Intra Process Memory Corruptions due to Wrong COW (FOLL_GET)
      
        It was discovered that we can create a memory corruption by reading a
        file via O_DIRECT to a part (e.g., first 512 bytes) of a page,
        concurrently writing to an unrelated part (e.g., last byte) of the same
        page, and concurrently write-protecting the page via clear_refs
        SOFTDIRTY tracking [6].
      
        For the reproducer, the issue is that O_DIRECT grabs a reference of the
        target page (via FOLL_GET) and clear_refs write-protects the relevant
        page table entry. On successive write access to the page from the
        process itself, we wrongly COW the page when resolving the write fault,
        resulting in a loss of synchronicity and consequently a memory corruption.
      
        While some people might think that using clear_refs in this combination
        is a corner cases, it turns out to be a more generic problem unfortunately.
      
        For example, it was just recently discovered that we can similarly
        create a memory corruption without clear_refs, simply by concurrently
        swapping out the buffer pages [7]. Note that we nowadays even use the
        swap infrastructure in Linux without an actual swap disk/partition: the
        prime example is zram which is enabled as default under Fedora [10].
      
        The root issue is that a write-fault on a page that has additional
        references results in a COW and thereby a loss of synchronicity
        and consequently a memory corruption if two parties believe they are
        referencing the same page.
      "
      
      We don't particularly care about R/O FOLL_GET references: they were never
      reliable and O_DIRECT doesn't expect to observe modifications from a page
      after DMA was started.
      
      Note that:
      * this only fixes the issue on x86, arm64, s390x and ppc64/book3s
        ("enterprise architectures"). Other architectures have to implement
        __HAVE_ARCH_PTE_SWP_EXCLUSIVE to achieve the same.
      * this does *not * consider any kind of fork() after taking the reference:
        fork() after GUP never worked reliably with FOLL_GET.
      * Not losing PG_anon_exclusive during swapout was the last remaining
        piece. KSM already makes sure that there are no other references on
        a page before considering it for sharing. Page migration maintains
        PG_anon_exclusive and simply fails when there are additional references
        (freezing the refcount fails). Only swapout code dropped the
        PG_anon_exclusive flag because it requires more work to remember +
        restore it.
      
      With this series in place, most COW issues of [3] are fixed on said
      architectures. Other architectures can implement
      __HAVE_ARCH_PTE_SWP_EXCLUSIVE fairly easily.
      
      [1] https://lkml.kernel.org/r/20220329160440.193848-1-david@redhat.com
      [2] https://lkml.kernel.org/r/20211217113049.23850-1-david@redhat.com
      [3] https://lore.kernel.org/r/3ae33b08-d9ef-f846-56fb-645e3b9b4c66@redhat.com
      
      
      This patch (of 8):
      
      Currently, we clear PG_anon_exclusive in try_to_unmap() and forget about
      it.  We do this, to keep fork() logic on swap entries easy and efficient:
      for example, if we wouldn't clear it when unmapping, we'd have to lookup
      the page in the swapcache for each and every swap entry during fork() and
      clear PG_anon_exclusive if set.
      
      Instead, we want to store that information directly in the swap pte,
      protected by the page table lock, similarly to how we handle
      SWP_MIGRATION_READ_EXCLUSIVE for migration entries.  However, for actual
      swap entries, we don't want to mess with the swap type (e.g., still one
      bit) because it overcomplicates swap code.
      
      In try_to_unmap(), we already reject to unmap in case the page might be
      pinned, because we must not lose PG_anon_exclusive on pinned pages ever. 
      Checking if there are other unexpected references reliably *before*
      completely unmapping a page is unfortunately not really possible: THP
      heavily overcomplicate the situation.  Once fully unmapped it's easier --
      we, for example, make sure that there are no unexpected references *after*
      unmapping a page before starting writeback on that page.
      
      So, we currently might end up unmapping a page and clearing
      PG_anon_exclusive if that page has additional references, for example, due
      to a FOLL_GET.
      
      do_swap_page() has to re-determine if a page is exclusive, which will
      easily fail if there are other references on a page, most prominently GUP
      references via FOLL_GET.  This can currently result in memory corruptions
      when taking a FOLL_GET | FOLL_WRITE reference on a page even when fork()
      is never involved: try_to_unmap() will succeed, and when refaulting the
      page, it cannot be marked exclusive and will get replaced by a copy in the
      page tables on the next write access, resulting in writes via the GUP
      reference to the page being lost.
      
      In an ideal world, everybody that uses GUP and wants to modify page
      content, such as O_DIRECT, would properly use FOLL_PIN.  However, that
      conversion will take a while.  It's easier to fix what used to work in the
      past (FOLL_GET | FOLL_WRITE) remembering PG_anon_exclusive.  In addition,
      by remembering PG_anon_exclusive we can further reduce unnecessary COW in
      some cases, so it's the natural thing to do.
      
      So let's transfer the PG_anon_exclusive information to the swap pte and
      store it via an architecture-dependant pte bit; use that information when
      restoring the swap pte in do_swap_page() and unuse_pte().  During fork(),
      we simply have to clear the pte bit and are done.
      
      Of course, there is one corner case to handle: swap backends that don't
      support concurrent page modifications while the page is under writeback. 
      Special case these, and drop the exclusive marker.  Add a comment why that
      is just fine (also, reuse_swap_page() would have done the same in the
      past).
      
      In the future, we'll hopefully have all architectures support
      __HAVE_ARCH_PTE_SWP_EXCLUSIVE, such that we can get rid of the empty stubs
      and the define completely.  Then, we can also convert
      SWP_MIGRATION_READ_EXCLUSIVE.  For architectures it's fairly easy to
      support: either simply use a yet unused pte bit that can be used for swap
      entries, steal one from the arch type bits if they exceed 5, or steal one
      from the offset bits.
      
      Note: R/O FOLL_GET references were never really reliable, especially when
      taking one on a shared page and then writing to the page (e.g., GUP after
      fork()).  FOLL_GET, including R/W references, were never really reliable
      once fork was involved (e.g., GUP before fork(), GUP during fork()).  KSM
      steps back in case it stumbles over unexpected references and is,
      therefore, fine.
      
      [david@redhat.com: fix SWP_STABLE_WRITES test]
        Link: https://lkml.kernel.org/r/ac725bcb-313a-4fff-250a-68ba9a8f85fb@redhat.comLink: https://lkml.kernel.org/r/20220329164329.208407-1-david@redhat.com
      Link: https://lkml.kernel.org/r/20220329164329.208407-2-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Jann Horn <jannh@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      1493a191
    • D
      mm/rmap: fail try_to_migrate() early when setting a PMD migration entry fails · 7f5abe60
      David Hildenbrand 提交于
      Let's fail right away in case we cannot clear PG_anon_exclusive because
      the anon THP may be pinned.  Right now, we continue trying to install
      migration entries and the caller of try_to_migrate() will realize that the
      page is still mapped and has to restore the migration entries.  Let's just
      fail fast just like for PTE migration entries.
      
      Link: https://lkml.kernel.org/r/20220428083441.37290-14-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Suggested-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      7f5abe60
    • D
      mm: remember exclusively mapped anonymous pages with PG_anon_exclusive · 6c287605
      David Hildenbrand 提交于
      Let's mark exclusively mapped anonymous pages with PG_anon_exclusive as
      exclusive, and use that information to make GUP pins reliable and stay
      consistent with the page mapped into the page table even if the page table
      entry gets write-protected.
      
      With that information at hand, we can extend our COW logic to always reuse
      anonymous pages that are exclusive.  For anonymous pages that might be
      shared, the existing logic applies.
      
      As already documented, PG_anon_exclusive is usually only expressive in
      combination with a page table entry.  Especially PTE vs.  PMD-mapped
      anonymous pages require more thought, some examples: due to mremap() we
      can easily have a single compound page PTE-mapped into multiple page
      tables exclusively in a single process -- multiple page table locks apply.
      Further, due to MADV_WIPEONFORK we might not necessarily write-protect
      all PTEs, and only some subpages might be pinned.  Long story short: once
      PTE-mapped, we have to track information about exclusivity per sub-page,
      but until then, we can just track it for the compound page in the head
      page and not having to update a whole bunch of subpages all of the time
      for a simple PMD mapping of a THP.
      
      For simplicity, this commit mostly talks about "anonymous pages", while
      it's for THP actually "the part of an anonymous folio referenced via a
      page table entry".
      
      To not spill PG_anon_exclusive code all over the mm code-base, we let the
      anon rmap code to handle all PG_anon_exclusive logic it can easily handle.
      
      If a writable, present page table entry points at an anonymous (sub)page,
      that (sub)page must be PG_anon_exclusive.  If GUP wants to take a reliably
      pin (FOLL_PIN) on an anonymous page references via a present page table
      entry, it must only pin if PG_anon_exclusive is set for the mapped
      (sub)page.
      
      This commit doesn't adjust GUP, so this is only implicitly handled for
      FOLL_WRITE, follow-up commits will teach GUP to also respect it for
      FOLL_PIN without FOLL_WRITE, to make all GUP pins of anonymous pages fully
      reliable.
      
      Whenever an anonymous page is to be shared (fork(), KSM), or when
      temporarily unmapping an anonymous page (swap, migration), the relevant
      PG_anon_exclusive bit has to be cleared to mark the anonymous page
      possibly shared.  Clearing will fail if there are GUP pins on the page:
      
      * For fork(), this means having to copy the page and not being able to
        share it.  fork() protects against concurrent GUP using the PT lock and
        the src_mm->write_protect_seq.
      
      * For KSM, this means sharing will fail.  For swap this means, unmapping
        will fail, For migration this means, migration will fail early.  All
        three cases protect against concurrent GUP using the PT lock and a
        proper clear/invalidate+flush of the relevant page table entry.
      
      This fixes memory corruptions reported for FOLL_PIN | FOLL_WRITE, when a
      pinned page gets mapped R/O and the successive write fault ends up
      replacing the page instead of reusing it.  It improves the situation for
      O_DIRECT/vmsplice/...  that still use FOLL_GET instead of FOLL_PIN, if
      fork() is *not* involved, however swapout and fork() are still
      problematic.  Properly using FOLL_PIN instead of FOLL_GET for these GUP
      users will fix the issue for them.
      
      I. Details about basic handling
      
      I.1. Fresh anonymous pages
      
      page_add_new_anon_rmap() and hugepage_add_new_anon_rmap() will mark the
      given page exclusive via __page_set_anon_rmap(exclusive=1).  As that is
      the mechanism fresh anonymous pages come into life (besides migration code
      where we copy the page->mapping), all fresh anonymous pages will start out
      as exclusive.
      
      I.2. COW reuse handling of anonymous pages
      
      When a COW handler stumbles over a (sub)page that's marked exclusive, it
      simply reuses it.  Otherwise, the handler tries harder under page lock to
      detect if the (sub)page is exclusive and can be reused.  If exclusive,
      page_move_anon_rmap() will mark the given (sub)page exclusive.
      
      Note that hugetlb code does not yet check for PageAnonExclusive(), as it
      still uses the old COW logic that is prone to the COW security issue
      because hugetlb code cannot really tolerate unnecessary/wrong COW as huge
      pages are a scarce resource.
      
      I.3. Migration handling
      
      try_to_migrate() has to try marking an exclusive anonymous page shared via
      page_try_share_anon_rmap().  If it fails because there are GUP pins on the
      page, unmap fails.  migrate_vma_collect_pmd() and
      __split_huge_pmd_locked() are handled similarly.
      
      Writable migration entries implicitly point at shared anonymous pages. 
      For readable migration entries that information is stored via a new
      "readable-exclusive" migration entry, specific to anonymous pages.
      
      When restoring a migration entry in remove_migration_pte(), information
      about exlusivity is detected via the migration entry type, and
      RMAP_EXCLUSIVE is set accordingly for
      page_add_anon_rmap()/hugepage_add_anon_rmap() to restore that information.
      
      I.4. Swapout handling
      
      try_to_unmap() has to try marking the mapped page possibly shared via
      page_try_share_anon_rmap().  If it fails because there are GUP pins on the
      page, unmap fails.  For now, information about exclusivity is lost.  In
      the future, we might want to remember that information in the swap entry
      in some cases, however, it requires more thought, care, and a way to store
      that information in swap entries.
      
      I.5. Swapin handling
      
      do_swap_page() will never stumble over exclusive anonymous pages in the
      swap cache, as try_to_migrate() prohibits that.  do_swap_page() always has
      to detect manually if an anonymous page is exclusive and has to set
      RMAP_EXCLUSIVE for page_add_anon_rmap() accordingly.
      
      I.6. THP handling
      
      __split_huge_pmd_locked() has to move the information about exclusivity
      from the PMD to the PTEs.
      
      a) In case we have a readable-exclusive PMD migration entry, simply
         insert readable-exclusive PTE migration entries.
      
      b) In case we have a present PMD entry and we don't want to freeze
         ("convert to migration entries"), simply forward PG_anon_exclusive to
         all sub-pages, no need to temporarily clear the bit.
      
      c) In case we have a present PMD entry and want to freeze, handle it
         similar to try_to_migrate(): try marking the page shared first.  In
         case we fail, we ignore the "freeze" instruction and simply split
         ordinarily.  try_to_migrate() will properly fail because the THP is
         still mapped via PTEs.
      
      When splitting a compound anonymous folio (THP), the information about
      exclusivity is implicitly handled via the migration entries: no need to
      replicate PG_anon_exclusive manually.
      
      I.7.  fork() handling fork() handling is relatively easy, because
      PG_anon_exclusive is only expressive for some page table entry types.
      
      a) Present anonymous pages
      
      page_try_dup_anon_rmap() will mark the given subpage shared -- which will
      fail if the page is pinned.  If it failed, we have to copy (or PTE-map a
      PMD to handle it on the PTE level).
      
      Note that device exclusive entries are just a pointer at a PageAnon()
      page.  fork() will first convert a device exclusive entry to a present
      page table and handle it just like present anonymous pages.
      
      b) Device private entry
      
      Device private entries point at PageAnon() pages that cannot be mapped
      directly and, therefore, cannot get pinned.
      
      page_try_dup_anon_rmap() will mark the given subpage shared, which cannot
      fail because they cannot get pinned.
      
      c) HW poison entries
      
      PG_anon_exclusive will remain untouched and is stale -- the page table
      entry is just a placeholder after all.
      
      d) Migration entries
      
      Writable and readable-exclusive entries are converted to readable entries:
      possibly shared.
      
      I.8. mprotect() handling
      
      mprotect() only has to properly handle the new readable-exclusive
      migration entry:
      
      When write-protecting a migration entry that points at an anonymous page,
      remember the information about exclusivity via the "readable-exclusive"
      migration entry type.
      
      II. Migration and GUP-fast
      
      Whenever replacing a present page table entry that maps an exclusive
      anonymous page by a migration entry, we have to mark the page possibly
      shared and synchronize against GUP-fast by a proper clear/invalidate+flush
      to make the following scenario impossible:
      
      1. try_to_migrate() places a migration entry after checking for GUP pins
         and marks the page possibly shared.
      
      2. GUP-fast pins the page due to lack of synchronization
      
      3. fork() converts the "writable/readable-exclusive" migration entry into a
         readable migration entry
      
      4. Migration fails due to the GUP pin (failing to freeze the refcount)
      
      5. Migration entries are restored. PG_anon_exclusive is lost
      
      -> We have a pinned page that is not marked exclusive anymore.
      
      Note that we move information about exclusivity from the page to the
      migration entry as it otherwise highly overcomplicates fork() and
      PTE-mapping a THP.
      
      III. Swapout and GUP-fast
      
      Whenever replacing a present page table entry that maps an exclusive
      anonymous page by a swap entry, we have to mark the page possibly shared
      and synchronize against GUP-fast by a proper clear/invalidate+flush to
      make the following scenario impossible:
      
      1. try_to_unmap() places a swap entry after checking for GUP pins and
         clears exclusivity information on the page.
      
      2. GUP-fast pins the page due to lack of synchronization.
      
      -> We have a pinned page that is not marked exclusive anymore.
      
      If we'd ever store information about exclusivity in the swap entry,
      similar to migration handling, the same considerations as in II would
      apply.  This is future work.
      
      Link: https://lkml.kernel.org/r/20220428083441.37290-13-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      6c287605
    • D
      mm/rmap: drop "compound" parameter from page_add_new_anon_rmap() · 40f2bbf7
      David Hildenbrand 提交于
      New anonymous pages are always mapped natively: only THP/khugepaged code
      maps a new compound anonymous page and passes "true".  Otherwise, we're
      just dealing with simple, non-compound pages.
      
      Let's give the interface clearer semantics and document these.  Remove the
      PageTransCompound() sanity check from page_add_new_anon_rmap().
      
      Link: https://lkml.kernel.org/r/20220428083441.37290-9-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      40f2bbf7
    • D
      mm/rmap: pass rmap flags to hugepage_add_anon_rmap() · 28c5209d
      David Hildenbrand 提交于
      Let's prepare for passing RMAP_EXCLUSIVE, similarly as we do for
      page_add_anon_rmap() now.  RMAP_COMPOUND is implicit for hugetlb pages and
      ignored.
      
      Link: https://lkml.kernel.org/r/20220428083441.37290-8-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      28c5209d
    • D
      mm/rmap: remove do_page_add_anon_rmap() · f1e2db12
      David Hildenbrand 提交于
      ... and instead convert page_add_anon_rmap() to accept flags.
      
      Passing flags instead of bools is usually nicer either way, and we want to
      more often also pass RMAP_EXCLUSIVE in follow up patches when detecting
      that an anonymous page is exclusive: for example, when restoring an
      anonymous page from a writable migration entry.
      
      This is a preparation for marking an anonymous page inside
      page_add_anon_rmap() as exclusive when RMAP_EXCLUSIVE is passed.
      
      Link: https://lkml.kernel.org/r/20220428083441.37290-7-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      f1e2db12
    • D
      mm/rmap: convert RMAP flags to a proper distinct rmap_t type · 14f9135d
      David Hildenbrand 提交于
      We want to pass the flags to more than one anon rmap function, getting rid
      of special "do_page_add_anon_rmap()".  So let's pass around a distinct
      __bitwise type and refine documentation.
      
      Link: https://lkml.kernel.org/r/20220428083441.37290-6-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      14f9135d
    • D
      mm/rmap: fix missing swap_free() in try_to_unmap() after arch_unmap_one() failed · 322842ea
      David Hildenbrand 提交于
      Patch series "mm: COW fixes part 2: reliable GUP pins of anonymous pages", v4.
      
      This series is the result of the discussion on the previous approach [2]. 
      More information on the general COW issues can be found there.  It is
      based on latest linus/master (post v5.17, with relevant core-MM changes
      for v5.18-rc1).
      
      This series fixes memory corruptions when a GUP pin (FOLL_PIN) was taken
      on an anonymous page and COW logic fails to detect exclusivity of the page
      to then replacing the anonymous page by a copy in the page table: The GUP
      pin lost synchronicity with the pages mapped into the page tables.
      
      This issue, including other related COW issues, has been summarized in [3]
      under 3):
      "
        3. Intra Process Memory Corruptions due to Wrong COW (FOLL_PIN)
      
        page_maybe_dma_pinned() is used to check if a page may be pinned for
        DMA (using FOLL_PIN instead of FOLL_GET).  While false positives are
        tolerable, false negatives are problematic: pages that are pinned for
        DMA must not be added to the swapcache.  If it happens, the (now pinned)
        page could be faulted back from the swapcache into page tables
        read-only.  Future write-access would detect the pinning and COW the
        page, losing synchronicity.  For the interested reader, this is nicely
        documented in feb889fb ("mm: don't put pinned pages into the swap
        cache").
      
        Peter reports [8] that page_maybe_dma_pinned() as used is racy in some
        cases and can result in a violation of the documented semantics: giving
        false negatives because of the race.
      
        There are cases where we call it without properly taking a per-process
        sequence lock, turning the usage of page_maybe_dma_pinned() racy.  While
        one case (clear_refs SOFTDIRTY tracking, see below) seems to be easy to
        handle, there is especially one rmap case (shrink_page_list) that's hard
        to fix: in the rmap world, we're not limited to a single process.
      
        The shrink_page_list() issue is really subtle.  If we race with
        someone pinning a page, we can trigger the same issue as in the FOLL_GET
        case.  See the detail section at the end of this mail on a discussion
        how bad this can bite us with VFIO or other FOLL_PIN user.
      
        It's harder to reproduce, but I managed to modify the O_DIRECT
        reproducer to use io_uring fixed buffers [15] instead, which ends up
        using FOLL_PIN | FOLL_WRITE | FOLL_LONGTERM to pin buffer pages and can
        similarly trigger a loss of synchronicity and consequently a memory
        corruption.
      
        Again, the root issue is that a write-fault on a page that has
        additional references results in a COW and thereby a loss of
        synchronicity and consequently a memory corruption if two parties
        believe they are referencing the same page.
      "
      
      This series makes GUP pins (R/O and R/W) on anonymous pages fully
      reliable, especially also taking care of concurrent pinning via GUP-fast,
      for example, also fully fixing an issue reported regarding NUMA balancing
      [4] recently.  While doing that, it further reduces "unnecessary COWs",
      especially when we don't fork()/KSM and don't swapout, and fixes the COW
      security for hugetlb for FOLL_PIN.
      
      In summary, we track via a pageflag (PG_anon_exclusive) whether a mapped
      anonymous page is exclusive.  Exclusive anonymous pages that are mapped
      R/O can directly be mapped R/W by the COW logic in the write fault
      handler.  Exclusive anonymous pages that want to be shared (fork(), KSM)
      first have to be marked shared -- which will fail if there are GUP pins on
      the page.  GUP is only allowed to take a pin on anonymous pages that are
      exclusive.  The PT lock is the primary mechanism to synchronize
      modifications of PG_anon_exclusive.  We synchronize against GUP-fast
      either via the src_mm->write_protect_seq (during fork()) or via
      clear/invalidate+flush of the relevant page table entry.
      
      Special care has to be taken about swap, migration, and THPs (whereby a
      PMD-mapping can be converted to a PTE mapping and we have to track
      information for subpages).  Besides these, we let the rmap code handle
      most magic.  For reliable R/O pins of anonymous pages, we need
      FAULT_FLAG_UNSHARE logic as part of our previous approach [2], however,
      it's now 100% mapcount free and I further simplified it a bit.
      
        #1 is a fix
        #3-#10 are mostly rmap preparations for PG_anon_exclusive handling
        #11 introduces PG_anon_exclusive
        #12 uses PG_anon_exclusive and make R/W pins of anonymous pages
         reliable
        #13 is a preparation for reliable R/O pins
        #14 and #15 is reused/modified GUP-triggered unsharing for R/O GUP pins
         make R/O pins of anonymous pages reliable
        #16 adds sanity check when (un)pinning anonymous pages
      
      [1] https://lkml.kernel.org/r/20220131162940.210846-1-david@redhat.com
      [2] https://lkml.kernel.org/r/20211217113049.23850-1-david@redhat.com
      [3] https://lore.kernel.org/r/3ae33b08-d9ef-f846-56fb-645e3b9b4c66@redhat.com
      [4] https://bugzilla.kernel.org/show_bug.cgi?id=215616
      
      
      This patch (of 17):
      
      In case arch_unmap_one() fails, we already did a swap_duplicate().  let's
      undo that properly via swap_free().
      
      Link: https://lkml.kernel.org/r/20220428083441.37290-1-david@redhat.com
      Link: https://lkml.kernel.org/r/20220428083441.37290-2-david@redhat.com
      Fixes: ca827d55 ("mm, swap: Add infrastructure for saving page metadata on swap")
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NKhalid Aziz <khalid.aziz@oracle.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Jann Horn <jannh@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      322842ea
  6. 29 4月, 2022 2 次提交
    • M
      mm: rmap: introduce pfn_mkclean_range() to cleans PTEs · 6a8e0596
      Muchun Song 提交于
      The page_mkclean_one() is supposed to be used with the pfn that has a
      associated struct page, but not all the pfns (e.g.  DAX) have a struct
      page.  Introduce a new function pfn_mkclean_range() to cleans the PTEs
      (including PMDs) mapped with range of pfns which has no struct page
      associated with them.  This helper will be used by DAX device in the next
      patch to make pfns clean.
      
      Link: https://lkml.kernel.org/r/20220403053957.10770-4-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Ross Zwisler <zwisler@kernel.org>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Cc: Xiyu Yang <xiyuyang19@fudan.edu.cn>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      6a8e0596
    • M
      mm: rmap: fix cache flush on THP pages · 7f9c9b60
      Muchun Song 提交于
      Patch series "Fix some bugs related to ramp and dax", v7.
      
      Patch 1-2 fix a cache flush bug, because subsequent patches depend on
      those on those changes, there are placed in this series.  Patch 3-4 are
      preparation for fixing a dax bug in patch 5.  Patch 6 is code cleanup
      since the previous patch removes the usage of follow_invalidate_pte().
      
      
      This patch (of 6):
      
      The flush_cache_page() only remove a PAGE_SIZE sized range from the cache.
      However, it does not cover the full pages in a THP except a head page. 
      Replace it with flush_cache_range() to fix this issue.  At least, no
      problems were found due to this.  Maybe because the architectures that
      have virtual indexed caches is less.
      
      Link: https://lkml.kernel.org/r/20220403053957.10770-1-songmuchun@bytedance.com
      Link: https://lkml.kernel.org/r/20220403053957.10770-2-songmuchun@bytedance.com
      Fixes: f27176cf ("mm: convert page_mkclean_one() to use page_vma_mapped_walk()")
      Signed-off-by: NMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: NYang Shi <shy828301@gmail.com>
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Xiyu Yang <xiyuyang19@fudan.edu.cn>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Ross Zwisler <zwisler@kernel.org>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      7f9c9b60
  7. 02 4月, 2022 1 次提交
  8. 25 3月, 2022 3 次提交
    • M
      mm: fix race between MADV_FREE reclaim and blkdev direct IO read · 6c8e2a25
      Mauricio Faria de Oliveira 提交于
      Problem:
      =======
      
      Userspace might read the zero-page instead of actual data from a direct IO
      read on a block device if the buffers have been called madvise(MADV_FREE)
      on earlier (this is discussed below) due to a race between page reclaim on
      MADV_FREE and blkdev direct IO read.
      
      - Race condition:
        ==============
      
      During page reclaim, the MADV_FREE page check in try_to_unmap_one() checks
      if the page is not dirty, then discards its rmap PTE(s) (vs.  remap back
      if the page is dirty).
      
      However, after try_to_unmap_one() returns to shrink_page_list(), it might
      keep the page _anyway_ if page_ref_freeze() fails (it expects exactly
      _one_ page reference, from the isolation for page reclaim).
      
      Well, blkdev_direct_IO() gets references for all pages, and on READ
      operations it only sets them dirty _later_.
      
      So, if MADV_FREE'd pages (i.e., not dirty) are used as buffers for direct
      IO read from block devices, and page reclaim happens during
      __blkdev_direct_IO[_simple]() exactly AFTER bio_iov_iter_get_pages()
      returns, but BEFORE the pages are set dirty, the situation happens.
      
      The direct IO read eventually completes.  Now, when userspace reads the
      buffers, the PTE is no longer there and the page fault handler
      do_anonymous_page() services that with the zero-page, NOT the data!
      
      A synthetic reproducer is provided.
      
      - Page faults:
        ===========
      
      If page reclaim happens BEFORE bio_iov_iter_get_pages() the issue doesn't
      happen, because that faults-in all pages as writeable, so
      do_anonymous_page() sets up a new page/rmap/PTE, and that is used by
      direct IO.  The userspace reads don't fault as the PTE is there (thus
      zero-page is not used/setup).
      
      But if page reclaim happens AFTER it / BEFORE setting pages dirty, the PTE
      is no longer there; the subsequent page faults can't help:
      
      The data-read from the block device probably won't generate faults due to
      DMA (no MMU) but even in the case it wouldn't use DMA, that happens on
      different virtual addresses (not user-mapped addresses) because `struct
      bio_vec` stores `struct page` to figure addresses out (which are different
      from user-mapped addresses) for the read.
      
      Thus userspace reads (to user-mapped addresses) still fault, then
      do_anonymous_page() gets another `struct page` that would address/ map to
      other memory than the `struct page` used by `struct bio_vec` for the read.
      (The original `struct page` is not available, since it wasn't freed, as
      page_ref_freeze() failed due to more page refs.  And even if it were
      available, its data cannot be trusted anymore.)
      
      Solution:
      ========
      
      One solution is to check for the expected page reference count in
      try_to_unmap_one().
      
      There should be one reference from the isolation (that is also checked in
      shrink_page_list() with page_ref_freeze()) plus one or more references
      from page mapping(s) (put in discard: label).  Further references mean
      that rmap/PTE cannot be unmapped/nuked.
      
      (Note: there might be more than one reference from mapping due to
      fork()/clone() without CLONE_VM, which use the same `struct page` for
      references, until the copy-on-write page gets copied.)
      
      So, additional page references (e.g., from direct IO read) now prevent the
      rmap/PTE from being unmapped/dropped; similarly to the page is not freed
      per shrink_page_list()/page_ref_freeze()).
      
      - Races and Barriers:
        ==================
      
      The new check in try_to_unmap_one() should be safe in races with
      bio_iov_iter_get_pages() in get_user_pages() fast and slow paths, as it's
      done under the PTE lock.
      
      The fast path doesn't take the lock, but it checks if the PTE has changed
      and if so, it drops the reference and leaves the page for the slow path
      (which does take that lock).
      
      The fast path requires synchronization w/ full memory barrier: it writes
      the page reference count first then it reads the PTE later, while
      try_to_unmap() writes PTE first then it reads page refcount.
      
      And a second barrier is needed, as the page dirty flag should not be read
      before the page reference count (as in __remove_mapping()).  (This can be
      a load memory barrier only; no writes are involved.)
      
      Call stack/comments:
      
      - try_to_unmap_one()
        - page_vma_mapped_walk()
          - map_pte()			# see pte_offset_map_lock():
              pte_offset_map()
              spin_lock()
      
        - ptep_get_and_clear()	# write PTE
        - smp_mb()			# (new barrier) GUP fast path
        - page_ref_count()		# (new check) read refcount
      
        - page_vma_mapped_walk_done()	# see pte_unmap_unlock():
            pte_unmap()
            spin_unlock()
      
      - bio_iov_iter_get_pages()
        - __bio_iov_iter_get_pages()
          - iov_iter_get_pages()
            - get_user_pages_fast()
              - internal_get_user_pages_fast()
      
                # fast path
                - lockless_pages_from_mm()
                  - gup_{pgd,p4d,pud,pmd,pte}_range()
                      ptep = pte_offset_map()		# not _lock()
                      pte = ptep_get_lockless(ptep)
      
                      page = pte_page(pte)
                      try_grab_compound_head(page)	# inc refcount
                                                  	# (RMW/barrier
                                                   	#  on success)
      
                      if (pte_val(pte) != pte_val(*ptep)) # read PTE
                              put_compound_head(page) # dec refcount
                              			# go slow path
      
                # slow path
                - __gup_longterm_unlocked()
                  - get_user_pages_unlocked()
                    - __get_user_pages_locked()
                      - __get_user_pages()
                        - follow_{page,p4d,pud,pmd}_mask()
                          - follow_page_pte()
                              ptep = pte_offset_map_lock()
                              pte = *ptep
                              page = vm_normal_page(pte)
                              try_grab_page(page)	# inc refcount
                              pte_unmap_unlock()
      
      - Huge Pages:
        ==========
      
      Regarding transparent hugepages, that logic shouldn't change, as MADV_FREE
      (aka lazyfree) pages are PageAnon() && !PageSwapBacked()
      (madvise_free_pte_range() -> mark_page_lazyfree() -> lru_lazyfree_fn())
      thus should reach shrink_page_list() -> split_huge_page_to_list() before
      try_to_unmap[_one](), so it deals with normal pages only.
      
      (And in case unlikely/TTU_SPLIT_HUGE_PMD/split_huge_pmd_address() happens,
      which should not or be rare, the page refcount should be greater than
      mapcount: the head page is referenced by tail pages.  That also prevents
      checking the head `page` then incorrectly call page_remove_rmap(subpage)
      for a tail page, that isn't even in the shrink_page_list()'s page_list (an
      effect of split huge pmd/pmvw), as it might happen today in this unlikely
      scenario.)
      
      MADV_FREE'd buffers:
      ===================
      
      So, back to the "if MADV_FREE pages are used as buffers" note.  The case
      is arguable, and subject to multiple interpretations.
      
      The madvise(2) manual page on the MADV_FREE advice value says:
      
      1) 'After a successful MADV_FREE ... data will be lost when
         the kernel frees the pages.'
      2) 'the free operation will be canceled if the caller writes
         into the page' / 'subsequent writes ... will succeed and
         then [the] kernel cannot free those dirtied pages'
      3) 'If there is no subsequent write, the kernel can free the
         pages at any time.'
      
      Thoughts, questions, considerations... respectively:
      
      1) Since the kernel didn't actually free the page (page_ref_freeze()
         failed), should the data not have been lost? (on userspace read.)
      2) Should writes performed by the direct IO read be able to cancel
         the free operation?
         - Should the direct IO read be considered as 'the caller' too,
           as it's been requested by 'the caller'?
         - Should the bio technique to dirty pages on return to userspace
           (bio_check_pages_dirty() is called/used by __blkdev_direct_IO())
           be considered in another/special way here?
      3) Should an upcoming write from a previously requested direct IO
         read be considered as a subsequent write, so the kernel should
         not free the pages? (as it's known at the time of page reclaim.)
      
      And lastly:
      
      Technically, the last point would seem a reasonable consideration and
      balance, as the madvise(2) manual page apparently (and fairly) seem to
      assume that 'writes' are memory access from the userspace process (not
      explicitly considering writes from the kernel or its corner cases; again,
      fairly)..  plus the kernel fix implementation for the corner case of the
      largely 'non-atomic write' encompassed by a direct IO read operation, is
      relatively simple; and it helps.
      
      Reproducer:
      ==========
      
      @ test.c (simplified, but works)
      
      	#define _GNU_SOURCE
      	#include <fcntl.h>
      	#include <stdio.h>
      	#include <unistd.h>
      	#include <sys/mman.h>
      
      	int main() {
      		int fd, i;
      		char *buf;
      
      		fd = open(DEV, O_RDONLY | O_DIRECT);
      
      		buf = mmap(NULL, BUF_SIZE, PROT_READ | PROT_WRITE,
                      	   MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
      
      		for (i = 0; i < BUF_SIZE; i += PAGE_SIZE)
      			buf[i] = 1; // init to non-zero
      
      		madvise(buf, BUF_SIZE, MADV_FREE);
      
      		read(fd, buf, BUF_SIZE);
      
      		for (i = 0; i < BUF_SIZE; i += PAGE_SIZE)
      			printf("%p: 0x%x\n", &buf[i], buf[i]);
      
      		return 0;
      	}
      
      @ block/fops.c (formerly fs/block_dev.c)
      
      	+#include <linux/swap.h>
      	...
      	... __blkdev_direct_IO[_simple](...)
      	{
      	...
      	+	if (!strcmp(current->comm, "good"))
      	+		shrink_all_memory(ULONG_MAX);
      	+
               	ret = bio_iov_iter_get_pages(...);
      	+
      	+	if (!strcmp(current->comm, "bad"))
      	+		shrink_all_memory(ULONG_MAX);
      	...
      	}
      
      @ shell
      
              # NUM_PAGES=4
              # PAGE_SIZE=$(getconf PAGE_SIZE)
      
              # yes | dd of=test.img bs=${PAGE_SIZE} count=${NUM_PAGES}
              # DEV=$(losetup -f --show test.img)
      
              # gcc -DDEV=\"$DEV\" \
                    -DBUF_SIZE=$((PAGE_SIZE * NUM_PAGES)) \
                    -DPAGE_SIZE=${PAGE_SIZE} \
                     test.c -o test
      
              # od -tx1 $DEV
              0000000 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a
              *
              0040000
      
              # mv test good
              # ./good
              0x7f7c10418000: 0x79
              0x7f7c10419000: 0x79
              0x7f7c1041a000: 0x79
              0x7f7c1041b000: 0x79
      
              # mv good bad
              # ./bad
              0x7fa1b8050000: 0x0
              0x7fa1b8051000: 0x0
              0x7fa1b8052000: 0x0
              0x7fa1b8053000: 0x0
      
      Note: the issue is consistent on v5.17-rc3, but it's intermittent with the
      support of MADV_FREE on v4.5 (60%-70% error; needs swap).  [wrap
      do_direct_IO() in do_blockdev_direct_IO() @ fs/direct-io.c].
      
      - v5.17-rc3:
      
              # for i in {1..1000}; do ./good; done \
                  | cut -d: -f2 | sort | uniq -c
                 4000  0x79
      
              # mv good bad
              # for i in {1..1000}; do ./bad; done \
                  | cut -d: -f2 | sort | uniq -c
                 4000  0x0
      
              # free | grep Swap
              Swap:             0           0           0
      
      - v4.5:
      
              # for i in {1..1000}; do ./good; done \
                  | cut -d: -f2 | sort | uniq -c
                 4000  0x79
      
              # mv good bad
              # for i in {1..1000}; do ./bad; done \
                  | cut -d: -f2 | sort | uniq -c
                 2702  0x0
                 1298  0x79
      
              # swapoff -av
              swapoff /swap
      
              # for i in {1..1000}; do ./bad; done \
                  | cut -d: -f2 | sort | uniq -c
                 4000  0x79
      
      Ceph/TCMalloc:
      =============
      
      For documentation purposes, the use case driving the analysis/fix is Ceph
      on Ubuntu 18.04, as the TCMalloc library there still uses MADV_FREE to
      release unused memory to the system from the mmap'ed page heap (might be
      committed back/used again; it's not munmap'ed.) - PageHeap::DecommitSpan()
      -> TCMalloc_SystemRelease() -> madvise() - PageHeap::CommitSpan() ->
      TCMalloc_SystemCommit() -> do nothing.
      
      Note: TCMalloc switched back to MADV_DONTNEED a few commits after the
      release in Ubuntu 18.04 (google-perftools/gperftools 2.5), so the issue
      just 'disappeared' on Ceph on later Ubuntu releases but is still present
      in the kernel, and can be hit by other use cases.
      
      The observed issue seems to be the old Ceph bug #22464 [1], where checksum
      mismatches are observed (and instrumentation with buffer dumps shows
      zero-pages read from mmap'ed/MADV_FREE'd page ranges).
      
      The issue in Ceph was reasonably deemed a kernel bug (comment #50) and
      mostly worked around with a retry mechanism, but other parts of Ceph could
      still hit that (rocksdb).  Anyway, it's less likely to be hit again as
      TCMalloc switched out of MADV_FREE by default.
      
      (Some kernel versions/reports from the Ceph bug, and relation with
      the MADV_FREE introduction/changes; TCMalloc versions not checked.)
      - 4.4 good
      - 4.5 (madv_free: introduction)
      - 4.9 bad
      - 4.10 good? maybe a swapless system
      - 4.12 (madv_free: no longer free instantly on swapless systems)
      - 4.13 bad
      
      [1] https://tracker.ceph.com/issues/22464
      
      Thanks:
      ======
      
      Several people contributed to analysis/discussions/tests/reproducers in
      the first stages when drilling down on ceph/tcmalloc/linux kernel:
      
      - Dan Hill
      - Dan Streetman
      - Dongdong Tao
      - Gavin Guo
      - Gerald Yang
      - Heitor Alves de Siqueira
      - Ioanna Alifieraki
      - Jay Vosburgh
      - Matthew Ruffell
      - Ponnuvel Palaniyappan
      
      Reviews, suggestions, corrections, comments:
      
      - Minchan Kim
      - Yu Zhao
      - Huang, Ying
      - John Hubbard
      - Christoph Hellwig
      
      [mfo@canonical.com: v4]
        Link: https://lkml.kernel.org/r/20220209202659.183418-1-mfo@canonical.comLink: https://lkml.kernel.org/r/20220131230255.789059-1-mfo@canonical.com
      
      Fixes: 802a3a92 ("mm: reclaim MADV_FREE pages")
      Signed-off-by: NMauricio Faria de Oliveira <mfo@canonical.com>
      Reviewed-by: N"Huang, Ying" <ying.huang@intel.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Dan Hill <daniel.hill@canonical.com>
      Cc: Dan Streetman <dan.streetman@canonical.com>
      Cc: Dongdong Tao <dongdong.tao@canonical.com>
      Cc: Gavin Guo <gavin.guo@canonical.com>
      Cc: Gerald Yang <gerald.yang@canonical.com>
      Cc: Heitor Alves de Siqueira <halves@canonical.com>
      Cc: Ioanna Alifieraki <ioanna-maria.alifieraki@canonical.com>
      Cc: Jay Vosburgh <jay.vosburgh@canonical.com>
      Cc: Matthew Ruffell <matthew.ruffell@canonical.com>
      Cc: Ponnuvel Palaniyappan <ponnuvel.palaniyappan@canonical.com>
      Cc: <stable@vger.kernel.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6c8e2a25
    • A
      mm/migration: add trace events for base page and HugeTLB migrations · 4cc79b33
      Anshuman Khandual 提交于
      This adds two trace events for base page and HugeTLB page migrations.
      These events, closely follow the implementation details like setting and
      removing of PTE migration entries, which are essential operations for
      migration.  The new CREATE_TRACE_POINTS in <mm/rmap.c> covers both
      <events/migration.h> and <events/tlb.h> based trace events.  Hence drop
      redundant CREATE_TRACE_POINTS from other places which could have otherwise
      conflicted during build.
      
      Link: https://lkml.kernel.org/r/1643368182-9588-3-git-send-email-anshuman.khandual@arm.comSigned-off-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Reported-by: Nkernel test robot <lkp@intel.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4cc79b33
    • H
      mm/thp: fix NR_FILE_MAPPED accounting in page_*_file_rmap() · 5d543f13
      Hugh Dickins 提交于
      NR_FILE_MAPPED accounting in mm/rmap.c (for /proc/meminfo "Mapped" and
      /proc/vmstat "nr_mapped" and the memcg's memory.stat "mapped_file") is
      slightly flawed for file or shmem huge pages.
      
      It is well thought out, and looks convincing, but there's a racy case when
      the careful counting in page_remove_file_rmap() (without page lock) gets
      discarded.  So that in a workload like two "make -j20" kernel builds under
      memory pressure, with cc1 on hugepage text, "Mapped" can easily grow by a
      spurious 5MB or more on each iteration, ending up implausibly bigger than
      most other numbers in /proc/meminfo.  And, hypothetically, might grow to
      the point of seriously interfering in mm/vmscan.c's heuristics, which do
      take NR_FILE_MAPPED into some consideration.
      
      Fixed by moving the __mod_lruvec_page_state() down to where it will not be
      missed before return (and I've grown a bit tired of that oft-repeated
      but-not-everywhere comment on the __ness: it gets lost in the move here).
      
      Does page_add_file_rmap() need the same change?  I suspect not, because
      page lock is held in all relevant cases, and its skipping case looks safe;
      but it's much easier to be sure, if we do make the same change.
      
      Link: https://lkml.kernel.org/r/e02e52a1-8550-a57c-ed29-f51191ea2375@google.com
      Fixes: dd78fedd ("rmap: support file thp")
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Reviewed-by: NYang Shi <shy828301@gmail.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5d543f13
  9. 23 3月, 2022 2 次提交
  10. 22 3月, 2022 13 次提交
  11. 17 3月, 2022 1 次提交
  12. 18 2月, 2022 2 次提交
    • H
      mm/thp: shrink_page_list() avoid splitting VM_LOCKED THP · 47d4f3ee
      Hugh Dickins 提交于
      4.8 commit 7751b2da ("vmscan: split file huge pages before paging
      them out") inserted a split_huge_page_to_list() into shrink_page_list()
      without considering the mlock case: no problem if the page has already
      been marked as Mlocked (the !page_evictable check much higher up will
      have skipped all this), but it has always been the case that races or
      omissions in setting Mlocked can rely on page reclaim to detect this
      and correct it before actually reclaiming - and that remains so, but
      what a shame if a hugepage is needlessly split before discovering it.
      
      It is surprising that page_check_references() returns PAGEREF_RECLAIM
      when VM_LOCKED, but there was a good reason for that: try_to_unmap_one()
      is where the condition is detected and corrected; and until now it could
      not be done in page_referenced_one(), because that does not always have
      the page locked.  Now that mlock's requirement for page lock has gone,
      copy try_to_unmap_one()'s mlock restoration into page_referenced_one(),
      and let page_check_references() return PAGEREF_ACTIVATE in this case.
      
      But page_referenced_one() may find a pte mapping one part of a hugepage:
      what hold should a pte mapped in a VM_LOCKED area exert over the entire
      huge page?  That's debatable.  The approach taken here is to treat that
      pte mapping in page_referenced_one() as if not VM_LOCKED, and if no
      VM_LOCKED pmd mapping is found later in the walk, and lack of reference
      permits, then PAGEREF_RECLAIM take it to attempted splitting as before.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      47d4f3ee
    • H
      mm/munlock: page migration needs mlock pagevec drained · b7435507
      Hugh Dickins 提交于
      Page migration of a VM_LOCKED page tends to fail, because when the old
      page is unmapped, it is put on the mlock pagevec with raised refcount,
      which then fails the freeze.
      
      At first I thought this would be fixed by a local mlock_page_drain() at
      the upper rmap_walk() level - which would have nicely batched all the
      munlocks of that page; but tests show that the task can too easily move
      to another cpu, leaving pagevec residue behind which fails the migration.
      
      So try_to_migrate_one() drain the local pagevec after page_remove_rmap()
      from a VM_LOCKED vma; and do the same in try_to_unmap_one(), whose
      TTU_IGNORE_MLOCK users would want the same treatment; and do the same
      in remove_migration_pte() - not important when successfully inserting
      a new page, but necessary when hoping to retry after failure.
      
      Any new pagevec runs the risk of adding a new way of stranding, and we
      might discover other corners where mlock_page_drain() or lru_add_drain()
      would now help.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      b7435507