1. 01 12月, 2022 11 次提交
    • P
      mm: anonymous shared memory naming · d09e8ca6
      Pasha Tatashin 提交于
      Since commit 9a10064f ("mm: add a field to store names for private
      anonymous memory"), name for private anonymous memory, but not shared
      anonymous, can be set.  However, naming shared anonymous memory just as
      useful for tracking purposes.
      
      Extend the functionality to be able to set names for shared anon.
      
      There are two ways to create anonymous shared memory, using memfd or
      directly via mmap():
      1. fd = memfd_create(...)
         mem = mmap(..., MAP_SHARED, fd, ...)
      2. mem = mmap(..., MAP_SHARED | MAP_ANONYMOUS, -1, ...)
      
      In both cases the anonymous shared memory is created the same way by
      mapping an unlinked file on tmpfs.
      
      The memfd way allows to give a name for anonymous shared memory, but
      not useful when parts of shared memory require to have distinct names.
      
      Example use case: The VMM maps VM memory as anonymous shared memory (not
      private because VMM is sandboxed and drivers are running in their own
      processes).  However, the VM tells back to the VMM how parts of the memory
      are actually used by the guest, how each of the segments should be backed
      (i.e.  4K pages, 2M pages), and some other information about the segments.
      The naming allows us to monitor the effective memory footprint for each
      of these segments from the host without looking inside the guest.
      
      Sample output:
        /* Create shared anonymous segmenet */
        anon_shmem = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
                          MAP_SHARED | MAP_ANONYMOUS, -1, 0);
        /* Name the segment: "MY-NAME" */
        rv = prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME,
                   anon_shmem, SIZE, "MY-NAME");
      
      cat /proc/<pid>/maps (and smaps):
      7fc8e2b4c000-7fc8f2b4c000 rw-s 00000000 00:01 1024 [anon_shmem:MY-NAME]
      
      If the segment is not named, the output is:
      7fc8e2b4c000-7fc8f2b4c000 rw-s 00000000 00:01 1024 /dev/zero (deleted)
      
      Link: https://lkml.kernel.org/r/20221115020602.804224-1-pasha.tatashin@soleen.comSigned-off-by: NPasha Tatashin <pasha.tatashin@soleen.com>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Bagas Sanjaya <bagasdotme@gmail.com>
      Cc: Colin Cross <ccross@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Liam Howlett <liam.howlett@oracle.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: Vincent Whitchurch <vincent.whitchurch@axis.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: xu xin <cgel.zte@gmail.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      d09e8ca6
    • L
      mm: teach release_pages() to take an array of encoded page pointers too · 449c7967
      Linus Torvalds 提交于
      release_pages() already could take either an array of page pointers, or an
      array of folio pointers.  Expand it to also accept an array of encoded
      page pointers, which is what both the existing mlock() use and the
      upcoming mmu_gather use of encoded page pointers wants.
      
      Note that release_pages() won't actually use, or react to, any extra
      encoded bits.  Instead, this is very much a case of "I have walked the
      array of encoded pages and done everything the extra bits tell me to do,
      now release it all".
      
      Also, while the "either page or folio pointers" dual use was handled with
      a cast of the pointer in "release_folios()", this takes a slightly
      different approach and uses the "transparent union" attribute to describe
      the set of arguments to the function:
      
        https://gcc.gnu.org/onlinedocs/gcc/Common-Type-Attributes.html
      
      which has been supported by gcc forever, but the kernel hasn't used
      before.
      
      That allows us to avoid using various wrappers with casts, and just use
      the same function regardless of use.
      
      Link: https://lkml.kernel.org/r/20221109203051.1835763-2-torvalds@linux-foundation.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      449c7967
    • D
      mm/autonuma: use can_change_(pte|pmd)_writable() to replace savedwrite · 6a56ccbc
      David Hildenbrand 提交于
      commit b191f9b1 ("mm: numa: preserve PTE write permissions across a
      NUMA hinting fault") added remembering write permissions using ordinary
      pte_write() for PROT_NONE mapped pages to avoid write faults when
      remapping the page !PROT_NONE on NUMA hinting faults.
      
      That commit noted:
      
          The patch looks hacky but the alternatives looked worse. The tidest was
          to rewalk the page tables after a hinting fault but it was more complex
          than this approach and the performance was worse. It's not generally
          safe to just mark the page writable during the fault if it's a write
          fault as it may have been read-only for COW so that approach was
          discarded.
      
      Later, commit 288bc549 ("mm/autonuma: let architecture override how
      the write bit should be stashed in a protnone pte.") introduced a family
      of savedwrite PTE functions that didn't necessarily improve the whole
      situation.
      
      One confusing thing is that nowadays, if a page is pte_protnone()
      and pte_savedwrite() then also pte_write() is true. Another source of
      confusion is that there is only a single pte_mk_savedwrite() call in the
      kernel. All other write-protection code seems to silently rely on
      pte_wrprotect().
      
      Ever since PageAnonExclusive was introduced and we started using it in
      mprotect context via commit 64fe24a3 ("mm/mprotect: try avoiding write
      faults for exclusive anonymous pages when changing protection"), we do
      have machinery in place to avoid write faults when changing protection,
      which is exactly what we want to do here.
      
      Let's similarly do what ordinary mprotect() does nowadays when upgrading
      write permissions and reuse can_change_pte_writable() and
      can_change_pmd_writable() to detect if we can upgrade PTE permissions to be
      writable.
      
      For anonymous pages there should be absolutely no change: if an
      anonymous page is not exclusive, it could not have been mapped writable --
      because only exclusive anonymous pages can be mapped writable.
      
      However, there *might* be a change for writable shared mappings that
      require writenotify: if they are not dirty, we cannot map them writable.
      While it might not matter in practice, we'd need a different way to
      identify whether writenotify is actually required -- and ordinary mprotect
      would benefit from that as well.
      
      Note that we don't optimize for the actual migration case:
      (1) When migration succeeds the new PTE will not be writable because the
          source PTE was not writable (protnone); in the future we
          might just optimize that case similarly by reusing
          can_change_pte_writable()/can_change_pmd_writable() when removing
          migration PTEs.
      (2) When migration fails, we'd have to recalculate the "writable" flag
          because we temporarily dropped the PT lock; for now keep it simple and
          set "writable=false".
      
      We'll remove all savedwrite leftovers next.
      
      Link: https://lkml.kernel.org/r/20221108174652.198904-6-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      6a56ccbc
    • D
      mm/mprotect: factor out check whether manual PTE write upgrades are required · eb309ec8
      David Hildenbrand 提交于
      Let's factor the check out into vma_wants_manual_pte_write_upgrade(), to be
      reused in NUMA hinting fault context soon.
      
      Link: https://lkml.kernel.org/r/20221108174652.198904-5-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      eb309ec8
    • H
      mm,thp,rmap: subpages_mapcount COMPOUND_MAPPED if PMD-mapped · 4b51634c
      Hugh Dickins 提交于
      Can the lock_compound_mapcount() bit_spin_lock apparatus be removed now? 
      Yes.  Not by atomic64_t or cmpxchg games, those get difficult on 32-bit;
      but if we slightly abuse subpages_mapcount by additionally demanding that
      one bit be set there when the compound page is PMD-mapped, then a cascade
      of two atomic ops is able to maintain the stats without bit_spin_lock.
      
      This is harder to reason about than when bit_spin_locked, but I believe
      safe; and no drift in stats detected when testing.  When there are racing
      removes and adds, of course the sequence of operations is less well-
      defined; but each operation on subpages_mapcount is atomically good.  What
      might be disastrous, is if subpages_mapcount could ever fleetingly appear
      negative: but the pte lock (or pmd lock) these rmap functions are called
      under, ensures that a last remove cannot race ahead of a first add.
      
      Continue to make an exception for hugetlb (PageHuge) pages, though that
      exception can be easily removed by a further commit if necessary: leave
      subpages_mapcount 0, don't bother with COMPOUND_MAPPED in its case, just
      carry on checking compound_mapcount too in folio_mapped(), page_mapped().
      
      Evidence is that this way goes slightly faster than the previous
      implementation in all cases (pmds after ptes now taking around 103ms); and
      relieves us of worrying about contention on the bit_spin_lock.
      
      Link: https://lkml.kernel.org/r/3978f3ca-5473-55a7-4e14-efea5968d892@google.comSigned-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Dan Carpenter <error27@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Zach O'Keefe <zokeefe@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      4b51634c
    • H
      mm,thp,rmap: subpages_mapcount of PTE-mapped subpages · be5ef2d9
      Hugh Dickins 提交于
      Patch series "mm,thp,rmap: rework the use of subpages_mapcount", v2.
      
      
      This patch (of 3):
      
      Following suggestion from Linus, instead of counting every PTE map of a
      compound page in subpages_mapcount, just count how many of its subpages
      are PTE-mapped: this yields the exact number needed for NR_ANON_MAPPED and
      NR_FILE_MAPPED stats, without any need for a locked scan of subpages; and
      requires updating the count less often.
      
      This does then revert total_mapcount() and folio_mapcount() to needing a
      scan of subpages; but they are inherently racy, and need no locking, so
      Linus is right that the scans are much better done there.  Plus (unlike in
      6.1 and previous) subpages_mapcount lets us avoid the scan in the common
      case of no PTE maps.  And page_mapped() and folio_mapped() remain scanless
      and just as efficient with the new meaning of subpages_mapcount: those are
      the functions which I most wanted to remove the scan from.
      
      The updated page_dup_compound_rmap() is no longer suitable for use by anon
      THP's __split_huge_pmd_locked(); but page_add_anon_rmap() can be used for
      that, so long as its VM_BUG_ON_PAGE(!PageLocked) is deleted.
      
      Evidence is that this way goes slightly faster than the previous
      implementation for most cases; but significantly faster in the (now
      scanless) pmds after ptes case, which started out at 870ms and was brought
      down to 495ms by the previous series, now takes around 105ms.
      
      Link: https://lkml.kernel.org/r/a5849eca-22f1-3517-bf29-95d982242742@google.com
      Link: https://lkml.kernel.org/r/eec17e16-4e1-7c59-f1bc-5bca90dac919@google.comSigned-off-by: NHugh Dickins <hughd@google.com>
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Dan Carpenter <error27@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Zach O'Keefe <zokeefe@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      be5ef2d9
    • H
      mm,thp,rmap: simplify compound page mapcount handling · cb67f428
      Hugh Dickins 提交于
      Compound page (folio) mapcount calculations have been different for anon
      and file (or shmem) THPs, and involved the obscure PageDoubleMap flag. 
      And each huge mapping and unmapping of a file (or shmem) THP involved
      atomically incrementing and decrementing the mapcount of every subpage of
      that huge page, dirtying many struct page cachelines.
      
      Add subpages_mapcount field to the struct folio and first tail page, so
      that the total of subpage mapcounts is available in one place near the
      head: then page_mapcount() and total_mapcount() and page_mapped(), and
      their folio equivalents, are so quick that anon and file and hugetlb don't
      need to be optimized differently.  Delete the unloved PageDoubleMap.
      
      page_add and page_remove rmap functions must now maintain the
      subpages_mapcount as well as the subpage _mapcount, when dealing with pte
      mappings of huge pages; and correct maintenance of NR_ANON_MAPPED and
      NR_FILE_MAPPED statistics still needs reading through the subpages, using
      nr_subpages_unmapped() - but only when first or last pmd mapping finds
      subpages_mapcount raised (double-map case, not the common case).
      
      But are those counts (used to decide when to split an anon THP, and in
      vmscan's pagecache_reclaimable heuristic) correctly maintained?  Not
      quite: since page_remove_rmap() (and also split_huge_pmd()) is often
      called without page lock, there can be races when a subpage pte mapcount
      0<->1 while compound pmd mapcount 0<->1 is scanning - races which the
      previous implementation had prevented.  The statistics might become
      inaccurate, and even drift down until they underflow through 0.  That is
      not good enough, but is better dealt with in a followup patch.
      
      Update a few comments on first and second tail page overlaid fields. 
      hugepage_add_new_anon_rmap() has to "increment" compound_mapcount, but
      subpages_mapcount and compound_pincount are already correctly at 0, so
      delete its reinitialization of compound_pincount.
      
      A simple 100 X munmap(mmap(2GB, MAP_SHARED|MAP_POPULATE, tmpfs), 2GB) took
      18 seconds on small pages, and used to take 1 second on huge pages, but
      now takes 119 milliseconds on huge pages.  Mapping by pmds a second time
      used to take 860ms and now takes 92ms; mapping by pmds after mapping by
      ptes (when the scan is needed) used to take 870ms and now takes 495ms. 
      But there might be some benchmarks which would show a slowdown, because
      tail struct pages now fall out of cache until final freeing checks them.
      
      Link: https://lkml.kernel.org/r/47ad693-717-79c8-e1ba-46c3a6602e48@google.comSigned-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zach O'Keefe <zokeefe@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      cb67f428
    • T
      mm, hwpoison: when copy-on-write hits poison, take page offline · d302c239
      Tony Luck 提交于
      Cannot call memory_failure() directly from the fault handler because
      mmap_lock (and others) are held.
      
      It is important, but not urgent, to mark the source page as h/w poisoned
      and unmap it from other tasks.
      
      Use memory_failure_queue() to request a call to memory_failure() for the
      page with the error.
      
      Also provide a stub version for CONFIG_MEMORY_FAILURE=n
      
      Link: https://lkml.kernel.org/r/20221021200120.175753-3-tony.luck@intel.comSigned-off-by: NTony Luck <tony.luck@intel.com>
      Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Shuai Xue <xueshuai@linux.alibaba.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      d302c239
    • S
      mm: convert mm's rss stats into percpu_counter · f1a79412
      Shakeel Butt 提交于
      Currently mm_struct maintains rss_stats which are updated on page fault
      and the unmapping codepaths.  For page fault codepath the updates are
      cached per thread with the batch of TASK_RSS_EVENTS_THRESH which is 64. 
      The reason for caching is performance for multithreaded applications
      otherwise the rss_stats updates may become hotspot for such applications.
      
      However this optimization comes with the cost of error margin in the rss
      stats.  The rss_stats for applications with large number of threads can be
      very skewed.  At worst the error margin is (nr_threads * 64) and we have a
      lot of applications with 100s of threads, so the error margin can be very
      high.  Internally we had to reduce TASK_RSS_EVENTS_THRESH to 32.
      
      Recently we started seeing the unbounded errors for rss_stats for specific
      applications which use TCP rx0cp.  It seems like vm_insert_pages()
      codepath does not sync rss_stats at all.
      
      This patch converts the rss_stats into percpu_counter to convert the error
      margin from (nr_threads * 64) to approximately (nr_cpus ^ 2).  However
      this conversion enable us to get the accurate stats for situations where
      accuracy is more important than the cpu cost.
      
      This patch does not make such tradeoffs - we can just use
      percpu_counter_add_local() for the updates and percpu_counter_sum() (or
      percpu_counter_sync() + percpu_counter_read) for the readers.  At the
      moment the readers are either procfs interface, oom_killer and memory
      reclaim which I think are not performance critical and should be ok with
      slow read.  However I think we can make that change in a separate patch.
      
      Link: https://lkml.kernel.org/r/20221024052841.3291983-1-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      f1a79412
    • M
      hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing · 04ada095
      Mike Kravetz 提交于
      madvise(MADV_DONTNEED) ends up calling zap_page_range() to clear page
      tables associated with the address range.  For hugetlb vmas,
      zap_page_range will call __unmap_hugepage_range_final.  However,
      __unmap_hugepage_range_final assumes the passed vma is about to be removed
      and deletes the vma_lock to prevent pmd sharing as the vma is on the way
      out.  In the case of madvise(MADV_DONTNEED) the vma remains, but the
      missing vma_lock prevents pmd sharing and could potentially lead to issues
      with truncation/fault races.
      
      This issue was originally reported here [1] as a BUG triggered in
      page_try_dup_anon_rmap.  Prior to the introduction of the hugetlb
      vma_lock, __unmap_hugepage_range_final cleared the VM_MAYSHARE flag to
      prevent pmd sharing.  Subsequent faults on this vma were confused as
      VM_MAYSHARE indicates a sharable vma, but was not set so page_mapping was
      not set in new pages added to the page table.  This resulted in pages that
      appeared anonymous in a VM_SHARED vma and triggered the BUG.
      
      Address issue by adding a new zap flag ZAP_FLAG_UNMAP to indicate an unmap
      call from unmap_vmas().  This is used to indicate the 'final' unmapping of
      a hugetlb vma.  When called via MADV_DONTNEED, this flag is not set and
      the vm_lock is not deleted.
      
      [1] https://lore.kernel.org/lkml/CAO4mrfdLMXsao9RF4fUE8-Wfde8xmjsKrTNMNC9wjUb6JudD0g@mail.gmail.com/
      
      Link: https://lkml.kernel.org/r/20221114235507.294320-3-mike.kravetz@oracle.com
      Fixes: 90e7e7f5 ("mm: enable MADV_DONTNEED for hugetlb mappings")
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reported-by: NWei Chen <harperchen1110@gmail.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      04ada095
    • M
      madvise: use zap_page_range_single for madvise dontneed · 21b85b09
      Mike Kravetz 提交于
      This series addresses the issue first reported in [1], and fully described
      in patch 2.  Patches 1 and 2 address the user visible issue and are tagged
      for stable backports.
      
      While exploring solutions to this issue, related problems with mmu
      notification calls were discovered.  This is addressed in the patch
      "hugetlb: remove duplicate mmu notifications:".  Since there are no user
      visible effects, this third is not tagged for stable backports.
      
      Previous discussions suggested further cleanup by removing the
      routine zap_page_range.  This is possible because zap_page_range_single
      is now exported, and all callers of zap_page_range pass ranges entirely
      within a single vma.  This work will be done in a later patch so as not
      to distract from this bug fix.
      
      [1] https://lore.kernel.org/lkml/CAO4mrfdLMXsao9RF4fUE8-Wfde8xmjsKrTNMNC9wjUb6JudD0g@mail.gmail.com/
      
      
      This patch (of 2):
      
      Expose the routine zap_page_range_single to zap a range within a single
      vma.  The madvise routine madvise_dontneed_single_vma can use this routine
      as it explicitly operates on a single vma.  Also, update the mmu
      notification range in zap_page_range_single to take hugetlb pmd sharing
      into account.  This is required as MADV_DONTNEED supports hugetlb vmas.
      
      Link: https://lkml.kernel.org/r/20221114235507.294320-1-mike.kravetz@oracle.com
      Link: https://lkml.kernel.org/r/20221114235507.294320-2-mike.kravetz@oracle.com
      Fixes: 90e7e7f5 ("mm: enable MADV_DONTNEED for hugetlb mappings")
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reported-by: NWei Chen <harperchen1110@gmail.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      21b85b09
  2. 09 11月, 2022 6 次提交
  3. 04 10月, 2022 2 次提交
  4. 27 9月, 2022 12 次提交
  5. 12 9月, 2022 3 次提交
    • D
      mm: fix PageAnonExclusive clearing racing with concurrent RCU GUP-fast · 088b8aa5
      David Hildenbrand 提交于
      commit 6c287605 ("mm: remember exclusively mapped anonymous pages with
      PG_anon_exclusive") made sure that when PageAnonExclusive() has to be
      cleared during temporary unmapping of a page, that the PTE is
      cleared/invalidated and that the TLB is flushed.
      
      What we want to achieve in all cases is that we cannot end up with a pin on
      an anonymous page that may be shared, because such pins would be
      unreliable and could result in memory corruptions when the mapped page
      and the pin go out of sync due to a write fault.
      
      That TLB flush handling was inspired by an outdated comment in
      mm/ksm.c:write_protect_page(), which similarly required the TLB flush in
      the past to synchronize with GUP-fast. However, ever since general RCU GUP
      fast was introduced in commit 2667f50e ("mm: introduce a general RCU
      get_user_pages_fast()"), a TLB flush is no longer sufficient to handle
      concurrent GUP-fast in all cases -- it only handles traditional IPI-based
      GUP-fast correctly.
      
      Peter Xu (thankfully) questioned whether that TLB flush is really
      required. On architectures that send an IPI broadcast on TLB flush,
      it works as expected. To synchronize with RCU GUP-fast properly, we're
      conceptually fine, however, we have to enforce a certain memory order and
      are missing memory barriers.
      
      Let's document that, avoid the TLB flush where possible and use proper
      explicit memory barriers where required. We shouldn't really care about the
      additional memory barriers here, as we're not on extremely hot paths --
      and we're getting rid of some TLB flushes.
      
      We use a smp_mb() pair for handling concurrent pinning and a
      smp_rmb()/smp_wmb() pair for handling the corner case of only temporary
      PTE changes but permanent PageAnonExclusive changes.
      
      One extreme example, whereby GUP-fast takes a R/O pin and KSM wants to
      convert an exclusive anonymous page to a KSM page, and that page is already
      mapped write-protected (-> no PTE change) would be:
      
      	Thread 0 (KSM)			Thread 1 (GUP-fast)
      
      					(B1) Read the PTE
      					# (B2) skipped without FOLL_WRITE
      	(A1) Clear PTE
      	smp_mb()
      	(A2) Check pinned
      					(B3) Pin the mapped page
      					smp_mb()
      	(A3) Clear PageAnonExclusive
      	smp_wmb()
      	(A4) Restore PTE
      					(B4) Check if the PTE changed
      					smp_rmb()
      					(B5) Check PageAnonExclusive
      
      Thread 1 will properly detect that PageAnonExclusive was cleared and
      back off.
      
      Note that we don't need a memory barrier between checking if the page is
      pinned and clearing PageAnonExclusive, because stores are not
      speculated.
      
      The possible issues due to reordering are of theoretical nature so far
      and attempts to reproduce the race failed.
      
      Especially the "no PTE change" case isn't the common case, because we'd
      need an exclusive anonymous page that's mapped R/O and the PTE is clean
      in KSM code -- and using KSM with page pinning isn't extremely common.
      Further, the clear+TLB flush we used for now implies a memory barrier.
      So the problematic missing part should be the missing memory barrier
      after pinning but before checking if the PTE changed.
      
      Link: https://lkml.kernel.org/r/20220901083559.67446-1-david@redhat.com
      Fixes: 6c287605 ("mm: remember exclusively mapped anonymous pages with PG_anon_exclusive")
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Andrea Parri <parri.andrea@gmail.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Paul E. McKenney" <paulmck@kernel.org>
      Cc: Christoph von Recklinghausen <crecklin@redhat.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      088b8aa5
    • K
      mm: kill find_min_pfn_with_active_regions() · fb70c487
      Kefeng Wang 提交于
      find_min_pfn_with_active_regions() is only called from free_area_init(). 
      Open-code the PHYS_PFN(memblock_start_of_DRAM()) into free_area_init(),
      and kill find_min_pfn_with_active_regions().
      
      Link: https://lkml.kernel.org/r/20220815111017.39341-1-wangkefeng.wang@huawei.comSigned-off-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      fb70c487
    • H
      memory tiering: hot page selection with hint page fault latency · 33024536
      Huang Ying 提交于
      Patch series "memory tiering: hot page selection", v4.
      
      To optimize page placement in a memory tiering system with NUMA balancing,
      the hot pages in the slow memory nodes need to be identified. 
      Essentially, the original NUMA balancing implementation selects the mostly
      recently accessed (MRU) pages to promote.  But this isn't a perfect
      algorithm to identify the hot pages.  Because the pages with quite low
      access frequency may be accessed eventually given the NUMA balancing page
      table scanning period could be quite long (e.g.  60 seconds).  So in this
      patchset, we implement a new hot page identification algorithm based on
      the latency between NUMA balancing page table scanning and hint page
      fault.  Which is a kind of mostly frequently accessed (MFU) algorithm.
      
      In NUMA balancing memory tiering mode, if there are hot pages in slow
      memory node and cold pages in fast memory node, we need to promote/demote
      hot/cold pages between the fast and cold memory nodes.
      
      A choice is to promote/demote as fast as possible.  But the CPU cycles and
      memory bandwidth consumed by the high promoting/demoting throughput will
      hurt the latency of some workload because of accessing inflating and slow
      memory bandwidth contention.
      
      A way to resolve this issue is to restrict the max promoting/demoting
      throughput.  It will take longer to finish the promoting/demoting.  But
      the workload latency will be better.  This is implemented in this patchset
      as the page promotion rate limit mechanism.
      
      The promotion hot threshold is workload and system configuration
      dependent.  So in this patchset, a method to adjust the hot threshold
      automatically is implemented.  The basic idea is to control the number of
      the candidate promotion pages to match the promotion rate limit.
      
      We used the pmbench memory accessing benchmark tested the patchset on a
      2-socket server system with DRAM and PMEM installed.  The test results are
      as follows,
      
      		pmbench score		promote rate
      		 (accesses/s)			MB/s
      		-------------		------------
      base		  146887704.1		       725.6
      hot selection     165695601.2		       544.0
      rate limit	  162814569.8		       165.2
      auto adjustment	  170495294.0                  136.9
      
      From the results above,
      
      With hot page selection patch [1/3], the pmbench score increases about
      12.8%, and promote rate (overhead) decreases about 25.0%, compared with
      base kernel.
      
      With rate limit patch [2/3], pmbench score decreases about 1.7%, and
      promote rate decreases about 69.6%, compared with hot page selection
      patch.
      
      With threshold auto adjustment patch [3/3], pmbench score increases about
      4.7%, and promote rate decrease about 17.1%, compared with rate limit
      patch.
      
      Baolin helped to test the patchset with MySQL on a machine which contains
      1 DRAM node (30G) and 1 PMEM node (126G).
      
      sysbench /usr/share/sysbench/oltp_read_write.lua \
      ......
      --tables=200 \
      --table-size=1000000 \
      --report-interval=10 \
      --threads=16 \
      --time=120
      
      The tps can be improved about 5%.
      
      
      This patch (of 3):
      
      To optimize page placement in a memory tiering system with NUMA balancing,
      the hot pages in the slow memory node need to be identified.  Essentially,
      the original NUMA balancing implementation selects the mostly recently
      accessed (MRU) pages to promote.  But this isn't a perfect algorithm to
      identify the hot pages.  Because the pages with quite low access frequency
      may be accessed eventually given the NUMA balancing page table scanning
      period could be quite long (e.g.  60 seconds).  The most frequently
      accessed (MFU) algorithm is better.
      
      So, in this patch we implemented a better hot page selection algorithm. 
      Which is based on NUMA balancing page table scanning and hint page fault
      as follows,
      
      - When the page tables of the processes are scanned to change PTE/PMD
        to be PROT_NONE, the current time is recorded in struct page as scan
        time.
      
      - When the page is accessed, hint page fault will occur.  The scan
        time is gotten from the struct page.  And The hint page fault
        latency is defined as
      
          hint page fault time - scan time
      
      The shorter the hint page fault latency of a page is, the higher the
      probability of their access frequency to be higher.  So the hint page
      fault latency is a better estimation of the page hot/cold.
      
      It's hard to find some extra space in struct page to hold the scan time. 
      Fortunately, we can reuse some bits used by the original NUMA balancing.
      
      NUMA balancing uses some bits in struct page to store the page accessing
      CPU and PID (referring to page_cpupid_xchg_last()).  Which is used by the
      multi-stage node selection algorithm to avoid to migrate pages shared
      accessed by the NUMA nodes back and forth.  But for pages in the slow
      memory node, even if they are shared accessed by multiple NUMA nodes, as
      long as the pages are hot, they need to be promoted to the fast memory
      node.  So the accessing CPU and PID information are unnecessary for the
      slow memory pages.  We can reuse these bits in struct page to record the
      scan time.  For the fast memory pages, these bits are used as before.
      
      For the hot threshold, the default value is 1 second, which works well in
      our performance test.  All pages with hint page fault latency < hot
      threshold will be considered hot.
      
      It's hard for users to determine the hot threshold.  So we don't provide a
      kernel ABI to set it, just provide a debugfs interface for advanced users
      to experiment.  We will continue to work on a hot threshold automatic
      adjustment mechanism.
      
      The downside of the above method is that the response time to the workload
      hot spot changing may be much longer.  For example,
      
      - A previous cold memory area becomes hot
      
      - The hint page fault will be triggered.  But the hint page fault
        latency isn't shorter than the hot threshold.  So the pages will
        not be promoted.
      
      - When the memory area is scanned again, maybe after a scan period,
        the hint page fault latency measured will be shorter than the hot
        threshold and the pages will be promoted.
      
      To mitigate this, if there are enough free space in the fast memory node,
      the hot threshold will not be used, all pages will be promoted upon the
      hint page fault for fast response.
      
      Thanks Zhong Jiang reported and tested the fix for a bug when disabling
      memory tiering mode dynamically.
      
      Link: https://lkml.kernel.org/r/20220713083954.34196-1-ying.huang@intel.com
      Link: https://lkml.kernel.org/r/20220713083954.34196-2-ying.huang@intel.comSigned-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
      Tested-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Wei Xu <weixugc@google.com>
      Cc: osalvador <osalvador@suse.de>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Zhong Jiang <zhongjiang-ali@linux.alibaba.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      33024536
  6. 29 8月, 2022 1 次提交
  7. 21 8月, 2022 1 次提交
    • D
      mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW · 5535be30
      David Hildenbrand 提交于
      Ever since the Dirty COW (CVE-2016-5195) security issue happened, we know
      that FOLL_FORCE can be possibly dangerous, especially if there are races
      that can be exploited by user space.
      
      Right now, it would be sufficient to have some code that sets a PTE of a
      R/O-mapped shared page dirty, in order for it to erroneously become
      writable by FOLL_FORCE.  The implications of setting a write-protected PTE
      dirty might not be immediately obvious to everyone.
      
      And in fact ever since commit 9ae0f87d ("mm/shmem: unconditionally set
      pte dirty in mfill_atomic_install_pte"), we can use UFFDIO_CONTINUE to map
      a shmem page R/O while marking the pte dirty.  This can be used by
      unprivileged user space to modify tmpfs/shmem file content even if the
      user does not have write permissions to the file, and to bypass memfd
      write sealing -- Dirty COW restricted to tmpfs/shmem (CVE-2022-2590).
      
      To fix such security issues for good, the insight is that we really only
      need that fancy retry logic (FOLL_COW) for COW mappings that are not
      writable (!VM_WRITE).  And in a COW mapping, we really only broke COW if
      we have an exclusive anonymous page mapped.  If we have something else
      mapped, or the mapped anonymous page might be shared (!PageAnonExclusive),
      we have to trigger a write fault to break COW.  If we don't find an
      exclusive anonymous page when we retry, we have to trigger COW breaking
      once again because something intervened.
      
      Let's move away from this mandatory-retry + dirty handling and rely on our
      PageAnonExclusive() flag for making a similar decision, to use the same
      COW logic as in other kernel parts here as well.  In case we stumble over
      a PTE in a COW mapping that does not map an exclusive anonymous page, COW
      was not properly broken and we have to trigger a fake write-fault to break
      COW.
      
      Just like we do in can_change_pte_writable() added via commit 64fe24a3
      ("mm/mprotect: try avoiding write faults for exclusive anonymous pages
      when changing protection") and commit 76aefad6 ("mm/mprotect: fix
      soft-dirty check in can_change_pte_writable()"), take care of softdirty
      and uffd-wp manually.
      
      For example, a write() via /proc/self/mem to a uffd-wp-protected range has
      to fail instead of silently granting write access and bypassing the
      userspace fault handler.  Note that FOLL_FORCE is not only used for debug
      access, but also triggered by applications without debug intentions, for
      example, when pinning pages via RDMA.
      
      This fixes CVE-2022-2590. Note that only x86_64 and aarch64 are
      affected, because only those support CONFIG_HAVE_ARCH_USERFAULTFD_MINOR.
      
      Fortunately, FOLL_COW is no longer required to handle FOLL_FORCE. So
      let's just get rid of it.
      
      Thanks to Nadav Amit for pointing out that the pte_dirty() check in
      FOLL_FORCE code is problematic and might be exploitable.
      
      Note 1: We don't check for the PTE being dirty because it doesn't matter
      	for making a "was COWed" decision anymore, and whoever modifies the
      	page has to set the page dirty either way.
      
      Note 2: Kernels before extended uffd-wp support and before
      	PageAnonExclusive (< 5.19) can simply revert the problematic
      	commit instead and be safe regarding UFFDIO_CONTINUE. A backport to
      	v5.19 requires minor adjustments due to lack of
      	vma_soft_dirty_enabled().
      
      Link: https://lkml.kernel.org/r/20220809205640.70916-1-david@redhat.com
      Fixes: 9ae0f87d ("mm/shmem: unconditionally set pte dirty in mfill_atomic_install_pte")
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: David Laight <David.Laight@ACULAB.COM>
      Cc: <stable@vger.kernel.org>	[5.16]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      5535be30
  8. 09 8月, 2022 3 次提交
  9. 19 7月, 2022 1 次提交