1. 01 12月, 2022 7 次提交
    • H
      mm,thp,rmap: subpages_mapcount COMPOUND_MAPPED if PMD-mapped · 4b51634c
      Hugh Dickins 提交于
      Can the lock_compound_mapcount() bit_spin_lock apparatus be removed now? 
      Yes.  Not by atomic64_t or cmpxchg games, those get difficult on 32-bit;
      but if we slightly abuse subpages_mapcount by additionally demanding that
      one bit be set there when the compound page is PMD-mapped, then a cascade
      of two atomic ops is able to maintain the stats without bit_spin_lock.
      
      This is harder to reason about than when bit_spin_locked, but I believe
      safe; and no drift in stats detected when testing.  When there are racing
      removes and adds, of course the sequence of operations is less well-
      defined; but each operation on subpages_mapcount is atomically good.  What
      might be disastrous, is if subpages_mapcount could ever fleetingly appear
      negative: but the pte lock (or pmd lock) these rmap functions are called
      under, ensures that a last remove cannot race ahead of a first add.
      
      Continue to make an exception for hugetlb (PageHuge) pages, though that
      exception can be easily removed by a further commit if necessary: leave
      subpages_mapcount 0, don't bother with COMPOUND_MAPPED in its case, just
      carry on checking compound_mapcount too in folio_mapped(), page_mapped().
      
      Evidence is that this way goes slightly faster than the previous
      implementation in all cases (pmds after ptes now taking around 103ms); and
      relieves us of worrying about contention on the bit_spin_lock.
      
      Link: https://lkml.kernel.org/r/3978f3ca-5473-55a7-4e14-efea5968d892@google.comSigned-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Dan Carpenter <error27@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Zach O'Keefe <zokeefe@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      4b51634c
    • H
      mm,thp,rmap: subpages_mapcount of PTE-mapped subpages · be5ef2d9
      Hugh Dickins 提交于
      Patch series "mm,thp,rmap: rework the use of subpages_mapcount", v2.
      
      
      This patch (of 3):
      
      Following suggestion from Linus, instead of counting every PTE map of a
      compound page in subpages_mapcount, just count how many of its subpages
      are PTE-mapped: this yields the exact number needed for NR_ANON_MAPPED and
      NR_FILE_MAPPED stats, without any need for a locked scan of subpages; and
      requires updating the count less often.
      
      This does then revert total_mapcount() and folio_mapcount() to needing a
      scan of subpages; but they are inherently racy, and need no locking, so
      Linus is right that the scans are much better done there.  Plus (unlike in
      6.1 and previous) subpages_mapcount lets us avoid the scan in the common
      case of no PTE maps.  And page_mapped() and folio_mapped() remain scanless
      and just as efficient with the new meaning of subpages_mapcount: those are
      the functions which I most wanted to remove the scan from.
      
      The updated page_dup_compound_rmap() is no longer suitable for use by anon
      THP's __split_huge_pmd_locked(); but page_add_anon_rmap() can be used for
      that, so long as its VM_BUG_ON_PAGE(!PageLocked) is deleted.
      
      Evidence is that this way goes slightly faster than the previous
      implementation for most cases; but significantly faster in the (now
      scanless) pmds after ptes case, which started out at 870ms and was brought
      down to 495ms by the previous series, now takes around 105ms.
      
      Link: https://lkml.kernel.org/r/a5849eca-22f1-3517-bf29-95d982242742@google.com
      Link: https://lkml.kernel.org/r/eec17e16-4e1-7c59-f1bc-5bca90dac919@google.comSigned-off-by: NHugh Dickins <hughd@google.com>
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Dan Carpenter <error27@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Zach O'Keefe <zokeefe@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      be5ef2d9
    • H
      mm,thp,rmap: simplify compound page mapcount handling · cb67f428
      Hugh Dickins 提交于
      Compound page (folio) mapcount calculations have been different for anon
      and file (or shmem) THPs, and involved the obscure PageDoubleMap flag. 
      And each huge mapping and unmapping of a file (or shmem) THP involved
      atomically incrementing and decrementing the mapcount of every subpage of
      that huge page, dirtying many struct page cachelines.
      
      Add subpages_mapcount field to the struct folio and first tail page, so
      that the total of subpage mapcounts is available in one place near the
      head: then page_mapcount() and total_mapcount() and page_mapped(), and
      their folio equivalents, are so quick that anon and file and hugetlb don't
      need to be optimized differently.  Delete the unloved PageDoubleMap.
      
      page_add and page_remove rmap functions must now maintain the
      subpages_mapcount as well as the subpage _mapcount, when dealing with pte
      mappings of huge pages; and correct maintenance of NR_ANON_MAPPED and
      NR_FILE_MAPPED statistics still needs reading through the subpages, using
      nr_subpages_unmapped() - but only when first or last pmd mapping finds
      subpages_mapcount raised (double-map case, not the common case).
      
      But are those counts (used to decide when to split an anon THP, and in
      vmscan's pagecache_reclaimable heuristic) correctly maintained?  Not
      quite: since page_remove_rmap() (and also split_huge_pmd()) is often
      called without page lock, there can be races when a subpage pte mapcount
      0<->1 while compound pmd mapcount 0<->1 is scanning - races which the
      previous implementation had prevented.  The statistics might become
      inaccurate, and even drift down until they underflow through 0.  That is
      not good enough, but is better dealt with in a followup patch.
      
      Update a few comments on first and second tail page overlaid fields. 
      hugepage_add_new_anon_rmap() has to "increment" compound_mapcount, but
      subpages_mapcount and compound_pincount are already correctly at 0, so
      delete its reinitialization of compound_pincount.
      
      A simple 100 X munmap(mmap(2GB, MAP_SHARED|MAP_POPULATE, tmpfs), 2GB) took
      18 seconds on small pages, and used to take 1 second on huge pages, but
      now takes 119 milliseconds on huge pages.  Mapping by pmds a second time
      used to take 860ms and now takes 92ms; mapping by pmds after mapping by
      ptes (when the scan is needed) used to take 870ms and now takes 495ms. 
      But there might be some benchmarks which would show a slowdown, because
      tail struct pages now fall out of cache until final freeing checks them.
      
      Link: https://lkml.kernel.org/r/47ad693-717-79c8-e1ba-46c3a6602e48@google.comSigned-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zach O'Keefe <zokeefe@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      cb67f428
    • T
      mm, hwpoison: when copy-on-write hits poison, take page offline · d302c239
      Tony Luck 提交于
      Cannot call memory_failure() directly from the fault handler because
      mmap_lock (and others) are held.
      
      It is important, but not urgent, to mark the source page as h/w poisoned
      and unmap it from other tasks.
      
      Use memory_failure_queue() to request a call to memory_failure() for the
      page with the error.
      
      Also provide a stub version for CONFIG_MEMORY_FAILURE=n
      
      Link: https://lkml.kernel.org/r/20221021200120.175753-3-tony.luck@intel.comSigned-off-by: NTony Luck <tony.luck@intel.com>
      Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Shuai Xue <xueshuai@linux.alibaba.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      d302c239
    • S
      mm: convert mm's rss stats into percpu_counter · f1a79412
      Shakeel Butt 提交于
      Currently mm_struct maintains rss_stats which are updated on page fault
      and the unmapping codepaths.  For page fault codepath the updates are
      cached per thread with the batch of TASK_RSS_EVENTS_THRESH which is 64. 
      The reason for caching is performance for multithreaded applications
      otherwise the rss_stats updates may become hotspot for such applications.
      
      However this optimization comes with the cost of error margin in the rss
      stats.  The rss_stats for applications with large number of threads can be
      very skewed.  At worst the error margin is (nr_threads * 64) and we have a
      lot of applications with 100s of threads, so the error margin can be very
      high.  Internally we had to reduce TASK_RSS_EVENTS_THRESH to 32.
      
      Recently we started seeing the unbounded errors for rss_stats for specific
      applications which use TCP rx0cp.  It seems like vm_insert_pages()
      codepath does not sync rss_stats at all.
      
      This patch converts the rss_stats into percpu_counter to convert the error
      margin from (nr_threads * 64) to approximately (nr_cpus ^ 2).  However
      this conversion enable us to get the accurate stats for situations where
      accuracy is more important than the cpu cost.
      
      This patch does not make such tradeoffs - we can just use
      percpu_counter_add_local() for the updates and percpu_counter_sum() (or
      percpu_counter_sync() + percpu_counter_read) for the readers.  At the
      moment the readers are either procfs interface, oom_killer and memory
      reclaim which I think are not performance critical and should be ok with
      slow read.  However I think we can make that change in a separate patch.
      
      Link: https://lkml.kernel.org/r/20221024052841.3291983-1-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      f1a79412
    • M
      hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing · 04ada095
      Mike Kravetz 提交于
      madvise(MADV_DONTNEED) ends up calling zap_page_range() to clear page
      tables associated with the address range.  For hugetlb vmas,
      zap_page_range will call __unmap_hugepage_range_final.  However,
      __unmap_hugepage_range_final assumes the passed vma is about to be removed
      and deletes the vma_lock to prevent pmd sharing as the vma is on the way
      out.  In the case of madvise(MADV_DONTNEED) the vma remains, but the
      missing vma_lock prevents pmd sharing and could potentially lead to issues
      with truncation/fault races.
      
      This issue was originally reported here [1] as a BUG triggered in
      page_try_dup_anon_rmap.  Prior to the introduction of the hugetlb
      vma_lock, __unmap_hugepage_range_final cleared the VM_MAYSHARE flag to
      prevent pmd sharing.  Subsequent faults on this vma were confused as
      VM_MAYSHARE indicates a sharable vma, but was not set so page_mapping was
      not set in new pages added to the page table.  This resulted in pages that
      appeared anonymous in a VM_SHARED vma and triggered the BUG.
      
      Address issue by adding a new zap flag ZAP_FLAG_UNMAP to indicate an unmap
      call from unmap_vmas().  This is used to indicate the 'final' unmapping of
      a hugetlb vma.  When called via MADV_DONTNEED, this flag is not set and
      the vm_lock is not deleted.
      
      [1] https://lore.kernel.org/lkml/CAO4mrfdLMXsao9RF4fUE8-Wfde8xmjsKrTNMNC9wjUb6JudD0g@mail.gmail.com/
      
      Link: https://lkml.kernel.org/r/20221114235507.294320-3-mike.kravetz@oracle.com
      Fixes: 90e7e7f5 ("mm: enable MADV_DONTNEED for hugetlb mappings")
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reported-by: NWei Chen <harperchen1110@gmail.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      04ada095
    • M
      madvise: use zap_page_range_single for madvise dontneed · 21b85b09
      Mike Kravetz 提交于
      This series addresses the issue first reported in [1], and fully described
      in patch 2.  Patches 1 and 2 address the user visible issue and are tagged
      for stable backports.
      
      While exploring solutions to this issue, related problems with mmu
      notification calls were discovered.  This is addressed in the patch
      "hugetlb: remove duplicate mmu notifications:".  Since there are no user
      visible effects, this third is not tagged for stable backports.
      
      Previous discussions suggested further cleanup by removing the
      routine zap_page_range.  This is possible because zap_page_range_single
      is now exported, and all callers of zap_page_range pass ranges entirely
      within a single vma.  This work will be done in a later patch so as not
      to distract from this bug fix.
      
      [1] https://lore.kernel.org/lkml/CAO4mrfdLMXsao9RF4fUE8-Wfde8xmjsKrTNMNC9wjUb6JudD0g@mail.gmail.com/
      
      
      This patch (of 2):
      
      Expose the routine zap_page_range_single to zap a range within a single
      vma.  The madvise routine madvise_dontneed_single_vma can use this routine
      as it explicitly operates on a single vma.  Also, update the mmu
      notification range in zap_page_range_single to take hugetlb pmd sharing
      into account.  This is required as MADV_DONTNEED supports hugetlb vmas.
      
      Link: https://lkml.kernel.org/r/20221114235507.294320-1-mike.kravetz@oracle.com
      Link: https://lkml.kernel.org/r/20221114235507.294320-2-mike.kravetz@oracle.com
      Fixes: 90e7e7f5 ("mm: enable MADV_DONTNEED for hugetlb mappings")
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reported-by: NWei Chen <harperchen1110@gmail.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      21b85b09
  2. 09 11月, 2022 6 次提交
  3. 04 10月, 2022 2 次提交
  4. 27 9月, 2022 12 次提交
  5. 12 9月, 2022 3 次提交
    • D
      mm: fix PageAnonExclusive clearing racing with concurrent RCU GUP-fast · 088b8aa5
      David Hildenbrand 提交于
      commit 6c287605 ("mm: remember exclusively mapped anonymous pages with
      PG_anon_exclusive") made sure that when PageAnonExclusive() has to be
      cleared during temporary unmapping of a page, that the PTE is
      cleared/invalidated and that the TLB is flushed.
      
      What we want to achieve in all cases is that we cannot end up with a pin on
      an anonymous page that may be shared, because such pins would be
      unreliable and could result in memory corruptions when the mapped page
      and the pin go out of sync due to a write fault.
      
      That TLB flush handling was inspired by an outdated comment in
      mm/ksm.c:write_protect_page(), which similarly required the TLB flush in
      the past to synchronize with GUP-fast. However, ever since general RCU GUP
      fast was introduced in commit 2667f50e ("mm: introduce a general RCU
      get_user_pages_fast()"), a TLB flush is no longer sufficient to handle
      concurrent GUP-fast in all cases -- it only handles traditional IPI-based
      GUP-fast correctly.
      
      Peter Xu (thankfully) questioned whether that TLB flush is really
      required. On architectures that send an IPI broadcast on TLB flush,
      it works as expected. To synchronize with RCU GUP-fast properly, we're
      conceptually fine, however, we have to enforce a certain memory order and
      are missing memory barriers.
      
      Let's document that, avoid the TLB flush where possible and use proper
      explicit memory barriers where required. We shouldn't really care about the
      additional memory barriers here, as we're not on extremely hot paths --
      and we're getting rid of some TLB flushes.
      
      We use a smp_mb() pair for handling concurrent pinning and a
      smp_rmb()/smp_wmb() pair for handling the corner case of only temporary
      PTE changes but permanent PageAnonExclusive changes.
      
      One extreme example, whereby GUP-fast takes a R/O pin and KSM wants to
      convert an exclusive anonymous page to a KSM page, and that page is already
      mapped write-protected (-> no PTE change) would be:
      
      	Thread 0 (KSM)			Thread 1 (GUP-fast)
      
      					(B1) Read the PTE
      					# (B2) skipped without FOLL_WRITE
      	(A1) Clear PTE
      	smp_mb()
      	(A2) Check pinned
      					(B3) Pin the mapped page
      					smp_mb()
      	(A3) Clear PageAnonExclusive
      	smp_wmb()
      	(A4) Restore PTE
      					(B4) Check if the PTE changed
      					smp_rmb()
      					(B5) Check PageAnonExclusive
      
      Thread 1 will properly detect that PageAnonExclusive was cleared and
      back off.
      
      Note that we don't need a memory barrier between checking if the page is
      pinned and clearing PageAnonExclusive, because stores are not
      speculated.
      
      The possible issues due to reordering are of theoretical nature so far
      and attempts to reproduce the race failed.
      
      Especially the "no PTE change" case isn't the common case, because we'd
      need an exclusive anonymous page that's mapped R/O and the PTE is clean
      in KSM code -- and using KSM with page pinning isn't extremely common.
      Further, the clear+TLB flush we used for now implies a memory barrier.
      So the problematic missing part should be the missing memory barrier
      after pinning but before checking if the PTE changed.
      
      Link: https://lkml.kernel.org/r/20220901083559.67446-1-david@redhat.com
      Fixes: 6c287605 ("mm: remember exclusively mapped anonymous pages with PG_anon_exclusive")
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Andrea Parri <parri.andrea@gmail.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Paul E. McKenney" <paulmck@kernel.org>
      Cc: Christoph von Recklinghausen <crecklin@redhat.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      088b8aa5
    • K
      mm: kill find_min_pfn_with_active_regions() · fb70c487
      Kefeng Wang 提交于
      find_min_pfn_with_active_regions() is only called from free_area_init(). 
      Open-code the PHYS_PFN(memblock_start_of_DRAM()) into free_area_init(),
      and kill find_min_pfn_with_active_regions().
      
      Link: https://lkml.kernel.org/r/20220815111017.39341-1-wangkefeng.wang@huawei.comSigned-off-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      fb70c487
    • H
      memory tiering: hot page selection with hint page fault latency · 33024536
      Huang Ying 提交于
      Patch series "memory tiering: hot page selection", v4.
      
      To optimize page placement in a memory tiering system with NUMA balancing,
      the hot pages in the slow memory nodes need to be identified. 
      Essentially, the original NUMA balancing implementation selects the mostly
      recently accessed (MRU) pages to promote.  But this isn't a perfect
      algorithm to identify the hot pages.  Because the pages with quite low
      access frequency may be accessed eventually given the NUMA balancing page
      table scanning period could be quite long (e.g.  60 seconds).  So in this
      patchset, we implement a new hot page identification algorithm based on
      the latency between NUMA balancing page table scanning and hint page
      fault.  Which is a kind of mostly frequently accessed (MFU) algorithm.
      
      In NUMA balancing memory tiering mode, if there are hot pages in slow
      memory node and cold pages in fast memory node, we need to promote/demote
      hot/cold pages between the fast and cold memory nodes.
      
      A choice is to promote/demote as fast as possible.  But the CPU cycles and
      memory bandwidth consumed by the high promoting/demoting throughput will
      hurt the latency of some workload because of accessing inflating and slow
      memory bandwidth contention.
      
      A way to resolve this issue is to restrict the max promoting/demoting
      throughput.  It will take longer to finish the promoting/demoting.  But
      the workload latency will be better.  This is implemented in this patchset
      as the page promotion rate limit mechanism.
      
      The promotion hot threshold is workload and system configuration
      dependent.  So in this patchset, a method to adjust the hot threshold
      automatically is implemented.  The basic idea is to control the number of
      the candidate promotion pages to match the promotion rate limit.
      
      We used the pmbench memory accessing benchmark tested the patchset on a
      2-socket server system with DRAM and PMEM installed.  The test results are
      as follows,
      
      		pmbench score		promote rate
      		 (accesses/s)			MB/s
      		-------------		------------
      base		  146887704.1		       725.6
      hot selection     165695601.2		       544.0
      rate limit	  162814569.8		       165.2
      auto adjustment	  170495294.0                  136.9
      
      From the results above,
      
      With hot page selection patch [1/3], the pmbench score increases about
      12.8%, and promote rate (overhead) decreases about 25.0%, compared with
      base kernel.
      
      With rate limit patch [2/3], pmbench score decreases about 1.7%, and
      promote rate decreases about 69.6%, compared with hot page selection
      patch.
      
      With threshold auto adjustment patch [3/3], pmbench score increases about
      4.7%, and promote rate decrease about 17.1%, compared with rate limit
      patch.
      
      Baolin helped to test the patchset with MySQL on a machine which contains
      1 DRAM node (30G) and 1 PMEM node (126G).
      
      sysbench /usr/share/sysbench/oltp_read_write.lua \
      ......
      --tables=200 \
      --table-size=1000000 \
      --report-interval=10 \
      --threads=16 \
      --time=120
      
      The tps can be improved about 5%.
      
      
      This patch (of 3):
      
      To optimize page placement in a memory tiering system with NUMA balancing,
      the hot pages in the slow memory node need to be identified.  Essentially,
      the original NUMA balancing implementation selects the mostly recently
      accessed (MRU) pages to promote.  But this isn't a perfect algorithm to
      identify the hot pages.  Because the pages with quite low access frequency
      may be accessed eventually given the NUMA balancing page table scanning
      period could be quite long (e.g.  60 seconds).  The most frequently
      accessed (MFU) algorithm is better.
      
      So, in this patch we implemented a better hot page selection algorithm. 
      Which is based on NUMA balancing page table scanning and hint page fault
      as follows,
      
      - When the page tables of the processes are scanned to change PTE/PMD
        to be PROT_NONE, the current time is recorded in struct page as scan
        time.
      
      - When the page is accessed, hint page fault will occur.  The scan
        time is gotten from the struct page.  And The hint page fault
        latency is defined as
      
          hint page fault time - scan time
      
      The shorter the hint page fault latency of a page is, the higher the
      probability of their access frequency to be higher.  So the hint page
      fault latency is a better estimation of the page hot/cold.
      
      It's hard to find some extra space in struct page to hold the scan time. 
      Fortunately, we can reuse some bits used by the original NUMA balancing.
      
      NUMA balancing uses some bits in struct page to store the page accessing
      CPU and PID (referring to page_cpupid_xchg_last()).  Which is used by the
      multi-stage node selection algorithm to avoid to migrate pages shared
      accessed by the NUMA nodes back and forth.  But for pages in the slow
      memory node, even if they are shared accessed by multiple NUMA nodes, as
      long as the pages are hot, they need to be promoted to the fast memory
      node.  So the accessing CPU and PID information are unnecessary for the
      slow memory pages.  We can reuse these bits in struct page to record the
      scan time.  For the fast memory pages, these bits are used as before.
      
      For the hot threshold, the default value is 1 second, which works well in
      our performance test.  All pages with hint page fault latency < hot
      threshold will be considered hot.
      
      It's hard for users to determine the hot threshold.  So we don't provide a
      kernel ABI to set it, just provide a debugfs interface for advanced users
      to experiment.  We will continue to work on a hot threshold automatic
      adjustment mechanism.
      
      The downside of the above method is that the response time to the workload
      hot spot changing may be much longer.  For example,
      
      - A previous cold memory area becomes hot
      
      - The hint page fault will be triggered.  But the hint page fault
        latency isn't shorter than the hot threshold.  So the pages will
        not be promoted.
      
      - When the memory area is scanned again, maybe after a scan period,
        the hint page fault latency measured will be shorter than the hot
        threshold and the pages will be promoted.
      
      To mitigate this, if there are enough free space in the fast memory node,
      the hot threshold will not be used, all pages will be promoted upon the
      hint page fault for fast response.
      
      Thanks Zhong Jiang reported and tested the fix for a bug when disabling
      memory tiering mode dynamically.
      
      Link: https://lkml.kernel.org/r/20220713083954.34196-1-ying.huang@intel.com
      Link: https://lkml.kernel.org/r/20220713083954.34196-2-ying.huang@intel.comSigned-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
      Tested-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Wei Xu <weixugc@google.com>
      Cc: osalvador <osalvador@suse.de>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Zhong Jiang <zhongjiang-ali@linux.alibaba.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      33024536
  6. 29 8月, 2022 1 次提交
  7. 21 8月, 2022 1 次提交
    • D
      mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW · 5535be30
      David Hildenbrand 提交于
      Ever since the Dirty COW (CVE-2016-5195) security issue happened, we know
      that FOLL_FORCE can be possibly dangerous, especially if there are races
      that can be exploited by user space.
      
      Right now, it would be sufficient to have some code that sets a PTE of a
      R/O-mapped shared page dirty, in order for it to erroneously become
      writable by FOLL_FORCE.  The implications of setting a write-protected PTE
      dirty might not be immediately obvious to everyone.
      
      And in fact ever since commit 9ae0f87d ("mm/shmem: unconditionally set
      pte dirty in mfill_atomic_install_pte"), we can use UFFDIO_CONTINUE to map
      a shmem page R/O while marking the pte dirty.  This can be used by
      unprivileged user space to modify tmpfs/shmem file content even if the
      user does not have write permissions to the file, and to bypass memfd
      write sealing -- Dirty COW restricted to tmpfs/shmem (CVE-2022-2590).
      
      To fix such security issues for good, the insight is that we really only
      need that fancy retry logic (FOLL_COW) for COW mappings that are not
      writable (!VM_WRITE).  And in a COW mapping, we really only broke COW if
      we have an exclusive anonymous page mapped.  If we have something else
      mapped, or the mapped anonymous page might be shared (!PageAnonExclusive),
      we have to trigger a write fault to break COW.  If we don't find an
      exclusive anonymous page when we retry, we have to trigger COW breaking
      once again because something intervened.
      
      Let's move away from this mandatory-retry + dirty handling and rely on our
      PageAnonExclusive() flag for making a similar decision, to use the same
      COW logic as in other kernel parts here as well.  In case we stumble over
      a PTE in a COW mapping that does not map an exclusive anonymous page, COW
      was not properly broken and we have to trigger a fake write-fault to break
      COW.
      
      Just like we do in can_change_pte_writable() added via commit 64fe24a3
      ("mm/mprotect: try avoiding write faults for exclusive anonymous pages
      when changing protection") and commit 76aefad6 ("mm/mprotect: fix
      soft-dirty check in can_change_pte_writable()"), take care of softdirty
      and uffd-wp manually.
      
      For example, a write() via /proc/self/mem to a uffd-wp-protected range has
      to fail instead of silently granting write access and bypassing the
      userspace fault handler.  Note that FOLL_FORCE is not only used for debug
      access, but also triggered by applications without debug intentions, for
      example, when pinning pages via RDMA.
      
      This fixes CVE-2022-2590. Note that only x86_64 and aarch64 are
      affected, because only those support CONFIG_HAVE_ARCH_USERFAULTFD_MINOR.
      
      Fortunately, FOLL_COW is no longer required to handle FOLL_FORCE. So
      let's just get rid of it.
      
      Thanks to Nadav Amit for pointing out that the pte_dirty() check in
      FOLL_FORCE code is problematic and might be exploitable.
      
      Note 1: We don't check for the PTE being dirty because it doesn't matter
      	for making a "was COWed" decision anymore, and whoever modifies the
      	page has to set the page dirty either way.
      
      Note 2: Kernels before extended uffd-wp support and before
      	PageAnonExclusive (< 5.19) can simply revert the problematic
      	commit instead and be safe regarding UFFDIO_CONTINUE. A backport to
      	v5.19 requires minor adjustments due to lack of
      	vma_soft_dirty_enabled().
      
      Link: https://lkml.kernel.org/r/20220809205640.70916-1-david@redhat.com
      Fixes: 9ae0f87d ("mm/shmem: unconditionally set pte dirty in mfill_atomic_install_pte")
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: David Laight <David.Laight@ACULAB.COM>
      Cc: <stable@vger.kernel.org>	[5.16]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      5535be30
  8. 09 8月, 2022 3 次提交
  9. 19 7月, 2022 1 次提交
  10. 18 7月, 2022 4 次提交
    • A
      mm/mmap: drop ARCH_HAS_VM_GET_PAGE_PROT · 3d923c5f
      Anshuman Khandual 提交于
      Now all the platforms enable ARCH_HAS_GET_PAGE_PROT.  They define and
      export own vm_get_page_prot() whether custom or standard
      DECLARE_VM_GET_PAGE_PROT.  Hence there is no need for default generic
      fallback for vm_get_page_prot().  Just drop this fallback and also
      ARCH_HAS_GET_PAGE_PROT mechanism.
      
      Link: https://lkml.kernel.org/r/20220711070600.2378316-27-anshuman.khandual@arm.comSigned-off-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Reviewed-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NChristophe Leroy <christophe.leroy@csgroup.eu>
      Acked-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Brian Cain <bcain@quicinc.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Dinh Nguyen <dinguyen@kernel.org>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Huacai Chen <chenhuacai@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Sam Ravnborg <sam@ravnborg.org>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vineet Gupta <vgupta@kernel.org>
      Cc: WANG Xuerui <kernel@xen0n.name>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      3d923c5f
    • A
      mm/mmap: build protect protection_map[] with ARCH_HAS_VM_GET_PAGE_PROT · 09095f74
      Anshuman Khandual 提交于
      Now that protection_map[] has been moved inside those platforms that
      enable ARCH_HAS_VM_GET_PAGE_PROT.  Hence generic protection_map[] array
      now can be protected with CONFIG_ARCH_HAS_VM_GET_PAGE_PROT intead of
      __P000.
      
      Link: https://lkml.kernel.org/r/20220711070600.2378316-8-anshuman.khandual@arm.comSigned-off-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Reviewed-by: NChristophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Brian Cain <bcain@quicinc.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Dinh Nguyen <dinguyen@kernel.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Huacai Chen <chenhuacai@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Sam Ravnborg <sam@ravnborg.org>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vineet Gupta <vgupta@kernel.org>
      Cc: WANG Xuerui <kernel@xen0n.name>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      09095f74
    • A
      mm/mmap: build protect protection_map[] with __P000 · 84053271
      Anshuman Khandual 提交于
      Patch series "mm/mmap: Drop __SXXX/__PXXX macros from across platforms",
      v7.
      
      __SXXX/__PXXX macros are unnecessary abstraction layer in creating the
      generic protection_map[] array which is used for vm_get_page_prot().  This
      abstraction layer can be avoided, if the platforms just define the array
      protection_map[] for all possible vm_flags access permission combinations
      and also export vm_get_page_prot() implementation.
      
      This series drops __SXXX/__PXXX macros from across platforms in the tree. 
      First it build protects generic protection_map[] array with '#ifdef
      __P000' and moves it inside platforms which enable
      ARCH_HAS_VM_GET_PAGE_PROT.  Later this build protects same array with
      '#ifdef ARCH_HAS_VM_GET_PAGE_PROT' and moves inside remaining platforms
      while enabling ARCH_HAS_VM_GET_PAGE_PROT.  This adds a new macro
      DECLARE_VM_GET_PAGE_PROT defining the current generic vm_get_page_prot(),
      in order for it to be reused on platforms that do not require custom
      implementation.  Finally, ARCH_HAS_VM_GET_PAGE_PROT can just be dropped,
      as all platforms now define and export vm_get_page_prot(), via looking up
      a private and static protection_map[] array.  protection_map[] data type
      has been changed as 'static const' on all platforms that do not change it
      during boot.
      
      
      This patch (of 26):
      
      Build protect generic protection_map[] array with __P000, so that it can
      be moved inside all the platforms one after the other.  Otherwise there
      will be build failures during this process. 
      CONFIG_ARCH_HAS_VM_GET_PAGE_PROT cannot be used for this purpose as only
      certain platforms enable this config now.
      
      Link: https://lkml.kernel.org/r/20220711070600.2378316-1-anshuman.khandual@arm.com
      Link: https://lkml.kernel.org/r/20220711070600.2378316-2-anshuman.khandual@arm.comSigned-off-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NChristophe Leroy <christophe.leroy@csgroup.eu>
      Suggested-by: NChristophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Brian Cain <bcain@quicinc.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Dinh Nguyen <dinguyen@kernel.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Huacai Chen <chenhuacai@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Sam Ravnborg <sam@ravnborg.org>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vineet Gupta <vgupta@kernel.org>
      Cc: WANG Xuerui <kernel@xen0n.name>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      84053271
    • S
      mm: introduce mf_dax_kill_procs() for fsdax case · c36e2024
      Shiyang Ruan 提交于
      This new function is a variant of mf_generic_kill_procs that accepts a
      file, offset pair instead of a struct to support multiple files sharing a
      DAX mapping.  It is intended to be called by the file systems as part of
      the memory_failure handler after the file system performed a reverse
      mapping from the storage address to the file and file offset.
      
      Link: https://lkml.kernel.org/r/20220603053738.1218681-6-ruansy.fnst@fujitsu.comSigned-off-by: NShiyang Ruan <ruansy.fnst@fujitsu.com>
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Dan Williams <dan.j.wiliams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Goldwyn Rodrigues <rgoldwyn@suse.com>
      Cc: Goldwyn Rodrigues <rgoldwyn@suse.de>
      Cc: Jane Chu <jane.chu@oracle.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Ritesh Harjani <riteshh@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      c36e2024