1. 12 1月, 2023 1 次提交
  2. 12 12月, 2022 4 次提交
  3. 01 12月, 2022 14 次提交
    • A
      s390/mm: use pmd_pgtable_page() helper in __gmap_segment_gaddr() · 7e25de77
      Anshuman Khandual 提交于
      In __gmap_segment_gaddr() pmd level page table page is being extracted
      from the pmd pointer, similar to pmd_pgtable_page() implementation.  This
      reduces some redundancy by directly using pmd_pgtable_page() instead,
      though first making it available.
      
      Link: https://lkml.kernel.org/r/20221125034502.1559986-1-anshuman.khandual@arm.comSigned-off-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Acked-by: NAlexander Gordeev <agordeev@linux.ibm.com>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      7e25de77
    • A
      mm/thp: rename pmd_to_page() as pmd_pgtable_page() · 373dfda2
      Anshuman Khandual 提交于
      Current pmd_to_page(), which derives the page table page containing the
      pmd address has a very misleading name.  The problem being, it sounds
      similar to pmd_page() which derives page embedded in a given pmd entry
      either for next level page or a mapped huge page.  Rename it as
      pmd_pgtable_page() instead.
      
      Link: https://lkml.kernel.org/r/20221124131641.1523772-1-anshuman.khandual@arm.comSigned-off-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      373dfda2
    • D
      mm/gup: reliable R/O long-term pinning in COW mappings · 84209e87
      David Hildenbrand 提交于
      We already support reliable R/O pinning of anonymous memory. However,
      assume we end up pinning (R/O long-term) a pagecache page or the shared
      zeropage inside a writable private ("COW") mapping. The next write access
      will trigger a write-fault and replace the pinned page by an exclusive
      anonymous page in the process page tables to break COW: the pinned page no
      longer corresponds to the page mapped into the process' page table.
      
      Now that FAULT_FLAG_UNSHARE can break COW on anything mapped into a
      COW mapping, let's properly break COW first before R/O long-term
      pinning something that's not an exclusive anon page inside a COW
      mapping. FAULT_FLAG_UNSHARE will break COW and map an exclusive anon page
      instead that can get pinned safely.
      
      With this change, we can stop using FOLL_FORCE|FOLL_WRITE for reliable
      R/O long-term pinning in COW mappings.
      
      With this change, the new R/O long-term pinning tests for non-anonymous
      memory succeed:
        # [RUN] R/O longterm GUP pin ... with shared zeropage
        ok 151 Longterm R/O pin is reliable
        # [RUN] R/O longterm GUP pin ... with memfd
        ok 152 Longterm R/O pin is reliable
        # [RUN] R/O longterm GUP pin ... with tmpfile
        ok 153 Longterm R/O pin is reliable
        # [RUN] R/O longterm GUP pin ... with huge zeropage
        ok 154 Longterm R/O pin is reliable
        # [RUN] R/O longterm GUP pin ... with memfd hugetlb (2048 kB)
        ok 155 Longterm R/O pin is reliable
        # [RUN] R/O longterm GUP pin ... with memfd hugetlb (1048576 kB)
        ok 156 Longterm R/O pin is reliable
        # [RUN] R/O longterm GUP-fast pin ... with shared zeropage
        ok 157 Longterm R/O pin is reliable
        # [RUN] R/O longterm GUP-fast pin ... with memfd
        ok 158 Longterm R/O pin is reliable
        # [RUN] R/O longterm GUP-fast pin ... with tmpfile
        ok 159 Longterm R/O pin is reliable
        # [RUN] R/O longterm GUP-fast pin ... with huge zeropage
        ok 160 Longterm R/O pin is reliable
        # [RUN] R/O longterm GUP-fast pin ... with memfd hugetlb (2048 kB)
        ok 161 Longterm R/O pin is reliable
        # [RUN] R/O longterm GUP-fast pin ... with memfd hugetlb (1048576 kB)
        ok 162 Longterm R/O pin is reliable
      
      Note 1: We don't care about short-term R/O-pinning, because they have
      snapshot semantics: they are not supposed to observe modifications that
      happen after pinning.
      
      As one example, assume we start direct I/O to read from a page and store
      page content into a file: modifications to page content after starting
      direct I/O are not guaranteed to end up in the file. So even if we'd pin
      the shared zeropage, the end result would be as expected -- getting zeroes
      stored to the file.
      
      Note 2: For shared mappings we'll now always fallback to the slow path to
      lookup the VMA when R/O long-term pining. While that's the necessary price
      we have to pay right now, it's actually not that bad in practice: most
      FOLL_LONGTERM users already specify FOLL_WRITE, for example, along with
      FOLL_FORCE because they tried dealing with COW mappings correctly ...
      
      Note 3: For users that use FOLL_LONGTERM right now without FOLL_WRITE,
      such as VFIO, we'd now no longer pin the shared zeropage. Instead, we'd
      populate exclusive anon pages that we can pin. There was a concern that
      this could affect the memlock limit of existing setups.
      
      For example, a VM running with VFIO could run into the memlock limit and
      fail to run. However, we essentially had the same behavior already in
      commit 17839856 ("gup: document and work around "COW can break either
      way" issue") which got merged into some enterprise distros, and there were
      not any such complaints. So most probably, we're fine.
      
      Link: https://lkml.kernel.org/r/20221116102659.70287-10-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NDaniel Vetter <daniel.vetter@ffwll.ch>
      Reviewed-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      84209e87
    • P
      mm: anonymous shared memory naming · d09e8ca6
      Pasha Tatashin 提交于
      Since commit 9a10064f ("mm: add a field to store names for private
      anonymous memory"), name for private anonymous memory, but not shared
      anonymous, can be set.  However, naming shared anonymous memory just as
      useful for tracking purposes.
      
      Extend the functionality to be able to set names for shared anon.
      
      There are two ways to create anonymous shared memory, using memfd or
      directly via mmap():
      1. fd = memfd_create(...)
         mem = mmap(..., MAP_SHARED, fd, ...)
      2. mem = mmap(..., MAP_SHARED | MAP_ANONYMOUS, -1, ...)
      
      In both cases the anonymous shared memory is created the same way by
      mapping an unlinked file on tmpfs.
      
      The memfd way allows to give a name for anonymous shared memory, but
      not useful when parts of shared memory require to have distinct names.
      
      Example use case: The VMM maps VM memory as anonymous shared memory (not
      private because VMM is sandboxed and drivers are running in their own
      processes).  However, the VM tells back to the VMM how parts of the memory
      are actually used by the guest, how each of the segments should be backed
      (i.e.  4K pages, 2M pages), and some other information about the segments.
      The naming allows us to monitor the effective memory footprint for each
      of these segments from the host without looking inside the guest.
      
      Sample output:
        /* Create shared anonymous segmenet */
        anon_shmem = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
                          MAP_SHARED | MAP_ANONYMOUS, -1, 0);
        /* Name the segment: "MY-NAME" */
        rv = prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME,
                   anon_shmem, SIZE, "MY-NAME");
      
      cat /proc/<pid>/maps (and smaps):
      7fc8e2b4c000-7fc8f2b4c000 rw-s 00000000 00:01 1024 [anon_shmem:MY-NAME]
      
      If the segment is not named, the output is:
      7fc8e2b4c000-7fc8f2b4c000 rw-s 00000000 00:01 1024 /dev/zero (deleted)
      
      Link: https://lkml.kernel.org/r/20221115020602.804224-1-pasha.tatashin@soleen.comSigned-off-by: NPasha Tatashin <pasha.tatashin@soleen.com>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Bagas Sanjaya <bagasdotme@gmail.com>
      Cc: Colin Cross <ccross@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Liam Howlett <liam.howlett@oracle.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Sean Christopherson <seanjc@google.com>
      Cc: Vincent Whitchurch <vincent.whitchurch@axis.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: xu xin <cgel.zte@gmail.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      d09e8ca6
    • L
      mm: teach release_pages() to take an array of encoded page pointers too · 449c7967
      Linus Torvalds 提交于
      release_pages() already could take either an array of page pointers, or an
      array of folio pointers.  Expand it to also accept an array of encoded
      page pointers, which is what both the existing mlock() use and the
      upcoming mmu_gather use of encoded page pointers wants.
      
      Note that release_pages() won't actually use, or react to, any extra
      encoded bits.  Instead, this is very much a case of "I have walked the
      array of encoded pages and done everything the extra bits tell me to do,
      now release it all".
      
      Also, while the "either page or folio pointers" dual use was handled with
      a cast of the pointer in "release_folios()", this takes a slightly
      different approach and uses the "transparent union" attribute to describe
      the set of arguments to the function:
      
        https://gcc.gnu.org/onlinedocs/gcc/Common-Type-Attributes.html
      
      which has been supported by gcc forever, but the kernel hasn't used
      before.
      
      That allows us to avoid using various wrappers with casts, and just use
      the same function regardless of use.
      
      Link: https://lkml.kernel.org/r/20221109203051.1835763-2-torvalds@linux-foundation.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      449c7967
    • D
      mm/autonuma: use can_change_(pte|pmd)_writable() to replace savedwrite · 6a56ccbc
      David Hildenbrand 提交于
      commit b191f9b1 ("mm: numa: preserve PTE write permissions across a
      NUMA hinting fault") added remembering write permissions using ordinary
      pte_write() for PROT_NONE mapped pages to avoid write faults when
      remapping the page !PROT_NONE on NUMA hinting faults.
      
      That commit noted:
      
          The patch looks hacky but the alternatives looked worse. The tidest was
          to rewalk the page tables after a hinting fault but it was more complex
          than this approach and the performance was worse. It's not generally
          safe to just mark the page writable during the fault if it's a write
          fault as it may have been read-only for COW so that approach was
          discarded.
      
      Later, commit 288bc549 ("mm/autonuma: let architecture override how
      the write bit should be stashed in a protnone pte.") introduced a family
      of savedwrite PTE functions that didn't necessarily improve the whole
      situation.
      
      One confusing thing is that nowadays, if a page is pte_protnone()
      and pte_savedwrite() then also pte_write() is true. Another source of
      confusion is that there is only a single pte_mk_savedwrite() call in the
      kernel. All other write-protection code seems to silently rely on
      pte_wrprotect().
      
      Ever since PageAnonExclusive was introduced and we started using it in
      mprotect context via commit 64fe24a3 ("mm/mprotect: try avoiding write
      faults for exclusive anonymous pages when changing protection"), we do
      have machinery in place to avoid write faults when changing protection,
      which is exactly what we want to do here.
      
      Let's similarly do what ordinary mprotect() does nowadays when upgrading
      write permissions and reuse can_change_pte_writable() and
      can_change_pmd_writable() to detect if we can upgrade PTE permissions to be
      writable.
      
      For anonymous pages there should be absolutely no change: if an
      anonymous page is not exclusive, it could not have been mapped writable --
      because only exclusive anonymous pages can be mapped writable.
      
      However, there *might* be a change for writable shared mappings that
      require writenotify: if they are not dirty, we cannot map them writable.
      While it might not matter in practice, we'd need a different way to
      identify whether writenotify is actually required -- and ordinary mprotect
      would benefit from that as well.
      
      Note that we don't optimize for the actual migration case:
      (1) When migration succeeds the new PTE will not be writable because the
          source PTE was not writable (protnone); in the future we
          might just optimize that case similarly by reusing
          can_change_pte_writable()/can_change_pmd_writable() when removing
          migration PTEs.
      (2) When migration fails, we'd have to recalculate the "writable" flag
          because we temporarily dropped the PT lock; for now keep it simple and
          set "writable=false".
      
      We'll remove all savedwrite leftovers next.
      
      Link: https://lkml.kernel.org/r/20221108174652.198904-6-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      6a56ccbc
    • D
      mm/mprotect: factor out check whether manual PTE write upgrades are required · eb309ec8
      David Hildenbrand 提交于
      Let's factor the check out into vma_wants_manual_pte_write_upgrade(), to be
      reused in NUMA hinting fault context soon.
      
      Link: https://lkml.kernel.org/r/20221108174652.198904-5-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      eb309ec8
    • H
      mm,thp,rmap: subpages_mapcount COMPOUND_MAPPED if PMD-mapped · 4b51634c
      Hugh Dickins 提交于
      Can the lock_compound_mapcount() bit_spin_lock apparatus be removed now? 
      Yes.  Not by atomic64_t or cmpxchg games, those get difficult on 32-bit;
      but if we slightly abuse subpages_mapcount by additionally demanding that
      one bit be set there when the compound page is PMD-mapped, then a cascade
      of two atomic ops is able to maintain the stats without bit_spin_lock.
      
      This is harder to reason about than when bit_spin_locked, but I believe
      safe; and no drift in stats detected when testing.  When there are racing
      removes and adds, of course the sequence of operations is less well-
      defined; but each operation on subpages_mapcount is atomically good.  What
      might be disastrous, is if subpages_mapcount could ever fleetingly appear
      negative: but the pte lock (or pmd lock) these rmap functions are called
      under, ensures that a last remove cannot race ahead of a first add.
      
      Continue to make an exception for hugetlb (PageHuge) pages, though that
      exception can be easily removed by a further commit if necessary: leave
      subpages_mapcount 0, don't bother with COMPOUND_MAPPED in its case, just
      carry on checking compound_mapcount too in folio_mapped(), page_mapped().
      
      Evidence is that this way goes slightly faster than the previous
      implementation in all cases (pmds after ptes now taking around 103ms); and
      relieves us of worrying about contention on the bit_spin_lock.
      
      Link: https://lkml.kernel.org/r/3978f3ca-5473-55a7-4e14-efea5968d892@google.comSigned-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Dan Carpenter <error27@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Zach O'Keefe <zokeefe@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      4b51634c
    • H
      mm,thp,rmap: subpages_mapcount of PTE-mapped subpages · be5ef2d9
      Hugh Dickins 提交于
      Patch series "mm,thp,rmap: rework the use of subpages_mapcount", v2.
      
      
      This patch (of 3):
      
      Following suggestion from Linus, instead of counting every PTE map of a
      compound page in subpages_mapcount, just count how many of its subpages
      are PTE-mapped: this yields the exact number needed for NR_ANON_MAPPED and
      NR_FILE_MAPPED stats, without any need for a locked scan of subpages; and
      requires updating the count less often.
      
      This does then revert total_mapcount() and folio_mapcount() to needing a
      scan of subpages; but they are inherently racy, and need no locking, so
      Linus is right that the scans are much better done there.  Plus (unlike in
      6.1 and previous) subpages_mapcount lets us avoid the scan in the common
      case of no PTE maps.  And page_mapped() and folio_mapped() remain scanless
      and just as efficient with the new meaning of subpages_mapcount: those are
      the functions which I most wanted to remove the scan from.
      
      The updated page_dup_compound_rmap() is no longer suitable for use by anon
      THP's __split_huge_pmd_locked(); but page_add_anon_rmap() can be used for
      that, so long as its VM_BUG_ON_PAGE(!PageLocked) is deleted.
      
      Evidence is that this way goes slightly faster than the previous
      implementation for most cases; but significantly faster in the (now
      scanless) pmds after ptes case, which started out at 870ms and was brought
      down to 495ms by the previous series, now takes around 105ms.
      
      Link: https://lkml.kernel.org/r/a5849eca-22f1-3517-bf29-95d982242742@google.com
      Link: https://lkml.kernel.org/r/eec17e16-4e1-7c59-f1bc-5bca90dac919@google.comSigned-off-by: NHugh Dickins <hughd@google.com>
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Dan Carpenter <error27@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: Zach O'Keefe <zokeefe@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      be5ef2d9
    • H
      mm,thp,rmap: simplify compound page mapcount handling · cb67f428
      Hugh Dickins 提交于
      Compound page (folio) mapcount calculations have been different for anon
      and file (or shmem) THPs, and involved the obscure PageDoubleMap flag. 
      And each huge mapping and unmapping of a file (or shmem) THP involved
      atomically incrementing and decrementing the mapcount of every subpage of
      that huge page, dirtying many struct page cachelines.
      
      Add subpages_mapcount field to the struct folio and first tail page, so
      that the total of subpage mapcounts is available in one place near the
      head: then page_mapcount() and total_mapcount() and page_mapped(), and
      their folio equivalents, are so quick that anon and file and hugetlb don't
      need to be optimized differently.  Delete the unloved PageDoubleMap.
      
      page_add and page_remove rmap functions must now maintain the
      subpages_mapcount as well as the subpage _mapcount, when dealing with pte
      mappings of huge pages; and correct maintenance of NR_ANON_MAPPED and
      NR_FILE_MAPPED statistics still needs reading through the subpages, using
      nr_subpages_unmapped() - but only when first or last pmd mapping finds
      subpages_mapcount raised (double-map case, not the common case).
      
      But are those counts (used to decide when to split an anon THP, and in
      vmscan's pagecache_reclaimable heuristic) correctly maintained?  Not
      quite: since page_remove_rmap() (and also split_huge_pmd()) is often
      called without page lock, there can be races when a subpage pte mapcount
      0<->1 while compound pmd mapcount 0<->1 is scanning - races which the
      previous implementation had prevented.  The statistics might become
      inaccurate, and even drift down until they underflow through 0.  That is
      not good enough, but is better dealt with in a followup patch.
      
      Update a few comments on first and second tail page overlaid fields. 
      hugepage_add_new_anon_rmap() has to "increment" compound_mapcount, but
      subpages_mapcount and compound_pincount are already correctly at 0, so
      delete its reinitialization of compound_pincount.
      
      A simple 100 X munmap(mmap(2GB, MAP_SHARED|MAP_POPULATE, tmpfs), 2GB) took
      18 seconds on small pages, and used to take 1 second on huge pages, but
      now takes 119 milliseconds on huge pages.  Mapping by pmds a second time
      used to take 860ms and now takes 92ms; mapping by pmds after mapping by
      ptes (when the scan is needed) used to take 870ms and now takes 495ms. 
      But there might be some benchmarks which would show a slowdown, because
      tail struct pages now fall out of cache until final freeing checks them.
      
      Link: https://lkml.kernel.org/r/47ad693-717-79c8-e1ba-46c3a6602e48@google.comSigned-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: James Houghton <jthoughton@google.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zach O'Keefe <zokeefe@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      cb67f428
    • T
      mm, hwpoison: when copy-on-write hits poison, take page offline · d302c239
      Tony Luck 提交于
      Cannot call memory_failure() directly from the fault handler because
      mmap_lock (and others) are held.
      
      It is important, but not urgent, to mark the source page as h/w poisoned
      and unmap it from other tasks.
      
      Use memory_failure_queue() to request a call to memory_failure() for the
      page with the error.
      
      Also provide a stub version for CONFIG_MEMORY_FAILURE=n
      
      Link: https://lkml.kernel.org/r/20221021200120.175753-3-tony.luck@intel.comSigned-off-by: NTony Luck <tony.luck@intel.com>
      Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Shuai Xue <xueshuai@linux.alibaba.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      d302c239
    • S
      mm: convert mm's rss stats into percpu_counter · f1a79412
      Shakeel Butt 提交于
      Currently mm_struct maintains rss_stats which are updated on page fault
      and the unmapping codepaths.  For page fault codepath the updates are
      cached per thread with the batch of TASK_RSS_EVENTS_THRESH which is 64. 
      The reason for caching is performance for multithreaded applications
      otherwise the rss_stats updates may become hotspot for such applications.
      
      However this optimization comes with the cost of error margin in the rss
      stats.  The rss_stats for applications with large number of threads can be
      very skewed.  At worst the error margin is (nr_threads * 64) and we have a
      lot of applications with 100s of threads, so the error margin can be very
      high.  Internally we had to reduce TASK_RSS_EVENTS_THRESH to 32.
      
      Recently we started seeing the unbounded errors for rss_stats for specific
      applications which use TCP rx0cp.  It seems like vm_insert_pages()
      codepath does not sync rss_stats at all.
      
      This patch converts the rss_stats into percpu_counter to convert the error
      margin from (nr_threads * 64) to approximately (nr_cpus ^ 2).  However
      this conversion enable us to get the accurate stats for situations where
      accuracy is more important than the cpu cost.
      
      This patch does not make such tradeoffs - we can just use
      percpu_counter_add_local() for the updates and percpu_counter_sum() (or
      percpu_counter_sync() + percpu_counter_read) for the readers.  At the
      moment the readers are either procfs interface, oom_killer and memory
      reclaim which I think are not performance critical and should be ok with
      slow read.  However I think we can make that change in a separate patch.
      
      Link: https://lkml.kernel.org/r/20221024052841.3291983-1-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      f1a79412
    • M
      hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing · 04ada095
      Mike Kravetz 提交于
      madvise(MADV_DONTNEED) ends up calling zap_page_range() to clear page
      tables associated with the address range.  For hugetlb vmas,
      zap_page_range will call __unmap_hugepage_range_final.  However,
      __unmap_hugepage_range_final assumes the passed vma is about to be removed
      and deletes the vma_lock to prevent pmd sharing as the vma is on the way
      out.  In the case of madvise(MADV_DONTNEED) the vma remains, but the
      missing vma_lock prevents pmd sharing and could potentially lead to issues
      with truncation/fault races.
      
      This issue was originally reported here [1] as a BUG triggered in
      page_try_dup_anon_rmap.  Prior to the introduction of the hugetlb
      vma_lock, __unmap_hugepage_range_final cleared the VM_MAYSHARE flag to
      prevent pmd sharing.  Subsequent faults on this vma were confused as
      VM_MAYSHARE indicates a sharable vma, but was not set so page_mapping was
      not set in new pages added to the page table.  This resulted in pages that
      appeared anonymous in a VM_SHARED vma and triggered the BUG.
      
      Address issue by adding a new zap flag ZAP_FLAG_UNMAP to indicate an unmap
      call from unmap_vmas().  This is used to indicate the 'final' unmapping of
      a hugetlb vma.  When called via MADV_DONTNEED, this flag is not set and
      the vm_lock is not deleted.
      
      [1] https://lore.kernel.org/lkml/CAO4mrfdLMXsao9RF4fUE8-Wfde8xmjsKrTNMNC9wjUb6JudD0g@mail.gmail.com/
      
      Link: https://lkml.kernel.org/r/20221114235507.294320-3-mike.kravetz@oracle.com
      Fixes: 90e7e7f5 ("mm: enable MADV_DONTNEED for hugetlb mappings")
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reported-by: NWei Chen <harperchen1110@gmail.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      04ada095
    • M
      madvise: use zap_page_range_single for madvise dontneed · 21b85b09
      Mike Kravetz 提交于
      This series addresses the issue first reported in [1], and fully described
      in patch 2.  Patches 1 and 2 address the user visible issue and are tagged
      for stable backports.
      
      While exploring solutions to this issue, related problems with mmu
      notification calls were discovered.  This is addressed in the patch
      "hugetlb: remove duplicate mmu notifications:".  Since there are no user
      visible effects, this third is not tagged for stable backports.
      
      Previous discussions suggested further cleanup by removing the
      routine zap_page_range.  This is possible because zap_page_range_single
      is now exported, and all callers of zap_page_range pass ranges entirely
      within a single vma.  This work will be done in a later patch so as not
      to distract from this bug fix.
      
      [1] https://lore.kernel.org/lkml/CAO4mrfdLMXsao9RF4fUE8-Wfde8xmjsKrTNMNC9wjUb6JudD0g@mail.gmail.com/
      
      
      This patch (of 2):
      
      Expose the routine zap_page_range_single to zap a range within a single
      vma.  The madvise routine madvise_dontneed_single_vma can use this routine
      as it explicitly operates on a single vma.  Also, update the mmu
      notification range in zap_page_range_single to take hugetlb pmd sharing
      into account.  This is required as MADV_DONTNEED supports hugetlb vmas.
      
      Link: https://lkml.kernel.org/r/20221114235507.294320-1-mike.kravetz@oracle.com
      Link: https://lkml.kernel.org/r/20221114235507.294320-2-mike.kravetz@oracle.com
      Fixes: 90e7e7f5 ("mm: enable MADV_DONTNEED for hugetlb mappings")
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reported-by: NWei Chen <harperchen1110@gmail.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      21b85b09
  4. 10 11月, 2022 3 次提交
  5. 09 11月, 2022 6 次提交
  6. 04 10月, 2022 2 次提交
  7. 27 9月, 2022 10 次提交