1. 27 9月, 2022 2 次提交
  2. 12 9月, 2022 4 次提交
    • D
      mm: fix PageAnonExclusive clearing racing with concurrent RCU GUP-fast · 088b8aa5
      David Hildenbrand 提交于
      commit 6c287605 ("mm: remember exclusively mapped anonymous pages with
      PG_anon_exclusive") made sure that when PageAnonExclusive() has to be
      cleared during temporary unmapping of a page, that the PTE is
      cleared/invalidated and that the TLB is flushed.
      
      What we want to achieve in all cases is that we cannot end up with a pin on
      an anonymous page that may be shared, because such pins would be
      unreliable and could result in memory corruptions when the mapped page
      and the pin go out of sync due to a write fault.
      
      That TLB flush handling was inspired by an outdated comment in
      mm/ksm.c:write_protect_page(), which similarly required the TLB flush in
      the past to synchronize with GUP-fast. However, ever since general RCU GUP
      fast was introduced in commit 2667f50e ("mm: introduce a general RCU
      get_user_pages_fast()"), a TLB flush is no longer sufficient to handle
      concurrent GUP-fast in all cases -- it only handles traditional IPI-based
      GUP-fast correctly.
      
      Peter Xu (thankfully) questioned whether that TLB flush is really
      required. On architectures that send an IPI broadcast on TLB flush,
      it works as expected. To synchronize with RCU GUP-fast properly, we're
      conceptually fine, however, we have to enforce a certain memory order and
      are missing memory barriers.
      
      Let's document that, avoid the TLB flush where possible and use proper
      explicit memory barriers where required. We shouldn't really care about the
      additional memory barriers here, as we're not on extremely hot paths --
      and we're getting rid of some TLB flushes.
      
      We use a smp_mb() pair for handling concurrent pinning and a
      smp_rmb()/smp_wmb() pair for handling the corner case of only temporary
      PTE changes but permanent PageAnonExclusive changes.
      
      One extreme example, whereby GUP-fast takes a R/O pin and KSM wants to
      convert an exclusive anonymous page to a KSM page, and that page is already
      mapped write-protected (-> no PTE change) would be:
      
      	Thread 0 (KSM)			Thread 1 (GUP-fast)
      
      					(B1) Read the PTE
      					# (B2) skipped without FOLL_WRITE
      	(A1) Clear PTE
      	smp_mb()
      	(A2) Check pinned
      					(B3) Pin the mapped page
      					smp_mb()
      	(A3) Clear PageAnonExclusive
      	smp_wmb()
      	(A4) Restore PTE
      					(B4) Check if the PTE changed
      					smp_rmb()
      					(B5) Check PageAnonExclusive
      
      Thread 1 will properly detect that PageAnonExclusive was cleared and
      back off.
      
      Note that we don't need a memory barrier between checking if the page is
      pinned and clearing PageAnonExclusive, because stores are not
      speculated.
      
      The possible issues due to reordering are of theoretical nature so far
      and attempts to reproduce the race failed.
      
      Especially the "no PTE change" case isn't the common case, because we'd
      need an exclusive anonymous page that's mapped R/O and the PTE is clean
      in KSM code -- and using KSM with page pinning isn't extremely common.
      Further, the clear+TLB flush we used for now implies a memory barrier.
      So the problematic missing part should be the missing memory barrier
      after pinning but before checking if the PTE changed.
      
      Link: https://lkml.kernel.org/r/20220901083559.67446-1-david@redhat.com
      Fixes: 6c287605 ("mm: remember exclusively mapped anonymous pages with PG_anon_exclusive")
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Andrea Parri <parri.andrea@gmail.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Paul E. McKenney" <paulmck@kernel.org>
      Cc: Christoph von Recklinghausen <crecklin@redhat.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      088b8aa5
    • A
      mm/gup.c: refactor check_and_migrate_movable_pages() · 67e139b0
      Alistair Popple 提交于
      When pinning pages with FOLL_LONGTERM check_and_migrate_movable_pages() is
      called to migrate pages out of zones which should not contain any longterm
      pinned pages.
      
      When migration succeeds all pages will have been unpinned so pinning needs
      to be retried.  Migration can also fail, in which case the pages will also
      have been unpinned but the operation should not be retried.  If all pages
      are in the correct zone nothing will be unpinned and no retry is required.
      
      The logic in check_and_migrate_movable_pages() tracks unnecessary state
      and the return codes for each case are difficult to follow.  Refactor the
      code to clean this up.  No behaviour change is intended.
      
      [akpm@linux-foundation.org: fix unused var warning]
      Link: https://lkml.kernel.org/r/19583d1df07fdcb99cfa05c265588a3fa58d1902.1661317396.git-series.apopple@nvidia.comSigned-off-by: NAlistair Popple <apopple@nvidia.com>
      Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com>
      Cc: Alex Sierra <alex.sierra@amd.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Felix Kuehling <felix.kuehling@amd.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Shigeru Yoshida <syoshida@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      67e139b0
    • A
      mm/gup.c: don't pass gup_flags to check_and_migrate_movable_pages() · f6d299ec
      Alistair Popple 提交于
      gup_flags is passed to check_and_migrate_movable_pages() so that it can
      call either put_page() or unpin_user_page() to drop the page reference. 
      However check_and_migrate_movable_pages() is only called for
      FOLL_LONGTERM, which implies FOLL_PIN so there is no need to pass
      gup_flags.
      
      Link: https://lkml.kernel.org/r/d611c65a9008ff55887307df457c6c2220ad6163.1661317396.git-series.apopple@nvidia.comSigned-off-by: NAlistair Popple <apopple@nvidia.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com>
      Cc: Alex Sierra <alex.sierra@amd.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Felix Kuehling <felix.kuehling@amd.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Shigeru Yoshida <syoshida@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      f6d299ec
    • A
      mm/gup.c: simplify and fix check_and_migrate_movable_pages() return codes · 24a95998
      Alistair Popple 提交于
      When pinning pages with FOLL_LONGTERM check_and_migrate_movable_pages() is
      called to migrate pages out of zones which should not contain any longterm
      pinned pages.
      
      When migration succeeds all pages will have been unpinned so pinning needs
      to be retried.  This is indicated by returning zero.  When all pages are
      in the correct zone the number of pinned pages is returned.
      
      However migration can also fail, in which case pages are unpinned and
      -ENOMEM is returned.  However if the failure was due to not being unable
      to isolate a page zero is returned.  This leads to indefinite looping in
      __gup_longterm_locked().
      
      Fix this by simplifying the return codes such that zero indicates all
      pages were successfully pinned in the correct zone while errors indicate
      either pages were migrated and pinning should be retried or that migration
      has failed and therefore the pinning operation should fail.
      
      [syoshida@redhat.com: fix return value for __gup_longterm_locked()]
        Link: https://lkml.kernel.org/r/20220821183547.950370-1-syoshida@redhat.com
      [akpm@linux-foundation.org: fix code layout, per John]
      [yshigeru@gmail.com: fix uninitialized return value on __gup_longterm_locked()]
        Link: https://lkml.kernel.org/r/20220827230037.78876-1-syoshida@redhat.com
      Link: https://lkml.kernel.org/r/20220729024645.764366-1-apopple@nvidia.comSigned-off-by: NAlistair Popple <apopple@nvidia.com>
      Signed-off-by: NShigeru Yoshida <syoshida@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      24a95998
  3. 21 8月, 2022 1 次提交
    • D
      mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW · 5535be30
      David Hildenbrand 提交于
      Ever since the Dirty COW (CVE-2016-5195) security issue happened, we know
      that FOLL_FORCE can be possibly dangerous, especially if there are races
      that can be exploited by user space.
      
      Right now, it would be sufficient to have some code that sets a PTE of a
      R/O-mapped shared page dirty, in order for it to erroneously become
      writable by FOLL_FORCE.  The implications of setting a write-protected PTE
      dirty might not be immediately obvious to everyone.
      
      And in fact ever since commit 9ae0f87d ("mm/shmem: unconditionally set
      pte dirty in mfill_atomic_install_pte"), we can use UFFDIO_CONTINUE to map
      a shmem page R/O while marking the pte dirty.  This can be used by
      unprivileged user space to modify tmpfs/shmem file content even if the
      user does not have write permissions to the file, and to bypass memfd
      write sealing -- Dirty COW restricted to tmpfs/shmem (CVE-2022-2590).
      
      To fix such security issues for good, the insight is that we really only
      need that fancy retry logic (FOLL_COW) for COW mappings that are not
      writable (!VM_WRITE).  And in a COW mapping, we really only broke COW if
      we have an exclusive anonymous page mapped.  If we have something else
      mapped, or the mapped anonymous page might be shared (!PageAnonExclusive),
      we have to trigger a write fault to break COW.  If we don't find an
      exclusive anonymous page when we retry, we have to trigger COW breaking
      once again because something intervened.
      
      Let's move away from this mandatory-retry + dirty handling and rely on our
      PageAnonExclusive() flag for making a similar decision, to use the same
      COW logic as in other kernel parts here as well.  In case we stumble over
      a PTE in a COW mapping that does not map an exclusive anonymous page, COW
      was not properly broken and we have to trigger a fake write-fault to break
      COW.
      
      Just like we do in can_change_pte_writable() added via commit 64fe24a3
      ("mm/mprotect: try avoiding write faults for exclusive anonymous pages
      when changing protection") and commit 76aefad6 ("mm/mprotect: fix
      soft-dirty check in can_change_pte_writable()"), take care of softdirty
      and uffd-wp manually.
      
      For example, a write() via /proc/self/mem to a uffd-wp-protected range has
      to fail instead of silently granting write access and bypassing the
      userspace fault handler.  Note that FOLL_FORCE is not only used for debug
      access, but also triggered by applications without debug intentions, for
      example, when pinning pages via RDMA.
      
      This fixes CVE-2022-2590. Note that only x86_64 and aarch64 are
      affected, because only those support CONFIG_HAVE_ARCH_USERFAULTFD_MINOR.
      
      Fortunately, FOLL_COW is no longer required to handle FOLL_FORCE. So
      let's just get rid of it.
      
      Thanks to Nadav Amit for pointing out that the pte_dirty() check in
      FOLL_FORCE code is problematic and might be exploitable.
      
      Note 1: We don't check for the PTE being dirty because it doesn't matter
      	for making a "was COWed" decision anymore, and whoever modifies the
      	page has to set the page dirty either way.
      
      Note 2: Kernels before extended uffd-wp support and before
      	PageAnonExclusive (< 5.19) can simply revert the problematic
      	commit instead and be safe regarding UFFDIO_CONTINUE. A backport to
      	v5.19 requires minor adjustments due to lack of
      	vma_soft_dirty_enabled().
      
      Link: https://lkml.kernel.org/r/20220809205640.70916-1-david@redhat.com
      Fixes: 9ae0f87d ("mm/shmem: unconditionally set pte dirty in mfill_atomic_install_pte")
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: David Laight <David.Laight@ACULAB.COM>
      Cc: <stable@vger.kernel.org>	[5.16]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      5535be30
  4. 30 7月, 2022 1 次提交
  5. 19 7月, 2022 1 次提交
  6. 18 7月, 2022 3 次提交
    • L
      mm: gup: pass a pointer to virt_to_page() · 396a400b
      Linus Walleij 提交于
      Functions that work on a pointer to virtual memory such as virt_to_pfn()
      and users of that function such as virt_to_page() are supposed to pass a
      pointer to virtual memory, ideally a (void *) or other pointer.  However
      since many architectures implement virt_to_pfn() as a macro, this function
      becomes polymorphic and accepts both a (unsigned long) and a (void *).
      
      If we instead implement a proper virt_to_pfn(void *addr) function the
      following happens (occurred on arch/arm):
      
        mm/gup.c: In function '__get_user_pages_locked':
        mm/gup.c:1599:49: warning: passing argument 1 of 'virt_to_pfn'
          makes pointer from integer without a cast [-Wint-conversion]
          pages[i] = virt_to_page(start);
      
      Fix this with an explicit cast.
      
      Link: https://lkml.kernel.org/r/20220630084124.691207-5-linus.walleij@linaro.orgSigned-off-by: NLinus Walleij <linus.walleij@linaro.org>
      Reviewed-by: NJason Gunthorpe <jgg@nvidia.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Marco Elver <elver@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      396a400b
    • A
      mm/gup: migrate device coherent pages when pinning instead of failing · b05a79d4
      Alistair Popple 提交于
      Currently any attempts to pin a device coherent page will fail.  This is
      because device coherent pages need to be managed by a device driver, and
      pinning them would prevent a driver from migrating them off the device.
      
      However this is no reason to fail pinning of these pages.  These are
      coherent and accessible from the CPU so can be migrated just like pinning
      ZONE_MOVABLE pages.  So instead of failing all attempts to pin them first
      try migrating them out of ZONE_DEVICE.
      
      [hch@lst.de: rebased to the split device memory checks, moved migrate_device_page to migrate_device.c]
      Link: https://lkml.kernel.org/r/20220715150521.18165-7-alex.sierra@amd.comSigned-off-by: NAlistair Popple <apopple@nvidia.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Acked-by: NFelix Kuehling <Felix.Kuehling@amd.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      b05a79d4
    • A
      mm: rename is_pinnable_page() to is_longterm_pinnable_page() · 6077c943
      Alex Sierra 提交于
      Patch series "Add MEMORY_DEVICE_COHERENT for coherent device memory
      mapping", v9.
      
      This patch series introduces MEMORY_DEVICE_COHERENT, a type of memory
      owned by a device that can be mapped into CPU page tables like
      MEMORY_DEVICE_GENERIC and can also be migrated like MEMORY_DEVICE_PRIVATE.
      
      This patch series is mostly self-contained except for a few places where
      it needs to update other subsystems to handle the new memory type.
      
      System stability and performance are not affected according to our ongoing
      testing, including xfstests.
      
      How it works: The system BIOS advertises the GPU device memory (aka VRAM)
      as SPM (special purpose memory) in the UEFI system address map.
      
      The amdgpu driver registers the memory with devmap as
      MEMORY_DEVICE_COHERENT using devm_memremap_pages.  The initial user for
      this hardware page migration capability is the Frontier supercomputer
      project.  This functionality is not AMD-specific.  We expect other GPU
      vendors to find this functionality useful, and possibly other hardware
      types in the future.
      
      Our test nodes in the lab are similar to the Frontier configuration, with
      .5 TB of system memory plus 256 GB of device memory split across 4 GPUs,
      all in a single coherent address space.  Page migration is expected to
      improve application efficiency significantly.  We will report empirical
      results as they become available.
      
      Coherent device type pages at gup are now migrated back to system memory
      if they are being pinned long-term (FOLL_LONGTERM).  The reason is, that
      long-term pinning would interfere with the device memory manager owning
      the device-coherent pages (e.g.  evictions in TTM).  These series
      incorporate Alistair Popple patches to do this migration from
      pin_user_pages() calls.  hmm_gup_test has been added to hmm-test to test
      different get user pages calls.
      
      This series includes handling of device-managed anonymous pages returned
      by vm_normal_pages.  Although they behave like normal pages for purposes
      of mapping in CPU page tables and for COW, they do not support LRU lists,
      NUMA migration or THP.
      
      We also introduced a FOLL_LRU flag that adds the same behaviour to
      follow_page and related APIs, to allow callers to specify that they expect
      to put pages on an LRU list.
      
      
      This patch (of 14):
      
      is_pinnable_page() and folio_is_pinnable() are renamed to
      is_longterm_pinnable_page() and folio_is_longterm_pinnable() respectively.
      These functions are used in the FOLL_LONGTERM flag context.
      
      Link: https://lkml.kernel.org/r/20220715150521.18165-1-alex.sierra@amd.com
      Link: https://lkml.kernel.org/r/20220715150521.18165-2-alex.sierra@amd.comSigned-off-by: NAlex Sierra <alex.sierra@amd.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Felix Kuehling <Felix.Kuehling@amd.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      6077c943
  7. 04 7月, 2022 1 次提交
    • M
      mm/migration: return errno when isolate_huge_page failed · 7ce82f4c
      Miaohe Lin 提交于
      We might fail to isolate huge page due to e.g.  the page is under
      migration which cleared HPageMigratable.  We should return errno in this
      case rather than always return 1 which could confuse the user, i.e.  the
      caller might think all of the memory is migrated while the hugetlb page is
      left behind.  We make the prototype of isolate_huge_page consistent with
      isolate_lru_page as suggested by Huang Ying and rename isolate_huge_page
      to isolate_hugetlb as suggested by Muchun to improve the readability.
      
      Link: https://lkml.kernel.org/r/20220530113016.16663-4-linmiaohe@huawei.com
      Fixes: e8db67eb ("mm: migrate: move_pages() supports thp migration")
      Signed-off-by: NMiaohe Lin <linmiaohe@huawei.com>
      Suggested-by: NHuang Ying <ying.huang@intel.com>
      Reported-by: kernel test robot <lkp@intel.com> (build error)
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Peter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      7ce82f4c
  8. 17 6月, 2022 1 次提交
    • P
      mm: avoid unnecessary page fault retires on shared memory types · d9272525
      Peter Xu 提交于
      I observed that for each of the shared file-backed page faults, we're very
      likely to retry one more time for the 1st write fault upon no page.  It's
      because we'll need to release the mmap lock for dirty rate limit purpose
      with balance_dirty_pages_ratelimited() (in fault_dirty_shared_page()).
      
      Then after that throttling we return VM_FAULT_RETRY.
      
      We did that probably because VM_FAULT_RETRY is the only way we can return
      to the fault handler at that time telling it we've released the mmap lock.
      
      However that's not ideal because it's very likely the fault does not need
      to be retried at all since the pgtable was well installed before the
      throttling, so the next continuous fault (including taking mmap read lock,
      walk the pgtable, etc.) could be in most cases unnecessary.
      
      It's not only slowing down page faults for shared file-backed, but also add
      more mmap lock contention which is in most cases not needed at all.
      
      To observe this, one could try to write to some shmem page and look at
      "pgfault" value in /proc/vmstat, then we should expect 2 counts for each
      shmem write simply because we retried, and vm event "pgfault" will capture
      that.
      
      To make it more efficient, add a new VM_FAULT_COMPLETED return code just to
      show that we've completed the whole fault and released the lock.  It's also
      a hint that we should very possibly not need another fault immediately on
      this page because we've just completed it.
      
      This patch provides a ~12% perf boost on my aarch64 test VM with a simple
      program sequentially dirtying 400MB shmem file being mmap()ed and these are
      the time it needs:
      
        Before: 650.980 ms (+-1.94%)
        After:  569.396 ms (+-1.38%)
      
      I believe it could help more than that.
      
      We need some special care on GUP and the s390 pgfault handler (for gmap
      code before returning from pgfault), the rest changes in the page fault
      handlers should be relatively straightforward.
      
      Another thing to mention is that mm_account_fault() does take this new
      fault as a generic fault to be accounted, unlike VM_FAULT_RETRY.
      
      I explicitly didn't touch hmm_vma_fault() and break_ksm() because they do
      not handle VM_FAULT_RETRY even with existing code, so I'm literally keeping
      them as-is.
      
      Link: https://lkml.kernel.org/r/20220530183450.42886-1-peterx@redhat.comSigned-off-by: NPeter Xu <peterx@redhat.com>
      Acked-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NVineet Gupta <vgupta@kernel.org>
      Acked-by: NGuo Ren <guoren@kernel.org>
      Acked-by: NMax Filippov <jcmvbkbc@gmail.com>
      Acked-by: NChristian Borntraeger <borntraeger@linux.ibm.com>
      Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Reviewed-by: NAlistair Popple <apopple@nvidia.com>
      Reviewed-by: NIngo Molnar <mingo@kernel.org>
      Acked-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>	[arm part]
      Acked-by: NHeiko Carstens <hca@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Johannes Berg <johannes@sipsolutions.net>
      Cc: Brian Cain <bcain@quicinc.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Janosch Frank <frankja@linux.ibm.com>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Alexander Gordeev <agordeev@linux.ibm.com>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: Will Deacon <will@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dinh Nguyen <dinguyen@kernel.org>
      Cc: Rich Felker <dalias@libc.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Yoshinori Sato <ysato@users.osdn.me>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      d9272525
  9. 10 5月, 2022 4 次提交
    • Y
      mm/gup: fix comments to pin_user_pages_*() · 0768c8de
      Yury Norov 提交于
      pin_user_pages API forces FOLL_PIN in gup_flags, which means that the API
      requires struct page **pages to be provided (not NULL).  However, the
      comment to pin_user_pages() clearly allows for passing in a NULL @pages
      argument.
      
      Remove the incorrect comments, and add WARN_ON_ONCE(!pages) calls to
      enforce the API.
      
      It has been independently spotted by Minchan Kim and confirmed with
      John Hubbard:
      
      https://lore.kernel.org/all/YgWA0ghrrzHONehH@google.com/
      
      Link: https://lkml.kernel.org/r/20220422015839.1274328-1-yury.norov@gmail.comSigned-off-by: NYury Norov (NVIDIA) <yury.norov@gmail.com>
      Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      0768c8de
    • D
      mm/gup: sanity-check with CONFIG_DEBUG_VM that anonymous pages are exclusive when (un)pinning · b6a2619c
      David Hildenbrand 提交于
      Let's verify when (un)pinning anonymous pages that we always deal with
      exclusive anonymous pages, which guarantees that we'll have a reliable
      PIN, meaning that we cannot end up with the GUP pin being inconsistent
      with he pages mapped into the page tables due to a COW triggered by a
      write fault.
      
      When pinning pages, after conditionally triggering GUP unsharing of
      possibly shared anonymous pages, we should always only see exclusive
      anonymous pages.  Note that anonymous pages that are mapped writable must
      be marked exclusive, otherwise we'd have a BUG.
      
      When pinning during ordinary GUP, simply add a check after our conditional
      GUP-triggered unsharing checks.  As we know exactly how the page is
      mapped, we know exactly in which page we have to check for
      PageAnonExclusive().
      
      When pinning via GUP-fast we have to be careful, because we can race with
      fork(): verify only after we made sure via the seqcount that we didn't
      race with concurrent fork() that we didn't end up pinning a possibly
      shared anonymous page.
      
      Similarly, when unpinning, verify that the pages are still marked as
      exclusive: otherwise something turned the pages possibly shared, which can
      result in random memory corruptions, which we really want to catch.
      
      With only the pinned pages at hand and not the actual page table entries
      we have to be a bit careful: hugetlb pages are always mapped via a single
      logical page table entry referencing the head page and PG_anon_exclusive
      of the head page applies.  Anon THP are a bit more complicated, because we
      might have obtained the page reference either via a PMD or a PTE --
      depending on the mapping type we either have to check PageAnonExclusive of
      the head page (PMD-mapped THP) or the tail page (PTE-mapped THP) applies:
      as we don't know and to make our life easier, check that either is set.
      
      Take care to not verify in case we're unpinning during GUP-fast because we
      detected concurrent fork(): we might stumble over an anonymous page that
      is now shared.
      
      Link: https://lkml.kernel.org/r/20220428083441.37290-18-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      b6a2619c
    • D
      mm/gup: trigger FAULT_FLAG_UNSHARE when R/O-pinning a possibly shared anonymous page · a7f22660
      David Hildenbrand 提交于
      Whenever GUP currently ends up taking a R/O pin on an anonymous page that
      might be shared -- mapped R/O and !PageAnonExclusive() -- any write fault
      on the page table entry will end up replacing the mapped anonymous page
      due to COW, resulting in the GUP pin no longer being consistent with the
      page actually mapped into the page table.
      
      The possible ways to deal with this situation are:
       (1) Ignore and pin -- what we do right now.
       (2) Fail to pin -- which would be rather surprising to callers and
           could break user space.
       (3) Trigger unsharing and pin the now exclusive page -- reliable R/O
           pins.
      
      Let's implement 3) because it provides the clearest semantics and allows
      for checking in unpin_user_pages() and friends for possible BUGs: when
      trying to unpin a page that's no longer exclusive, clearly something went
      very wrong and might result in memory corruptions that might be hard to
      debug.  So we better have a nice way to spot such issues.
      
      This change implies that whenever user space *wrote* to a private mapping
      (IOW, we have an anonymous page mapped), that GUP pins will always remain
      consistent: reliable R/O GUP pins of anonymous pages.
      
      As a side note, this commit fixes the COW security issue for hugetlb with
      FOLL_PIN as documented in:
        https://lore.kernel.org/r/3ae33b08-d9ef-f846-56fb-645e3b9b4c66@redhat.com
      The vmsplice reproducer still applies, because vmsplice uses FOLL_GET
      instead of FOLL_PIN.
      
      Note that follow_huge_pmd() doesn't apply because we cannot end up in
      there with FOLL_PIN.
      
      This commit is heavily based on prototype patches by Andrea.
      
      Link: https://lkml.kernel.org/r/20220428083441.37290-17-david@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Co-developed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      a7f22660
    • D
      mm/gup: disallow follow_page(FOLL_PIN) · 8909691b
      David Hildenbrand 提交于
      We want to change the way we handle R/O pins on anonymous pages that might
      be shared: if we detect a possibly shared anonymous page -- mapped R/O and
      not !PageAnonExclusive() -- we want to trigger unsharing via a page fault,
      resulting in an exclusive anonymous page that can be pinned reliably
      without getting replaced via COW on the next write fault.
      
      However, the required page fault will be problematic for follow_page(): in
      contrast to ordinary GUP, follow_page() doesn't trigger faults internally.
      So we would have to end up failing a R/O pin via follow_page(), although
      there is something mapped R/O into the page table, which might be rather
      surprising.
      
      We don't seem to have follow_page(FOLL_PIN) users, and it's a purely
      internal MM function.  Let's just make our life easier and the semantics
      of follow_page() clearer by just disallowing FOLL_PIN for follow_page()
      completely.
      
      Link: https://lkml.kernel.org/r/20220428083441.37290-15-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Oded Gabbay <oded.gabbay@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      8909691b
  10. 25 4月, 2022 1 次提交
    • C
      mm: Add fault_in_subpage_writeable() to probe at sub-page granularity · da32b581
      Catalin Marinas 提交于
      On hardware with features like arm64 MTE or SPARC ADI, an access fault
      can be triggered at sub-page granularity. Depending on how the
      fault_in_writeable() function is used, the caller can get into a
      live-lock by continuously retrying the fault-in on an address different
      from the one where the uaccess failed.
      
      In the majority of cases progress is ensured by the following
      conditions:
      
      1. copy_to_user_nofault() guarantees at least one byte access if the
         user address is not faulting.
      
      2. The fault_in_writeable() loop is resumed from the first address that
         could not be accessed by copy_to_user_nofault().
      
      If the loop iteration is restarted from an earlier (initial) point, the
      loop is repeated with the same conditions and it would live-lock.
      
      Introduce an arch-specific probe_subpage_writeable() and call it from
      the newly added fault_in_subpage_writeable() function. The arch code
      with sub-page faults will have to implement the specific probing
      functionality.
      
      Note that no other fault_in_subpage_*() functions are added since they
      have no callers currently susceptible to a live-lock.
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Link: https://lore.kernel.org/r/20220423100751.1870771-2-catalin.marinas@arm.comSigned-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      da32b581
  11. 02 4月, 2022 1 次提交
    • H
      mm/munlock: add lru_add_drain() to fix memcg_stat_test · ece369c7
      Hugh Dickins 提交于
      Mike reports that LTP memcg_stat_test usually leads to
      
        memcg_stat_test 3 TINFO: Test unevictable with MAP_LOCKED
        memcg_stat_test 3 TINFO: Running memcg_process --mmap-lock1 -s 135168
        memcg_stat_test 3 TINFO: Warming up pid: 3460
        memcg_stat_test 3 TINFO: Process is still here after warm up: 3460
        memcg_stat_test 3 TFAIL: unevictable is 122880, 135168 expected
      
      but may also lead to
      
        memcg_stat_test 4 TINFO: Test unevictable with mlock
        memcg_stat_test 4 TINFO: Running memcg_process --mmap-lock2 -s 135168
        memcg_stat_test 4 TINFO: Warming up pid: 4271
        memcg_stat_test 4 TINFO: Process is still here after warm up: 4271
        memcg_stat_test 4 TFAIL: unevictable is 122880, 135168 expected
      
      or both.  A wee bit flaky.
      
      follow_page_pte() used to have an lru_add_drain() per each page mlocked,
      and the test came to rely on accurate stats.  The pagevec to be drained
      is different now, but still covered by lru_add_drain(); and, never mind
      the test, I believe it's in everyone's interest that a bulk faulting
      interface like populate_vma_page_range() or faultin_vma_page_range()
      should drain its local pagevecs at the end, to save others sometimes
      needing the much more expensive lru_add_drain_all().
      
      This does not absolutely guarantee exact stats - the mlocking task can
      be migrated between CPUs as it proceeds - but it's good enough and the
      tests pass.
      
      Link: https://lkml.kernel.org/r/47f6d39c-a075-50cb-1cfb-26dd957a48af@google.com
      Fixes: b67bf49c ("mm/munlock: delete FOLL_MLOCK and FOLL_POPULATE")
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Reported-by: NMike Galbraith <efault@gmx.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ece369c7
  12. 23 3月, 2022 4 次提交
  13. 22 3月, 2022 16 次提交