1. 04 9月, 2021 1 次提交
    • N
      userfaultfd: change mmap_changing to atomic · a759a909
      Nadav Amit 提交于
      Patch series "userfaultfd: minor bug fixes".
      
      Three unrelated bug fixes. The first two addresses possible issues (not
      too theoretical ones), but I did not encounter them in practice.
      
      The third patch addresses a test bug that causes the test to fail on my
      system. It has been sent before as part of a bigger RFC.
      
      This patch (of 3):
      
      mmap_changing is currently a boolean variable, which is set and cleared
      without any lock that protects against concurrent modifications.
      
      mmap_changing is supposed to mark whether userfaultfd page-faults handling
      should be retried since mappings are undergoing a change.  However,
      concurrent calls, for instance to madvise(MADV_DONTNEED), might cause
      mmap_changing to be false, although the remove event was still not read
      (hence acknowledged) by the user.
      
      Change mmap_changing to atomic_t and increase/decrease appropriately.  Add
      a debug assertion to see whether mmap_changing is negative.
      
      Link: https://lkml.kernel.org/r/20210808020724.1022515-1-namit@vmware.com
      Link: https://lkml.kernel.org/r/20210808020724.1022515-2-namit@vmware.com
      Fixes: df2cc96e ("userfaultfd: prevent non-cooperative events vs mcopy_atomic races")
      Signed-off-by: NNadav Amit <namit@vmware.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a759a909
  2. 01 7月, 2021 4 次提交
    • A
      userfaultfd/shmem: modify shmem_mfill_atomic_pte to use install_pte() · 7d64ae3a
      Axel Rasmussen 提交于
      In a previous commit, we added the mfill_atomic_install_pte() helper.
      This helper does the job of setting up PTEs for an existing page, to map
      it into a given VMA.  It deals with both the anon and shmem cases, as well
      as the shared and private cases.
      
      In other words, shmem_mfill_atomic_pte() duplicates a case it already
      handles.  So, expose it, and let shmem_mfill_atomic_pte() use it directly,
      to reduce code duplication.
      
      This requires that we refactor shmem_mfill_atomic_pte() a bit:
      
      Instead of doing accounting (shmem_recalc_inode() et al) part-way through
      the PTE setup, do it afterward.  This frees up mfill_atomic_install_pte()
      from having to care about this accounting, and means we don't need to e.g.
      shmem_uncharge() in the error path.
      
      A side effect is this switches shmem_mfill_atomic_pte() to use
      lru_cache_add_inactive_or_unevictable() instead of just lru_cache_add().
      This wrapper does some extra accounting in an exceptional case, if
      appropriate, so it's actually the more correct thing to use.
      
      Link: https://lkml.kernel.org/r/20210503180737.2487560-7-axelrasmussen@google.comSigned-off-by: NAxel Rasmussen <axelrasmussen@google.com>
      Reviewed-by: NPeter Xu <peterx@redhat.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Wang Qing <wangqing@vivo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7d64ae3a
    • A
      userfaultfd/shmem: support UFFDIO_CONTINUE for shmem · 15313257
      Axel Rasmussen 提交于
      With this change, userspace can resolve a minor fault within a
      shmem-backed area with a UFFDIO_CONTINUE ioctl.  The semantics for this
      match those for hugetlbfs - we look up the existing page in the page
      cache, and install a PTE for it.
      
      This commit introduces a new helper: mfill_atomic_install_pte.
      
      Why handle UFFDIO_CONTINUE for shmem in mm/userfaultfd.c, instead of in
      shmem.c?  The existing userfault implementation only relies on shmem.c for
      VM_SHARED VMAs.  However, minor fault handling / CONTINUE work just fine
      for !VM_SHARED VMAs as well.  We'd prefer to handle CONTINUE for shmem in
      one place, regardless of shared/private (to reduce code duplication).
      
      Why add a new mfill_atomic_install_pte helper?  A problem we have with
      continue is that shmem_mfill_atomic_pte() and mcopy_atomic_pte() are
      *close* to what we want, but not exactly.  We do want to setup the PTEs in
      a CONTINUE operation, but we don't want to e.g.  allocate a new page,
      charge it (e.g.  to the shmem inode), manipulate various flags, etc.  Also
      we have the problem stated above: shmem_mfill_atomic_pte() and
      mcopy_atomic_pte() both handle one-half of the problem (shared / private)
      continue cares about.  So, introduce mcontinue_atomic_pte(), to handle all
      of the shmem continue cases.  Introduce the helper so it doesn't duplicate
      code with mcopy_atomic_pte().
      
      In a future commit, shmem_mfill_atomic_pte() will also be modified to use
      this new helper.  However, since this is a bigger refactor, it seems most
      clear to do it as a separate change.
      
      Link: https://lkml.kernel.org/r/20210503180737.2487560-5-axelrasmussen@google.comSigned-off-by: NAxel Rasmussen <axelrasmussen@google.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Acked-by: NPeter Xu <peterx@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Wang Qing <wangqing@vivo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      15313257
    • A
      userfaultfd/shmem: combine shmem_{mcopy_atomic,mfill_zeropage}_pte · 3460f6e5
      Axel Rasmussen 提交于
      Patch series "userfaultfd: add minor fault handling for shmem", v6.
      
      Overview
      ========
      
      See the series which added minor faults for hugetlbfs [3] for a detailed
      overview of minor fault handling in general.  This series adds the same
      support for shmem-backed areas.
      
      This series is structured as follows:
      
      - Commits 1 and 2 are cleanups.
      - Commits 3 and 4 implement the new feature (minor fault handling for shmem).
      - Commit 5 advertises that the feature is now available since at this point it's
        fully implemented.
      - Commit 6 is a final cleanup, modifying an existing code path to re-use a new
        helper we've introduced.
      - Commits 7, 8, 9, 10 update the userfaultfd selftest to exercise the feature.
      
      Use Case
      ========
      
      In some cases it is useful to have VM memory backed by tmpfs instead of
      hugetlbfs.  So, this feature will be used to support the same VM live
      migration use case described in my original series.
      
      Additionally, Android folks (Lokesh Gidra <lokeshgidra@google.com>) hope
      to optimize the Android Runtime garbage collector using this feature:
      
      "The plan is to use userfaultfd for concurrently compacting the heap.
      With this feature, the heap can be shared-mapped at another location where
      the GC-thread(s) could continue the compaction operation without the need
      to invoke userfault ioctl(UFFDIO_COPY) each time.  OTOH, if and when Java
      threads get faults on the heap, UFFDIO_CONTINUE can be used to resume
      execution.  Furthermore, this feature enables updating references in the
      'non-moving' portion of the heap efficiently.  Without this feature,
      uneccessary page copying (ioctl(UFFDIO_COPY)) would be required."
      
      [1] https://lore.kernel.org/patchwork/cover/1388144/
      [2] https://lore.kernel.org/patchwork/patch/1408161/
      [3] https://lore.kernel.org/linux-fsdevel/20210301222728.176417-1-axelrasmussen@google.com/T/#t
      
      This patch (of 9):
      
      Previously, we did a dance where we had one calling path in userfaultfd.c
      (mfill_atomic_pte), but then we split it into two in shmem_fs.h
      (shmem_{mcopy_atomic,mfill_zeropage}_pte), and then rejoined into a single
      shared function in shmem.c (shmem_mfill_atomic_pte).
      
      This is all a bit overly complex.  Just call the single combined shmem
      function directly, allowing us to clean up various branches, boilerplate,
      etc.
      
      While we're touching this function, two other small cleanup changes:
      - offset is equivalent to pgoff, so we can get rid of offset entirely.
      - Split two VM_BUG_ON cases into two statements. This means the line
        number reported when the BUG is hit specifies exactly which condition
        was true.
      
      Link: https://lkml.kernel.org/r/20210503180737.2487560-1-axelrasmussen@google.com
      Link: https://lkml.kernel.org/r/20210503180737.2487560-3-axelrasmussen@google.comSigned-off-by: NAxel Rasmussen <axelrasmussen@google.com>
      Reviewed-by: NPeter Xu <peterx@redhat.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Wang Qing <wangqing@vivo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3460f6e5
    • M
      mm, hugetlb: fix racy resv_huge_pages underflow on UFFDIO_COPY · 8cc5fcbb
      Mina Almasry 提交于
      On UFFDIO_COPY, if we fail to copy the page contents while holding the
      hugetlb_fault_mutex, we will drop the mutex and return to the caller after
      allocating a page that consumed a reservation.  In this case there may be
      a fault that double consumes the reservation.  To handle this, we free the
      allocated page, fix the reservations, and allocate a temporary hugetlb
      page and return that to the caller.  When the caller does the copy outside
      of the lock, we again check the cache, and allocate a page consuming the
      reservation, and copy over the contents.
      
      Test:
      Hacked the code locally such that resv_huge_pages underflows produce
      a warning and the copy_huge_page_from_user() always fails, then:
      
      ./tools/testing/selftests/vm/userfaultfd hugetlb_shared 10
              2 /tmp/kokonut_test/huge/userfaultfd_test && echo test success
      ./tools/testing/selftests/vm/userfaultfd hugetlb 10
      	2 /tmp/kokonut_test/huge/userfaultfd_test && echo test success
      
      Both tests succeed and produce no warnings. After the
      test runs number of free/resv hugepages is correct.
      
      [yuehaibing@huawei.com: remove set but not used variable 'vm_alloc_shared']
        Link: https://lkml.kernel.org/r/20210601141610.28332-1-yuehaibing@huawei.com
      [almasrymina@google.com: fix allocation error check and copy func name]
        Link: https://lkml.kernel.org/r/20210605010626.1459873-1-almasrymina@google.com
      
      Link: https://lkml.kernel.org/r/20210528005029.88088-1-almasrymina@google.comSigned-off-by: NMina Almasry <almasrymina@google.com>
      Signed-off-by: NYueHaibing <yuehaibing@huawei.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8cc5fcbb
  3. 23 5月, 2021 1 次提交
  4. 06 5月, 2021 2 次提交
    • A
      userfaultfd: add UFFDIO_CONTINUE ioctl · f6191471
      Axel Rasmussen 提交于
      This ioctl is how userspace ought to resolve "minor" userfaults.  The
      idea is, userspace is notified that a minor fault has occurred.  It
      might change the contents of the page using its second non-UFFD mapping,
      or not.  Then, it calls UFFDIO_CONTINUE to tell the kernel "I have
      ensured the page contents are correct, carry on setting up the mapping".
      
      Note that it doesn't make much sense to use UFFDIO_{COPY,ZEROPAGE} for
      MINOR registered VMAs.  ZEROPAGE maps the VMA to the zero page; but in
      the minor fault case, we already have some pre-existing underlying page.
      Likewise, UFFDIO_COPY isn't useful if we have a second non-UFFD mapping.
      We'd just use memcpy() or similar instead.
      
      It turns out hugetlb_mcopy_atomic_pte() already does very close to what
      we want, if an existing page is provided via `struct page **pagep`.  We
      already special-case the behavior a bit for the UFFDIO_ZEROPAGE case, so
      just extend that design: add an enum for the three modes of operation,
      and make the small adjustments needed for the MCOPY_ATOMIC_CONTINUE
      case.  (Basically, look up the existing page, and avoid adding the
      existing page to the page cache or calling set_page_huge_active() on
      it.)
      
      Link: https://lkml.kernel.org/r/20210301222728.176417-5-axelrasmussen@google.comSigned-off-by: NAxel Rasmussen <axelrasmussen@google.com>
      Reviewed-by: NPeter Xu <peterx@redhat.com>
      Cc: Adam Ruprecht <ruprecht@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Cannon Matthews <cannonmatthews@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chinwen Chang <chinwen.chang@mediatek.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Michal Koutn" <mkoutny@suse.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Shawn Anastasio <shawn@anastas.io>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f6191471
    • P
      hugetlb: pass vma into huge_pte_alloc() and huge_pmd_share() · aec44e0f
      Peter Xu 提交于
      Patch series "hugetlb: Disable huge pmd unshare for uffd-wp", v4.
      
      This series tries to disable huge pmd unshare of hugetlbfs backed memory
      for uffd-wp.  Although uffd-wp of hugetlbfs is still during rfc stage,
      the idea of this series may be needed for multiple tasks (Axel's uffd
      minor fault series, and Mike's soft dirty series), so I picked it out
      from the larger series.
      
      This patch (of 4):
      
      It is a preparation work to be able to behave differently in the per
      architecture huge_pte_alloc() according to different VMA attributes.
      
      Pass it deeper into huge_pmd_share() so that we can avoid the find_vma() call.
      
      [peterx@redhat.com: build fix]
        Link: https://lkml.kernel.org/r/20210304164653.GB397383@xz-x1Link: https://lkml.kernel.org/r/20210218230633.15028-1-peterx@redhat.com
      
      Link: https://lkml.kernel.org/r/20210218230633.15028-2-peterx@redhat.comSigned-off-by: NPeter Xu <peterx@redhat.com>
      Suggested-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Adam Ruprecht <ruprecht@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Cannon Matthews <cannonmatthews@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chinwen Chang <chinwen.chang@mediatek.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Michal Koutn" <mkoutny@suse.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Shawn Anastasio <shawn@anastas.io>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aec44e0f
  5. 13 8月, 2020 1 次提交
    • J
      mm/vmscan: protect the workingset on anonymous LRU · b518154e
      Joonsoo Kim 提交于
      In current implementation, newly created or swap-in anonymous page is
      started on active list.  Growing active list results in rebalancing
      active/inactive list so old pages on active list are demoted to inactive
      list.  Hence, the page on active list isn't protected at all.
      
      Following is an example of this situation.
      
      Assume that 50 hot pages on active list.  Numbers denote the number of
      pages on active/inactive list (active | inactive).
      
      1. 50 hot pages on active list
      50(h) | 0
      
      2. workload: 50 newly created (used-once) pages
      50(uo) | 50(h)
      
      3. workload: another 50 newly created (used-once) pages
      50(uo) | 50(uo), swap-out 50(h)
      
      This patch tries to fix this issue.  Like as file LRU, newly created or
      swap-in anonymous pages will be inserted to the inactive list.  They are
      promoted to active list if enough reference happens.  This simple
      modification changes the above example as following.
      
      1. 50 hot pages on active list
      50(h) | 0
      
      2. workload: 50 newly created (used-once) pages
      50(h) | 50(uo)
      
      3. workload: another 50 newly created (used-once) pages
      50(h) | 50(uo), swap-out 50(uo)
      
      As you can see, hot pages on active list would be protected.
      
      Note that, this implementation has a drawback that the page cannot be
      promoted and will be swapped-out if re-access interval is greater than the
      size of inactive list but less than the size of total(active+inactive).
      To solve this potential issue, following patch will apply workingset
      detection similar to the one that's already applied to file LRU.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Link: http://lkml.kernel.org/r/1595490560-15117-3-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b518154e
  6. 10 6月, 2020 2 次提交
  7. 04 6月, 2020 4 次提交
  8. 08 4月, 2020 3 次提交
    • S
      userfaultfd: wp: support write protection for userfault vma range · ffd05793
      Shaohua Li 提交于
      Add API to enable/disable writeprotect a vma range.  Unlike mprotect, this
      doesn't split/merge vmas.
      
      [peterx@redhat.com:
       - use the helper to find VMA;
       - return -ENOENT if not found to match mcopy case;
       - use the new MM_CP_UFFD_WP* flags for change_protection
       - check against mmap_changing for failures
       - replace find_dst_vma with vma_find_uffd]
      Signed-off-by: NShaohua Li <shli@fb.com>
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJerome Glisse <jglisse@redhat.com>
      Reviewed-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Link: http://lkml.kernel.org/r/20200220163112.11409-13-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ffd05793
    • P
      userfaultfd: wp: apply _PAGE_UFFD_WP bit · 292924b2
      Peter Xu 提交于
      Firstly, introduce two new flags MM_CP_UFFD_WP[_RESOLVE] for
      change_protection() when used with uffd-wp and make sure the two new flags
      are exclusively used.  Then,
      
        - For MM_CP_UFFD_WP: apply the _PAGE_UFFD_WP bit and remove _PAGE_RW
          when a range of memory is write protected by uffd
      
        - For MM_CP_UFFD_WP_RESOLVE: remove the _PAGE_UFFD_WP bit and recover
          _PAGE_RW when write protection is resolved from userspace
      
      And use this new interface in mwriteprotect_range() to replace the old
      MM_CP_DIRTY_ACCT.
      
      Do this change for both PTEs and huge PMDs.  Then we can start to identify
      which PTE/PMD is write protected by general (e.g., COW or soft dirty
      tracking), and which is for userfaultfd-wp.
      
      Since we should keep the _PAGE_UFFD_WP when doing pte_modify(), add it
      into _PAGE_CHG_MASK as well.  Meanwhile, since we have this new bit, we
      can be even more strict when detecting uffd-wp page faults in either
      do_wp_page() or wp_huge_pmd().
      
      After we're with _PAGE_UFFD_WP, a special case is when a page is both
      protected by the general COW logic and also userfault-wp.  Here the
      userfault-wp will have higher priority and will be handled first.  Only
      after the uffd-wp bit is cleared on the PTE/PMD will we continue to handle
      the general COW.  These are the steps on what will happen with such a
      page:
      
        1. CPU accesses write protected shared page (so both protected by
           general COW and uffd-wp), blocked by uffd-wp first because in
           do_wp_page we'll handle uffd-wp first, so it has higher priority
           than general COW.
      
        2. Uffd service thread receives the request, do UFFDIO_WRITEPROTECT
           to remove the uffd-wp bit upon the PTE/PMD.  However here we
           still keep the write bit cleared.  Notify the blocked CPU.
      
        3. The blocked CPU resumes the page fault process with a fault
           retry, during retry it'll notice it was not with the uffd-wp bit
           this time but it is still write protected by general COW, then
           it'll go though the COW path in the fault handler, copy the page,
           apply write bit where necessary, and retry again.
      
        4. The CPU will be able to access this page with write bit set.
      Suggested-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@fb.com>
      Link: http://lkml.kernel.org/r/20200220163112.11409-8-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      292924b2
    • A
      userfaultfd: wp: add UFFDIO_COPY_MODE_WP · 72981e0e
      Andrea Arcangeli 提交于
      This allows UFFDIO_COPY to map pages write-protected.
      
      [peterx@redhat.com: switch to VM_WARN_ON_ONCE in mfill_atomic_pte; add brackets
       around "dst_vma->vm_flags & VM_WRITE"; fix wordings in comments and
       commit messages]
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJerome Glisse <jglisse@redhat.com>
      Reviewed-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@fb.com>
      Link: http://lkml.kernel.org/r/20200220163112.11409-6-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      72981e0e
  9. 03 4月, 2020 1 次提交
    • M
      hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization · c0d0381a
      Mike Kravetz 提交于
      Patch series "hugetlbfs: use i_mmap_rwsem for more synchronization", v2.
      
      While discussing the issue with huge_pte_offset [1], I remembered that
      there were more outstanding hugetlb races.  These issues are:
      
      1) For shared pmds, huge PTE pointers returned by huge_pte_alloc can become
         invalid via a call to huge_pmd_unshare by another thread.
      2) hugetlbfs page faults can race with truncation causing invalid global
         reserve counts and state.
      
      A previous attempt was made to use i_mmap_rwsem in this manner as
      described at [2].  However, those patches were reverted starting with [3]
      due to locking issues.
      
      To effectively use i_mmap_rwsem to address the above issues it needs to be
      held (in read mode) during page fault processing.  However, during fault
      processing we need to lock the page we will be adding.  Lock ordering
      requires we take page lock before i_mmap_rwsem.  Waiting until after
      taking the page lock is too late in the fault process for the
      synchronization we want to do.
      
      To address this lock ordering issue, the following patches change the lock
      ordering for hugetlb pages.  This is not too invasive as hugetlbfs
      processing is done separate from core mm in many places.  However, I don't
      really like this idea.  Much ugliness is contained in the new routine
      hugetlb_page_mapping_lock_write() of patch 1.
      
      The only other way I can think of to address these issues is by catching
      all the races.  After catching a race, cleanup, backout, retry ...  etc,
      as needed.  This can get really ugly, especially for huge page
      reservations.  At one time, I started writing some of the reservation
      backout code for page faults and it got so ugly and complicated I went
      down the path of adding synchronization to avoid the races.  Any other
      suggestions would be welcome.
      
      [1] https://lore.kernel.org/linux-mm/1582342427-230392-1-git-send-email-longpeng2@huawei.com/
      [2] https://lore.kernel.org/linux-mm/20181222223013.22193-1-mike.kravetz@oracle.com/
      [3] https://lore.kernel.org/linux-mm/20190103235452.29335-1-mike.kravetz@oracle.com
      [4] https://lore.kernel.org/linux-mm/1584028670.7365.182.camel@lca.pw/
      [5] https://lore.kernel.org/lkml/20200312183142.108df9ac@canb.auug.org.au/
      
      This patch (of 2):
      
      While looking at BUGs associated with invalid huge page map counts, it was
      discovered and observed that a huge pte pointer could become 'invalid' and
      point to another task's page table.  Consider the following:
      
      A task takes a page fault on a shared hugetlbfs file and calls
      huge_pte_alloc to get a ptep.  Suppose the returned ptep points to a
      shared pmd.
      
      Now, another task truncates the hugetlbfs file.  As part of truncation, it
      unmaps everyone who has the file mapped.  If the range being truncated is
      covered by a shared pmd, huge_pmd_unshare will be called.  For all but the
      last user of the shared pmd, huge_pmd_unshare will clear the pud pointing
      to the pmd.  If the task in the middle of the page fault is not the last
      user, the ptep returned by huge_pte_alloc now points to another task's
      page table or worse.  This leads to bad things such as incorrect page
      map/reference counts or invalid memory references.
      
      To fix, expand the use of i_mmap_rwsem as follows:
      - i_mmap_rwsem is held in read mode whenever huge_pmd_share is called.
        huge_pmd_share is only called via huge_pte_alloc, so callers of
        huge_pte_alloc take i_mmap_rwsem before calling.  In addition, callers
        of huge_pte_alloc continue to hold the semaphore until finished with
        the ptep.
      - i_mmap_rwsem is held in write mode whenever huge_pmd_unshare is called.
      
      One problem with this scheme is that it requires taking i_mmap_rwsem
      before taking the page lock during page faults.  This is not the order
      specified in the rest of mm code.  Handling of hugetlbfs pages is mostly
      isolated today.  Therefore, we use this alternative locking order for
      PageHuge() pages.
      
               mapping->i_mmap_rwsem
                 hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
                   page->flags PG_locked (lock_page)
      
      To help with lock ordering issues, hugetlb_page_mapping_lock_write() is
      introduced to write lock the i_mmap_rwsem associated with a page.
      
      In most cases it is easy to get address_space via vma->vm_file->f_mapping.
      However, in the case of migration or memory errors for anon pages we do
      not have an associated vma.  A new routine _get_hugetlb_page_mapping()
      will use anon_vma to get address_space in these cases.
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
      Link: http://lkml.kernel.org/r/20200316205756.146666-2-mike.kravetz@oracle.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c0d0381a
  10. 02 12月, 2019 6 次提交
  11. 19 6月, 2019 1 次提交
  12. 15 5月, 2019 1 次提交
    • M
      hugetlb: use same fault hash key for shared and private mappings · 1b426bac
      Mike Kravetz 提交于
      hugetlb uses a fault mutex hash table to prevent page faults of the
      same pages concurrently.  The key for shared and private mappings is
      different.  Shared keys off address_space and file index.  Private keys
      off mm and virtual address.  Consider a private mappings of a populated
      hugetlbfs file.  A fault will map the page from the file and if needed
      do a COW to map a writable page.
      
      Hugetlbfs hole punch uses the fault mutex to prevent mappings of file
      pages.  It uses the address_space file index key.  However, private
      mappings will use a different key and could race with this code to map
      the file page.  This causes problems (BUG) for the page cache remove
      code as it expects the page to be unmapped.  A sample stack is:
      
      page dumped because: VM_BUG_ON_PAGE(page_mapped(page))
      kernel BUG at mm/filemap.c:169!
      ...
      RIP: 0010:unaccount_page_cache_page+0x1b8/0x200
      ...
      Call Trace:
      __delete_from_page_cache+0x39/0x220
      delete_from_page_cache+0x45/0x70
      remove_inode_hugepages+0x13c/0x380
      ? __add_to_page_cache_locked+0x162/0x380
      hugetlbfs_fallocate+0x403/0x540
      ? _cond_resched+0x15/0x30
      ? __inode_security_revalidate+0x5d/0x70
      ? selinux_file_permission+0x100/0x130
      vfs_fallocate+0x13f/0x270
      ksys_fallocate+0x3c/0x80
      __x64_sys_fallocate+0x1a/0x20
      do_syscall_64+0x5b/0x180
      entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      There seems to be another potential COW issue/race with this approach
      of different private and shared keys as noted in commit 8382d914
      ("mm, hugetlb: improve page-fault scalability").
      
      Since every hugetlb mapping (even anon and private) is actually a file
      mapping, just use the address_space index key for all mappings.  This
      results in potentially more hash collisions.  However, this should not
      be the common case.
      
      Link: http://lkml.kernel.org/r/20190328234704.27083-3-mike.kravetz@oracle.com
      Link: http://lkml.kernel.org/r/20190412165235.t4sscoujczfhuiyt@linux-r8p5
      Fixes: b5cec28d ("hugetlbfs: truncate_hugepages() takes a range of pages")
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: NDavidlohr Bueso <dbueso@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1b426bac
  13. 09 1月, 2019 1 次提交
    • M
      hugetlbfs: revert "use i_mmap_rwsem for more pmd sharing synchronization" · ddeaab32
      Mike Kravetz 提交于
      This reverts b43a9990
      
      The reverted commit caused issues with migration and poisoning of anon
      huge pages.  The LTP move_pages12 test will cause an "unable to handle
      kernel NULL pointer" BUG would occur with stack similar to:
      
        RIP: 0010:down_write+0x1b/0x40
        Call Trace:
          migrate_pages+0x81f/0xb90
          __ia32_compat_sys_migrate_pages+0x190/0x190
          do_move_pages_to_node.isra.53.part.54+0x2a/0x50
          kernel_move_pages+0x566/0x7b0
          __x64_sys_move_pages+0x24/0x30
          do_syscall_64+0x5b/0x180
          entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      The purpose of the reverted patch was to fix some long existing races
      with huge pmd sharing.  It used i_mmap_rwsem for this purpose with the
      idea that this could also be used to address truncate/page fault races
      with another patch.  Further analysis has determined that i_mmap_rwsem
      can not be used to address all these hugetlbfs synchronization issues.
      Therefore, revert this patch while working an another approach to the
      underlying issues.
      
      Link: http://lkml.kernel.org/r/20190103235452.29335-2-mike.kravetz@oracle.comSigned-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reported-by: NJan Stancek <jstancek@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ddeaab32
  14. 05 1月, 2019 1 次提交
    • J
      mm: treewide: remove unused address argument from pte_alloc functions · 4cf58924
      Joel Fernandes (Google) 提交于
      Patch series "Add support for fast mremap".
      
      This series speeds up the mremap(2) syscall by copying page tables at
      the PMD level even for non-THP systems.  There is concern that the extra
      'address' argument that mremap passes to pte_alloc may do something
      subtle architecture related in the future that may make the scheme not
      work.  Also we find that there is no point in passing the 'address' to
      pte_alloc since its unused.  This patch therefore removes this argument
      tree-wide resulting in a nice negative diff as well.  Also ensuring
      along the way that the enabled architectures do not do anything funky
      with the 'address' argument that goes unnoticed by the optimization.
      
      Build and boot tested on x86-64.  Build tested on arm64.  The config
      enablement patch for arm64 will be posted in the future after more
      testing.
      
      The changes were obtained by applying the following Coccinelle script.
      (thanks Julia for answering all Coccinelle questions!).
      Following fix ups were done manually:
      * Removal of address argument from  pte_fragment_alloc
      * Removal of pte_alloc_one_fast definitions from m68k and microblaze.
      
      // Options: --include-headers --no-includes
      // Note: I split the 'identifier fn' line, so if you are manually
      // running it, please unsplit it so it runs for you.
      
      virtual patch
      
      @pte_alloc_func_def depends on patch exists@
      identifier E2;
      identifier fn =~
      "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
      type T2;
      @@
      
       fn(...
      - , T2 E2
       )
       { ... }
      
      @pte_alloc_func_proto_noarg depends on patch exists@
      type T1, T2, T3, T4;
      identifier fn =~ "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
      @@
      
      (
      - T3 fn(T1, T2);
      + T3 fn(T1);
      |
      - T3 fn(T1, T2, T4);
      + T3 fn(T1, T2);
      )
      
      @pte_alloc_func_proto depends on patch exists@
      identifier E1, E2, E4;
      type T1, T2, T3, T4;
      identifier fn =~
      "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
      @@
      
      (
      - T3 fn(T1 E1, T2 E2);
      + T3 fn(T1 E1);
      |
      - T3 fn(T1 E1, T2 E2, T4 E4);
      + T3 fn(T1 E1, T2 E2);
      )
      
      @pte_alloc_func_call depends on patch exists@
      expression E2;
      identifier fn =~
      "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
      @@
      
       fn(...
      -,  E2
       )
      
      @pte_alloc_macro depends on patch exists@
      identifier fn =~
      "^(__pte_alloc|pte_alloc_one|pte_alloc|__pte_alloc_kernel|pte_alloc_one_kernel)$";
      identifier a, b, c;
      expression e;
      position p;
      @@
      
      (
      - #define fn(a, b, c) e
      + #define fn(a, b) e
      |
      - #define fn(a, b) e
      + #define fn(a) e
      )
      
      Link: http://lkml.kernel.org/r/20181108181201.88826-2-joelaf@google.comSigned-off-by: NJoel Fernandes (Google) <joel@joelfernandes.org>
      Suggested-by: NKirill A. Shutemov <kirill@shutemov.name>
      Acked-by: NKirill A. Shutemov <kirill@shutemov.name>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Julia Lawall <Julia.Lawall@lip6.fr>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4cf58924
  15. 29 12月, 2018 1 次提交
    • M
      hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization · b43a9990
      Mike Kravetz 提交于
      While looking at BUGs associated with invalid huge page map counts, it was
      discovered and observed that a huge pte pointer could become 'invalid' and
      point to another task's page table.  Consider the following:
      
      A task takes a page fault on a shared hugetlbfs file and calls
      huge_pte_alloc to get a ptep.  Suppose the returned ptep points to a
      shared pmd.
      
      Now, another task truncates the hugetlbfs file.  As part of truncation, it
      unmaps everyone who has the file mapped.  If the range being truncated is
      covered by a shared pmd, huge_pmd_unshare will be called.  For all but the
      last user of the shared pmd, huge_pmd_unshare will clear the pud pointing
      to the pmd.  If the task in the middle of the page fault is not the last
      user, the ptep returned by huge_pte_alloc now points to another task's
      page table or worse.  This leads to bad things such as incorrect page
      map/reference counts or invalid memory references.
      
      To fix, expand the use of i_mmap_rwsem as follows:
      
      - i_mmap_rwsem is held in read mode whenever huge_pmd_share is called.
        huge_pmd_share is only called via huge_pte_alloc, so callers of
        huge_pte_alloc take i_mmap_rwsem before calling.  In addition, callers
        of huge_pte_alloc continue to hold the semaphore until finished with the
        ptep.
      
      - i_mmap_rwsem is held in write mode whenever huge_pmd_unshare is
        called.
      
      [mike.kravetz@oracle.com: add explicit check for mapping != null]
      Link: http://lkml.kernel.org/r/20181218223557.5202-2-mike.kravetz@oracle.com
      Fixes: 39dde65c ("shared page table for hugetlb page")
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
      Cc: Colin Ian King <colin.king@canonical.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b43a9990
  16. 01 12月, 2018 4 次提交
  17. 08 6月, 2018 1 次提交
    • M
      userfaultfd: prevent non-cooperative events vs mcopy_atomic races · df2cc96e
      Mike Rapoport 提交于
      If a process monitored with userfaultfd changes it's memory mappings or
      forks() at the same time as uffd monitor fills the process memory with
      UFFDIO_COPY, the actual creation of page table entries and copying of
      the data in mcopy_atomic may happen either before of after the memory
      mapping modifications and there is no way for the uffd monitor to
      maintain consistent view of the process memory layout.
      
      For instance, let's consider fork() running in parallel with
      userfaultfd_copy():
      
      process        		         |	uffd monitor
      ---------------------------------+------------------------------
      fork()        		         | userfaultfd_copy()
      ...        		         | ...
          dup_mmap()        	         |     down_read(mmap_sem)
          down_write(mmap_sem)         |     /* create PTEs, copy data */
              dup_uffd()               |     up_read(mmap_sem)
              copy_page_range()        |
              up_write(mmap_sem)       |
              dup_uffd_complete()      |
                  /* notify monitor */ |
      
      If the userfaultfd_copy() takes the mmap_sem first, the new page(s) will
      be present by the time copy_page_range() is called and they will appear
      in the child's memory mappings.  However, if the fork() is the first to
      take the mmap_sem, the new pages won't be mapped in the child's address
      space.
      
      If the pages are not present and child tries to access them, the monitor
      will get page fault notification and everything is fine.  However, if
      the pages *are present*, the child can access them without uffd
      noticing.  And if we copy them into child it'll see the wrong data.
      Since we are talking about background copy, we'd need to decide whether
      the pages should be copied or not regardless #PF notifications.
      
      Since userfaultfd monitor has no way to determine what was the order,
      let's disallow userfaultfd_copy in parallel with the non-cooperative
      events.  In such case we return -EAGAIN and the uffd monitor can
      understand that userfaultfd_copy() clashed with a non-cooperative event
      and take an appropriate action.
      
      Link: http://lkml.kernel.org/r/1527061324-19949-1-git-send-email-rppt@linux.vnet.ibm.comSigned-off-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Acked-by: NPavel Emelyanov <xemul@virtuozzo.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Andrei Vagin <avagin@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      df2cc96e
  18. 07 2月, 2018 1 次提交
  19. 07 9月, 2017 2 次提交
  20. 10 3月, 2017 1 次提交
  21. 02 3月, 2017 1 次提交