1. 04 9月, 2021 1 次提交
    • N
      userfaultfd: change mmap_changing to atomic · a759a909
      Nadav Amit 提交于
      Patch series "userfaultfd: minor bug fixes".
      
      Three unrelated bug fixes. The first two addresses possible issues (not
      too theoretical ones), but I did not encounter them in practice.
      
      The third patch addresses a test bug that causes the test to fail on my
      system. It has been sent before as part of a bigger RFC.
      
      This patch (of 3):
      
      mmap_changing is currently a boolean variable, which is set and cleared
      without any lock that protects against concurrent modifications.
      
      mmap_changing is supposed to mark whether userfaultfd page-faults handling
      should be retried since mappings are undergoing a change.  However,
      concurrent calls, for instance to madvise(MADV_DONTNEED), might cause
      mmap_changing to be false, although the remove event was still not read
      (hence acknowledged) by the user.
      
      Change mmap_changing to atomic_t and increase/decrease appropriately.  Add
      a debug assertion to see whether mmap_changing is negative.
      
      Link: https://lkml.kernel.org/r/20210808020724.1022515-1-namit@vmware.com
      Link: https://lkml.kernel.org/r/20210808020724.1022515-2-namit@vmware.com
      Fixes: df2cc96e ("userfaultfd: prevent non-cooperative events vs mcopy_atomic races")
      Signed-off-by: NNadav Amit <namit@vmware.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a759a909
  2. 24 7月, 2021 1 次提交
    • P
      userfaultfd: do not untag user pointers · e71e2ace
      Peter Collingbourne 提交于
      Patch series "userfaultfd: do not untag user pointers", v5.
      
      If a user program uses userfaultfd on ranges of heap memory, it may end
      up passing a tagged pointer to the kernel in the range.start field of
      the UFFDIO_REGISTER ioctl.  This can happen when using an MTE-capable
      allocator, or on Android if using the Tagged Pointers feature for MTE
      readiness [1].
      
      When a fault subsequently occurs, the tag is stripped from the fault
      address returned to the application in the fault.address field of struct
      uffd_msg.  However, from the application's perspective, the tagged
      address *is* the memory address, so if the application is unaware of
      memory tags, it may get confused by receiving an address that is, from
      its point of view, outside of the bounds of the allocation.  We observed
      this behavior in the kselftest for userfaultfd [2] but other
      applications could have the same problem.
      
      Address this by not untagging pointers passed to the userfaultfd ioctls.
      Instead, let the system call fail.  Also change the kselftest to use
      mmap so that it doesn't encounter this problem.
      
      [1] https://source.android.com/devices/tech/debug/tagged-pointers
      [2] tools/testing/selftests/vm/userfaultfd.c
      
      This patch (of 2):
      
      Do not untag pointers passed to the userfaultfd ioctls.  Instead, let
      the system call fail.  This will provide an early indication of problems
      with tag-unaware userspace code instead of letting the code get confused
      later, and is consistent with how we decided to handle brk/mmap/mremap
      in commit dcde2373 ("mm: Avoid creating virtual address aliases in
      brk()/mmap()/mremap()"), as well as being consistent with the existing
      tagged address ABI documentation relating to how ioctl arguments are
      handled.
      
      The code change is a revert of commit 7d032574 ("userfaultfd: untag
      user pointers") plus some fixups to some additional calls to
      validate_range that have appeared since then.
      
      [1] https://source.android.com/devices/tech/debug/tagged-pointers
      [2] tools/testing/selftests/vm/userfaultfd.c
      
      Link: https://lkml.kernel.org/r/20210714195437.118982-1-pcc@google.com
      Link: https://lkml.kernel.org/r/20210714195437.118982-2-pcc@google.com
      Link: https://linux-review.googlesource.com/id/I761aa9f0344454c482b83fcfcce547db0a25501b
      Fixes: 63f0c603 ("arm64: Introduce prctl() options to control the tagged user addresses ABI")
      Signed-off-by: NPeter Collingbourne <pcc@google.com>
      Reviewed-by: NAndrey Konovalov <andreyknvl@gmail.com>
      Reviewed-by: NCatalin Marinas <catalin.marinas@arm.com>
      Cc: Alistair Delva <adelva@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Martin <Dave.Martin@arm.com>
      Cc: Evgenii Stepanov <eugenis@google.com>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Mitch Phillips <mitchp@google.com>
      Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: William McVicker <willmcvicker@google.com>
      Cc: <stable@vger.kernel.org>	[5.4]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e71e2ace
  3. 01 7月, 2021 3 次提交
  4. 18 6月, 2021 1 次提交
  5. 06 5月, 2021 3 次提交
    • A
      userfaultfd: add UFFDIO_CONTINUE ioctl · f6191471
      Axel Rasmussen 提交于
      This ioctl is how userspace ought to resolve "minor" userfaults.  The
      idea is, userspace is notified that a minor fault has occurred.  It
      might change the contents of the page using its second non-UFFD mapping,
      or not.  Then, it calls UFFDIO_CONTINUE to tell the kernel "I have
      ensured the page contents are correct, carry on setting up the mapping".
      
      Note that it doesn't make much sense to use UFFDIO_{COPY,ZEROPAGE} for
      MINOR registered VMAs.  ZEROPAGE maps the VMA to the zero page; but in
      the minor fault case, we already have some pre-existing underlying page.
      Likewise, UFFDIO_COPY isn't useful if we have a second non-UFFD mapping.
      We'd just use memcpy() or similar instead.
      
      It turns out hugetlb_mcopy_atomic_pte() already does very close to what
      we want, if an existing page is provided via `struct page **pagep`.  We
      already special-case the behavior a bit for the UFFDIO_ZEROPAGE case, so
      just extend that design: add an enum for the three modes of operation,
      and make the small adjustments needed for the MCOPY_ATOMIC_CONTINUE
      case.  (Basically, look up the existing page, and avoid adding the
      existing page to the page cache or calling set_page_huge_active() on
      it.)
      
      Link: https://lkml.kernel.org/r/20210301222728.176417-5-axelrasmussen@google.comSigned-off-by: NAxel Rasmussen <axelrasmussen@google.com>
      Reviewed-by: NPeter Xu <peterx@redhat.com>
      Cc: Adam Ruprecht <ruprecht@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Cannon Matthews <cannonmatthews@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chinwen Chang <chinwen.chang@mediatek.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Michal Koutn" <mkoutny@suse.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Shawn Anastasio <shawn@anastas.io>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f6191471
    • A
      userfaultfd: add minor fault registration mode · 7677f7fd
      Axel Rasmussen 提交于
      Patch series "userfaultfd: add minor fault handling", v9.
      
      Overview
      ========
      
      This series adds a new userfaultfd feature, UFFD_FEATURE_MINOR_HUGETLBFS.
      When enabled (via the UFFDIO_API ioctl), this feature means that any
      hugetlbfs VMAs registered with UFFDIO_REGISTER_MODE_MISSING will *also*
      get events for "minor" faults.  By "minor" fault, I mean the following
      situation:
      
      Let there exist two mappings (i.e., VMAs) to the same page(s) (shared
      memory).  One of the mappings is registered with userfaultfd (in minor
      mode), and the other is not.  Via the non-UFFD mapping, the underlying
      pages have already been allocated & filled with some contents.  The UFFD
      mapping has not yet been faulted in; when it is touched for the first
      time, this results in what I'm calling a "minor" fault.  As a concrete
      example, when working with hugetlbfs, we have huge_pte_none(), but
      find_lock_page() finds an existing page.
      
      We also add a new ioctl to resolve such faults: UFFDIO_CONTINUE.  The idea
      is, userspace resolves the fault by either a) doing nothing if the
      contents are already correct, or b) updating the underlying contents using
      the second, non-UFFD mapping (via memcpy/memset or similar, or something
      fancier like RDMA, or etc...).  In either case, userspace issues
      UFFDIO_CONTINUE to tell the kernel "I have ensured the page contents are
      correct, carry on setting up the mapping".
      
      Use Case
      ========
      
      Consider the use case of VM live migration (e.g. under QEMU/KVM):
      
      1. While a VM is still running, we copy the contents of its memory to a
         target machine. The pages are populated on the target by writing to the
         non-UFFD mapping, using the setup described above. The VM is still running
         (and therefore its memory is likely changing), so this may be repeated
         several times, until we decide the target is "up to date enough".
      
      2. We pause the VM on the source, and start executing on the target machine.
         During this gap, the VM's user(s) will *see* a pause, so it is desirable to
         minimize this window.
      
      3. Between the last time any page was copied from the source to the target, and
         when the VM was paused, the contents of that page may have changed - and
         therefore the copy we have on the target machine is out of date. Although we
         can keep track of which pages are out of date, for VMs with large amounts of
         memory, it is "slow" to transfer this information to the target machine. We
         want to resume execution before such a transfer would complete.
      
      4. So, the guest begins executing on the target machine. The first time it
         touches its memory (via the UFFD-registered mapping), userspace wants to
         intercept this fault. Userspace checks whether or not the page is up to date,
         and if not, copies the updated page from the source machine, via the non-UFFD
         mapping. Finally, whether a copy was performed or not, userspace issues a
         UFFDIO_CONTINUE ioctl to tell the kernel "I have ensured the page contents
         are correct, carry on setting up the mapping".
      
      We don't have to do all of the final updates on-demand. The userfaultfd manager
      can, in the background, also copy over updated pages once it receives the map of
      which pages are up-to-date or not.
      
      Interaction with Existing APIs
      ==============================
      
      Because this is a feature, a registered VMA could potentially receive both
      missing and minor faults.  I spent some time thinking through how the
      existing API interacts with the new feature:
      
      UFFDIO_CONTINUE cannot be used to resolve non-minor faults, as it does not
      allocate a new page.  If UFFDIO_CONTINUE is used on a non-minor fault:
      
      - For non-shared memory or shmem, -EINVAL is returned.
      - For hugetlb, -EFAULT is returned.
      
      UFFDIO_COPY and UFFDIO_ZEROPAGE cannot be used to resolve minor faults.
      Without modifications, the existing codepath assumes a new page needs to
      be allocated.  This is okay, since userspace must have a second
      non-UFFD-registered mapping anyway, thus there isn't much reason to want
      to use these in any case (just memcpy or memset or similar).
      
      - If UFFDIO_COPY is used on a minor fault, -EEXIST is returned.
      - If UFFDIO_ZEROPAGE is used on a minor fault, -EEXIST is returned (or -EINVAL
        in the case of hugetlb, as UFFDIO_ZEROPAGE is unsupported in any case).
      - UFFDIO_WRITEPROTECT simply doesn't work with shared memory, and returns
        -ENOENT in that case (regardless of the kind of fault).
      
      Future Work
      ===========
      
      This series only supports hugetlbfs.  I have a second series in flight to
      support shmem as well, extending the functionality.  This series is more
      mature than the shmem support at this point, and the functionality works
      fully on hugetlbfs, so this series can be merged first and then shmem
      support will follow.
      
      This patch (of 6):
      
      This feature allows userspace to intercept "minor" faults.  By "minor"
      faults, I mean the following situation:
      
      Let there exist two mappings (i.e., VMAs) to the same page(s).  One of the
      mappings is registered with userfaultfd (in minor mode), and the other is
      not.  Via the non-UFFD mapping, the underlying pages have already been
      allocated & filled with some contents.  The UFFD mapping has not yet been
      faulted in; when it is touched for the first time, this results in what
      I'm calling a "minor" fault.  As a concrete example, when working with
      hugetlbfs, we have huge_pte_none(), but find_lock_page() finds an existing
      page.
      
      This commit adds the new registration mode, and sets the relevant flag on
      the VMAs being registered.  In the hugetlb fault path, if we find that we
      have huge_pte_none(), but find_lock_page() does indeed find an existing
      page, then we have a "minor" fault, and if the VMA has the userfaultfd
      registration flag, we call into userfaultfd to handle it.
      
      This is implemented as a new registration mode, instead of an API feature.
      This is because the alternative implementation has significant drawbacks
      [1].
      
      However, doing it this was requires we allocate a VM_* flag for the new
      registration mode.  On 32-bit systems, there are no unused bits, so this
      feature is only supported on architectures with
      CONFIG_ARCH_USES_HIGH_VMA_FLAGS.  When attempting to register a VMA in
      MINOR mode on 32-bit architectures, we return -EINVAL.
      
      [1] https://lore.kernel.org/patchwork/patch/1380226/
      
      [peterx@redhat.com: fix minor fault page leak]
        Link: https://lkml.kernel.org/r/20210322175132.36659-1-peterx@redhat.com
      
      Link: https://lkml.kernel.org/r/20210301222728.176417-1-axelrasmussen@google.com
      Link: https://lkml.kernel.org/r/20210301222728.176417-2-axelrasmussen@google.comSigned-off-by: NAxel Rasmussen <axelrasmussen@google.com>
      Reviewed-by: NPeter Xu <peterx@redhat.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chinwen Chang <chinwen.chang@mediatek.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Michal Koutn" <mkoutny@suse.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Shawn Anastasio <shawn@anastas.io>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Adam Ruprecht <ruprecht@google.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Cannon Matthews <cannonmatthews@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7677f7fd
    • P
      hugetlb/userfaultfd: unshare all pmds for hugetlbfs when register wp · 6dfeaff9
      Peter Xu 提交于
      Huge pmd sharing for hugetlbfs is racy with userfaultfd-wp because
      userfaultfd-wp is always based on pgtable entries, so they cannot be
      shared.
      
      Walk the hugetlb range and unshare all such mappings if there is, right
      before UFFDIO_REGISTER will succeed and return to userspace.
      
      This will pair with want_pmd_share() in hugetlb code so that huge pmd
      sharing is completely disabled for userfaultfd-wp registered range.
      
      Link: https://lkml.kernel.org/r/20210218231206.15524-1-peterx@redhat.comSigned-off-by: NPeter Xu <peterx@redhat.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Adam Ruprecht <ruprecht@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Cannon Matthews <cannonmatthews@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chinwen Chang <chinwen.chang@mediatek.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Michal Koutn" <mkoutny@suse.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Shawn Anastasio <shawn@anastas.io>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6dfeaff9
  6. 15 1月, 2021 1 次提交
  7. 16 12月, 2020 2 次提交
    • L
      userfaultfd: add user-mode only option to unprivileged_userfaultfd sysctl knob · d0d4730a
      Lokesh Gidra 提交于
      With this change, when the knob is set to 0, it allows unprivileged users
      to call userfaultfd, like when it is set to 1, but with the restriction
      that page faults from only user-mode can be handled.  In this mode, an
      unprivileged user (without SYS_CAP_PTRACE capability) must pass
      UFFD_USER_MODE_ONLY to userfaultd or the API will fail with EPERM.
      
      This enables administrators to reduce the likelihood that an attacker with
      access to userfaultfd can delay faulting kernel code to widen timing
      windows for other exploits.
      
      The default value of this knob is changed to 0.  This is required for
      correct functioning of pipe mutex.  However, this will fail postcopy live
      migration, which will be unnoticeable to the VM guests.  To avoid this,
      set 'vm.userfault = 1' in /sys/sysctl.conf.
      
      The main reason this change is desirable as in the short term is that the
      Android userland will behave as with the sysctl set to zero.  So without
      this commit, any Linux binary using userfaultfd to manage its memory would
      behave differently if run within the Android userland.  For more details,
      refer to Andrea's reply [1].
      
      [1] https://lore.kernel.org/lkml/20200904033438.GI9411@redhat.com/
      
      Link: https://lkml.kernel.org/r/20201120030411.2690816-3-lokeshgidra@google.comSigned-off-by: NLokesh Gidra <lokeshgidra@google.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Stephen Smalley <stephen.smalley.work@gmail.com>
      Cc: Eric Biggers <ebiggers@kernel.org>
      Cc: Daniel Colascione <dancol@dancol.org>
      Cc: "Joel Fernandes (Google)" <joel@joelfernandes.org>
      Cc: Kalesh Singh <kaleshsingh@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Jeff Vander Stoep <jeffv@google.com>
      Cc: <calin@google.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Nitin Gupta <nigupta@nvidia.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Iurii Zaikin <yzaikin@google.com>
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Cc: Daniel Colascione <dancol@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d0d4730a
    • L
      userfaultfd: add UFFD_USER_MODE_ONLY · 37cd0575
      Lokesh Gidra 提交于
      Patch series "Control over userfaultfd kernel-fault handling", v6.
      
      This patch series is split from [1].  The other series enables SELinux
      support for userfaultfd file descriptors so that its creation and movement
      can be controlled.
      
      It has been demonstrated on various occasions that suspending kernel code
      execution for an arbitrary amount of time at any access to userspace
      memory (copy_from_user()/copy_to_user()/...) can be exploited to change
      the intended behavior of the kernel.  For instance, handling page faults
      in kernel-mode using userfaultfd has been exploited in [2, 3].  Likewise,
      FUSE, which is similar to userfaultfd in this respect, has been exploited
      in [4, 5] for similar outcome.
      
      This small patch series adds a new flag to userfaultfd(2) that allows
      callers to give up the ability to handle kernel-mode faults with the
      resulting UFFD file object.  It then adds a 'user-mode only' option to the
      unprivileged_userfaultfd sysctl knob to require unprivileged callers to
      use this new flag.
      
      The purpose of this new interface is to decrease the chance of an
      unprivileged userfaultfd user taking advantage of userfaultfd to enhance
      security vulnerabilities by lengthening the race window in kernel code.
      
      [1] https://lore.kernel.org/lkml/20200211225547.235083-1-dancol@google.com/
      [2] https://duasynt.com/blog/linux-kernel-heap-spray
      [3] https://duasynt.com/blog/cve-2016-6187-heap-off-by-one-exploit
      [4] https://googleprojectzero.blogspot.com/2016/06/exploiting-recursion-in-linux-kernel_20.html
      [5] https://bugs.chromium.org/p/project-zero/issues/detail?id=808
      
      This patch (of 2):
      
      userfaultfd handles page faults from both user and kernel code.  Add a new
      UFFD_USER_MODE_ONLY flag for userfaultfd(2) that makes the resulting
      userfaultfd object refuse to handle faults from kernel mode, treating
      these faults as if SIGBUS were always raised, causing the kernel code to
      fail with EFAULT.
      
      A future patch adds a knob allowing administrators to give some processes
      the ability to create userfaultfd file objects only if they pass
      UFFD_USER_MODE_ONLY, reducing the likelihood that these processes will
      exploit userfaultfd's ability to delay kernel page faults to open timing
      windows for future exploits.
      
      Link: https://lkml.kernel.org/r/20201120030411.2690816-1-lokeshgidra@google.com
      Link: https://lkml.kernel.org/r/20201120030411.2690816-2-lokeshgidra@google.comSigned-off-by: NDaniel Colascione <dancol@google.com>
      Signed-off-by: NLokesh Gidra <lokeshgidra@google.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: <calin@google.com>
      Cc: Daniel Colascione <dancol@dancol.org>
      Cc: Eric Biggers <ebiggers@kernel.org>
      Cc: Iurii Zaikin <yzaikin@google.com>
      Cc: Jeff Vander Stoep <jeffv@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Joel Fernandes (Google)" <joel@joelfernandes.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kalesh Singh <kaleshsingh@google.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Cc: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Nitin Gupta <nigupta@nvidia.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Stephen Smalley <stephen.smalley.work@gmail.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      37cd0575
  8. 17 10月, 2020 1 次提交
  9. 04 8月, 2020 1 次提交
    • L
      userfaultfd: simplify fault handling · f9bf3522
      Linus Torvalds 提交于
      Instead of waiting in a loop for the userfaultfd condition to become
      true, just wait once and return VM_FAULT_RETRY.
      
      We've already dropped the mmap lock, we know we can't really
      successfully handle the fault at this point and the caller will have to
      retry anyway.  So there's no point in making the wait any more
      complicated than it needs to be - just schedule away.
      
      And once you don't have that complexity with explicit looping, you can
      also just lose all the 'userfaultfd_signal_pending()' complexity,
      because once we've set the correct process sleeping state, and don't
      loop, the act of scheduling itself will be checking if there are any
      pending signals before going to sleep.
      
      We can also drop the VM_FAULT_MAJOR games, since we'll be treating all
      retried faults as major soon anyway (series to regularize and share more
      of fault handling across architectures in a separate series by Peter Xu,
      and in the meantime we won't worry about the possible minor - I'll be
      here all week, try the veal - accounting difference).
      
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Peter Xu <peterx@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f9bf3522
  10. 29 7月, 2020 1 次提交
  11. 10 6月, 2020 4 次提交
  12. 08 4月, 2020 4 次提交
  13. 03 4月, 2020 3 次提交
    • P
      mm/userfaultfd: honor FAULT_FLAG_KILLABLE in fault path · 3e69ad08
      Peter Xu 提交于
      Userfaultfd fault path was by default killable even if the caller does not
      have FAULT_FLAG_KILLABLE.  That makes sense before in that when with gup
      we don't have FAULT_FLAG_KILLABLE properly set before.  Now after previous
      patch we've got FAULT_FLAG_KILLABLE applied even for gup code so it should
      also make sense to let userfaultfd to honor the FAULT_FLAG_KILLABLE.
      
      Because we're unconditionally setting FAULT_FLAG_KILLABLE in gup code
      right now, this patch should have no functional change.  It also cleaned
      the code a little bit by introducing some helpers.
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: NBrian Geffon <bgeffon@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Link: http://lkml.kernel.org/r/20200220160300.9941-1-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3e69ad08
    • P
      mm: introduce FAULT_FLAG_INTERRUPTIBLE · c270a7ee
      Peter Xu 提交于
      handle_userfaultfd() is currently the only one place in the kernel page
      fault procedures that can respond to non-fatal userspace signals.  It was
      trying to detect such an allowance by checking against USER & KILLABLE
      flags, which was "un-official".
      
      In this patch, we introduced a new flag (FAULT_FLAG_INTERRUPTIBLE) to show
      that the fault handler allows the fault procedure to respond even to
      non-fatal signals.  Meanwhile, add this new flag to the default fault
      flags so that all the page fault handlers can benefit from the new flag.
      With that, replacing the userfault check to this one.
      
      Since the line is getting even longer, clean up the fault flags a bit too
      to ease TTY users.
      
      Although we've got a new flag and applied it, we shouldn't have any
      functional change with this patch so far.
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: NBrian Geffon <bgeffon@google.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Link: http://lkml.kernel.org/r/20200220195348.16302-1-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c270a7ee
    • P
      userfaultfd: don't retake mmap_sem to emulate NOPAGE · ef429ee7
      Peter Xu 提交于
      This patch removes the risk path in handle_userfault() then we will be
      sure that the callers of handle_mm_fault() will know that the VMAs might
      have changed.  Meanwhile with previous patch we don't lose responsiveness
      as well since the core mm code now can handle the nonfatal userspace
      signals even if we return VM_FAULT_RETRY.
      Suggested-by: NAndrea Arcangeli <aarcange@redhat.com>
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: NBrian Geffon <bgeffon@google.com>
      Reviewed-by: NJerome Glisse <jglisse@redhat.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Link: http://lkml.kernel.org/r/20200220160234.9646-1-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ef429ee7
  14. 02 12月, 2019 2 次提交
  15. 23 10月, 2019 1 次提交
  16. 26 9月, 2019 1 次提交
  17. 25 8月, 2019 1 次提交
  18. 05 7月, 2019 1 次提交
    • E
      fs/userfaultfd.c: disable irqs for fault_pending and event locks · cbcfa130
      Eric Biggers 提交于
      When IOCB_CMD_POLL is used on a userfaultfd, aio_poll() disables IRQs
      and takes kioctx::ctx_lock, then userfaultfd_ctx::fd_wqh.lock.
      
      This may have to wait for userfaultfd_ctx::fd_wqh.lock to be released by
      userfaultfd_ctx_read(), which in turn can be waiting for
      userfaultfd_ctx::fault_pending_wqh.lock or
      userfaultfd_ctx::event_wqh.lock.
      
      But elsewhere the fault_pending_wqh and event_wqh locks are taken with
      IRQs enabled.  Since the IRQ handler may take kioctx::ctx_lock, lockdep
      reports that a deadlock is possible.
      
      Fix it by always disabling IRQs when taking the fault_pending_wqh and
      event_wqh locks.
      
      Commit ae62c16e ("userfaultfd: disable irqs when taking the
      waitqueue lock") didn't fix this because it only accounted for the
      fd_wqh lock, not the other locks nested inside it.
      
      Link: http://lkml.kernel.org/r/20190627075004.21259-1-ebiggers@kernel.org
      Fixes: bfe4037e ("aio: implement IOCB_CMD_POLL")
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Reported-by: syzbot+fab6de82892b6b9c6191@syzkaller.appspotmail.com
      Reported-by: syzbot+53c0b767f7ca0dc0c451@syzkaller.appspotmail.com
      Reported-by: syzbot+a3accb352f9c22041cfa@syzkaller.appspotmail.com
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: <stable@vger.kernel.org>	[4.19+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cbcfa130
  19. 19 6月, 2019 1 次提交
  20. 15 5月, 2019 1 次提交
    • P
      userfaultfd/sysctl: add vm.unprivileged_userfaultfd · cefdca0a
      Peter Xu 提交于
      Userfaultfd can be misued to make it easier to exploit existing
      use-after-free (and similar) bugs that might otherwise only make a
      short window or race condition available.  By using userfaultfd to
      stall a kernel thread, a malicious program can keep some state that it
      wrote, stable for an extended period, which it can then access using an
      existing exploit.  While it doesn't cause the exploit itself, and while
      it's not the only thing that can stall a kernel thread when accessing a
      memory location, it's one of the few that never needs privilege.
      
      We can add a flag, allowing userfaultfd to be restricted, so that in
      general it won't be useable by arbitrary user programs, but in
      environments that require userfaultfd it can be turned back on.
      
      Add a global sysctl knob "vm.unprivileged_userfaultfd" to control
      whether userfaultfd is allowed by unprivileged users.  When this is
      set to zero, only privileged users (root user, or users with the
      CAP_SYS_PTRACE capability) will be able to use the userfaultfd
      syscalls.
      
      Andrea said:
      
      : The only difference between the bpf sysctl and the userfaultfd sysctl
      : this way is that the bpf sysctl adds the CAP_SYS_ADMIN capability
      : requirement, while userfaultfd adds the CAP_SYS_PTRACE requirement,
      : because the userfaultfd monitor is more likely to need CAP_SYS_PTRACE
      : already if it's doing other kind of tracking on processes runtime, in
      : addition of userfaultfd.  In other words both syscalls works only for
      : root, when the two sysctl are opt-in set to 1.
      
      [dgilbert@redhat.com: changelog additions]
      [akpm@linux-foundation.org: documentation tweak, per Mike]
      Link: http://lkml.kernel.org/r/20190319030722.12441-2-peterx@redhat.comSigned-off-by: NPeter Xu <peterx@redhat.com>
      Suggested-by: NAndrea Arcangeli <aarcange@redhat.com>
      Suggested-by: NMike Rapoport <rppt@linux.ibm.com>
      Reviewed-by: NMike Rapoport <rppt@linux.ibm.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Pavel Emelyanov <xemul@virtuozzo.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cefdca0a
  21. 20 4月, 2019 1 次提交
    • A
      coredump: fix race condition between mmget_not_zero()/get_task_mm() and core dumping · 04f5866e
      Andrea Arcangeli 提交于
      The core dumping code has always run without holding the mmap_sem for
      writing, despite that is the only way to ensure that the entire vma
      layout will not change from under it.  Only using some signal
      serialization on the processes belonging to the mm is not nearly enough.
      This was pointed out earlier.  For example in Hugh's post from Jul 2017:
      
        https://lkml.kernel.org/r/alpine.LSU.2.11.1707191716030.2055@eggly.anvils
      
        "Not strictly relevant here, but a related note: I was very surprised
         to discover, only quite recently, how handle_mm_fault() may be called
         without down_read(mmap_sem) - when core dumping. That seems a
         misguided optimization to me, which would also be nice to correct"
      
      In particular because the growsdown and growsup can move the
      vm_start/vm_end the various loops the core dump does around the vma will
      not be consistent if page faults can happen concurrently.
      
      Pretty much all users calling mmget_not_zero()/get_task_mm() and then
      taking the mmap_sem had the potential to introduce unexpected side
      effects in the core dumping code.
      
      Adding mmap_sem for writing around the ->core_dump invocation is a
      viable long term fix, but it requires removing all copy user and page
      faults and to replace them with get_dump_page() for all binary formats
      which is not suitable as a short term fix.
      
      For the time being this solution manually covers the places that can
      confuse the core dump either by altering the vma layout or the vma flags
      while it runs.  Once ->core_dump runs under mmap_sem for writing the
      function mmget_still_valid() can be dropped.
      
      Allowing mmap_sem protected sections to run in parallel with the
      coredump provides some minor parallelism advantage to the swapoff code
      (which seems to be safe enough by never mangling any vma field and can
      keep doing swapins in parallel to the core dumping) and to some other
      corner case.
      
      In order to facilitate the backporting I added "Fixes: 86039bd3"
      however the side effect of this same race condition in /proc/pid/mem
      should be reproducible since before 2.6.12-rc2 so I couldn't add any
      other "Fixes:" because there's no hash beyond the git genesis commit.
      
      Because find_extend_vma() is the only location outside of the process
      context that could modify the "mm" structures under mmap_sem for
      reading, by adding the mmget_still_valid() check to it, all other cases
      that take the mmap_sem for reading don't need the new check after
      mmget_not_zero()/get_task_mm().  The expand_stack() in page fault
      context also doesn't need the new check, because all tasks under core
      dumping are frozen.
      
      Link: http://lkml.kernel.org/r/20190325224949.11068-1-aarcange@redhat.com
      Fixes: 86039bd3 ("userfaultfd: add new syscall to provide memory externalization")
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: NJann Horn <jannh@google.com>
      Suggested-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NPeter Xu <peterx@redhat.com>
      Reviewed-by: NMike Rapoport <rppt@linux.ibm.com>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Reviewed-by: NJann Horn <jannh@google.com>
      Acked-by: NJason Gunthorpe <jgg@mellanox.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      04f5866e
  22. 29 12月, 2018 2 次提交
  23. 15 12月, 2018 1 次提交
  24. 01 12月, 2018 1 次提交
  25. 13 11月, 2018 1 次提交