1. 10 3月, 2017 1 次提交
  2. 02 3月, 2017 2 次提交
  3. 28 2月, 2017 2 次提交
  4. 25 2月, 2017 4 次提交
  5. 23 2月, 2017 16 次提交
  6. 25 1月, 2017 1 次提交
    • A
      userfaultfd: fix SIGBUS resulting from false rwsem wakeups · 15a77c6f
      Andrea Arcangeli 提交于
      With >=32 CPUs the userfaultfd selftest triggered a graceful but
      unexpected SIGBUS because VM_FAULT_RETRY was returned by
      handle_userfault() despite the UFFDIO_COPY wasn't completed.
      
      This seems caused by rwsem waking the thread blocked in
      handle_userfault() and we can't run up_read() before the wait_event
      sequence is complete.
      
      Keeping the wait_even sequence identical to the first one, would require
      running userfaultfd_must_wait() again to know if the loop should be
      repeated, and it would also require retaking the rwsem and revalidating
      the whole vma status.
      
      It seems simpler to wait the targeted wakeup so that if false wakeups
      materialize we still wait for our specific wakeup event, unless of
      course there are signals or the uffd was released.
      
      Debug code collecting the stack trace of the wakeup showed this:
      
        $ ./userfaultfd 100 99999
        nr_pages: 25600, nr_pages_per_cpu: 800
        bounces: 99998, mode: racing ver poll, userfaults: 32 35 90 232 30 138 69 82 34 30 139 40 40 31 20 19 43 13 15 28 27 38 21 43 56 22 1 17 31 8 4 2
        bounces: 99997, mode: rnd ver poll, Bus error (core dumped)
      
          save_stack_trace+0x2b/0x50
          try_to_wake_up+0x2a6/0x580
          wake_up_q+0x32/0x70
          rwsem_wake+0xe0/0x120
          call_rwsem_wake+0x1b/0x30
          up_write+0x3b/0x40
          vm_mmap_pgoff+0x9c/0xc0
          SyS_mmap_pgoff+0x1a9/0x240
          SyS_mmap+0x22/0x30
          entry_SYSCALL_64_fastpath+0x1f/0xbd
          0xffffffffffffffff
          FAULT_FLAG_ALLOW_RETRY missing 70
        CPU: 24 PID: 1054 Comm: userfaultfd Tainted: G        W       4.8.0+ #30
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-0-ge2fc41e-prebuilt.qemu-project.org 04/01/2014
        Call Trace:
          dump_stack+0xb8/0x112
          handle_userfault+0x572/0x650
          handle_mm_fault+0x12cb/0x1520
          __do_page_fault+0x175/0x500
          trace_do_page_fault+0x61/0x270
          do_async_page_fault+0x19/0x90
          async_page_fault+0x25/0x30
      
      This always happens when the main userfault selftest thread is running
      clone() while glibc runs either mprotect or mmap (both taking mmap_sem
      down_write()) to allocate the thread stack of the background threads,
      while locking/userfault threads already run at full throttle and are
      susceptible to false wakeups that may cause handle_userfault() to return
      before than expected (which results in graceful SIGBUS at the next
      attempt).
      
      This was reproduced only with >=32 CPUs because the loop to start the
      thread where clone() is too quick with fewer CPUs, while with 32 CPUs
      there's already significant activity on ~32 locking and userfault
      threads when the last background threads are started with clone().
      
      This >=32 CPUs SMP race condition is likely reproducible only with the
      selftest because of the much heavier userfault load it generates if
      compared to real apps.
      
      We'll have to allow "one more" VM_FAULT_RETRY for the WP support and a
      patch floating around that provides it also hidden this problem but in
      reality only is successfully at hiding the problem.
      
      False wakeups could still happen again the second time
      handle_userfault() is invoked, even if it's a so rare race condition
      that getting false wakeups twice in a row is impossible to reproduce.
      This full fix is needed for correctness, the only alternative would be
      to allow VM_FAULT_RETRY to be returned infinitely.  With this fix the WP
      support can stick to a strict "one more" VM_FAULT_RETRY logic (no need
      of returning it infinite times to avoid the SIGBUS).
      
      Link: http://lkml.kernel.org/r/20170111005535.13832-2-aarcange@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: NShubham Kumar Sharma <shubham.kumar.sharma@oracle.com>
      Tested-by: NMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Michael Rapoport <RAPOPORT@il.ibm.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      15a77c6f
  7. 15 12月, 2016 1 次提交
  8. 27 7月, 2016 1 次提交
  9. 21 5月, 2016 1 次提交
  10. 03 3月, 2016 1 次提交
    • L
      userfaultfd: don't block on the last VM updates at exit time · 39680f50
      Linus Torvalds 提交于
      The exit path will do some final updates to the VM of an exiting process
      to inform others of the fact that the process is going away.
      
      That happens, for example, for robust futex state cleanup, but also if
      the parent has asked for a TID update when the process exits (we clear
      the child tid field in user space).
      
      However, at the time we do those final VM accesses, we've already
      stopped accepting signals, so the usual "stop waiting for userfaults on
      signal" code in fs/userfaultfd.c no longer works, and the process can
      become an unkillable zombie waiting for something that will never
      happen.
      
      To solve this, just make handle_userfault() abort any user fault
      handling if we're already in the exit path past the signal handling
      state being dead (marked by PF_EXITING).
      
      This VM special case is pretty ugly, and it is possible that we should
      look at finalizing signals later (or move the VM final accesses
      earlier).  But in the meantime this is a fairly minimally intrusive fix.
      Reported-and-tested-by: NDmitry Vyukov <dvyukov@google.com>
      Acked-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      39680f50
  11. 23 9月, 2015 1 次提交
  12. 18 9月, 2015 1 次提交
  13. 05 9月, 2015 8 次提交
    • A
      userfaultfd: avoid missing wakeups during refile in userfaultfd_read · 2c5b7e1b
      Andrea Arcangeli 提交于
      During the refile in userfaultfd_read both waitqueues could look empty to
      the lockless wake_userfault().  Use a seqcount to prevent this false
      negative that could leave an userfault blocked.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2c5b7e1b
    • A
      userfaultfd: allow signals to interrupt a userfault · dfa37dc3
      Andrea Arcangeli 提交于
      This is only simple to achieve if the userfault is going to return to
      userland (not to the kernel) because we can avoid returning VM_FAULT_RETRY
      despite we temporarily released the mmap_sem.  The fault would just be
      retried by userland then.  This is safe at least on x86 and powerpc (the
      two archs with the syscall implemented so far).
      
      Hint to verify for which archs this is safe: after handle_mm_fault
      returns, no access to data structures protected by the mmap_sem must be
      done by the fault code in arch/*/mm/fault.c until up_read(&mm->mmap_sem)
      is called.
      
      This has two main benefits: signals can run with lower latency in
      production (signals aren't blocked by userfaults and userfaults are
      immediately repeated after signal processing) and gdb can then trivially
      debug the threads blocked in this kind of userfaults coming directly from
      userland.
      
      On a side note: while gdb has a need to get signal processed, coredumps
      always worked perfectly with userfaults, no matter if the userfault is
      triggered by GUP a kernel copy_user or directly from userland.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dfa37dc3
    • A
      userfaultfd: require UFFDIO_API before other ioctls · e6485a47
      Andrea Arcangeli 提交于
      UFFDIO_API was already forced before read/poll could work.  This makes the
      code more strict to force it also for all other ioctls.
      
      All users would already have been required to call UFFDIO_API before
      invoking other ioctls but this makes it more explicit.
      
      This will ensure we can change all ioctls (all but UFFDIO_API/struct
      uffdio_api) with a bump of uffdio_api.api.
      
      There's no actual plan or need to change the API or the ioctl, the current
      API already should cover fine even the non cooperative usage, but this is
      just for the longer term future just in case.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e6485a47
    • A
      userfaultfd: UFFDIO_COPY and UFFDIO_ZEROPAGE · ad465cae
      Andrea Arcangeli 提交于
      These two ioctl allows to either atomically copy or to map zeropages
      into the virtual address space. This is used by the thread that opened
      the userfaultfd to resolve the userfaults.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NPavel Emelyanov <xemul@parallels.com>
      Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
      Cc: zhang.zhanghailiang@huawei.com
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Feiner <pfeiner@google.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ad465cae
    • A
      userfaultfd: solve the race between UFFDIO_COPY|ZEROPAGE and read · 8d2afd96
      Andrea Arcangeli 提交于
      Solve in-kernel the race between UFFDIO_COPY|ZEROPAGE and
      userfaultfd_read if they are run on different threads simultaneously.
      
      Until now qemu solved the race in userland: the race was explicitly
      and intentionally left for userland to solve. However we can also
      solve it in kernel.
      
      Requiring all users to solve this race if they use two threads (one
      for the background transfer and one for the userfault reads) isn't
      very attractive from an API prospective, furthermore this allows to
      remove a whole bunch of mutex and bitmap code from qemu, making it
      faster. The cost of __get_user_pages_fast should be insignificant
      considering it scales perfectly and the pagetables are already hot in
      the CPU cache, compared to the overhead in userland to maintain those
      structures.
      
      Applying this patch is backwards compatible with respect to the
      userfaultfd userland API, however reverting this change wouldn't be
      backwards compatible anymore.
      
      Without this patch qemu in the background transfer thread, has to read
      the old state, and do UFFDIO_WAKE if old_state is missing but it
      become REQUESTED by the time it tries to set it to RECEIVED (signaling
      the other side received an userfault).
      
          vcpu                background_thr userfault_thr
          -----               -----          -----
          vcpu0 handle_mm_fault()
      
                              postcopy_place_page
                              read old_state -> MISSING
                              UFFDIO_COPY 0x7fb76a139000 (no wakeup, still pending)
      
          vcpu0 fault at 0x7fb76a139000 enters handle_userfault
          poll() is kicked
      
                                              poll() -> POLLIN
                                              read() -> 0x7fb76a139000
                                              postcopy_pmi_change_state(MISSING, REQUESTED) -> REQUESTED
      
                              tmp_state = postcopy_pmi_change_state(old_state, RECEIVED) -> REQUESTED
                              /* check that no userfault raced with UFFDIO_COPY */
                              if (old_state == MISSING && tmp_state == REQUESTED)
                                      UFFDIO_WAKE from background thread
      
      And a second case where a UFFDIO_WAKE would be needed is in the userfault thread:
      
          vcpu                background_thr userfault_thr
          -----               -----          -----
          vcpu0 handle_mm_fault()
      
                              postcopy_place_page
                              read old_state -> MISSING
                              UFFDIO_COPY 0x7fb76a139000 (no wakeup, still pending)
                              tmp_state = postcopy_pmi_change_state(old_state, RECEIVED) -> RECEIVED
      
          vcpu0 fault at 0x7fb76a139000 enters handle_userfault
          poll() is kicked
      
                                              poll() -> POLLIN
                                              read() -> 0x7fb76a139000
      
                                              if (postcopy_pmi_change_state(MISSING, REQUESTED) == RECEIVED)
                                                      UFFDIO_WAKE from userfault thread
      
      This patch removes the need of both UFFDIO_WAKE and of the associated
      per-page tristate as well.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NPavel Emelyanov <xemul@parallels.com>
      Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
      Cc: zhang.zhanghailiang@huawei.com
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Feiner <pfeiner@google.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8d2afd96
    • A
      userfaultfd: allocate the userfaultfd_ctx cacheline aligned · 3004ec9c
      Andrea Arcangeli 提交于
      Use proper slab to guarantee alignment.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NPavel Emelyanov <xemul@parallels.com>
      Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
      Cc: zhang.zhanghailiang@huawei.com
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Feiner <pfeiner@google.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3004ec9c
    • A
      userfaultfd: optimize read() and poll() to be O(1) · 15b726ef
      Andrea Arcangeli 提交于
      This makes read O(1) and poll that was already O(1) becomes lockless.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NPavel Emelyanov <xemul@parallels.com>
      Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
      Cc: zhang.zhanghailiang@huawei.com
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Feiner <pfeiner@google.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      15b726ef
    • A
      userfaultfd: wake pending userfaults · ba85c702
      Andrea Arcangeli 提交于
      This is an optimization but it's a userland visible one and it affects
      the API.
      
      The downside of this optimization is that if you call poll() and you
      get POLLIN, read(ufd) may still return -EAGAIN. The blocked userfault
      may be waken by a different thread, before read(ufd) comes
      around. This in short means that poll() isn't really usable if the
      userfaultfd is opened in blocking mode.
      
      userfaults won't wait in "pending" state to be read anymore and any
      UFFDIO_WAKE or similar operations that has the objective of waking
      userfaults after their resolution, will wake all blocked userfaults
      for the resolved range, including those that haven't been read() by
      userland yet.
      
      The behavior of poll() becomes not standard, but this obviates the
      need of "spurious" UFFDIO_WAKE and it lets the userland threads to
      restart immediately without requiring an UFFDIO_WAKE. This is even
      more significant in case of repeated faults on the same address from
      multiple threads.
      
      This optimization is justified by the measurement that the number of
      spurious UFFDIO_WAKE accounts for 5% and 10% of the total
      userfaults for heavy workloads, so it's worth optimizing those away.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NPavel Emelyanov <xemul@parallels.com>
      Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
      Cc: zhang.zhanghailiang@huawei.com
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Feiner <pfeiner@google.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ba85c702