1. 04 7月, 2018 1 次提交
  2. 08 6月, 2018 1 次提交
    • M
      userfaultfd: prevent non-cooperative events vs mcopy_atomic races · df2cc96e
      Mike Rapoport 提交于
      If a process monitored with userfaultfd changes it's memory mappings or
      forks() at the same time as uffd monitor fills the process memory with
      UFFDIO_COPY, the actual creation of page table entries and copying of
      the data in mcopy_atomic may happen either before of after the memory
      mapping modifications and there is no way for the uffd monitor to
      maintain consistent view of the process memory layout.
      
      For instance, let's consider fork() running in parallel with
      userfaultfd_copy():
      
      process        		         |	uffd monitor
      ---------------------------------+------------------------------
      fork()        		         | userfaultfd_copy()
      ...        		         | ...
          dup_mmap()        	         |     down_read(mmap_sem)
          down_write(mmap_sem)         |     /* create PTEs, copy data */
              dup_uffd()               |     up_read(mmap_sem)
              copy_page_range()        |
              up_write(mmap_sem)       |
              dup_uffd_complete()      |
                  /* notify monitor */ |
      
      If the userfaultfd_copy() takes the mmap_sem first, the new page(s) will
      be present by the time copy_page_range() is called and they will appear
      in the child's memory mappings.  However, if the fork() is the first to
      take the mmap_sem, the new pages won't be mapped in the child's address
      space.
      
      If the pages are not present and child tries to access them, the monitor
      will get page fault notification and everything is fine.  However, if
      the pages *are present*, the child can access them without uffd
      noticing.  And if we copy them into child it'll see the wrong data.
      Since we are talking about background copy, we'd need to decide whether
      the pages should be copied or not regardless #PF notifications.
      
      Since userfaultfd monitor has no way to determine what was the order,
      let's disallow userfaultfd_copy in parallel with the non-cooperative
      events.  In such case we return -EAGAIN and the uffd monitor can
      understand that userfaultfd_copy() clashed with a non-cooperative event
      and take an appropriate action.
      
      Link: http://lkml.kernel.org/r/1527061324-19949-1-git-send-email-rppt@linux.vnet.ibm.comSigned-off-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Acked-by: NPavel Emelyanov <xemul@virtuozzo.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Andrei Vagin <avagin@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      df2cc96e
  3. 12 2月, 2018 1 次提交
    • L
      vfs: do bulk POLL* -> EPOLL* replacement · a9a08845
      Linus Torvalds 提交于
      This is the mindless scripted replacement of kernel use of POLL*
      variables as described by Al, done by this script:
      
          for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
              L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
              for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done
          done
      
      with de-mangling cleanups yet to come.
      
      NOTE! On almost all architectures, the EPOLL* constants have the same
      values as the POLL* constants do.  But they keyword here is "almost".
      For various bad reasons they aren't the same, and epoll() doesn't
      actually work quite correctly in some cases due to this on Sparc et al.
      
      The next patch from Al will sort out the final differences, and we
      should be all done.
      Scripted-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9a08845
  4. 01 2月, 2018 2 次提交
  5. 05 1月, 2018 1 次提交
  6. 28 11月, 2017 1 次提交
  7. 16 11月, 2017 1 次提交
  8. 25 10月, 2017 1 次提交
    • M
      locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns... · 6aa7de05
      Mark Rutland 提交于
      locking/atomics: COCCINELLE/treewide: Convert trivial ACCESS_ONCE() patterns to READ_ONCE()/WRITE_ONCE()
      
      Please do not apply this to mainline directly, instead please re-run the
      coccinelle script shown below and apply its output.
      
      For several reasons, it is desirable to use {READ,WRITE}_ONCE() in
      preference to ACCESS_ONCE(), and new code is expected to use one of the
      former. So far, there's been no reason to change most existing uses of
      ACCESS_ONCE(), as these aren't harmful, and changing them results in
      churn.
      
      However, for some features, the read/write distinction is critical to
      correct operation. To distinguish these cases, separate read/write
      accessors must be used. This patch migrates (most) remaining
      ACCESS_ONCE() instances to {READ,WRITE}_ONCE(), using the following
      coccinelle script:
      
      ----
      // Convert trivial ACCESS_ONCE() uses to equivalent READ_ONCE() and
      // WRITE_ONCE()
      
      // $ make coccicheck COCCI=/home/mark/once.cocci SPFLAGS="--include-headers" MODE=patch
      
      virtual patch
      
      @ depends on patch @
      expression E1, E2;
      @@
      
      - ACCESS_ONCE(E1) = E2
      + WRITE_ONCE(E1, E2)
      
      @ depends on patch @
      expression E;
      @@
      
      - ACCESS_ONCE(E)
      + READ_ONCE(E)
      ----
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: davem@davemloft.net
      Cc: linux-arch@vger.kernel.org
      Cc: mpe@ellerman.id.au
      Cc: shuah@kernel.org
      Cc: snitzer@redhat.com
      Cc: thor.thayer@linux.intel.com
      Cc: tj@kernel.org
      Cc: viro@zeniv.linux.org.uk
      Cc: will.deacon@arm.com
      Link: http://lkml.kernel.org/r/1508792849-3115-19-git-send-email-paulmck@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6aa7de05
  9. 04 10月, 2017 1 次提交
    • A
      userfaultfd: non-cooperative: fix fork use after free · 384632e6
      Andrea Arcangeli 提交于
      When reading the event from the uffd, we put it on a temporary
      fork_event list to detect if we can still access it after releasing and
      retaking the event_wqh.lock.
      
      If fork aborts and removes the event from the fork_event all is fine as
      long as we're still in the userfault read context and fork_event head is
      still alive.
      
      We've to put the event allocated in the fork kernel stack, back from
      fork_event list-head to the event_wqh head, before returning from
      userfaultfd_ctx_read, because the fork_event head lifetime is limited to
      the userfaultfd_ctx_read stack lifetime.
      
      Forgetting to move the event back to its event_wqh place then results in
      __remove_wait_queue(&ctx->event_wqh, &ewq->wq); in
      userfaultfd_event_wait_completion to remove it from a head that has been
      already freed from the reader stack.
      
      This could only happen if resolve_userfault_fork failed (for example if
      there are no file descriptors available to allocate the fork uffd).  If
      it succeeded it was put back correctly.
      
      Furthermore, after find_userfault_evt receives a fork event, the forked
      userfault context in fork_nctx and uwq->msg.arg.reserved.reserved1 can
      be released by the fork thread as soon as the event_wqh.lock is
      released.  Taking a reference on the fork_nctx before dropping the lock
      prevents an use after free in resolve_userfault_fork().
      
      If the fork side aborted and it already released everything, we still
      try to succeed resolve_userfault_fork(), if possible.
      
      Fixes: 893e26e6 ("userfaultfd: non-cooperative: Add fork() event")
      Link: http://lkml.kernel.org/r/20170920180413.26713-1-aarcange@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: NMark Rutland <mark.rutland@arm.com>
      Tested-by: NMark Rutland <mark.rutland@arm.com>
      Cc: Pavel Emelyanov <xemul@virtuozzo.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      384632e6
  10. 09 9月, 2017 1 次提交
  11. 07 9月, 2017 4 次提交
  12. 11 8月, 2017 1 次提交
  13. 10 8月, 2017 1 次提交
  14. 03 8月, 2017 2 次提交
  15. 07 7月, 2017 2 次提交
    • P
      mm/hugetlb: add size parameter to huge_pte_offset() · 7868a208
      Punit Agrawal 提交于
      A poisoned or migrated hugepage is stored as a swap entry in the page
      tables.  On architectures that support hugepages consisting of
      contiguous page table entries (such as on arm64) this leads to ambiguity
      in determining the page table entry to return in huge_pte_offset() when
      a poisoned entry is encountered.
      
      Let's remove the ambiguity by adding a size parameter to convey
      additional information about the requested address.  Also fixup the
      definition/usage of huge_pte_offset() throughout the tree.
      
      Link: http://lkml.kernel.org/r/20170522133604.11392-4-punit.agrawal@arm.comSigned-off-by: NPunit Agrawal <punit.agrawal@arm.com>
      Acked-by: NSteve Capper <steve.capper@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: James Hogan <james.hogan@imgtec.com> (odd fixer:METAG ARCHITECTURE)
      Cc: Ralf Baechle <ralf@linux-mips.org> (supporter:MIPS)
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Rich Felker <dalias@libc.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Chris Metcalf <cmetcalf@mellanox.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7868a208
    • M
      fs/userfaultfd.c: drop dead code · f93ae364
      Mike Rapoport 提交于
      Calculation of start end end in __wake_userfault function are not used
      and can be removed.
      
      Link: http://lkml.kernel.org/r/1494930917-3134-1-git-send-email-rppt@linux.vnet.ibm.comSigned-off-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f93ae364
  16. 20 6月, 2017 2 次提交
    • I
      sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list naming · 2055da97
      Ingo Molnar 提交于
      So I've noticed a number of instances where it was not obvious from the
      code whether ->task_list was for a wait-queue head or a wait-queue entry.
      
      Furthermore, there's a number of wait-queue users where the lists are
      not for 'tasks' but other entities (poll tables, etc.), in which case
      the 'task_list' name is actively confusing.
      
      To clear this all up, name the wait-queue head and entry list structure
      fields unambiguously:
      
      	struct wait_queue_head::task_list	=> ::head
      	struct wait_queue_entry::task_list	=> ::entry
      
      For example, this code:
      
      	rqw->wait.task_list.next != &wait->task_list
      
      ... is was pretty unclear (to me) what it's doing, while now it's written this way:
      
      	rqw->wait.head.next != &wait->entry
      
      ... which makes it pretty clear that we are iterating a list until we see the head.
      
      Other examples are:
      
      	list_for_each_entry_safe(pos, next, &x->task_list, task_list) {
      	list_for_each_entry(wq, &fence->wait.task_list, task_list) {
      
      ... where it's unclear (to me) what we are iterating, and during review it's
      hard to tell whether it's trying to walk a wait-queue entry (which would be
      a bug), while now it's written as:
      
      	list_for_each_entry_safe(pos, next, &x->head, entry) {
      	list_for_each_entry(wq, &fence->wait.head, entry) {
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      2055da97
    • I
      sched/wait: Rename wait_queue_t => wait_queue_entry_t · ac6424b9
      Ingo Molnar 提交于
      Rename:
      
      	wait_queue_t		=>	wait_queue_entry_t
      
      'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue",
      but in reality it's a queue *entry*. The 'real' queue is the wait queue head,
      which had to carry the name.
      
      Start sorting this out by renaming it to 'wait_queue_entry_t'.
      
      This also allows the real structure name 'struct __wait_queue' to
      lose its double underscore and become 'struct wait_queue_entry',
      which is the more canonical nomenclature for such data types.
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      ac6424b9
  17. 17 6月, 2017 1 次提交
  18. 08 4月, 2017 1 次提交
  19. 10 3月, 2017 8 次提交
  20. 02 3月, 2017 2 次提交
  21. 28 2月, 2017 2 次提交
  22. 25 2月, 2017 3 次提交