1. 14 10月, 2020 25 次提交
  2. 12 10月, 2020 2 次提交
  3. 11 10月, 2020 1 次提交
    • H
      mm/khugepaged: fix filemap page_to_pgoff(page) != offset · 033b5d77
      Hugh Dickins 提交于
      There have been elusive reports of filemap_fault() hitting its
      VM_BUG_ON_PAGE(page_to_pgoff(page) != offset, page) on kernels built
      with CONFIG_READ_ONLY_THP_FOR_FS=y.
      
      Suren has hit it on a kernel with CONFIG_READ_ONLY_THP_FOR_FS=y and
      CONFIG_NUMA is not set: and he has analyzed it down to how khugepaged
      without NUMA reuses the same huge page after collapse_file() failed
      (whereas NUMA targets its allocation to the respective node each time).
      And most of us were usually testing with CONFIG_NUMA=y kernels.
      
      collapse_file(old start)
        new_page = khugepaged_alloc_page(hpage)
        __SetPageLocked(new_page)
        new_page->index = start // hpage->index=old offset
        new_page->mapping = mapping
        xas_store(&xas, new_page)
      
                                filemap_fault
                                  page = find_get_page(mapping, offset)
                                  // if offset falls inside hpage then
                                  // compound_head(page) == hpage
                                  lock_page_maybe_drop_mmap()
                                    __lock_page(page)
      
        // collapse fails
        xas_store(&xas, old page)
        new_page->mapping = NULL
        unlock_page(new_page)
      
      collapse_file(new start)
        new_page = khugepaged_alloc_page(hpage)
        __SetPageLocked(new_page)
        new_page->index = start // hpage->index=new offset
        new_page->mapping = mapping // mapping becomes valid again
      
                                  // since compound_head(page) == hpage
                                  // page_to_pgoff(page) got changed
                                  VM_BUG_ON_PAGE(page_to_pgoff(page) != offset)
      
      An initial patch replaced __SetPageLocked() by lock_page(), which did
      fix the race which Suren illustrates above.  But testing showed that it's
      not good enough: if the racing task's __lock_page() gets delayed long
      after its find_get_page(), then it may follow collapse_file(new start)'s
      successful final unlock_page(), and crash on the same VM_BUG_ON_PAGE.
      
      It could be fixed by relaxing filemap_fault()'s VM_BUG_ON_PAGE to a
      check and retry (as is done for mapping), with similar relaxations in
      find_lock_entry() and pagecache_get_page(): but it's not obvious what
      else might get caught out; and khugepaged non-NUMA appears to be unique
      in exposing a page to page cache, then revoking, without going through
      a full cycle of freeing before reuse.
      
      Instead, non-NUMA khugepaged_prealloc_page() release the old page
      if anyone else has a reference to it (1% of cases when I tested).
      
      Although never reported on huge tmpfs, I believe its find_lock_entry()
      has been at similar risk; but huge tmpfs does not rely on khugepaged
      for its normal working nearly so much as READ_ONLY_THP_FOR_FS does.
      Reported-by: NDenis Lisov <dennis.lissov@gmail.com>
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=206569
      Link: https://lore.kernel.org/linux-mm/?q=20200219144635.3b7417145de19b65f258c943%40linux-foundation.orgReported-by: NQian Cai <cai@lca.pw>
      Link: https://lore.kernel.org/linux-xfs/?q=20200616013309.GB815%40lca.pwReported-and-analyzed-by: NSuren Baghdasaryan <surenb@google.com>
      Fixes: 87c460a0 ("mm/khugepaged: collapse_shmem() without freezing new_page")
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: stable@vger.kernel.org # v4.9+
      Reviewed-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      033b5d77
  4. 09 10月, 2020 1 次提交
    • L
      mm: avoid early COW write protect games during fork() · f3c64eda
      Linus Torvalds 提交于
      In commit 70e806e4 ("mm: Do early cow for pinned pages during fork()
      for ptes") we write-protected the PTE before doing the page pinning
      check, in order to avoid a race with concurrent fast-GUP pinning (which
      doesn't take the mm semaphore or the page table lock).
      
      That trick doesn't actually work - it doesn't handle memory ordering
      properly, and doing so would be prohibitively expensive.
      
      It also isn't really needed.  While we're moving in the direction of
      allowing and supporting page pinning without marking the pinned area
      with MADV_DONTFORK, the fact is that we've never really supported this
      kind of odd "concurrent fork() and page pinning", and doing the
      serialization on a pte level is just wrong.
      
      We can add serialization with a per-mm sequence counter, so we know how
      to solve that race properly, but we'll do that at a more appropriate
      time.  Right now this just removes the write protect games.
      
      It also turns out that the write protect games actually break on Power,
      as reported by Aneesh Kumar:
      
       "Architecture like ppc64 expects set_pte_at to be not used for updating
        a valid pte. This is further explained in commit 56eecdb9 ("mm:
        Use ptep/pmdp_set_numa() for updating _PAGE_NUMA bit")"
      
      and the code triggered a warning there:
      
        WARNING: CPU: 0 PID: 30613 at arch/powerpc/mm/pgtable.c:185 set_pte_at+0x2a8/0x3a0 arch/powerpc/mm/pgtable.c:185
        Call Trace:
          copy_present_page mm/memory.c:857 [inline]
          copy_present_pte mm/memory.c:899 [inline]
          copy_pte_range mm/memory.c:1014 [inline]
          copy_pmd_range mm/memory.c:1092 [inline]
          copy_pud_range mm/memory.c:1127 [inline]
          copy_p4d_range mm/memory.c:1150 [inline]
          copy_page_range+0x1f6c/0x2cc0 mm/memory.c:1212
          dup_mmap kernel/fork.c:592 [inline]
          dup_mm+0x77c/0xab0 kernel/fork.c:1355
          copy_mm kernel/fork.c:1411 [inline]
          copy_process+0x1f00/0x2740 kernel/fork.c:2070
          _do_fork+0xc4/0x10b0 kernel/fork.c:2429
      
      Link: https://lore.kernel.org/lkml/CAHk-=wiWr+gO0Ro4LvnJBMs90OiePNyrE3E+pJvc9PzdBShdmw@mail.gmail.com/
      Link: https://lore.kernel.org/linuxppc-dev/20201008092541.398079-1-aneesh.kumar@linux.ibm.com/Reported-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Tested-by: NLeon Romanovsky <leonro@nvidia.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Kirill Shutemov <kirill@shutemov.name>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f3c64eda
  5. 04 10月, 2020 2 次提交
  6. 03 10月, 2020 3 次提交
    • C
      mm: remove compat_process_vm_{readv,writev} · c3973b40
      Christoph Hellwig 提交于
      Now that import_iovec handles compat iovecs, the native syscalls
      can be used for the compat case as well.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      c3973b40
    • C
      iov_iter: transparently handle compat iovecs in import_iovec · 89cd35c5
      Christoph Hellwig 提交于
      Use in compat_syscall to import either native or the compat iovecs, and
      remove the now superflous compat_import_iovec.
      
      This removes the need for special compat logic in most callers, and
      the remaining ones can still be simplified by using __import_iovec
      with a bool compat parameter.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      89cd35c5
    • C
      iov_iter: refactor rw_copy_check_uvector and import_iovec · bfdc5970
      Christoph Hellwig 提交于
      Split rw_copy_check_uvector into two new helpers with more sensible
      calling conventions:
      
       - iovec_from_user copies a iovec from userspace either into the provided
         stack buffer if it fits, or allocates a new buffer for it.  Returns
         the actually used iovec.  It also verifies that iov_len does fit a
         signed type, and handles compat iovecs if the compat flag is set.
       - __import_iovec consolidates the native and compat versions of
         import_iovec. It calls iovec_from_user, then validates each iovec
         actually points to user addresses, and ensures the total length
         doesn't overflow.
      
      This has two major implications:
      
       - the access_process_vm case loses the total lenght checking, which
         wasn't required anyway, given that each call receives two iovecs
         for the local and remote side of the operation, and it verifies
         the total length on the local side already.
       - instead of a single loop there now are two loops over the iovecs.
         Given that the iovecs are cache hot this doesn't make a major
         difference
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      bfdc5970
  7. 29 9月, 2020 2 次提交
    • H
      io_uring: fix async buffered reads when readahead is disabled · c8d317aa
      Hao Xu 提交于
      The async buffered reads feature is not working when readahead is
      turned off. There are two things to concern:
      
      - when doing retry in io_read, not only the IOCB_WAITQ flag but also
        the IOCB_NOWAIT flag is still set, which makes it goes to would_block
        phase in generic_file_buffered_read() and then return -EAGAIN. After
        that, the io-wq thread work is queued, and later doing the async
        reads in the old way.
      
      - even if we remove IOCB_NOWAIT when doing retry, the feature is still
        not running properly, since in generic_file_buffered_read() it goes to
        lock_page_killable() after calling mapping->a_ops->readpage() to do
        IO, and thus causing process to sleep.
      
      Fixes: 1a0a7853 ("mm: support async buffered reads in generic_file_buffered_read()")
      Fixes: 3b2a4439 ("io_uring: get rid of kiocb_wait_page_queue_init()")
      Signed-off-by: NHao Xu <haoxu@linux.alibaba.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c8d317aa
    • J
      mm: do not rely on mm == current->mm in __get_user_pages_locked · a4d63c37
      Jason A. Donenfeld 提交于
      It seems likely this block was pasted from internal_get_user_pages_fast,
      which is not passed an mm struct and therefore uses current's.  But
      __get_user_pages_locked is passed an explicit mm, and current->mm is not
      always valid. This was hit when being called from i915, which uses:
      
        pin_user_pages_remote->
          __get_user_pages_remote->
            __gup_longterm_locked->
              __get_user_pages_locked
      
      Before, this would lead to an OOPS:
      
        BUG: kernel NULL pointer dereference, address: 0000000000000064
        #PF: supervisor write access in kernel mode
        #PF: error_code(0x0002) - not-present page
        CPU: 10 PID: 1431 Comm: kworker/u33:1 Tainted: P S   U     O      5.9.0-rc7+ #140
        Hardware name: LENOVO 20QTCTO1WW/20QTCTO1WW, BIOS N2OET47W (1.34 ) 08/06/2020
        Workqueue: i915-userptr-acquire __i915_gem_userptr_get_pages_worker [i915]
        RIP: 0010:__get_user_pages_remote+0xd7/0x310
        Call Trace:
         __i915_gem_userptr_get_pages_worker+0xc8/0x260 [i915]
         process_one_work+0x1ca/0x390
         worker_thread+0x48/0x3c0
         kthread+0x114/0x130
         ret_from_fork+0x1f/0x30
        CR2: 0000000000000064
      
      This commit fixes the problem by using the mm pointer passed to the
      function rather than the bogus one in current.
      
      Fixes: 008cfe44 ("mm: Introduce mm_struct.has_pinned")
      Tested-by: NChris Wilson <chris@chris-wilson.co.uk>
      Reported-by: NHarald Arnesen <harald@skogtun.org>
      Reviewed-by: NJason Gunthorpe <jgg@nvidia.com>
      Reviewed-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NJason A. Donenfeld <Jason@zx2c4.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a4d63c37
  8. 28 9月, 2020 4 次提交
    • P
      mm/thp: Split huge pmds/puds if they're pinned when fork() · d042035e
      Peter Xu 提交于
      Pinned pages shouldn't be write-protected when fork() happens, because
      follow up copy-on-write on these pages could cause the pinned pages to
      be replaced by random newly allocated pages.
      
      For huge PMDs, we split the huge pmd if pinning is detected.  So that
      future handling will be done by the PTE level (with our latest changes,
      each of the small pages will be copied).  We can achieve this by let
      copy_huge_pmd() return -EAGAIN for pinned pages, so that we'll
      fallthrough in copy_pmd_range() and finally land the next
      copy_pte_range() call.
      
      Huge PUDs will be even more special - so far it does not support
      anonymous pages.  But it can actually be done the same as the huge PMDs
      even if the split huge PUDs means to erase the PUD entries.  It'll
      guarantee the follow up fault ins will remap the same pages in either
      parent/child later.
      
      This might not be the most efficient way, but it should be easy and
      clean enough.  It should be fine, since we're tackling with a very rare
      case just to make sure userspaces that pinned some thps will still work
      even without MADV_DONTFORK and after they fork()ed.
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d042035e
    • P
      mm: Do early cow for pinned pages during fork() for ptes · 70e806e4
      Peter Xu 提交于
      This allows copy_pte_range() to do early cow if the pages were pinned on
      the source mm.
      
      Currently we don't have an accurate way to know whether a page is pinned
      or not.  The only thing we have is page_maybe_dma_pinned().  However
      that's good enough for now.  Especially, with the newly added
      mm->has_pinned flag to make sure we won't affect processes that never
      pinned any pages.
      
      It would be easier if we can do GFP_KERNEL allocation within
      copy_one_pte().  Unluckily, we can't because we're with the page table
      locks held for both the parent and child processes.  So the page
      allocation needs to be done outside copy_one_pte().
      
      Some trick is there in copy_present_pte(), majorly the wrprotect trick
      to block concurrent fast-gup.  Comments in the function should explain
      better in place.
      
      Oleg Nesterov reported a (probably harmless) bug during review that we
      didn't reset entry.val properly in copy_pte_range() so that potentially
      there's chance to call add_swap_count_continuation() multiple times on
      the same swp entry.  However that should be harmless since even if it
      happens, the same function (add_swap_count_continuation()) will return
      directly noticing that there're enough space for the swp counter.  So
      instead of a standalone stable patch, it is touched up in this patch
      directly.
      
      Link: https://lore.kernel.org/lkml/20200914143829.GA1424636@nvidia.com/Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      70e806e4
    • P
      mm/fork: Pass new vma pointer into copy_page_range() · 7a4830c3
      Peter Xu 提交于
      This prepares for the future work to trigger early cow on pinned pages
      during fork().
      
      No functional change intended.
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7a4830c3
    • P
      mm: Introduce mm_struct.has_pinned · 008cfe44
      Peter Xu 提交于
      (Commit message majorly collected from Jason Gunthorpe)
      
      Reduce the chance of false positive from page_maybe_dma_pinned() by
      keeping track if the mm_struct has ever been used with pin_user_pages().
      This allows cases that might drive up the page ref_count to avoid any
      penalty from handling dma_pinned pages.
      
      Future work is planned, to provide a more sophisticated solution, likely
      to turn it into a real counter.  For now, make it atomic_t but use it as
      a boolean for simplicity.
      Suggested-by: NJason Gunthorpe <jgg@ziepe.ca>
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      008cfe44