1. 05 6月, 2021 1 次提交
  2. 07 5月, 2021 2 次提交
  3. 01 5月, 2021 5 次提交
  4. 31 3月, 2021 1 次提交
    • I
      mm: fix race by making init_zero_pfn() early_initcall · e720e7d0
      Ilya Lipnitskiy 提交于
      There are code paths that rely on zero_pfn to be fully initialized
      before core_initcall.  For example, wq_sysfs_init() is a core_initcall
      function that eventually results in a call to kernel_execve, which
      causes a page fault with a subsequent mmput.  If zero_pfn is not
      initialized by then it may not get cleaned up properly and result in an
      error:
      
        BUG: Bad rss-counter state mm:(ptrval) type:MM_ANONPAGES val:1
      
      Here is an analysis of the race as seen on a MIPS device. On this
      particular MT7621 device (Ubiquiti ER-X), zero_pfn is PFN 0 until
      initialized, at which point it becomes PFN 5120:
      
        1. wq_sysfs_init calls into kobject_uevent_env at core_initcall:
             kobject_uevent_env+0x7e4/0x7ec
             kset_register+0x68/0x88
             bus_register+0xdc/0x34c
             subsys_virtual_register+0x34/0x78
             wq_sysfs_init+0x1c/0x4c
             do_one_initcall+0x50/0x1a8
             kernel_init_freeable+0x230/0x2c8
             kernel_init+0x10/0x100
             ret_from_kernel_thread+0x14/0x1c
      
        2. kobject_uevent_env() calls call_usermodehelper_exec() which executes
           kernel_execve asynchronously.
      
        3. Memory allocations in kernel_execve cause a page fault, bumping the
           MM reference counter:
             add_mm_counter_fast+0xb4/0xc0
             handle_mm_fault+0x6e4/0xea0
             __get_user_pages.part.78+0x190/0x37c
             __get_user_pages_remote+0x128/0x360
             get_arg_page+0x34/0xa0
             copy_string_kernel+0x194/0x2a4
             kernel_execve+0x11c/0x298
             call_usermodehelper_exec_async+0x114/0x194
      
        4. In case zero_pfn has not been initialized yet, zap_pte_range does
           not decrement the MM_ANONPAGES RSS counter and the BUG message is
           triggered shortly afterwards when __mmdrop checks the ref counters:
             __mmdrop+0x98/0x1d0
             free_bprm+0x44/0x118
             kernel_execve+0x160/0x1d8
             call_usermodehelper_exec_async+0x114/0x194
             ret_from_kernel_thread+0x14/0x1c
      
      To avoid races such as described above, initialize init_zero_pfn at
      early_initcall level.  Depending on the architecture, ZERO_PAGE is
      either constant or gets initialized even earlier, at paging_init, so
      there is no issue with initializing zero_pfn earlier.
      
      Link: https://lkml.kernel.org/r/CALCv0x2YqOXEAy2Q=hafjhHCtTHVodChv1qpM=niAXOpqEbt7w@mail.gmail.comSigned-off-by: NIlya Lipnitskiy <ilya.lipnitskiy@gmail.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: stable@vger.kernel.org
      Tested-by: N周琰杰 (Zhou Yanjie) <zhouyanjie@wanyeetech.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e720e7d0
  5. 14 3月, 2021 2 次提交
    • N
      mm/userfaultfd: fix memory corruption due to writeprotect · 6ce64428
      Nadav Amit 提交于
      Userfaultfd self-test fails occasionally, indicating a memory corruption.
      
      Analyzing this problem indicates that there is a real bug since mmap_lock
      is only taken for read in mwriteprotect_range() and defers flushes, and
      since there is insufficient consideration of concurrent deferred TLB
      flushes in wp_page_copy().  Although the PTE is flushed from the TLBs in
      wp_page_copy(), this flush takes place after the copy has already been
      performed, and therefore changes of the page are possible between the time
      of the copy and the time in which the PTE is flushed.
      
      To make matters worse, memory-unprotection using userfaultfd also poses a
      problem.  Although memory unprotection is logically a promotion of PTE
      permissions, and therefore should not require a TLB flush, the current
      userrfaultfd code might actually cause a demotion of the architectural PTE
      permission: when userfaultfd_writeprotect() unprotects memory region, it
      unintentionally *clears* the RW-bit if it was already set.  Note that this
      unprotecting a PTE that is not write-protected is a valid use-case: the
      userfaultfd monitor might ask to unprotect a region that holds both
      write-protected and write-unprotected PTEs.
      
      The scenario that happens in selftests/vm/userfaultfd is as follows:
      
      cpu0				cpu1			cpu2
      ----				----			----
      							[ Writable PTE
      							  cached in TLB ]
      userfaultfd_writeprotect()
      [ write-*unprotect* ]
      mwriteprotect_range()
      mmap_read_lock()
      change_protection()
      
      change_protection_range()
      ...
      change_pte_range()
      [ *clear* “write”-bit ]
      [ defer TLB flushes ]
      				[ page-fault ]
      				...
      				wp_page_copy()
      				 cow_user_page()
      				  [ copy page ]
      							[ write to old
      							  page ]
      				...
      				 set_pte_at_notify()
      
      A similar scenario can happen:
      
      cpu0		cpu1		cpu2		cpu3
      ----		----		----		----
      						[ Writable PTE
      				  		  cached in TLB ]
      userfaultfd_writeprotect()
      [ write-protect ]
      [ deferred TLB flush ]
      		userfaultfd_writeprotect()
      		[ write-unprotect ]
      		[ deferred TLB flush]
      				[ page-fault ]
      				wp_page_copy()
      				 cow_user_page()
      				 [ copy page ]
      				 ...		[ write to page ]
      				set_pte_at_notify()
      
      This race exists since commit 292924b2 ("userfaultfd: wp: apply
      _PAGE_UFFD_WP bit").  Yet, as Yu Zhao pointed, these races became apparent
      since commit 09854ba9 ("mm: do_wp_page() simplification") which made
      wp_page_copy() more likely to take place, specifically if page_count(page)
      > 1.
      
      To resolve the aforementioned races, check whether there are pending
      flushes on uffd-write-protected VMAs, and if there are, perform a flush
      before doing the COW.
      
      Further optimizations will follow to avoid during uffd-write-unprotect
      unnecassary PTE write-protection and TLB flushes.
      
      Link: https://lkml.kernel.org/r/20210304095423.3825684-1-namit@vmware.com
      Fixes: 09854ba9 ("mm: do_wp_page() simplification")
      Signed-off-by: NNadav Amit <namit@vmware.com>
      Suggested-by: NYu Zhao <yuzhao@google.com>
      Reviewed-by: NPeter Xu <peterx@redhat.com>
      Tested-by: NPeter Xu <peterx@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Will Deacon <will@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: <stable@vger.kernel.org>	[5.9+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6ce64428
    • P
      mm: introduce page_needs_cow_for_dma() for deciding whether cow · 97a7e473
      Peter Xu 提交于
      We've got quite a few places (pte, pmd, pud) that explicitly checked
      against whether we should break the cow right now during fork().  It's
      easier to provide a helper, especially before we work the same thing on
      hugetlbfs.
      
      Since we'll reference is_cow_mapping() in mm.h, move it there too.
      Actually it suites mm.h more since internal.h is mm/ only, but mm.h is
      exported to the whole kernel.  With that we should expect another patch to
      use is_cow_mapping() whenever we can across the kernel since we do use it
      quite a lot but it's always done with raw code against VM_* flags.
      
      Link: https://lkml.kernel.org/r/20210217233547.93892-4-peterx@redhat.comSigned-off-by: NPeter Xu <peterx@redhat.com>
      Reviewed-by: NJason Gunthorpe <jgg@ziepe.ca>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Daniel Vetter <daniel@ffwll.ch>
      Cc: David Airlie <airlied@linux.ie>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Gal Pressman <galpress@amazon.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Kirill Shutemov <kirill@shutemov.name>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Roland Scheidegger <sroland@vmware.com>
      Cc: VMware Graphics <linux-graphics-maintainer@vmware.com>
      Cc: Wei Zhang <wzam@amazon.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      97a7e473
  6. 27 2月, 2021 1 次提交
  7. 25 2月, 2021 3 次提交
  8. 09 2月, 2021 1 次提交
    • P
      mm: provide a saner PTE walking API for modules · 9fd6dad1
      Paolo Bonzini 提交于
      Currently, the follow_pfn function is exported for modules but
      follow_pte is not.  However, follow_pfn is very easy to misuse,
      because it does not provide protections (so most of its callers
      assume the page is writable!) and because it returns after having
      already unlocked the page table lock.
      
      Provide instead a simplified version of follow_pte that does
      not have the pmdpp and range arguments.  The older version
      survives as follow_invalidate_pte() for use by fs/dax.c.
      Reviewed-by: NJason Gunthorpe <jgg@nvidia.com>
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      9fd6dad1
  9. 30 1月, 2021 2 次提交
  10. 21 1月, 2021 1 次提交
    • W
      mm: Pass 'address' to map to do_set_pte() and drop FAULT_FLAG_PREFAULT · 9d3af4b4
      Will Deacon 提交于
      Rather than modifying the 'address' field of the 'struct vm_fault'
      passed to do_set_pte(), leave that to identify the real faulting address
      and pass in the virtual address to be mapped by the new pte as a
      separate argument.
      
      This makes FAULT_FLAG_PREFAULT redundant, as a prefault entry can be
      identified simply by comparing the new address parameter with the
      faulting address, so remove the redundant flag at the same time.
      
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NWill Deacon <will@kernel.org>
      9d3af4b4
  11. 20 1月, 2021 2 次提交
    • W
      mm: Allow architectures to request 'old' entries when prefaulting · 46bdb427
      Will Deacon 提交于
      Commit 5c0a85fa ("mm: make faultaround produce old ptes") changed
      the "faultaround" behaviour to initialise prefaulted PTEs as 'old',
      since this avoids vmscan wrongly assuming that they are hot, despite
      having never been explicitly accessed by userspace. The change has been
      shown to benefit numerous arm64 micro-architectures (with hardware
      access flag) running Android, where both application launch latency and
      direct reclaim time are significantly reduced (by 10%+ and ~80%
      respectively).
      
      Unfortunately, commit 315d09bf ("Revert "mm: make faultaround
      produce old ptes"") reverted the change due to it being identified as
      the cause of a ~6% regression in unixbench on x86. Experiments on a
      variety of recent arm64 micro-architectures indicate that unixbench is
      not affected by the original commit, which appears to yield a 0-1%
      performance improvement.
      
      Since one size does not fit all for the initial state of prefaulted
      PTEs, introduce arch_wants_old_prefaulted_pte(), which allows an
      architecture to opt-in to 'old' prefaulted PTEs at runtime based on
      whatever criteria it may have.
      
      Cc: Jan Kara <jack@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Reported-by: NVinayak Menon <vinmenon@codeaurora.org>
      Signed-off-by: NWill Deacon <will@kernel.org>
      46bdb427
    • K
      mm: Cleanup faultaround and finish_fault() codepaths · f9ce0be7
      Kirill A. Shutemov 提交于
      alloc_set_pte() has two users with different requirements: in the
      faultaround code, it called from an atomic context and PTE page table
      has to be preallocated. finish_fault() can sleep and allocate page table
      as needed.
      
      PTL locking rules are also strange, hard to follow and overkill for
      finish_fault().
      
      Let's untangle the mess. alloc_set_pte() has gone now. All locking is
      explicit.
      
      The price is some code duplication to handle huge pages in faultaround
      path, but it should be fine, having overall improvement in readability.
      
      Link: https://lore.kernel.org/r/20201229132819.najtavneutnf7ajp@boxSigned-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      [will: s/from from/from/ in comment; spotted by willy]
      Signed-off-by: NWill Deacon <will@kernel.org>
      f9ce0be7
  12. 12 1月, 2021 1 次提交
    • D
      mm: Close race in generic_access_phys · 96667f8a
      Daniel Vetter 提交于
      Way back it was a reasonable assumptions that iomem mappings never
      change the pfn range they point at. But this has changed:
      
      - gpu drivers dynamically manage their memory nowadays, invalidating
        ptes with unmap_mapping_range when buffers get moved
      
      - contiguous dma allocations have moved from dedicated carvetouts to
        cma regions. This means if we miss the unmap the pfn might contain
        pagecache or anon memory (well anything allocated with GFP_MOVEABLE)
      
      - even /dev/mem now invalidates mappings when the kernel requests that
        iomem region when CONFIG_IO_STRICT_DEVMEM is set, see 3234ac66
        ("/dev/mem: Revoke mappings when a driver claims the region")
      
      Accessing pfns obtained from ptes without holding all the locks is
      therefore no longer a good idea. Fix this.
      
      Since ioremap might need to manipulate pagetables too we need to drop
      the pt lock and have a retry loop if we raced.
      
      While at it, also add kerneldoc and improve the comment for the
      vma_ops->access function. It's for accessing, not for moving the
      memory from iomem to system memory, as the old comment seemed to
      suggest.
      
      References: 28b2ee20 ("access_process_vm device memory infrastructure")
      Signed-off-by: NDaniel Vetter <daniel.vetter@intel.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Benjamin Herrensmidt <benh@kernel.crashing.org>
      Cc: Dave Airlie <airlied@linux.ie>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: linux-mm@kvack.org
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: linux-samsung-soc@vger.kernel.org
      Cc: linux-media@vger.kernel.org
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Signed-off-by: NDaniel Vetter <daniel.vetter@ffwll.ch>
      Link: https://patchwork.freedesktop.org/patch/msgid/20201127164131.2244124-8-daniel.vetter@ffwll.ch
      96667f8a
  13. 30 12月, 2020 1 次提交
    • N
      mm: generalise COW SMC TLB flushing race comment · 111fe718
      Nicholas Piggin 提交于
      I'm not sure if I'm completely missing something here, but AFAIKS the
      reference to the mysterious "COW SMC race" confuses the issue.  The
      original changelog and mailing list thread didn't help me either.
      
      This SMC race is where the problem was detected, but isn't the general
      problem bigger and more obvious: that the new PTE could be picked up at
      any time by any TLB while entries for the old PTE exist in other TLBs
      before the TLB flush takes effect?
      
      The case where the iTLB and dTLB of a CPU are pointing at different pages
      is an interesting one but follows from the general problem.
      
      The other (minor) thing with the comment I think it makes it a bit clearer
      to say what the old code was doing (i.e., it avoids the race as opposed to
      what?).
      
      References: 4ce072f1 ("mm: fix a race condition under SMC + COW")
      Link: https://lkml.kernel.org/r/20201215121119.351650-1-npiggin@gmail.comReviewed-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Suresh Siddha <sbsiddha@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      111fe718
  14. 16 12月, 2020 4 次提交
  15. 19 10月, 2020 1 次提交
  16. 17 10月, 2020 1 次提交
  17. 14 10月, 2020 4 次提交
  18. 09 10月, 2020 1 次提交
    • L
      mm: avoid early COW write protect games during fork() · f3c64eda
      Linus Torvalds 提交于
      In commit 70e806e4 ("mm: Do early cow for pinned pages during fork()
      for ptes") we write-protected the PTE before doing the page pinning
      check, in order to avoid a race with concurrent fast-GUP pinning (which
      doesn't take the mm semaphore or the page table lock).
      
      That trick doesn't actually work - it doesn't handle memory ordering
      properly, and doing so would be prohibitively expensive.
      
      It also isn't really needed.  While we're moving in the direction of
      allowing and supporting page pinning without marking the pinned area
      with MADV_DONTFORK, the fact is that we've never really supported this
      kind of odd "concurrent fork() and page pinning", and doing the
      serialization on a pte level is just wrong.
      
      We can add serialization with a per-mm sequence counter, so we know how
      to solve that race properly, but we'll do that at a more appropriate
      time.  Right now this just removes the write protect games.
      
      It also turns out that the write protect games actually break on Power,
      as reported by Aneesh Kumar:
      
       "Architecture like ppc64 expects set_pte_at to be not used for updating
        a valid pte. This is further explained in commit 56eecdb9 ("mm:
        Use ptep/pmdp_set_numa() for updating _PAGE_NUMA bit")"
      
      and the code triggered a warning there:
      
        WARNING: CPU: 0 PID: 30613 at arch/powerpc/mm/pgtable.c:185 set_pte_at+0x2a8/0x3a0 arch/powerpc/mm/pgtable.c:185
        Call Trace:
          copy_present_page mm/memory.c:857 [inline]
          copy_present_pte mm/memory.c:899 [inline]
          copy_pte_range mm/memory.c:1014 [inline]
          copy_pmd_range mm/memory.c:1092 [inline]
          copy_pud_range mm/memory.c:1127 [inline]
          copy_p4d_range mm/memory.c:1150 [inline]
          copy_page_range+0x1f6c/0x2cc0 mm/memory.c:1212
          dup_mmap kernel/fork.c:592 [inline]
          dup_mm+0x77c/0xab0 kernel/fork.c:1355
          copy_mm kernel/fork.c:1411 [inline]
          copy_process+0x1f00/0x2740 kernel/fork.c:2070
          _do_fork+0xc4/0x10b0 kernel/fork.c:2429
      
      Link: https://lore.kernel.org/lkml/CAHk-=wiWr+gO0Ro4LvnJBMs90OiePNyrE3E+pJvc9PzdBShdmw@mail.gmail.com/
      Link: https://lore.kernel.org/linuxppc-dev/20201008092541.398079-1-aneesh.kumar@linux.ibm.com/Reported-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Tested-by: NLeon Romanovsky <leonro@nvidia.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Kirill Shutemov <kirill@shutemov.name>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f3c64eda
  19. 06 10月, 2020 1 次提交
  20. 28 9月, 2020 2 次提交
    • P
      mm: Do early cow for pinned pages during fork() for ptes · 70e806e4
      Peter Xu 提交于
      This allows copy_pte_range() to do early cow if the pages were pinned on
      the source mm.
      
      Currently we don't have an accurate way to know whether a page is pinned
      or not.  The only thing we have is page_maybe_dma_pinned().  However
      that's good enough for now.  Especially, with the newly added
      mm->has_pinned flag to make sure we won't affect processes that never
      pinned any pages.
      
      It would be easier if we can do GFP_KERNEL allocation within
      copy_one_pte().  Unluckily, we can't because we're with the page table
      locks held for both the parent and child processes.  So the page
      allocation needs to be done outside copy_one_pte().
      
      Some trick is there in copy_present_pte(), majorly the wrprotect trick
      to block concurrent fast-gup.  Comments in the function should explain
      better in place.
      
      Oleg Nesterov reported a (probably harmless) bug during review that we
      didn't reset entry.val properly in copy_pte_range() so that potentially
      there's chance to call add_swap_count_continuation() multiple times on
      the same swp entry.  However that should be harmless since even if it
      happens, the same function (add_swap_count_continuation()) will return
      directly noticing that there're enough space for the swp counter.  So
      instead of a standalone stable patch, it is touched up in this patch
      directly.
      
      Link: https://lore.kernel.org/lkml/20200914143829.GA1424636@nvidia.com/Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      70e806e4
    • P
      mm/fork: Pass new vma pointer into copy_page_range() · 7a4830c3
      Peter Xu 提交于
      This prepares for the future work to trigger early cow on pinned pages
      during fork().
      
      No functional change intended.
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7a4830c3
  21. 24 9月, 2020 3 次提交
    • L
      mm: fix misplaced unlock_page in do_wp_page() · be068f29
      Linus Torvalds 提交于
      Commit 09854ba9 ("mm: do_wp_page() simplification") reorganized all
      the code around the page re-use vs copy, but in the process also moved
      the final unlock_page() around to after the wp_page_reuse() call.
      
      That normally doesn't matter - but it means that the unlock_page() is
      now done after releasing the page table lock.  Again, not a big deal,
      you'd think.
      
      But it turns out that it's very wrong indeed, because once we've
      released the page table lock, we've basically lost our only reference to
      the page - the page tables - and it could now be free'd at any time.  We
      do hold the mmap_sem, so no actual unmap() can happen, but madvise can
      come in and a MADV_DONTNEED will zap the page range - and free the page.
      
      So now the page may be free'd just as we're unlocking it, which in turn
      will usually trigger a "Bad page state" error in the freeing path.  To
      make matters more confusing, by the time the debug code prints out the
      page state, the unlock has typically completed and everything looks fine
      again.
      
      This all doesn't happen in any normal situations, but it does trigger
      with the dirtyc0w_child LTP test.  And it seems to trigger much more
      easily (but not expclusively) on s390 than elsewhere, probably because
      s390 doesn't do the "batch pages up for freeing after the TLB flush"
      that gives the unlock_page() more time to complete and makes the race
      harder to hit.
      
      Fixes: 09854ba9 ("mm: do_wp_page() simplification")
      Link: https://lore.kernel.org/lkml/a46e9bbef2ed4e17778f5615e818526ef848d791.camel@redhat.com/
      Link: https://lore.kernel.org/linux-mm/c41149a8-211e-390b-af1d-d5eee690fecb@linux.alibaba.com/Reported-by: NQian Cai <cai@redhat.com>
      Reported-by: NAlex Shi <alex.shi@linux.alibaba.com>
      Bisected-and-analyzed-by: NGerald Schaefer <gerald.schaefer@linux.ibm.com>
      Tested-by: NGerald Schaefer <gerald.schaefer@linux.ibm.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      be068f29
    • L
      mm: move the copy_one_pte() pte_present check into the caller · 79a1971c
      Linus Torvalds 提交于
      This completes the split of the non-present and present pte cases by
      moving the check for the source pte being present into the single
      caller, which also means that we clearly separate out the very different
      return value case for a non-present pte.
      
      The present pte case currently always succeeds.
      
      This is a pure code re-organization with no semantic change: the intent
      is to make it much easier to add a new return case to the present pte
      case for when we do early COW at page table copy time.
      
      This was split out from the previous commit simply to make it easy to
      visually see that there were no semantic changes from this code
      re-organization.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      79a1971c
    • L
      mm: split out the non-present case from copy_one_pte() · df3a57d1
      Linus Torvalds 提交于
      This is a purely mechanical split of the copy_one_pte() function.  It's
      not immediately obvious when looking at the diff because of the
      indentation change, but the way to see what is going on in this commit
      is to use the "-w" flag to not show pure whitespace changes, and you see
      how the first part of copy_one_pte() is simply lifted out into a
      separate function.
      
      And since the non-present case is marked unlikely, don't make the new
      function be inlined.  Not that gcc really seems to care, since it looks
      like it will inline it anyway due to the whole "single callsite for
      static function" logic.  In fact, code generation with the function
      split is almost identical to before.  But not marking it inline is the
      right thing to do.
      
      This is pure prep-work and cleanup for subsequent changes.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      df3a57d1