1. 22 3月, 2023 2 次提交
    • D
      mm: optimize do_wp_page() for fresh pages in local LRU pagevecs · eefceffc
      David Hildenbrand 提交于
      mainline inclusion
      from mainline-v5.18-rc1
      commit d4c47097
      category: bugfix
      bugzilla: https://gitee.com/src-openeuler/kernel/issues/I6NK0S
      CVE: NA
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d4c470970d45c863fafc757521a82be2f80b1232
      
      --------------------------------
      
      For example, if a page just got swapped in via a read fault, the LRU
      pagevecs might still hold a reference to the page.  If we trigger a write
      fault on such a page, the additional reference from the LRU pagevecs will
      prohibit reusing the page.
      
      Let's conditionally drain the local LRU pagevecs when we stumble over a
      !PageLRU() page.  We cannot easily drain remote LRU pagevecs and it might
      not be desirable performance-wise.  Consequently, this will only avoid
      copying in some cases.
      
      Add a simple "page_count(page) > 3" check first but keep the
      "page_count(page) > 1 + PageSwapCache(page)" check in place, as we want to
      minimize cases where we remove a page from the swapcache but won't be able
      to reuse it, for example, because another process has it mapped R/O, to
      not affect reclaim.
      
      We cannot easily handle the following cases and we will always have to
      copy:
      
      (1) The page is referenced in the LRU pagevecs of other CPUs. We really
          would have to drain the LRU pagevecs of all CPUs -- most probably
          copying is much cheaper.
      
      (2) The page is already PageLRU() but is getting moved between LRU
          lists, for example, for activation (e.g., mark_page_accessed()),
          deactivation (MADV_COLD), or lazyfree (MADV_FREE). We'd have to
          drain mostly unconditionally, which might be bad performance-wise.
          Most probably this won't happen too often in practice.
      
      Note that there are other reasons why an anon page might temporarily not
      be PageLRU(): for example, compaction and migration have to isolate LRU
      pages from the LRU lists first (isolate_lru_page()), moving them to
      temporary local lists and clearing PageLRU() and holding an additional
      reference on the page.  In that case, we'll always copy.
      
      This change seems to be fairly effective with the reproducer [1] shared by
      Nadav, as long as writeback is done synchronously, for example, using
      zram.  However, with asynchronous writeback, we'll usually fail to free
      the swapcache because the page is still under writeback: something we
      cannot easily optimize for, and maybe it's not really relevant in
      practice.
      
      [1] https://lkml.kernel.org/r/0480D692-D9B2-429A-9A88-9BBA1331AC3A@gmail.com
      
      Link: https://lkml.kernel.org/r/20220131162940.210846-3-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NZhangPeng <zhangpeng362@huawei.com>
      Reviewed-by: Ntong tiangen <tongtiangen@huawei.com>
      Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
      eefceffc
    • D
      mm: optimize do_wp_page() for exclusive pages in the swapcache · 8c97cec0
      David Hildenbrand 提交于
      mainline inclusion
      from mainline-v5.18-rc1
      commit 53a05ad9
      category: bugfix
      bugzilla: https://gitee.com/src-openeuler/kernel/issues/I6NK0S
      CVE: NA
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=53a05ad9f21d858d24f76d12b3e990405f2036d1
      
      --------------------------------
      
      Patch series "mm: COW fixes part 1: fix the COW security issue for THP and swap", v3.
      
      This series attempts to optimize and streamline the COW logic for ordinary
      anon pages and THP anon pages, fixing two remaining instances of
      CVE-2020-29374 in do_swap_page() and do_huge_pmd_wp_page(): information
      can leak from a parent process to a child process via anonymous pages
      shared during fork().
      
      This issue, including other related COW issues, has been summarized in [2]:
      
       "1. Observing Memory Modifications of Private Pages From A Child Process
      
        Long story short: process-private memory might not be as private as you
        think once you fork(): successive modifications of private memory
        regions in the parent process can still be observed by the child
        process, for example, by smart use of vmsplice()+munmap().
      
        The core problem is that pinning pages readable in a child process, such
        as done via the vmsplice system call, can result in a child process
        observing memory modifications done in the parent process the child is
        not supposed to observe. [1] contains an excellent summary and [2]
        contains further details. This issue was assigned CVE-2020-29374 [9].
      
        For this to trigger, it's required to use a fork() without subsequent
        exec(), for example, as used under Android zygote. Without further
        details about an application that forks less-privileged child processes,
        one cannot really say what's actually affected and what's not -- see the
        details section the end of this mail for a short sshd/openssh analysis.
      
        While commit 17839856 ("gup: document and work around "COW can break
        either way" issue") fixed this issue and resulted in other problems
        (e.g., ptrace on pmem), commit 09854ba9 ("mm: do_wp_page()
        simplification") re-introduced part of the problem unfortunately.
      
        The original reproducer can be modified quite easily to use THP [3] and
        make the issue appear again on upstream kernels. I modified it to use
        hugetlb [4] and it triggers as well. The problem is certainly less
        severe with hugetlb than with THP; it merely highlights that we still
        have plenty of open holes we should be closing/fixing.
      
        Regarding vmsplice(), the only known workaround is to disallow the
        vmsplice() system call ... or disable THP and hugetlb. But who knows
        what else is affected (RDMA? O_DIRECT?) to achieve the same goal -- in
        the end, it's a more generic issue"
      
      This security issue was first reported by Jann Horn on 27 May 2020 and it
      currently affects anonymous pages during swapin, anonymous THP and hugetlb.
      This series tackles anonymous pages during swapin and anonymous THP:
      
       - do_swap_page() for handling COW on PTEs during swapin directly
      
       - do_huge_pmd_wp_page() for handling COW on PMD-mapped THP during write
         faults
      
      With this series, we'll apply the same COW logic we have in do_wp_page()
      to all swappable anon pages: don't reuse (map writable) the page in
      case there are additional references (page_count() != 1). All users of
      reuse_swap_page() are remove, and consequently reuse_swap_page() is
      removed.
      
      In general, we're struggling with the following COW-related issues:
      
      (1) "missed COW": we miss to copy on write and reuse the page (map it
          writable) although we must copy because there are pending references
          from another process to this page. The result is a security issue.
      
      (2) "wrong COW": we copy on write although we wouldn't have to and
          shouldn't: if there are valid GUP references, they will become out
          of sync with the pages mapped into the page table. We fail to detect
          that such a page can be reused safely, especially if never more than
          a single process mapped the page. The result is an intra process
          memory corruption.
      
      (3) "unnecessary COW": we copy on write although we wouldn't have to:
          performance degradation and temporary increases swap+memory
          consumption can be the result.
      
      While this series fixes (1) for swappable anon pages, it tries to reduce
      reported cases of (3) first as good and easy as possible to limit the
      impact when streamlining.  The individual patches try to describe in
      which cases we will run into (3).
      
      This series certainly makes (2) worse for THP, because a THP will now
      get PTE-mapped on write faults if there are additional references, even
      if there was only ever a single process involved: once PTE-mapped, we'll
      copy each and every subpage and won't reuse any subpage as long as the
      underlying compound page wasn't split.
      
      I'm working on an approach to fix (2) and improve (3): PageAnonExclusive
      to mark anon pages that are exclusive to a single process, allow GUP
      pins only on such exclusive pages, and allow turning exclusive pages
      shared (clearing PageAnonExclusive) only if there are no GUP pins.  Anon
      pages with PageAnonExclusive set never have to be copied during write
      faults, but eventually during fork() if they cannot be turned shared.
      The improved reuse logic in this series will essentially also be the
      logic to reset PageAnonExclusive.  This work will certainly take a
      while, but I'm planning on sharing details before having code fully
      ready.
      
      cleanups related to reuse_swap_page().
      
      Notes:
      * For now, I'll leave hugetlb code untouched: "unnecessary COW" might
        easily break existing setups because hugetlb pages are a scarce resource
        and we could just end up having to crash the application when we run out
        of hugetlb pages. We have to be very careful and the security aspect with
        hugetlb is most certainly less relevant than for unprivileged anon pages.
      * Instead of lru_add_drain() we might actually just drain the lru_add list
        or even just remove the single page of interest from the lru_add list.
        This would require a new helper function, and could be added if the
        conditional lru_add_drain() turn out to be a problem.
      * I extended the test case already included in [1] to also test for the
        newly found do_swap_page() case. I'll send that out separately once/if
        this part was merged.
      
      [1] https://lkml.kernel.org/r/20211217113049.23850-1-david@redhat.com
      [2] https://lore.kernel.org/r/3ae33b08-d9ef-f846-56fb-645e3b9b4c66@redhat.com
      
      This patch (of 9):
      
      Liang Zhang reported [1] that the current COW logic in do_wp_page() is
      sub-optimal when it comes to swap+read fault+write fault of anonymous
      pages that have a single user, visible via a performance degradation in
      the redis benchmark.  Something similar was previously reported [2] by
      Nadav with a simple reproducer.
      
      After we put an anon page into the swapcache and unmapped it from a single
      process, that process might read that page again and refault it read-only.
      If that process then writes to that page, the process is actually the
      exclusive user of the page, however, the COW logic in do_co_page() won't
      be able to reuse it due to the additional reference from the swapcache.
      
      Let's optimize for pages that have been added to the swapcache but only
      have an exclusive user.  Try removing the swapcache reference if there is
      hope that we're the exclusive user.
      
      We will fail removing the swapcache reference in two scenarios:
      (1) There are additional swap entries referencing the page: copying
          instead of reusing is the right thing to do.
      (2) The page is under writeback: theoretically we might be able to reuse
          in some cases, however, we cannot remove the additional reference
          and will have to copy.
      
      Note that we'll only try removing the page from the swapcache when it's
      highly likely that we'll be the exclusive owner after removing the page
      from the swapache.  As we're about to map that page writable and redirty
      it, that should not affect reclaim but is rather the right thing to do.
      
      Further, we might have additional references from the LRU pagevecs, which
      will force us to copy instead of being able to reuse.  We'll try handling
      such references for some scenarios next.  Concurrent writeback cannot be
      handled easily and we'll always have to copy.
      
      While at it, remove the superfluous page_mapcount() check: it's
      implicitly covered by the page_count() for ordinary anon pages.
      
      [1] https://lkml.kernel.org/r/20220113140318.11117-1-zhangliang5@huawei.com
      [2] https://lkml.kernel.org/r/0480D692-D9B2-429A-9A88-9BBA1331AC3A@gmail.com
      
      Link: https://lkml.kernel.org/r/20220131162940.210846-2-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Reported-by: NLiang Zhang <zhangliang5@huawei.com>
      Reported-by: NNadav Amit <nadav.amit@gmail.com>
      Reviewed-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NZhangPeng <zhangpeng362@huawei.com>
      Reviewed-by: Ntong tiangen <tongtiangen@huawei.com>
      Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
      8c97cec0
  2. 31 1月, 2023 1 次提交
  3. 08 12月, 2022 1 次提交
  4. 30 11月, 2022 1 次提交
    • Y
      mm/memory: add non-anonymous page check in the copy_present_page() · 0a37c960
      Yuanzheng Song 提交于
      stable inclusion
      from stable-v5.10.153
      commit 935a8b6202101d7f58fe9cd11287f9cec0d8dd32
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I5XS4G
      CVE: NA
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=935a8b6202101d7f58fe9cd11287f9cec0d8dd32
      
      --------------------------------
      
      The vma->anon_vma of the child process may be NULL because
      the entire vma does not contain anonymous pages. In this
      case, a BUG will occur when the copy_present_page() passes
      a copy of a non-anonymous page of that vma to the
      page_add_new_anon_rmap() to set up new anonymous rmap.
      
      ------------[ cut here ]------------
      kernel BUG at mm/rmap.c:1044!
      Internal error: Oops - BUG: 0 [#1] SMP
      Modules linked in:
      CPU: 2 PID: 3617 Comm: test Not tainted 5.10.149 #1
      Hardware name: linux,dummy-virt (DT)
      pstate: 80000005 (Nzcv daif -PAN -UAO -TCO BTYPE=--)
      pc : __page_set_anon_rmap+0xbc/0xf8
      lr : __page_set_anon_rmap+0xbc/0xf8
      sp : ffff800014c1b870
      x29: ffff800014c1b870 x28: 0000000000000001
      x27: 0000000010100073 x26: ffff1d65c517baa8
      x25: ffff1d65cab0f000 x24: ffff1d65c416d800
      x23: ffff1d65cab5f248 x22: 0000000020000000
      x21: 0000000000000001 x20: 0000000000000000
      x19: fffffe75970023c0 x18: 0000000000000000
      x17: 0000000000000000 x16: 0000000000000000
      x15: 0000000000000000 x14: 0000000000000000
      x13: 0000000000000000 x12: 0000000000000000
      x11: 0000000000000000 x10: 0000000000000000
      x9 : ffffc3096d5fb858 x8 : 0000000000000000
      x7 : 0000000000000011 x6 : ffff5a5c9089c000
      x5 : 0000000000020000 x4 : ffff5a5c9089c000
      x3 : ffffc3096d200000 x2 : ffffc3096e8d0000
      x1 : ffff1d65ca3da740 x0 : 0000000000000000
      Call trace:
       __page_set_anon_rmap+0xbc/0xf8
       page_add_new_anon_rmap+0x1e0/0x390
       copy_pte_range+0xd00/0x1248
       copy_page_range+0x39c/0x620
       dup_mmap+0x2e0/0x5a8
       dup_mm+0x78/0x140
       copy_process+0x918/0x1a20
       kernel_clone+0xac/0x638
       __do_sys_clone+0x78/0xb0
       __arm64_sys_clone+0x30/0x40
       el0_svc_common.constprop.0+0xb0/0x308
       do_el0_svc+0x48/0xb8
       el0_svc+0x24/0x38
       el0_sync_handler+0x160/0x168
       el0_sync+0x180/0x1c0
      Code: 97f8ff85 f9400294 17ffffeb 97f8ff82 (d4210000)
      ---[ end trace a972347688dc9bd4 ]---
      Kernel panic - not syncing: Oops - BUG: Fatal exception
      SMP: stopping secondary CPUs
      Kernel Offset: 0x43095d200000 from 0xffff800010000000
      PHYS_OFFSET: 0xffffe29a80000000
      CPU features: 0x08200022,61806082
      Memory Limit: none
      ---[ end Kernel panic - not syncing: Oops - BUG: Fatal exception ]---
      
      This problem has been fixed by the commit <fb3d824d>
      ("mm/rmap: split page_dup_rmap() into page_dup_file_rmap()
      and page_try_dup_anon_rmap()"), but still exists in the
      linux-5.10.y branch.
      
      This patch is not applicable to this version because
      of the large version differences. Therefore, fix it by
      adding non-anonymous page check in the copy_present_page().
      
      Cc: stable@vger.kernel.org
      Fixes: 70e806e4 ("mm: Do early cow for pinned pages during fork() for ptes")
      Signed-off-by: NYuanzheng Song <songyuanzheng@huawei.com>
      Acked-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYuanzheng Song <songyuanzheng@huawei.com>
      Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      0a37c960
  5. 21 11月, 2022 1 次提交
    • T
      arm64: add cow to machine check safe · b32f46c2
      Tong Tiangen 提交于
      hulk inclusion
      category: feature
      bugzilla: https://gitee.com/openeuler/kernel/issues/I5GB28
      CVE: NA
      
      -------------------------------
      
      In the cow(copy on write) processing, the data of the user process is
      copied, when hardware memory error is encountered during copy, only the
      relevant processes are affected, so killing the user process and isolate
      the user page with hardware memory errors is a more reasonable choice than
      kernel panic.
      
      Add new helper copy_page_mc() which provide a page copy implementation with
      machine check safe. At present, only used in cow. In future, we can expand
      more scenes. As long as the consequences of page copy failure are not
      fatal(eg: only affect user process), we can use this helper.
      
      The copy_page_mc() in copy_page_mc.S is largely borrows from copy_page()
      in copy_page.S and the main difference is copy_page_mc() add extable entry
      to every load/store insn to support machine check safe. largely to keep the
      patch simple. If needed those optimizations can be folded in.
      Signed-off-by: NTong Tiangen <tongtiangen@huawei.com>
      b32f46c2
  6. 11 11月, 2022 1 次提交
  7. 03 11月, 2022 1 次提交
    • A
      mm/memory.c: fix race when faulting a device private page · 66c1e596
      Alistair Popple 提交于
      mainline inclusion
      from mainline-v6.1-rc1
      commit 16ce101d
      category: bugfix
      bugzilla: https://gitee.com/src-openeuler/kernel/issues/I5VZ0L
      CVE: CVE-2022-3523
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=16ce101db85db694a91380aa4c89b25530871d33
      
      --------------------------------
      
      Patch series "Fix several device private page reference counting issues",
      v2
      
      This series aims to fix a number of page reference counting issues in
      drivers dealing with device private ZONE_DEVICE pages.  These result in
      use-after-free type bugs, either from accessing a struct page which no
      longer exists because it has been removed or accessing fields within the
      struct page which are no longer valid because the page has been freed.
      
      During normal usage it is unlikely these will cause any problems.  However
      without these fixes it is possible to crash the kernel from userspace.
      These crashes can be triggered either by unloading the kernel module or
      unbinding the device from the driver prior to a userspace task exiting.
      In modules such as Nouveau it is also possible to trigger some of these
      issues by explicitly closing the device file-descriptor prior to the task
      exiting and then accessing device private memory.
      
      This involves some minor changes to both PowerPC and AMD GPU code.
      Unfortunately I lack hardware to test either of those so any help there
      would be appreciated.  The changes mimic what is done in for both Nouveau
      and hmm-tests though so I doubt they will cause problems.
      
      This patch (of 8):
      
      When the CPU tries to access a device private page the migrate_to_ram()
      callback associated with the pgmap for the page is called.  However no
      reference is taken on the faulting page.  Therefore a concurrent migration
      of the device private page can free the page and possibly the underlying
      pgmap.  This results in a race which can crash the kernel due to the
      migrate_to_ram() function pointer becoming invalid.  It also means drivers
      can't reliably read the zone_device_data field because the page may have
      been freed with memunmap_pages().
      
      Close the race by getting a reference on the page while holding the ptl to
      ensure it has not been freed.  Unfortunately the elevated reference count
      will cause the migration required to handle the fault to fail.  To avoid
      this failure pass the faulting page into the migrate_vma functions so that
      if an elevated reference count is found it can be checked to see if it's
      expected or not.
      
      [mpe@ellerman.id.au: fix build]
        Link: https://lkml.kernel.org/r/87fsgbf3gh.fsf@mpe.ellerman.id.au
      Link: https://lkml.kernel.org/r/cover.60659b549d8509ddecafad4f498ee7f03bb23c69.1664366292.git-series.apopple@nvidia.com
      Link: https://lkml.kernel.org/r/d3e813178a59e565e8d78d9b9a4e2562f6494f90.1664366292.git-series.apopple@nvidia.comSigned-off-by: NAlistair Popple <apopple@nvidia.com>
      Acked-by: NFelix Kuehling <Felix.Kuehling@amd.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Lyude Paul <lyude@redhat.com>
      Cc: Alex Deucher <alexander.deucher@amd.com>
      Cc: Alex Sierra <alex.sierra@amd.com>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Conflicts:
      	arch/powerpc/kvm/book3s_hv_uvmem.c
      	include/linux/migrate.h
      	lib/test_hmm.c
      	mm/migrate.c
      Signed-off-by: NMa Wupeng <mawupeng1@huawei.com>
      Reviewed-by: Ntong tiangen <tongtiangen@huawei.com>
      Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      66c1e596
  8. 02 11月, 2022 1 次提交
  9. 09 8月, 2022 1 次提交
  10. 18 7月, 2022 1 次提交
    • P
      mm: don't skip swap entry even if zap_details specified · b9be9610
      Peter Xu 提交于
      stable inclusion
      from stable-v5.10.111
      commit f089471d1b754cdd386f081f6c62eec414e8e188
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I5GL1Z
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=f089471d1b754cdd386f081f6c62eec414e8e188
      
      --------------------------------
      
      commit 5abfd71d upstream.
      
      Patch series "mm: Rework zap ptes on swap entries", v5.
      
      Patch 1 should fix a long standing bug for zap_pte_range() on
      zap_details usage.  The risk is we could have some swap entries skipped
      while we should have zapped them.
      
      Migration entries are not the major concern because file backed memory
      always zap in the pattern that "first time without page lock, then
      re-zap with page lock" hence the 2nd zap will always make sure all
      migration entries are already recovered.
      
      However there can be issues with real swap entries got skipped
      errornoously.  There's a reproducer provided in commit message of patch
      1 for that.
      
      Patch 2-4 are cleanups that are based on patch 1.  After the whole
      patchset applied, we should have a very clean view of zap_pte_range().
      
      Only patch 1 needs to be backported to stable if necessary.
      
      This patch (of 4):
      
      The "details" pointer shouldn't be the token to decide whether we should
      skip swap entries.
      
      For example, when the callers specified details->zap_mapping==NULL, it
      means the user wants to zap all the pages (including COWed pages), then
      we need to look into swap entries because there can be private COWed
      pages that was swapped out.
      
      Skipping some swap entries when details is non-NULL may lead to wrongly
      leaving some of the swap entries while we should have zapped them.
      
      A reproducer of the problem:
      Reviewed-by: NWei Li <liwei391@huawei.com>
      
      ===8<===
              #define _GNU_SOURCE         /* See feature_test_macros(7) */
              #include <stdio.h>
              #include <assert.h>
              #include <unistd.h>
              #include <sys/mman.h>
              #include <sys/types.h>
      
              int page_size;
              int shmem_fd;
              char *buffer;
      
              void main(void)
              {
                      int ret;
                      char val;
      
                      page_size = getpagesize();
                      shmem_fd = memfd_create("test", 0);
                      assert(shmem_fd >= 0);
      
                      ret = ftruncate(shmem_fd, page_size * 2);
                      assert(ret == 0);
      
                      buffer = mmap(NULL, page_size * 2, PROT_READ | PROT_WRITE,
                                      MAP_PRIVATE, shmem_fd, 0);
                      assert(buffer != MAP_FAILED);
      
                      /* Write private page, swap it out */
                      buffer[page_size] = 1;
                      madvise(buffer, page_size * 2, MADV_PAGEOUT);
      
                      /* This should drop private buffer[page_size] already */
                      ret = ftruncate(shmem_fd, page_size);
                      assert(ret == 0);
                      /* Recover the size */
                      ret = ftruncate(shmem_fd, page_size * 2);
                      assert(ret == 0);
      
                      /* Re-read the data, it should be all zero */
                      val = buffer[page_size];
                      if (val == 0)
                              printf("Good\n");
                      else
                              printf("BUG\n");
              }
      ===8<===
      
      We don't need to touch up the pmd path, because pmd never had a issue with
      swap entries.  For example, shmem pmd migration will always be split into
      pte level, and same to swapping on anonymous.
      
      Add another helper should_zap_cows() so that we can also check whether we
      should zap private mappings when there's no page pointer specified.
      
      This patch drops that trick, so we handle swap ptes coherently.  Meanwhile
      we should do the same check upon migration entry, hwpoison entry and
      genuine swap entries too.
      
      To be explicit, we should still remember to keep the private entries if
      even_cows==false, and always zap them when even_cows==true.
      
      The issue seems to exist starting from the initial commit of git.
      
      [peterx@redhat.com: comment tweaks]
        Link: https://lkml.kernel.org/r/20220217060746.71256-2-peterx@redhat.com
      
      Link: https://lkml.kernel.org/r/20220217060746.71256-1-peterx@redhat.com
      Link: https://lkml.kernel.org/r/20220216094810.60572-1-peterx@redhat.com
      Link: https://lkml.kernel.org/r/20220216094810.60572-2-peterx@redhat.com
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      b9be9610
  11. 06 7月, 2022 2 次提交
  12. 15 10月, 2021 2 次提交
  13. 13 10月, 2021 1 次提交
  14. 12 10月, 2021 1 次提交
    • H
      mm/thp: unmap_mapping_page() to fix THP truncate_cleanup_page() · 1bfa3cc3
      Hugh Dickins 提交于
      stable inclusion
      from stable-5.10.47
      commit 0010275ca243e6260893207d41843bb8dc3846e4
      bugzilla: 172973 https://gitee.com/openeuler/kernel/issues/I4DAKB
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=0010275ca243e6260893207d41843bb8dc3846e4
      
      --------------------------------
      
      [ Upstream commit 22061a1f ]
      
      There is a race between THP unmapping and truncation, when truncate sees
      pmd_none() and skips the entry, after munmap's zap_huge_pmd() cleared
      it, but before its page_remove_rmap() gets to decrement
      compound_mapcount: generating false "BUG: Bad page cache" reports that
      the page is still mapped when deleted.  This commit fixes that, but not
      in the way I hoped.
      
      The first attempt used try_to_unmap(page, TTU_SYNC|TTU_IGNORE_MLOCK)
      instead of unmap_mapping_range() in truncate_cleanup_page(): it has
      often been an annoyance that we usually call unmap_mapping_range() with
      no pages locked, but there apply it to a single locked page.
      try_to_unmap() looks more suitable for a single locked page.
      
      However, try_to_unmap_one() contains a VM_BUG_ON_PAGE(!pvmw.pte,page):
      it is used to insert THP migration entries, but not used to unmap THPs.
      Copy zap_huge_pmd() and add THP handling now? Perhaps, but their TLB
      needs are different, I'm too ignorant of the DAX cases, and couldn't
      decide how far to go for anon+swap.  Set that aside.
      
      The second attempt took a different tack: make no change in truncate.c,
      but modify zap_huge_pmd() to insert an invalidated huge pmd instead of
      clearing it initially, then pmd_clear() between page_remove_rmap() and
      unlocking at the end.  Nice.  But powerpc blows that approach out of the
      water, with its serialize_against_pte_lookup(), and interesting pgtable
      usage.  It would need serious help to get working on powerpc (with a
      minor optimization issue on s390 too).  Set that aside.
      
      Just add an "if (page_mapped(page)) synchronize_rcu();" or other such
      delay, after unmapping in truncate_cleanup_page()? Perhaps, but though
      that's likely to reduce or eliminate the number of incidents, it would
      give less assurance of whether we had identified the problem correctly.
      
      This successful iteration introduces "unmap_mapping_page(page)" instead
      of try_to_unmap(), and goes the usual unmap_mapping_range_tree() route,
      with an addition to details.  Then zap_pmd_range() watches for this
      case, and does spin_unlock(pmd_lock) if so - just like
      page_vma_mapped_walk() now does in the PVMW_SYNC case.  Not pretty, but
      safe.
      
      Note that unmap_mapping_page() is doing a VM_BUG_ON(!PageLocked) to
      assert its interface; but currently that's only used to make sure that
      page->mapping is stable, and zap_pmd_range() doesn't care if the page is
      locked or not.  Along these lines, in invalidate_inode_pages2_range()
      move the initial unmap_mapping_range() out from under page lock, before
      then calling unmap_mapping_page() under page lock if still mapped.
      
      Link: https://lkml.kernel.org/r/a2a4a148-cdd8-942c-4ef8-51b77f643dbe@google.com
      Fixes: fc127da0 ("truncate: handle file thp")
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: NYang Shi <shy828301@gmail.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jue Wang <juew@google.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Wang Yugui <wangyugui@e16-tech.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      
      Note on stable backport: fixed up call to truncate_cleanup_page()
      in truncate_inode_pages_range().
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: NWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      
      Conflict:
      	mm/truncate.c
      [Backport from mainline 22061a1f]
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      1bfa3cc3
  15. 02 9月, 2021 1 次提交
    • X
      userfaultfd: fix BUG_ON() in userfaultfd_release() · 84ff5e27
      Xiongfeng Wang 提交于
      hulk inclusion
      category: bugfix
      bugzilla: 175146
      CVE: NA
      
      ------------------------------------
      
      Syzkaller caught the following BUG_ON:
      
      ------------[ cut here ]------------
      kernel BUG at fs/userfaultfd.c:909!
      Internal error: Oops - BUG: 0 [#1] SMP
      Dumping ftrace buffer:
         (ftrace buffer empty)
      Modules linked in:
      Process syz-executor.2 (pid: 1994, stack limit = 0x0000000048da525b)
      CPU: 0 PID: 1994 Comm: syz-executor.2 Not tainted 4.19.90+ #6
      Hardware name: linux,dummy-virt (DT)
      pstate: 80000005 (Nzcv daif -PAN -UAO)
      pc : userfaultfd_release+0x4f0/0x6a0 fs/userfaultfd.c:908
      lr : userfaultfd_release+0x4f0/0x6a0 fs/userfaultfd.c:908
      sp : ffff80017d247c80
      x29: ffff80017d247c90 x28: ffff80019b25f720
      x27: 2000000000100077 x26: ffff80017c28fe40
      x25: ffff80019b25f770 x24: ffff80019b25f7e0
      x23: ffff80019b25e378 x22: 1ffff0002fa48fa6
      x21: ffff80017f103200 x20: dfff200000000000
      x19: ffff80017c28fe40 x18: 0000000000000000
      x17: ffffffff00000001 x16: 0000000000000000
      x15: 0000000000000000 x14: 0000000000000000
      x13: 0000000000000000 x12: 0000000000000000
      x11: 0000000000000000 x10: 0000000000000000
      x9 : 1ffff0002fa48fa6 x8 : ffff10002fa48fa6
      x7 : ffff20000add39f0 x6 : 00000000f2000000
      x5 : 0000000000000000 x4 : ffff10002fa48f76
      x3 : ffff200008000000 x2 : ffff20000a61d000
      x1 : ffff800160aa9000 x0 : 0000000000000000
      Call trace:
       userfaultfd_release+0x4f0/0x6a0 fs/userfaultfd.c:908
       __fput+0x20c/0x688 fs/file_table.c:278
       ____fput+0x24/0x30 fs/file_table.c:309
       task_work_run+0x13c/0x2f8 kernel/task_work.c:135
       tracehook_notify_resume include/linux/tracehook.h:193 [inline]
       do_notify_resume+0x380/0x628 arch/arm64/kernel/signal.c:728
       work_pending+0x8/0x10
      Code: 97ecb0e4 d4210000 17ffffc7 97ecb0e1 (d4210000)
      ---[ end trace de790a3f637d9e60 ]---
      
      In userfaultfd_release(), we check if 'vm_userfaultfd_ctx' and
      'vm_flags&(VM_UFFD_MISSING|VM_UFFD_WP)' are not zero at the same time.
      If not, it is bug. But we lack checking for VM_USWAP flag. So add it to
      avoid the false BUG_ON(). This patch also fix several other issues.
      
      Fixes: c3e6287f ("userswap: support userswap via userfaultfd")
      Signed-off-by: NXiongfeng Wang <wangxiongfeng2@huawei.com>
      Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      
       Conflicts:
      	fs/userfaultfd.c
      Signed-off-by: NXiongfeng Wang <wangxiongfeng2@huawei.com>
      Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      84ff5e27
  16. 31 8月, 2021 1 次提交
    • J
      mm: add pin memory method for checkpoint add restore · 7dc4c73d
      Jingxian He 提交于
      hulk inclusion
      category: feature
      bugzilla: 48159
      CVE: N/A
      
      ------------------------------
      
      We can use the checkpoint and restore in userspace(criu) method to dump
      and restore tasks when updating the kernel.
      Currently, criu needs dump all memory data of tasks to files.
      When the memory size is very large(larger than 1G),
      the cost time of the dumping data will be very long(more than 1 min).
      
      By pin the memory data of tasks and collect the corresponding physical pages
      mapping info in checkpoint process, we can remap the physical pages to
      restore tasks after upgrading the kernel. This pin memory method can
      restore the task data within one second.
      
      The pin memory area info is saved in the reserved memblock,
      which can keep usable in the kernel update process.
      
      The pin memory driver provides the following ioctl command for criu:
      1) SET_PIN_MEM_AREA:
      Set pin memory area, which can be remap to the restore task.
      2) CLEAR_PIN_MEM_AREA:
      Clear the pin memory area info,
      which enable user reset the pin data.
      3) REMAP_PIN_MEM_AREA:
      Remap the pages of the pin memory to the restore task.
      Signed-off-by: NJingxian He <hejingxian@huawei.com>
      Reviewed-by: NChen Wandun <chenwandun@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      7dc4c73d
  17. 19 7月, 2021 1 次提交
  18. 14 7月, 2021 1 次提交
  19. 03 6月, 2021 1 次提交
    • J
      mm: notify remote TLBs when dirtying a PTE · d30a2a28
      Jean-Philippe Brucker 提交于
      maillist inclusion
      category: feature
      bugzilla: 51855
      CVE: NA
      
      Reference: https://jpbrucker.net/git/linux/commit/?h=sva/2021-03-01&id=d32d8baaf293aaefef8a1c9b8a4508ab2ec46c61
      
      ---------------------------------------------
      
      The ptep_set_access_flags path in handle_pte_fault, can cause a change of
      the pte's permissions on some architectures. A Read-Only and
      writeable-clean entry becomes Read-Write and dirty. This requires us to
      call the MMU notifier to invalidate the entry in remote TLBs, for instance
      in a PCIe Address Translation Cache (ATC).
      
      Here is a scenario where the lack of notifier call ends up locking a
      device:
      
      1) A shared anonymous buffer is mapped with READ|WRITE prot, at VA.
      
      2) A PCIe device with ATS/PRI/PASID capabilities wants to read the buffer,
         using its virtual address.
      
         a) Device asks for translation of VA for reading (NW=1)
      
         b) The IOMMU cannot fulfill the request, so the device does a Page
            Request for VA. The fault is handled with do_read_fault, after which
            the PTE has flags young, write and rdonly.
      
         c) Device retries the translation; IOMMU sends a Translation Completion
            with the PA and Read-Only permission.
      
         d) The VA->PA translation is stored in the ATC, with Read-Only
            permission. From the device's point of view, the page may or may not
            be writeable. It didn't ask for writeability, so it doesn't get a
            definite answer on that point.
      
      3) The same device now wants to write the buffer. It needs to restart
         the AT-PR-AT dance for writing this time.
      
         a) Device could asks for translation of VA for reading and writing
            (NW=0). The IOMMU would reply with the same Read-Only mapping, so
            this time the device is certain that the page isn't writeable. Some
            implementations might update their ATC entry to store that
            information. The ATS specification is pretty fuzzy on the behaviour
            to adopt.
      
         b) The entry is Read-Only, so we fault again. The PTE exists and is
            valid, all we need to do is mark it dirty. TLBs are invalidated, but
            not the ATC since there is no notifier.
      
         c) Now the behaviour depends on the device implementation. If 3a)
            didn't update the ATC entry, the device is still uncertain on the
            writeability of the page, goto 3a) - repeat the Translation Request
            and get Read-Write permissions.
      
            But if 3a) updated the ATC entry, the device is certain of the
            PTE's permissions, and will goto 3b) instead - repeat the page
            fault, again and again. This time we take the "spurious fault" path
            in the same function, which invalidates the TLB but doesn't call an
            MMU notifier either.
      
      To avoid this page request loop, call mmu_notifier_change_pte after
      dirtying the PTE.
      
      Note: if the IOMMU supports hardware update of the access/dirty bits, 3a)
      dirties the PTE, and the IOMMU returns RW permission to the device, so
      there is no need to do a Page Request.
      Signed-off-by: NJean-Philippe Brucker <jean-philippe@linaro.org>
      Signed-off-by: NLijun Fang <fanglijun3@huawei.com>
      Reviewed-by: NWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      d30a2a28
  20. 22 4月, 2021 1 次提交
    • I
      mm: fix race by making init_zero_pfn() early_initcall · a202e050
      Ilya Lipnitskiy 提交于
      stable inclusion
      from stable-5.10.28
      commit ec3e06e06f763d8ef5ba3919e2c4e59366b5b92a
      bugzilla: 51779
      
      --------------------------------
      
      commit e720e7d0 upstream.
      
      There are code paths that rely on zero_pfn to be fully initialized
      before core_initcall.  For example, wq_sysfs_init() is a core_initcall
      function that eventually results in a call to kernel_execve, which
      causes a page fault with a subsequent mmput.  If zero_pfn is not
      initialized by then it may not get cleaned up properly and result in an
      error:
      
        BUG: Bad rss-counter state mm:(ptrval) type:MM_ANONPAGES val:1
      
      Here is an analysis of the race as seen on a MIPS device. On this
      particular MT7621 device (Ubiquiti ER-X), zero_pfn is PFN 0 until
      initialized, at which point it becomes PFN 5120:
      
        1. wq_sysfs_init calls into kobject_uevent_env at core_initcall:
             kobject_uevent_env+0x7e4/0x7ec
             kset_register+0x68/0x88
             bus_register+0xdc/0x34c
             subsys_virtual_register+0x34/0x78
             wq_sysfs_init+0x1c/0x4c
             do_one_initcall+0x50/0x1a8
             kernel_init_freeable+0x230/0x2c8
             kernel_init+0x10/0x100
             ret_from_kernel_thread+0x14/0x1c
      
        2. kobject_uevent_env() calls call_usermodehelper_exec() which executes
           kernel_execve asynchronously.
      
        3. Memory allocations in kernel_execve cause a page fault, bumping the
           MM reference counter:
             add_mm_counter_fast+0xb4/0xc0
             handle_mm_fault+0x6e4/0xea0
             __get_user_pages.part.78+0x190/0x37c
             __get_user_pages_remote+0x128/0x360
             get_arg_page+0x34/0xa0
             copy_string_kernel+0x194/0x2a4
             kernel_execve+0x11c/0x298
             call_usermodehelper_exec_async+0x114/0x194
      
        4. In case zero_pfn has not been initialized yet, zap_pte_range does
           not decrement the MM_ANONPAGES RSS counter and the BUG message is
           triggered shortly afterwards when __mmdrop checks the ref counters:
             __mmdrop+0x98/0x1d0
             free_bprm+0x44/0x118
             kernel_execve+0x160/0x1d8
             call_usermodehelper_exec_async+0x114/0x194
             ret_from_kernel_thread+0x14/0x1c
      
      To avoid races such as described above, initialize init_zero_pfn at
      early_initcall level.  Depending on the architecture, ZERO_PAGE is
      either constant or gets initialized even earlier, at paging_init, so
      there is no issue with initializing zero_pfn earlier.
      
      Link: https://lkml.kernel.org/r/CALCv0x2YqOXEAy2Q=hafjhHCtTHVodChv1qpM=niAXOpqEbt7w@mail.gmail.comSigned-off-by: NIlya Lipnitskiy <ilya.lipnitskiy@gmail.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: stable@vger.kernel.org
      Tested-by: N周琰杰 (Zhou Yanjie) <zhouyanjie@wanyeetech.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      a202e050
  21. 09 4月, 2021 6 次提交
  22. 12 1月, 2021 1 次提交
  23. 19 10月, 2020 1 次提交
  24. 17 10月, 2020 1 次提交
  25. 14 10月, 2020 4 次提交
  26. 09 10月, 2020 1 次提交
    • L
      mm: avoid early COW write protect games during fork() · f3c64eda
      Linus Torvalds 提交于
      In commit 70e806e4 ("mm: Do early cow for pinned pages during fork()
      for ptes") we write-protected the PTE before doing the page pinning
      check, in order to avoid a race with concurrent fast-GUP pinning (which
      doesn't take the mm semaphore or the page table lock).
      
      That trick doesn't actually work - it doesn't handle memory ordering
      properly, and doing so would be prohibitively expensive.
      
      It also isn't really needed.  While we're moving in the direction of
      allowing and supporting page pinning without marking the pinned area
      with MADV_DONTFORK, the fact is that we've never really supported this
      kind of odd "concurrent fork() and page pinning", and doing the
      serialization on a pte level is just wrong.
      
      We can add serialization with a per-mm sequence counter, so we know how
      to solve that race properly, but we'll do that at a more appropriate
      time.  Right now this just removes the write protect games.
      
      It also turns out that the write protect games actually break on Power,
      as reported by Aneesh Kumar:
      
       "Architecture like ppc64 expects set_pte_at to be not used for updating
        a valid pte. This is further explained in commit 56eecdb9 ("mm:
        Use ptep/pmdp_set_numa() for updating _PAGE_NUMA bit")"
      
      and the code triggered a warning there:
      
        WARNING: CPU: 0 PID: 30613 at arch/powerpc/mm/pgtable.c:185 set_pte_at+0x2a8/0x3a0 arch/powerpc/mm/pgtable.c:185
        Call Trace:
          copy_present_page mm/memory.c:857 [inline]
          copy_present_pte mm/memory.c:899 [inline]
          copy_pte_range mm/memory.c:1014 [inline]
          copy_pmd_range mm/memory.c:1092 [inline]
          copy_pud_range mm/memory.c:1127 [inline]
          copy_p4d_range mm/memory.c:1150 [inline]
          copy_page_range+0x1f6c/0x2cc0 mm/memory.c:1212
          dup_mmap kernel/fork.c:592 [inline]
          dup_mm+0x77c/0xab0 kernel/fork.c:1355
          copy_mm kernel/fork.c:1411 [inline]
          copy_process+0x1f00/0x2740 kernel/fork.c:2070
          _do_fork+0xc4/0x10b0 kernel/fork.c:2429
      
      Link: https://lore.kernel.org/lkml/CAHk-=wiWr+gO0Ro4LvnJBMs90OiePNyrE3E+pJvc9PzdBShdmw@mail.gmail.com/
      Link: https://lore.kernel.org/linuxppc-dev/20201008092541.398079-1-aneesh.kumar@linux.ibm.com/Reported-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Tested-by: NLeon Romanovsky <leonro@nvidia.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Kirill Shutemov <kirill@shutemov.name>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f3c64eda
  27. 06 10月, 2020 1 次提交
  28. 28 9月, 2020 2 次提交
    • P
      mm: Do early cow for pinned pages during fork() for ptes · 70e806e4
      Peter Xu 提交于
      This allows copy_pte_range() to do early cow if the pages were pinned on
      the source mm.
      
      Currently we don't have an accurate way to know whether a page is pinned
      or not.  The only thing we have is page_maybe_dma_pinned().  However
      that's good enough for now.  Especially, with the newly added
      mm->has_pinned flag to make sure we won't affect processes that never
      pinned any pages.
      
      It would be easier if we can do GFP_KERNEL allocation within
      copy_one_pte().  Unluckily, we can't because we're with the page table
      locks held for both the parent and child processes.  So the page
      allocation needs to be done outside copy_one_pte().
      
      Some trick is there in copy_present_pte(), majorly the wrprotect trick
      to block concurrent fast-gup.  Comments in the function should explain
      better in place.
      
      Oleg Nesterov reported a (probably harmless) bug during review that we
      didn't reset entry.val properly in copy_pte_range() so that potentially
      there's chance to call add_swap_count_continuation() multiple times on
      the same swp entry.  However that should be harmless since even if it
      happens, the same function (add_swap_count_continuation()) will return
      directly noticing that there're enough space for the swp counter.  So
      instead of a standalone stable patch, it is touched up in this patch
      directly.
      
      Link: https://lore.kernel.org/lkml/20200914143829.GA1424636@nvidia.com/Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      70e806e4
    • P
      mm/fork: Pass new vma pointer into copy_page_range() · 7a4830c3
      Peter Xu 提交于
      This prepares for the future work to trigger early cow on pinned pages
      during fork().
      
      No functional change intended.
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7a4830c3