1. 03 11月, 2022 1 次提交
    • A
      mm/memory.c: fix race when faulting a device private page · 66c1e596
      Alistair Popple 提交于
      mainline inclusion
      from mainline-v6.1-rc1
      commit 16ce101d
      category: bugfix
      bugzilla: https://gitee.com/src-openeuler/kernel/issues/I5VZ0L
      CVE: CVE-2022-3523
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=16ce101db85db694a91380aa4c89b25530871d33
      
      --------------------------------
      
      Patch series "Fix several device private page reference counting issues",
      v2
      
      This series aims to fix a number of page reference counting issues in
      drivers dealing with device private ZONE_DEVICE pages.  These result in
      use-after-free type bugs, either from accessing a struct page which no
      longer exists because it has been removed or accessing fields within the
      struct page which are no longer valid because the page has been freed.
      
      During normal usage it is unlikely these will cause any problems.  However
      without these fixes it is possible to crash the kernel from userspace.
      These crashes can be triggered either by unloading the kernel module or
      unbinding the device from the driver prior to a userspace task exiting.
      In modules such as Nouveau it is also possible to trigger some of these
      issues by explicitly closing the device file-descriptor prior to the task
      exiting and then accessing device private memory.
      
      This involves some minor changes to both PowerPC and AMD GPU code.
      Unfortunately I lack hardware to test either of those so any help there
      would be appreciated.  The changes mimic what is done in for both Nouveau
      and hmm-tests though so I doubt they will cause problems.
      
      This patch (of 8):
      
      When the CPU tries to access a device private page the migrate_to_ram()
      callback associated with the pgmap for the page is called.  However no
      reference is taken on the faulting page.  Therefore a concurrent migration
      of the device private page can free the page and possibly the underlying
      pgmap.  This results in a race which can crash the kernel due to the
      migrate_to_ram() function pointer becoming invalid.  It also means drivers
      can't reliably read the zone_device_data field because the page may have
      been freed with memunmap_pages().
      
      Close the race by getting a reference on the page while holding the ptl to
      ensure it has not been freed.  Unfortunately the elevated reference count
      will cause the migration required to handle the fault to fail.  To avoid
      this failure pass the faulting page into the migrate_vma functions so that
      if an elevated reference count is found it can be checked to see if it's
      expected or not.
      
      [mpe@ellerman.id.au: fix build]
        Link: https://lkml.kernel.org/r/87fsgbf3gh.fsf@mpe.ellerman.id.au
      Link: https://lkml.kernel.org/r/cover.60659b549d8509ddecafad4f498ee7f03bb23c69.1664366292.git-series.apopple@nvidia.com
      Link: https://lkml.kernel.org/r/d3e813178a59e565e8d78d9b9a4e2562f6494f90.1664366292.git-series.apopple@nvidia.comSigned-off-by: NAlistair Popple <apopple@nvidia.com>
      Acked-by: NFelix Kuehling <Felix.Kuehling@amd.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Lyude Paul <lyude@redhat.com>
      Cc: Alex Deucher <alexander.deucher@amd.com>
      Cc: Alex Sierra <alex.sierra@amd.com>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Conflicts:
      	arch/powerpc/kvm/book3s_hv_uvmem.c
      	include/linux/migrate.h
      	lib/test_hmm.c
      	mm/migrate.c
      Signed-off-by: NMa Wupeng <mawupeng1@huawei.com>
      Reviewed-by: Ntong tiangen <tongtiangen@huawei.com>
      Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      66c1e596
  2. 02 11月, 2022 1 次提交
  3. 09 8月, 2022 1 次提交
  4. 18 7月, 2022 1 次提交
    • P
      mm: don't skip swap entry even if zap_details specified · b9be9610
      Peter Xu 提交于
      stable inclusion
      from stable-v5.10.111
      commit f089471d1b754cdd386f081f6c62eec414e8e188
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I5GL1Z
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=f089471d1b754cdd386f081f6c62eec414e8e188
      
      --------------------------------
      
      commit 5abfd71d upstream.
      
      Patch series "mm: Rework zap ptes on swap entries", v5.
      
      Patch 1 should fix a long standing bug for zap_pte_range() on
      zap_details usage.  The risk is we could have some swap entries skipped
      while we should have zapped them.
      
      Migration entries are not the major concern because file backed memory
      always zap in the pattern that "first time without page lock, then
      re-zap with page lock" hence the 2nd zap will always make sure all
      migration entries are already recovered.
      
      However there can be issues with real swap entries got skipped
      errornoously.  There's a reproducer provided in commit message of patch
      1 for that.
      
      Patch 2-4 are cleanups that are based on patch 1.  After the whole
      patchset applied, we should have a very clean view of zap_pte_range().
      
      Only patch 1 needs to be backported to stable if necessary.
      
      This patch (of 4):
      
      The "details" pointer shouldn't be the token to decide whether we should
      skip swap entries.
      
      For example, when the callers specified details->zap_mapping==NULL, it
      means the user wants to zap all the pages (including COWed pages), then
      we need to look into swap entries because there can be private COWed
      pages that was swapped out.
      
      Skipping some swap entries when details is non-NULL may lead to wrongly
      leaving some of the swap entries while we should have zapped them.
      
      A reproducer of the problem:
      Reviewed-by: NWei Li <liwei391@huawei.com>
      
      ===8<===
              #define _GNU_SOURCE         /* See feature_test_macros(7) */
              #include <stdio.h>
              #include <assert.h>
              #include <unistd.h>
              #include <sys/mman.h>
              #include <sys/types.h>
      
              int page_size;
              int shmem_fd;
              char *buffer;
      
              void main(void)
              {
                      int ret;
                      char val;
      
                      page_size = getpagesize();
                      shmem_fd = memfd_create("test", 0);
                      assert(shmem_fd >= 0);
      
                      ret = ftruncate(shmem_fd, page_size * 2);
                      assert(ret == 0);
      
                      buffer = mmap(NULL, page_size * 2, PROT_READ | PROT_WRITE,
                                      MAP_PRIVATE, shmem_fd, 0);
                      assert(buffer != MAP_FAILED);
      
                      /* Write private page, swap it out */
                      buffer[page_size] = 1;
                      madvise(buffer, page_size * 2, MADV_PAGEOUT);
      
                      /* This should drop private buffer[page_size] already */
                      ret = ftruncate(shmem_fd, page_size);
                      assert(ret == 0);
                      /* Recover the size */
                      ret = ftruncate(shmem_fd, page_size * 2);
                      assert(ret == 0);
      
                      /* Re-read the data, it should be all zero */
                      val = buffer[page_size];
                      if (val == 0)
                              printf("Good\n");
                      else
                              printf("BUG\n");
              }
      ===8<===
      
      We don't need to touch up the pmd path, because pmd never had a issue with
      swap entries.  For example, shmem pmd migration will always be split into
      pte level, and same to swapping on anonymous.
      
      Add another helper should_zap_cows() so that we can also check whether we
      should zap private mappings when there's no page pointer specified.
      
      This patch drops that trick, so we handle swap ptes coherently.  Meanwhile
      we should do the same check upon migration entry, hwpoison entry and
      genuine swap entries too.
      
      To be explicit, we should still remember to keep the private entries if
      even_cows==false, and always zap them when even_cows==true.
      
      The issue seems to exist starting from the initial commit of git.
      
      [peterx@redhat.com: comment tweaks]
        Link: https://lkml.kernel.org/r/20220217060746.71256-2-peterx@redhat.com
      
      Link: https://lkml.kernel.org/r/20220217060746.71256-1-peterx@redhat.com
      Link: https://lkml.kernel.org/r/20220216094810.60572-1-peterx@redhat.com
      Link: https://lkml.kernel.org/r/20220216094810.60572-2-peterx@redhat.com
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      b9be9610
  5. 06 7月, 2022 2 次提交
  6. 15 10月, 2021 2 次提交
  7. 13 10月, 2021 1 次提交
  8. 12 10月, 2021 1 次提交
    • H
      mm/thp: unmap_mapping_page() to fix THP truncate_cleanup_page() · 1bfa3cc3
      Hugh Dickins 提交于
      stable inclusion
      from stable-5.10.47
      commit 0010275ca243e6260893207d41843bb8dc3846e4
      bugzilla: 172973 https://gitee.com/openeuler/kernel/issues/I4DAKB
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=0010275ca243e6260893207d41843bb8dc3846e4
      
      --------------------------------
      
      [ Upstream commit 22061a1f ]
      
      There is a race between THP unmapping and truncation, when truncate sees
      pmd_none() and skips the entry, after munmap's zap_huge_pmd() cleared
      it, but before its page_remove_rmap() gets to decrement
      compound_mapcount: generating false "BUG: Bad page cache" reports that
      the page is still mapped when deleted.  This commit fixes that, but not
      in the way I hoped.
      
      The first attempt used try_to_unmap(page, TTU_SYNC|TTU_IGNORE_MLOCK)
      instead of unmap_mapping_range() in truncate_cleanup_page(): it has
      often been an annoyance that we usually call unmap_mapping_range() with
      no pages locked, but there apply it to a single locked page.
      try_to_unmap() looks more suitable for a single locked page.
      
      However, try_to_unmap_one() contains a VM_BUG_ON_PAGE(!pvmw.pte,page):
      it is used to insert THP migration entries, but not used to unmap THPs.
      Copy zap_huge_pmd() and add THP handling now? Perhaps, but their TLB
      needs are different, I'm too ignorant of the DAX cases, and couldn't
      decide how far to go for anon+swap.  Set that aside.
      
      The second attempt took a different tack: make no change in truncate.c,
      but modify zap_huge_pmd() to insert an invalidated huge pmd instead of
      clearing it initially, then pmd_clear() between page_remove_rmap() and
      unlocking at the end.  Nice.  But powerpc blows that approach out of the
      water, with its serialize_against_pte_lookup(), and interesting pgtable
      usage.  It would need serious help to get working on powerpc (with a
      minor optimization issue on s390 too).  Set that aside.
      
      Just add an "if (page_mapped(page)) synchronize_rcu();" or other such
      delay, after unmapping in truncate_cleanup_page()? Perhaps, but though
      that's likely to reduce or eliminate the number of incidents, it would
      give less assurance of whether we had identified the problem correctly.
      
      This successful iteration introduces "unmap_mapping_page(page)" instead
      of try_to_unmap(), and goes the usual unmap_mapping_range_tree() route,
      with an addition to details.  Then zap_pmd_range() watches for this
      case, and does spin_unlock(pmd_lock) if so - just like
      page_vma_mapped_walk() now does in the PVMW_SYNC case.  Not pretty, but
      safe.
      
      Note that unmap_mapping_page() is doing a VM_BUG_ON(!PageLocked) to
      assert its interface; but currently that's only used to make sure that
      page->mapping is stable, and zap_pmd_range() doesn't care if the page is
      locked or not.  Along these lines, in invalidate_inode_pages2_range()
      move the initial unmap_mapping_range() out from under page lock, before
      then calling unmap_mapping_page() under page lock if still mapped.
      
      Link: https://lkml.kernel.org/r/a2a4a148-cdd8-942c-4ef8-51b77f643dbe@google.com
      Fixes: fc127da0 ("truncate: handle file thp")
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: NYang Shi <shy828301@gmail.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jue Wang <juew@google.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Wang Yugui <wangyugui@e16-tech.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      
      Note on stable backport: fixed up call to truncate_cleanup_page()
      in truncate_inode_pages_range().
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: NWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      
      Conflict:
      	mm/truncate.c
      [Backport from mainline 22061a1f]
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      1bfa3cc3
  9. 02 9月, 2021 1 次提交
    • X
      userfaultfd: fix BUG_ON() in userfaultfd_release() · 84ff5e27
      Xiongfeng Wang 提交于
      hulk inclusion
      category: bugfix
      bugzilla: 175146
      CVE: NA
      
      ------------------------------------
      
      Syzkaller caught the following BUG_ON:
      
      ------------[ cut here ]------------
      kernel BUG at fs/userfaultfd.c:909!
      Internal error: Oops - BUG: 0 [#1] SMP
      Dumping ftrace buffer:
         (ftrace buffer empty)
      Modules linked in:
      Process syz-executor.2 (pid: 1994, stack limit = 0x0000000048da525b)
      CPU: 0 PID: 1994 Comm: syz-executor.2 Not tainted 4.19.90+ #6
      Hardware name: linux,dummy-virt (DT)
      pstate: 80000005 (Nzcv daif -PAN -UAO)
      pc : userfaultfd_release+0x4f0/0x6a0 fs/userfaultfd.c:908
      lr : userfaultfd_release+0x4f0/0x6a0 fs/userfaultfd.c:908
      sp : ffff80017d247c80
      x29: ffff80017d247c90 x28: ffff80019b25f720
      x27: 2000000000100077 x26: ffff80017c28fe40
      x25: ffff80019b25f770 x24: ffff80019b25f7e0
      x23: ffff80019b25e378 x22: 1ffff0002fa48fa6
      x21: ffff80017f103200 x20: dfff200000000000
      x19: ffff80017c28fe40 x18: 0000000000000000
      x17: ffffffff00000001 x16: 0000000000000000
      x15: 0000000000000000 x14: 0000000000000000
      x13: 0000000000000000 x12: 0000000000000000
      x11: 0000000000000000 x10: 0000000000000000
      x9 : 1ffff0002fa48fa6 x8 : ffff10002fa48fa6
      x7 : ffff20000add39f0 x6 : 00000000f2000000
      x5 : 0000000000000000 x4 : ffff10002fa48f76
      x3 : ffff200008000000 x2 : ffff20000a61d000
      x1 : ffff800160aa9000 x0 : 0000000000000000
      Call trace:
       userfaultfd_release+0x4f0/0x6a0 fs/userfaultfd.c:908
       __fput+0x20c/0x688 fs/file_table.c:278
       ____fput+0x24/0x30 fs/file_table.c:309
       task_work_run+0x13c/0x2f8 kernel/task_work.c:135
       tracehook_notify_resume include/linux/tracehook.h:193 [inline]
       do_notify_resume+0x380/0x628 arch/arm64/kernel/signal.c:728
       work_pending+0x8/0x10
      Code: 97ecb0e4 d4210000 17ffffc7 97ecb0e1 (d4210000)
      ---[ end trace de790a3f637d9e60 ]---
      
      In userfaultfd_release(), we check if 'vm_userfaultfd_ctx' and
      'vm_flags&(VM_UFFD_MISSING|VM_UFFD_WP)' are not zero at the same time.
      If not, it is bug. But we lack checking for VM_USWAP flag. So add it to
      avoid the false BUG_ON(). This patch also fix several other issues.
      
      Fixes: c3e6287f ("userswap: support userswap via userfaultfd")
      Signed-off-by: NXiongfeng Wang <wangxiongfeng2@huawei.com>
      Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      
       Conflicts:
      	fs/userfaultfd.c
      Signed-off-by: NXiongfeng Wang <wangxiongfeng2@huawei.com>
      Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      84ff5e27
  10. 31 8月, 2021 1 次提交
    • J
      mm: add pin memory method for checkpoint add restore · 7dc4c73d
      Jingxian He 提交于
      hulk inclusion
      category: feature
      bugzilla: 48159
      CVE: N/A
      
      ------------------------------
      
      We can use the checkpoint and restore in userspace(criu) method to dump
      and restore tasks when updating the kernel.
      Currently, criu needs dump all memory data of tasks to files.
      When the memory size is very large(larger than 1G),
      the cost time of the dumping data will be very long(more than 1 min).
      
      By pin the memory data of tasks and collect the corresponding physical pages
      mapping info in checkpoint process, we can remap the physical pages to
      restore tasks after upgrading the kernel. This pin memory method can
      restore the task data within one second.
      
      The pin memory area info is saved in the reserved memblock,
      which can keep usable in the kernel update process.
      
      The pin memory driver provides the following ioctl command for criu:
      1) SET_PIN_MEM_AREA:
      Set pin memory area, which can be remap to the restore task.
      2) CLEAR_PIN_MEM_AREA:
      Clear the pin memory area info,
      which enable user reset the pin data.
      3) REMAP_PIN_MEM_AREA:
      Remap the pages of the pin memory to the restore task.
      Signed-off-by: NJingxian He <hejingxian@huawei.com>
      Reviewed-by: NChen Wandun <chenwandun@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      7dc4c73d
  11. 19 7月, 2021 1 次提交
  12. 14 7月, 2021 1 次提交
  13. 03 6月, 2021 1 次提交
    • J
      mm: notify remote TLBs when dirtying a PTE · d30a2a28
      Jean-Philippe Brucker 提交于
      maillist inclusion
      category: feature
      bugzilla: 51855
      CVE: NA
      
      Reference: https://jpbrucker.net/git/linux/commit/?h=sva/2021-03-01&id=d32d8baaf293aaefef8a1c9b8a4508ab2ec46c61
      
      ---------------------------------------------
      
      The ptep_set_access_flags path in handle_pte_fault, can cause a change of
      the pte's permissions on some architectures. A Read-Only and
      writeable-clean entry becomes Read-Write and dirty. This requires us to
      call the MMU notifier to invalidate the entry in remote TLBs, for instance
      in a PCIe Address Translation Cache (ATC).
      
      Here is a scenario where the lack of notifier call ends up locking a
      device:
      
      1) A shared anonymous buffer is mapped with READ|WRITE prot, at VA.
      
      2) A PCIe device with ATS/PRI/PASID capabilities wants to read the buffer,
         using its virtual address.
      
         a) Device asks for translation of VA for reading (NW=1)
      
         b) The IOMMU cannot fulfill the request, so the device does a Page
            Request for VA. The fault is handled with do_read_fault, after which
            the PTE has flags young, write and rdonly.
      
         c) Device retries the translation; IOMMU sends a Translation Completion
            with the PA and Read-Only permission.
      
         d) The VA->PA translation is stored in the ATC, with Read-Only
            permission. From the device's point of view, the page may or may not
            be writeable. It didn't ask for writeability, so it doesn't get a
            definite answer on that point.
      
      3) The same device now wants to write the buffer. It needs to restart
         the AT-PR-AT dance for writing this time.
      
         a) Device could asks for translation of VA for reading and writing
            (NW=0). The IOMMU would reply with the same Read-Only mapping, so
            this time the device is certain that the page isn't writeable. Some
            implementations might update their ATC entry to store that
            information. The ATS specification is pretty fuzzy on the behaviour
            to adopt.
      
         b) The entry is Read-Only, so we fault again. The PTE exists and is
            valid, all we need to do is mark it dirty. TLBs are invalidated, but
            not the ATC since there is no notifier.
      
         c) Now the behaviour depends on the device implementation. If 3a)
            didn't update the ATC entry, the device is still uncertain on the
            writeability of the page, goto 3a) - repeat the Translation Request
            and get Read-Write permissions.
      
            But if 3a) updated the ATC entry, the device is certain of the
            PTE's permissions, and will goto 3b) instead - repeat the page
            fault, again and again. This time we take the "spurious fault" path
            in the same function, which invalidates the TLB but doesn't call an
            MMU notifier either.
      
      To avoid this page request loop, call mmu_notifier_change_pte after
      dirtying the PTE.
      
      Note: if the IOMMU supports hardware update of the access/dirty bits, 3a)
      dirties the PTE, and the IOMMU returns RW permission to the device, so
      there is no need to do a Page Request.
      Signed-off-by: NJean-Philippe Brucker <jean-philippe@linaro.org>
      Signed-off-by: NLijun Fang <fanglijun3@huawei.com>
      Reviewed-by: NWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      d30a2a28
  14. 22 4月, 2021 1 次提交
    • I
      mm: fix race by making init_zero_pfn() early_initcall · a202e050
      Ilya Lipnitskiy 提交于
      stable inclusion
      from stable-5.10.28
      commit ec3e06e06f763d8ef5ba3919e2c4e59366b5b92a
      bugzilla: 51779
      
      --------------------------------
      
      commit e720e7d0 upstream.
      
      There are code paths that rely on zero_pfn to be fully initialized
      before core_initcall.  For example, wq_sysfs_init() is a core_initcall
      function that eventually results in a call to kernel_execve, which
      causes a page fault with a subsequent mmput.  If zero_pfn is not
      initialized by then it may not get cleaned up properly and result in an
      error:
      
        BUG: Bad rss-counter state mm:(ptrval) type:MM_ANONPAGES val:1
      
      Here is an analysis of the race as seen on a MIPS device. On this
      particular MT7621 device (Ubiquiti ER-X), zero_pfn is PFN 0 until
      initialized, at which point it becomes PFN 5120:
      
        1. wq_sysfs_init calls into kobject_uevent_env at core_initcall:
             kobject_uevent_env+0x7e4/0x7ec
             kset_register+0x68/0x88
             bus_register+0xdc/0x34c
             subsys_virtual_register+0x34/0x78
             wq_sysfs_init+0x1c/0x4c
             do_one_initcall+0x50/0x1a8
             kernel_init_freeable+0x230/0x2c8
             kernel_init+0x10/0x100
             ret_from_kernel_thread+0x14/0x1c
      
        2. kobject_uevent_env() calls call_usermodehelper_exec() which executes
           kernel_execve asynchronously.
      
        3. Memory allocations in kernel_execve cause a page fault, bumping the
           MM reference counter:
             add_mm_counter_fast+0xb4/0xc0
             handle_mm_fault+0x6e4/0xea0
             __get_user_pages.part.78+0x190/0x37c
             __get_user_pages_remote+0x128/0x360
             get_arg_page+0x34/0xa0
             copy_string_kernel+0x194/0x2a4
             kernel_execve+0x11c/0x298
             call_usermodehelper_exec_async+0x114/0x194
      
        4. In case zero_pfn has not been initialized yet, zap_pte_range does
           not decrement the MM_ANONPAGES RSS counter and the BUG message is
           triggered shortly afterwards when __mmdrop checks the ref counters:
             __mmdrop+0x98/0x1d0
             free_bprm+0x44/0x118
             kernel_execve+0x160/0x1d8
             call_usermodehelper_exec_async+0x114/0x194
             ret_from_kernel_thread+0x14/0x1c
      
      To avoid races such as described above, initialize init_zero_pfn at
      early_initcall level.  Depending on the architecture, ZERO_PAGE is
      either constant or gets initialized even earlier, at paging_init, so
      there is no issue with initializing zero_pfn earlier.
      
      Link: https://lkml.kernel.org/r/CALCv0x2YqOXEAy2Q=hafjhHCtTHVodChv1qpM=niAXOpqEbt7w@mail.gmail.comSigned-off-by: NIlya Lipnitskiy <ilya.lipnitskiy@gmail.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: stable@vger.kernel.org
      Tested-by: N周琰杰 (Zhou Yanjie) <zhouyanjie@wanyeetech.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      a202e050
  15. 09 4月, 2021 6 次提交
  16. 12 1月, 2021 1 次提交
  17. 19 10月, 2020 1 次提交
  18. 17 10月, 2020 1 次提交
  19. 14 10月, 2020 4 次提交
  20. 09 10月, 2020 1 次提交
    • L
      mm: avoid early COW write protect games during fork() · f3c64eda
      Linus Torvalds 提交于
      In commit 70e806e4 ("mm: Do early cow for pinned pages during fork()
      for ptes") we write-protected the PTE before doing the page pinning
      check, in order to avoid a race with concurrent fast-GUP pinning (which
      doesn't take the mm semaphore or the page table lock).
      
      That trick doesn't actually work - it doesn't handle memory ordering
      properly, and doing so would be prohibitively expensive.
      
      It also isn't really needed.  While we're moving in the direction of
      allowing and supporting page pinning without marking the pinned area
      with MADV_DONTFORK, the fact is that we've never really supported this
      kind of odd "concurrent fork() and page pinning", and doing the
      serialization on a pte level is just wrong.
      
      We can add serialization with a per-mm sequence counter, so we know how
      to solve that race properly, but we'll do that at a more appropriate
      time.  Right now this just removes the write protect games.
      
      It also turns out that the write protect games actually break on Power,
      as reported by Aneesh Kumar:
      
       "Architecture like ppc64 expects set_pte_at to be not used for updating
        a valid pte. This is further explained in commit 56eecdb9 ("mm:
        Use ptep/pmdp_set_numa() for updating _PAGE_NUMA bit")"
      
      and the code triggered a warning there:
      
        WARNING: CPU: 0 PID: 30613 at arch/powerpc/mm/pgtable.c:185 set_pte_at+0x2a8/0x3a0 arch/powerpc/mm/pgtable.c:185
        Call Trace:
          copy_present_page mm/memory.c:857 [inline]
          copy_present_pte mm/memory.c:899 [inline]
          copy_pte_range mm/memory.c:1014 [inline]
          copy_pmd_range mm/memory.c:1092 [inline]
          copy_pud_range mm/memory.c:1127 [inline]
          copy_p4d_range mm/memory.c:1150 [inline]
          copy_page_range+0x1f6c/0x2cc0 mm/memory.c:1212
          dup_mmap kernel/fork.c:592 [inline]
          dup_mm+0x77c/0xab0 kernel/fork.c:1355
          copy_mm kernel/fork.c:1411 [inline]
          copy_process+0x1f00/0x2740 kernel/fork.c:2070
          _do_fork+0xc4/0x10b0 kernel/fork.c:2429
      
      Link: https://lore.kernel.org/lkml/CAHk-=wiWr+gO0Ro4LvnJBMs90OiePNyrE3E+pJvc9PzdBShdmw@mail.gmail.com/
      Link: https://lore.kernel.org/linuxppc-dev/20201008092541.398079-1-aneesh.kumar@linux.ibm.com/Reported-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Tested-by: NLeon Romanovsky <leonro@nvidia.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Kirill Shutemov <kirill@shutemov.name>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f3c64eda
  21. 06 10月, 2020 1 次提交
  22. 28 9月, 2020 2 次提交
    • P
      mm: Do early cow for pinned pages during fork() for ptes · 70e806e4
      Peter Xu 提交于
      This allows copy_pte_range() to do early cow if the pages were pinned on
      the source mm.
      
      Currently we don't have an accurate way to know whether a page is pinned
      or not.  The only thing we have is page_maybe_dma_pinned().  However
      that's good enough for now.  Especially, with the newly added
      mm->has_pinned flag to make sure we won't affect processes that never
      pinned any pages.
      
      It would be easier if we can do GFP_KERNEL allocation within
      copy_one_pte().  Unluckily, we can't because we're with the page table
      locks held for both the parent and child processes.  So the page
      allocation needs to be done outside copy_one_pte().
      
      Some trick is there in copy_present_pte(), majorly the wrprotect trick
      to block concurrent fast-gup.  Comments in the function should explain
      better in place.
      
      Oleg Nesterov reported a (probably harmless) bug during review that we
      didn't reset entry.val properly in copy_pte_range() so that potentially
      there's chance to call add_swap_count_continuation() multiple times on
      the same swp entry.  However that should be harmless since even if it
      happens, the same function (add_swap_count_continuation()) will return
      directly noticing that there're enough space for the swp counter.  So
      instead of a standalone stable patch, it is touched up in this patch
      directly.
      
      Link: https://lore.kernel.org/lkml/20200914143829.GA1424636@nvidia.com/Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      70e806e4
    • P
      mm/fork: Pass new vma pointer into copy_page_range() · 7a4830c3
      Peter Xu 提交于
      This prepares for the future work to trigger early cow on pinned pages
      during fork().
      
      No functional change intended.
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7a4830c3
  23. 24 9月, 2020 3 次提交
    • L
      mm: fix misplaced unlock_page in do_wp_page() · be068f29
      Linus Torvalds 提交于
      Commit 09854ba9 ("mm: do_wp_page() simplification") reorganized all
      the code around the page re-use vs copy, but in the process also moved
      the final unlock_page() around to after the wp_page_reuse() call.
      
      That normally doesn't matter - but it means that the unlock_page() is
      now done after releasing the page table lock.  Again, not a big deal,
      you'd think.
      
      But it turns out that it's very wrong indeed, because once we've
      released the page table lock, we've basically lost our only reference to
      the page - the page tables - and it could now be free'd at any time.  We
      do hold the mmap_sem, so no actual unmap() can happen, but madvise can
      come in and a MADV_DONTNEED will zap the page range - and free the page.
      
      So now the page may be free'd just as we're unlocking it, which in turn
      will usually trigger a "Bad page state" error in the freeing path.  To
      make matters more confusing, by the time the debug code prints out the
      page state, the unlock has typically completed and everything looks fine
      again.
      
      This all doesn't happen in any normal situations, but it does trigger
      with the dirtyc0w_child LTP test.  And it seems to trigger much more
      easily (but not expclusively) on s390 than elsewhere, probably because
      s390 doesn't do the "batch pages up for freeing after the TLB flush"
      that gives the unlock_page() more time to complete and makes the race
      harder to hit.
      
      Fixes: 09854ba9 ("mm: do_wp_page() simplification")
      Link: https://lore.kernel.org/lkml/a46e9bbef2ed4e17778f5615e818526ef848d791.camel@redhat.com/
      Link: https://lore.kernel.org/linux-mm/c41149a8-211e-390b-af1d-d5eee690fecb@linux.alibaba.com/Reported-by: NQian Cai <cai@redhat.com>
      Reported-by: NAlex Shi <alex.shi@linux.alibaba.com>
      Bisected-and-analyzed-by: NGerald Schaefer <gerald.schaefer@linux.ibm.com>
      Tested-by: NGerald Schaefer <gerald.schaefer@linux.ibm.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      be068f29
    • L
      mm: move the copy_one_pte() pte_present check into the caller · 79a1971c
      Linus Torvalds 提交于
      This completes the split of the non-present and present pte cases by
      moving the check for the source pte being present into the single
      caller, which also means that we clearly separate out the very different
      return value case for a non-present pte.
      
      The present pte case currently always succeeds.
      
      This is a pure code re-organization with no semantic change: the intent
      is to make it much easier to add a new return case to the present pte
      case for when we do early COW at page table copy time.
      
      This was split out from the previous commit simply to make it easy to
      visually see that there were no semantic changes from this code
      re-organization.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      79a1971c
    • L
      mm: split out the non-present case from copy_one_pte() · df3a57d1
      Linus Torvalds 提交于
      This is a purely mechanical split of the copy_one_pte() function.  It's
      not immediately obvious when looking at the diff because of the
      indentation change, but the way to see what is going on in this commit
      is to use the "-w" flag to not show pure whitespace changes, and you see
      how the first part of copy_one_pte() is simply lifted out into a
      separate function.
      
      And since the non-present case is marked unlikely, don't make the new
      function be inlined.  Not that gcc really seems to care, since it looks
      like it will inline it anyway due to the whole "single callsite for
      static function" logic.  In fact, code generation with the function
      split is almost identical to before.  But not marking it inline is the
      right thing to do.
      
      This is pure prep-work and cleanup for subsequent changes.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      df3a57d1
  24. 06 9月, 2020 1 次提交
    • J
      mm: track page table modifications in __apply_to_page_range() · e80d3909
      Joerg Roedel 提交于
      __apply_to_page_range() is also used to change and/or allocate
      page-table pages in the vmalloc area of the address space.  Make sure
      these changes get synchronized to other page-tables in the system by
      calling arch_sync_kernel_mappings() when necessary.
      
      The impact appears limited to x86-32, where apply_to_page_range may miss
      updating the PMD.  That leads to explosions in drivers like
      
        BUG: unable to handle page fault for address: fe036000
        #PF: supervisor write access in kernel mode
        #PF: error_code(0x0002) - not-present page
        *pde = 00000000
        Oops: 0002 [#1] SMP
        CPU: 3 PID: 1300 Comm: gem_concurrent_ Not tainted 5.9.0-rc1+ #16
        Hardware name:  /NUC6i3SYB, BIOS SYSKLi35.86A.0024.2015.1027.2142 10/27/2015
        EIP: __execlists_context_alloc+0x132/0x2d0 [i915]
        Code: 31 d2 89 f0 e8 2f 55 02 00 89 45 e8 3d 00 f0 ff ff 0f 87 11 01 00 00 8b 4d e8 03 4b 30 b8 5a 5a 5a 5a ba 01 00 00 00 8d 79 04 <c7> 01 5a 5a 5a 5a c7 81 fc 0f 00 00 5a 5a 5a 5a 83 e7 fc 29 f9 81
        EAX: 5a5a5a5a EBX: f60ca000 ECX: fe036000 EDX: 00000001
        ESI: f43b7340 EDI: fe036004 EBP: f6389cb8 ESP: f6389c9c
        DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010286
        CR0: 80050033 CR2: fe036000 CR3: 2d361000 CR4: 001506d0
        DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
        DR6: fffe0ff0 DR7: 00000400
        Call Trace:
          execlists_context_alloc+0x10/0x20 [i915]
          intel_context_alloc_state+0x3f/0x70 [i915]
          __intel_context_do_pin+0x117/0x170 [i915]
          i915_gem_do_execbuffer+0xcc7/0x2500 [i915]
          i915_gem_execbuffer2_ioctl+0xcd/0x1f0 [i915]
          drm_ioctl_kernel+0x8f/0xd0
          drm_ioctl+0x223/0x3d0
          __ia32_sys_ioctl+0x1ab/0x760
          __do_fast_syscall_32+0x3f/0x70
          do_fast_syscall_32+0x29/0x60
          do_SYSENTER_32+0x15/0x20
          entry_SYSENTER_32+0x9f/0xf2
        EIP: 0xb7f28559
        Code: 03 74 c0 01 10 05 03 74 b8 01 10 06 03 74 b4 01 10 07 03 74 b0 01 10 08 03 74 d8 01 00 00 00 00 00 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 8d 76 00 58 b8 77 00 00 00 cd 80 90 8d 76
        EAX: ffffffda EBX: 00000005 ECX: c0406469 EDX: bf95556c
        ESI: b7e68000 EDI: c0406469 EBP: 00000005 ESP: bf9554d8
        DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00000296
        Modules linked in: i915 x86_pkg_temp_thermal intel_powerclamp crc32_pclmul crc32c_intel intel_cstate intel_uncore intel_gtt drm_kms_helper intel_pch_thermal video button autofs4 i2c_i801 i2c_smbus fan
        CR2: 00000000fe036000
      
      It looks like kasan, xen and i915 are vulnerable.
      
      Actual impact is "on thinkpad X60 in 5.9-rc1, screen starts blinking
      after 30-or-so minutes, and machine is unusable"
      
      [sfr@canb.auug.org.au: ARCH_PAGE_TABLE_SYNC_MASK needs vmalloc.h]
        Link: https://lkml.kernel.org/r/20200825172508.16800a4f@canb.auug.org.au
      [chris@chris-wilson.co.uk: changelog addition]
      [pavel@ucw.cz: changelog addition]
      
      Fixes: 2ba3e694 ("mm/vmalloc: track which page-table levels were modified")
      Fixes: 86cf69f1 ("x86/mm/32: implement arch_sync_kernel_mappings()")
      Signed-off-by: NJoerg Roedel <jroedel@suse.de>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: Chris Wilson <chris@chris-wilson.co.uk>	[x86-32]
      Tested-by: NPavel Machek <pavel@ucw.cz>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: <stable@vger.kernel.org>	[5.8+]
      Link: https://lkml.kernel.org/r/20200821123746.16904-1-joro@8bytes.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e80d3909
  25. 05 9月, 2020 2 次提交
  26. 19 8月, 2020 1 次提交