1. 17 6月, 2021 3 次提交
    • H
      mm/thp: try_to_unmap() use TTU_SYNC for safe splitting · 732ed558
      Hugh Dickins 提交于
      Stressing huge tmpfs often crashed on unmap_page()'s VM_BUG_ON_PAGE
      (!unmap_success): with dump_page() showing mapcount:1, but then its raw
      struct page output showing _mapcount ffffffff i.e.  mapcount 0.
      
      And even if that particular VM_BUG_ON_PAGE(!unmap_success) is removed,
      it is immediately followed by a VM_BUG_ON_PAGE(compound_mapcount(head)),
      and further down an IS_ENABLED(CONFIG_DEBUG_VM) total_mapcount BUG():
      all indicative of some mapcount difficulty in development here perhaps.
      But the !CONFIG_DEBUG_VM path handles the failures correctly and
      silently.
      
      I believe the problem is that once a racing unmap has cleared pte or
      pmd, try_to_unmap_one() may skip taking the page table lock, and emerge
      from try_to_unmap() before the racing task has reached decrementing
      mapcount.
      
      Instead of abandoning the unsafe VM_BUG_ON_PAGE(), and the ones that
      follow, use PVMW_SYNC in try_to_unmap_one() in this case: adding
      TTU_SYNC to the options, and passing that from unmap_page().
      
      When CONFIG_DEBUG_VM, or for non-debug too? Consensus is to do the same
      for both: the slight overhead added should rarely matter, except perhaps
      if splitting sparsely-populated multiply-mapped shmem.  Once confident
      that bugs are fixed, TTU_SYNC here can be removed, and the race
      tolerated.
      
      Link: https://lkml.kernel.org/r/c1e95853-8bcd-d8fd-55fa-e7f2488e78f@google.com
      Fixes: fec89c10 ("thp: rewrite freeze_page()/unfreeze_page() with generic rmap walkers")
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jue Wang <juew@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Wang Yugui <wangyugui@e16-tech.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      732ed558
    • H
      mm/thp: make is_huge_zero_pmd() safe and quicker · 3b77e8c8
      Hugh Dickins 提交于
      Most callers of is_huge_zero_pmd() supply a pmd already verified
      present; but a few (notably zap_huge_pmd()) do not - it might be a pmd
      migration entry, in which the pfn is encoded differently from a present
      pmd: which might pass the is_huge_zero_pmd() test (though not on x86,
      since L1TF forced us to protect against that); or perhaps even crash in
      pmd_page() applied to a swap-like entry.
      
      Make it safe by adding pmd_present() check into is_huge_zero_pmd()
      itself; and make it quicker by saving huge_zero_pfn, so that
      is_huge_zero_pmd() will not need to do that pmd_page() lookup each time.
      
      __split_huge_pmd_locked() checked pmd_trans_huge() before: that worked,
      but is unnecessary now that is_huge_zero_pmd() checks present.
      
      Link: https://lkml.kernel.org/r/21ea9ca-a1f5-8b90-5e88-95fb1c49bbfa@google.com
      Fixes: e71769ae ("mm: enable thp migration for shmem thp")
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: NYang Shi <shy828301@gmail.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jue Wang <juew@google.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Wang Yugui <wangyugui@e16-tech.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3b77e8c8
    • H
      mm/thp: fix __split_huge_pmd_locked() on shmem migration entry · 99fa8a48
      Hugh Dickins 提交于
      Patch series "mm/thp: fix THP splitting unmap BUGs and related", v10.
      
      Here is v2 batch of long-standing THP bug fixes that I had not got
      around to sending before, but prompted now by Wang Yugui's report
      https://lore.kernel.org/linux-mm/20210412180659.B9E3.409509F4@e16-tech.com/
      
      Wang Yugui has tested a rollup of these fixes applied to 5.10.39, and
      they have done no harm, but have *not* fixed that issue: something more
      is needed and I have no idea of what.
      
      This patch (of 7):
      
      Stressing huge tmpfs page migration racing hole punch often crashed on
      the VM_BUG_ON(!pmd_present) in pmdp_huge_clear_flush(), with DEBUG_VM=y
      kernel; or shortly afterwards, on a bad dereference in
      __split_huge_pmd_locked() when DEBUG_VM=n.  They forgot to allow for pmd
      migration entries in the non-anonymous case.
      
      Full disclosure: those particular experiments were on a kernel with more
      relaxed mmap_lock and i_mmap_rwsem locking, and were not repeated on the
      vanilla kernel: it is conceivable that stricter locking happens to avoid
      those cases, or makes them less likely; but __split_huge_pmd_locked()
      already allowed for pmd migration entries when handling anonymous THPs,
      so this commit brings the shmem and file THP handling into line.
      
      And while there: use old_pmd rather than _pmd, as in the following
      blocks; and make it clearer to the eye that the !vma_is_anonymous()
      block is self-contained, making an early return after accounting for
      unmapping.
      
      Link: https://lkml.kernel.org/r/af88612-1473-2eaa-903-8d1a448b26@google.com
      Link: https://lkml.kernel.org/r/dd221a99-efb3-cd1d-6256-7e646af29314@google.com
      Fixes: e71769ae ("mm: enable thp migration for shmem thp")
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Wang Yugui <wangyugui@e16-tech.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Jue Wang <juew@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      99fa8a48
  2. 07 5月, 2021 1 次提交
  3. 06 5月, 2021 9 次提交
  4. 14 3月, 2021 2 次提交
  5. 27 2月, 2021 1 次提交
    • R
      mm,thp,shmem: limit shmem THP alloc gfp_mask · 164cc4fe
      Rik van Riel 提交于
      Patch series "mm,thp,shm: limit shmem THP alloc gfp_mask", v6.
      
      The allocation flags of anonymous transparent huge pages can be controlled
      through the files in /sys/kernel/mm/transparent_hugepage/defrag, which can
      help the system from getting bogged down in the page reclaim and
      compaction code when many THPs are getting allocated simultaneously.
      
      However, the gfp_mask for shmem THP allocations were not limited by those
      configuration settings, and some workloads ended up with all CPUs stuck on
      the LRU lock in the page reclaim code, trying to allocate dozens of THPs
      simultaneously.
      
      This patch applies the same configurated limitation of THPs to shmem
      hugepage allocations, to prevent that from happening.
      
      This way a THP defrag setting of "never" or "defer+madvise" will result in
      quick allocation failures without direct reclaim when no 2MB free pages
      are available.
      
      With this patch applied, THP allocations for tmpfs will be a little more
      aggressive than today for files mmapped with MADV_HUGEPAGE, and a little
      less aggressive for files that are not mmapped or mapped without that
      flag.
      
      This patch (of 4):
      
      The allocation flags of anonymous transparent huge pages can be controlled
      through the files in /sys/kernel/mm/transparent_hugepage/defrag, which can
      help the system from getting bogged down in the page reclaim and
      compaction code when many THPs are getting allocated simultaneously.
      
      However, the gfp_mask for shmem THP allocations were not limited by those
      configuration settings, and some workloads ended up with all CPUs stuck on
      the LRU lock in the page reclaim code, trying to allocate dozens of THPs
      simultaneously.
      
      This patch applies the same configurated limitation of THPs to shmem
      hugepage allocations, to prevent that from happening.
      
      Controlling the gfp_mask of THP allocations through the knobs in sysfs
      allows users to determine the balance between how aggressively the system
      tries to allocate THPs at fault time, and how much the application may end
      up stalling attempting those allocations.
      
      This way a THP defrag setting of "never" or "defer+madvise" will result in
      quick allocation failures without direct reclaim when no 2MB free pages
      are available.
      
      With this patch applied, THP allocations for tmpfs will be a little more
      aggressive than today for files mmapped with MADV_HUGEPAGE, and a little
      less aggressive for files that are not mmapped or mapped without that
      flag.
      
      Link: https://lkml.kernel.org/r/20201124194925.623931-1-riel@surriel.com
      Link: https://lkml.kernel.org/r/20201124194925.623931-2-riel@surriel.comSigned-off-by: NRik van Riel <riel@surriel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Xu Yu <xuyu@linux.alibaba.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      164cc4fe
  6. 25 2月, 2021 7 次提交
  7. 06 2月, 2021 1 次提交
  8. 16 12月, 2020 10 次提交
  9. 03 12月, 2020 1 次提交
    • R
      mm: memcontrol: Use helpers to read page's memcg data · bcfe06bf
      Roman Gushchin 提交于
      Patch series "mm: allow mapping accounted kernel pages to userspace", v6.
      
      Currently a non-slab kernel page which has been charged to a memory cgroup
      can't be mapped to userspace.  The underlying reason is simple: PageKmemcg
      flag is defined as a page type (like buddy, offline, etc), so it takes a
      bit from a page->mapped counter.  Pages with a type set can't be mapped to
      userspace.
      
      But in general the kmemcg flag has nothing to do with mapping to
      userspace.  It only means that the page has been accounted by the page
      allocator, so it has to be properly uncharged on release.
      
      Some bpf maps are mapping the vmalloc-based memory to userspace, and their
      memory can't be accounted because of this implementation detail.
      
      This patchset removes this limitation by moving the PageKmemcg flag into
      one of the free bits of the page->mem_cgroup pointer.  Also it formalizes
      accesses to the page->mem_cgroup and page->obj_cgroups using new helpers,
      adds several checks and removes a couple of obsolete functions.  As the
      result the code became more robust with fewer open-coded bit tricks.
      
      This patch (of 4):
      
      Currently there are many open-coded reads of the page->mem_cgroup pointer,
      as well as a couple of read helpers, which are barely used.
      
      It creates an obstacle on a way to reuse some bits of the pointer for
      storing additional bits of information.  In fact, we already do this for
      slab pages, where the last bit indicates that a pointer has an attached
      vector of objcg pointers instead of a regular memcg pointer.
      
      This commits uses 2 existing helpers and introduces a new helper to
      converts all read sides to calls of these helpers:
        struct mem_cgroup *page_memcg(struct page *page);
        struct mem_cgroup *page_memcg_rcu(struct page *page);
        struct mem_cgroup *page_memcg_check(struct page *page);
      
      page_memcg_check() is intended to be used in cases when the page can be a
      slab page and have a memcg pointer pointing at objcg vector.  It does
      check the lowest bit, and if set, returns NULL.  page_memcg() contains a
      VM_BUG_ON_PAGE() check for the page not being a slab page.
      
      To make sure nobody uses a direct access, struct page's
      mem_cgroup/obj_cgroups is converted to unsigned long memcg_data.
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Link: https://lkml.kernel.org/r/20201027001657.3398190-1-guro@fb.com
      Link: https://lkml.kernel.org/r/20201027001657.3398190-2-guro@fb.com
      Link: https://lore.kernel.org/bpf/20201201215900.3569844-2-guro@fb.com
      bcfe06bf
  10. 23 11月, 2020 1 次提交
    • G
      mm/userfaultfd: do not access vma->vm_mm after calling handle_userfault() · bfe8cc1d
      Gerald Schaefer 提交于
      Alexander reported a syzkaller / KASAN finding on s390, see below for
      complete output.
      
      In do_huge_pmd_anonymous_page(), the pre-allocated pagetable will be
      freed in some cases.  In the case of userfaultfd_missing(), this will
      happen after calling handle_userfault(), which might have released the
      mmap_lock.  Therefore, the following pte_free(vma->vm_mm, pgtable) will
      access an unstable vma->vm_mm, which could have been freed or re-used
      already.
      
      For all architectures other than s390 this will go w/o any negative
      impact, because pte_free() simply frees the page and ignores the
      passed-in mm.  The implementation for SPARC32 would also access
      mm->page_table_lock for pte_free(), but there is no THP support in
      SPARC32, so the buggy code path will not be used there.
      
      For s390, the mm->context.pgtable_list is being used to maintain the 2K
      pagetable fragments, and operating on an already freed or even re-used
      mm could result in various more or less subtle bugs due to list /
      pagetable corruption.
      
      Fix this by calling pte_free() before handle_userfault(), similar to how
      it is already done in __do_huge_pmd_anonymous_page() for the WRITE /
      non-huge_zero_page case.
      
      Commit 6b251fc9 ("userfaultfd: call handle_userfault() for
      userfaultfd_missing() faults") actually introduced both, the
      do_huge_pmd_anonymous_page() and also __do_huge_pmd_anonymous_page()
      changes wrt to calling handle_userfault(), but only in the latter case
      it put the pte_free() before calling handle_userfault().
      
        BUG: KASAN: use-after-free in do_huge_pmd_anonymous_page+0xcda/0xd90 mm/huge_memory.c:744
        Read of size 8 at addr 00000000962d6988 by task syz-executor.0/9334
      
        CPU: 1 PID: 9334 Comm: syz-executor.0 Not tainted 5.10.0-rc1-syzkaller-07083-g4c9720875573 #0
        Hardware name: IBM 3906 M04 701 (KVM/Linux)
        Call Trace:
          do_huge_pmd_anonymous_page+0xcda/0xd90 mm/huge_memory.c:744
          create_huge_pmd mm/memory.c:4256 [inline]
          __handle_mm_fault+0xe6e/0x1068 mm/memory.c:4480
          handle_mm_fault+0x288/0x748 mm/memory.c:4607
          do_exception+0x394/0xae0 arch/s390/mm/fault.c:479
          do_dat_exception+0x34/0x80 arch/s390/mm/fault.c:567
          pgm_check_handler+0x1da/0x22c arch/s390/kernel/entry.S:706
          copy_from_user_mvcos arch/s390/lib/uaccess.c:111 [inline]
          raw_copy_from_user+0x3a/0x88 arch/s390/lib/uaccess.c:174
          _copy_from_user+0x48/0xa8 lib/usercopy.c:16
          copy_from_user include/linux/uaccess.h:192 [inline]
          __do_sys_sigaltstack kernel/signal.c:4064 [inline]
          __s390x_sys_sigaltstack+0xc8/0x240 kernel/signal.c:4060
          system_call+0xe0/0x28c arch/s390/kernel/entry.S:415
      
        Allocated by task 9334:
          slab_alloc_node mm/slub.c:2891 [inline]
          slab_alloc mm/slub.c:2899 [inline]
          kmem_cache_alloc+0x118/0x348 mm/slub.c:2904
          vm_area_dup+0x9c/0x2b8 kernel/fork.c:356
          __split_vma+0xba/0x560 mm/mmap.c:2742
          split_vma+0xca/0x108 mm/mmap.c:2800
          mlock_fixup+0x4ae/0x600 mm/mlock.c:550
          apply_vma_lock_flags+0x2c6/0x398 mm/mlock.c:619
          do_mlock+0x1aa/0x718 mm/mlock.c:711
          __do_sys_mlock2 mm/mlock.c:738 [inline]
          __s390x_sys_mlock2+0x86/0xa8 mm/mlock.c:728
          system_call+0xe0/0x28c arch/s390/kernel/entry.S:415
      
        Freed by task 9333:
          slab_free mm/slub.c:3142 [inline]
          kmem_cache_free+0x7c/0x4b8 mm/slub.c:3158
          __vma_adjust+0x7b2/0x2508 mm/mmap.c:960
          vma_merge+0x87e/0xce0 mm/mmap.c:1209
          userfaultfd_release+0x412/0x6b8 fs/userfaultfd.c:868
          __fput+0x22c/0x7a8 fs/file_table.c:281
          task_work_run+0x200/0x320 kernel/task_work.c:151
          tracehook_notify_resume include/linux/tracehook.h:188 [inline]
          do_notify_resume+0x100/0x148 arch/s390/kernel/signal.c:538
          system_call+0xe6/0x28c arch/s390/kernel/entry.S:416
      
        The buggy address belongs to the object at 00000000962d6948 which belongs to the cache vm_area_struct of size 200
        The buggy address is located 64 bytes inside of 200-byte region [00000000962d6948, 00000000962d6a10)
        The buggy address belongs to the page: page:00000000313a09fe refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x962d6 flags: 0x3ffff00000000200(slab)
        raw: 3ffff00000000200 000040000257e080 0000000c0000000c 000000008020ba00
        raw: 0000000000000000 000f001e00000000 ffffffff00000001 0000000096959501
        page dumped because: kasan: bad access detected
        page->mem_cgroup:0000000096959501
      
        Memory state around the buggy address:
         00000000962d6880: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
         00000000962d6900: 00 fc fc fc fc fc fc fc fc fa fb fb fb fb fb fb
        >00000000962d6980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                              ^
         00000000962d6a00: fb fb fc fc fc fc fc fc fc fc 00 00 00 00 00 00
         00000000962d6a80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
        ==================================================================
      
      Fixes: 6b251fc9 ("userfaultfd: call handle_userfault() for userfaultfd_missing() faults")
      Reported-by: NAlexander Egorenkov <egorenar@linux.ibm.com>
      Signed-off-by: NGerald Schaefer <gerald.schaefer@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: <stable@vger.kernel.org>	[4.3+]
      Link: https://lkml.kernel.org/r/20201110190329.11920-1-gerald.schaefer@linux.ibm.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bfe8cc1d
  11. 17 10月, 2020 4 次提交