1. 16 12月, 2020 9 次提交
    • A
      mm/lru: replace pgdat lru_lock with lruvec lock · 6168d0da
      Alex Shi 提交于
      This patch moves per node lru_lock into lruvec, thus bring a lru_lock for
      each of memcg per node.  So on a large machine, each of memcg don't have
      to suffer from per node pgdat->lru_lock competition.  They could go fast
      with their self lru_lock.
      
      After move memcg charge before lru inserting, page isolation could
      serialize page's memcg, then per memcg lruvec lock is stable and could
      replace per node lru lock.
      
      In isolate_migratepages_block(), compact_unlock_should_abort and
      lock_page_lruvec_irqsave are open coded to work with compact_control.
      Also add a debug func in locking which may give some clues if there are
      sth out of hands.
      
      Daniel Jordan's testing show 62% improvement on modified readtwice case on
      his 2P * 10 core * 2 HT broadwell box.
      https://lore.kernel.org/lkml/20200915165807.kpp7uhiw7l3loofu@ca-dmjordan1.us.oracle.com/
      
      Hugh Dickins helped on the patch polish, thanks!
      
      [alex.shi@linux.alibaba.com: fix comment typo]
        Link: https://lkml.kernel.org/r/5b085715-292a-4b43-50b3-d73dc90d1de5@linux.alibaba.com
      [alex.shi@linux.alibaba.com: use page_memcg()]
        Link: https://lkml.kernel.org/r/5a4c2b72-7ee8-2478-fc0e-85eb83aafec4@linux.alibaba.com
      
      Link: https://lkml.kernel.org/r/1604566549-62481-18-git-send-email-alex.shi@linux.alibaba.comSigned-off-by: NAlex Shi <alex.shi@linux.alibaba.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Rong Chen <rong.a.chen@intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Alexander Duyck <alexander.duyck@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mika Penttilä <mika.penttila@nextfour.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6168d0da
    • A
      mm/thp: narrow lru locking · b6769834
      Alex Shi 提交于
      lru_lock and page cache xa_lock have no obvious reason to be taken one
      way round or the other: until now, lru_lock has been taken before page
      cache xa_lock, when splitting a THP; but nothing else takes them
      together.  Reverse that ordering: let's narrow the lru locking - but
      leave local_irq_disable to block interrupts throughout, like before.
      
      Hugh Dickins point: split_huge_page_to_list() was already silly, to be
      using the _irqsave variant: it's just been taking sleeping locks, so
      would already be broken if entered with interrupts enabled.  So we can
      save passing flags argument down to __split_huge_page().
      
      Why change the lock ordering here? That was hard to decide.  One reason:
      when this series reaches per-memcg lru locking, it relies on the THP's
      memcg to be stable when taking the lru_lock: that is now done after the
      THP's refcount has been frozen, which ensures page memcg cannot change.
      
      Another reason: previously, lock_page_memcg()'s move_lock was presumed
      to nest inside lru_lock; but now lru_lock must nest inside (page cache
      lock inside) move_lock, so it becomes possible to use lock_page_memcg()
      to stabilize page memcg before taking its lru_lock.  That is not the
      mechanism used in this series, but it is an option we want to keep open.
      
      [hughd@google.com: rewrite commit log]
      
      Link: https://lkml.kernel.org/r/1604566549-62481-5-git-send-email-alex.shi@linux.alibaba.comSigned-off-by: NAlex Shi <alex.shi@linux.alibaba.com>
      Reviewed-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Alexander Duyck <alexander.duyck@gmail.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: "Chen, Rong A" <rong.a.chen@intel.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mika Penttilä <mika.penttila@nextfour.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b6769834
    • A
      mm/thp: simplify lru_add_page_tail() · 6dbb5741
      Alex Shi 提交于
      Simplify lru_add_page_tail(), there are actually only two cases
      possible: split_huge_page_to_list(), with list supplied and head
      isolated from lru by its caller; or split_huge_page(), with NULL list
      and head on lru - because when head is racily isolated from lru, the
      isolator's reference will stop the split from getting any further than
      its page_ref_freeze().
      
      So decide between the two cases by "list", but add VM_WARN_ON()s to
      verify that they match our lru expectations.
      
      [Hugh Dickins: rewrite commit log]
      
      Link: https://lkml.kernel.org/r/1604566549-62481-4-git-send-email-alex.shi@linux.alibaba.comSigned-off-by: NAlex Shi <alex.shi@linux.alibaba.com>
      Reviewed-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mika Penttilä <mika.penttila@nextfour.com>
      Cc: Alexander Duyck <alexander.duyck@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: "Chen, Rong A" <rong.a.chen@intel.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6dbb5741
    • A
      mm/thp: use head for head page in lru_add_page_tail() · 94866635
      Alex Shi 提交于
      Since the first parameter is only used by head page, it's better to make
      it explicit.
      
      Link: https://lkml.kernel.org/r/1604566549-62481-3-git-send-email-alex.shi@linux.alibaba.comSigned-off-by: NAlex Shi <alex.shi@linux.alibaba.com>
      Reviewed-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Acked-by: NHugh Dickins <hughd@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Alexander Duyck <alexander.duyck@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: "Chen, Rong A" <rong.a.chen@intel.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mika Penttilä <mika.penttila@nextfour.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      94866635
    • A
      mm/thp: move lru_add_page_tail() to huge_memory.c · 88dcb9a3
      Alex Shi 提交于
      Patch series "per memcg lru lock", v21.
      
      This patchset includes 3 parts:
      
       1) some code cleanup and minimum optimization as preparation
      
       2) use TestCleanPageLRU as page isolation's precondition
      
       3) replace per node lru_lock with per memcg per node lru_lock
      
      Current lru_lock is one for each of node, pgdat->lru_lock, that guard
      for lru lists, but now we had moved the lru lists into memcg for long
      time.  Still using per node lru_lock is clearly unscalable, pages on
      each of memcgs have to compete each others for a whole lru_lock.  This
      patchset try to use per lruvec/memcg lru_lock to repleace per node lru
      lock to guard lru lists, make it scalable for memcgs and get performance
      gain.
      
      Currently lru_lock still guards both lru list and page's lru bit, that's
      ok.  but if we want to use specific lruvec lock on the page, we need to
      pin down the page's lruvec/memcg during locking.  Just taking lruvec
      lock first may be undermined by the page's memcg charge/migration.  To
      fix this problem, we could take out the page's lru bit clear and use it
      as pin down action to block the memcg changes.  That's the reason for
      new atomic func TestClearPageLRU.  So now isolating a page need both
      actions: TestClearPageLRU and hold the lru_lock.
      
      The typical usage of this is isolate_migratepages_block() in
      compaction.c we have to take lru bit before lru lock, that serialized
      the page isolation in memcg page charge/migration which will change
      page's lruvec and new lru_lock in it.
      
      The above solution suggested by Johannes Weiner, and based on his new
      memcg charge path, then have this patchset.  (Hugh Dickins tested and
      contributed much code from compaction fix to general code polish, thanks
      a lot!).
      
      Daniel Jordan's testing show 62% improvement on modified readtwice case
      on his 2P * 10 core * 2 HT broadwell box on v18, which has no much
      different with this v20.
      
       https://lore.kernel.org/lkml/20200915165807.kpp7uhiw7l3loofu@ca-dmjordan1.us.oracle.com/
      
      Thanks to Hugh Dickins and Konstantin Khlebnikov, they both brought this
      idea 8 years ago, and others who gave comments as well: Daniel Jordan,
      Mel Gorman, Shakeel Butt, Matthew Wilcox, Alexander Duyck etc.
      
      Thanks for Testing support from Intel 0day and Rong Chen, Fengguang Wu,
      and Yun Wang.  Hugh Dickins also shared his kbuild-swap case.
      
      This patch (of 19):
      
      lru_add_page_tail() is only used in huge_memory.c, defining it in other
      file with a CONFIG_TRANSPARENT_HUGEPAGE macro restrict just looks weird.
      
      Let's move it THP. And make it static as Hugh Dickins suggested.
      
      Link: https://lkml.kernel.org/r/1604566549-62481-1-git-send-email-alex.shi@linux.alibaba.com
      Link: https://lkml.kernel.org/r/1604566549-62481-2-git-send-email-alex.shi@linux.alibaba.comSigned-off-by: NAlex Shi <alex.shi@linux.alibaba.com>
      Reviewed-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Alexander Duyck <alexander.duyck@gmail.com>
      Cc: "Chen, Rong A" <rong.a.chen@intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mika Penttilä <mika.penttila@nextfour.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      88dcb9a3
    • J
      mm: huge_memory: convert remaining use of sprintf to sysfs_emit and neatening · bfb0ffeb
      Joe Perches 提交于
      Convert the only use of sprintf with struct kobject * that the cocci
      script could not convert.
      
      Miscellanea:
      
       - Neaten the uses of a constant string with sysfs_emit to use a const
         char * to reduce overall object size
      
      Link: https://lkml.kernel.org/r/7df6be66bbd68e1a0bca9d35aca1341dbf94d2a7.1605376435.git.joe@perches.comSigned-off-by: NJoe Perches <joe@perches.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bfb0ffeb
    • J
      mm: use sysfs_emit for struct kobject * uses · ae7a927d
      Joe Perches 提交于
      Patch series "mm: Convert sysfs sprintf family to sysfs_emit", v2.
      
      Use the new sysfs_emit family and not the sprintf family.
      
      This patch (of 5):
      
      Use the sysfs_emit function instead of the sprintf family.
      
      Done with cocci script as in commit 3c6bff3c ("RDMA: Convert sysfs
      kobject * show functions to use sysfs_emit()")
      
      Link: https://lkml.kernel.org/r/cover.1605376435.git.joe@perches.com
      Link: https://lkml.kernel.org/r/9c249215bad6df616ba0410ad980042694970c1b.1605376435.git.joe@perches.comSigned-off-by: NJoe Perches <joe@perches.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ae7a927d
    • S
      mm/rmap: always do TTU_IGNORE_ACCESS · 013339df
      Shakeel Butt 提交于
      Since commit 369ea824 ("mm/rmap: update to new mmu_notifier semantic
      v2"), the code to check the secondary MMU's page table access bit is
      broken for !(TTU_IGNORE_ACCESS) because the page is unmapped from the
      secondary MMU's page table before the check.  More specifically for those
      secondary MMUs which unmap the memory in
      mmu_notifier_invalidate_range_start() like kvm.
      
      However memory reclaim is the only user of !(TTU_IGNORE_ACCESS) or the
      absence of TTU_IGNORE_ACCESS and it explicitly performs the page table
      access check before trying to unmap the page.  So, at worst the reclaim
      will miss accesses in a very short window if we remove page table access
      check in unmapping code.
      
      There is an unintented consequence of !(TTU_IGNORE_ACCESS) for the memcg
      reclaim.  From memcg reclaim the page_referenced() only account the
      accesses from the processes which are in the same memcg of the target page
      but the unmapping code is considering accesses from all the processes, so,
      decreasing the effectiveness of memcg reclaim.
      
      The simplest solution is to always assume TTU_IGNORE_ACCESS in unmapping
      code.
      
      Link: https://lkml.kernel.org/r/20201104231928.1494083-1-shakeelb@google.com
      Fixes: 369ea824 ("mm/rmap: update to new mmu_notifier semantic v2")
      Signed-off-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      013339df
    • J
      mm: memcontrol: add file_thp, shmem_thp to memory.stat · b8eddff8
      Johannes Weiner 提交于
      As huge page usage in the page cache and for shmem files proliferates in
      our production environment, the performance monitoring team has asked for
      per-cgroup stats on those pages.
      
      We already track and export anon_thp per cgroup.  We already track file
      THP and shmem THP per node, so making them per-cgroup is only a matter of
      switching from node to lruvec counters.  All callsites are in places where
      the pages are charged and locked, so page->memcg is stable.
      
      [hannes@cmpxchg.org: add documentation]
        Link: https://lkml.kernel.org/r/20201026174029.GC548555@cmpxchg.org
      
      Link: https://lkml.kernel.org/r/20201022151844.489337-1-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRik van Riel <riel@surriel.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b8eddff8
  2. 23 11月, 2020 1 次提交
    • G
      mm/userfaultfd: do not access vma->vm_mm after calling handle_userfault() · bfe8cc1d
      Gerald Schaefer 提交于
      Alexander reported a syzkaller / KASAN finding on s390, see below for
      complete output.
      
      In do_huge_pmd_anonymous_page(), the pre-allocated pagetable will be
      freed in some cases.  In the case of userfaultfd_missing(), this will
      happen after calling handle_userfault(), which might have released the
      mmap_lock.  Therefore, the following pte_free(vma->vm_mm, pgtable) will
      access an unstable vma->vm_mm, which could have been freed or re-used
      already.
      
      For all architectures other than s390 this will go w/o any negative
      impact, because pte_free() simply frees the page and ignores the
      passed-in mm.  The implementation for SPARC32 would also access
      mm->page_table_lock for pte_free(), but there is no THP support in
      SPARC32, so the buggy code path will not be used there.
      
      For s390, the mm->context.pgtable_list is being used to maintain the 2K
      pagetable fragments, and operating on an already freed or even re-used
      mm could result in various more or less subtle bugs due to list /
      pagetable corruption.
      
      Fix this by calling pte_free() before handle_userfault(), similar to how
      it is already done in __do_huge_pmd_anonymous_page() for the WRITE /
      non-huge_zero_page case.
      
      Commit 6b251fc9 ("userfaultfd: call handle_userfault() for
      userfaultfd_missing() faults") actually introduced both, the
      do_huge_pmd_anonymous_page() and also __do_huge_pmd_anonymous_page()
      changes wrt to calling handle_userfault(), but only in the latter case
      it put the pte_free() before calling handle_userfault().
      
        BUG: KASAN: use-after-free in do_huge_pmd_anonymous_page+0xcda/0xd90 mm/huge_memory.c:744
        Read of size 8 at addr 00000000962d6988 by task syz-executor.0/9334
      
        CPU: 1 PID: 9334 Comm: syz-executor.0 Not tainted 5.10.0-rc1-syzkaller-07083-g4c9720875573 #0
        Hardware name: IBM 3906 M04 701 (KVM/Linux)
        Call Trace:
          do_huge_pmd_anonymous_page+0xcda/0xd90 mm/huge_memory.c:744
          create_huge_pmd mm/memory.c:4256 [inline]
          __handle_mm_fault+0xe6e/0x1068 mm/memory.c:4480
          handle_mm_fault+0x288/0x748 mm/memory.c:4607
          do_exception+0x394/0xae0 arch/s390/mm/fault.c:479
          do_dat_exception+0x34/0x80 arch/s390/mm/fault.c:567
          pgm_check_handler+0x1da/0x22c arch/s390/kernel/entry.S:706
          copy_from_user_mvcos arch/s390/lib/uaccess.c:111 [inline]
          raw_copy_from_user+0x3a/0x88 arch/s390/lib/uaccess.c:174
          _copy_from_user+0x48/0xa8 lib/usercopy.c:16
          copy_from_user include/linux/uaccess.h:192 [inline]
          __do_sys_sigaltstack kernel/signal.c:4064 [inline]
          __s390x_sys_sigaltstack+0xc8/0x240 kernel/signal.c:4060
          system_call+0xe0/0x28c arch/s390/kernel/entry.S:415
      
        Allocated by task 9334:
          slab_alloc_node mm/slub.c:2891 [inline]
          slab_alloc mm/slub.c:2899 [inline]
          kmem_cache_alloc+0x118/0x348 mm/slub.c:2904
          vm_area_dup+0x9c/0x2b8 kernel/fork.c:356
          __split_vma+0xba/0x560 mm/mmap.c:2742
          split_vma+0xca/0x108 mm/mmap.c:2800
          mlock_fixup+0x4ae/0x600 mm/mlock.c:550
          apply_vma_lock_flags+0x2c6/0x398 mm/mlock.c:619
          do_mlock+0x1aa/0x718 mm/mlock.c:711
          __do_sys_mlock2 mm/mlock.c:738 [inline]
          __s390x_sys_mlock2+0x86/0xa8 mm/mlock.c:728
          system_call+0xe0/0x28c arch/s390/kernel/entry.S:415
      
        Freed by task 9333:
          slab_free mm/slub.c:3142 [inline]
          kmem_cache_free+0x7c/0x4b8 mm/slub.c:3158
          __vma_adjust+0x7b2/0x2508 mm/mmap.c:960
          vma_merge+0x87e/0xce0 mm/mmap.c:1209
          userfaultfd_release+0x412/0x6b8 fs/userfaultfd.c:868
          __fput+0x22c/0x7a8 fs/file_table.c:281
          task_work_run+0x200/0x320 kernel/task_work.c:151
          tracehook_notify_resume include/linux/tracehook.h:188 [inline]
          do_notify_resume+0x100/0x148 arch/s390/kernel/signal.c:538
          system_call+0xe6/0x28c arch/s390/kernel/entry.S:416
      
        The buggy address belongs to the object at 00000000962d6948 which belongs to the cache vm_area_struct of size 200
        The buggy address is located 64 bytes inside of 200-byte region [00000000962d6948, 00000000962d6a10)
        The buggy address belongs to the page: page:00000000313a09fe refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x962d6 flags: 0x3ffff00000000200(slab)
        raw: 3ffff00000000200 000040000257e080 0000000c0000000c 000000008020ba00
        raw: 0000000000000000 000f001e00000000 ffffffff00000001 0000000096959501
        page dumped because: kasan: bad access detected
        page->mem_cgroup:0000000096959501
      
        Memory state around the buggy address:
         00000000962d6880: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
         00000000962d6900: 00 fc fc fc fc fc fc fc fc fa fb fb fb fb fb fb
        >00000000962d6980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                              ^
         00000000962d6a00: fb fb fc fc fc fc fc fc fc fc 00 00 00 00 00 00
         00000000962d6a80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
        ==================================================================
      
      Fixes: 6b251fc9 ("userfaultfd: call handle_userfault() for userfaultfd_missing() faults")
      Reported-by: NAlexander Egorenkov <egorenar@linux.ibm.com>
      Signed-off-by: NGerald Schaefer <gerald.schaefer@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: <stable@vger.kernel.org>	[4.3+]
      Link: https://lkml.kernel.org/r/20201110190329.11920-1-gerald.schaefer@linux.ibm.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bfe8cc1d
  3. 17 10月, 2020 6 次提交
  4. 14 10月, 2020 1 次提交
  5. 28 9月, 2020 1 次提交
    • P
      mm/thp: Split huge pmds/puds if they're pinned when fork() · d042035e
      Peter Xu 提交于
      Pinned pages shouldn't be write-protected when fork() happens, because
      follow up copy-on-write on these pages could cause the pinned pages to
      be replaced by random newly allocated pages.
      
      For huge PMDs, we split the huge pmd if pinning is detected.  So that
      future handling will be done by the PTE level (with our latest changes,
      each of the small pages will be copied).  We can achieve this by let
      copy_huge_pmd() return -EAGAIN for pinned pages, so that we'll
      fallthrough in copy_pmd_range() and finally land the next
      copy_pte_range() call.
      
      Huge PUDs will be even more special - so far it does not support
      anonymous pages.  But it can actually be done the same as the huge PMDs
      even if the split huge PUDs means to erase the PUD entries.  It'll
      guarantee the follow up fault ins will remap the same pages in either
      parent/child later.
      
      This might not be the most efficient way, but it should be easy and
      clean enough.  It should be fine, since we're tackling with a very rare
      case just to make sure userspaces that pinned some thps will still work
      even without MADV_DONTFORK and after they fork()ed.
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d042035e
  6. 20 9月, 2020 1 次提交
    • R
      mm/thp: fix __split_huge_pmd_locked() for migration PMD · ec0abae6
      Ralph Campbell 提交于
      A migrating transparent huge page has to already be unmapped.  Otherwise,
      the page could be modified while it is being copied to a new page and data
      could be lost.  The function __split_huge_pmd() checks for a PMD migration
      entry before calling __split_huge_pmd_locked() leading one to think that
      __split_huge_pmd_locked() can handle splitting a migrating PMD.
      
      However, the code always increments the page->_mapcount and adjusts the
      memory control group accounting assuming the page is mapped.
      
      Also, if the PMD entry is a migration PMD entry, the call to
      is_huge_zero_pmd(*pmd) is incorrect because it calls pmd_pfn(pmd) instead
      of migration_entry_to_pfn(pmd_to_swp_entry(pmd)).  Fix these problems by
      checking for a PMD migration entry.
      
      Fixes: 84c3fc4e ("mm: thp: check pmd migration entry in common path")
      Signed-off-by: NRalph Campbell <rcampbell@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NYang Shi <shy828301@gmail.com>
      Reviewed-by: NZi Yan <ziy@nvidia.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Bharata B Rao <bharata@linux.ibm.com>
      Cc: Ben Skeggs <bskeggs@redhat.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: <stable@vger.kernel.org>	[4.14+]
      Link: https://lkml.kernel.org/r/20200903183140.19055-1-rcampbell@nvidia.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ec0abae6
  7. 05 9月, 2020 1 次提交
  8. 04 9月, 2020 1 次提交
    • C
      mm: Preserve the PG_arch_2 flag in __split_huge_page_tail() · 72e6afa0
      Catalin Marinas 提交于
      When a huge page is split into normal pages, part of the head page flags
      are transferred to the tail pages. However, the PG_arch_* flags are not
      part of the preserved set.
      
      PG_arch_2 is used by the arm64 MTE support to mark pages that have valid
      tags. The absence of such flag would cause the arm64 set_pte_at() to
      clear the tags in order to avoid stale tags exposed to user or the
      swapping out hooks to ignore the tags. Not preserving PG_arch_2 on huge
      page splitting leads to tag corruption in the tail pages.
      
      Preserve the newly added PG_arch_2 flag in __split_huge_page_tail().
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      72e6afa0
  9. 13 8月, 2020 2 次提交
  10. 08 8月, 2020 3 次提交
  11. 10 6月, 2020 2 次提交
  12. 05 6月, 2020 1 次提交
  13. 04 6月, 2020 8 次提交
  14. 03 6月, 2020 1 次提交
    • L
      gup: document and work around "COW can break either way" issue · 17839856
      Linus Torvalds 提交于
      Doing a "get_user_pages()" on a copy-on-write page for reading can be
      ambiguous: the page can be COW'ed at any time afterwards, and the
      direction of a COW event isn't defined.
      
      Yes, whoever writes to it will generally do the COW, but if the thread
      that did the get_user_pages() unmapped the page before the write (and
      that could happen due to memory pressure in addition to any outright
      action), the writer could also just take over the old page instead.
      
      End result: the get_user_pages() call might result in a page pointer
      that is no longer associated with the original VM, and is associated
      with - and controlled by - another VM having taken it over instead.
      
      So when doing a get_user_pages() on a COW mapping, the only really safe
      thing to do would be to break the COW when getting the page, even when
      only getting it for reading.
      
      At the same time, some users simply don't even care.
      
      For example, the perf code wants to look up the page not because it
      cares about the page, but because the code simply wants to look up the
      physical address of the access for informational purposes, and doesn't
      really care about races when a page might be unmapped and remapped
      elsewhere.
      
      This adds logic to force a COW event by setting FOLL_WRITE on any
      copy-on-write mapping when FOLL_GET (or FOLL_PIN) is used to get a page
      pointer as a result.
      
      The current semantics end up being:
      
       - __get_user_pages_fast(): no change. If you don't ask for a write,
         you won't break COW. You'd better know what you're doing.
      
       - get_user_pages_fast(): the fast-case "look it up in the page tables
         without anything getting mmap_sem" now refuses to follow a read-only
         page, since it might need COW breaking.  Which happens in the slow
         path - the fast path doesn't know if the memory might be COW or not.
      
       - get_user_pages() (including the slow-path fallback for gup_fast()):
         for a COW mapping, turn on FOLL_WRITE for FOLL_GET/FOLL_PIN, with
         very similar semantics to FOLL_FORCE.
      
      If it turns out that we want finer granularity (ie "only break COW when
      it might actually matter" - things like the zero page are special and
      don't need to be broken) we might need to push these semantics deeper
      into the lookup fault path.  So if people care enough, it's possible
      that we might end up adding a new internal FOLL_BREAK_COW flag to go
      with the internal FOLL_COW flag we already have for tracking "I had a
      COW".
      
      Alternatively, if it turns out that different callers might want to
      explicitly control the forced COW break behavior, we might even want to
      make such a flag visible to the users of get_user_pages() instead of
      using the above default semantics.
      
      But for now, this is mostly commentary on the issue (this commit message
      being a lot bigger than the patch, and that patch in turn is almost all
      comments), with that minimal "enable COW breaking early" logic using the
      existing FOLL_WRITE behavior.
      
      [ It might be worth noting that we've always had this ambiguity, and it
        could arguably be seen as a user-space issue.
      
        You only get private COW mappings that could break either way in
        situations where user space is doing cooperative things (ie fork()
        before an execve() etc), but it _is_ surprising and very subtle, and
        fork() is supposed to give you independent address spaces.
      
        So let's treat this as a kernel issue and make the semantics of
        get_user_pages() easier to understand. Note that obviously a true
        shared mapping will still get a page that can change under us, so this
        does _not_ mean that get_user_pages() somehow returns any "stable"
        page ]
      Reported-by: NJann Horn <jannh@google.com>
      Tested-by: NChristoph Hellwig <hch@lst.de>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NKirill Shutemov <kirill@shutemov.name>
      Acked-by: NJan Kara <jack@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      17839856
  15. 05 5月, 2020 1 次提交
  16. 08 4月, 2020 1 次提交
    • P
      userfaultfd: wp: support swap and page migration · f45ec5ff
      Peter Xu 提交于
      For either swap and page migration, we all use the bit 2 of the entry to
      identify whether this entry is uffd write-protected.  It plays a similar
      role as the existing soft dirty bit in swap entries but only for keeping
      the uffd-wp tracking for a specific PTE/PMD.
      
      Something special here is that when we want to recover the uffd-wp bit
      from a swap/migration entry to the PTE bit we'll also need to take care of
      the _PAGE_RW bit and make sure it's cleared, otherwise even with the
      _PAGE_UFFD_WP bit we can't trap it at all.
      
      In change_pte_range() we do nothing for uffd if the PTE is a swap entry.
      That can lead to data mismatch if the page that we are going to write
      protect is swapped out when sending the UFFDIO_WRITEPROTECT.  This patch
      also applies/removes the uffd-wp bit even for the swap entries.
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@fb.com>
      Link: http://lkml.kernel.org/r/20200220163112.11409-11-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f45ec5ff