1. 27 2月, 2021 4 次提交
  2. 25 2月, 2021 4 次提交
    • L
      mm: rmap: explicitly reset vma->anon_vma in unlink_anon_vmas() · ee8ab190
      Li Xinhai 提交于
      In case the vma will continue to be used after unlink its relevant
      anon_vma, we need to reset the vma->anon_vma pointer to NULL.  So, later
      when fault happen within this vma again, a new anon_vma will be prepared.
      
      By this way, the vma will only be checked for reverse mapping of pages
      which been fault in after the unlink_anon_vmas call.
      
      Currently, the mremap with MREMAP_DONTUNMAP scenario will continue use the
      vma after moved its page table entries to a new vma.  For other scenarios,
      the vma itself will be freed after call unlink_anon_vmas.
      
      Link: https://lkml.kernel.org/r/20210119075126.3513154-1-lixinhai.lxh@gmail.comSigned-off-by: NLi Xinhai <lixinhai.lxh@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ee8ab190
    • M
      mm: memcontrol: convert NR_FILE_PMDMAPPED account to pages · 380780e7
      Muchun Song 提交于
      Currently we use struct per_cpu_nodestat to cache the vmstat counters,
      which leads to inaccurate statistics especially THP vmstat counters.  In
      the systems with hundreds of processors it can be GBs of memory.  For
      example, for a 96 CPUs system, the threshold is the maximum number of 125.
      And the per cpu counters can cache 23.4375 GB in total.
      
      The THP page is already a form of batched addition (it will add 512 worth
      of memory in one go) so skipping the batching seems like sensible.
      Although every THP stats update overflows the per-cpu counter, resorting
      to atomic global updates.  But it can make the statistics more accuracy
      for the THP vmstat counters.
      
      So we convert the NR_FILE_PMDMAPPED account to pages.  This patch is
      consistent with 8f182270 ("mm/swap.c: flush lru pvecs on compound page
      arrival").  Doing this also can make the unit of vmstat counters more
      unified.  Finally, the unit of the vmstat counters are pages, kB and
      bytes.  The B/KB suffix can tell us that the unit is bytes or kB.  The
      rest which is without suffix are pages.
      
      Link: https://lkml.kernel.org/r/20201228164110.2838-7-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: NeilBrown <neilb@suse.de>
      Cc: Pankaj Gupta <pankaj.gupta@cloud.ionos.com>
      Cc: Rafael. J. Wysocki <rafael@kernel.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Sami Tolvanen <samitolvanen@google.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      380780e7
    • M
      mm: memcontrol: convert NR_SHMEM_PMDMAPPED account to pages · a1528e21
      Muchun Song 提交于
      Currently we use struct per_cpu_nodestat to cache the vmstat counters,
      which leads to inaccurate statistics especially THP vmstat counters.  In
      the systems with hundreds of processors it can be GBs of memory.  For
      example, for a 96 CPUs system, the threshold is the maximum number of 125.
      And the per cpu counters can cache 23.4375 GB in total.
      
      The THP page is already a form of batched addition (it will add 512 worth
      of memory in one go) so skipping the batching seems like sensible.
      Although every THP stats update overflows the per-cpu counter, resorting
      to atomic global updates.  But it can make the statistics more accuracy
      for the THP vmstat counters.
      
      So we convert the NR_SHMEM_PMDMAPPED account to pages.  This patch is
      consistent with 8f182270 ("mm/swap.c: flush lru pvecs on compound page
      arrival").  Doing this also can make the unit of vmstat counters more
      unified.  Finally, the unit of the vmstat counters are pages, kB and
      bytes.  The B/KB suffix can tell us that the unit is bytes or kB.  The
      rest which is without suffix are pages.
      
      Link: https://lkml.kernel.org/r/20201228164110.2838-6-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: NeilBrown <neilb@suse.de>
      Cc: Pankaj Gupta <pankaj.gupta@cloud.ionos.com>
      Cc: Rafael. J. Wysocki <rafael@kernel.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Sami Tolvanen <samitolvanen@google.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a1528e21
    • M
      mm: memcontrol: convert NR_ANON_THPS account to pages · 69473e5d
      Muchun Song 提交于
      Currently we use struct per_cpu_nodestat to cache the vmstat counters,
      which leads to inaccurate statistics especially THP vmstat counters.  In
      the systems with hundreds of processors it can be GBs of memory.  For
      example, for a 96 CPUs system, the threshold is the maximum number of 125.
      And the per cpu counters can cache 23.4375 GB in total.
      
      The THP page is already a form of batched addition (it will add 512 worth
      of memory in one go) so skipping the batching seems like sensible.
      Although every THP stats update overflows the per-cpu counter, resorting
      to atomic global updates.  But it can make the statistics more accuracy
      for the THP vmstat counters.
      
      So we convert the NR_ANON_THPS account to pages.  This patch is consistent
      with 8f182270 ("mm/swap.c: flush lru pvecs on compound page arrival").
      Doing this also can make the unit of vmstat counters more unified.
      Finally, the unit of the vmstat counters are pages, kB and bytes.  The
      B/KB suffix can tell us that the unit is bytes or kB.  The rest which is
      without suffix are pages.
      
      Link: https://lkml.kernel.org/r/20201228164110.2838-3-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Rafael. J. Wysocki <rafael@kernel.org>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Sami Tolvanen <samitolvanen@google.com>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: NeilBrown <neilb@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pankaj Gupta <pankaj.gupta@cloud.ionos.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      69473e5d
  3. 16 12月, 2020 3 次提交
    • H
      mm/lru: revise the comments of lru_lock · 15b44736
      Hugh Dickins 提交于
      Since we changed the pgdat->lru_lock to lruvec->lru_lock, it's time to fix
      the incorrect comments in code.  Also fixed some zone->lru_lock comment
      error from ancient time.  etc.
      
      I struggled to understand the comment above move_pages_to_lru() (surely
      it never calls page_referenced()), and eventually realized that most of
      it had got separated from shrink_active_list(): move that comment back.
      
      Link: https://lkml.kernel.org/r/1604566549-62481-20-git-send-email-alex.shi@linux.alibaba.comSigned-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Alexander Duyck <alexander.duyck@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Chen, Rong A" <rong.a.chen@intel.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mika Penttilä <mika.penttila@nextfour.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      15b44736
    • A
      mm/rmap: stop store reordering issue on page->mapping · 16f5e707
      Alex Shi 提交于
      Hugh Dickins and Minchan Kim observed a long time issue which discussed
      here, but actully the mentioned fix in
      
        https://lore.kernel.org/lkml/20150504031722.GA2768@blaptop/
      
      was missed.
      
      The store reordering may cause problem in the scenario:
      
      	CPU 0						CPU1
         do_anonymous_page
      	page_add_new_anon_rmap()
      	  page->mapping = anon_vma + PAGE_MAPPING_ANON
      	lru_cache_add_inactive_or_unevictable()
      	  spin_lock(lruvec->lock)
      	  SetPageLRU()
      	  spin_unlock(lruvec->lock)
      						/* idletacking judged it as LRU
      						 * page so pass the page in
      						 * page_idle_clear_pte_refs
      						 */
      						page_idle_clear_pte_refs
      						  rmap_walk
      						    if PageAnon(page)
      
      Johannes give detailed examples how the store reordering could cause
      trouble: "The concern is the SetPageLRU may get reorder before
      'page->mapping' setting, That would make CPU 1 will observe at
      page->mapping after observing PageLRU set on the page.
      
      1. anon_vma + PAGE_MAPPING_ANON
      
         That's the in-order scenario and is fine.
      
      2. NULL
      
         That's possible if the page->mapping store gets reordered to occur
         after SetPageLRU. That's fine too because we check for it.
      
      3. anon_vma without the PAGE_MAPPING_ANON bit
      
         That would be a problem and could lead to all kinds of undesirable
         behavior including crashes and data corruption.
      
         Is it possible? AFAICT the compiler is allowed to tear the store to
         page->mapping and I don't see anything that would prevent it.
      
      That said, I also don't see how the reader testing PageLRU under the
      lru_lock would prevent that in the first place.  AFAICT we need that
      WRITE_ONCE() around the page->mapping assignment."
      
      [alex.shi@linux.alibaba.com: updated for comments change from Johannes]
        Link: https://lkml.kernel.org/r/e66ef2e5-c74c-6498-e8b3-56c37b9d2d15@linux.alibaba.com
      
      Link: https://lkml.kernel.org/r/1604566549-62481-7-git-send-email-alex.shi@linux.alibaba.comSigned-off-by: NAlex Shi <alex.shi@linux.alibaba.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Alexander Duyck <alexander.duyck@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: "Chen, Rong A" <rong.a.chen@intel.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mika Penttilä <mika.penttila@nextfour.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      16f5e707
    • S
      mm/rmap: always do TTU_IGNORE_ACCESS · 013339df
      Shakeel Butt 提交于
      Since commit 369ea824 ("mm/rmap: update to new mmu_notifier semantic
      v2"), the code to check the secondary MMU's page table access bit is
      broken for !(TTU_IGNORE_ACCESS) because the page is unmapped from the
      secondary MMU's page table before the check.  More specifically for those
      secondary MMUs which unmap the memory in
      mmu_notifier_invalidate_range_start() like kvm.
      
      However memory reclaim is the only user of !(TTU_IGNORE_ACCESS) or the
      absence of TTU_IGNORE_ACCESS and it explicitly performs the page table
      access check before trying to unmap the page.  So, at worst the reclaim
      will miss accesses in a very short window if we remove page table access
      check in unmapping code.
      
      There is an unintented consequence of !(TTU_IGNORE_ACCESS) for the memcg
      reclaim.  From memcg reclaim the page_referenced() only account the
      accesses from the processes which are in the same memcg of the target page
      but the unmapping code is considering accesses from all the processes, so,
      decreasing the effectiveness of memcg reclaim.
      
      The simplest solution is to always assume TTU_IGNORE_ACCESS in unmapping
      code.
      
      Link: https://lkml.kernel.org/r/20201104231928.1494083-1-shakeelb@google.com
      Fixes: 369ea824 ("mm/rmap: update to new mmu_notifier semantic v2")
      Signed-off-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      013339df
  4. 15 11月, 2020 1 次提交
    • M
      hugetlbfs: fix anon huge page migration race · 336bf30e
      Mike Kravetz 提交于
      Qian Cai reported the following BUG in [1]
      
        LTP: starting move_pages12
        BUG: unable to handle page fault for address: ffffffffffffffe0
        ...
        RIP: 0010:anon_vma_interval_tree_iter_first+0xa2/0x170 avc_start_pgoff at mm/interval_tree.c:63
        Call Trace:
          rmap_walk_anon+0x141/0xa30 rmap_walk_anon at mm/rmap.c:1864
          try_to_unmap+0x209/0x2d0 try_to_unmap at mm/rmap.c:1763
          migrate_pages+0x1005/0x1fb0
          move_pages_and_store_status.isra.47+0xd7/0x1a0
          __x64_sys_move_pages+0xa5c/0x1100
          do_syscall_64+0x5f/0x310
          entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Hugh Dickins diagnosed this as a migration bug caused by code introduced
      to use i_mmap_rwsem for pmd sharing synchronization.  Specifically, the
      routine unmap_and_move_huge_page() is always passing the TTU_RMAP_LOCKED
      flag to try_to_unmap() while holding i_mmap_rwsem.  This is wrong for
      anon pages as the anon_vma_lock should be held in this case.  Further
      analysis suggested that i_mmap_rwsem was not required to he held at all
      when calling try_to_unmap for anon pages as an anon page could never be
      part of a shared pmd mapping.
      
      Discussion also revealed that the hack in hugetlb_page_mapping_lock_write
      to drop page lock and acquire i_mmap_rwsem is wrong.  There is no way to
      keep mapping valid while dropping page lock.
      
      This patch does the following:
      
       - Do not take i_mmap_rwsem and set TTU_RMAP_LOCKED for anon pages when
         calling try_to_unmap.
      
       - Remove the hacky code in hugetlb_page_mapping_lock_write. The routine
         will now simply do a 'trylock' while still holding the page lock. If
         the trylock fails, it will return NULL. This could impact the
         callers:
      
          - migration calling code will receive -EAGAIN and retry up to the
            hard coded limit (10).
      
          - memory error code will treat the page as BUSY. This will force
            killing (SIGKILL) instead of SIGBUS any mapping tasks.
      
         Do note that this change in behavior only happens when there is a
         race. None of the standard kernel testing suites actually hit this
         race, but it is possible.
      
      [1] https://lore.kernel.org/lkml/20200708012044.GC992@lca.pw/
      [2] https://lore.kernel.org/linux-mm/alpine.LSU.2.11.2010071833100.2214@eggly.anvils/
      
      Fixes: c0d0381a ("hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization")
      Reported-by: NQian Cai <cai@lca.pw>
      Suggested-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NNaoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20201105195058.78401-1-mike.kravetz@oracle.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      336bf30e
  5. 17 10月, 2020 1 次提交
  6. 06 9月, 2020 1 次提交
    • A
      mm/rmap: fixup copying of soft dirty and uffd ptes · ad7df764
      Alistair Popple 提交于
      During memory migration a pte is temporarily replaced with a migration
      swap pte.  Some pte bits from the existing mapping such as the soft-dirty
      and uffd write-protect bits are preserved by copying these to the
      temporary migration swap pte.
      
      However these bits are not stored at the same location for swap and
      non-swap ptes.  Therefore testing these bits requires using the
      appropriate helper function for the given pte type.
      
      Unfortunately several code locations were found where the wrong helper
      function is being used to test soft_dirty and uffd_wp bits which leads to
      them getting incorrectly set or cleared during page-migration.
      
      Fix these by using the correct tests based on pte type.
      
      Fixes: a5430dda ("mm/migrate: support un-addressable ZONE_DEVICE page in migration")
      Fixes: 8c3328f1 ("mm/migrate: migrate_vma() unmap page from vma while collecting pages")
      Fixes: f45ec5ff ("userfaultfd: wp: support swap and page migration")
      Signed-off-by: NAlistair Popple <alistair@popple.id.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NPeter Xu <peterx@redhat.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Alistair Popple <alistair@popple.id.au>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20200825064232.10023-2-alistair@popple.id.auSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ad7df764
  7. 15 8月, 2020 2 次提交
  8. 13 8月, 2020 1 次提交
    • M
      hugetlbfs: remove call to huge_pte_alloc without i_mmap_rwsem · 34ae204f
      Mike Kravetz 提交于
      Commit c0d0381a ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
      synchronization") requires callers of huge_pte_alloc to hold i_mmap_rwsem
      in at least read mode.  This is because the explicit locking in
      huge_pmd_share (called by huge_pte_alloc) was removed.  When restructuring
      the code, the call to huge_pte_alloc in the else block at the beginning of
      hugetlb_fault was missed.
      
      Unfortunately, that else clause is exercised when there is no page table
      entry.  This will likely lead to a call to huge_pmd_share.  If
      huge_pmd_share thinks pmd sharing is possible, it will traverse the
      mapping tree (i_mmap) without holding i_mmap_rwsem.  If someone else is
      modifying the tree, bad things such as addressing exceptions or worse
      could happen.
      
      Simply remove the else clause.  It should have been removed previously.
      The code following the else will call huge_pte_alloc with the appropriate
      locking.
      
      To prevent this type of issue in the future, add routines to assert that
      i_mmap_rwsem is held, and call these routines in huge pmd sharing
      routines.
      
      Fixes: c0d0381a ("hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization")
      Suggested-by: NMatthew Wilcox <willy@infradead.org>
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A.Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/e670f327-5cf9-1959-96e4-6dc7cc30d3d5@oracle.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      34ae204f
  9. 10 6月, 2020 1 次提交
  10. 04 6月, 2020 2 次提交
  11. 08 4月, 2020 4 次提交
    • P
      mm: prevent a warning when casting void* -> enum · 4708f318
      Palmer Dabbelt 提交于
      I recently build the RISC-V port with LLVM trunk, which has introduced a
      new warning when casting from a pointer to an enum of a smaller size.
      This patch simply casts to a long in the middle to stop the warning.  I'd
      be surprised this is the only one in the kernel, but it's the only one I
      saw.
      Signed-off-by: NPalmer Dabbelt <palmerdabbelt@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20200227211741.83165-1-palmer@dabbelt.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4708f318
    • P
      userfaultfd: wp: support swap and page migration · f45ec5ff
      Peter Xu 提交于
      For either swap and page migration, we all use the bit 2 of the entry to
      identify whether this entry is uffd write-protected.  It plays a similar
      role as the existing soft dirty bit in swap entries but only for keeping
      the uffd-wp tracking for a specific PTE/PMD.
      
      Something special here is that when we want to recover the uffd-wp bit
      from a swap/migration entry to the PTE bit we'll also need to take care of
      the _PAGE_RW bit and make sure it's cleared, otherwise even with the
      _PAGE_UFFD_WP bit we can't trap it at all.
      
      In change_pte_range() we do nothing for uffd if the PTE is a swap entry.
      That can lead to data mismatch if the page that we are going to write
      protect is swapped out when sending the UFFDIO_WRITEPROTECT.  This patch
      also applies/removes the uffd-wp bit even for the swap entries.
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Shaohua Li <shli@fb.com>
      Link: http://lkml.kernel.org/r/20200220163112.11409-11-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f45ec5ff
    • M
      mm: remove CONFIG_TRANSPARENT_HUGE_PAGECACHE · 396bcc52
      Matthew Wilcox (Oracle) 提交于
      Commit e496cf3d ("thp: introduce CONFIG_TRANSPARENT_HUGE_PAGECACHE")
      notes that it should be reverted when the PowerPC problem was fixed.  The
      commit fixing the PowerPC problem (953c66c2) did not revert the
      commit; instead setting CONFIG_TRANSPARENT_HUGE_PAGECACHE to the same as
      CONFIG_TRANSPARENT_HUGEPAGE.  Checking with Kirill and Aneesh, this was an
      oversight, so remove the Kconfig symbol and undo the work of commit
      e496cf3d.
      Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Link: http://lkml.kernel.org/r/20200318140253.6141-6-willy@infradead.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      396bcc52
    • L
      Revert "mm/rmap.c: reuse mergeable anon_vma as parent when fork" · 23ab76bf
      Li Xinhai 提交于
      This reverts commit 4e4a9eb9 ("mm/rmap.c: reuse mergeable
      anon_vma as parent when fork").
      
      In dup_mmap(), anon_vma_fork() is called for attaching anon_vma and
      parameter 'tmp' (i.e., the new vma of child) has same ->vm_next and
      ->vm_prev as its parent vma.  That causes the anon_vma used by parent been
      mistakenly shared by child (In anon_vma_clone(), the code added by that
      commit will do this reuse work).
      
      Besides this issue, the design of reusing anon_vma from vma which has gone
      through fork should be avoided ([1]).  So, this patch reverts that commit
      and maintains the consistent logic of reusing anon_vma for
      fork/split/merge vma.
      
      Reusing anon_vma within the process is fine.  But if a vma has gone
      through fork(), then that vma's anon_vma should not be shared with its
      neighbor vma.  As explained in [1], when vma gone through fork(), the
      check for list_is_singular(vma->anon_vma_chain) will be false, and
      don't share anon_vma.
      
      With current issue, one example can clarify more.  Parent process do
      below two steps:
      
      1. p_vma_1 is created and p_anon_vma_1 is prepared;
      
      2. p_vma_2 is created and share p_anon_vma_1; (this is allowed,
         becaues p_vma_1 didn't gothrough fork()); parent process do fork():
      
      3. c_vma_1 is dup from p_vma_1, and has its own c_anon_vma_1
         prepared; at this point, c_vma_1->anon_vma_chain has two items, one
         for p_anon_vma_1 and one for c_anon_vma_1;
      
      4. c_vma_2 is dup from p_vma_2, it is not allowed to share
         c_anon_vma_1, because
      
      c_vma_1->anon_vma_chain has two items.
      [1] commit d0e9fe17 ("Simplify and comment on anon_vma re-use for
          anon_vma_prepare()") explains the test of "list_is_singular()".
      
      Fixes: 4e4a9eb9 ("mm/rmap.c: reuse mergeable anon_vma as parent when fork")
      Signed-off-by: NLi Xinhai <lixinhai.lxh@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Link: http://lkml.kernel.org/r/1581150928-3214-3-git-send-email-lixinhai.lxh@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      23ab76bf
  12. 03 4月, 2020 3 次提交
    • M
      hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization · c0d0381a
      Mike Kravetz 提交于
      Patch series "hugetlbfs: use i_mmap_rwsem for more synchronization", v2.
      
      While discussing the issue with huge_pte_offset [1], I remembered that
      there were more outstanding hugetlb races.  These issues are:
      
      1) For shared pmds, huge PTE pointers returned by huge_pte_alloc can become
         invalid via a call to huge_pmd_unshare by another thread.
      2) hugetlbfs page faults can race with truncation causing invalid global
         reserve counts and state.
      
      A previous attempt was made to use i_mmap_rwsem in this manner as
      described at [2].  However, those patches were reverted starting with [3]
      due to locking issues.
      
      To effectively use i_mmap_rwsem to address the above issues it needs to be
      held (in read mode) during page fault processing.  However, during fault
      processing we need to lock the page we will be adding.  Lock ordering
      requires we take page lock before i_mmap_rwsem.  Waiting until after
      taking the page lock is too late in the fault process for the
      synchronization we want to do.
      
      To address this lock ordering issue, the following patches change the lock
      ordering for hugetlb pages.  This is not too invasive as hugetlbfs
      processing is done separate from core mm in many places.  However, I don't
      really like this idea.  Much ugliness is contained in the new routine
      hugetlb_page_mapping_lock_write() of patch 1.
      
      The only other way I can think of to address these issues is by catching
      all the races.  After catching a race, cleanup, backout, retry ...  etc,
      as needed.  This can get really ugly, especially for huge page
      reservations.  At one time, I started writing some of the reservation
      backout code for page faults and it got so ugly and complicated I went
      down the path of adding synchronization to avoid the races.  Any other
      suggestions would be welcome.
      
      [1] https://lore.kernel.org/linux-mm/1582342427-230392-1-git-send-email-longpeng2@huawei.com/
      [2] https://lore.kernel.org/linux-mm/20181222223013.22193-1-mike.kravetz@oracle.com/
      [3] https://lore.kernel.org/linux-mm/20190103235452.29335-1-mike.kravetz@oracle.com
      [4] https://lore.kernel.org/linux-mm/1584028670.7365.182.camel@lca.pw/
      [5] https://lore.kernel.org/lkml/20200312183142.108df9ac@canb.auug.org.au/
      
      This patch (of 2):
      
      While looking at BUGs associated with invalid huge page map counts, it was
      discovered and observed that a huge pte pointer could become 'invalid' and
      point to another task's page table.  Consider the following:
      
      A task takes a page fault on a shared hugetlbfs file and calls
      huge_pte_alloc to get a ptep.  Suppose the returned ptep points to a
      shared pmd.
      
      Now, another task truncates the hugetlbfs file.  As part of truncation, it
      unmaps everyone who has the file mapped.  If the range being truncated is
      covered by a shared pmd, huge_pmd_unshare will be called.  For all but the
      last user of the shared pmd, huge_pmd_unshare will clear the pud pointing
      to the pmd.  If the task in the middle of the page fault is not the last
      user, the ptep returned by huge_pte_alloc now points to another task's
      page table or worse.  This leads to bad things such as incorrect page
      map/reference counts or invalid memory references.
      
      To fix, expand the use of i_mmap_rwsem as follows:
      - i_mmap_rwsem is held in read mode whenever huge_pmd_share is called.
        huge_pmd_share is only called via huge_pte_alloc, so callers of
        huge_pte_alloc take i_mmap_rwsem before calling.  In addition, callers
        of huge_pte_alloc continue to hold the semaphore until finished with
        the ptep.
      - i_mmap_rwsem is held in write mode whenever huge_pmd_unshare is called.
      
      One problem with this scheme is that it requires taking i_mmap_rwsem
      before taking the page lock during page faults.  This is not the order
      specified in the rest of mm code.  Handling of hugetlbfs pages is mostly
      isolated today.  Therefore, we use this alternative locking order for
      PageHuge() pages.
      
               mapping->i_mmap_rwsem
                 hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
                   page->flags PG_locked (lock_page)
      
      To help with lock ordering issues, hugetlb_page_mapping_lock_write() is
      introduced to write lock the i_mmap_rwsem associated with a page.
      
      In most cases it is easy to get address_space via vma->vm_file->f_mapping.
      However, in the case of migration or memory errors for anon pages we do
      not have an associated vma.  A new routine _get_hugetlb_page_mapping()
      will use anon_vma to get address_space in these cases.
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Prakash Sangappa <prakash.sangappa@oracle.com>
      Link: http://lkml.kernel.org/r/20200316205756.146666-2-mike.kravetz@oracle.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c0d0381a
    • A
      mm/vma: make is_vma_temporary_stack() available for general use · 222100ee
      Anshuman Khandual 提交于
      Currently the declaration and definition for is_vma_temporary_stack() are
      scattered.  Lets make is_vma_temporary_stack() helper available for
      general use and also drop the declaration from (include/linux/huge_mm.h)
      which is no longer required.  While at this, rename this as
      vma_is_temporary_stack() in line with existing helpers.  This should not
      cause any functional change.
      Signed-off-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1582782965-3274-4-git-send-email-anshuman.khandual@arm.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      222100ee
    • J
      mm/gup: page->hpage_pinned_refcount: exact pin counts for huge pages · 47e29d32
      John Hubbard 提交于
      For huge pages (and in fact, any compound page), the GUP_PIN_COUNTING_BIAS
      scheme tends to overflow too easily, each tail page increments the head
      page->_refcount by GUP_PIN_COUNTING_BIAS (1024).  That limits the number
      of huge pages that can be pinned.
      
      This patch removes that limitation, by using an exact form of pin counting
      for compound pages of order > 1.  The "order > 1" is required because this
      approach uses the 3rd struct page in the compound page, and order 1
      compound pages only have two pages, so that won't work there.
      
      A new struct page field, hpage_pinned_refcount, has been added, replacing
      a padding field in the union (so no new space is used).
      
      This enhancement also has a useful side effect: huge pages and compound
      pages (of order > 1) do not suffer from the "potential false positives"
      problem that is discussed in the page_dma_pinned() comment block.  That is
      because these compound pages have extra space for tracking things, so they
      get exact pin counts instead of overloading page->_refcount.
      
      Documentation/core-api/pin_user_pages.rst is updated accordingly.
      Suggested-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Link: http://lkml.kernel.org/r/20200211001536.1027652-8-jhubbard@nvidia.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      47e29d32
  13. 02 12月, 2019 1 次提交
  14. 01 12月, 2019 4 次提交
  15. 19 10月, 2019 1 次提交
  16. 25 9月, 2019 4 次提交
  17. 14 8月, 2019 1 次提交
    • R
      mm/hmm: fix bad subpage pointer in try_to_unmap_one · 1de13ee5
      Ralph Campbell 提交于
      When migrating an anonymous private page to a ZONE_DEVICE private page,
      the source page->mapping and page->index fields are copied to the
      destination ZONE_DEVICE struct page and the page_mapcount() is
      increased.  This is so rmap_walk() can be used to unmap and migrate the
      page back to system memory.
      
      However, try_to_unmap_one() computes the subpage pointer from a swap pte
      which computes an invalid page pointer and a kernel panic results such
      as:
      
        BUG: unable to handle page fault for address: ffffea1fffffffc8
      
      Currently, only single pages can be migrated to device private memory so
      no subpage computation is needed and it can be set to "page".
      
      [rcampbell@nvidia.com: add comment]
        Link: http://lkml.kernel.org/r/20190724232700.23327-4-rcampbell@nvidia.com
      Link: http://lkml.kernel.org/r/20190719192955.30462-4-rcampbell@nvidia.com
      Fixes: a5430dda ("mm/migrate: support un-addressable ZONE_DEVICE page in migration")
      Signed-off-by: NRalph Campbell <rcampbell@nvidia.com>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1de13ee5
  18. 15 5月, 2019 2 次提交