1. 01 7月, 2021 40 次提交
    • M
      mm/zbud: reuse unbuddied[0] as buddied in zbud_pool · f356aeac
      Miaohe Lin 提交于
      Patch series "Cleanups for zbud", v2.
      
      This series contains just cleanups to save some possible memory in
      zbud_pool and avoid exporting any unneeded zbud API.  More details can be
      found in the respective changelogs
      
      This patch (of 2):
      
      Since commit 9d8c5b52 ("mm: zbud: fix condition check on allocation
      size"), zbud_pool.unbuddied[0] is always unused.  We can reuse it as
      buddied field to save some possible memory.
      
      Link: https://lkml.kernel.org/r/20210608114515.206992-1-linmiaohe@huawei.com
      Link: https://lkml.kernel.org/r/20210608114515.206992-2-linmiaohe@huawei.comSigned-off-by: NMiaohe Lin <linmiaohe@huawei.com>
      Cc: Seth Jennings <sjenning@redhat.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f356aeac
    • M
      mm/z3fold: use release_z3fold_page_locked() to release locked z3fold page · 28473d91
      Miaohe Lin 提交于
      We should use release_z3fold_page_locked() to release z3fold page when
      it's locked, although it looks harmless to use release_z3fold_page() now.
      
      Link: https://lkml.kernel.org/r/20210619093151.1492174-7-linmiaohe@huawei.com
      Fixes: dcf5aedb ("z3fold: stricter locking and more careful reclaim")
      Signed-off-by: NMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: NVitaly Wool <vitaly.wool@konsulko.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      28473d91
    • M
      mm/z3fold: fix potential memory leak in z3fold_destroy_pool() · dac0d1cf
      Miaohe Lin 提交于
      There is a memory leak in z3fold_destroy_pool() as it forgets to
      free_percpu pool->unbuddied.  Call free_percpu for pool->unbuddied to fix
      this issue.
      
      Link: https://lkml.kernel.org/r/20210619093151.1492174-6-linmiaohe@huawei.com
      Fixes: d30561c5 ("z3fold: use per-cpu unbuddied lists")
      Signed-off-by: NMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: NVitaly Wool <vitaly.wool@konsulko.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dac0d1cf
    • M
      mm/z3fold: remove unused function handle_to_z3fold_header() · 767cc6c5
      Miaohe Lin 提交于
      handle_to_z3fold_header() is unused now.  So we can remove it.  As a
      result, get_z3fold_header() becomes the only caller of
      __get_z3fold_header() and the argument lock is always true.  Therefore we
      could further fold the __get_z3fold_header() into get_z3fold_header() with
      lock = true.
      
      Link: https://lkml.kernel.org/r/20210619093151.1492174-5-linmiaohe@huawei.comSigned-off-by: NMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: NVitaly Wool <vitaly.wool@konsulko.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      767cc6c5
    • M
      mm/z3fold: remove magic number in z3fold_create_pool() · e891f60e
      Miaohe Lin 提交于
      It's meaningless to pass a magic number 2 to __alloc_percpu() as there is
      a minimum alignment size of PCPU_MIN_ALLOC_SIZE (> 2) in it.  Also there
      is no special alignment requirement for unbuddied.  So we could replace
      this magic number with nature alignment, i.e.  __alignof__(struct
      list_head), to improve readability.
      
      Link: https://lkml.kernel.org/r/20210619093151.1492174-4-linmiaohe@huawei.comSigned-off-by: NMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: NVitaly Wool <vitaly.wool@konsulko.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e891f60e
    • M
      mm/z3fold: avoid possible underflow in z3fold_alloc() · 014284a0
      Miaohe Lin 提交于
      It is not enough to just make sure the z3fold header is not larger than
      the page size.  When z3fold header is equal to PAGE_SIZE, we would
      underflow when check alloc size against PAGE_SIZE - ZHDR_SIZE_ALIGNED -
      CHUNK_SIZE in z3fold_alloc().  Make sure there has remaining spaces for
      its buddy to fix this theoretical issue.
      
      Link: https://lkml.kernel.org/r/20210619093151.1492174-3-linmiaohe@huawei.comSigned-off-by: NMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: NVitaly Wool <vitaly.wool@konsulko.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      014284a0
    • M
      mm/z3fold: define macro NCHUNKS as TOTAL_CHUNKS - ZHDR_CHUNKS · e3c0db4f
      Miaohe Lin 提交于
      Patch series "Cleanup and fixup for z3fold".
      
      This series contains cleanups to remove unused function, redefine macro to
      improve readability and so on.  Also this fixes several bugs in z3fold,
      such as memory leak in z3fold_destroy_pool().  More details can be found
      in the respective changelogs.
      
      This patch (of 6):
      
      To improve code readability, we could define macro NCHUNKS as TOTAL_CHUNKS
      - ZHDR_CHUNKS.  No functional change intended.
      
      Link: https://lkml.kernel.org/r/20210619093151.1492174-1-linmiaohe@huawei.com
      Link: https://lkml.kernel.org/r/20210619093151.1492174-2-linmiaohe@huawei.comSigned-off-by: NMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: NVitaly Wool <vitaly.wool@konsulko.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e3c0db4f
    • D
      fs/proc/kcore: use page_offline_(freeze|thaw) · c6d9eee2
      David Hildenbrand 提交于
      Let's properly synchronize with drivers that set PageOffline().
      Unfreeze/thaw every now and then, so drivers that want to set
      PageOffline() can make progress.
      
      Link: https://lkml.kernel.org/r/20210526093041.8800-7-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMike Rapoport <rppt@linux.ibm.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Aili Yao <yaoaili@kingsoft.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Jiri Bohac <jbohac@suse.cz>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Wei Liu <wei.liu@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c6d9eee2
    • D
      virtio-mem: use page_offline_(start|end) when setting PageOffline() · 6cc26d77
      David Hildenbrand 提交于
      Let's properly use page_offline_(start|end) to synchronize setting
      PageOffline(), so we won't have valid page access to unplugged memory
      regions from /proc/kcore.
      
      Existing balloon implementations usually allow reading inflated memory;
      doing so might result in unnecessary overhead in the hypervisor, which is
      currently the case with virtio-mem.
      
      For future virtio-mem use cases, it will be different when using shmem,
      huge pages, !anonymous private mappings, ...  as backing storage for a VM.
      virtio-mem unplugged memory must no longer be accessed and access might
      result in undefined behavior.  There will be a virtio spec extension to
      document this change, including a new feature flag indicating the changed
      behavior.  We really don't want to race against PFN walkers reading random
      page content.
      
      Link: https://lkml.kernel.org/r/20210526093041.8800-6-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMichael S. Tsirkin <mst@redhat.com>
      Acked-by: NMike Rapoport <rppt@linux.ibm.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Aili Yao <yaoaili@kingsoft.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Jiri Bohac <jbohac@suse.cz>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Wei Liu <wei.liu@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6cc26d77
    • D
      mm: introduce page_offline_(begin|end|freeze|thaw) to synchronize setting PageOffline() · 82840451
      David Hildenbrand 提交于
      A driver might set a page logically offline -- PageOffline() -- and turn
      the page inaccessible in the hypervisor; after that, access to page
      content can be fatal.  One example is virtio-mem; while unplugged memory
      -- marked as PageOffline() can currently be read in the hypervisor, this
      will no longer be the case in the future; for example, when having a
      virtio-mem device backed by huge pages in the hypervisor.
      
      Some special PFN walkers -- i.e., /proc/kcore -- read content of random
      pages after checking PageOffline(); however, these PFN walkers can race
      with drivers that set PageOffline().
      
      Let's introduce page_offline_(begin|end|freeze|thaw) for synchronizing.
      
      page_offline_freeze()/page_offline_thaw() allows for a subsystem to
      synchronize with such drivers, achieving that a page cannot be set
      PageOffline() while frozen.
      
      page_offline_begin()/page_offline_end() is used by drivers that care about
      such races when setting a page PageOffline().
      
      For simplicity, use a rwsem for now; neither drivers nor users are
      performance sensitive.
      
      Link: https://lkml.kernel.org/r/20210526093041.8800-5-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NMike Rapoport <rppt@linux.ibm.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Aili Yao <yaoaili@kingsoft.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Jiri Bohac <jbohac@suse.cz>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Wei Liu <wei.liu@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      82840451
    • D
      fs/proc/kcore: don't read offline sections, logically offline pages and hwpoisoned pages · 0daa322b
      David Hildenbrand 提交于
      Let's avoid reading:
      
      1) Offline memory sections: the content of offline memory sections is
         stale as the memory is effectively unused by the kernel.  On s390x with
         standby memory, offline memory sections (belonging to offline storage
         increments) are not accessible.  With virtio-mem and the hyper-v
         balloon, we can have unavailable memory chunks that should not be
         accessed inside offline memory sections.  Last but not least, offline
         memory sections might contain hwpoisoned pages which we can no longer
         identify because the memmap is stale.
      
      2) PG_offline pages: logically offline pages that are documented as
         "The content of these pages is effectively stale.  Such pages should
         not be touched (read/write/dump/save) except by their owner.".
         Examples include pages inflated in a balloon or unavailble memory
         ranges inside hotplugged memory sections with virtio-mem or the hyper-v
         balloon.
      
      3) PG_hwpoison pages: Reading pages marked as hwpoisoned can be fatal.
         As documented: "Accessing is not safe since it may cause another
         machine check.  Don't touch!"
      
      Introduce is_page_hwpoison(), adding a comment that it is inherently racy
      but best we can really do.
      
      Reading /proc/kcore now performs similar checks as when reading
      /proc/vmcore for kdump via makedumpfile: problematic pages are exclude.
      It's also similar to hibernation code, however, we don't skip hwpoisoned
      pages when processing pages in kernel/power/snapshot.c:saveable_page()
      yet.
      
      Note 1: we can race against memory offlining code, especially memory going
      offline and getting unplugged: however, we will properly tear down the
      identity mapping and handle faults gracefully when accessing this memory
      from kcore code.
      
      Note 2: we can race against drivers setting PageOffline() and turning
      memory inaccessible in the hypervisor.  We'll handle this in a follow-up
      patch.
      
      Link: https://lkml.kernel.org/r/20210526093041.8800-4-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NMike Rapoport <rppt@linux.ibm.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Aili Yao <yaoaili@kingsoft.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Jiri Bohac <jbohac@suse.cz>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Wei Liu <wei.liu@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0daa322b
    • D
      fs/proc/kcore: pfn_is_ram check only applies to KCORE_RAM · 2711032c
      David Hildenbrand 提交于
      Let's resturcture the code, using switch-case, and checking pfn_is_ram()
      only when we are dealing with KCORE_RAM.
      
      Link: https://lkml.kernel.org/r/20210526093041.8800-3-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NMike Rapoport <rppt@linux.ibm.com>
      Cc: Aili Yao <yaoaili@kingsoft.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Jiri Bohac <jbohac@suse.cz>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Wei Liu <wei.liu@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2711032c
    • D
      fs/proc/kcore: drop KCORE_REMAP and KCORE_OTHER · 3c36b419
      David Hildenbrand 提交于
      Patch series "fs/proc/kcore: don't read offline sections, logically offline pages and hwpoisoned pages", v3.
      
      Looking for places where the kernel might unconditionally read
      PageOffline() pages, I stumbled over /proc/kcore; turns out /proc/kcore
      needs some more love to not touch some other pages we really don't want to
      read -- i.e., hwpoisoned ones.
      
      Examples for PageOffline() pages are pages inflated in a balloon, memory
      unplugged via virtio-mem, and partially-present sections in memory added
      by the Hyper-V balloon.
      
      When reading pages inflated in a balloon, we essentially produce
      unnecessary load in the hypervisor; holes in partially present sections in
      case of Hyper-V are not accessible and already were a problem for
      /proc/vmcore, fixed in makedumpfile by detecting PageOffline() pages.  In
      the future, virtio-mem might disallow reading unplugged memory -- marked
      as PageOffline() -- in some environments, resulting in undefined behavior
      when accessed; therefore, I'm trying to identify and rework all these
      (corner) cases.
      
      With this series, there is really only access via /dev/mem, /proc/vmcore
      and kdb left after I ripped out /dev/kmem.  kdb is an advanced corner-case
      use case -- we won't care for now if someone explicitly tries to do nasty
      things by reading from/writing to physical addresses we better not touch.
      /dev/mem is a use case we won't support for virtio-mem, at least for now,
      so we'll simply disallow mapping any virtio-mem memory via /dev/mem next.
      /proc/vmcore is really only a problem when dumping the old kernel via
      something that's not makedumpfile (read: basically never), however, we'll
      try sanitizing that as well in the second kernel in the future.
      
      Tested via kcore_dump:
      	https://github.com/schlafwandler/kcore_dump
      
      This patch (of 6):
      
      Commit db779ef6 ("proc/kcore: Remove unused kclist_add_remap()")
      removed the last user of KCORE_REMAP.
      
      Commit 595dd46e ("vfs/proc/kcore, x86/mm/kcore: Fix SMAP fault when
      dumping vsyscall user page") removed the last user of KCORE_OTHER.
      
      Let's drop both types.  While at it, also drop vaddr in "struct
      kcore_list", used by KCORE_REMAP only.
      
      Link: https://lkml.kernel.org/r/20210526093041.8800-1-david@redhat.com
      Link: https://lkml.kernel.org/r/20210526093041.8800-2-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NMike Rapoport <rppt@linux.ibm.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Aili Yao <yaoaili@kingsoft.com>
      Cc: Jiri Bohac <jbohac@suse.cz>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Wei Liu <wei.liu@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3c36b419
    • M
      docs: proc.rst: meminfo: briefly describe gaps in memory accounting · 8d719afc
      Mike Rapoport 提交于
      Add a paragraph that explains that it may happen that the counters in
      /proc/meminfo do not add up to the overall memory usage.
      
      Link: https://lkml.kernel.org/r/20210421061127.1182723-1-rppt@kernel.orgSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8d719afc
    • K
      mm/kconfig: move HOLES_IN_ZONE into mm · 781eb2cd
      Kefeng Wang 提交于
      commit a55749639dc1 ("ia64: drop marked broken DISCONTIGMEM and
      VIRTUAL_MEM_MAP") drop VIRTUAL_MEM_MAP, so there is no need HOLES_IN_ZONE
      on ia64.
      
      Also move HOLES_IN_ZONE into mm/Kconfig, select it if architecture needs
      this feature.
      
      Link: https://lkml.kernel.org/r/20210417075946.181402-1-wangkefeng.wang@huawei.comSigned-off-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Acked-by: Catalin Marinas <catalin.marinas@arm.com>	[arm64]
      Cc: Will Deacon <will@kernel.org>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      781eb2cd
    • M
      mm: workingset: define macro WORKINGSET_SHIFT · 3ebc57f4
      Miaohe Lin 提交于
      The magic number 1 is used in several places in workingset.c.  Define a
      macro WORKINGSET_SHIFT for it to improve code readability.
      
      Link: https://lkml.kernel.org/r/20210624122307.1759342-1-linmiaohe@huawei.comSigned-off-by: NMiaohe Lin <linmiaohe@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3ebc57f4
    • Y
      include/trace/events/vmscan.h: remove mm_vmscan_inactive_list_is_low · 764c04a9
      Yu Zhao 提交于
      mm_vmscan_inactive_list_is_low has no users after commit b91ac374
      ("mm: vmscan: enforce inactive:active ratio at the reclaim root").
      
      Remove it.
      
      Link: https://lkml.kernel.org/r/20210614194554.2683395-1-yuzhao@google.comSigned-off-by: NYu Zhao <yuzhao@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      764c04a9
    • Y
      mm/vmscan.c: fix potential deadlock in reclaim_pages() · 2d2b8d2b
      Yu Zhao 提交于
      Theoretically without the protect from memalloc_noreclaim_save() and
      memalloc_noreclaim_restore(), reclaim_pages() can go into the block
      I/O layer recursively and deadlock.
      
      Querying 'reclaim_pages' in our kernel crash databases didn't yield
      any results. So the deadlock seems unlikely to happen. A possible
      explanation is that the only user of reclaim_pages(), i.e.,
      MADV_PAGEOUT, is usually called before memory pressure builds up,
      e.g., on Android and Chrome OS. Under such a condition, allocations in
      the block I/O layer can be fulfilled without diverting to direct
      reclaim and therefore the recursion is avoided.
      
      Link: https://lkml.kernel.org/r/20210622074642.785473-1-yuzhao@google.com
      Link: https://lkml.kernel.org/r/20210614194727.2684053-1-yuzhao@google.comSigned-off-by: NYu Zhao <yuzhao@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2d2b8d2b
    • A
      userfaultfd/selftests: exercise minor fault handling shmem support · 4a8f021b
      Axel Rasmussen 提交于
      Enable test_uffdio_minor for test_type == TEST_SHMEM, and modify the test
      slightly to pass in / check for the right feature flags.
      
      Link: https://lkml.kernel.org/r/20210503180737.2487560-11-axelrasmussen@google.comSigned-off-by: NAxel Rasmussen <axelrasmussen@google.com>
      Reviewed-by: NPeter Xu <peterx@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Wang Qing <wangqing@vivo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4a8f021b
    • A
      userfaultfd/selftests: reinitialize test context in each test · 8ba6e864
      Axel Rasmussen 提交于
      Currently, the context (fds, mmap-ed areas, etc.) are global.  Each test
      mutates this state in some way, in some cases really "clobbering it"
      (e.g., the events test mremap-ing area_dst over the top of area_src, or
      the minor faults tests overwriting the count_verify values in the test
      areas).  We run the tests in a particular order, each test is careful to
      make the right assumptions about its starting state, etc.
      
      But, this is fragile.  It's better for a test's success or failure to not
      depend on what some other prior test case did to the global state.
      
      To that end, clear and reinitialize the test context at the start of each
      test case, so whatever prior test cases did doesn't affect future tests.
      
      This is particularly relevant to this series because the events test's
      mremap of area_dst screws up assumptions the minor fault test was relying
      on.  This wasn't a problem for hugetlb, as we don't mremap in that case.
      
      [peterx@redhat.com: fix conflict between this patch and the uffd pagemap series]
        Link: https://lkml.kernel.org/r/YKQqKrl+/cQ1utrb@t490s
      
      Link: https://lkml.kernel.org/r/20210503180737.2487560-10-axelrasmussen@google.comSigned-off-by: NAxel Rasmussen <axelrasmussen@google.com>
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Reviewed-by: NPeter Xu <peterx@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Wang Qing <wangqing@vivo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8ba6e864
    • A
      userfaultfd/selftests: create alias mappings in the shmem test · 5bb23edb
      Axel Rasmussen 提交于
      Previously, we just allocated two shm areas: area_src and area_dst.  With
      this commit, change this so we also allocate area_src_alias, and
      area_dst_alias.
      
      area_*_alias and area_* (respectively) point to the same underlying
      physical pages, but are different VMAs.  In a future commit in this
      series, we'll leverage this setup to exercise minor fault handling support
      for shmem, just like we do in the hugetlb_shared test.
      
      Link: https://lkml.kernel.org/r/20210503180737.2487560-9-axelrasmussen@google.comSigned-off-by: NAxel Rasmussen <axelrasmussen@google.com>
      Reviewed-by: NPeter Xu <peterx@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Wang Qing <wangqing@vivo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5bb23edb
    • A
      userfaultfd/selftests: use memfd_create for shmem test type · fa2c2b58
      Axel Rasmussen 提交于
      This is a preparatory commit.  In the future, we want to be able to setup
      alias mappings for area_src and area_dst in the shmem test, like we do in
      the hugetlb_shared test.  With a VMA obtained via mmap(MAP_ANONYMOUS |
      MAP_SHARED), it isn't clear how to do this.
      
      So, mmap() with an fd, so we can create alias mappings.  Use memfd_create
      instead of actually passing in a tmpfs path like hugetlb does, since it's
      more convenient / simpler to run, and works just as well.
      
      Future commits will:
      
      1. Setup the alias mappings.
      2. Extend our tests to actually take advantage of this, to test new
         userfaultfd behavior being introduced in this series.
      
      Also, a small fix in the area we're changing: when the hugetlb setup fails
      in main(), pass in the right argv[] so we actually print out the hugetlb
      file path.
      
      Link: https://lkml.kernel.org/r/20210503180737.2487560-8-axelrasmussen@google.comSigned-off-by: NAxel Rasmussen <axelrasmussen@google.com>
      Reviewed-by: NPeter Xu <peterx@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Wang Qing <wangqing@vivo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fa2c2b58
    • A
      userfaultfd/shmem: modify shmem_mfill_atomic_pte to use install_pte() · 7d64ae3a
      Axel Rasmussen 提交于
      In a previous commit, we added the mfill_atomic_install_pte() helper.
      This helper does the job of setting up PTEs for an existing page, to map
      it into a given VMA.  It deals with both the anon and shmem cases, as well
      as the shared and private cases.
      
      In other words, shmem_mfill_atomic_pte() duplicates a case it already
      handles.  So, expose it, and let shmem_mfill_atomic_pte() use it directly,
      to reduce code duplication.
      
      This requires that we refactor shmem_mfill_atomic_pte() a bit:
      
      Instead of doing accounting (shmem_recalc_inode() et al) part-way through
      the PTE setup, do it afterward.  This frees up mfill_atomic_install_pte()
      from having to care about this accounting, and means we don't need to e.g.
      shmem_uncharge() in the error path.
      
      A side effect is this switches shmem_mfill_atomic_pte() to use
      lru_cache_add_inactive_or_unevictable() instead of just lru_cache_add().
      This wrapper does some extra accounting in an exceptional case, if
      appropriate, so it's actually the more correct thing to use.
      
      Link: https://lkml.kernel.org/r/20210503180737.2487560-7-axelrasmussen@google.comSigned-off-by: NAxel Rasmussen <axelrasmussen@google.com>
      Reviewed-by: NPeter Xu <peterx@redhat.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Wang Qing <wangqing@vivo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7d64ae3a
    • A
      userfaultfd/shmem: advertise shmem minor fault support · 964ab004
      Axel Rasmussen 提交于
      Now that the feature is fully implemented (the faulting path hooks exist
      so userspace is notified, and the ioctl to resolve such faults is
      available), advertise this as a supported feature.
      
      Link: https://lkml.kernel.org/r/20210503180737.2487560-6-axelrasmussen@google.comSigned-off-by: NAxel Rasmussen <axelrasmussen@google.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Acked-by: NPeter Xu <peterx@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Wang Qing <wangqing@vivo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      964ab004
    • A
      userfaultfd/shmem: support UFFDIO_CONTINUE for shmem · 15313257
      Axel Rasmussen 提交于
      With this change, userspace can resolve a minor fault within a
      shmem-backed area with a UFFDIO_CONTINUE ioctl.  The semantics for this
      match those for hugetlbfs - we look up the existing page in the page
      cache, and install a PTE for it.
      
      This commit introduces a new helper: mfill_atomic_install_pte.
      
      Why handle UFFDIO_CONTINUE for shmem in mm/userfaultfd.c, instead of in
      shmem.c?  The existing userfault implementation only relies on shmem.c for
      VM_SHARED VMAs.  However, minor fault handling / CONTINUE work just fine
      for !VM_SHARED VMAs as well.  We'd prefer to handle CONTINUE for shmem in
      one place, regardless of shared/private (to reduce code duplication).
      
      Why add a new mfill_atomic_install_pte helper?  A problem we have with
      continue is that shmem_mfill_atomic_pte() and mcopy_atomic_pte() are
      *close* to what we want, but not exactly.  We do want to setup the PTEs in
      a CONTINUE operation, but we don't want to e.g.  allocate a new page,
      charge it (e.g.  to the shmem inode), manipulate various flags, etc.  Also
      we have the problem stated above: shmem_mfill_atomic_pte() and
      mcopy_atomic_pte() both handle one-half of the problem (shared / private)
      continue cares about.  So, introduce mcontinue_atomic_pte(), to handle all
      of the shmem continue cases.  Introduce the helper so it doesn't duplicate
      code with mcopy_atomic_pte().
      
      In a future commit, shmem_mfill_atomic_pte() will also be modified to use
      this new helper.  However, since this is a bigger refactor, it seems most
      clear to do it as a separate change.
      
      Link: https://lkml.kernel.org/r/20210503180737.2487560-5-axelrasmussen@google.comSigned-off-by: NAxel Rasmussen <axelrasmussen@google.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Acked-by: NPeter Xu <peterx@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Wang Qing <wangqing@vivo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      15313257
    • A
      userfaultfd/shmem: support minor fault registration for shmem · c949b097
      Axel Rasmussen 提交于
      This patch allows shmem-backed VMAs to be registered for minor faults.
      Minor faults are appropriately relayed to userspace in the fault path, for
      VMAs with the relevant flag.
      
      This commit doesn't hook up the UFFDIO_CONTINUE ioctl for shmem-backed
      minor faults, though, so userspace doesn't yet have a way to resolve such
      faults.
      
      Because of this, we also don't yet advertise this as a supported feature.
      That will be done in a separate commit when the feature is fully
      implemented.
      
      Link: https://lkml.kernel.org/r/20210503180737.2487560-4-axelrasmussen@google.comSigned-off-by: NAxel Rasmussen <axelrasmussen@google.com>
      Acked-by: NPeter Xu <peterx@redhat.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Wang Qing <wangqing@vivo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c949b097
    • A
      userfaultfd/shmem: combine shmem_{mcopy_atomic,mfill_zeropage}_pte · 3460f6e5
      Axel Rasmussen 提交于
      Patch series "userfaultfd: add minor fault handling for shmem", v6.
      
      Overview
      ========
      
      See the series which added minor faults for hugetlbfs [3] for a detailed
      overview of minor fault handling in general.  This series adds the same
      support for shmem-backed areas.
      
      This series is structured as follows:
      
      - Commits 1 and 2 are cleanups.
      - Commits 3 and 4 implement the new feature (minor fault handling for shmem).
      - Commit 5 advertises that the feature is now available since at this point it's
        fully implemented.
      - Commit 6 is a final cleanup, modifying an existing code path to re-use a new
        helper we've introduced.
      - Commits 7, 8, 9, 10 update the userfaultfd selftest to exercise the feature.
      
      Use Case
      ========
      
      In some cases it is useful to have VM memory backed by tmpfs instead of
      hugetlbfs.  So, this feature will be used to support the same VM live
      migration use case described in my original series.
      
      Additionally, Android folks (Lokesh Gidra <lokeshgidra@google.com>) hope
      to optimize the Android Runtime garbage collector using this feature:
      
      "The plan is to use userfaultfd for concurrently compacting the heap.
      With this feature, the heap can be shared-mapped at another location where
      the GC-thread(s) could continue the compaction operation without the need
      to invoke userfault ioctl(UFFDIO_COPY) each time.  OTOH, if and when Java
      threads get faults on the heap, UFFDIO_CONTINUE can be used to resume
      execution.  Furthermore, this feature enables updating references in the
      'non-moving' portion of the heap efficiently.  Without this feature,
      uneccessary page copying (ioctl(UFFDIO_COPY)) would be required."
      
      [1] https://lore.kernel.org/patchwork/cover/1388144/
      [2] https://lore.kernel.org/patchwork/patch/1408161/
      [3] https://lore.kernel.org/linux-fsdevel/20210301222728.176417-1-axelrasmussen@google.com/T/#t
      
      This patch (of 9):
      
      Previously, we did a dance where we had one calling path in userfaultfd.c
      (mfill_atomic_pte), but then we split it into two in shmem_fs.h
      (shmem_{mcopy_atomic,mfill_zeropage}_pte), and then rejoined into a single
      shared function in shmem.c (shmem_mfill_atomic_pte).
      
      This is all a bit overly complex.  Just call the single combined shmem
      function directly, allowing us to clean up various branches, boilerplate,
      etc.
      
      While we're touching this function, two other small cleanup changes:
      - offset is equivalent to pgoff, so we can get rid of offset entirely.
      - Split two VM_BUG_ON cases into two statements. This means the line
        number reported when the BUG is hit specifies exactly which condition
        was true.
      
      Link: https://lkml.kernel.org/r/20210503180737.2487560-1-axelrasmussen@google.com
      Link: https://lkml.kernel.org/r/20210503180737.2487560-3-axelrasmussen@google.comSigned-off-by: NAxel Rasmussen <axelrasmussen@google.com>
      Reviewed-by: NPeter Xu <peterx@redhat.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Wang Qing <wangqing@vivo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3460f6e5
    • P
      userfaultfd/selftests: add pagemap uffd-wp test · eb3b2e00
      Peter Xu 提交于
      Add one anonymous specific test to start using pagemap.  With pagemap
      support, we can directly read the uffd-wp bit from pgtable without
      triggering any fault, so it's easier to do sanity checks in unit tests.
      
      Meanwhile this test also leverages the newly introduced MADV_PAGEOUT
      madvise function to test swap ptes with uffd-wp bit set, and across
      fork()s.
      
      Link: https://lkml.kernel.org/r/20210428225030.9708-7-peterx@redhat.comSigned-off-by: NPeter Xu <peterx@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Wang Qing <wangqing@vivo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eb3b2e00
    • P
      mm/pagemap: export uffd-wp protection information · fb8e37f3
      Peter Xu 提交于
      Export the PTE/PMD status of uffd-wp to pagemap too.
      
      Link: https://lkml.kernel.org/r/20210428225030.9708-6-peterx@redhat.comSigned-off-by: NPeter Xu <peterx@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Wang Qing <wangqing@vivo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fb8e37f3
    • P
      mm/userfaultfd: fail uffd-wp registration if not supported · 00b151f2
      Peter Xu 提交于
      We should fail uffd-wp registration immediately if the arch does not even
      have CONFIG_HAVE_ARCH_USERFAULTFD_WP defined.  That'll block also relevant
      ioctls on e.g.  UFFDIO_WRITEPROTECT because that'll check against
      VM_UFFD_WP, which can only be applied with a success registration.
      
      Remove the WP feature bit too for those archs when handling UFFDIO_API
      ioctl.
      
      Link: https://lkml.kernel.org/r/20210428225030.9708-5-peterx@redhat.comSigned-off-by: NPeter Xu <peterx@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Wang Qing <wangqing@vivo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      00b151f2
    • P
      mm/userfaultfd: fix uffd-wp special cases for fork() · 8f34f1ea
      Peter Xu 提交于
      We tried to do something similar in b569a176 ("userfaultfd: wp: drop
      _PAGE_UFFD_WP properly when fork") previously, but it's not doing it all
      right..  A few fixes around the code path:
      
      1. We were referencing VM_UFFD_WP vm_flags on the _old_ vma rather
         than the new vma.  That's overlooked in b569a176, so it won't work
         as expected.  Thanks to the recent rework on fork code
         (7a4830c3), we can easily get the new vma now, so switch the
         checks to that.
      
      2. Dropping the uffd-wp bit in copy_huge_pmd() could be wrong if the
         huge pmd is a migration huge pmd.  When it happens, instead of using
         pmd_uffd_wp(), we should use pmd_swp_uffd_wp().  The fix is simply to
         handle them separately.
      
      3. Forget to carry over uffd-wp bit for a write migration huge pmd
         entry.  This also happens in copy_huge_pmd(), where we converted a
         write huge migration entry into a read one.
      
      4. In copy_nonpresent_pte(), drop uffd-wp if necessary for swap ptes.
      
      5. In copy_present_page() when COW is enforced when fork(), we also
         need to pass over the uffd-wp bit if VM_UFFD_WP is armed on the new
         vma, and when the pte to be copied has uffd-wp bit set.
      
      Remove the comment in copy_present_pte() about this.  It won't help a huge
      lot to only comment there, but comment everywhere would be an overkill.
      Let's assume the commit messages would help.
      
      [peterx@redhat.com: fix a few thp pmd missing uffd-wp bit]
        Link: https://lkml.kernel.org/r/20210428225030.9708-4-peterx@redhat.com
      
      Link: https://lkml.kernel.org/r/20210428225030.9708-3-peterx@redhat.com
      Fixes: b569a176 ("userfaultfd: wp: drop _PAGE_UFFD_WP properly when fork")
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Wang Qing <wangqing@vivo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8f34f1ea
    • P
      mm/thp: simplify copying of huge zero page pmd when fork · 5fc7a5f6
      Peter Xu 提交于
      Patch series "mm/uffd: Misc fix for uffd-wp and one more test".
      
      This series tries to fix some corner case bugs for uffd-wp on either thp
      or fork().  Then it introduced a new test with pagemap/pageout.
      
      Patch layout:
      
      Patch 1:    cleanup for THP, it'll slightly simplify the follow up patches
      Patch 2-4:  misc fixes for uffd-wp here and there; please refer to each patch
      Patch 5:    add pagemap support for uffd-wp
      Patch 6:    add pagemap/pageout test for uffd-wp
      
      The last test introduced can also verify some of the fixes in previous
      patches, as the test will fail without the fixes.  However it's not easy
      to verify all the changes in patch 2-4, but hopefully they can still be
      properly reviewed.
      
      Note that if considering the ongoing uffd-wp shmem & hugetlbfs work, patch
      5 will be incomplete as it's missing e.g.  hugetlbfs part or the special
      swap pte detection.  However that's not needed in this series, and since
      that series is still during review, this series does not depend on that
      one (the last test only runs with anonymous memory, not file-backed).  So
      this series can be merged even before that series.
      
      This patch (of 6):
      
      Huge zero page is handled in a special path in copy_huge_pmd(), however it
      should share most codes with a normal thp page.  Trying to share more code
      with it by removing the special path.  The only leftover so far is the
      huge zero page refcounting (mm_get_huge_zero_page()), because that's
      separately done with a global counter.
      
      This prepares for a future patch to modify the huge pmd to be installed,
      so that we don't need to duplicate it explicitly into huge zero page case
      too.
      
      Link: https://lkml.kernel.org/r/20210428225030.9708-1-peterx@redhat.com
      Link: https://lkml.kernel.org/r/20210428225030.9708-2-peterx@redhat.comSigned-off-by: NPeter Xu <peterx@redhat.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>, peterx@redhat.com
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Wang Qing <wangqing@vivo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5fc7a5f6
    • P
      userfaultfd/selftests: unify error handling · 42e584ee
      Peter Xu 提交于
      Introduce err()/_err() and replace all the different ways to fail the
      program, mostly "fprintf" and "perror" with tons of exit() calls.  Always
      stop the test program at any failure.
      
      Link: https://lkml.kernel.org/r/20210412232753.1012412-6-peterx@redhat.comSigned-off-by: NPeter Xu <peterx@redhat.com>
      Reviewed-by: NAxel Rasmussen <axelrasmussen@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Wang Qing <wangqing@vivo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      42e584ee
    • P
      userfaultfd/selftests: only dump counts if mode enabled · de3ca8e4
      Peter Xu 提交于
      WP and MINOR modes are conditionally enabled on specific memory types.
      This patch avoids dumping tons of zeros for those cases when the modes are
      not supported at all.
      
      Link: https://lkml.kernel.org/r/20210412232753.1012412-5-peterx@redhat.comSigned-off-by: NPeter Xu <peterx@redhat.com>
      Reviewed-by: NAxel Rasmussen <axelrasmussen@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Wang Qing <wangqing@vivo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      de3ca8e4
    • P
      userfaultfd/selftests: dropping VERIFY check in locking_thread · 4e08e18a
      Peter Xu 提交于
      It tries to check against all zeros and looped for quite a few times.
      However after that we'll verify the same page with count_verify, while
      count_verify can never be zero.  So it means if it's a zero page we'll
      detect it anyways with below code.
      
      There's yet another place we conditionally check the fault flag - just do
      it unconditionally.
      
      Link: https://lkml.kernel.org/r/20210412232753.1012412-4-peterx@redhat.comSigned-off-by: NPeter Xu <peterx@redhat.com>
      Reviewed-by: NAxel Rasmussen <axelrasmussen@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Wang Qing <wangqing@vivo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4e08e18a
    • P
      userfaultfd/selftests: remove the time() check on delayed uffd · ba4f8c35
      Peter Xu 提交于
      There seems to have no guarantee that time() will return the same for the
      two calls even if there's no delay, e.g.  when a fault is accidentally
      crossing the changing of a second.  Meanwhile, this message is also not
      helping that much since delay could happen with a lot of reasons, e.g.,
      schedule latency of resolving thread.  It may not mean an issue with uffd.
      
      Neither do I saw this error triggered either in the past runs.  Even if it
      triggers, it'll be drown in all the rest of test logs.  Remove it.
      
      Link: https://lkml.kernel.org/r/20210412232753.1012412-3-peterx@redhat.comSigned-off-by: NPeter Xu <peterx@redhat.com>
      Reviewed-by: NAxel Rasmussen <axelrasmussen@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Wang Qing <wangqing@vivo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ba4f8c35
    • P
      userfaultfd/selftests: use user mode only · d2c6c06f
      Peter Xu 提交于
      Patch series "userfaultfd/selftests: A few cleanups", v2.
      
      I wanted to cleanup userfaultfd.c fault handling for a long time.  If it's
      not cleaned, when the new code grows the file it'll also grow the size
      that needs to be cleaned...  This is my attempt to cleanup the userfaultfd
      selftest on fault handling, to use an err() macro instead of either
      fprintf() or perror() then another exit() call.
      
      The huge cleanup is done in the last patch.  The first 4 patches are some
      other standalone cleanups for the same file, so I put them together.
      
      This patch (of 5):
      
      Userfaultfd selftest does not need to handle kernel initiated fault.  Set
      user mode so it can be run even if unprivileged_userfaultfd=0 (which is
      the default).
      
      Link: https://lkml.kernel.org/r/20210412232753.1012412-2-peterx@redhat.comSigned-off-by: NPeter Xu <peterx@redhat.com>
      Reviewed-by: NAxel Rasmussen <axelrasmussen@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Wang Qing <wangqing@vivo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d2c6c06f
    • N
      mm/hwpoison: disable pcp for page_handle_poison() · 510d25c9
      Naoya Horiguchi 提交于
      Recent changes by patch "mm/page_alloc: allow high-order pages to be
      stored on the per-cpu lists" makes kernels determine whether to use pcp by
      pcp_allowed_order(), which breaks soft-offline for hugetlb pages.
      
      Soft-offline dissolves a migration source page, then removes it from buddy
      free list, so it's assumed that any subpage of the soft-offlined hugepage
      are recognized as a buddy page just after returning from
      dissolve_free_huge_page().  pcp_allowed_order() returns true for hugetlb,
      so this assumption is no longer true.
      
      So disable pcp during dissolve_free_huge_page() and take_page_off_buddy()
      to prevent soft-offlined hugepages from linking to pcp lists.
      Soft-offline should not be common events so the impact on performance
      should be minimal.  And I think that the optimization of Mel's patch could
      benefit to hugetlb so zone_pcp_disable() is called only in hwpoison
      context.
      
      Link: https://lkml.kernel.org/r/20210617092626.291006-1-nao.horiguchi@gmail.comSigned-off-by: NNaoya Horiguchi <naoya.horiguchi@nec.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      510d25c9
    • M
      hugetlb: address ref count racing in prep_compound_gigantic_page · 7118fc29
      Mike Kravetz 提交于
      In [1], Jann Horn points out a possible race between
      prep_compound_gigantic_page and __page_cache_add_speculative.  The root
      cause of the possible race is prep_compound_gigantic_page uncondittionally
      setting the ref count of pages to zero.  It does this because
      prep_compound_gigantic_page is handed a 'group' of pages from an allocator
      and needs to convert that group of pages to a compound page.  The ref
      count of each page in this 'group' is one as set by the allocator.
      However, the ref count of compound page tail pages must be zero.
      
      The potential race comes about when ref counted pages are returned from
      the allocator.  When this happens, other mm code could also take a
      reference on the page.  __page_cache_add_speculative is one such example.
      Therefore, prep_compound_gigantic_page can not just set the ref count of
      pages to zero as it does today.  Doing so would lose the reference taken
      by any other code.  This would lead to BUGs in code checking ref counts
      and could possibly even lead to memory corruption.
      
      There are two possible ways to address this issue.
      
      1) Make all allocators of gigantic groups of pages be able to return a
         properly constructed compound page.
      
      2) Make prep_compound_gigantic_page be more careful when constructing a
         compound page.
      
      This patch takes approach 2.
      
      In prep_compound_gigantic_page, use cmpxchg to only set ref count to zero
      if it is one.  If the cmpxchg fails, call synchronize_rcu() in the hope
      that the extra ref count will be driopped during a rcu grace period.  This
      is not a performance critical code path and the wait should be
      accceptable.  If the ref count is still inflated after the grace period,
      then undo any modifications made and return an error.
      
      Currently prep_compound_gigantic_page is type void and does not return
      errors.  Modify the two callers to check for and handle error returns.  On
      error, the caller must free the 'group' of pages as they can not be used
      to form a gigantic page.  After freeing pages, the runtime caller
      (alloc_fresh_huge_page) will retry the allocation once.  Boot time
      allocations can not be retried.
      
      The routine prep_compound_page also unconditionally sets the ref count of
      compound page tail pages to zero.  However, in this case the buddy
      allocator is constructing a compound page from freshly allocated pages.
      The ref count on those freshly allocated pages is already zero, so the
      set_page_count(p, 0) is unnecessary and could lead to confusion.  Just
      remove it.
      
      [1] https://lore.kernel.org/linux-mm/CAG48ez23q0Jy9cuVnwAe7t_fdhMk2S7N5Hdi-GLcCeq5bsfLxw@mail.gmail.com/
      
      Link: https://lkml.kernel.org/r/20210622021423.154662-3-mike.kravetz@oracle.com
      Fixes: 58a84aa9 ("thp: set compound tail page _count to zero")
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reported-by: NJann Horn <jannh@google.com>
      Cc: Youquan Song <youquan.song@intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7118fc29
    • M
      hugetlb: remove prep_compound_huge_page cleanup · 48b8d744
      Mike Kravetz 提交于
      Patch series "Fix prep_compound_gigantic_page ref count adjustment".
      
      These patches address the possible race between
      prep_compound_gigantic_page and __page_cache_add_speculative as described
      by Jann Horn in [1].
      
      The first patch simply removes the unnecessary/obsolete helper routine
      prep_compound_huge_page to make the actual fix a little simpler.
      
      The second patch is the actual fix and has a detailed explanation in the
      commit message.
      
      This potential issue has existed for almost 10 years and I am unaware of
      anyone actually hitting the race.  I did not cc stable, but would be happy
      to squash the patches and send to stable if anyone thinks that is a good
      idea.
      
      [1] https://lore.kernel.org/linux-mm/CAG48ez23q0Jy9cuVnwAe7t_fdhMk2S7N5Hdi-GLcCeq5bsfLxw@mail.gmail.com/
      
      This patch (of 2):
      
      I could not think of a reliable way to recreate the issue for testing.
      Rather, I 'simulated errors' to exercise all the error paths.
      
      The routine prep_compound_huge_page is a simple wrapper to call either
      prep_compound_gigantic_page or prep_compound_page.  However, it is only
      called from gather_bootmem_prealloc which only processes gigantic pages.
      Eliminate the routine and call prep_compound_gigantic_page directly.
      
      Link: https://lkml.kernel.org/r/20210622021423.154662-1-mike.kravetz@oracle.com
      Link: https://lkml.kernel.org/r/20210622021423.154662-2-mike.kravetz@oracle.comSigned-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Youquan Song <youquan.song@intel.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      48b8d744