1. 06 5月, 2021 18 次提交
    • Y
      mm: memcontrol: rename shrinker_map to shrinker_info · e4262c4f
      Yang Shi 提交于
      The following patch is going to add nr_deferred into shrinker_map, the
      change will make shrinker_map not only include map anymore, so rename it
      to "memcg_shrinker_info".  And this should make the patch adding
      nr_deferred cleaner and readable and make review easier.  Also remove the
      "memcg_" prefix.
      
      Link: https://lkml.kernel.org/r/20210311190845.9708-7-shy828301@gmail.comSigned-off-by: NYang Shi <shy828301@gmail.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e4262c4f
    • Y
      mm: vmscan: consolidate shrinker_maps handling code · 2bfd3637
      Yang Shi 提交于
      The shrinker map management is not purely memcg specific, it is at the
      intersection between memory cgroup and shrinkers.  It's allocation and
      assignment of a structure, and the only memcg bit is the map is being
      stored in a memcg structure.  So move the shrinker_maps handling code
      into vmscan.c for tighter integration with shrinker code, and remove the
      "memcg_" prefix.  There is no functional change.
      
      Link: https://lkml.kernel.org/r/20210311190845.9708-3-shy828301@gmail.comSigned-off-by: NYang Shi <shy828301@gmail.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2bfd3637
    • D
      mm/vmscan: replace implicit RECLAIM_ZONE checks with explicit checks · 202e35db
      Dave Hansen 提交于
      RECLAIM_ZONE was assumed to be unused because it was never explicitly
      used in the kernel.  However, there were a number of places where it was
      checked implicitly by checking 'node_reclaim_mode' for a zero value.
      
      These zero checks are not great because it is not obvious what a zero
      mode *means* in the code.  Replace them with a helper which makes it
      more obvious: node_reclaim_enabled().
      
      This helper also provides a handy place to explicitly check the
      RECLAIM_ZONE bit itself.  Check it explicitly there to make it more
      obvious where the bit can affect behavior.
      
      This should have no functional impact.
      
      Link: https://lkml.kernel.org/r/20210219172559.BF589C44@viggo.jf.intel.comSigned-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NBen Widawsky <ben.widawsky@intel.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: "Tobin C. Harding" <tobin@kernel.org>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Daniel Wagner <dwagner@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      202e35db
    • D
      mm/vmscan: move RECLAIM* bits to uapi header · b6676de8
      Dave Hansen 提交于
      It is currently not obvious that the RECLAIM_* bits are part of the uapi
      since they are defined in vmscan.c.  Move them to a uapi header to make it
      obvious.
      
      This should have no functional impact.
      
      Link: https://lkml.kernel.org/r/20210219172557.08074910@viggo.jf.intel.comSigned-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NBen Widawsky <ben.widawsky@intel.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Daniel Wagner <dwagner@suse.de>
      Cc: "Tobin C. Harding" <tobin@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Qian Cai <cai@lca.pw>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b6676de8
    • A
      userfaultfd: add UFFDIO_CONTINUE ioctl · f6191471
      Axel Rasmussen 提交于
      This ioctl is how userspace ought to resolve "minor" userfaults.  The
      idea is, userspace is notified that a minor fault has occurred.  It
      might change the contents of the page using its second non-UFFD mapping,
      or not.  Then, it calls UFFDIO_CONTINUE to tell the kernel "I have
      ensured the page contents are correct, carry on setting up the mapping".
      
      Note that it doesn't make much sense to use UFFDIO_{COPY,ZEROPAGE} for
      MINOR registered VMAs.  ZEROPAGE maps the VMA to the zero page; but in
      the minor fault case, we already have some pre-existing underlying page.
      Likewise, UFFDIO_COPY isn't useful if we have a second non-UFFD mapping.
      We'd just use memcpy() or similar instead.
      
      It turns out hugetlb_mcopy_atomic_pte() already does very close to what
      we want, if an existing page is provided via `struct page **pagep`.  We
      already special-case the behavior a bit for the UFFDIO_ZEROPAGE case, so
      just extend that design: add an enum for the three modes of operation,
      and make the small adjustments needed for the MCOPY_ATOMIC_CONTINUE
      case.  (Basically, look up the existing page, and avoid adding the
      existing page to the page cache or calling set_page_huge_active() on
      it.)
      
      Link: https://lkml.kernel.org/r/20210301222728.176417-5-axelrasmussen@google.comSigned-off-by: NAxel Rasmussen <axelrasmussen@google.com>
      Reviewed-by: NPeter Xu <peterx@redhat.com>
      Cc: Adam Ruprecht <ruprecht@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Cannon Matthews <cannonmatthews@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chinwen Chang <chinwen.chang@mediatek.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Michal Koutn" <mkoutny@suse.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Shawn Anastasio <shawn@anastas.io>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f6191471
    • A
      userfaultfd: hugetlbfs: only compile UFFD helpers if config enabled · 714c1891
      Axel Rasmussen 提交于
      For background, mm/userfaultfd.c provides a general mcopy_atomic
      implementation.  But some types of memory (i.e., hugetlb and shmem) need
      a slightly different implementation, so they provide their own helpers
      for this.  In other words, userfaultfd is the only caller of these
      functions.
      
      This patch achieves two things:
      
      1. Don't spend time compiling code which will end up never being
         referenced anyway (a small build time optimization).
      
      2. In patches later in this series, we extend the signature of these
         helpers with UFFD-specific state (a mode enumeration).  Once this
         happens, we *have to* either not compile the helpers, or
         unconditionally define the UFFD-only state (which seems messier to me).
         This includes the declarations in the headers, as otherwise they'd
         yield warnings about implicitly defining the type of those arguments.
      
      Link: https://lkml.kernel.org/r/20210301222728.176417-4-axelrasmussen@google.comSigned-off-by: NAxel Rasmussen <axelrasmussen@google.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NPeter Xu <peterx@redhat.com>
      Cc: Adam Ruprecht <ruprecht@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Cannon Matthews <cannonmatthews@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chinwen Chang <chinwen.chang@mediatek.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Michal Koutn" <mkoutny@suse.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Shawn Anastasio <shawn@anastas.io>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      714c1891
    • A
      userfaultfd: disable huge PMD sharing for MINOR registered VMAs · 0d9cadab
      Axel Rasmussen 提交于
      As the comment says: for the MINOR fault use case, although the page
      might be present and populated in the other (non-UFFD-registered) half
      of the mapping, it may be out of date, and we explicitly want userspace
      to get a minor fault so it can check and potentially update the page's
      contents.
      
      Huge PMD sharing would prevent these faults from occurring for suitably
      aligned areas, so disable it upon UFFD registration.
      
      Link: https://lkml.kernel.org/r/20210301222728.176417-3-axelrasmussen@google.comSigned-off-by: NAxel Rasmussen <axelrasmussen@google.com>
      Reviewed-by: NPeter Xu <peterx@redhat.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Adam Ruprecht <ruprecht@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Cannon Matthews <cannonmatthews@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chinwen Chang <chinwen.chang@mediatek.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Michal Koutn" <mkoutny@suse.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Shawn Anastasio <shawn@anastas.io>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0d9cadab
    • A
      userfaultfd: add minor fault registration mode · 7677f7fd
      Axel Rasmussen 提交于
      Patch series "userfaultfd: add minor fault handling", v9.
      
      Overview
      ========
      
      This series adds a new userfaultfd feature, UFFD_FEATURE_MINOR_HUGETLBFS.
      When enabled (via the UFFDIO_API ioctl), this feature means that any
      hugetlbfs VMAs registered with UFFDIO_REGISTER_MODE_MISSING will *also*
      get events for "minor" faults.  By "minor" fault, I mean the following
      situation:
      
      Let there exist two mappings (i.e., VMAs) to the same page(s) (shared
      memory).  One of the mappings is registered with userfaultfd (in minor
      mode), and the other is not.  Via the non-UFFD mapping, the underlying
      pages have already been allocated & filled with some contents.  The UFFD
      mapping has not yet been faulted in; when it is touched for the first
      time, this results in what I'm calling a "minor" fault.  As a concrete
      example, when working with hugetlbfs, we have huge_pte_none(), but
      find_lock_page() finds an existing page.
      
      We also add a new ioctl to resolve such faults: UFFDIO_CONTINUE.  The idea
      is, userspace resolves the fault by either a) doing nothing if the
      contents are already correct, or b) updating the underlying contents using
      the second, non-UFFD mapping (via memcpy/memset or similar, or something
      fancier like RDMA, or etc...).  In either case, userspace issues
      UFFDIO_CONTINUE to tell the kernel "I have ensured the page contents are
      correct, carry on setting up the mapping".
      
      Use Case
      ========
      
      Consider the use case of VM live migration (e.g. under QEMU/KVM):
      
      1. While a VM is still running, we copy the contents of its memory to a
         target machine. The pages are populated on the target by writing to the
         non-UFFD mapping, using the setup described above. The VM is still running
         (and therefore its memory is likely changing), so this may be repeated
         several times, until we decide the target is "up to date enough".
      
      2. We pause the VM on the source, and start executing on the target machine.
         During this gap, the VM's user(s) will *see* a pause, so it is desirable to
         minimize this window.
      
      3. Between the last time any page was copied from the source to the target, and
         when the VM was paused, the contents of that page may have changed - and
         therefore the copy we have on the target machine is out of date. Although we
         can keep track of which pages are out of date, for VMs with large amounts of
         memory, it is "slow" to transfer this information to the target machine. We
         want to resume execution before such a transfer would complete.
      
      4. So, the guest begins executing on the target machine. The first time it
         touches its memory (via the UFFD-registered mapping), userspace wants to
         intercept this fault. Userspace checks whether or not the page is up to date,
         and if not, copies the updated page from the source machine, via the non-UFFD
         mapping. Finally, whether a copy was performed or not, userspace issues a
         UFFDIO_CONTINUE ioctl to tell the kernel "I have ensured the page contents
         are correct, carry on setting up the mapping".
      
      We don't have to do all of the final updates on-demand. The userfaultfd manager
      can, in the background, also copy over updated pages once it receives the map of
      which pages are up-to-date or not.
      
      Interaction with Existing APIs
      ==============================
      
      Because this is a feature, a registered VMA could potentially receive both
      missing and minor faults.  I spent some time thinking through how the
      existing API interacts with the new feature:
      
      UFFDIO_CONTINUE cannot be used to resolve non-minor faults, as it does not
      allocate a new page.  If UFFDIO_CONTINUE is used on a non-minor fault:
      
      - For non-shared memory or shmem, -EINVAL is returned.
      - For hugetlb, -EFAULT is returned.
      
      UFFDIO_COPY and UFFDIO_ZEROPAGE cannot be used to resolve minor faults.
      Without modifications, the existing codepath assumes a new page needs to
      be allocated.  This is okay, since userspace must have a second
      non-UFFD-registered mapping anyway, thus there isn't much reason to want
      to use these in any case (just memcpy or memset or similar).
      
      - If UFFDIO_COPY is used on a minor fault, -EEXIST is returned.
      - If UFFDIO_ZEROPAGE is used on a minor fault, -EEXIST is returned (or -EINVAL
        in the case of hugetlb, as UFFDIO_ZEROPAGE is unsupported in any case).
      - UFFDIO_WRITEPROTECT simply doesn't work with shared memory, and returns
        -ENOENT in that case (regardless of the kind of fault).
      
      Future Work
      ===========
      
      This series only supports hugetlbfs.  I have a second series in flight to
      support shmem as well, extending the functionality.  This series is more
      mature than the shmem support at this point, and the functionality works
      fully on hugetlbfs, so this series can be merged first and then shmem
      support will follow.
      
      This patch (of 6):
      
      This feature allows userspace to intercept "minor" faults.  By "minor"
      faults, I mean the following situation:
      
      Let there exist two mappings (i.e., VMAs) to the same page(s).  One of the
      mappings is registered with userfaultfd (in minor mode), and the other is
      not.  Via the non-UFFD mapping, the underlying pages have already been
      allocated & filled with some contents.  The UFFD mapping has not yet been
      faulted in; when it is touched for the first time, this results in what
      I'm calling a "minor" fault.  As a concrete example, when working with
      hugetlbfs, we have huge_pte_none(), but find_lock_page() finds an existing
      page.
      
      This commit adds the new registration mode, and sets the relevant flag on
      the VMAs being registered.  In the hugetlb fault path, if we find that we
      have huge_pte_none(), but find_lock_page() does indeed find an existing
      page, then we have a "minor" fault, and if the VMA has the userfaultfd
      registration flag, we call into userfaultfd to handle it.
      
      This is implemented as a new registration mode, instead of an API feature.
      This is because the alternative implementation has significant drawbacks
      [1].
      
      However, doing it this was requires we allocate a VM_* flag for the new
      registration mode.  On 32-bit systems, there are no unused bits, so this
      feature is only supported on architectures with
      CONFIG_ARCH_USES_HIGH_VMA_FLAGS.  When attempting to register a VMA in
      MINOR mode on 32-bit architectures, we return -EINVAL.
      
      [1] https://lore.kernel.org/patchwork/patch/1380226/
      
      [peterx@redhat.com: fix minor fault page leak]
        Link: https://lkml.kernel.org/r/20210322175132.36659-1-peterx@redhat.com
      
      Link: https://lkml.kernel.org/r/20210301222728.176417-1-axelrasmussen@google.com
      Link: https://lkml.kernel.org/r/20210301222728.176417-2-axelrasmussen@google.comSigned-off-by: NAxel Rasmussen <axelrasmussen@google.com>
      Reviewed-by: NPeter Xu <peterx@redhat.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chinwen Chang <chinwen.chang@mediatek.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Michal Koutn" <mkoutny@suse.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Shawn Anastasio <shawn@anastas.io>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Adam Ruprecht <ruprecht@google.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Cannon Matthews <cannonmatthews@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7677f7fd
    • O
      mm: make alloc_contig_range handle in-use hugetlb pages · ae37c7ff
      Oscar Salvador 提交于
      alloc_contig_range() will fail if it finds a HugeTLB page within the
      range, without a chance to handle them.  Since HugeTLB pages can be
      migrated as any LRU or Movable page, it does not make sense to bail out
      without trying.  Enable the interface to recognize in-use HugeTLB pages so
      we can migrate them, and have much better chances to succeed the call.
      
      Link: https://lkml.kernel.org/r/20210419075413.1064-7-osalvador@suse.deSigned-off-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ae37c7ff
    • O
      mm: make alloc_contig_range handle free hugetlb pages · 369fa227
      Oscar Salvador 提交于
      alloc_contig_range will fail if it ever sees a HugeTLB page within the
      range we are trying to allocate, even when that page is free and can be
      easily reallocated.
      
      This has proved to be problematic for some users of alloc_contic_range,
      e.g: CMA and virtio-mem, where those would fail the call even when those
      pages lay in ZONE_MOVABLE and are free.
      
      We can do better by trying to replace such page.
      
      Free hugepages are tricky to handle so as to no userspace application
      notices disruption, we need to replace the current free hugepage with a
      new one.
      
      In order to do that, a new function called alloc_and_dissolve_huge_page is
      introduced.  This function will first try to get a new fresh hugepage, and
      if it succeeds, it will replace the old one in the free hugepage pool.
      
      The free page replacement is done under hugetlb_lock, so no external users
      of hugetlb will notice the change.  To allocate the new huge page, we use
      alloc_buddy_huge_page(), so we do not have to deal with any counters, and
      prep_new_huge_page() is not called.  This is valulable because in case we
      need to free the new page, we only need to call __free_pages().
      
      Once we know that the page to be replaced is a genuine 0-refcounted huge
      page, we remove the old page from the freelist by remove_hugetlb_page().
      Then, we can call __prep_new_huge_page() and
      __prep_account_new_huge_page() for the new huge page to properly
      initialize it and increment the hstate->nr_huge_pages counter (previously
      decremented by remove_hugetlb_page()).  Once done, the page is enqueued by
      enqueue_huge_page() and it is ready to be used.
      
      There is one tricky case when page's refcount is 0 because it is in the
      process of being released.  A missing PageHugeFreed bit will tell us that
      freeing is in flight so we retry after dropping the hugetlb_lock.  The
      race window should be small and the next retry should make a forward
      progress.
      
      E.g:
      
      CPU0				CPU1
      free_huge_page()		isolate_or_dissolve_huge_page
      				  PageHuge() == T
      				  alloc_and_dissolve_huge_page
      				    alloc_buddy_huge_page()
      				    spin_lock_irq(hugetlb_lock)
      				    // PageHuge() && !PageHugeFreed &&
      				    // !PageCount()
      				    spin_unlock_irq(hugetlb_lock)
        spin_lock_irq(hugetlb_lock)
        1) update_and_free_page
             PageHuge() == F
             __free_pages()
        2) enqueue_huge_page
             SetPageHugeFreed()
        spin_unlock_irq(&hugetlb_lock)
      				  spin_lock_irq(hugetlb_lock)
                                         1) PageHuge() == F (freed by case#1 from CPU0)
      				   2) PageHuge() == T
                                             PageHugeFreed() == T
                                             - proceed with replacing the page
      
      In the case above we retry as the window race is quite small and we have
      high chances to succeed next time.
      
      With regard to the allocation, we restrict it to the node the page belongs
      to with __GFP_THISNODE, meaning we do not fallback on other node's zones.
      
      Note that gigantic hugetlb pages are fenced off since there is a cyclic
      dependency between them and alloc_contig_range.
      
      Link: https://lkml.kernel.org/r/20210419075413.1064-6-osalvador@suse.deSigned-off-by: NOscar Salvador <osalvador@suse.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      369fa227
    • M
      hugetlb: add per-hstate mutex to synchronize user adjustments · 29383967
      Mike Kravetz 提交于
      The helper routine hstate_next_node_to_alloc accesses and modifies the
      hstate variable next_nid_to_alloc.  The helper is used by the routines
      alloc_pool_huge_page and adjust_pool_surplus.  adjust_pool_surplus is
      called with hugetlb_lock held.  However, alloc_pool_huge_page can not be
      called with the hugetlb lock held as it will call the page allocator.
      Two instances of alloc_pool_huge_page could be run in parallel or
      alloc_pool_huge_page could run in parallel with adjust_pool_surplus
      which may result in the variable next_nid_to_alloc becoming invalid for
      the caller and pages being allocated on the wrong node.
      
      Both alloc_pool_huge_page and adjust_pool_surplus are only called from
      the routine set_max_huge_pages after boot.  set_max_huge_pages is only
      called as the reusult of a user writing to the proc/sysfs nr_hugepages,
      or nr_hugepages_mempolicy file to adjust the number of hugetlb pages.
      
      It makes little sense to allow multiple adjustment to the number of
      hugetlb pages in parallel.  Add a mutex to the hstate and use it to only
      allow one hugetlb page adjustment at a time.  This will synchronize
      modifications to the next_nid_to_alloc variable.
      
      Link: https://lkml.kernel.org/r/20210409205254.242291-4-mike.kravetz@oracle.comSigned-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: NMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
      Cc: Barry Song <song.bao.hua@hisilicon.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: HORIGUCHI NAOYA <naoya.horiguchi@nec.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      29383967
    • M
      mm/huge_memory.c: remove unused macro TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG · d4afd60c
      Miaohe Lin 提交于
      Commit 4958e4d8 ("mm: thp: remove debug_cow switch") forgot to
      remove TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG macro.  Remove it here.
      
      Link: https://lkml.kernel.org/r/20210318122722.13135-6-linmiaohe@huawei.comSigned-off-by: NMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: NZi Yan <ziy@nvidia.com>
      Reviewed-by: NPeter Xu <peterx@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Thomas Hellstrm (Intel) <thomas_os@shipmail.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Cc: yuleixzhang <yulei.kernel@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d4afd60c
    • P
      hugetlb/userfaultfd: unshare all pmds for hugetlbfs when register wp · 6dfeaff9
      Peter Xu 提交于
      Huge pmd sharing for hugetlbfs is racy with userfaultfd-wp because
      userfaultfd-wp is always based on pgtable entries, so they cannot be
      shared.
      
      Walk the hugetlb range and unshare all such mappings if there is, right
      before UFFDIO_REGISTER will succeed and return to userspace.
      
      This will pair with want_pmd_share() in hugetlb code so that huge pmd
      sharing is completely disabled for userfaultfd-wp registered range.
      
      Link: https://lkml.kernel.org/r/20210218231206.15524-1-peterx@redhat.comSigned-off-by: NPeter Xu <peterx@redhat.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Adam Ruprecht <ruprecht@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Cannon Matthews <cannonmatthews@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chinwen Chang <chinwen.chang@mediatek.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Michal Koutn" <mkoutny@suse.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Shawn Anastasio <shawn@anastas.io>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6dfeaff9
    • P
      mm/hugetlb: move flush_hugetlb_tlb_range() into hugetlb.h · 537cf30b
      Peter Xu 提交于
      Prepare for it to be called outside of mm/hugetlb.c.
      
      Link: https://lkml.kernel.org/r/20210218231204.15474-1-peterx@redhat.comSigned-off-by: NPeter Xu <peterx@redhat.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NAxel Rasmussen <axelrasmussen@google.com>
      Cc: Adam Ruprecht <ruprecht@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Cannon Matthews <cannonmatthews@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chinwen Chang <chinwen.chang@mediatek.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Michal Koutn" <mkoutny@suse.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Shawn Anastasio <shawn@anastas.io>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      537cf30b
    • P
      hugetlb/userfaultfd: forbid huge pmd sharing when uffd enabled · c1991e07
      Peter Xu 提交于
      Huge pmd sharing could bring problem to userfaultfd.  The thing is that
      userfaultfd is running its logic based on the special bits on page table
      entries, however the huge pmd sharing could potentially share page table
      entries for different address ranges.  That could cause issues on
      either:
      
       - When sharing huge pmd page tables for an uffd write protected range,
         the newly mapped huge pmd range will also be write protected
         unexpectedly, or,
      
       - When we try to write protect a range of huge pmd shared range, we'll
         first do huge_pmd_unshare() in hugetlb_change_protection(), however
         that also means the UFFDIO_WRITEPROTECT could be silently skipped for
         the shared region, which could lead to data loss.
      
      While at it, a few other things are done altogether:
      
       - Move want_pmd_share() from mm/hugetlb.c into linux/hugetlb.h, because
         that's definitely something that arch code would like to use too
      
       - ARM64 currently directly check against
         CONFIG_ARCH_WANT_HUGE_PMD_SHARE when trying to share huge pmd. Switch
         to the want_pmd_share() helper.
      
       - Move vma_shareable() from huge_pmd_share() into want_pmd_share().
      
      [peterx@redhat.com: fix build with !ARCH_WANT_HUGE_PMD_SHARE]
        Link: https://lkml.kernel.org/r/20210310185359.88297-1-peterx@redhat.com
      
      Link: https://lkml.kernel.org/r/20210218231202.15426-1-peterx@redhat.comSigned-off-by: NPeter Xu <peterx@redhat.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NAxel Rasmussen <axelrasmussen@google.com>
      Tested-by: NNaresh Kamboju <naresh.kamboju@linaro.org>
      Cc: Adam Ruprecht <ruprecht@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Cannon Matthews <cannonmatthews@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chinwen Chang <chinwen.chang@mediatek.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Michal Koutn" <mkoutny@suse.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Shawn Anastasio <shawn@anastas.io>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c1991e07
    • P
      hugetlb: pass vma into huge_pte_alloc() and huge_pmd_share() · aec44e0f
      Peter Xu 提交于
      Patch series "hugetlb: Disable huge pmd unshare for uffd-wp", v4.
      
      This series tries to disable huge pmd unshare of hugetlbfs backed memory
      for uffd-wp.  Although uffd-wp of hugetlbfs is still during rfc stage,
      the idea of this series may be needed for multiple tasks (Axel's uffd
      minor fault series, and Mike's soft dirty series), so I picked it out
      from the larger series.
      
      This patch (of 4):
      
      It is a preparation work to be able to behave differently in the per
      architecture huge_pte_alloc() according to different VMA attributes.
      
      Pass it deeper into huge_pmd_share() so that we can avoid the find_vma() call.
      
      [peterx@redhat.com: build fix]
        Link: https://lkml.kernel.org/r/20210304164653.GB397383@xz-x1Link: https://lkml.kernel.org/r/20210218230633.15028-1-peterx@redhat.com
      
      Link: https://lkml.kernel.org/r/20210218230633.15028-2-peterx@redhat.comSigned-off-by: NPeter Xu <peterx@redhat.com>
      Suggested-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Adam Ruprecht <ruprecht@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Cannon Matthews <cannonmatthews@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chinwen Chang <chinwen.chang@mediatek.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Michal Koutn" <mkoutny@suse.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Oliver Upton <oupton@google.com>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Shawn Anastasio <shawn@anastas.io>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aec44e0f
    • M
      mm: remove nrexceptional from inode · 8bc3c481
      Matthew Wilcox (Oracle) 提交于
      We no longer track anything in nrexceptional, so remove it, saving 8 bytes
      per inode.
      
      Link: https://lkml.kernel.org/r/20201026151849.24232-5-willy@infradead.orgSigned-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Tested-by: NVishal Verma <vishal.l.verma@intel.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8bc3c481
    • M
      mm: introduce and use mapping_empty() · 7716506a
      Matthew Wilcox (Oracle) 提交于
      Patch series "Remove nrexceptional tracking", v2.
      
      We actually use nrexceptional for very little these days.  It's a minor
      pain to keep in sync with nrpages, but the pain becomes much bigger with
      the THP patches because we don't know how many indices a shadow entry
      occupies.  It's easier to just remove it than keep it accurate.
      
      Also, we save 8 bytes per inode which is nothing to sneeze at; on my
      laptop, it would improve shmem_inode_cache from 22 to 23 objects per
      16kB, and inode_cache from 26 to 27 objects.  Combined, that saves
      a megabyte of memory from a combined usage of 25MB for both caches.
      Unfortunately, ext4 doesn't cross a magic boundary, so it doesn't save
      any memory for ext4.
      
      This patch (of 4):
      
      Instead of checking the two counters (nrpages and nrexceptional), we can
      just check whether i_pages is empty.
      
      Link: https://lkml.kernel.org/r/20201026151849.24232-1-willy@infradead.org
      Link: https://lkml.kernel.org/r/20201026151849.24232-2-willy@infradead.orgSigned-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Tested-by: NVishal Verma <vishal.l.verma@intel.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7716506a
  2. 01 5月, 2021 22 次提交