1. 16 12月, 2020 40 次提交
    • M
      arm, arm64: move free_unused_memmap() to generic mm · 4f5b0c17
      Mike Rapoport 提交于
      ARM and ARM64 free unused parts of the memory map just before the
      initialization of the page allocator. To allow holes in the memory map both
      architectures overload pfn_valid() and define HAVE_ARCH_PFN_VALID.
      
      Allowing holes in the memory map for FLATMEM may be useful for small
      machines, such as ARC and m68k and will enable those architectures to cease
      using DISCONTIGMEM and still support more than one memory bank.
      
      Move the functions that free unused memory map to generic mm and enable
      them in case HAVE_ARCH_PFN_VALID=y.
      
      Link: https://lkml.kernel.org/r/20201101170454.9567-10-rppt@kernel.orgSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Acked-by: Catalin Marinas <catalin.marinas@arm.com>	[arm64]
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Meelis Roos <mroos@linux.ee>
      Cc: Michael Schmitz <schmitzmic@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4f5b0c17
    • M
      arm: remove CONFIG_ARCH_HAS_HOLES_MEMORYMODEL · 5e545df3
      Mike Rapoport 提交于
      ARM is the only architecture that defines CONFIG_ARCH_HAS_HOLES_MEMORYMODEL
      which in turn enables memmap_valid_within() function that is intended to
      verify existence  of struct page associated with a pfn when there are holes
      in the memory map.
      
      However, the ARCH_HAS_HOLES_MEMORYMODEL also enables HAVE_ARCH_PFN_VALID
      and arch-specific pfn_valid() implementation that also deals with the holes
      in the memory map.
      
      The only two users of memmap_valid_within() call this function after
      a call to pfn_valid() so the memmap_valid_within() check becomes redundant.
      
      Remove CONFIG_ARCH_HAS_HOLES_MEMORYMODEL and memmap_valid_within() and rely
      entirely on ARM's implementation of pfn_valid() that is now enabled
      unconditionally.
      
      Link: https://lkml.kernel.org/r/20201101170454.9567-9-rppt@kernel.orgSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Meelis Roos <mroos@linux.ee>
      Cc: Michael Schmitz <schmitzmic@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5e545df3
    • M
      ia64: make SPARSEMEM default and disable DISCONTIGMEM · 214496cb
      Mike Rapoport 提交于
      SPARSEMEM memory model suitable for systems with large holes in their
      phyiscal memory layout. With SPARSEMEM_VMEMMAP enabled it provides
      pfn_to_page() and page_to_pfn() as fast as FLATMEM.
      
      Make it the default memory model for IA-64 and disable DISCONTIGMEM which
      is considered obsolete for quite some time.
      
      Link: https://lkml.kernel.org/r/20201101170454.9567-8-rppt@kernel.orgSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Meelis Roos <mroos@linux.ee>
      Cc: Michael Schmitz <schmitzmic@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      214496cb
    • M
      ia64: forbid using VIRTUAL_MEM_MAP with FLATMEM · ea34f78f
      Mike Rapoport 提交于
      Virtual memory map was intended to avoid wasting memory on the memory map
      on systems with large holes in the physical memory layout. Long ago it been
      superseded first by DISCONTIGMEM and then by SPARSEMEM. Moreover,
      SPARSEMEM_VMEMMAP provide the same functionality in much more portable way.
      
      As the first step to removing the VIRTUAL_MEM_MAP forbid it's usage with
      FLATMEM and panic on systems with large holes in the physical memory
      layout that try to run FLATMEM kernels.
      
      Link: https://lkml.kernel.org/r/20201101170454.9567-7-rppt@kernel.orgSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Meelis Roos <mroos@linux.ee>
      Cc: Michael Schmitz <schmitzmic@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ea34f78f
    • M
      ia64: split virtual map initialization out of paging_init() · 1f112129
      Mike Rapoport 提交于
      For both FLATMEM and DISCONTIGMEM/SPARSEMEM the virtual map initialization
      is spread over paging_init() for no good reason.
      
      Split out the bits related to virtual map initialization to a helper
      functions, one for FLATMEM and another for !FLATMEM configurations.
      
      Link: https://lkml.kernel.org/r/20201101170454.9567-6-rppt@kernel.orgSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Meelis Roos <mroos@linux.ee>
      Cc: Michael Schmitz <schmitzmic@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1f112129
    • M
      ia64: discontig: paging_init(): remove local max_pfn calculation · b90b5547
      Mike Rapoport 提交于
      The maximal PFN in the system is calculated during find_memory() time and
      it is stored at max_low_pfn then.
      
      Use this value in paging_init() and remove the redundant detection of
      max_pfn in that function.
      
      Link: https://lkml.kernel.org/r/20201101170454.9567-5-rppt@kernel.orgSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Meelis Roos <mroos@linux.ee>
      Cc: Michael Schmitz <schmitzmic@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b90b5547
    • M
      ia64: remove 'ifdef CONFIG_ZONE_DMA32' statements · 5d37fc0b
      Mike Rapoport 提交于
      After the removal of SN2 platform (commit cf07cb1f ("ia64: remove
      support for the SGI SN2 platform") IA-64 always has ZONE_DMA32 and there is
      no point to guard code with this configuration option.
      
      Remove ifdefery associated with CONFIG_ZONE_DMA32
      
      Link: https://lkml.kernel.org/r/20201101170454.9567-4-rppt@kernel.orgSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Meelis Roos <mroos@linux.ee>
      Cc: Michael Schmitz <schmitzmic@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5d37fc0b
    • M
      ia64: remove custom __early_pfn_to_nid() · 03e92a5e
      Mike Rapoport 提交于
      The ia64 implementation of __early_pfn_to_nid() essentially relies on the
      same data as the generic implementation.
      
      The correspondence between memory ranges and nodes is set in memblock
      during early memory initialization in register_active_ranges() function.
      
      The initialization of sparsemem that requires early_pfn_to_nid() happens
      later and it can use the memblock information like the other architectures.
      
      Link: https://lkml.kernel.org/r/20201101170454.9567-3-rppt@kernel.orgSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Meelis Roos <mroos@linux.ee>
      Cc: Michael Schmitz <schmitzmic@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      03e92a5e
    • M
      alpha: switch from DISCONTIGMEM to SPARSEMEM · 36d40290
      Mike Rapoport 提交于
      Patch series "arch, mm: deprecate DISCONTIGMEM", v2.
      
      It's been a while since DISCONTIGMEM is generally considered deprecated,
      but it is still used by four architectures.  This set replaces
      DISCONTIGMEM with a different way to handle holes in the memory map and
      marks DISCONTIGMEM configuration as BROKEN in Kconfigs of these
      architectures with the intention to completely remove it in several
      releases.
      
      While for 64-bit alpha and ia64 the switch to SPARSEMEM is quite obvious
      and was a matter of moving some bits around, for smaller 32-bit arc and
      m68k SPARSEMEM is not necessarily the best thing to do.
      
      On 32-bit machines SPARSEMEM would require large sections to make section
      index fit in the page flags, but larger sections mean that more memory is
      wasted for unused memory map.
      
      Besides, pfn_to_page() and page_to_pfn() become less efficient, at least
      on arc.
      
      So I've decided to generalize arm's approach for freeing of unused parts
      of the memory map with FLATMEM and enable it for both arc and m68k.  The
      details are in the description of patches 10 (arc) and 13 (m68k).
      
      This patch (of 13):
      
      Enable SPARSEMEM support on alpha and deprecate DISCONTIGMEM.
      
      The required changes are mostly around moving duplicated definitions of
      page access and address conversion macros to a common place and making sure
      they are available for all memory models.
      
      The DISCONTINGMEM support is marked as BROKEN an will be removed in a
      couple of releases.
      
      Link: https://lkml.kernel.org/r/20201101170454.9567-1-rppt@kernel.org
      Link: https://lkml.kernel.org/r/20201101170454.9567-2-rppt@kernel.orgSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Meelis Roos <mroos@linux.ee>
      Cc: Michael Schmitz <schmitzmic@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      36d40290
    • M
      lkdtm: disable KASAN for rodata.o · 6d5a88cd
      Marco Elver 提交于
      Building lkdtm with KASAN and Clang 11 or later results in the following
      error when attempting to load the module:
      
        kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
        BUG: unable to handle page fault for address: ffffffffc019cd70
        #PF: supervisor instruction fetch in kernel mode
        #PF: error_code(0x0011) - permissions violation
        ...
        RIP: 0010:asan.module_ctor+0x0/0xffffffffffffa290 [lkdtm]
        ...
        Call Trace:
         do_init_module+0x17c/0x570
         load_module+0xadee/0xd0b0
         __x64_sys_finit_module+0x16c/0x1a0
         do_syscall_64+0x34/0x50
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      The reason is that rodata.o generates a dummy function that lives in
      .rodata to validate that .rodata can't be executed; however, Clang 11 adds
      KASAN globals support by generating module constructors to initialize
      globals redzones.  When Clang 11 adds a module constructor to rodata.o, it
      is also added to .rodata: any attempt to call it on initialization results
      in the above error.
      
      Therefore, disable KASAN instrumentation for rodata.o.
      
      Link: https://lkml.kernel.org/r/20201214191413.3164796-1-elver@google.comSigned-off-by: NMarco Elver <elver@google.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6d5a88cd
    • W
      kasan: update documentation for generic kasan · 4784be28
      Walter Wu 提交于
      Generic KASAN also supports to record the last two workqueue stacks and
      print them in KASAN report.  So that need to update documentation.
      
      Link: https://lkml.kernel.org/r/20201203023037.30792-1-walter-zh.wu@mediatek.comSigned-off-by: NWalter Wu <walter-zh.wu@mediatek.com>
      Suggested-by: NMarco Elver <elver@google.com>
      Acked-by: NMarco Elver <elver@google.com>
      Reviewed-by: NDmitry Vyukov <dvyukov@google.com>
      Reviewed-by: NAndrey Konovalov <andreyknvl@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Cc: Matthias Brugger <matthias.bgg@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4784be28
    • W
      lib/test_kasan.c: add workqueue test case · 214c783d
      Walter Wu 提交于
      Adds a test to verify workqueue stack recording and print it in
      KASAN report.
      
      The KASAN report was as follows(cleaned up slightly):
      
       BUG: KASAN: use-after-free in kasan_workqueue_uaf
      
       Freed by task 54:
        kasan_save_stack+0x24/0x50
        kasan_set_track+0x24/0x38
        kasan_set_free_info+0x20/0x40
        __kasan_slab_free+0x10c/0x170
        kasan_slab_free+0x10/0x18
        kfree+0x98/0x270
        kasan_workqueue_work+0xc/0x18
      
       Last potentially related work creation:
        kasan_save_stack+0x24/0x50
        kasan_record_wq_stack+0xa8/0xb8
        insert_work+0x48/0x288
        __queue_work+0x3e8/0xc40
        queue_work_on+0xf4/0x118
        kasan_workqueue_uaf+0xfc/0x190
      
      Link: https://lkml.kernel.org/r/20201203022748.30681-1-walter-zh.wu@mediatek.comSigned-off-by: NWalter Wu <walter-zh.wu@mediatek.com>
      Acked-by: NMarco Elver <elver@google.com>
      Reviewed-by: NDmitry Vyukov <dvyukov@google.com>
      Reviewed-by: NAndrey Konovalov <andreyknvl@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Matthias Brugger <matthias.bgg@gmail.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      214c783d
    • W
      kasan: print workqueue stack · ef133461
      Walter Wu 提交于
      The aux_stack[2] is reused to record the call_rcu() call stack and
      enqueuing work call stacks.  So that we need to change the auxiliary stack
      title for common title, print them in KASAN report.
      
      Link: https://lkml.kernel.org/r/20201203022715.30635-1-walter-zh.wu@mediatek.comSigned-off-by: NWalter Wu <walter-zh.wu@mediatek.com>
      Suggested-by: NMarco Elver <elver@google.com>
      Acked-by: NMarco Elver <elver@google.com>
      Reviewed-by: NDmitry Vyukov <dvyukov@google.com>
      Reviewed-by: NAndrey Konovalov <andreyknvl@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Cc: Matthias Brugger <matthias.bgg@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ef133461
    • W
      workqueue: kasan: record workqueue stack · e89a85d6
      Walter Wu 提交于
      Patch series "kasan: add workqueue stack for generic KASAN", v5.
      
      Syzbot reports many UAF issues for workqueue, see [1].
      
      In some of these access/allocation happened in process_one_work(), we
      see the free stack is useless in KASAN report, it doesn't help
      programmers to solve UAF for workqueue issue.
      
      This patchset improves KASAN reports by making them to have workqueue
      queueing stack.  It is useful for programmers to solve use-after-free or
      double-free memory issue.
      
      Generic KASAN also records the last two workqueue stacks and prints them
      in KASAN report.  It is only suitable for generic KASAN.
      
      [1] https://groups.google.com/g/syzkaller-bugs/search?q=%22use-after-free%22+process_one_work
      [2] https://bugzilla.kernel.org/show_bug.cgi?id=198437
      
      This patch (of 4):
      
      When analyzing use-after-free or double-free issue, recording the
      enqueuing work stacks is helpful to preserve usage history which
      potentially gives a hint about the affected code.
      
      For workqueue it has turned out to be useful to record the enqueuing work
      call stacks.  Because user can see KASAN report to determine whether it is
      root cause.  They don't need to enable debugobjects, but they have a
      chance to find out the root cause.
      
      Link: https://lkml.kernel.org/r/20201203022148.29754-1-walter-zh.wu@mediatek.com
      Link: https://lkml.kernel.org/r/20201203022442.30006-1-walter-zh.wu@mediatek.comSigned-off-by: NWalter Wu <walter-zh.wu@mediatek.com>
      Suggested-by: NMarco Elver <elver@google.com>
      Acked-by: NMarco Elver <elver@google.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NDmitry Vyukov <dvyukov@google.com>
      Reviewed-by: NAndrey Konovalov <andreyknvl@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Cc: Marco Elver <elver@google.com>
      Cc: Matthias Brugger <matthias.bgg@gmail.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e89a85d6
    • V
      mm/vmalloc.c: fix kasan shadow poisoning size · c041098c
      Vincenzo Frascino 提交于
      The size of vm area can be affected by the presence or not of the guard
      page.  In particular when VM_NO_GUARD is present, the actual accessible
      size has to be considered like the real size minus the guard page.
      
      Currently kasan does not keep into account this information during the
      poison operation and in particular tries to poison the guard page as well.
      
      This approach, even if incorrect, does not cause an issue because the tags
      for the guard page are written in the shadow memory.  With the future
      introduction of the Tag-Based KASAN, being the guard page inaccessible by
      nature, the write tag operation on this page triggers a fault.
      
      Fix kasan shadow poisoning size invoking get_vm_area_size() instead of
      accessing directly the field in the data structure to detect the correct
      value.
      
      Link: https://lkml.kernel.org/r/20201027160213.32904-1-vincenzo.frascino@arm.com
      Fixes: d98c9e83 ("kasan: fix crashes on access to memory mapped by vm_map_ram()")
      Signed-off-by: NVincenzo Frascino <vincenzo.frascino@arm.com>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Marco Elver <elver@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c041098c
    • A
      docs/vm: remove unused 3 items explanation for /proc/vmstat · 56db19fe
      Alex Shi 提交于
      Commit 5647bc29 ("mm: compaction: Move migration fail/success
      stats to migrate.c"), removed 3 items in /proc/vmstat. but the docs
      still has their explanation. let's remove them.
      
      "compact_blocks_moved",
      "compact_pages_moved",
      "compact_pagemigrate_failed",
      
      Link: https://lkml.kernel.org/r/1605520282-51993-1-git-send-email-alex.shi@linux.alibaba.comSigned-off-by: NAlex Shi <alex.shi@linux.alibaba.com>
      Reviewed-by: NZi Yan <ziy@nvidia.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      56db19fe
    • W
      mm/vmalloc: Fix unlock order in s_stop() · 0a7dd4e9
      Waiman Long 提交于
      When multiple locks are acquired, they should be released in reverse
      order. For s_start() and s_stop() in mm/vmalloc.c, that is not the
      case.
      
        s_start: mutex_lock(&vmap_purge_lock); spin_lock(&vmap_area_lock);
        s_stop : mutex_unlock(&vmap_purge_lock); spin_unlock(&vmap_area_lock);
      
      This unlock sequence, though allowed, is not optimal. If a waiter is
      present, mutex_unlock() will need to go through the slowpath of waking
      up the waiter with preemption disabled. Fix that by releasing the
      spinlock first before the mutex.
      
      Link: https://lkml.kernel.org/r/20201213180843.16938-1-longman@redhat.com
      Fixes: e36176be ("mm/vmalloc: rework vmap_area_lock")
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Reviewed-by: NUladzislau Rezki (Sony) <urezki@gmail.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0a7dd4e9
    • B
    • A
      mm/vmalloc: add 'align' parameter explanation for pvm_determine_end_from_reverse · 799fa85d
      Alex Shi 提交于
      Kernel-doc markup has a issue on pvm_determine_end_from_reverse:
      
        mm/vmalloc.c:3145: warning: Function parameter or member 'align' not described in 'pvm_determine_end_from_reverse'
      
      Add a explanation for it to remove the warning.
      
      Link: https://lkml.kernel.org/r/1605605088-30668-3-git-send-email-alex.shi@linux.alibaba.comSigned-off-by: NAlex Shi <alex.shi@linux.alibaba.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      799fa85d
    • U
      mm/vmalloc: rework the drain logic · 96e2db45
      Uladzislau Rezki (Sony) 提交于
      A current "lazy drain" model suffers from at least two issues.
      
      First one is related to the unsorted list of vmap areas, thus in order to
      identify the [min:max] range of areas to be drained, it requires a full
      list scan.  What is a time consuming if the list is too long.
      
      Second one and as a next step is about merging all fragments with a free
      space.  What is also a time consuming because it has to iterate over
      entire list which holds outstanding lazy areas.
      
      See below the "preemptirqsoff" tracer that illustrates a high latency.  It
      is ~24676us.  Our workloads like audio and video are effected by such long
      latency:
      
      <snip>
        tracer: preemptirqsoff
      
        preemptirqsoff latency trace v1.1.5 on 4.9.186-perf+
        --------------------------------------------------------------------
        latency: 24676 us, #4/4, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 P:8)
           -----------------
           | task: crtc_commit:112-261 (uid:0 nice:0 policy:1 rt_prio:16)
           -----------------
         => started at: __purge_vmap_area_lazy
         => ended at:   __purge_vmap_area_lazy
      
                         _------=> CPU#
                        / _-----=> irqs-off
                       | / _----=> need-resched
                       || / _---=> hardirq/softirq
                       ||| / _--=> preempt-depth
                       |||| /     delay
         cmd     pid   ||||| time  |   caller
            \   /      |||||  \    |   /
      crtc_com-261     1...1    1us*: _raw_spin_lock <-__purge_vmap_area_lazy
      [...]
      crtc_com-261     1...1 24675us : _raw_spin_unlock <-__purge_vmap_area_lazy
      crtc_com-261     1...1 24677us : trace_preempt_on <-__purge_vmap_area_lazy
      crtc_com-261     1...1 24683us : <stack trace>
       => free_vmap_area_noflush
       => remove_vm_area
       => __vunmap
       => vfree
       => drm_property_free_blob
       => drm_mode_object_unreference
       => drm_property_unreference_blob
       => __drm_atomic_helper_crtc_destroy_state
       => sde_crtc_destroy_state
       => drm_atomic_state_default_clear
       => drm_atomic_state_clear
       => drm_atomic_state_free
       => complete_commit
       => _msm_drm_commit_work_cb
       => kthread_worker_fn
       => kthread
       => ret_from_fork
      <snip>
      
      To address those two issues we can redesign a purging of the outstanding
      lazy areas.  Instead of queuing vmap areas to the list, we replace it by
      the separate rb-tree.  In hat case an area is located in the tree/list in
      ascending order.  It will give us below advantages:
      
      a) Outstanding vmap areas are merged creating bigger coalesced blocks,
         thus it becomes less fragmented.
      
      b) It is possible to calculate a flush range [min:max] without scanning
         all elements.  It is O(1) access time or complexity;
      
      c) The final merge of areas with the rb-tree that represents a free
         space is faster because of (a).  As a result the lock contention is
         also reduced.
      
      Link: https://lkml.kernel.org/r/20201116220033.1837-2-urezki@gmail.comSigned-off-by: NUladzislau Rezki (Sony) <urezki@gmail.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: huang ying <huang.ying.caritas@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96e2db45
    • U
      mm/vmalloc: use free_vm_area() if an allocation fails · 8945a723
      Uladzislau Rezki (Sony) 提交于
      There is a dedicated and separate function that finds and removes a
      continuous kernel virtual area.  As a final step it also releases the
      "area", a descriptor of corresponding vm_struct.
      
      Use free_vmap_area() in the __vmalloc_node_range() instead of open coded
      steps which are exactly the same, to perform a cleanup.
      
      Link: https://lkml.kernel.org/r/20201116220033.1837-1-urezki@gmail.comSigned-off-by: NUladzislau Rezki (Sony) <urezki@gmail.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8945a723
    • A
      mm/vmalloc.c:__vmalloc_area_node(): avoid 32-bit overflow · 34fe6537
      Andrew Morton 提交于
      With a machine with 3 TB (more than 2 TB memory).  If you use vmalloc to
      allocate > 2 TB memory, the array_size below will be overflowed.
      
      The array_size is an unsigned int and can only be used to allocate less
      than 2 TB memory.  If you pass 2*1028*1028*1024*1024 = 2 * 2^40 in the
      argument of vmalloc.  The array_size will become 2*2^31 = 2^32.  The 2^32
      cannot be store with a 32 bit integer.
      
      The fix is to change the type of array_size to unsigned long.
      
      [akpm@linux-foundation.org: rework for current mainline]
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=210023
      Reported-by: <hsinhuiwu@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      34fe6537
    • D
      locking/selftests: add testcases for fs_reclaim · d5037d1d
      Daniel Vetter 提交于
      Since I butchered this I figured better to make sure we have testcases for
      this now.  Since we only have a locking context for __GFP_FS that's the
      only thing we're testing right now.
      
      Link: https://lkml.kernel.org/r/20201125162532.1299794-4-daniel.vetter@ffwll.chSigned-off-by: NDaniel Vetter <daniel.vetter@intel.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Thomas Hellström (Intel) <thomas_os@shipmail.org>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Waiman Long <longman@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d5037d1d
    • D
      mm: extract might_alloc() debug check · 95d6c701
      Daniel Vetter 提交于
      Extracted from slab.h, which seems to have the most complete version
      including the correct might_sleep() check.  Roll it out to slob.c.
      
      Motivated by a discussion with Paul about possibly changing call_rcu
      behaviour to allocate memory, but only roughly every 500th call.
      
      There are a lot fewer places in the kernel that care about whether
      allocating memory is allowed or not (due to deadlocks with reclaim code)
      than places that care whether sleeping is allowed.  But debugging these
      also tends to be a lot harder, so nice descriptive checks could come in
      handy.  I might have some use eventually for annotations in drivers/gpu.
      
      Note that unlike fs_reclaim_acquire/release gfpflags_allow_blocking does
      not consult the PF_MEMALLOC flags.  But there is no flag equivalent for
      GFP_NOWAIT, hence this check can't go wrong due to
      memalloc_no*_save/restore contexts.  Willy is working on a patch series
      which might change this:
      
      https://lore.kernel.org/linux-mm/20200625113122.7540-7-willy@infradead.org/
      
      I think best would be if that updates gfpflags_allow_blocking(), since
      there's a ton of callers all over the place for that already.
      
      Link: https://lkml.kernel.org/r/20201125162532.1299794-3-daniel.vetter@ffwll.chSigned-off-by: NDaniel Vetter <daniel.vetter@intel.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NPaul E. McKenney <paulmck@kernel.org>
      Reviewed-by: NJason Gunthorpe <jgg@nvidia.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
      Cc: Thomas Hellström (Intel) <thomas_os@shipmail.org>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      95d6c701
    • D
      mm: track mmu notifiers in fs_reclaim_acquire/release · f920e413
      Daniel Vetter 提交于
      fs_reclaim_acquire/release nicely catch recursion issues when allocating
      GFP_KERNEL memory against shrinkers (which gpu drivers tend to use to keep
      the excessive caches in check).  For mmu notifier recursions we do have
      lockdep annotations since 23b68395 ("mm/mmu_notifiers: add a lockdep
      map for invalidate_range_start/end").
      
      But these only fire if a path actually results in some pte invalidation -
      for most small allocations that's very rarely the case.  The other trouble
      is that pte invalidation can happen any time when __GFP_RECLAIM is set.
      Which means only really GFP_ATOMIC is a safe choice, GFP_NOIO isn't good
      enough to avoid potential mmu notifier recursion.
      
      I was pondering whether we should just do the general annotation, but
      there's always the risk for false positives.  Plus I'm assuming that the
      core fs and io code is a lot better reviewed and tested than random mmu
      notifier code in drivers.  Hence why I decide to only annotate for that
      specific case.
      
      Furthermore even if we'd create a lockdep map for direct reclaim, we'd
      still need to explicit pull in the mmu notifier map - there's a lot more
      places that do pte invalidation than just direct reclaim, these two
      contexts arent the same.
      
      Note that the mmu notifiers needing their own independent lockdep map is
      also the reason we can't hold them from fs_reclaim_acquire to
      fs_reclaim_release - it would nest with the acquistion in the pte
      invalidation code, causing a lockdep splat.  And we can't remove the
      annotations from pte invalidation and all the other places since they're
      called from many other places than page reclaim.  Hence we can only do the
      equivalent of might_lock, but on the raw lockdep map.
      
      With this we can also remove the lockdep priming added in 66204f1d
      ("mm/mmu_notifiers: prime lockdep") since the new annotations are strictly
      more powerful.
      
      Link: https://lkml.kernel.org/r/20201125162532.1299794-2-daniel.vetter@ffwll.chSigned-off-by: NDaniel Vetter <daniel.vetter@intel.com>
      Reviewed-by: NJason Gunthorpe <jgg@nvidia.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Thomas Hellström (Intel) <thomas_os@shipmail.org>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f920e413
    • D
      mm: forbid splitting special mappings · 871402e0
      Dmitry Safonov 提交于
      Don't allow splitting of vm_special_mapping's.  It affects vdso/vvar
      areas.  Uprobes have only one page in xol_area so they aren't affected.
      
      Those restrictions were enforced by checks in .mremap() callbacks.
      Restrict resizing with generic .split() callback.
      
      Link: https://lkml.kernel.org/r/20201013013416.390574-7-dima@arista.comSigned-off-by: NDmitry Safonov <dima@arista.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      871402e0
    • D
      mremap: check if it's possible to split original vma · 73d5e062
      Dmitry Safonov 提交于
      If original VMA can't be split at the desired address, do_munmap() will
      fail and leave both new-copied VMA and old VMA.  De-facto it's
      MREMAP_DONTUNMAP behaviour, which is unexpected.
      
      Currently, it may fail such way for hugetlbfs and dax device mappings.
      
      Minimize such unpleasant situations to OOM by checking .may_split() before
      attempting to create a VMA copy.
      
      Link: https://lkml.kernel.org/r/20201013013416.390574-6-dima@arista.comSigned-off-by: NDmitry Safonov <dima@arista.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      73d5e062
    • D
      vm_ops: rename .split() callback to .may_split() · dd3b614f
      Dmitry Safonov 提交于
      Rename the callback to reflect that it's not called *on* or *after* split,
      but rather some time before the splitting to check if it's possible.
      
      Link: https://lkml.kernel.org/r/20201013013416.390574-5-dima@arista.comSigned-off-by: NDmitry Safonov <dima@arista.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dd3b614f
    • D
      mremap: don't allow MREMAP_DONTUNMAP on special_mappings and aio · cd544fd1
      Dmitry Safonov 提交于
      As kernel expect to see only one of such mappings, any further operations
      on the VMA-copy may be unexpected by the kernel.  Maybe it's being on the
      safe side, but there doesn't seem to be any expected use-case for this, so
      restrict it now.
      
      Link: https://lkml.kernel.org/r/20201013013416.390574-4-dima@arista.com
      Fixes: commit e346b381 ("mm/mremap: add MREMAP_DONTUNMAP to mremap()")
      Signed-off-by: NDmitry Safonov <dima@arista.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cd544fd1
    • D
      mm/mremap: for MREMAP_DONTUNMAP check security_vm_enough_memory_mm() · ad8ee77e
      Dmitry Safonov 提交于
      Currently memory is accounted post-mremap() with MREMAP_DONTUNMAP, which
      may break overcommit policy.  So, check if there's enough memory before
      doing actual VMA copy.
      
      Don't unset VM_ACCOUNT on MREMAP_DONTUNMAP.  By semantics, such mremap()
      is actually a memory allocation.  That also simplifies the error-path a
      little.
      
      Also, as it's memory allocation on success don't reset hiwater_vm value.
      
      Link: https://lkml.kernel.org/r/20201013013416.390574-3-dima@arista.com
      Fixes: commit e346b381 ("mm/mremap: add MREMAP_DONTUNMAP to mremap()")
      Signed-off-by: NDmitry Safonov <dima@arista.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ad8ee77e
    • D
      mm/mremap: account memory on do_munmap() failure · 51df7bcb
      Dmitry Safonov 提交于
      Patch series "mremap: move_vma() fixes".
      
      This patch (of 6):
      
      move_vma() copies VMA without adding it to account, then unmaps old part
      of VMA.  On failure it unmaps the new VMA.  With hacks accounting in
      munmap is disabled as it's a copy of existing VMA.
      
      Account the memory on munmap() failure which was previously copied into
      a new VMA.
      
      Link: https://lkml.kernel.org/r/20201013013416.390574-1-dima@arista.com
      Link: https://lkml.kernel.org/r/20201013013416.390574-2-dima@arista.com
      Fixes: commit e2ea83742133 ("[PATCH] mremap: move_vma fixes and cleanup")
      Signed-off-by: NDmitry Safonov <dima@arista.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      51df7bcb
    • M
      mm: move free_unref_page to mm/internal.h · 0966aeb4
      Matthew Wilcox (Oracle) 提交于
      Code outside mm/ should not be calling free_unref_page().  Also move
      free_unref_page_list().
      
      Link: https://lkml.kernel.org/r/20201125034655.27687-2-willy@infradead.orgSigned-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NMike Rapoport <rppt@linux.ibm.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0966aeb4
    • M
      sparc: fix handling of page table constructor failure · 06517c9a
      Matthew Wilcox (Oracle) 提交于
      The page has just been allocated, so its refcount is 1.  free_unref_page()
      is for use on pages which have a zero refcount.  Use __free_page() like
      the other implementations of pte_alloc_one().
      
      Link: https://lkml.kernel.org/r/20201125034655.27687-1-willy@infradead.org
      Fixes: 1ae9ae5f ("sparc: handle pgtable_page_ctor() fail")
      Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NMike Rapoport <rppt@linux.ibm.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      06517c9a
    • A
      mm: mmap_lock: add tracepoints around lock acquisition · 2b5067a8
      Axel Rasmussen 提交于
      The goal of these tracepoints is to be able to debug lock contention
      issues.  This lock is acquired on most (all?) mmap / munmap / page fault
      operations, so a multi-threaded process which does a lot of these can
      experience significant contention.
      
      We trace just before we start acquisition, when the acquisition returns
      (whether it succeeded or not), and when the lock is released (or
      downgraded).  The events are broken out by lock type (read / write).
      
      The events are also broken out by memcg path.  For container-based
      workloads, users often think of several processes in a memcg as a single
      logical "task", so collecting statistics at this level is useful.
      
      The end goal is to get latency information.  This isn't directly included
      in the trace events.  Instead, users are expected to compute the time
      between "start locking" and "acquire returned", using e.g.  synthetic
      events or BPF.  The benefit we get from this is simpler code.
      
      Because we use tracepoint_enabled() to decide whether or not to trace,
      this patch has effectively no overhead unless tracepoints are enabled at
      runtime.  If tracepoints are enabled, there is a performance impact, but
      how much depends on exactly what e.g.  the BPF program does.
      
      [axelrasmussen@google.com: fix use-after-free race and css ref leak in tracepoints]
        Link: https://lkml.kernel.org/r/20201130233504.3725241-1-axelrasmussen@google.com
      [axelrasmussen@google.com: v3]
        Link: https://lkml.kernel.org/r/20201207213358.573750-1-axelrasmussen@google.com
      [rostedt@goodmis.org: in-depth examples of tracepoint_enabled() usage, and per-cpu-per-context buffer design]
      
      Link: https://lkml.kernel.org/r/20201105211739.568279-2-axelrasmussen@google.comSigned-off-by: NAxel Rasmussen <axelrasmussen@google.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Chinwen Chang <chinwen.chang@mediatek.com>
      Cc: Davidlohr Bueso <dbueso@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Laurent Dufour <ldufour@linux.ibm.com>
      Cc: Yafang Shao <laoar.shao@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2b5067a8
    • A
      mm/page_vma_mapped.c: add colon to fix kernel-doc markups error for check_pte · 777f303c
      Alex Shi 提交于
      check_pte() needs a correct colon for kernel-doc markup, otherwise, gcc
      has the following warning for W=1, mm/page_vma_mapped.c:86: warning:
      Function parameter or member 'pvmw' not described in 'check_pte'
      
      Link: https://lkml.kernel.org/r/1605597167-25145-1-git-send-email-alex.shi@linux.alibaba.comSigned-off-by: NAlex Shi <alex.shi@linux.alibaba.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      777f303c
    • A
      mm/mapping_dirty_helpers: enhance the kernel-doc markups · f5b7e739
      Alex Shi 提交于
      Add and change parameter explanation for wp_pte and clean_record_pte, to
      avoid W1 warning:
      
        mm/mapping_dirty_helpers.c:34: warning: Function parameter or member 'end' not described in 'wp_pte'
        mm/mapping_dirty_helpers.c:88: warning: Function parameter or member 'end' not described in 'clean_record_pte'
      
      Link: https://lkml.kernel.org/r/1605605088-30668-2-git-send-email-alex.shi@linux.alibaba.comSigned-off-by: NAlex Shi <alex.shi@linux.alibaba.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f5b7e739
    • J
      mm: cleanup: remove unused tsk arg from __access_remote_vm · d3f5ffca
      John Hubbard 提交于
      Despite a comment that said that page fault accounting would be charged to
      whatever task_struct* was passed into __access_remote_vm(), the tsk
      argument was actually unused.
      
      Making page fault accounting actually use this task struct is quite a
      project, so there is no point in keeping the tsk argument.
      
      Delete both the comment, and the argument.
      
      [rppt@linux.ibm.com: changelog addition]
      
      Link: https://lkml.kernel.org/r/20201026074137.4147787-1-jhubbard@nvidia.comSigned-off-by: NJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: NMike Rapoport <rppt@linux.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d3f5ffca
    • K
      x86: mremap speedup - Enable HAVE_MOVE_PUD · be37c98d
      Kalesh Singh 提交于
      HAVE_MOVE_PUD enables remapping pages at the PUD level if both the
      source and destination addresses are PUD-aligned.
      
      With HAVE_MOVE_PUD enabled it can be inferred that there is
      approximately a 13x improvement in performance on x86.  (See data
      below).
      
      ------- Test Results ---------
      
      The following results were obtained using a 5.4 kernel, by remapping
      a PUD-aligned, 1GB sized region to a PUD-aligned destination.
      The results from 10 iterations of the test are given below:
      
      Total mremap times for 1GB data on x86. All times are in nanoseconds.
      
        Control        HAVE_MOVE_PUD
      
        180394         15089
        235728         14056
        238931         25741
        187330         13838
        241742         14187
        177925         14778
        182758         14728
        160872         14418
        205813         15107
        245722         13998
      
        205721.5       15594    <-- Mean time in nanoseconds
      
      A 1GB mremap completion time drops from ~205 microseconds
      to ~15 microseconds on x86. (~13x speed up).
      
      Link: https://lkml.kernel.org/r/20201014005320.2233162-6-kaleshsingh@google.comSigned-off-by: NKalesh Singh <kaleshsingh@google.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NIngo Molnar <mingo@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Frederic Weisbecker <frederic@kernel.org>
      Cc: Gavin Shan <gshan@redhat.com>
      Cc: Hassan Naveed <hnaveed@wavecomp.com>
      Cc: Jia He <justin.he@arm.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Krzysztof Kozlowski <krzk@kernel.org>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Masahiro Yamada <masahiroy@kernel.org>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Minchan Kim <minchan@google.com>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Ram Pai <linuxram@us.ibm.com>
      Cc: Sami Tolvanen <samitolvanen@google.com>
      Cc: Sandipan Das <sandipan@linux.ibm.com>
      Cc: SeongJae Park <sjpark@amazon.de>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      be37c98d
    • K
      arm64: mremap speedup - enable HAVE_MOVE_PUD · f5308c89
      Kalesh Singh 提交于
      HAVE_MOVE_PUD enables remapping pages at the PUD level if both the source
      and destination addresses are PUD-aligned.
      
      With HAVE_MOVE_PUD enabled it can be inferred that there is approximately
      a 19x improvement in performance on arm64.  (See data below).
      
      ------- Test Results ---------
      
      The following results were obtained using a 5.4 kernel, by remapping a
      PUD-aligned, 1GB sized region to a PUD-aligned destination.  The results
      from 10 iterations of the test are given below:
      
      Total mremap times for 1GB data on arm64. All times are in nanoseconds.
      
        Control          HAVE_MOVE_PUD
      
        1247761          74271
        1219896          46771
        1094792          59687
        1227760          48385
        1043698          76666
        1101771          50365
        1159896          52500
        1143594          75261
        1025833          61354
        1078125          48697
      
        1134312.6        59395.7    <-- Mean time in nanoseconds
      
      A 1GB mremap completion time drops from ~1.1 milliseconds to ~59
      microseconds on arm64.  (~19x speed up).
      
      Link: https://lkml.kernel.org/r/20201014005320.2233162-5-kaleshsingh@google.comSigned-off-by: NKalesh Singh <kaleshsingh@google.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Frederic Weisbecker <frederic@kernel.org>
      Cc: Gavin Shan <gshan@redhat.com>
      Cc: Hassan Naveed <hnaveed@wavecomp.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jia He <justin.he@arm.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Krzysztof Kozlowski <krzk@kernel.org>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Masahiro Yamada <masahiroy@kernel.org>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Minchan Kim <minchan@google.com>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Ram Pai <linuxram@us.ibm.com>
      Cc: Sami Tolvanen <samitolvanen@google.com>
      Cc: Sandipan Das <sandipan@linux.ibm.com>
      Cc: SeongJae Park <sjpark@amazon.de>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f5308c89
    • K
      mm: speedup mremap on 1GB or larger regions · c49dd340
      Kalesh Singh 提交于
      Android needs to move large memory regions for garbage collection.  The GC
      requires moving physical pages of multi-gigabyte heap using mremap.
      During this move, the application threads have to be paused for
      correctness.  It is critical to keep this pause as short as possible to
      avoid jitters during user interaction.
      
      Optimize mremap for >= 1GB-sized regions by moving at the PUD/PGD level if
      the source and destination addresses are PUD-aligned.  For
      CONFIG_PGTABLE_LEVELS == 3, moving at the PUD level in effect moves PGD
      entries, since the PUD entry is “folded back” onto the PGD entry.  Add
      HAVE_MOVE_PUD so that architectures where moving at the PUD level isn't
      supported/tested can turn this off by not selecting the config.
      
      Link: https://lkml.kernel.org/r/20201014005320.2233162-4-kaleshsingh@google.comSigned-off-by: NKalesh Singh <kaleshsingh@google.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: Nkernel test robot <lkp@intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Frederic Weisbecker <frederic@kernel.org>
      Cc: Gavin Shan <gshan@redhat.com>
      Cc: Hassan Naveed <hnaveed@wavecomp.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jia He <justin.he@arm.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Krzysztof Kozlowski <krzk@kernel.org>
      Cc: Lokesh Gidra <lokeshgidra@google.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Masahiro Yamada <masahiroy@kernel.org>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Mina Almasry <almasrymina@google.com>
      Cc: Minchan Kim <minchan@google.com>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Ram Pai <linuxram@us.ibm.com>
      Cc: Sami Tolvanen <samitolvanen@google.com>
      Cc: Sandipan Das <sandipan@linux.ibm.com>
      Cc: SeongJae Park <sjpark@amazon.de>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Steven Price <steven.price@arm.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c49dd340