1. 16 12月, 2020 40 次提交
    • M
      PM: hibernate: make direct map manipulations more explicit · 2abf962a
      Mike Rapoport 提交于
      When DEBUG_PAGEALLOC or ARCH_HAS_SET_DIRECT_MAP is enabled a page may be
      not present in the direct map and has to be explicitly mapped before it
      could be copied.
      
      Introduce hibernate_map_page() and hibernation_unmap_page() that will
      explicitly use set_direct_map_{default,invalid}_noflush() for
      ARCH_HAS_SET_DIRECT_MAP case and debug_pagealloc_{map,unmap}_pages() for
      DEBUG_PAGEALLOC case.
      
      The remapping of the pages in safe_copy_page() presumes that it only
      changes protection bits in an existing PTE and so it is safe to ignore
      return value of set_direct_map_{default,invalid}_noflush().
      
      Still, add a pr_warn() so that future changes in set_memory APIs will not
      silently break hibernation.
      
      Link: https://lkml.kernel.org/r/20201109192128.960-3-rppt@kernel.orgSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Acked-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2abf962a
    • M
      mm: introduce debug_pagealloc_{map,unmap}_pages() helpers · 77bc7fd6
      Mike Rapoport 提交于
      Patch series "arch, mm: improve robustness of direct map manipulation", v7.
      
      During recent discussion about KVM protected memory, David raised a
      concern about usage of __kernel_map_pages() outside of DEBUG_PAGEALLOC
      scope [1].
      
      Indeed, for architectures that define CONFIG_ARCH_HAS_SET_DIRECT_MAP it is
      possible that __kernel_map_pages() would fail, but since this function is
      void, the failure will go unnoticed.
      
      Moreover, there's lack of consistency of __kernel_map_pages() semantics
      across architectures as some guard this function with #ifdef
      DEBUG_PAGEALLOC, some refuse to update the direct map if page allocation
      debugging is disabled at run time and some allow modifying the direct map
      regardless of DEBUG_PAGEALLOC settings.
      
      This set straightens this out by restoring dependency of
      __kernel_map_pages() on DEBUG_PAGEALLOC and updating the call sites
      accordingly.
      
      Since currently the only user of __kernel_map_pages() outside
      DEBUG_PAGEALLOC is hibernation, it is updated to make direct map accesses
      there more explicit.
      
      [1] https://lore.kernel.org/lkml/2759b4bf-e1e3-d006-7d86-78a40348269d@redhat.com
      
      This patch (of 4):
      
      When CONFIG_DEBUG_PAGEALLOC is enabled, it unmaps pages from the kernel
      direct mapping after free_pages().  The pages than need to be mapped back
      before they could be used.  Theese mapping operations use
      __kernel_map_pages() guarded with with debug_pagealloc_enabled().
      
      The only place that calls __kernel_map_pages() without checking whether
      DEBUG_PAGEALLOC is enabled is the hibernation code that presumes
      availability of this function when ARCH_HAS_SET_DIRECT_MAP is set.  Still,
      on arm64, __kernel_map_pages() will bail out when DEBUG_PAGEALLOC is not
      enabled but set_direct_map_invalid_noflush() may render some pages not
      present in the direct map and hibernation code won't be able to save such
      pages.
      
      To make page allocation debugging and hibernation interaction more robust,
      the dependency on DEBUG_PAGEALLOC or ARCH_HAS_SET_DIRECT_MAP has to be
      made more explicit.
      
      Start with combining the guard condition and the call to
      __kernel_map_pages() into debug_pagealloc_map_pages() and
      debug_pagealloc_unmap_pages() functions to emphasize that
      __kernel_map_pages() should not be called without DEBUG_PAGEALLOC and use
      these new functions to map/unmap pages when page allocation debugging is
      enabled.
      
      Link: https://lkml.kernel.org/r/20201109192128.960-1-rppt@kernel.org
      Link: https://lkml.kernel.org/r/20201109192128.960-2-rppt@kernel.orgSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Albert Ou <aou@eecs.berkeley.edu>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      77bc7fd6
    • M
      m68k: deprecate DISCONTIGMEM · fcd353a3
      Mike Rapoport 提交于
      DISCONTIGMEM was intended to provide more efficient support for systems
      with holes in their physical address space that FLATMEM did.
      
      Yet, it's overhead in terms of the memory consumption seems to
      overweight the savings on the unused memory map.
      
      For a ARAnyM system with 16 MBytes of FastRAM configured, the memory
      usage reported after page allocator initialization is
      
        Memory: 23828K/30720K available (3206K kernel code, 535K rwdata, 936K rodata, 768K init, 193K bss, 6892K reserved, 0K cma-reserved)
      
      and with DISCONTIGMEM disabled and with relatively large hole in the memory
      map it is:
      
        Memory: 23864K/30720K available (3197K kernel code, 516K rwdata, 936K rodata, 764K init, 179K bss, 6856K reserved, 0K cma-reserved)
      
      Moreover, since m68k already has custom pfn_valid() it is possible to
      define HAVE_ARCH_PFN_VALID to enable freeing of unused memory map.  The
      minimal size of a hole that can be freed should not be less than
      MAX_ORDER_NR_PAGES so to achieve more substantial memory savings let
      m68k also define custom FORCE_MAX_ZONEORDER.
      
      With FORCE_MAX_ZONEORDER set to 9 memory usage becomes:
      
        Memory: 23880K/30720K available (3197K kernel code, 516K rwdata, 936K rodata, 764K init, 179K bss, 6840K reserved, 0K cma-reserved)
      
      Link: https://lkml.kernel.org/r/20201101170454.9567-14-rppt@kernel.orgSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Meelis Roos <mroos@linux.ee>
      Cc: Michael Schmitz <schmitzmic@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fcd353a3
    • M
      m68k/mm: enable use of generic memory_model.h for !DISCONTIGMEM · 4bfc848e
      Mike Rapoport 提交于
      The pg_data_map and pg_data_table arrays as well as page_to_pfn() and
      pfn_to_page() are required only for DISCONTIGMEM. Other memory models can
      use the generic definitions in asm-generic/memory_model.h.
      
      Link: https://lkml.kernel.org/r/20201101170454.9567-13-rppt@kernel.orgSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Meelis Roos <mroos@linux.ee>
      Cc: Michael Schmitz <schmitzmic@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4bfc848e
    • M
      m68k/mm: make node data and node setup depend on CONFIG_DISCONTIGMEM · 6b2ad8d7
      Mike Rapoport 提交于
      The pg_data_t node structures and their initialization currently depends on
      !CONFIG_SINGLE_MEMORY_CHUNK. Since they are required only for DISCONTIGMEM
      make this dependency explicit and replace usage of
      CONFIG_SINGLE_MEMORY_CHUNK with CONFIG_DISCONTIGMEM where appropriate.
      
      The CONFIG_SINGLE_MEMORY_CHUNK was implicitly disabled on the ColdFire MMU
      variant, although it always presumed a single memory bank. As there is no
      actual need for DISCONTIGMEM in this case, make sure that ColdFire MMU
      systems set CONFIG_SINGLE_MEMORY_CHUNK to 'y'.
      
      Link: https://lkml.kernel.org/r/20201101170454.9567-12-rppt@kernel.orgSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Meelis Roos <mroos@linux.ee>
      Cc: Michael Schmitz <schmitzmic@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6b2ad8d7
    • M
      arc: use FLATMEM with freeing of unused memory map instead of DISCONTIGMEM · 050b2da2
      Mike Rapoport 提交于
      Currently ARC uses DISCONTIGMEM to cope with sparse physical memory address
      space on systems with 2 memory banks. While DISCONTIGMEM avoids wasting
      memory on unpopulated memory map, it adds both memory and CPU overhead
      relatively to FLATMEM. Moreover, DISCONTINGMEM is generally considered
      deprecated.
      
      The obvious replacement for DISCONTIGMEM would be SPARSEMEM, but it is also
      less efficient than FLATMEM in pfn_to_page() and page_to_pfn() conversions.
      Besides it requires tuning of SECTION_SIZE which is not trivial for
      possible ARC memory configuration.
      
      Since the memory map for both banks is always allocated from the "lowmem"
      bank, it is possible to use FLATMEM for two-bank configuration and simply
      free the unused hole in the memory map. All is required for that is to
      provide ARC-specific pfn_valid() that will take into account actual
      physical memory configuration and define HAVE_ARCH_PFN_VALID.
      
      The resulting kernel image configured with defconfig + HIGHMEM=y is
      smaller:
      
        $ size a/vmlinux b/vmlinux
           text    data     bss     dec     hex filename
        4673503 1245456  279756 6198715  5e95bb a/vmlinux
        4658706 1246864  279756 6185326  5e616e b/vmlinux
      
        $ ./scripts/bloat-o-meter a/vmlinux b/vmlinux
        add/remove: 28/30 grow/shrink: 42/399 up/down: 10986/-29025 (-18039)
        ...
        Total: Before=4709315, After = 4691276, chg -0.38%
      
      Booting nSIM with haps_ns.dts results in the following memory usage
      reports:
      
        a:
        Memory: 1559104K/1572864K available (3531K kernel code, 595K rwdata, 752K rodata, 136K init, 275K bss, 13760K reserved, 0K cma-reserved, 1048576K highmem)
      
        b:
        Memory: 1559112K/1572864K available (3519K kernel code, 594K rwdata, 752K rodata, 136K init, 280K bss, 13752K reserved, 0K cma-reserved, 1048576K highmem)
      
      Link: https://lkml.kernel.org/r/20201101170454.9567-11-rppt@kernel.orgSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Meelis Roos <mroos@linux.ee>
      Cc: Michael Schmitz <schmitzmic@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      050b2da2
    • M
      arm, arm64: move free_unused_memmap() to generic mm · 4f5b0c17
      Mike Rapoport 提交于
      ARM and ARM64 free unused parts of the memory map just before the
      initialization of the page allocator. To allow holes in the memory map both
      architectures overload pfn_valid() and define HAVE_ARCH_PFN_VALID.
      
      Allowing holes in the memory map for FLATMEM may be useful for small
      machines, such as ARC and m68k and will enable those architectures to cease
      using DISCONTIGMEM and still support more than one memory bank.
      
      Move the functions that free unused memory map to generic mm and enable
      them in case HAVE_ARCH_PFN_VALID=y.
      
      Link: https://lkml.kernel.org/r/20201101170454.9567-10-rppt@kernel.orgSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Acked-by: Catalin Marinas <catalin.marinas@arm.com>	[arm64]
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Meelis Roos <mroos@linux.ee>
      Cc: Michael Schmitz <schmitzmic@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4f5b0c17
    • M
      arm: remove CONFIG_ARCH_HAS_HOLES_MEMORYMODEL · 5e545df3
      Mike Rapoport 提交于
      ARM is the only architecture that defines CONFIG_ARCH_HAS_HOLES_MEMORYMODEL
      which in turn enables memmap_valid_within() function that is intended to
      verify existence  of struct page associated with a pfn when there are holes
      in the memory map.
      
      However, the ARCH_HAS_HOLES_MEMORYMODEL also enables HAVE_ARCH_PFN_VALID
      and arch-specific pfn_valid() implementation that also deals with the holes
      in the memory map.
      
      The only two users of memmap_valid_within() call this function after
      a call to pfn_valid() so the memmap_valid_within() check becomes redundant.
      
      Remove CONFIG_ARCH_HAS_HOLES_MEMORYMODEL and memmap_valid_within() and rely
      entirely on ARM's implementation of pfn_valid() that is now enabled
      unconditionally.
      
      Link: https://lkml.kernel.org/r/20201101170454.9567-9-rppt@kernel.orgSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Meelis Roos <mroos@linux.ee>
      Cc: Michael Schmitz <schmitzmic@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5e545df3
    • M
      ia64: make SPARSEMEM default and disable DISCONTIGMEM · 214496cb
      Mike Rapoport 提交于
      SPARSEMEM memory model suitable for systems with large holes in their
      phyiscal memory layout. With SPARSEMEM_VMEMMAP enabled it provides
      pfn_to_page() and page_to_pfn() as fast as FLATMEM.
      
      Make it the default memory model for IA-64 and disable DISCONTIGMEM which
      is considered obsolete for quite some time.
      
      Link: https://lkml.kernel.org/r/20201101170454.9567-8-rppt@kernel.orgSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Meelis Roos <mroos@linux.ee>
      Cc: Michael Schmitz <schmitzmic@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      214496cb
    • M
      ia64: forbid using VIRTUAL_MEM_MAP with FLATMEM · ea34f78f
      Mike Rapoport 提交于
      Virtual memory map was intended to avoid wasting memory on the memory map
      on systems with large holes in the physical memory layout. Long ago it been
      superseded first by DISCONTIGMEM and then by SPARSEMEM. Moreover,
      SPARSEMEM_VMEMMAP provide the same functionality in much more portable way.
      
      As the first step to removing the VIRTUAL_MEM_MAP forbid it's usage with
      FLATMEM and panic on systems with large holes in the physical memory
      layout that try to run FLATMEM kernels.
      
      Link: https://lkml.kernel.org/r/20201101170454.9567-7-rppt@kernel.orgSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Meelis Roos <mroos@linux.ee>
      Cc: Michael Schmitz <schmitzmic@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ea34f78f
    • M
      ia64: split virtual map initialization out of paging_init() · 1f112129
      Mike Rapoport 提交于
      For both FLATMEM and DISCONTIGMEM/SPARSEMEM the virtual map initialization
      is spread over paging_init() for no good reason.
      
      Split out the bits related to virtual map initialization to a helper
      functions, one for FLATMEM and another for !FLATMEM configurations.
      
      Link: https://lkml.kernel.org/r/20201101170454.9567-6-rppt@kernel.orgSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Meelis Roos <mroos@linux.ee>
      Cc: Michael Schmitz <schmitzmic@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1f112129
    • M
      ia64: discontig: paging_init(): remove local max_pfn calculation · b90b5547
      Mike Rapoport 提交于
      The maximal PFN in the system is calculated during find_memory() time and
      it is stored at max_low_pfn then.
      
      Use this value in paging_init() and remove the redundant detection of
      max_pfn in that function.
      
      Link: https://lkml.kernel.org/r/20201101170454.9567-5-rppt@kernel.orgSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Meelis Roos <mroos@linux.ee>
      Cc: Michael Schmitz <schmitzmic@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b90b5547
    • M
      ia64: remove 'ifdef CONFIG_ZONE_DMA32' statements · 5d37fc0b
      Mike Rapoport 提交于
      After the removal of SN2 platform (commit cf07cb1f ("ia64: remove
      support for the SGI SN2 platform") IA-64 always has ZONE_DMA32 and there is
      no point to guard code with this configuration option.
      
      Remove ifdefery associated with CONFIG_ZONE_DMA32
      
      Link: https://lkml.kernel.org/r/20201101170454.9567-4-rppt@kernel.orgSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Meelis Roos <mroos@linux.ee>
      Cc: Michael Schmitz <schmitzmic@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5d37fc0b
    • M
      ia64: remove custom __early_pfn_to_nid() · 03e92a5e
      Mike Rapoport 提交于
      The ia64 implementation of __early_pfn_to_nid() essentially relies on the
      same data as the generic implementation.
      
      The correspondence between memory ranges and nodes is set in memblock
      during early memory initialization in register_active_ranges() function.
      
      The initialization of sparsemem that requires early_pfn_to_nid() happens
      later and it can use the memblock information like the other architectures.
      
      Link: https://lkml.kernel.org/r/20201101170454.9567-3-rppt@kernel.orgSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Meelis Roos <mroos@linux.ee>
      Cc: Michael Schmitz <schmitzmic@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      03e92a5e
    • M
      alpha: switch from DISCONTIGMEM to SPARSEMEM · 36d40290
      Mike Rapoport 提交于
      Patch series "arch, mm: deprecate DISCONTIGMEM", v2.
      
      It's been a while since DISCONTIGMEM is generally considered deprecated,
      but it is still used by four architectures.  This set replaces
      DISCONTIGMEM with a different way to handle holes in the memory map and
      marks DISCONTIGMEM configuration as BROKEN in Kconfigs of these
      architectures with the intention to completely remove it in several
      releases.
      
      While for 64-bit alpha and ia64 the switch to SPARSEMEM is quite obvious
      and was a matter of moving some bits around, for smaller 32-bit arc and
      m68k SPARSEMEM is not necessarily the best thing to do.
      
      On 32-bit machines SPARSEMEM would require large sections to make section
      index fit in the page flags, but larger sections mean that more memory is
      wasted for unused memory map.
      
      Besides, pfn_to_page() and page_to_pfn() become less efficient, at least
      on arc.
      
      So I've decided to generalize arm's approach for freeing of unused parts
      of the memory map with FLATMEM and enable it for both arc and m68k.  The
      details are in the description of patches 10 (arc) and 13 (m68k).
      
      This patch (of 13):
      
      Enable SPARSEMEM support on alpha and deprecate DISCONTIGMEM.
      
      The required changes are mostly around moving duplicated definitions of
      page access and address conversion macros to a common place and making sure
      they are available for all memory models.
      
      The DISCONTINGMEM support is marked as BROKEN an will be removed in a
      couple of releases.
      
      Link: https://lkml.kernel.org/r/20201101170454.9567-1-rppt@kernel.org
      Link: https://lkml.kernel.org/r/20201101170454.9567-2-rppt@kernel.orgSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greg Ungerer <gerg@linux-m68k.org>
      Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Meelis Roos <mroos@linux.ee>
      Cc: Michael Schmitz <schmitzmic@gmail.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      36d40290
    • M
      lkdtm: disable KASAN for rodata.o · 6d5a88cd
      Marco Elver 提交于
      Building lkdtm with KASAN and Clang 11 or later results in the following
      error when attempting to load the module:
      
        kernel tried to execute NX-protected page - exploit attempt? (uid: 0)
        BUG: unable to handle page fault for address: ffffffffc019cd70
        #PF: supervisor instruction fetch in kernel mode
        #PF: error_code(0x0011) - permissions violation
        ...
        RIP: 0010:asan.module_ctor+0x0/0xffffffffffffa290 [lkdtm]
        ...
        Call Trace:
         do_init_module+0x17c/0x570
         load_module+0xadee/0xd0b0
         __x64_sys_finit_module+0x16c/0x1a0
         do_syscall_64+0x34/0x50
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      The reason is that rodata.o generates a dummy function that lives in
      .rodata to validate that .rodata can't be executed; however, Clang 11 adds
      KASAN globals support by generating module constructors to initialize
      globals redzones.  When Clang 11 adds a module constructor to rodata.o, it
      is also added to .rodata: any attempt to call it on initialization results
      in the above error.
      
      Therefore, disable KASAN instrumentation for rodata.o.
      
      Link: https://lkml.kernel.org/r/20201214191413.3164796-1-elver@google.comSigned-off-by: NMarco Elver <elver@google.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6d5a88cd
    • W
      kasan: update documentation for generic kasan · 4784be28
      Walter Wu 提交于
      Generic KASAN also supports to record the last two workqueue stacks and
      print them in KASAN report.  So that need to update documentation.
      
      Link: https://lkml.kernel.org/r/20201203023037.30792-1-walter-zh.wu@mediatek.comSigned-off-by: NWalter Wu <walter-zh.wu@mediatek.com>
      Suggested-by: NMarco Elver <elver@google.com>
      Acked-by: NMarco Elver <elver@google.com>
      Reviewed-by: NDmitry Vyukov <dvyukov@google.com>
      Reviewed-by: NAndrey Konovalov <andreyknvl@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Cc: Matthias Brugger <matthias.bgg@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4784be28
    • W
      lib/test_kasan.c: add workqueue test case · 214c783d
      Walter Wu 提交于
      Adds a test to verify workqueue stack recording and print it in
      KASAN report.
      
      The KASAN report was as follows(cleaned up slightly):
      
       BUG: KASAN: use-after-free in kasan_workqueue_uaf
      
       Freed by task 54:
        kasan_save_stack+0x24/0x50
        kasan_set_track+0x24/0x38
        kasan_set_free_info+0x20/0x40
        __kasan_slab_free+0x10c/0x170
        kasan_slab_free+0x10/0x18
        kfree+0x98/0x270
        kasan_workqueue_work+0xc/0x18
      
       Last potentially related work creation:
        kasan_save_stack+0x24/0x50
        kasan_record_wq_stack+0xa8/0xb8
        insert_work+0x48/0x288
        __queue_work+0x3e8/0xc40
        queue_work_on+0xf4/0x118
        kasan_workqueue_uaf+0xfc/0x190
      
      Link: https://lkml.kernel.org/r/20201203022748.30681-1-walter-zh.wu@mediatek.comSigned-off-by: NWalter Wu <walter-zh.wu@mediatek.com>
      Acked-by: NMarco Elver <elver@google.com>
      Reviewed-by: NDmitry Vyukov <dvyukov@google.com>
      Reviewed-by: NAndrey Konovalov <andreyknvl@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Matthias Brugger <matthias.bgg@gmail.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      214c783d
    • W
      kasan: print workqueue stack · ef133461
      Walter Wu 提交于
      The aux_stack[2] is reused to record the call_rcu() call stack and
      enqueuing work call stacks.  So that we need to change the auxiliary stack
      title for common title, print them in KASAN report.
      
      Link: https://lkml.kernel.org/r/20201203022715.30635-1-walter-zh.wu@mediatek.comSigned-off-by: NWalter Wu <walter-zh.wu@mediatek.com>
      Suggested-by: NMarco Elver <elver@google.com>
      Acked-by: NMarco Elver <elver@google.com>
      Reviewed-by: NDmitry Vyukov <dvyukov@google.com>
      Reviewed-by: NAndrey Konovalov <andreyknvl@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Cc: Matthias Brugger <matthias.bgg@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ef133461
    • W
      workqueue: kasan: record workqueue stack · e89a85d6
      Walter Wu 提交于
      Patch series "kasan: add workqueue stack for generic KASAN", v5.
      
      Syzbot reports many UAF issues for workqueue, see [1].
      
      In some of these access/allocation happened in process_one_work(), we
      see the free stack is useless in KASAN report, it doesn't help
      programmers to solve UAF for workqueue issue.
      
      This patchset improves KASAN reports by making them to have workqueue
      queueing stack.  It is useful for programmers to solve use-after-free or
      double-free memory issue.
      
      Generic KASAN also records the last two workqueue stacks and prints them
      in KASAN report.  It is only suitable for generic KASAN.
      
      [1] https://groups.google.com/g/syzkaller-bugs/search?q=%22use-after-free%22+process_one_work
      [2] https://bugzilla.kernel.org/show_bug.cgi?id=198437
      
      This patch (of 4):
      
      When analyzing use-after-free or double-free issue, recording the
      enqueuing work stacks is helpful to preserve usage history which
      potentially gives a hint about the affected code.
      
      For workqueue it has turned out to be useful to record the enqueuing work
      call stacks.  Because user can see KASAN report to determine whether it is
      root cause.  They don't need to enable debugobjects, but they have a
      chance to find out the root cause.
      
      Link: https://lkml.kernel.org/r/20201203022148.29754-1-walter-zh.wu@mediatek.com
      Link: https://lkml.kernel.org/r/20201203022442.30006-1-walter-zh.wu@mediatek.comSigned-off-by: NWalter Wu <walter-zh.wu@mediatek.com>
      Suggested-by: NMarco Elver <elver@google.com>
      Acked-by: NMarco Elver <elver@google.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NDmitry Vyukov <dvyukov@google.com>
      Reviewed-by: NAndrey Konovalov <andreyknvl@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Cc: Marco Elver <elver@google.com>
      Cc: Matthias Brugger <matthias.bgg@gmail.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e89a85d6
    • V
      mm/vmalloc.c: fix kasan shadow poisoning size · c041098c
      Vincenzo Frascino 提交于
      The size of vm area can be affected by the presence or not of the guard
      page.  In particular when VM_NO_GUARD is present, the actual accessible
      size has to be considered like the real size minus the guard page.
      
      Currently kasan does not keep into account this information during the
      poison operation and in particular tries to poison the guard page as well.
      
      This approach, even if incorrect, does not cause an issue because the tags
      for the guard page are written in the shadow memory.  With the future
      introduction of the Tag-Based KASAN, being the guard page inaccessible by
      nature, the write tag operation on this page triggers a fault.
      
      Fix kasan shadow poisoning size invoking get_vm_area_size() instead of
      accessing directly the field in the data structure to detect the correct
      value.
      
      Link: https://lkml.kernel.org/r/20201027160213.32904-1-vincenzo.frascino@arm.com
      Fixes: d98c9e83 ("kasan: fix crashes on access to memory mapped by vm_map_ram()")
      Signed-off-by: NVincenzo Frascino <vincenzo.frascino@arm.com>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Marco Elver <elver@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c041098c
    • A
      docs/vm: remove unused 3 items explanation for /proc/vmstat · 56db19fe
      Alex Shi 提交于
      Commit 5647bc29 ("mm: compaction: Move migration fail/success
      stats to migrate.c"), removed 3 items in /proc/vmstat. but the docs
      still has their explanation. let's remove them.
      
      "compact_blocks_moved",
      "compact_pages_moved",
      "compact_pagemigrate_failed",
      
      Link: https://lkml.kernel.org/r/1605520282-51993-1-git-send-email-alex.shi@linux.alibaba.comSigned-off-by: NAlex Shi <alex.shi@linux.alibaba.com>
      Reviewed-by: NZi Yan <ziy@nvidia.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      56db19fe
    • W
      mm/vmalloc: Fix unlock order in s_stop() · 0a7dd4e9
      Waiman Long 提交于
      When multiple locks are acquired, they should be released in reverse
      order. For s_start() and s_stop() in mm/vmalloc.c, that is not the
      case.
      
        s_start: mutex_lock(&vmap_purge_lock); spin_lock(&vmap_area_lock);
        s_stop : mutex_unlock(&vmap_purge_lock); spin_unlock(&vmap_area_lock);
      
      This unlock sequence, though allowed, is not optimal. If a waiter is
      present, mutex_unlock() will need to go through the slowpath of waking
      up the waiter with preemption disabled. Fix that by releasing the
      spinlock first before the mutex.
      
      Link: https://lkml.kernel.org/r/20201213180843.16938-1-longman@redhat.com
      Fixes: e36176be ("mm/vmalloc: rework vmap_area_lock")
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Reviewed-by: NUladzislau Rezki (Sony) <urezki@gmail.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0a7dd4e9
    • B
    • A
      mm/vmalloc: add 'align' parameter explanation for pvm_determine_end_from_reverse · 799fa85d
      Alex Shi 提交于
      Kernel-doc markup has a issue on pvm_determine_end_from_reverse:
      
        mm/vmalloc.c:3145: warning: Function parameter or member 'align' not described in 'pvm_determine_end_from_reverse'
      
      Add a explanation for it to remove the warning.
      
      Link: https://lkml.kernel.org/r/1605605088-30668-3-git-send-email-alex.shi@linux.alibaba.comSigned-off-by: NAlex Shi <alex.shi@linux.alibaba.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      799fa85d
    • U
      mm/vmalloc: rework the drain logic · 96e2db45
      Uladzislau Rezki (Sony) 提交于
      A current "lazy drain" model suffers from at least two issues.
      
      First one is related to the unsorted list of vmap areas, thus in order to
      identify the [min:max] range of areas to be drained, it requires a full
      list scan.  What is a time consuming if the list is too long.
      
      Second one and as a next step is about merging all fragments with a free
      space.  What is also a time consuming because it has to iterate over
      entire list which holds outstanding lazy areas.
      
      See below the "preemptirqsoff" tracer that illustrates a high latency.  It
      is ~24676us.  Our workloads like audio and video are effected by such long
      latency:
      
      <snip>
        tracer: preemptirqsoff
      
        preemptirqsoff latency trace v1.1.5 on 4.9.186-perf+
        --------------------------------------------------------------------
        latency: 24676 us, #4/4, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 P:8)
           -----------------
           | task: crtc_commit:112-261 (uid:0 nice:0 policy:1 rt_prio:16)
           -----------------
         => started at: __purge_vmap_area_lazy
         => ended at:   __purge_vmap_area_lazy
      
                         _------=> CPU#
                        / _-----=> irqs-off
                       | / _----=> need-resched
                       || / _---=> hardirq/softirq
                       ||| / _--=> preempt-depth
                       |||| /     delay
         cmd     pid   ||||| time  |   caller
            \   /      |||||  \    |   /
      crtc_com-261     1...1    1us*: _raw_spin_lock <-__purge_vmap_area_lazy
      [...]
      crtc_com-261     1...1 24675us : _raw_spin_unlock <-__purge_vmap_area_lazy
      crtc_com-261     1...1 24677us : trace_preempt_on <-__purge_vmap_area_lazy
      crtc_com-261     1...1 24683us : <stack trace>
       => free_vmap_area_noflush
       => remove_vm_area
       => __vunmap
       => vfree
       => drm_property_free_blob
       => drm_mode_object_unreference
       => drm_property_unreference_blob
       => __drm_atomic_helper_crtc_destroy_state
       => sde_crtc_destroy_state
       => drm_atomic_state_default_clear
       => drm_atomic_state_clear
       => drm_atomic_state_free
       => complete_commit
       => _msm_drm_commit_work_cb
       => kthread_worker_fn
       => kthread
       => ret_from_fork
      <snip>
      
      To address those two issues we can redesign a purging of the outstanding
      lazy areas.  Instead of queuing vmap areas to the list, we replace it by
      the separate rb-tree.  In hat case an area is located in the tree/list in
      ascending order.  It will give us below advantages:
      
      a) Outstanding vmap areas are merged creating bigger coalesced blocks,
         thus it becomes less fragmented.
      
      b) It is possible to calculate a flush range [min:max] without scanning
         all elements.  It is O(1) access time or complexity;
      
      c) The final merge of areas with the rb-tree that represents a free
         space is faster because of (a).  As a result the lock contention is
         also reduced.
      
      Link: https://lkml.kernel.org/r/20201116220033.1837-2-urezki@gmail.comSigned-off-by: NUladzislau Rezki (Sony) <urezki@gmail.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: huang ying <huang.ying.caritas@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96e2db45
    • U
      mm/vmalloc: use free_vm_area() if an allocation fails · 8945a723
      Uladzislau Rezki (Sony) 提交于
      There is a dedicated and separate function that finds and removes a
      continuous kernel virtual area.  As a final step it also releases the
      "area", a descriptor of corresponding vm_struct.
      
      Use free_vmap_area() in the __vmalloc_node_range() instead of open coded
      steps which are exactly the same, to perform a cleanup.
      
      Link: https://lkml.kernel.org/r/20201116220033.1837-1-urezki@gmail.comSigned-off-by: NUladzislau Rezki (Sony) <urezki@gmail.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8945a723
    • A
      mm/vmalloc.c:__vmalloc_area_node(): avoid 32-bit overflow · 34fe6537
      Andrew Morton 提交于
      With a machine with 3 TB (more than 2 TB memory).  If you use vmalloc to
      allocate > 2 TB memory, the array_size below will be overflowed.
      
      The array_size is an unsigned int and can only be used to allocate less
      than 2 TB memory.  If you pass 2*1028*1028*1024*1024 = 2 * 2^40 in the
      argument of vmalloc.  The array_size will become 2*2^31 = 2^32.  The 2^32
      cannot be store with a 32 bit integer.
      
      The fix is to change the type of array_size to unsigned long.
      
      [akpm@linux-foundation.org: rework for current mainline]
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=210023
      Reported-by: <hsinhuiwu@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      34fe6537
    • D
      locking/selftests: add testcases for fs_reclaim · d5037d1d
      Daniel Vetter 提交于
      Since I butchered this I figured better to make sure we have testcases for
      this now.  Since we only have a locking context for __GFP_FS that's the
      only thing we're testing right now.
      
      Link: https://lkml.kernel.org/r/20201125162532.1299794-4-daniel.vetter@ffwll.chSigned-off-by: NDaniel Vetter <daniel.vetter@intel.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Thomas Hellström (Intel) <thomas_os@shipmail.org>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Waiman Long <longman@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d5037d1d
    • D
      mm: extract might_alloc() debug check · 95d6c701
      Daniel Vetter 提交于
      Extracted from slab.h, which seems to have the most complete version
      including the correct might_sleep() check.  Roll it out to slob.c.
      
      Motivated by a discussion with Paul about possibly changing call_rcu
      behaviour to allocate memory, but only roughly every 500th call.
      
      There are a lot fewer places in the kernel that care about whether
      allocating memory is allowed or not (due to deadlocks with reclaim code)
      than places that care whether sleeping is allowed.  But debugging these
      also tends to be a lot harder, so nice descriptive checks could come in
      handy.  I might have some use eventually for annotations in drivers/gpu.
      
      Note that unlike fs_reclaim_acquire/release gfpflags_allow_blocking does
      not consult the PF_MEMALLOC flags.  But there is no flag equivalent for
      GFP_NOWAIT, hence this check can't go wrong due to
      memalloc_no*_save/restore contexts.  Willy is working on a patch series
      which might change this:
      
      https://lore.kernel.org/linux-mm/20200625113122.7540-7-willy@infradead.org/
      
      I think best would be if that updates gfpflags_allow_blocking(), since
      there's a ton of callers all over the place for that already.
      
      Link: https://lkml.kernel.org/r/20201125162532.1299794-3-daniel.vetter@ffwll.chSigned-off-by: NDaniel Vetter <daniel.vetter@intel.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NPaul E. McKenney <paulmck@kernel.org>
      Reviewed-by: NJason Gunthorpe <jgg@nvidia.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
      Cc: Thomas Hellström (Intel) <thomas_os@shipmail.org>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      95d6c701
    • D
      mm: track mmu notifiers in fs_reclaim_acquire/release · f920e413
      Daniel Vetter 提交于
      fs_reclaim_acquire/release nicely catch recursion issues when allocating
      GFP_KERNEL memory against shrinkers (which gpu drivers tend to use to keep
      the excessive caches in check).  For mmu notifier recursions we do have
      lockdep annotations since 23b68395 ("mm/mmu_notifiers: add a lockdep
      map for invalidate_range_start/end").
      
      But these only fire if a path actually results in some pte invalidation -
      for most small allocations that's very rarely the case.  The other trouble
      is that pte invalidation can happen any time when __GFP_RECLAIM is set.
      Which means only really GFP_ATOMIC is a safe choice, GFP_NOIO isn't good
      enough to avoid potential mmu notifier recursion.
      
      I was pondering whether we should just do the general annotation, but
      there's always the risk for false positives.  Plus I'm assuming that the
      core fs and io code is a lot better reviewed and tested than random mmu
      notifier code in drivers.  Hence why I decide to only annotate for that
      specific case.
      
      Furthermore even if we'd create a lockdep map for direct reclaim, we'd
      still need to explicit pull in the mmu notifier map - there's a lot more
      places that do pte invalidation than just direct reclaim, these two
      contexts arent the same.
      
      Note that the mmu notifiers needing their own independent lockdep map is
      also the reason we can't hold them from fs_reclaim_acquire to
      fs_reclaim_release - it would nest with the acquistion in the pte
      invalidation code, causing a lockdep splat.  And we can't remove the
      annotations from pte invalidation and all the other places since they're
      called from many other places than page reclaim.  Hence we can only do the
      equivalent of might_lock, but on the raw lockdep map.
      
      With this we can also remove the lockdep priming added in 66204f1d
      ("mm/mmu_notifiers: prime lockdep") since the new annotations are strictly
      more powerful.
      
      Link: https://lkml.kernel.org/r/20201125162532.1299794-2-daniel.vetter@ffwll.chSigned-off-by: NDaniel Vetter <daniel.vetter@intel.com>
      Reviewed-by: NJason Gunthorpe <jgg@nvidia.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Thomas Hellström (Intel) <thomas_os@shipmail.org>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Paul E. McKenney <paulmck@kernel.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f920e413
    • D
      mm: forbid splitting special mappings · 871402e0
      Dmitry Safonov 提交于
      Don't allow splitting of vm_special_mapping's.  It affects vdso/vvar
      areas.  Uprobes have only one page in xol_area so they aren't affected.
      
      Those restrictions were enforced by checks in .mremap() callbacks.
      Restrict resizing with generic .split() callback.
      
      Link: https://lkml.kernel.org/r/20201013013416.390574-7-dima@arista.comSigned-off-by: NDmitry Safonov <dima@arista.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      871402e0
    • D
      mremap: check if it's possible to split original vma · 73d5e062
      Dmitry Safonov 提交于
      If original VMA can't be split at the desired address, do_munmap() will
      fail and leave both new-copied VMA and old VMA.  De-facto it's
      MREMAP_DONTUNMAP behaviour, which is unexpected.
      
      Currently, it may fail such way for hugetlbfs and dax device mappings.
      
      Minimize such unpleasant situations to OOM by checking .may_split() before
      attempting to create a VMA copy.
      
      Link: https://lkml.kernel.org/r/20201013013416.390574-6-dima@arista.comSigned-off-by: NDmitry Safonov <dima@arista.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      73d5e062
    • D
      vm_ops: rename .split() callback to .may_split() · dd3b614f
      Dmitry Safonov 提交于
      Rename the callback to reflect that it's not called *on* or *after* split,
      but rather some time before the splitting to check if it's possible.
      
      Link: https://lkml.kernel.org/r/20201013013416.390574-5-dima@arista.comSigned-off-by: NDmitry Safonov <dima@arista.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dd3b614f
    • D
      mremap: don't allow MREMAP_DONTUNMAP on special_mappings and aio · cd544fd1
      Dmitry Safonov 提交于
      As kernel expect to see only one of such mappings, any further operations
      on the VMA-copy may be unexpected by the kernel.  Maybe it's being on the
      safe side, but there doesn't seem to be any expected use-case for this, so
      restrict it now.
      
      Link: https://lkml.kernel.org/r/20201013013416.390574-4-dima@arista.com
      Fixes: commit e346b381 ("mm/mremap: add MREMAP_DONTUNMAP to mremap()")
      Signed-off-by: NDmitry Safonov <dima@arista.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cd544fd1
    • D
      mm/mremap: for MREMAP_DONTUNMAP check security_vm_enough_memory_mm() · ad8ee77e
      Dmitry Safonov 提交于
      Currently memory is accounted post-mremap() with MREMAP_DONTUNMAP, which
      may break overcommit policy.  So, check if there's enough memory before
      doing actual VMA copy.
      
      Don't unset VM_ACCOUNT on MREMAP_DONTUNMAP.  By semantics, such mremap()
      is actually a memory allocation.  That also simplifies the error-path a
      little.
      
      Also, as it's memory allocation on success don't reset hiwater_vm value.
      
      Link: https://lkml.kernel.org/r/20201013013416.390574-3-dima@arista.com
      Fixes: commit e346b381 ("mm/mremap: add MREMAP_DONTUNMAP to mremap()")
      Signed-off-by: NDmitry Safonov <dima@arista.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ad8ee77e
    • D
      mm/mremap: account memory on do_munmap() failure · 51df7bcb
      Dmitry Safonov 提交于
      Patch series "mremap: move_vma() fixes".
      
      This patch (of 6):
      
      move_vma() copies VMA without adding it to account, then unmaps old part
      of VMA.  On failure it unmaps the new VMA.  With hacks accounting in
      munmap is disabled as it's a copy of existing VMA.
      
      Account the memory on munmap() failure which was previously copied into
      a new VMA.
      
      Link: https://lkml.kernel.org/r/20201013013416.390574-1-dima@arista.com
      Link: https://lkml.kernel.org/r/20201013013416.390574-2-dima@arista.com
      Fixes: commit e2ea83742133 ("[PATCH] mremap: move_vma fixes and cleanup")
      Signed-off-by: NDmitry Safonov <dima@arista.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      51df7bcb
    • M
      mm: move free_unref_page to mm/internal.h · 0966aeb4
      Matthew Wilcox (Oracle) 提交于
      Code outside mm/ should not be calling free_unref_page().  Also move
      free_unref_page_list().
      
      Link: https://lkml.kernel.org/r/20201125034655.27687-2-willy@infradead.orgSigned-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NMike Rapoport <rppt@linux.ibm.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0966aeb4
    • M
      sparc: fix handling of page table constructor failure · 06517c9a
      Matthew Wilcox (Oracle) 提交于
      The page has just been allocated, so its refcount is 1.  free_unref_page()
      is for use on pages which have a zero refcount.  Use __free_page() like
      the other implementations of pte_alloc_one().
      
      Link: https://lkml.kernel.org/r/20201125034655.27687-1-willy@infradead.org
      Fixes: 1ae9ae5f ("sparc: handle pgtable_page_ctor() fail")
      Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NMike Rapoport <rppt@linux.ibm.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      06517c9a
    • A
      mm: mmap_lock: add tracepoints around lock acquisition · 2b5067a8
      Axel Rasmussen 提交于
      The goal of these tracepoints is to be able to debug lock contention
      issues.  This lock is acquired on most (all?) mmap / munmap / page fault
      operations, so a multi-threaded process which does a lot of these can
      experience significant contention.
      
      We trace just before we start acquisition, when the acquisition returns
      (whether it succeeded or not), and when the lock is released (or
      downgraded).  The events are broken out by lock type (read / write).
      
      The events are also broken out by memcg path.  For container-based
      workloads, users often think of several processes in a memcg as a single
      logical "task", so collecting statistics at this level is useful.
      
      The end goal is to get latency information.  This isn't directly included
      in the trace events.  Instead, users are expected to compute the time
      between "start locking" and "acquire returned", using e.g.  synthetic
      events or BPF.  The benefit we get from this is simpler code.
      
      Because we use tracepoint_enabled() to decide whether or not to trace,
      this patch has effectively no overhead unless tracepoints are enabled at
      runtime.  If tracepoints are enabled, there is a performance impact, but
      how much depends on exactly what e.g.  the BPF program does.
      
      [axelrasmussen@google.com: fix use-after-free race and css ref leak in tracepoints]
        Link: https://lkml.kernel.org/r/20201130233504.3725241-1-axelrasmussen@google.com
      [axelrasmussen@google.com: v3]
        Link: https://lkml.kernel.org/r/20201207213358.573750-1-axelrasmussen@google.com
      [rostedt@goodmis.org: in-depth examples of tracepoint_enabled() usage, and per-cpu-per-context buffer design]
      
      Link: https://lkml.kernel.org/r/20201105211739.568279-2-axelrasmussen@google.comSigned-off-by: NAxel Rasmussen <axelrasmussen@google.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Chinwen Chang <chinwen.chang@mediatek.com>
      Cc: Davidlohr Bueso <dbueso@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Laurent Dufour <ldufour@linux.ibm.com>
      Cc: Yafang Shao <laoar.shao@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2b5067a8