1. 07 11月, 2021 40 次提交
    • S
      mm: remove HARDENED_USERCOPY_FALLBACK · 53944f17
      Stephen Kitt 提交于
      This has served its purpose and is no longer used.  All usercopy
      violations appear to have been handled by now, any remaining instances
      (or new bugs) will cause copies to be rejected.
      
      This isn't a direct revert of commit 2d891fbc ("usercopy: Allow
      strict enforcement of whitelists"); since usercopy_fallback is
      effectively 0, the fallback handling is removed too.
      
      This also removes the usercopy_fallback module parameter on slab_common.
      
      Link: https://github.com/KSPP/linux/issues/153
      Link: https://lkml.kernel.org/r/20210921061149.1091163-1-steve@sk2.orgSigned-off-by: NStephen Kitt <steve@sk2.org>
      Suggested-by: NKees Cook <keescook@chromium.org>
      Acked-by: NKees Cook <keescook@chromium.org>
      Reviewed-by: Joel Stanley <joel@jms.id.au>	[defconfig change]
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: James Morris <jmorris@namei.org>
      Cc: "Serge E . Hallyn" <serge@hallyn.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      53944f17
    • I
      mm/highmem: remove deprecated kmap_atomic · d2c20e51
      Ira Weiny 提交于
      kmap_atomic() is being deprecated in favor of kmap_local_page().
      
      Replace the uses of kmap_atomic() within the highmem code.
      
      On profiling clear_huge_page() using ftrace an improvement of 62% was
      observed on the below setup.
      
      Setup:-
      Below data has been collected on Qualcomm's SM7250 SoC THP enabled
      (kernel v4.19.113) with only CPU-0(Cortex-A55) and CPU-7(Cortex-A76)
      switched on and set to max frequency, also DDR set to perf governor.
      
      FTRACE Data:-
      
      Base data:-
      Number of iterations: 48
      Mean of allocation time: 349.5 us
      std deviation: 74.5 us
      
      v4 data:-
      Number of iterations: 48
      Mean of allocation time: 131 us
      std deviation: 32.7 us
      
      The following simple userspace experiment to allocate
      100MB(BUF_SZ) of pages and writing to it gave us a good insight,
      we observed an improvement of 42% in allocation and writing timings.
      -------------------------------------------------------------
      Test code snippet
      -------------------------------------------------------------
            clock_start();
            buf = malloc(BUF_SZ); /* Allocate 100 MB of memory */
      
              for(i=0; i < BUF_SZ_PAGES; i++)
              {
                      *((int *)(buf + (i*PAGE_SIZE))) = 1;
              }
            clock_end();
      -------------------------------------------------------------
      
      Malloc test timings for 100MB anon allocation:-
      
      Base data:-
      Number of iterations: 100
      Mean of allocation time: 31831 us
      std deviation: 4286 us
      
      v4 data:-
      Number of iterations: 100
      Mean of allocation time: 18193 us
      std deviation: 4915 us
      
      [willy@infradead.org: fix zero_user_segments()]
        Link: https://lkml.kernel.org/r/YYVhHCJcm2DM2G9u@casper.infradead.org
      
      Link: https://lkml.kernel.org/r/20210204073255.20769-2-prathu.baronia@oneplus.comSigned-off-by: NIra Weiny <ira.weiny@intel.com>
      Signed-off-by: NPrathu Baronia <prathu.baronia@oneplus.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d2c20e51
    • D
      memblock: add MEMBLOCK_DRIVER_MANAGED to mimic IORESOURCE_SYSRAM_DRIVER_MANAGED · f7892d8e
      David Hildenbrand 提交于
      Let's add a flag that corresponds to IORESOURCE_SYSRAM_DRIVER_MANAGED,
      indicating that we're dealing with a memory region that is never
      indicated in the firmware-provided memory map, but always detected and
      added by a driver.
      
      Similar to MEMBLOCK_HOTPLUG, most infrastructure has to treat such
      memory regions like ordinary MEMBLOCK_NONE memory regions -- for
      example, when selecting memory regions to add to the vmcore for dumping
      in the crashkernel via for_each_mem_range().
      
      However, especially kexec_file is not supposed to select such memblocks
      via for_each_free_mem_range() / for_each_free_mem_range_reverse() to
      place kexec images, similar to how we handle
      IORESOURCE_SYSRAM_DRIVER_MANAGED without CONFIG_ARCH_KEEP_MEMBLOCK.
      
      We'll make sure that memory hotplug code sets the flag where applicable
      (IORESOURCE_SYSRAM_DRIVER_MANAGED) next.  This prepares architectures
      that need CONFIG_ARCH_KEEP_MEMBLOCK, such as arm64, for virtio-mem
      support.
      
      Note that kexec *must not* indicate this memory to the second kernel and
      *must not* place kexec-images on this memory.  Let's add a comment to
      kexec_walk_memblock(), documenting how we handle MEMBLOCK_DRIVER_MANAGED
      now just like using IORESOURCE_SYSRAM_DRIVER_MANAGED in
      locate_mem_hole_callback() for kexec_walk_resources().
      
      Also note that MEMBLOCK_HOTPLUG cannot be reused due to different
      semantics:
      	MEMBLOCK_HOTPLUG: memory is indicated as "System RAM" in the
      	firmware-provided memory map and added to the system early during
      	boot; kexec *has to* indicate this memory to the second kernel and
      	can place kexec-images on this memory. After memory hotunplug,
      	kexec has to be re-armed. We mostly ignore this flag when
      	"movable_node" is not set on the kernel command line, because
      	then we're told to not care about hotunpluggability of such
      	memory regions.
      
      	MEMBLOCK_DRIVER_MANAGED: memory is not indicated as "System RAM" in
      	the firmware-provided memory map; this memory is always detected
      	and added to the system by a driver; memory might not actually be
      	physically hotunpluggable. kexec *must not* indicate this memory to
      	the second kernel and *must not* place kexec-images on this memory.
      
      Link: https://lkml.kernel.org/r/20211004093605.5830-5-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NMike Rapoport <rppt@linux.ibm.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Huacai Chen <chenhuacai@kernel.org>
      Cc: Jianyong Wu <Jianyong.Wu@arm.com>
      Cc: Jiaxun Yang <jiaxun.yang@flygoat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Shahab Vahedi <shahab@synopsys.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vineet Gupta <vgupta@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f7892d8e
    • D
      memblock: allow to specify flags with memblock_add_node() · 952eea9b
      David Hildenbrand 提交于
      We want to specify flags when hotplugging memory.  Let's prepare to pass
      flags to memblock_add_node() by adjusting all existing users.
      
      Note that when hotplugging memory the system is already up and running
      and we might have concurrent memblock users: for example, while we're
      hotplugging memory, kexec_file code might search for suitable memory
      regions to place kexec images.  It's important to add the memory
      directly to memblock via a single call with the right flags, instead of
      adding the memory first and apply flags later: otherwise, concurrent
      memblock users might temporarily stumble over memblocks with wrong
      flags, which will be important in a follow-up patch that introduces a
      new flag to properly handle add_memory_driver_managed().
      
      Link: https://lkml.kernel.org/r/20211004093605.5830-4-david@redhat.comAcked-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Acked-by: NHeiko Carstens <hca@linux.ibm.com>
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: Shahab Vahedi <shahab@synopsys.com>	[arch/arc]
      Reviewed-by: NMike Rapoport <rppt@linux.ibm.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Huacai Chen <chenhuacai@kernel.org>
      Cc: Jianyong Wu <Jianyong.Wu@arm.com>
      Cc: Jiaxun Yang <jiaxun.yang@flygoat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vineet Gupta <vgupta@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      952eea9b
    • D
      memblock: improve MEMBLOCK_HOTPLUG documentation · e14b4155
      David Hildenbrand 提交于
      The description of MEMBLOCK_HOTPLUG is currently short and consequently
      misleading: we're actually dealing with a memory region that might get
      hotunplugged later (i.e., the platform+firmware supports it), yet it is
      indicated in the firmware-provided memory map as system ram that will
      just get used by the system for any purpose when not taking special
      care.  The firmware marked this memory region as a hot(un)plugged (e.g.,
      hotplugged before reboot), implying that it might get hotunplugged again
      later.
      
      Whether we consider this information depends on the "movable_node"
      kernel commandline parameter: only with "movable_node" set, we'll try
      keeping this memory hotunpluggable, for example, by not serving early
      allocations from this memory region and by letting the buddy manage it
      using the ZONE_MOVABLE.
      
      Let's make this clearer by extending the documentation.
      
      Note: kexec *has to* indicate this memory to the second kernel.  With
      "movable_node" set, we don't want to place kexec-images on this memory.
      Without "movable_node" set, we don't care and can place kexec-images on
      this memory.  In both cases, after successful memory hotunplug, kexec
      has to be re-armed to update the memory map for the second kernel and to
      place the kexec-images somewhere else.
      
      Link: https://lkml.kernel.org/r/20211004093605.5830-3-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NMike Rapoport <rppt@linux.ibm.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Huacai Chen <chenhuacai@kernel.org>
      Cc: Jianyong Wu <Jianyong.Wu@arm.com>
      Cc: Jiaxun Yang <jiaxun.yang@flygoat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Shahab Vahedi <shahab@synopsys.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vineet Gupta <vgupta@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e14b4155
    • D
      mm/memory_hotplug: remove stale function declarations · 43e3aa2a
      David Hildenbrand 提交于
      These functions no longer exist.
      
      Link: https://lkml.kernel.org/r/20210929143600.49379-6-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Alex Shi <alexs@kernel.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      43e3aa2a
    • D
      mm/memory_hotplug: remove HIGHMEM leftovers · 6b740c6c
      David Hildenbrand 提交于
      We don't support CONFIG_MEMORY_HOTPLUG on 32 bit and consequently not
      HIGHMEM.  Let's remove any leftover code -- including the unused
      "status_change_nid_high" field part of the memory notifier.
      
      Link: https://lkml.kernel.org/r/20210929143600.49379-5-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Alex Shi <alexs@kernel.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6b740c6c
    • D
      mm/memory_hotplug: remove CONFIG_MEMORY_HOTPLUG_SPARSE · 50f9481e
      David Hildenbrand 提交于
      CONFIG_MEMORY_HOTPLUG depends on CONFIG_SPARSEMEM, so there is no need for
      CONFIG_MEMORY_HOTPLUG_SPARSE anymore; adjust all instances to use
      CONFIG_MEMORY_HOTPLUG and remove CONFIG_MEMORY_HOTPLUG_SPARSE.
      
      Link: https://lkml.kernel.org/r/20210929143600.49379-3-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: Shuah Khan <skhan@linuxfoundation.org>	[kselftest]
      Acked-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Acked-by: NOscar Salvador <osalvador@suse.de>
      Cc: Alex Shi <alexs@kernel.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      50f9481e
    • Y
      mm: migrate: make demotion knob depend on migration · 20f9ba4f
      Yang Shi 提交于
      The memory demotion needs to call migrate_pages() to do the jobs.  And
      it is controlled by a knob, however, the knob doesn't depend on
      CONFIG_MIGRATION.  The knob could be truned on even though MIGRATION is
      disabled, this will not cause any crash since migrate_pages() would just
      return -ENOSYS.  But it is definitely not optimal to go through demotion
      path then retry regular swap every time.
      
      And it doesn't make too much sense to have the knob visible to the users
      when !MIGRATION.  Move the related code from mempolicy.[h|c] to
      migrate.[h|c].
      
      Link: https://lkml.kernel.org/r/20211015005559.246709-1-shy828301@gmail.comSigned-off-by: NYang Shi <shy828301@gmail.com>
      Acked-by: N"Huang, Ying" <ying.huang@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      20f9ba4f
    • J
      mm/migrate: de-duplicate migrate_reason strings · 8eb42bea
      John Hubbard 提交于
      In order to remove the need to manually keep three different files in
      synch, provide a common definition of the mapping between enum
      migrate_reason, and the associated strings for each enum item.
      
      1. Use the tracing system's mapping of enums to strings, by redefining
         and reusing the MIGRATE_REASON and supporting macros, and using that
         to populate the string array in mm/debug.c.
      
      2. Move enum migrate_reason to migrate_mode.h. This is not strictly
         necessary for this patch, but migrate mode and migrate reason go
         together, so this will slightly clarify things.
      
      Link: https://lkml.kernel.org/r/20210922041755.141817-2-jhubbard@nvidia.comSigned-off-by: NJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: NWeizhao Ouyang <o451686892@gmail.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8eb42bea
    • Z
      hugetlbfs: extend the definition of hugepages parameter to support node allocation · b5389086
      Zhenguo Yao 提交于
      We can specify the number of hugepages to allocate at boot.  But the
      hugepages is balanced in all nodes at present.  In some scenarios, we
      only need hugepages in one node.  For example: DPDK needs hugepages
      which are in the same node as NIC.
      
      If DPDK needs four hugepages of 1G size in node1 and system has 16 numa
      nodes we must reserve 64 hugepages on the kernel cmdline.  But only four
      hugepages are used.  The others should be free after boot.  If the
      system memory is low(for example: 64G), it will be an impossible task.
      
      So extend the hugepages parameter to support specifying hugepages on a
      specific node.  For example add following parameter:
      
        hugepagesz=1G hugepages=0:1,1:3
      
      It will allocate 1 hugepage in node0 and 3 hugepages in node1.
      
      Link: https://lkml.kernel.org/r/20211005054729.86457-1-yaozhenguo1@gmail.comSigned-off-by: NZhenguo Yao <yaozhenguo1@gmail.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Zhenguo Yao <yaozhenguo1@gmail.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b5389086
    • M
      memblock: use memblock_free for freeing virtual pointers · 4421cca0
      Mike Rapoport 提交于
      Rename memblock_free_ptr() to memblock_free() and use memblock_free()
      when freeing a virtual pointer so that memblock_free() will be a
      counterpart of memblock_alloc()
      
      The callers are updated with the below semantic patch and manual
      addition of (void *) casting to pointers that are represented by
      unsigned long variables.
      
          @@
          identifier vaddr;
          expression size;
          @@
          (
          - memblock_phys_free(__pa(vaddr), size);
          + memblock_free(vaddr, size);
          |
          - memblock_free_ptr(vaddr, size);
          + memblock_free(vaddr, size);
          )
      
      [sfr@canb.auug.org.au: fixup]
        Link: https://lkml.kernel.org/r/20211018192940.3d1d532f@canb.auug.org.au
      
      Link: https://lkml.kernel.org/r/20210930185031.18648-7-rppt@kernel.orgSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Shahab Vahedi <Shahab.Vahedi@synopsys.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4421cca0
    • M
      memblock: rename memblock_free to memblock_phys_free · 3ecc6834
      Mike Rapoport 提交于
      Since memblock_free() operates on a physical range, make its name
      reflect it and rename it to memblock_phys_free(), so it will be a
      logical counterpart to memblock_phys_alloc().
      
      The callers are updated with the below semantic patch:
      
          @@
          expression addr;
          expression size;
          @@
          - memblock_free(addr, size);
          + memblock_phys_free(addr, size);
      
      Link: https://lkml.kernel.org/r/20210930185031.18648-6-rppt@kernel.orgSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Shahab Vahedi <Shahab.Vahedi@synopsys.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3ecc6834
    • M
      memblock: stop aliasing __memblock_free_late with memblock_free_late · 621d9739
      Mike Rapoport 提交于
      memblock_free_late() is a NOP wrapper for __memblock_free_late(), there
      is no point to keep this indirection.
      
      Drop the wrapper and rename __memblock_free_late() to
      memblock_free_late().
      
      Link: https://lkml.kernel.org/r/20210930185031.18648-5-rppt@kernel.orgSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Shahab Vahedi <Shahab.Vahedi@synopsys.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      621d9739
    • M
      memblock: drop memblock_free_early_nid() and memblock_free_early() · fa277171
      Mike Rapoport 提交于
      memblock_free_early_nid() is unused and memblock_free_early() is an
      alias for memblock_free().
      
      Replace calls to memblock_free_early() with calls to memblock_free() and
      remove memblock_free_early() and memblock_free_early_nid().
      
      Link: https://lkml.kernel.org/r/20210930185031.18648-4-rppt@kernel.orgSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Shahab Vahedi <Shahab.Vahedi@synopsys.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fa277171
    • Y
      mm/vmpressure: fix data-race with memcg->socket_pressure · 7e6ec49c
      Yuanzheng Song 提交于
      When reading memcg->socket_pressure in mem_cgroup_under_socket_pressure()
      and writing memcg->socket_pressure in vmpressure() at the same time, the
      following data-race occurs:
      
        BUG: KCSAN: data-race in __sk_mem_reduce_allocated / vmpressure
      
        write to 0xffff8881286f4938 of 8 bytes by task 24550 on cpu 3:
         vmpressure+0x218/0x230 mm/vmpressure.c:307
         shrink_node_memcgs+0x2b9/0x410 mm/vmscan.c:2658
         shrink_node+0x9d2/0x11d0 mm/vmscan.c:2769
         shrink_zones+0x29f/0x470 mm/vmscan.c:2972
         do_try_to_free_pages+0x193/0x6e0 mm/vmscan.c:3027
         try_to_free_mem_cgroup_pages+0x1c0/0x3f0 mm/vmscan.c:3345
         reclaim_high mm/memcontrol.c:2440 [inline]
         mem_cgroup_handle_over_high+0x18b/0x4d0 mm/memcontrol.c:2624
         tracehook_notify_resume include/linux/tracehook.h:197 [inline]
         exit_to_user_mode_loop kernel/entry/common.c:164 [inline]
         exit_to_user_mode_prepare+0x110/0x170 kernel/entry/common.c:191
         syscall_exit_to_user_mode+0x16/0x30 kernel/entry/common.c:266
         ret_from_fork+0x15/0x30 arch/x86/entry/entry_64.S:289
      
        read to 0xffff8881286f4938 of 8 bytes by interrupt on cpu 1:
         mem_cgroup_under_socket_pressure include/linux/memcontrol.h:1483 [inline]
         sk_under_memory_pressure include/net/sock.h:1314 [inline]
         __sk_mem_reduce_allocated+0x1d2/0x270 net/core/sock.c:2696
         __sk_mem_reclaim+0x44/0x50 net/core/sock.c:2711
         sk_mem_reclaim include/net/sock.h:1490 [inline]
         ......
         net_rx_action+0x17a/0x480 net/core/dev.c:6864
         __do_softirq+0x12c/0x2af kernel/softirq.c:298
         run_ksoftirqd+0x13/0x20 kernel/softirq.c:653
         smpboot_thread_fn+0x33f/0x510 kernel/smpboot.c:165
         kthread+0x1fc/0x220 kernel/kthread.c:292
         ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:296
      
      Fix it by using READ_ONCE() and WRITE_ONCE() to read and write
      memcg->socket_pressure.
      
      Link: https://lkml.kernel.org/r/20211025082843.671690-1-songyuanzheng@huawei.comSigned-off-by: NYuanzheng Song <songyuanzheng@huawei.com>
      Reviewed-by: NMuchun Song <songmuchun@bytedance.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Alex Shi <alexs@kernel.org>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7e6ec49c
    • M
      mm/vmscan: throttle reclaim when no progress is being made · 69392a40
      Mel Gorman 提交于
      Memcg reclaim throttles on congestion if no reclaim progress is made.
      This makes little sense, it might be due to writeback or a host of other
      factors.
      
      For !memcg reclaim, it's messy.  Direct reclaim primarily is throttled
      in the page allocator if it is failing to make progress.  Kswapd
      throttles if too many pages are under writeback and marked for immediate
      reclaim.
      
      This patch explicitly throttles if reclaim is failing to make progress.
      
      [vbabka@suse.cz: Remove redundant code]
      
      Link: https://lkml.kernel.org/r/20211022144651.19914-4-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: "Darrick J . Wong" <djwong@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: NeilBrown <neilb@suse.de>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      69392a40
    • M
      mm/vmscan: throttle reclaim and compaction when too may pages are isolated · d818fca1
      Mel Gorman 提交于
      Page reclaim throttles on congestion if too many parallel reclaim
      instances have isolated too many pages.  This makes no sense, excessive
      parallelisation has nothing to do with writeback or congestion.
      
      This patch creates an additional workqueue to sleep on when too many
      pages are isolated.  The throttled tasks are woken when the number of
      isolated pages is reduced or a timeout occurs.  There may be some false
      positive wakeups for GFP_NOIO/GFP_NOFS callers but the tasks will
      throttle again if necessary.
      
      [shy828301@gmail.com: Wake up from compaction context]
      [vbabka@suse.cz: Account number of throttled tasks only for writeback]
      
      Link: https://lkml.kernel.org/r/20211022144651.19914-3-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: "Darrick J . Wong" <djwong@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: NeilBrown <neilb@suse.de>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d818fca1
    • M
      mm/vmscan: throttle reclaim until some writeback completes if congested · 8cd7c588
      Mel Gorman 提交于
      Patch series "Remove dependency on congestion_wait in mm/", v5.
      
      This series that removes all calls to congestion_wait in mm/ and deletes
      wait_iff_congested.  It's not a clever implementation but
      congestion_wait has been broken for a long time [1].
      
      Even if congestion throttling worked, it was never a great idea.  While
      excessive dirty/writeback pages at the tail of the LRU is one
      possibility that reclaim may be slow, there is also the problem of too
      many pages being isolated and reclaim failing for other reasons
      (elevated references, too many pages isolated, excessive LRU contention
      etc).
      
      This series replaces the "congestion" throttling with 3 different types.
      
       - If there are too many dirty/writeback pages, sleep until a timeout or
         enough pages get cleaned
      
       - If too many pages are isolated, sleep until enough isolated pages are
         either reclaimed or put back on the LRU
      
       - If no progress is being made, direct reclaim tasks sleep until
         another task makes progress with acceptable efficiency.
      
      This was initially tested with a mix of workloads that used to trigger
      corner cases that no longer work.  A new test case was created called
      "stutterp" (pagereclaim-stutterp-noreaders in mmtests) using a freshly
      created XFS filesystem.  Note that it may be necessary to increase the
      timeout of ssh if executing remotely as ssh itself can get throttled and
      the connection may timeout.
      
      stutterp varies the number of "worker" processes from 4 up to NR_CPUS*4
      to check the impact as the number of direct reclaimers increase.  It has
      four types of worker.
      
       - One "anon latency" worker creates small mappings with mmap() and
         times how long it takes to fault the mapping reading it 4K at a time
      
       - X file writers which is fio randomly writing X files where the total
         size of the files add up to the allowed dirty_ratio. fio is allowed
         to run for a warmup period to allow some file-backed pages to
         accumulate. The duration of the warmup is based on the best-case
         linear write speed of the storage.
      
       - Y file readers which is fio randomly reading small files
      
       - Z anon memory hogs which continually map (100-dirty_ratio)% of memory
      
       - Total estimated WSS = (100+dirty_ration) percentage of memory
      
      X+Y+Z+1 == NR_WORKERS varying from 4 up to NR_CPUS*4
      
      The intent is to maximise the total WSS with a mix of file and anon
      memory where some anonymous memory must be swapped and there is a high
      likelihood of dirty/writeback pages reaching the end of the LRU.
      
      The test can be configured to have no background readers to stress
      dirty/writeback pages.  The results below are based on having zero
      readers.
      
      The short summary of the results is that the series works and stalls
      until some event occurs but the timeouts may need adjustment.
      
      The test results are not broken down by patch as the series should be
      treated as one block that replaces a broken throttling mechanism with a
      working one.
      
      Finally, three machines were tested but I'm reporting the worst set of
      results.  The other two machines had much better latencies for example.
      
      First the results of the "anon latency" latency
      
        stutterp
                                      5.15.0-rc1             5.15.0-rc1
                                         vanilla mm-reclaimcongest-v5r4
        Amean     mmap-4      31.4003 (   0.00%)   2661.0198 (-8374.52%)
        Amean     mmap-7      38.1641 (   0.00%)    149.2891 (-291.18%)
        Amean     mmap-12     60.0981 (   0.00%)    187.8105 (-212.51%)
        Amean     mmap-21    161.2699 (   0.00%)    213.9107 ( -32.64%)
        Amean     mmap-30    174.5589 (   0.00%)    377.7548 (-116.41%)
        Amean     mmap-48   8106.8160 (   0.00%)   1070.5616 (  86.79%)
        Stddev    mmap-4      41.3455 (   0.00%)  27573.9676 (-66591.66%)
        Stddev    mmap-7      53.5556 (   0.00%)   4608.5860 (-8505.23%)
        Stddev    mmap-12    171.3897 (   0.00%)   5559.4542 (-3143.75%)
        Stddev    mmap-21   1506.6752 (   0.00%)   5746.2507 (-281.39%)
        Stddev    mmap-30    557.5806 (   0.00%)   7678.1624 (-1277.05%)
        Stddev    mmap-48  61681.5718 (   0.00%)  14507.2830 (  76.48%)
        Max-90    mmap-4      31.4243 (   0.00%)     83.1457 (-164.59%)
        Max-90    mmap-7      41.0410 (   0.00%)     41.0720 (  -0.08%)
        Max-90    mmap-12     66.5255 (   0.00%)     53.9073 (  18.97%)
        Max-90    mmap-21    146.7479 (   0.00%)    105.9540 (  27.80%)
        Max-90    mmap-30    193.9513 (   0.00%)     64.3067 (  66.84%)
        Max-90    mmap-48    277.9137 (   0.00%)    591.0594 (-112.68%)
        Max       mmap-4    1913.8009 (   0.00%) 299623.9695 (-15555.96%)
        Max       mmap-7    2423.9665 (   0.00%) 204453.1708 (-8334.65%)
        Max       mmap-12   6845.6573 (   0.00%) 221090.3366 (-3129.64%)
        Max       mmap-21  56278.6508 (   0.00%) 213877.3496 (-280.03%)
        Max       mmap-30  19716.2990 (   0.00%) 216287.6229 (-997.00%)
        Max       mmap-48 477923.9400 (   0.00%) 245414.8238 (  48.65%)
      
      For most thread counts, the time to mmap() is unfortunately increased.
      In earlier versions of the series, this was lower but a large number of
      throttling events were reaching their timeout increasing the amount of
      inefficient scanning of the LRU.  There is no prioritisation of reclaim
      tasks making progress based on each tasks rate of page allocation versus
      progress of reclaim.  The variance is also impacted for high worker
      counts but in all cases, the differences in latency are not
      statistically significant due to very large maximum outliers.  Max-90
      shows that 90% of the stalls are comparable but the Max results show the
      massive outliers which are increased to to stalling.
      
      It is expected that this will be very machine dependant.  Due to the
      test design, reclaim is difficult so allocations stall and there are
      variances depending on whether THPs can be allocated or not.  The amount
      of memory will affect exactly how bad the corner cases are and how often
      they trigger.  The warmup period calculation is not ideal as it's based
      on linear writes where as fio is randomly writing multiple files from
      multiple tasks so the start state of the test is variable.  For example,
      these are the latencies on a single-socket machine that had more memory
      
        Amean     mmap-4      42.2287 (   0.00%)     49.6838 * -17.65%*
        Amean     mmap-7     216.4326 (   0.00%)     47.4451 *  78.08%*
        Amean     mmap-12   2412.0588 (   0.00%)     51.7497 (  97.85%)
        Amean     mmap-21   5546.2548 (   0.00%)     51.8862 (  99.06%)
        Amean     mmap-30   1085.3121 (   0.00%)     72.1004 (  93.36%)
      
      The overall system CPU usage and elapsed time is as follows
      
                          5.15.0-rc3  5.15.0-rc3
                             vanilla mm-reclaimcongest-v5r4
        Duration User        6989.03      983.42
        Duration System      7308.12      799.68
        Duration Elapsed     2277.67     2092.98
      
      The patches reduce system CPU usage by 89% as the vanilla kernel is rarely
      stalling.
      
      The high-level /proc/vmstats show
      
                                             5.15.0-rc1     5.15.0-rc1
                                                vanilla mm-reclaimcongest-v5r2
        Ops Direct pages scanned          1056608451.00   503594991.00
        Ops Kswapd pages scanned           109795048.00   147289810.00
        Ops Kswapd pages reclaimed          63269243.00    31036005.00
        Ops Direct pages reclaimed          10803973.00     6328887.00
        Ops Kswapd efficiency %                   57.62          21.07
        Ops Kswapd velocity                    48204.98       57572.86
        Ops Direct efficiency %                    1.02           1.26
        Ops Direct velocity                   463898.83      196845.97
      
      Kswapd scanned less pages but the detailed pattern is different.  The
      vanilla kernel scans slowly over time where as the patches exhibits
      burst patterns of scan activity.  Direct reclaim scanning is reduced by
      52% due to stalling.
      
      The pattern for stealing pages is also slightly different.  Both kernels
      exhibit spikes but the vanilla kernel when reclaiming shows pages being
      reclaimed over a period of time where as the patches tend to reclaim in
      spikes.  The difference is that vanilla is not throttling and instead
      scanning constantly finding some pages over time where as the patched
      kernel throttles and reclaims in spikes.
      
        Ops Percentage direct scans               90.59          77.37
      
      For direct reclaim, vanilla scanned 90.59% of pages where as with the
      patches, 77.37% were direct reclaim due to throttling
      
        Ops Page writes by reclaim           2613590.00     1687131.00
      
      Page writes from reclaim context are reduced.
      
        Ops Page writes anon                 2932752.00     1917048.00
      
      And there is less swapping.
      
        Ops Page reclaim immediate         996248528.00   107664764.00
      
      The number of pages encountered at the tail of the LRU tagged for
      immediate reclaim but still dirty/writeback is reduced by 89%.
      
        Ops Slabs scanned                     164284.00      153608.00
      
      Slab scan activity is similar.
      
      ftrace was used to gather stall activity
      
        Vanilla
        -------
            1 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=16000
            2 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=12000
            8 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=8000
           29 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=4000
        82394 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=0
      
      The fast majority of wait_iff_congested calls do not stall at all.  What
      is likely happening is that cond_resched() reschedules the task for a
      short period when the BDI is not registering congestion (which it never
      will in this test setup).
      
            1 writeback_congestion_wait: usec_timeout=100000 usec_delayed=120000
            2 writeback_congestion_wait: usec_timeout=100000 usec_delayed=132000
            4 writeback_congestion_wait: usec_timeout=100000 usec_delayed=112000
          380 writeback_congestion_wait: usec_timeout=100000 usec_delayed=108000
          778 writeback_congestion_wait: usec_timeout=100000 usec_delayed=104000
      
      congestion_wait if called always exceeds the timeout as there is no
      trigger to wake it up.
      
      Bottom line: Vanilla will throttle but it's not effective.
      
      Patch series
      ------------
      
      Kswapd throttle activity was always due to scanning pages tagged for
      immediate reclaim at the tail of the LRU
      
            1 usec_timeout=100000 usect_delayed=72000 reason=VMSCAN_THROTTLE_WRITEBACK
            4 usec_timeout=100000 usect_delayed=20000 reason=VMSCAN_THROTTLE_WRITEBACK
            5 usec_timeout=100000 usect_delayed=12000 reason=VMSCAN_THROTTLE_WRITEBACK
            6 usec_timeout=100000 usect_delayed=16000 reason=VMSCAN_THROTTLE_WRITEBACK
           11 usec_timeout=100000 usect_delayed=100000 reason=VMSCAN_THROTTLE_WRITEBACK
           11 usec_timeout=100000 usect_delayed=8000 reason=VMSCAN_THROTTLE_WRITEBACK
           94 usec_timeout=100000 usect_delayed=0 reason=VMSCAN_THROTTLE_WRITEBACK
          112 usec_timeout=100000 usect_delayed=4000 reason=VMSCAN_THROTTLE_WRITEBACK
      
      The majority of events did not stall or stalled for a short period.
      Roughly 16% of stalls reached the timeout before expiry.  For direct
      reclaim, the number of times stalled for each reason were
      
         6624 reason=VMSCAN_THROTTLE_ISOLATED
        93246 reason=VMSCAN_THROTTLE_NOPROGRESS
        96934 reason=VMSCAN_THROTTLE_WRITEBACK
      
      The most common reason to stall was due to excessive pages tagged for
      immediate reclaim at the tail of the LRU followed by a failure to make
      forward.  A relatively small number were due to too many pages isolated
      from the LRU by parallel threads
      
      For VMSCAN_THROTTLE_ISOLATED, the breakdown of delays was
      
            9 usec_timeout=20000 usect_delayed=4000 reason=VMSCAN_THROTTLE_ISOLATED
           12 usec_timeout=20000 usect_delayed=16000 reason=VMSCAN_THROTTLE_ISOLATED
           83 usec_timeout=20000 usect_delayed=20000 reason=VMSCAN_THROTTLE_ISOLATED
         6520 usec_timeout=20000 usect_delayed=0 reason=VMSCAN_THROTTLE_ISOLATED
      
      Most did not stall at all.  A small number reached the timeout.
      
      For VMSCAN_THROTTLE_NOPROGRESS, the breakdown of stalls were all over
      the map
      
            1 usec_timeout=500000 usect_delayed=324000 reason=VMSCAN_THROTTLE_NOPROGRESS
            1 usec_timeout=500000 usect_delayed=332000 reason=VMSCAN_THROTTLE_NOPROGRESS
            1 usec_timeout=500000 usect_delayed=348000 reason=VMSCAN_THROTTLE_NOPROGRESS
            1 usec_timeout=500000 usect_delayed=360000 reason=VMSCAN_THROTTLE_NOPROGRESS
            2 usec_timeout=500000 usect_delayed=228000 reason=VMSCAN_THROTTLE_NOPROGRESS
            2 usec_timeout=500000 usect_delayed=260000 reason=VMSCAN_THROTTLE_NOPROGRESS
            2 usec_timeout=500000 usect_delayed=340000 reason=VMSCAN_THROTTLE_NOPROGRESS
            2 usec_timeout=500000 usect_delayed=364000 reason=VMSCAN_THROTTLE_NOPROGRESS
            2 usec_timeout=500000 usect_delayed=372000 reason=VMSCAN_THROTTLE_NOPROGRESS
            2 usec_timeout=500000 usect_delayed=428000 reason=VMSCAN_THROTTLE_NOPROGRESS
            2 usec_timeout=500000 usect_delayed=460000 reason=VMSCAN_THROTTLE_NOPROGRESS
            2 usec_timeout=500000 usect_delayed=464000 reason=VMSCAN_THROTTLE_NOPROGRESS
            3 usec_timeout=500000 usect_delayed=244000 reason=VMSCAN_THROTTLE_NOPROGRESS
            3 usec_timeout=500000 usect_delayed=252000 reason=VMSCAN_THROTTLE_NOPROGRESS
            3 usec_timeout=500000 usect_delayed=272000 reason=VMSCAN_THROTTLE_NOPROGRESS
            4 usec_timeout=500000 usect_delayed=188000 reason=VMSCAN_THROTTLE_NOPROGRESS
            4 usec_timeout=500000 usect_delayed=268000 reason=VMSCAN_THROTTLE_NOPROGRESS
            4 usec_timeout=500000 usect_delayed=328000 reason=VMSCAN_THROTTLE_NOPROGRESS
            4 usec_timeout=500000 usect_delayed=380000 reason=VMSCAN_THROTTLE_NOPROGRESS
            4 usec_timeout=500000 usect_delayed=392000 reason=VMSCAN_THROTTLE_NOPROGRESS
            4 usec_timeout=500000 usect_delayed=432000 reason=VMSCAN_THROTTLE_NOPROGRESS
            5 usec_timeout=500000 usect_delayed=204000 reason=VMSCAN_THROTTLE_NOPROGRESS
            5 usec_timeout=500000 usect_delayed=220000 reason=VMSCAN_THROTTLE_NOPROGRESS
            5 usec_timeout=500000 usect_delayed=412000 reason=VMSCAN_THROTTLE_NOPROGRESS
            5 usec_timeout=500000 usect_delayed=436000 reason=VMSCAN_THROTTLE_NOPROGRESS
            6 usec_timeout=500000 usect_delayed=488000 reason=VMSCAN_THROTTLE_NOPROGRESS
            7 usec_timeout=500000 usect_delayed=212000 reason=VMSCAN_THROTTLE_NOPROGRESS
            7 usec_timeout=500000 usect_delayed=300000 reason=VMSCAN_THROTTLE_NOPROGRESS
            7 usec_timeout=500000 usect_delayed=316000 reason=VMSCAN_THROTTLE_NOPROGRESS
            7 usec_timeout=500000 usect_delayed=472000 reason=VMSCAN_THROTTLE_NOPROGRESS
            8 usec_timeout=500000 usect_delayed=248000 reason=VMSCAN_THROTTLE_NOPROGRESS
            8 usec_timeout=500000 usect_delayed=356000 reason=VMSCAN_THROTTLE_NOPROGRESS
            8 usec_timeout=500000 usect_delayed=456000 reason=VMSCAN_THROTTLE_NOPROGRESS
            9 usec_timeout=500000 usect_delayed=124000 reason=VMSCAN_THROTTLE_NOPROGRESS
            9 usec_timeout=500000 usect_delayed=376000 reason=VMSCAN_THROTTLE_NOPROGRESS
            9 usec_timeout=500000 usect_delayed=484000 reason=VMSCAN_THROTTLE_NOPROGRESS
           10 usec_timeout=500000 usect_delayed=172000 reason=VMSCAN_THROTTLE_NOPROGRESS
           10 usec_timeout=500000 usect_delayed=420000 reason=VMSCAN_THROTTLE_NOPROGRESS
           10 usec_timeout=500000 usect_delayed=452000 reason=VMSCAN_THROTTLE_NOPROGRESS
           11 usec_timeout=500000 usect_delayed=256000 reason=VMSCAN_THROTTLE_NOPROGRESS
           12 usec_timeout=500000 usect_delayed=112000 reason=VMSCAN_THROTTLE_NOPROGRESS
           12 usec_timeout=500000 usect_delayed=116000 reason=VMSCAN_THROTTLE_NOPROGRESS
           12 usec_timeout=500000 usect_delayed=144000 reason=VMSCAN_THROTTLE_NOPROGRESS
           12 usec_timeout=500000 usect_delayed=152000 reason=VMSCAN_THROTTLE_NOPROGRESS
           12 usec_timeout=500000 usect_delayed=264000 reason=VMSCAN_THROTTLE_NOPROGRESS
           12 usec_timeout=500000 usect_delayed=384000 reason=VMSCAN_THROTTLE_NOPROGRESS
           12 usec_timeout=500000 usect_delayed=424000 reason=VMSCAN_THROTTLE_NOPROGRESS
           12 usec_timeout=500000 usect_delayed=492000 reason=VMSCAN_THROTTLE_NOPROGRESS
           13 usec_timeout=500000 usect_delayed=184000 reason=VMSCAN_THROTTLE_NOPROGRESS
           13 usec_timeout=500000 usect_delayed=444000 reason=VMSCAN_THROTTLE_NOPROGRESS
           14 usec_timeout=500000 usect_delayed=308000 reason=VMSCAN_THROTTLE_NOPROGRESS
           14 usec_timeout=500000 usect_delayed=440000 reason=VMSCAN_THROTTLE_NOPROGRESS
           14 usec_timeout=500000 usect_delayed=476000 reason=VMSCAN_THROTTLE_NOPROGRESS
           16 usec_timeout=500000 usect_delayed=140000 reason=VMSCAN_THROTTLE_NOPROGRESS
           17 usec_timeout=500000 usect_delayed=232000 reason=VMSCAN_THROTTLE_NOPROGRESS
           17 usec_timeout=500000 usect_delayed=240000 reason=VMSCAN_THROTTLE_NOPROGRESS
           17 usec_timeout=500000 usect_delayed=280000 reason=VMSCAN_THROTTLE_NOPROGRESS
           18 usec_timeout=500000 usect_delayed=404000 reason=VMSCAN_THROTTLE_NOPROGRESS
           20 usec_timeout=500000 usect_delayed=148000 reason=VMSCAN_THROTTLE_NOPROGRESS
           20 usec_timeout=500000 usect_delayed=216000 reason=VMSCAN_THROTTLE_NOPROGRESS
           20 usec_timeout=500000 usect_delayed=468000 reason=VMSCAN_THROTTLE_NOPROGRESS
           21 usec_timeout=500000 usect_delayed=448000 reason=VMSCAN_THROTTLE_NOPROGRESS
           23 usec_timeout=500000 usect_delayed=168000 reason=VMSCAN_THROTTLE_NOPROGRESS
           23 usec_timeout=500000 usect_delayed=296000 reason=VMSCAN_THROTTLE_NOPROGRESS
           25 usec_timeout=500000 usect_delayed=132000 reason=VMSCAN_THROTTLE_NOPROGRESS
           25 usec_timeout=500000 usect_delayed=352000 reason=VMSCAN_THROTTLE_NOPROGRESS
           26 usec_timeout=500000 usect_delayed=180000 reason=VMSCAN_THROTTLE_NOPROGRESS
           27 usec_timeout=500000 usect_delayed=284000 reason=VMSCAN_THROTTLE_NOPROGRESS
           28 usec_timeout=500000 usect_delayed=164000 reason=VMSCAN_THROTTLE_NOPROGRESS
           29 usec_timeout=500000 usect_delayed=136000 reason=VMSCAN_THROTTLE_NOPROGRESS
           30 usec_timeout=500000 usect_delayed=200000 reason=VMSCAN_THROTTLE_NOPROGRESS
           30 usec_timeout=500000 usect_delayed=400000 reason=VMSCAN_THROTTLE_NOPROGRESS
           31 usec_timeout=500000 usect_delayed=196000 reason=VMSCAN_THROTTLE_NOPROGRESS
           32 usec_timeout=500000 usect_delayed=156000 reason=VMSCAN_THROTTLE_NOPROGRESS
           33 usec_timeout=500000 usect_delayed=224000 reason=VMSCAN_THROTTLE_NOPROGRESS
           35 usec_timeout=500000 usect_delayed=128000 reason=VMSCAN_THROTTLE_NOPROGRESS
           35 usec_timeout=500000 usect_delayed=176000 reason=VMSCAN_THROTTLE_NOPROGRESS
           36 usec_timeout=500000 usect_delayed=368000 reason=VMSCAN_THROTTLE_NOPROGRESS
           36 usec_timeout=500000 usect_delayed=496000 reason=VMSCAN_THROTTLE_NOPROGRESS
           37 usec_timeout=500000 usect_delayed=312000 reason=VMSCAN_THROTTLE_NOPROGRESS
           38 usec_timeout=500000 usect_delayed=304000 reason=VMSCAN_THROTTLE_NOPROGRESS
           40 usec_timeout=500000 usect_delayed=288000 reason=VMSCAN_THROTTLE_NOPROGRESS
           43 usec_timeout=500000 usect_delayed=408000 reason=VMSCAN_THROTTLE_NOPROGRESS
           55 usec_timeout=500000 usect_delayed=416000 reason=VMSCAN_THROTTLE_NOPROGRESS
           56 usec_timeout=500000 usect_delayed=76000 reason=VMSCAN_THROTTLE_NOPROGRESS
           58 usec_timeout=500000 usect_delayed=120000 reason=VMSCAN_THROTTLE_NOPROGRESS
           59 usec_timeout=500000 usect_delayed=208000 reason=VMSCAN_THROTTLE_NOPROGRESS
           61 usec_timeout=500000 usect_delayed=68000 reason=VMSCAN_THROTTLE_NOPROGRESS
           71 usec_timeout=500000 usect_delayed=192000 reason=VMSCAN_THROTTLE_NOPROGRESS
           71 usec_timeout=500000 usect_delayed=480000 reason=VMSCAN_THROTTLE_NOPROGRESS
           79 usec_timeout=500000 usect_delayed=60000 reason=VMSCAN_THROTTLE_NOPROGRESS
           82 usec_timeout=500000 usect_delayed=320000 reason=VMSCAN_THROTTLE_NOPROGRESS
           82 usec_timeout=500000 usect_delayed=92000 reason=VMSCAN_THROTTLE_NOPROGRESS
           85 usec_timeout=500000 usect_delayed=64000 reason=VMSCAN_THROTTLE_NOPROGRESS
           85 usec_timeout=500000 usect_delayed=80000 reason=VMSCAN_THROTTLE_NOPROGRESS
           88 usec_timeout=500000 usect_delayed=84000 reason=VMSCAN_THROTTLE_NOPROGRESS
           90 usec_timeout=500000 usect_delayed=160000 reason=VMSCAN_THROTTLE_NOPROGRESS
           90 usec_timeout=500000 usect_delayed=292000 reason=VMSCAN_THROTTLE_NOPROGRESS
           94 usec_timeout=500000 usect_delayed=56000 reason=VMSCAN_THROTTLE_NOPROGRESS
          118 usec_timeout=500000 usect_delayed=88000 reason=VMSCAN_THROTTLE_NOPROGRESS
          119 usec_timeout=500000 usect_delayed=72000 reason=VMSCAN_THROTTLE_NOPROGRESS
          126 usec_timeout=500000 usect_delayed=108000 reason=VMSCAN_THROTTLE_NOPROGRESS
          146 usec_timeout=500000 usect_delayed=52000 reason=VMSCAN_THROTTLE_NOPROGRESS
          148 usec_timeout=500000 usect_delayed=36000 reason=VMSCAN_THROTTLE_NOPROGRESS
          148 usec_timeout=500000 usect_delayed=48000 reason=VMSCAN_THROTTLE_NOPROGRESS
          159 usec_timeout=500000 usect_delayed=28000 reason=VMSCAN_THROTTLE_NOPROGRESS
          178 usec_timeout=500000 usect_delayed=44000 reason=VMSCAN_THROTTLE_NOPROGRESS
          183 usec_timeout=500000 usect_delayed=40000 reason=VMSCAN_THROTTLE_NOPROGRESS
          237 usec_timeout=500000 usect_delayed=100000 reason=VMSCAN_THROTTLE_NOPROGRESS
          266 usec_timeout=500000 usect_delayed=32000 reason=VMSCAN_THROTTLE_NOPROGRESS
          313 usec_timeout=500000 usect_delayed=24000 reason=VMSCAN_THROTTLE_NOPROGRESS
          347 usec_timeout=500000 usect_delayed=96000 reason=VMSCAN_THROTTLE_NOPROGRESS
          470 usec_timeout=500000 usect_delayed=20000 reason=VMSCAN_THROTTLE_NOPROGRESS
          559 usec_timeout=500000 usect_delayed=16000 reason=VMSCAN_THROTTLE_NOPROGRESS
          964 usec_timeout=500000 usect_delayed=12000 reason=VMSCAN_THROTTLE_NOPROGRESS
         2001 usec_timeout=500000 usect_delayed=104000 reason=VMSCAN_THROTTLE_NOPROGRESS
         2447 usec_timeout=500000 usect_delayed=8000 reason=VMSCAN_THROTTLE_NOPROGRESS
         7888 usec_timeout=500000 usect_delayed=4000 reason=VMSCAN_THROTTLE_NOPROGRESS
        22727 usec_timeout=500000 usect_delayed=0 reason=VMSCAN_THROTTLE_NOPROGRESS
        51305 usec_timeout=500000 usect_delayed=500000 reason=VMSCAN_THROTTLE_NOPROGRESS
      
      The full timeout is often hit but a large number also do not stall at
      all.  The remainder slept a little allowing other reclaim tasks to make
      progress.
      
      While this timeout could be further increased, it could also negatively
      impact worst-case behaviour when there is no prioritisation of what task
      should make progress.
      
      For VMSCAN_THROTTLE_WRITEBACK, the breakdown was
      
            1 usec_timeout=100000 usect_delayed=44000 reason=VMSCAN_THROTTLE_WRITEBACK
            2 usec_timeout=100000 usect_delayed=76000 reason=VMSCAN_THROTTLE_WRITEBACK
            3 usec_timeout=100000 usect_delayed=80000 reason=VMSCAN_THROTTLE_WRITEBACK
            5 usec_timeout=100000 usect_delayed=48000 reason=VMSCAN_THROTTLE_WRITEBACK
            5 usec_timeout=100000 usect_delayed=84000 reason=VMSCAN_THROTTLE_WRITEBACK
            6 usec_timeout=100000 usect_delayed=72000 reason=VMSCAN_THROTTLE_WRITEBACK
            7 usec_timeout=100000 usect_delayed=88000 reason=VMSCAN_THROTTLE_WRITEBACK
           11 usec_timeout=100000 usect_delayed=56000 reason=VMSCAN_THROTTLE_WRITEBACK
           12 usec_timeout=100000 usect_delayed=64000 reason=VMSCAN_THROTTLE_WRITEBACK
           16 usec_timeout=100000 usect_delayed=92000 reason=VMSCAN_THROTTLE_WRITEBACK
           24 usec_timeout=100000 usect_delayed=68000 reason=VMSCAN_THROTTLE_WRITEBACK
           28 usec_timeout=100000 usect_delayed=32000 reason=VMSCAN_THROTTLE_WRITEBACK
           30 usec_timeout=100000 usect_delayed=60000 reason=VMSCAN_THROTTLE_WRITEBACK
           30 usec_timeout=100000 usect_delayed=96000 reason=VMSCAN_THROTTLE_WRITEBACK
           32 usec_timeout=100000 usect_delayed=52000 reason=VMSCAN_THROTTLE_WRITEBACK
           42 usec_timeout=100000 usect_delayed=40000 reason=VMSCAN_THROTTLE_WRITEBACK
           77 usec_timeout=100000 usect_delayed=28000 reason=VMSCAN_THROTTLE_WRITEBACK
           99 usec_timeout=100000 usect_delayed=36000 reason=VMSCAN_THROTTLE_WRITEBACK
          137 usec_timeout=100000 usect_delayed=24000 reason=VMSCAN_THROTTLE_WRITEBACK
          190 usec_timeout=100000 usect_delayed=20000 reason=VMSCAN_THROTTLE_WRITEBACK
          339 usec_timeout=100000 usect_delayed=16000 reason=VMSCAN_THROTTLE_WRITEBACK
          518 usec_timeout=100000 usect_delayed=12000 reason=VMSCAN_THROTTLE_WRITEBACK
          852 usec_timeout=100000 usect_delayed=8000 reason=VMSCAN_THROTTLE_WRITEBACK
         3359 usec_timeout=100000 usect_delayed=4000 reason=VMSCAN_THROTTLE_WRITEBACK
         7147 usec_timeout=100000 usect_delayed=0 reason=VMSCAN_THROTTLE_WRITEBACK
        83962 usec_timeout=100000 usect_delayed=100000 reason=VMSCAN_THROTTLE_WRITEBACK
      
      The majority hit the timeout in direct reclaim context although a
      sizable number did not stall at all.  This is very different to kswapd
      where only a tiny percentage of stalls due to writeback reached the
      timeout.
      
      Bottom line, the throttling appears to work and the wakeup events may
      limit worst case stalls.  There might be some grounds for adjusting
      timeouts but it's likely futile as the worst-case scenarios depend on
      the workload, memory size and the speed of the storage.  A better
      approach to improve the series further would be to prioritise tasks
      based on their rate of allocation with the caveat that it may be very
      expensive to track.
      
      This patch (of 5):
      
      Page reclaim throttles on wait_iff_congested under the following
      conditions:
      
       - kswapd is encountering pages under writeback and marked for immediate
         reclaim implying that pages are cycling through the LRU faster than
         pages can be cleaned.
      
       - Direct reclaim will stall if all dirty pages are backed by congested
         inodes.
      
      wait_iff_congested is almost completely broken with few exceptions.
      This patch adds a new node-based workqueue and tracks the number of
      throttled tasks and pages written back since throttling started.  If
      enough pages belonging to the node are written back then the throttled
      tasks will wake early.  If not, the throttled tasks sleeps until the
      timeout expires.
      
      [neilb@suse.de: Uninterruptible sleep and simpler wakeups]
      [hdanton@sina.com: Avoid race when reclaim starts]
      [vbabka@suse.cz: vmstat irq-safe api, clarifications]
      
      Link: https://lore.kernel.org/linux-mm/45d8b7a6-8548-65f5-cccf-9f451d4ae3d4@kernel.dk/ [1]
      Link: https://lkml.kernel.org/r/20211022144651.19914-1-mgorman@techsingularity.net
      Link: https://lkml.kernel.org/r/20211022144651.19914-2-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: NeilBrown <neilb@suse.de>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: "Darrick J . Wong" <djwong@kernel.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8cd7c588
    • M
      mm, hugepages: add mremap() support for hugepage backed vma · 550a7d60
      Mina Almasry 提交于
      Support mremap() for hugepage backed vma segment by simply repositioning
      page table entries.  The page table entries are repositioned to the new
      virtual address on mremap().
      
      Hugetlb mremap() support is of course generic; my motivating use case is
      a library (hugepage_text), which reloads the ELF text of executables in
      hugepages.  This significantly increases the execution performance of
      said executables.
      
      Restrict the mremap operation on hugepages to up to the size of the
      original mapping as the underlying hugetlb reservation is not yet
      capable of handling remapping to a larger size.
      
      During the mremap() operation we detect pmd_share'd mappings and we
      unshare those during the mremap().  On access and fault the sharing is
      established again.
      
      Link: https://lkml.kernel.org/r/20211013195825.3058275-1-almasrymina@google.comSigned-off-by: NMina Almasry <almasrymina@google.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Ken Chen <kenchen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Kirill Shutemov <kirill@shutemov.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      550a7d60
    • L
      mm: khugepaged: recalculate min_free_kbytes after stopping khugepaged · bd3400ea
      Liangcai Fan 提交于
      When initializing transparent huge pages, min_free_kbytes would be
      calculated according to what khugepaged expected.
      
      So when transparent huge pages get disabled, min_free_kbytes should be
      recalculated instead of the higher value set by khugepaged.
      
      Link: https://lkml.kernel.org/r/1633937809-16558-1-git-send-email-liangcaifan19@gmail.comSigned-off-by: NLiangcai Fan <liangcaifan19@gmail.com>
      Signed-off-by: NChunyan Zhang <zhang.lyra@gmail.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bd3400ea
    • M
      mm/cma: add cma_pages_valid to determine if pages are in CMA · 9871e2de
      Mike Kravetz 提交于
      Add new interface cma_pages_valid() which indicates if the specified
      pages are part of a CMA region.  This interface will be used in a
      subsequent patch by hugetlb code.
      
      In order to keep the same amount of DEBUG information, a pr_debug() call
      was added to cma_pages_valid().  In the case where the page passed to
      cma_release is not in cma region, the debug message will be printed from
      cma_pages_valid as opposed to cma_release.
      
      Link: https://lkml.kernel.org/r/20211007181918.136982-3-mike.kravetz@oracle.comSigned-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: Nghia Le <nghialm78@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9871e2de
    • M
      hugetlb: add demote hugetlb page sysfs interfaces · 79dfc695
      Mike Kravetz 提交于
      Patch series "hugetlb: add demote/split page functionality", v4.
      
      The concurrent use of multiple hugetlb page sizes on a single system is
      becoming more common.  One of the reasons is better TLB support for
      gigantic page sizes on x86 hardware.  In addition, hugetlb pages are
      being used to back VMs in hosting environments.
      
      When using hugetlb pages to back VMs, it is often desirable to
      preallocate hugetlb pools.  This avoids the delay and uncertainty of
      allocating hugetlb pages at VM startup.  In addition, preallocating huge
      pages minimizes the issue of memory fragmentation that increases the
      longer the system is up and running.
      
      In such environments, a combination of larger and smaller hugetlb pages
      are preallocated in anticipation of backing VMs of various sizes.  Over
      time, the preallocated pool of smaller hugetlb pages may become depleted
      while larger hugetlb pages still remain.  In such situations, it is
      desirable to convert larger hugetlb pages to smaller hugetlb pages.
      
      Converting larger to smaller hugetlb pages can be accomplished today by
      first freeing the larger page to the buddy allocator and then allocating
      the smaller pages.  For example, to convert 50 GB pages on x86:
      
        gb_pages=`cat .../hugepages-1048576kB/nr_hugepages`
        m2_pages=`cat .../hugepages-2048kB/nr_hugepages`
        echo $(($gb_pages - 50)) > .../hugepages-1048576kB/nr_hugepages
        echo $(($m2_pages + 25600)) > .../hugepages-2048kB/nr_hugepages
      
      On an idle system this operation is fairly reliable and results are as
      expected.  The number of 2MB pages is increased as expected and the time
      of the operation is a second or two.
      
      However, when there is activity on the system the following issues
      arise:
      
      1) This process can take quite some time, especially if allocation of
         the smaller pages is not immediate and requires migration/compaction.
      
      2) There is no guarantee that the total size of smaller pages allocated
         will match the size of the larger page which was freed. This is
         because the area freed by the larger page could quickly be
         fragmented.
      
      In a test environment with a load that continually fills the page cache
      with clean pages, results such as the following can be observed:
      
        Unexpected number of 2MB pages allocated: Expected 25600, have 19944
        real    0m42.092s
        user    0m0.008s
        sys     0m41.467s
      
      To address these issues, introduce the concept of hugetlb page demotion.
      Demotion provides a means of 'in place' splitting of a hugetlb page to
      pages of a smaller size.  This avoids freeing pages to buddy and then
      trying to allocate from buddy.
      
      Page demotion is controlled via sysfs files that reside in the per-hugetlb
      page size and per node directories.
      
       - demote_size
              Target page size for demotion, a smaller huge page size. File
              can be written to chose a smaller huge page size if multiple are
              available.
      
       - demote
              Writable number of hugetlb pages to be demoted
      
      To demote 50 GB huge pages, one would:
      
        cat .../hugepages-1048576kB/free_hugepages   /* optional, verify free pages */
        cat .../hugepages-1048576kB/demote_size      /* optional, verify target size */
        echo 50 > .../hugepages-1048576kB/demote
      
      Only hugetlb pages which are free at the time of the request can be
      demoted.  Demotion does not add to the complexity of surplus pages and
      honors reserved huge pages.  Therefore, when a value is written to the
      sysfs demote file, that value is only the maximum number of pages which
      will be demoted.  It is possible fewer will actually be demoted.  The
      recently introduced per-hstate mutex is used to synchronize demote
      operations with other operations that modify hugetlb pools.
      
      Real world use cases
      --------------------
      The above scenario describes a real world use case where hugetlb pages
      are used to back VMs on x86.  Both issues of long allocation times and
      not necessarily getting the expected number of smaller huge pages after
      a free and allocate cycle have been experienced.  The occurrence of
      these issues is dependent on other activity within the host and can not
      be predicted.
      
      This patch (of 5):
      
      Two new sysfs files are added to demote hugtlb pages.  These files are
      both per-hugetlb page size and per node.  Files are:
      
        demote_size - The size in Kb that pages are demoted to. (read-write)
        demote - The number of huge pages to demote. (write-only)
      
      By default, demote_size is the next smallest huge page size.  Valid huge
      page sizes less than huge page size may be written to this file.  When
      huge pages are demoted, they are demoted to this size.
      
      Writing a value to demote will result in an attempt to demote that
      number of hugetlb pages to an appropriate number of demote_size pages.
      
      NOTE: Demote interfaces are only provided for huge page sizes if there
      is a smaller target demote huge page size.  For example, on x86 1GB huge
      pages will have demote interfaces.  2MB huge pages will not have demote
      interfaces.
      
      This patch does not provide full demote functionality.  It only provides
      the sysfs interfaces.
      
      It also provides documentation for the new interfaces.
      
      [mike.kravetz@oracle.com: n_mask initialization does not need to be protected by the mutex]
        Link: https://lkml.kernel.org/r/0530e4ef-2492-5186-f919-5db68edea654@oracle.com
      
      Link: https://lkml.kernel.org/r/20211007181918.136982-2-mike.kravetz@oracle.comSigned-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
      Cc: Nghia Le <nghialm78@gmail.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      79dfc695
    • P
      mm/hugetlb: drop __unmap_hugepage_range definition from hugetlb.h · 73c54763
      Peter Xu 提交于
      Remove __unmap_hugepage_range() from the header file, because it is only
      used in hugetlb.c.
      
      Link: https://lkml.kernel.org/r/20210917165108.9341-1-peterx@redhat.comSigned-off-by: NPeter Xu <peterx@redhat.com>
      Suggested-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com>
      Reviewed-by: NMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      73c54763
    • M
      mm: fix data race in PagePoisoned() · 477d01fc
      Marco Elver 提交于
      PagePoisoned() accesses page->flags which can be updated concurrently:
      
        | BUG: KCSAN: data-race in next_uptodate_page / unlock_page
        |
        | write (marked) to 0xffffea00050f37c0 of 8 bytes by task 1872 on cpu 1:
        |  instrument_atomic_write           include/linux/instrumented.h:87 [inline]
        |  clear_bit_unlock_is_negative_byte include/asm-generic/bitops/instrumented-lock.h:74 [inline]
        |  unlock_page+0x102/0x1b0           mm/filemap.c:1465
        |  filemap_map_pages+0x6c6/0x890     mm/filemap.c:3057
        |  ...
        | read to 0xffffea00050f37c0 of 8 bytes by task 1873 on cpu 0:
        |  PagePoisoned                   include/linux/page-flags.h:204 [inline]
        |  PageReadahead                  include/linux/page-flags.h:382 [inline]
        |  next_uptodate_page+0x456/0x830 mm/filemap.c:2975
        |  ...
        | CPU: 0 PID: 1873 Comm: systemd-udevd Not tainted 5.11.0-rc4-00001-gf9ce0be7 #1
      
      To avoid the compiler tearing or otherwise optimizing the access, use
      READ_ONCE() to access flags.
      
      Link: https://lore.kernel.org/all/20210826144157.GA26950@xsang-OptiPlex-9020/
      Link: https://lkml.kernel.org/r/20210913113542.2658064-1-elver@google.comReported-by: Nkernel test robot <oliver.sang@intel.com>
      Signed-off-by: NMarco Elver <elver@google.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NWill Deacon <will@kernel.org>
      Cc: Marco Elver <elver@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      477d01fc
    • C
      mm: create a new system state and fix core_kernel_text() · d2635f20
      Christophe Leroy 提交于
      core_kernel_text() considers that until system_state in at least
      SYSTEM_RUNNING, init memory is valid.
      
      But init memory is freed a few lines before setting SYSTEM_RUNNING, so
      we have a small period of time when core_kernel_text() is wrong.
      
      Create an intermediate system state called SYSTEM_FREEING_INIT that is
      set before starting freeing init memory, and use it in
      core_kernel_text() to report init memory invalid earlier.
      
      Link: https://lkml.kernel.org/r/9ecfdee7dd4d741d172cb93ff1d87f1c58127c9a.1633001016.git.christophe.leroy@csgroup.euSigned-off-by: NChristophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@ozlabs.org>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d2635f20
    • F
      mm/page_alloc: detect allocation forbidden by cpuset and bail out early · 8ca1b5a4
      Feng Tang 提交于
      There was a report that starting an Ubuntu in docker while using cpuset
      to bind it to movable nodes (a node only has movable zone, like a node
      for hotplug or a Persistent Memory node in normal usage) will fail due
      to memory allocation failure, and then OOM is involved and many other
      innocent processes got killed.
      
      It can be reproduced with command:
      
          $ docker run -it --rm --cpuset-mems 4 ubuntu:latest bash -c "grep Mems_allowed /proc/self/status"
      
      (where node 4 is a movable node)
      
        runc:[2:INIT] invoked oom-killer: gfp_mask=0x500cc2(GFP_HIGHUSER|__GFP_ACCOUNT), order=0, oom_score_adj=0
        CPU: 8 PID: 8291 Comm: runc:[2:INIT] Tainted: G        W I E     5.8.2-0.g71b519a-default #1 openSUSE Tumbleweed (unreleased)
        Hardware name: Dell Inc. PowerEdge R640/0PHYDR, BIOS 2.6.4 04/09/2020
        Call Trace:
         dump_stack+0x6b/0x88
         dump_header+0x4a/0x1e2
         oom_kill_process.cold+0xb/0x10
         out_of_memory.part.0+0xaf/0x230
         out_of_memory+0x3d/0x80
         __alloc_pages_slowpath.constprop.0+0x954/0xa20
         __alloc_pages_nodemask+0x2d3/0x300
         pipe_write+0x322/0x590
         new_sync_write+0x196/0x1b0
         vfs_write+0x1c3/0x1f0
         ksys_write+0xa7/0xe0
         do_syscall_64+0x52/0xd0
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        Mem-Info:
        active_anon:392832 inactive_anon:182 isolated_anon:0
         active_file:68130 inactive_file:151527 isolated_file:0
         unevictable:2701 dirty:0 writeback:7
         slab_reclaimable:51418 slab_unreclaimable:116300
         mapped:45825 shmem:735 pagetables:2540 bounce:0
         free:159849484 free_pcp:73 free_cma:0
        Node 4 active_anon:1448kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:0kB dirty:0kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB all_unreclaimable? no
        Node 4 Movable free:130021408kB min:9140kB low:139160kB high:269180kB reserved_highatomic:0KB active_anon:1448kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:130023424kB managed:130023424kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:292kB local_pcp:84kB free_cma:0kB
        lowmem_reserve[]: 0 0 0 0 0
        Node 4 Movable: 1*4kB (M) 0*8kB 0*16kB 1*32kB (M) 0*64kB 0*128kB 1*256kB (M) 1*512kB (M) 1*1024kB (M) 0*2048kB 31743*4096kB (M) = 130021156kB
      
        oom-kill:constraint=CONSTRAINT_CPUSET,nodemask=(null),cpuset=docker-9976a269caec812c134fa317f27487ee36e1129beba7278a463dd53e5fb9997b.scope,mems_allowed=4,global_oom,task_memcg=/system.slice/containerd.service,task=containerd,pid=4100,uid=0
        Out of memory: Killed process 4100 (containerd) total-vm:4077036kB, anon-rss:51184kB, file-rss:26016kB, shmem-rss:0kB, UID:0 pgtables:676kB oom_score_adj:0
        oom_reaper: reaped process 8248 (docker), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
        oom_reaper: reaped process 2054 (node_exporter), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
        oom_reaper: reaped process 1452 (systemd-journal), now anon-rss:0kB, file-rss:8564kB, shmem-rss:4kB
        oom_reaper: reaped process 2146 (munin-node), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
        oom_reaper: reaped process 8291 (runc:[2:INIT]), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
      
      The reason is that in this case, the target cpuset nodes only have
      movable zone, while the creation of an OS in docker sometimes needs to
      allocate memory in non-movable zones (dma/dma32/normal) like
      GFP_HIGHUSER, and the cpuset limit forbids the allocation, then
      out-of-memory killing is involved even when normal nodes and movable
      nodes both have many free memory.
      
      The OOM killer cannot help to resolve the situation as there is no
      usable memory for the request in the cpuset scope.  The only reasonable
      measure to take is to fail the allocation right away and have the caller
      to deal with it.
      
      So add a check for cases like this in the slowpath of allocation, and
      bail out early returning NULL for the allocation.
      
      As page allocation is one of the hottest path in kernel, this check will
      hurt all users with sane cpuset configuration, add a static branch check
      and detect the abnormal config in cpuset memory binding setup so that
      the extra check cost in page allocation is not paid by everyone.
      
      [thanks to Micho Hocko and David Rientjes for suggesting not handling
       it inside OOM code, adding cpuset check, refining comments]
      
      Link: https://lkml.kernel.org/r/1632481657-68112-1-git-send-email-feng.tang@intel.comSigned-off-by: NFeng Tang <feng.tang@intel.com>
      Suggested-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Zefan Li <lizefan.x@bytedance.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8ca1b5a4
    • C
      mm/vmalloc: introduce alloc_pages_bulk_array_mempolicy to accelerate memory allocation · c00b6b96
      Chen Wandun 提交于
      Commit ffb29b1c ("mm/vmalloc: fix numa spreading for large hash
      tables") can cause significant performance regressions in some
      situations as Andrew mentioned in [1].  The main situation is vmalloc,
      vmalloc will allocate pages with NUMA_NO_NODE by default, that will
      result in alloc page one by one;
      
      In order to solve this, __alloc_pages_bulk and mempolicy should be
      considered at the same time.
      
      1) If node is specified in memory allocation request, it will alloc all
         pages by __alloc_pages_bulk.
      
      2) If interleaving allocate memory, it will cauculate how many pages
         should be allocated in each node, and use __alloc_pages_bulk to alloc
         pages in each node.
      
      [1]: https://lore.kernel.org/lkml/CALvZod4G3SzP3kWxQYn0fj+VgG-G3yWXz=gz17+3N57ru1iajw@mail.gmail.com/t/#m750c8e3231206134293b089feaa090590afa0f60
      
      [akpm@linux-foundation.org: coding style fixes]
      [akpm@linux-foundation.org: make two functions static]
      [akpm@linux-foundation.org: fix CONFIG_NUMA=n build]
      
      Link: https://lkml.kernel.org/r/20211021080744.874701-3-chenwandun@huawei.comSigned-off-by: NChen Wandun <chenwandun@huawei.com>
      Reviewed-by: NUladzislau Rezki (Sony) <urezki@gmail.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Hanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c00b6b96
    • K
      kasan: arm64: fix pcpu_page_first_chunk crash with KASAN_VMALLOC · 3252b1d8
      Kefeng Wang 提交于
      With KASAN_VMALLOC and NEED_PER_CPU_PAGE_FIRST_CHUNK the kernel crashes:
      
        Unable to handle kernel paging request at virtual address ffff7000028f2000
        ...
        swapper pgtable: 64k pages, 48-bit VAs, pgdp=0000000042440000
        [ffff7000028f2000] pgd=000000063e7c0003, p4d=000000063e7c0003, pud=000000063e7c0003, pmd=000000063e7b0003, pte=0000000000000000
        Internal error: Oops: 96000007 [#1] PREEMPT SMP
        Modules linked in:
        CPU: 0 PID: 0 Comm: swapper Not tainted 5.13.0-rc4-00003-gc6e6e28f3f30-dirty #62
        Hardware name: linux,dummy-virt (DT)
        pstate: 200000c5 (nzCv daIF -PAN -UAO -TCO BTYPE=--)
        pc : kasan_check_range+0x90/0x1a0
        lr : memcpy+0x88/0xf4
        sp : ffff80001378fe20
        ...
        Call trace:
         kasan_check_range+0x90/0x1a0
         pcpu_page_first_chunk+0x3f0/0x568
         setup_per_cpu_areas+0xb8/0x184
         start_kernel+0x8c/0x328
      
      The vm area used in vm_area_register_early() has no kasan shadow memory,
      Let's add a new kasan_populate_early_vm_area_shadow() function to
      populate the vm area shadow memory to fix the issue.
      
      [wangkefeng.wang@huawei.com: fix redefinition of 'kasan_populate_early_vm_area_shadow']
        Link: https://lkml.kernel.org/r/20211011123211.3936196-1-wangkefeng.wang@huawei.com
      
      Link: https://lkml.kernel.org/r/20210910053354.26721-4-wangkefeng.wang@huawei.comSigned-off-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Acked-by: Marco Elver <elver@google.com>		[KASAN]
      Acked-by: Andrey Konovalov <andreyknvl@gmail.com>	[KASAN]
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3252b1d8
    • P
      mm/vmalloc: don't allow VM_NO_GUARD on vmap() · bd1a8fb2
      Peter Zijlstra 提交于
      The vmalloc guard pages are added on top of each allocation, thereby
      isolating any two allocations from one another.  The top guard of the
      lower allocation is the bottom guard guard of the higher allocation etc.
      
      Therefore VM_NO_GUARD is dangerous; it breaks the basic premise of
      isolating separate allocations.
      
      There are only two in-tree users of this flag, neither of which use it
      through the exported interface.  Ensure it stays this way.
      
      Link: https://lkml.kernel.org/r/YUMfdA36fuyZ+/xt@hirez.programming.kicks-ass.netSigned-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NWill Deacon <will@kernel.org>
      Acked-by: NKees Cook <keescook@chromium.org>
      Cc: Andrey Konovalov <andreyknvl@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Uladzislau Rezki <urezki@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bd1a8fb2
    • L
      include/linux/io-mapping.h: remove fallback for writecombine · 2e86f78b
      Lucas De Marchi 提交于
      The fallback was introduced in commit 80c33624 ("io-mapping: Fixup
      for different names of writecombine") to fix the build on microblaze.
      
      5 years later, it seems all archs now provide a pgprot_writecombine(),
      so just remove the other possible fallbacks.  For microblaze,
      pgprot_writecombine() is available since commit 97ccedd7
      ("microblaze: Provide pgprot_device/writecombine macros for nommu").
      
      This is build-tested on microblaze with a hack to always build
      mm/io-mapping.o and without DIYing on an x86-only macro
      (_PAGE_CACHE_MASK)
      
      Link: https://lkml.kernel.org/r/20211020204838.1142908-1-lucas.demarchi@intel.comSigned-off-by: NLucas De Marchi <lucas.demarchi@intel.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2e86f78b
    • L
      memory: remove unused CONFIG_MEM_BLOCK_SIZE · e26e0cc3
      Lukas Bulwahn 提交于
      Commit 3947be19 ("[PATCH] memory hotplug: sysfs and add/remove
      functions") defines CONFIG_MEM_BLOCK_SIZE, but this has never been
      utilized anywhere.
      
      It is a good practice to keep the CONFIG_* defines exclusively for the
      Kbuild system.  So, drop this unused definition.
      
      This issue was noticed due to running ./scripts/checkkconfigsymbols.py.
      
      Link: https://lkml.kernel.org/r/20211006120354.7468-1-lukas.bulwahn@gmail.comSigned-off-by: NLukas Bulwahn <lukas.bulwahn@gmail.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e26e0cc3
    • P
      mm: add zap_skip_check_mapping() helper · 91b61ef3
      Peter Xu 提交于
      Use the helper for the checks.  Rename "check_mapping" into
      "zap_mapping" because "check_mapping" looks like a bool but in fact it
      stores the mapping itself.  When it's set, we check the mapping (it must
      be non-NULL).  When it's cleared we skip the check, which works like the
      old way.
      
      Move the duplicated comments to the helper too.
      
      Link: https://lkml.kernel.org/r/20210915181538.11288-1-peterx@redhat.comSigned-off-by: NPeter Xu <peterx@redhat.com>
      Reviewed-by: NAlistair Popple <apopple@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Liam Howlett <liam.howlett@oracle.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      91b61ef3
    • P
      mm: drop first_index/last_index in zap_details · 232a6a1c
      Peter Xu 提交于
      The first_index/last_index parameters in zap_details are actually only
      used in unmap_mapping_range_tree().  At the meantime, this function is
      only called by unmap_mapping_pages() once.
      
      Instead of passing these two variables through the whole stack of page
      zapping code, remove them from zap_details and let them simply be
      parameters of unmap_mapping_range_tree(), which is inlined.
      
      Link: https://lkml.kernel.org/r/20210915181535.11238-1-peterx@redhat.comSigned-off-by: NPeter Xu <peterx@redhat.com>
      Reviewed-by: NAlistair Popple <apopple@nvidia.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NLiam Howlett <liam.howlett@oracle.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      232a6a1c
    • R
      f1dc0db2
    • M
      mm: memcontrol: remove the kmem states · e80216d9
      Muchun Song 提交于
      Now the kmem states is only used to indicate whether the kmem is
      offline.  However, we can set ->kmemcg_id to -1 to indicate whether the
      kmem is offline.  Finally, we can remove the kmem states to simplify the
      code.
      
      Link: https://lkml.kernel.org/r/20211025125259.56624-1-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e80216d9
    • C
      mm: simplify bdi refcounting · efee1713
      Christoph Hellwig 提交于
      Move grabbing and releasing the bdi refcount out of the common
      wb_init/wb_exit helpers into code that is only used for the non-default
      memcg driven bdi_writeback structures.
      
      [hch@lst.de: add comment]
        Link: https://lkml.kernel.org/r/20211027074207.GA12793@lst.de
      [akpm@linux-foundation.org: fix typo]
      
      Link: https://lkml.kernel.org/r/20211021124441.668816-6-hch@lst.deSigned-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Miquel Raynal <miquel.raynal@bootlin.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Vignesh Raghavendra <vigneshr@ti.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      efee1713
    • C
      fs: explicitly unregister per-superblock BDIs · 0b3ea092
      Christoph Hellwig 提交于
      Add a new SB_I_ flag to mark superblocks that have an ephemeral bdi
      associated with them, and unregister it when the superblock is shut
      down.
      
      Link: https://lkml.kernel.org/r/20211021124441.668816-4-hch@lst.deSigned-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Miquel Raynal <miquel.raynal@bootlin.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Vignesh Raghavendra <vigneshr@ti.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0b3ea092
    • K
      percpu: add __alloc_size attributes for better bounds checking · 17197dd4
      Kees Cook 提交于
      As already done in GrapheneOS, add the __alloc_size attribute for
      appropriate percpu allocator interfaces, to provide additional hinting
      for better bounds checking, assisting CONFIG_FORTIFY_SOURCE and other
      compiler optimizations.
      
      Note that due to the implementation of the percpu API, this is unlikely
      to ever actually provide compile-time checking beyond very simple
      non-SMP builds.  But, since they are technically allocators, mark them
      as such.
      
      Link: https://lkml.kernel.org/r/20210930222704.2631604-9-keescook@chromium.orgSigned-off-by: NKees Cook <keescook@chromium.org>
      Co-developed-by: NDaniel Micay <danielmicay@gmail.com>
      Signed-off-by: NDaniel Micay <danielmicay@gmail.com>
      Acked-by: NDennis Zhou <dennis@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Andy Whitcroft <apw@canonical.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dwaipayan Ray <dwaipayanray1@gmail.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com>
      Cc: Miguel Ojeda <ojeda@kernel.org>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Alexandre Bounine <alex.bou9@gmail.com>
      Cc: Gustavo A. R. Silva <gustavoars@kernel.org>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jing Xiangfeng <jingxiangfeng@huawei.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: kernel test robot <lkp@intel.com>
      Cc: Matt Porter <mporter@kernel.crashing.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      17197dd4
    • K
      mm/page_alloc: add __alloc_size attributes for better bounds checking · abd58f38
      Kees Cook 提交于
      As already done in GrapheneOS, add the __alloc_size attribute for
      appropriate page allocator interfaces, to provide additional hinting for
      better bounds checking, assisting CONFIG_FORTIFY_SOURCE and other
      compiler optimizations.
      
      Link: https://lkml.kernel.org/r/20210930222704.2631604-8-keescook@chromium.orgSigned-off-by: NKees Cook <keescook@chromium.org>
      Co-developed-by: NDaniel Micay <danielmicay@gmail.com>
      Signed-off-by: NDaniel Micay <danielmicay@gmail.com>
      Cc: Andy Whitcroft <apw@canonical.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Dwaipayan Ray <dwaipayanray1@gmail.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com>
      Cc: Miguel Ojeda <ojeda@kernel.org>
      Cc: Nathan Chancellor <nathan@kernel.org>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Alexandre Bounine <alex.bou9@gmail.com>
      Cc: Gustavo A. R. Silva <gustavoars@kernel.org>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jing Xiangfeng <jingxiangfeng@huawei.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: kernel test robot <lkp@intel.com>
      Cc: Matt Porter <mporter@kernel.crashing.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      abd58f38