1. 27 9月, 2022 12 次提交
    • Y
      mm: multi-gen LRU: minimal implementation · ac35a490
      Yu Zhao 提交于
      To avoid confusion, the terms "promotion" and "demotion" will be applied
      to the multi-gen LRU, as a new convention; the terms "activation" and
      "deactivation" will be applied to the active/inactive LRU, as usual.
      
      The aging produces young generations.  Given an lruvec, it increments
      max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS.  The aging promotes
      hot pages to the youngest generation when it finds them accessed through
      page tables; the demotion of cold pages happens consequently when it
      increments max_seq.  Promotion in the aging path does not involve any LRU
      list operations, only the updates of the gen counter and
      lrugen->nr_pages[]; demotion, unless as the result of the increment of
      max_seq, requires LRU list operations, e.g., lru_deactivate_fn().  The
      aging has the complexity O(nr_hot_pages), since it is only interested in
      hot pages.
      
      The eviction consumes old generations.  Given an lruvec, it increments
      min_seq when lrugen->lists[] indexed by min_seq%MAX_NR_GENS becomes empty.
      A feedback loop modeled after the PID controller monitors refaults over
      anon and file types and decides which type to evict when both types are
      available from the same generation.
      
      The protection of pages accessed multiple times through file descriptors
      takes place in the eviction path.  Each generation is divided into
      multiple tiers.  A page accessed N times through file descriptors is in
      tier order_base_2(N).  Tiers do not have dedicated lrugen->lists[], only
      bits in folio->flags.  The aforementioned feedback loop also monitors
      refaults over all tiers and decides when to protect pages in which tiers
      (N>1), using the first tier (N=0,1) as a baseline.  The first tier
      contains single-use unmapped clean pages, which are most likely the best
      choices.  In contrast to promotion in the aging path, the protection of a
      page in the eviction path is achieved by moving this page to the next
      generation, i.e., min_seq+1, if the feedback loop decides so.  This
      approach has the following advantages:
      
      1. It removes the cost of activation in the buffered access path by
         inferring whether pages accessed multiple times through file
         descriptors are statistically hot and thus worth protecting in the
         eviction path.
      2. It takes pages accessed through page tables into account and avoids
         overprotecting pages accessed multiple times through file
         descriptors. (Pages accessed through page tables are in the first
         tier, since N=0.)
      3. More tiers provide better protection for pages accessed more than
         twice through file descriptors, when under heavy buffered I/O
         workloads.
      
      Server benchmark results:
        Single workload:
          fio (buffered I/O): +[30, 32]%
                      IOPS         BW
            5.19-rc1: 2673k        10.2GiB/s
            patch1-6: 3491k        13.3GiB/s
      
        Single workload:
          memcached (anon): -[4, 6]%
                      Ops/sec      KB/sec
            5.19-rc1: 1161501.04   45177.25
            patch1-6: 1106168.46   43025.04
      
        Configurations:
          CPU: two Xeon 6154
          Mem: total 256G
      
          Node 1 was only used as a ram disk to reduce the variance in the
          results.
      
          patch drivers/block/brd.c <<EOF
          99,100c99,100
          < 	gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM;
          < 	page = alloc_page(gfp_flags);
          ---
          > 	gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE;
          > 	page = alloc_pages_node(1, gfp_flags, 0);
          EOF
      
          cat >>/etc/systemd/system.conf <<EOF
          CPUAffinity=numa
          NUMAPolicy=bind
          NUMAMask=0
          EOF
      
          cat >>/etc/memcached.conf <<EOF
          -m 184320
          -s /var/run/memcached/memcached.sock
          -a 0766
          -t 36
          -B binary
          EOF
      
          cat fio.sh
          modprobe brd rd_nr=1 rd_size=113246208
          swapoff -a
          mkfs.ext4 /dev/ram0
          mount -t ext4 /dev/ram0 /mnt
      
          mkdir /sys/fs/cgroup/user.slice/test
          echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max
          echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs
          fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \
            --buffered=1 --ioengine=io_uring --iodepth=128 \
            --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
            --rw=randread --random_distribution=random --norandommap \
            --time_based --ramp_time=10m --runtime=5m --group_reporting
      
          cat memcached.sh
          modprobe brd rd_nr=1 rd_size=113246208
          swapoff -a
          mkswap /dev/ram0
          swapon /dev/ram0
      
          memtier_benchmark -S /var/run/memcached/memcached.sock \
            -P memcache_binary -n allkeys --key-minimum=1 \
            --key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \
            --ratio 1:0 --pipeline 8 -d 2000
      
          memtier_benchmark -S /var/run/memcached/memcached.sock \
            -P memcache_binary -n allkeys --key-minimum=1 \
            --key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \
            --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed
      
      Client benchmark results:
        kswapd profiles:
          5.19-rc1
            40.33%  page_vma_mapped_walk (overhead)
            21.80%  lzo1x_1_do_compress (real work)
             7.53%  do_raw_spin_lock
             3.95%  _raw_spin_unlock_irq
             2.52%  vma_interval_tree_iter_next
             2.37%  folio_referenced_one
             2.28%  vma_interval_tree_subtree_search
             1.97%  anon_vma_interval_tree_iter_first
             1.60%  ptep_clear_flush
             1.06%  __zram_bvec_write
      
          patch1-6
            39.03%  lzo1x_1_do_compress (real work)
            18.47%  page_vma_mapped_walk (overhead)
             6.74%  _raw_spin_unlock_irq
             3.97%  do_raw_spin_lock
             2.49%  ptep_clear_flush
             2.48%  anon_vma_interval_tree_iter_first
             1.92%  folio_referenced_one
             1.88%  __zram_bvec_write
             1.48%  memmove
             1.31%  vma_interval_tree_iter_next
      
        Configurations:
          CPU: single Snapdragon 7c
          Mem: total 4G
      
          ChromeOS MemoryPressure [1]
      
      [1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/
      
      Link: https://lkml.kernel.org/r/20220918080010.2920238-7-yuzhao@google.comSigned-off-by: NYu Zhao <yuzhao@google.com>
      Acked-by: NBrian Geffon <bgeffon@google.com>
      Acked-by: NJan Alexander Steffens (heftig) <heftig@archlinux.org>
      Acked-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Acked-by: NSteven Barrett <steven@liquorix.net>
      Acked-by: NSuleiman Souhlal <suleiman@google.com>
      Tested-by: NDaniel Byrne <djbyrne@mtu.edu>
      Tested-by: NDonald Carr <d@chaos-reins.com>
      Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: NKonstantin Kharlamov <Hi-Angel@yandex.ru>
      Tested-by: NShuang Zhai <szhai2@cs.rochester.edu>
      Tested-by: NSofia Trinh <sofia.trinh@edi.works>
      Tested-by: NVaibhav Jain <vaibhav@linux.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Barry Song <baohua@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Larabel <Michael@MichaelLarabel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      ac35a490
    • Y
      mm: multi-gen LRU: groundwork · ec1c86b2
      Yu Zhao 提交于
      Evictable pages are divided into multiple generations for each lruvec.
      The youngest generation number is stored in lrugen->max_seq for both
      anon and file types as they are aged on an equal footing. The oldest
      generation numbers are stored in lrugen->min_seq[] separately for anon
      and file types as clean file pages can be evicted regardless of swap
      constraints. These three variables are monotonically increasing.
      
      Generation numbers are truncated into order_base_2(MAX_NR_GENS+1) bits
      in order to fit into the gen counter in folio->flags. Each truncated
      generation number is an index to lrugen->lists[]. The sliding window
      technique is used to track at least MIN_NR_GENS and at most
      MAX_NR_GENS generations. The gen counter stores a value within [1,
      MAX_NR_GENS] while a page is on one of lrugen->lists[]. Otherwise it
      stores 0.
      
      There are two conceptually independent procedures: "the aging", which
      produces young generations, and "the eviction", which consumes old
      generations.  They form a closed-loop system, i.e., "the page reclaim". 
      Both procedures can be invoked from userspace for the purposes of working
      set estimation and proactive reclaim.  These techniques are commonly used
      to optimize job scheduling (bin packing) in data centers [1][2].
      
      To avoid confusion, the terms "hot" and "cold" will be applied to the
      multi-gen LRU, as a new convention; the terms "active" and "inactive" will
      be applied to the active/inactive LRU, as usual.
      
      The protection of hot pages and the selection of cold pages are based
      on page access channels and patterns. There are two access channels:
      one through page tables and the other through file descriptors. The
      protection of the former channel is by design stronger because:
      1. The uncertainty in determining the access patterns of the former
         channel is higher due to the approximation of the accessed bit.
      2. The cost of evicting the former channel is higher due to the TLB
         flushes required and the likelihood of encountering the dirty bit.
      3. The penalty of underprotecting the former channel is higher because
         applications usually do not prepare themselves for major page
         faults like they do for blocked I/O. E.g., GUI applications
         commonly use dedicated I/O threads to avoid blocking rendering
         threads.
      
      There are also two access patterns: one with temporal locality and the
      other without.  For the reasons listed above, the former channel is
      assumed to follow the former pattern unless VM_SEQ_READ or VM_RAND_READ is
      present; the latter channel is assumed to follow the latter pattern unless
      outlying refaults have been observed [3][4].
      
      The next patch will address the "outlying refaults".  Three macros, i.e.,
      LRU_REFS_WIDTH, LRU_REFS_PGOFF and LRU_REFS_MASK, used later are added in
      this patch to make the entire patchset less diffy.
      
      A page is added to the youngest generation on faulting.  The aging needs
      to check the accessed bit at least twice before handing this page over to
      the eviction.  The first check takes care of the accessed bit set on the
      initial fault; the second check makes sure this page has not been used
      since then.  This protocol, AKA second chance, requires a minimum of two
      generations, hence MIN_NR_GENS.
      
      [1] https://dl.acm.org/doi/10.1145/3297858.3304053
      [2] https://dl.acm.org/doi/10.1145/3503222.3507731
      [3] https://lwn.net/Articles/495543/
      [4] https://lwn.net/Articles/815342/
      
      Link: https://lkml.kernel.org/r/20220918080010.2920238-6-yuzhao@google.comSigned-off-by: NYu Zhao <yuzhao@google.com>
      Acked-by: NBrian Geffon <bgeffon@google.com>
      Acked-by: NJan Alexander Steffens (heftig) <heftig@archlinux.org>
      Acked-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Acked-by: NSteven Barrett <steven@liquorix.net>
      Acked-by: NSuleiman Souhlal <suleiman@google.com>
      Tested-by: NDaniel Byrne <djbyrne@mtu.edu>
      Tested-by: NDonald Carr <d@chaos-reins.com>
      Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: NKonstantin Kharlamov <Hi-Angel@yandex.ru>
      Tested-by: NShuang Zhai <szhai2@cs.rochester.edu>
      Tested-by: NSofia Trinh <sofia.trinh@edi.works>
      Tested-by: NVaibhav Jain <vaibhav@linux.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Barry Song <baohua@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Larabel <Michael@MichaelLarabel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      ec1c86b2
    • Y
      Revert "include/linux/mm_inline.h: fold __update_lru_size() into its sole caller" · aa1b6790
      Yu Zhao 提交于
      This patch undoes the following refactor: commit 289ccba1
      ("include/linux/mm_inline.h: fold __update_lru_size() into its sole
      caller")
      
      The upcoming changes to include/linux/mm_inline.h will reuse
      __update_lru_size().
      
      Link: https://lkml.kernel.org/r/20220918080010.2920238-5-yuzhao@google.comSigned-off-by: NYu Zhao <yuzhao@google.com>
      Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: NBrian Geffon <bgeffon@google.com>
      Acked-by: NJan Alexander Steffens (heftig) <heftig@archlinux.org>
      Acked-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Acked-by: NSteven Barrett <steven@liquorix.net>
      Acked-by: NSuleiman Souhlal <suleiman@google.com>
      Tested-by: NDaniel Byrne <djbyrne@mtu.edu>
      Tested-by: NDonald Carr <d@chaos-reins.com>
      Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: NKonstantin Kharlamov <Hi-Angel@yandex.ru>
      Tested-by: NShuang Zhai <szhai2@cs.rochester.edu>
      Tested-by: NSofia Trinh <sofia.trinh@edi.works>
      Tested-by: NVaibhav Jain <vaibhav@linux.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Barry Song <baohua@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michael Larabel <Michael@MichaelLarabel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      aa1b6790
    • Y
      mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG · eed9a328
      Yu Zhao 提交于
      Some architectures support the accessed bit in non-leaf PMD entries, e.g.,
      x86 sets the accessed bit in a non-leaf PMD entry when using it as part of
      linear address translation [1].  Page table walkers that clear the
      accessed bit may use this capability to reduce their search space.
      
      Note that:
      1. Although an inline function is preferable, this capability is added
         as a configuration option for consistency with the existing macros.
      2. Due to the little interest in other varieties, this capability was
         only tested on Intel and AMD CPUs.
      
      Thanks to the following developers for their efforts [2][3].
        Randy Dunlap <rdunlap@infradead.org>
        Stephen Rothwell <sfr@canb.auug.org.au>
      
      [1]: Intel 64 and IA-32 Architectures Software Developer's Manual
           Volume 3 (June 2021), section 4.8
      [2] https://lore.kernel.org/r/bfdcc7c8-922f-61a9-aa15-7e7250f04af7@infradead.org/
      [3] https://lore.kernel.org/r/20220413151513.5a0d7a7e@canb.auug.org.au/
      
      Link: https://lkml.kernel.org/r/20220918080010.2920238-3-yuzhao@google.comSigned-off-by: NYu Zhao <yuzhao@google.com>
      Reviewed-by: NBarry Song <baohua@kernel.org>
      Acked-by: NBrian Geffon <bgeffon@google.com>
      Acked-by: NJan Alexander Steffens (heftig) <heftig@archlinux.org>
      Acked-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Acked-by: NSteven Barrett <steven@liquorix.net>
      Acked-by: NSuleiman Souhlal <suleiman@google.com>
      Tested-by: NDaniel Byrne <djbyrne@mtu.edu>
      Tested-by: NDonald Carr <d@chaos-reins.com>
      Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: NKonstantin Kharlamov <Hi-Angel@yandex.ru>
      Tested-by: NShuang Zhai <szhai2@cs.rochester.edu>
      Tested-by: NSofia Trinh <sofia.trinh@edi.works>
      Tested-by: NVaibhav Jain <vaibhav@linux.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Larabel <Michael@MichaelLarabel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      eed9a328
    • Y
      mm: x86, arm64: add arch_has_hw_pte_young() · e1fd09e3
      Yu Zhao 提交于
      Patch series "Multi-Gen LRU Framework", v14.
      
      What's new
      ==========
      1. OpenWrt, in addition to Android, Arch Linux Zen, Armbian, ChromeOS,
         Liquorix, post-factum and XanMod, is now shipping MGLRU on 5.15.
      2. Fixed long-tailed direct reclaim latency seen on high-memory (TBs)
         machines. The old direct reclaim backoff, which tries to enforce a
         minimum fairness among all eligible memcgs, over-swapped by about
         (total_mem>>DEF_PRIORITY)-nr_to_reclaim. The new backoff, which
         pulls the plug on swapping once the target is met, trades some
         fairness for curtailed latency:
         https://lore.kernel.org/r/20220918080010.2920238-10-yuzhao@google.com/
      3. Fixed minior build warnings and conflicts. More comments and nits.
      
      TLDR
      ====
      The current page reclaim is too expensive in terms of CPU usage and it
      often makes poor choices about what to evict. This patchset offers an
      alternative solution that is performant, versatile and
      straightforward.
      
      Patchset overview
      =================
      The design and implementation overview is in patch 14:
      https://lore.kernel.org/r/20220918080010.2920238-15-yuzhao@google.com/
      
      01. mm: x86, arm64: add arch_has_hw_pte_young()
      02. mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
      Take advantage of hardware features when trying to clear the accessed
      bit in many PTEs.
      
      03. mm/vmscan.c: refactor shrink_node()
      04. Revert "include/linux/mm_inline.h: fold __update_lru_size() into
          its sole caller"
      Minor refactors to improve readability for the following patches.
      
      05. mm: multi-gen LRU: groundwork
      Adds the basic data structure and the functions that insert pages to
      and remove pages from the multi-gen LRU (MGLRU) lists.
      
      06. mm: multi-gen LRU: minimal implementation
      A minimal implementation without optimizations.
      
      07. mm: multi-gen LRU: exploit locality in rmap
      Exploits spatial locality to improve efficiency when using the rmap.
      
      08. mm: multi-gen LRU: support page table walks
      Further exploits spatial locality by optionally scanning page tables.
      
      09. mm: multi-gen LRU: optimize multiple memcgs
      Optimizes the overall performance for multiple memcgs running mixed
      types of workloads.
      
      10. mm: multi-gen LRU: kill switch
      Adds a kill switch to enable or disable MGLRU at runtime.
      
      11. mm: multi-gen LRU: thrashing prevention
      12. mm: multi-gen LRU: debugfs interface
      Provide userspace with features like thrashing prevention, working set
      estimation and proactive reclaim.
      
      13. mm: multi-gen LRU: admin guide
      14. mm: multi-gen LRU: design doc
      Add an admin guide and a design doc.
      
      Benchmark results
      =================
      Independent lab results
      -----------------------
      Based on the popularity of searches [01] and the memory usage in
      Google's public cloud, the most popular open-source memory-hungry
      applications, in alphabetical order, are:
            Apache Cassandra      Memcached
            Apache Hadoop         MongoDB
            Apache Spark          PostgreSQL
            MariaDB (MySQL)       Redis
      
      An independent lab evaluated MGLRU with the most widely used benchmark
      suites for the above applications. They posted 960 data points along
      with kernel metrics and perf profiles collected over more than 500
      hours of total benchmark time. Their final reports show that, with 95%
      confidence intervals (CIs), the above applications all performed
      significantly better for at least part of their benchmark matrices.
      
      On 5.14:
      1. Apache Spark [02] took 95% CIs [9.28, 11.19]% and [12.20, 14.93]%
         less wall time to sort three billion random integers, respectively,
         under the medium- and the high-concurrency conditions, when
         overcommitting memory. There were no statistically significant
         changes in wall time for the rest of the benchmark matrix.
      2. MariaDB [03] achieved 95% CIs [5.24, 10.71]% and [20.22, 25.97]%
         more transactions per minute (TPM), respectively, under the medium-
         and the high-concurrency conditions, when overcommitting memory.
         There were no statistically significant changes in TPM for the rest
         of the benchmark matrix.
      3. Memcached [04] achieved 95% CIs [23.54, 32.25]%, [20.76, 41.61]%
         and [21.59, 30.02]% more operations per second (OPS), respectively,
         for sequential access, random access and Gaussian (distribution)
         access, when THP=always; 95% CIs [13.85, 15.97]% and
         [23.94, 29.92]% more OPS, respectively, for random access and
         Gaussian access, when THP=never. There were no statistically
         significant changes in OPS for the rest of the benchmark matrix.
      4. MongoDB [05] achieved 95% CIs [2.23, 3.44]%, [6.97, 9.73]% and
         [2.16, 3.55]% more operations per second (OPS), respectively, for
         exponential (distribution) access, random access and Zipfian
         (distribution) access, when underutilizing memory; 95% CIs
         [8.83, 10.03]%, [21.12, 23.14]% and [5.53, 6.46]% more OPS,
         respectively, for exponential access, random access and Zipfian
         access, when overcommitting memory.
      
      On 5.15:
      5. Apache Cassandra [06] achieved 95% CIs [1.06, 4.10]%, [1.94, 5.43]%
         and [4.11, 7.50]% more operations per second (OPS), respectively,
         for exponential (distribution) access, random access and Zipfian
         (distribution) access, when swap was off; 95% CIs [0.50, 2.60]%,
         [6.51, 8.77]% and [3.29, 6.75]% more OPS, respectively, for
         exponential access, random access and Zipfian access, when swap was
         on.
      6. Apache Hadoop [07] took 95% CIs [5.31, 9.69]% and [2.02, 7.86]%
         less average wall time to finish twelve parallel TeraSort jobs,
         respectively, under the medium- and the high-concurrency
         conditions, when swap was on. There were no statistically
         significant changes in average wall time for the rest of the
         benchmark matrix.
      7. PostgreSQL [08] achieved 95% CI [1.75, 6.42]% more transactions per
         minute (TPM) under the high-concurrency condition, when swap was
         off; 95% CIs [12.82, 18.69]% and [22.70, 46.86]% more TPM,
         respectively, under the medium- and the high-concurrency
         conditions, when swap was on. There were no statistically
         significant changes in TPM for the rest of the benchmark matrix.
      8. Redis [09] achieved 95% CIs [0.58, 5.94]%, [6.55, 14.58]% and
         [11.47, 19.36]% more total operations per second (OPS),
         respectively, for sequential access, random access and Gaussian
         (distribution) access, when THP=always; 95% CIs [1.27, 3.54]%,
         [10.11, 14.81]% and [8.75, 13.64]% more total OPS, respectively,
         for sequential access, random access and Gaussian access, when
         THP=never.
      
      Our lab results
      ---------------
      To supplement the above results, we ran the following benchmark suites
      on 5.16-rc7 and found no regressions [10].
            fs_fio_bench_hdd_mq      pft
            fs_lmbench               pgsql-hammerdb
            fs_parallelio            redis
            fs_postmark              stream
            hackbench                sysbenchthread
            kernbench                tpcc_spark
            memcached                unixbench
            multichase               vm-scalability
            mutilate                 will-it-scale
            nginx
      
      [01] https://trends.google.com
      [02] https://lore.kernel.org/r/20211102002002.92051-1-bot@edi.works/
      [03] https://lore.kernel.org/r/20211009054315.47073-1-bot@edi.works/
      [04] https://lore.kernel.org/r/20211021194103.65648-1-bot@edi.works/
      [05] https://lore.kernel.org/r/20211109021346.50266-1-bot@edi.works/
      [06] https://lore.kernel.org/r/20211202062806.80365-1-bot@edi.works/
      [07] https://lore.kernel.org/r/20211209072416.33606-1-bot@edi.works/
      [08] https://lore.kernel.org/r/20211218071041.24077-1-bot@edi.works/
      [09] https://lore.kernel.org/r/20211122053248.57311-1-bot@edi.works/
      [10] https://lore.kernel.org/r/20220104202247.2903702-1-yuzhao@google.com/
      
      Read-world applications
      =======================
      Third-party testimonials
      ------------------------
      Konstantin reported [11]:
         I have Archlinux with 8G RAM + zswap + swap. While developing, I
         have lots of apps opened such as multiple LSP-servers for different
         langs, chats, two browsers, etc... Usually, my system gets quickly
         to a point of SWAP-storms, where I have to kill LSP-servers,
         restart browsers to free memory, etc, otherwise the system lags
         heavily and is barely usable.
         
         1.5 day ago I migrated from 5.11.15 kernel to 5.12 + the LRU
         patchset, and I started up by opening lots of apps to create memory
         pressure, and worked for a day like this. Till now I had not a
         single SWAP-storm, and mind you I got 3.4G in SWAP. I was never
         getting to the point of 3G in SWAP before without a single
         SWAP-storm.
      
      Vaibhav from IBM reported [12]:
         In a synthetic MongoDB Benchmark, seeing an average of ~19%
         throughput improvement on POWER10(Radix MMU + 64K Page Size) with
         MGLRU patches on top of 5.16 kernel for MongoDB + YCSB across
         three different request distributions, namely, Exponential, Uniform
         and Zipfan.
      
      Shuang from U of Rochester reported [13]:
         With the MGLRU, fio achieved 95% CIs [38.95, 40.26]%, [4.12, 6.64]%
         and [9.26, 10.36]% higher throughput, respectively, for random
         access, Zipfian (distribution) access and Gaussian (distribution)
         access, when the average number of jobs per CPU is 1; 95% CIs
         [42.32, 49.15]%, [9.44, 9.89]% and [20.99, 22.86]% higher
         throughput, respectively, for random access, Zipfian access and
         Gaussian access, when the average number of jobs per CPU is 2.
      
      Daniel from Michigan Tech reported [14]:
         With Memcached allocating ~100GB of byte-addressable Optante,
         performance improvement in terms of throughput (measured as queries
         per second) was about 10% for a series of workloads.
      
      Large-scale deployments
      -----------------------
      We've rolled out MGLRU to tens of millions of ChromeOS users and
      about a million Android users. Google's fleetwide profiling [15] shows
      an overall 40% decrease in kswapd CPU usage, in addition to
      improvements in other UX metrics, e.g., an 85% decrease in the number
      of low-memory kills at the 75th percentile and an 18% decrease in
      app launch time at the 50th percentile.
      
      The downstream kernels that have been using MGLRU include:
      1. Android [16]
      2. Arch Linux Zen [17]
      3. Armbian [18]
      4. ChromeOS [19]
      5. Liquorix [20]
      6. OpenWrt [21]
      7. post-factum [22]
      8. XanMod [23]
      
      [11] https://lore.kernel.org/r/140226722f2032c86301fbd326d91baefe3d7d23.camel@yandex.ru/
      [12] https://lore.kernel.org/r/87czj3mux0.fsf@vajain21.in.ibm.com/
      [13] https://lore.kernel.org/r/20220105024423.26409-1-szhai2@cs.rochester.edu/
      [14] https://lore.kernel.org/r/CA+4-3vksGvKd18FgRinxhqHetBS1hQekJE2gwco8Ja-bJWKtFw@mail.gmail.com/
      [15] https://dl.acm.org/doi/10.1145/2749469.2750392
      [16] https://android.com
      [17] https://archlinux.org
      [18] https://armbian.com
      [19] https://chromium.org
      [20] https://liquorix.net
      [21] https://openwrt.org
      [22] https://codeberg.org/pf-kernel
      [23] https://xanmod.org
      
      Summary
      =======
      The facts are:
      1. The independent lab results and the real-world applications
         indicate substantial improvements; there are no known regressions.
      2. Thrashing prevention, working set estimation and proactive reclaim
         work out of the box; there are no equivalent solutions.
      3. There is a lot of new code; no smaller changes have been
         demonstrated similar effects.
      
      Our options, accordingly, are:
      1. Given the amount of evidence, the reported improvements will likely
         materialize for a wide range of workloads.
      2. Gauging the interest from the past discussions, the new features
         will likely be put to use for both personal computers and data
         centers.
      3. Based on Google's track record, the new code will likely be well
         maintained in the long term. It'd be more difficult if not
         impossible to achieve similar effects with other approaches.
      
      
      This patch (of 14):
      
      Some architectures automatically set the accessed bit in PTEs, e.g., x86
      and arm64 v8.2.  On architectures that do not have this capability,
      clearing the accessed bit in a PTE usually triggers a page fault following
      the TLB miss of this PTE (to emulate the accessed bit).
      
      Being aware of this capability can help make better decisions, e.g.,
      whether to spread the work out over a period of time to reduce bursty page
      faults when trying to clear the accessed bit in many PTEs.
      
      Note that theoretically this capability can be unreliable, e.g.,
      hotplugged CPUs might be different from builtin ones.  Therefore it should
      not be used in architecture-independent code that involves correctness,
      e.g., to determine whether TLB flushes are required (in combination with
      the accessed bit).
      
      Link: https://lkml.kernel.org/r/20220918080010.2920238-1-yuzhao@google.com
      Link: https://lkml.kernel.org/r/20220918080010.2920238-2-yuzhao@google.comSigned-off-by: NYu Zhao <yuzhao@google.com>
      Reviewed-by: NBarry Song <baohua@kernel.org>
      Acked-by: NBrian Geffon <bgeffon@google.com>
      Acked-by: NJan Alexander Steffens (heftig) <heftig@archlinux.org>
      Acked-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Acked-by: NSteven Barrett <steven@liquorix.net>
      Acked-by: NSuleiman Souhlal <suleiman@google.com>
      Acked-by: NWill Deacon <will@kernel.org>
      Tested-by: NDaniel Byrne <djbyrne@mtu.edu>
      Tested-by: NDonald Carr <d@chaos-reins.com>
      Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: NKonstantin Kharlamov <Hi-Angel@yandex.ru>
      Tested-by: NShuang Zhai <szhai2@cs.rochester.edu>
      Tested-by: NSofia Trinh <sofia.trinh@edi.works>
      Tested-by: NVaibhav Jain <vaibhav@linux.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michael Larabel <Michael@MichaelLarabel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      e1fd09e3
    • Y
      delayacct: support re-entrance detection of thrashing accounting · aa1cf99b
      Yang Yang 提交于
      Once upon a time, we only support accounting thrashing of page cache. 
      Then Joonsoo introduced workingset detection for anonymous pages and we
      gained the ability to account thrashing of them[1].
      
      For page cache thrashing accounting, there is no suitable place to do it
      in fs level likes swap_readpage().  So we have to do it in
      folio_wait_bit_common().
      
      Then for anonymous pages thrashing accounting, we have to do it in both
      swap_readpage() and folio_wait_bit_common().  This likes PSI, so we should
      let thrashing accounting supports re-entrance detection.
      
      This patch is to prepare complete thrashing accounting, and is based on
      patch "filemap: make the accounting of thrashing more consistent".
      
      [1] commit aae466b0 ("mm/swap: implement workingset detection for anonymous LRU")
      
      Link: https://lkml.kernel.org/r/20220815071134.74551-1-yang.yang29@zte.com.cnSigned-off-by: NYang Yang <yang.yang29@zte.com.cn>
      Signed-off-by: NCGEL ZTE <cgel.zte@gmail.com>
      Reviewed-by: NRan Xiaokai <ran.xiaokai@zte.com.cn>
      Reviewed-by: Nwangyong <wang.yong12@zte.com.cn>
      Acked-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      aa1cf99b
    • P
      mm/swap: cache swap migration A/D bits support · 5154e607
      Peter Xu 提交于
      Introduce a variable swap_migration_ad_supported to cache whether the arch
      supports swap migration A/D bits.
      
      Here one thing to mention is that SWP_MIG_TOTAL_BITS will internally
      reference the other macro MAX_PHYSMEM_BITS, which is a function call on
      x86 (constant on all the rest of archs).
      
      It's safe to reference it in swapfile_init() because when reaching here
      we're already during initcalls level 4 so we must have initialized 5-level
      pgtable for x86_64 (right after early_identify_cpu() finishes).
      
      - start_kernel
        - setup_arch
          - early_cpu_init
            - get_cpu_cap --> fetch from CPUID (including X86_FEATURE_LA57)
            - early_identify_cpu --> clear X86_FEATURE_LA57 (if early lvl5 not enabled (USE_EARLY_PGTABLE_L5))
        - arch_call_rest_init
          - rest_init
            - kernel_init
              - kernel_init_freeable
                - do_basic_setup
                  - do_initcalls --> calls swapfile_init() (initcall level 4)
      
      This should slightly speed up the migration swap entry handlings.
      
      Link: https://lkml.kernel.org/r/20220811161331.37055-8-peterx@redhat.comSigned-off-by: NPeter Xu <peterx@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      5154e607
    • P
      mm/swap: cache maximum swapfile size when init swap · be45a490
      Peter Xu 提交于
      We used to have swapfile_maximum_size() fetching a maximum value of
      swapfile size per-arch.
      
      As the caller of max_swapfile_size() grows, this patch introduce a
      variable "swapfile_maximum_size" and cache the value of old
      max_swapfile_size(), so that we don't need to calculate the value every
      time.
      
      Caching the value in swapfile_init() is safe because when reaching the
      phase we should have initialized all the relevant information.  Here the
      major arch to take care of is x86, which defines the max swapfile size
      based on L1TF mitigation.
      
      Here both X86_BUG_L1TF or l1tf_mitigation should have been setup properly
      when reaching swapfile_init().  As a reference, the code path looks like
      this for x86:
      
      - start_kernel
        - setup_arch
          - early_cpu_init
            - early_identify_cpu --> setup X86_BUG_L1TF
        - parse_early_param
          - l1tf_cmdline --> set l1tf_mitigation
        - check_bugs
          - l1tf_select_mitigation --> set l1tf_mitigation
        - arch_call_rest_init
          - rest_init
            - kernel_init
              - kernel_init_freeable
                - do_basic_setup
                  - do_initcalls --> calls swapfile_init() (initcall level 4)
      
      The swapfile size only depends on swp pte format on non-x86 archs, so
      caching it is safe too.
      
      Since at it, rename max_swapfile_size() to arch_max_swapfile_size()
      because arch can define its own function, so it's more straightforward to
      have "arch_" as its prefix.  At the meantime, export swapfile_maximum_size
      to replace the old usages of max_swapfile_size().
      
      [peterx@redhat.com: declare arch_max_swapfile_size) in swapfile.h]
        Link: https://lkml.kernel.org/r/YxTh1GuC6ro5fKL5@xz-m1.local
      Link: https://lkml.kernel.org/r/20220811161331.37055-7-peterx@redhat.comSigned-off-by: NPeter Xu <peterx@redhat.com>
      Reviewed-by: N"Huang, Ying" <ying.huang@intel.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      be45a490
    • P
      mm: remember young/dirty bit for page migrations · 2e346877
      Peter Xu 提交于
      When page migration happens, we always ignore the young/dirty bit settings
      in the old pgtable, and marking the page as old in the new page table
      using either pte_mkold() or pmd_mkold(), and keeping the pte clean.
      
      That's fine from functional-wise, but that's not friendly to page reclaim
      because the moving page can be actively accessed within the procedure. 
      Not to mention hardware setting the young bit can bring quite some
      overhead on some systems, e.g.  x86_64 needs a few hundreds nanoseconds to
      set the bit.  The same slowdown problem to dirty bits when the memory is
      first written after page migration happened.
      
      Actually we can easily remember the A/D bit configuration and recover the
      information after the page is migrated.  To achieve it, define a new set
      of bits in the migration swap offset field to cache the A/D bits for old
      pte.  Then when removing/recovering the migration entry, we can recover
      the A/D bits even if the page changed.
      
      One thing to mention is that here we used max_swapfile_size() to detect
      how many swp offset bits we have, and we'll only enable this feature if we
      know the swp offset is big enough to store both the PFN value and the A/D
      bits.  Otherwise the A/D bits are dropped like before.
      
      Link: https://lkml.kernel.org/r/20220811161331.37055-6-peterx@redhat.comSigned-off-by: NPeter Xu <peterx@redhat.com>
      Reviewed-by: N"Huang, Ying" <ying.huang@intel.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      2e346877
    • P
      mm/swap: add swp_offset_pfn() to fetch PFN from swap entry · 0d206b5d
      Peter Xu 提交于
      We've got a bunch of special swap entries that stores PFN inside the swap
      offset fields.  To fetch the PFN, normally the user just calls
      swp_offset() assuming that'll be the PFN.
      
      Add a helper swp_offset_pfn() to fetch the PFN instead, fetching only the
      max possible length of a PFN on the host, meanwhile doing proper check
      with MAX_PHYSMEM_BITS to make sure the swap offsets can actually store the
      PFNs properly always using the BUILD_BUG_ON() in is_pfn_swap_entry().
      
      One reason to do so is we never tried to sanitize whether swap offset can
      really fit for storing PFN.  At the meantime, this patch also prepares us
      with the future possibility to store more information inside the swp
      offset field, so assuming "swp_offset(entry)" to be the PFN will not stand
      any more very soon.
      
      Replace many of the swp_offset() callers to use swp_offset_pfn() where
      proper.  Note that many of the existing users are not candidates for the
      replacement, e.g.:
      
        (1) When the swap entry is not a pfn swap entry at all, or,
        (2) when we wanna keep the whole swp_offset but only change the swp type.
      
      For the latter, it can happen when fork() triggered on a write-migration
      swap entry pte, we may want to only change the migration type from
      write->read but keep the rest, so it's not "fetching PFN" but "changing
      swap type only".  They're left aside so that when there're more
      information within the swp offset they'll be carried over naturally in
      those cases.
      
      Since at it, dropping hwpoison_entry_to_pfn() because that's exactly what
      the new swp_offset_pfn() is about.
      
      Link: https://lkml.kernel.org/r/20220811161331.37055-4-peterx@redhat.comSigned-off-by: NPeter Xu <peterx@redhat.com>
      Reviewed-by: N"Huang, Ying" <ying.huang@intel.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      0d206b5d
    • P
      mm/swap: comment all the ifdef in swapops.h · eba4d770
      Peter Xu 提交于
      swapops.h contains quite a few layers of ifdef, some of the "else" and
      "endif" doesn't get proper comment on the macro so it's hard to follow on
      what are they referring to.  Add the comments.
      
      Link: https://lkml.kernel.org/r/20220811161331.37055-3-peterx@redhat.comSigned-off-by: NPeter Xu <peterx@redhat.com>
      Suggested-by: NNadav Amit <nadav.amit@gmail.com>
      Reviewed-by: NHuang Ying <ying.huang@intel.com>
      Reviewed-by: NAlistair Popple <apopple@nvidia.com>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      eba4d770
    • M
      mm, hwpoison: use num_poisoned_pages_sub() to decrease num_poisoned_pages · 21c9e90a
      Miaohe Lin 提交于
      Use num_poisoned_pages_sub() to combine multiple atomic ops into one. Also
      num_poisoned_pages_dec() can be killed as there's no caller now.
      
      Link: https://lkml.kernel.org/r/20220830123604.25763-4-linmiaohe@huawei.comSigned-off-by: NMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: NNaoya Horiguchi <naoya.horiguchi@nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      21c9e90a
  2. 12 9月, 2022 28 次提交
    • D
      mm: fix PageAnonExclusive clearing racing with concurrent RCU GUP-fast · 088b8aa5
      David Hildenbrand 提交于
      commit 6c287605 ("mm: remember exclusively mapped anonymous pages with
      PG_anon_exclusive") made sure that when PageAnonExclusive() has to be
      cleared during temporary unmapping of a page, that the PTE is
      cleared/invalidated and that the TLB is flushed.
      
      What we want to achieve in all cases is that we cannot end up with a pin on
      an anonymous page that may be shared, because such pins would be
      unreliable and could result in memory corruptions when the mapped page
      and the pin go out of sync due to a write fault.
      
      That TLB flush handling was inspired by an outdated comment in
      mm/ksm.c:write_protect_page(), which similarly required the TLB flush in
      the past to synchronize with GUP-fast. However, ever since general RCU GUP
      fast was introduced in commit 2667f50e ("mm: introduce a general RCU
      get_user_pages_fast()"), a TLB flush is no longer sufficient to handle
      concurrent GUP-fast in all cases -- it only handles traditional IPI-based
      GUP-fast correctly.
      
      Peter Xu (thankfully) questioned whether that TLB flush is really
      required. On architectures that send an IPI broadcast on TLB flush,
      it works as expected. To synchronize with RCU GUP-fast properly, we're
      conceptually fine, however, we have to enforce a certain memory order and
      are missing memory barriers.
      
      Let's document that, avoid the TLB flush where possible and use proper
      explicit memory barriers where required. We shouldn't really care about the
      additional memory barriers here, as we're not on extremely hot paths --
      and we're getting rid of some TLB flushes.
      
      We use a smp_mb() pair for handling concurrent pinning and a
      smp_rmb()/smp_wmb() pair for handling the corner case of only temporary
      PTE changes but permanent PageAnonExclusive changes.
      
      One extreme example, whereby GUP-fast takes a R/O pin and KSM wants to
      convert an exclusive anonymous page to a KSM page, and that page is already
      mapped write-protected (-> no PTE change) would be:
      
      	Thread 0 (KSM)			Thread 1 (GUP-fast)
      
      					(B1) Read the PTE
      					# (B2) skipped without FOLL_WRITE
      	(A1) Clear PTE
      	smp_mb()
      	(A2) Check pinned
      					(B3) Pin the mapped page
      					smp_mb()
      	(A3) Clear PageAnonExclusive
      	smp_wmb()
      	(A4) Restore PTE
      					(B4) Check if the PTE changed
      					smp_rmb()
      					(B5) Check PageAnonExclusive
      
      Thread 1 will properly detect that PageAnonExclusive was cleared and
      back off.
      
      Note that we don't need a memory barrier between checking if the page is
      pinned and clearing PageAnonExclusive, because stores are not
      speculated.
      
      The possible issues due to reordering are of theoretical nature so far
      and attempts to reproduce the race failed.
      
      Especially the "no PTE change" case isn't the common case, because we'd
      need an exclusive anonymous page that's mapped R/O and the PTE is clean
      in KSM code -- and using KSM with page pinning isn't extremely common.
      Further, the clear+TLB flush we used for now implies a memory barrier.
      So the problematic missing part should be the missing memory barrier
      after pinning but before checking if the PTE changed.
      
      Link: https://lkml.kernel.org/r/20220901083559.67446-1-david@redhat.com
      Fixes: 6c287605 ("mm: remember exclusively mapped anonymous pages with PG_anon_exclusive")
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Andrea Parri <parri.andrea@gmail.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "Paul E. McKenney" <paulmck@kernel.org>
      Cc: Christoph von Recklinghausen <crecklin@redhat.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      088b8aa5
    • M
      hugetlb: make hugetlb_cma_check() static · 263b8998
      Miaohe Lin 提交于
      Patch series "A few cleanup patches for hugetlb", v2.
      
      This series contains a few cleanup patches to use helper functions to
      simplify the codes, remove unneeded nid parameter and so on. More
      details can be found in the respective changelogs.
      
      
      This patch (of 10):
      
      Make hugetlb_cma_check() static as it's only used inside mm/hugetlb.c.
      
      Link: https://lkml.kernel.org/r/20220901120030.63318-1-linmiaohe@huawei.com
      Link: https://lkml.kernel.org/r/20220901120030.63318-2-linmiaohe@huawei.comSigned-off-by: NMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: NMuchun Song <songmuchun@bytedance.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      263b8998
    • Z
      fs/buffer: remove bh_submit_read() helper · 454552d0
      Zhang Yi 提交于
      bh_submit_read() has no user anymore, just remove it.
      
      Link: https://lkml.kernel.org/r/20220901133505.2510834-15-yi.zhang@huawei.comSigned-off-by: NZhang Yi <yi.zhang@huawei.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      454552d0
    • Z
      fs/buffer: remove ll_rw_block() helper · 79f59784
      Zhang Yi 提交于
      Now that all ll_rw_block() users has been replaced to new safe helpers,
      we just remove it here.
      
      Link: https://lkml.kernel.org/r/20220901133505.2510834-13-yi.zhang@huawei.comSigned-off-by: NZhang Yi <yi.zhang@huawei.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      79f59784
    • Z
      fs/buffer: add some new buffer read helpers · fdee117e
      Zhang Yi 提交于
      Current ll_rw_block() helper is fragile because it assumes that locked
      buffer means it's under IO which is submitted by some other who holds
      the lock, it skip buffer if it failed to get the lock, so it's only
      safe on the readahead path. Unfortunately, now that most filesystems
      still use this helper mistakenly on the sync metadata read path. There
      is no guarantee that the one who holds the buffer lock always submit IO
      (e.g. buffer_migrate_folio_norefs() after commit 88dbcbb3 ("blkdev:
      avoid migration stalls for blkdev pages"), it could lead to false
      positive -EIO when submitting reading IO.
      
      This patch add some friendly buffer read helpers to prepare replacing
      ll_rw_block() and similar calls. We can only call bh_readahead_[]
      helpers for the readahead paths.
      
      Link: https://lkml.kernel.org/r/20220901133505.2510834-3-yi.zhang@huawei.comSigned-off-by: NZhang Yi <yi.zhang@huawei.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      fdee117e
    • Z
      fs/buffer: remove __breadahead_gfp() · 214f8796
      Zhang Yi 提交于
      Patch series "fs/buffer: remove ll_rw_block()", v2.
      
      ll_rw_block() will skip locked buffer before submitting IO, it assumes
      that locked buffer means it is under IO.  This assumption is not always
      true because we cannot guarantee every buffer lock path would submit IO. 
      After commit 88dbcbb3 ("blkdev: avoid migration stalls for blkdev
      pages"), buffer_migrate_folio_norefs() becomes one exceptional case, and
      there may be others.  So ll_rw_block() is not safe on the sync read path,
      we could get false positive EIO return value when filesystem reading
      metadata.  It seems that it could be only used on the readahead path.
      
      Unfortunately, many filesystem misuse the ll_rw_block() on the sync read
      path.  This patch set just remove ll_rw_block() and add new friendly
      helpers, which could prevent false positive EIO on the read metadata path.
      Thanks for the suggestion from Jan, the original discussion is at [1].
      
       patch 1: remove unused helpers in fs/buffer.c
       patch 2: add new bh_read_[*] helpers
       patch 3-11: remove all ll_rw_block() calls in filesystems
       patch 12-14: do some leftover cleanups.
      
      [1]. https://lore.kernel.org/linux-mm/20220825080146.2021641-1-chengzhihao1@huawei.com/
      
      
      This patch (of 14):
      
      No one use __breadahead_gfp() and sb_breadahead_unmovable() any more,
      remove them.
      
      Link: https://lkml.kernel.org/r/20220901133505.2510834-1-yi.zhang@huawei.com
      Link: https://lkml.kernel.org/r/20220901133505.2510834-2-yi.zhang@huawei.comSigned-off-by: NZhang Yi <yi.zhang@huawei.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Gruenbacher <agruenba@redhat.com>
      Cc: Bob Peterson <rpeterso@redhat.com>
      Cc: Evgeniy Dushistov <dushistov@mail.ru>
      Cc: Heming Zhao <ocfs2-devel@oss.oracle.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Konstantin Komarov <almaz.alexandrovich@paragon-software.com>
      Cc: Mark Fasheh <mark@fasheh.com>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Yu Kuai <yukuai3@huawei.com>
      Cc: Zhihao Cheng <chengzhihao1@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      214f8796
    • L
      mm/thp: remove redundant CONFIG_TRANSPARENT_HUGEPAGE · bcd0dea5
      Liu Shixin 提交于
      Simplify code by removing redundant CONFIG_TRANSPARENT_HUGEPAGE judgment.
      
      No functional change.
      
      Link: https://lkml.kernel.org/r/20220829095125.3284567-1-liushixin2@huawei.comSigned-off-by: NLiu Shixin <liushixin2@huawei.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      bcd0dea5
    • L
      mm/thp: simplify has_transparent_hugepage by using IS_BUILTIN · a38c94ed
      Liu Shixin 提交于
      Simplify code of has_transparent_hugepage define by using IS_BUILTIN.
      
      No functional change.
      
      Link: https://lkml.kernel.org/r/20220829095709.3287462-1-liushixin2@huawei.comSigned-off-by: NLiu Shixin <liushixin2@huawei.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      a38c94ed
    • K
      mm: fix null-ptr-deref in kswapd_is_running() · b4a0215e
      Kefeng Wang 提交于
      kswapd_run/stop() will set pgdat->kswapd to NULL, which could race with
      kswapd_is_running() in kcompactd(),
      
      kswapd_run/stop()                       kcompactd()
                                                kswapd_is_running()
        pgdat->kswapd // error or nomal ptr
                                                verify pgdat->kswapd
                                                  // load non-NULL
      pgdat->kswapd
        pgdat->kswapd = NULL
                                                task_is_running(pgdat->kswapd)
                                                  // Null pointer derefence
      
      KASAN reports the null-ptr-deref shown below,
      
        vmscan: Failed to start kswapd on node 0
        ...
        BUG: KASAN: null-ptr-deref in kcompactd+0x440/0x504
        Read of size 8 at addr 0000000000000024 by task kcompactd0/37
      
        CPU: 0 PID: 37 Comm: kcompactd0 Kdump: loaded Tainted: G           OE     5.10.60 #1
        Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
        Call trace:
         dump_backtrace+0x0/0x394
         show_stack+0x34/0x4c
         dump_stack+0x158/0x1e4
         __kasan_report+0x138/0x140
         kasan_report+0x44/0xdc
         __asan_load8+0x94/0xd0
         kcompactd+0x440/0x504
         kthread+0x1a4/0x1f0
         ret_from_fork+0x10/0x18
      
      At present kswapd/kcompactd_run() and kswapd/kcompactd_stop() are protected
      by mem_hotplug_begin/done(), but without kcompactd(). There is no need to
      involve memory hotplug lock in kcompactd(), so let's add a new mutex to
      protect pgdat->kswapd accesses.
      
      Also, because the kcompactd task will check the state of kswapd task, it's
      better to call kcompactd_stop() before kswapd_stop() to reduce lock
      conflicts.
      
      [akpm@linux-foundation.org: add comments]
      Link: https://lkml.kernel.org/r/20220827111959.186838-1-wangkefeng.wang@huawei.comSigned-off-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Muchun Song <muchun.song@linux.dev>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      b4a0215e
    • K
      mm: kill is_memblock_offlined() · 639118d1
      Kefeng Wang 提交于
      Directly check state of struct memory_block, no need a single function.
      
      Link: https://lkml.kernel.org/r/20220827112043.187028-1-wangkefeng.wang@huawei.comSigned-off-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      639118d1
    • V
      filemap: remove find_get_pages_contig() · 48658d85
      Vishal Moola (Oracle) 提交于
      All callers of find_get_pages_contig() have been removed, so it is no
      longer needed.
      
      Link: https://lkml.kernel.org/r/20220824004023.77310-8-vishal.moola@gmail.comSigned-off-by: NVishal Moola (Oracle) <vishal.moola@gmail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Chris Mason <clm@fb.com>
      Cc: David Sterba <dsterba@suse.com>
      Cc: David Sterba <dsterb@suse.com>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      48658d85
    • V
      filemap: add filemap_get_folios_contig() · 35b47146
      Vishal Moola (Oracle) 提交于
      Patch series "Convert to filemap_get_folios_contig()", v3.
      
      This patch series replaces find_get_pages_contig() with
      filemap_get_folios_contig().
      
      
      This patch (of 7):
      
      This function is meant to replace find_get_pages_contig().
      
      Unlike find_get_pages_contig(), filemap_get_folios_contig() no longer
      takes in a target number of pages to find - It returns up to 15 contiguous
      folios.
      
      To be more consistent with filemap_get_folios(),
      filemap_get_folios_contig() now also updates the start index passed in,
      and takes an end index.
      
      Link: https://lkml.kernel.org/r/20220824004023.77310-1-vishal.moola@gmail.com
      Link: https://lkml.kernel.org/r/20220824004023.77310-2-vishal.moola@gmail.comSigned-off-by: NVishal Moola (Oracle) <vishal.moola@gmail.com>
      Cc: Chris Mason <clm@fb.com>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: David Sterba <dsterba@suse.com>
      Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: David Sterba <dsterb@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      35b47146
    • L
      page_ext: introduce boot parameter 'early_page_ext' · c4f20f14
      Li Zhe 提交于
      In commit 2f1ee091 ("Revert "mm: use early_pfn_to_nid in
      page_ext_init""), we call page_ext_init() after page_alloc_init_late() to
      avoid some panic problem.  It seems that we cannot track early page
      allocations in current kernel even if page structure has been initialized
      early.
      
      This patch introduces a new boot parameter 'early_page_ext' to resolve
      this problem.  If we pass it to the kernel, page_ext_init() will be moved
      up and the feature 'deferred initialization of struct pages' will be
      disabled to initialize the page allocator early and prevent the panic
      problem above.  It can help us to catch early page allocations.  This is
      useful especially when we find that the free memory value is not the same
      right after different kernel booting.
      
      [akpm@linux-foundation.org: fix section issue by removing __meminitdata]
      Link: https://lkml.kernel.org/r/20220825102714.669-1-lizhe.67@bytedance.comSigned-off-by: NLi Zhe <lizhe.67@bytedance.com>
      Suggested-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Jason A. Donenfeld <Jason@zx2c4.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Mark-PK Tsai <mark-pk.tsai@mediatek.com>
      Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      c4f20f14
    • S
      memcg: increase MEMCG_CHARGE_BATCH to 64 · 1813e51e
      Shakeel Butt 提交于
      For several years, MEMCG_CHARGE_BATCH was kept at 32 but with bigger
      machines and the network intensive workloads requiring througput in Gbps,
      32 is too small and makes the memcg charging path a bottleneck.  For now,
      increase it to 64 for easy acceptance to 6.0.  We will need to revisit
      this in future for ever increasing demand of higher performance.
      
      Please note that the memcg charge path drain the per-cpu memcg charge
      stock, so there should not be any oom behavior change.  Though it does
      have impact on rstat flushing and high limit reclaim backoff.
      
      To evaluate the impact of this optimization, on a 72 CPUs machine, we
      ran the following workload in a three level of cgroup hierarchy.
      
       $ netserver -6
       # 36 instances of netperf with following params
       $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
      
      Results (average throughput of netperf):
      Without (6.0-rc1)       10482.7 Mbps
      With patch              17064.7 Mbps (62.7% improvement)
      
      With the patch, the throughput improved by 62.7%.
      
      Link: https://lkml.kernel.org/r/20220825000506.239406-4-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
      Reported-by: Nkernel test robot <oliver.sang@intel.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Reviewed-by: NFeng Tang <feng.tang@intel.com>
      Acked-by: NRoman Gushchin <roman.gushchin@linux.dev>
      Acked-by: NMuchun Song <songmuchun@bytedance.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Michal Koutný" <mkoutny@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      1813e51e
    • S
      mm: page_counter: rearrange struct page_counter fields · 408587ba
      Shakeel Butt 提交于
      With memcg v2 enabled, memcg->memory.usage is a very hot member for the
      workloads doing memcg charging on multiple CPUs concurrently. 
      Particularly the network intensive workloads.  In addition, there is a
      false cache sharing between memory.usage and memory.high on the charge
      path.  This patch moves the usage into a separate cacheline and move all
      the read most fields into separate cacheline.
      
      To evaluate the impact of this optimization, on a 72 CPUs machine, we ran
      the following workload in a three level of cgroup hierarchy.
      
       $ netserver -6
       # 36 instances of netperf with following params
       $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
      
      Results (average throughput of netperf):
      Without (6.0-rc1)	10482.7 Mbps
      With patch		12413.7 Mbps (18.4% improvement)
      
      With the patch, the throughput improved by 18.4%.
      
      One side-effect of this patch is the increase in the size of struct
      mem_cgroup.  For example with this patch on 64 bit build, the size of
      struct mem_cgroup increased from 4032 bytes to 4416 bytes.  However for
      the performance improvement, this additional size is worth it.  In
      addition there are opportunities to reduce the size of struct mem_cgroup
      like deprecation of kmem and tcpmem page counters and better packing.
      
      Link: https://lkml.kernel.org/r/20220825000506.239406-3-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
      Reported-by: Nkernel test robot <oliver.sang@intel.com>
      Reviewed-by: NFeng Tang <feng.tang@intel.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NRoman Gushchin <roman.gushchin@linux.dev>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Michal Koutný" <mkoutny@suse.com>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      408587ba
    • R
      mm: pagewalk: fix documentation of PTE hole handling · e2f8f44b
      Rolf Eike Beer 提交于
      Empty PTEs are passed to the pte_entry callback, not to pte_hole.
      
      Link: https://lkml.kernel.org/r/3695521.kQq0lBPeGt@devpool047Signed-off-by: NRolf Eike Beer <eb@emlix.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      e2f8f44b
    • C
      mm: fix use-after free of page_ext after race with memory-offline · b1d5488a
      Charan Teja Kalla 提交于
      The below is one path where race between page_ext and offline of the
      respective memory blocks will cause use-after-free on the access of
      page_ext structure.
      
      process1		              process2
      ---------                             ---------
      a)doing /proc/page_owner           doing memory offline
      			           through offline_pages.
      
      b) PageBuddy check is failed
         thus proceed to get the
         page_owner information
         through page_ext access.
      page_ext = lookup_page_ext(page);
      
      				    migrate_pages();
      				    .................
      				Since all pages are successfully
      				migrated as part of the offline
      				operation,send MEM_OFFLINE notification
      				where for page_ext it calls:
      				offline_page_ext()-->
      				__free_page_ext()-->
      				   free_page_ext()-->
      				     vfree(ms->page_ext)
      			           mem_section->page_ext = NULL
      
      c) Check for the PAGE_EXT
         flags in the page_ext->flags
         access results into the
         use-after-free (leading to
         the translation faults).
      
      As mentioned above, there is really no synchronization between page_ext
      access and its freeing in the memory_offline.
      
      The memory offline steps(roughly) on a memory block is as below:
      
      1) Isolate all the pages
      
      2) while(1)
        try free the pages to buddy.(->free_list[MIGRATE_ISOLATE])
      
      3) delete the pages from this buddy list.
      
      4) Then free page_ext.(Note: The struct page is still alive as it is
         freed only during hot remove of the memory which frees the memmap,
         which steps the user might not perform).
      
      This design leads to the state where struct page is alive but the struct
      page_ext is freed, where the later is ideally part of the former which
      just representing the page_flags (check [3] for why this design is
      chosen).
      
      The abovementioned race is just one example __but the problem persists in
      the other paths too involving page_ext->flags access(eg:
      page_is_idle())__.
      
      Fix all the paths where offline races with page_ext access by maintaining
      synchronization with rcu lock and is achieved in 3 steps:
      
      1) Invalidate all the page_ext's of the sections of a memory block by
         storing a flag in the LSB of mem_section->page_ext.
      
      2) Wait until all the existing readers to finish working with the
         ->page_ext's with synchronize_rcu().  Any parallel process that starts
         after this call will not get page_ext, through lookup_page_ext(), for
         the block parallel offline operation is being performed.
      
      3) Now safely free all sections ->page_ext's of the block on which
         offline operation is being performed.
      
      Note: If synchronize_rcu() takes time then optimizations can be done in
      this path through call_rcu()[2].
      
      Thanks to David Hildenbrand for his views/suggestions on the initial
      discussion[1] and Pavan kondeti for various inputs on this patch.
      
      [1] https://lore.kernel.org/linux-mm/59edde13-4167-8550-86f0-11fc67882107@quicinc.com/
      [2] https://lore.kernel.org/all/a26ce299-aed1-b8ad-711e-a49e82bdd180@quicinc.com/T/#u
      [3] https://lore.kernel.org/all/6fa6b7aa-731e-891c-3efb-a03d6a700efa@redhat.com/
      
      [quic_charante@quicinc.com: rename label `loop' to `ext_put_continue' per David]
        Link: https://lkml.kernel.org/r/1661496993-11473-1-git-send-email-quic_charante@quicinc.com
      Link: https://lkml.kernel.org/r/1660830600-9068-1-git-send-email-quic_charante@quicinc.comSigned-off-by: NCharan Teja Kalla <quic_charante@quicinc.com>
      Suggested-by: NDavid Hildenbrand <david@redhat.com>
      Suggested-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Fernand Sieber <sieberf@amazon.com>
      Cc: Minchan Kim <minchan@google.com>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Pavan Kondeti <quic_pkondeti@quicinc.com>
      Cc: SeongJae Park <sjpark@amazon.de>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      b1d5488a
    • K
      mm: kill find_min_pfn_with_active_regions() · fb70c487
      Kefeng Wang 提交于
      find_min_pfn_with_active_regions() is only called from free_area_init(). 
      Open-code the PHYS_PFN(memblock_start_of_DRAM()) into free_area_init(),
      and kill find_min_pfn_with_active_regions().
      
      Link: https://lkml.kernel.org/r/20220815111017.39341-1-wangkefeng.wang@huawei.comSigned-off-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      fb70c487
    • Z
      arch: mm: rename FORCE_MAX_ZONEORDER to ARCH_FORCE_MAX_ORDER · 0192445c
      Zi Yan 提交于
      This Kconfig option is used by individual arch to set its desired
      MAX_ORDER.  Rename it to reflect its actual use.
      
      Link: https://lkml.kernel.org/r/20220815143959.1511278-1-zi.yan@sent.comAcked-by: NMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: NZi Yan <ziy@nvidia.com>
      Acked-by: Guo Ren <guoren@kernel.org>			[csky]
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: Catalin Marinas <catalin.marinas@arm.com>	[arm64]
      Acked-by: Huacai Chen <chenhuacai@kernel.org>		[LoongArch]
      Acked-by: Michael Ellerman <mpe@ellerman.id.au>		[powerpc]
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Taichi Sugaya <sugaya.taichi@socionext.com>
      Cc: Neil Armstrong <narmstrong@baylibre.com>
      Cc: Qin Jian <qinjian@cqplus1.com>
      Cc: Guo Ren <guoren@kernel.org>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Dinh Nguyen <dinguyen@kernel.org>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Ley Foon Tan <ley.foon.tan@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      0192445c
    • H
      memory tiering: adjust hot threshold automatically · c959924b
      Huang Ying 提交于
      The promotion hot threshold is workload and system configuration
      dependent.  So in this patch, a method to adjust the hot threshold
      automatically is implemented.  The basic idea is to control the number of
      the candidate promotion pages to match the promotion rate limit.  If the
      hint page fault latency of a page is less than the hot threshold, we will
      try to promote the page, and the page is called the candidate promotion
      page.
      
      If the number of the candidate promotion pages in the statistics interval
      is much more than the promotion rate limit, the hot threshold will be
      decreased to reduce the number of the candidate promotion pages. 
      Otherwise, the hot threshold will be increased to increase the number of
      the candidate promotion pages.
      
      To make the above method works, in each statistics interval, the total
      number of the pages to check (on which the hint page faults occur) and the
      hot/cold distribution need to be stable.  Because the page tables are
      scanned linearly in NUMA balancing, but the hot/cold distribution isn't
      uniform along the address usually, the statistics interval should be
      larger than the NUMA balancing scan period.  So in the patch, the max scan
      period is used as statistics interval and it works well in our tests.
      
      Link: https://lkml.kernel.org/r/20220713083954.34196-4-ying.huang@intel.comSigned-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
      Tested-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: osalvador <osalvador@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Wei Xu <weixugc@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zhong Jiang <zhongjiang-ali@linux.alibaba.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      c959924b
    • H
      memory tiering: rate limit NUMA migration throughput · c6833e10
      Huang Ying 提交于
      In NUMA balancing memory tiering mode, if there are hot pages in slow
      memory node and cold pages in fast memory node, we need to promote/demote
      hot/cold pages between the fast and cold memory nodes.
      
      A choice is to promote/demote as fast as possible.  But the CPU cycles and
      memory bandwidth consumed by the high promoting/demoting throughput will
      hurt the latency of some workload because of accessing inflating and slow
      memory bandwidth contention.
      
      A way to resolve this issue is to restrict the max promoting/demoting
      throughput.  It will take longer to finish the promoting/demoting.  But
      the workload latency will be better.  This is implemented in this patch as
      the page promotion rate limit mechanism.
      
      The number of the candidate pages to be promoted to the fast memory node
      via NUMA balancing is counted, if the count exceeds the limit specified by
      the users, the NUMA balancing promotion will be stopped until the next
      second.
      
      A new sysctl knob kernel.numa_balancing_promote_rate_limit_MBps is added
      for the users to specify the limit.
      
      Link: https://lkml.kernel.org/r/20220713083954.34196-3-ying.huang@intel.comSigned-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
      Tested-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: osalvador <osalvador@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Wei Xu <weixugc@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zhong Jiang <zhongjiang-ali@linux.alibaba.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      c6833e10
    • H
      memory tiering: hot page selection with hint page fault latency · 33024536
      Huang Ying 提交于
      Patch series "memory tiering: hot page selection", v4.
      
      To optimize page placement in a memory tiering system with NUMA balancing,
      the hot pages in the slow memory nodes need to be identified. 
      Essentially, the original NUMA balancing implementation selects the mostly
      recently accessed (MRU) pages to promote.  But this isn't a perfect
      algorithm to identify the hot pages.  Because the pages with quite low
      access frequency may be accessed eventually given the NUMA balancing page
      table scanning period could be quite long (e.g.  60 seconds).  So in this
      patchset, we implement a new hot page identification algorithm based on
      the latency between NUMA balancing page table scanning and hint page
      fault.  Which is a kind of mostly frequently accessed (MFU) algorithm.
      
      In NUMA balancing memory tiering mode, if there are hot pages in slow
      memory node and cold pages in fast memory node, we need to promote/demote
      hot/cold pages between the fast and cold memory nodes.
      
      A choice is to promote/demote as fast as possible.  But the CPU cycles and
      memory bandwidth consumed by the high promoting/demoting throughput will
      hurt the latency of some workload because of accessing inflating and slow
      memory bandwidth contention.
      
      A way to resolve this issue is to restrict the max promoting/demoting
      throughput.  It will take longer to finish the promoting/demoting.  But
      the workload latency will be better.  This is implemented in this patchset
      as the page promotion rate limit mechanism.
      
      The promotion hot threshold is workload and system configuration
      dependent.  So in this patchset, a method to adjust the hot threshold
      automatically is implemented.  The basic idea is to control the number of
      the candidate promotion pages to match the promotion rate limit.
      
      We used the pmbench memory accessing benchmark tested the patchset on a
      2-socket server system with DRAM and PMEM installed.  The test results are
      as follows,
      
      		pmbench score		promote rate
      		 (accesses/s)			MB/s
      		-------------		------------
      base		  146887704.1		       725.6
      hot selection     165695601.2		       544.0
      rate limit	  162814569.8		       165.2
      auto adjustment	  170495294.0                  136.9
      
      From the results above,
      
      With hot page selection patch [1/3], the pmbench score increases about
      12.8%, and promote rate (overhead) decreases about 25.0%, compared with
      base kernel.
      
      With rate limit patch [2/3], pmbench score decreases about 1.7%, and
      promote rate decreases about 69.6%, compared with hot page selection
      patch.
      
      With threshold auto adjustment patch [3/3], pmbench score increases about
      4.7%, and promote rate decrease about 17.1%, compared with rate limit
      patch.
      
      Baolin helped to test the patchset with MySQL on a machine which contains
      1 DRAM node (30G) and 1 PMEM node (126G).
      
      sysbench /usr/share/sysbench/oltp_read_write.lua \
      ......
      --tables=200 \
      --table-size=1000000 \
      --report-interval=10 \
      --threads=16 \
      --time=120
      
      The tps can be improved about 5%.
      
      
      This patch (of 3):
      
      To optimize page placement in a memory tiering system with NUMA balancing,
      the hot pages in the slow memory node need to be identified.  Essentially,
      the original NUMA balancing implementation selects the mostly recently
      accessed (MRU) pages to promote.  But this isn't a perfect algorithm to
      identify the hot pages.  Because the pages with quite low access frequency
      may be accessed eventually given the NUMA balancing page table scanning
      period could be quite long (e.g.  60 seconds).  The most frequently
      accessed (MFU) algorithm is better.
      
      So, in this patch we implemented a better hot page selection algorithm. 
      Which is based on NUMA balancing page table scanning and hint page fault
      as follows,
      
      - When the page tables of the processes are scanned to change PTE/PMD
        to be PROT_NONE, the current time is recorded in struct page as scan
        time.
      
      - When the page is accessed, hint page fault will occur.  The scan
        time is gotten from the struct page.  And The hint page fault
        latency is defined as
      
          hint page fault time - scan time
      
      The shorter the hint page fault latency of a page is, the higher the
      probability of their access frequency to be higher.  So the hint page
      fault latency is a better estimation of the page hot/cold.
      
      It's hard to find some extra space in struct page to hold the scan time. 
      Fortunately, we can reuse some bits used by the original NUMA balancing.
      
      NUMA balancing uses some bits in struct page to store the page accessing
      CPU and PID (referring to page_cpupid_xchg_last()).  Which is used by the
      multi-stage node selection algorithm to avoid to migrate pages shared
      accessed by the NUMA nodes back and forth.  But for pages in the slow
      memory node, even if they are shared accessed by multiple NUMA nodes, as
      long as the pages are hot, they need to be promoted to the fast memory
      node.  So the accessing CPU and PID information are unnecessary for the
      slow memory pages.  We can reuse these bits in struct page to record the
      scan time.  For the fast memory pages, these bits are used as before.
      
      For the hot threshold, the default value is 1 second, which works well in
      our performance test.  All pages with hint page fault latency < hot
      threshold will be considered hot.
      
      It's hard for users to determine the hot threshold.  So we don't provide a
      kernel ABI to set it, just provide a debugfs interface for advanced users
      to experiment.  We will continue to work on a hot threshold automatic
      adjustment mechanism.
      
      The downside of the above method is that the response time to the workload
      hot spot changing may be much longer.  For example,
      
      - A previous cold memory area becomes hot
      
      - The hint page fault will be triggered.  But the hint page fault
        latency isn't shorter than the hot threshold.  So the pages will
        not be promoted.
      
      - When the memory area is scanned again, maybe after a scan period,
        the hint page fault latency measured will be shorter than the hot
        threshold and the pages will be promoted.
      
      To mitigate this, if there are enough free space in the fast memory node,
      the hot threshold will not be used, all pages will be promoted upon the
      hint page fault for fast response.
      
      Thanks Zhong Jiang reported and tested the fix for a bug when disabling
      memory tiering mode dynamically.
      
      Link: https://lkml.kernel.org/r/20220713083954.34196-1-ying.huang@intel.com
      Link: https://lkml.kernel.org/r/20220713083954.34196-2-ying.huang@intel.comSigned-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
      Tested-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Wei Xu <weixugc@google.com>
      Cc: osalvador <osalvador@suse.de>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Zhong Jiang <zhongjiang-ali@linux.alibaba.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      33024536
    • P
      mm: add more BUILD_BUG_ONs to gfp_migratetype() · 4d86d4f7
      Peter Collingbourne 提交于
      gfp_migratetype() also expects GFP_RECLAIMABLE and
      GFP_MOVABLE|GFP_RECLAIMABLE to be shiftable into MIGRATE_* enum values, so
      add some more BUILD_BUG_ONs to reflect this assumption.
      
      Link: https://linux-review.googlesource.com/id/Iae64e2182f75c3aca776a486b71a72571d66d83e
      Link: https://lkml.kernel.org/r/20220726230241.3770532-1-pcc@google.comSigned-off-by: NPeter Collingbourne <pcc@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      4d86d4f7
    • M
      hugetlb_cgroup: remove unneeded return value · 736a8ccc
      Miaohe Lin 提交于
      The return value of set_hugetlb_cgroup and set_hugetlb_cgroup_rsvd are
      always ignored. Remove them to clean up the code.
      
      Link: https://lkml.kernel.org/r/20220729080106.12752-4-linmiaohe@huawei.comSigned-off-by: NMiaohe Lin <linmiaohe@huawei.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Mina Almasry <almasrymina@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      736a8ccc
    • I
      kfence: add sysfs interface to disable kfence for selected slabs. · b84e04f1
      Imran Khan 提交于
      By default kfence allocation can happen for any slab object, whose size is
      up to PAGE_SIZE, as long as that allocation is the first allocation after
      expiration of kfence sample interval.  But in certain debugging scenarios
      we may be interested in debugging corruptions involving some specific slub
      objects like dentry or ext4_* etc.  In such cases limiting kfence for
      allocations involving only specific slub objects will increase the
      probablity of catching the issue since kfence pool will not be consumed by
      other slab objects.
      
      This patch introduces a sysfs interface
      '/sys/kernel/slab/<name>/skip_kfence' to disable kfence for specific
      slabs.  Having the interface work in this way does not impact
      current/default behavior of kfence and allows us to use kfence for
      specific slabs (when needed) as well.  The decision to skip/use kfence is
      taken depending on whether kmem_cache.flags has (newly introduced)
      SLAB_SKIP_KFENCE flag set or not.
      
      Link: https://lkml.kernel.org/r/20220814195353.2540848-1-imran.f.khan@oracle.comSigned-off-by: NImran Khan <imran.f.khan@oracle.com>
      Reviewed-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NMarco Elver <elver@google.com>
      Reviewed-by: NHyeonggon Yoo <42.hyeyoo@gmail.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      b84e04f1
    • F
      mm/hugetlb: add dedicated func to get 'allowed' nodemask for current process · d2226ebd
      Feng Tang 提交于
      Muchun Song found that after MPOL_PREFERRED_MANY policy was introduced in
      commit b27abacc ("mm/mempolicy: add MPOL_PREFERRED_MANY for multiple
      preferred nodes"), the policy_nodemask_current()'s semantics for this new
      policy has been changed, which returns 'preferred' nodes instead of
      'allowed' nodes.
      
      With the changed semantic of policy_nodemask_current, a task with
      MPOL_PREFERRED_MANY policy could fail to get its reservation even though
      it can fall back to other nodes (either defined by cpusets or all online
      nodes) for that reservation failing mmap calles unnecessarily early.
      
      The fix is to not consider MPOL_PREFERRED_MANY for reservations at all
      because they, unlike MPOL_MBIND, do not pose any actual hard constrain.
      
      Michal suggested the policy_nodemask_current() is only used by hugetlb,
      and could be moved to hugetlb code with more explicit name to enforce the
      'allowed' semantics for which only MPOL_BIND policy matters.
      
      apply_policy_zone() is made extern to be called in hugetlb code and its
      return value is changed to bool.
      
      [1]. https://lore.kernel.org/lkml/20220801084207.39086-1-songmuchun@bytedance.com/t/
      
      Link: https://lkml.kernel.org/r/20220805005903.95563-1-feng.tang@intel.com
      Fixes: b27abacc ("mm/mempolicy: add MPOL_PREFERRED_MANY for multiple preferred nodes")
      Signed-off-by: NFeng Tang <feng.tang@intel.com>
      Reported-by: NMuchun Song <songmuchun@bytedance.com>
      Suggested-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NMuchun Song <songmuchun@bytedance.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Ben Widawsky <bwidawsk@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      d2226ebd
    • Y
      mm/vmscan: define macros for refaults in struct lruvec · e9c2dbc8
      Yang Yang 提交于
      The magic number 0 and 1 are used in several places in vmscan.c.
      Define macros for them to improve code readability.
      
      Link: https://lkml.kernel.org/r/20220808005644.1721066-1-yang.yang29@zte.com.cnSigned-off-by: NYang Yang <yang.yang29@zte.com.cn>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      e9c2dbc8
    • Z
      mm/madvise: introduce MADV_COLLAPSE sync hugepage collapse · 7d8faaf1
      Zach O'Keefe 提交于
      This idea was introduced by David Rientjes[1].
      
      Introduce a new madvise mode, MADV_COLLAPSE, that allows users to request
      a synchronous collapse of memory at their own expense.
      
      The benefits of this approach are:
      
      * CPU is charged to the process that wants to spend the cycles for the
        THP
      * Avoid unpredictable timing of khugepaged collapse
      
      Semantics
      
      This call is independent of the system-wide THP sysfs settings, but will
      fail for memory marked VM_NOHUGEPAGE.  If the ranges provided span
      multiple VMAs, the semantics of the collapse over each VMA is independent
      from the others.  This implies a hugepage cannot cross a VMA boundary.  If
      collapse of a given hugepage-aligned/sized region fails, the operation may
      continue to attempt collapsing the remainder of memory specified.
      
      The memory ranges provided must be page-aligned, but are not required to
      be hugepage-aligned.  If the memory ranges are not hugepage-aligned, the
      start/end of the range will be clamped to the first/last hugepage-aligned
      address covered by said range.  The memory ranges must span at least one
      hugepage-sized region.
      
      All non-resident pages covered by the range will first be
      swapped/faulted-in, before being internally copied onto a freshly
      allocated hugepage.  Unmapped pages will have their data directly
      initialized to 0 in the new hugepage.  However, for every eligible
      hugepage aligned/sized region to-be collapsed, at least one page must
      currently be backed by memory (a PMD covering the address range must
      already exist).
      
      Allocation for the new hugepage may enter direct reclaim and/or
      compaction, regardless of VMA flags.  When the system has multiple NUMA
      nodes, the hugepage will be allocated from the node providing the most
      native pages.  This operation operates on the current state of the
      specified process and makes no persistent changes or guarantees on how
      pages will be mapped, constructed, or faulted in the future
      
      Return Value
      
      If all hugepage-sized/aligned regions covered by the provided range were
      either successfully collapsed, or were already PMD-mapped THPs, this
      operation will be deemed successful.  On success, process_madvise(2)
      returns the number of bytes advised, and madvise(2) returns 0.  Else, -1
      is returned and errno is set to indicate the error for the most-recently
      attempted hugepage collapse.  Note that many failures might have occurred,
      since the operation may continue to collapse in the event a single
      hugepage-sized/aligned region fails.
      
      	ENOMEM	Memory allocation failed or VMA not found
      	EBUSY	Memcg charging failed
      	EAGAIN	Required resource temporarily unavailable.  Try again
      		might succeed.
      	EINVAL	Other error: No PMD found, subpage doesn't have Present
      		bit set, "Special" page no backed by struct page, VMA
      		incorrectly sized, address not page-aligned, ...
      
      Most notable here is ENOMEM and EBUSY (new to madvise) which are intended
      to provide the caller with actionable feedback so they may take an
      appropriate fallback measure.
      
      Use Cases
      
      An immediate user of this new functionality are malloc() implementations
      that manage memory in hugepage-sized chunks, but sometimes subrelease
      memory back to the system in native-sized chunks via MADV_DONTNEED;
      zapping the pmd.  Later, when the memory is hot, the implementation could
      madvise(MADV_COLLAPSE) to re-back the memory by THPs to regain hugepage
      coverage and dTLB performance.  TCMalloc is such an implementation that
      could benefit from this[2].
      
      Only privately-mapped anon memory is supported for now, but additional
      support for file, shmem, and HugeTLB high-granularity mappings[2] is
      expected.  File and tmpfs/shmem support would permit:
      
      * Backing executable text by THPs.  Current support provided by
        CONFIG_READ_ONLY_THP_FOR_FS may take a long time on a large system which
        might impair services from serving at their full rated load after
        (re)starting.  Tricks like mremap(2)'ing text onto anonymous memory to
        immediately realize iTLB performance prevents page sharing and demand
        paging, both of which increase steady state memory footprint.  With
        MADV_COLLAPSE, we get the best of both worlds: Peak upfront performance
        and lower RAM footprints.
      * Backing guest memory by hugapages after the memory contents have been
        migrated in native-page-sized chunks to a new host, in a
        userfaultfd-based live-migration stack.
      
      [1] https://lore.kernel.org/linux-mm/d098c392-273a-36a4-1a29-59731cdf5d3d@google.com/
      [2] https://github.com/google/tcmalloc/tree/master/tcmalloc
      
      [jrdr.linux@gmail.com: avoid possible memory leak in failure path]
        Link: https://lkml.kernel.org/r/20220713024109.62810-1-jrdr.linux@gmail.com
      [zokeefe@google.com add missing kfree() to madvise_collapse()]
        Link: https://lore.kernel.org/linux-mm/20220713024109.62810-1-jrdr.linux@gmail.com/
        Link: https://lkml.kernel.org/r/20220713161851.1879439-1-zokeefe@google.com
      [zokeefe@google.com: delay computation of hpage boundaries until use]]
        Link: https://lkml.kernel.org/r/20220720140603.1958773-4-zokeefe@google.com
      Link: https://lkml.kernel.org/r/20220706235936.2197195-10-zokeefe@google.comSigned-off-by: NZach O'Keefe <zokeefe@google.com>
      Signed-off-by: N"Souptick Joarder (HPE)" <jrdr.linux@gmail.com>
      Suggested-by: NDavid Rientjes <rientjes@google.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Axel Rasmussen <axelrasmussen@google.com>
      Cc: Chris Kennelly <ckennelly@google.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Pasha Tatashin <pasha.tatashin@soleen.com>
      Cc: Pavel Begunkov <asml.silence@gmail.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rongwei Wang <rongwei.wang@linux.alibaba.com>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      7d8faaf1