1. 23 11月, 2022 1 次提交
    • J
      mm: vmscan: fix extreme overreclaim and swap floods · f53af428
      Johannes Weiner 提交于
      During proactive reclaim, we sometimes observe severe overreclaim, with
      several thousand times more pages reclaimed than requested.
      
      This trace was obtained from shrink_lruvec() during such an instance:
      
          prio:0 anon_cost:1141521 file_cost:7767
          nr_reclaimed:4387406 nr_to_reclaim:1047 (or_factor:4190)
          nr=[7161123 345 578 1111]
      
      While he reclaimer requested 4M, vmscan reclaimed close to 16G, most of it
      by swapping.  These requests take over a minute, during which the write()
      to memory.reclaim is unkillably stuck inside the kernel.
      
      Digging into the source, this is caused by the proportional reclaim
      bailout logic.  This code tries to resolve a fundamental conflict: to
      reclaim roughly what was requested, while also aging all LRUs fairly and
      in accordance to their size, swappiness, refault rates etc.  The way it
      attempts fairness is that once the reclaim goal has been reached, it stops
      scanning the LRUs with the smaller remaining scan targets, and adjusts the
      remainder of the bigger LRUs according to how much of the smaller LRUs was
      scanned.  It then finishes scanning that remainder regardless of the
      reclaim goal.
      
      This works fine if priority levels are low and the LRU lists are
      comparable in size.  However, in this instance, the cgroup that is
      targeted by proactive reclaim has almost no files left - they've already
      been squeezed out by proactive reclaim earlier - and the remaining anon
      pages are hot.  Anon rotations cause the priority level to drop to 0,
      which results in reclaim targeting all of anon (a lot) and all of file
      (almost nothing).  By the time reclaim decides to bail, it has scanned
      most or all of the file target, and therefor must also scan most or all of
      the enormous anon target.  This target is thousands of times larger than
      the reclaim goal, thus causing the overreclaim.
      
      The bailout code hasn't changed in years, why is this failing now?  The
      most likely explanations are two other recent changes in anon reclaim:
      
      1. Before the series starting with commit 5df74196 ("mm: fix LRU
         balancing effect of new transparent huge pages"), the VM was
         overall relatively reluctant to swap at all, even if swap was
         configured. This means the LRU balancing code didn't come into play
         as often as it does now, and mostly in high pressure situations
         where pronounced swap activity wouldn't be as surprising.
      
      2. For historic reasons, shrink_lruvec() loops on the scan targets of
         all LRU lists except the active anon one, meaning it would bail if
         the only remaining pages to scan were active anon - even if there
         were a lot of them.
      
         Before the series starting with commit ccc5dc67 ("mm/vmscan:
         make active/inactive ratio as 1:1 for anon lru"), most anon pages
         would live on the active LRU; the inactive one would contain only a
         handful of preselected reclaim candidates. After the series, anon
         gets aged similarly to file, and the inactive list is the default
         for new anon pages as well, making it often the much bigger list.
      
         As a result, the VM is now more likely to actually finish large
         anon targets than before.
      
      Change the code such that only one SWAP_CLUSTER_MAX-sized nudge toward the
      larger LRU lists is made before bailing out on a met reclaim goal.
      
      This fixes the extreme overreclaim problem.
      
      Fairness is more subtle and harder to evaluate.  No obvious misbehavior
      was observed on the test workload, in any case.  Conceptually, fairness
      should primarily be a cumulative effect from regular, lower priority
      scans.  Once the VM is in trouble and needs to escalate scan targets to
      make forward progress, fairness needs to take a backseat.  This is also
      acknowledged by the myriad exceptions in get_scan_count().  This patch
      makes fairness decrease gradually, as it keeps fairness work static over
      increasing priority levels with growing scan targets.  This should make
      more sense - although we may have to re-visit the exact values.
      
      Link: https://lkml.kernel.org/r/20220802162811.39216-1-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRik van Riel <riel@surriel.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      f53af428
  2. 08 10月, 2022 2 次提交
  3. 04 10月, 2022 4 次提交
  4. 27 9月, 2022 13 次提交
    • L
      mm/vmscan: use vma iterator instead of vm_next · 78ba531f
      Liam R. Howlett 提交于
      Use the vma iterator in in get_next_vma() instead of the linked list.
      
      [yuzhao@google.com: mm/vmscan: use the proper VMA iterator]
        Link: https://lkml.kernel.org/r/Yx+QGOgHg1Wk8tGK@google.com
      Link: https://lkml.kernel.org/r/20220906194824.2110408-68-Liam.Howlett@oracle.comSigned-off-by: NLiam R. Howlett <Liam.Howlett@oracle.com>
      Signed-off-by: NYu Zhao <yuzhao@google.com>
      Tested-by: NYu Zhao <yuzhao@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: SeongJae Park <sj@kernel.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      78ba531f
    • J
      mm/demotion: demote pages according to allocation fallback order · 32008027
      Jagdish Gediya 提交于
      Currently, a higher tier node can only be demoted to selected nodes on the
      next lower tier as defined by the demotion path.  This strict demotion
      order does not work in all use cases (e.g.  some use cases may want to
      allow cross-socket demotion to another node in the same demotion tier as a
      fallback when the preferred demotion node is out of space).  This demotion
      order is also inconsistent with the page allocation fallback order when
      all the nodes in a higher tier are out of space: The page allocation can
      fall back to any node from any lower tier, whereas the demotion order
      doesn't allow that currently.
      
      This patch adds support to get all the allowed demotion targets for a
      memory tier.  demote_page_list() function is now modified to utilize this
      allowed node mask as the fallback allocation mask.
      
      Link: https://lkml.kernel.org/r/20220818131042.113280-9-aneesh.kumar@linux.ibm.comSigned-off-by: NJagdish Gediya <jvgediya.oss@gmail.com>
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Reviewed-by: N"Huang, Ying" <ying.huang@intel.com>
      Acked-by: NWei Xu <weixugc@google.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Bharata B Rao <bharata@amd.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Hesham Almatary <hesham.almatary@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: SeongJae Park <sj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      32008027
    • A
      mm/demotion: move memory demotion related code · 91952440
      Aneesh Kumar K.V 提交于
      This moves memory demotion related code to mm/memory-tiers.c.  No
      functional change in this patch.
      
      Link: https://lkml.kernel.org/r/20220818131042.113280-3-aneesh.kumar@linux.ibm.comSigned-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Reviewed-by: N"Huang, Ying" <ying.huang@intel.com>
      Acked-by: NWei Xu <weixugc@google.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Bharata B Rao <bharata@amd.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Hesham Almatary <hesham.almatary@huawei.com>
      Cc: Jagdish Gediya <jvgediya.oss@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tim Chen <tim.c.chen@intel.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: SeongJae Park <sj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      91952440
    • Y
      mm: multi-gen LRU: admin guide · 07017acb
      Yu Zhao 提交于
      Add an admin guide.
      
      Link: https://lkml.kernel.org/r/20220918080010.2920238-14-yuzhao@google.comSigned-off-by: NYu Zhao <yuzhao@google.com>
      Acked-by: NBrian Geffon <bgeffon@google.com>
      Acked-by: NJan Alexander Steffens (heftig) <heftig@archlinux.org>
      Acked-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Acked-by: NSteven Barrett <steven@liquorix.net>
      Acked-by: NSuleiman Souhlal <suleiman@google.com>
      Acked-by: NMike Rapoport <rppt@linux.ibm.com>
      Tested-by: NDaniel Byrne <djbyrne@mtu.edu>
      Tested-by: NDonald Carr <d@chaos-reins.com>
      Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: NKonstantin Kharlamov <Hi-Angel@yandex.ru>
      Tested-by: NShuang Zhai <szhai2@cs.rochester.edu>
      Tested-by: NSofia Trinh <sofia.trinh@edi.works>
      Tested-by: NVaibhav Jain <vaibhav@linux.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Barry Song <baohua@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Larabel <Michael@MichaelLarabel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      07017acb
    • Y
      mm: multi-gen LRU: debugfs interface · d6c3af7d
      Yu Zhao 提交于
      Add /sys/kernel/debug/lru_gen for working set estimation and proactive
      reclaim.  These techniques are commonly used to optimize job scheduling
      (bin packing) in data centers [1][2].
      
      Compared with the page table-based approach and the PFN-based
      approach, this lruvec-based approach has the following advantages:
      1. It offers better choices because it is aware of memcgs, NUMA nodes,
         shared mappings and unmapped page cache.
      2. It is more scalable because it is O(nr_hot_pages), whereas the
         PFN-based approach is O(nr_total_pages).
      
      Add /sys/kernel/debug/lru_gen_full for debugging.
      
      [1] https://dl.acm.org/doi/10.1145/3297858.3304053
      [2] https://dl.acm.org/doi/10.1145/3503222.3507731
      
      Link: https://lkml.kernel.org/r/20220918080010.2920238-13-yuzhao@google.comSigned-off-by: NYu Zhao <yuzhao@google.com>
      Reviewed-by: NQi Zheng <zhengqi.arch@bytedance.com>
      Acked-by: NBrian Geffon <bgeffon@google.com>
      Acked-by: NJan Alexander Steffens (heftig) <heftig@archlinux.org>
      Acked-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Acked-by: NSteven Barrett <steven@liquorix.net>
      Acked-by: NSuleiman Souhlal <suleiman@google.com>
      Tested-by: NDaniel Byrne <djbyrne@mtu.edu>
      Tested-by: NDonald Carr <d@chaos-reins.com>
      Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: NKonstantin Kharlamov <Hi-Angel@yandex.ru>
      Tested-by: NShuang Zhai <szhai2@cs.rochester.edu>
      Tested-by: NSofia Trinh <sofia.trinh@edi.works>
      Tested-by: NVaibhav Jain <vaibhav@linux.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Barry Song <baohua@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Larabel <Michael@MichaelLarabel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      d6c3af7d
    • Y
      mm: multi-gen LRU: thrashing prevention · 1332a809
      Yu Zhao 提交于
      Add /sys/kernel/mm/lru_gen/min_ttl_ms for thrashing prevention, as
      requested by many desktop users [1].
      
      When set to value N, it prevents the working set of N milliseconds from
      getting evicted.  The OOM killer is triggered if this working set cannot
      be kept in memory.  Based on the average human detectable lag (~100ms),
      N=1000 usually eliminates intolerable lags due to thrashing.  Larger
      values like N=3000 make lags less noticeable at the risk of premature OOM
      kills.
      
      Compared with the size-based approach [2], this time-based approach
      has the following advantages:
      
      1. It is easier to configure because it is agnostic to applications
         and memory sizes.
      2. It is more reliable because it is directly wired to the OOM killer.
      
      [1] https://lore.kernel.org/r/Ydza%2FzXKY9ATRoh6@google.com/
      [2] https://lore.kernel.org/r/20101028191523.GA14972@google.com/
      
      Link: https://lkml.kernel.org/r/20220918080010.2920238-12-yuzhao@google.comSigned-off-by: NYu Zhao <yuzhao@google.com>
      Acked-by: NBrian Geffon <bgeffon@google.com>
      Acked-by: NJan Alexander Steffens (heftig) <heftig@archlinux.org>
      Acked-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Acked-by: NSteven Barrett <steven@liquorix.net>
      Acked-by: NSuleiman Souhlal <suleiman@google.com>
      Tested-by: NDaniel Byrne <djbyrne@mtu.edu>
      Tested-by: NDonald Carr <d@chaos-reins.com>
      Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: NKonstantin Kharlamov <Hi-Angel@yandex.ru>
      Tested-by: NShuang Zhai <szhai2@cs.rochester.edu>
      Tested-by: NSofia Trinh <sofia.trinh@edi.works>
      Tested-by: NVaibhav Jain <vaibhav@linux.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Barry Song <baohua@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Larabel <Michael@MichaelLarabel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      1332a809
    • Y
      mm: multi-gen LRU: kill switch · 354ed597
      Yu Zhao 提交于
      Add /sys/kernel/mm/lru_gen/enabled as a kill switch. Components that
      can be disabled include:
        0x0001: the multi-gen LRU core
        0x0002: walking page table, when arch_has_hw_pte_young() returns
                true
        0x0004: clearing the accessed bit in non-leaf PMD entries, when
                CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y
        [yYnN]: apply to all the components above
      E.g.,
        echo y >/sys/kernel/mm/lru_gen/enabled
        cat /sys/kernel/mm/lru_gen/enabled
        0x0007
        echo 5 >/sys/kernel/mm/lru_gen/enabled
        cat /sys/kernel/mm/lru_gen/enabled
        0x0005
      
      NB: the page table walks happen on the scale of seconds under heavy memory
      pressure, in which case the mmap_lock contention is a lesser concern,
      compared with the LRU lock contention and the I/O congestion.  So far the
      only well-known case of the mmap_lock contention happens on Android, due
      to Scudo [1] which allocates several thousand VMAs for merely a few
      hundred MBs.  The SPF and the Maple Tree also have provided their own
      assessments [2][3].  However, if walking page tables does worsen the
      mmap_lock contention, the kill switch can be used to disable it.  In this
      case the multi-gen LRU will suffer a minor performance degradation, as
      shown previously.
      
      Clearing the accessed bit in non-leaf PMD entries can also be disabled,
      since this behavior was not tested on x86 varieties other than Intel and
      AMD.
      
      [1] https://source.android.com/devices/tech/debug/scudo
      [2] https://lore.kernel.org/r/20220128131006.67712-1-michel@lespinasse.org/
      [3] https://lore.kernel.org/r/20220426150616.3937571-1-Liam.Howlett@oracle.com/
      
      Link: https://lkml.kernel.org/r/20220918080010.2920238-11-yuzhao@google.comSigned-off-by: NYu Zhao <yuzhao@google.com>
      Acked-by: NBrian Geffon <bgeffon@google.com>
      Acked-by: NJan Alexander Steffens (heftig) <heftig@archlinux.org>
      Acked-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Acked-by: NSteven Barrett <steven@liquorix.net>
      Acked-by: NSuleiman Souhlal <suleiman@google.com>
      Tested-by: NDaniel Byrne <djbyrne@mtu.edu>
      Tested-by: NDonald Carr <d@chaos-reins.com>
      Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: NKonstantin Kharlamov <Hi-Angel@yandex.ru>
      Tested-by: NShuang Zhai <szhai2@cs.rochester.edu>
      Tested-by: NSofia Trinh <sofia.trinh@edi.works>
      Tested-by: NVaibhav Jain <vaibhav@linux.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Barry Song <baohua@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Larabel <Michael@MichaelLarabel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      354ed597
    • Y
      mm: multi-gen LRU: optimize multiple memcgs · f76c8337
      Yu Zhao 提交于
      When multiple memcgs are available, it is possible to use generations as a
      frame of reference to make better choices and improve overall performance
      under global memory pressure.  This patch adds a basic optimization to
      select memcgs that can drop single-use unmapped clean pages first.  Doing
      so reduces the chance of going into the aging path or swapping, which can
      be costly.
      
      A typical example that benefits from this optimization is a server running
      mixed types of workloads, e.g., heavy anon workload in one memcg and heavy
      buffered I/O workload in the other.
      
      Though this optimization can be applied to both kswapd and direct reclaim,
      it is only added to kswapd to keep the patchset manageable.  Later
      improvements may cover the direct reclaim path.
      
      While ensuring certain fairness to all eligible memcgs, proportional scans
      of individual memcgs also require proper backoff to avoid overshooting
      their aggregate reclaim target by too much.  Otherwise it can cause high
      direct reclaim latency.  The conditions for backoff are:
      
      1. At low priorities, for direct reclaim, if aging fairness or direct
         reclaim latency is at risk, i.e., aging one memcg multiple times or
         swapping after the target is met.
      2. At high priorities, for global reclaim, if per-zone free pages are
         above respective watermarks.
      
      Server benchmark results:
        Mixed workloads:
          fio (buffered I/O): +[19, 21]%
                      IOPS         BW
            patch1-8: 1880k        7343MiB/s
            patch1-9: 2252k        8796MiB/s
      
          memcached (anon): +[119, 123]%
                      Ops/sec      KB/sec
            patch1-8: 862768.65    33514.68
            patch1-9: 1911022.12   74234.54
      
        Mixed workloads:
          fio (buffered I/O): +[75, 77]%
                      IOPS         BW
            5.19-rc1: 1279k        4996MiB/s
            patch1-9: 2252k        8796MiB/s
      
          memcached (anon): +[13, 15]%
                      Ops/sec      KB/sec
            5.19-rc1: 1673524.04   65008.87
            patch1-9: 1911022.12   74234.54
      
        Configurations:
          (changes since patch 6)
      
          cat mixed.sh
          modprobe brd rd_nr=2 rd_size=56623104
      
          swapoff -a
          mkswap /dev/ram0
          swapon /dev/ram0
      
          mkfs.ext4 /dev/ram1
          mount -t ext4 /dev/ram1 /mnt
      
          memtier_benchmark -S /var/run/memcached/memcached.sock \
            -P memcache_binary -n allkeys --key-minimum=1 \
            --key-maximum=50000000 --key-pattern=P:P -c 1 -t 36 \
            --ratio 1:0 --pipeline 8 -d 2000
      
          fio -name=mglru --numjobs=36 --directory=/mnt --size=1408m \
            --buffered=1 --ioengine=io_uring --iodepth=128 \
            --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
            --rw=randread --random_distribution=random --norandommap \
            --time_based --ramp_time=10m --runtime=90m --group_reporting &
          pid=$!
      
          sleep 200
      
          memtier_benchmark -S /var/run/memcached/memcached.sock \
            -P memcache_binary -n allkeys --key-minimum=1 \
            --key-maximum=50000000 --key-pattern=R:R -c 1 -t 36 \
            --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed
      
          kill -INT $pid
          wait
      
      Client benchmark results:
        no change (CONFIG_MEMCG=n)
      
      Link: https://lkml.kernel.org/r/20220918080010.2920238-10-yuzhao@google.comSigned-off-by: NYu Zhao <yuzhao@google.com>
      Acked-by: NBrian Geffon <bgeffon@google.com>
      Acked-by: NJan Alexander Steffens (heftig) <heftig@archlinux.org>
      Acked-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Acked-by: NSteven Barrett <steven@liquorix.net>
      Acked-by: NSuleiman Souhlal <suleiman@google.com>
      Tested-by: NDaniel Byrne <djbyrne@mtu.edu>
      Tested-by: NDonald Carr <d@chaos-reins.com>
      Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: NKonstantin Kharlamov <Hi-Angel@yandex.ru>
      Tested-by: NShuang Zhai <szhai2@cs.rochester.edu>
      Tested-by: NSofia Trinh <sofia.trinh@edi.works>
      Tested-by: NVaibhav Jain <vaibhav@linux.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Barry Song <baohua@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Larabel <Michael@MichaelLarabel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      f76c8337
    • Y
      mm: multi-gen LRU: support page table walks · bd74fdae
      Yu Zhao 提交于
      To further exploit spatial locality, the aging prefers to walk page tables
      to search for young PTEs and promote hot pages.  A kill switch will be
      added in the next patch to disable this behavior.  When disabled, the
      aging relies on the rmap only.
      
      NB: this behavior has nothing similar with the page table scanning in the
      2.4 kernel [1], which searches page tables for old PTEs, adds cold pages
      to swapcache and unmaps them.
      
      To avoid confusion, the term "iteration" specifically means the traversal
      of an entire mm_struct list; the term "walk" will be applied to page
      tables and the rmap, as usual.
      
      An mm_struct list is maintained for each memcg, and an mm_struct follows
      its owner task to the new memcg when this task is migrated.  Given an
      lruvec, the aging iterates lruvec_memcg()->mm_list and calls
      walk_page_range() with each mm_struct on this list to promote hot pages
      before it increments max_seq.
      
      When multiple page table walkers iterate the same list, each of them gets
      a unique mm_struct; therefore they can run concurrently.  Page table
      walkers ignore any misplaced pages, e.g., if an mm_struct was migrated,
      pages it left in the previous memcg will not be promoted when its current
      memcg is under reclaim.  Similarly, page table walkers will not promote
      pages from nodes other than the one under reclaim.
      
      This patch uses the following optimizations when walking page tables:
      1. It tracks the usage of mm_struct's between context switches so that
         page table walkers can skip processes that have been sleeping since
         the last iteration.
      2. It uses generational Bloom filters to record populated branches so
         that page table walkers can reduce their search space based on the
         query results, e.g., to skip page tables containing mostly holes or
         misplaced pages.
      3. It takes advantage of the accessed bit in non-leaf PMD entries when
         CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG=y.
      4. It does not zigzag between a PGD table and the same PMD table
         spanning multiple VMAs. IOW, it finishes all the VMAs within the
         range of the same PMD table before it returns to a PGD table. This
         improves the cache performance for workloads that have large
         numbers of tiny VMAs [2], especially when CONFIG_PGTABLE_LEVELS=5.
      
      Server benchmark results:
        Single workload:
          fio (buffered I/O): no change
      
        Single workload:
          memcached (anon): +[8, 10]%
                      Ops/sec      KB/sec
            patch1-7: 1147696.57   44640.29
            patch1-8: 1245274.91   48435.66
      
        Configurations:
          no change
      
      Client benchmark results:
        kswapd profiles:
          patch1-7
            48.16%  lzo1x_1_do_compress (real work)
             8.20%  page_vma_mapped_walk (overhead)
             7.06%  _raw_spin_unlock_irq
             2.92%  ptep_clear_flush
             2.53%  __zram_bvec_write
             2.11%  do_raw_spin_lock
             2.02%  memmove
             1.93%  lru_gen_look_around
             1.56%  free_unref_page_list
             1.40%  memset
      
          patch1-8
            49.44%  lzo1x_1_do_compress (real work)
             6.19%  page_vma_mapped_walk (overhead)
             5.97%  _raw_spin_unlock_irq
             3.13%  get_pfn_folio
             2.85%  ptep_clear_flush
             2.42%  __zram_bvec_write
             2.08%  do_raw_spin_lock
             1.92%  memmove
             1.44%  alloc_zspage
             1.36%  memset
      
        Configurations:
          no change
      
      Thanks to the following developers for their efforts [3].
        kernel test robot <lkp@intel.com>
      
      [1] https://lwn.net/Articles/23732/
      [2] https://llvm.org/docs/ScudoHardenedAllocator.html
      [3] https://lore.kernel.org/r/202204160827.ekEARWQo-lkp@intel.com/
      
      Link: https://lkml.kernel.org/r/20220918080010.2920238-9-yuzhao@google.comSigned-off-by: NYu Zhao <yuzhao@google.com>
      Acked-by: NBrian Geffon <bgeffon@google.com>
      Acked-by: NJan Alexander Steffens (heftig) <heftig@archlinux.org>
      Acked-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Acked-by: NSteven Barrett <steven@liquorix.net>
      Acked-by: NSuleiman Souhlal <suleiman@google.com>
      Tested-by: NDaniel Byrne <djbyrne@mtu.edu>
      Tested-by: NDonald Carr <d@chaos-reins.com>
      Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: NKonstantin Kharlamov <Hi-Angel@yandex.ru>
      Tested-by: NShuang Zhai <szhai2@cs.rochester.edu>
      Tested-by: NSofia Trinh <sofia.trinh@edi.works>
      Tested-by: NVaibhav Jain <vaibhav@linux.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Barry Song <baohua@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Larabel <Michael@MichaelLarabel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      bd74fdae
    • Y
      mm: multi-gen LRU: exploit locality in rmap · 018ee47f
      Yu Zhao 提交于
      Searching the rmap for PTEs mapping each page on an LRU list (to test and
      clear the accessed bit) can be expensive because pages from different VMAs
      (PA space) are not cache friendly to the rmap (VA space).  For workloads
      mostly using mapped pages, searching the rmap can incur the highest CPU
      cost in the reclaim path.
      
      This patch exploits spatial locality to reduce the trips into the rmap. 
      When shrink_page_list() walks the rmap and finds a young PTE, a new
      function lru_gen_look_around() scans at most BITS_PER_LONG-1 adjacent
      PTEs.  On finding another young PTE, it clears the accessed bit and
      updates the gen counter of the page mapped by this PTE to
      (max_seq%MAX_NR_GENS)+1.
      
      Server benchmark results:
        Single workload:
          fio (buffered I/O): no change
      
        Single workload:
          memcached (anon): +[3, 5]%
                      Ops/sec      KB/sec
            patch1-6: 1106168.46   43025.04
            patch1-7: 1147696.57   44640.29
      
        Configurations:
          no change
      
      Client benchmark results:
        kswapd profiles:
          patch1-6
            39.03%  lzo1x_1_do_compress (real work)
            18.47%  page_vma_mapped_walk (overhead)
             6.74%  _raw_spin_unlock_irq
             3.97%  do_raw_spin_lock
             2.49%  ptep_clear_flush
             2.48%  anon_vma_interval_tree_iter_first
             1.92%  folio_referenced_one
             1.88%  __zram_bvec_write
             1.48%  memmove
             1.31%  vma_interval_tree_iter_next
      
          patch1-7
            48.16%  lzo1x_1_do_compress (real work)
             8.20%  page_vma_mapped_walk (overhead)
             7.06%  _raw_spin_unlock_irq
             2.92%  ptep_clear_flush
             2.53%  __zram_bvec_write
             2.11%  do_raw_spin_lock
             2.02%  memmove
             1.93%  lru_gen_look_around
             1.56%  free_unref_page_list
             1.40%  memset
      
        Configurations:
          no change
      
      Link: https://lkml.kernel.org/r/20220918080010.2920238-8-yuzhao@google.comSigned-off-by: NYu Zhao <yuzhao@google.com>
      Acked-by: NBarry Song <baohua@kernel.org>
      Acked-by: NBrian Geffon <bgeffon@google.com>
      Acked-by: NJan Alexander Steffens (heftig) <heftig@archlinux.org>
      Acked-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Acked-by: NSteven Barrett <steven@liquorix.net>
      Acked-by: NSuleiman Souhlal <suleiman@google.com>
      Tested-by: NDaniel Byrne <djbyrne@mtu.edu>
      Tested-by: NDonald Carr <d@chaos-reins.com>
      Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: NKonstantin Kharlamov <Hi-Angel@yandex.ru>
      Tested-by: NShuang Zhai <szhai2@cs.rochester.edu>
      Tested-by: NSofia Trinh <sofia.trinh@edi.works>
      Tested-by: NVaibhav Jain <vaibhav@linux.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Larabel <Michael@MichaelLarabel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      018ee47f
    • Y
      mm: multi-gen LRU: minimal implementation · ac35a490
      Yu Zhao 提交于
      To avoid confusion, the terms "promotion" and "demotion" will be applied
      to the multi-gen LRU, as a new convention; the terms "activation" and
      "deactivation" will be applied to the active/inactive LRU, as usual.
      
      The aging produces young generations.  Given an lruvec, it increments
      max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS.  The aging promotes
      hot pages to the youngest generation when it finds them accessed through
      page tables; the demotion of cold pages happens consequently when it
      increments max_seq.  Promotion in the aging path does not involve any LRU
      list operations, only the updates of the gen counter and
      lrugen->nr_pages[]; demotion, unless as the result of the increment of
      max_seq, requires LRU list operations, e.g., lru_deactivate_fn().  The
      aging has the complexity O(nr_hot_pages), since it is only interested in
      hot pages.
      
      The eviction consumes old generations.  Given an lruvec, it increments
      min_seq when lrugen->lists[] indexed by min_seq%MAX_NR_GENS becomes empty.
      A feedback loop modeled after the PID controller monitors refaults over
      anon and file types and decides which type to evict when both types are
      available from the same generation.
      
      The protection of pages accessed multiple times through file descriptors
      takes place in the eviction path.  Each generation is divided into
      multiple tiers.  A page accessed N times through file descriptors is in
      tier order_base_2(N).  Tiers do not have dedicated lrugen->lists[], only
      bits in folio->flags.  The aforementioned feedback loop also monitors
      refaults over all tiers and decides when to protect pages in which tiers
      (N>1), using the first tier (N=0,1) as a baseline.  The first tier
      contains single-use unmapped clean pages, which are most likely the best
      choices.  In contrast to promotion in the aging path, the protection of a
      page in the eviction path is achieved by moving this page to the next
      generation, i.e., min_seq+1, if the feedback loop decides so.  This
      approach has the following advantages:
      
      1. It removes the cost of activation in the buffered access path by
         inferring whether pages accessed multiple times through file
         descriptors are statistically hot and thus worth protecting in the
         eviction path.
      2. It takes pages accessed through page tables into account and avoids
         overprotecting pages accessed multiple times through file
         descriptors. (Pages accessed through page tables are in the first
         tier, since N=0.)
      3. More tiers provide better protection for pages accessed more than
         twice through file descriptors, when under heavy buffered I/O
         workloads.
      
      Server benchmark results:
        Single workload:
          fio (buffered I/O): +[30, 32]%
                      IOPS         BW
            5.19-rc1: 2673k        10.2GiB/s
            patch1-6: 3491k        13.3GiB/s
      
        Single workload:
          memcached (anon): -[4, 6]%
                      Ops/sec      KB/sec
            5.19-rc1: 1161501.04   45177.25
            patch1-6: 1106168.46   43025.04
      
        Configurations:
          CPU: two Xeon 6154
          Mem: total 256G
      
          Node 1 was only used as a ram disk to reduce the variance in the
          results.
      
          patch drivers/block/brd.c <<EOF
          99,100c99,100
          < 	gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM;
          < 	page = alloc_page(gfp_flags);
          ---
          > 	gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE;
          > 	page = alloc_pages_node(1, gfp_flags, 0);
          EOF
      
          cat >>/etc/systemd/system.conf <<EOF
          CPUAffinity=numa
          NUMAPolicy=bind
          NUMAMask=0
          EOF
      
          cat >>/etc/memcached.conf <<EOF
          -m 184320
          -s /var/run/memcached/memcached.sock
          -a 0766
          -t 36
          -B binary
          EOF
      
          cat fio.sh
          modprobe brd rd_nr=1 rd_size=113246208
          swapoff -a
          mkfs.ext4 /dev/ram0
          mount -t ext4 /dev/ram0 /mnt
      
          mkdir /sys/fs/cgroup/user.slice/test
          echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max
          echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs
          fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \
            --buffered=1 --ioengine=io_uring --iodepth=128 \
            --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
            --rw=randread --random_distribution=random --norandommap \
            --time_based --ramp_time=10m --runtime=5m --group_reporting
      
          cat memcached.sh
          modprobe brd rd_nr=1 rd_size=113246208
          swapoff -a
          mkswap /dev/ram0
          swapon /dev/ram0
      
          memtier_benchmark -S /var/run/memcached/memcached.sock \
            -P memcache_binary -n allkeys --key-minimum=1 \
            --key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \
            --ratio 1:0 --pipeline 8 -d 2000
      
          memtier_benchmark -S /var/run/memcached/memcached.sock \
            -P memcache_binary -n allkeys --key-minimum=1 \
            --key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \
            --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed
      
      Client benchmark results:
        kswapd profiles:
          5.19-rc1
            40.33%  page_vma_mapped_walk (overhead)
            21.80%  lzo1x_1_do_compress (real work)
             7.53%  do_raw_spin_lock
             3.95%  _raw_spin_unlock_irq
             2.52%  vma_interval_tree_iter_next
             2.37%  folio_referenced_one
             2.28%  vma_interval_tree_subtree_search
             1.97%  anon_vma_interval_tree_iter_first
             1.60%  ptep_clear_flush
             1.06%  __zram_bvec_write
      
          patch1-6
            39.03%  lzo1x_1_do_compress (real work)
            18.47%  page_vma_mapped_walk (overhead)
             6.74%  _raw_spin_unlock_irq
             3.97%  do_raw_spin_lock
             2.49%  ptep_clear_flush
             2.48%  anon_vma_interval_tree_iter_first
             1.92%  folio_referenced_one
             1.88%  __zram_bvec_write
             1.48%  memmove
             1.31%  vma_interval_tree_iter_next
      
        Configurations:
          CPU: single Snapdragon 7c
          Mem: total 4G
      
          ChromeOS MemoryPressure [1]
      
      [1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/
      
      Link: https://lkml.kernel.org/r/20220918080010.2920238-7-yuzhao@google.comSigned-off-by: NYu Zhao <yuzhao@google.com>
      Acked-by: NBrian Geffon <bgeffon@google.com>
      Acked-by: NJan Alexander Steffens (heftig) <heftig@archlinux.org>
      Acked-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Acked-by: NSteven Barrett <steven@liquorix.net>
      Acked-by: NSuleiman Souhlal <suleiman@google.com>
      Tested-by: NDaniel Byrne <djbyrne@mtu.edu>
      Tested-by: NDonald Carr <d@chaos-reins.com>
      Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: NKonstantin Kharlamov <Hi-Angel@yandex.ru>
      Tested-by: NShuang Zhai <szhai2@cs.rochester.edu>
      Tested-by: NSofia Trinh <sofia.trinh@edi.works>
      Tested-by: NVaibhav Jain <vaibhav@linux.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Barry Song <baohua@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Larabel <Michael@MichaelLarabel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      ac35a490
    • Y
      mm: multi-gen LRU: groundwork · ec1c86b2
      Yu Zhao 提交于
      Evictable pages are divided into multiple generations for each lruvec.
      The youngest generation number is stored in lrugen->max_seq for both
      anon and file types as they are aged on an equal footing. The oldest
      generation numbers are stored in lrugen->min_seq[] separately for anon
      and file types as clean file pages can be evicted regardless of swap
      constraints. These three variables are monotonically increasing.
      
      Generation numbers are truncated into order_base_2(MAX_NR_GENS+1) bits
      in order to fit into the gen counter in folio->flags. Each truncated
      generation number is an index to lrugen->lists[]. The sliding window
      technique is used to track at least MIN_NR_GENS and at most
      MAX_NR_GENS generations. The gen counter stores a value within [1,
      MAX_NR_GENS] while a page is on one of lrugen->lists[]. Otherwise it
      stores 0.
      
      There are two conceptually independent procedures: "the aging", which
      produces young generations, and "the eviction", which consumes old
      generations.  They form a closed-loop system, i.e., "the page reclaim". 
      Both procedures can be invoked from userspace for the purposes of working
      set estimation and proactive reclaim.  These techniques are commonly used
      to optimize job scheduling (bin packing) in data centers [1][2].
      
      To avoid confusion, the terms "hot" and "cold" will be applied to the
      multi-gen LRU, as a new convention; the terms "active" and "inactive" will
      be applied to the active/inactive LRU, as usual.
      
      The protection of hot pages and the selection of cold pages are based
      on page access channels and patterns. There are two access channels:
      one through page tables and the other through file descriptors. The
      protection of the former channel is by design stronger because:
      1. The uncertainty in determining the access patterns of the former
         channel is higher due to the approximation of the accessed bit.
      2. The cost of evicting the former channel is higher due to the TLB
         flushes required and the likelihood of encountering the dirty bit.
      3. The penalty of underprotecting the former channel is higher because
         applications usually do not prepare themselves for major page
         faults like they do for blocked I/O. E.g., GUI applications
         commonly use dedicated I/O threads to avoid blocking rendering
         threads.
      
      There are also two access patterns: one with temporal locality and the
      other without.  For the reasons listed above, the former channel is
      assumed to follow the former pattern unless VM_SEQ_READ or VM_RAND_READ is
      present; the latter channel is assumed to follow the latter pattern unless
      outlying refaults have been observed [3][4].
      
      The next patch will address the "outlying refaults".  Three macros, i.e.,
      LRU_REFS_WIDTH, LRU_REFS_PGOFF and LRU_REFS_MASK, used later are added in
      this patch to make the entire patchset less diffy.
      
      A page is added to the youngest generation on faulting.  The aging needs
      to check the accessed bit at least twice before handing this page over to
      the eviction.  The first check takes care of the accessed bit set on the
      initial fault; the second check makes sure this page has not been used
      since then.  This protocol, AKA second chance, requires a minimum of two
      generations, hence MIN_NR_GENS.
      
      [1] https://dl.acm.org/doi/10.1145/3297858.3304053
      [2] https://dl.acm.org/doi/10.1145/3503222.3507731
      [3] https://lwn.net/Articles/495543/
      [4] https://lwn.net/Articles/815342/
      
      Link: https://lkml.kernel.org/r/20220918080010.2920238-6-yuzhao@google.comSigned-off-by: NYu Zhao <yuzhao@google.com>
      Acked-by: NBrian Geffon <bgeffon@google.com>
      Acked-by: NJan Alexander Steffens (heftig) <heftig@archlinux.org>
      Acked-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Acked-by: NSteven Barrett <steven@liquorix.net>
      Acked-by: NSuleiman Souhlal <suleiman@google.com>
      Tested-by: NDaniel Byrne <djbyrne@mtu.edu>
      Tested-by: NDonald Carr <d@chaos-reins.com>
      Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: NKonstantin Kharlamov <Hi-Angel@yandex.ru>
      Tested-by: NShuang Zhai <szhai2@cs.rochester.edu>
      Tested-by: NSofia Trinh <sofia.trinh@edi.works>
      Tested-by: NVaibhav Jain <vaibhav@linux.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Barry Song <baohua@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Michael Larabel <Michael@MichaelLarabel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      ec1c86b2
    • Y
      mm/vmscan.c: refactor shrink_node() · f1e1a7be
      Yu Zhao 提交于
      This patch refactors shrink_node() to improve readability for the upcoming
      changes to mm/vmscan.c.
      
      Link: https://lkml.kernel.org/r/20220918080010.2920238-4-yuzhao@google.comSigned-off-by: NYu Zhao <yuzhao@google.com>
      Reviewed-by: NBarry Song <baohua@kernel.org>
      Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com>
      Acked-by: NBrian Geffon <bgeffon@google.com>
      Acked-by: NJan Alexander Steffens (heftig) <heftig@archlinux.org>
      Acked-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Acked-by: NSteven Barrett <steven@liquorix.net>
      Acked-by: NSuleiman Souhlal <suleiman@google.com>
      Tested-by: NDaniel Byrne <djbyrne@mtu.edu>
      Tested-by: NDonald Carr <d@chaos-reins.com>
      Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: NKonstantin Kharlamov <Hi-Angel@yandex.ru>
      Tested-by: NShuang Zhai <szhai2@cs.rochester.edu>
      Tested-by: NSofia Trinh <sofia.trinh@edi.works>
      Tested-by: NVaibhav Jain <vaibhav@linux.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michael Larabel <Michael@MichaelLarabel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qi Zheng <zhengqi.arch@bytedance.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      f1e1a7be
  5. 12 9月, 2022 4 次提交
  6. 30 7月, 2022 2 次提交
  7. 04 7月, 2022 11 次提交
  8. 29 6月, 2022 1 次提交
  9. 26 5月, 2022 1 次提交
  10. 20 5月, 2022 1 次提交
    • M
      mm: don't be stuck to rmap lock on reclaim path · 6d4675e6
      Minchan Kim 提交于
      The rmap locks(i_mmap_rwsem and anon_vma->root->rwsem) could be contended
      under memory pressure if processes keep working on their vmas(e.g., fork,
      mmap, munmap).  It makes reclaim path stuck.  In our real workload traces,
      we see kswapd is waiting the lock for 300ms+(worst case, a sec) and it
      makes other processes entering direct reclaim, which were also stuck on
      the lock.
      
      This patch makes lru aging path try_lock mode like shink_page_list so the
      reclaim context will keep working with next lru pages without being stuck.
      if it found the rmap lock contended, it rotates the page back to head of
      lru in both active/inactive lrus to make them consistent behavior, which
      is basic starting point rather than adding more heristic.
      
      Since this patch introduces a new "contended" field as out-param along
      with try_lock in-param in rmap_walk_control, it's not immutable any longer
      if the try_lock is set so remove const keywords on rmap related functions.
      Since rmap walking is already expensive operation, I doubt the const
      would help sizable benefit( And we didn't have it until 5.17).
      
      In a heavy app workload in Android, trace shows following statistics.  It
      almost removes rmap lock contention from reclaim path.
      
      Martin Liu reported:
      
      Before:
      
         max_dur(ms)  min_dur(ms)  max-min(dur)ms  avg_dur(ms)  sum_dur(ms)  count blocked_function
               1632            0            1631   151.542173        31672    209  page_lock_anon_vma_read
                601            0             601   145.544681        28817    198  rmap_walk_file
      
      After:
      
         max_dur(ms)  min_dur(ms)  max-min(dur)ms  avg_dur(ms)  sum_dur(ms)  count blocked_function
                NaN          NaN              NaN          NaN          NaN    0.0             NaN
                  0            0                0     0.127645            1     12  rmap_walk_file
      
      [minchan@kernel.org: add comment, per Matthew]
        Link: https://lkml.kernel.org/r/YnNqeB5tUf6LZ57b@google.com
      Link: https://lkml.kernel.org/r/20220510215423.164547-1-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: John Dias <joaodias@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Martin Liu <liumartin@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      6d4675e6