• Y
    mm: multi-gen LRU: minimal implementation · 9323b417
    Yu Zhao 提交于
    mainline inclusion
    from mainline-v6.1-rc1
    commit ac35a490
    category: feature
    bugzilla: https://gitee.com/openeuler/open-source-summer/issues/I55Z0L
    CVE: NA
    Reference: https://android-review.googlesource.com/c/kernel/common/+/2050911/10
    
    ----------------------------------------------------------------------
    
    To avoid confusion, the terms "promotion" and "demotion" will be
    applied to the multi-gen LRU, as a new convention; the terms
    "activation" and "deactivation" will be applied to the active/inactive
    LRU, as usual.
    
    The aging produces young generations. Given an lruvec, it increments
    max_seq when max_seq-min_seq+1 approaches MIN_NR_GENS. The aging
    promotes hot pages to the youngest generation when it finds them
    accessed through page tables; the demotion of cold pages happens
    consequently when it increments max_seq. The aging has the complexity
    O(nr_hot_pages), since it is only interested in hot pages. Promotion
    in the aging path does not require any LRU list operations, only the
    updates of the gen counter and lrugen->nr_pages[]; demotion, unless as
    the result of the increment of max_seq, requires LRU list operations,
    e.g., lru_deactivate_fn().
    
    The eviction consumes old generations. Given an lruvec, it increments
    min_seq when the lists indexed by min_seq%MAX_NR_GENS become empty. A
    feedback loop modeled after the PID controller monitors refaults over
    anon and file types and decides which type to evict when both types
    are available from the same generation.
    
    Each generation is divided into multiple tiers. Tiers represent
    different ranges of numbers of accesses through file descriptors. A
    page accessed N times through file descriptors is in tier
    order_base_2(N). Tiers do not have dedicated lrugen->lists[], only
    bits in page->flags. In contrast to moving across generations, which
    requires the LRU lock, moving across tiers only involves operations on
    page->flags. The feedback loop also monitors refaults over all tiers
    and decides when to protect pages in which tiers (N>1), using the
    first tier (N=0,1) as a baseline. The first tier contains single-use
    unmapped clean pages, which are most likely the best choices. The
    eviction moves a page to the next generation, i.e., min_seq+1, if the
    feedback loop decides so. This approach has the following advantages:
    1. It removes the cost of activation in the buffered access path by
       inferring whether pages accessed multiple times through file
       descriptors are statistically hot and thus worth protecting in the
       eviction path.
    2. It takes pages accessed through page tables into account and avoids
       overprotecting pages accessed multiple times through file
       descriptors. (Pages accessed through page tables are in the first
       tier, since N=0.)
    3. More tiers provide better protection for pages accessed more than
       twice through file descriptors, when under heavy buffered I/O
       workloads.
    
    Server benchmark results:
      Single workload:
        fio (buffered I/O): +[38, 40]%
                             IOPS         BW
          5.18-ed464352: 2547k        9989MiB/s
          patch1-6:          3540k        13.5GiB/s
    
      Single workload:
        memcached (anon): +[103, 107]%
                             Ops/sec      KB/sec
          5.18-ed464352: 469048.66    18243.91
          patch1-6:          964656.80    37520.88
    
      Configurations:
        CPU: two Xeon 6154
        Mem: total 256G
    
        Node 1 was only used as a ram disk to reduce the variance in the
        results.
    
        patch drivers/block/brd.c <<EOF
        99,100c99,100
        < 	gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM;
        < 	page = alloc_page(gfp_flags);
        ---
        > 	gfp_flags = GFP_NOIO | __GFP_ZERO | __GFP_HIGHMEM | __GFP_THISNODE;
        > 	page = alloc_pages_node(1, gfp_flags, 0);
        EOF
    
        cat >>/etc/systemd/system.conf <<EOF
        CPUAffinity=numa
        NUMAPolicy=bind
        NUMAMask=0
        EOF
    
        cat >>/etc/memcached.conf <<EOF
        -m 184320
        -s /var/run/memcached/memcached.sock
        -a 0766
        -t 36
        -B binary
        EOF
    
        cat fio.sh
        modprobe brd rd_nr=1 rd_size=113246208
        swapoff -a
        mkfs.ext4 /dev/ram0
        mount -t ext4 /dev/ram0 /mnt
    
        mkdir /sys/fs/cgroup/user.slice/test
        echo 38654705664 >/sys/fs/cgroup/user.slice/test/memory.max
        echo $$ >/sys/fs/cgroup/user.slice/test/cgroup.procs
        fio -name=mglru --numjobs=72 --directory=/mnt --size=1408m \
          --buffered=1 --ioengine=io_uring --iodepth=128 \
          --iodepth_batch_submit=32 --iodepth_batch_complete=32 \
          --rw=randread --random_distribution=random --norandommap \
          --time_based --ramp_time=10m --runtime=5m --group_reporting
    
        cat memcached.sh
        modprobe brd rd_nr=1 rd_size=113246208
        swapoff -a
        mkswap /dev/ram0
        swapon /dev/ram0
    
        memtier_benchmark -S /var/run/memcached/memcached.sock \
          -P memcache_binary -n allkeys --key-minimum=1 \
          --key-maximum=65000000 --key-pattern=P:P -c 1 -t 36 \
          --ratio 1:0 --pipeline 8 -d 2000
    
        memtier_benchmark -S /var/run/memcached/memcached.sock \
          -P memcache_binary -n allkeys --key-minimum=1 \
          --key-maximum=65000000 --key-pattern=R:R -c 1 -t 36 \
          --ratio 0:1 --pipeline 8 --randomize --distinct-client-seed
    
    Client benchmark results:
      kswapd profiles:
        5.18-ed464352
          39.56%  page_vma_mapped_walk
          19.32%  lzo1x_1_do_compress (real work)
           7.18%  do_raw_spin_lock
           4.23%  _raw_spin_unlock_irq
           2.26%  vma_interval_tree_subtree_search
           2.12%  vma_interval_tree_iter_next
           2.11%  folio_referenced_one
           1.90%  anon_vma_interval_tree_iter_first
           1.47%  ptep_clear_flush
           0.97%  __anon_vma_interval_tree_subtree_search
    
        patch1-6
          36.13%  lzo1x_1_do_compress (real work)
          19.16%  page_vma_mapped_walk
           6.55%  _raw_spin_unlock_irq
           4.02%  do_raw_spin_lock
           2.32%  anon_vma_interval_tree_iter_first
           2.11%  ptep_clear_flush
           1.76%  __zram_bvec_write
           1.64%  folio_referenced_one
           1.40%  memmove
           1.35%  obj_malloc
    
      Configurations:
        CPU: single Snapdragon 7c
        Mem: total 4G
    
        Chrome OS MemoryPressure [1]
    
    [1] https://chromium.googlesource.com/chromiumos/platform/tast-tests/
    
    Link: https://lore.kernel.org/r/20220309021230.721028-7-yuzhao@google.com/Signed-off-by: NYu Zhao <yuzhao@google.com>
    Acked-by: NBrian Geffon <bgeffon@google.com>
    Acked-by: NJan Alexander Steffens (heftig) <heftig@archlinux.org>
    Acked-by: NOleksandr Natalenko <oleksandr@natalenko.name>
    Acked-by: NSteven Barrett <steven@liquorix.net>
    Acked-by: NSuleiman Souhlal <suleiman@google.com>
    Tested-by: NDaniel Byrne <djbyrne@mtu.edu>
    Tested-by: NDonald Carr <d@chaos-reins.com>
    Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
    Tested-by: NKonstantin Kharlamov <Hi-Angel@yandex.ru>
    Tested-by: NShuang Zhai <szhai2@cs.rochester.edu>
    Tested-by: NSofia Trinh <sofia.trinh@edi.works>
    Tested-by: NVaibhav Jain <vaibhav@linux.ibm.com>
    Bug: 227651406
    Signed-off-by: NKalesh Singh <kaleshsingh@google.com>
    Change-Id: I3fe4850006d7984cd9f4fd46134b826609dc2f86
    Signed-off-by: NYuLinjia <3110442349@qq.com>
    9323b417
workingset.c 24.2 KB