1. 18 11月, 2021 6 次提交
    • S
      Documentation/vm: move user guides to admin-guide/mm/ · c2caaf45
      SeongJae Park 提交于
      mainline inclusion
      from mainline-5.16
      commit ad782c48
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK
      CVE: NA
      
      -------------------------------------------------
      Most memory management user guide documents are in 'admin-guide/mm/',
      but two of those are in 'vm/'.  This moves the two docs into
      'admin-guide/mm' for easier documents finding.
      
      Link: https://lkml.kernel.org/r/20210917123958.3819-2-sj@kernel.orgSigned-off-by: NSeongJae Park <sjpark@amazon.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      (cherry picked from commit ad782c48)
      Signed-off-by: NYue Zou <zouyue3@huawei.com>
      Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      c2caaf45
    • G
      mm/damon: grammar s/works/work/ · 9f1fbced
      Geert Uytterhoeven 提交于
      mainline inclusion
      from mainline-5.16
      commit f24b0626
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK
      CVE: NA
      
      -------------------------------------------------
      Correct a singular versus plural grammar mistake in the help text for
      the DAMON_VADDR config symbol.
      
      Link: https://lkml.kernel.org/r/20210914073451.3883834-1-geert@linux-m68k.org
      Fixes: 3f49584b ("mm/damon: implement primitives for the virtual memory address spaces")
      Signed-off-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Reviewed-by: NSeongJae Park <sjpark@amazon.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      (cherry picked from commit f24b0626)
      Signed-off-by: NYue Zou <zouyue3@huawei.com>
      Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      9f1fbced
    • S
      mm/damon/core-test: fix wrong expectations for 'damon_split_regions_of()' · a56fb752
      SeongJae Park 提交于
      mainline inclusion
      from mainline-5.15
      commit 2e014660
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK
      CVE: NA
      
      -------------------------------------------------
      Kunit test cases for 'damon_split_regions_of()' expects the number of
      regions after calling the function will be same to their request
      ('nr_sub').  However, the requested number is just an upper-limit,
      because the function randomly decides the size of each sub-region.
      
      This fixes the wrong expectation.
      
      Link: https://lkml.kernel.org/r/20211028090628.14948-1-sj@kernel.org
      Fixes: 17ccae8b ("mm/damon: add kunit tests")
      Signed-off-by: NSeongJae Park <sj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      (cherry picked from commit 2e014660)
      Signed-off-by: NYue Zou <zouyue3@huawei.com>
      Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      a56fb752
    • A
      mm/damon: don't use strnlen() with known-bogus source length · 53ead8d9
      Adam Borowski 提交于
      mainline inclusion
      from mainline-5.15-rc3
      commit 892ab4bb
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I4GVMK
      CVE: NA
      
      -------------------------------------------------
      gcc knows the true length too, and rightfully complains.
      
      Link: https://lkml.kernel.org/r/20210912204447.10427-1-kilobyte@angband.plSigned-off-by: NAdam Borowski <kilobyte@angband.pl>
      Cc: SeongJae Park <sj38.park@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      (cherry picked from commit 892ab4bb)
      Signed-off-by: NYue Zou <zouyue3@huawei.com>
      Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      53ead8d9
    • B
      sched: Add cluster scheduler level in core and related Kconfig for ARM64 · ac032ae3
      Barry Song 提交于
      mainline inclusion
      from tip/sched/core for v5.16
      commit: 778c558f
      category: feature
      bugzilla: https://gitee.com/openeuler/kernel/issues/I4H33U
      CVE: NA
      Reference: https://lore.kernel.org/lkml/20210924085104.44806-1-21cnbao@gmail.com/
      
      ------------------------------------------------------------------------
      
      This patch adds scheduler level for clusters and automatically enables
      the load balance among clusters. It will directly benefit a lot of
      workload which loves more resources such as memory bandwidth, caches.
      
      Testing has widely been done in two different hardware configurations of
      Kunpeng920:
      
       24 cores in one NUMA(6 clusters in each NUMA node);
       32 cores in one NUMA(8 clusters in each NUMA node)
      
      Workload is running on either one NUMA node or four NUMA nodes, thus,
      this can estimate the effect of cluster spreading w/ and w/o NUMA load
      balance.
      
      * Stream benchmark:
      
      4threads stream (on 1NUMA * 24cores = 24cores)
                      stream                 stream
                      w/o patch              w/ patch
      MB/sec copy     29929.64 (   0.00%)    32932.68 (  10.03%)
      MB/sec scale    29861.10 (   0.00%)    32710.58 (   9.54%)
      MB/sec add      27034.42 (   0.00%)    32400.68 (  19.85%)
      MB/sec triad    27225.26 (   0.00%)    31965.36 (  17.41%)
      
      6threads stream (on 1NUMA * 24cores = 24cores)
                      stream                 stream
                      w/o patch              w/ patch
      MB/sec copy     40330.24 (   0.00%)    42377.68 (   5.08%)
      MB/sec scale    40196.42 (   0.00%)    42197.90 (   4.98%)
      MB/sec add      37427.00 (   0.00%)    41960.78 (  12.11%)
      MB/sec triad    37841.36 (   0.00%)    42513.64 (  12.35%)
      
      12threads stream (on 1NUMA * 24cores = 24cores)
                      stream                 stream
                      w/o patch              w/ patch
      MB/sec copy     52639.82 (   0.00%)    53818.04 (   2.24%)
      MB/sec scale    52350.30 (   0.00%)    53253.38 (   1.73%)
      MB/sec add      53607.68 (   0.00%)    55198.82 (   2.97%)
      MB/sec triad    54776.66 (   0.00%)    56360.40 (   2.89%)
      
      Thus, it could help memory-bound workload especially under medium load.
      Similar improvement is also seen in lkp-pbzip2:
      
      * lkp-pbzip2 benchmark
      
      2-96 threads (on 4NUMA * 24cores = 96cores)
                        lkp-pbzip2              lkp-pbzip2
                        w/o patch               w/ patch
      Hmean     tput-2   11062841.57 (   0.00%)  11341817.51 *   2.52%*
      Hmean     tput-5   26815503.70 (   0.00%)  27412872.65 *   2.23%*
      Hmean     tput-8   41873782.21 (   0.00%)  43326212.92 *   3.47%*
      Hmean     tput-12  61875980.48 (   0.00%)  64578337.51 *   4.37%*
      Hmean     tput-21 105814963.07 (   0.00%) 111381851.01 *   5.26%*
      Hmean     tput-30 150349470.98 (   0.00%) 156507070.73 *   4.10%*
      Hmean     tput-48 237195937.69 (   0.00%) 242353597.17 *   2.17%*
      Hmean     tput-79 360252509.37 (   0.00%) 362635169.23 *   0.66%*
      Hmean     tput-96 394571737.90 (   0.00%) 400952978.48 *   1.62%*
      
      2-24 threads (on 1NUMA * 24cores = 24cores)
                       lkp-pbzip2               lkp-pbzip2
                       w/o patch                w/ patch
      Hmean     tput-2   11071705.49 (   0.00%)  11296869.10 *   2.03%*
      Hmean     tput-4   20782165.19 (   0.00%)  21949232.15 *   5.62%*
      Hmean     tput-6   30489565.14 (   0.00%)  33023026.96 *   8.31%*
      Hmean     tput-8   40376495.80 (   0.00%)  42779286.27 *   5.95%*
      Hmean     tput-12  61264033.85 (   0.00%)  62995632.78 *   2.83%*
      Hmean     tput-18  86697139.39 (   0.00%)  86461545.74 (  -0.27%)
      Hmean     tput-24 104854637.04 (   0.00%) 104522649.46 *  -0.32%*
      
      In the case of 6 threads and 8 threads, we see the greatest performance
      improvement.
      
      Similar improvement can be seen on lkp-pixz though the improvement is
      smaller:
      
      * lkp-pixz benchmark
      
      2-24 threads lkp-pixz (on 1NUMA * 24cores = 24cores)
                        lkp-pixz               lkp-pixz
                        w/o patch              w/ patch
      Hmean     tput-2   6486981.16 (   0.00%)  6561515.98 *   1.15%*
      Hmean     tput-4  11645766.38 (   0.00%) 11614628.43 (  -0.27%)
      Hmean     tput-6  15429943.96 (   0.00%) 15957350.76 *   3.42%*
      Hmean     tput-8  19974087.63 (   0.00%) 20413746.98 *   2.20%*
      Hmean     tput-12 28172068.18 (   0.00%) 28751997.06 *   2.06%*
      Hmean     tput-18 39413409.54 (   0.00%) 39896830.55 *   1.23%*
      Hmean     tput-24 49101815.85 (   0.00%) 49418141.47 *   0.64%*
      
      * SPECrate benchmark
      
      4,8,16 copies mcf_r(on 1NUMA * 32cores = 32cores)
      		Base     	 	Base
      		Run Time   	 	Rate
      		-------  	 	---------
      4 Copies	w/o 580 (w/ 570)       	w/o 11.1 (w/ 11.3)
      8 Copies	w/o 647 (w/ 605)       	w/o 20.0 (w/ 21.4, +7%)
      16 Copies	w/o 844 (w/ 844)       	w/o 30.6 (w/ 30.6)
      
      32 Copies(on 4NUMA * 32 cores = 128cores)
      [w/o patch]
                       Base     Base        Base
      Benchmarks       Copies  Run Time     Rate
      --------------- -------  ---------  ---------
      500.perlbench_r      32        584       87.2  *
      502.gcc_r            32        503       90.2  *
      505.mcf_r            32        745       69.4  *
      520.omnetpp_r        32       1031       40.7  *
      523.xalancbmk_r      32        597       56.6  *
      525.x264_r            1         --            CE
      531.deepsjeng_r      32        336      109    *
      541.leela_r          32        556       95.4  *
      548.exchange2_r      32        513      163    *
      557.xz_r             32        530       65.2  *
       Est. SPECrate2017_int_base              80.3
      
      [w/ patch]
                        Base     Base        Base
      Benchmarks       Copies  Run Time     Rate
      --------------- -------  ---------  ---------
      500.perlbench_r      32        580      87.8 (+0.688%)  *
      502.gcc_r            32        477      95.1 (+5.432%)  *
      505.mcf_r            32        644      80.3 (+13.574%) *
      520.omnetpp_r        32        942      44.6 (+9.58%)   *
      523.xalancbmk_r      32        560      60.4 (+6.714%%) *
      525.x264_r            1         --           CE
      531.deepsjeng_r      32        337      109  (+0.000%) *
      541.leela_r          32        554      95.6 (+0.210%) *
      548.exchange2_r      32        515      163  (+0.000%) *
      557.xz_r             32        524      66.0 (+1.227%) *
       Est. SPECrate2017_int_base              83.7 (+4.062%)
      
      On the other hand, it is slightly helpful to CPU-bound tasks like
      kernbench:
      
      * 24-96 threads kernbench (on 4NUMA * 24cores = 96cores)
                           kernbench              kernbench
                           w/o cluster            w/ cluster
      Min       user-24    12054.67 (   0.00%)    12024.19 (   0.25%)
      Min       syst-24     1751.51 (   0.00%)     1731.68 (   1.13%)
      Min       elsp-24      600.46 (   0.00%)      598.64 (   0.30%)
      Min       user-48    12361.93 (   0.00%)    12315.32 (   0.38%)
      Min       syst-48     1917.66 (   0.00%)     1892.73 (   1.30%)
      Min       elsp-48      333.96 (   0.00%)      332.57 (   0.42%)
      Min       user-96    12922.40 (   0.00%)    12921.17 (   0.01%)
      Min       syst-96     2143.94 (   0.00%)     2110.39 (   1.56%)
      Min       elsp-96      211.22 (   0.00%)      210.47 (   0.36%)
      Amean     user-24    12063.99 (   0.00%)    12030.78 *   0.28%*
      Amean     syst-24     1755.20 (   0.00%)     1735.53 *   1.12%*
      Amean     elsp-24      601.60 (   0.00%)      600.19 (   0.23%)
      Amean     user-48    12362.62 (   0.00%)    12315.56 *   0.38%*
      Amean     syst-48     1921.59 (   0.00%)     1894.95 *   1.39%*
      Amean     elsp-48      334.10 (   0.00%)      332.82 *   0.38%*
      Amean     user-96    12925.27 (   0.00%)    12922.63 (   0.02%)
      Amean     syst-96     2146.66 (   0.00%)     2122.20 *   1.14%*
      Amean     elsp-96      211.96 (   0.00%)      211.79 (   0.08%)
      
      Note this patch isn't an universal win, it might hurt those workload
      which can benefit from packing. Though tasks which want to take
      advantages of lower communication latency of one cluster won't
      necessarily been packed in one cluster while kernel is not aware of
      clusters, they have some chance to be randomly packed. But this
      patch will make them more likely spread.
      Signed-off-by: NBarry Song <song.bao.hua@hisilicon.com>
      Tested-by: NYicong Yang <yangyicong@hisilicon.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NYicong Yang <yangyicong@hisilicon.com>
      Reviewed-by: Ntao zeng <prime.zeng@hisilicon.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      ac032ae3
    • J
      topology: Represent clusters of CPUs within a die · c84cd40a
      Jonathan Cameron 提交于
      mainline inclusion
      from tip/sched/core for v5.15-release
      commit: c5e22fef
      category: feature
      bugzilla: https://gitee.com/openeuler/kernel/issues/I4GEZS
      CVE: NA
      Reference: https://lore.kernel.org/lkml/20210924085104.44806-1-21cnbao@gmail.com/
      
      ------------------------------------------------------------------------
      
      Both ACPI and DT provide the ability to describe additional layers of
      topology between that of individual cores and higher level constructs
      such as the level at which the last level cache is shared.
      In ACPI this can be represented in PPTT as a Processor Hierarchy
      Node Structure [1] that is the parent of the CPU cores and in turn
      has a parent Processor Hierarchy Nodes Structure representing
      a higher level of topology.
      
      For example Kunpeng 920 has 6 or 8 clusters in each NUMA node, and each
      cluster has 4 cpus. All clusters share L3 cache data, but each cluster
      has local L3 tag. On the other hand, each clusters will share some
      internal system bus.
      
      +-----------------------------------+                          +---------+
      |  +------+    +------+             +--------------------------+         |
      |  | CPU0 |    | cpu1 |             |    +-----------+         |         |
      |  +------+    +------+             |    |           |         |         |
      |                                   +----+    L3     |         |         |
      |  +------+    +------+   cluster   |    |    tag    |         |         |
      |  | CPU2 |    | CPU3 |             |    |           |         |         |
      |  +------+    +------+             |    +-----------+         |         |
      |                                   |                          |         |
      +-----------------------------------+                          |         |
      +-----------------------------------+                          |         |
      |  +------+    +------+             +--------------------------+         |
      |  |      |    |      |             |    +-----------+         |         |
      |  +------+    +------+             |    |           |         |         |
      |                                   |    |    L3     |         |         |
      |  +------+    +------+             +----+    tag    |         |         |
      |  |      |    |      |             |    |           |         |         |
      |  +------+    +------+             |    +-----------+         |         |
      |                                   |                          |         |
      +-----------------------------------+                          |   L3    |
                                                                     |   data  |
      +-----------------------------------+                          |         |
      |  +------+    +------+             |    +-----------+         |         |
      |  |      |    |      |             |    |           |         |         |
      |  +------+    +------+             +----+    L3     |         |         |
      |                                   |    |    tag    |         |         |
      |  +------+    +------+             |    |           |         |         |
      |  |      |    |      |             |    +-----------+         |         |
      |  +------+    +------+             +--------------------------+         |
      +-----------------------------------|                          |         |
      +-----------------------------------|                          |         |
      |  +------+    +------+             +--------------------------+         |
      |  |      |    |      |             |    +-----------+         |         |
      |  +------+    +------+             |    |           |         |         |
      |                                   +----+    L3     |         |         |
      |  +------+    +------+             |    |    tag    |         |         |
      |  |      |    |      |             |    |           |         |         |
      |  +------+    +------+             |    +-----------+         |         |
      |                                   |                          |         |
      +-----------------------------------+                          |         |
      +-----------------------------------+                          |         |
      |  +------+    +------+             +--------------------------+         |
      |  |      |    |      |             |   +-----------+          |         |
      |  +------+    +------+             |   |           |          |         |
      |                                   |   |    L3     |          |         |
      |  +------+    +------+             +---+    tag    |          |         |
      |  |      |    |      |             |   |           |          |         |
      |  +------+    +------+             |   +-----------+          |         |
      |                                   |                          |         |
      +-----------------------------------+                          |         |
      +-----------------------------------+                          |         |
      |  +------+    +------+             +--------------------------+         |
      |  |      |    |      |             |  +-----------+           |         |
      |  +------+    +------+             |  |           |           |         |
      |                                   |  |    L3     |           |         |
      |  +------+    +------+             +--+    tag    |           |         |
      |  |      |    |      |             |  |           |           |         |
      |  +------+    +------+             |  +-----------+           |         |
      |                                   |                          +---------+
      +-----------------------------------+
      
      That means spreading tasks among clusters will bring more bandwidth
      while packing tasks within one cluster will lead to smaller cache
      synchronization latency. So both kernel and userspace will have
      a chance to leverage this topology to deploy tasks accordingly to
      achieve either smaller cache latency within one cluster or an even
      distribution of load among clusters for higher throughput.
      
      This patch exposes cluster topology to both kernel and userspace.
      Libraried like hwloc will know cluster by cluster_cpus and related
      sysfs attributes. PoC of HWLOC support at [2].
      
      Note this patch only handle the ACPI case.
      
      Special consideration is needed for SMT processors, where it is
      necessary to move 2 levels up the hierarchy from the leaf nodes
      (thus skipping the processor core level).
      
      Note that arm64 / ACPI does not provide any means of identifying
      a die level in the topology but that may be unrelate to the cluster
      level.
      
      [1] ACPI Specification 6.3 - section 5.2.29.1 processor hierarchy node
          structure (Type 0)
      [2] https://github.com/hisilicon/hwloc/tree/linux-clusterSigned-off-by: NJonathan Cameron <Jonathan.Cameron@huawei.com>
      Signed-off-by: NTian Tao <tiantao6@hisilicon.com>
      Signed-off-by: NBarry Song <song.bao.hua@hisilicon.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lore.kernel.org/r/20210924085104.44806-2-21cnbao@gmail.comSigned-off-by: NYicong Yang <yangyicong@hisilicon.com>
      Reviewed-by: Ntao zeng <prime.zeng@hisilicon.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      c84cd40a
  2. 15 11月, 2021 34 次提交