• G
    sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs · edd5e1ef
    Guan Jing 提交于
    mainline inclusion
    from mainline-v5.18-rc1
    commit e496132e
    category: feature
    bugzilla: https://gitee.com/openeuler/kernel/issues/I78WM8
    
    Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v6.4-rc3&id=e496132ebedd870b67f1f6d2428f9bb9d7ae27fd
    
    --------------------------------
    
    Commit 7d2b5dd0 ("sched/numa: Allow a floating imbalance between NUMA
    nodes") allowed an imbalance between NUMA nodes such that communicating
    tasks would not be pulled apart by the load balancer. This works fine when
    there is a 1:1 relationship between LLC and node but can be suboptimal
    for multiple LLCs if independent tasks prematurely use CPUs sharing cache.
    
    Zen* has multiple LLCs per node with local memory channels and due to
    the allowed imbalance, it's far harder to tune some workloads to run
    optimally than it is on hardware that has 1 LLC per node. This patch
    allows an imbalance to exist up to the point where LLCs should be balanced
    between nodes.
    
    On a Zen3 machine running STREAM parallelised with OMP to have on instance
    per LLC the results and without binding, the results are
    
                                5.17.0-rc0             5.17.0-rc0
                                   vanilla       sched-numaimb-v6
    MB/sec copy-16    162596.94 (   0.00%)   580559.74 ( 257.05%)
    MB/sec scale-16   136901.28 (   0.00%)   374450.52 ( 173.52%)
    MB/sec add-16     157300.70 (   0.00%)   564113.76 ( 258.62%)
    MB/sec triad-16   151446.88 (   0.00%)   564304.24 ( 272.61%)
    
    STREAM can use directives to force the spread if the OpenMP is new
    enough but that doesn't help if an application uses threads and
    it's not known in advance how many threads will be created.
    
    Coremark is a CPU and cache intensive benchmark parallelised with
    threads. When running with 1 thread per core, the vanilla kernel
    allows threads to contend on cache. With the patch;
    
                                   5.17.0-rc0             5.17.0-rc0
                                      vanilla       sched-numaimb-v5
    Min       Score-16   368239.36 (   0.00%)   389816.06 (   5.86%)
    Hmean     Score-16   388607.33 (   0.00%)   427877.08 *  10.11%*
    Max       Score-16   408945.69 (   0.00%)   481022.17 (  17.62%)
    Stddev    Score-16    15247.04 (   0.00%)    24966.82 ( -63.75%)
    CoeffVar  Score-16        3.92 (   0.00%)        5.82 ( -48.48%)
    
    It can also make a big difference for semi-realistic workloads
    like specjbb which can execute arbitrary numbers of threads without
    advance knowledge of how they should be placed. Even in cases where
    the average performance is neutral, the results are more stable.
    
                                   5.17.0-rc0             5.17.0-rc0
                                      vanilla       sched-numaimb-v6
    Hmean     tput-1      71631.55 (   0.00%)    73065.57 (   2.00%)
    Hmean     tput-8     582758.78 (   0.00%)   556777.23 (  -4.46%)
    Hmean     tput-16   1020372.75 (   0.00%)  1009995.26 (  -1.02%)
    Hmean     tput-24   1416430.67 (   0.00%)  1398700.11 (  -1.25%)
    Hmean     tput-32   1687702.72 (   0.00%)  1671357.04 (  -0.97%)
    Hmean     tput-40   1798094.90 (   0.00%)  2015616.46 *  12.10%*
    Hmean     tput-48   1972731.77 (   0.00%)  2333233.72 (  18.27%)
    Hmean     tput-56   2386872.38 (   0.00%)  2759483.38 (  15.61%)
    Hmean     tput-64   2909475.33 (   0.00%)  2925074.69 (   0.54%)
    Hmean     tput-72   2585071.36 (   0.00%)  2962443.97 (  14.60%)
    Hmean     tput-80   2994387.24 (   0.00%)  3015980.59 (   0.72%)
    Hmean     tput-88   3061408.57 (   0.00%)  3010296.16 (  -1.67%)
    Hmean     tput-96   3052394.82 (   0.00%)  2784743.41 (  -8.77%)
    Hmean     tput-104  2997814.76 (   0.00%)  2758184.50 (  -7.99%)
    Hmean     tput-112  2955353.29 (   0.00%)  2859705.09 (  -3.24%)
    Hmean     tput-120  2889770.71 (   0.00%)  2764478.46 (  -4.34%)
    Hmean     tput-128  2871713.84 (   0.00%)  2750136.73 (  -4.23%)
    Stddev    tput-1       5325.93 (   0.00%)     2002.53 (  62.40%)
    Stddev    tput-8       6630.54 (   0.00%)    10905.00 ( -64.47%)
    Stddev    tput-16     25608.58 (   0.00%)     6851.16 (  73.25%)
    Stddev    tput-24     12117.69 (   0.00%)     4227.79 (  65.11%)
    Stddev    tput-32     27577.16 (   0.00%)     8761.05 (  68.23%)
    Stddev    tput-40     59505.86 (   0.00%)     2048.49 (  96.56%)
    Stddev    tput-48    168330.30 (   0.00%)    93058.08 (  44.72%)
    Stddev    tput-56    219540.39 (   0.00%)    30687.02 (  86.02%)
    Stddev    tput-64    121750.35 (   0.00%)     9617.36 (  92.10%)
    Stddev    tput-72    223387.05 (   0.00%)    34081.13 (  84.74%)
    Stddev    tput-80    128198.46 (   0.00%)    22565.19 (  82.40%)
    Stddev    tput-88    136665.36 (   0.00%)    27905.97 (  79.58%)
    Stddev    tput-96    111925.81 (   0.00%)    99615.79 (  11.00%)
    Stddev    tput-104   146455.96 (   0.00%)    28861.98 (  80.29%)
    Stddev    tput-112    88740.49 (   0.00%)    58288.23 (  34.32%)
    Stddev    tput-120   186384.86 (   0.00%)    45812.03 (  75.42%)
    Stddev    tput-128    78761.09 (   0.00%)    57418.48 (  27.10%)
    
    Similarly, for embarassingly parallel problems like NPB-ep, there are
    improvements due to better spreading across LLC when the machine is not
    fully utilised.
    
                                  vanilla       sched-numaimb-v6
    Min       ep.D       31.79 (   0.00%)       26.11 (  17.87%)
    Amean     ep.D       31.86 (   0.00%)       26.17 *  17.86%*
    Stddev    ep.D        0.07 (   0.00%)        0.05 (  24.41%)
    CoeffVar  ep.D        0.22 (   0.00%)        0.20 (   7.97%)
    Max       ep.D       31.93 (   0.00%)       26.21 (  17.91%)
    Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
    Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-by: NGautham R. Shenoy <gautham.shenoy@amd.com>
    Tested-by: NK Prateek Nayak <kprateek.nayak@amd.com>
    Link: https://lore.kernel.org/r/20220208094334.16379-3-mgorman@techsingularity.netSigned-off-by: NGuan Jing <guanjing6@huawei.com>
    edd5e1ef
topology.h 6.8 KB