1. 16 9月, 2020 13 次提交
    • S
      powerpc/numa: Detect support for coregroup · f9f130ff
      Srikar Dronamraju 提交于
      Add support for grouping cores based on the device-tree classification.
      - The last domain in the associativity domains always refers to the
      core.
      - If primary reference domain happens to be the penultimate domain in
      the associativity domains device-tree property, then there are no
      coregroups. However if its not a penultimate domain, then there are
      coregroups. There can be more than one coregroup. For now we would be
      interested in the last or the smallest coregroups, i.e one sub-group
      per DIE.
      
      Currently there are no firmwares that are exposing this grouping. Hence
      allow the basis for grouping to be abstract.  Once the firmware starts
      using this grouping, code would be added to detect the type of grouping
      and adjust the sd domain flags accordingly.
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Reviewed-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200810071834.92514-8-srikar@linux.vnet.ibm.com
      f9f130ff
    • S
      powerpc/smp: Optimize start_secondary · caa8e29d
      Srikar Dronamraju 提交于
      In start_secondary, even if shared_cache was already set, system does a
      redundant match for cpumask. This redundant check can be removed by
      checking if shared_cache is already set.
      
      While here, localize the sibling_mask variable to within the if
      condition.
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Reviewed-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200810071834.92514-7-srikar@linux.vnet.ibm.com
      caa8e29d
    • S
      powerpc/smp: Dont assume l2-cache to be superset of sibling · f6606cfd
      Srikar Dronamraju 提交于
      Current code assumes that cpumask of cpus sharing a l2-cache mask will
      always be a superset of cpu_sibling_mask.
      
      Lets stop that assumption. cpu_l2_cache_mask is a superset of
      cpu_sibling_mask if and only if shared_caches is set.
      Reviewed-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200913171038.GB11808@linux.vnet.ibm.com
      f6606cfd
    • S
      powerpc/smp: Move topology fixups into a new function · 3c6032a8
      Srikar Dronamraju 提交于
      Move topology fixup based on the platform attributes into its own
      function which is called just before set_sched_topology.
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Reviewed-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200810071834.92514-5-srikar@linux.vnet.ibm.com
      3c6032a8
    • S
      powerpc/smp: Move powerpc_topology above · 5e93f16a
      Srikar Dronamraju 提交于
      Just moving the powerpc_topology description above.
      This will help in using functions in this file and avoid declarations.
      
      No other functional changes
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Reviewed-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200810071834.92514-4-srikar@linux.vnet.ibm.com
      5e93f16a
    • S
      powerpc/smp: Merge Power9 topology with Power topology · 2ef0ca54
      Srikar Dronamraju 提交于
      A new sched_domain_topology_level was added just for Power9. However the
      same can be achieved by merging powerpc_topology with power9_topology
      and makes the code more simpler especially when adding a new sched
      domain.
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Reviewed-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200810071834.92514-3-srikar@linux.vnet.ibm.com
      2ef0ca54
    • S
      powerpc/smp: Fix a warning under !NEED_MULTIPLE_NODES · d0fd24bb
      Srikar Dronamraju 提交于
      Fix a build warning in a non CONFIG_NEED_MULTIPLE_NODES
      "error: _numa_cpu_lookup_table_ undeclared"
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Reviewed-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200810071834.92514-2-srikar@linux.vnet.ibm.com
      d0fd24bb
    • S
      powerpc/numa: Offline memoryless cpuless node 0 · e75130f2
      Srikar Dronamraju 提交于
      Currently Linux kernel with CONFIG_NUMA on a system with multiple
      possible nodes, marks node 0 as online at boot.  However in practice,
      there are systems which have node 0 as memoryless and cpuless.
      
      This can cause numa_balancing to be enabled on systems with only one node
      with memory and CPUs. The existence of this dummy node which is cpuless and
      memoryless node can confuse users/scripts looking at output of lscpu /
      numactl.
      
      By marking, node 0 as offline, lets stop assuming that node 0 is
      always online. If node 0 has CPU or memory that are online, node 0 will
      again be set as online.
      
      v5.8
       available: 2 nodes (0,2)
       node 0 cpus:
       node 0 size: 0 MB
       node 0 free: 0 MB
       node 2 cpus: 0 1 2 3 4 5 6 7
       node 2 size: 32625 MB
       node 2 free: 31490 MB
       node distances:
       node   0   2
         0:  10  20
         2:  20  10
      
      proc and sys files
      ------------------
       /sys/devices/system/node/online:            0,2
       /proc/sys/kernel/numa_balancing:            1
       /sys/devices/system/node/has_cpu:           2
       /sys/devices/system/node/has_memory:        2
       /sys/devices/system/node/has_normal_memory: 2
       /sys/devices/system/node/possible:          0-31
      
      v5.8 + patch
      ------------------
       available: 1 nodes (2)
       node 2 cpus: 0 1 2 3 4 5 6 7
       node 2 size: 32625 MB
       node 2 free: 31487 MB
       node distances:
       node   2
         2:  10
      
      proc and sys files
      ------------------
      /sys/devices/system/node/online:            2
      /proc/sys/kernel/numa_balancing:            0
      /sys/devices/system/node/has_cpu:           2
      /sys/devices/system/node/has_memory:        2
      /sys/devices/system/node/has_normal_memory: 2
      /sys/devices/system/node/possible:          0-31
      
      Example of a node with online CPUs/memory on node 0.
      (Same o/p with and without patch)
      numactl -H
      available: 4 nodes (0-3)
      node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
      node 0 size: 32482 MB
      node 0 free: 22994 MB
      node 1 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
      node 1 size: 0 MB
      node 1 free: 0 MB
      node 2 cpus: 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
      node 2 size: 0 MB
      node 2 free: 0 MB
      node 3 cpus: 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 node 3 size: 0 MB
      node 3 free: 0 MB
      node distances:
      node   0   1   2   3
        0:  10  20  40  40
        1:  20  10  40  40
        2:  40  40  10  20
        3:  40  40  20  10
      
      Note: On Powerpc, cpu_to_node of possible but not present cpus would
      previously return 0. Hence this commit depends on commit ("powerpc/numa: Set
      numa_node for all possible cpus") and commit ("powerpc/numa: Prefer node id
      queried from vphn"). Without the 2 commits, Powerpc system might crash.
      
      1. User space applications like Numactl, lscpu, that parse the sysfs tend to
      believe there is an extra online node. This tends to confuse users and
      applications. Other user space applications start believing that system was
      not able to use all the resources (i.e missing resources) or the system was
      not setup correctly.
      
      2. Also existence of dummy node also leads to inconsistent information. The
      number of online nodes is inconsistent with the information in the
      device-tree and resource-dump
      
      3. When the dummy node is present, single node non-Numa systems end up showing
      up as NUMA systems and numa_balancing gets enabled. This will mean we take
      the hit from the unnecessary numa hinting faults.
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200818081104.57888-4-srikar@linux.vnet.ibm.com
      e75130f2
    • S
      powerpc/numa: Prefer node id queried from vphn · 6398eaa2
      Srikar Dronamraju 提交于
      Node id queried from the static device tree may not
      be correct. For example: it may always show 0 on a shared processor.
      Hence prefer the node id queried from vphn and fallback on the device tree
      based node id if vphn query fails.
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200818081104.57888-3-srikar@linux.vnet.ibm.com
      6398eaa2
    • S
      powerpc/numa: Set numa_node for all possible cpus · a874f100
      Srikar Dronamraju 提交于
      A Powerpc system with multiple possible nodes and with CONFIG_NUMA
      enabled always used to have a node 0, even if node 0 does not any cpus
      or memory attached to it. As per PAPR, node affinity of a cpu is only
      available once its present / online. For all cpus that are possible but
      not present, cpu_to_node() would point to node 0.
      
      To ensure a cpuless, memoryless dummy node is not online, powerpc need
      to make sure all possible but not present cpu_to_node are set to a
      proper node.
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200818081104.57888-2-srikar@linux.vnet.ibm.com
      a874f100
    • S
      powerpc/numa: Restrict possible nodes based on platform · 67df7784
      Srikar Dronamraju 提交于
      As per draft LoPAPR (Revision 2.9_pre7), section B.5.3 "Run Time
      Abstraction Services (RTAS) Node" available at:
        https://openpowerfoundation.org/wp-content/uploads/2020/07/LoPAR-20200611.pdf
      
      ... there are 2 device tree properties:
      
        "ibm,max-associativity-domains"
         which defines the maximum number of domains that the firmware i.e
         PowerVM can support.
      
      and:
      
        "ibm,current-associativity-domains"
         which defines the maximum number of domains that the current
         platform can support.
      
      The value of "ibm,max-associativity-domains" is always greater than or
      equal to "ibm,current-associativity-domains" property. If the latter
      property is not available, use "ibm,max-associativity-domain" as a
      fallback. In this yet to be released LoPAPR, "ibm,current-associativity-domains"
      is mentioned in page 833 / B.5.3 which is covered under under
      "Appendix B. System Binding" section
      
      Currently powerpc uses the "ibm,max-associativity-domains" property
      while setting the possible number of nodes. This is currently set at
      32. However the possible number of nodes for a platform may be
      significantly less. Hence set the possible number of nodes based on
      "ibm,current-associativity-domains" property.
      
      Nathan Lynch had raised a valid concern that post LPM (Live Partition
      Migration), a user could DLPAR add processors and memory after LPM
      with "new" associativity properties:
        https://lore.kernel.org/linuxppc-dev/871rljfet9.fsf@linux.ibm.com/t/#u
      
      He also pointed out that "ibm,max-associativity-domains" has the same
      contents on all currently available PowerVM systems, unlike
      "ibm,current-associativity-domains" and hence may be better able to
      handle the new NUMA associativity properties.
      
      However with the recent commit dbce4562 ("powerpc/numa: Limit
      possible nodes to within num_possible_nodes"), all new NUMA
      associativity properties are capped to initially set nr_node_ids.
      Hence this commit should be safe with any new DLPAR add post LPM.
      
        $ lsprop /proc/device-tree/rtas/ibm,*associ*-domains
        /proc/device-tree/rtas/ibm,current-associativity-domains
        		 00000005 00000001 00000002 00000002 00000002 00000010
        /proc/device-tree/rtas/ibm,max-associativity-domains
        		 00000005 00000001 00000008 00000020 00000020 00000100
      
        $ cat /sys/devices/system/node/possible ##Before patch
        0-31
      
        $ cat /sys/devices/system/node/possible ##After patch
        0-1
      
      Note the maximum nodes this platform can support is only 2 but the
      possible nodes is set to 32.
      
      This is important because lot of kernel and user space code allocate
      structures for all possible nodes leading to a lot of memory that is
      allocated but not used.
      
      I ran a simple experiment to create and destroy 100 memory cgroups on
      boot on a 8 node machine (Power8 Alpine).
      
      Before patch:
        free -k at boot
                      total        used        free      shared  buff/cache   available
        Mem:      523498176     4106816   518820608       22272      570752   516606720
        Swap:       4194240           0     4194240
      
        free -k after creating 100 memory cgroups
                      total        used        free      shared  buff/cache   available
        Mem:      523498176     4628416   518246464       22336      623296   516058688
        Swap:       4194240           0     4194240
      
        free -k after destroying 100 memory cgroups
                      total        used        free      shared  buff/cache   available
        Mem:      523498176     4697408   518173760       22400      627008   515987904
        Swap:       4194240           0     4194240
      
      After patch:
        free -k at boot
                      total        used        free      shared  buff/cache   available
        Mem:      523498176     3969472   518933888       22272      594816   516731776
        Swap:       4194240           0     4194240
      
        free -k after creating 100 memory cgroups
                      total        used        free      shared  buff/cache   available
        Mem:      523498176     4181888   518676096       22208      640192   516496448
        Swap:       4194240           0     4194240
      
        free -k after destroying 100 memory cgroups
                      total        used        free      shared  buff/cache   available
        Mem:      523498176     4232320   518619904       22272      645952   516443264
        Swap:       4194240           0     4194240
      
      Observations:
        Fixed kernel takes 137344 kb (4106816-3969472) less to boot.
        Fixed kernel takes 309184 kb (4628416-4181888-137344) less to create 100 memcgs.
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      [mpe: Reformat change log a bit for readability]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200817055257.110873-1-srikar@linux.vnet.ibm.com
      67df7784
    • S
      powerpc/topology: Override cpu_smt_mask · f3232321
      Srikar Dronamraju 提交于
      On Power9, a pair of SMT4 cores can be presented by the firmware as a SMT8
      core for backward compatibility reasons, with the fusion of two SMT4 cores.
      Powerpc allows LPARs to be live migrated from Power8 to Power9.  Existing
      software developed/configured for Power8, expects to see a SMT8 core.
      
      In order to maintain userspace backward compatibility (with Power8 chips in
      case of Power9) in enterprise Linux systems, the topology_sibling_cpumask
      has to be set to SMT8 core.
      
      cpu_smt_mask() should generally point to the cpu mask of the SMT4 core.
      Hence override the default cpu_smt_mask() to be powerpc specific
      allowing for better scheduling behaviour on Power.
      
      schbench
      (latency measured in usecs, so lesser is better)
      Without patch                   With patch
      Latency percentiles (usec)	Latency percentiles (usec)
      	50.0000th: 34           	50.0000th: 38
      	75.0000th: 47           	75.0000th: 52
      	90.0000th: 54           	90.0000th: 60
      	95.0000th: 57           	95.0000th: 64
      	*99.0000th: 62          	*99.0000th: 72
      	99.5000th: 65           	99.5000th: 75
      	99.9000th: 76           	99.9000th: 3452
      	min=0, max=9205         	min=0, max=9344
      
      schbench (With Cede disabled)
      Without patch                   With patch
      Latency percentiles (usec) 	Latency percentiles (usec)
      	50.0000th: 20           	50.0000th: 21
      	75.0000th: 28           	75.0000th: 29
      	90.0000th: 33           	90.0000th: 34
      	95.0000th: 35           	95.0000th: 37
      	*99.0000th: 40          	*99.0000th: 40
      	99.5000th: 48           	99.5000th: 42
      	99.9000th: 94           	99.9000th: 79
      	min=0, max=791          	min=0, max=791
      
      perf bench sched pipe
      usec/ops : lesser is better
      Without patch
        N           Min           Max        Median           Avg        Stddev
      101      5.095113      5.595269      5.204842     5.2298776    0.10762713
      
      5.10 - 5.15 : ##################################################   23% (24)
      5.15 - 5.20 : #############################################        21% (22)
      5.20 - 5.25 : ##################################################   23% (24)
      5.25 - 5.30 : #########################                            11% (12)
      5.30 - 5.35 : ##########                                            4% (5)
      5.35 - 5.40 : ########                                              3% (4)
      5.40 - 5.45 : ########                                              3% (4)
      5.45 - 5.50 : ####                                                  1% (2)
      5.50 - 5.55 : ##                                                    0% (1)
      5.55 - 5.60 : ####                                                  1% (2)
      
      With patch
        N           Min           Max        Median           Avg        Stddev
      101      5.134675      8.524719      5.207658     5.2780985    0.34911969
      
      5.1 - 5.5 : ##################################################   94% (95)
      5.5 - 5.8 : ##                                                    3% (4)
      5.8 - 6.2 :                                                       0% (1)
      6.2 - 6.5 :
      6.5 - 6.8 :
      6.8 - 7.2 :
      7.2 - 7.5 :
      7.5 - 7.8 :
      7.8 - 8.2 :
      8.2 - 8.5 :
      
      perf bench sched pipe (cede disabled)
      usec/ops : lesser is better
      Without patch
        N           Min           Max        Median           Avg        Stddev
      101      7.884227     12.576538      7.956474     8.0170722    0.46159054
      
      7.9 - 8.4 : ##################################################   99% (100)
      8.4 - 8.8 :
      8.8 - 9.3 :
      9.3 - 9.8 :
      9.8 - 10.2 :
      10.2 - 10.7 :
      10.7 - 11.2 :
      11.2 - 11.6 :
      11.6 - 12.1 :
      12.1 - 12.6 :
      
      With patch
        N           Min           Max        Median           Avg        Stddev
      101      7.956021      8.217284      8.015615     8.0283866   0.049844967
      
      7.96 - 7.98 : ######################                               12% (13)
      7.98 - 8.01 : ##################################################   28% (29)
      8.01 - 8.03 : ####################################                 20% (21)
      8.03 - 8.06 : #########################                            14% (15)
      8.06 - 8.09 : ######################                               12% (13)
      8.09 - 8.11 : ######                                                3% (4)
      8.11 - 8.14 : ###                                                   1% (2)
      8.14 - 8.17 : ###                                                   1% (2)
      8.17 - 8.19 :
      8.19 - 8.22 : #                                                     0% (1)
      
      Observations: With the patch, the initial run/iteration takes a slight
      longer time. This can be attributed to the fact that now we pick a CPU
      from a idle core which could be sleep mode. Once we remove the cede,
      state the numbers improve in favour of the patch.
      
      ebizzy:
      transactions per second (higher is better)
      without patch
        N           Min           Max        Median           Avg        Stddev
      100       1018433       1304470       1193208     1182315.7     60018.733
      
      1018433 - 1047037 : ######                                                3% (3)
      1047037 - 1075640 : ########                                              4% (4)
      1075640 - 1104244 : ########                                              4% (4)
      1104244 - 1132848 : ###############                                       7% (7)
      1132848 - 1161452 : ####################################                 17% (17)
      1161452 - 1190055 : ##########################                           12% (12)
      1190055 - 1218659 : #############################################        21% (21)
      1218659 - 1247263 : ##################################################   23% (23)
      1247263 - 1275866 : ########                                              4% (4)
      1275866 - 1304470 : ########                                              4% (4)
      
      with patch
        N           Min           Max        Median           Avg        Stddev
      100        967014       1292938       1208819     1185281.8     69815.851
      
       967014 - 999606  : ##                                                    1% (1)
       999606 - 1032199 : ##                                                    1% (1)
      1032199 - 1064791 : ############                                          6% (6)
      1064791 - 1097384 : ##########                                            5% (5)
      1097384 - 1129976 : ##################                                    9% (9)
      1129976 - 1162568 : ####################                                 10% (10)
      1162568 - 1195161 : ##########################                           13% (13)
      1195161 - 1227753 : ############################################         22% (22)
      1227753 - 1260346 : ##################################################   25% (25)
      1260346 - 1292938 : ##############                                        7% (7)
      
      Observations: Not much changes, ebizzy is not much impacted.
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200807074517.27957-2-srikar@linux.vnet.ibm.com
      f3232321
    • V
      powerpc/papr_scm: Fix warning triggered by perf_stats_show() · ca78ef2f
      Vaibhav Jain 提交于
      A warning is reported by the kernel in case perf_stats_show() returns
      an error code. The warning is of the form below:
      
       papr_scm ibm,persistent-memory:ibm,pmemory@44100001:
       	  Failed to query performance stats, Err:-10
       dev_attr_show: perf_stats_show+0x0/0x1c0 [papr_scm] returned bad count
       fill_read_buffer: dev_attr_show+0x0/0xb0 returned bad count
      
      On investigation it looks like that the compiler is silently
      truncating the return value of drc_pmem_query_stats() from 'long' to
      'int', since the variable used to store the return code 'rc' is an
      'int'. This truncated value is then returned back as a 'ssize_t' back
      from perf_stats_show() to 'dev_attr_show()' which thinks of it as a
      large unsigned number and triggers this warning..
      
      To fix this we update the type of variable 'rc' from 'int' to
      'ssize_t' that prevents the compiler from truncating the return value
      of drc_pmem_query_stats() and returning correct signed value back from
      perf_stats_show().
      
      Fixes: 2d02bf83 ("powerpc/papr_scm: Fetch nvdimm performance stats from PHYP")
      Signed-off-by: NVaibhav Jain <vaibhav@linux.ibm.com>
      Reviewed-by: NIra Weiny <ira.weiny@intel.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200912081451.66225-1-vaibhav@linux.ibm.com
      ca78ef2f
  2. 15 9月, 2020 27 次提交