1. 16 9月, 2020 8 次提交
    • S
      powerpc/numa: Offline memoryless cpuless node 0 · e75130f2
      Srikar Dronamraju 提交于
      Currently Linux kernel with CONFIG_NUMA on a system with multiple
      possible nodes, marks node 0 as online at boot.  However in practice,
      there are systems which have node 0 as memoryless and cpuless.
      
      This can cause numa_balancing to be enabled on systems with only one node
      with memory and CPUs. The existence of this dummy node which is cpuless and
      memoryless node can confuse users/scripts looking at output of lscpu /
      numactl.
      
      By marking, node 0 as offline, lets stop assuming that node 0 is
      always online. If node 0 has CPU or memory that are online, node 0 will
      again be set as online.
      
      v5.8
       available: 2 nodes (0,2)
       node 0 cpus:
       node 0 size: 0 MB
       node 0 free: 0 MB
       node 2 cpus: 0 1 2 3 4 5 6 7
       node 2 size: 32625 MB
       node 2 free: 31490 MB
       node distances:
       node   0   2
         0:  10  20
         2:  20  10
      
      proc and sys files
      ------------------
       /sys/devices/system/node/online:            0,2
       /proc/sys/kernel/numa_balancing:            1
       /sys/devices/system/node/has_cpu:           2
       /sys/devices/system/node/has_memory:        2
       /sys/devices/system/node/has_normal_memory: 2
       /sys/devices/system/node/possible:          0-31
      
      v5.8 + patch
      ------------------
       available: 1 nodes (2)
       node 2 cpus: 0 1 2 3 4 5 6 7
       node 2 size: 32625 MB
       node 2 free: 31487 MB
       node distances:
       node   2
         2:  10
      
      proc and sys files
      ------------------
      /sys/devices/system/node/online:            2
      /proc/sys/kernel/numa_balancing:            0
      /sys/devices/system/node/has_cpu:           2
      /sys/devices/system/node/has_memory:        2
      /sys/devices/system/node/has_normal_memory: 2
      /sys/devices/system/node/possible:          0-31
      
      Example of a node with online CPUs/memory on node 0.
      (Same o/p with and without patch)
      numactl -H
      available: 4 nodes (0-3)
      node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
      node 0 size: 32482 MB
      node 0 free: 22994 MB
      node 1 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
      node 1 size: 0 MB
      node 1 free: 0 MB
      node 2 cpus: 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
      node 2 size: 0 MB
      node 2 free: 0 MB
      node 3 cpus: 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 node 3 size: 0 MB
      node 3 free: 0 MB
      node distances:
      node   0   1   2   3
        0:  10  20  40  40
        1:  20  10  40  40
        2:  40  40  10  20
        3:  40  40  20  10
      
      Note: On Powerpc, cpu_to_node of possible but not present cpus would
      previously return 0. Hence this commit depends on commit ("powerpc/numa: Set
      numa_node for all possible cpus") and commit ("powerpc/numa: Prefer node id
      queried from vphn"). Without the 2 commits, Powerpc system might crash.
      
      1. User space applications like Numactl, lscpu, that parse the sysfs tend to
      believe there is an extra online node. This tends to confuse users and
      applications. Other user space applications start believing that system was
      not able to use all the resources (i.e missing resources) or the system was
      not setup correctly.
      
      2. Also existence of dummy node also leads to inconsistent information. The
      number of online nodes is inconsistent with the information in the
      device-tree and resource-dump
      
      3. When the dummy node is present, single node non-Numa systems end up showing
      up as NUMA systems and numa_balancing gets enabled. This will mean we take
      the hit from the unnecessary numa hinting faults.
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200818081104.57888-4-srikar@linux.vnet.ibm.com
      e75130f2
    • S
      powerpc/numa: Prefer node id queried from vphn · 6398eaa2
      Srikar Dronamraju 提交于
      Node id queried from the static device tree may not
      be correct. For example: it may always show 0 on a shared processor.
      Hence prefer the node id queried from vphn and fallback on the device tree
      based node id if vphn query fails.
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200818081104.57888-3-srikar@linux.vnet.ibm.com
      6398eaa2
    • S
      powerpc/numa: Set numa_node for all possible cpus · a874f100
      Srikar Dronamraju 提交于
      A Powerpc system with multiple possible nodes and with CONFIG_NUMA
      enabled always used to have a node 0, even if node 0 does not any cpus
      or memory attached to it. As per PAPR, node affinity of a cpu is only
      available once its present / online. For all cpus that are possible but
      not present, cpu_to_node() would point to node 0.
      
      To ensure a cpuless, memoryless dummy node is not online, powerpc need
      to make sure all possible but not present cpu_to_node are set to a
      proper node.
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200818081104.57888-2-srikar@linux.vnet.ibm.com
      a874f100
    • S
      powerpc/numa: Restrict possible nodes based on platform · 67df7784
      Srikar Dronamraju 提交于
      As per draft LoPAPR (Revision 2.9_pre7), section B.5.3 "Run Time
      Abstraction Services (RTAS) Node" available at:
        https://openpowerfoundation.org/wp-content/uploads/2020/07/LoPAR-20200611.pdf
      
      ... there are 2 device tree properties:
      
        "ibm,max-associativity-domains"
         which defines the maximum number of domains that the firmware i.e
         PowerVM can support.
      
      and:
      
        "ibm,current-associativity-domains"
         which defines the maximum number of domains that the current
         platform can support.
      
      The value of "ibm,max-associativity-domains" is always greater than or
      equal to "ibm,current-associativity-domains" property. If the latter
      property is not available, use "ibm,max-associativity-domain" as a
      fallback. In this yet to be released LoPAPR, "ibm,current-associativity-domains"
      is mentioned in page 833 / B.5.3 which is covered under under
      "Appendix B. System Binding" section
      
      Currently powerpc uses the "ibm,max-associativity-domains" property
      while setting the possible number of nodes. This is currently set at
      32. However the possible number of nodes for a platform may be
      significantly less. Hence set the possible number of nodes based on
      "ibm,current-associativity-domains" property.
      
      Nathan Lynch had raised a valid concern that post LPM (Live Partition
      Migration), a user could DLPAR add processors and memory after LPM
      with "new" associativity properties:
        https://lore.kernel.org/linuxppc-dev/871rljfet9.fsf@linux.ibm.com/t/#u
      
      He also pointed out that "ibm,max-associativity-domains" has the same
      contents on all currently available PowerVM systems, unlike
      "ibm,current-associativity-domains" and hence may be better able to
      handle the new NUMA associativity properties.
      
      However with the recent commit dbce4562 ("powerpc/numa: Limit
      possible nodes to within num_possible_nodes"), all new NUMA
      associativity properties are capped to initially set nr_node_ids.
      Hence this commit should be safe with any new DLPAR add post LPM.
      
        $ lsprop /proc/device-tree/rtas/ibm,*associ*-domains
        /proc/device-tree/rtas/ibm,current-associativity-domains
        		 00000005 00000001 00000002 00000002 00000002 00000010
        /proc/device-tree/rtas/ibm,max-associativity-domains
        		 00000005 00000001 00000008 00000020 00000020 00000100
      
        $ cat /sys/devices/system/node/possible ##Before patch
        0-31
      
        $ cat /sys/devices/system/node/possible ##After patch
        0-1
      
      Note the maximum nodes this platform can support is only 2 but the
      possible nodes is set to 32.
      
      This is important because lot of kernel and user space code allocate
      structures for all possible nodes leading to a lot of memory that is
      allocated but not used.
      
      I ran a simple experiment to create and destroy 100 memory cgroups on
      boot on a 8 node machine (Power8 Alpine).
      
      Before patch:
        free -k at boot
                      total        used        free      shared  buff/cache   available
        Mem:      523498176     4106816   518820608       22272      570752   516606720
        Swap:       4194240           0     4194240
      
        free -k after creating 100 memory cgroups
                      total        used        free      shared  buff/cache   available
        Mem:      523498176     4628416   518246464       22336      623296   516058688
        Swap:       4194240           0     4194240
      
        free -k after destroying 100 memory cgroups
                      total        used        free      shared  buff/cache   available
        Mem:      523498176     4697408   518173760       22400      627008   515987904
        Swap:       4194240           0     4194240
      
      After patch:
        free -k at boot
                      total        used        free      shared  buff/cache   available
        Mem:      523498176     3969472   518933888       22272      594816   516731776
        Swap:       4194240           0     4194240
      
        free -k after creating 100 memory cgroups
                      total        used        free      shared  buff/cache   available
        Mem:      523498176     4181888   518676096       22208      640192   516496448
        Swap:       4194240           0     4194240
      
        free -k after destroying 100 memory cgroups
                      total        used        free      shared  buff/cache   available
        Mem:      523498176     4232320   518619904       22272      645952   516443264
        Swap:       4194240           0     4194240
      
      Observations:
        Fixed kernel takes 137344 kb (4106816-3969472) less to boot.
        Fixed kernel takes 309184 kb (4628416-4181888-137344) less to create 100 memcgs.
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      [mpe: Reformat change log a bit for readability]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200817055257.110873-1-srikar@linux.vnet.ibm.com
      67df7784
    • S
      powerpc/topology: Override cpu_smt_mask · f3232321
      Srikar Dronamraju 提交于
      On Power9, a pair of SMT4 cores can be presented by the firmware as a SMT8
      core for backward compatibility reasons, with the fusion of two SMT4 cores.
      Powerpc allows LPARs to be live migrated from Power8 to Power9.  Existing
      software developed/configured for Power8, expects to see a SMT8 core.
      
      In order to maintain userspace backward compatibility (with Power8 chips in
      case of Power9) in enterprise Linux systems, the topology_sibling_cpumask
      has to be set to SMT8 core.
      
      cpu_smt_mask() should generally point to the cpu mask of the SMT4 core.
      Hence override the default cpu_smt_mask() to be powerpc specific
      allowing for better scheduling behaviour on Power.
      
      schbench
      (latency measured in usecs, so lesser is better)
      Without patch                   With patch
      Latency percentiles (usec)	Latency percentiles (usec)
      	50.0000th: 34           	50.0000th: 38
      	75.0000th: 47           	75.0000th: 52
      	90.0000th: 54           	90.0000th: 60
      	95.0000th: 57           	95.0000th: 64
      	*99.0000th: 62          	*99.0000th: 72
      	99.5000th: 65           	99.5000th: 75
      	99.9000th: 76           	99.9000th: 3452
      	min=0, max=9205         	min=0, max=9344
      
      schbench (With Cede disabled)
      Without patch                   With patch
      Latency percentiles (usec) 	Latency percentiles (usec)
      	50.0000th: 20           	50.0000th: 21
      	75.0000th: 28           	75.0000th: 29
      	90.0000th: 33           	90.0000th: 34
      	95.0000th: 35           	95.0000th: 37
      	*99.0000th: 40          	*99.0000th: 40
      	99.5000th: 48           	99.5000th: 42
      	99.9000th: 94           	99.9000th: 79
      	min=0, max=791          	min=0, max=791
      
      perf bench sched pipe
      usec/ops : lesser is better
      Without patch
        N           Min           Max        Median           Avg        Stddev
      101      5.095113      5.595269      5.204842     5.2298776    0.10762713
      
      5.10 - 5.15 : ##################################################   23% (24)
      5.15 - 5.20 : #############################################        21% (22)
      5.20 - 5.25 : ##################################################   23% (24)
      5.25 - 5.30 : #########################                            11% (12)
      5.30 - 5.35 : ##########                                            4% (5)
      5.35 - 5.40 : ########                                              3% (4)
      5.40 - 5.45 : ########                                              3% (4)
      5.45 - 5.50 : ####                                                  1% (2)
      5.50 - 5.55 : ##                                                    0% (1)
      5.55 - 5.60 : ####                                                  1% (2)
      
      With patch
        N           Min           Max        Median           Avg        Stddev
      101      5.134675      8.524719      5.207658     5.2780985    0.34911969
      
      5.1 - 5.5 : ##################################################   94% (95)
      5.5 - 5.8 : ##                                                    3% (4)
      5.8 - 6.2 :                                                       0% (1)
      6.2 - 6.5 :
      6.5 - 6.8 :
      6.8 - 7.2 :
      7.2 - 7.5 :
      7.5 - 7.8 :
      7.8 - 8.2 :
      8.2 - 8.5 :
      
      perf bench sched pipe (cede disabled)
      usec/ops : lesser is better
      Without patch
        N           Min           Max        Median           Avg        Stddev
      101      7.884227     12.576538      7.956474     8.0170722    0.46159054
      
      7.9 - 8.4 : ##################################################   99% (100)
      8.4 - 8.8 :
      8.8 - 9.3 :
      9.3 - 9.8 :
      9.8 - 10.2 :
      10.2 - 10.7 :
      10.7 - 11.2 :
      11.2 - 11.6 :
      11.6 - 12.1 :
      12.1 - 12.6 :
      
      With patch
        N           Min           Max        Median           Avg        Stddev
      101      7.956021      8.217284      8.015615     8.0283866   0.049844967
      
      7.96 - 7.98 : ######################                               12% (13)
      7.98 - 8.01 : ##################################################   28% (29)
      8.01 - 8.03 : ####################################                 20% (21)
      8.03 - 8.06 : #########################                            14% (15)
      8.06 - 8.09 : ######################                               12% (13)
      8.09 - 8.11 : ######                                                3% (4)
      8.11 - 8.14 : ###                                                   1% (2)
      8.14 - 8.17 : ###                                                   1% (2)
      8.17 - 8.19 :
      8.19 - 8.22 : #                                                     0% (1)
      
      Observations: With the patch, the initial run/iteration takes a slight
      longer time. This can be attributed to the fact that now we pick a CPU
      from a idle core which could be sleep mode. Once we remove the cede,
      state the numbers improve in favour of the patch.
      
      ebizzy:
      transactions per second (higher is better)
      without patch
        N           Min           Max        Median           Avg        Stddev
      100       1018433       1304470       1193208     1182315.7     60018.733
      
      1018433 - 1047037 : ######                                                3% (3)
      1047037 - 1075640 : ########                                              4% (4)
      1075640 - 1104244 : ########                                              4% (4)
      1104244 - 1132848 : ###############                                       7% (7)
      1132848 - 1161452 : ####################################                 17% (17)
      1161452 - 1190055 : ##########################                           12% (12)
      1190055 - 1218659 : #############################################        21% (21)
      1218659 - 1247263 : ##################################################   23% (23)
      1247263 - 1275866 : ########                                              4% (4)
      1275866 - 1304470 : ########                                              4% (4)
      
      with patch
        N           Min           Max        Median           Avg        Stddev
      100        967014       1292938       1208819     1185281.8     69815.851
      
       967014 - 999606  : ##                                                    1% (1)
       999606 - 1032199 : ##                                                    1% (1)
      1032199 - 1064791 : ############                                          6% (6)
      1064791 - 1097384 : ##########                                            5% (5)
      1097384 - 1129976 : ##################                                    9% (9)
      1129976 - 1162568 : ####################                                 10% (10)
      1162568 - 1195161 : ##########################                           13% (13)
      1195161 - 1227753 : ############################################         22% (22)
      1227753 - 1260346 : ##################################################   25% (25)
      1260346 - 1292938 : ##############                                        7% (7)
      
      Observations: Not much changes, ebizzy is not much impacted.
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200807074517.27957-2-srikar@linux.vnet.ibm.com
      f3232321
    • S
      sched/topology: Allow archs to override cpu_smt_mask · 3babbe44
      Srikar Dronamraju 提交于
      cpu_smt_mask tracks topology_sibling_cpumask. This would be good for
      most architectures. One of the users of cpu_smt_mask(), would be to
      identify idle-cores. On Power9, a pair of SMT4 cores can be presented
      by the firmware as a SMT8 core for backward compatibility reasons.
      
      powerpc allows LPARs to be live migrated from Power8 to Power9. Do
      note Power8 had only SMT8 cores. Existing software which has been
      developed/configured for Power8 would expect to see SMT8 core.
      Maintaining the illusion of SMT8 core is a requirement to make that
      work.
      
      In order to maintain above userspace backward compatibility with
      previous versions of processor, Power9 onwards there is option to the
      firmware to advertise a pair of SMT4 cores as a fused cores aka SMT8
      core. On Power9 this pair shares the L2 cache as well. However, from
      the scheduler's point of view, a core should be determined by SMT4,
      since its a completely independent unit of compute. Hence allow
      powerpc architecture to override the default cpu_smt_mask() to point
      to the SMT4 cores in a SMT8 mode.
      
      This will ensure the scheduler is always given the right information.
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200807074517.27957-1-srikar@linux.vnet.ibm.com
      3babbe44
    • W
      drivers/macintosh/smu.c: Fix undeclared symbol warning · 3db8715e
      Wang Wensheng 提交于
      Make kernel with `C=2`:
      drivers/macintosh/smu.c:1018:30: warning: symbol
      '__smu_get_sdb_partition' was not declared. Should it be static?
      Reported-by: NHulk Robot <hulkci@huawei.com>
      Signed-off-by: NWang Wensheng <wangwensheng4@huawei.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200914122615.65669-1-wangwensheng4@huawei.com
      3db8715e
    • V
      powerpc/papr_scm: Fix warning triggered by perf_stats_show() · ca78ef2f
      Vaibhav Jain 提交于
      A warning is reported by the kernel in case perf_stats_show() returns
      an error code. The warning is of the form below:
      
       papr_scm ibm,persistent-memory:ibm,pmemory@44100001:
       	  Failed to query performance stats, Err:-10
       dev_attr_show: perf_stats_show+0x0/0x1c0 [papr_scm] returned bad count
       fill_read_buffer: dev_attr_show+0x0/0xb0 returned bad count
      
      On investigation it looks like that the compiler is silently
      truncating the return value of drc_pmem_query_stats() from 'long' to
      'int', since the variable used to store the return code 'rc' is an
      'int'. This truncated value is then returned back as a 'ssize_t' back
      from perf_stats_show() to 'dev_attr_show()' which thinks of it as a
      large unsigned number and triggers this warning..
      
      To fix this we update the type of variable 'rc' from 'int' to
      'ssize_t' that prevents the compiler from truncating the return value
      of drc_pmem_query_stats() and returning correct signed value back from
      perf_stats_show().
      
      Fixes: 2d02bf83 ("powerpc/papr_scm: Fetch nvdimm performance stats from PHYP")
      Signed-off-by: NVaibhav Jain <vaibhav@linux.ibm.com>
      Reviewed-by: NIra Weiny <ira.weiny@intel.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200912081451.66225-1-vaibhav@linux.ibm.com
      ca78ef2f
  2. 15 9月, 2020 32 次提交