1. 26 8月, 2021 1 次提交
  2. 13 8月, 2021 5 次提交
  3. 10 8月, 2021 1 次提交
    • L
      powerpc/numa: Consider the max NUMA node for migratable LPAR · 9c7248bb
      Laurent Dufour 提交于
      When a LPAR is migratable, we should consider the maximum possible NUMA
      node instead of the number of NUMA nodes from the actual system.
      
      The DT property 'ibm,current-associativity-domains' defines the maximum
      number of nodes the LPAR can see when running on that box. But if the
      LPAR is being migrated on another box, it may see up to the nodes
      defined by 'ibm,max-associativity-domains'. So if a LPAR is migratable,
      that value should be used.
      
      Unfortunately, there is no easy way to know if an LPAR is migratable or
      not. The hypervisor exports the property 'ibm,migratable-partition' in
      the case it set to migrate partition, but that would not mean that the
      current partition is migratable.
      
      Without this patch, when a LPAR is started on a 2 node box and then
      migrated to a 3 node box, the hypervisor may spread the LPAR's CPUs on
      the 3rd node. In that case if a CPU from that 3rd node is added to the
      LPAR, it will be wrongly assigned to the node because the kernel has
      been set to use up to 2 nodes (the configuration of the departure node).
      With this patch applies, the CPU is correctly added to the 3rd node.
      
      Fixes: f9f130ff ("powerpc/numa: Detect support for coregroup")
      Signed-off-by: NLaurent Dufour <ldufour@linux.ibm.com>
      Reviewed-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20210511073136.17795-1-ldufour@linux.ibm.com
      9c7248bb
  4. 27 11月, 2020 1 次提交
    • S
      powerpc/numa: Fix a regression on memoryless node 0 · 10f78fd0
      Srikar Dronamraju 提交于
      Commit e75130f2 ("powerpc/numa: Offline memoryless cpuless node 0")
      offlines node 0 and expects nodes to be subsequently onlined when CPUs
      or nodes are detected.
      
      Commit 6398eaa2 ("powerpc/numa: Prefer node id queried from vphn")
      skips onlining node 0 when CPUs are associated with node 0.
      
      On systems with node 0 having CPUs but no memory, this causes node 0 be
      marked offline. This causes issues at boot time when trying to set
      memory node for online CPUs while building the zonelist.
      
      0:mon> t
      [link register   ] c000000000400354 __build_all_zonelists+0x164/0x280
      [c00000000161bda0] c0000000016533c8 node_states+0x20/0xa0 (unreliable)
      [c00000000161bdc0] c000000000400384 __build_all_zonelists+0x194/0x280
      [c00000000161be30] c000000001041800 build_all_zonelists_init+0x4c/0x118
      [c00000000161be80] c0000000004020d0 build_all_zonelists+0x190/0x1b0
      [c00000000161bef0] c000000001003cf8 start_kernel+0x18c/0x6a8
      [c00000000161bf90] c00000000000adb4 start_here_common+0x1c/0x3e8
      0:mon> r
      R00 = c000000000400354   R16 = 000000000b57a0e8
      R01 = c00000000161bda0   R17 = 000000000b57a6b0
      R02 = c00000000161ce00   R18 = 000000000b5afee8
      R03 = 0000000000000000   R19 = 000000000b6448a0
      R04 = 0000000000000000   R20 = fffffffffffffffd
      R05 = 0000000000000000   R21 = 0000000001400000
      R06 = 0000000000000000   R22 = 000000001ec00000
      R07 = 0000000000000001   R23 = c000000001175580
      R08 = 0000000000000000   R24 = c000000001651ed8
      R09 = c0000000017e84d8   R25 = c000000001652480
      R10 = 0000000000000000   R26 = c000000001175584
      R11 = c000000c7fac0d10   R27 = c0000000019568d0
      R12 = c000000000400180   R28 = 0000000000000000
      R13 = c000000002200000   R29 = c00000000164dd78
      R14 = 000000000b579f78   R30 = 0000000000000000
      R15 = 000000000b57a2b8   R31 = c000000001175584
      pc  = c000000000400194 local_memory_node+0x24/0x80
      cfar= c000000000074334 mcount+0xc/0x10
      lr  = c000000000400354 __build_all_zonelists+0x164/0x280
      msr = 8000000002001033   cr  = 44002284
      ctr = c000000000400180   xer = 0000000000000001   trap =  380
      dar = 0000000000001388   dsisr = c00000000161bc90
      0:mon>
      
      Fix this by setting node to be online while onlining CPUs that belong to
      node 0.
      
      Fixes: e75130f2 ("powerpc/numa: Offline memoryless cpuless node 0")
      Fixes: 6398eaa2 ("powerpc/numa: Prefer node id queried from vphn")
      Reported-by: NMilan Mohanty <milmohan@in.ibm.com>
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20201127053738.10085-1-srikar@linux.vnet.ibm.com
      10f78fd0
  5. 14 10月, 2020 1 次提交
    • M
      arch, mm: replace for_each_memblock() with for_each_mem_pfn_range() · c9118e6c
      Mike Rapoport 提交于
      There are several occurrences of the following pattern:
      
      	for_each_memblock(memory, reg) {
      		start_pfn = memblock_region_memory_base_pfn(reg);
      		end_pfn = memblock_region_memory_end_pfn(reg);
      
      		/* do something with start_pfn and end_pfn */
      	}
      
      Rather than iterate over all memblock.memory regions and each time query
      for their start and end PFNs, use for_each_mem_pfn_range() iterator to get
      simpler and clearer code.
      Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NBaoquan He <bhe@redhat.com>
      Acked-by: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>	[.clang-format]
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Daniel Axtens <dja@axtens.net>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Emil Renner Berthing <kernel@esmil.dk>
      Cc: Hari Bathini <hbathini@linux.ibm.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Link: https://lkml.kernel.org/r/20200818151634.14343-12-rppt@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c9118e6c
  6. 06 10月, 2020 1 次提交
    • S
      pseries/hotplug-memory: hot-add: skip redundant LMB lookup · 72cdd117
      Scott Cheloha 提交于
      During memory hot-add, dlpar_add_lmb() calls memory_add_physaddr_to_nid()
      to determine which node id (nid) to use when later calling __add_memory().
      
      This is wasteful.  On pseries, memory_add_physaddr_to_nid() finds an
      appropriate nid for a given address by looking up the LMB containing the
      address and then passing that LMB to of_drconf_to_nid_single() to get the
      nid.  In dlpar_add_lmb() we get this address from the LMB itself.
      
      In short, we have a pointer to an LMB and then we are searching for
      that LMB *again* in order to find its nid.
      
      If we call of_drconf_to_nid_single() directly from dlpar_add_lmb() we
      can skip the redundant lookup.  The only error handling we need to
      duplicate from memory_add_physaddr_to_nid() is the fallback to the
      default nid when drconf_to_nid_single() returns -1 (NUMA_NO_NODE) or
      an invalid nid.
      
      Skipping the extra lookup makes hot-add operations faster, especially
      on machines with many LMBs.
      
      Consider an LPAR with 126976 LMBs.  In one test, hot-adding 126000
      LMBs on an upatched kernel took ~3.5 hours while a patched kernel
      completed the same operation in ~2 hours:
      
      Unpatched (12450 seconds):
      Sep  9 04:06:31 ltc-brazos1 drmgr[810169]: drmgr: -c mem -a -q 126000
      Sep  9 04:06:31 ltc-brazos1 kernel: pseries-hotplug-mem: Attempting to hot-add 126000 LMB(s)
      [...]
      Sep  9 07:34:01 ltc-brazos1 kernel: pseries-hotplug-mem: Memory at 20000000 (drc index 80000002) was hot-added
      
      Patched (7065 seconds):
      Sep  8 21:49:57 ltc-brazos1 drmgr[877703]: drmgr: -c mem -a -q 126000
      Sep  8 21:49:57 ltc-brazos1 kernel: pseries-hotplug-mem: Attempting to hot-add 126000 LMB(s)
      [...]
      Sep  8 23:27:42 ltc-brazos1 kernel: pseries-hotplug-mem: Memory at 20000000 (drc index 80000002) was hot-added
      
      It should be noted that the speedup grows more substantial when
      hot-adding LMBs at the end of the drconf range.  This is because we
      are skipping a linear LMB search.
      
      To see the distinction, consider smaller hot-add test on the same
      LPAR.  A perf-stat run with 10 iterations showed that hot-adding 4096
      LMBs completed less than 1 second faster on a patched kernel:
      
      Unpatched:
       Performance counter stats for 'drmgr -c mem -a -q 4096' (10 runs):
      
              104,753.42 msec task-clock                #    0.992 CPUs utilized            ( +-  0.55% )
                   4,708      context-switches          #    0.045 K/sec                    ( +-  0.69% )
                   2,444      cpu-migrations            #    0.023 K/sec                    ( +-  1.25% )
                     394      page-faults               #    0.004 K/sec                    ( +-  0.22% )
         445,902,503,057      cycles                    #    4.257 GHz                      ( +-  0.55% )  (66.67%)
           8,558,376,740      stalled-cycles-frontend   #    1.92% frontend cycles idle     ( +-  0.88% )  (49.99%)
         300,346,181,651      stalled-cycles-backend    #   67.36% backend cycles idle      ( +-  0.76% )  (50.01%)
         258,091,488,691      instructions              #    0.58  insn per cycle
                                                        #    1.16  stalled cycles per insn  ( +-  0.22% )  (66.67%)
          70,568,169,256      branches                  #  673.660 M/sec                    ( +-  0.17% )  (50.01%)
           3,100,725,426      branch-misses             #    4.39% of all branches          ( +-  0.20% )  (49.99%)
      
                 105.583 +- 0.589 seconds time elapsed  ( +-  0.56% )
      
      Patched:
       Performance counter stats for 'drmgr -c mem -a -q 4096' (10 runs):
      
              104,055.69 msec task-clock                #    0.993 CPUs utilized            ( +-  0.32% )
                   4,606      context-switches          #    0.044 K/sec                    ( +-  0.20% )
                   2,463      cpu-migrations            #    0.024 K/sec                    ( +-  0.93% )
                     394      page-faults               #    0.004 K/sec                    ( +-  0.25% )
         442,951,129,921      cycles                    #    4.257 GHz                      ( +-  0.32% )  (66.66%)
           8,710,413,329      stalled-cycles-frontend   #    1.97% frontend cycles idle     ( +-  0.47% )  (50.06%)
         299,656,905,836      stalled-cycles-backend    #   67.65% backend cycles idle      ( +-  0.39% )  (50.02%)
         252,731,168,193      instructions              #    0.57  insn per cycle
                                                        #    1.19  stalled cycles per insn  ( +-  0.20% )  (66.66%)
          68,902,851,121      branches                  #  662.173 M/sec                    ( +-  0.13% )  (49.94%)
           3,100,242,882      branch-misses             #    4.50% of all branches          ( +-  0.15% )  (49.98%)
      
                 104.829 +- 0.325 seconds time elapsed  ( +-  0.31% )
      
      This is consistent.  An add-by-count hot-add operation adds LMBs
      greedily, so LMBs near the start of the drconf range are considered
      first.  On an otherwise idle LPAR with so many LMBs we would expect to
      find the LMBs we need near the start of the drconf range, hence the
      smaller speedup.
      Signed-off-by: NScott Cheloha <cheloha@linux.ibm.com>
      Reviewed-by: NLaurent Dufour <ldufour@linux.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200916145122.3408129-1-cheloha@linux.ibm.com
      72cdd117
  7. 16 9月, 2020 7 次提交
    • S
      powerpc/smp: Implement cpu_to_coregroup_id · fa35e868
      Srikar Dronamraju 提交于
      Lookup the coregroup id from the associativity array.
      
      If unable to detect the coregroup id, fallback on the core id.
      This way, ensure sched_domain degenerates and an extra sched domain is
      not created.
      
      Ideally this function should have been implemented in
      arch/powerpc/kernel/smp.c. However if its implemented in mm/numa.c, we
      don't need to find the primary domain again.
      
      If the device-tree mentions more than one coregroup, then kernel
      implements only the last or the smallest coregroup, which currently
      corresponds to the penultimate domain in the device-tree.
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Reviewed-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200810071834.92514-11-srikar@linux.vnet.ibm.com
      fa35e868
    • S
      powerpc/smp: Create coregroup domain · 72730bfc
      Srikar Dronamraju 提交于
      Add percpu coregroup maps and masks to create coregroup domain.
      If a coregroup doesn't exist, the coregroup domain will be degenerated
      in favour of SMT/CACHE domain. Do note this patch is only creating stubs
      for cpu_to_coregroup_id. The actual cpu_to_coregroup_id implementation
      would be in a subsequent patch.
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Reviewed-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200810071834.92514-10-srikar@linux.vnet.ibm.com
      72730bfc
    • S
      powerpc/numa: Detect support for coregroup · f9f130ff
      Srikar Dronamraju 提交于
      Add support for grouping cores based on the device-tree classification.
      - The last domain in the associativity domains always refers to the
      core.
      - If primary reference domain happens to be the penultimate domain in
      the associativity domains device-tree property, then there are no
      coregroups. However if its not a penultimate domain, then there are
      coregroups. There can be more than one coregroup. For now we would be
      interested in the last or the smallest coregroups, i.e one sub-group
      per DIE.
      
      Currently there are no firmwares that are exposing this grouping. Hence
      allow the basis for grouping to be abstract.  Once the firmware starts
      using this grouping, code would be added to detect the type of grouping
      and adjust the sd domain flags accordingly.
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Reviewed-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200810071834.92514-8-srikar@linux.vnet.ibm.com
      f9f130ff
    • S
      powerpc/numa: Offline memoryless cpuless node 0 · e75130f2
      Srikar Dronamraju 提交于
      Currently Linux kernel with CONFIG_NUMA on a system with multiple
      possible nodes, marks node 0 as online at boot.  However in practice,
      there are systems which have node 0 as memoryless and cpuless.
      
      This can cause numa_balancing to be enabled on systems with only one node
      with memory and CPUs. The existence of this dummy node which is cpuless and
      memoryless node can confuse users/scripts looking at output of lscpu /
      numactl.
      
      By marking, node 0 as offline, lets stop assuming that node 0 is
      always online. If node 0 has CPU or memory that are online, node 0 will
      again be set as online.
      
      v5.8
       available: 2 nodes (0,2)
       node 0 cpus:
       node 0 size: 0 MB
       node 0 free: 0 MB
       node 2 cpus: 0 1 2 3 4 5 6 7
       node 2 size: 32625 MB
       node 2 free: 31490 MB
       node distances:
       node   0   2
         0:  10  20
         2:  20  10
      
      proc and sys files
      ------------------
       /sys/devices/system/node/online:            0,2
       /proc/sys/kernel/numa_balancing:            1
       /sys/devices/system/node/has_cpu:           2
       /sys/devices/system/node/has_memory:        2
       /sys/devices/system/node/has_normal_memory: 2
       /sys/devices/system/node/possible:          0-31
      
      v5.8 + patch
      ------------------
       available: 1 nodes (2)
       node 2 cpus: 0 1 2 3 4 5 6 7
       node 2 size: 32625 MB
       node 2 free: 31487 MB
       node distances:
       node   2
         2:  10
      
      proc and sys files
      ------------------
      /sys/devices/system/node/online:            2
      /proc/sys/kernel/numa_balancing:            0
      /sys/devices/system/node/has_cpu:           2
      /sys/devices/system/node/has_memory:        2
      /sys/devices/system/node/has_normal_memory: 2
      /sys/devices/system/node/possible:          0-31
      
      Example of a node with online CPUs/memory on node 0.
      (Same o/p with and without patch)
      numactl -H
      available: 4 nodes (0-3)
      node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
      node 0 size: 32482 MB
      node 0 free: 22994 MB
      node 1 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
      node 1 size: 0 MB
      node 1 free: 0 MB
      node 2 cpus: 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
      node 2 size: 0 MB
      node 2 free: 0 MB
      node 3 cpus: 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 node 3 size: 0 MB
      node 3 free: 0 MB
      node distances:
      node   0   1   2   3
        0:  10  20  40  40
        1:  20  10  40  40
        2:  40  40  10  20
        3:  40  40  20  10
      
      Note: On Powerpc, cpu_to_node of possible but not present cpus would
      previously return 0. Hence this commit depends on commit ("powerpc/numa: Set
      numa_node for all possible cpus") and commit ("powerpc/numa: Prefer node id
      queried from vphn"). Without the 2 commits, Powerpc system might crash.
      
      1. User space applications like Numactl, lscpu, that parse the sysfs tend to
      believe there is an extra online node. This tends to confuse users and
      applications. Other user space applications start believing that system was
      not able to use all the resources (i.e missing resources) or the system was
      not setup correctly.
      
      2. Also existence of dummy node also leads to inconsistent information. The
      number of online nodes is inconsistent with the information in the
      device-tree and resource-dump
      
      3. When the dummy node is present, single node non-Numa systems end up showing
      up as NUMA systems and numa_balancing gets enabled. This will mean we take
      the hit from the unnecessary numa hinting faults.
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200818081104.57888-4-srikar@linux.vnet.ibm.com
      e75130f2
    • S
      powerpc/numa: Prefer node id queried from vphn · 6398eaa2
      Srikar Dronamraju 提交于
      Node id queried from the static device tree may not
      be correct. For example: it may always show 0 on a shared processor.
      Hence prefer the node id queried from vphn and fallback on the device tree
      based node id if vphn query fails.
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200818081104.57888-3-srikar@linux.vnet.ibm.com
      6398eaa2
    • S
      powerpc/numa: Set numa_node for all possible cpus · a874f100
      Srikar Dronamraju 提交于
      A Powerpc system with multiple possible nodes and with CONFIG_NUMA
      enabled always used to have a node 0, even if node 0 does not any cpus
      or memory attached to it. As per PAPR, node affinity of a cpu is only
      available once its present / online. For all cpus that are possible but
      not present, cpu_to_node() would point to node 0.
      
      To ensure a cpuless, memoryless dummy node is not online, powerpc need
      to make sure all possible but not present cpu_to_node are set to a
      proper node.
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200818081104.57888-2-srikar@linux.vnet.ibm.com
      a874f100
    • S
      powerpc/numa: Restrict possible nodes based on platform · 67df7784
      Srikar Dronamraju 提交于
      As per draft LoPAPR (Revision 2.9_pre7), section B.5.3 "Run Time
      Abstraction Services (RTAS) Node" available at:
        https://openpowerfoundation.org/wp-content/uploads/2020/07/LoPAR-20200611.pdf
      
      ... there are 2 device tree properties:
      
        "ibm,max-associativity-domains"
         which defines the maximum number of domains that the firmware i.e
         PowerVM can support.
      
      and:
      
        "ibm,current-associativity-domains"
         which defines the maximum number of domains that the current
         platform can support.
      
      The value of "ibm,max-associativity-domains" is always greater than or
      equal to "ibm,current-associativity-domains" property. If the latter
      property is not available, use "ibm,max-associativity-domain" as a
      fallback. In this yet to be released LoPAPR, "ibm,current-associativity-domains"
      is mentioned in page 833 / B.5.3 which is covered under under
      "Appendix B. System Binding" section
      
      Currently powerpc uses the "ibm,max-associativity-domains" property
      while setting the possible number of nodes. This is currently set at
      32. However the possible number of nodes for a platform may be
      significantly less. Hence set the possible number of nodes based on
      "ibm,current-associativity-domains" property.
      
      Nathan Lynch had raised a valid concern that post LPM (Live Partition
      Migration), a user could DLPAR add processors and memory after LPM
      with "new" associativity properties:
        https://lore.kernel.org/linuxppc-dev/871rljfet9.fsf@linux.ibm.com/t/#u
      
      He also pointed out that "ibm,max-associativity-domains" has the same
      contents on all currently available PowerVM systems, unlike
      "ibm,current-associativity-domains" and hence may be better able to
      handle the new NUMA associativity properties.
      
      However with the recent commit dbce4562 ("powerpc/numa: Limit
      possible nodes to within num_possible_nodes"), all new NUMA
      associativity properties are capped to initially set nr_node_ids.
      Hence this commit should be safe with any new DLPAR add post LPM.
      
        $ lsprop /proc/device-tree/rtas/ibm,*associ*-domains
        /proc/device-tree/rtas/ibm,current-associativity-domains
        		 00000005 00000001 00000002 00000002 00000002 00000010
        /proc/device-tree/rtas/ibm,max-associativity-domains
        		 00000005 00000001 00000008 00000020 00000020 00000100
      
        $ cat /sys/devices/system/node/possible ##Before patch
        0-31
      
        $ cat /sys/devices/system/node/possible ##After patch
        0-1
      
      Note the maximum nodes this platform can support is only 2 but the
      possible nodes is set to 32.
      
      This is important because lot of kernel and user space code allocate
      structures for all possible nodes leading to a lot of memory that is
      allocated but not used.
      
      I ran a simple experiment to create and destroy 100 memory cgroups on
      boot on a 8 node machine (Power8 Alpine).
      
      Before patch:
        free -k at boot
                      total        used        free      shared  buff/cache   available
        Mem:      523498176     4106816   518820608       22272      570752   516606720
        Swap:       4194240           0     4194240
      
        free -k after creating 100 memory cgroups
                      total        used        free      shared  buff/cache   available
        Mem:      523498176     4628416   518246464       22336      623296   516058688
        Swap:       4194240           0     4194240
      
        free -k after destroying 100 memory cgroups
                      total        used        free      shared  buff/cache   available
        Mem:      523498176     4697408   518173760       22400      627008   515987904
        Swap:       4194240           0     4194240
      
      After patch:
        free -k at boot
                      total        used        free      shared  buff/cache   available
        Mem:      523498176     3969472   518933888       22272      594816   516731776
        Swap:       4194240           0     4194240
      
        free -k after creating 100 memory cgroups
                      total        used        free      shared  buff/cache   available
        Mem:      523498176     4181888   518676096       22208      640192   516496448
        Swap:       4194240           0     4194240
      
        free -k after destroying 100 memory cgroups
                      total        used        free      shared  buff/cache   available
        Mem:      523498176     4232320   518619904       22272      645952   516443264
        Swap:       4194240           0     4194240
      
      Observations:
        Fixed kernel takes 137344 kb (4106816-3969472) less to boot.
        Fixed kernel takes 309184 kb (4628416-4181888-137344) less to create 100 memcgs.
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      [mpe: Reformat change log a bit for readability]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200817055257.110873-1-srikar@linux.vnet.ibm.com
      67df7784
  8. 08 8月, 2020 1 次提交
  9. 29 7月, 2020 1 次提交
  10. 26 7月, 2020 1 次提交
  11. 16 7月, 2020 11 次提交
  12. 04 3月, 2020 4 次提交
  13. 04 2月, 2020 1 次提交
  14. 04 7月, 2019 4 次提交
    • A
      powerpc/mm: Consolidate numa_enable check and min_common_depth check · 495c2ff4
      Aneesh Kumar K.V 提交于
      If we fail to parse min_common_depth from device tree we boot with
      numa disabled. Reflect the same by updating numa_enabled variable
      to false. Also, switch all min_common_depth failure check to
      if (!numa_enabled) check.
      
      This helps us to avoid checking for both in different code paths.
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      495c2ff4
    • A
      powerpc/mm: Fix node look up with numa=off boot · f52741c4
      Aneesh Kumar K.V 提交于
      If we boot with numa=off, we need to make sure we return NUMA_NO_NODE when
      looking up associativity details of resources. Without this, we hit crash
      like below
      
      BUG: Unable to handle kernel data access at 0x40000000008
      Faulting instruction address: 0xc000000008f31704
      cpu 0x1b: Vector: 380 (Data SLB Access) at [c00000000b9bb320]
          pc: c000000008f31704: _raw_spin_lock+0x14/0x100
          lr: c0000000083f41fc: ____cache_alloc_node+0x5c/0x290
          sp: c00000000b9bb5b0
         msr: 800000010280b033
         dar: 40000000008
        current = 0xc00000000b9a2700
        paca    = 0xc00000000a740c00   irqmask: 0x03   irq_happened: 0x01
          pid   = 1, comm = swapper/27
      Linux version 5.2.0-rc4-00925-g74e188c620b1 (root@linux-d8ip) (gcc version 7.4.1 20190424 [gcc-7-branch revision 270538] (SUSE Linux)) #34 SMP Sat Jun 29 00:41:02 EDT 2019
      enter ? for help
      [link register   ] c0000000083f41fc ____cache_alloc_node+0x5c/0x290
      [c00000000b9bb5b0] 0000000000000dc0 (unreliable)
      [c00000000b9bb5f0] c0000000083f48c8 kmem_cache_alloc_node_trace+0x138/0x360
      [c00000000b9bb670] c000000008aa789c devres_alloc_node+0x4c/0xa0
      [c00000000b9bb6a0] c000000008337218 devm_memremap+0x58/0x130
      [c00000000b9bb6f0] c000000008aed00c devm_nsio_enable+0xdc/0x170
      [c00000000b9bb780] c000000008af3b6c nd_pmem_probe+0x4c/0x180
      [c00000000b9bb7b0] c000000008ad84cc nvdimm_bus_probe+0xac/0x260
      [c00000000b9bb840] c000000008aa0628 really_probe+0x148/0x500
      [c00000000b9bb8d0] c000000008aa0d7c driver_probe_device+0x19c/0x1d0
      [c00000000b9bb950] c000000008aa11bc device_driver_attach+0xcc/0x100
      [c00000000b9bb990] c000000008aa12ec __driver_attach+0xfc/0x1e0
      [c00000000b9bba10] c000000008a9d0a4 bus_for_each_dev+0xb4/0x130
      [c00000000b9bba70] c000000008a9fc04 driver_attach+0x34/0x50
      [c00000000b9bba90] c000000008a9f118 bus_add_driver+0x1d8/0x300
      [c00000000b9bbb20] c000000008aa2358 driver_register+0x98/0x1a0
      [c00000000b9bbb90] c000000008ad7e6c __nd_driver_register+0x5c/0x100
      [c00000000b9bbbf0] c0000000093efbac nd_pmem_driver_init+0x34/0x48
      [c00000000b9bbc10] c0000000080106c0 do_one_initcall+0x60/0x2d0
      [c00000000b9bbce0] c00000000938463c kernel_init_freeable+0x384/0x48c
      [c00000000b9bbdb0] c000000008010a5c kernel_init+0x2c/0x160
      [c00000000b9bbe20] c00000000800ba54 ret_from_kernel_thread+0x5c/0x68
      Reported-and-debugged-by: NVaibhav Jain <vaibhav@linux.ibm.com>
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      f52741c4
    • A
      powerpc/mm/drconf: Use NUMA_NO_NODE on failures instead of node 0 · ea9f5b70
      Aneesh Kumar K.V 提交于
      If we fail to parse the associativity array we should default to
      NUMA_NO_NODE instead of NODE 0. Rest of the code fallback to the
      right default if we find the numa node value NUMA_NO_NODE.
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      ea9f5b70
    • N
      powerpc/pseries: Provide vcpu dispatch statistics · d62c8dee
      Naveen N. Rao 提交于
      For Shared Processor LPARs, the POWER Hypervisor maintains a
      relatively static mapping of the LPAR processors (vcpus) to physical
      processor chips (representing the "home" node) and tries to always
      dispatch vcpus on their associated physical processor chip. However,
      under certain scenarios, vcpus may be dispatched on a different
      processor chip (away from its home node). The actual physical
      processor number on which a certain vcpu is dispatched is available to
      the guest in the 'processor_id' field of each DTL entry.
      
      The guest can discover the home node of each vcpu through the
      H_HOME_NODE_ASSOCIATIVITY(flags=1) hcall. The guest can also discover
      the associativity of physical processors, as represented in the DTL
      entry, through the H_HOME_NODE_ASSOCIATIVITY(flags=2) hcall.
      
      These can then be compared to determine if the vcpu was dispatched on
      its home node or not. If the vcpu was not dispatched on the home node,
      it is possible to determine if the vcpu was dispatched in a different
      chip, socket or drawer.
      
      Introduce a procfs file /proc/powerpc/vcpudispatch_stats that can be
      used to obtain these statistics. Writing '1' to this file enables
      collecting the statistics, while writing '0' disables the statistics.
      The statistics themselves are available by reading the procfs file. By
      default, the DTLB log for each vcpu is processed 50 times a second so
      as not to miss any entries. This processing frequency can be changed
      through /proc/powerpc/vcpudispatch_stats_freq.
      Signed-off-by: NNaveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      d62c8dee