1. 23 12月, 2021 1 次提交
  2. 15 11月, 2021 2 次提交
  3. 26 8月, 2021 4 次提交
    • S
      powerpc/numa: Update cpu_cpu_map on CPU online/offline · 9a245d0e
      Srikar Dronamraju 提交于
      cpu_cpu_map holds all the CPUs in the DIE. However in PowerPC, when
      onlining/offlining of CPUs, this mask doesn't get updated.  This mask
      is however updated when CPUs are added/removed. So when both
      operations like online/offline of CPUs and adding/removing of CPUs are
      done simultaneously, then cpumaps end up broken.
      
      WARNING: CPU: 13 PID: 1142 at kernel/sched/topology.c:898
      build_sched_domains+0xd48/0x1720
      Modules linked in: rpadlpar_io rpaphp mptcp_diag xsk_diag tcp_diag
      udp_diag raw_diag inet_diag unix_diag af_packet_diag netlink_diag
      bonding tls nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib
      nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct
      nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set
      rfkill nf_tables nfnetlink pseries_rng xts vmx_crypto uio_pdrv_genirq
      uio binfmt_misc ip_tables xfs libcrc32c dm_service_time sd_mod t10_pi sg
      ibmvfc scsi_transport_fc ibmveth dm_multipath dm_mirror dm_region_hash
      dm_log dm_mod fuse
      CPU: 13 PID: 1142 Comm: kworker/13:2 Not tainted 5.13.0-rc6+ #28
      Workqueue: events cpuset_hotplug_workfn
      NIP:  c0000000001caac8 LR: c0000000001caac4 CTR: 00000000007088ec
      REGS: c00000005596f220 TRAP: 0700   Not tainted  (5.13.0-rc6+)
      MSR:  8000000000029033 <SF,EE,ME,IR,DR,RI,LE>  CR: 48828222  XER:
      00000009
      CFAR: c0000000001ea698 IRQMASK: 0
      GPR00: c0000000001caac4 c00000005596f4c0 c000000001c4a400 0000000000000036
      GPR04: 00000000fffdffff c00000005596f1d0 0000000000000027 c0000018cfd07f90
      GPR08: 0000000000000023 0000000000000001 0000000000000027 c0000018fe68ffe8
      GPR12: 0000000000008000 c00000001e9d1880 c00000013a047200 0000000000000800
      GPR16: c000000001d3c7d0 0000000000000240 0000000000000048 c000000010aacd18
      GPR20: 0000000000000001 c000000010aacc18 c00000013a047c00 c000000139ec2400
      GPR24: 0000000000000280 c000000139ec2520 c000000136c1b400 c000000001c93060
      GPR28: c00000013a047c20 c000000001d3c6c0 c000000001c978a0 000000000000000d
      NIP [c0000000001caac8] build_sched_domains+0xd48/0x1720
      LR [c0000000001caac4] build_sched_domains+0xd44/0x1720
      Call Trace:
      [c00000005596f4c0] [c0000000001caac4] build_sched_domains+0xd44/0x1720 (unreliable)
      [c00000005596f670] [c0000000001cc5ec] partition_sched_domains_locked+0x3ac/0x4b0
      [c00000005596f710] [c0000000002804e4] rebuild_sched_domains_locked+0x404/0x9e0
      [c00000005596f810] [c000000000283e60] rebuild_sched_domains+0x40/0x70
      [c00000005596f840] [c000000000284124] cpuset_hotplug_workfn+0x294/0xf10
      [c00000005596fc60] [c000000000175040] process_one_work+0x290/0x590
      [c00000005596fd00] [c0000000001753c8] worker_thread+0x88/0x620
      [c00000005596fda0] [c000000000181704] kthread+0x194/0x1a0
      [c00000005596fe10] [c00000000000ccec] ret_from_kernel_thread+0x5c/0x70
      Instruction dump:
      485af049 60000000 2fa30800 409e0028 80fe0000 e89a00f8 e86100e8 38da0120
      7f88e378 7ce53b78 4801fb91 60000000 <0fe00000> 39000000 38e00000 38c00000
      
      Fix this by updating cpu_cpu_map aka cpumask_of_node() on every CPU
      online/offline.
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20210826100521.412639-5-srikar@linux.vnet.ibm.com
      9a245d0e
    • S
      powerpc/numa: Print debug statements only when required · 544a09ee
      Srikar Dronamraju 提交于
      Currently, a debug message gets printed every time an attempt to
      add(remove) a CPU. However this is redundant if the CPU is already added
      (removed) from the node.
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20210826100521.412639-4-srikar@linux.vnet.ibm.com
      544a09ee
    • S
      powerpc/numa: convert printk to pr_xxx · 506c2075
      Srikar Dronamraju 提交于
      Convert the remaining printk to pr_xxx
      One advantage would be all prints will now have prefix "numa:" from
      pr_fmt().
      
      [ convert printk(KERN_ERR) to pr_warn : Suggested by Laurent Dufour ]
      Suggested-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      [mpe: Rebase onto powerpc/next, s/WARNING/Warning/]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20210826100521.412639-3-srikar@linux.vnet.ibm.com
      506c2075
    • S
      powerpc/numa: Drop dbg in favour of pr_debug · 544af642
      Srikar Dronamraju 提交于
      powerpc supported numa=debug which is not documented. This option was
      used to print early debug output. However something more flexible can be
      achieved by using CONFIG_DYNAMIC_DEBUG.
      
      Hence drop dbg (and numa=debug) in favour of pr_debug
      Suggested-by: NMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      [mpe: Rebase on to powerpc/next form2 affinity changes]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20210826100521.412639-2-srikar@linux.vnet.ibm.com
      544af642
  4. 13 8月, 2021 5 次提交
  5. 10 8月, 2021 1 次提交
    • L
      powerpc/numa: Consider the max NUMA node for migratable LPAR · 9c7248bb
      Laurent Dufour 提交于
      When a LPAR is migratable, we should consider the maximum possible NUMA
      node instead of the number of NUMA nodes from the actual system.
      
      The DT property 'ibm,current-associativity-domains' defines the maximum
      number of nodes the LPAR can see when running on that box. But if the
      LPAR is being migrated on another box, it may see up to the nodes
      defined by 'ibm,max-associativity-domains'. So if a LPAR is migratable,
      that value should be used.
      
      Unfortunately, there is no easy way to know if an LPAR is migratable or
      not. The hypervisor exports the property 'ibm,migratable-partition' in
      the case it set to migrate partition, but that would not mean that the
      current partition is migratable.
      
      Without this patch, when a LPAR is started on a 2 node box and then
      migrated to a 3 node box, the hypervisor may spread the LPAR's CPUs on
      the 3rd node. In that case if a CPU from that 3rd node is added to the
      LPAR, it will be wrongly assigned to the node because the kernel has
      been set to use up to 2 nodes (the configuration of the departure node).
      With this patch applies, the CPU is correctly added to the 3rd node.
      
      Fixes: f9f130ff ("powerpc/numa: Detect support for coregroup")
      Signed-off-by: NLaurent Dufour <ldufour@linux.ibm.com>
      Reviewed-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20210511073136.17795-1-ldufour@linux.ibm.com
      9c7248bb
  6. 27 11月, 2020 1 次提交
    • S
      powerpc/numa: Fix a regression on memoryless node 0 · 10f78fd0
      Srikar Dronamraju 提交于
      Commit e75130f2 ("powerpc/numa: Offline memoryless cpuless node 0")
      offlines node 0 and expects nodes to be subsequently onlined when CPUs
      or nodes are detected.
      
      Commit 6398eaa2 ("powerpc/numa: Prefer node id queried from vphn")
      skips onlining node 0 when CPUs are associated with node 0.
      
      On systems with node 0 having CPUs but no memory, this causes node 0 be
      marked offline. This causes issues at boot time when trying to set
      memory node for online CPUs while building the zonelist.
      
      0:mon> t
      [link register   ] c000000000400354 __build_all_zonelists+0x164/0x280
      [c00000000161bda0] c0000000016533c8 node_states+0x20/0xa0 (unreliable)
      [c00000000161bdc0] c000000000400384 __build_all_zonelists+0x194/0x280
      [c00000000161be30] c000000001041800 build_all_zonelists_init+0x4c/0x118
      [c00000000161be80] c0000000004020d0 build_all_zonelists+0x190/0x1b0
      [c00000000161bef0] c000000001003cf8 start_kernel+0x18c/0x6a8
      [c00000000161bf90] c00000000000adb4 start_here_common+0x1c/0x3e8
      0:mon> r
      R00 = c000000000400354   R16 = 000000000b57a0e8
      R01 = c00000000161bda0   R17 = 000000000b57a6b0
      R02 = c00000000161ce00   R18 = 000000000b5afee8
      R03 = 0000000000000000   R19 = 000000000b6448a0
      R04 = 0000000000000000   R20 = fffffffffffffffd
      R05 = 0000000000000000   R21 = 0000000001400000
      R06 = 0000000000000000   R22 = 000000001ec00000
      R07 = 0000000000000001   R23 = c000000001175580
      R08 = 0000000000000000   R24 = c000000001651ed8
      R09 = c0000000017e84d8   R25 = c000000001652480
      R10 = 0000000000000000   R26 = c000000001175584
      R11 = c000000c7fac0d10   R27 = c0000000019568d0
      R12 = c000000000400180   R28 = 0000000000000000
      R13 = c000000002200000   R29 = c00000000164dd78
      R14 = 000000000b579f78   R30 = 0000000000000000
      R15 = 000000000b57a2b8   R31 = c000000001175584
      pc  = c000000000400194 local_memory_node+0x24/0x80
      cfar= c000000000074334 mcount+0xc/0x10
      lr  = c000000000400354 __build_all_zonelists+0x164/0x280
      msr = 8000000002001033   cr  = 44002284
      ctr = c000000000400180   xer = 0000000000000001   trap =  380
      dar = 0000000000001388   dsisr = c00000000161bc90
      0:mon>
      
      Fix this by setting node to be online while onlining CPUs that belong to
      node 0.
      
      Fixes: e75130f2 ("powerpc/numa: Offline memoryless cpuless node 0")
      Fixes: 6398eaa2 ("powerpc/numa: Prefer node id queried from vphn")
      Reported-by: NMilan Mohanty <milmohan@in.ibm.com>
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20201127053738.10085-1-srikar@linux.vnet.ibm.com
      10f78fd0
  7. 14 10月, 2020 1 次提交
    • M
      arch, mm: replace for_each_memblock() with for_each_mem_pfn_range() · c9118e6c
      Mike Rapoport 提交于
      There are several occurrences of the following pattern:
      
      	for_each_memblock(memory, reg) {
      		start_pfn = memblock_region_memory_base_pfn(reg);
      		end_pfn = memblock_region_memory_end_pfn(reg);
      
      		/* do something with start_pfn and end_pfn */
      	}
      
      Rather than iterate over all memblock.memory regions and each time query
      for their start and end PFNs, use for_each_mem_pfn_range() iterator to get
      simpler and clearer code.
      Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NBaoquan He <bhe@redhat.com>
      Acked-by: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>	[.clang-format]
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Daniel Axtens <dja@axtens.net>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Emil Renner Berthing <kernel@esmil.dk>
      Cc: Hari Bathini <hbathini@linux.ibm.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Paul Walmsley <paul.walmsley@sifive.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Stafford Horne <shorne@gmail.com>
      Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Link: https://lkml.kernel.org/r/20200818151634.14343-12-rppt@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c9118e6c
  8. 06 10月, 2020 1 次提交
    • S
      pseries/hotplug-memory: hot-add: skip redundant LMB lookup · 72cdd117
      Scott Cheloha 提交于
      During memory hot-add, dlpar_add_lmb() calls memory_add_physaddr_to_nid()
      to determine which node id (nid) to use when later calling __add_memory().
      
      This is wasteful.  On pseries, memory_add_physaddr_to_nid() finds an
      appropriate nid for a given address by looking up the LMB containing the
      address and then passing that LMB to of_drconf_to_nid_single() to get the
      nid.  In dlpar_add_lmb() we get this address from the LMB itself.
      
      In short, we have a pointer to an LMB and then we are searching for
      that LMB *again* in order to find its nid.
      
      If we call of_drconf_to_nid_single() directly from dlpar_add_lmb() we
      can skip the redundant lookup.  The only error handling we need to
      duplicate from memory_add_physaddr_to_nid() is the fallback to the
      default nid when drconf_to_nid_single() returns -1 (NUMA_NO_NODE) or
      an invalid nid.
      
      Skipping the extra lookup makes hot-add operations faster, especially
      on machines with many LMBs.
      
      Consider an LPAR with 126976 LMBs.  In one test, hot-adding 126000
      LMBs on an upatched kernel took ~3.5 hours while a patched kernel
      completed the same operation in ~2 hours:
      
      Unpatched (12450 seconds):
      Sep  9 04:06:31 ltc-brazos1 drmgr[810169]: drmgr: -c mem -a -q 126000
      Sep  9 04:06:31 ltc-brazos1 kernel: pseries-hotplug-mem: Attempting to hot-add 126000 LMB(s)
      [...]
      Sep  9 07:34:01 ltc-brazos1 kernel: pseries-hotplug-mem: Memory at 20000000 (drc index 80000002) was hot-added
      
      Patched (7065 seconds):
      Sep  8 21:49:57 ltc-brazos1 drmgr[877703]: drmgr: -c mem -a -q 126000
      Sep  8 21:49:57 ltc-brazos1 kernel: pseries-hotplug-mem: Attempting to hot-add 126000 LMB(s)
      [...]
      Sep  8 23:27:42 ltc-brazos1 kernel: pseries-hotplug-mem: Memory at 20000000 (drc index 80000002) was hot-added
      
      It should be noted that the speedup grows more substantial when
      hot-adding LMBs at the end of the drconf range.  This is because we
      are skipping a linear LMB search.
      
      To see the distinction, consider smaller hot-add test on the same
      LPAR.  A perf-stat run with 10 iterations showed that hot-adding 4096
      LMBs completed less than 1 second faster on a patched kernel:
      
      Unpatched:
       Performance counter stats for 'drmgr -c mem -a -q 4096' (10 runs):
      
              104,753.42 msec task-clock                #    0.992 CPUs utilized            ( +-  0.55% )
                   4,708      context-switches          #    0.045 K/sec                    ( +-  0.69% )
                   2,444      cpu-migrations            #    0.023 K/sec                    ( +-  1.25% )
                     394      page-faults               #    0.004 K/sec                    ( +-  0.22% )
         445,902,503,057      cycles                    #    4.257 GHz                      ( +-  0.55% )  (66.67%)
           8,558,376,740      stalled-cycles-frontend   #    1.92% frontend cycles idle     ( +-  0.88% )  (49.99%)
         300,346,181,651      stalled-cycles-backend    #   67.36% backend cycles idle      ( +-  0.76% )  (50.01%)
         258,091,488,691      instructions              #    0.58  insn per cycle
                                                        #    1.16  stalled cycles per insn  ( +-  0.22% )  (66.67%)
          70,568,169,256      branches                  #  673.660 M/sec                    ( +-  0.17% )  (50.01%)
           3,100,725,426      branch-misses             #    4.39% of all branches          ( +-  0.20% )  (49.99%)
      
                 105.583 +- 0.589 seconds time elapsed  ( +-  0.56% )
      
      Patched:
       Performance counter stats for 'drmgr -c mem -a -q 4096' (10 runs):
      
              104,055.69 msec task-clock                #    0.993 CPUs utilized            ( +-  0.32% )
                   4,606      context-switches          #    0.044 K/sec                    ( +-  0.20% )
                   2,463      cpu-migrations            #    0.024 K/sec                    ( +-  0.93% )
                     394      page-faults               #    0.004 K/sec                    ( +-  0.25% )
         442,951,129,921      cycles                    #    4.257 GHz                      ( +-  0.32% )  (66.66%)
           8,710,413,329      stalled-cycles-frontend   #    1.97% frontend cycles idle     ( +-  0.47% )  (50.06%)
         299,656,905,836      stalled-cycles-backend    #   67.65% backend cycles idle      ( +-  0.39% )  (50.02%)
         252,731,168,193      instructions              #    0.57  insn per cycle
                                                        #    1.19  stalled cycles per insn  ( +-  0.20% )  (66.66%)
          68,902,851,121      branches                  #  662.173 M/sec                    ( +-  0.13% )  (49.94%)
           3,100,242,882      branch-misses             #    4.50% of all branches          ( +-  0.15% )  (49.98%)
      
                 104.829 +- 0.325 seconds time elapsed  ( +-  0.31% )
      
      This is consistent.  An add-by-count hot-add operation adds LMBs
      greedily, so LMBs near the start of the drconf range are considered
      first.  On an otherwise idle LPAR with so many LMBs we would expect to
      find the LMBs we need near the start of the drconf range, hence the
      smaller speedup.
      Signed-off-by: NScott Cheloha <cheloha@linux.ibm.com>
      Reviewed-by: NLaurent Dufour <ldufour@linux.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200916145122.3408129-1-cheloha@linux.ibm.com
      72cdd117
  9. 16 9月, 2020 7 次提交
    • S
      powerpc/smp: Implement cpu_to_coregroup_id · fa35e868
      Srikar Dronamraju 提交于
      Lookup the coregroup id from the associativity array.
      
      If unable to detect the coregroup id, fallback on the core id.
      This way, ensure sched_domain degenerates and an extra sched domain is
      not created.
      
      Ideally this function should have been implemented in
      arch/powerpc/kernel/smp.c. However if its implemented in mm/numa.c, we
      don't need to find the primary domain again.
      
      If the device-tree mentions more than one coregroup, then kernel
      implements only the last or the smallest coregroup, which currently
      corresponds to the penultimate domain in the device-tree.
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Reviewed-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200810071834.92514-11-srikar@linux.vnet.ibm.com
      fa35e868
    • S
      powerpc/smp: Create coregroup domain · 72730bfc
      Srikar Dronamraju 提交于
      Add percpu coregroup maps and masks to create coregroup domain.
      If a coregroup doesn't exist, the coregroup domain will be degenerated
      in favour of SMT/CACHE domain. Do note this patch is only creating stubs
      for cpu_to_coregroup_id. The actual cpu_to_coregroup_id implementation
      would be in a subsequent patch.
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Reviewed-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200810071834.92514-10-srikar@linux.vnet.ibm.com
      72730bfc
    • S
      powerpc/numa: Detect support for coregroup · f9f130ff
      Srikar Dronamraju 提交于
      Add support for grouping cores based on the device-tree classification.
      - The last domain in the associativity domains always refers to the
      core.
      - If primary reference domain happens to be the penultimate domain in
      the associativity domains device-tree property, then there are no
      coregroups. However if its not a penultimate domain, then there are
      coregroups. There can be more than one coregroup. For now we would be
      interested in the last or the smallest coregroups, i.e one sub-group
      per DIE.
      
      Currently there are no firmwares that are exposing this grouping. Hence
      allow the basis for grouping to be abstract.  Once the firmware starts
      using this grouping, code would be added to detect the type of grouping
      and adjust the sd domain flags accordingly.
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Reviewed-by: NGautham R. Shenoy <ego@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200810071834.92514-8-srikar@linux.vnet.ibm.com
      f9f130ff
    • S
      powerpc/numa: Offline memoryless cpuless node 0 · e75130f2
      Srikar Dronamraju 提交于
      Currently Linux kernel with CONFIG_NUMA on a system with multiple
      possible nodes, marks node 0 as online at boot.  However in practice,
      there are systems which have node 0 as memoryless and cpuless.
      
      This can cause numa_balancing to be enabled on systems with only one node
      with memory and CPUs. The existence of this dummy node which is cpuless and
      memoryless node can confuse users/scripts looking at output of lscpu /
      numactl.
      
      By marking, node 0 as offline, lets stop assuming that node 0 is
      always online. If node 0 has CPU or memory that are online, node 0 will
      again be set as online.
      
      v5.8
       available: 2 nodes (0,2)
       node 0 cpus:
       node 0 size: 0 MB
       node 0 free: 0 MB
       node 2 cpus: 0 1 2 3 4 5 6 7
       node 2 size: 32625 MB
       node 2 free: 31490 MB
       node distances:
       node   0   2
         0:  10  20
         2:  20  10
      
      proc and sys files
      ------------------
       /sys/devices/system/node/online:            0,2
       /proc/sys/kernel/numa_balancing:            1
       /sys/devices/system/node/has_cpu:           2
       /sys/devices/system/node/has_memory:        2
       /sys/devices/system/node/has_normal_memory: 2
       /sys/devices/system/node/possible:          0-31
      
      v5.8 + patch
      ------------------
       available: 1 nodes (2)
       node 2 cpus: 0 1 2 3 4 5 6 7
       node 2 size: 32625 MB
       node 2 free: 31487 MB
       node distances:
       node   2
         2:  10
      
      proc and sys files
      ------------------
      /sys/devices/system/node/online:            2
      /proc/sys/kernel/numa_balancing:            0
      /sys/devices/system/node/has_cpu:           2
      /sys/devices/system/node/has_memory:        2
      /sys/devices/system/node/has_normal_memory: 2
      /sys/devices/system/node/possible:          0-31
      
      Example of a node with online CPUs/memory on node 0.
      (Same o/p with and without patch)
      numactl -H
      available: 4 nodes (0-3)
      node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
      node 0 size: 32482 MB
      node 0 free: 22994 MB
      node 1 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
      node 1 size: 0 MB
      node 1 free: 0 MB
      node 2 cpus: 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143
      node 2 size: 0 MB
      node 2 free: 0 MB
      node 3 cpus: 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 node 3 size: 0 MB
      node 3 free: 0 MB
      node distances:
      node   0   1   2   3
        0:  10  20  40  40
        1:  20  10  40  40
        2:  40  40  10  20
        3:  40  40  20  10
      
      Note: On Powerpc, cpu_to_node of possible but not present cpus would
      previously return 0. Hence this commit depends on commit ("powerpc/numa: Set
      numa_node for all possible cpus") and commit ("powerpc/numa: Prefer node id
      queried from vphn"). Without the 2 commits, Powerpc system might crash.
      
      1. User space applications like Numactl, lscpu, that parse the sysfs tend to
      believe there is an extra online node. This tends to confuse users and
      applications. Other user space applications start believing that system was
      not able to use all the resources (i.e missing resources) or the system was
      not setup correctly.
      
      2. Also existence of dummy node also leads to inconsistent information. The
      number of online nodes is inconsistent with the information in the
      device-tree and resource-dump
      
      3. When the dummy node is present, single node non-Numa systems end up showing
      up as NUMA systems and numa_balancing gets enabled. This will mean we take
      the hit from the unnecessary numa hinting faults.
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200818081104.57888-4-srikar@linux.vnet.ibm.com
      e75130f2
    • S
      powerpc/numa: Prefer node id queried from vphn · 6398eaa2
      Srikar Dronamraju 提交于
      Node id queried from the static device tree may not
      be correct. For example: it may always show 0 on a shared processor.
      Hence prefer the node id queried from vphn and fallback on the device tree
      based node id if vphn query fails.
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200818081104.57888-3-srikar@linux.vnet.ibm.com
      6398eaa2
    • S
      powerpc/numa: Set numa_node for all possible cpus · a874f100
      Srikar Dronamraju 提交于
      A Powerpc system with multiple possible nodes and with CONFIG_NUMA
      enabled always used to have a node 0, even if node 0 does not any cpus
      or memory attached to it. As per PAPR, node affinity of a cpu is only
      available once its present / online. For all cpus that are possible but
      not present, cpu_to_node() would point to node 0.
      
      To ensure a cpuless, memoryless dummy node is not online, powerpc need
      to make sure all possible but not present cpu_to_node are set to a
      proper node.
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200818081104.57888-2-srikar@linux.vnet.ibm.com
      a874f100
    • S
      powerpc/numa: Restrict possible nodes based on platform · 67df7784
      Srikar Dronamraju 提交于
      As per draft LoPAPR (Revision 2.9_pre7), section B.5.3 "Run Time
      Abstraction Services (RTAS) Node" available at:
        https://openpowerfoundation.org/wp-content/uploads/2020/07/LoPAR-20200611.pdf
      
      ... there are 2 device tree properties:
      
        "ibm,max-associativity-domains"
         which defines the maximum number of domains that the firmware i.e
         PowerVM can support.
      
      and:
      
        "ibm,current-associativity-domains"
         which defines the maximum number of domains that the current
         platform can support.
      
      The value of "ibm,max-associativity-domains" is always greater than or
      equal to "ibm,current-associativity-domains" property. If the latter
      property is not available, use "ibm,max-associativity-domain" as a
      fallback. In this yet to be released LoPAPR, "ibm,current-associativity-domains"
      is mentioned in page 833 / B.5.3 which is covered under under
      "Appendix B. System Binding" section
      
      Currently powerpc uses the "ibm,max-associativity-domains" property
      while setting the possible number of nodes. This is currently set at
      32. However the possible number of nodes for a platform may be
      significantly less. Hence set the possible number of nodes based on
      "ibm,current-associativity-domains" property.
      
      Nathan Lynch had raised a valid concern that post LPM (Live Partition
      Migration), a user could DLPAR add processors and memory after LPM
      with "new" associativity properties:
        https://lore.kernel.org/linuxppc-dev/871rljfet9.fsf@linux.ibm.com/t/#u
      
      He also pointed out that "ibm,max-associativity-domains" has the same
      contents on all currently available PowerVM systems, unlike
      "ibm,current-associativity-domains" and hence may be better able to
      handle the new NUMA associativity properties.
      
      However with the recent commit dbce4562 ("powerpc/numa: Limit
      possible nodes to within num_possible_nodes"), all new NUMA
      associativity properties are capped to initially set nr_node_ids.
      Hence this commit should be safe with any new DLPAR add post LPM.
      
        $ lsprop /proc/device-tree/rtas/ibm,*associ*-domains
        /proc/device-tree/rtas/ibm,current-associativity-domains
        		 00000005 00000001 00000002 00000002 00000002 00000010
        /proc/device-tree/rtas/ibm,max-associativity-domains
        		 00000005 00000001 00000008 00000020 00000020 00000100
      
        $ cat /sys/devices/system/node/possible ##Before patch
        0-31
      
        $ cat /sys/devices/system/node/possible ##After patch
        0-1
      
      Note the maximum nodes this platform can support is only 2 but the
      possible nodes is set to 32.
      
      This is important because lot of kernel and user space code allocate
      structures for all possible nodes leading to a lot of memory that is
      allocated but not used.
      
      I ran a simple experiment to create and destroy 100 memory cgroups on
      boot on a 8 node machine (Power8 Alpine).
      
      Before patch:
        free -k at boot
                      total        used        free      shared  buff/cache   available
        Mem:      523498176     4106816   518820608       22272      570752   516606720
        Swap:       4194240           0     4194240
      
        free -k after creating 100 memory cgroups
                      total        used        free      shared  buff/cache   available
        Mem:      523498176     4628416   518246464       22336      623296   516058688
        Swap:       4194240           0     4194240
      
        free -k after destroying 100 memory cgroups
                      total        used        free      shared  buff/cache   available
        Mem:      523498176     4697408   518173760       22400      627008   515987904
        Swap:       4194240           0     4194240
      
      After patch:
        free -k at boot
                      total        used        free      shared  buff/cache   available
        Mem:      523498176     3969472   518933888       22272      594816   516731776
        Swap:       4194240           0     4194240
      
        free -k after creating 100 memory cgroups
                      total        used        free      shared  buff/cache   available
        Mem:      523498176     4181888   518676096       22208      640192   516496448
        Swap:       4194240           0     4194240
      
        free -k after destroying 100 memory cgroups
                      total        used        free      shared  buff/cache   available
        Mem:      523498176     4232320   518619904       22272      645952   516443264
        Swap:       4194240           0     4194240
      
      Observations:
        Fixed kernel takes 137344 kb (4106816-3969472) less to boot.
        Fixed kernel takes 309184 kb (4628416-4181888-137344) less to create 100 memcgs.
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      [mpe: Reformat change log a bit for readability]
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200817055257.110873-1-srikar@linux.vnet.ibm.com
      67df7784
  10. 08 8月, 2020 1 次提交
  11. 29 7月, 2020 1 次提交
  12. 26 7月, 2020 1 次提交
  13. 16 7月, 2020 11 次提交
  14. 04 3月, 2020 3 次提交