- 19 10月, 2016 1 次提交
-
-
由 Michael Ellerman 提交于
At boot we dump the NUMA memory topology in dump_numa_memory_topology(), at KERN_DEBUG level, resulting in output like: Node 0 Memory: 0x0-0x100000000 Node 1 Memory: 0x100000000-0x200000000 Which is nice enough, but immediately after that we iterate over each node and call setup_node_data(), which also prints out the node ranges, at KERN_INFO, giving eg: numa: Initmem setup node 0 [mem 0x00000000-0xffffffff] numa: Initmem setup node 1 [mem 0x100000000-0x1ffffffff] Additionally dump_numa_memory_topology() does not use KERN_CONT correctly, resulting in split output lines on recent kernels. So drop dump_numa_memory_topology() as superfluous chatter. Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au> Acked-by: NBalbir Singh <bsingharora@gmail.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 23 7月, 2016 1 次提交
-
-
Install the callbacks via the state machine. On the boot cpu the callback is invoked manually because cpuhp is not up yet and everything must be preinitialized before additional CPUs are up. Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Nikunj A Dadhania <nikunj@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Bharata B Rao <bharata@linux.vnet.ibm.com> Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Christophe Jaillet <christophe.jaillet@wanadoo.fr> Cc: Anton Blanchard <anton@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: linuxppc-dev@lists.ozlabs.org Cc: rt@linutronix.de Link: http://lkml.kernel.org/r/20160718140727.GA13132@linutronix.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
-
- 14 6月, 2016 2 次提交
-
-
由 Bharata B Rao 提交于
memory_hotplug_max() uses hot_add_drconf_memory_max() to get maxmimum addressable memory by referring to ibm,dyanamic-memory property. There are three problems with the current approach: 1 hot_add_drconf_memory_max() assumes that ibm,dynamic-memory includes all the LMBs of the guest, but that is not true for PowerKVM which populates only DR LMBs (LMBs that can be hotplugged/removed) in that property. 2 hot_add_drconf_memory_max() multiplies lmb-size with lmb-count to arrive at the max possible address. Since ibm,dynamic-memory doesn't include RMA LMBs, the address thus obtained will be less than the actual max address. For example, if max possible memory size is 32G, with lmb-size of 256MB there can be 127 LMBs in ibm,dynamic-memory (1 LMB for RMA which won't be present here). hot_add_drconf_memory_max() would then return the max addressable memory as 127 * 256MB = 31.75GB, the max address should have been 32G which is what ibm,lrdr-capacity shows. 3 In PowerKVM, there can be a gap between the end of boot time RAM and beginning of hotplug RAM area. So just multiplying lmb-count with lmb-size will not provide the correct max possible address for PowerKVM. This patch fixes 1 by using ibm,lrdr-capacity property to return the max addressable memory whenever the property is present. Then it fixes 2 & 3 by fetching the address of the last LMB in ibm,dynamic-memory property. Fixes: cd34206e ("powerpc: Add memory_hotplug_max()") Signed-off-by: NBharata B Rao <bharata@linux.vnet.ibm.com> Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Bharata B Rao 提交于
Signed-off-by: NBharata B Rao <bharata@linux.vnet.ibm.com> Reviewed-by: NDavid Gibson <david@gibson.dropbear.id.au> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 06 11月, 2015 1 次提交
-
-
由 Raghavendra K T 提交于
With the setup_nr_nodes(), we have already initialized node_possible_map. So it is safe to use for_each_node here. There are many places in the kernel that use hardcoded 'for' loop with nr_node_ids, because all other architectures have numa nodes populated serially. That should be reason we had maintained the same for powerpc. But, since sparse numa node ids possible on powerpc, we unnecessarily allocate memory for non existent numa nodes. For e.g., on a system with 0,1,16,17 as numa nodes nr_node_ids=18 and we allocate memory for nodes 2-14. This patch we allocate memory for only existing numa nodes. The patch is boot tested on a 4 node tuleta, confirming with printks that it works as expected. Signed-off-by: NRaghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Cc: Vladimir Davydov <vdavydov@parallels.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Anton Blanchard <anton@samba.org> Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> Cc: Greg Kurz <gkurz@linux.vnet.ibm.com> Cc: Grant Likely <grant.likely@linaro.org> Cc: Nikunj A Dadhania <nikunj@linux.vnet.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 15 10月, 2015 1 次提交
-
-
由 Christophe Jaillet 提交于
of_get_next_parent can be used to simplify the while() loop and avoid the need of a temp variable. Signed-off-by: NChristophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 18 8月, 2015 1 次提交
-
-
由 Nikunj A Dadhania 提交于
In some situations, a NUMA guest that supports ibm,dynamic-memory-reconfiguration node will end up having flat NUMA distances between nodes. This is because of two problems in the current code. 1) Different representations of associativity lists. There is an assumption about the associativity list in initialize_distance_lookup_table(). Associativity list has two forms: a) [cpu,memory]@x/ibm,associativity has following format: <N> <N integers> b) ibm,dynamic-reconfiguration-memory/ibm,associativity-lookup-arrays <M> <N> <M associativity lists each having N integers> M = the number of associativity lists N = the number of entries per associativity list Fix initialize_distance_lookup_table() so that it does not assume "case a". And update the caller to skip the length field before sending the associativity list. 2) Distance table not getting updated from drconf path. Node distance table will not get initialized in certain cases as ibm,dynamic-reconfiguration-memory path does not initialize the lookup table. Call initialize_distance_lookup_table() from drconf path with appropriate associativity list. Reported-by: NBharata B Rao <bharata@linux.vnet.ibm.com> Signed-off-by: NNikunj A Dadhania <nikunj@linux.vnet.ibm.com> Acked-by: NAnton Blanchard <anton@samba.org> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 23 3月, 2015 1 次提交
-
-
由 Nishanth Aravamudan 提交于
Raghu noticed an issue with excessive memory allocation on power with a simple cgroup test, specifically, in mem_cgroup_css_alloc -> for_each_node -> alloc_mem_cgroup_per_zone_info(), which ends up blowing up the kmalloc-2048 slab (to the order of 200MB for 400 cgroup directories). The underlying issue is that NODES_SHIFT on power is 8 (256 NUMA nodes possible), which defines node_possible_map, which in turn defines the value of nr_node_ids in setup_nr_node_ids and the iteration of for_each_node. In practice, we never see a system with 256 NUMA nodes, and in fact, we do not support node hotplug on power in the first place, so the nodes that are online when we come up are the nodes that will be present for the lifetime of this kernel. So let's, at least, drop the NUMA possible map down to the online map at runtime. This is similar to what x86 does in its initialization routines. mem_cgroup_css_alloc should also be fixed to only iterate over memory-populated nodes and handle hotplug, but that is a separate change. Signed-off-by: NNishanth Aravamudan <nacc@linux.vnet.ibm.com> Cc: Tejun Heo <tj@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Anton Blanchard <anton@samba.org> Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 18 3月, 2015 3 次提交
-
-
由 Greg Kurz 提交于
The goal behind this patch is to be able to write userland tests for the VPHN parsing code. Suggested-by: NMichael Ellerman <mpe@ellerman.id.au> Signed-off-by: NGreg Kurz <gkurz@linux.vnet.ibm.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Greg Kurz 提交于
The first argument to vphn_unpack_associativity() is a const long *, but the parsing code expects __be64 values actually. Let's move the endian fixing down for consistency. Signed-off-by: NGreg Kurz <gkurz@linux.vnet.ibm.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Greg Kurz 提交于
The number of values returned by the H_HOME_NODE_ASSOCIATIVITY h_call deserves to be explicitly defined, for a better understanding of the code. Signed-off-by: NGreg Kurz <gkurz@linux.vnet.ibm.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 25 11月, 2014 1 次提交
-
-
由 Grant Likely 提交于
The OF_RECONFIG notifier callback uses a different structure depending on whether it is a node change or a property change. This is silly, and not very safe. Rework the code to use the same data structure regardless of the type of notifier. Signed-off-by: NGrant Likely <grant.likely@linaro.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Rob Herring <robh+dt@kernel.org> Cc: Pantelis Antoniou <pantelis.antoniou@konsulko.com> Cc: <linuxppc-dev@lists.ozlabs.org>
-
- 10 11月, 2014 2 次提交
-
-
由 Anton Blanchard 提交于
We did part of sparse initialisation in setup_arch and part in initmem_init. Put them together. Signed-off-by: NAnton Blanchard <anton@samba.org> Tested-by: NEmil Medve <Emilian.Medve@Freescale.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Anton Blanchard 提交于
At the moment we transition from the memblock alloctor to the bootmem allocator. Gitting rid of the bootmem allocator removes a bunch of complicated code (most of which I owe the dubious honour of being responsible for writing). Signed-off-by: NAnton Blanchard <anton@samba.org> Tested-by: NEmil Medve <Emilian.Medve@Freescale.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 29 10月, 2014 2 次提交
-
-
由 Nishanth Aravamudan 提交于
We received a report of warning in kernel/sched/core.c where the sched group was NULL on an LPAR after a topology update. This seems to occur because after the topology update has moved the CPUs, cpu_to_node is returning the old value still, which ends up breaking the consistency of the NUMA topology in the per-cpu maps. Ensure that we update the per-cpu fields when we re-map CPUs. Signed-off-by: NNishanth Aravamudan <nacc@linux.vnet.ibm.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Nishanth Aravamudan 提交于
There isn't any need to keep referring to update->cpu, as we've already checked cpu == update->cpu at this point. Signed-off-by: NNishanth Aravamudan <nacc@linux.vnet.ibm.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 16 10月, 2014 1 次提交
-
-
由 Greg Kurz 提交于
The associativity domain numbers are obtained from the hypervisor through registers and written into memory by the guest: the packed array passed to vphn_unpack_associativity() is then native-endian, unlike what was assumed in the following commit: commit b08a2a12 Author: Alistair Popple <alistair@popple.id.au> Date: Wed Aug 7 02:01:44 2013 +1000 powerpc: Make NUMA device node code endian safe This issue fills the topology with bogus data and makes it unusable. It may lead to severe performance breakdowns. We should ideally patch the vphn_unpack_associativity() function to do the 64-bit loads, but this requires some more brain storming. In the meantime, let's go for a suboptimal and temporary bug fix: this patch converts each 64-bit value of the packed array to big endian, as expected by the current parsing code in vphn_unpack_associativity(). Signed-off-by: NGreg Kurz <gkurz@linux.vnet.ibm.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 13 10月, 2014 2 次提交
-
-
由 Nishanth Aravamudan 提交于
We have hit a few customer issues with the topology update code (VPHN and PRRN). It would be nice to be able to debug the notifications coming from the hypervisor in both cases to the LPAR, as well as to disable responding to the notifications at boot-time, to narrow down the source of the problems. Add a basic level of such functionality, similar to the numa= command-line parameter. We already have a toggle in /proc/powerpc/topology_updates that allows run-time enabling/disabling, so the updates can be started at run-time if desired. But the bugs we've run into have occured during boot or very shortly after coming to login, and have resulted in a broken NUMA topology. Signed-off-by: NNishanth Aravamudan <nacc@linux.vnet.ibm.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Nishanth Aravamudan 提交于
proc_create can fail, we should check the return value and pass up the failure. Suggested-by: NMichael Ellerman <mpe@ellerman.id.au> Signed-off-by: NNishanth Aravamudan <nacc@linux.vnet.ibm.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 25 9月, 2014 3 次提交
-
-
由 Li Zhong 提交于
this patches changes some error handling logics in numa_setup_cpu(), when cpu node is not found, so: if the cpu is possible, but not present, -1 is kept in numa_cpu_lookup_table, so later, if the cpu is added, we could set correct numa information for it. if the cpu is present, then we set the first online node to numa_cpu_lookup_table instead of 0 ( in case 0 might not be an online node? ) Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com> Signed-off-by: NLi Zhong <zhong@linux.vnet.ibm.com> Acked-by: NNishanth Aravamudan <nacc@linux.vnet.ibm.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Li Zhong 提交于
As Nish suggested, it makes more sense to init the numa node informatiion for present cpus at boottime, which could also avoid WARN_ON(1) in numa_setup_cpu(). With this change, we also need to change the smp_prepare_cpus() to set up numa information only on present cpus. For those possible, but not present cpus, their numa information will be set up after they are started, as the original code did before commit 2fabf084. Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com> Signed-off-by: NLi Zhong <zhong@linux.vnet.ibm.com> Acked-by: NNishanth Aravamudan <nacc@linux.vnet.ibm.com> Tested-by: NCyril Bur <cyril.bur@au1.ibm.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
由 Li Zhong 提交于
With commit 2fabf084 ("powerpc: reorder per-cpu NUMA information's initialization"), during boottime, cpu_numa_callback() is called earlier(before their online) for each cpu, and verify_cpu_node_mapping() uses cpu_to_node() to check whether siblings are in the same node. It skips the checking for siblings that are not online yet. So the only check done here is for the bootcpu, which is online at that time. But the per-cpu numa_node cpu_to_node() uses hasn't been set up yet (which will be set up in smp_prepare_cpus()). So I saw something like following reported: [ 0.000000] CPU thread siblings 1/2/3 and 0 don't belong to the same node! As we don't actually do the checking during this early stage, so maybe we could directly call numa_setup_cpu() in do_init_bootmem(). Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com> Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com> Signed-off-by: NLi Zhong <zhong@linux.vnet.ibm.com> Acked-by: NNishanth Aravamudan <nacc@linux.vnet.ibm.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
-
- 20 9月, 2014 1 次提交
-
-
由 Scott Wood 提交于
Commit 1c98025c "powerpc: Dynamic DMA zone limits" updated how zones are created in paging_init(), but missed the NUMA version of paging_init(). This was noticed via a linker error, since dma_pfn_limit_to_zone() was, like the non-NUMA paging_init(), limited by #ifndef CONFIG_NEED_MULTIPLE_NODES. It turns out that the NUMA paging_init() was not actually doing anything different from the standard paging_init(), other than a couple debug prints, a couple 32-bit-only ifdef sections, and a call to mark_nonram_nosave(). It's not clear whether mark_nonram_nosave() is inherently wrong to do for NUMA, or just not useful on targets that have NUMA, but for now I'm preserving the existing behavior. Fixes: 1c98025c "powerpc: Dynamic DMA zone limits" Reported-by: NStephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: NScott Wood <scottwood@freescale.com>
-
- 13 8月, 2014 1 次提交
-
-
由 Nishanth Aravamudan 提交于
There is an issue currently where NUMA information is used on powerpc (and possibly ia64) before it has been read from the device-tree, which leads to large slab consumption with CONFIG_SLUB and memoryless nodes. NUMA powerpc non-boot CPU's cpu_to_node/cpu_to_mem is only accurate after start_secondary(), similar to ia64, which is invoked via smp_init(). Commit 6ee0578b ("workqueue: mark init_workqueues() as early_initcall()") made init_workqueues() be invoked via do_pre_smp_initcalls(), which is obviously before the secondary processors are online. Additionally, the following commits changed init_workqueues() to use cpu_to_node to determine the node to use for kthread_create_on_node: bce90380 ("workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[]") f3f90ad4 ("workqueue: determine NUMA node of workers accourding to the allowed cpumask") Therefore, when init_workqueues() runs, it sees all CPUs as being on Node 0. On LPARs or KVM guests where Node 0 is memoryless, this leads to a high number of slab deactivations (http://www.spinics.net/lists/linux-mm/msg67489.html). Fix this by initializing the powerpc-specific CPU<->node/local memory node mapping as early as possible, which on powerpc is do_init_bootmem(). Currently that function initializes the mapping for the boot CPU, but we extend it to setup the mapping for all possible CPUs. Then, in smp_prepare_cpus(), we can correspondingly set the per-cpu values for all possible CPUs. That ensures that before the early_initcalls run (and really as early as possible), the per-cpu NUMA mapping is accurate. While testing memoryless nodes on PowerKVM guests with a fix to the workqueue logic to use cpu_to_mem() instead of cpu_to_node(), with a guest topology of: available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 node 0 size: 0 MB node 0 free: 0 MB node 1 cpus: 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 node 1 size: 16336 MB node 1 free: 15329 MB node distances: node 0 1 0: 10 40 1: 40 10 the slab consumption decreases from Slab: 932416 kB SUnreclaim: 902336 kB to Slab: 395264 kB SUnreclaim: 359424 kB And we a corresponding increase in the slab efficiency from slab mem objs slabs used active active ------------------------------------------------------------ kmalloc-16384 337 MB 11.28% 100.00% task_struct 288 MB 9.93% 100.00% to slab mem objs slabs used active active ------------------------------------------------------------ kmalloc-16384 37 MB 100.00% 100.00% task_struct 31 MB 100.00% 100.00% Powerpc didn't support memoryless nodes until recently (64bb80d8 "powerpc/numa: Enable CONFIG_HAVE_MEMORYLESS_NODES" and 8c272261 "powerpc/numa: Enable USE_PERCPU_NUMA_NODE_ID"). Those commits also helped improve memory consumption with these kind of environments. Signed-off-by: NNishanth Aravamudan <nacc@linux.vnet.ibm.com> Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
-
- 05 8月, 2014 1 次提交
-
-
由 Andrey Utkin 提交于
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=81631Reported-by: NDavid Binderman <dcb314@hotmail.com> Signed-off-by: NAndrey Utkin <andrey.krieger.utkin@gmail.com> CC: <stable@vger.kernel.org> Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
-
- 19 4月, 2014 1 次提交
-
-
由 Mike Qiu 提交于
CHK include/config/kernel.release CHK include/generated/uapi/linux/version.h CHK include/generated/utsrelease.h ... Building modules, stage 2. WARNING: 1 bad relocations c0000000013d6a30 R_PPC64_ADDR64 uprobes_fetch_type_table WRAP arch/powerpc/boot/zImage.pseries WRAP arch/powerpc/boot/zImage.epapr MODPOST 1849 modules ERROR: ".__node_distance" [drivers/block/nvme.ko] undefined! make[1]: *** [__modpost] Error 1 make: *** [modules] Error 2 make: *** Waiting for unfinished jobs.... The reason is symbol "__node_distance" not been exported in powerpc. Signed-off-by: NMike Qiu <qiudayu@linux.vnet.ibm.com> Acked-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Cc: Jesse Larrew <jlarrew@linux.vnet.ibm.com> Cc: Robert Jennings <rcj@linux.vnet.ibm.com> Cc: Alistair Popple <alistair@popple.id.au> Cc: Mike Qiu <qiudayu@linux.vnet.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 09 4月, 2014 1 次提交
-
-
由 Michael Wang 提交于
Since v1: Edited the comment according to Srivatsa's suggestion. During the testing, we encounter below WARN followed by Oops: WARNING: at kernel/sched/core.c:6218 ... NIP [c000000000101660] .build_sched_domains+0x11d0/0x1200 LR [c000000000101358] .build_sched_domains+0xec8/0x1200 PACATMSCRATCH [800000000000f032] Call Trace: [c00000001b103850] [c000000000101358] .build_sched_domains+0xec8/0x1200 [c00000001b1039a0] [c00000000010aad4] .partition_sched_domains+0x484/0x510 [c00000001b103aa0] [c00000000016d0a8] .rebuild_sched_domains+0x68/0xa0 [c00000001b103b30] [c00000000005cbf0] .topology_work_fn+0x10/0x30 ... Oops: Kernel access of bad area, sig: 11 [#1] ... NIP [c00000000045c000] .__bitmap_weight+0x60/0xf0 LR [c00000000010132c] .build_sched_domains+0xe9c/0x1200 PACATMSCRATCH [8000000000029032] Call Trace: [c00000001b1037a0] [c000000000288ff4] .kmem_cache_alloc_node_trace+0x184/0x3a0 [c00000001b103850] [c00000000010132c] .build_sched_domains+0xe9c/0x1200 [c00000001b1039a0] [c00000000010aad4] .partition_sched_domains+0x484/0x510 [c00000001b103aa0] [c00000000016d0a8] .rebuild_sched_domains+0x68/0xa0 [c00000001b103b30] [c00000000005cbf0] .topology_work_fn+0x10/0x30 ... This was caused by that 'sd->groups == NULL' after building groups, which was caused by the empty 'sd->span'. The cpu's domain contained nothing because the cpu was assigned to a wrong node, due to the following unfortunate sequence of events: 1. The hypervisor sent a topology update to the guest OS, to notify changes to the cpu-node mapping. However, the update was actually redundant - i.e., the "new" mapping was exactly the same as the old one. 2. Due to this, the 'updated_cpus' mask turned out to be empty after exiting the 'for-loop' in arch_update_cpu_topology(). 3. So we ended up calling stop-machine() with an empty cpumask list, which made stop-machine internally elect cpumask_first(cpu_online_mask), i.e., CPU0 as the cpu to run the payload (the update_cpu_topology() function). 4. This causes update_cpu_topology() to be run by CPU0. And since 'updates' is kzalloc()'ed inside arch_update_cpu_topology(), update_cpu_topology() finds update->cpu as well as update->new_nid to be 0. In other words, we end up assigning CPU0 (and eventually its siblings) to node 0, incorrectly. Along with the following wrong updating, it causes the sched-domain rebuild code to break and crash the system. Fix this by skipping the topology update in cases where we find that the topology has not actually changed in reality (ie., spurious updates). CC: Benjamin Herrenschmidt <benh@kernel.crashing.org> CC: Paul Mackerras <paulus@samba.org> CC: Nathan Fontenot <nfont@linux.vnet.ibm.com> CC: Stephen Rothwell <sfr@canb.auug.org.au> CC: Andrew Morton <akpm@linux-foundation.org> CC: Robert Jennings <rcj@linux.vnet.ibm.com> CC: Jesse Larrew <jlarrew@linux.vnet.ibm.com> CC: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com> CC: Alistair Popple <alistair@popple.id.au> Suggested-by: N"Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com> Signed-off-by: NMichael Wang <wangyun@linux.vnet.ibm.com> Reviewed-by: NSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
-
- 29 1月, 2014 1 次提交
-
-
由 Joe Perches 提交于
This should have been octal. Signed-off-by: NJoe Perches <joe@perches.com> Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
-
- 22 1月, 2014 1 次提交
-
-
由 Tang Chen 提交于
[sfr@canb.auug.org.au: fix powerpc build] Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com> Reviewed-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Rafael J . Wysocki" <rjw@sisk.pl> Cc: Chen Tang <imtangchen@gmail.com> Cc: Gong Chen <gong.chen@linux.intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Len Brown <lenb@kernel.org> Cc: Liu Jiang <jiang.liu@huawei.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Prarit Bhargava <prarit@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Taku Izumi <izumi.taku@jp.fujitsu.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Thomas Renninger <trenn@suse.de> Cc: Toshi Kani <toshi.kani@hp.com> Cc: Vasilis Liaskovitis <vasilis.liaskovitis@profitbricks.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Wen Congyang <wency@cn.fujitsu.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Yinghai Lu <yinghai@kernel.org> Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 15 1月, 2014 2 次提交
-
-
由 Srivatsa S. Bhat 提交于
There have been some weird bugs in the past where the kernel tried to associate threads of the same core to different NUMA nodes, and things went haywire after that point (as expected). But unfortunately, root-causing such issues have been quite challenging, due to the lack of appropriate debug checks in the kernel. These bugs usually lead to some odd soft-lockups in the scheduler's build-sched-domain code in the CPU hotplug path, which makes it very hard to trace it back to the incorrect cpu-to-node mappings. So add appropriate debug checks to catch such invalid cpu-to-node mappings as early as possible. Signed-off-by: NSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
-
由 Srivatsa S. Bhat 提交于
On POWER platforms, the hypervisor can notify the guest kernel about dynamic changes in the cpu-numa associativity (VPHN topology update). Hence the cpu-to-node mappings that we got from the firmware during boot, may no longer be valid after such updates. This is handled using the arch_update_cpu_topology() hook in the scheduler, and the sched-domains are rebuilt according to the new mappings. But unfortunately, at the moment, CPU hotplug ignores these updated mappings and instead queries the firmware for the cpu-to-numa relationships and uses them during CPU online. So the kernel can end up assigning wrong NUMA nodes to CPUs during subsequent CPU hotplug online operations (after booting). Further, a particularly problematic scenario can result from this bug: On POWER platforms, the SMT mode can be switched between 1, 2, 4 (and even 8) threads per core. The switch to Single-Threaded (ST) mode is performed by offlining all except the first CPU thread in each core. Switching back to SMT mode involves onlining those other threads back, in each core. Now consider this scenario: 1. During boot, the kernel gets the cpu-to-node mappings from the firmware and assigns the CPUs to NUMA nodes appropriately, during CPU online. 2. Later on, the hypervisor updates the cpu-to-node mappings dynamically and communicates this update to the kernel. The kernel in turn updates its cpu-to-node associations and rebuilds its sched domains. Everything is fine so far. 3. Now, the user switches the machine from SMT to ST mode (say, by running ppc64_cpu --smt=1). This involves offlining all except 1 thread in each core. 4. The user then tries to switch back from ST to SMT mode (say, by running ppc64_cpu --smt=4), and this involves onlining those threads back. Since CPU hotplug ignores the new mappings, it queries the firmware and tries to associate the newly onlined sibling threads to the old NUMA nodes. This results in sibling threads within the same core getting associated with different NUMA nodes, which is incorrect. The scheduler's build-sched-domains code gets thoroughly confused with this and enters an infinite loop and causes soft-lockups, as explained in detail in commit 3be7db6a (powerpc: VPHN topology change updates all siblings). So to fix this, use the numa_cpu_lookup_table to remember the updated cpu-to-node mappings, and use them during CPU hotplug online operations. Further, we also need to ensure that all threads in a core are assigned to a common NUMA node, irrespective of whether all those threads were online during the topology update. To achieve this, we take care not to use cpu_sibling_mask() since it is not hotplug invariant. Instead, we use cpu_first_sibling_thread() and set up the mappings manually using the 'threads_per_core' value for that particular platform. This helps us ensure that we don't hit this bug with any combination of CPU hotplug and SMT mode switching. Cc: stable@vger.kernel.org Signed-off-by: NSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
-
- 13 11月, 2013 1 次提交
-
-
由 Xishi Qiu 提交于
Use "pgdat_end_pfn()" instead of "pgdat->node_start_pfn + pgdat->node_spanned_pages". Simplify the code, no functional change. Signed-off-by: NXishi Qiu <qiuxishi@huawei.com> Cc: James Hogan <james.hogan@imgtec.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Paul Mundt <lethal@linux-sh.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 30 10月, 2013 1 次提交
-
-
由 Robert Jennings 提交于
Simple fixes for sparse warnings in this file. Resolves: arch/powerpc/mm/numa.c:198:24: warning: Using plain integer as NULL pointer arch/powerpc/mm/numa.c:1157:5: warning: symbol 'hot_add_node_scn_to_nid' was not declared. Should it be static? arch/powerpc/mm/numa.c:1238:28: warning: Using plain integer as NULL pointer arch/powerpc/mm/numa.c:1538:6: warning: symbol 'topology_schedule_update' was not declared. Should it be static? Signed-off-by: NRobert C Jennings <rcj@linux.vnet.ibm.com> Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
-
- 14 8月, 2013 2 次提交
-
-
由 Alistair Popple 提交于
The device tree is big endian so make sure we byteswap on little endian. We assume any pHyp calls also return big endian results in memory. Signed-off-by: NAlistair Popple <alistair@popple.id.au> Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
-
由 Anton Blanchard 提交于
Although the shared_proc field in the lppaca works today, it is not architected. A shared processor partition will always have a non zero yield_count so use that instead. Create a wrapper so users don't have to know about the details. In order for older kernels to continue to work on KVM we need to set the shared_proc bit. While here, remove the ugly bitfield. Signed-off-by: NAnton Blanchard <anton@samba.org> Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
-
- 01 8月, 2013 1 次提交
-
-
由 Robert Jennings 提交于
When an associativity level change is found for one thread, the siblings threads need to be updated as well. This is done today for PRRN in stage_topology_update() but is missing for VPHN in update_cpu_associativity_changes_mask(). This patch will correctly update all thread siblings during a topology change. Without this patch a topology update can result in a CPU in init_sched_groups_power() getting stuck indefinitely in a loop. This loop is built in build_sched_groups(). As a result of the thread moving to a node separate from its siblings the struct sched_group will have its next pointer set to point to itself rather than the sched_group struct of the next thread. This happens because we have a domain without the SD_OVERLAP flag, which is correct, and a topology that doesn't conform with reality (threads on the same core assigned to different numa nodes). When this list is traversed by init_sched_groups_power() it will reach the thread's sched_group structure and loop indefinitely; the cpu will be stuck at this point. The bug was exposed when VPHN was enabled in commit b7abef04 (v3.9). Cc: <stable@vger.kernel.org> [v3.9+] Reported-by: NJan Stancek <jstancek@redhat.com> Signed-off-by: NRobert Jennings <rcj@linux.vnet.ibm.com> Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
-
- 01 7月, 2013 2 次提交
-
-
由 Nathan Fontenot 提交于
The topology update code that updates the cpu node registration in sysfs should not be called while in stop_machine(). The register/unregister calls take a lock and may sleep. This patch moves these calls outside of the call to stop_machine(). Signed-off-by: NNathan Fontenot <nfont@linux.vnet.ibm.com> CC: <stable@vger.kernel.org> Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
-
由 Paul Gortmaker 提交于
The __cpuinit type of throwaway sections might have made sense some time ago when RAM was more constrained, but now the savings do not offset the cost and complications. For example, the fix in commit 5e427ec2 ("x86: Fix bit corruption at CPU resume time") is a good example of the nasty type of bugs that can be created with improper use of the various __init prefixes. After a discussion on LKML[1] it was decided that cpuinit should go the way of devinit and be phased out. Once all the users are gone, we can then finally remove the macros themselves from linux/init.h. This removes all the powerpc uses of the __cpuinit macros. There are no __CPUINIT users in assembly files in powerpc. [1] https://lkml.org/lkml/2013/5/20/589 Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Josh Boyer <jwboyer@gmail.com> Cc: Matt Porter <mporter@kernel.crashing.org> Cc: Kumar Gala <galak@kernel.crashing.org> Cc: linuxppc-dev@lists.ozlabs.org Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
-
- 30 4月, 2013 2 次提交
-
-
由 Nathan Fontenot 提交于
Correct build failure for powerpc/pseries builds with CONFIG_SMP not defined. The function cpu_sibling_mask has no meaning (or definition) when CONFIG_SMP is not defined. Additionally, the updating of NUMA affinity for a CPU in a UP system doesn't really make sense. This patch ifdef's out the code making the affinity updates for PRRN events to fix the following build break. arch/powerpc/mm/numa.c: In function ‘stage_topology_update’: arch/powerpc/mm/numa.c:1535: error: implicit declaration of function ‘cpu_sibling_mask’ arch/powerpc/mm/numa.c:1535: warning: passing argument 3 of ‘cpumask_or’ makes pointer from integer without a cast make[1]: *** [arch/powerpc/mm/numa.o] Error 1 Signed-off-by: NNathan Fontenot <nfont@linux.vnet.ibm.com> Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
-
由 Stephen Rothwell 提交于
After merging the cgroup tree, today's linux-next build (powerpc ppc64_defconfig) failed like this: arch/powerpc/mm/numa.c: In function 'arch_update_cpu_topology': arch/powerpc/mm/numa.c:1465:2: error: implicit declaration of function 'kzalloc' [-Werror=implicit-function-declaration] arch/powerpc/mm/numa.c:1465:10: error: assignment makes pointer from integer without a cast [-Werror] arch/powerpc/mm/numa.c:1497:2: error: implicit declaration of function 'kfree' [-Werror=implicit-function-declaration] Caused by commit 30c05350 ("powerpc/pseries: Use stop machine to update cpu maps") from the powerpc tree interacting with (probably) commit ff794dea ("cpuset: remove include of cgroup.h from cpuset.h") from the cgroup tree. Removing includes from header files is fraught with danger ... The former should have added an include of linux/slab.h to arch/powerpc/mm/numa.c. I have added the following merge fix patch for today (but it should be applied to the powerpc tree ASAP). From: Stephen Rothwell <sfr@canb.auug.org.au> Date: Mon, 29 Apr 2013 14:01:44 +1000 Subject: [PATCH] powerpc: numa.c: using kzalloc/kfree requires including slab.h fixes these build errors: arch/powerpc/mm/numa.c: In function 'arch_update_cpu_topology': arch/powerpc/mm/numa.c:1465:2: error: implicit declaration of function 'kzalloc' [-Werror=implicit-function-declaration] arch/powerpc/mm/numa.c:1465:10: error: assignment makes pointer from integer without a cast [-Werror] arch/powerpc/mm/numa.c:1497:2: error: implicit declaration of function 'kfree' [-Werror=implicit-function-declaration] Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au> Acked-by: NTejun Heo <tj@kernel.org> Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
-