提交 · 4b6cfb2a8cd7520e8a747718e5c1da047697ca31 · openeuler / raspberrypi-kernel

18 3月, 2015 3 次提交

powerpc/vphn: move VPHN parsing logic to a separate file · 4b6cfb2a

由 Greg Kurz 提交于 2月 23, 2015

The goal behind this patch is to be able to write userland tests for the
VPHN parsing code.
Suggested-by: NMichael Ellerman <mpe@ellerman.id.au>
Signed-off-by: NGreg Kurz <gkurz@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

4b6cfb2a

powerpc/vphn: move endianness fixing to vphn_unpack_associativity() · b1fc9484

由 Greg Kurz 提交于 2月 23, 2015

The first argument to vphn_unpack_associativity() is a const long *, but the
parsing code expects __be64 values actually. Let's move the endian fixing
down for consistency.
Signed-off-by: NGreg Kurz <gkurz@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

b1fc9484

powerpc/vphn: clarify the H_HOME_NODE_ASSOCIATIVITY API · 9d0e6d9d

由 Greg Kurz 提交于 2月 23, 2015

The number of values returned by the H_HOME_NODE_ASSOCIATIVITY h_call deserves
to be explicitly defined, for a better understanding of the code.
Signed-off-by: NGreg Kurz <gkurz@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

9d0e6d9d

25 11月, 2014 1 次提交

of/reconfig: Always use the same structure for notifiers · f5242e5a

由 Grant Likely 提交于 11月 24, 2014

The OF_RECONFIG notifier callback uses a different structure depending
on whether it is a node change or a property change. This is silly, and
not very safe. Rework the code to use the same data structure regardless
of the type of notifier.
Signed-off-by: NGrant Likely <grant.likely@linaro.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: Pantelis Antoniou <pantelis.antoniou@konsulko.com>
Cc: <linuxppc-dev@lists.ozlabs.org>

f5242e5a

10 11月, 2014 2 次提交

powerpc: Move sparse_init() into initmem_init · 21098b9e

由 Anton Blanchard 提交于 9月 17, 2014

We did part of sparse initialisation in setup_arch and part in
initmem_init. Put them together.
Signed-off-by: NAnton Blanchard <anton@samba.org>
Tested-by: NEmil Medve <Emilian.Medve@Freescale.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

21098b9e

powerpc: Remove bootmem allocator · 10239733

由 Anton Blanchard 提交于 9月 17, 2014

At the moment we transition from the memblock alloctor to the bootmem
allocator. Gitting rid of the bootmem allocator removes a bunch of
complicated code (most of which I owe the dubious honour of being
responsible for writing).
Signed-off-by: NAnton Blanchard <anton@samba.org>
Tested-by: NEmil Medve <Emilian.Medve@Freescale.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

10239733

29 10月, 2014 2 次提交

powerpc/numa: ensure per-cpu NUMA mappings are correct on topology update · 2c0a33f9

由 Nishanth Aravamudan 提交于 10月 17, 2014

We received a report of warning in kernel/sched/core.c where the sched
group was NULL on an LPAR after a topology update. This seems to occur
because after the topology update has moved the CPUs, cpu_to_node is
returning the old value still, which ends up breaking the consistency of
the NUMA topology in the per-cpu maps. Ensure that we update the per-cpu
fields when we re-map CPUs.
Signed-off-by: NNishanth Aravamudan <nacc@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

2c0a33f9

powerpc/numa: use cached value of update->cpu in update_cpu_topology · 49f8d8c0

由 Nishanth Aravamudan 提交于 10月 17, 2014

There isn't any need to keep referring to update->cpu, as we've already
checked cpu == update->cpu at this point.
Signed-off-by: NNishanth Aravamudan <nacc@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

49f8d8c0

16 10月, 2014 1 次提交

powerpc/vphn: NUMA node code expects big-endian · 5c9fb189

由 Greg Kurz 提交于 10月 15, 2014

The associativity domain numbers are obtained from the hypervisor through
registers and written into memory by the guest: the packed array passed to
vphn_unpack_associativity() is then native-endian, unlike what was assumed
in the following commit:

commit b08a2a12
Author: Alistair Popple <alistair@popple.id.au>
Date:   Wed Aug 7 02:01:44 2013 +1000

    powerpc: Make NUMA device node code endian safe

This issue fills the topology with bogus data and makes it unusable. It may
lead to severe performance breakdowns.

We should ideally patch the vphn_unpack_associativity() function to do the
64-bit loads, but this requires some more brain storming.

In the meantime, let's go for a suboptimal and temporary bug fix: this patch
converts each 64-bit value of the packed array to big endian, as expected by
the current parsing code in vphn_unpack_associativity().
Signed-off-by: NGreg Kurz <gkurz@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

5c9fb189

13 10月, 2014 2 次提交

powerpc/numa: Add ability to disable and debug topology updates · 2d73bae1

由 Nishanth Aravamudan 提交于 10月 10, 2014

We have hit a few customer issues with the topology update code (VPHN
and PRRN). It would be nice to be able to debug the notifications coming
from the hypervisor in both cases to the LPAR, as well as to disable
responding to the notifications at boot-time, to narrow down the source
of the problems. Add a basic level of such functionality, similar to the
numa= command-line parameter. We already have a toggle in
/proc/powerpc/topology_updates that allows run-time enabling/disabling,
so the updates can be started at run-time if desired. But the bugs we've
run into have occured during boot or very shortly after coming to login,
and have resulted in a broken NUMA topology.
Signed-off-by: NNishanth Aravamudan <nacc@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

2d73bae1

powerpc/numa: check error return from proc_create · 2d15b9b4

由 Nishanth Aravamudan 提交于 10月 09, 2014

proc_create can fail, we should check the return value and pass up the
failure.
Suggested-by: NMichael Ellerman <mpe@ellerman.id.au>
Signed-off-by: NNishanth Aravamudan <nacc@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

2d15b9b4

25 9月, 2014 3 次提交

powerpc: some changes in numa_setup_cpu() · 297cf502

由 Li Zhong 提交于 8月 27, 2014

this patches changes some error handling logics in numa_setup_cpu(),
when cpu node is not found, so:

if the cpu is possible, but not present, -1 is kept in numa_cpu_lookup_table,
so later, if the cpu is added, we could set correct numa information for it.

if the cpu is present, then we set the first online node to
numa_cpu_lookup_table instead of 0 ( in case 0 might not be an online node? )

Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: NLi Zhong <zhong@linux.vnet.ibm.com>
Acked-by: NNishanth Aravamudan <nacc@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

297cf502

powerpc: Only set numa node information for present cpus at boottime · bc3c4327

由 Li Zhong 提交于 8月 27, 2014

As Nish suggested, it makes more sense to init the numa node informatiion
for present cpus at boottime, which could also avoid WARN_ON(1) in
numa_setup_cpu().

With this change, we also need to change the smp_prepare_cpus() to set up
numa information only on present cpus.

For those possible, but not present cpus, their numa information
will be set up after they are started, as the original code did before commit
2fabf084.

Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: NLi Zhong <zhong@linux.vnet.ibm.com>
Acked-by: NNishanth Aravamudan <nacc@linux.vnet.ibm.com>
Tested-by: NCyril Bur <cyril.bur@au1.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

bc3c4327

powerpc: Fix warning reported by verify_cpu_node_mapping() · 70ad2375

由 Li Zhong 提交于 8月 27, 2014

With commit 2fabf084 ("powerpc: reorder per-cpu NUMA information's
initialization"), during boottime, cpu_numa_callback() is called
earlier(before their online) for each cpu, and verify_cpu_node_mapping()
uses cpu_to_node() to check whether siblings are in the same node.

It skips the checking for siblings that are not online yet. So the only
check done here is for the bootcpu, which is online at that time. But
the per-cpu numa_node cpu_to_node() uses hasn't been set up yet (which
will be set up in smp_prepare_cpus()).

So I saw something like following reported:
[    0.000000] CPU thread siblings 1/2/3 and 0 don't belong to the same
node!

As we don't actually do the checking during this early stage, so maybe
we could directly call numa_setup_cpu() in do_init_bootmem().

Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: NLi Zhong <zhong@linux.vnet.ibm.com>
Acked-by: NNishanth Aravamudan <nacc@linux.vnet.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

70ad2375

20 9月, 2014 1 次提交

powerpc/mm: Use common paging_init() for NUMA · 6db35ad2

由 Scott Wood 提交于 9月 18, 2014

Commit 1c98025c "powerpc: Dynamic DMA
zone limits" updated how zones are created in paging_init(), but missed
the NUMA version of paging_init().  This was noticed via a linker
error, since dma_pfn_limit_to_zone() was, like the non-NUMA
paging_init(), limited by #ifndef CONFIG_NEED_MULTIPLE_NODES.

It turns out that the NUMA paging_init() was not actually doing
anything different from the standard paging_init(), other than a couple
debug prints, a couple 32-bit-only ifdef sections, and a call to
mark_nonram_nosave().  It's not clear whether mark_nonram_nosave() is
inherently wrong to do for NUMA, or just not useful on targets that
have NUMA, but for now I'm preserving the existing behavior.

Fixes: 1c98025c "powerpc: Dynamic DMA zone limits"
Reported-by: NStephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: NScott Wood <scottwood@freescale.com>

6db35ad2

13 8月, 2014 1 次提交

powerpc: reorder per-cpu NUMA information's initialization · 2fabf084

由 Nishanth Aravamudan 提交于 7月 17, 2014

There is an issue currently where NUMA information is used on powerpc
(and possibly ia64) before it has been read from the device-tree, which
leads to large slab consumption with CONFIG_SLUB and memoryless nodes.

NUMA powerpc non-boot CPU's cpu_to_node/cpu_to_mem is only accurate
after start_secondary(), similar to ia64, which is invoked via
smp_init().

Commit 6ee0578b ("workqueue: mark init_workqueues() as
early_initcall()") made init_workqueues() be invoked via
do_pre_smp_initcalls(), which is obviously before the secondary
processors are online.

Additionally, the following commits changed init_workqueues() to use
cpu_to_node to determine the node to use for kthread_create_on_node:

bce90380 ("workqueue: add wq_numa_tbl_len and
wq_numa_possible_cpumask[]")
f3f90ad4 ("workqueue: determine NUMA node of workers accourding to
the allowed cpumask")

Therefore, when init_workqueues() runs, it sees all CPUs as being on
Node 0. On LPARs or KVM guests where Node 0 is memoryless, this leads to
a high number of slab deactivations
(http://www.spinics.net/lists/linux-mm/msg67489.html).

Fix this by initializing the powerpc-specific CPU<->node/local memory
node mapping as early as possible, which on powerpc is
do_init_bootmem(). Currently that function initializes the mapping for
the boot CPU, but we extend it to setup the mapping for all possible
CPUs. Then, in smp_prepare_cpus(), we can correspondingly set the
per-cpu values for all possible CPUs. That ensures that before the
early_initcalls run (and really as early as possible), the per-cpu NUMA
mapping is accurate.

While testing memoryless nodes on PowerKVM guests with a fix to the
workqueue logic to use cpu_to_mem() instead of cpu_to_node(), with a
guest topology of:

available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49
node 0 size: 0 MB
node 0 free: 0 MB
node 1 cpus: 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99
node 1 size: 16336 MB
node 1 free: 15329 MB
node distances:
node   0   1
  0:  10  40
  1:  40  10

the slab consumption decreases from

Slab:             932416 kB
SUnreclaim:       902336 kB

to

Slab:             395264 kB
SUnreclaim:       359424 kB

And we a corresponding increase in the slab efficiency from

slab                                   mem     objs    slabs
                                      used   active   active
------------------------------------------------------------
kmalloc-16384                       337 MB   11.28%  100.00%
task_struct                         288 MB    9.93%  100.00%

to

slab                                   mem     objs    slabs
                                      used   active   active
------------------------------------------------------------
kmalloc-16384                        37 MB  100.00%  100.00%
task_struct                          31 MB  100.00%  100.00%

Powerpc didn't support memoryless nodes until recently (64bb80d8
"powerpc/numa: Enable CONFIG_HAVE_MEMORYLESS_NODES" and 8c272261
"powerpc/numa: Enable USE_PERCPU_NUMA_NODE_ID"). Those commits also
helped improve memory consumption with these kind of environments.
Signed-off-by: NNishanth Aravamudan <nacc@linux.vnet.ibm.com>
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>

2fabf084

05 8月, 2014 1 次提交

powerpc/mm/numa: Fix break placement · b00fc6ec

由 Andrey Utkin 提交于 8月 04, 2014

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=81631Reported-by: NDavid Binderman <dcb314@hotmail.com>
Signed-off-by: NAndrey Utkin <andrey.krieger.utkin@gmail.com>
CC: <stable@vger.kernel.org>
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>

b00fc6ec

19 4月, 2014 1 次提交

powerpc/mm: fix ".__node_distance" undefined · 12c743eb

由 Mike Qiu 提交于 4月 18, 2014

  CHK     include/config/kernel.release
  CHK     include/generated/uapi/linux/version.h
  CHK     include/generated/utsrelease.h
  ...
  Building modules, stage 2.
WARNING: 1 bad relocations
c0000000013d6a30 R_PPC64_ADDR64    uprobes_fetch_type_table
  WRAP    arch/powerpc/boot/zImage.pseries
  WRAP    arch/powerpc/boot/zImage.epapr
  MODPOST 1849 modules
ERROR: ".__node_distance" [drivers/block/nvme.ko] undefined!
make[1]: *** [__modpost] Error 1
make: *** [modules] Error 2
make: *** Waiting for unfinished jobs....

The reason is symbol "__node_distance" not been exported in powerpc.
Signed-off-by: NMike Qiu <qiudayu@linux.vnet.ibm.com>
Acked-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Jesse Larrew <jlarrew@linux.vnet.ibm.com>
Cc: Robert Jennings <rcj@linux.vnet.ibm.com>
Cc: Alistair Popple <alistair@popple.id.au>
Cc: Mike Qiu <qiudayu@linux.vnet.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

12c743eb

09 4月, 2014 1 次提交

power, sched: stop updating inside arch_update_cpu_topology() when nothing to be update · 9a013361

由 Michael Wang 提交于 4月 08, 2014

Since v1:
	Edited the comment according to Srivatsa's suggestion.

During the testing, we encounter below WARN followed by Oops:

	WARNING: at kernel/sched/core.c:6218
	...
	NIP [c000000000101660] .build_sched_domains+0x11d0/0x1200
	LR [c000000000101358] .build_sched_domains+0xec8/0x1200
	PACATMSCRATCH [800000000000f032]
	Call Trace:
	[c00000001b103850] [c000000000101358] .build_sched_domains+0xec8/0x1200
	[c00000001b1039a0] [c00000000010aad4] .partition_sched_domains+0x484/0x510
	[c00000001b103aa0] [c00000000016d0a8] .rebuild_sched_domains+0x68/0xa0
	[c00000001b103b30] [c00000000005cbf0] .topology_work_fn+0x10/0x30
	...
	Oops: Kernel access of bad area, sig: 11 [#1]
	...
	NIP [c00000000045c000] .__bitmap_weight+0x60/0xf0
	LR [c00000000010132c] .build_sched_domains+0xe9c/0x1200
	PACATMSCRATCH [8000000000029032]
	Call Trace:
	[c00000001b1037a0] [c000000000288ff4] .kmem_cache_alloc_node_trace+0x184/0x3a0
	[c00000001b103850] [c00000000010132c] .build_sched_domains+0xe9c/0x1200
	[c00000001b1039a0] [c00000000010aad4] .partition_sched_domains+0x484/0x510
	[c00000001b103aa0] [c00000000016d0a8] .rebuild_sched_domains+0x68/0xa0
	[c00000001b103b30] [c00000000005cbf0] .topology_work_fn+0x10/0x30
	...

This was caused by that 'sd->groups == NULL' after building groups, which
was caused by the empty 'sd->span'.

The cpu's domain contained nothing because the cpu was assigned to a wrong
node, due to the following unfortunate sequence of events:

1. The hypervisor sent a topology update to the guest OS, to notify changes
   to the cpu-node mapping. However, the update was actually redundant - i.e.,
   the "new" mapping was exactly the same as the old one.

2. Due to this, the 'updated_cpus' mask turned out to be empty after exiting
   the 'for-loop' in arch_update_cpu_topology().

3. So we ended up calling stop-machine() with an empty cpumask list, which made
   stop-machine internally elect cpumask_first(cpu_online_mask), i.e., CPU0 as
   the cpu to run the payload (the update_cpu_topology() function).

4. This causes update_cpu_topology() to be run by CPU0. And since 'updates'
   is kzalloc()'ed inside arch_update_cpu_topology(), update_cpu_topology()
   finds update->cpu as well as update->new_nid to be 0. In other words, we
   end up assigning CPU0 (and eventually its siblings) to node 0, incorrectly.

Along with the following wrong updating, it causes the sched-domain rebuild
code to break and crash the system.

Fix this by skipping the topology update in cases where we find that
the topology has not actually changed in reality (ie., spurious updates).

CC: Benjamin Herrenschmidt <benh@kernel.crashing.org>
CC: Paul Mackerras <paulus@samba.org>
CC: Nathan Fontenot <nfont@linux.vnet.ibm.com>
CC: Stephen Rothwell <sfr@canb.auug.org.au>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Robert Jennings <rcj@linux.vnet.ibm.com>
CC: Jesse Larrew <jlarrew@linux.vnet.ibm.com>
CC: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com>
CC: Alistair Popple <alistair@popple.id.au>
Suggested-by: N"Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: NMichael Wang <wangyun@linux.vnet.ibm.com>
Reviewed-by: NSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>

9a013361

29 1月, 2014 1 次提交

powerpc/numa: Fix decimal permissions · 316d7188

由 Joe Perches 提交于 1月 28, 2014

This should have been octal.
Signed-off-by: NJoe Perches <joe@perches.com>
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>

316d7188

22 1月, 2014 1 次提交

memblock: make memblock_set_node() support different memblock_type · e7e8de59

由 Tang Chen 提交于 1月 21, 2014

[sfr@canb.auug.org.au: fix powerpc build]
Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
Reviewed-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Rafael J . Wysocki" <rjw@sisk.pl>
Cc: Chen Tang <imtangchen@gmail.com>
Cc: Gong Chen <gong.chen@linux.intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Jiang Liu <jiang.liu@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Liu Jiang <jiang.liu@huawei.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Thomas Renninger <trenn@suse.de>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Vasilis Liaskovitis <vasilis.liaskovitis@profitbricks.com>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Wen Congyang <wency@cn.fujitsu.com>
Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

e7e8de59

15 1月, 2014 2 次提交

powerpc: Add debug checks to catch invalid cpu-to-node mappings · 68fb18aa

由 Srivatsa S. Bhat 提交于 12月 30, 2013

There have been some weird bugs in the past where the kernel tried to associate
threads of the same core to different NUMA nodes, and things went haywire after
that point (as expected).

But unfortunately, root-causing such issues have been quite challenging, due to
the lack of appropriate debug checks in the kernel. These bugs usually lead to
some odd soft-lockups in the scheduler's build-sched-domain code in the CPU
hotplug path, which makes it very hard to trace it back to the incorrect
cpu-to-node mappings.

So add appropriate debug checks to catch such invalid cpu-to-node mappings
as early as possible.
Signed-off-by: NSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>

68fb18aa

powerpc: Fix the setup of CPU-to-Node mappings during CPU online · d4edc5b6

由 Srivatsa S. Bhat 提交于 12月 30, 2013

On POWER platforms, the hypervisor can notify the guest kernel about dynamic
changes in the cpu-numa associativity (VPHN topology update). Hence the
cpu-to-node mappings that we got from the firmware during boot, may no longer
be valid after such updates. This is handled using the arch_update_cpu_topology()
hook in the scheduler, and the sched-domains are rebuilt according to the new
mappings.

But unfortunately, at the moment, CPU hotplug ignores these updated mappings
and instead queries the firmware for the cpu-to-numa relationships and uses
them during CPU online. So the kernel can end up assigning wrong NUMA nodes
to CPUs during subsequent CPU hotplug online operations (after booting).

Further, a particularly problematic scenario can result from this bug:
On POWER platforms, the SMT mode can be switched between 1, 2, 4 (and even 8)
threads per core. The switch to Single-Threaded (ST) mode is performed by
offlining all except the first CPU thread in each core. Switching back to
SMT mode involves onlining those other threads back, in each core.

Now consider this scenario:

1. During boot, the kernel gets the cpu-to-node mappings from the firmware
and assigns the CPUs to NUMA nodes appropriately, during CPU online.

2. Later on, the hypervisor updates the cpu-to-node mappings dynamically and
communicates this update to the kernel. The kernel in turn updates its
cpu-to-node associations and rebuilds its sched domains. Everything is
fine so far.

3. Now, the user switches the machine from SMT to ST mode (say, by running
ppc64_cpu --smt=1). This involves offlining all except 1 thread in each
core.

4. The user then tries to switch back from ST to SMT mode (say, by running
ppc64_cpu --smt=4), and this involves onlining those threads back. Since
CPU hotplug ignores the new mappings, it queries the firmware and tries to
associate the newly onlined sibling threads to the old NUMA nodes. This
results in sibling threads within the same core getting associated with
different NUMA nodes, which is incorrect.

The scheduler's build-sched-domains code gets thoroughly confused with this
and enters an infinite loop and causes soft-lockups, as explained in detail
in commit 3be7db6a (powerpc: VPHN topology change updates all siblings).

So to fix this, use the numa_cpu_lookup_table to remember the updated
cpu-to-node mappings, and use them during CPU hotplug online operations.
Further, we also need to ensure that all threads in a core are assigned to a
common NUMA node, irrespective of whether all those threads were online during
the topology update. To achieve this, we take care not to use cpu_sibling_mask()
since it is not hotplug invariant. Instead, we use cpu_first_sibling_thread()
and set up the mappings manually using the 'threads_per_core' value for that
particular platform. This helps us ensure that we don't hit this bug with any
combination of CPU hotplug and SMT mode switching.

Cc: stable@vger.kernel.org
Signed-off-by: NSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>

d4edc5b6

13 11月, 2013 1 次提交

mm: use pgdat_end_pfn() to simplify the code in arch · 6408068e

由 Xishi Qiu 提交于 11月 12, 2013

Use "pgdat_end_pfn()" instead of "pgdat->node_start_pfn +
pgdat->node_spanned_pages".  Simplify the code, no functional change.
Signed-off-by: NXishi Qiu <qiuxishi@huawei.com>
Cc: James Hogan <james.hogan@imgtec.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

6408068e

30 10月, 2013 1 次提交

powerpc: Fix warnings for arch/powerpc/mm/numa.c · ec32dd66

由 Robert Jennings 提交于 10月 28, 2013

Simple fixes for sparse warnings in this file.

Resolves:
arch/powerpc/mm/numa.c:198:24:
        warning: Using plain integer as NULL pointer

arch/powerpc/mm/numa.c:1157:5:
       warning: symbol 'hot_add_node_scn_to_nid' was not declared.
                Should it be static?

arch/powerpc/mm/numa.c:1238:28:
       warning: Using plain integer as NULL pointer

arch/powerpc/mm/numa.c:1538:6:
       warning: symbol 'topology_schedule_update' was not declared.
                Should it be static?
Signed-off-by: NRobert C Jennings <rcj@linux.vnet.ibm.com>
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>

ec32dd66

14 8月, 2013 2 次提交

powerpc: Make NUMA device node code endian safe · b08a2a12

由 Alistair Popple 提交于 8月 07, 2013

The device tree is big endian so make sure we byteswap on little
endian. We assume any pHyp calls also return big endian results in
memory.
Signed-off-by: NAlistair Popple <alistair@popple.id.au>
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>

b08a2a12

powerpc: Stop using non-architected shared_proc field in lppaca · f13c13a0

由 Anton Blanchard 提交于 8月 07, 2013

Although the shared_proc field in the lppaca works today, it is
not architected. A shared processor partition will always have a non
zero yield_count so use that instead. Create a wrapper so users
don't have to know about the details.

In order for older kernels to continue to work on KVM we need
to set the shared_proc bit. While here, remove the ugly bitfield.
Signed-off-by: NAnton Blanchard <anton@samba.org>
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>

f13c13a0

01 8月, 2013 1 次提交

powerpc: VPHN topology change updates all siblings · 3be7db6a

由 Robert Jennings 提交于 7月 24, 2013

When an associativity level change is found for one thread, the
siblings threads need to be updated as well. This is done today
for PRRN in stage_topology_update() but is missing for VPHN in
update_cpu_associativity_changes_mask(). This patch will correctly
update all thread siblings during a topology change.

Without this patch a topology update can result in a CPU in
init_sched_groups_power() getting stuck indefinitely in a loop.

This loop is built in build_sched_groups(). As a result of the thread
moving to a node separate from its siblings the struct sched_group will
have its next pointer set to point to itself rather than the sched_group
struct of the next thread. This happens because we have a domain without
the SD_OVERLAP flag, which is correct, and a topology that doesn't conform
with reality (threads on the same core assigned to different numa nodes).
When this list is traversed by init_sched_groups_power() it will reach
the thread's sched_group structure and loop indefinitely; the cpu will
be stuck at this point.

The bug was exposed when VPHN was enabled in commit b7abef04 (v3.9).

Cc: <stable@vger.kernel.org> [v3.9+]
Reported-by: NJan Stancek <jstancek@redhat.com>
Signed-off-by: NRobert Jennings <rcj@linux.vnet.ibm.com>
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>

3be7db6a

01 7月, 2013 2 次提交

powerpc/numa: Do not update sysfs cpu registration from invalid context · dd023217

由 Nathan Fontenot 提交于 6月 24, 2013

The topology update code that updates the cpu node registration in sysfs
should not be called while in stop_machine(). The register/unregister
calls take a lock and may sleep.

This patch moves these calls outside of the call to stop_machine().
Signed-off-by: NNathan Fontenot <nfont@linux.vnet.ibm.com>
CC: <stable@vger.kernel.org>
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>

dd023217

powerpc: Delete __cpuinit usage from all users · 061d19f2

由 Paul Gortmaker 提交于 6月 24, 2013

The __cpuinit type of throwaway sections might have made sense
some time ago when RAM was more constrained, but now the savings
do not offset the cost and complications.  For example, the fix in
commit 5e427ec2 ("x86: Fix bit corruption at CPU resume time")
is a good example of the nasty type of bugs that can be created
with improper use of the various __init prefixes.

After a discussion on LKML[1] it was decided that cpuinit should go
the way of devinit and be phased out.  Once all the users are gone,
we can then finally remove the macros themselves from linux/init.h.

This removes all the powerpc uses of the __cpuinit macros.  There
are no __CPUINIT users in assembly files in powerpc.

[1] https://lkml.org/lkml/2013/5/20/589

Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Josh Boyer <jwboyer@gmail.com>
Cc: Matt Porter <mporter@kernel.crashing.org>
Cc: Kumar Gala <galak@kernel.crashing.org>
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>

061d19f2

30 4月, 2013 3 次提交

powerpc/pseries: Correct builds break when CONFIG_SMP not defined · 601abdc3

由 Nathan Fontenot 提交于 4月 29, 2013

Correct build failure for powerpc/pseries builds with CONFIG_SMP not defined.

The function cpu_sibling_mask has no meaning (or definition) when CONFIG_SMP
is not defined. Additionally, the updating of NUMA affinity for a CPU in a UP
system doesn't really make sense.

This patch ifdef's out the code making the affinity updates for PRRN events to
fix the following build break.

arch/powerpc/mm/numa.c: In function ‘stage_topology_update’:
arch/powerpc/mm/numa.c:1535: error: implicit declaration of function ‘cpu_sibling_mask’
arch/powerpc/mm/numa.c:1535: warning: passing argument 3 of ‘cpumask_or’ makes pointer from integer without a cast
make[1]: *** [arch/powerpc/mm/numa.o] Error 1
Signed-off-by: NNathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>

601abdc3

powerpc: Fix build failure after merge of the cgroup tree · 9f3a90e8

由 Stephen Rothwell 提交于 4月 28, 2013

After merging the cgroup tree, today's linux-next build (powerpc
ppc64_defconfig) failed like this:

arch/powerpc/mm/numa.c: In function 'arch_update_cpu_topology':
arch/powerpc/mm/numa.c:1465:2: error: implicit declaration of function 'kzalloc' [-Werror=implicit-function-declaration]
arch/powerpc/mm/numa.c:1465:10: error: assignment makes pointer from integer without a cast [-Werror]
arch/powerpc/mm/numa.c:1497:2: error: implicit declaration of function 'kfree' [-Werror=implicit-function-declaration]

Caused by commit 30c05350 ("powerpc/pseries: Use stop machine to
update cpu maps") from the powerpc tree interacting with (probably)
commit ff794dea ("cpuset: remove include of cgroup.h from cpuset.h")
from the cgroup tree. Removing includes from header files is fraught
with danger ...

The former should have added an include of linux/slab.h to
arch/powerpc/mm/numa.c.

I have added the following merge fix patch for today (but it should be
applied to the powerpc tree ASAP).

From: Stephen Rothwell <sfr@canb.auug.org.au>
Date: Mon, 29 Apr 2013 14:01:44 +1000
Subject: [PATCH] powerpc: numa.c: using kzalloc/kfree requires including
slab.h

fixes these build errors:

9f3a90e8

powerpc/mm/numa: use setup_nr_node_ids() instead of opencoding. · f9d531b8

由 Cody P Schafer 提交于 4月 29, 2013

[sfr@canb.auug.org.au: add missing semicolon]
Signed-off-by: NCody P Schafer <cody@linux.vnet.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f9d531b8

26 4月, 2013 6 次提交

powerpc/pseries: Add /proc interface to control topology updates · e04fa612

由 Nathan Fontenot 提交于 4月 24, 2013

There are instances in which we do not want topology updates to occur.
In order to allow this a /proc interface (/proc/powerpc/topology_updates)
is introduced so that topology updates can be enabled and disabled.

This patch also adds a prrn_is_enabled() call so that PRRN events are
handled in the kernel only if topology updating is enabled.
Signed-off-by: NNathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>

e04fa612

powerpc/pseries: RE-enable Virtual Processor Home Node updating · b7abef04

由 Jesse Larrew 提交于 4月 24, 2013

The new PRRN firmware feature provides a more convenient and event-driven
interface than VPHN for notifying Linux of changes to the NUMA affinity of
platform resources. However, for practical reasons, it may not be feasible
for some customers to update to the latest firmware. For these customers,
the VPHN feature supported on previous firmware versions may still be the
best option.

The VPHN feature was previously disabled due to races with the load
balancing code when accessing the NUMA cpu maps, but the new stop_machine()
approach protects the NUMA cpu maps from these concurrent accesses. It
should be safe to re-enable this feature now.
Signed-off-by: NNathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>

b7abef04

powerpc/pseries: Update NUMA VDSO information when updating CPU maps · 176bbf14

由 Jesse Larrew 提交于 4月 24, 2013

The following patch adds vdso_getcpu_init(), which stores the NUMA node for
a cpu in SPRG3:

Commit 18ad51dd ("powerpc: Add VDSO version of getcpu") adds
vdso_getcpu_init(), which stores the NUMA node for a cpu in SPRG3.

This patch ensures that this information is also updated when the NUMA
affinity of a cpu changes.
Signed-off-by: NNathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>

176bbf14

powerpc/pseries: Use stop machine to update cpu maps · 30c05350

由 Nathan Fontenot 提交于 4月 24, 2013

The new PRRN firmware feature allows CPU and memory resources to be
transparently reassigned across NUMA boundaries. When this happens, the
kernel must update the node maps to reflect the new affinity information.

Although the NUMA maps can be protected by locking primitives during the
update itself, this is insufficient to prevent concurrent accesses to these
structures. Since cpumask_of_node() hands out a pointer to these
structures, they can still be modified outside of the lock. Furthermore,
tracking down each usage of these pointers and adding locks would be quite
invasive and difficult to maintain.

The approach used is to make a list of affected cpus and call stop_machine
to have the update routine run on each of the affected cpus allowing them
to update themselves. Each cpu finds itself in the list of cpus and makes
the appropriate updates. We need to have each cpu do this for themselves to
handle calls to vdso_getcpu_init() added in a subsequent patch.

Situations like these are best handled using stop_machine(). Since the NUMA
affinity updates are exceptionally rare events, this approach has the
benefit of not adding any overhead while accessing the NUMA maps during
normal operation.
Signed-off-by: NNathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>

30c05350

powerpc/pseries: Update CPU maps when device tree is updated · 5d88aa85

由 Jesse Larrew 提交于 4月 24, 2013

Platform events such as partition migration or the new PRRN firmware
feature can cause the NUMA characteristics of a CPU to change, and these
changes will be reflected in the device tree nodes for the affected
CPUs.

This patch registers a handler for Open Firmware device tree updates
and reconfigures the CPU and node maps whenever the associativity
changes. Currently, this is accomplished by marking the affected CPUs in
the cpu_associativity_changes_mask and allowing
arch_update_cpu_topology() to retrieve the new associativity information
using hcall_vphn().

Protecting the NUMA cpu maps from concurrent access during an update
operation will be addressed in a subsequent patch in this series.
Signed-off-by: NNathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>

5d88aa85

powerpc/pseries: Update numa.c to use updated firmware_has_feature() · 8002b0c5

由 Nathan Fontenot 提交于 4月 24, 2013

Update the numa code to use the updated firmware_has_feature() when checking
for type 1 affinity.
Signed-off-by: NNathan Fontenot <nfont@linux.vnet.ibm.com>
Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>

8002b0c5

18 4月, 2013 1 次提交

powerpc: fix annotation of fake_numa_create_new_node() · 55671f3c

由 Stephen Rothwell 提交于 3月 25, 2013

This function has always been marked as __cpuinit, but is only called
from functions marked as __init and references an __initdata variable.
So change its annotation to __init.

Fixes this build warning:

WARNING: arch/powerpc/mm/built-in.o(.cpuinit.text+0x86): Section mismatch in reference from the function .fake_numa_create_new_node() to the variable .init.data:cmdline
The function __cpuinit .fake_numa_create_new_node() references
a variable __initdata cmdline.
If cmdline is only used by .fake_numa_create_new_node then
annotate cmdline with a matching annotation.
Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: NMichael Ellerman <michael@ellerman.id.au>

55671f3c