1. 28 7月, 2014 4 次提交
  2. 11 7月, 2014 1 次提交
  3. 21 6月, 2014 3 次提交
    • L
      powerpc/booke64: wrap tlb lock and search in htw miss with FTR_SMT · 7c480050
      Laurentiu Tudor 提交于
      Virtualized environments may expose a e6500 dual-threaded core
      as two single-threaded e6500 cores. Take advantage of this
      and get rid of the tlb lock and the trap-causing tlbsx in
      the htw miss handler by guarding with CPU_FTR_SMT, as it's
      already being done in the bolted tlb1 miss handler.
      
      As seen in the results below, measurements done with lmbench
      random memory access latency test running under Freescale's
      Embedded Hypervisor, there is a ~34% improvement.
      
      Memory latencies in nanoseconds - smaller is better
          (WARNING - may not be correct, check graphs)
      ----------------------------------------------------
      Host       Mhz   L1 $   L2 $    Main mem    Rand mem
      ---------  ---   ----   ----    --------    --------
      smt       1665 1.8020   13.2    83.0         1149.7
      nosmt     1665 1.8020   13.2    83.0          758.1
      Signed-off-by: NLaurentiu Tudor <Laurentiu.Tudor@freescale.com>
      Cc: Scott Wood <scottwood@freescale.com>
      [scottwood@freescale.com: commit message tweak]
      Signed-off-by: NScott Wood <scottwood@freescale.com>
      7c480050
    • S
      powerpc/e6500: hw tablewalk: fix recursive tlb lock on cpu 0 · 1cb4ed92
      Scott Wood 提交于
      Commit 82d86de2 "TLB lock recursive"
      introduced a bug whereby cpu 0 uses the same value for "lock held" as
      is used to indicate that the lock is free.  This means that cpu 1 can
      acquire the lock whenever it wants, regardless of whether cpu 0 has it
      locked, which in turn means we can get duplicate TLB entries.
      
      Add one to the CPU value to ensure we do not use zero as a "lock held"
      value.
      Signed-off-by: NScott Wood <scottwood@freescale.com>
      Reported-by: NEd Swarthout <ed.swarthout@freescale.com>
      1cb4ed92
    • S
      powerpc/e6500: hw tablewalk: clear TID in kernel indirect entries · bbd08c72
      Scott Wood 提交于
      Previously TID was being cleared before the tlbsx, but not after.  This
      can lead to a multiway hit between a TLB entry with TID=0 (previously
      inserted when PID=0) and a TLB entry with TID!=0 that matches PID.
      This can theoretically result in undefined behavior, though we probably
      get lucky due to the details of the overlap.  It also results in the
      inability to use multihit detection to detect other conflicting TLB
      entries, as well as poorer TLB utilization due to duplicating kernel
      TLB entries.
      
      Rather than try to patch up MAS1 after tlbsx, the entire value is
      saved/restored as with MAS2.
      
      I observed a slight improvement in TLB miss performance with this patch
      applied.
      Signed-off-by: NScott Wood <scottwood@freescale.com>
      Reported-by: NEd Swarthout <ed.swarthout@freescale.com>
      bbd08c72
  4. 06 6月, 2014 1 次提交
    • M
      powerpc/mm: Check paca psize is up to date for huge mappings · 09567e7f
      Michael Ellerman 提交于
      We have a bug in our hugepage handling which exhibits as an infinite
      loop of hash faults. If the fault is being taken in the kernel it will
      typically trigger the softlockup detector, or the RCU stall detector.
      
      The bug is as follows:
      
       1. mmap(0xa0000000, ..., MAP_FIXED | MAP_HUGE_TLB | MAP_ANONYMOUS ..)
       2. Slice code converts the slice psize to 16M.
       3. The code on lines 539-540 of slice.c in slice_get_unmapped_area()
          synchronises the mm->context with the paca->context. So the paca slice
          mask is updated to include the 16M slice.
       3. Either:
          * mmap() fails because there are no huge pages available.
          * mmap() succeeds and the mapping is then munmapped.
          In both cases the slice psize remains at 16M in both the paca & mm.
       4. mmap(0xa0000000, ..., MAP_FIXED | MAP_ANONYMOUS ..)
       5. The slice psize is converted back to 64K. Because of the check on line 539
          of slice.c we DO NOT update the paca->context. The paca slice mask is now
          out of sync with the mm slice mask.
       6. User/kernel accesses 0xa0000000.
       7. The SLB miss handler slb_allocate_realmode() **uses the paca slice mask**
          to create an SLB entry and inserts it in the SLB.
      18. With the 16M SLB entry in place the hardware does a hash lookup, no entry
          is found so a data access exception is generated.
      19. The data access handler calls do_page_fault() -> handle_mm_fault().
      10. __handle_mm_fault() creates a THP mapping with do_huge_pmd_anonymous_page().
      11. The hardware retries the access, there is still nothing in the hash table
          so once again a data access exception is generated.
      12. hash_page() calls into __hash_page_thp() and inserts a mapping in the
          hash. Although the THP mapping maps 16M the hashing is done using 64K
          as the segment page size.
      13. hash_page() returns immediately after calling __hash_page_thp(), skipping
          over the code at line 1125. Resulting in the mismatch between the
          paca->context and mm->context not being detected.
      14. The hardware retries the access, the hash it generates using the 16M
          SLB entry does NOT match the hash we inserted.
      15. We take another data access and go into __hash_page_thp().
      16. We see a valid entry in the hpte_slot_array and so we call updatepp()
          which succeeds.
      17. Goto 14.
      
      We could fix this in two ways. The first would be to remove or modify
      the check on line 539 of slice.c.
      
      The second option is to cause the check of paca psize in hash_page() on
      line 1125 to also be done for THP pages.
      
      We prefer the latter, because the check & update of the paca psize is
      not done until we know it's necessary. It's also done only on the
      current cpu, so we don't need to IPI all other cpus.
      
      Without further rearranging the code, the simplest fix is to pull out
      the code that checks paca psize and call it in two places. Firstly for
      THP/hugetlb, and secondly for other mappings as before.
      
      Thanks to Dave Jones for trinity, which originally found this bug.
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      CC: stable@vger.kernel.org [v3.11+]
      09567e7f
  5. 05 6月, 2014 1 次提交
  6. 30 5月, 2014 1 次提交
    • A
      KVM: PPC: Book3S PR: Rework SLB switching code · d8d164a9
      Alexander Graf 提交于
      On LPAR guest systems Linux enables the shadow SLB to indicate to the
      hypervisor a number of SLB entries that always have to be available.
      
      Today we go through this shadow SLB and disable all ESID's valid bits.
      However, pHyp doesn't like this approach very much and honors us with
      fancy machine checks.
      
      Fortunately the shadow SLB descriptor also has an entry that indicates
      the number of valid entries following. During the lifetime of a guest
      we can just swap that value to 0 and don't have to worry about the
      SLB restoration magic.
      
      While we're touching the code, let's also make it more readable (get
      rid of rldicl), allow it to deal with a dynamic number of bolted
      SLB entries and only do shadow SLB swizzling on LPAR systems.
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      d8d164a9
  7. 23 5月, 2014 1 次提交
    • S
      powerpc/fsl-booke64: Set vmemmap_psize to 4K · e57eeae4
      Scott Wood 提交于
      The only way Freescale booke chips support mappings larger than 4K
      is via TLB1.  The only way we support (direct) TLB1 entries is via
      hugetlb, which is not what map_kernel_page() does when given a large
      page size.
      
      Without this, a kernel with CONFIG_SPARSEMEM_VMEMMAP enabled crashes on
      boot with messages such as:
      
      PID hash table entries: 4096 (order: 3, 32768 bytes)
      Sorting __ex_table...
      BUG: Bad page state in process swapper  pfn:00a2f
      page:8000040000023a48 count:0 mapcount:0 mapping:0000040000ffce48 index:0x40000ffbe50
      page flags: 0x40000ffda40(active|arch_1|private|private_2|head|tail|swapcache|mappedtodisk|reclaim|swapbacked|unevictable|mlocked)
      page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
      bad because of flags:
      page flags: 0x311840(active|private|private_2|swapcache|unevictable|mlocked)
      Modules linked in:
      CPU: 0 PID: 0 Comm: swapper Not tainted 3.15.0-rc1-00003-g7fa250c #299
      Call Trace:
      [c00000000098ba20] [c000000000008b3c] .show_stack+0x7c/0x1cc (unreliable)
      [c00000000098baf0] [c00000000060aa50] .dump_stack+0x88/0xb4
      [c00000000098bb70] [c0000000000c0468] .bad_page+0x144/0x1a0
      [c00000000098bc10] [c0000000000c0628] .free_pages_prepare+0x164/0x17c
      [c00000000098bcc0] [c0000000000c24cc] .free_hot_cold_page+0x48/0x214
      [c00000000098bd60] [c00000000086c318] .free_all_bootmem+0x1fc/0x354
      [c00000000098be70] [c00000000085da84] .mem_init+0xac/0xdc
      [c00000000098bef0] [c0000000008547b0] .start_kernel+0x21c/0x4d4
      [c00000000098bf90] [c000000000000448] .start_here_common+0x20/0x58
      Signed-off-by: NScott Wood <scottwood@freescale.com>
      e57eeae4
  8. 01 5月, 2014 2 次提交
  9. 30 4月, 2014 1 次提交
  10. 29 4月, 2014 1 次提交
    • A
      KVM guest: Make pv trampoline code executable · b18db0b8
      Alexander Graf 提交于
      Our PV guest patching code assembles chunks of instructions on the fly when it
      encounters more complicated instructions to hijack. These instructions need
      to live in a section that we don't mark as non-executable, as otherwise we
      fault when jumping there.
      
      Right now we put it into the .bss section where it automatically gets marked
      as non-executable. Add a check to the NX setting function to ensure that we
      leave these particular pages executable.
      Signed-off-by: NAlexander Graf <agraf@suse.de>
      b18db0b8
  11. 28 4月, 2014 1 次提交
  12. 23 4月, 2014 3 次提交
  13. 19 4月, 2014 1 次提交
    • M
      powerpc/mm: fix ".__node_distance" undefined · 12c743eb
      Mike Qiu 提交于
        CHK     include/config/kernel.release
        CHK     include/generated/uapi/linux/version.h
        CHK     include/generated/utsrelease.h
        ...
        Building modules, stage 2.
      WARNING: 1 bad relocations
      c0000000013d6a30 R_PPC64_ADDR64    uprobes_fetch_type_table
        WRAP    arch/powerpc/boot/zImage.pseries
        WRAP    arch/powerpc/boot/zImage.epapr
        MODPOST 1849 modules
      ERROR: ".__node_distance" [drivers/block/nvme.ko] undefined!
      make[1]: *** [__modpost] Error 1
      make: *** [modules] Error 2
      make: *** Waiting for unfinished jobs....
      
      The reason is symbol "__node_distance" not been exported in powerpc.
      Signed-off-by: NMike Qiu <qiudayu@linux.vnet.ibm.com>
      Acked-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Cc: Jesse Larrew <jlarrew@linux.vnet.ibm.com>
      Cc: Robert Jennings <rcj@linux.vnet.ibm.com>
      Cc: Alistair Popple <alistair@popple.id.au>
      Cc: Mike Qiu <qiudayu@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      12c743eb
  14. 09 4月, 2014 2 次提交
    • M
      power, sched: stop updating inside arch_update_cpu_topology() when nothing to be update · 9a013361
      Michael Wang 提交于
      Since v1:
      	Edited the comment according to Srivatsa's suggestion.
      
      During the testing, we encounter below WARN followed by Oops:
      
      	WARNING: at kernel/sched/core.c:6218
      	...
      	NIP [c000000000101660] .build_sched_domains+0x11d0/0x1200
      	LR [c000000000101358] .build_sched_domains+0xec8/0x1200
      	PACATMSCRATCH [800000000000f032]
      	Call Trace:
      	[c00000001b103850] [c000000000101358] .build_sched_domains+0xec8/0x1200
      	[c00000001b1039a0] [c00000000010aad4] .partition_sched_domains+0x484/0x510
      	[c00000001b103aa0] [c00000000016d0a8] .rebuild_sched_domains+0x68/0xa0
      	[c00000001b103b30] [c00000000005cbf0] .topology_work_fn+0x10/0x30
      	...
      	Oops: Kernel access of bad area, sig: 11 [#1]
      	...
      	NIP [c00000000045c000] .__bitmap_weight+0x60/0xf0
      	LR [c00000000010132c] .build_sched_domains+0xe9c/0x1200
      	PACATMSCRATCH [8000000000029032]
      	Call Trace:
      	[c00000001b1037a0] [c000000000288ff4] .kmem_cache_alloc_node_trace+0x184/0x3a0
      	[c00000001b103850] [c00000000010132c] .build_sched_domains+0xe9c/0x1200
      	[c00000001b1039a0] [c00000000010aad4] .partition_sched_domains+0x484/0x510
      	[c00000001b103aa0] [c00000000016d0a8] .rebuild_sched_domains+0x68/0xa0
      	[c00000001b103b30] [c00000000005cbf0] .topology_work_fn+0x10/0x30
      	...
      
      This was caused by that 'sd->groups == NULL' after building groups, which
      was caused by the empty 'sd->span'.
      
      The cpu's domain contained nothing because the cpu was assigned to a wrong
      node, due to the following unfortunate sequence of events:
      
      1. The hypervisor sent a topology update to the guest OS, to notify changes
         to the cpu-node mapping. However, the update was actually redundant - i.e.,
         the "new" mapping was exactly the same as the old one.
      
      2. Due to this, the 'updated_cpus' mask turned out to be empty after exiting
         the 'for-loop' in arch_update_cpu_topology().
      
      3. So we ended up calling stop-machine() with an empty cpumask list, which made
         stop-machine internally elect cpumask_first(cpu_online_mask), i.e., CPU0 as
         the cpu to run the payload (the update_cpu_topology() function).
      
      4. This causes update_cpu_topology() to be run by CPU0. And since 'updates'
         is kzalloc()'ed inside arch_update_cpu_topology(), update_cpu_topology()
         finds update->cpu as well as update->new_nid to be 0. In other words, we
         end up assigning CPU0 (and eventually its siblings) to node 0, incorrectly.
      
      Along with the following wrong updating, it causes the sched-domain rebuild
      code to break and crash the system.
      
      Fix this by skipping the topology update in cases where we find that
      the topology has not actually changed in reality (ie., spurious updates).
      
      CC: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      CC: Paul Mackerras <paulus@samba.org>
      CC: Nathan Fontenot <nfont@linux.vnet.ibm.com>
      CC: Stephen Rothwell <sfr@canb.auug.org.au>
      CC: Andrew Morton <akpm@linux-foundation.org>
      CC: Robert Jennings <rcj@linux.vnet.ibm.com>
      CC: Jesse Larrew <jlarrew@linux.vnet.ibm.com>
      CC: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com>
      CC: Alistair Popple <alistair@popple.id.au>
      Suggested-by: N"Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com>
      Signed-off-by: NMichael Wang <wangyun@linux.vnet.ibm.com>
      Reviewed-by: NSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      9a013361
    • A
      powerpc/mm: NUMA pte should be handled via slow path in get_user_pages_fast() · 1dc954bd
      Aneesh Kumar K.V 提交于
      We need to handle numa pte via the slow path
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      1dc954bd
  15. 24 3月, 2014 1 次提交
  16. 20 3月, 2014 3 次提交
    • S
      powerpc/booke64: Critical and machine check exception support · 609af38f
      Scott Wood 提交于
      Add special state saving for critical and machine check exceptions.
      
      Most of this code could be used to handle debug exceptions taken from
      kernel space, but actually doing so is outside the scope of this patch.
      
      The various critical and machine check exceptions now point to their
      real handlers, rather than hanging the kernel.
      Signed-off-by: NScott Wood <scottwood@freescale.com>
      609af38f
    • S
      powerpc/booke64: Use SPRG_TLB_EXFRAME on bolted handlers · a3dc6207
      Scott Wood 提交于
      While bolted handlers (including e6500) do not need to deal with a TLB
      miss recursively causing another TLB miss, nested TLB misses can still
      happen with crit/mc/debug exceptions -- so we still need to honor
      SPRG_TLB_EXFRAME.
      
      We don't need to spend time modifying it in the TLB miss fastpath,
      though -- the special level exception will handle that.
      Signed-off-by: NScott Wood <scottwood@freescale.com>
      Cc: Mihai Caraman <mihai.caraman@freescale.com>
      Cc: kvm-ppc@vger.kernel.org
      a3dc6207
    • S
      powerpc/e6500: Make TLB lock recursive · 82d86de2
      Scott Wood 提交于
      Once special level interrupts are supported, we may take nested TLB
      misses -- so allow the same thread to acquire the lock recursively.
      
      The lock will not be effective against the nested TLB miss handler
      trying to write the same entry as the interrupted TLB miss handler, but
      that's also a problem on non-threaded CPUs that lack TLB write
      conditional.  This will be addressed in the patch that enables crit/mc
      support by invalidating the TLB on return from level exceptions.
      Signed-off-by: NScott Wood <scottwood@freescale.com>
      82d86de2
  17. 07 3月, 2014 2 次提交
  18. 17 2月, 2014 1 次提交
  19. 11 2月, 2014 1 次提交
    • M
      powerpc: Fix kdump hang issue on p8 with relocation on exception enabled. · 429d2e83
      Mahesh Salgaonkar 提交于
      On p8 systems, with relocation on exception feature enabled we are seeing
      kdump kernel hang at interrupt vector 0xc*4400. The reason is, with this
      feature enabled, exception are raised with MMU (IR=DR=1) ON with the
      default offset of 0xc*4000. Since exception is raised in virtual mode it
      requires the vector region to be executable without which it fails to
      fetch and execute instruction at 0xc*4xxx. For default kernel since kernel
      is loaded at real 0, the htab mappings sets the entire kernel text region
      executable. But for relocatable kernel (e.g. kdump case) we only copy
      interrupt vectors down to real 0 and never marked that region as
      executable because in p7 and below we always get exception in real mode.
      
      This patch fixes this issue by marking htab mapping range as executable
      that overlaps with the interrupt vector region for relocatable kernel.
      
      Thanks to Ben who helped me to debug this issue and find the root cause.
      Signed-off-by: NMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      429d2e83
  20. 29 1月, 2014 3 次提交
  21. 22 1月, 2014 1 次提交
    • T
      memblock: make memblock_set_node() support different memblock_type · e7e8de59
      Tang Chen 提交于
      [sfr@canb.auug.org.au: fix powerpc build]
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Reviewed-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "Rafael J . Wysocki" <rjw@sisk.pl>
      Cc: Chen Tang <imtangchen@gmail.com>
      Cc: Gong Chen <gong.chen@linux.intel.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Jiang Liu <jiang.liu@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Liu Jiang <jiang.liu@huawei.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Thomas Renninger <trenn@suse.de>
      Cc: Toshi Kani <toshi.kani@hp.com>
      Cc: Vasilis Liaskovitis <vasilis.liaskovitis@profitbricks.com>
      Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e7e8de59
  22. 18 1月, 2014 1 次提交
  23. 15 1月, 2014 4 次提交
    • G
      powerpc: Make add_system_ram_resources() __init · 4f770924
      Geert Uytterhoeven 提交于
      add_system_ram_resources() is a subsys_initcall.
      Signed-off-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      4f770924
    • S
      powerpc: Add debug checks to catch invalid cpu-to-node mappings · 68fb18aa
      Srivatsa S. Bhat 提交于
      There have been some weird bugs in the past where the kernel tried to associate
      threads of the same core to different NUMA nodes, and things went haywire after
      that point (as expected).
      
      But unfortunately, root-causing such issues have been quite challenging, due to
      the lack of appropriate debug checks in the kernel. These bugs usually lead to
      some odd soft-lockups in the scheduler's build-sched-domain code in the CPU
      hotplug path, which makes it very hard to trace it back to the incorrect
      cpu-to-node mappings.
      
      So add appropriate debug checks to catch such invalid cpu-to-node mappings
      as early as possible.
      Signed-off-by: NSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      68fb18aa
    • S
      powerpc: Fix the setup of CPU-to-Node mappings during CPU online · d4edc5b6
      Srivatsa S. Bhat 提交于
      On POWER platforms, the hypervisor can notify the guest kernel about dynamic
      changes in the cpu-numa associativity (VPHN topology update). Hence the
      cpu-to-node mappings that we got from the firmware during boot, may no longer
      be valid after such updates. This is handled using the arch_update_cpu_topology()
      hook in the scheduler, and the sched-domains are rebuilt according to the new
      mappings.
      
      But unfortunately, at the moment, CPU hotplug ignores these updated mappings
      and instead queries the firmware for the cpu-to-numa relationships and uses
      them during CPU online. So the kernel can end up assigning wrong NUMA nodes
      to CPUs during subsequent CPU hotplug online operations (after booting).
      
      Further, a particularly problematic scenario can result from this bug:
      On POWER platforms, the SMT mode can be switched between 1, 2, 4 (and even 8)
      threads per core. The switch to Single-Threaded (ST) mode is performed by
      offlining all except the first CPU thread in each core. Switching back to
      SMT mode involves onlining those other threads back, in each core.
      
      Now consider this scenario:
      
      1. During boot, the kernel gets the cpu-to-node mappings from the firmware
         and assigns the CPUs to NUMA nodes appropriately, during CPU online.
      
      2. Later on, the hypervisor updates the cpu-to-node mappings dynamically and
         communicates this update to the kernel. The kernel in turn updates its
         cpu-to-node associations and rebuilds its sched domains. Everything is
         fine so far.
      
      3. Now, the user switches the machine from SMT to ST mode (say, by running
         ppc64_cpu --smt=1). This involves offlining all except 1 thread in each
         core.
      
      4. The user then tries to switch back from ST to SMT mode (say, by running
         ppc64_cpu --smt=4), and this involves onlining those threads back. Since
         CPU hotplug ignores the new mappings, it queries the firmware and tries to
         associate the newly onlined sibling threads to the old NUMA nodes. This
         results in sibling threads within the same core getting associated with
         different NUMA nodes, which is incorrect.
      
         The scheduler's build-sched-domains code gets thoroughly confused with this
         and enters an infinite loop and causes soft-lockups, as explained in detail
         in commit 3be7db6a (powerpc: VPHN topology change updates all siblings).
      
      So to fix this, use the numa_cpu_lookup_table to remember the updated
      cpu-to-node mappings, and use them during CPU hotplug online operations.
      Further, we also need to ensure that all threads in a core are assigned to a
      common NUMA node, irrespective of whether all those threads were online during
      the topology update. To achieve this, we take care not to use cpu_sibling_mask()
      since it is not hotplug invariant. Instead, we use cpu_first_sibling_thread()
      and set up the mappings manually using the 'threads_per_core' value for that
      particular platform. This helps us ensure that we don't hit this bug with any
      combination of CPU hotplug and SMT mode switching.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      d4edc5b6
    • P
      powerpc: Delete non-required instances of include <linux/init.h> · c141611f
      Paul Gortmaker 提交于
      None of these files are actually using any __init type directives
      and hence don't need to include <linux/init.h>.  Most are just a
      left over from __devinit and __cpuinit removal, or simply due to
      code getting copied from one driver to the next.
      
      The one instance where we add an include for init.h covers off
      a case where that file was implicitly getting it from another
      header which itself didn't need it.
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      c141611f